ADR-004: Index updates use hash-based incremental, not nightly full re-embedding | AI Retrieval Platform

Context

The corpus is mutable: documents are added, edited, and removed continuously. The vector index has to stay current without burning budget or blowing the latency SLA.

Naive options:

Nightly full re-embed — every document gets re-embedded every night. Simple, idempotent, but costs $20 per 1M docs per night ($600/mo) regardless of churn.
Re-embed on every write — application-side trigger that calls the embedding API on every UPSERT. Costs grow with write volume, not corpus size. Hot path latency hit.
Hash-based incremental — store a SHA-256 of content alongside the vector. On UPSERT, compute the new hash; only re-embed when the hash changed.
CDC + change feed — Debezium watches Postgres WAL; an embedding worker consumes the change feed.

Decision

Adopt hash-based incremental updates with content-addressable SHA-256. The embedding pipeline computes the hash on UPSERT and only calls the embedding API if the hash changed.

# incremental_manager.py
import hashlib

def compute_hash(content: str) -> str:
    return hashlib.sha256(content.encode("utf-8")).hexdigest()

def should_reembed(doc_id: str, new_content: str, conn) -> bool:
    row = conn.fetchrow("SELECT content_hash FROM documents WHERE id = %s", doc_id)
    new_hash = compute_hash(new_content)
    return row is None or row["content_hash"] != new_hash

def upsert_with_hash(doc_id: str, content: str, embedding: list[float], conn):
    new_hash = compute_hash(content)
    conn.execute(
        "INSERT INTO documents (id, content, content_hash, embedding) "
        "VALUES (%s, %s, %s, %s) "
        "ON CONFLICT (id) DO UPDATE SET "
        "  content = EXCLUDED.content, "
        "  content_hash = EXCLUDED.content_hash, "
        "  embedding = EXCLUDED.embedding, "
        "  updated_at = now() "
        "WHERE documents.content_hash IS DISTINCT FROM EXCLUDED.content_hash",
        (doc_id, content, new_hash, embedding),
    )

Tradeoffs we accept

Lever	Hash-based (chosen)	Nightly full	Per-write embed	CDC + change feed
Embedding cost / mo (1M docs, 1% daily churn)	~$0.20	~$600	Scales with writes	~$5 (only changes)
Hot-path latency	0 ms (background worker)	0 ms	+50–200 ms per write	0 ms (async)
Operational complexity	Background pipeline + jobs table	Cron job	Application trigger	Debezium + Kafka stack
Failure recovery	Resume from `pending` jobs	Re-run cron	None (write-side)	Replay change feed
Tutorial reproducibility	Pure Python + Postgres	Pure Python + Postgres	Pure Python + Postgres	New stack (Debezium + Kafka)
Cost amortization	Once at ingest, never re-pay	Pays nightly	Pays per write	Once at change

We optimize for cost amortization + operational simplicity. Hash-based updates turn the embedding bill from a recurring monthly cost into a one-shot cost at ingest plus a tiny incremental for actual edits. CDC achieves the same outcome but adds Debezium + Kafka operational burden — overkill for a single-team retrieval platform.

Consequences (positive)

Embedding cost at 1M docs with 1% daily churn = ~$0.20/mo (vs ~$600/mo nightly). The Module 04 cost-model CSV documents this as the breakthrough lever.
The pipeline is resumable: embedding_pipeline.py's jobs table persists state across restarts. Module 03's checkpoint_verify.sh validates this.
The hash check is a single Postgres column compare — no API call, no latency hit on writes that don't change content.
Adding new fields to the embedding (e.g. metadata-aware embeddings) is one schema migration: add a new metadata_hash column and trigger re-embed only when EITHER hash changes.

Consequences (negative)

Embedding-model migrations require a forced re-embed. When a team upgrades from text-embedding-3-small to text-embedding-3-large, every hash is still valid but every embedding is wrong. Mitigation: ADR-005 (Deprecated) documents the embedding-version field that was added precisely for this case — embedding_version.py introduces EmbeddingVersion enum + mark_stale_embeddings.
Hash collision is theoretical but possible. SHA-256 collision on natural-language content has never been observed; we accept the risk. Adversarial input (crafted hash collisions) is out of scope for retrieval.
Background pipeline failure modes. If the worker crashes mid-batch, the jobs table holds pending rows. Mitigation: Module 03 ships checkpoint_verify.sh which validates resume-on-failure.
Edit-frequency assumption. The cost win depends on <5% daily churn. A corpus where 50% of documents change daily would pay 50× more.

Reversal plan

The interface is incremental_manager.upsert_with_hash(...). Replacement is bounded:

Nightly full — replace should_reembed to always return True; add a cron entry for embedding_pipeline.py. ~10 lines.
Per-write embed — move the embed call from background worker into the FastAPI write path. Add latency budget enforcement. ~20 lines.
CDC + Debezium — add infra/debezium-connect.yaml, swap embedding_pipeline.py to consume the Debezium topic instead of polling the jobs table. New stack — ~1 engineer-week.

Estimated effort: 0.5–5 engineer-days depending on swap target. Reversible.

References

incremental_manager.py (compute_hash + should_reembed + upsert_with_hash)
embedding_pipeline.py (resume-on-failure jobs table, async embed batch)
checkpoint_verify.sh (validates pipeline resume)
Module 04 cost-model CSV (documents the $20/M one-time → ~$0.20/mo lever)
ADR-005 (Deprecated single-embedding-version — forced the addition of embedding_version field on top of this hash mechanism)