Context
The corpus is mutable: documents are added, edited, and removed continuously. The vector index has to stay current without burning budget or blowing the latency SLA.
Naive options:
- Nightly full re-embed — every document gets re-embedded every night. Simple, idempotent, but costs $20 per 1M docs per night ($600/mo) regardless of churn.
- Re-embed on every write — application-side trigger that calls the embedding API on every UPSERT. Costs grow with write volume, not corpus size. Hot path latency hit.
- Hash-based incremental — store a SHA-256 of
contentalongside the vector. On UPSERT, compute the new hash; only re-embed when the hash changed. - CDC + change feed — Debezium watches Postgres WAL; an embedding worker consumes the change feed.
Decision
Adopt hash-based incremental updates with content-addressable SHA-256. The embedding pipeline computes the hash on UPSERT and only calls the embedding API if the hash changed.
# incremental_manager.py
import hashlib
def compute_hash(content: str) -> str:
return hashlib.sha256(content.encode("utf-8")).hexdigest()
def should_reembed(doc_id: str, new_content: str, conn) -> bool:
row = conn.fetchrow("SELECT content_hash FROM documents WHERE id = %s", doc_id)
new_hash = compute_hash(new_content)
return row is None or row["content_hash"] != new_hash
def upsert_with_hash(doc_id: str, content: str, embedding: list[float], conn):
new_hash = compute_hash(content)
conn.execute(
"INSERT INTO documents (id, content, content_hash, embedding) "
"VALUES (%s, %s, %s, %s) "
"ON CONFLICT (id) DO UPDATE SET "
" content = EXCLUDED.content, "
" content_hash = EXCLUDED.content_hash, "
" embedding = EXCLUDED.embedding, "
" updated_at = now() "
"WHERE documents.content_hash IS DISTINCT FROM EXCLUDED.content_hash",
(doc_id, content, new_hash, embedding),
)
Tradeoffs we accept
| Lever | Hash-based (chosen) | Nightly full | Per-write embed | CDC + change feed |
|---|---|---|---|---|
| Embedding cost / mo (1M docs, 1% daily churn) | ~$0.20 | ~$600 | Scales with writes | ~$5 (only changes) |
| Hot-path latency | 0 ms (background worker) | 0 ms | +50–200 ms per write | 0 ms (async) |
| Operational complexity | Background pipeline + jobs table | Cron job | Application trigger | Debezium + Kafka stack |
| Failure recovery | Resume from pending jobs | Re-run cron | None (write-side) | Replay change feed |
| Tutorial reproducibility | Pure Python + Postgres | Pure Python + Postgres | Pure Python + Postgres | New stack (Debezium + Kafka) |
| Cost amortization | Once at ingest, never re-pay | Pays nightly | Pays per write | Once at change |
We optimize for cost amortization + operational simplicity. Hash-based updates turn the embedding bill from a recurring monthly cost into a one-shot cost at ingest plus a tiny incremental for actual edits. CDC achieves the same outcome but adds Debezium + Kafka operational burden — overkill for a single-team retrieval platform.
Consequences (positive)
- Embedding cost at 1M docs with 1% daily churn = ~$0.20/mo (vs ~$600/mo nightly). The Module 04 cost-model CSV documents this as the breakthrough lever.
- The pipeline is resumable:
embedding_pipeline.py'sjobstable persists state across restarts. Module 03'scheckpoint_verify.shvalidates this. - The hash check is a single Postgres column compare — no API call, no latency hit on writes that don't change content.
- Adding new fields to the embedding (e.g. metadata-aware embeddings)
is one schema migration: add a new
metadata_hashcolumn and trigger re-embed only when EITHER hash changes.
Consequences (negative)
- Embedding-model migrations require a forced re-embed. When a
team upgrades from
text-embedding-3-smalltotext-embedding-3-large, every hash is still valid but every embedding is wrong. Mitigation: ADR-005 (Deprecated) documents the embedding-version field that was added precisely for this case —embedding_version.pyintroducesEmbeddingVersionenum +mark_stale_embeddings. - Hash collision is theoretical but possible. SHA-256 collision on natural-language content has never been observed; we accept the risk. Adversarial input (crafted hash collisions) is out of scope for retrieval.
- Background pipeline failure modes. If the worker crashes
mid-batch, the jobs table holds
pendingrows. Mitigation: Module 03 shipscheckpoint_verify.shwhich validates resume-on-failure. - Edit-frequency assumption. The cost win depends on <5% daily churn. A corpus where 50% of documents change daily would pay 50× more.
Reversal plan
The interface is incremental_manager.upsert_with_hash(...).
Replacement is bounded:
- Nightly full — replace
should_reembedto always returnTrue; add a cron entry forembedding_pipeline.py. ~10 lines. - Per-write embed — move the embed call from background worker into the FastAPI write path. Add latency budget enforcement. ~20 lines.
- CDC + Debezium — add
infra/debezium-connect.yaml, swapembedding_pipeline.pyto consume the Debezium topic instead of polling thejobstable. New stack — ~1 engineer-week.
Estimated effort: 0.5–5 engineer-days depending on swap target. Reversible.
References
incremental_manager.py(compute_hash + should_reembed + upsert_with_hash)embedding_pipeline.py(resume-on-failure jobs table, async embed batch)checkpoint_verify.sh(validates pipeline resume)- Module 04 cost-model CSV (documents the $20/M one-time → ~$0.20/mo lever)
- ADR-005 (Deprecated single-embedding-version — forced the addition of
embedding_versionfield on top of this hash mechanism)