Skip to content
Back to AI Retrieval Platform

Index updates use hash-based incremental, not nightly full re-embedding

✓ AcceptedAI Retrieval Platform03 — Scale & Agent Integration
By AI-DE Engineering Team·Stakeholders: retrieval engineer, finance, operator

Context

The corpus is mutable: documents are added, edited, and removed continuously. The vector index has to stay current without burning budget or blowing the latency SLA.

Naive options:

  1. Nightly full re-embed — every document gets re-embedded every night. Simple, idempotent, but costs $20 per 1M docs per night ($600/mo) regardless of churn.
  2. Re-embed on every write — application-side trigger that calls the embedding API on every UPSERT. Costs grow with write volume, not corpus size. Hot path latency hit.
  3. Hash-based incremental — store a SHA-256 of content alongside the vector. On UPSERT, compute the new hash; only re-embed when the hash changed.
  4. CDC + change feed — Debezium watches Postgres WAL; an embedding worker consumes the change feed.

Decision

Adopt hash-based incremental updates with content-addressable SHA-256. The embedding pipeline computes the hash on UPSERT and only calls the embedding API if the hash changed.

# incremental_manager.py
import hashlib

def compute_hash(content: str) -> str:
    return hashlib.sha256(content.encode("utf-8")).hexdigest()

def should_reembed(doc_id: str, new_content: str, conn) -> bool:
    row = conn.fetchrow("SELECT content_hash FROM documents WHERE id = %s", doc_id)
    new_hash = compute_hash(new_content)
    return row is None or row["content_hash"] != new_hash

def upsert_with_hash(doc_id: str, content: str, embedding: list[float], conn):
    new_hash = compute_hash(content)
    conn.execute(
        "INSERT INTO documents (id, content, content_hash, embedding) "
        "VALUES (%s, %s, %s, %s) "
        "ON CONFLICT (id) DO UPDATE SET "
        "  content = EXCLUDED.content, "
        "  content_hash = EXCLUDED.content_hash, "
        "  embedding = EXCLUDED.embedding, "
        "  updated_at = now() "
        "WHERE documents.content_hash IS DISTINCT FROM EXCLUDED.content_hash",
        (doc_id, content, new_hash, embedding),
    )

Tradeoffs we accept

LeverHash-based (chosen)Nightly fullPer-write embedCDC + change feed
Embedding cost / mo (1M docs, 1% daily churn)~$0.20~$600Scales with writes~$5 (only changes)
Hot-path latency0 ms (background worker)0 ms+50–200 ms per write0 ms (async)
Operational complexityBackground pipeline + jobs tableCron jobApplication triggerDebezium + Kafka stack
Failure recoveryResume from pending jobsRe-run cronNone (write-side)Replay change feed
Tutorial reproducibilityPure Python + PostgresPure Python + PostgresPure Python + PostgresNew stack (Debezium + Kafka)
Cost amortizationOnce at ingest, never re-payPays nightlyPays per writeOnce at change

We optimize for cost amortization + operational simplicity. Hash-based updates turn the embedding bill from a recurring monthly cost into a one-shot cost at ingest plus a tiny incremental for actual edits. CDC achieves the same outcome but adds Debezium + Kafka operational burden — overkill for a single-team retrieval platform.

Consequences (positive)

  • Embedding cost at 1M docs with 1% daily churn = ~$0.20/mo (vs ~$600/mo nightly). The Module 04 cost-model CSV documents this as the breakthrough lever.
  • The pipeline is resumable: embedding_pipeline.py's jobs table persists state across restarts. Module 03's checkpoint_verify.sh validates this.
  • The hash check is a single Postgres column compare — no API call, no latency hit on writes that don't change content.
  • Adding new fields to the embedding (e.g. metadata-aware embeddings) is one schema migration: add a new metadata_hash column and trigger re-embed only when EITHER hash changes.

Consequences (negative)

  • Embedding-model migrations require a forced re-embed. When a team upgrades from text-embedding-3-small to text-embedding-3-large, every hash is still valid but every embedding is wrong. Mitigation: ADR-005 (Deprecated) documents the embedding-version field that was added precisely for this case — embedding_version.py introduces EmbeddingVersion enum + mark_stale_embeddings.
  • Hash collision is theoretical but possible. SHA-256 collision on natural-language content has never been observed; we accept the risk. Adversarial input (crafted hash collisions) is out of scope for retrieval.
  • Background pipeline failure modes. If the worker crashes mid-batch, the jobs table holds pending rows. Mitigation: Module 03 ships checkpoint_verify.sh which validates resume-on-failure.
  • Edit-frequency assumption. The cost win depends on <5% daily churn. A corpus where 50% of documents change daily would pay 50× more.

Reversal plan

The interface is incremental_manager.upsert_with_hash(...). Replacement is bounded:

  1. Nightly full — replace should_reembed to always return True; add a cron entry for embedding_pipeline.py. ~10 lines.
  2. Per-write embed — move the embed call from background worker into the FastAPI write path. Add latency budget enforcement. ~20 lines.
  3. CDC + Debezium — add infra/debezium-connect.yaml, swap embedding_pipeline.py to consume the Debezium topic instead of polling the jobs table. New stack — ~1 engineer-week.

Estimated effort: 0.5–5 engineer-days depending on swap target. Reversible.

References

  • incremental_manager.py (compute_hash + should_reembed + upsert_with_hash)
  • embedding_pipeline.py (resume-on-failure jobs table, async embed batch)
  • checkpoint_verify.sh (validates pipeline resume)
  • Module 04 cost-model CSV (documents the $20/M one-time → ~$0.20/mo lever)
  • ADR-005 (Deprecated single-embedding-version — forced the addition of embedding_version field on top of this hash mechanism)
Built into the project

This decision shipped as part of AI Retrieval Platform — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open