Context (when this was Accepted)
The v1 schema treated the embedding as a single fixed type — vector(1536)
column on documents, no embedding_version column, no schema
recognition that the embedding model could change. The documents
table looked like:
-- v1 schema (Module 01)
CREATE TABLE documents (
id UUID PRIMARY KEY,
content TEXT NOT NULL,
content_hash TEXT, -- ADR-004 added this
embedding vector(1536) NOT NULL,
metadata JSONB,
ts_content tsvector GENERATED ALWAYS AS (...) STORED,
created_at TIMESTAMPTZ DEFAULT now()
);
The implicit assumption was: the embedding model is OpenAI's
text-embedding-3-small, forever, on every row. This worked fine
for v1.
What changed (and why we reversed)
Three things forced the reversal:
-
The embedding model upgraded.
text-embedding-3-large(3072 dims, ~2-3 percentage points better recall@10 in MS MARCO) became the obvious choice for premium tenants. Thevector(1536)column could not hold a 3072-dim vector. -
Multiple tenants, multiple models. A free tier on
text-embedding-3-small, a premium tier ontext-embedding-3-large, an on-prem tier on a self-hosted bge model — three vector columns? Three tables? The schema didn't know. -
A migration is not a single atomic operation. Migrating 1M embeddings takes ~3 hours. During those 3 hours, queries still need to return correct results — meaning some documents have v1 embeddings, some have v2, and the search has to know which.
The fix landed in Module 03:
embedding_version.py— introduces anEmbeddingVersionenum (V1_OPENAI_SMALL,V2_OPENAI_LARGE,V3_BGE_LOCAL).- The
documentsschema gains anembedding_version TEXT NOT NULLcolumn. Each row knows which model produced its vector. mark_stale_embeddings(target_version)flipsembedding_statustostalefor rows on the old version.reembed_stale(batch_size, target_version)walks the stale rows, re-embeds with the target model, and atomically swaps the embedding + version + status.- The retrieval path filters on
WHERE embedding_version = :preferred_versionwith fallback to the legacy version on miss.
# embedding_version.py
class EmbeddingVersion(str, Enum):
V1_OPENAI_SMALL = "openai-text-embedding-3-small-1536"
V2_OPENAI_LARGE = "openai-text-embedding-3-large-3072"
V3_BGE_LOCAL = "bge-base-en-v1.5-768"
def mark_stale_embeddings(target_version: EmbeddingVersion, conn):
conn.execute(
"UPDATE documents SET embedding_status = 'stale' "
"WHERE embedding_version != %s AND embedding_status = 'fresh'",
target_version.value,
)
def reembed_stale(batch_size: int, target_version: EmbeddingVersion, conn):
# Walk stale rows in batches of `batch_size`, embed with target,
# UPDATE embedding + embedding_version + embedding_status atomically.
...
Why we left this ADR Deprecated rather than deleting it
A future maintainer looking at embedding_version.py will wonder why
the column exists at all — Postgres-native vector types don't require
it. The interesting question — why didn't they ship it in v1? — is
answered by this ADR.
The MADR convention treats Deprecated ADRs as part of the permanent record. We follow that convention.
What we got wrong (and what we'd do again)
Got wrong:
- We treated the embedding model as immutable infrastructure. It isn't. OpenAI deprecates models on a regular cadence, the open-source frontier moves every few months, and customers will request specific models for compliance reasons. A schema that doesn't know "which model" is brittle by design.
- We coupled the schema dimension (1536) to the chosen model. The
vector(N)constraint is the failure that bit us —embedding vector(N)can't beALTER TABLE-resized in place; the migration required a NEW column + backfill + cutover. - We didn't separate the "what's in the row" hash (ADR-004's
content_hash) from the "what model embedded the row" hash (embedding_version). They're orthogonal and need separate fields.
Got right:
- The retrieval interface (
api/main.py's/search) was already flexible enough to filter on a column. AddingWHERE embedding_version = ?was a one-line change. - ADR-004's hash-based incremental updates make the re-embed cheap even when forced by a model migration — only ~$0.20/mo at 1M docs with 1% churn.
- The
embedding_statuscolumn (separate from the version column) preserved the ability to mark rows for re-embed without dropping them from search results during the migration window.
When (if ever) to revisit
A future ADR could deprecate the version column when both of these are true:
- The team has settled on a single embedding model for >12 months with no known upgrade path.
- Multi-tenant model selection has been replaced by a policy engine (e.g. ADR-?-future routes premium tenants to a separate index entirely).
Until then, the version column stays.
References
embedding_version.py(the EmbeddingVersion enum + mark_stale_embeddings + reembed_stale + query-time preference filter)incremental_manager.py(ADR-004 — content hashing, separate concern)embedding_pipeline.py(the worker that consumesembedding_status='stale'rows)seed/01_create_tables.sql(current schema withembedding_version+embedding_statuscolumns)- ADR-001 (pgvector — vector(N) dimension constraint that forced the migration)
- ADR-004 (hash-based incremental — makes the forced re-embed cheap)