Skip to content
Back to AI Retrieval Platform

Single fixed embedding model assumption (DEPRECATED)

✗ DeprecatedAI Retrieval Platform03 — Scale & Agent Integration (origin → reversal)
By AI-DE Engineering Team·Stakeholders: retrieval engineer, on-call, finance

Context (when this was Accepted)

The v1 schema treated the embedding as a single fixed type — vector(1536) column on documents, no embedding_version column, no schema recognition that the embedding model could change. The documents table looked like:

-- v1 schema (Module 01)
CREATE TABLE documents (
  id UUID PRIMARY KEY,
  content TEXT NOT NULL,
  content_hash TEXT,             -- ADR-004 added this
  embedding vector(1536) NOT NULL,
  metadata JSONB,
  ts_content tsvector GENERATED ALWAYS AS (...) STORED,
  created_at TIMESTAMPTZ DEFAULT now()
);

The implicit assumption was: the embedding model is OpenAI's text-embedding-3-small, forever, on every row. This worked fine for v1.

What changed (and why we reversed)

Three things forced the reversal:

  1. The embedding model upgraded. text-embedding-3-large (3072 dims, ~2-3 percentage points better recall@10 in MS MARCO) became the obvious choice for premium tenants. The vector(1536) column could not hold a 3072-dim vector.

  2. Multiple tenants, multiple models. A free tier on text-embedding-3-small, a premium tier on text-embedding-3-large, an on-prem tier on a self-hosted bge model — three vector columns? Three tables? The schema didn't know.

  3. A migration is not a single atomic operation. Migrating 1M embeddings takes ~3 hours. During those 3 hours, queries still need to return correct results — meaning some documents have v1 embeddings, some have v2, and the search has to know which.

The fix landed in Module 03:

  • embedding_version.py — introduces an EmbeddingVersion enum (V1_OPENAI_SMALL, V2_OPENAI_LARGE, V3_BGE_LOCAL).
  • The documents schema gains an embedding_version TEXT NOT NULL column. Each row knows which model produced its vector.
  • mark_stale_embeddings(target_version) flips embedding_status to stale for rows on the old version.
  • reembed_stale(batch_size, target_version) walks the stale rows, re-embeds with the target model, and atomically swaps the embedding + version + status.
  • The retrieval path filters on WHERE embedding_version = :preferred_version with fallback to the legacy version on miss.
# embedding_version.py
class EmbeddingVersion(str, Enum):
    V1_OPENAI_SMALL = "openai-text-embedding-3-small-1536"
    V2_OPENAI_LARGE = "openai-text-embedding-3-large-3072"
    V3_BGE_LOCAL = "bge-base-en-v1.5-768"

def mark_stale_embeddings(target_version: EmbeddingVersion, conn):
    conn.execute(
        "UPDATE documents SET embedding_status = 'stale' "
        "WHERE embedding_version != %s AND embedding_status = 'fresh'",
        target_version.value,
    )

def reembed_stale(batch_size: int, target_version: EmbeddingVersion, conn):
    # Walk stale rows in batches of `batch_size`, embed with target,
    # UPDATE embedding + embedding_version + embedding_status atomically.
    ...

Why we left this ADR Deprecated rather than deleting it

A future maintainer looking at embedding_version.py will wonder why the column exists at all — Postgres-native vector types don't require it. The interesting question — why didn't they ship it in v1? — is answered by this ADR.

The MADR convention treats Deprecated ADRs as part of the permanent record. We follow that convention.

What we got wrong (and what we'd do again)

Got wrong:

  • We treated the embedding model as immutable infrastructure. It isn't. OpenAI deprecates models on a regular cadence, the open-source frontier moves every few months, and customers will request specific models for compliance reasons. A schema that doesn't know "which model" is brittle by design.
  • We coupled the schema dimension (1536) to the chosen model. The vector(N) constraint is the failure that bit us — embedding vector(N) can't be ALTER TABLE-resized in place; the migration required a NEW column + backfill + cutover.
  • We didn't separate the "what's in the row" hash (ADR-004's content_hash) from the "what model embedded the row" hash (embedding_version). They're orthogonal and need separate fields.

Got right:

  • The retrieval interface (api/main.py's /search) was already flexible enough to filter on a column. Adding WHERE embedding_version = ? was a one-line change.
  • ADR-004's hash-based incremental updates make the re-embed cheap even when forced by a model migration — only ~$0.20/mo at 1M docs with 1% churn.
  • The embedding_status column (separate from the version column) preserved the ability to mark rows for re-embed without dropping them from search results during the migration window.

When (if ever) to revisit

A future ADR could deprecate the version column when both of these are true:

  1. The team has settled on a single embedding model for >12 months with no known upgrade path.
  2. Multi-tenant model selection has been replaced by a policy engine (e.g. ADR-?-future routes premium tenants to a separate index entirely).

Until then, the version column stays.

References

  • embedding_version.py (the EmbeddingVersion enum + mark_stale_embeddings + reembed_stale + query-time preference filter)
  • incremental_manager.py (ADR-004 — content hashing, separate concern)
  • embedding_pipeline.py (the worker that consumes embedding_status='stale' rows)
  • seed/01_create_tables.sql (current schema with embedding_version + embedding_status columns)
  • ADR-001 (pgvector — vector(N) dimension constraint that forced the migration)
  • ADR-004 (hash-based incremental — makes the forced re-embed cheap)
Built into the project

This decision shipped as part of AI Retrieval Platform — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open