Architecture Decision Records
Every EXPERT project ships 5 ADRs that document the real engineering tradeoffs behind it — chunking strategies, retrieval fusion, caching tiers, exactly-once delivery, judge cascades. Including one Deprecated decision per project with the receipts for why it was reverted. 50 ADRs across 10 projects.
- ADR-001 · Hybrid retrieval (BM25 + dense) with Reciprocal Rank FusionAccepted
A RAG system's retrieval layer is the single largest determinant of answer quality. Module 02 ships a Pinecone HNSW index over text-embedding-3-small (1536-dim, cosine). On the seed corpus (4 document
- ADR-002 · Cross-encoder reranking on top-K (precision lever vs latency cost)Accepted
The hybrid retriever (ADR-001) returns the fused top-50 chunks. The LLM's context window can hold maybe top-10 of those without saturating cost or diluting attention. The question is: which 10?
- ADR-003 · Recursive chunking as default; semantic for high-value docsAccepted
Chunking is the most consequential decision in a RAG system. The chunk you emit at ingest time is the chunk the LLM sees at query time — there is no recovery from a bad cut. We benchmarked 4 strategie
- ADR-004 · LLM gateway with fallback chain (gpt-4o → gpt-4o-mini)Accepted
By Module 04 every API handler in the codebase calls the LLM directly via OpenAIClient or AnthropicClient (Module 03's multi-provider client). That works locally but fails three real production needs:
- ADR-005 · Fixed-size chunking (DEPRECATED)Deprecated
When the M01 ingestion pipeline first shipped, the chunker was a single fixed-size strategy: split every document into 1000-character windows with 200-character overlap. The rationale was reasonable o
- ADR-001 · Dual-tier caching: exact-match in front of semanticAccepted
LLM inference is the dominant variable cost in this platform. At a target load of 50k requests/day, Module 02's instrumentation already confirms ~78% of spend goes to model providers (the rest is Post
- ADR-002 · Three-tier budget hierarchy with fail-open enforcementAccepted
By Module 04 the platform tracks every token and routes every prompt — but nothing prevents a runaway team or a runaway model from burning the monthly budget in a weekend. Three forces drive the budge
- ADR-003 · Cost-latency-quality routing triangle with explicit fallback chainAccepted
Module 02 made cost visible. Module 03 cut the bill through caching. Module 04 is where the platform decides _which model_ to use — and that decision is the single largest lever left, because the pric
- ADR-004 · Per-request detail + daily rollup with `ON CONFLICT` upsertsAccepted
Module 02 has to answer two questions on the same data:
- ADR-005 · Sync SQLAlchemy across all hot paths (DEPRECATED)Deprecated
When Module 02 first shipped, the persistence layer was synchronous SQLAlchemy throughout — models.py declares declarative_base() and sessionmaker() against a sync engine, and every database call in t
- ADR-001 · vLLM continuous batching + PagedAttention over Triton/TensorRT-LLMAccepted
The serving layer is the single largest determinant of cost-per-request and p99 latency on this platform. We're shipping Mistral-7B-Instruct (7B parameters, 8k context) under a <200ms p99 SLA target w
- ADR-002 · Ray Serve over Kubernetes for autoscalingAccepted
By M03 we need the inference layer to scale beyond a single GPU replica. The reference scenario is 5 tenants × 2k qpd = 10k qpd with bursty arrival patterns concentrated 9am–4pm ET (market hours). A s
- ADR-003 · Redis semantic cache (HNSW + threshold=0.92) over exact-match LRUAccepted
M01 ships the vLLM endpoint with no caching. Every request — including near-identical repeats from the same tenant — hits the model. On the FinSight workload (5 tenants asking financial-analyst questi
- ADR-004 · Circuit breaker over naive retry-with-backoffAccepted
The vLLM engine has three predictable failure modes under production load:
- ADR-005 · Single-instance vLLM serving (DEPRECATED)Deprecated
When M01 first shipped, the deployment was deliberately simple: one vLLM replica behind FastAPI, single docker-compose service, single A10G GPU.
- ADR-001 · LangGraph chosen over CrewAI / AutoGen / custom orchestratorAccepted
The agent orchestrator is the most expensive decision in this build to reverse — every worker, every tool call, every checkpoint, every failure-recovery path runs through it. Pick wrong and the M05 ha
- ADR-002 · Redis for orchestrator checkpoints; Postgres only for business dataAccepted
The agent pipeline writes two distinct kinds of state to disk:
- ADR-003 · Hierarchical supervisor-worker topology, not peer-to-peer agentsAccepted
Multi-agent systems can be organized along a continuum:
- ADR-004 · HITL via LangGraph `interrupt_before` + Slack actionable buttonsAccepted
Some agent decisions require a human in the loop:
- ADR-005 · Single global ToolRegistry without RBAC scoping (DEPRECATED)Deprecated
Original v0 design (ADR-005, originally accepted) had one global ToolRegistry instance shared across all agents. Workers would registry.get("query_database") and execute it. This was the simplest poss
- ADR-001 · Use pgvector + HNSW over Qdrant / Pinecone / WeaviateAccepted
The platform serves hybrid (semantic + lexical + reranked) search over 1M-capable document corpora. The vector index is the hot path on every query — it has to hit the <100 ms P99 budget while staying
- ADR-002 · Hybrid retrieval uses Reciprocal Rank Fusion, not score averaging or learned-sparseAccepted
Hybrid retrieval merges a dense (semantic) candidate list with a sparse (lexical) candidate list. The dense list comes from cosine similarity over pgvector; the sparse list comes from Postgres ts_rank
- ADR-003 · Reranker is a CPU cross-encoder, not LLM-as-judgeAccepted
The hybrid retriever (ADR-001 + ADR-002) returns ~50 candidate documents per query. The reranker's job is to score absolute quality across those 50 and surface a top-K (typically 5–10) for the agent o
- ADR-004 · Index updates use hash-based incremental, not nightly full re-embeddingAccepted
The corpus is mutable: documents are added, edited, and removed continuously. The vector index has to stay current without burning budget or blowing the latency SLA.
- ADR-005 · Single fixed embedding model assumption (DEPRECATED)Deprecated
- ADR-001 · SystemContract as the platform's north star (declared upfront, not derived)Accepted
A full-stack RAG platform makes hundreds of design decisions — chunk size, embedding model, retrieval strategy, judge model, cache TTL, rate limits. Without a single source of truth for _what good loo
- ADR-002 · pgvector + HNSW + hybrid RRF + cross-encoder rerank, not a dedicated vector DBAccepted
The retrieval layer is the single highest-leverage decision in a RAG platform — get it wrong and every downstream layer (LLM generation, eval, observability) is debugging the wrong problem. Vendor cho
- ADR-003 · 4-class query router with confidence threshold + ambiguous fallbackAccepted
Not every user query should hit the same pipeline. A factual question ("What is the refund policy?") is best answered with RAG. An analytical question ("How many tickets did tenant X open last month?"
- ADR-004 · Three-level failure cascade: RAG → LLM-only → cached → honest errorAccepted
In production, RAG components fail. The retriever can timeout. The LLM API can rate-limit. The reranker can OOM. The vector index can lock during a VACUUM. Every one of these failures has happened in
- ADR-005 · Multi-tenant retrieval via row-filter on shared index (DEPRECATED)Deprecated
Original v0 design (ADR-005, originally accepted) had one shared chunks table with a tenant_id column and a WHERE tenant_id = ? filter on every retrieval query. This was the simplest possible design a
- ADR-001 · Tier-A/Tier-B labeling protocol for the gold setAccepted
A working LLM eval framework is worth nothing without trustworthy labeled examples to score against. The gold set is the foundation everything else gates on — multi-judge calibration in M03, RAGAS in
- ADR-002 · recall@k + MRR for retrieval gating; nDCG rejected for v1Accepted
The eval framework gates RAG releases on retrieval quality. The single biggest cause of "the LLM hallucinated" turns out to be the retriever returning the wrong chunk — so the metric we put on the CI
- ADR-003 · Judge model selection: Claude Sonnet default, GPT-4 adversarial-only, Llama swap pathAccepted
Multi-judge LLM-as-judge ensembles need a default model. The pick has three knock-on effects:
- ADR-004 · Multi-judge consensus is weighted average, not majority voteAccepted
Three judges per case. They disagree. We need to collapse three scores into one consensus number that downstream consumers (dashboard, CI gate, cost dashboard) read as "the score".
- ADR-005 · Single shared `test_cases` table for all suites (DEPRECATED)Deprecated
Original v0 design (ADR-005, originally accepted) had one shared test_cases table with a suite_id foreign key. Tagging used a many-to-many junction test_case_tags against a global tags table. This was
- ADR-001 · Use aiohttp + custom crawler over Scrapy / requests-htmlAccepted
The pipeline ingests 1k+ pages of web data per project run, scaling to 1M-doc-capable batch ingestion. The crawler is the front of the pipeline — every downstream module (dedup, quality, tokenization,
- ADR-002 · Dedup is MinHash + LSH (datasketch), not hash-only or embedding-similarityAccepted
The corpus contains exact duplicates (mirrored sites, RSS reposts) AND near-duplicates (a 95%-identical article republished with a different intro paragraph). LLM training-data quality depends on remo
- ADR-003 · Hybrid tokenization: tiktoken default + custom BPE pedagogicalAccepted
LLM training-data tokenization has two distinct purposes in this project: (1) production token-counting + sequence packing for the actual dataset construction, and (2) pedagogical understanding of how
- ADR-004 · Orchestration is Ray + Airflow, not Spark / Dask / single-node cronAccepted
The pipeline has two distinct orchestration concerns:
- ADR-005 · Pinecone-only vector backend (DEPRECATED)Deprecated
- ADR-001 · Use Feast over Tecton / Hopsworks / DIY for the feature-store layerAccepted
The platform must serve point-in-time-correct features for training and sub-100 ms online lookups for inference, across multiple models and teams. Training-serving skew is the failure mode every previ
- ADR-002 · Online store is Redis, not DynamoDB or Postgres directAccepted
The online store serves features at inference time, on the request path. Latency budget for BentoML.predict() is <50 ms P99 end-to-end across the gateway, the feature lookup, the model forward pass, a
- ADR-003 · Tracking + data versioning is MLflow + DVC, not W&B / Neptune / ClearMLAccepted
The platform must answer two related but distinct questions:
- ADR-004 · Model serving is BentoML, not TorchServe / Seldon / FastAPI-by-handAccepted
Module 03 deploys the registered scikit-learn / XGBoost churn model behind a REST API on Kubernetes. The serving layer must:
- ADR-005 · Event-driven auto-retraining via drift hook (DEPRECATED)Deprecated
- ADR-001 · Use Anthropic Claude API over self-hosted inference for v1Accepted
The platform must answer questions over tenant-private documents under strict compliance constraints (PII redaction, audit trail, RBAC). The inference layer is a major cost driver and the most operati
- ADR-002 · Multi-tenant isolation via Postgres RLS, not schema-per-tenantAccepted
The platform serves multiple tenants from a single Postgres instance with pgvector for retrieval. Tenant isolation is a hard requirement — a query from acme must never return rows belonging to northwi
- ADR-003 · Policy engine is Python rules + Redis store, not OPA/RegoAccepted
The platform must enforce dynamic, per-tenant policies on every request: which actions are allowed, which require approval, which must mask outputs. Policies change frequently — a tenant signs a new D
- ADR-004 · Approval queue is Redis with 24h TTL, not Temporal workflowAccepted
When a policy engine returns require_approval, the request is parked until a human reviewer accepts or rejects it. We need durability (restarts must not lose pending approvals), a clear TTL (stale app
- ADR-005 · Single shared `documents` table for all tenants (DEPRECATED)Deprecated