ADR-001: Dual-tier caching: exact-match in front of semantic | AI Cost Optimization

Context

LLM inference is the dominant variable cost in this platform. At a target load of 50k requests/day, Module 02's instrumentation already confirms ~78% of spend goes to model providers (the rest is Postgres + Redis + observability). A non-trivial fraction of those requests are repeats — classified questions, template-driven completions, identical retrievals — and a larger fraction are near-duplicates of recent queries. We had three options for cutting the provider bill before touching the model itself:

Exact-match cache only. Hash the prompt; if seen, return the cached response. Sub-millisecond Redis lookup. Coverage 30–60% on workloads with FAQ-like distribution; collapses to < 5% on free-form chat.
Semantic cache only. Embed the prompt; compare against a Redis store of prior embeddings; return the cached response if cosine similarity ≥ 0.92. Coverage 40%+ on free-form workloads but every miss costs an embedding call (~$0.02/1k tokens) and the lookup adds 10–15 ms p99.
Dual-tier — exact-match in front of semantic. Try the cheap exact-match first; only fall through to the embedding lookup on miss.

Option 1 alone leaves money on the table on free-form workloads. Option 2 alone adds latency and embedding cost to every request — including the ones the exact-match cache would have handled for free.

Decision

We adopt dual-tier caching with exact-match in front of semantic.

# src/cache/cache.py
async def lookup(prompt: str) -> CacheHit | None:
    # Tier 1: O(1) hash lookup, ~3 ms p50
    if hit := await exact_cache.get(prompt_hash(prompt)):
        return hit

    # Tier 2: embedding + cosine similarity, ~15 ms p99
    if hit := await semantic_cache.lookup(prompt, threshold=0.92):
        return hit

    return None

The exact-match tier uses a SHA-256 of the normalised prompt as the Redis key. The semantic tier uses text-embedding-3-small (the cheap embedding model; ADR-003 covers the embedding choice) and an HNSW-style sweep over recent embeddings, capped to a 7-day rolling window. Both tiers share a single TTL policy (24 h hot / 7 d warm) and a unified hit-rate metric.

Tradeoffs we accept

Lever	Alternative	Chosen
Cache miss cost	"Free miss" with exact-only	Pay 1 embedding call per miss
p99 latency on first-time prompts	3 ms (exact)	18 ms (exact + semantic)
Implementation complexity	One Redis pattern	Two patterns + a fall-through
Operational burden	One TTL knob	Two TTL knobs + similarity threshold

The latency hit on first-time prompts is the largest concrete cost. We accept it because the dominant hot path on this workload is repeat-heavy: ~70% of requests resolve in tier 1 with no semantic lookup, and the remaining 30% are predominantly low-latency-budget queries where the embedding lookup is still faster than a model call.

Consequences (positive)

Cost coverage stacks. Tier 1 catches FAQ-shape repeats; tier 2 catches paraphrases. End-to-end cache hit rate moves from 30–60% (exact-only) to a measured 65–75% in the seed dataset.
Cheap path stays cheap. Repeats — the most common case — never pay an embedding call. The optimization surface is exactly where the misses land.
Single hit-rate metric, single dashboard. cache_hit_rate{tier=$tier} splits the two for diagnostics but rolls up into one cost-savings line.

Consequences (negative)

Two TTL knobs to tune. The semantic tier is meaningfully more sensitive to TTL than the exact tier — too long and stale answers leak, too short and the savings collapse. We mitigate with a weekly automated sweep of the hit rate vs threshold curve and a runbook entry (runbook/cache-tuning.md).
Embedding model is a vendor dependency. If text-embedding-3-small pricing or behaviour changes, the semantic tier is exposed. We mitigate by keeping the embedding call behind an interface (embeddings.py) so the swap to an open-weight embedding (e.g. bge-small-en) is contained.
Threshold drift. Cosine 0.92 is calibrated to the seed dataset. As workloads change, we re-calibrate. The runbook documents the procedure.

Reversal plan

If the semantic tier's marginal hit rate drops below 15% (i.e. tier 1 is catching almost everything), the cost of running embeddings on every miss exceeds the savings. The reversal is mechanical:

Set a feature flag (CACHE_SEMANTIC_ENABLED=false) in .env.
src/cache/cache.py short-circuits past the semantic call on lookup.
Drain the semantic Redis namespace via redis-cli --scan --pattern semantic:*.
Tear down the embedding pre-warm job in migrations/.

Estimated effort: ~2 engineer-days. The exact-match tier remains intact.

References

src/cache/cache.py — dual-tier lookup orchestrator
src/cache/semantic.py — embedding + cosine search
src/cache/cache_types.py — CacheHit / CacheTier dataclasses
src/cache/comparison.py — exact vs semantic A/B harness used in M03
ADR-003 (embedding-model choice for the semantic tier — see related)
runbook/cost-incident-response.py — playbook entry "Cache hit rate < 40%"