# ADR-001 — Dual-tier caching: exact-match in front of semantic

- **Status:** Accepted
- **Date:** 2026-04-12
- **Module:** 03 — Caching, Prompt Optimization & Cost Reduction
- **Stakeholders:** platform engineer, cost owner, SRE on-call

## Context

LLM inference is the dominant variable cost in this platform. At a target load
of 50k requests/day, Module 02's instrumentation already confirms ~78% of
spend goes to model providers (the rest is Postgres + Redis + observability).
A non-trivial fraction of those requests are repeats — classified questions,
template-driven completions, identical retrievals — and a larger fraction are
near-duplicates of recent queries. We had three options for cutting the
provider bill before touching the model itself:

1. **Exact-match cache only.** Hash the prompt; if seen, return the cached
   response. Sub-millisecond Redis lookup. Coverage 30–60% on workloads with
   FAQ-like distribution; collapses to < 5% on free-form chat.
2. **Semantic cache only.** Embed the prompt; compare against a Redis store of
   prior embeddings; return the cached response if cosine similarity ≥ 0.92.
   Coverage 40%+ on free-form workloads but every miss costs an embedding call
   (~$0.02/1k tokens) and the lookup adds 10–15 ms p99.
3. **Dual-tier — exact-match in front of semantic.** Try the cheap exact-match
   first; only fall through to the embedding lookup on miss.

Option 1 alone leaves money on the table on free-form workloads. Option 2 alone
adds latency and embedding cost to every request — including the ones the
exact-match cache would have handled for free.

## Decision

We adopt **dual-tier caching** with exact-match in front of semantic.

```python
# src/cache/cache.py
async def lookup(prompt: str) -> CacheHit | None:
    # Tier 1: O(1) hash lookup, ~3 ms p50
    if hit := await exact_cache.get(prompt_hash(prompt)):
        return hit

    # Tier 2: embedding + cosine similarity, ~15 ms p99
    if hit := await semantic_cache.lookup(prompt, threshold=0.92):
        return hit

    return None
```

The exact-match tier uses a SHA-256 of the normalised prompt as the Redis key.
The semantic tier uses `text-embedding-3-small` (the cheap embedding model;
ADR-003 covers the embedding choice) and an HNSW-style sweep over recent
embeddings, capped to a 7-day rolling window. Both tiers share a single TTL
policy (24 h hot / 7 d warm) and a unified hit-rate metric.

## Tradeoffs we accept

| Lever                             | Alternative                 | Chosen                               |
| --------------------------------- | --------------------------- | ------------------------------------ |
| Cache miss cost                   | "Free miss" with exact-only | Pay 1 embedding call per miss        |
| p99 latency on first-time prompts | 3 ms (exact)                | 18 ms (exact + semantic)             |
| Implementation complexity         | One Redis pattern           | Two patterns + a fall-through        |
| Operational burden                | One TTL knob                | Two TTL knobs + similarity threshold |

The latency hit on first-time prompts is the largest concrete cost. We accept
it because the dominant hot path on this workload is repeat-heavy: ~70% of
requests resolve in tier 1 with no semantic lookup, and the remaining 30% are
predominantly low-latency-budget queries where the embedding lookup is still
faster than a model call.

## Consequences (positive)

- **Cost coverage stacks.** Tier 1 catches FAQ-shape repeats; tier 2 catches
  paraphrases. End-to-end cache hit rate moves from 30–60% (exact-only) to a
  measured 65–75% in the seed dataset.
- **Cheap path stays cheap.** Repeats — the most common case — never pay an
  embedding call. The optimization surface is exactly where the misses land.
- **Single hit-rate metric, single dashboard.** `cache_hit_rate{tier=$tier}`
  splits the two for diagnostics but rolls up into one cost-savings line.

## Consequences (negative)

- **Two TTL knobs to tune.** The semantic tier is meaningfully more sensitive
  to TTL than the exact tier — too long and stale answers leak, too short and
  the savings collapse. We mitigate with a weekly automated sweep of the hit
  rate vs threshold curve and a runbook entry (`runbook/cache-tuning.md`).
- **Embedding model is a vendor dependency.** If `text-embedding-3-small`
  pricing or behaviour changes, the semantic tier is exposed. We mitigate by
  keeping the embedding call behind an interface (`embeddings.py`) so the
  swap to an open-weight embedding (e.g. `bge-small-en`) is contained.
- **Threshold drift.** Cosine 0.92 is calibrated to the seed dataset. As
  workloads change, we re-calibrate. The runbook documents the procedure.

## Reversal plan

If the semantic tier's marginal hit rate drops below 15% (i.e. tier 1 is
catching almost everything), the cost of running embeddings on every miss
exceeds the savings. The reversal is mechanical:

1. Set a feature flag (`CACHE_SEMANTIC_ENABLED=false`) in `.env`.
2. `src/cache/cache.py` short-circuits past the semantic call on lookup.
3. Drain the semantic Redis namespace via `redis-cli --scan --pattern semantic:*`.
4. Tear down the embedding pre-warm job in `migrations/`.

Estimated effort: ~2 engineer-days. The exact-match tier remains intact.

## References

- `src/cache/cache.py` — dual-tier lookup orchestrator
- `src/cache/semantic.py` — embedding + cosine search
- `src/cache/cache_types.py` — `CacheHit` / `CacheTier` dataclasses
- `src/cache/comparison.py` — exact vs semantic A/B harness used in M03
- ADR-003 (embedding-model choice for the semantic tier — see related)
- `runbook/cost-incident-response.py` — playbook entry "Cache hit rate < 40%"