Context
LLM inference is the dominant variable cost in this platform. At a target load of 50k requests/day, Module 02's instrumentation already confirms ~78% of spend goes to model providers (the rest is Postgres + Redis + observability). A non-trivial fraction of those requests are repeats — classified questions, template-driven completions, identical retrievals — and a larger fraction are near-duplicates of recent queries. We had three options for cutting the provider bill before touching the model itself:
- Exact-match cache only. Hash the prompt; if seen, return the cached response. Sub-millisecond Redis lookup. Coverage 30–60% on workloads with FAQ-like distribution; collapses to < 5% on free-form chat.
- Semantic cache only. Embed the prompt; compare against a Redis store of prior embeddings; return the cached response if cosine similarity ≥ 0.92. Coverage 40%+ on free-form workloads but every miss costs an embedding call (~$0.02/1k tokens) and the lookup adds 10–15 ms p99.
- Dual-tier — exact-match in front of semantic. Try the cheap exact-match first; only fall through to the embedding lookup on miss.
Option 1 alone leaves money on the table on free-form workloads. Option 2 alone adds latency and embedding cost to every request — including the ones the exact-match cache would have handled for free.
Decision
We adopt dual-tier caching with exact-match in front of semantic.
# src/cache/cache.py
async def lookup(prompt: str) -> CacheHit | None:
# Tier 1: O(1) hash lookup, ~3 ms p50
if hit := await exact_cache.get(prompt_hash(prompt)):
return hit
# Tier 2: embedding + cosine similarity, ~15 ms p99
if hit := await semantic_cache.lookup(prompt, threshold=0.92):
return hit
return None
The exact-match tier uses a SHA-256 of the normalised prompt as the Redis key.
The semantic tier uses text-embedding-3-small (the cheap embedding model;
ADR-003 covers the embedding choice) and an HNSW-style sweep over recent
embeddings, capped to a 7-day rolling window. Both tiers share a single TTL
policy (24 h hot / 7 d warm) and a unified hit-rate metric.
Tradeoffs we accept
| Lever | Alternative | Chosen |
|---|---|---|
| Cache miss cost | "Free miss" with exact-only | Pay 1 embedding call per miss |
| p99 latency on first-time prompts | 3 ms (exact) | 18 ms (exact + semantic) |
| Implementation complexity | One Redis pattern | Two patterns + a fall-through |
| Operational burden | One TTL knob | Two TTL knobs + similarity threshold |
The latency hit on first-time prompts is the largest concrete cost. We accept it because the dominant hot path on this workload is repeat-heavy: ~70% of requests resolve in tier 1 with no semantic lookup, and the remaining 30% are predominantly low-latency-budget queries where the embedding lookup is still faster than a model call.
Consequences (positive)
- Cost coverage stacks. Tier 1 catches FAQ-shape repeats; tier 2 catches paraphrases. End-to-end cache hit rate moves from 30–60% (exact-only) to a measured 65–75% in the seed dataset.
- Cheap path stays cheap. Repeats — the most common case — never pay an embedding call. The optimization surface is exactly where the misses land.
- Single hit-rate metric, single dashboard.
cache_hit_rate{tier=$tier}splits the two for diagnostics but rolls up into one cost-savings line.
Consequences (negative)
- Two TTL knobs to tune. The semantic tier is meaningfully more sensitive
to TTL than the exact tier — too long and stale answers leak, too short and
the savings collapse. We mitigate with a weekly automated sweep of the hit
rate vs threshold curve and a runbook entry (
runbook/cache-tuning.md). - Embedding model is a vendor dependency. If
text-embedding-3-smallpricing or behaviour changes, the semantic tier is exposed. We mitigate by keeping the embedding call behind an interface (embeddings.py) so the swap to an open-weight embedding (e.g.bge-small-en) is contained. - Threshold drift. Cosine 0.92 is calibrated to the seed dataset. As workloads change, we re-calibrate. The runbook documents the procedure.
Reversal plan
If the semantic tier's marginal hit rate drops below 15% (i.e. tier 1 is catching almost everything), the cost of running embeddings on every miss exceeds the savings. The reversal is mechanical:
- Set a feature flag (
CACHE_SEMANTIC_ENABLED=false) in.env. src/cache/cache.pyshort-circuits past the semantic call on lookup.- Drain the semantic Redis namespace via
redis-cli --scan --pattern semantic:*. - Tear down the embedding pre-warm job in
migrations/.
Estimated effort: ~2 engineer-days. The exact-match tier remains intact.
References
src/cache/cache.py— dual-tier lookup orchestratorsrc/cache/semantic.py— embedding + cosine searchsrc/cache/cache_types.py—CacheHit/CacheTierdataclassessrc/cache/comparison.py— exact vs semantic A/B harness used in M03- ADR-003 (embedding-model choice for the semantic tier — see related)
runbook/cost-incident-response.py— playbook entry "Cache hit rate < 40%"