ADR-003: Redis semantic cache (HNSW + threshold=0.92) over exact-match LRU | AI Serving Platform

Context

M01 ships the vLLM endpoint with no caching. Every request — including near-identical repeats from the same tenant — hits the model. On the FinSight workload (5 tenants asking financial-analyst questions about earnings, regulations, instruments) we measured the M02 baseline at 1800ms p99, $0.48 per request, with a long tail of repeat queries: "summarize Apple's Q4 earnings", "summarize AAPL Q4 earnings", "give me the AAPL earnings summary" — three paraphrases, three full GPU passes.

Three caching options:

Exact-match LRU (in-memory). Hash the prompt; if seen, return cached response. Sub-millisecond lookup. Coverage on the FinSight workload: ~10% (only catches verbatim repeats). The paraphrase tail gets nothing.
Semantic cache only (Redis HNSW). Embed the prompt; compare against a Redis vector index of prior embeddings; return the cached response when cosine ≥ 0.92. Coverage: ~~35% on the FinSight workload (catches the paraphrase tail). Every miss costs an embedding call (~~$0.00002 at text-embedding-3-small) and ~5–10ms vector search latency.
Tier-1 exact + tier-2 semantic. Same dual-tier pattern as RAG: try exact first, fall through to semantic on miss. Adds complexity for marginal coverage on the FinSight workload (the exact-only tier catches ~10% but it's already in the semantic tier's coverage).

Option 2 wins for this workload because the FinSight queries are paraphrase-heavy by nature (analysts re-ask the same question different ways). On a workload with truly verbatim repeats (autocomplete, FAQ endpoints) the dual-tier pattern would be the right answer.

Decision

We adopt a single-tier Redis semantic cache with HNSW vector index and cosine threshold 0.92.

# api/cache/semantic_cache.py
class SemanticCache:
    def __init__(self):
        self.embedder = SentenceTransformer(
            "sentence-transformers/all-MiniLM-L6-v2"  # 384-dim, ~22M params
        )
        self.threshold = 0.92                          # tuned on FinSight queries
        self.ttl_default = 86400                       # 1 day
        self.ttl_volatile = 300                        # 5 min for rate/yield/price queries

    async def lookup(self, query: str) -> CacheHit | None:
        embedding = self.embedder.encode(query)
        results = await self.redis.ft("idx:cache").search(
            f"*=>[KNN 1 @embedding $vec]",
            query_params={"vec": embedding.tobytes()},
        )
        if results.docs and results.docs[0].score >= self.threshold:
            return CacheHit.from_doc(results.docs[0])
        return None

Cache key format: cache:{sha256(query)[:16]}. The SHA truncation prefix is for debugging (lets us correlate cache entries to query hashes); the actual lookup is HNSW vector similarity.

The TTL split (ttl_default=86400 vs ttl_volatile=300) addresses a domain-specific failure mode: questions about prices, yields, and rates go stale within minutes; questions about regulations or company fundamentals stay fresh for hours. Volatility is detected via keyword classification on the query ("rate", "yield", "price", "current").

Tradeoffs we accept

Lever	Alternative	Chosen
Coverage	Exact-only (~10%)	Semantic (~35% measured on FinSight)
Per-query overhead	$0/0ms (no cache)	~$0.00002 + 5–10ms (embedding + KNN)
Threshold tuning	None (exact match)	0.92 calibrated on FinSight golden set
Failure mode	Cache miss = full model call	Cache miss = embedding wasted + full model call
Volatility handling	Single TTL	TTL split (5min volatile / 1day default)

The largest concrete cost is the per-query embedding overhead on cache miss: 5–10ms latency + $0.00002. We accept it because (a) the embedding miss is dwarfed by the model call when it lands (avg 150–300ms vLLM generation), and (b) the 35% hit rate at the FinSight workload turns into $1,247/mo savings at the reference scenario per the cost-model CSV.

Consequences (positive)

35% measured cache hit rate on the FinSight workload — and it's the paraphrase tail, exactly the requests an exact-match cache misses.
Cached p99 latency drops to 8ms (Redis HNSW lookup + response serialization). Versus 195ms p99 with cache miss. The latency cliff between hit and miss is what makes the cache user-visible.
Single source of truth for "what we've answered before." The cost manager (app/cost_manager.py) reads cache hits from the same Redis index for per-tenant attribution.
Volatility-aware TTL prevents the most common semantic-cache failure mode: stale prices/rates leaking back as fresh answers.

Consequences (negative)

Embedding-model dependency. all-MiniLM-L6-v2 is a vendor (HuggingFace) dependency. If the model gets pulled or the embedding semantics shift meaningfully across versions, the threshold needs re-tuning. We pin the model SHA in requirements.txt.
Threshold drift. 0.92 is calibrated on the FinSight golden set. Other workloads (legal, medical, support) will need re-calibration. The runbook documents the procedure: re-run the canary set, plot precision/recall vs threshold, pick the elbow.
Cache poisoning surface. A malformed entry in Redis can return as a high-confidence hit. M04's chaos scenario #5 specifically tests this; mitigation is input validation on lookup() + a max-cache-entry-size check + TTL expiry as a backstop.
Threshold-recall tradeoff. Lowering threshold to 0.88 lifts hit rate to ~45% but starts returning weakly-related answers; user_rating drops. Documented in M02 as the threshold-tuning tradeoff.

Reversal plan

If hit rate drops below 15% (workload changes to truly novel queries) or poisoning becomes a recurring issue, the reversal is:

Set CACHE_SEMANTIC_ENABLED=false in .env.
RAGPipeline.answer() short-circuits past the cache lookup.
Drain the Redis cache namespace (redis-cli --scan --pattern cache:*).
Either reverse to exact-match-only (10% coverage) or eliminate caching entirely.

Estimated effort: ~2 engineer-days.

References

api/cache/semantic_cache.py — SemanticCache class
api/rag/pipeline_v2_cache.py — RAGPipeline integrating the cache layer
api/middleware/latency_tracker.py — per-stage timing (cache_lookup budget tracked separately)
app/cost_manager.py — reads cache hit metrics for cost attribution
M02 cost-tradeoff table: baseline 1800ms/$0.48 → +cache 127ms/$0.25
ADR-001 (the vLLM engine the cache sits in front of)
ADR-004 (the circuit breaker that wraps the cache + model call)