Context
M01 ships the vLLM endpoint with no caching. Every request — including near-identical repeats from the same tenant — hits the model. On the FinSight workload (5 tenants asking financial-analyst questions about earnings, regulations, instruments) we measured the M02 baseline at 1800ms p99, $0.48 per request, with a long tail of repeat queries: "summarize Apple's Q4 earnings", "summarize AAPL Q4 earnings", "give me the AAPL earnings summary" — three paraphrases, three full GPU passes.
Three caching options:
- Exact-match LRU (in-memory). Hash the prompt; if seen, return cached response. Sub-millisecond lookup. Coverage on the FinSight workload: ~10% (only catches verbatim repeats). The paraphrase tail gets nothing.
- Semantic cache only (Redis HNSW). Embed the prompt; compare against
a Redis vector index of prior embeddings; return the cached response
when cosine ≥ 0.92. Coverage:
35% on the FinSight workload (catches the paraphrase tail). Every miss costs an embedding call ($0.00002 at text-embedding-3-small) and ~5–10ms vector search latency. - Tier-1 exact + tier-2 semantic. Same dual-tier pattern as RAG: try exact first, fall through to semantic on miss. Adds complexity for marginal coverage on the FinSight workload (the exact-only tier catches ~10% but it's already in the semantic tier's coverage).
Option 2 wins for this workload because the FinSight queries are paraphrase-heavy by nature (analysts re-ask the same question different ways). On a workload with truly verbatim repeats (autocomplete, FAQ endpoints) the dual-tier pattern would be the right answer.
Decision
We adopt a single-tier Redis semantic cache with HNSW vector index and cosine threshold 0.92.
# api/cache/semantic_cache.py
class SemanticCache:
def __init__(self):
self.embedder = SentenceTransformer(
"sentence-transformers/all-MiniLM-L6-v2" # 384-dim, ~22M params
)
self.threshold = 0.92 # tuned on FinSight queries
self.ttl_default = 86400 # 1 day
self.ttl_volatile = 300 # 5 min for rate/yield/price queries
async def lookup(self, query: str) -> CacheHit | None:
embedding = self.embedder.encode(query)
results = await self.redis.ft("idx:cache").search(
f"*=>[KNN 1 @embedding $vec]",
query_params={"vec": embedding.tobytes()},
)
if results.docs and results.docs[0].score >= self.threshold:
return CacheHit.from_doc(results.docs[0])
return None
Cache key format: cache:{sha256(query)[:16]}. The SHA truncation
prefix is for debugging (lets us correlate cache entries to query
hashes); the actual lookup is HNSW vector similarity.
The TTL split (ttl_default=86400 vs ttl_volatile=300) addresses a
domain-specific failure mode: questions about prices, yields, and rates
go stale within minutes; questions about regulations or company
fundamentals stay fresh for hours. Volatility is detected via keyword
classification on the query ("rate", "yield", "price", "current").
Tradeoffs we accept
| Lever | Alternative | Chosen |
|---|---|---|
| Coverage | Exact-only (~10%) | Semantic (~35% measured on FinSight) |
| Per-query overhead | $0/0ms (no cache) | ~$0.00002 + 5–10ms (embedding + KNN) |
| Threshold tuning | None (exact match) | 0.92 calibrated on FinSight golden set |
| Failure mode | Cache miss = full model call | Cache miss = embedding wasted + full model call |
| Volatility handling | Single TTL | TTL split (5min volatile / 1day default) |
The largest concrete cost is the per-query embedding overhead on cache miss: 5–10ms latency + $0.00002. We accept it because (a) the embedding miss is dwarfed by the model call when it lands (avg 150–300ms vLLM generation), and (b) the 35% hit rate at the FinSight workload turns into $1,247/mo savings at the reference scenario per the cost-model CSV.
Consequences (positive)
- 35% measured cache hit rate on the FinSight workload — and it's the paraphrase tail, exactly the requests an exact-match cache misses.
- Cached p99 latency drops to 8ms (Redis HNSW lookup + response serialization). Versus 195ms p99 with cache miss. The latency cliff between hit and miss is what makes the cache user-visible.
- Single source of truth for "what we've answered before." The cost
manager (
app/cost_manager.py) reads cache hits from the same Redis index for per-tenant attribution. - Volatility-aware TTL prevents the most common semantic-cache failure mode: stale prices/rates leaking back as fresh answers.
Consequences (negative)
- Embedding-model dependency.
all-MiniLM-L6-v2is a vendor (HuggingFace) dependency. If the model gets pulled or the embedding semantics shift meaningfully across versions, the threshold needs re-tuning. We pin the model SHA inrequirements.txt. - Threshold drift. 0.92 is calibrated on the FinSight golden set. Other workloads (legal, medical, support) will need re-calibration. The runbook documents the procedure: re-run the canary set, plot precision/recall vs threshold, pick the elbow.
- Cache poisoning surface. A malformed entry in Redis can return as a
high-confidence hit. M04's chaos scenario #5 specifically tests this;
mitigation is input validation on
lookup()+ a max-cache-entry-size check + TTL expiry as a backstop. - Threshold-recall tradeoff. Lowering threshold to 0.88 lifts hit rate to ~45% but starts returning weakly-related answers; user_rating drops. Documented in M02 as the threshold-tuning tradeoff.
Reversal plan
If hit rate drops below 15% (workload changes to truly novel queries) or poisoning becomes a recurring issue, the reversal is:
- Set
CACHE_SEMANTIC_ENABLED=falsein.env. RAGPipeline.answer()short-circuits past the cache lookup.- Drain the Redis cache namespace (
redis-cli --scan --pattern cache:*). - Either reverse to exact-match-only (10% coverage) or eliminate caching entirely.
Estimated effort: ~2 engineer-days.
References
api/cache/semantic_cache.py— SemanticCache classapi/rag/pipeline_v2_cache.py— RAGPipeline integrating the cache layerapi/middleware/latency_tracker.py— per-stage timing (cache_lookup budget tracked separately)app/cost_manager.py— reads cache hit metrics for cost attribution- M02 cost-tradeoff table: baseline 1800ms/$0.48 → +cache 127ms/$0.25
- ADR-001 (the vLLM engine the cache sits in front of)
- ADR-004 (the circuit breaker that wraps the cache + model call)