Skip to content
Back to AI Serving Platform

Redis semantic cache (HNSW + threshold=0.92) over exact-match LRU

✓ AcceptedAI Serving Platform02 — Optimize: RAG + Cost vs Accuracy
By AI-DE Engineering Team·Stakeholders: serving engineer, ML platform lead, cost owner

Context

M01 ships the vLLM endpoint with no caching. Every request — including near-identical repeats from the same tenant — hits the model. On the FinSight workload (5 tenants asking financial-analyst questions about earnings, regulations, instruments) we measured the M02 baseline at 1800ms p99, $0.48 per request, with a long tail of repeat queries: "summarize Apple's Q4 earnings", "summarize AAPL Q4 earnings", "give me the AAPL earnings summary" — three paraphrases, three full GPU passes.

Three caching options:

  1. Exact-match LRU (in-memory). Hash the prompt; if seen, return cached response. Sub-millisecond lookup. Coverage on the FinSight workload: ~10% (only catches verbatim repeats). The paraphrase tail gets nothing.
  2. Semantic cache only (Redis HNSW). Embed the prompt; compare against a Redis vector index of prior embeddings; return the cached response when cosine ≥ 0.92. Coverage: 35% on the FinSight workload (catches the paraphrase tail). Every miss costs an embedding call ($0.00002 at text-embedding-3-small) and ~5–10ms vector search latency.
  3. Tier-1 exact + tier-2 semantic. Same dual-tier pattern as RAG: try exact first, fall through to semantic on miss. Adds complexity for marginal coverage on the FinSight workload (the exact-only tier catches ~10% but it's already in the semantic tier's coverage).

Option 2 wins for this workload because the FinSight queries are paraphrase-heavy by nature (analysts re-ask the same question different ways). On a workload with truly verbatim repeats (autocomplete, FAQ endpoints) the dual-tier pattern would be the right answer.

Decision

We adopt a single-tier Redis semantic cache with HNSW vector index and cosine threshold 0.92.

# api/cache/semantic_cache.py
class SemanticCache:
    def __init__(self):
        self.embedder = SentenceTransformer(
            "sentence-transformers/all-MiniLM-L6-v2"  # 384-dim, ~22M params
        )
        self.threshold = 0.92                          # tuned on FinSight queries
        self.ttl_default = 86400                       # 1 day
        self.ttl_volatile = 300                        # 5 min for rate/yield/price queries

    async def lookup(self, query: str) -> CacheHit | None:
        embedding = self.embedder.encode(query)
        results = await self.redis.ft("idx:cache").search(
            f"*=>[KNN 1 @embedding $vec]",
            query_params={"vec": embedding.tobytes()},
        )
        if results.docs and results.docs[0].score >= self.threshold:
            return CacheHit.from_doc(results.docs[0])
        return None

Cache key format: cache:{sha256(query)[:16]}. The SHA truncation prefix is for debugging (lets us correlate cache entries to query hashes); the actual lookup is HNSW vector similarity.

The TTL split (ttl_default=86400 vs ttl_volatile=300) addresses a domain-specific failure mode: questions about prices, yields, and rates go stale within minutes; questions about regulations or company fundamentals stay fresh for hours. Volatility is detected via keyword classification on the query ("rate", "yield", "price", "current").

Tradeoffs we accept

LeverAlternativeChosen
CoverageExact-only (~10%)Semantic (~35% measured on FinSight)
Per-query overhead$0/0ms (no cache)~$0.00002 + 5–10ms (embedding + KNN)
Threshold tuningNone (exact match)0.92 calibrated on FinSight golden set
Failure modeCache miss = full model callCache miss = embedding wasted + full model call
Volatility handlingSingle TTLTTL split (5min volatile / 1day default)

The largest concrete cost is the per-query embedding overhead on cache miss: 5–10ms latency + $0.00002. We accept it because (a) the embedding miss is dwarfed by the model call when it lands (avg 150–300ms vLLM generation), and (b) the 35% hit rate at the FinSight workload turns into $1,247/mo savings at the reference scenario per the cost-model CSV.

Consequences (positive)

  • 35% measured cache hit rate on the FinSight workload — and it's the paraphrase tail, exactly the requests an exact-match cache misses.
  • Cached p99 latency drops to 8ms (Redis HNSW lookup + response serialization). Versus 195ms p99 with cache miss. The latency cliff between hit and miss is what makes the cache user-visible.
  • Single source of truth for "what we've answered before." The cost manager (app/cost_manager.py) reads cache hits from the same Redis index for per-tenant attribution.
  • Volatility-aware TTL prevents the most common semantic-cache failure mode: stale prices/rates leaking back as fresh answers.

Consequences (negative)

  • Embedding-model dependency. all-MiniLM-L6-v2 is a vendor (HuggingFace) dependency. If the model gets pulled or the embedding semantics shift meaningfully across versions, the threshold needs re-tuning. We pin the model SHA in requirements.txt.
  • Threshold drift. 0.92 is calibrated on the FinSight golden set. Other workloads (legal, medical, support) will need re-calibration. The runbook documents the procedure: re-run the canary set, plot precision/recall vs threshold, pick the elbow.
  • Cache poisoning surface. A malformed entry in Redis can return as a high-confidence hit. M04's chaos scenario #5 specifically tests this; mitigation is input validation on lookup() + a max-cache-entry-size check + TTL expiry as a backstop.
  • Threshold-recall tradeoff. Lowering threshold to 0.88 lifts hit rate to ~45% but starts returning weakly-related answers; user_rating drops. Documented in M02 as the threshold-tuning tradeoff.

Reversal plan

If hit rate drops below 15% (workload changes to truly novel queries) or poisoning becomes a recurring issue, the reversal is:

  1. Set CACHE_SEMANTIC_ENABLED=false in .env.
  2. RAGPipeline.answer() short-circuits past the cache lookup.
  3. Drain the Redis cache namespace (redis-cli --scan --pattern cache:*).
  4. Either reverse to exact-match-only (10% coverage) or eliminate caching entirely.

Estimated effort: ~2 engineer-days.

References

  • api/cache/semantic_cache.py — SemanticCache class
  • api/rag/pipeline_v2_cache.py — RAGPipeline integrating the cache layer
  • api/middleware/latency_tracker.py — per-stage timing (cache_lookup budget tracked separately)
  • app/cost_manager.py — reads cache hit metrics for cost attribution
  • M02 cost-tradeoff table: baseline 1800ms/$0.48 → +cache 127ms/$0.25
  • ADR-001 (the vLLM engine the cache sits in front of)
  • ADR-004 (the circuit breaker that wraps the cache + model call)
Built into the project

This decision shipped as part of AI Serving Platform — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open