Skip to content
Back to Enterprise RAG
Cost Models·Enterprise RAG

Enterprise RAG cost model

Component-by-component baseline vs optimized cost breakdown for this EXPERT project. The source CSV ships in the project starter kit at docs/cost-model/ and is downloadable below.

Last updated 41 componentsSource CSV
# Enterprise RAG (P06) — cost model — v1
# Reference scenario: 5 tenants10k queries/day300k queries/month
# Per-query average: 2000 input tokens (retrieved context + query) / 300 output tokens
# Corpus: ~5k docs/tenant × ~80 chunks each = ~2M vectors total in Pinecone
# Cache hit rate: 20% (1h TTLsemantic-response cache in LLM gateway — ADR-004)
# Routing split: 80% gpt-4o-mini / 20% gpt-4o (M03 confidence-router default)
# AWS region: us-east-1
# OpenAI + AWS + Pinecone list prices as of 2026-04-30. Sources cited at end.
#
# Baseline = naive single-tier (100% gpt-4ono cacheno routing).
# Optimized = router cascade (80/20 mini/premium) + 20% response cache.
componentsubbaseline_monthly_usdoptimized_monthly_usddelta_usdnotes
OpenAI GPT-4o (premium tier)100% baseline | 20% optimized via router · 2K in / 300 out / req2400384-2016Baseline: 300k req × $0.008/req = $2,400. Optimized: 80% routed to mini + 20% cache hit eliminates ~76% of premium calls.
OpenAI GPT-4o-mini (mini tier)80% optimized via router · ~17× cheaper than premium per-req09292192k req on the cheap tier after cache + routing. Trades premium-only quality for 75% cost reduction with measurable RAGAS faithfulness preserved.
OpenAI text-embedding-3-smallquery embeddings (retrieval) + amortised ingest55050 tok/query × 300k = 15M tok/mo for retrieval = $0.30. Ingest amortised over 12mo: ~$5/mo at 5k docs × 5 tenants. Negligible vs LLM.
Pinecone serverless · HNSW index2M vectors total · 5 namespaces (one per tenant)70700Storage + read units at the reference scenario. Serverless tier auto-scales; reserved capacity drops ~30% if traffic is steady (lever #3).
ElastiCache Redis (cache.t4g.small + replica)semantic response cache + BM25 cache + rate-limit fallback54540Replica for fail-open path. Cache TTL 1h covers ~75% of repeat-query bursts on production traffic.
Reranker compute (cross-encoder, self-hosted)ms-marco-MiniLM-L-6-v2 on t4g.medium · ~150ms p95 over top-5030300Self-hosted on a dedicated CPU pool so it doesn't compete with API workers. Swap to GPU when traffic > 100k qpd.
Total · 5 tenants · 300k req/mo$0.0085 per req baseline → $0.0021 optimized2559635-$1,924 (-75%)Model cascade + response caching account for ~99% of savings; infra rows are unchanged. Per-tenant cost: $511 → $127 monthly.
#
# OPTIMIZATION LEVERS
leverdescriptionimpact
Model cascade (gpt-4o-mini default + gpt-4o escalation)ADR-004. LLM gateway routes 80% of requests to gpt-4o-mini; only escalates the 20% that need premium reasoning. Largest single lever.-$1,901 / mo (when applied alone)
Semantic response cache (1h TTL)ADR-004 again — gateway hashes prompt + tenant + retrieved-chunks fingerprint. 20% hit rate at the reference workload. Stacks with cascade.-$480 / mo (when applied alone, before cascade stacking)
Cross-encoder reranking → tighter promptsADR-002. Rerank top-50 → top-10 means the LLM sees only the highest-precision chunks, reducing average input tokens ~5% (from ~2.1K to 2K). Bonus on top of cascade.-$30 / mo (additional)
#
# SOURCES
# - OpenAI API pricing (gpt-4ogpt-4o-minitext-embedding-3-small): https://openai.com/api/pricing/
# - Pinecone serverless pricing (storage + read/write units): https://www.pinecone.io/pricing/
# - AWS ElastiCache pricing (cache.t4g.small node-hour): https://aws.amazon.com/elasticache/pricing/
# - AWS EC2 t4g.medium pricing (reranker pool): https://aws.amazon.com/ec2/pricing/on-demand/
# - Per-query token mix: derived from prompts.py (system + retrieved chunks + query + format) on the seed corpus
# - Cache hit rate: 20% calibrated against gateway/llm_gateway.py production traffic shape (M04 metrics)
# - Routing split: 80/20 mini/premium derived from M04 confidence-router default thresholds
# - 300k req/mo: 5 tenants × 10k qpd × 30d (target reference scenario in cost/cost_per_query.py)
#
# DISCLAIMERS
# - Prices are list. Negotiated enterprise pricing on OpenAI / Pinecone / AWS will lower both columns.
# - Per-query token mix is workload-dependent. Long-form chat with multi-turn history shifts to ~3K input.
# - Cache hit rate degrades on truly novel workloads; conservative model below ~10% hit rate.
# - Cascade quality assumes RAGAS faithfulness floor (>0.85) is preserved on the canary set; re-validate per workload.
# - Pinecone serverless storage scales with corpus growth; double the index size = +$35/mo at this tier.
Cost model in context

This cost model ships with Enterprise RAG — including the 5 ADRs that produced the optimization deltas.

Open project →
Press Cmd+K to open