Back to Enterprise RAG
Cost Models·Enterprise RAG
Enterprise RAG cost model
Component-by-component baseline vs optimized cost breakdown for this EXPERT project. The source CSV ships in the project starter kit at docs/cost-model/ and is downloadable below.
| # Enterprise RAG (P06) — cost model — v1 | |||||
|---|---|---|---|---|---|
| # Reference scenario: 5 tenants | 10k queries/day | 300k queries/month | |||
| # Per-query average: 2000 input tokens (retrieved context + query) / 300 output tokens | |||||
| # Corpus: ~5k docs/tenant × ~80 chunks each = ~2M vectors total in Pinecone | |||||
| # Cache hit rate: 20% (1h TTL | semantic-response cache in LLM gateway — ADR-004) | ||||
| # Routing split: 80% gpt-4o-mini / 20% gpt-4o (M03 confidence-router default) | |||||
| # AWS region: us-east-1 | |||||
| # OpenAI + AWS + Pinecone list prices as of 2026-04-30. Sources cited at end. | |||||
| # | |||||
| # Baseline = naive single-tier (100% gpt-4o | no cache | no routing). | |||
| # Optimized = router cascade (80/20 mini/premium) + 20% response cache. | |||||
| component | sub | baseline_monthly_usd | optimized_monthly_usd | delta_usd | notes |
| OpenAI GPT-4o (premium tier) | 100% baseline | 20% optimized via router · 2K in / 300 out / req | 2400 | 384 | -2016 | Baseline: 300k req × $0.008/req = $2,400. Optimized: 80% routed to mini + 20% cache hit eliminates ~76% of premium calls. |
| OpenAI GPT-4o-mini (mini tier) | 80% optimized via router · ~17× cheaper than premium per-req | 0 | 92 | 92 | 192k req on the cheap tier after cache + routing. Trades premium-only quality for 75% cost reduction with measurable RAGAS faithfulness preserved. |
| OpenAI text-embedding-3-small | query embeddings (retrieval) + amortised ingest | 5 | 5 | 0 | 50 tok/query × 300k = 15M tok/mo for retrieval = $0.30. Ingest amortised over 12mo: ~$5/mo at 5k docs × 5 tenants. Negligible vs LLM. |
| Pinecone serverless · HNSW index | 2M vectors total · 5 namespaces (one per tenant) | 70 | 70 | 0 | Storage + read units at the reference scenario. Serverless tier auto-scales; reserved capacity drops ~30% if traffic is steady (lever #3). |
| ElastiCache Redis (cache.t4g.small + replica) | semantic response cache + BM25 cache + rate-limit fallback | 54 | 54 | 0 | Replica for fail-open path. Cache TTL 1h covers ~75% of repeat-query bursts on production traffic. |
| Reranker compute (cross-encoder, self-hosted) | ms-marco-MiniLM-L-6-v2 on t4g.medium · ~150ms p95 over top-50 | 30 | 30 | 0 | Self-hosted on a dedicated CPU pool so it doesn't compete with API workers. Swap to GPU when traffic > 100k qpd. |
| Total · 5 tenants · 300k req/mo | $0.0085 per req baseline → $0.0021 optimized | 2559 | 635 | -$1,924 (-75%) | Model cascade + response caching account for ~99% of savings; infra rows are unchanged. Per-tenant cost: $511 → $127 monthly. |
| # | |||||
| # OPTIMIZATION LEVERS | |||||
| lever | description | impact | |||
| Model cascade (gpt-4o-mini default + gpt-4o escalation) | ADR-004. LLM gateway routes 80% of requests to gpt-4o-mini; only escalates the 20% that need premium reasoning. Largest single lever. | -$1,901 / mo (when applied alone) | |||
| Semantic response cache (1h TTL) | ADR-004 again — gateway hashes prompt + tenant + retrieved-chunks fingerprint. 20% hit rate at the reference workload. Stacks with cascade. | -$480 / mo (when applied alone, before cascade stacking) | |||
| Cross-encoder reranking → tighter prompts | ADR-002. Rerank top-50 → top-10 means the LLM sees only the highest-precision chunks, reducing average input tokens ~5% (from ~2.1K to 2K). Bonus on top of cascade. | -$30 / mo (additional) | |||
| # | |||||
| # SOURCES | |||||
| # - OpenAI API pricing (gpt-4o | gpt-4o-mini | text-embedding-3-small): https://openai.com/api/pricing/ | |||
| # - Pinecone serverless pricing (storage + read/write units): https://www.pinecone.io/pricing/ | |||||
| # - AWS ElastiCache pricing (cache.t4g.small node-hour): https://aws.amazon.com/elasticache/pricing/ | |||||
| # - AWS EC2 t4g.medium pricing (reranker pool): https://aws.amazon.com/ec2/pricing/on-demand/ | |||||
| # - Per-query token mix: derived from prompts.py (system + retrieved chunks + query + format) on the seed corpus | |||||
| # - Cache hit rate: 20% calibrated against gateway/llm_gateway.py production traffic shape (M04 metrics) | |||||
| # - Routing split: 80/20 mini/premium derived from M04 confidence-router default thresholds | |||||
| # - 300k req/mo: 5 tenants × 10k qpd × 30d (target reference scenario in cost/cost_per_query.py) | |||||
| # | |||||
| # DISCLAIMERS | |||||
| # - Prices are list. Negotiated enterprise pricing on OpenAI / Pinecone / AWS will lower both columns. | |||||
| # - Per-query token mix is workload-dependent. Long-form chat with multi-turn history shifts to ~3K input. | |||||
| # - Cache hit rate degrades on truly novel workloads; conservative model below ~10% hit rate. | |||||
| # - Cascade quality assumes RAGAS faithfulness floor (>0.85) is preserved on the canary set; re-validate per workload. | |||||
| # - Pinecone serverless storage scales with corpus growth; double the index size = +$35/mo at this tier. |
Cost model in context
This cost model ships with Enterprise RAG — including the 5 ADRs that produced the optimization deltas.
Open project →