Enterprise RAG cost model

Name: Enterprise RAG cost model dataset
Creator: AI-DE
Keywords: Enterprise RAG, cost model, cloud cost, FinOps, AI infrastructure

Component-by-component baseline vs optimized cost breakdown for this EXPERT project. The source CSV ships in the project starter kit at docs/cost-model/ and is downloadable below.

Last updated 2026-05-2241 componentsSource CSV

# Enterprise RAG (P06) — cost model — v1
# Reference scenario: 5 tenants	10k queries/day	300k queries/month
# Per-query average: 2000 input tokens (retrieved context + query) / 300 output tokens
# Corpus: ~5k docs/tenant × ~80 chunks each = ~2M vectors total in Pinecone
# Cache hit rate: 20% (1h TTL	semantic-response cache in LLM gateway — ADR-004)
# Routing split: 80% gpt-4o-mini / 20% gpt-4o (M03 confidence-router default)
# AWS region: us-east-1
# OpenAI + AWS + Pinecone list prices as of 2026-04-30. Sources cited at end.
#
# Baseline = naive single-tier (100% gpt-4o	no cache	no routing).
# Optimized = router cascade (80/20 mini/premium) + 20% response cache.
component	sub	baseline_monthly_usd	optimized_monthly_usd	delta_usd	notes
OpenAI GPT-4o (premium tier)	100% baseline \| 20% optimized via router · 2K in / 300 out / req	2400	384	-2016	Baseline: 300k req × $0.008/req = $2,400. Optimized: 80% routed to mini + 20% cache hit eliminates ~76% of premium calls.
OpenAI GPT-4o-mini (mini tier)	80% optimized via router · ~17× cheaper than premium per-req	0	92	92	192k req on the cheap tier after cache + routing. Trades premium-only quality for 75% cost reduction with measurable RAGAS faithfulness preserved.
OpenAI text-embedding-3-small	query embeddings (retrieval) + amortised ingest	5	5	0	50 tok/query × 300k = 15M tok/mo for retrieval = $0.30. Ingest amortised over 12mo: ~$5/mo at 5k docs × 5 tenants. Negligible vs LLM.
Pinecone serverless · HNSW index	2M vectors total · 5 namespaces (one per tenant)	70	70	0	Storage + read units at the reference scenario. Serverless tier auto-scales; reserved capacity drops ~30% if traffic is steady (lever #3).
ElastiCache Redis (cache.t4g.small + replica)	semantic response cache + BM25 cache + rate-limit fallback	54	54	0	Replica for fail-open path. Cache TTL 1h covers ~75% of repeat-query bursts on production traffic.
Reranker compute (cross-encoder, self-hosted)	ms-marco-MiniLM-L-6-v2 on t4g.medium · ~150ms p95 over top-50	30	30	0	Self-hosted on a dedicated CPU pool so it doesn't compete with API workers. Swap to GPU when traffic > 100k qpd.
Total · 5 tenants · 300k req/mo	$0.0085 per req baseline → $0.0021 optimized	2559	635	-$1,924 (-75%)	Model cascade + response caching account for ~99% of savings; infra rows are unchanged. Per-tenant cost: $511 → $127 monthly.
#
# OPTIMIZATION LEVERS
lever	description	impact
Model cascade (gpt-4o-mini default + gpt-4o escalation)	ADR-004. LLM gateway routes 80% of requests to gpt-4o-mini; only escalates the 20% that need premium reasoning. Largest single lever.	-$1,901 / mo (when applied alone)
Semantic response cache (1h TTL)	ADR-004 again — gateway hashes prompt + tenant + retrieved-chunks fingerprint. 20% hit rate at the reference workload. Stacks with cascade.	-$480 / mo (when applied alone, before cascade stacking)
Cross-encoder reranking → tighter prompts	ADR-002. Rerank top-50 → top-10 means the LLM sees only the highest-precision chunks, reducing average input tokens ~5% (from ~2.1K to 2K). Bonus on top of cascade.	-$30 / mo (additional)
#
# SOURCES
# - OpenAI API pricing (gpt-4o	gpt-4o-mini	text-embedding-3-small): https://openai.com/api/pricing/
# - Pinecone serverless pricing (storage + read/write units): https://www.pinecone.io/pricing/
# - AWS ElastiCache pricing (cache.t4g.small node-hour): https://aws.amazon.com/elasticache/pricing/
# - AWS EC2 t4g.medium pricing (reranker pool): https://aws.amazon.com/ec2/pricing/on-demand/
# - Per-query token mix: derived from prompts.py (system + retrieved chunks + query + format) on the seed corpus
# - Cache hit rate: 20% calibrated against gateway/llm_gateway.py production traffic shape (M04 metrics)
# - Routing split: 80/20 mini/premium derived from M04 confidence-router default thresholds
# - 300k req/mo: 5 tenants × 10k qpd × 30d (target reference scenario in cost/cost_per_query.py)
#
# DISCLAIMERS
# - Prices are list. Negotiated enterprise pricing on OpenAI / Pinecone / AWS will lower both columns.
# - Per-query token mix is workload-dependent. Long-form chat with multi-turn history shifts to ~3K input.
# - Cache hit rate degrades on truly novel workloads; conservative model below ~10% hit rate.
# - Cascade quality assumes RAGAS faithfulness floor (>0.85) is preserved on the canary set; re-validate per workload.
# - Pinecone serverless storage scales with corpus growth; double the index size = +$35/mo at this tier.

Cost model in context

This cost model ships with Enterprise RAG — including the 5 ADRs that produced the optimization deltas.

Open project →