Back to AI Retrieval Platform
Cost Models·AI Retrieval Platform
AI Retrieval Platform cost model
Component-by-component baseline vs optimized cost breakdown for this EXPERT project. The source CSV ships in the project starter kit at docs/cost-model/ and is downloadable below.
| # ai-retrieval-platform cost model — v1 | |||||
|---|---|---|---|---|---|
| # Tenant load assumed: 1 tenant × 1M documents × 1M queries/mo | |||||
| # Embedding model: OpenAI text-embedding-3-small (1536 dims). | |||||
| # Inference workload: cross-encoder ms-marco-MiniLM-L-6-v2 on CPU (no GPU). | |||||
| # Optional Anthropic Claude / OpenAI agent function calls at ~5M tokens/mo. | |||||
| # AWS region: us-east-1 | list prices as of 2026-05. | ||||
| # Sources cited at end of file. Verify against current vendor pricing pages | |||||
| # before signing a contract. | |||||
| # | |||||
| # Read this file in any spreadsheet (Numbers | Excel | Google Sheets | LibreOffice). | ||
| # Lines starting with `#` are comments — most spreadsheets ignore them or treat | |||||
| # them as headers. If yours doesn't | paste rows after the comment block. | ||||
| component | sub | baseline_monthly_usd | optimized_monthly_usd | delta_usd | notes |
| Postgres + pgvector (RDS) | db.t4g.medium · 100GB gp3 · 1M vectors × 1536-dim ≈ 6GB · HNSW + GIN indexes | 98 | 68 | 30 | List $0.082/h ≈ $59/mo + $0.115/GB-mo storage. Optimized = 1-yr Reserved Instance (~30% off). |
| ElastiCache Redis · semantic cache | cache.t4g.small · primary + replica · 0.95 cosine threshold · 1h TTL · ~50% hit-rate measured in Part 4 | 54 | 36 | 18 | List $0.034/h × 2 ≈ $49/mo. Optimized = 1-yr Reserved Node. |
| FastAPI serving · 3 replicas (HPA min) | EC2 t4g.medium × 3 · /search + /search/hybrid + /agent endpoints · cross-encoder reranker on CPU | 90 | 66 | 24 | List $0.0335/h × 3 ≈ $72/mo. Optimized = 1-yr RI commit. HPA can scale to 10 — see lever 2. |
| OpenAI embeddings (one-shot at ingest) | 1M docs × text-embedding-3-small ≈ $20 one-shot · ADR-004 hash-based incremental drops to ~1% churn | 20 | 1 | 19 | $0.02/M tokens · ~1k tokens/doc avg. Optimized = $0.20/mo at 1% daily churn (per ADR-004 hash-based incremental updates) — the breakthrough lever. |
| OpenAI agent function calls (Part 5) | ~5M tokens/mo · agent tool-call loop · semantic-cache hit-rate reduces calls | 10 | 3 | 7 | GPT-4o-mini $0.15/M in / $0.60/M out OR Anthropic Claude Haiku $0.80/M / $4.00/M. Optimized via semantic-cache hit-rate (lever 3). |
| S3 + egress + Grafana free tier | 50GB embedding backups + eval results · 50GB free observability · ~10GB/mo egress | 3 | 3 | 0 | S3 Standard $0.023/GB-mo. First 100GB egress/mo free. Grafana Cloud free tier covers Prometheus + 50GB logs + 50GB traces + 10K series. |
| Total · 1 tenant @ 1M queries | ~$0.30 per 1k queries at baseline | 275 | 177 | 98 | −$98/mo · −36% — RIs + cache + incremental embedding drive most of the savings. |
| # | |||||
| # OPTIMIZATION LEVERS | |||||
| lever | description | impact | |||
| 1-yr Reserved Instances | Commit to 12-month reserved capacity on RDS + ElastiCache + the 3 EC2 BentoML/FastAPI instances once load is stable for 30 days. Standard ~30% off list. | −$72 / mo | |||
| Semantic cache hit-rate | Part 4's semantic_cache.py uses cosine similarity threshold 0.95 with a 1-hour TTL. Measured hit-rate is 50%+ on real query distributions (queries cluster around popular topics). Halves embedding + query API costs. | −$15 / mo | |||
| Hash-based incremental embedding (ADR-004) | incremental_manager.py computes SHA-256 of content; only re-embeds when content changes (~1% daily churn vs 100% nightly full re-embed). Turns embedding from monthly recurring ($600 nightly model) to amortized one-shot. | −$15 / mo | |||
| HNSW ef_search tuning (Part 2 sweep) | scripts/benchmark_hnsw.py sweeps ef_search 10–200 vs latency. Default ef_search=40 hits ~0.81 recall@10 at <50ms p50. Lower ef_search trades recall for latency / cost-of-shared-buffer-size. | ~−5% RDS instance size at 10M+ scale | |||
| # | |||||
| # CROSSOVER ANALYSIS — when does this stop fitting on a small managed footprint? | |||||
| # At ~10M vectors (10× current) | the HNSW index in pgvector's shared_buffers | ||||
| # starts thrashing on db.t4g.medium. Two scaling paths: | |||||
| # 1. Vertical: bump to db.r6g.large (3.84GB ECU) ~ $182/mo on-demand. Linear cost. | |||||
| # 2. Horizontal: shard by tenant_id via infra/shard_router.py (Module 5). Two | |||||
| # db.t4g.medium nodes ~ $196/mo | but each node has half the index in cache. | ||||
| # Decision threshold: when p99 latency exceeds 80ms (warning band on the SLA | |||||
| # verifier) | evaluate the shard split or the vertical bump. The runbook YAML | ||||
| # (runbook/retrieval_platform.yaml) covers F1 (recall drop) + F2 (latency spike) | |||||
| # diagnosis paths. | |||||
| # | |||||
| # SOURCES | |||||
| # - AWS RDS pricing: https://aws.amazon.com/rds/postgresql/pricing/ | |||||
| # (db.t4g.medium us-east-1: $0.082/h | $0.115/GB-mo gp3) | ||||
| # - AWS EC2 pricing: https://aws.amazon.com/ec2/pricing/on-demand/ | |||||
| # (t4g.medium us-east-1: $0.0335/h) | |||||
| # - AWS ElastiCache pricing: https://aws.amazon.com/elasticache/pricing/ | |||||
| # (cache.t4g.small us-east-1: $0.034/h) | |||||
| # - AWS Reserved Instances: https://aws.amazon.com/aws-cost-management/aws-cost-optimization/reserved-instances/ | |||||
| # (1-yr no-upfront standard ~30% off list) | |||||
| # - AWS S3 pricing: https://aws.amazon.com/s3/pricing/ | |||||
| # (Standard tier $0.023/GB-mo | first 100GB egress/mo free) | ||||
| # - OpenAI API pricing: https://openai.com/api/pricing/ | |||||
| # (text-embedding-3-small $0.02/M tokens · GPT-4o-mini $0.15/M in / $0.60/M out) | |||||
| # - Anthropic API pricing: https://www.anthropic.com/pricing#anthropic-api | |||||
| # (Claude Haiku 4.5 $0.80/M in / $4.00/M out — alternative to GPT-4o-mini for the agent layer) | |||||
| # - Grafana Cloud pricing: https://grafana.com/pricing/ | |||||
| # (Free tier: 50GB logs + 50GB traces + 10K Prometheus series) | |||||
| # | |||||
| # All prices verified 2026-05. Vendor pricing changes; rerun this model | |||||
| # quarterly. Numbers also embedded in the on-page CostModel section of the | |||||
| # project landing. |
Cost model in context
This cost model ships with AI Retrieval Platform — including the 5 ADRs that produced the optimization deltas.
Open project →