Skip to content
Back to AI Retrieval Platform
Cost Models·AI Retrieval Platform

AI Retrieval Platform cost model

Component-by-component baseline vs optimized cost breakdown for this EXPERT project. The source CSV ships in the project starter kit at docs/cost-model/ and is downloadable below.

Last updated 59 componentsSource CSV
# ai-retrieval-platform cost model — v1
# Tenant load assumed: 1 tenant × 1M documents × 1M queries/mo
# Embedding model: OpenAI text-embedding-3-small (1536 dims).
# Inference workload: cross-encoder ms-marco-MiniLM-L-6-v2 on CPU (no GPU).
# Optional Anthropic Claude / OpenAI agent function calls at ~5M tokens/mo.
# AWS region: us-east-1list prices as of 2026-05.
# Sources cited at end of file. Verify against current vendor pricing pages
# before signing a contract.
#
# Read this file in any spreadsheet (NumbersExcelGoogle SheetsLibreOffice).
# Lines starting with `#` are comments — most spreadsheets ignore them or treat
# them as headers. If yours doesn'tpaste rows after the comment block.
componentsubbaseline_monthly_usdoptimized_monthly_usddelta_usdnotes
Postgres + pgvector (RDS)db.t4g.medium · 100GB gp3 · 1M vectors × 1536-dim ≈ 6GB · HNSW + GIN indexes986830List $0.082/h ≈ $59/mo + $0.115/GB-mo storage. Optimized = 1-yr Reserved Instance (~30% off).
ElastiCache Redis · semantic cachecache.t4g.small · primary + replica · 0.95 cosine threshold · 1h TTL · ~50% hit-rate measured in Part 4543618List $0.034/h × 2 ≈ $49/mo. Optimized = 1-yr Reserved Node.
FastAPI serving · 3 replicas (HPA min)EC2 t4g.medium × 3 · /search + /search/hybrid + /agent endpoints · cross-encoder reranker on CPU906624List $0.0335/h × 3 ≈ $72/mo. Optimized = 1-yr RI commit. HPA can scale to 10 — see lever 2.
OpenAI embeddings (one-shot at ingest)1M docs × text-embedding-3-small ≈ $20 one-shot · ADR-004 hash-based incremental drops to ~1% churn20119$0.02/M tokens · ~1k tokens/doc avg. Optimized = $0.20/mo at 1% daily churn (per ADR-004 hash-based incremental updates) — the breakthrough lever.
OpenAI agent function calls (Part 5)~5M tokens/mo · agent tool-call loop · semantic-cache hit-rate reduces calls1037GPT-4o-mini $0.15/M in / $0.60/M out OR Anthropic Claude Haiku $0.80/M / $4.00/M. Optimized via semantic-cache hit-rate (lever 3).
S3 + egress + Grafana free tier50GB embedding backups + eval results · 50GB free observability · ~10GB/mo egress330S3 Standard $0.023/GB-mo. First 100GB egress/mo free. Grafana Cloud free tier covers Prometheus + 50GB logs + 50GB traces + 10K series.
Total · 1 tenant @ 1M queries~$0.30 per 1k queries at baseline27517798−$98/mo · −36% — RIs + cache + incremental embedding drive most of the savings.
#
# OPTIMIZATION LEVERS
leverdescriptionimpact
1-yr Reserved InstancesCommit to 12-month reserved capacity on RDS + ElastiCache + the 3 EC2 BentoML/FastAPI instances once load is stable for 30 days. Standard ~30% off list.−$72 / mo
Semantic cache hit-ratePart 4's semantic_cache.py uses cosine similarity threshold 0.95 with a 1-hour TTL. Measured hit-rate is 50%+ on real query distributions (queries cluster around popular topics). Halves embedding + query API costs.−$15 / mo
Hash-based incremental embedding (ADR-004)incremental_manager.py computes SHA-256 of content; only re-embeds when content changes (~1% daily churn vs 100% nightly full re-embed). Turns embedding from monthly recurring ($600 nightly model) to amortized one-shot.−$15 / mo
HNSW ef_search tuning (Part 2 sweep)scripts/benchmark_hnsw.py sweeps ef_search 10–200 vs latency. Default ef_search=40 hits ~0.81 recall@10 at <50ms p50. Lower ef_search trades recall for latency / cost-of-shared-buffer-size.~−5% RDS instance size at 10M+ scale
#
# CROSSOVER ANALYSIS — when does this stop fitting on a small managed footprint?
# At ~10M vectors (10× current)the HNSW index in pgvector's shared_buffers
# starts thrashing on db.t4g.medium. Two scaling paths:
# 1. Vertical: bump to db.r6g.large (3.84GB ECU) ~ $182/mo on-demand. Linear cost.
# 2. Horizontal: shard by tenant_id via infra/shard_router.py (Module 5). Two
# db.t4g.medium nodes ~ $196/mobut each node has half the index in cache.
# Decision threshold: when p99 latency exceeds 80ms (warning band on the SLA
# verifier)evaluate the shard split or the vertical bump. The runbook YAML
# (runbook/retrieval_platform.yaml) covers F1 (recall drop) + F2 (latency spike)
# diagnosis paths.
#
# SOURCES
# - AWS RDS pricing: https://aws.amazon.com/rds/postgresql/pricing/
# (db.t4g.medium us-east-1: $0.082/h$0.115/GB-mo gp3)
# - AWS EC2 pricing: https://aws.amazon.com/ec2/pricing/on-demand/
# (t4g.medium us-east-1: $0.0335/h)
# - AWS ElastiCache pricing: https://aws.amazon.com/elasticache/pricing/
# (cache.t4g.small us-east-1: $0.034/h)
# - AWS Reserved Instances: https://aws.amazon.com/aws-cost-management/aws-cost-optimization/reserved-instances/
# (1-yr no-upfront standard ~30% off list)
# - AWS S3 pricing: https://aws.amazon.com/s3/pricing/
# (Standard tier $0.023/GB-mofirst 100GB egress/mo free)
# - OpenAI API pricing: https://openai.com/api/pricing/
# (text-embedding-3-small $0.02/M tokens · GPT-4o-mini $0.15/M in / $0.60/M out)
# - Anthropic API pricing: https://www.anthropic.com/pricing#anthropic-api
# (Claude Haiku 4.5 $0.80/M in / $4.00/M out — alternative to GPT-4o-mini for the agent layer)
# - Grafana Cloud pricing: https://grafana.com/pricing/
# (Free tier: 50GB logs + 50GB traces + 10K Prometheus series)
#
# All prices verified 2026-05. Vendor pricing changes; rerun this model
# quarterly. Numbers also embedded in the on-page CostModel section of the
# project landing.
Cost model in context

This cost model ships with AI Retrieval Platform — including the 5 ADRs that produced the optimization deltas.

Open project →
Press Cmd+K to open