Build the
evaluation system
that decides whether AI ships
Ship a production LLM eval framework with multi-metric scoring, a judge-cascade ensemble (Haiku → Sonnet → GPT-4o), variance-based agreement, recall@k retrieval gating, a GitHub Actions regression gate that blocks merges, and 5 committed ADRs. Module 01 unlocks with PRO; the platform unlocks with EXPERT.
The LLM-eval system-design portfolio piece for staff AI roles — 5 committed ADRs, a real cost model defending the judge cascade, a regression gate engineers actually merge against, and a gold-set protocol you can defend in a calibration review.
- MetricRegistry + EvaluationEngine + LLMJudge running locally on a 100-case sample set
- FastAPI test-management CRUD with SQLAlchemy ORM, pagination, and per-suite tag scoping
- 3-judge cascade ensemble (Haiku triage / Sonnet primary / GPT-4o adversarial) with weighted-average consensus
- GitHub Actions CI workflow that runs the eval suite on PR, posts a diff comment, and blocks merge on regression
- 5 ADRs (one Deprecated) committed alongside the code, plus a runnable cost-model CSV
- RAGAS + hallucination scaffolding, online-eval drift detector, and human-annotation workflow primitives
Module 01 unlocks with PRO. The full platform with EXPERT.
Module 01 (~2h) ships a complete eval engine — MetricRegistry, EvaluationEngine, LLMJudge, BatchEvaluator with SQLite/JSON storage. Included with PRO; bring your own test cases. Modules 02-07 (~15h additional) layer on the test-management API, multi-judge cascade, CI/CD regression gate, RAG eval, online drift detection, and human-annotation workflow. Unlocks with EXPERT.
Foundation. Platform. Production.
Each phase ends with a tagged release, a passing CI gate, and an agreement-metric review. No ambiguity about where you are.
Eval engine running locally. MetricRegistry + EvaluationEngine + LLMJudge + BatchEvaluator on a 100-case sample set with SQLite + JSON storage.
- ✓Working `python -m llm_eval.demo` with MockLLMJudge
- ✓MetricRegistry with 4 built-in metrics
- ✓BatchEvaluator with retry + rate limiting
Test-management API + multi-judge cascade + CI regression gate. PR-triggered eval that blocks merge when retrieval recall@5 drops more than 3pp.
- ✓FastAPI CRUD + per-suite tag scoping (ADR-005)
- ✓3-judge cascade with weighted-average consensus
- ✓.github/workflows/llm-eval.yml + PR comment bot + merge gate
RAG eval + online drift detection + human-annotation loop. Faithfulness scoring on real RAG traces, drift alerts on consensus regressions, gold-set growth via tier-B adjudication.
- ✓RAGAS scaffolding + recall@k gate (ADR-002)
- ✓Online-eval Redis consumer + drift detector
- ✓Annotation queue + Krippendorff α + gold-set promotion
One command. Local FastAPI + SQLite + MockLLMJudge — no API key.
What lives in the repo
You get the real eval engine on day one — MetricRegistry, EvaluationEngine, LLMJudge, BatchEvaluator, plus the M02 FastAPI CRUD skeleton, M03 multi-judge router, M04 GitHub Actions workflow + quality-gate runner. Modules 05-07 ship as scaffolded source files you complete in the tutorial.
- llm_eval/core/ — MetricRegistry, EvaluationEngine, judges, BatchEvaluator
- api/ — FastAPI test-management CRUD with per-suite tag tables
- llm_eval/multi_judge/ — 3-judge cascade router + 6 consensus strategies
- .github/workflows/ + scripts/ — CI eval workflow + check_quality_gates.py + PR comment bot
- src/eval/ — RAGAS / online-eval / drift / annotation scaffolding (M05-M07)
- docs/adr/ + docs/cost-model/ — 5 committed ADRs (one Deprecated) + the runnable cost-model CSV
LLM Evaluation Framework Starter Kit
Pre-built eval engine + FastAPI CRUD + multi-judge cascade + GitHub Actions workflow. Now bundled: 5 ADR markdown files (docs/adr/) and the runnable cost-model CSV (docs/cost-model/) — unzip and read them straight from the repo.
The same eval demo — but built for the regression-gating case.
Most LLM-eval tutorials show you a notebook scoring 50 hand-picked prompts. This shows what changes when CI runs on every PR, three judges disagree on 15% of cases, and the cost model has to defend itself to a CFO.
Krippendorff α tracking (ADR-001)Haiku → Sonnet → GPT-4o; weighted-avg consensus, variance agreement (ADR-003 + ADR-004)SQLite demo · per-suite Postgres tables in production (ADR-005 reversal)llm-eval.yml on every PR; recall@5 + MRR gate; merge block on regression (ADR-002)docs/cost-model/Write the ADRs staff engineers actually get judged on.
Five ADRs ship inside the starter-kit zip at docs/adr/, one per major decision in the build, including a real Deprecated ADR documenting the v0 → per-suite migration after the bulk-tag deadlock incident. The kind of doc that travels with you to your next role. Preview ADR-001 →
Tier-A/Tier-B labeling protocol for the gold set
Krippendorff α ≥ 0.75recall@k + MRR for retrieval gating; nDCG rejected
recall@5 gates CI (floor 0.85 + max 3pp regression); MRR as the diagnostic, never on the gateClaude Sonnet default judge; GPT-4o adversarial-only
Multi-judge consensus is weighted average, not majority vote
WEIGHTED_AVERAGE default with judge-specific weights from ADR-003; MEDIAN fallback when any judge failsSingle shared test_cases table for all suites (v0)
test_case_tags tables with (suite_id, tag) compound indexesRead the FinOps story, not just the latency one.
Module 06 ships a runnable cost-model CSV inside the starter-kit zip at docs/cost-model/. 10k evals/mo load, real Anthropic + OpenAI + AWS list prices, with the model-cascade and reserved-instance levers wired up. The version you’ll defend to a CFO. Preview the CSV →
Optimization levers
Async architecture review with a staff-level reviewer (cohort beta).
Submit your repo, your ADR draft, or your judge-calibration plan. A staff or principal-level reviewer who has shipped this exact stack responds within 7 days with line-by-line comments. Cohort capped at 12 members.
Bring a diff, an ADR draft, or a calibration plan.
The cohort beta runs as async architecture review — pick a reviewer by topic, send the artifact, get inline comments + a Loom walkthrough back. No back-and-forth scheduling. No 30-minute slot pressure.
PRO unlocks Module 01. EXPERT unlocks the full platform.
PRO is the entry point — Module 01 (the eval engine) plus the rest of the PRO catalog. EXPERT unlocks Modules 02-07 of this build, the 5 ADRs, the cost-model CSV, and the cohort-beta async review.
Pick this if you own the gate, not just a feature.
Staff / principal engineers · LLM platform
You own the regression gate, the judge calibration, and the answer to 'why are we shipping this?' that your VP asks before launch.
Engineering managers · AI
You need a reference architecture for the eval system your CTO will ask about before the AI team gets headcount or a budget for human labelers.
ML platform / infra leads
You absorb LLM eval without absorbing 4 new vendors. Anthropic, OpenAI, Postgres, Redis — tools you already operate. This is the playbook.
Founding engineers · AI startups
Your investors will ask 'how do you know your model is getting better?' before they ask about scale. The 5 ADRs + recall@k gate + cost model is the answer.
Going deeper? Four tracks back this project.
The LLM Evaluation curriculum is the foundation. These four tracks let you go deeper on retrieval, agents, observability, and the LLM system that you're evaluating.
Quick answers.
Ready to ship the system that decides whether AI ships?
Start with PRO ($29/mo) for Module 01 — the Evaluation Framework engine. Or unlock the full 7-module platform plus 5 ADRs, the cost-model CSV, and cohort-beta architecture review with EXPERT ($79/mo).