Skip to content
ai-de.net/Projects/P08 · LLM evaluation framework — multi-judge cascade + recall@k gate
EXPERT-tier · PRO unlocks Module 01AI & vectors trackP08

Build the
evaluation system
that decides whether AI ships

Ship a production LLM eval framework with multi-metric scoring, a judge-cascade ensemble (Haiku → Sonnet → GPT-4o), variance-based agreement, recall@k retrieval gating, a GitHub Actions regression gate that blocks merges, and 5 committed ADRs. Module 01 unlocks with PRO; the platform unlocks with EXPERT.

Timeline
17-19 hours
Difficulty
Senior+
Stack
Pydantic · FastAPI · Anthropic · OpenAI · GitHub Actions · SQLAlchemy

The LLM-eval system-design portfolio piece for staff AI roles — 5 committed ADRs, a real cost model defending the judge cascade, a regression gate engineers actually merge against, and a gold-set protocol you can defend in a calibration review.

By the end you will have wired
  • MetricRegistry + EvaluationEngine + LLMJudge running locally on a 100-case sample set
  • FastAPI test-management CRUD with SQLAlchemy ORM, pagination, and per-suite tag scoping
  • 3-judge cascade ensemble (Haiku triage / Sonnet primary / GPT-4o adversarial) with weighted-average consensus
  • GitHub Actions CI workflow that runs the eval suite on PR, posts a diff comment, and blocks merge on regression
  • 5 ADRs (one Deprecated) committed alongside the code, plus a runnable cost-model CSV
  • RAGAS + hallucination scaffolding, online-eval drift detector, and human-annotation workflow primitives
PREREQ · SENIOR+Built for engineers shipping LLMs in production. Comfortable with Python services, async / concurrency, at least one of: retrieval, vendor LLM APIs, or CI/CD pipelines. Not a “what is an eval” course.
llm_eval.platform · 7 modules · gold-set armed · tier-A 1k · tier-B 400 gold
regression gate ✓
Test inputs
Eval engine
Storage
Surfaces
gold/* tier-B400 labeled · κ ≥ 0.75
tier_a/* triage1k single + 10% spot-check
live_trafficimplicit + explicit feedback
adversarial/*red-team injections
Labeling protocol — see ADR-001
MetricRegistryaccuracy · F1 · BLEU · BERTScore
JudgeCascadeHaiku → Sonnet → GPT-4o
consensus.weighted_avgvariance-based agreement
RAGAS pipelinefaithfulness · groundedness
Judge cascade — see ADR-003
SQLite (demo)eval.db · runs + results
Postgres + per-suiteproduction · ADR-005 reversal
Redisjudge response cache · 12% hit
JSONL gold storetier-B labeled cases
Per-suite isolation — see ADR-005 (Deprecated)
.github/workflowseval-on-PR · gates
PR comment botregression diff per metric
Drift detectorz-score on rolling window
Annotation queueKrippendorff α tracking
CI gate logic — see quality-gates.py
# Judge cascade — 51% cost cut
Haiku triage runs first ($0.80/M in)
Sonnet primary skipped when Haiku and a baseline judge agree within 0.1
GPT-4o fires only on adversarial cases or > 0.2 disagreement
→ ~$0.011 per evaluation at optimized load
# Regression gate — blocks merge
PR triggers eval suite on labeled gold + tier-A samples
recall@5 floor 0.85 + max 3pp regression = error
MRR + per-tag breakdown posted as PR comment (warning only)
→ check_quality_gates.py exit 1 on error severity
100
sample test cases shipped
5 ADRs
committed in starter kit
−51%
judge cost vs Sonnet-only baseline
Curriculum · 7 modules · 17-19 hours · 3 phases

Module 01 unlocks with PRO. The full platform with EXPERT.

Module 01 (~2h) ships a complete eval engine — MetricRegistry, EvaluationEngine, LLMJudge, BatchEvaluator with SQLite/JSON storage. Included with PRO; bring your own test cases. Modules 02-07 (~15h additional) layer on the test-management API, multi-judge cascade, CI/CD regression gate, RAG eval, online drift detection, and human-annotation workflow. Unlocks with EXPERT.

P08 · 7 modules · 17-19 hours · 60 lessons
Free preview EXPERT required
M01
Evaluation Framework
MetricRegistry pattern (Protocol-based plugins), built-in metrics (accuracy, F1, BLEU, BERTScore), LLMJudge wrapping the Anthropic SDK at T=0.0, BatchEvaluator with ThreadPoolExecutor + retry, JSON + SQLite storage backends, and a runnable demo with a deterministic MockLLMJudge. The honest baseline before anything else exists.
Phase 12h8 lessonsPRO TIER
Unlock with PRO →
M02
Test Management System
Pydantic schemas (TestCaseBase / Create / Response / Filter / BulkCreate), SQLAlchemy ORM with per-suite tag tables (post-ADR-005 reversal), FastAPI CRUD with pagination + filtering + bulk operations, JSON / JSONL / CSV import paths, and version tracking on every update.
Phase 22h9 lessonsEXPERT TIER
Unlock with EXPERT →
M03
Multi-Judge Evaluation
Three-judge cascade router (Haiku triage → Sonnet primary → GPT-4o adversarial), six consensus strategies (weighted avg / median / majority / unanimous / highest / lowest), variance-based agreement metric (Kappa scaffolded for extension), and graceful degradation when judges fail.
Phase 22h8 lessonsEXPERT TIER
Unlock with EXPERT →
M04
CI/CD Integration
.github/workflows/llm-eval.yml triggering on PR + push, baseline artifact download, quality-gate runner (recall@5 floor + 3pp regression cap + per-tag breakdown), PR comment bot via gh-script, merge blocking via exit 1 on error severity.
Phase 22h9 lessonsEXPERT TIER
Unlock with EXPERT →
M05
RAG & Hallucination Evaluation
RAGAS pipeline scaffolding (faithfulness / groundedness / context precision / context recall), retrieval metric module (recall@k + MRR per ADR-002), hallucination detector with attribution, failure taxonomy categorizer, and adversarial test generator with template-based negative examples.
Phase 32.5h9 lessonsEXPERT TIER
Unlock with EXPERT →
M06
Production Evaluation Platform
Online evaluation harness (Redis-stream consumer, batch windowing), drift detector (z-score on rolling window over consensus + per-judge), business-metric alignment scaffolding, ship/no-ship decision framework keyed off the cost-model CSV, dataset versioning by content hash.
Phase 33h9 lessonsEXPERT TIER
Unlock with EXPERT →
M07
Human-in-the-Loop Evaluation
Annotation task schema, queue + assignment + adjudication flow, tier-A/tier-B routing per ADR-001, Krippendorff α tracker, gold-dataset promotion criteria, judge calibration loop driven by human-adjudicated tier-B disagreements, prioritization for high-uncertainty cases.
Phase 33.5h8 lessonsEXPERT TIER
Unlock with EXPERT →
Module 01 with PRO ($29/mo) · Modules 02-07 with EXPERT ($79/mo)
See plans →
Backed by curriculum
LLM Evaluation
5 modules14 hoursLLM-judge · RAGAS · Recall@k · Calibration
Open curriculum
iThis curriculum is the foundation for the project — it’s not a sales add-on. EXPERT subscribers get full access to all modules.
The build, in 3 phases

Foundation. Platform. Production.

Each phase ends with a tagged release, a passing CI gate, and an agreement-metric review. No ambiguity about where you are.

01~2h
Foundation (Module 01)

Eval engine running locally. MetricRegistry + EvaluationEngine + LLMJudge + BatchEvaluator on a 100-case sample set with SQLite + JSON storage.

  • Working `python -m llm_eval.demo` with MockLLMJudge
  • MetricRegistry with 4 built-in metrics
  • BatchEvaluator with retry + rate limiting
02~6h
Platform (Modules 02-04)

Test-management API + multi-judge cascade + CI regression gate. PR-triggered eval that blocks merge when retrieval recall@5 drops more than 3pp.

  • FastAPI CRUD + per-suite tag scoping (ADR-005)
  • 3-judge cascade with weighted-average consensus
  • .github/workflows/llm-eval.yml + PR comment bot + merge gate
03~9h
Production (Modules 05-07)

RAG eval + online drift detection + human-annotation loop. Faithfulness scoring on real RAG traces, drift alerts on consensus regressions, gold-set growth via tier-B adjudication.

  • RAGAS scaffolding + recall@k gate (ADR-002)
  • Online-eval Redis consumer + drift detector
  • Annotation queue + Krippendorff α + gold-set promotion
Project setup · 10 minutes

One command. Local FastAPI + SQLite + MockLLMJudge — no API key.

What lives in the repo

You get the real eval engine on day one — MetricRegistry, EvaluationEngine, LLMJudge, BatchEvaluator, plus the M02 FastAPI CRUD skeleton, M03 multi-judge router, M04 GitHub Actions workflow + quality-gate runner. Modules 05-07 ship as scaffolded source files you complete in the tutorial.

  • llm_eval/core/ — MetricRegistry, EvaluationEngine, judges, BatchEvaluator
  • api/ — FastAPI test-management CRUD with per-suite tag tables
  • llm_eval/multi_judge/ — 3-judge cascade router + 6 consensus strategies
  • .github/workflows/ + scripts/ — CI eval workflow + check_quality_gates.py + PR comment bot
  • src/eval/ — RAGAS / online-eval / drift / annotation scaffolding (M05-M07)
  • docs/adr/ + docs/cost-model/ — 5 committed ADRs (one Deprecated) + the runnable cost-model CSV
Download · Starter Kit · 70 files · 91 KB

LLM Evaluation Framework Starter Kit

Pre-built eval engine + FastAPI CRUD + multi-judge cascade + GitHub Actions workflow. Now bundled: 5 ADR markdown files (docs/adr/) and the runnable cost-model CSV (docs/cost-model/) — unzip and read them straight from the repo.

EXPERT project · 70 files · ADRs + cost model bundled · last updated 2026-05-08
~/projects/llm-evaluation-framework — zsh
1. Unzip and run the offline demo (no API key)
$ unzip llm-evaluation-framework-starter.zip
$ cd llm-evaluation-framework-starter
$ python -m llm_eval.demo
2. Inspect a finished eval run
$ sqlite3 eval.db \
$ "SELECT run_id, num_test_cases, aggregate_scores FROM evaluation_runs;"
3. Run the multi-judge cascade locally
$ export ANTHROPIC_API_KEY=...
$ python -m llm_eval.multi_judge.evaluator \
$ --suite gold/factual_qa.jsonl --judges haiku,sonnet
4. Open the cost model + read ADR-001
$ open docs/cost-model/llm-eval-cost-model.csv
$ less docs/adr/001-labeling-protocol.md
100
sample test cases
50
tier-B gold examples
30
historical eval runs
200
feedback events
Production hardening

The same eval demo — but built for the regression-gating case.

Most LLM-eval tutorials show you a notebook scoring 50 hand-picked prompts. This shows what changes when CI runs on every PR, three judges disagree on 15% of cases, and the cost model has to defend itself to a CFO.

Notebook evalWhat most teams ship
×
Test cases
50 prompts hand-picked, no agreement metric
×
Judge
One LLM call, hope it's right
×
Storage
Pickle file or notebook output cell
×
CI integration
None — re-run the notebook on demand
×
Cost
Whatever the bill says next month
×
Drift detection
Notice when the team complains
Your eval platformModules 02–07
Test cases
Tier-A 1k single-annotator + Tier-B 400 dual-annotator gold with Krippendorff α tracking (ADR-001)
Judge
3-judge cascade Haiku → Sonnet → GPT-4o; weighted-avg consensus, variance agreement (ADR-003 + ADR-004)
Storage
SQLite demo · per-suite Postgres tables in production (ADR-005 reversal)
CI integration
llm-eval.yml on every PR; recall@5 + MRR gate; merge block on regression (ADR-002)
Cost
Judge cascade −51% vs Sonnet-only baseline; reserved RDS+EC; CSV indocs/cost-model/
Drift detection
Online-eval Redis consumer + z-score drift on rolling window; Grafana panel + alert routing
EXPERT-only · architecture decision records

Write the ADRs staff engineers actually get judged on.

Five ADRs ship inside the starter-kit zip at docs/adr/, one per major decision in the build, including a real Deprecated ADR documenting the v0 → per-suite migration after the bulk-tag deadlock incident. The kind of doc that travels with you to your next role. Preview ADR-001 →

ADR-001Accepted

Tier-A/Tier-B labeling protocol for the gold set

Context
Need ≥1k cases at κ ≥ 0.7 quality on a 10 labeler-day budget; uniform dual-labeling fits neither
Decision
Tier-A 60% single-annotator + 10% spot-check; Tier-B 40% dual- annotator + adjudication on Krippendorff α ≥ 0.75
Tradeoff
5% noise floor on Tier-A; we disclose it on every report
Reversal
Ramp Tier-A → dual-annotator if team > 8 labelers (~2 engineer-weeks)
ADR-002Accepted

recall@k + MRR for retrieval gating; nDCG rejected

Context
PR comment must read in one number; nDCG would double Tier-B labeling cost (graded relevance)
Decision
recall@5 gates CI (floor 0.85 + max 3pp regression); MRR as the diagnostic, never on the gate
Tradeoff
Blind to graded-relevance differences in ~5% of cases; readability wins
Reversal
Add nDCG@10 to MetricRegistry (~3 days) when graded use case lands
ADR-003Accepted

Claude Sonnet default judge; GPT-4o adversarial-only

Context
Judge cost dominates the eval bill; vendor lock and idle-GPU cost both unattractive at v1 load
Decision
3-judge cascade — Haiku triage (.20 weight) → Sonnet primary (.50) → GPT-4o (.30, only on disagreement > 0.2 or adversarial)
Tradeoff
Variable per-eval cost; we forecast against the 85th percentile
Reversal
Self-hosted Llama 3.1 70B at ~80M tok/mo crossover; ~3 engineer-days
ADR-004Accepted

Multi-judge consensus is weighted average, not majority vote

Context
CI gate thresholds on a continuous score; binary vote loses signal; per-judge calibration is real
Decision
WEIGHTED_AVERAGE default with judge-specific weights from ADR-003; MEDIAN fallback when any judge fails
Tradeoff
Variance-based agreement, not Cohen's/Fleiss' κ — promote to Krippendorff α (~3 days) when needed
Reversal
Per-task-type strategy routing (~2 days); adaptive weight recalibration (~1 week)
ADR-005Deprecated

Single shared test_cases table for all suites (v0)

Context
Day-3 MVP: one table + global tag dictionary; suite_id as FK only
Decision
Reverted in M02 — moved to per-suite test_case_tags tables with (suite_id, tag) compound indexes
Why reversed
Bulk-tag at 1.4k cases held a 9-min row-lock; CI eval queued behind it for 11 min on 2026-04-22
Replaced by
Per-suite tag isolation; ~3.5 engineer-day reversal cost
EXPERT-only · cost model

Read the FinOps story, not just the latency one.

Module 06 ships a runnable cost-model CSV inside the starter-kit zip at docs/cost-model/. 10k evals/mo load, real Anthropic + OpenAI + AWS list prices, with the model-cascade and reserved-instance levers wired up. The version you’ll defend to a CFO. Preview the CSV →

ComponentBaseline / moOptimized / moDelta
Anthropic Claude Sonnet 4.6 (judge)
100% baseline → 30% optimized · 15M in / 3M out tok/mo
$90
$27
−$63
Anthropic Claude Haiku 4.5 (triage)
70% of mix in optimized · ~12M in / 2M out tok/mo
$0
$17
OpenAI GPT-4o (adversarial / disagreement)
~15% of mix · ~2M in / 0.5M out tok/mo
$0
$12
OpenAI text-embedding-3-small (M05 RAG)
~5M tokens / mo · cached at 75% hit rate
$4
$1
−$3
AWS RDS Postgres (db.t4g.medium)
100GB gp3 · eval store: runs / cases / suites / tags
$50
$35
−$15
AWS ElastiCache Redis + GitHub Actions
cache.t4g.micro + ~300 PR runs × 4 min
$25
$21
−$4
Total · 10k evals/mo
~$0.017 per eval at baseline · ~$0.011 optimized
$169
$113
−$56 (−33%)

Optimization levers

Model cascade (Haiku → Sonnet → GPT-4o)
Route to Haiku first ($0.80/M in). Skip Sonnet when Haiku and a baseline agree within 0.1. Only invoke GPT-4o on adversarial cases or > 0.2 disagreement. ADR-003.
−$56 / mo · −33%
Embedding cache + judge response cache
SHA-256 cache on (case, prompt) for T=0.0 judge calls. Embedding cache by content hash — 75% hit rate on stable gold sets, 30-day TTL.
−$11 / mo · grows with suite stability
RDS + ElastiCache 1-yr reserved
Commit to 12-month reserved capacity once load is stable for 30 days. ~30% off RDS, ~26% off ElastiCache. Break-even at month 4.
−$19 / mo · −29% on store cost
EXPERT benefit · cohort beta

Async architecture review with a staff-level reviewer (cohort beta).

Submit your repo, your ADR draft, or your judge-calibration plan. A staff or principal-level reviewer who has shipped this exact stack responds within 7 days with line-by-line comments. Cohort capped at 12 members.

Bring a diff, an ADR draft, or a calibration plan.

The cohort beta runs as async architecture review — pick a reviewer by topic, send the artifact, get inline comments + a Loom walkthrough back. No back-and-forth scheduling. No 30-minute slot pressure.

MR
Mira R.
Ex-staff · LLM platform · top-3 cloud
Judge calibration, multi-judge consensus design, gold-set protocol, eval-as-canary patterns
Send the diff. I'll go line-by-line through your judge weights and your agreement metric and pick out the tier-A noise that's leaking into the gate.
DK
Daniel K.
Principal · RAG platform · enterprise SaaS
RAGAS deployment, retrieval gating, recall@k vs nDCG tradeoffs, hallucination attribution flows
Send your worst regression. We'll walk it backwards from the failed gate to whether retrieval or generation broke first.
AS
Anya S.
Eng manager · ML infra · public Series-D
Org design for eval teams, hiring rubrics, staff-engineer interview prep, ADR review
If you're prepping for staff promo, send your ADR draft. We'll work backwards from the rubric.
Format
Async
Turnaround
7 days
Cohort
12 members
Scope
ADR + arch review
Request a slot
What your tier unlocks

PRO unlocks Module 01. EXPERT unlocks the full platform.

PRO is the entry point — Module 01 (the eval engine) plus the rest of the PRO catalog. EXPERT unlocks Modules 02-07 of this build, the 5 ADRs, the cost-model CSV, and the cohort-beta async review.

What you getFREEPROEXPERT
Module 01 of P08
Evaluation Framework engine (~2h)
Included
Included
Modules 02-07 of P08
Test mgmt / multi-judge / CI / RAG / production / HITL (~15h)
Included
5 committed ADRs + cost-model CSV
Starter kit docs/adr/ + docs/cost-model/
Included
PRO project catalog
Production-grade builds
2
All current
All current + this one
Curriculum
All 7 tracks
Phase 1 only
All
All + bonus modules
Code review
Senior+ reviewers
4 / month
Unlimited
Cohort-beta architecture review
Async · 7-day turnaround · 12-member cap
Included
Certificate
Verifiable on LinkedIn
Yes
Yes + LinkedIn rec
$79/mo
billed monthly · open enrollment · cancel anytime
or annual
$699/yr save 26%
Unlock EXPERT
Who this is for

Pick this if you own the gate, not just a feature.

ST

Staff / principal engineers · LLM platform

You own the regression gate, the judge calibration, and the answer to 'why are we shipping this?' that your VP asks before launch.

EM

Engineering managers · AI

You need a reference architecture for the eval system your CTO will ask about before the AI team gets headcount or a budget for human labelers.

PA

ML platform / infra leads

You absorb LLM eval without absorbing 4 new vendors. Anthropic, OpenAI, Postgres, Redis — tools you already operate. This is the playbook.

FR

Founding engineers · AI startups

Your investors will ask 'how do you know your model is getting better?' before they ask about scale. The 5 ADRs + recall@k gate + cost model is the answer.

FAQ · EXPERT tier

Quick answers.

Module 01 (Evaluation Framework engine) is included with PRO at $29/mo — you ship MetricRegistry + EvaluationEngine + LLMJudge + BatchEvaluator and run them on your own test cases. The rest of the platform — Modules 02-07 (test management, multi-judge cascade, CI gate, RAG eval, production drift, human-in-the-loop), the 5 committed ADRs, the runnable cost-model CSV, and the cohort-beta async architecture review — unlocks with EXPERT at $79/mo. PRO gets you the engine; EXPERT gets you the system you'd defend in an architecture review.
Yes — most of the value is in the eval methodology, not the model. The judge cascade in ADR-003 swaps Anthropic for OpenAI in one config flag (the Judge interface is a Protocol). The recall@k gate, the per-suite Postgres design, the regression workflow, and the cost-model CSV all generalize to any LLM provider. The only OpenAI-specific code path is the M05 embedding step, and it's behind a Protocol too.
Not for v1. The cohort beta runs as async review: you submit a diff / ADR / runbook / calibration plan, a staff-level reviewer responds within 7 days with inline comments + a Loom walkthrough. Cohort is capped at 12 members so reviewers can keep the SLA. We'll evaluate adding live 1:1 sessions once the cohort signal is solid.
17-19 hours of focused work across 7 modules. Most learners spread it across 4-6 weeks alongside a day job. Module 01 alone is ~2 hours and gets you a runnable eval engine you can evaluate your own model with.
It's a strong forcing function. Staff LLM-platform interviews lean heavily on system design (judge calibration, drift detection, regression gating, cost) and on having opinions backed by real tradeoffs. The 5 ADRs you'll commit (one Deprecated, with the lock-incident receipts) are exactly the artifacts a panel asks about. Pair with the cohort-beta review on your final repo and you have the portfolio piece.
Yes — receipts and a learning-budget letter are downloadable on subscription. Many EXPERT learners are reimbursed under engineering training, AI upskilling, or eval-platform tooling budgets.
Model training. Fine-tuning. Pre-training data curation. This is an evaluation framework — you ship the system that decides whether the model you have is good enough. We don't teach you to make the model better; we teach you to know whether it is.

Ready to ship the system that decides whether AI ships?

Start with PRO ($29/mo) for Module 01 — the Evaluation Framework engine. Or unlock the full 7-module platform plus 5 ADRs, the cost-model CSV, and cohort-beta architecture review with EXPERT ($79/mo).

P08 · LLM evaluation framework · EXPERT · PRO unlocks M01Unlock EXPERT →
Press Cmd+K to open