ai-de.net/Projects/P08 · LLM evaluation framework — multi-judge cascade + recall@k gate

EXPERT-tier · PRO unlocks Module 01AI & vectors trackP08

Build the
evaluation system
that decides whether AI ships

Ship a production LLM eval framework with multi-metric scoring, a judge-cascade ensemble (Haiku → Sonnet → GPT-4o), variance-based agreement, recall@k retrieval gating, a GitHub Actions regression gate that blocks merges, and 5 committed ADRs. Module 01 unlocks with PRO; the platform unlocks with EXPERT.

Timeline

17-19 hours

Difficulty

Senior+

Stack

Pydantic · FastAPI · Anthropic · OpenAI · GitHub Actions · SQLAlchemy

See EXPERT benefits

The LLM-eval system-design portfolio piece for staff AI roles — 5 committed ADRs, a real cost model defending the judge cascade, a regression gate engineers actually merge against, and a gold-set protocol you can defend in a calibration review.

By the end you will have wired

MetricRegistry + EvaluationEngine + LLMJudge running locally on a 100-case sample set
FastAPI test-management CRUD with SQLAlchemy ORM, pagination, and per-suite tag scoping
3-judge cascade ensemble (Haiku triage / Sonnet primary / GPT-4o adversarial) with weighted-average consensus
GitHub Actions CI workflow that runs the eval suite on PR, posts a diff comment, and blocks merge on regression
5 ADRs (one Deprecated) committed alongside the code, plus a runnable cost-model CSV
RAGAS + hallucination scaffolding, online-eval drift detector, and human-annotation workflow primitives

PREREQ · SENIOR+Built for engineers shipping LLMs in production. Comfortable with Python services, async / concurrency, at least one of: retrieval, vendor LLM APIs, or CI/CD pipelines. Not a “what is an eval” course.

llm_eval.platform · 7 modules · gold-set armed · tier-A 1k · tier-B 400 gold

regression gate ✓

Test inputs

Eval engine

Storage

Surfaces

gold/* tier-B400 labeled · κ ≥ 0.75

tier_a/* triage1k single + 10% spot-check

live_trafficimplicit + explicit feedback

adversarial/*red-team injections

Labeling protocol — see ADR-001

MetricRegistryaccuracy · F1 · BLEU · BERTScore

JudgeCascadeHaiku → Sonnet → GPT-4o

consensus.weighted_avgvariance-based agreement

RAGAS pipelinefaithfulness · groundedness

Judge cascade — see ADR-003

SQLite (demo)eval.db · runs + results

Postgres + per-suiteproduction · ADR-005 reversal

Redisjudge response cache · 12% hit

JSONL gold storetier-B labeled cases

Per-suite isolation — see ADR-005 (Deprecated)

.github/workflowseval-on-PR · gates

PR comment botregression diff per metric

Drift detectorz-score on rolling window

Annotation queueKrippendorff α tracking

CI gate logic — see quality-gates.py

# Judge cascade — 51% cost cut

Haiku triage runs first ($0.80/M in)

Sonnet primary skipped when Haiku and a baseline judge agree within 0.1

GPT-4o fires only on adversarial cases or > 0.2 disagreement

→ ~$0.011 per evaluation at optimized load

# Regression gate — blocks merge

PR triggers eval suite on labeled gold + tier-A samples

recall@5 floor 0.85 + max 3pp regression = error

MRR + per-tag breakdown posted as PR comment (warning only)

→ check_quality_gates.py exit 1 on error severity

100

sample test cases shipped

5 ADRs

committed in starter kit

−51%

judge cost vs Sonnet-only baseline

Curriculum · 7 modules · 17-19 hours · 3 phases

Module 01 unlocks with PRO. The full platform with EXPERT.

Module 01 (~2h) ships a complete eval engine — MetricRegistry, EvaluationEngine, LLMJudge, BatchEvaluator with SQLite/JSON storage. Included with PRO; bring your own test cases. Modules 02-07 (~15h additional) layer on the test-management API, multi-judge cascade, CI/CD regression gate, RAG eval, online drift detection, and human-annotation workflow. Unlocks with EXPERT.

P08 · 7 modules · 17-19 hours · 60 lessons

Free preview EXPERT required

M01

⊘Evaluation Framework

MetricRegistry pattern (Protocol-based plugins), built-in metrics (accuracy, F1, BLEU, BERTScore), LLMJudge wrapping the Anthropic SDK at T=0.0, BatchEvaluator with ThreadPoolExecutor + retry, JSON + SQLite storage backends, and a runnable demo with a deterministic MockLLMJudge. The honest baseline before anything else exists.

Phase 12h8 lessonsPRO TIER

Unlock with PRO →

M02

⊘Test Management System

Pydantic schemas (TestCaseBase / Create / Response / Filter / BulkCreate), SQLAlchemy ORM with per-suite tag tables (post-ADR-005 reversal), FastAPI CRUD with pagination + filtering + bulk operations, JSON / JSONL / CSV import paths, and version tracking on every update.

Phase 22h9 lessonsEXPERT TIER

Unlock with EXPERT →

M03

⊘Multi-Judge Evaluation

Three-judge cascade router (Haiku triage → Sonnet primary → GPT-4o adversarial), six consensus strategies (weighted avg / median / majority / unanimous / highest / lowest), variance-based agreement metric (Kappa scaffolded for extension), and graceful degradation when judges fail.

Phase 22h8 lessonsEXPERT TIER

Unlock with EXPERT →

M04

⊘CI/CD Integration

.github/workflows/llm-eval.yml triggering on PR + push, baseline artifact download, quality-gate runner (recall@5 floor + 3pp regression cap + per-tag breakdown), PR comment bot via gh-script, merge blocking via exit 1 on error severity.

Phase 22h9 lessonsEXPERT TIER

Unlock with EXPERT →

M05

⊘RAG & Hallucination Evaluation

RAGAS pipeline scaffolding (faithfulness / groundedness / context precision / context recall), retrieval metric module (recall@k + MRR per ADR-002), hallucination detector with attribution, failure taxonomy categorizer, and adversarial test generator with template-based negative examples.

Phase 32.5h9 lessonsEXPERT TIER

Unlock with EXPERT →

M06

⊘Production Evaluation Platform

Online evaluation harness (Redis-stream consumer, batch windowing), drift detector (z-score on rolling window over consensus + per-judge), business-metric alignment scaffolding, ship/no-ship decision framework keyed off the cost-model CSV, dataset versioning by content hash.

Phase 33h9 lessonsEXPERT TIER

Unlock with EXPERT →

M07

⊘Human-in-the-Loop Evaluation

Annotation task schema, queue + assignment + adjudication flow, tier-A/tier-B routing per ADR-001, Krippendorff α tracker, gold-dataset promotion criteria, judge calibration loop driven by human-adjudicated tier-B disagreements, prioritization for high-uncertainty cases.

Phase 33.5h8 lessonsEXPERT TIER

Unlock with EXPERT →

Module 01 with PRO ($29/mo) · Modules 02-07 with EXPERT ($79/mo)

See plans →

Backed by curriculum

LLM Evaluation

5 modules14 hoursLLM-judge · RAGAS · Recall@k · Calibration

Open curriculum

iThis curriculum is the foundation for the project — it’s not a sales add-on. EXPERT subscribers get full access to all modules.

The build, in 3 phases

Foundation. Platform. Production.

Each phase ends with a tagged release, a passing CI gate, and an agreement-metric review. No ambiguity about where you are.

01~2h

Foundation (Module 01)

Eval engine running locally. MetricRegistry + EvaluationEngine + LLMJudge + BatchEvaluator on a 100-case sample set with SQLite + JSON storage.

✓Working `python -m llm_eval.demo` with MockLLMJudge
✓MetricRegistry with 4 built-in metrics
✓BatchEvaluator with retry + rate limiting

02~6h

Platform (Modules 02-04)

Test-management API + multi-judge cascade + CI regression gate. PR-triggered eval that blocks merge when retrieval recall@5 drops more than 3pp.

✓FastAPI CRUD + per-suite tag scoping (ADR-005)
✓3-judge cascade with weighted-average consensus
✓.github/workflows/llm-eval.yml + PR comment bot + merge gate

03~9h

Production (Modules 05-07)

RAG eval + online drift detection + human-annotation loop. Faithfulness scoring on real RAG traces, drift alerts on consensus regressions, gold-set growth via tier-B adjudication.

✓RAGAS scaffolding + recall@k gate (ADR-002)
✓Online-eval Redis consumer + drift detector
✓Annotation queue + Krippendorff α + gold-set promotion

Project setup · 10 minutes

One command. Local FastAPI + SQLite + MockLLMJudge — no API key.

What lives in the repo

You get the real eval engine on day one — MetricRegistry, EvaluationEngine, LLMJudge, BatchEvaluator, plus the M02 FastAPI CRUD skeleton, M03 multi-judge router, M04 GitHub Actions workflow + quality-gate runner. Modules 05-07 ship as scaffolded source files you complete in the tutorial.

llm_eval/core/ — MetricRegistry, EvaluationEngine, judges, BatchEvaluator
api/ — FastAPI test-management CRUD with per-suite tag tables
llm_eval/multi_judge/ — 3-judge cascade router + 6 consensus strategies
.github/workflows/ + scripts/ — CI eval workflow + check_quality_gates.py + PR comment bot
src/eval/ — RAGAS / online-eval / drift / annotation scaffolding (M05-M07)
docs/adr/ + docs/cost-model/ — 5 committed ADRs (one Deprecated) + the runnable cost-model CSV

Download · Starter Kit · 70 files · 91 KB

LLM Evaluation Framework Starter Kit

Pre-built eval engine + FastAPI CRUD + multi-judge cascade + GitHub Actions workflow. Now bundled: 5 ADR markdown files (docs/adr/) and the runnable cost-model CSV (docs/cost-model/) — unzip and read them straight from the repo.

EXPERT project · 70 files · ADRs + cost model bundled · last updated 2026-05-08

~/projects/llm-evaluation-framework — zsh

1. Unzip and run the offline demo (no API key)

$ unzip llm-evaluation-framework-starter.zip

$ cd llm-evaluation-framework-starter

$ python -m llm_eval.demo

2. Inspect a finished eval run

$ sqlite3 eval.db \

$ "SELECT run_id, num_test_cases, aggregate_scores FROM evaluation_runs;"

3. Run the multi-judge cascade locally

$ export ANTHROPIC_API_KEY=...

$ python -m llm_eval.multi_judge.evaluator \

$ --suite gold/factual_qa.jsonl --judges haiku,sonnet

4. Open the cost model + read ADR-001

$ open docs/cost-model/llm-eval-cost-model.csv

$ less docs/adr/001-labeling-protocol.md

100

sample test cases

tier-B gold examples

historical eval runs

200

feedback events

Production hardening

The same eval demo — but built for the regression-gating case.

Most LLM-eval tutorials show you a notebook scoring 50 hand-picked prompts. This shows what changes when CI runs on every PR, three judges disagree on 15% of cases, and the cost model has to defend itself to a CFO.

Notebook evalWhat most teams ship

Test cases

50 prompts hand-picked, no agreement metric

Judge

One LLM call, hope it's right

Storage

Pickle file or notebook output cell

CI integration

None — re-run the notebook on demand

Cost

Whatever the bill says next month

Drift detection

Notice when the team complains

Your eval platformModules 02–07

✓

Test cases

Tier-A 1k single-annotator + Tier-B 400 dual-annotator gold with Krippendorff α tracking (ADR-001)

✓

Judge

3-judge cascade Haiku → Sonnet → GPT-4o; weighted-avg consensus, variance agreement (ADR-003 + ADR-004)

✓

Storage

SQLite demo · per-suite Postgres tables in production (ADR-005 reversal)

✓

CI integration

llm-eval.yml on every PR; recall@5 + MRR gate; merge block on regression (ADR-002)

✓

Cost

Judge cascade −51% vs Sonnet-only baseline; reserved RDS+EC; CSV indocs/cost-model/

✓

Drift detection

Online-eval Redis consumer + z-score drift on rolling window; Grafana panel + alert routing

EXPERT-only · architecture decision records

Write the ADRs staff engineers actually get judged on.

Five ADRs ship inside the starter-kit zip at docs/adr/, one per major decision in the build, including a real Deprecated ADR documenting the v0 → per-suite migration after the bulk-tag deadlock incident. The kind of doc that travels with you to your next role. Preview ADR-001 →

ADR-001Accepted

Tier-A/Tier-B labeling protocol for the gold set

Context

Need ≥1k cases at κ ≥ 0.7 quality on a 10 labeler-day budget; uniform dual-labeling fits neither

Decision

Tier-A 60% single-annotator + 10% spot-check; Tier-B 40% dual- annotator + adjudication on Krippendorff α ≥ 0.75

Tradeoff

5% noise floor on Tier-A; we disclose it on every report

Reversal

Ramp Tier-A → dual-annotator if team > 8 labelers (~2 engineer-weeks)

ADR-002Accepted

recall@k + MRR for retrieval gating; nDCG rejected

Context

PR comment must read in one number; nDCG would double Tier-B labeling cost (graded relevance)

Decision

recall@5 gates CI (floor 0.85 + max 3pp regression); MRR as the diagnostic, never on the gate

Tradeoff

Blind to graded-relevance differences in ~5% of cases; readability wins

Reversal

Add nDCG@10 to MetricRegistry (~3 days) when graded use case lands

ADR-003Accepted

Claude Sonnet default judge; GPT-4o adversarial-only

Context

Judge cost dominates the eval bill; vendor lock and idle-GPU cost both unattractive at v1 load

Decision

3-judge cascade — Haiku triage (.20 weight) → Sonnet primary (.50) → GPT-4o (.30, only on disagreement > 0.2 or adversarial)

Tradeoff

Variable per-eval cost; we forecast against the 85th percentile

Reversal

Self-hosted Llama 3.1 70B at ~80M tok/mo crossover; ~3 engineer-days

ADR-004Accepted

Multi-judge consensus is weighted average, not majority vote

Context

CI gate thresholds on a continuous score; binary vote loses signal; per-judge calibration is real

Decision

WEIGHTED_AVERAGE default with judge-specific weights from ADR-003; MEDIAN fallback when any judge fails

Tradeoff

Variance-based agreement, not Cohen's/Fleiss' κ — promote to Krippendorff α (~3 days) when needed

Reversal

Per-task-type strategy routing (~2 days); adaptive weight recalibration (~1 week)

ADR-005Deprecated

Single shared test_cases table for all suites (v0)

Context

Day-3 MVP: one table + global tag dictionary; suite_id as FK only

Decision

Reverted in M02 — moved to per-suite test_case_tags tables with (suite_id, tag) compound indexes

Why reversed

Bulk-tag at 1.4k cases held a 9-min row-lock; CI eval queued behind it for 11 min on 2026-04-22

Replaced by

Per-suite tag isolation; ~3.5 engineer-day reversal cost

EXPERT-only · cost model

Read the FinOps story, not just the latency one.

Module 06 ships a runnable cost-model CSV inside the starter-kit zip at docs/cost-model/. 10k evals/mo load, real Anthropic + OpenAI + AWS list prices, with the model-cascade and reserved-instance levers wired up. The version you’ll defend to a CFO. Preview the CSV →

ComponentBaseline / moOptimized / moDelta

Anthropic Claude Sonnet 4.6 (judge)

100% baseline → 30% optimized · 15M in / 3M out tok/mo

$90

$27

−$63

Anthropic Claude Haiku 4.5 (triage)

70% of mix in optimized · ~12M in / 2M out tok/mo

$17

—

OpenAI GPT-4o (adversarial / disagreement)

~15% of mix · ~2M in / 0.5M out tok/mo

$12

—

OpenAI text-embedding-3-small (M05 RAG)

~5M tokens / mo · cached at 75% hit rate

−$3

AWS RDS Postgres (db.t4g.medium)

100GB gp3 · eval store: runs / cases / suites / tags

$50

$35

−$15

AWS ElastiCache Redis + GitHub Actions

cache.t4g.micro + ~300 PR runs × 4 min

$25

$21

−$4

Total · 10k evals/mo

~$0.017 per eval at baseline · ~$0.011 optimized

$169

$113

−$56 (−33%)

Optimization levers

Model cascade (Haiku → Sonnet → GPT-4o)

Route to Haiku first ($0.80/M in). Skip Sonnet when Haiku and a baseline agree within 0.1. Only invoke GPT-4o on adversarial cases or > 0.2 disagreement. ADR-003.

−$56 / mo · −33%

Embedding cache + judge response cache

SHA-256 cache on (case, prompt) for T=0.0 judge calls. Embedding cache by content hash — 75% hit rate on stable gold sets, 30-day TTL.

−$11 / mo · grows with suite stability

RDS + ElastiCache 1-yr reserved

Commit to 12-month reserved capacity once load is stable for 30 days. ~30% off RDS, ~26% off ElastiCache. Break-even at month 4.

−$19 / mo · −29% on store cost

EXPERT benefit · cohort beta

Async architecture review with a staff-level reviewer (cohort beta).

Submit your repo, your ADR draft, or your judge-calibration plan. A staff or principal-level reviewer who has shipped this exact stack responds within 7 days with line-by-line comments. Cohort capped at 12 members.

Bring a diff, an ADR draft, or a calibration plan.

The cohort beta runs as async architecture review — pick a reviewer by topic, send the artifact, get inline comments + a Loom walkthrough back. No back-and-forth scheduling. No 30-minute slot pressure.

Mira R.

Ex-staff · LLM platform · top-3 cloud

Judge calibration, multi-judge consensus design, gold-set protocol, eval-as-canary patterns

“Send the diff. I'll go line-by-line through your judge weights and your agreement metric and pick out the tier-A noise that's leaking into the gate.”

Daniel K.

Principal · RAG platform · enterprise SaaS

RAGAS deployment, retrieval gating, recall@k vs nDCG tradeoffs, hallucination attribution flows

“Send your worst regression. We'll walk it backwards from the failed gate to whether retrieval or generation broke first.”

Anya S.

Eng manager · ML infra · public Series-D

Org design for eval teams, hiring rubrics, staff-engineer interview prep, ADR review

“If you're prepping for staff promo, send your ADR draft. We'll work backwards from the rubric.”

Format

Async

Turnaround

7 days

Cohort

12 members

Scope

ADR + arch review

Request a slot →

What your tier unlocks

PRO unlocks Module 01. EXPERT unlocks the full platform.

PRO is the entry point — Module 01 (the eval engine) plus the rest of the PRO catalog. EXPERT unlocks Modules 02-07 of this build, the 5 ADRs, the cost-model CSV, and the cohort-beta async review.

What you getFREEPROEXPERT

Module 01 of P08

Evaluation Framework engine (~2h)

—

Included

Modules 02-07 of P08

Test mgmt / multi-judge / CI / RAG / production / HITL (~15h)

—

Included

5 committed ADRs + cost-model CSV

Starter kit docs/adr/ + docs/cost-model/

—

Included

PRO project catalog

Production-grade builds

All current

All current + this one

Curriculum

All 7 tracks

Phase 1 only

All

All + bonus modules

Code review

Senior+ reviewers

—

4 / month

Unlimited

Cohort-beta architecture review

Async · 7-day turnaround · 12-member cap

—

Included

Certificate

Verifiable on LinkedIn

—

Yes

Yes + LinkedIn rec

$79/mo

billed monthly · open enrollment · cancel anytime

or annual

$699/yr save 26%

Unlock EXPERT →

Who this is for

Pick this if you own the gate, not just a feature.

Staff / principal engineers · LLM platform

You own the regression gate, the judge calibration, and the answer to 'why are we shipping this?' that your VP asks before launch.

Engineering managers · AI

You need a reference architecture for the eval system your CTO will ask about before the AI team gets headcount or a budget for human labelers.

ML platform / infra leads

You absorb LLM eval without absorbing 4 new vendors. Anthropic, OpenAI, Postgres, Redis — tools you already operate. This is the playbook.

Founding engineers · AI startups

Your investors will ask 'how do you know your model is getting better?' before they ask about scale. The 5 ADRs + recall@k gate + cost model is the answer.

Related curriculum

Going deeper? Four tracks back this project.

The LLM Evaluation curriculum is the foundation. These four tracks let you go deeper on retrieval, agents, observability, and the LLM system that you're evaluating.

FAQ · EXPERT tier

Quick answers.

How is this different from PRO?+

Module 01 (Evaluation Framework engine) is included with PRO at $29/mo — you ship MetricRegistry + EvaluationEngine + LLMJudge + BatchEvaluator and run them on your own test cases. The rest of the platform — Modules 02-07 (test management, multi-judge cascade, CI gate, RAG eval, production drift, human-in-the-loop), the 5 committed ADRs, the runnable cost-model CSV, and the cohort-beta async architecture review — unlocks with EXPERT at $79/mo. PRO gets you the engine; EXPERT gets you the system you'd defend in an architecture review.

Is this still useful if I'm just shipping with OpenAI / a vendor stack?+

Yes — most of the value is in the eval methodology, not the model. The judge cascade in ADR-003 swaps Anthropic for OpenAI in one config flag (the Judge interface is a Protocol). The recall@k gate, the per-suite Postgres design, the regression workflow, and the cost-model CSV all generalize to any LLM provider. The only OpenAI-specific code path is the M05 embedding step, and it's behind a Protocol too.

Is the cohort-beta mentor program 1:1 video calls?+

Not for v1. The cohort beta runs as async review: you submit a diff / ADR / runbook / calibration plan, a staff-level reviewer responds within 7 days with inline comments + a Loom walkthrough. Cohort is capped at 12 members so reviewers can keep the SLA. We'll evaluate adding live 1:1 sessions once the cohort signal is solid.

How long until I can finish this project?+

17-19 hours of focused work across 7 modules. Most learners spread it across 4-6 weeks alongside a day job. Module 01 alone is ~2 hours and gets you a runnable eval engine you can evaluate your own model with.

Is this enough to interview for staff LLM-platform roles?+

It's a strong forcing function. Staff LLM-platform interviews lean heavily on system design (judge calibration, drift detection, regression gating, cost) and on having opinions backed by real tradeoffs. The 5 ADRs you'll commit (one Deprecated, with the lock-incident receipts) are exactly the artifacts a panel asks about. Pair with the cohort-beta review on your final repo and you have the portfolio piece.

Can my company expense it?+

Yes — receipts and a learning-budget letter are downloadable on subscription. Many EXPERT learners are reimbursed under engineering training, AI upskilling, or eval-platform tooling budgets.

What is NOT in scope?+

Model training. Fine-tuning. Pre-training data curation. This is an evaluation framework — you ship the system that decides whether the model you have is good enough. We don't teach you to make the model better; we teach you to know whether it is.

Ready to ship the system that decides whether AI ships?

Start with PRO ($29/mo) for Module 01 — the Evaluation Framework engine. Or unlock the full 7-module platform plus 5 ADRs, the cost-model CSV, and cohort-beta architecture review with EXPERT ($79/mo).

See EXPERT benefits

P08 · LLM evaluation framework · EXPERT · PRO unlocks M01Unlock EXPERT →

Build theevaluation systemthat decides whether AI ships

Module 01 unlocks with PRO. The full platform with EXPERT.

Foundation. Platform. Production.

One command. Local FastAPI + SQLite + MockLLMJudge — no API key.

What lives in the repo

LLM Evaluation Framework Starter Kit

The same eval demo — but built for the regression-gating case.

Write the ADRs staff engineers actually get judged on.

Tier-A/Tier-B labeling protocol for the gold set

recall@k + MRR for retrieval gating; nDCG rejected

Claude Sonnet default judge; GPT-4o adversarial-only

Multi-judge consensus is weighted average, not majority vote

Single shared test_cases table for all suites (v0)

Read the FinOps story, not just the latency one.

Optimization levers

Async architecture review with a staff-level reviewer (cohort beta).

Bring a diff, an ADR draft, or a calibration plan.

PRO unlocks Module 01. EXPERT unlocks the full platform.

Pick this if you own the gate, not just a feature.

Staff / principal engineers · LLM platform

Engineering managers · AI

ML platform / infra leads

Founding engineers · AI startups

Going deeper? Four tracks back this project.

RAG Learning Path

Agentic Workflows

MLOps for Data Engineers

Data Observability & Quality

Quick answers.

Ready to ship the system that decides whether AI ships?

Build the
evaluation system
that decides whether AI ships