Context
Multi-judge LLM-as-judge ensembles need a default model. The pick has three knock-on effects:
- Cost. Judges run on every PR, every nightly regression, and every adversarial sweep. At 10k evals/month × 3 judges, judge cost dominates the eval bill (see cost-model CSV in
docs/cost-model/). - Quality. Different models exhibit different judging biases (lenient, strict, length-preferring, refusal-prone). Picking the wrong default bakes that bias into every gate.
- Vendor risk. Anthropic, OpenAI, Google all change pricing and behaviour on their own clocks. Our judge code shouldn't be lock-in code.
Models considered for default:
- Claude Sonnet 4.6: strong calibration on the gold set we tested (κ vs human = 0.78), $3/M in + $15/M out.
- GPT-4o: comparable calibration (κ = 0.76), but ~3× the per-token cost in our 500-in / 100-out shape.
- Claude Haiku 4.5: fast and cheap ($0.80/M in + $4/M out), but κ vs human = 0.61 — below our 0.7 floor for default-judge use.
- Llama-3.1-70B (self-hosted): competitive κ (0.74) at GPU-host cost. But idle GPU cost > Anthropic API cost until ~80M tokens/month.
Decision
Three-judge ensemble as default; one judge per role:
- Judge 1 — Claude Sonnet (primary): weight 0.5
- Judge 2 — Claude Haiku (triage filter): weight 0.2 — runs first; if its score and Sonnet's agree within 0.1, skip Judge 3
- Judge 3 — GPT-4o (adversarial only): weight 0.3 — only invoked on adversarial test cases or when Haiku/Sonnet disagree by > 0.2
Cost-savings logic:
# llm_eval/multi_judge/router.py
async def route(case: TestCase) -> list[Judge]:
judges = [Judge.HAIKU, Judge.SONNET]
if case.tags.intersection({"adversarial", "high_stakes"}):
judges.append(Judge.GPT4O)
return judges
triage_score = await Judge.HAIKU.score(case)
primary_score = await Judge.SONNET.score(case)
if abs(triage_score - primary_score) > 0.2:
judges.append(Judge.GPT4O) # disagreement → escalate
return judges
Result: GPT-4o fires on ~15% of evals (the adversarial tier + organic disagreements), not 100%. Cost of the third judge drops by 85%.
The judge interface is a Protocol — Judge.HAIKU / Judge.SONNET / Judge.GPT4O / Judge.LLAMA are interchangeable.
# llm_eval/judges/base.py
class Judge(Protocol):
async def score(self, case: TestCase) -> JudgeResult: ...
Tradeoffs we accept
| Lever | Alternative | Chosen |
|---|---|---|
| Single-vendor risk | Pure-Anthropic ensemble | 2/3 Anthropic + 1/3 OpenAI — limits "Anthropic deprecates a model" blast radius |
| Cost | All Sonnet ($90/mo) | Cascade ($44/mo at our load — see cost-model CSV) |
| Latency | All-parallel always | Cascade adds ~300ms median when Haiku/Sonnet disagree; we accept it |
| Calibration | Hand-tuned per-judge weights | Fixed weights for v1; per-task-type weights deferred to ADR-004 |
Consequences (positive)
- Judge cost drops ~50% vs the all-Sonnet baseline at our 10k evals/mo load.
- Two-vendor exposure means an Anthropic price hike doesn't unilaterally break our gate.
- The cascade pattern surfaces which judges disagree in the audit log — useful debugging signal.
- Self-hosted Llama is one config flag away if and when economics flip.
Consequences (negative)
- Three judges with different prompt formats means we maintain three judge prompts, not one. Mitigation: shared
JudgePromptbase class with model-specific tweaks. - Cascade introduces variable per-eval cost — finance forecasting needs the 85th-percentile not the median.
- GPT-4o weight (0.3) was picked empirically on a 200-case calibration set; not statistically defended at scale.
Reversal plan
Llama swap (full vendor exit): ~3 engineer-days. Replace Judge.SONNET and Judge.HAIKU with Judge.LLAMA instances pointing at the inference cluster. Crossover at ~80M tokens/month per tenant — the cost-model CSV shows the math.
GPT-4o → Claude Opus (consolidate to single vendor): ~1 day. Same Protocol; just point at the Anthropic Opus endpoint. Use when GPT-4o's adversarial advantage stops showing in monthly calibration.
Per-task-type weights: see ADR-004 follow-up. ~1 week.
References
llm_eval/judges/base.py— Judge Protocolllm_eval/judges/llm_judge.py— Anthropic + OpenAI implementationsllm_eval/multi_judge/router.py— cascade routing logicllm_eval/multi_judge/evaluator.py— ensemble + consensusdocs/cost-model/llm-eval-cost-model.csv— judge cost math- ADR-001 (gold set used to calibrate judge κ vs human)
- ADR-004 (consensus strategy that consumes these weights)