ADR-003: Judge model selection: Claude Sonnet default, GPT-4 adversarial-only, Llama swap path | LLM Evaluation Framework

Context

Multi-judge LLM-as-judge ensembles need a default model. The pick has three knock-on effects:

Cost. Judges run on every PR, every nightly regression, and every adversarial sweep. At 10k evals/month × 3 judges, judge cost dominates the eval bill (see cost-model CSV in docs/cost-model/).
Quality. Different models exhibit different judging biases (lenient, strict, length-preferring, refusal-prone). Picking the wrong default bakes that bias into every gate.
Vendor risk. Anthropic, OpenAI, Google all change pricing and behaviour on their own clocks. Our judge code shouldn't be lock-in code.

Models considered for default:

Claude Sonnet 4.6: strong calibration on the gold set we tested (κ vs human = 0.78), $3/M in + $15/M out.
GPT-4o: comparable calibration (κ = 0.76), but ~3× the per-token cost in our 500-in / 100-out shape.
Claude Haiku 4.5: fast and cheap ($0.80/M in + $4/M out), but κ vs human = 0.61 — below our 0.7 floor for default-judge use.
Llama-3.1-70B (self-hosted): competitive κ (0.74) at GPU-host cost. But idle GPU cost > Anthropic API cost until ~80M tokens/month.

Decision

Three-judge ensemble as default; one judge per role:

Judge 1 — Claude Sonnet (primary): weight 0.5
Judge 2 — Claude Haiku (triage filter): weight 0.2 — runs first; if its score and Sonnet's agree within 0.1, skip Judge 3
Judge 3 — GPT-4o (adversarial only): weight 0.3 — only invoked on adversarial test cases or when Haiku/Sonnet disagree by > 0.2

Cost-savings logic:

# llm_eval/multi_judge/router.py
async def route(case: TestCase) -> list[Judge]:
    judges = [Judge.HAIKU, Judge.SONNET]
    if case.tags.intersection({"adversarial", "high_stakes"}):
        judges.append(Judge.GPT4O)
        return judges
    triage_score = await Judge.HAIKU.score(case)
    primary_score = await Judge.SONNET.score(case)
    if abs(triage_score - primary_score) > 0.2:
        judges.append(Judge.GPT4O)   # disagreement → escalate
    return judges

Result: GPT-4o fires on ~15% of evals (the adversarial tier + organic disagreements), not 100%. Cost of the third judge drops by 85%.

The judge interface is a Protocol — Judge.HAIKU / Judge.SONNET / Judge.GPT4O / Judge.LLAMA are interchangeable.

# llm_eval/judges/base.py
class Judge(Protocol):
    async def score(self, case: TestCase) -> JudgeResult: ...

Tradeoffs we accept

Lever	Alternative	Chosen
Single-vendor risk	Pure-Anthropic ensemble	2/3 Anthropic + 1/3 OpenAI — limits "Anthropic deprecates a model" blast radius
Cost	All Sonnet ($90/mo)	Cascade ($44/mo at our load — see cost-model CSV)
Latency	All-parallel always	Cascade adds ~300ms median when Haiku/Sonnet disagree; we accept it
Calibration	Hand-tuned per-judge weights	Fixed weights for v1; per-task-type weights deferred to ADR-004

Consequences (positive)

Judge cost drops ~50% vs the all-Sonnet baseline at our 10k evals/mo load.
Two-vendor exposure means an Anthropic price hike doesn't unilaterally break our gate.
The cascade pattern surfaces which judges disagree in the audit log — useful debugging signal.
Self-hosted Llama is one config flag away if and when economics flip.

Consequences (negative)

Three judges with different prompt formats means we maintain three judge prompts, not one. Mitigation: shared JudgePrompt base class with model-specific tweaks.
Cascade introduces variable per-eval cost — finance forecasting needs the 85th-percentile not the median.
GPT-4o weight (0.3) was picked empirically on a 200-case calibration set; not statistically defended at scale.

Reversal plan

Llama swap (full vendor exit): ~3 engineer-days. Replace Judge.SONNET and Judge.HAIKU with Judge.LLAMA instances pointing at the inference cluster. Crossover at ~80M tokens/month per tenant — the cost-model CSV shows the math.

GPT-4o → Claude Opus (consolidate to single vendor): ~1 day. Same Protocol; just point at the Anthropic Opus endpoint. Use when GPT-4o's adversarial advantage stops showing in monthly calibration.

Per-task-type weights: see ADR-004 follow-up. ~1 week.

References

llm_eval/judges/base.py — Judge Protocol
llm_eval/judges/llm_judge.py — Anthropic + OpenAI implementations
llm_eval/multi_judge/router.py — cascade routing logic
llm_eval/multi_judge/evaluator.py — ensemble + consensus
docs/cost-model/llm-eval-cost-model.csv — judge cost math
ADR-001 (gold set used to calibrate judge κ vs human)
ADR-004 (consensus strategy that consumes these weights)