Skip to content
Back to LLM Evaluation Framework

Judge model selection: Claude Sonnet default, GPT-4 adversarial-only, Llama swap path

✓ AcceptedLLM Evaluation Framework03 — Multi-Judge Evaluation
By AI-DE Engineering Team·Stakeholders: ML engineer, platform owner, finance lead

Context

Multi-judge LLM-as-judge ensembles need a default model. The pick has three knock-on effects:

  1. Cost. Judges run on every PR, every nightly regression, and every adversarial sweep. At 10k evals/month × 3 judges, judge cost dominates the eval bill (see cost-model CSV in docs/cost-model/).
  2. Quality. Different models exhibit different judging biases (lenient, strict, length-preferring, refusal-prone). Picking the wrong default bakes that bias into every gate.
  3. Vendor risk. Anthropic, OpenAI, Google all change pricing and behaviour on their own clocks. Our judge code shouldn't be lock-in code.

Models considered for default:

  • Claude Sonnet 4.6: strong calibration on the gold set we tested (κ vs human = 0.78), $3/M in + $15/M out.
  • GPT-4o: comparable calibration (κ = 0.76), but ~3× the per-token cost in our 500-in / 100-out shape.
  • Claude Haiku 4.5: fast and cheap ($0.80/M in + $4/M out), but κ vs human = 0.61 — below our 0.7 floor for default-judge use.
  • Llama-3.1-70B (self-hosted): competitive κ (0.74) at GPU-host cost. But idle GPU cost > Anthropic API cost until ~80M tokens/month.

Decision

Three-judge ensemble as default; one judge per role:

  • Judge 1 — Claude Sonnet (primary): weight 0.5
  • Judge 2 — Claude Haiku (triage filter): weight 0.2 — runs first; if its score and Sonnet's agree within 0.1, skip Judge 3
  • Judge 3 — GPT-4o (adversarial only): weight 0.3 — only invoked on adversarial test cases or when Haiku/Sonnet disagree by > 0.2

Cost-savings logic:

# llm_eval/multi_judge/router.py
async def route(case: TestCase) -> list[Judge]:
    judges = [Judge.HAIKU, Judge.SONNET]
    if case.tags.intersection({"adversarial", "high_stakes"}):
        judges.append(Judge.GPT4O)
        return judges
    triage_score = await Judge.HAIKU.score(case)
    primary_score = await Judge.SONNET.score(case)
    if abs(triage_score - primary_score) > 0.2:
        judges.append(Judge.GPT4O)   # disagreement → escalate
    return judges

Result: GPT-4o fires on ~15% of evals (the adversarial tier + organic disagreements), not 100%. Cost of the third judge drops by 85%.

The judge interface is a Protocol — Judge.HAIKU / Judge.SONNET / Judge.GPT4O / Judge.LLAMA are interchangeable.

# llm_eval/judges/base.py
class Judge(Protocol):
    async def score(self, case: TestCase) -> JudgeResult: ...

Tradeoffs we accept

LeverAlternativeChosen
Single-vendor riskPure-Anthropic ensemble2/3 Anthropic + 1/3 OpenAI — limits "Anthropic deprecates a model" blast radius
CostAll Sonnet ($90/mo)Cascade ($44/mo at our load — see cost-model CSV)
LatencyAll-parallel alwaysCascade adds ~300ms median when Haiku/Sonnet disagree; we accept it
CalibrationHand-tuned per-judge weightsFixed weights for v1; per-task-type weights deferred to ADR-004

Consequences (positive)

  • Judge cost drops ~50% vs the all-Sonnet baseline at our 10k evals/mo load.
  • Two-vendor exposure means an Anthropic price hike doesn't unilaterally break our gate.
  • The cascade pattern surfaces which judges disagree in the audit log — useful debugging signal.
  • Self-hosted Llama is one config flag away if and when economics flip.

Consequences (negative)

  • Three judges with different prompt formats means we maintain three judge prompts, not one. Mitigation: shared JudgePrompt base class with model-specific tweaks.
  • Cascade introduces variable per-eval cost — finance forecasting needs the 85th-percentile not the median.
  • GPT-4o weight (0.3) was picked empirically on a 200-case calibration set; not statistically defended at scale.

Reversal plan

Llama swap (full vendor exit): ~3 engineer-days. Replace Judge.SONNET and Judge.HAIKU with Judge.LLAMA instances pointing at the inference cluster. Crossover at ~80M tokens/month per tenant — the cost-model CSV shows the math.

GPT-4o → Claude Opus (consolidate to single vendor): ~1 day. Same Protocol; just point at the Anthropic Opus endpoint. Use when GPT-4o's adversarial advantage stops showing in monthly calibration.

Per-task-type weights: see ADR-004 follow-up. ~1 week.

References

  • llm_eval/judges/base.py — Judge Protocol
  • llm_eval/judges/llm_judge.py — Anthropic + OpenAI implementations
  • llm_eval/multi_judge/router.py — cascade routing logic
  • llm_eval/multi_judge/evaluator.py — ensemble + consensus
  • docs/cost-model/llm-eval-cost-model.csv — judge cost math
  • ADR-001 (gold set used to calibrate judge κ vs human)
  • ADR-004 (consensus strategy that consumes these weights)
Built into the project

This decision shipped as part of LLM Evaluation Framework — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open