ADR-004: Multi-judge consensus is weighted average, not majority vote | LLM Evaluation Framework

Context

Three judges per case. They disagree. We need to collapse three scores into one consensus number that downstream consumers (dashboard, CI gate, cost dashboard) read as "the score".

Five consensus strategies were on the table:

Majority vote (binary cutoff per judge → take the majority): simple, loses information on close calls.
Median: robust to outlier judges; ties get awkward at 3 judges.
Unanimous: only score is kept when all judges agree; otherwise flagged for human review.
Weighted average: judge-specific weights, continuous output.
Highest / lowest: worst-case framing — "ship only if all judges happy".

Constraints:

Continuous output. The CI gate (M04) thresholds on a continuous score; binary "passed/failed per judge" loses signal.
Outlier handling. Judges fail (rate-limit, timeout, bad JSON parse). We can't have one failed judge tank the score.
Calibration. Per ADR-003, we have judge-specific weights that encode each judge's empirical calibration vs human (κ on the gold set). Weighted average uses them; majority vote ignores them.

Decision

Default consensus strategy: weighted average. Fallback to median when judge_failure_count > 0.

# llm_eval/multi_judge/consensus.py
class ConsensusStrategy(str, Enum):
    WEIGHTED_AVERAGE = "weighted_average"   # default
    MEDIAN = "median"                        # fallback when judges fail
    MAJORITY = "majority"                    # binary use cases only
    UNANIMOUS = "unanimous"                  # high-stakes red-team
    HIGHEST = "highest"                      # ship-only-if-all-happy
    LOWEST = "lowest"                        # paranoid evaluation


def consensus(scores: dict[str, float], weights: dict[str, float],
              strategy: ConsensusStrategy = ConsensusStrategy.WEIGHTED_AVERAGE) -> float:
    if not scores:
        raise NoJudgesError("All judges failed")

    if strategy == ConsensusStrategy.WEIGHTED_AVERAGE:
        # Renormalize weights for the judges that returned a score
        active_weights = {j: w for j, w in weights.items() if j in scores}
        total = sum(active_weights.values())
        return sum(scores[j] * (w / total) for j, w in active_weights.items())

    if strategy == ConsensusStrategy.MEDIAN:
        return statistics.median(scores.values())
    # ... other strategies

Weights from ADR-003: Claude Sonnet 0.5, Claude Haiku 0.2, GPT-4o 0.3.

Agreement metric: variance of scores.values(), normalized:

def agreement(scores: dict[str, float]) -> float:
    if len(scores) < 2:
        return 1.0
    v = statistics.variance(scores.values())
    return max(0.0, 1.0 - min(v / 0.0625, 1.0))  # 0.0625 = (max-min)^2/16 floor

Tradeoffs we accept

Lever	Alternative	Chosen
Loss of information	Majority vote (binary)	Weighted average (continuous) — preserves the CI gate's continuous threshold
Calibration	Equal-weight averaging	Per-judge weights from ADR-003 calibration
Statistical defensibility	Cohen's κ / Fleiss' κ as agreement metric	Variance-based agreement for v1 — simple, no math the team disagrees on
Outlier sensitivity	Trimmed mean	Fall back to median when any judge fails — already a planned-failure path

Consequences (positive)

Continuous output preserved through the gate; the dashboard reports both the consensus number and per-judge scores.
Per-judge calibration weights propagate from ADR-003 into the consensus number with no extra wiring.
Median fallback handles partial failures gracefully; the system doesn't tank when one judge fails.
All 6 strategies live in the same enum so config-driven swap (e.g. UNANIMOUS for adversarial sweeps) is one line.

Consequences (negative)

Variance-based agreement is not Cohen's/Fleiss' κ. The Part 3 prose mentions Kappa as a concept; the code does not implement it. We ship the simpler metric and document this gap explicitly. Promoting to κ requires categorical (not continuous) judge outputs, which would force a refactor of the judge Protocol — out of scope for v1.
Weights are static. As judges drift (model update, prompt tweak), weights need recalibration. We have no automated trigger for this yet — manual every quarter.
Weighted-average can mask outlier-judge "this one really thinks it's wrong" signals. Mitigation: dashboard surfaces individual scores alongside the consensus.

Reversal plan

Promote to Krippendorff's α (proper agreement metric): ~3 engineer-days. Krippendorff handles continuous data and missing judges natively. Trigger: a stakeholder or auditor pushes back on the variance-based metric.

Adaptive weights from rolling κ vs human: ~1 engineer-week. Train a daily job that recomputes per-judge weights from the most recent 30-day window of human-adjudicated tier-B cases. Trigger: judge drift detected (variance-based agreement ↑ but human-judge alignment ↓).

Per-task-type strategy routing: ~2 days. e.g. UNANIMOUS on red-team adversarial cases, WEIGHTED_AVERAGE on the regression set. Trigger: a class of cases consistently breaks the default strategy.

References

llm_eval/multi_judge/evaluator.py — orchestration
llm_eval/multi_judge/consensus.py — strategies + agreement
ADR-001 (gold set used for empirical weight calibration)
ADR-003 (judge weights consumed here)
Krippendorff, K. (2018). Content Analysis: An Introduction to Its Methodology, ch. 12 — agreement measures