ADR-003: Cost-latency-quality routing triangle with explicit fallback chain | AI Cost Optimization

Context

Module 02 made cost visible. Module 03 cut the bill through caching. Module 04 is where the platform decides which model to use — and that decision is the single largest lever left, because the price gap between GPT-4o-mini and GPT-4o is ~30× per output token (pricing.py: $0.60/M vs $15.00/M output).

Three competing forces govern the routing decision:

Cost. Cheaper models exist; sending everything to the premium tier is the "naive 50k queries/day startup" anti-pattern documented in examples/startup_nightmare.py.
Quality. Some prompts genuinely need the premium model; sending them all to the mini tier is the symmetric anti-pattern (cheap + wrong).
Latency. Routing decisions sit on the hot path. A 200 ms classifier that calls a model just to pick a model defeats the purpose.

Naive routing patterns we ruled out:

Static keyword matching only. "If prompt contains 'analyse' → premium" works for ~30% of traffic; collapses on free-form chat.
Always-classify-first. Run a cheap model on every prompt to decide; adds 200 ms p95 and an extra model call to every request.
Premium-by-default with budget cap. Just route everything to GPT-4o until the budget is hit. Bad for cost; worse on latency when the budget flips and traffic suddenly fails.

What we wanted: a routing decision that costs ~milliseconds, gets the easy cases right cheaply, and escalates only the cases that need it.

Decision

We adopt a 3-tier routing triangle (MINI / STANDARD / PREMIUM) with an explicit fallback chain keyed on a four-strategy classifier:

# src/routing/router.py — pseudocode of the chain
async def route(prompt: str, ctx: RequestContext) -> RouteDecision:
    # 1. Cheap deterministic strategies first
    if decision := keyword_router.classify(prompt):  return decision
    if decision := complexity_router.classify(prompt):  return decision

    # 2. Cheap model self-confidence (only if 1 was inconclusive)
    if decision := await confidence_router.classify(prompt):  return decision

    # 3. Always-fallback chain — never returns None
    return fallback_chain.default(ctx)

The triangle (triangle.py): every model is positioned on a 3-axis plot (cost / latency / quality). Three named tiers map to vendor models:

Tier	Model	$/1M in	$/1M out	p95 latency	Use case
MINI	gpt-4o-mini	$0.15	$0.60	800 ms	Greetings, classification, FAQ
STANDARD	claude-3-5-sonnet	$3.00	$15.00	1.6 s	Reasoning, summarisation
PREMIUM	gpt-4o	$2.50	$10.00	1.4 s	Analysis, generation, agent loops

The four routing strategies (strategies.py):

Keyword router — pattern match on prompt text → tier. Catches the obvious 30%. ~0.5 ms.
Complexity router — heuristic score (1–5) over prompt length, lexical diversity, code-block presence. Score ≥ 4 → escalate to STANDARD. ~2 ms.
Confidence router — runs MINI; if confidence < 0.5 (self-reported via logprobs), re-runs on STANDARD. Adds one MINI call to ~15% of requests.
Fallback chain — when the three above are inconclusive, default to the tier marked in RequestContext.default_tier (per-tenant default).

Fallback resilience (fallback.py): every tier has an exponential-backoff retry policy (base 200 ms × 2^retry + jitter). On hard failure (timeout, rate limit, 5xx), the chain steps up to the next tier — not down. Stepping down on failure is the cost-engineer's instinct; stepping up is the SRE-engineer's. We chose up because a failed cheap call followed by a successful expensive call is recoverable; a failed expensive call followed by a successful cheap call probably degraded quality silently.

Every routing decision lands in llm_requests.route_decision (a string enum) so the analytics layer can answer "what fraction of requests escalated, and why?"

Tradeoffs we accept

Lever	Alternative	Chosen
Decision cost	Always-classify	Cheap-first chain with escalation
Quality floor	Always-premium	MINI + confidence escalation
Failure direction	Step down on error (cheap)	Step up on error (resilient)
Tunability	Hard-coded tier mapping	Per-tenant `default_tier` + per-strategy thresholds
Observability	Single "model" column	`route_decision` enum + 4-strategy attribution

The largest concrete cost is the confidence router's extra MINI call on ~15% of requests. At the seed-dataset workload that adds ~$0.04/day per active user — roughly 8% of the cost it saves. Net: ~5× ROI on the confidence strategy.

Consequences (positive)

Cost cuts demonstrable. Reference scenario in examples/startup_nightmare.py shows 78% reduction (50k queries/day from $9k/mo naive → $1.98k/mo with this triangle + cache).
Latency stays predictable. ~85% of requests route via the deterministic strategies (no model call). The ~15% confidence-router path adds ~800 ms, but is bounded.
Failure mode is upgrades, not silent degradation. If MINI fails, the user gets a STANDARD response; they never get a half-broken cheap answer.
Per-tenant levers. A premium tenant can set default_tier=PREMIUM and bypass the cheap-first chain entirely.

Consequences (negative)

Four strategies = four places to debug. When a routing decision surprises someone, "which strategy decided?" is a real question. Mitigated by attribution in route_decision and a ?explain=true debug query that returns the decision trace.
Threshold maintenance. Confidence threshold (0.5), complexity threshold (4), and keyword patterns are calibrated to the seed dataset. They drift. Runbook entry: weekly re-calibration; monthly threshold sweep.
Step-up-on-failure can amplify cost during incidents. A vendor outage on MINI cascades the whole platform onto STANDARD. We mitigate with a circuit breaker per tier (fallback.py::CircuitBreaker) that pins traffic to a healthy tier for ≥ 30 s after consecutive failures.

Reversal plan

If the confidence router's ROI drops below break-even (i.e. the 15% extra MINI calls cost more than they save), reverse to a 3-strategy chain:

Set ROUTING_CONFIDENCE_ENABLED=false in .env.
src/routing/router.py::route skips strategy 3.
Re-tune the complexity threshold downward to compensate.
Re-run examples/startup_nightmare.py to verify the cost-savings table.

Estimated effort: ~3 engineer-days including re-tuning.

References

src/routing/triangle.py — cost-latency-quality positions
src/routing/strategies.py — 4 routing strategies
src/routing/router.py — chain orchestrator
src/routing/fallback.py — exponential backoff + circuit breaker
src/routing/eval.py — A/B evaluation harness for tier choices
src/routing/latency.py — per-tier latency budget enforcement
pricing.py — model price table (the input to ROI calculation)
ADR-001 (cache layer the router sits behind)
ADR-002 (budget engine that gates the chain)