Skip to content
Back to AI Cost Optimization

Cost-latency-quality routing triangle with explicit fallback chain

✓ AcceptedAI Cost Optimization04 — Model Routing, Quality & Optimization
By AI-DE Engineering Team·Stakeholders: platform engineer, ML engineer, on-call SRE

Context

Module 02 made cost visible. Module 03 cut the bill through caching. Module 04 is where the platform decides which model to use — and that decision is the single largest lever left, because the price gap between GPT-4o-mini and GPT-4o is ~30× per output token (pricing.py: $0.60/M vs $15.00/M output).

Three competing forces govern the routing decision:

  1. Cost. Cheaper models exist; sending everything to the premium tier is the "naive 50k queries/day startup" anti-pattern documented in examples/startup_nightmare.py.
  2. Quality. Some prompts genuinely need the premium model; sending them all to the mini tier is the symmetric anti-pattern (cheap + wrong).
  3. Latency. Routing decisions sit on the hot path. A 200 ms classifier that calls a model just to pick a model defeats the purpose.

Naive routing patterns we ruled out:

  • Static keyword matching only. "If prompt contains 'analyse' → premium" works for ~30% of traffic; collapses on free-form chat.
  • Always-classify-first. Run a cheap model on every prompt to decide; adds 200 ms p95 and an extra model call to every request.
  • Premium-by-default with budget cap. Just route everything to GPT-4o until the budget is hit. Bad for cost; worse on latency when the budget flips and traffic suddenly fails.

What we wanted: a routing decision that costs ~milliseconds, gets the easy cases right cheaply, and escalates only the cases that need it.

Decision

We adopt a 3-tier routing triangle (MINI / STANDARD / PREMIUM) with an explicit fallback chain keyed on a four-strategy classifier:

# src/routing/router.py — pseudocode of the chain
async def route(prompt: str, ctx: RequestContext) -> RouteDecision:
    # 1. Cheap deterministic strategies first
    if decision := keyword_router.classify(prompt):  return decision
    if decision := complexity_router.classify(prompt):  return decision

    # 2. Cheap model self-confidence (only if 1 was inconclusive)
    if decision := await confidence_router.classify(prompt):  return decision

    # 3. Always-fallback chain — never returns None
    return fallback_chain.default(ctx)

The triangle (triangle.py): every model is positioned on a 3-axis plot (cost / latency / quality). Three named tiers map to vendor models:

TierModel$/1M in$/1M outp95 latencyUse case
MINIgpt-4o-mini$0.15$0.60800 msGreetings, classification, FAQ
STANDARDclaude-3-5-sonnet$3.00$15.001.6 sReasoning, summarisation
PREMIUMgpt-4o$2.50$10.001.4 sAnalysis, generation, agent loops

The four routing strategies (strategies.py):

  1. Keyword router — pattern match on prompt text → tier. Catches the obvious 30%. ~0.5 ms.
  2. Complexity router — heuristic score (1–5) over prompt length, lexical diversity, code-block presence. Score ≥ 4 → escalate to STANDARD. ~2 ms.
  3. Confidence router — runs MINI; if confidence < 0.5 (self-reported via logprobs), re-runs on STANDARD. Adds one MINI call to ~15% of requests.
  4. Fallback chain — when the three above are inconclusive, default to the tier marked in RequestContext.default_tier (per-tenant default).

Fallback resilience (fallback.py): every tier has an exponential-backoff retry policy (base 200 ms × 2^retry + jitter). On hard failure (timeout, rate limit, 5xx), the chain steps up to the next tier — not down. Stepping down on failure is the cost-engineer's instinct; stepping up is the SRE-engineer's. We chose up because a failed cheap call followed by a successful expensive call is recoverable; a failed expensive call followed by a successful cheap call probably degraded quality silently.

Every routing decision lands in llm_requests.route_decision (a string enum) so the analytics layer can answer "what fraction of requests escalated, and why?"

Tradeoffs we accept

LeverAlternativeChosen
Decision costAlways-classifyCheap-first chain with escalation
Quality floorAlways-premiumMINI + confidence escalation
Failure directionStep down on error (cheap)Step up on error (resilient)
TunabilityHard-coded tier mappingPer-tenant default_tier + per-strategy thresholds
ObservabilitySingle "model" columnroute_decision enum + 4-strategy attribution

The largest concrete cost is the confidence router's extra MINI call on ~15% of requests. At the seed-dataset workload that adds ~$0.04/day per active user — roughly 8% of the cost it saves. Net: ~5× ROI on the confidence strategy.

Consequences (positive)

  • Cost cuts demonstrable. Reference scenario in examples/startup_nightmare.py shows 78% reduction (50k queries/day from $9k/mo naive → $1.98k/mo with this triangle + cache).
  • Latency stays predictable. ~85% of requests route via the deterministic strategies (no model call). The ~15% confidence-router path adds ~800 ms, but is bounded.
  • Failure mode is upgrades, not silent degradation. If MINI fails, the user gets a STANDARD response; they never get a half-broken cheap answer.
  • Per-tenant levers. A premium tenant can set default_tier=PREMIUM and bypass the cheap-first chain entirely.

Consequences (negative)

  • Four strategies = four places to debug. When a routing decision surprises someone, "which strategy decided?" is a real question. Mitigated by attribution in route_decision and a ?explain=true debug query that returns the decision trace.
  • Threshold maintenance. Confidence threshold (0.5), complexity threshold (4), and keyword patterns are calibrated to the seed dataset. They drift. Runbook entry: weekly re-calibration; monthly threshold sweep.
  • Step-up-on-failure can amplify cost during incidents. A vendor outage on MINI cascades the whole platform onto STANDARD. We mitigate with a circuit breaker per tier (fallback.py::CircuitBreaker) that pins traffic to a healthy tier for ≥ 30 s after consecutive failures.

Reversal plan

If the confidence router's ROI drops below break-even (i.e. the 15% extra MINI calls cost more than they save), reverse to a 3-strategy chain:

  1. Set ROUTING_CONFIDENCE_ENABLED=false in .env.
  2. src/routing/router.py::route skips strategy 3.
  3. Re-tune the complexity threshold downward to compensate.
  4. Re-run examples/startup_nightmare.py to verify the cost-savings table.

Estimated effort: ~3 engineer-days including re-tuning.

References

  • src/routing/triangle.py — cost-latency-quality positions
  • src/routing/strategies.py — 4 routing strategies
  • src/routing/router.py — chain orchestrator
  • src/routing/fallback.py — exponential backoff + circuit breaker
  • src/routing/eval.py — A/B evaluation harness for tier choices
  • src/routing/latency.py — per-tier latency budget enforcement
  • pricing.py — model price table (the input to ROI calculation)
  • ADR-001 (cache layer the router sits behind)
  • ADR-002 (budget engine that gates the chain)
Built into the project

This decision shipped as part of AI Cost Optimization — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open