Context
Module 02 made cost visible. Module 03 cut the bill through caching. Module
04 is where the platform decides which model to use — and that decision is
the single largest lever left, because the price gap between GPT-4o-mini and
GPT-4o is ~30× per output token (pricing.py: $0.60/M vs $15.00/M output).
Three competing forces govern the routing decision:
- Cost. Cheaper models exist; sending everything to the premium tier is
the "naive 50k queries/day startup" anti-pattern documented in
examples/startup_nightmare.py. - Quality. Some prompts genuinely need the premium model; sending them all to the mini tier is the symmetric anti-pattern (cheap + wrong).
- Latency. Routing decisions sit on the hot path. A 200 ms classifier that calls a model just to pick a model defeats the purpose.
Naive routing patterns we ruled out:
- Static keyword matching only. "If prompt contains 'analyse' → premium" works for ~30% of traffic; collapses on free-form chat.
- Always-classify-first. Run a cheap model on every prompt to decide; adds 200 ms p95 and an extra model call to every request.
- Premium-by-default with budget cap. Just route everything to GPT-4o until the budget is hit. Bad for cost; worse on latency when the budget flips and traffic suddenly fails.
What we wanted: a routing decision that costs ~milliseconds, gets the easy cases right cheaply, and escalates only the cases that need it.
Decision
We adopt a 3-tier routing triangle (MINI / STANDARD / PREMIUM) with an explicit fallback chain keyed on a four-strategy classifier:
# src/routing/router.py — pseudocode of the chain
async def route(prompt: str, ctx: RequestContext) -> RouteDecision:
# 1. Cheap deterministic strategies first
if decision := keyword_router.classify(prompt): return decision
if decision := complexity_router.classify(prompt): return decision
# 2. Cheap model self-confidence (only if 1 was inconclusive)
if decision := await confidence_router.classify(prompt): return decision
# 3. Always-fallback chain — never returns None
return fallback_chain.default(ctx)
The triangle (triangle.py): every model is positioned on a 3-axis
plot (cost / latency / quality). Three named tiers map to vendor models:
| Tier | Model | $/1M in | $/1M out | p95 latency | Use case |
|---|---|---|---|---|---|
| MINI | gpt-4o-mini | $0.15 | $0.60 | 800 ms | Greetings, classification, FAQ |
| STANDARD | claude-3-5-sonnet | $3.00 | $15.00 | 1.6 s | Reasoning, summarisation |
| PREMIUM | gpt-4o | $2.50 | $10.00 | 1.4 s | Analysis, generation, agent loops |
The four routing strategies (strategies.py):
- Keyword router — pattern match on prompt text → tier. Catches the obvious 30%. ~0.5 ms.
- Complexity router — heuristic score (1–5) over prompt length, lexical diversity, code-block presence. Score ≥ 4 → escalate to STANDARD. ~2 ms.
- Confidence router — runs MINI; if
confidence < 0.5(self-reported via logprobs), re-runs on STANDARD. Adds one MINI call to ~15% of requests. - Fallback chain — when the three above are inconclusive, default to the
tier marked in
RequestContext.default_tier(per-tenant default).
Fallback resilience (fallback.py): every tier has an exponential-backoff
retry policy (base 200 ms × 2^retry + jitter). On hard failure (timeout,
rate limit, 5xx), the chain steps up to the next tier — not down. Stepping
down on failure is the cost-engineer's instinct; stepping up is the
SRE-engineer's. We chose up because a failed cheap call followed by a
successful expensive call is recoverable; a failed expensive call followed
by a successful cheap call probably degraded quality silently.
Every routing decision lands in llm_requests.route_decision (a string
enum) so the analytics layer can answer "what fraction of requests escalated,
and why?"
Tradeoffs we accept
| Lever | Alternative | Chosen |
|---|---|---|
| Decision cost | Always-classify | Cheap-first chain with escalation |
| Quality floor | Always-premium | MINI + confidence escalation |
| Failure direction | Step down on error (cheap) | Step up on error (resilient) |
| Tunability | Hard-coded tier mapping | Per-tenant default_tier + per-strategy thresholds |
| Observability | Single "model" column | route_decision enum + 4-strategy attribution |
The largest concrete cost is the confidence router's extra MINI call on ~15% of requests. At the seed-dataset workload that adds ~$0.04/day per active user — roughly 8% of the cost it saves. Net: ~5× ROI on the confidence strategy.
Consequences (positive)
- Cost cuts demonstrable. Reference scenario in
examples/startup_nightmare.pyshows 78% reduction (50k queries/day from $9k/mo naive → $1.98k/mo with this triangle + cache). - Latency stays predictable. ~85% of requests route via the deterministic strategies (no model call). The ~15% confidence-router path adds ~800 ms, but is bounded.
- Failure mode is upgrades, not silent degradation. If MINI fails, the user gets a STANDARD response; they never get a half-broken cheap answer.
- Per-tenant levers. A premium tenant can set
default_tier=PREMIUMand bypass the cheap-first chain entirely.
Consequences (negative)
- Four strategies = four places to debug. When a routing decision
surprises someone, "which strategy decided?" is a real question. Mitigated
by attribution in
route_decisionand a?explain=truedebug query that returns the decision trace. - Threshold maintenance. Confidence threshold (0.5), complexity threshold (4), and keyword patterns are calibrated to the seed dataset. They drift. Runbook entry: weekly re-calibration; monthly threshold sweep.
- Step-up-on-failure can amplify cost during incidents. A vendor outage
on MINI cascades the whole platform onto STANDARD. We mitigate with a
circuit breaker per tier (
fallback.py::CircuitBreaker) that pins traffic to a healthy tier for ≥ 30 s after consecutive failures.
Reversal plan
If the confidence router's ROI drops below break-even (i.e. the 15% extra MINI calls cost more than they save), reverse to a 3-strategy chain:
- Set
ROUTING_CONFIDENCE_ENABLED=falsein.env. src/routing/router.py::routeskips strategy 3.- Re-tune the complexity threshold downward to compensate.
- Re-run
examples/startup_nightmare.pyto verify the cost-savings table.
Estimated effort: ~3 engineer-days including re-tuning.
References
src/routing/triangle.py— cost-latency-quality positionssrc/routing/strategies.py— 4 routing strategiessrc/routing/router.py— chain orchestratorsrc/routing/fallback.py— exponential backoff + circuit breakersrc/routing/eval.py— A/B evaluation harness for tier choicessrc/routing/latency.py— per-tier latency budget enforcementpricing.py— model price table (the input to ROI calculation)- ADR-001 (cache layer the router sits behind)
- ADR-002 (budget engine that gates the chain)