Context
Not every user query should hit the same pipeline. A factual question ("What is the refund policy?") is best answered with RAG. An analytical question ("How many tickets did tenant X open last month?") is best answered with SQL. An open-ended question ("Help me draft a customer email") doesn't need retrieval at all. An ambiguous question is worse than any of those — running it down a wrong path produces confidently-wrong answers.
Three options on the table:
- No routing (RAG everywhere). Simplest. Wastes embeddings + LLM tokens on questions that don't need them. Hallucination rate goes up on open-ended queries because the LLM is forced to ground in irrelevant chunks.
- Hard rules. Regex / keyword-based routing. Brittle; misses paraphrases. We tried this in v0 — it broke on every prompt rewording.
- LLM-based classifier with confidence score. Few-shot prompt classifies into N classes; gates on confidence; falls back to "ambiguous" path on low confidence.
Decision
4-class router with confidence threshold + ambiguous fallback:
# routing/query_router.py
class QueryClass(str, Enum):
FACTUAL = "factual" # → RAG retrieval + grounded generation
ANALYTICAL = "analytical" # → SQL tool + structured response
OPEN_ENDED = "open_ended" # → LLM-only generation (no retrieval)
AMBIGUOUS = "ambiguous" # → clarifying question, don't answer
@dataclass
class RoutingDecision:
query_class: QueryClass
confidence: float # 0.0-1.0 from the classifier
reasoning: str # Why this class, for traces
class QueryRouter:
CONFIDENCE_THRESHOLD = 0.6 # Below this → AMBIGUOUS
async def route(self, query: str) -> RoutingDecision:
decision = await self.llm.with_structured_output(RoutingDecision).ainvoke([
{"role": "system", "content": ROUTER_SYSTEM_PROMPT_WITH_FEW_SHOT},
{"role": "user", "content": query},
])
if decision.confidence < self.CONFIDENCE_THRESHOLD:
return RoutingDecision(
query_class=QueryClass.AMBIGUOUS,
confidence=decision.confidence,
reasoning=f"Below threshold ({decision.confidence:.2f}); asking to clarify",
)
return decision
The confidence threshold is the safety hatch. We'd rather ask for clarification than route a borderline question to the wrong pipeline and produce a confidently-wrong answer.
# orchestration/pipeline.py
async def handle_query(query: str, tenant: Tenant) -> Response:
decision = await router.route(query)
if decision.query_class == QueryClass.FACTUAL:
chunks = await retriever.retrieve(query, tenant.tenant_id, k=5)
return await rag_generator.generate(query, chunks)
elif decision.query_class == QueryClass.ANALYTICAL:
return await sql_agent.handle(query, tenant)
elif decision.query_class == QueryClass.OPEN_ENDED:
return await llm_only.generate(query)
else: # AMBIGUOUS
return Response(text=f"Could you clarify? I'm not sure if you're asking for facts ({decision.reasoning}).")
Tradeoffs we accept
| Lever | Alternative | Chosen |
|---|---|---|
| Routing decision quality | Hard regex rules | LLM classification — accept the per-query LLM call cost (Haiku, ~$0.001/query) |
| Latency | No router (RAG everywhere) | Router adds ~150-300ms median — accept for hallucination reduction |
| Confidence calibration | Trust every routing decision | Threshold + AMBIGUOUS fallback — accept higher clarification rate (~8% of queries) |
| Class count | 7-10 micro-classes | 4 classes — accept some intra-class diversity for stable router calibration |
| Routing model | GPT-4o (higher accuracy) | Claude Haiku (cheaper) — accuracy gap is <2pp on our gold-set; cost gap is 4x |
Consequences (positive)
- Routing eliminates ~32% of unnecessary retrieval calls (open-ended + ambiguous queries that don't need RAG). At 10k queries/mo, that's ~$3-5/mo of judge cost saved + cleaner answers.
- Hallucination rate (measured via grounding_score in M05's eval) drops from 14% → 6% after wiring the router. Open-ended queries had been forced into RAG; the router gives them the LLM-only path they actually wanted.
- AMBIGUOUS path produces ~8% clarification questions. Users prefer this to confidently-wrong answers — verified in user feedback (M05 feedback loop).
- Routing decision is logged in traces. Debugging is "what class was this?" → check the trace, not "rerun against three pipelines".
- Few-shot examples are in a database (M03's prompt registry) — we can A/B test routing prompts without redeploys.
Consequences (negative)
- Router is on the critical path of every query. A flaky Haiku call delays everything. Mitigation: M04's failure cascade falls back to "default to FACTUAL → RAG" on router timeout.
- 4 classes is a coarse split. Some queries are factual+analytical (e.g. "what was the revenue last quarter?" — needs RAG over reports + SQL aggregation). Mitigation: documented as a known gap; future ADR may add a "compound" class with multi-stage execution.
- Confidence calibration drifts as the few-shot prompt evolves. We monitor via M05's online eval — when the AMBIGUOUS rate spikes, the prompt needs review.
- The router's LLM call is non-deterministic. Same query can route differently across reruns. Mitigation: T=0.0 + cache by (tenant_id, query) for ~5min.
Reversal plan
Drop the router (RAG everywhere): ~2 engineer-days. Trigger: AMBIGUOUS rate > 25% (signal that the router can't classify queries reliably) or router latency dominates the query budget.
Add classes (compound queries, summarization, code-gen): ~1 engineer-week per class. Each class needs a real pipeline behind it; don't add classes for which we can't ship a real handler.
Switch to a cheaper / faster classifier (e.g. fine-tuned MiniLM): ~2 engineer-weeks. Trigger: Haiku cost > $20/mo on the routing path or router p95 latency > 200ms. We've measured this is unlikely at our load.
References
routing/query_router.py— implementationrouting/few_shot_examples.py— calibrated examples (lives in DB)prompts/router_system_prompt.py— few-shot prompt templatetests/test_query_router.py— calibration test suite (held-out gold set)- ADR-001 (router confidence floor ties to SystemContract
correctness_floor) - ADR-002 (FACTUAL path uses the retriever)
- ADR-004 (failure cascade falls back when router times out)