Skip to content
Back to Full-Stack AI Platform

4-class query router with confidence threshold + ambiguous fallback

✓ AcceptedFull-Stack AI Platform03 — LLM Orchestration & Intelligent Routing
By AI-DE Engineering Team·Stakeholders: ML engineer, platform owner

Context

Not every user query should hit the same pipeline. A factual question ("What is the refund policy?") is best answered with RAG. An analytical question ("How many tickets did tenant X open last month?") is best answered with SQL. An open-ended question ("Help me draft a customer email") doesn't need retrieval at all. An ambiguous question is worse than any of those — running it down a wrong path produces confidently-wrong answers.

Three options on the table:

  • No routing (RAG everywhere). Simplest. Wastes embeddings + LLM tokens on questions that don't need them. Hallucination rate goes up on open-ended queries because the LLM is forced to ground in irrelevant chunks.
  • Hard rules. Regex / keyword-based routing. Brittle; misses paraphrases. We tried this in v0 — it broke on every prompt rewording.
  • LLM-based classifier with confidence score. Few-shot prompt classifies into N classes; gates on confidence; falls back to "ambiguous" path on low confidence.

Decision

4-class router with confidence threshold + ambiguous fallback:

# routing/query_router.py
class QueryClass(str, Enum):
    FACTUAL = "factual"          # → RAG retrieval + grounded generation
    ANALYTICAL = "analytical"    # → SQL tool + structured response
    OPEN_ENDED = "open_ended"    # → LLM-only generation (no retrieval)
    AMBIGUOUS = "ambiguous"      # → clarifying question, don't answer

@dataclass
class RoutingDecision:
    query_class: QueryClass
    confidence: float            # 0.0-1.0 from the classifier
    reasoning: str               # Why this class, for traces

class QueryRouter:
    CONFIDENCE_THRESHOLD = 0.6   # Below this → AMBIGUOUS

    async def route(self, query: str) -> RoutingDecision:
        decision = await self.llm.with_structured_output(RoutingDecision).ainvoke([
            {"role": "system", "content": ROUTER_SYSTEM_PROMPT_WITH_FEW_SHOT},
            {"role": "user", "content": query},
        ])
        if decision.confidence < self.CONFIDENCE_THRESHOLD:
            return RoutingDecision(
                query_class=QueryClass.AMBIGUOUS,
                confidence=decision.confidence,
                reasoning=f"Below threshold ({decision.confidence:.2f}); asking to clarify",
            )
        return decision

The confidence threshold is the safety hatch. We'd rather ask for clarification than route a borderline question to the wrong pipeline and produce a confidently-wrong answer.

# orchestration/pipeline.py
async def handle_query(query: str, tenant: Tenant) -> Response:
    decision = await router.route(query)
    if decision.query_class == QueryClass.FACTUAL:
        chunks = await retriever.retrieve(query, tenant.tenant_id, k=5)
        return await rag_generator.generate(query, chunks)
    elif decision.query_class == QueryClass.ANALYTICAL:
        return await sql_agent.handle(query, tenant)
    elif decision.query_class == QueryClass.OPEN_ENDED:
        return await llm_only.generate(query)
    else:  # AMBIGUOUS
        return Response(text=f"Could you clarify? I'm not sure if you're asking for facts ({decision.reasoning}).")

Tradeoffs we accept

LeverAlternativeChosen
Routing decision qualityHard regex rulesLLM classification — accept the per-query LLM call cost (Haiku, ~$0.001/query)
LatencyNo router (RAG everywhere)Router adds ~150-300ms median — accept for hallucination reduction
Confidence calibrationTrust every routing decisionThreshold + AMBIGUOUS fallback — accept higher clarification rate (~8% of queries)
Class count7-10 micro-classes4 classes — accept some intra-class diversity for stable router calibration
Routing modelGPT-4o (higher accuracy)Claude Haiku (cheaper) — accuracy gap is <2pp on our gold-set; cost gap is 4x

Consequences (positive)

  • Routing eliminates ~32% of unnecessary retrieval calls (open-ended + ambiguous queries that don't need RAG). At 10k queries/mo, that's ~$3-5/mo of judge cost saved + cleaner answers.
  • Hallucination rate (measured via grounding_score in M05's eval) drops from 14% → 6% after wiring the router. Open-ended queries had been forced into RAG; the router gives them the LLM-only path they actually wanted.
  • AMBIGUOUS path produces ~8% clarification questions. Users prefer this to confidently-wrong answers — verified in user feedback (M05 feedback loop).
  • Routing decision is logged in traces. Debugging is "what class was this?" → check the trace, not "rerun against three pipelines".
  • Few-shot examples are in a database (M03's prompt registry) — we can A/B test routing prompts without redeploys.

Consequences (negative)

  • Router is on the critical path of every query. A flaky Haiku call delays everything. Mitigation: M04's failure cascade falls back to "default to FACTUAL → RAG" on router timeout.
  • 4 classes is a coarse split. Some queries are factual+analytical (e.g. "what was the revenue last quarter?" — needs RAG over reports + SQL aggregation). Mitigation: documented as a known gap; future ADR may add a "compound" class with multi-stage execution.
  • Confidence calibration drifts as the few-shot prompt evolves. We monitor via M05's online eval — when the AMBIGUOUS rate spikes, the prompt needs review.
  • The router's LLM call is non-deterministic. Same query can route differently across reruns. Mitigation: T=0.0 + cache by (tenant_id, query) for ~5min.

Reversal plan

Drop the router (RAG everywhere): ~2 engineer-days. Trigger: AMBIGUOUS rate > 25% (signal that the router can't classify queries reliably) or router latency dominates the query budget.

Add classes (compound queries, summarization, code-gen): ~1 engineer-week per class. Each class needs a real pipeline behind it; don't add classes for which we can't ship a real handler.

Switch to a cheaper / faster classifier (e.g. fine-tuned MiniLM): ~2 engineer-weeks. Trigger: Haiku cost > $20/mo on the routing path or router p95 latency > 200ms. We've measured this is unlikely at our load.

References

  • routing/query_router.py — implementation
  • routing/few_shot_examples.py — calibrated examples (lives in DB)
  • prompts/router_system_prompt.py — few-shot prompt template
  • tests/test_query_router.py — calibration test suite (held-out gold set)
  • ADR-001 (router confidence floor ties to SystemContract correctness_floor)
  • ADR-002 (FACTUAL path uses the retriever)
  • ADR-004 (failure cascade falls back when router times out)
Built into the project

This decision shipped as part of Full-Stack AI Platform — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open