Skip to content
Back to Enterprise RAG

LLM gateway with fallback chain (gpt-4o → gpt-4o-mini)

✓ AcceptedEnterprise RAG05 — AI Platform Orchestration & Multi-Tenancy
By AI-DE Engineering Team·Stakeholders: platform engineer, ML platform lead, on-call SRE

Context

By Module 04 every API handler in the codebase calls the LLM directly via OpenAIClient or AnthropicClient (Module 03's multi-provider client). That works locally but fails three real production needs:

  1. Vendor outage resilience. OpenAI status pages document multi-hour outages on average ~1× per quarter. A single client.chat.completions call inside a request handler dies with an unhandled APIError when that happens.
  2. Cost attribution per tenant. Multi-tenant deployments need to track token spend per tenant for chargebacks, quotas, and overage detection. Wiring per-tenant cost tracking inline in every call site is repetitive and easy to miss.
  3. Response caching. Many production RAG queries are repeats — identical queries from the same tenant on the same documents should not pay for a second LLM call inside a 1-hour window.

A library-style fix (e.g. an SDK wrapper around the existing client) addresses (3) but not (1) and (2). What we needed was a gateway — a process boundary the rest of the platform calls through, where outages, costs, and caching are handled in one place.

Decision

We adopt a single LLM gateway (gateway/llm_gateway.py) that the entire platform calls through, with these features:

# gateway/llm_gateway.py — pseudocode of the gateway interface
class LLMGateway:
    def __init__(self):
        self.fallback_chain = ["gpt-4o", "gpt-4o-mini"]
        self.response_cache_ttl = 3600           # 1 hour
        self.tenant_rate_limits: dict[str, int] = {}
        self.tenant_cost_caps: dict[str, float] = {}

    async def complete(self, req: LLMRequest) -> LLMResponse:
        # 1. Tenant quota check
        if not self._tenant_allows(req.tenant_id, req.model):
            raise QuotaExceeded(req.tenant_id)

        # 2. Cache lookup
        cache_key = self._cache_key(req)
        if cached := await self.cache.get(cache_key):
            return cached

        # 3. Fallback chain
        for model in self._models_for(req):
            try:
                resp = await self._provider_call(model, req)
                await self.cache.set(cache_key, resp, ttl=self.response_cache_ttl)
                await self._record_cost(req.tenant_id, resp)
                return resp
            except (APITimeout, RateLimitError, APIServerError):
                continue
            except APIError as e:
                if e.is_retriable():
                    continue
                raise

        raise AllProvidersUnavailable()

The gateway is a library, not a service, in v1 — instantiated as a singleton in the FastAPI app. Service-extraction is documented as a v2 path in DESIGN.md once cross-process traffic justifies it.

Fallback chain default: [gpt-4o, gpt-4o-mini]. Premium model first; on outage / rate-limit / 5xx it falls through to the cheaper model automatically — the alternative (returning an error to the user) is strictly worse for partial degradation.

Cache TTL: 1 hour. Long enough to absorb the burst of identical queries that follows any popular update; short enough that tenants who update documents see fresh answers within a reasonable window. Configurable per tenant.

Per-tenant rate limit + cost cap: stored in Redis keys rl:{tenant_id}:{minute} and cost:{tenant_id}:{YYYY-MM}. Enforced inside the gateway so call sites don't have to know about quotas.

Tradeoffs we accept

LeverAlternativeChosen
Outage resiliencePer-call try/except in handlersCentralised fallback chain
Cost attributionInline tenant tagging at call siteGateway-recorded per-tenant cost
CachingPer-handler cachingCentralised hash-of-prompt cache
Process modelOut-of-process gateway serviceIn-process library gateway (v1)
Provider lock-inSingle SDKMulti-provider with adapter

The largest concrete trade is in-process vs out-of-process. We picked in-process because at v1 traffic the cross-process IPC overhead (~5 ms per call) outweighs the deployability benefit. The decision is reversible — the LLMGateway interface is the same whether it's a library or a service.

Consequences (positive)

  • Single point for cost + outage + cache. Every change to vendor pricing, fallback policy, or cache TTL lands in one file.
  • Tenants get coherent cost attribution. record_cost(tenant_id, …) fires inside the gateway on every successful call; the cost manager (app/cost_manager.py) reads from cost:{tenant_id}:{month} keys for per-tenant chargeback.
  • 20% cache hit rate measurable. The gateway emits Prometheus counters llm_gateway_cache_hits / llm_gateway_calls. The cost-model CSV uses this measured 20% rate as the optimization-lever assumption.
  • Vendor outage tested in CI. A test fixture (tests/test_gateway.py) injects a fault on the primary provider and asserts the fallback fires within the latency budget.

Consequences (negative)

  • In-process singleton complicates testing. Test setup has to wipe Redis state between runs; the cache and rate-limit keys are real Redis keys. We mitigate with a test-only Redis namespace.
  • Step-down on failure can mask provider issues. If gpt-4o is consistently failing, automatic fallback to gpt-4o-mini means users see worse answers rather than an error. We surface the fallback rate as a first-class Prometheus metric so this shows up in dashboards.
  • Cache key sensitivity. Identical queries with whitespace differences miss the cache. We normalise prompts (lowercase, collapse whitespace) before hashing — but normalisation is workload-dependent and documented.

Reversal plan

If the gateway's centralisation becomes the bottleneck (e.g. cache contention at higher scale, in-process becomes a lock contention point), the reversal is to extract the gateway as an out-of-process service:

  1. Extract gateway/llm_gateway.py as a separate FastAPI service.
  2. Replace in-process LLMGateway() instances with HTTP clients.
  3. Add the 5 ms IPC overhead to the latency budget; verify p95 still in bounds.
  4. Document the v2 deployment topology in DESIGN.md.

Estimated effort: ~5 engineer-days. The interface is preserved, so call sites don't change.

References

  • gateway/llm_gateway.py — the gateway implementation
  • app/cost_manager.py — reads gateway-recorded costs for chargeback
  • app/cache.py + app/cache_stats.py — Redis-backed response cache
  • tenant/tenant_manager.py — quota + rate-limit configuration
  • runbook/failure_modes.md — runbook entry "primary LLM provider down"
  • DESIGN.md — 4-layer architecture diagram (gateway sits at L2)
  • ADR-001 / ADR-002 (retrieval pipeline that produces gateway inputs)
Built into the project

This decision shipped as part of Enterprise RAG — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open