Context
By Module 04 every API handler in the codebase calls the LLM directly via
OpenAIClient or AnthropicClient (Module 03's multi-provider client).
That works locally but fails three real production needs:
- Vendor outage resilience. OpenAI status pages document multi-hour
outages on average ~1× per quarter. A single
client.chat.completionscall inside a request handler dies with an unhandledAPIErrorwhen that happens. - Cost attribution per tenant. Multi-tenant deployments need to track token spend per tenant for chargebacks, quotas, and overage detection. Wiring per-tenant cost tracking inline in every call site is repetitive and easy to miss.
- Response caching. Many production RAG queries are repeats — identical queries from the same tenant on the same documents should not pay for a second LLM call inside a 1-hour window.
A library-style fix (e.g. an SDK wrapper around the existing client) addresses (3) but not (1) and (2). What we needed was a gateway — a process boundary the rest of the platform calls through, where outages, costs, and caching are handled in one place.
Decision
We adopt a single LLM gateway (gateway/llm_gateway.py) that the
entire platform calls through, with these features:
# gateway/llm_gateway.py — pseudocode of the gateway interface
class LLMGateway:
def __init__(self):
self.fallback_chain = ["gpt-4o", "gpt-4o-mini"]
self.response_cache_ttl = 3600 # 1 hour
self.tenant_rate_limits: dict[str, int] = {}
self.tenant_cost_caps: dict[str, float] = {}
async def complete(self, req: LLMRequest) -> LLMResponse:
# 1. Tenant quota check
if not self._tenant_allows(req.tenant_id, req.model):
raise QuotaExceeded(req.tenant_id)
# 2. Cache lookup
cache_key = self._cache_key(req)
if cached := await self.cache.get(cache_key):
return cached
# 3. Fallback chain
for model in self._models_for(req):
try:
resp = await self._provider_call(model, req)
await self.cache.set(cache_key, resp, ttl=self.response_cache_ttl)
await self._record_cost(req.tenant_id, resp)
return resp
except (APITimeout, RateLimitError, APIServerError):
continue
except APIError as e:
if e.is_retriable():
continue
raise
raise AllProvidersUnavailable()
The gateway is a library, not a service, in v1 — instantiated as a
singleton in the FastAPI app. Service-extraction is documented as a v2
path in DESIGN.md once cross-process traffic justifies it.
Fallback chain default: [gpt-4o, gpt-4o-mini]. Premium model first;
on outage / rate-limit / 5xx it falls through to the cheaper model
automatically — the alternative (returning an error to the user) is
strictly worse for partial degradation.
Cache TTL: 1 hour. Long enough to absorb the burst of identical queries that follows any popular update; short enough that tenants who update documents see fresh answers within a reasonable window. Configurable per tenant.
Per-tenant rate limit + cost cap: stored in Redis keys
rl:{tenant_id}:{minute} and cost:{tenant_id}:{YYYY-MM}. Enforced
inside the gateway so call sites don't have to know about quotas.
Tradeoffs we accept
| Lever | Alternative | Chosen |
|---|---|---|
| Outage resilience | Per-call try/except in handlers | Centralised fallback chain |
| Cost attribution | Inline tenant tagging at call site | Gateway-recorded per-tenant cost |
| Caching | Per-handler caching | Centralised hash-of-prompt cache |
| Process model | Out-of-process gateway service | In-process library gateway (v1) |
| Provider lock-in | Single SDK | Multi-provider with adapter |
The largest concrete trade is in-process vs out-of-process. We picked
in-process because at v1 traffic the cross-process IPC overhead (~5 ms
per call) outweighs the deployability benefit. The decision is reversible
— the LLMGateway interface is the same whether it's a library or a
service.
Consequences (positive)
- Single point for cost + outage + cache. Every change to vendor pricing, fallback policy, or cache TTL lands in one file.
- Tenants get coherent cost attribution.
record_cost(tenant_id, …)fires inside the gateway on every successful call; the cost manager (app/cost_manager.py) reads fromcost:{tenant_id}:{month}keys for per-tenant chargeback. - 20% cache hit rate measurable. The gateway emits Prometheus
counters
llm_gateway_cache_hits/llm_gateway_calls. The cost-model CSV uses this measured 20% rate as the optimization-lever assumption. - Vendor outage tested in CI. A test fixture (
tests/test_gateway.py) injects a fault on the primary provider and asserts the fallback fires within the latency budget.
Consequences (negative)
- In-process singleton complicates testing. Test setup has to wipe Redis state between runs; the cache and rate-limit keys are real Redis keys. We mitigate with a test-only Redis namespace.
- Step-down on failure can mask provider issues. If gpt-4o is consistently failing, automatic fallback to gpt-4o-mini means users see worse answers rather than an error. We surface the fallback rate as a first-class Prometheus metric so this shows up in dashboards.
- Cache key sensitivity. Identical queries with whitespace differences miss the cache. We normalise prompts (lowercase, collapse whitespace) before hashing — but normalisation is workload-dependent and documented.
Reversal plan
If the gateway's centralisation becomes the bottleneck (e.g. cache contention at higher scale, in-process becomes a lock contention point), the reversal is to extract the gateway as an out-of-process service:
- Extract
gateway/llm_gateway.pyas a separate FastAPI service. - Replace in-process
LLMGateway()instances with HTTP clients. - Add the 5 ms IPC overhead to the latency budget; verify p95 still in bounds.
- Document the v2 deployment topology in
DESIGN.md.
Estimated effort: ~5 engineer-days. The interface is preserved, so call sites don't change.
References
gateway/llm_gateway.py— the gateway implementationapp/cost_manager.py— reads gateway-recorded costs for chargebackapp/cache.py+app/cache_stats.py— Redis-backed response cachetenant/tenant_manager.py— quota + rate-limit configurationrunbook/failure_modes.md— runbook entry "primary LLM provider down"DESIGN.md— 4-layer architecture diagram (gateway sits at L2)- ADR-001 / ADR-002 (retrieval pipeline that produces gateway inputs)