Context
The platform must answer questions over tenant-private documents under strict compliance constraints (PII redaction, audit trail, RBAC). The inference layer is a major cost driver and the most operationally demanding component — it carries GPU capacity planning, model rollouts, batching strategy, and quantization decisions. We have two viable v1 paths:
- Self-host an open-weight model (Llama 3 70B / Mixtral 8x22B) on vLLM or TGI with a managed GPU pool.
- Call a hosted vendor API (Anthropic Claude, OpenAI GPT-4) over HTTPS with a typed SDK.
We are a small team shipping v1 in 4 modules with a 4-tenant beta load target.
Decision
Adopt the Anthropic Claude API (from anthropic import Anthropic)
as the inference layer for v1. The pipeline interface is a single
generate(messages, system, max_tokens) call; tenant context is
injected via the system prompt and per-tenant retrieval results.
# rag/pipeline.py
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model="claude-haiku-4-5",
system=tenant_system_prompt,
messages=[{"role": "user", "content": query}],
max_tokens=1024,
)
Tradeoffs we accept
| Lever | Self-hosted vLLM | Anthropic API (chosen) |
|---|---|---|
| Day-1 latency | Engineering time before first token | First token in ~5 minutes |
| Compliance posture | Full data residency + audit | Vendor DPA + per-call audit log |
| Unit cost at low volume (<5M tok/mo) | $1,200+ / mo idle GPU | $0–$200 / mo |
| Unit cost at high volume (>500M tok/mo) | Lower per-token | Higher per-token |
| Vendor risk | Eliminated | Real (rate limits, deprecation) |
| Capacity planning | Required | Vendor's problem |
| Rollout / canary | We build it | We build it (model param) |
We optimize for time-to-shipped-platform, not for per-token cost
at scale, because there are no platform-scale tenants yet. The cost
model in docs/cost-model/enterprise-ai-cost-model.csv shows the
crossover point at ~80M tokens/month.
Consequences (positive)
- Module 01 ships in <5 hours of learner time. No GPU sourcing, no vLLM tuning, no quantization debugging.
- Full audit log captures every prompt + response + token count + cost with no extra infra (Anthropic returns usage in the response).
- The pipeline interface stays a pure Python contract — swap is a single-method change.
Consequences (negative)
- Vendor lock-in surface. All prompts are tuned for Claude; a swap to GPT or self-hosted Llama will require prompt regression testing. Mitigation: ADR-003 (prompts in Git, hash-tagged) makes regression testing mechanical.
- Per-token cost ceiling. At >80M tokens/month per tenant the API bill exceeds a self-hosted GPU pool. Mitigation: cost model + Grafana per-tenant token panel (Module 04) flags the crossover.
- Compliance review burden. Some regulated tenants (healthcare, finance) require on-prem inference. Mitigation: see ADR-???-future for the dual-stack plan.
Reversal plan
Inference is encapsulated in rag/pipeline.py::EnterpriseRAGPipeline.generate().
Replacement is bounded:
- Add
serving/vllm_client.pywith the samegenerate()signature. - Switch
pipeline.pyimport behind a feature flag. - Re-run prompt regression tests (Module 03 red-team pytest suite).
- Rotate per-tenant traffic incrementally via
router.py(5% → 100%).
Estimated effort: 1 engineer-week for a tested swap, not counting GPU capacity provisioning. The decision is reversible.
References
apps/web/public/downloads/enterprise-ai-platform-starter.zip!/rag/pipeline.py- ADR-002 (multi-tenant RLS — independent of inference layer)
- ADR-003 (prompt versioning — makes regression testing mechanical)
docs/cost-model/enterprise-ai-cost-model.csv(crossover analysis)