ADR-001: Use Anthropic Claude API over self-hosted inference for v1 | Enterprise AI Platform

Context

The platform must answer questions over tenant-private documents under strict compliance constraints (PII redaction, audit trail, RBAC). The inference layer is a major cost driver and the most operationally demanding component — it carries GPU capacity planning, model rollouts, batching strategy, and quantization decisions. We have two viable v1 paths:

Self-host an open-weight model (Llama 3 70B / Mixtral 8x22B) on vLLM or TGI with a managed GPU pool.
Call a hosted vendor API (Anthropic Claude, OpenAI GPT-4) over HTTPS with a typed SDK.

We are a small team shipping v1 in 4 modules with a 4-tenant beta load target.

Decision

Adopt the Anthropic Claude API (from anthropic import Anthropic) as the inference layer for v1. The pipeline interface is a single generate(messages, system, max_tokens) call; tenant context is injected via the system prompt and per-tenant retrieval results.

# rag/pipeline.py
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
    model="claude-haiku-4-5",
    system=tenant_system_prompt,
    messages=[{"role": "user", "content": query}],
    max_tokens=1024,
)

Tradeoffs we accept

Lever	Self-hosted vLLM	Anthropic API (chosen)
Day-1 latency	Engineering time before first token	First token in ~5 minutes
Compliance posture	Full data residency + audit	Vendor DPA + per-call audit log
Unit cost at low volume (<5M tok/mo)	$1,200+ / mo idle GPU	$0–$200 / mo
Unit cost at high volume (>500M tok/mo)	Lower per-token	Higher per-token
Vendor risk	Eliminated	Real (rate limits, deprecation)
Capacity planning	Required	Vendor's problem
Rollout / canary	We build it	We build it (model param)

We optimize for time-to-shipped-platform, not for per-token cost at scale, because there are no platform-scale tenants yet. The cost model in docs/cost-model/enterprise-ai-cost-model.csv shows the crossover point at ~80M tokens/month.

Consequences (positive)

Module 01 ships in <5 hours of learner time. No GPU sourcing, no vLLM tuning, no quantization debugging.
Full audit log captures every prompt + response + token count + cost with no extra infra (Anthropic returns usage in the response).
The pipeline interface stays a pure Python contract — swap is a single-method change.

Consequences (negative)

Vendor lock-in surface. All prompts are tuned for Claude; a swap to GPT or self-hosted Llama will require prompt regression testing. Mitigation: ADR-003 (prompts in Git, hash-tagged) makes regression testing mechanical.
Per-token cost ceiling. At >80M tokens/month per tenant the API bill exceeds a self-hosted GPU pool. Mitigation: cost model + Grafana per-tenant token panel (Module 04) flags the crossover.
Compliance review burden. Some regulated tenants (healthcare, finance) require on-prem inference. Mitigation: see ADR-???-future for the dual-stack plan.

Reversal plan

Inference is encapsulated in rag/pipeline.py::EnterpriseRAGPipeline.generate(). Replacement is bounded:

Add serving/vllm_client.py with the same generate() signature.
Switch pipeline.py import behind a feature flag.
Re-run prompt regression tests (Module 03 red-team pytest suite).
Rotate per-tenant traffic incrementally via router.py (5% → 100%).

Estimated effort: 1 engineer-week for a tested swap, not counting GPU capacity provisioning. The decision is reversible.

References

apps/web/public/downloads/enterprise-ai-platform-starter.zip!/rag/pipeline.py
ADR-002 (multi-tenant RLS — independent of inference layer)
ADR-003 (prompt versioning — makes regression testing mechanical)
docs/cost-model/enterprise-ai-cost-model.csv (crossover analysis)