Skip to content
Back to Enterprise AI Platform

Use Anthropic Claude API over self-hosted inference for v1

✓ AcceptedEnterprise AI Platform01 — RAG + Compliance Core
By AI-DE Engineering Team·Stakeholders: platform engineer, security engineer, finance

Context

The platform must answer questions over tenant-private documents under strict compliance constraints (PII redaction, audit trail, RBAC). The inference layer is a major cost driver and the most operationally demanding component — it carries GPU capacity planning, model rollouts, batching strategy, and quantization decisions. We have two viable v1 paths:

  1. Self-host an open-weight model (Llama 3 70B / Mixtral 8x22B) on vLLM or TGI with a managed GPU pool.
  2. Call a hosted vendor API (Anthropic Claude, OpenAI GPT-4) over HTTPS with a typed SDK.

We are a small team shipping v1 in 4 modules with a 4-tenant beta load target.

Decision

Adopt the Anthropic Claude API (from anthropic import Anthropic) as the inference layer for v1. The pipeline interface is a single generate(messages, system, max_tokens) call; tenant context is injected via the system prompt and per-tenant retrieval results.

# rag/pipeline.py
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
    model="claude-haiku-4-5",
    system=tenant_system_prompt,
    messages=[{"role": "user", "content": query}],
    max_tokens=1024,
)

Tradeoffs we accept

LeverSelf-hosted vLLMAnthropic API (chosen)
Day-1 latencyEngineering time before first tokenFirst token in ~5 minutes
Compliance postureFull data residency + auditVendor DPA + per-call audit log
Unit cost at low volume (<5M tok/mo)$1,200+ / mo idle GPU$0–$200 / mo
Unit cost at high volume (>500M tok/mo)Lower per-tokenHigher per-token
Vendor riskEliminatedReal (rate limits, deprecation)
Capacity planningRequiredVendor's problem
Rollout / canaryWe build itWe build it (model param)

We optimize for time-to-shipped-platform, not for per-token cost at scale, because there are no platform-scale tenants yet. The cost model in docs/cost-model/enterprise-ai-cost-model.csv shows the crossover point at ~80M tokens/month.

Consequences (positive)

  • Module 01 ships in <5 hours of learner time. No GPU sourcing, no vLLM tuning, no quantization debugging.
  • Full audit log captures every prompt + response + token count + cost with no extra infra (Anthropic returns usage in the response).
  • The pipeline interface stays a pure Python contract — swap is a single-method change.

Consequences (negative)

  • Vendor lock-in surface. All prompts are tuned for Claude; a swap to GPT or self-hosted Llama will require prompt regression testing. Mitigation: ADR-003 (prompts in Git, hash-tagged) makes regression testing mechanical.
  • Per-token cost ceiling. At >80M tokens/month per tenant the API bill exceeds a self-hosted GPU pool. Mitigation: cost model + Grafana per-tenant token panel (Module 04) flags the crossover.
  • Compliance review burden. Some regulated tenants (healthcare, finance) require on-prem inference. Mitigation: see ADR-???-future for the dual-stack plan.

Reversal plan

Inference is encapsulated in rag/pipeline.py::EnterpriseRAGPipeline.generate(). Replacement is bounded:

  1. Add serving/vllm_client.py with the same generate() signature.
  2. Switch pipeline.py import behind a feature flag.
  3. Re-run prompt regression tests (Module 03 red-team pytest suite).
  4. Rotate per-tenant traffic incrementally via router.py (5% → 100%).

Estimated effort: 1 engineer-week for a tested swap, not counting GPU capacity provisioning. The decision is reversible.

References

  • apps/web/public/downloads/enterprise-ai-platform-starter.zip!/rag/pipeline.py
  • ADR-002 (multi-tenant RLS — independent of inference layer)
  • ADR-003 (prompt versioning — makes regression testing mechanical)
  • docs/cost-model/enterprise-ai-cost-model.csv (crossover analysis)
Built into the project

This decision shipped as part of Enterprise AI Platform — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open