# ADR-001 — Use Anthropic Claude API over self-hosted inference for v1

- **Status:** Accepted
- **Date:** 2026-05-08
- **Module:** 01 — RAG + Compliance Core
- **Stakeholders:** platform engineer, security engineer, finance

## Context

The platform must answer questions over tenant-private documents under
strict compliance constraints (PII redaction, audit trail, RBAC). The
inference layer is a major cost driver and the most operationally
demanding component — it carries GPU capacity planning, model rollouts,
batching strategy, and quantization decisions. We have two viable v1
paths:

1. **Self-host an open-weight model** (Llama 3 70B / Mixtral 8x22B) on
   vLLM or TGI with a managed GPU pool.
2. **Call a hosted vendor API** (Anthropic Claude, OpenAI GPT-4) over
   HTTPS with a typed SDK.

We are a small team shipping v1 in 4 modules with a 4-tenant beta load
target.

## Decision

Adopt the **Anthropic Claude API** (`from anthropic import Anthropic`)
as the inference layer for v1. The pipeline interface is a single
`generate(messages, system, max_tokens)` call; tenant context is
injected via the system prompt and per-tenant retrieval results.

```python
# rag/pipeline.py
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
    model="claude-haiku-4-5",
    system=tenant_system_prompt,
    messages=[{"role": "user", "content": query}],
    max_tokens=1024,
)
```

## Tradeoffs we accept

| Lever                                   | Self-hosted vLLM                    | Anthropic API (chosen)          |
| --------------------------------------- | ----------------------------------- | ------------------------------- |
| Day-1 latency                           | Engineering time before first token | First token in ~5 minutes       |
| Compliance posture                      | Full data residency + audit         | Vendor DPA + per-call audit log |
| Unit cost at low volume (<5M tok/mo)    | $1,200+ / mo idle GPU               | $0–$200 / mo                    |
| Unit cost at high volume (>500M tok/mo) | Lower per-token                     | Higher per-token                |
| Vendor risk                             | Eliminated                          | Real (rate limits, deprecation) |
| Capacity planning                       | Required                            | Vendor's problem                |
| Rollout / canary                        | We build it                         | We build it (model param)       |

We optimize for **time-to-shipped-platform**, not for **per-token cost
at scale**, because there are no platform-scale tenants yet. The cost
model in `docs/cost-model/enterprise-ai-cost-model.csv` shows the
crossover point at ~80M tokens/month.

## Consequences (positive)

- Module 01 ships in <5 hours of learner time. No GPU sourcing, no
  vLLM tuning, no quantization debugging.
- Full audit log captures every prompt + response + token count + cost
  with no extra infra (Anthropic returns usage in the response).
- The pipeline interface stays a pure Python contract — swap is a
  single-method change.

## Consequences (negative)

- **Vendor lock-in surface.** All prompts are tuned for Claude; a swap
  to GPT or self-hosted Llama will require prompt regression testing.
  Mitigation: ADR-003 (prompts in Git, hash-tagged) makes regression
  testing mechanical.
- **Per-token cost ceiling.** At >80M tokens/month per tenant the API
  bill exceeds a self-hosted GPU pool. Mitigation: cost model + Grafana
  per-tenant token panel (Module 04) flags the crossover.
- **Compliance review burden.** Some regulated tenants (healthcare,
  finance) require on-prem inference. Mitigation: see ADR-???-future
  for the dual-stack plan.

## Reversal plan

Inference is encapsulated in `rag/pipeline.py::EnterpriseRAGPipeline.generate()`.
Replacement is bounded:

1. Add `serving/vllm_client.py` with the same `generate()` signature.
2. Switch `pipeline.py` import behind a feature flag.
3. Re-run prompt regression tests (Module 03 red-team pytest suite).
4. Rotate per-tenant traffic incrementally via `router.py` (5% → 100%).

Estimated effort: **1 engineer-week** for a tested swap, not counting
GPU capacity provisioning. The decision is reversible.

## References

- `apps/web/public/downloads/enterprise-ai-platform-starter.zip!/rag/pipeline.py`
- ADR-002 (multi-tenant RLS — independent of inference layer)
- ADR-003 (prompt versioning — makes regression testing mechanical)
- `docs/cost-model/enterprise-ai-cost-model.csv` (crossover analysis)