# ADR-001 — vLLM continuous batching + PagedAttention over Triton/TensorRT-LLM

- **Status:** Accepted
- **Date:** 2026-04-08
- **Module:** 01 — Build Your First AI Serving Layer
- **Stakeholders:** serving engineer, platform lead, on-call SRE

## Context

The serving layer is the single largest determinant of cost-per-request and
p99 latency on this platform. We're shipping Mistral-7B-Instruct (7B
parameters, 8k context) under a `<200ms p99` SLA target with a `<500ms p99`
baseline gate in M01. Three engines were on the table:

1. **NVIDIA Triton + TensorRT-LLM.** Industry standard. Best raw throughput
   on A100/H100 with TensorRT compiled engines. Cost: a model-compilation
   step, an extra config layer (Triton config files + Python backend), and
   no native OpenAI-API compatibility (need an adapter layer in front). The
   compilation step alone is a 30–60 minute round-trip per model change.
2. **HuggingFace `text-generation-inference` (TGI).** Solid, OpenAI-API
   compatible, supports continuous batching. Behind vLLM on raw throughput
   for Mistral-class models per public benchmarks at the time of M01
   (~70–80% of vLLM's tokens/sec on A10G).
3. **vLLM.** PagedAttention for KV-cache memory management, continuous
   batching that lets new requests join mid-flight, prefix caching for
   repeated system prompts, drop-in OpenAI-compatible API. Sub-200ms p99 on
   A10G at 256 concurrent sequences.

Naive static batching (the default in many serving stacks) wastes GPU cycles
because a slow request blocks the whole batch — the GPU sits idle waiting
for the longest-running request in the batch to finish. Continuous batching
fixes that. PagedAttention fixes the symmetric problem on memory:
fragmented KV-cache allocation wastes 60%+ of GPU memory in classic engines.

## Decision

We adopt **vLLM** as the serving engine, with continuous batching, prefix
caching, and PagedAttention all enabled.

```python
# serving/vllm_config.py
ENGINE_ARGS = EngineArgs(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    tensor_parallel_size=1,           # single A10G; bump to 2 for >7B
    max_num_seqs=256,                 # concurrent sequences per replica
    gpu_memory_utilization=0.85,      # 15% headroom for activations
    enable_prefix_caching=True,       # share KV across matching prefixes
    block_size=16,                    # PagedAttention block size
    max_model_len=8192,
)
```

The OpenAI-compatible `/v1/chat/completions` endpoint is exposed directly by
vLLM; FastAPI sits in front purely for auth, rate limiting, RAG injection
(M02), and Prometheus middleware.

## Tradeoffs we accept

| Lever             | Alternative                 | Chosen                             |
| ----------------- | --------------------------- | ---------------------------------- |
| Engine maturity   | Triton (industry standard)  | vLLM (newer, faster on Mistral-7B) |
| Compilation step  | TensorRT compiles per-model | vLLM loads HF weights directly     |
| API compatibility | Triton needs adapter layer  | vLLM ships OpenAI-compatible API   |
| KV memory         | Naive contiguous allocation | PagedAttention block_size=16       |
| Tunability        | Many knobs in Triton config | One Python file (`vllm_config.py`) |

The largest concrete cost is **vendor maturity risk** — vLLM is younger than
Triton (originally a Berkeley research project, ~2 years to v0.4). We
mitigate by pinning to a known-good version (`vllm>=0.4.0,<0.5.0`) and by
keeping the engine swap path small: the OpenAI-compatible API surface
means the rest of the platform doesn't depend on vLLM internals.

## Consequences (positive)

- **Continuous batching** measurably outperforms static batching on bursty
  workloads. New requests join the running batch instead of queueing for
  the next batch boundary; GPU utilisation stays >80% under load.
- **Prefix caching** reuses the KV cache across requests that share a
  system prompt. The 312-token FinSight system prompt (M02) is cached
  exactly once across all replicas; per-request prompt-fill cost drops to
  zero on cache hit.
- **PagedAttention** eliminates KV-cache fragmentation; ~55% memory
  savings on Mistral-7B vs naive contiguous allocation, which is what
  enables `max_num_seqs=256` to fit in 16GB A10G memory.
- **OpenAI-API compatibility** means clients (and the M03 streaming SSE
  endpoint) work without translation layers.
- **Single-file tuning surface.** `vllm_config.py` is the one place to
  change batching, GPU memory, prefix caching, and quantization toggles.

## Consequences (negative)

- **vLLM is younger than Triton.** Production maturity behind. We pin to
  a tested version range and watch the changelog.
- **Engine-specific tuning knobs.** `max_num_seqs`, `block_size`,
  `gpu_memory_utilization` aren't portable to other engines if we ever
  swap. The reversal plan documents the substitution mapping.
- **No native multi-model serving.** vLLM serves one model per process.
  Multi-model is a Ray Serve concern (ADR-002), not a vLLM concern.
- **No quantization in v1.** Mistral-7B fits in 16GB at fp16 with the
  paged-attention savings; AWQ/GPTQ would let us run on smaller GPUs (L4,
  T4) but adds a compilation step. Documented as a M05 follow-up.

## Reversal plan

If vLLM's pace of breaking changes becomes a problem, or a benchmark on
H100 shows TensorRT-LLM > vLLM by a meaningful margin (>30% throughput
delta), the reversal is mechanical:

1. Build a Triton config + Python backend with the same OpenAI-compatible
   shape. Keep the FastAPI gateway unchanged.
2. Add a TensorRT compilation step to CI for the target model.
3. Cut over per-replica via the Ray Serve deployment knob (one replica at a
   time, with the circuit breaker absorbing the swap).

Estimated effort: ~2 engineer-weeks for the engine swap, plus 30–60 min
per model for TensorRT compilation steady-state.

## References

- `serving/vllm_config.py` — the single tuning surface
- `serving/batching_config.py` — `scheduler_delay_factor` tuning (M02)
- `serving/kv_cache_config.py` — prefix-caching warm-up (M02)
- `api/main.py` — FastAPI gateway in front of vLLM
- `api/Dockerfile` — multi-stage build with python:3.11-slim runtime
- `tests/test_endpoint.py` — 4 smoke tests against the live engine
- `benchmarks/locustfile.py` — p99 baseline harness
- ADR-002 (Ray Serve orchestration this engine plugs into)
- ADR-003 (semantic cache that sits in front of this engine)
