ADR-001: vLLM continuous batching + PagedAttention over Triton/TensorRT-LLM | AI Serving Platform

Context

The serving layer is the single largest determinant of cost-per-request and p99 latency on this platform. We're shipping Mistral-7B-Instruct (7B parameters, 8k context) under a <200ms p99 SLA target with a <500ms p99 baseline gate in M01. Three engines were on the table:

NVIDIA Triton + TensorRT-LLM. Industry standard. Best raw throughput on A100/H100 with TensorRT compiled engines. Cost: a model-compilation step, an extra config layer (Triton config files + Python backend), and no native OpenAI-API compatibility (need an adapter layer in front). The compilation step alone is a 30–60 minute round-trip per model change.
HuggingFace text-generation-inference (TGI). Solid, OpenAI-API compatible, supports continuous batching. Behind vLLM on raw throughput for Mistral-class models per public benchmarks at the time of M01 (~70–80% of vLLM's tokens/sec on A10G).
vLLM. PagedAttention for KV-cache memory management, continuous batching that lets new requests join mid-flight, prefix caching for repeated system prompts, drop-in OpenAI-compatible API. Sub-200ms p99 on A10G at 256 concurrent sequences.

Naive static batching (the default in many serving stacks) wastes GPU cycles because a slow request blocks the whole batch — the GPU sits idle waiting for the longest-running request in the batch to finish. Continuous batching fixes that. PagedAttention fixes the symmetric problem on memory: fragmented KV-cache allocation wastes 60%+ of GPU memory in classic engines.

Decision

We adopt vLLM as the serving engine, with continuous batching, prefix caching, and PagedAttention all enabled.

# serving/vllm_config.py
ENGINE_ARGS = EngineArgs(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    tensor_parallel_size=1,           # single A10G; bump to 2 for >7B
    max_num_seqs=256,                 # concurrent sequences per replica
    gpu_memory_utilization=0.85,      # 15% headroom for activations
    enable_prefix_caching=True,       # share KV across matching prefixes
    block_size=16,                    # PagedAttention block size
    max_model_len=8192,
)

The OpenAI-compatible /v1/chat/completions endpoint is exposed directly by vLLM; FastAPI sits in front purely for auth, rate limiting, RAG injection (M02), and Prometheus middleware.

Tradeoffs we accept

Lever	Alternative	Chosen
Engine maturity	Triton (industry standard)	vLLM (newer, faster on Mistral-7B)
Compilation step	TensorRT compiles per-model	vLLM loads HF weights directly
API compatibility	Triton needs adapter layer	vLLM ships OpenAI-compatible API
KV memory	Naive contiguous allocation	PagedAttention block_size=16
Tunability	Many knobs in Triton config	One Python file (`vllm_config.py`)

The largest concrete cost is vendor maturity risk — vLLM is younger than Triton (originally a Berkeley research project, ~2 years to v0.4). We mitigate by pinning to a known-good version (vllm>=0.4.0,<0.5.0) and by keeping the engine swap path small: the OpenAI-compatible API surface means the rest of the platform doesn't depend on vLLM internals.

Consequences (positive)

Continuous batching measurably outperforms static batching on bursty workloads. New requests join the running batch instead of queueing for the next batch boundary; GPU utilisation stays >80% under load.
Prefix caching reuses the KV cache across requests that share a system prompt. The 312-token FinSight system prompt (M02) is cached exactly once across all replicas; per-request prompt-fill cost drops to zero on cache hit.
PagedAttention eliminates KV-cache fragmentation; ~55% memory savings on Mistral-7B vs naive contiguous allocation, which is what enables max_num_seqs=256 to fit in 16GB A10G memory.
OpenAI-API compatibility means clients (and the M03 streaming SSE endpoint) work without translation layers.
Single-file tuning surface. vllm_config.py is the one place to change batching, GPU memory, prefix caching, and quantization toggles.

Consequences (negative)

vLLM is younger than Triton. Production maturity behind. We pin to a tested version range and watch the changelog.
Engine-specific tuning knobs. max_num_seqs, block_size, gpu_memory_utilization aren't portable to other engines if we ever swap. The reversal plan documents the substitution mapping.
No native multi-model serving. vLLM serves one model per process. Multi-model is a Ray Serve concern (ADR-002), not a vLLM concern.
No quantization in v1. Mistral-7B fits in 16GB at fp16 with the paged-attention savings; AWQ/GPTQ would let us run on smaller GPUs (L4, T4) but adds a compilation step. Documented as a M05 follow-up.

Reversal plan

If vLLM's pace of breaking changes becomes a problem, or a benchmark on H100 shows TensorRT-LLM > vLLM by a meaningful margin (>30% throughput delta), the reversal is mechanical:

Build a Triton config + Python backend with the same OpenAI-compatible shape. Keep the FastAPI gateway unchanged.
Add a TensorRT compilation step to CI for the target model.
Cut over per-replica via the Ray Serve deployment knob (one replica at a time, with the circuit breaker absorbing the swap).

Estimated effort: ~2 engineer-weeks for the engine swap, plus 30–60 min per model for TensorRT compilation steady-state.

References

serving/vllm_config.py — the single tuning surface
serving/batching_config.py — scheduler_delay_factor tuning (M02)
serving/kv_cache_config.py — prefix-caching warm-up (M02)
api/main.py — FastAPI gateway in front of vLLM
api/Dockerfile — multi-stage build with python:3.11-slim runtime
tests/test_endpoint.py — 4 smoke tests against the live engine
benchmarks/locustfile.py — p99 baseline harness
ADR-002 (Ray Serve orchestration this engine plugs into)
ADR-003 (semantic cache that sits in front of this engine)