Context
The serving layer is the single largest determinant of cost-per-request and
p99 latency on this platform. We're shipping Mistral-7B-Instruct (7B
parameters, 8k context) under a <200ms p99 SLA target with a <500ms p99
baseline gate in M01. Three engines were on the table:
- NVIDIA Triton + TensorRT-LLM. Industry standard. Best raw throughput on A100/H100 with TensorRT compiled engines. Cost: a model-compilation step, an extra config layer (Triton config files + Python backend), and no native OpenAI-API compatibility (need an adapter layer in front). The compilation step alone is a 30–60 minute round-trip per model change.
- HuggingFace
text-generation-inference(TGI). Solid, OpenAI-API compatible, supports continuous batching. Behind vLLM on raw throughput for Mistral-class models per public benchmarks at the time of M01 (~70–80% of vLLM's tokens/sec on A10G). - vLLM. PagedAttention for KV-cache memory management, continuous batching that lets new requests join mid-flight, prefix caching for repeated system prompts, drop-in OpenAI-compatible API. Sub-200ms p99 on A10G at 256 concurrent sequences.
Naive static batching (the default in many serving stacks) wastes GPU cycles because a slow request blocks the whole batch — the GPU sits idle waiting for the longest-running request in the batch to finish. Continuous batching fixes that. PagedAttention fixes the symmetric problem on memory: fragmented KV-cache allocation wastes 60%+ of GPU memory in classic engines.
Decision
We adopt vLLM as the serving engine, with continuous batching, prefix caching, and PagedAttention all enabled.
# serving/vllm_config.py
ENGINE_ARGS = EngineArgs(
model="mistralai/Mistral-7B-Instruct-v0.2",
tensor_parallel_size=1, # single A10G; bump to 2 for >7B
max_num_seqs=256, # concurrent sequences per replica
gpu_memory_utilization=0.85, # 15% headroom for activations
enable_prefix_caching=True, # share KV across matching prefixes
block_size=16, # PagedAttention block size
max_model_len=8192,
)
The OpenAI-compatible /v1/chat/completions endpoint is exposed directly by
vLLM; FastAPI sits in front purely for auth, rate limiting, RAG injection
(M02), and Prometheus middleware.
Tradeoffs we accept
| Lever | Alternative | Chosen |
|---|---|---|
| Engine maturity | Triton (industry standard) | vLLM (newer, faster on Mistral-7B) |
| Compilation step | TensorRT compiles per-model | vLLM loads HF weights directly |
| API compatibility | Triton needs adapter layer | vLLM ships OpenAI-compatible API |
| KV memory | Naive contiguous allocation | PagedAttention block_size=16 |
| Tunability | Many knobs in Triton config | One Python file (vllm_config.py) |
The largest concrete cost is vendor maturity risk — vLLM is younger than
Triton (originally a Berkeley research project, ~2 years to v0.4). We
mitigate by pinning to a known-good version (vllm>=0.4.0,<0.5.0) and by
keeping the engine swap path small: the OpenAI-compatible API surface
means the rest of the platform doesn't depend on vLLM internals.
Consequences (positive)
- Continuous batching measurably outperforms static batching on bursty workloads. New requests join the running batch instead of queueing for the next batch boundary; GPU utilisation stays >80% under load.
- Prefix caching reuses the KV cache across requests that share a system prompt. The 312-token FinSight system prompt (M02) is cached exactly once across all replicas; per-request prompt-fill cost drops to zero on cache hit.
- PagedAttention eliminates KV-cache fragmentation; ~55% memory
savings on Mistral-7B vs naive contiguous allocation, which is what
enables
max_num_seqs=256to fit in 16GB A10G memory. - OpenAI-API compatibility means clients (and the M03 streaming SSE endpoint) work without translation layers.
- Single-file tuning surface.
vllm_config.pyis the one place to change batching, GPU memory, prefix caching, and quantization toggles.
Consequences (negative)
- vLLM is younger than Triton. Production maturity behind. We pin to a tested version range and watch the changelog.
- Engine-specific tuning knobs.
max_num_seqs,block_size,gpu_memory_utilizationaren't portable to other engines if we ever swap. The reversal plan documents the substitution mapping. - No native multi-model serving. vLLM serves one model per process. Multi-model is a Ray Serve concern (ADR-002), not a vLLM concern.
- No quantization in v1. Mistral-7B fits in 16GB at fp16 with the paged-attention savings; AWQ/GPTQ would let us run on smaller GPUs (L4, T4) but adds a compilation step. Documented as a M05 follow-up.
Reversal plan
If vLLM's pace of breaking changes becomes a problem, or a benchmark on H100 shows TensorRT-LLM > vLLM by a meaningful margin (>30% throughput delta), the reversal is mechanical:
- Build a Triton config + Python backend with the same OpenAI-compatible shape. Keep the FastAPI gateway unchanged.
- Add a TensorRT compilation step to CI for the target model.
- Cut over per-replica via the Ray Serve deployment knob (one replica at a time, with the circuit breaker absorbing the swap).
Estimated effort: ~2 engineer-weeks for the engine swap, plus 30–60 min per model for TensorRT compilation steady-state.
References
serving/vllm_config.py— the single tuning surfaceserving/batching_config.py—scheduler_delay_factortuning (M02)serving/kv_cache_config.py— prefix-caching warm-up (M02)api/main.py— FastAPI gateway in front of vLLMapi/Dockerfile— multi-stage build with python:3.11-slim runtimetests/test_endpoint.py— 4 smoke tests against the live enginebenchmarks/locustfile.py— p99 baseline harness- ADR-002 (Ray Serve orchestration this engine plugs into)
- ADR-003 (semantic cache that sits in front of this engine)