Skip to content
Back to AI Serving Platform

vLLM continuous batching + PagedAttention over Triton/TensorRT-LLM

✓ AcceptedAI Serving Platform01 — Build Your First AI Serving Layer
By AI-DE Engineering Team·Stakeholders: serving engineer, platform lead, on-call SRE

Context

The serving layer is the single largest determinant of cost-per-request and p99 latency on this platform. We're shipping Mistral-7B-Instruct (7B parameters, 8k context) under a <200ms p99 SLA target with a <500ms p99 baseline gate in M01. Three engines were on the table:

  1. NVIDIA Triton + TensorRT-LLM. Industry standard. Best raw throughput on A100/H100 with TensorRT compiled engines. Cost: a model-compilation step, an extra config layer (Triton config files + Python backend), and no native OpenAI-API compatibility (need an adapter layer in front). The compilation step alone is a 30–60 minute round-trip per model change.
  2. HuggingFace text-generation-inference (TGI). Solid, OpenAI-API compatible, supports continuous batching. Behind vLLM on raw throughput for Mistral-class models per public benchmarks at the time of M01 (~70–80% of vLLM's tokens/sec on A10G).
  3. vLLM. PagedAttention for KV-cache memory management, continuous batching that lets new requests join mid-flight, prefix caching for repeated system prompts, drop-in OpenAI-compatible API. Sub-200ms p99 on A10G at 256 concurrent sequences.

Naive static batching (the default in many serving stacks) wastes GPU cycles because a slow request blocks the whole batch — the GPU sits idle waiting for the longest-running request in the batch to finish. Continuous batching fixes that. PagedAttention fixes the symmetric problem on memory: fragmented KV-cache allocation wastes 60%+ of GPU memory in classic engines.

Decision

We adopt vLLM as the serving engine, with continuous batching, prefix caching, and PagedAttention all enabled.

# serving/vllm_config.py
ENGINE_ARGS = EngineArgs(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    tensor_parallel_size=1,           # single A10G; bump to 2 for >7B
    max_num_seqs=256,                 # concurrent sequences per replica
    gpu_memory_utilization=0.85,      # 15% headroom for activations
    enable_prefix_caching=True,       # share KV across matching prefixes
    block_size=16,                    # PagedAttention block size
    max_model_len=8192,
)

The OpenAI-compatible /v1/chat/completions endpoint is exposed directly by vLLM; FastAPI sits in front purely for auth, rate limiting, RAG injection (M02), and Prometheus middleware.

Tradeoffs we accept

LeverAlternativeChosen
Engine maturityTriton (industry standard)vLLM (newer, faster on Mistral-7B)
Compilation stepTensorRT compiles per-modelvLLM loads HF weights directly
API compatibilityTriton needs adapter layervLLM ships OpenAI-compatible API
KV memoryNaive contiguous allocationPagedAttention block_size=16
TunabilityMany knobs in Triton configOne Python file (vllm_config.py)

The largest concrete cost is vendor maturity risk — vLLM is younger than Triton (originally a Berkeley research project, ~2 years to v0.4). We mitigate by pinning to a known-good version (vllm>=0.4.0,<0.5.0) and by keeping the engine swap path small: the OpenAI-compatible API surface means the rest of the platform doesn't depend on vLLM internals.

Consequences (positive)

  • Continuous batching measurably outperforms static batching on bursty workloads. New requests join the running batch instead of queueing for the next batch boundary; GPU utilisation stays >80% under load.
  • Prefix caching reuses the KV cache across requests that share a system prompt. The 312-token FinSight system prompt (M02) is cached exactly once across all replicas; per-request prompt-fill cost drops to zero on cache hit.
  • PagedAttention eliminates KV-cache fragmentation; ~55% memory savings on Mistral-7B vs naive contiguous allocation, which is what enables max_num_seqs=256 to fit in 16GB A10G memory.
  • OpenAI-API compatibility means clients (and the M03 streaming SSE endpoint) work without translation layers.
  • Single-file tuning surface. vllm_config.py is the one place to change batching, GPU memory, prefix caching, and quantization toggles.

Consequences (negative)

  • vLLM is younger than Triton. Production maturity behind. We pin to a tested version range and watch the changelog.
  • Engine-specific tuning knobs. max_num_seqs, block_size, gpu_memory_utilization aren't portable to other engines if we ever swap. The reversal plan documents the substitution mapping.
  • No native multi-model serving. vLLM serves one model per process. Multi-model is a Ray Serve concern (ADR-002), not a vLLM concern.
  • No quantization in v1. Mistral-7B fits in 16GB at fp16 with the paged-attention savings; AWQ/GPTQ would let us run on smaller GPUs (L4, T4) but adds a compilation step. Documented as a M05 follow-up.

Reversal plan

If vLLM's pace of breaking changes becomes a problem, or a benchmark on H100 shows TensorRT-LLM > vLLM by a meaningful margin (>30% throughput delta), the reversal is mechanical:

  1. Build a Triton config + Python backend with the same OpenAI-compatible shape. Keep the FastAPI gateway unchanged.
  2. Add a TensorRT compilation step to CI for the target model.
  3. Cut over per-replica via the Ray Serve deployment knob (one replica at a time, with the circuit breaker absorbing the swap).

Estimated effort: ~2 engineer-weeks for the engine swap, plus 30–60 min per model for TensorRT compilation steady-state.

References

  • serving/vllm_config.py — the single tuning surface
  • serving/batching_config.pyscheduler_delay_factor tuning (M02)
  • serving/kv_cache_config.py — prefix-caching warm-up (M02)
  • api/main.py — FastAPI gateway in front of vLLM
  • api/Dockerfile — multi-stage build with python:3.11-slim runtime
  • tests/test_endpoint.py — 4 smoke tests against the live engine
  • benchmarks/locustfile.py — p99 baseline harness
  • ADR-002 (Ray Serve orchestration this engine plugs into)
  • ADR-003 (semantic cache that sits in front of this engine)
Built into the project

This decision shipped as part of AI Serving Platform — see the full architecture, starter kit, and 4 more ADRs.

Open project →
Press Cmd+K to open