vllm-benchmarkinglisted
Install: claude install-skill air-gapped/skills
# vLLM benchmarking
Target audience: operators producing defensible latency/throughput numbers against production or pre-production vLLM deployments, on datacenter GPUs, often in containerized or air-gapped environments.
## Why this matters
Bad benchmarks are worse than no benchmarks — they drive the wrong decisions with false confidence. The three common failure modes:
1. **Wrong methodology.** `--request-rate inf` answers "saturation throughput," not "TTFT my users see." Mixing those up leads to buying GPUs to solve a latency problem, or shipping a latency regression because total throughput looked fine.
2. **Wrong workload.** `--dataset-name random` has zero prefix structure. Real coding-agent or RAG traffic has heavy prefix reuse. Benchmarking caching wins on random produces numbers that don't survive contact with prod.
3. **No warmup / wrong tokenizer.** First N requests hit cold CUDA graphs. Token counts are fiction unless `--tokenizer` matches the served model exactly.
The cost of getting this right is small; the cost of getting it wrong is buying the wrong hardware.
## Decision tree — which subcommand
| Question | Command | Why |
|---|---|---|
| "Saturation throughput of this offline batch" | `vllm bench throughput` | Submits N prompts at once, measures tok/s. No server. |
| "Single-batch generation latency" | `vllm bench latency` | Fixed batch size, repeated N times. Warmup included. Good for kernel-level regression. |
| "Production serving performance" | `vll