← ClaudeAtlas

vllm-benchmarkinglisted

Run production vLLM benchmarks — `vllm bench` (serve, throughput, latency, sweep, startup, mm-processor), request-rate vs max-concurrency semantics, TTFT/TPOT/ITL/E2EL percentiles, goodput SLO measurement, prefix-cache workloads, air-gapped operation (HF_ENDPOINT, ModelScope, hf-mirror, offline cache). Methodology split — SLO health checks vs A/B change sweeps — plus pitfalls that produce misleading numbers (no warmup, wrong tokenizer, random-as-prod, `--request-rate inf` alone).
air-gapped/skills · ★ 2 · AI & Automation · score 78
Install: claude install-skill air-gapped/skills
# vLLM benchmarking Target audience: operators producing defensible latency/throughput numbers against production or pre-production vLLM deployments, on datacenter GPUs, often in containerized or air-gapped environments. ## Why this matters Bad benchmarks are worse than no benchmarks — they drive the wrong decisions with false confidence. The three common failure modes: 1. **Wrong methodology.** `--request-rate inf` answers "saturation throughput," not "TTFT my users see." Mixing those up leads to buying GPUs to solve a latency problem, or shipping a latency regression because total throughput looked fine. 2. **Wrong workload.** `--dataset-name random` has zero prefix structure. Real coding-agent or RAG traffic has heavy prefix reuse. Benchmarking caching wins on random produces numbers that don't survive contact with prod. 3. **No warmup / wrong tokenizer.** First N requests hit cold CUDA graphs. Token counts are fiction unless `--tokenizer` matches the served model exactly. The cost of getting this right is small; the cost of getting it wrong is buying the wrong hardware. ## Decision tree — which subcommand | Question | Command | Why | |---|---|---| | "Saturation throughput of this offline batch" | `vllm bench throughput` | Submits N prompts at once, measures tok/s. No server. | | "Single-batch generation latency" | `vllm bench latency` | Fixed batch size, repeated N times. Warmup included. Good for kernel-level regression. | | "Production serving performance" | `vll