← ClaudeAtlas

vllm-performance-tuninglisted

vLLM performance-tuning operator reference — tuning workflow (baseline → bottleneck → knob → re-bench), fused-MoE kernel autotune (`benchmark_moe.py` generates `E=N,N=M,device_name=X.json` configs), DeepEP all-to-all + expert parallelism + EPLB, CUDA graph modes (FULL_AND_PIECEWISE default), torch.compile AOT + compile cache, scheduler knobs (`--max-num-batched-tokens`, `--max-num-seqs`, `--async-scheduling`), TP/EP/DP/PP decision tree, NCCL/DCGM on H100/H200/B200/GB200, PD disaggregation (Nixl/Mooncake/LMCache), known regressions + vendor quirks (v0.14→0.15.1 MiniMax, MI300X FP8<BF16, DeepGEMM M<128 TTFT).
air-gapped/skills · ★ 2 · AI & Automation · score 78
Install: claude install-skill air-gapped/skills
# vLLM performance tuning Target: operators deploying models on new hardware, chasing throughput / latency / goodput SLOs, or diagnosing perf regressions. Current through v0.19.1 stable (2026-04-18) / v0.20.0 pre-release (2026-04-23). Last freshened 2026-04-24. Companion skills: `vllm-benchmarking` (measure), `vllm-caching` (KV), `vllm-nvidia-hardware` (GPU/GEMM), `vllm-configuration` (env vars), `vllm-observability` (metrics). ## The tuning workflow 1. **Characterize workload** — ISL / OSL / req/s / concurrency / SLO (P95 TTFT, P95 TPOT, P95 ITL). "Goodput" = tok/s/GPU **under SLO**, not raw tok/s. 2. **Pick parallelism** (see `references/moe-and-ep.md`) — model-fits-1-GPU → TP=1 + replicas (DP); MoE MLA (DeepSeek/Kimi-K2) → DP-attn + EP; multi-node → TP intra + PP inter OR Wide-EP. 3. **MoE on new SKU? Run `benchmark_moe.py --tune`** — generates `E=*,N=*,device_name=*.json` configs. Without tuned configs vLLM logs "Using default MoE config. Performance might be sub-optimal!" = 20-40% throughput loss. 4. **Run `auto_tune.sh`** (`benchmarks/auto_tune/`) — sweeps `max_num_seqs × max_num_batched_tokens`. 5. **Raise `--gpu-memory-utilization`** from 0.90 toward 0.95 until steady OOM margin, back off. MoE: cap at 0.85 (all-to-all buffers not in accounting). 6. **Chunked prefill (always on in V1)** — raise `--max-num-batched-tokens` (default 2048 since PR #10544) if TTFT > SLO; lower if ITL > SLO. 7. **CUDA graphs** — keep `FULL_AND_PIECEWISE` (default); align `--cuda-graph-si