vllm-benchmarking

Solid

Run production vLLM benchmarks — `vllm bench` (serve, throughput, latency, sweep, startup, mm-processor), request-rate vs max-concurrency semantics, TTFT/TPOT/ITL/E2EL percentiles, goodput SLO measurement, prefix-cache workloads, air-gapped operation (HF_ENDPOINT, ModelScope, hf-mirror, offline cache). Methodology split — SLO health checks vs A/B change sweeps — plus pitfalls that produce misleading numbers (no warmup, wrong tokenizer, random-as-prod, `--request-rate inf` alone).

AI & Automation 3 stars 1 forks Updated 2 days ago MIT

Install

View on GitHub

Quality Score: 79/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# vLLM benchmarking Target audience: operators producing defensible latency/throughput numbers against production or pre-production vLLM deployments, on datacenter GPUs, often in containerized or air-gapped environments. **This skill measures; it does not tune.** Once a number is trusted and the verdict is "too slow", the knobs live elsewhere in the `vllm` plugin: **`vllm-performance-tuning`** (scheduler, MoE kernels, CUDA graphs, parallelism), **`vllm-caching`** (KV tiering when the bottleneck is prefill or cache capacity), **`vllm-nvidia-hardware`** (the SKU's own ceiling). Measure → change one thing → re-measure with the same methodology; a tuning change compared against a differently-shaped benchmark run is not evidence. ## Why this matters Bad benchmarks are worse than no benchmarks — they drive the wrong decisions with false confidence. The three common failure modes: 1. **Wrong methodology.** `--request-rate inf` answers "saturation throughput," not "TTFT my users see." Mixing those up leads to buying GPUs to solve a latency problem, or shipping a latency regression because total throughput looked fine. 2. **Wrong workload.** `--dataset-name random` has zero prefix structure. Real coding-agent or RAG traffic has heavy prefix reuse. Benchmarking caching wins on random produces numbers that don't survive contact with prod. 3. **No warmup / wrong tokenizer.** First N requests hit cold CUDA graphs. Token counts are fiction unless `--tokenizer` matches the served model...

Details

Author: air-gapped
Repository: air-gapped/skills
Created: 3 months ago
Last Updated: 2 days ago
Language: Python
License: MIT

Integrates with

Anthropic · AI Docker · Infrastructure Kubernetes · Infrastructure

Bundled in these plugins

skills

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

vllm-performance-tuning

vLLM performance-tuning operator reference — tuning workflow (baseline → bottleneck → knob → re-bench), fused-MoE kernel autotune (`benchmark_moe.py` generates `E=N,N=M,device_name=X.json` configs), DeepEP all-to-all + expert parallelism + EPLB, CUDA graph modes (FULL_AND_PIECEWISE default), torch.compile AOT + compile cache, scheduler knobs (`--max-num-batched-tokens`, `--max-num-seqs`, `--async-scheduling`), TP/EP/DP/PP decision tree, NCCL/DCGM on H100/H200/B200/GB200, PD disaggregation (Nixl/Mooncake/LMCache), known regressions + vendor quirks (v0.14→0.15.1 MiniMax, MI300X FP8<BF16, DeepGEMM M<128 TTFT).

3 Updated 2 days ago

air-gapped

AI & Automation Solid

vllm-observability

Observe production vLLM — `/metrics` Prometheus surface (V1 engine), SLO-driven alerting on TTFT/ITL/queue/KV/preemption/aborts/corrupted-logits, shipping Grafana dashboards in `examples/observability/`, OTLP tracing with `--otlp-traces-endpoint` and `--collect-detailed-traces={model,worker,all}`, diagnostic rules to triage from /metrics alone — queue-grows + TPOT-stable means capacity, queue-stable + TPOT-grows means context/model, DCGM `SM_OCCUPANCY` is the real GPU-saturation signal not `GPU_UTIL`. V1 metric names (kv_cache_usage_perc), gpu_→kv_ rename saga, DCGM-exporter pairing, dashboard-lying pitfalls.

3 Updated 2 days ago

air-gapped

AI & Automation Solid

vllm-caching

vLLM tiered KV cache configuration for production H100/H200 clusters. Native CPU offload, LMCache (CPU+NVMe+GDS), NixlConnector (disaggregated prefill), MooncakeConnector (RDMA), MultiConnector composition. Version gates, sizing math (flag total across TP, not per-GPU — opposite of SGLang), KV-vs-weights offload distinction operators most often get wrong.

3 Updated 2 days ago

air-gapped