vllm-quantizationlisted
Install: claude install-skill air-gapped/skills
# vLLM quantization — operator skill
**Last verified:** 2026-04-24 — see `references/sources.md` for per-ref audit table.
For production vLLM operators on **H100 / H200 / B200 / B300 / GB200 / GB300** fleets
deciding which quantization format fits a given target model, producing a
checkpoint vLLM will actually load, wiring the right KV-cache dtype, diagnosing
accuracy or throughput regressions after an upgrade, and composing quantization
with speculative decoding / LoRA / MoE.
Pointer-map format: this SKILL.md picks the format and CLI; the files in
`references/` hold the per-format deep dives, exact source pointers, and
troubleshooting cards. Follow the link, don't paraphrase from memory — the
quantization layer moves faster than any other subsystem in vLLM (six formats
landed in v0.19 alone).
## When quantization wins, when it doesn't
Quantization trades weight precision for memory + compute:
- **KV-capacity bound** (long context, high concurrency) — FP8 or NVFP4 **KV
cache** gives a 2×/4× KV-capacity multiplier; weight format matters much
less than getting `--kv-cache-dtype` right. Measure `kv_cache_usage_perc`.
- **Memory-bandwidth bound** (small batch, decode-heavy, 70B+ on < 8 GPUs) —
weight quantization (NVFP4 / FP8 / W4A16) reduces HBM traffic per token,
giving 1.5–3× decode throughput on a well-matched target+kernel.
- **Compute bound** (prefill, large batch, small model) — quantization may
not help; Blackwell FP4 Tensor Cores are the first architectur