vllm-quantization

Solid

vLLM datacenter-GPU quantization — picking, configuring, troubleshooting NVFP4, FP8, MXFP4, MXFP8, AWQ, GPTQ, INT8, compressed-tensors, modelopt, quark on H100/H200/B200/B300/GB200/GB300. 29 `--quantization` flag values, KV-cache dtypes (fp8_e4m3, nvfp4, per-token-head, turboquant), MoE backend selection (CUTLASS, TRTLLM, FlashInfer, DeepGEMM, Marlin, Qutlass), producing checkpoints with llm-compressor and NVIDIA ModelOpt (NVFP4_DEFAULT_CFG, FP8_DEFAULT_CFG, W4A16, SmoothQuant+GPTQ), online quantization (`fp8_per_tensor`, `fp8_per_block`), training EAGLE-3/dflash drafters on BF16 targets before PTQ, version gates per vLLM release (v0.14 → v0.21).

AI & Automation 3 stars 1 forks Updated yesterday MIT

Install

View on GitHub

Quality Score: 79/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# vLLM quantization — operator skill **Last verified:** 2026-04-24 — see `references/sources.md` for per-ref audit table. For production vLLM operators on **H100 / H200 / B200 / B300 / GB200 / GB300** fleets deciding which quantization format fits a given target model, producing a checkpoint vLLM will actually load, wiring the right KV-cache dtype, diagnosing accuracy or throughput regressions after an upgrade, and composing quantization with speculative decoding / LoRA / MoE. Pointer-map format: this SKILL.md picks the format and CLI; the files in `references/` hold the per-format deep dives, exact source pointers, and troubleshooting cards. Follow the link, don't paraphrase from memory — the quantization layer moves faster than any other subsystem in vLLM (six formats landed in v0.19 alone). ## When quantization wins, when it doesn't Quantization trades weight precision for memory + compute: - **KV-capacity bound** (long context, high concurrency) — FP8 or NVFP4 **KV cache** gives a 2×/4× KV-capacity multiplier; weight format matters much less than getting `--kv-cache-dtype` right. Measure `kv_cache_usage_perc`. - **Memory-bandwidth bound** (small batch, decode-heavy, 70B+ on < 8 GPUs) — weight quantization (NVFP4 / FP8 / W4A16) reduces HBM traffic per token, giving 1.5–3× decode throughput on a well-matched target+kernel. - **Compute bound** (prefill, large batch, small model) — quantization may not help; Blackwell FP4 Tensor Cores are the first architectur...

Details

Author: air-gapped
Repository: air-gapped/skills
Created: 3 months ago
Last Updated: yesterday
Language: Python
License: MIT

Integrates with

Anthropic · AI Kubernetes · Infrastructure

Bundled in these plugins

skills

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

vllm-performance-tuning

vLLM performance-tuning operator reference — tuning workflow (baseline → bottleneck → knob → re-bench), fused-MoE kernel autotune (`benchmark_moe.py` generates `E=N,N=M,device_name=X.json` configs), DeepEP all-to-all + expert parallelism + EPLB, CUDA graph modes (FULL_AND_PIECEWISE default), torch.compile AOT + compile cache, scheduler knobs (`--max-num-batched-tokens`, `--max-num-seqs`, `--async-scheduling`), TP/EP/DP/PP decision tree, NCCL/DCGM on H100/H200/B200/GB200, PD disaggregation (Nixl/Mooncake/LMCache), known regressions + vendor quirks (v0.14→0.15.1 MiniMax, MI300X FP8<BF16, DeepGEMM M<128 TTFT).

3 Updated yesterday

air-gapped

AI & Automation Solid

vllm-caching

vLLM tiered KV cache configuration for production H100/H200 clusters. Native CPU offload, LMCache (CPU+NVMe+GDS), NixlConnector (disaggregated prefill), MooncakeConnector (RDMA), MultiConnector composition. Version gates, sizing math (flag total across TP, not per-GPU — opposite of SGLang), KV-vs-weights offload distinction operators most often get wrong.

3 Updated yesterday

air-gapped

AI & Automation Solid

vllm-observability

Observe production vLLM — `/metrics` Prometheus surface (V1 engine), SLO-driven alerting on TTFT/ITL/queue/KV/preemption/aborts/corrupted-logits, shipping Grafana dashboards in `examples/observability/`, OTLP tracing with `--otlp-traces-endpoint` and `--collect-detailed-traces={model,worker,all}`, diagnostic rules to triage from /metrics alone — queue-grows + TPOT-stable means capacity, queue-stable + TPOT-grows means context/model, DCGM `SM_OCCUPANCY` is the real GPU-saturation signal not `GPU_UTIL`. V1 metric names (kv_cache_usage_perc), gpu_→kv_ rename saga, DCGM-exporter pairing, dashboard-lying pitfalls.

3 Updated yesterday

air-gapped