vllm-nvidia-hardwarelisted
Install: claude install-skill air-gapped/skills
# vLLM on NVIDIA hardware — Hopper through Rubin
Target audience: operators who run vLLM on NVIDIA datacenter GPUs, sizing from single H100 nodes up to GB300 NVL72 racks, and evaluating Vera Rubin for 2026–2027 purchases.
This skill is a **reference**, not a walkthrough — most of the content is SKU tables, facility prerequisites, and platform compatibility matrices. The SKILL.md body holds the quick-answer shortcuts; the `references/` directory has the full tables. Read the reference file that matches the question.
## The one thing to know before anything else
LLM inference has two phases with radically different bottlenecks:
- **Prefill** is compute-bound (GEMMs, AI ≫ ridge point) — more FLOPs help.
- **Decode** is memory-bandwidth-bound (AI ≈ 1, 100× below the ridge) — more HBM bandwidth helps, more FLOPs don't.
Every hardware decision — FP4 vs FP8, B300's higher FLOPs with the same 8 TB/s, NVL72's domain collapse, Rubin's HBM4 jump to ~20 TB/s — is about relieving the memory wall on decode while keeping prefill healthy. Read `references/fundamentals.md` for the roofline math and the HBM roadmap context that makes the rest of the tables meaningful.
## Quick-answer router
**Hardware specs** ("what's the HBM on X?", "TDP of Y?")
- NVIDIA GPU SKUs (Hopper, Blackwell, Blackwell Ultra) → `references/gpu-specs.md`
- Vera Rubin roadmap (R100, Rubin Ultra, NVL144, Kyber NVL576) → `references/rubin-roadmap.md`
- Dell PowerEdge XE servers → `references/dell-xe.md`
- GB300 NVL