vllm-gemma-4-31blisted
Install: claude install-skill air-gapped/skills
# Gemma 4 31B on vLLM — operating-point reference
For platform engineers deploying `google/gemma-4-31B-it` (BF16, FP8) or its
community quants (e.g. `cyankiwi/gemma-4-31B-it-AWQ-4bit`,
`RedHatAI/*-Gemma-4-31B-*`) on vLLM 0.20+. Pulls together measurements
from a Verda 2× H100 SXM5 80GB audit on 2026-04-30 and the upstream
constraints that shape the answer.
## Three load-bearing facts
1. **Gemma 4 has heterogeneous head_dim (256 dense / 512 attention)**, which
forces vLLM to use `TRITON_ATTN` backend, not FLASH_ATTN. This is
automatic — vLLM logs `Gemma4 model has heterogeneous head dimensions
(head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to
prevent mixed-backend numerical divergence`. Don't try to override
with `--attention-backend FLASH_ATTN` — vLLM rejects it (`kv_cache_dtype
not supported`, `partial multimodal token full attention not supported`).
2. **Throughput plateaus at batch=64 on H100, batch=128 on H200.** This is
*not* a hardcoded vLLM cap — it's HBM-bandwidth-bound saturation. H100
SXM5 has ~3.35 TB/s HBM3, H200 has ~4.8 TB/s HBM3e (~43% more). The
bandwidth ratio approximately matches the batch ratio. See
`references/hbm-saturation.md` for the source-code investigation
(vllm/engine/arg_utils.py:2207-2288 is the only hardware-aware default
in the engine; H100 and H200 take the *same* code path). **Don't set
`max_num_seqs` above the bandwidth knee** — it just inflates TPOT and
TTFT without moving throu