← ClaudeAtlas

vllm-gemma-4-31blisted

Operating-point reference for serving Gemma 4 31B on vLLM — TP sizing, max_model_len, max_num_seqs, gpu_memory_utilization, kv_cache_dtype, EAGLE3 spec-dec, chat_template choice.
air-gapped/skills · ★ 2 · AI & Automation · score 78
Install: claude install-skill air-gapped/skills
# Gemma 4 31B on vLLM — operating-point reference For platform engineers deploying `google/gemma-4-31B-it` (BF16, FP8) or its community quants (e.g. `cyankiwi/gemma-4-31B-it-AWQ-4bit`, `RedHatAI/*-Gemma-4-31B-*`) on vLLM 0.20+. Pulls together measurements from a Verda 2× H100 SXM5 80GB audit on 2026-04-30 and the upstream constraints that shape the answer. ## Three load-bearing facts 1. **Gemma 4 has heterogeneous head_dim (256 dense / 512 attention)**, which forces vLLM to use `TRITON_ATTN` backend, not FLASH_ATTN. This is automatic — vLLM logs `Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence`. Don't try to override with `--attention-backend FLASH_ATTN` — vLLM rejects it (`kv_cache_dtype not supported`, `partial multimodal token full attention not supported`). 2. **Throughput plateaus at batch=64 on H100, batch=128 on H200.** This is *not* a hardcoded vLLM cap — it's HBM-bandwidth-bound saturation. H100 SXM5 has ~3.35 TB/s HBM3, H200 has ~4.8 TB/s HBM3e (~43% more). The bandwidth ratio approximately matches the batch ratio. See `references/hbm-saturation.md` for the source-code investigation (vllm/engine/arg_utils.py:2207-2288 is the only hardware-aware default in the engine; H100 and H200 take the *same* code path). **Don't set `max_num_seqs` above the bandwidth knee** — it just inflates TPOT and TTFT without moving throu