vllm-gemma-4-31b

Solid

Operating-point reference for serving Gemma 4 31B on vLLM — TP sizing, max_model_len, max_num_seqs, gpu_memory_utilization, kv_cache_dtype, EAGLE3 spec-dec, chat_template choice.

AI & Automation 3 stars 1 forks Updated 2 days ago MIT

Install

View on GitHub

Quality Score: 79/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Gemma 4 31B on vLLM — operating-point reference For platform engineers deploying `google/gemma-4-31B-it` (BF16, FP8) or its community quants (e.g. `cyankiwi/gemma-4-31B-it-AWQ-4bit`, `RedHatAI/*-Gemma-4-31B-*`) on vLLM 0.20+. Pulls together measurements from a Verda 2× H100 SXM5 80GB audit on 2026-04-30 and the upstream constraints that shape the answer. ## Three load-bearing facts 1. **Gemma 4 has heterogeneous head_dim (256 dense / 512 attention)**, which forces vLLM to use `TRITON_ATTN` backend, not FLASH_ATTN. This is automatic — vLLM logs `Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence`. Don't try to override with `--attention-backend FLASH_ATTN` — vLLM rejects it (`kv_cache_dtype not supported`, `partial multimodal token full attention not supported`). 2. **Throughput plateaus at batch=64 on H100, batch=128 on H200.** This is *not* a hardcoded vLLM cap — it's HBM-bandwidth-bound saturation. H100 SXM5 has ~3.35 TB/s HBM3, H200 has ~4.8 TB/s HBM3e (~43% more). The bandwidth ratio approximately matches the batch ratio. See `references/hbm-saturation.md` for the source-code investigation (`get_batch_defaults()` in vllm/engine/arg_utils.py is the only hardware-aware batch default in the engine; H100 and H200 take the *same* code path — re-verified at v0.25.1). **Don't set `max_num_seqs` above the bandwidth knee** — it j...

Details

Author: air-gapped
Repository: air-gapped/skills
Created: 3 months ago
Last Updated: 2 days ago
Language: Python
License: MIT

Integrates with

Anthropic · AI Kubernetes · Infrastructure

Bundled in these plugins

skills

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

vllm-nvidia-hardware

NVIDIA AI-hardware + vLLM-platform reference covering Hopper (H100/H200), Blackwell (B100/B200/B300) and Blackwell Ultra, Grace-Blackwell superchips and NVL72 racks (GB200, GB300), Vera Rubin (R100/R300) with VR200 NVL144 and Kyber NVL576, Dell PowerEdge XE family and IR5000/IR7000/IR9048 racks. Per-SKU HBM, FP4/FP8/FP16 TFLOPs, NVLink5, TDP, rack power/cooling (135 kW GB300, 180-220 kW NVL144, 600 kW Kyber), DLC vs RDHx, 800 VDC HVDC. Memory-wall roofline, HBM3E→HBM4 supply 2026. vLLM attention-backend × SM matrix, FP4/FP8 paths, KV connectors, Blackwell gotchas (SM103 TRTLLM hang, 270 vs 288 GB B300 bin split).

3 Updated 2 days ago

air-gapped

AI & Automation Solid

vllm-performance-tuning

vLLM performance-tuning operator reference — tuning workflow (baseline → bottleneck → knob → re-bench), fused-MoE kernel autotune (`benchmark_moe.py` generates `E=N,N=M,device_name=X.json` configs), DeepEP all-to-all + expert parallelism + EPLB, CUDA graph modes (FULL_AND_PIECEWISE default), torch.compile AOT + compile cache, scheduler knobs (`--max-num-batched-tokens`, `--max-num-seqs`, `--async-scheduling`), TP/EP/DP/PP decision tree, NCCL/DCGM on H100/H200/B200/GB200, PD disaggregation (Nixl/Mooncake/LMCache), known regressions + vendor quirks (v0.14→0.15.1 MiniMax, MI300X FP8<BF16, DeepGEMM M<128 TTFT).

3 Updated 2 days ago

air-gapped

AI & Automation Solid

vllm-caching

vLLM tiered KV cache configuration for production H100/H200 clusters. Native CPU offload, LMCache (CPU+NVMe+GDS), NixlConnector (disaggregated prefill), MooncakeConnector (RDMA), MultiConnector composition. Version gates, sizing math (flag total across TP, not per-GPU — opposite of SGLang), KV-vs-weights offload distinction operators most often get wrong.

3 Updated 2 days ago

air-gapped