vllm-speculative-decodinglisted

Pick, configure, tune, monitor vLLM speculative decoding in production. Eleven SpeculativeMethod options (ngram, ngram_gpu, medusa, mlp_speculator, draft_model, suffix, eagle, eagle3, dflash, mtp, extract_hidden_states), `--speculative-config` JSON schema, which methods pair with which target model family, Prometheus acceptance metric surface, version gates (v0.11.1 EAGLE-3 preamble fix, v0.16 parallel drafting, v0.18 ngram_gpu, v0.19 dflash and zero-bubble), composability with chunked prefill / PP / LoRA / FP8 / structured outputs, Arctic Inference plugin, where spec-dec stops paying at high batch.
air-gapped/skills · ★ 2 · AI & Automation · score 78

Install: claude install-skill air-gapped/skills

# vLLM speculative decoding — operator skill For production vLLM operators deciding which speculative method fits a given model + workload, configuring it correctly, wiring the acceptance metrics into their dashboards, and diagnosing why a deployment isn't seeing the expected speedup. ## When spec-dec wins, when it loses Spec-dec amortises memory-bandwidth-bound decode by letting a cheap proposer guess k tokens that a single target-model forward can verify in parallel. - **Wins at low concurrency (BS 1–8)**: decode is bandwidth-bound, k verified tokens per target step → 1.5–3× throughput on a well-matched target+drafter. EAGLE-3 on Llama-3.1-8B: +32% TPOT over EAGLE-1 at BS=4 (vLLM v0.11.1+, PR #25916). DFlash on Qwen3-8B: 3.5× at BS=1, 1.6× at BS=32 (PR #36847). - **Hurts at high concurrency (BS ≥ 32)**: target becomes compute-bound, draft latency is no longer hidden, rejections waste GPU time. Red Hat, Snowflake and the P-EAGLE author all report this. Gate spec-dec to the low-concurrency tier of a disagg or multi-tenant deployment, or disable above a threshold. - **Domain mismatch sinks acceptance**: stock EAGLE-3 checkpoints are chat-tuned. Code / agentic / RL-rollout workloads see AL drop from ~3 to ~2. Measure on actual traffic before trusting vendor numbers. ## Method selection Pick once by target-model family and workload shape. Full per-method detail in `references/methods.md`; MTP in `references/mtp.md`; EAGLE-3 specifics including P-EAGLE in `r