vllm-speculative-decodinglisted
Install: claude install-skill air-gapped/skills
# vLLM speculative decoding — operator skill
For production vLLM operators deciding which speculative method fits a given
model + workload, configuring it correctly, wiring the acceptance metrics into
their dashboards, and diagnosing why a deployment isn't seeing the expected
speedup.
## When spec-dec wins, when it loses
Spec-dec amortises memory-bandwidth-bound decode by letting a cheap proposer
guess k tokens that a single target-model forward can verify in parallel.
- **Wins at low concurrency (BS 1–8)**: decode is bandwidth-bound, k verified
tokens per target step → 1.5–3× throughput on a well-matched target+drafter.
EAGLE-3 on Llama-3.1-8B: +32% TPOT over EAGLE-1 at BS=4 (vLLM v0.11.1+, PR
#25916). DFlash on Qwen3-8B: 3.5× at BS=1, 1.6× at BS=32 (PR #36847).
- **Hurts at high concurrency (BS ≥ 32)**: target becomes compute-bound, draft
latency is no longer hidden, rejections waste GPU time. Red Hat, Snowflake
and the P-EAGLE author all report this. Gate spec-dec to the low-concurrency
tier of a disagg or multi-tenant deployment, or disable above a threshold.
- **Domain mismatch sinks acceptance**: stock EAGLE-3 checkpoints are chat-tuned.
Code / agentic / RL-rollout workloads see AL drop from ~3 to ~2. Measure on
actual traffic before trusting vendor numbers.
## Method selection
Pick once by target-model family and workload shape. Full per-method detail in
`references/methods.md`; MTP in `references/mtp.md`; EAGLE-3 specifics including
P-EAGLE in `r