← ClaudeAtlas

vllm-observabilitylisted

Observe production vLLM — `/metrics` Prometheus surface (V1 engine), SLO-driven alerting on TTFT/ITL/queue/KV/preemption/aborts/corrupted-logits, shipping Grafana dashboards in `examples/observability/`, OTLP tracing with `--otlp-traces-endpoint` and `--collect-detailed-traces={model,worker,all}`, diagnostic rules to triage from /metrics alone — queue-grows + TPOT-stable means capacity, queue-stable + TPOT-grows means context/model, DCGM `SM_OCCUPANCY` is the real GPU-saturation signal not `GPU_UTIL`. V1 metric names (kv_cache_usage_perc), gpu_→kv_ rename saga (PR #24245 / revert #25392), DCGM-exporter pairing, dashboard-lying pitfalls.
air-gapped/skills · ★ 2 · AI & Automation · score 78
Install: claude install-skill air-gapped/skills
# vLLM observability Target audience: operators running production vLLM on H100/H200 fleets, usually containerized, usually on Kubernetes, on-call for latency and throughput SLOs. ## Why this matters `nvidia-smi` can show a perfectly healthy GPU while TTFT is 11 seconds. Raw throughput in `tok/s` can be rising while user-visible P99 TTFT is cratering. Every production incident this skill exists to catch shares one structural problem: aggregate numbers and hardware counters lie, and only the vLLM-internal per-request distributions tell the truth. Two operator-facing outcomes matter: 1. **Alerting that wakes the right person for the right reason** — TTFT/ITL tail, queue depth, preemption rate, corrupted logits. 2. **Diagnosis from /metrics alone** — a small number of metric patterns distinguish "out of capacity" from "stuck scheduler" from "hot long-context outlier" without SSH'ing to the pod. ## The core diagnostic rule When something feels slow, read the ratio, not the absolute: | Queue depth | TPOT / ITL | Most likely cause | |---|---|---| | Rising | Stable | **Capacity shortage** — scale out or increase `max-num-seqs` | | Stable | Rising | **Context / model-side** — long-context request, CUDA graph recompile, prefix-cache miss | | Rising | Rising | **Compounding** — usually preemption storm; check `num_preemptions` rate | | Stable | Stable, but TTFT high | **Scheduler stall** — connector (LMCache/NIXL), head-of-line blocking, or engine-core descheduling (ebpf territ