vllm-observability

Solid

Observe production vLLM — `/metrics` Prometheus surface (V1 engine), SLO-driven alerting on TTFT/ITL/queue/KV/preemption/aborts/corrupted-logits, shipping Grafana dashboards in `examples/observability/`, OTLP tracing with `--otlp-traces-endpoint` and `--collect-detailed-traces={model,worker,all}`, diagnostic rules to triage from /metrics alone — queue-grows + TPOT-stable means capacity, queue-stable + TPOT-grows means context/model, DCGM `SM_OCCUPANCY` is the real GPU-saturation signal not `GPU_UTIL`. V1 metric names (kv_cache_usage_perc), gpu_→kv_ rename saga, DCGM-exporter pairing, dashboard-lying pitfalls.

AI & Automation 3 stars 1 forks Updated 2 days ago MIT

Install

View on GitHub

Quality Score: 79/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# vLLM observability Target audience: operators running production vLLM on H100/H200 fleets, usually containerized, usually on Kubernetes, on-call for latency and throughput SLOs. ## Why this matters `nvidia-smi` can show a perfectly healthy GPU while TTFT is 11 seconds. Raw throughput in `tok/s` can be rising while user-visible P99 TTFT is cratering. Every production incident this skill exists to catch shares one structural problem: aggregate numbers and hardware counters lie, and only the vLLM-internal per-request distributions tell the truth. Two operator-facing outcomes matter: 1. **Alerting that wakes the right person for the right reason** — TTFT/ITL tail, queue depth, preemption rate, corrupted logits. 2. **Diagnosis from /metrics alone** — a small number of metric patterns distinguish "out of capacity" from "stuck scheduler" from "hot long-context outlier" without SSH'ing to the pod. ## The core diagnostic rule When something feels slow, read the ratio, not the absolute: | Queue depth | TPOT / ITL | Most likely cause | |---|---|---| | Rising | Stable | **Capacity shortage** — scale out or increase `max-num-seqs` | | Stable | Rising | **Context / model-side** — long-context request, CUDA graph recompile, prefix-cache miss | | Rising | Rising | **Compounding** — usually preemption storm; check `num_preemptions` rate | | Stable | Stable, but TTFT high | **Scheduler stall** — connector (LMCache/NIXL), head-of-line blocking, or engine-core descheduling (ebpf territ...

Details

Author: air-gapped
Repository: air-gapped/skills
Created: 3 months ago
Last Updated: 2 days ago
Language: Python
License: MIT

Integrates with

Anthropic · AI Grafana · Monitoring Docker · Infrastructure Kubernetes · Infrastructure

Bundled in these plugins

skills

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

vllm-benchmarking

Run production vLLM benchmarks — `vllm bench` (serve, throughput, latency, sweep, startup, mm-processor), request-rate vs max-concurrency semantics, TTFT/TPOT/ITL/E2EL percentiles, goodput SLO measurement, prefix-cache workloads, air-gapped operation (HF_ENDPOINT, ModelScope, hf-mirror, offline cache). Methodology split — SLO health checks vs A/B change sweeps — plus pitfalls that produce misleading numbers (no warmup, wrong tokenizer, random-as-prod, `--request-rate inf` alone).

3 Updated 2 days ago

air-gapped

AI & Automation Solid

vllm-deployment

Use this skill when authoring, reviewing, or fixing a vLLM Kubernetes manifest, Docker/Podman pod, or OpenShift ServingRuntime — even when the user does not say "vllm". Triggers on: lab cluster performance practices, cache mount + survival across pod restarts (/root/.cache, VLLM_CACHE_ROOT, TORCHINDUCTOR_CACHE_DIR, TRITON_CACHE_DIR, "do we have caches saved"), HF_TOKEN secret in pod env, liveness + readiness probe tuning (initialDelaySeconds, failureThreshold, "pod takes 12 min to boot"), serve_args review, --enforce-eager rationale, MoE deployment ("ep2 dp2", --enable-expert-parallel, expert-parallel sizing), TP/PP sizing, ConfigMap parser-plugin mount, image tag selection, cold-boot reduction, multi-node LWS + Ray, control planes (llm-d, production-stack, AIBrix, NVIDIA Dynamo, KServe), KEDA autoscaling, GAIE routing, disaggregated prefill/decode (Nixl/Mooncake/LMCache/MORI-IO), RHAIIS on OpenShift (SCC, arbitrary UID, Routes 60s, ModelCar, air-gapped). Lead with operator intent, not vendor names.

3 Updated 2 days ago

air-gapped

AI & Automation Solid

vllm-performance-tuning

vLLM performance-tuning operator reference — tuning workflow (baseline → bottleneck → knob → re-bench), fused-MoE kernel autotune (`benchmark_moe.py` generates `E=N,N=M,device_name=X.json` configs), DeepEP all-to-all + expert parallelism + EPLB, CUDA graph modes (FULL_AND_PIECEWISE default), torch.compile AOT + compile cache, scheduler knobs (`--max-num-batched-tokens`, `--max-num-seqs`, `--async-scheduling`), TP/EP/DP/PP decision tree, NCCL/DCGM on H100/H200/B200/GB200, PD disaggregation (Nixl/Mooncake/LMCache), known regressions + vendor quirks (v0.14→0.15.1 MiniMax, MI300X FP8<BF16, DeepGEMM M<128 TTFT).

3 Updated 2 days ago

air-gapped