← ClaudeAtlas

vllm-deploymentlisted

Use this skill when authoring, reviewing, or fixing a vLLM Kubernetes manifest, Docker/Podman pod, or OpenShift ServingRuntime — even when the user does not say "vllm". Triggers on: lab cluster performance practices, cache mount + survival across pod restarts (/root/.cache, VLLM_CACHE_ROOT, TORCHINDUCTOR_CACHE_DIR, TRITON_CACHE_DIR, "do we have caches saved"), HF_TOKEN secret in pod env, liveness + readiness probe tuning (initialDelaySeconds, failureThreshold, "pod takes 12 min to boot"), serve_args review, --enforce-eager rationale, MoE deployment ("ep2 dp2", --enable-expert-parallel, expert-parallel sizing), TP/PP sizing, ConfigMap parser-plugin mount, image tag selection, cold-boot reduction, multi-node LWS + Ray, control planes (llm-d, production-stack, AIBrix, NVIDIA Dynamo, KServe), KEDA autoscaling, GAIE routing, disaggregated prefill/decode (Nixl/Mooncake/LMCache/MORI-IO), RHAIIS on OpenShift (SCC, arbitrary UID, Routes 60s, ModelCar, air-gapped). Lead with operator intent, not vendor names.
air-gapped/skills · ★ 2 · AI & Automation · score 78
Install: claude install-skill air-gapped/skills
# vLLM deployment (Kubernetes first, Docker lab, OpenShift sidebar) Target audience: platform engineers bringing up vLLM on production Kubernetes (H100/H200/B200/B300 fleets), and individual researchers running 1-to-2-node Docker / Podman setups in a lab. This skill is a **pointer map**. It points to the canonical sources — in the vLLM repo, in docs.vllm.ai, in the ecosystem repos, and to the load-bearing blog posts — rather than paraphrasing them. Paraphrase rots; pointers survive. ## Decision guide — pick the path | Situation | Go to | |---|---| | Single node, 1 container, TP ≤ 8 | `references/docker-lab.md` | | Single host, 2 containers for PD disagg lab | `references/docker-lab.md` (compose template) + `references/disagg.md` | | k8s, single model fits 1 pod | `references/pod-shape.md` + in-tree helm chart | | k8s, model needs multi-node TP/PP | `references/multi-node.md` (LWS + `multi-node-serving.sh`) | | k8s fleet, router + LMCache + observability bundled | `vllm-production-stack` (Helm) — see `references/ecosystem.md` | | k8s fleet, disagg P/D + KV-aware + GAIE + SLA scheduler | `llm-d` — see `references/ecosystem.md` | | k8s fleet, ByteDance-scale multi-tenant LoRA + heterogenous GPU | `AIBrix` — see `references/ecosystem.md` | | NVIDIA reference stack on prem / EKS / AKS with NIXL | `NVIDIA Dynamo` — see `references/ecosystem.md` | | OpenShift / RHOAI | `references/openshift.md` + RHAIIS images | | Routing / load balancing across pods | `references/routing.md` (G