vllm-deployment

Solid

Use this skill when authoring, reviewing, or fixing a vLLM Kubernetes manifest, Docker/Podman pod, or OpenShift ServingRuntime — even when the user does not say "vllm". Triggers on: lab cluster performance practices, cache mount + survival across pod restarts (/root/.cache, VLLM_CACHE_ROOT, TORCHINDUCTOR_CACHE_DIR, TRITON_CACHE_DIR, "do we have caches saved"), HF_TOKEN secret in pod env, liveness + readiness probe tuning (initialDelaySeconds, failureThreshold, "pod takes 12 min to boot"), serve_args review, --enforce-eager rationale, MoE deployment ("ep2 dp2", --enable-expert-parallel, expert-parallel sizing), TP/PP sizing, ConfigMap parser-plugin mount, image tag selection, cold-boot reduction, multi-node LWS + Ray, control planes (llm-d, production-stack, AIBrix, NVIDIA Dynamo, KServe), KEDA autoscaling, GAIE routing, disaggregated prefill/decode (Nixl/Mooncake/LMCache/MORI-IO), RHAIIS on OpenShift (SCC, arbitrary UID, Routes 60s, ModelCar, air-gapped). Lead with operator intent, not vendor names.

AI & Automation 3 stars 1 forks Updated 2 days ago MIT

Install

View on GitHub

Quality Score: 79/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# vLLM deployment (Kubernetes first, Docker lab, OpenShift sidebar) Target audience: platform engineers bringing up vLLM on production Kubernetes (H100/H200/B200/B300 fleets), and individual researchers running 1-to-2-node Docker / Podman setups in a lab. This skill is a **pointer map**. It points to the canonical sources — in the vLLM repo, in docs.vllm.ai, in the ecosystem repos, and to the load-bearing blog posts — rather than paraphrasing them. Paraphrase rots; pointers survive. ## Decision guide — pick the path | Situation | Go to | |---|---| | Single node, 1 container, TP ≤ 8 | `references/docker-lab.md` | | Single host, 2 containers for PD disagg lab | `references/docker-lab.md` (compose template) + `references/disagg.md` | | k8s, single model fits 1 pod | `references/pod-shape.md` + in-tree helm chart | | k8s, model needs multi-node TP/PP | `references/multi-node.md` (LWS + `multi-node-serving.sh`) | | k8s fleet, router + LMCache + observability bundled | `vllm-production-stack` (Helm) — see `references/ecosystem.md` | | k8s fleet, disagg P/D + KV-aware + GAIE + SLA scheduler | `llm-d` — see `references/ecosystem.md` | | k8s fleet, ByteDance-scale multi-tenant LoRA + heterogenous GPU | `AIBrix` — see `references/ecosystem.md` | | NVIDIA reference stack on prem / EKS / AKS with NIXL | `NVIDIA Dynamo` — see `references/ecosystem.md` | | OpenShift / RHOAI | `references/openshift.md` + RHAIIS images | | Routing / load balancing across pods | `references/routing.md` (G...

Details

Author: air-gapped
Repository: air-gapped/skills
Created: 3 months ago
Last Updated: 2 days ago
Language: Python
License: MIT

Integrates with

Anthropic · AI Kubernetes · Infrastructure

Bundled in these plugins

skills

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

vllm-configuration

Configure vLLM completely — YAML config file format, CLI arg precedence, full VLLM_*/HF_*/TRANSFORMERS_* env-var catalog, end-to-end recipe for air-gapped environments (internal HF mirrors, hf-mirror.com, ModelScope, HF_HUB_OFFLINE with pre-seeded cache, gated models offline, trust_remote_code supply-chain implications). VLLM_HOST_IP vs API-host confusion, Kubernetes-service-named-`vllm` env-var poisoning, usage-stats triple opt-out, YAML precedence surprises.

3 Updated 2 days ago

air-gapped

AI & Automation Solid

vllm-observability

Observe production vLLM — `/metrics` Prometheus surface (V1 engine), SLO-driven alerting on TTFT/ITL/queue/KV/preemption/aborts/corrupted-logits, shipping Grafana dashboards in `examples/observability/`, OTLP tracing with `--otlp-traces-endpoint` and `--collect-detailed-traces={model,worker,all}`, diagnostic rules to triage from /metrics alone — queue-grows + TPOT-stable means capacity, queue-stable + TPOT-grows means context/model, DCGM `SM_OCCUPANCY` is the real GPU-saturation signal not `GPU_UTIL`. V1 metric names (kv_cache_usage_perc), gpu_→kv_ rename saga, DCGM-exporter pairing, dashboard-lying pitfalls.

3 Updated 2 days ago

air-gapped

AI & Automation Listed

eks-genai

Use whenever someone is building, training, fine-tuning, or serving a generative AI / LLM workload on Amazon EKS — phrased as "GPU vs Trainium/Inferentia", "vLLM on EKS", "Ray Serve / KubeRay", "distributed training on EKS", "FSx for Lustre for ML", "Karpenter for GPU", "EFA / NCCL multi-node", "DCGM / Neuron Monitor", "LiteLLM / AI gateway", "RAG on EKS", "agentic AI on EKS", or "self-host Llama / Mistral / Qwen". Walks the opinionated 6-layer stack (compute → cluster/scheduler → frameworks → storage → observability → AI gateway), the GPU-vs-Neuron decision, the JARK + vLLM + LiteLLM canonical reference, KV-cache tiering, cost levers (Neuron, Spot, Capacity Blocks), and a non-negotiable security baseline. Trigger even if "GenAI" is never said — any GPU/Neuron, inference-serving, or distributed-training decision on EKS qualifies. Skip for SageMaker-only or Bedrock-only (no self-hosting) asks, and for generic cluster design/build with no AI/ML workload (use eks-design / eks-build).

37 Updated today

aws-samples