vllm-configuration

Solid

Configure vLLM completely — YAML config file format, CLI arg precedence, full VLLM_*/HF_*/TRANSFORMERS_* env-var catalog, end-to-end recipe for air-gapped environments (internal HF mirrors, hf-mirror.com, ModelScope, HF_HUB_OFFLINE with pre-seeded cache, gated models offline, trust_remote_code supply-chain implications). VLLM_HOST_IP vs API-host confusion, Kubernetes-service-named-`vllm` env-var poisoning, usage-stats triple opt-out, YAML precedence surprises.

AI & Automation 3 stars 1 forks Updated 2 days ago MIT

Install

View on GitHub

Quality Score: 79/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# vLLM configuration Target audience: operators deploying vLLM in production — datacenter GPUs, containerized, often inside networks that can't reach `huggingface.co` directly and need to use internal mirrors or fully offline caches. ## Why this matters vLLM's config surface is deceptively layered: CLI flags, a YAML `--config` file, `VLLM_*` env vars, and the HuggingFace / Transformers env vars it inherits transparently. The same setting can exist in three places, and the precedence ordering is not intuitive. Getting this wrong produces three classic failure modes: 1. **First-boot network errors** — operator pre-downloaded weights to a local path, but vLLM still hits `huggingface.co` for a revision check, a missing tokenizer file, or usage stats. The "local path" illusion is incomplete. 2. **Env-var namespace collisions** — a Kubernetes Service named `vllm` injects `VLLM_SERVICE_HOST` / `VLLM_SERVICE_PORT` into every pod, which silently overrides `VLLM_HOST_IP` / `VLLM_PORT`. vLLM's *internal distributed* init then uses the k8s cluster IP and breaks. 3. **`VLLM_HOST_IP` as an API host** — operators alias `--host $VLLM_HOST_IP` assuming symmetry with the API server. `VLLM_HOST_IP` is the **internal inter-worker bind address**, not the OpenAI-compat server host. Using it as the API host breaks TP/PP distributed init. The fix in every case is understanding the layering. This skill teaches that layering, then gives the operator-facing knobs, then the air-gapped recipe. ## P...

Details

Author: air-gapped
Repository: air-gapped/skills
Created: 3 months ago
Last Updated: 2 days ago
Language: Python
License: MIT

Integrates with

OpenAI · AI Anthropic · AI Hugging Face · AI Docker · Infrastructure Kubernetes · Infrastructure

Bundled in these plugins

skills

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

vllm-deployment

Use this skill when authoring, reviewing, or fixing a vLLM Kubernetes manifest, Docker/Podman pod, or OpenShift ServingRuntime — even when the user does not say "vllm". Triggers on: lab cluster performance practices, cache mount + survival across pod restarts (/root/.cache, VLLM_CACHE_ROOT, TORCHINDUCTOR_CACHE_DIR, TRITON_CACHE_DIR, "do we have caches saved"), HF_TOKEN secret in pod env, liveness + readiness probe tuning (initialDelaySeconds, failureThreshold, "pod takes 12 min to boot"), serve_args review, --enforce-eager rationale, MoE deployment ("ep2 dp2", --enable-expert-parallel, expert-parallel sizing), TP/PP sizing, ConfigMap parser-plugin mount, image tag selection, cold-boot reduction, multi-node LWS + Ray, control planes (llm-d, production-stack, AIBrix, NVIDIA Dynamo, KServe), KEDA autoscaling, GAIE routing, disaggregated prefill/decode (Nixl/Mooncake/LMCache/MORI-IO), RHAIIS on OpenShift (SCC, arbitrary UID, Routes 60s, ModelCar, air-gapped). Lead with operator intent, not vendor names.

3 Updated 2 days ago

air-gapped

AI & Automation Solid

vllm-caching

vLLM tiered KV cache configuration for production H100/H200 clusters. Native CPU offload, LMCache (CPU+NVMe+GDS), NixlConnector (disaggregated prefill), MooncakeConnector (RDMA), MultiConnector composition. Version gates, sizing math (flag total across TP, not per-GPU — opposite of SGLang), KV-vs-weights offload distinction operators most often get wrong.

3 Updated 2 days ago

air-gapped

AI & Automation Solid

vllm-observability

Observe production vLLM — `/metrics` Prometheus surface (V1 engine), SLO-driven alerting on TTFT/ITL/queue/KV/preemption/aborts/corrupted-logits, shipping Grafana dashboards in `examples/observability/`, OTLP tracing with `--otlp-traces-endpoint` and `--collect-detailed-traces={model,worker,all}`, diagnostic rules to triage from /metrics alone — queue-grows + TPOT-stable means capacity, queue-stable + TPOT-grows means context/model, DCGM `SM_OCCUPANCY` is the real GPU-saturation signal not `GPU_UTIL`. V1 metric names (kv_cache_usage_perc), gpu_→kv_ rename saga, DCGM-exporter pairing, dashboard-lying pitfalls.

3 Updated 2 days ago

air-gapped