sglang-model-gateway

Solid

SGLang Model Gateway (`sgl-model-gateway`, formerly `sgl-router`) — Rust router fronting vLLM and SGLang inference workers on Kubernetes. Covers first-class vLLM gRPC backend plus HTTP transparent-proxy for vanilla vLLM, the policy set (six `--policy` values, `cache_aware` default), tokenizer-format dispatch (`tokenizer.json` HF-fast vs `tiktoken.model` BPE — including when neither is required because `cache_aware` is text-based), air-gapped recipe (gateway ignores `HF_ENDPOINT`, mount tokenizer files on PVC only when actually needed), K8s manifests with `model_id` labels and per-model RBAC, three HA mitigations (single + PDB, `sessionAffinity: ClientIP`, `--enable-mesh` CRDT sync), and a pitfall catalog covering the Dec 2025 `sgl-router` → `sgl-model-gateway` rename and over-engineered tokenizer init-container traps.

AI & Automation 3 stars 1 forks Updated yesterday MIT

Install

View on GitHub

Quality Score: 79/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# SGLang Model Gateway — sgl-model-gateway Target audience: operators running **vLLM and/or SGLang inference on Kubernetes**, fronting workers with a router that does cache-aware load-balancing, optional prefill-decode disaggregation, and dynamic worker registration. Especially: hosting **multiple replicas of the same model** behind one address, in **air-gapped clusters with local model mirrors** (no live `huggingface.co`). ## Why this matters A single vLLM or SGLang process serves one Pod. To scale beyond one GPU, operators either fan-out replicas (N Pods, one Service) or run engine-internal data-parallelism (`vllm serve --data-parallel-size N`). Plain Kubernetes `Service` round-robins requests, which **fragments per-replica prefix caches** — every replica builds its own copy of the same prefix and hit rate divides by ~N. The Model Gateway is a Rust router that recovers most of that with a **cache-aware policy** (steers same-prefix requests to the same replica), adds health checks / circuit breakers / retries, exposes a unified OpenAI-compatible endpoint regardless of backend, and integrates with Kubernetes service discovery via label selectors. It also handles prefill-decode disaggregation when phases are split across worker pools. ## Sibling skills — what NOT to duplicate here This skill stays inside the router's territory. Defer to: - **`vllm-deployment`** — vLLM pod shape: `/dev/shm` emptyDir, `initialDelaySeconds: 600` (matches `VLLM_ENGINE_READY_TIMEOUT_S`), RHAI...

Details

Author: air-gapped
Repository: air-gapped/skills
Created: 3 months ago
Last Updated: yesterday
Language: Python
License: MIT

Integrates with

OpenAI · AI Anthropic · AI Hugging Face · AI Kubernetes · Infrastructure gRPC · API

Bundled in these plugins

skills

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Listed

senior-model-router-engineer

Use when designing, building, or operating the gateway between applications and LLM or model providers: routing requests across Claude, OpenAI, Gemini, and open weights, enforcing per route SLOs, configuring provider failover, tracking cost per call site, designing prompt and semantic caches, applying per tenant rate limits, supporting BYOK (bring your own key), enforcing zero data retention (ZDR) and regional routing, or wiring gateway observability. Triggers: model router, Vercel AI Gateway, OpenRouter, LiteLLM, Portkey, model fallback, provider failover, cost routing, semantic cache, prompt cache, rate limit per tenant, BYOK, ZDR, prompt logging, multi provider, provider abstraction, model SLO, model version pinning. Produces route configs, fallback policies, tenant rate limit policies, observability event schemas, cost dashboard specs, BYOK custody plans, gateway SLO sheets. Not for the call site prompt, see senior-llm-app-engineer; not for self hosted serving, see senior-mlops-engineer.

0 Updated 1 weeks ago

iamdemetris

AI & Automation Solid

vllm-deployment

Use this skill when authoring, reviewing, or fixing a vLLM Kubernetes manifest, Docker/Podman pod, or OpenShift ServingRuntime — even when the user does not say "vllm". Triggers on: lab cluster performance practices, cache mount + survival across pod restarts (/root/.cache, VLLM_CACHE_ROOT, TORCHINDUCTOR_CACHE_DIR, TRITON_CACHE_DIR, "do we have caches saved"), HF_TOKEN secret in pod env, liveness + readiness probe tuning (initialDelaySeconds, failureThreshold, "pod takes 12 min to boot"), serve_args review, --enforce-eager rationale, MoE deployment ("ep2 dp2", --enable-expert-parallel, expert-parallel sizing), TP/PP sizing, ConfigMap parser-plugin mount, image tag selection, cold-boot reduction, multi-node LWS + Ray, control planes (llm-d, production-stack, AIBrix, NVIDIA Dynamo, KServe), KEDA autoscaling, GAIE routing, disaggregated prefill/decode (Nixl/Mooncake/LMCache/MORI-IO), RHAIIS on OpenShift (SCC, arbitrary UID, Routes 60s, ModelCar, air-gapped). Lead with operator intent, not vendor names.

3 Updated yesterday

air-gapped

AI & Automation Listed

model-routing

Select which gateway model runs a task, subagent, or agent-team teammate. Use when the user names a model for a piece of work — e.g. "use gpt-5.6-sol for this", "spawn a review subagent on <model>", "have one teammate use <model-a> and another use <model-b>", or "switch this session to <model>". Ensures the exact configured model name is passed through unchanged, with no invented aliases and no silent substitution.

2 Updated today

JuLius5838