← ClaudeAtlas

sglang-model-gatewaylisted

SGLang Model Gateway (`sgl-model-gateway`, formerly `sgl-router`) — Rust router fronting vLLM and SGLang inference workers on Kubernetes. Covers first-class vLLM gRPC backend plus HTTP transparent-proxy for vanilla vLLM, eight load-balancing policies, tokenizer-format dispatch (`tokenizer.json` HF-fast vs `tiktoken.model` BPE — including when neither is required because `cache_aware` is text-based), air-gapped recipe (gateway ignores `HF_ENDPOINT`, mount tokenizer files on PVC only when actually needed), K8s manifests with `model_id` labels and per-model RBAC, three HA mitigations (single + PDB, `sessionAffinity: ClientIP`, `--enable-mesh` CRDT sync), and a pitfall catalog covering the Dec 2025 `sgl-router` → `sgl-model-gateway` rename and over-engineered tokenizer init-container traps.
air-gapped/skills · ★ 2 · AI & Automation · score 78
Install: claude install-skill air-gapped/skills
# SGLang Model Gateway — sgl-model-gateway Target audience: operators running **vLLM and/or SGLang inference on Kubernetes**, fronting workers with a router that does cache-aware load-balancing, optional prefill-decode disaggregation, and dynamic worker registration. Especially: hosting **multiple replicas of the same model** behind one address, in **air-gapped clusters with local model mirrors** (no live `huggingface.co`). ## Why this matters A single vLLM or SGLang process serves one Pod. To scale beyond one GPU, operators either fan-out replicas (N Pods, one Service) or run engine-internal data-parallelism (`vllm serve --data-parallel-size N`). Plain Kubernetes `Service` round-robins requests, which **fragments per-replica prefix caches** — every replica builds its own copy of the same prefix and hit rate divides by ~N. The Model Gateway is a Rust router that recovers most of that with a **cache-aware policy** (steers same-prefix requests to the same replica), adds health checks / circuit breakers / retries, exposes a unified OpenAI-compatible endpoint regardless of backend, and integrates with Kubernetes service discovery via label selectors. It also handles prefill-decode disaggregation when phases are split across worker pools. ## Sibling skills — what NOT to duplicate here This skill stays inside the router's territory. Defer to: - **`vllm-deployment`** — vLLM pod shape: `/dev/shm` emptyDir, `initialDelaySeconds: 600` (matches `VLLM_ENGINE_READY_TIMEOUT_S`), RHAI