sglang-hicachelisted
Install: claude install-skill air-gapped/skills
# SGLang HiCache — hierarchical KV cache
Target audience: operators running SGLang on H100/H200/H20/B200-class datacenter GPUs in production, especially long-context agentic workloads (coding agents, RAG, multi-turn chat) with strong prefix locality.
## Why this matters
Long-context inference is almost always KV-cache bound, not compute bound. HiCache extends effective KV capacity beyond HBM through pinned host DRAM (L2) and an optional distributed L3 (Mooncake / 3FS / NIXL / AIBrix / EIC / SiMM / file). Reported gains: TTFT –56% / throughput ×2 (Novita Qwen3-Coder-480B + 3FS), TTFT –84% on cache hits (Ant Group DeepSeek-R1-671B + Mooncake) — see [LMSYS blog 2025-09-10](https://www.lmsys.org/blog/2025-09-10-sglang-hicache/). LMSYS's headline "up to 6× / up to 80%" is uncited — trust the deployment-specific numbers.
**Why this skill exists alongside `vllm-caching`:** as of 2026-04-25, vLLM tier-extension caching is broken for the entire 2026 hybrid-attention model lineup (Gemma-4, Qwen3.5/3.6, gpt-oss, Llama-4) — most KV connectors don't subclass `SupportsHMA`. SGLang HiCache ships partial hybrid support today (v0.5.10) and fills the rest in v0.5.11. For the arch × backend × release matrix and the "should we migrate?" decision tree, see `references/hybrid-models.md`.
> **NIXL deep-dive** — the NIXL transfer library (UCX / GDS / Mooncake / S3-OBJ plugins, agent API, telemetry) lives in the dedicated **`nvidia-nixl`** skill. This skill covers SGLang-side wiring of `--hicach