← ClaudeAtlas

lmcache-mplisted

LMCache multiprocess (MP) mode — standalone LMCache server in its own pod/process that vLLM connects to over ZMQ. Provides process isolation, no GIL contention on the inference path, one cache shared by multiple vLLM pods on the same node, and CPU-memory scaling independent of GPU memory. Covers the `LMCacheMPConnector` path (the new direction; `LMCacheConnectorV1` in-process path still works but is being upstaged), DaemonSet+Deployment K8s pattern, L1 (CPU DRAM) + L2 (NIXL POSIX / GDS / HF3FS, plain fs, mooncake_store, s3) cascade, the `lmcache/standalone:nightly` + `lmcache/vllm-openai:latest-nightly` image pair vs stock `vllm/vllm-openai`, and the production gotchas (--no-enable-prefix-caching on vLLM side, --disable-hybrid-kv-cache-manager required, vLLM/lmcache version compatibility, hybrid-model NOT supported yet, cache_salt fallback adapter bug).
air-gapped/skills · ★ 2 · AI & Automation · score 78
Install: claude install-skill air-gapped/skills
# LMCache multiprocess (MP) mode Target audience: operators running vLLM on H100/H200/B200-class GPUs in production who need KV-cache extension beyond HBM and have outgrown the in-process LMCache path. Assumes Kubernetes or bare container deployment. ## Why this exists separately from `vllm-caching` `vllm-caching` covers vLLM's **native** CPU-offload (`--kv-offloading-size`, `OffloadingConnector`) and the **in-process** `LMCacheConnectorV1` (LMCache linked into the vLLM worker). MP mode is structurally different: - LMCache runs in its **own process / container / pod** with its own CPU and memory budget. - vLLM talks to it over **ZMQ** (DEALER/ROUTER pattern, default port 5555). - One LMCache server can serve **multiple vLLM pods on the same node** — they share the L1 cache. - L2 cascade (NVMe, S3, Mooncake, HF3FS) is configured on the LMCache side, not vLLM side. Different image pair, different deployment shape, different troubleshooting surface. Hence its own skill. ## Decision tree — pick a path Ask in order: 1. **Single vLLM pod, only need CPU DRAM tier, no node-shared cache?** → Native offload (`--kv-offloading-size N --kv-offloading-backend native --disable-hybrid-kv-cache-manager`). Zero extra pods. Use the `vllm-caching` skill, not this one. 2. **Single vLLM pod, need NVMe as a third tier, but no other pod will share the cache?** → In-process `LMCacheConnectorV1` (`--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'` + `LMCAC