vllm-input-modalities

Solid

vLLM non-chat inference surfaces — text embeddings (`/v1/embeddings`, `/v2/embed`), reranking/scoring (`/rerank`, `/score`), speech-to-text (`/v1/audio/transcriptions`, `/v1/audio/translations`), document OCR via VLMs. Covers 2026 `--runner pooling` (replacing `--task embed`), v0.20 deprecations (`score`→`classify`, multitask pooling, `encode`→`token_embed`+`token_classify`), Matryoshka/MRL, ColBERT/ColPali/ColQwen late-interaction MaxSim, Cohere v2 `/v2/embed`, Jina v3/v4/v5 quirks, cross-encoder score templates, Whisper large-v3-turbo quants, DeepSeek-OCR recipe (NGramPerReqLogitsProcessor, no prefix cache, GUNDAM mode).

AI & Automation 3 stars 1 forks Updated 2 days ago MIT

Install

View on GitHub

Quality Score: 79/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# vLLM — embeddings, reranking, speech-to-text, OCR Target audience: operators who need vLLM's non-chat-completion surfaces. Four capabilities bundled here because they share operator-facing concepts (`--runner` flag, pooling configuration, scoring API, multimodal preprocessing) even though two run on the pooling runner (embedding, reranking) and two run on the generate runner (STT, OCR). ## The mental model — one flag rules the surface vLLM decides what a model *does* from the combination of three flags: ``` --runner {auto|generate|pooling|draft} # what kind of workload --convert {auto|none|embed|classify} # adapt a generative LM to a pooler --pooler-config '{...}' # override pool type, dimensions, etc. ``` The pair `(runner, convert)` has replaced the old `--task {generate|embed| score|classify|reward|...}` flag. The old `--task` is **deprecated** and still works in current releases, but emits a deprecation warning and is scheduled for full removal. Canonical today: | Workload | Command | Runner | Notes | |---|---|---|---| | Chat / completion | `vllm serve <model>` | `generate` (auto) | default | | Embedding | `vllm serve <model> --runner pooling` | `pooling` | auto-detects CLS/LAST/MEAN from config | | Embedding from a causal LM | `vllm serve <model> --runner pooling --convert embed` | `pooling` | adapts `*ForCausalLM` checkpoints | | Classification | `vllm serve <model> --runner pooling --convert classify` | `pooling` | also how `scor...

Details

Author: air-gapped
Repository: air-gapped/skills
Created: 3 months ago
Last Updated: 2 days ago
Language: Python
License: MIT

Integrates with

Anthropic · AI Kubernetes · Infrastructure

Bundled in these plugins

skills

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

vllm-deployment

Use this skill when authoring, reviewing, or fixing a vLLM Kubernetes manifest, Docker/Podman pod, or OpenShift ServingRuntime — even when the user does not say "vllm". Triggers on: lab cluster performance practices, cache mount + survival across pod restarts (/root/.cache, VLLM_CACHE_ROOT, TORCHINDUCTOR_CACHE_DIR, TRITON_CACHE_DIR, "do we have caches saved"), HF_TOKEN secret in pod env, liveness + readiness probe tuning (initialDelaySeconds, failureThreshold, "pod takes 12 min to boot"), serve_args review, --enforce-eager rationale, MoE deployment ("ep2 dp2", --enable-expert-parallel, expert-parallel sizing), TP/PP sizing, ConfigMap parser-plugin mount, image tag selection, cold-boot reduction, multi-node LWS + Ray, control planes (llm-d, production-stack, AIBrix, NVIDIA Dynamo, KServe), KEDA autoscaling, GAIE routing, disaggregated prefill/decode (Nixl/Mooncake/LMCache/MORI-IO), RHAIIS on OpenShift (SCC, arbitrary UID, Routes 60s, ModelCar, air-gapped). Lead with operator intent, not vendor names.

3 Updated 2 days ago

air-gapped

AI & Automation Featured

vllm

Deploy and serve LLMs with vLLM behind an OpenAI-compatible endpoint, with tool calling enabled for agent workloads.

208 Updated today

Prism-Shadow

AI & Automation Solid

vllm-omni

vLLM-Omni output-side multimodal generation — image (FLUX.1/2, Qwen-Image, GLM-Image, BAGEL, SD3.5, HunyuanImage-3.0), video (Wan2.1/2.2, LTX-2, HunyuanVideo-1.5), TTS (Qwen3-TTS, CosyVoice3, Voxtral-TTS), any-to-any omni (Qwen3-Omni, Qwen2.5-Omni, MiMo-Audio) via `vllm serve --omni`. Stage-based disaggregation (OmniConnector + Mooncake + RDMA), `/v1/images/generations`, async+sync `/v1/videos`, `/v1/audio/speech` with voice-upload, PCM16 WebSocket `/v1/realtime`, Ulysses/Ring SP + CFG-parallel, DiT FP8/INT8/GGUF, CUDA/ROCm/NPU/XPU/MUSA matrix, release pitfalls (v0.19.0rc1 FLUX regression, GLM-Image transformers>=5.0, Qwen3-TTS enforce-eager).

3 Updated 2 days ago

air-gapped