← ClaudeAtlas

vllm-input-modalitieslisted

vLLM non-chat inference surfaces — text embeddings (`/v1/embeddings`, `/v2/embed`), reranking/scoring (`/rerank`, `/score`), speech-to-text (`/v1/audio/transcriptions`, `/v1/audio/translations`), document OCR via VLMs. Covers 2026 `--runner pooling` (replacing `--task embed`), v0.20 deprecations (`score`→`classify`, multitask pooling, `encode`→`token_embed`+`token_classify`), Matryoshka/MRL, ColBERT/ColPali/ColQwen late-interaction MaxSim, Cohere v2 `/v2/embed`, Jina v3/v4/v5 quirks, cross-encoder score templates, Whisper large-v3-turbo quants, DeepSeek-OCR recipe (NGramPerReqLogitsProcessor, no prefix cache, GUNDAM mode).
air-gapped/skills · ★ 2 · AI & Automation · score 78
Install: claude install-skill air-gapped/skills
# vLLM — embeddings, reranking, speech-to-text, OCR Target audience: operators who need vLLM's non-chat-completion surfaces. Four capabilities bundled here because they share operator-facing concepts (`--runner` flag, pooling configuration, scoring API, multimodal preprocessing) even though two run on the pooling runner (embedding, reranking) and two run on the generate runner (STT, OCR). ## The mental model — one flag rules the surface vLLM decides what a model *does* from the combination of three flags: ``` --runner {auto|generate|pooling|draft} # what kind of workload --convert {auto|none|embed|classify} # adapt a generative LM to a pooler --pooler-config '{...}' # override pool type, dimensions, etc. ``` The pair `(runner, convert)` has replaced the old `--task {generate|embed| score|classify|reward|...}` flag. The old `--task` is **deprecated** and still works in current releases, but emits a deprecation warning and is scheduled for full removal. Canonical today: | Workload | Command | Runner | Notes | |---|---|---|---| | Chat / completion | `vllm serve <model>` | `generate` (auto) | default | | Embedding | `vllm serve <model> --runner pooling` | `pooling` | auto-detects CLS/LAST/MEAN from config | | Embedding from a causal LM | `vllm serve <model> --runner pooling --convert embed` | `pooling` | adapts `*ForCausalLM` checkpoints | | Classification | `vllm serve <model> --runner pooling --convert classify` | `pooling` | also how `scor