vllm-chat-templates

Solid

vLLM chat-template (prompt-side Jinja) operator reference. Template resolution precedence (`--chat-template` → AutoProcessor → tokenizer default → bundled fallback), `chat_template_kwargs` allowlist silently dropping `add_generation_prompt`/`enable_thinking`/custom kwargs (PR 27622 fix), 27 shipped `tool_chat_template_*.jinja` files, known template-layer bugs for Qwen3/Qwen3-Coder, DeepSeek-R1/V3/V3.1/V3.2, GPT-OSS, Kimi-K2, Llama-4, Mistral (HF vs mistral mode), Gemma-3/4, Phi-4, GLM. Prompt side only — output parsing lives in sibling skills.

AI & Automation 3 stars 1 forks Updated yesterday MIT

Install

View on GitHub

Quality Score: 79/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# vLLM chat templates — operator triage Target audience: operators deploying vLLM in production. Assumes OpenAI-API-compatible frontend (`/v1/chat/completions` or `/v1/responses`), multi-GPU, mid-2024 through 2026 model families. ## Scope — three sibling skills, three layers Chat-template bugs span three layers. This skill owns the **prompt-rendering** layer; route to a sibling when the problem is output-side. | Layer | Direction | Skill | What it covers | |---|---|---|---| | **Chat template (Jinja)** | messages → prompt string | **this skill** | template precedence, `--chat-template`, `chat_template_kwargs`, `examples/tool_chat_template_*.jinja`, bundled fallbacks | | **Reasoning parser** | model output → `reasoning_content` + `content` | `vllm-reasoning-parsers` | `--reasoning-parser`, `extract_reasoning`, `is_reasoning_end`, `<think>` splitting | | **Tool parser** | model output → `tool_calls[]` | `vllm-tool-parsers` | `--tool-call-parser`, streaming state machines, partial-JSON parsing | When in doubt: if the operator complains about what the **server receives**, it's this skill; if about what the **client receives**, it's one of the parser skills. Template and parser bugs often present the same symptom ("reasoning_content is null"), so all three skills name each other in diagnostics. This file stays on the Jinja side. ## Why this matters Chat templates are the **Jinja layer between structured messages and the raw prompt string the model sees**. Tool-calling and re...

Details

Author: air-gapped
Repository: air-gapped/skills
Created: 3 months ago
Last Updated: yesterday
Language: Python
License: MIT

Integrates with

OpenAI · AI Anthropic · AI Kubernetes · Infrastructure

Bundled in these plugins

skills

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

jinja-expert

Author, read, and debug Jinja2 templates across the three places Jinja lives in 2026 — HuggingFace `chat_template.jinja` (rendered by `apply_chat_template` for vLLM / sglang), Ansible playbooks + `.j2` files, and Jinja-adjacent Kubernetes workflows (`values.yaml.j2`, `kubernetes.core.k8s + template`, Helm post-renderers). Companion to the `helm` skill — Helm charts are Go `text/template` + Sprig, not Jinja, and this skill makes that disambiguation explicit.

3 Updated yesterday

air-gapped

AI & Automation Solid

chat-completions-api

Reference for the OpenAI Chat Completions API (/v1/chat/completions) and legacy /v1/completions as the lingua-franca compatibility protocol — the official spec incl. deprecation timeline and Responses-only feature delta, how 7 local servers (vLLM, SGLang, llama.cpp, Ollama, mistral.rs, Llama Stack/OGX, Lemonade) actually implement it, gateways (LiteLLM, Bifrost), 10 cloud providers' CC-compat endpoints (Anthropic, Gemini, DeepSeek, xAI, Groq, OpenRouter, Azure...), the reasoning_content/reasoning field schism, finish_reason divergences, and client wire behavior (opencode, Vercel AI SDK). NOT for the Responses API (responses-api skill) or Anthropic Messages protocol (messages-api skill).

3 Updated yesterday

air-gapped

AI & Automation Solid

transformers-config-tokenizers-expert

Preflight reference for HuggingFace snapshots — what vLLM, sglang, and transformers.generate see at runtime. Covers config-file precedence (tokenizer.json, tokenizer_config.json, generation_config.json, chat_template.jinja), transformers v5 tokenizer-class taxonomy (TokenizersBackend, PythonBackend, MistralCommonBackend, TikTokenTokenizer), special-token discovery (all_special_ids, added_tokens_decoder, extra_special_tokens, backend_tokenizer.get_added_tokens_decoder), chat-template Jinja contract (ImmutableSandboxedEnvironment, loopcontrols, raise_exception, strftime_now, tojson, add_generation_prompt), and engine knobs (skip_special_tokens, trust_request_chat_template, chat_template_kwargs allowlist, adjust_request, incremental detokenizer, EOS merge). Ships verified 2026 hall-of-shame for Kimi-K2.6, GLM-5.1, Gemma-4, Qwen3, DeepSeek-V3, plus drop-in Python for resolving markers to IDs, detecting turn-primer-as-EOS leaks, and cross-referencing tokenizer.json vs tokenizer_config.json.

3 Updated yesterday

air-gapped