sglang

Featured

Fast structured generation and serving for LLMs with RadixAttention prefix caching. Use for JSON/regex outputs, constrained decoding, agentic workflows with tool calls, or when you need 5× faster inference than vLLM with prefix sharing. Powers 300,000+ GPUs at xAI, AMD, NVIDIA, and LinkedIn.

AI & Automation 27,984 stars 2901 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# SGLang High-performance serving framework for LLMs and VLMs with RadixAttention for automatic prefix caching. ## When to use SGLang **Use SGLang when:** - Need structured outputs (JSON, regex, grammar) - Building agents with repeated prefixes (system prompts, tools) - Agentic workflows with function calling - Multi-turn conversations with shared context - Need faster JSON decoding (3× vs standard) **Use vLLM instead when:** - Simple text generation without structure - Don't need prefix caching - Want mature, widely-tested production system **Use TensorRT-LLM instead when:** - Maximum single-request latency (no batching needed) - NVIDIA-only deployment - Need FP8/INT4 quantization on H100 ## Quick start ### Installation ```bash # pip install (recommended) pip install "sglang[all]" # With FlashInfer (faster, CUDA 11.8/12.1) pip install sglang[all] flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ # From source git clone https://github.com/sgl-project/sglang.git cd sglang pip install -e "python[all]" ``` ### Launch server ```bash # Basic server (Llama 3-8B) python -m sglang.launch_server \ --model-path meta-llama/Meta-Llama-3-8B-Instruct \ --port 30000 # With RadixAttention (automatic prefix caching) python -m sglang.launch_server \ --model-path meta-llama/Meta-Llama-3-8B-Instruct \ --port 30000 \ --enable-radix-cache # Default: enabled # Multi-GPU (tensor parallelism) python -m sglang.launch_server \ --model-path meta-llama/Meta-Ll...

Details

Author
davila7
Repository
davila7/claude-code-templates
Created
11 months ago
Last Updated
today
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category