sglang

Featured

Fast structured generation and serving for LLMs with RadixAttention prefix caching. Use for JSON/regex outputs, constrained decoding, agentic workflows with tool calls, or when you need 5× faster inference than vLLM with prefix sharing. Powers 300,000+ GPUs at xAI, AMD, NVIDIA, and LinkedIn.

AI & Automation 27,984 stars 2901 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%

100

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# SGLang High-performance serving framework for LLMs and VLMs with RadixAttention for automatic prefix caching. ## When to use SGLang **Use SGLang when:** - Need structured outputs (JSON, regex, grammar) - Building agents with repeated prefixes (system prompts, tools) - Agentic workflows with function calling - Multi-turn conversations with shared context - Need faster JSON decoding (3× vs standard) **Use vLLM instead when:** - Simple text generation without structure - Don't need prefix caching - Want mature, widely-tested production system **Use TensorRT-LLM instead when:** - Maximum single-request latency (no batching needed) - NVIDIA-only deployment - Need FP8/INT4 quantization on H100 ## Quick start ### Installation ```bash # pip install (recommended) pip install "sglang[all]" # With FlashInfer (faster, CUDA 11.8/12.1) pip install sglang[all] flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ # From source git clone https://github.com/sgl-project/sglang.git cd sglang pip install -e "python[all]" ``` ### Launch server ```bash # Basic server (Llama 3-8B) python -m sglang.launch_server \ --model-path meta-llama/Meta-Llama-3-8B-Instruct \ --port 30000 # With RadixAttention (automatic prefix caching) python -m sglang.launch_server \ --model-path meta-llama/Meta-Llama-3-8B-Instruct \ --port 30000 \ --enable-radix-cache # Default: enabled # Multi-GPU (tensor parallelism) python -m sglang.launch_server \ --model-path meta-llama/Meta-Ll...

Details

Author: davila7
Repository: davila7/claude-code-templates
Created: 11 months ago
Last Updated: today
Language: Python
License: MIT

Integrates with

Anthropic · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

sglang

9,609 Updated 1 months ago

Orchestra-Research

AI & Automation Solid

serving-llms-vllm

Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.

191,515 Updated today

NousResearch

AI & Automation Featured

serving-llms-vllm

27,984 Updated today

davila7