tensorrt-llm

Solid

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

AI & Automation 9,182 stars 697 forks Updated 1 months ago MIT

Install

View on GitHub

Quality Score: 94/100

Stars 20%

100

Recency 20%

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# TensorRT-LLM NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs. ## When to use TensorRT-LLM **Use TensorRT-LLM when:** - Deploying on NVIDIA GPUs (A100, H100, GB200) - Need maximum throughput (24,000+ tokens/sec on Llama 3) - Require low latency for real-time applications - Working with quantized models (FP8, INT4, FP4) - Scaling across multiple GPUs or nodes **Use vLLM instead when:** - Need simpler setup and Python-first API - Want PagedAttention without TensorRT compilation - Working with AMD GPUs or non-NVIDIA hardware **Use llama.cpp instead when:** - Deploying on CPU or Apple Silicon - Need edge deployment without NVIDIA GPUs - Want simpler GGUF quantization format ## Quick start ### Installation ```bash # Docker (recommended) docker pull nvidia/tensorrt_llm:latest # pip install pip install tensorrt_llm==1.2.0rc3 # Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12 ``` ### Basic inference ```python from tensorrt_llm import LLM, SamplingParams # Initialize model llm = LLM(model="meta-llama/Meta-Llama-3-8B") # Configure sampling sampling_params = SamplingParams( max_tokens=100, temperature=0.7, top_p=0.9 ) # Generate prompts = ["Explain quantum computing"] outputs = llm.generate(prompts, sampling_params) for output in outputs: print(output.text) ``` ### Serving with trtllm-serve ```bash # Start server (automatic model download and compilation) trtllm-serve meta-llama/Meta-Ll...

Details

Author: Orchestra-Research
Repository: Orchestra-Research/AI-Research-SKILLs
Created: 7 months ago
Last Updated: 1 months ago
Language: TeX
License: MIT

Integrates with

Hugging Face · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Featured

tensorrt-llm

27,705 Updated today

davila7

AI & Automation Solid

tensorrt-llm

175,435 Updated today

NousResearch

AI & Automation Featured

llama-cpp

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

27,705 Updated today

davila7

AI & Automation Solid

llama-cpp

175,435 Updated today

NousResearch

AI & Automation Solid

llama-cpp

9,182 Updated 1 months ago

Orchestra-Research