model-servinglisted

LLM and ML model deployment for inference. Use when serving models in production, building AI APIs, or optimizing inference. Covers vLLM (LLM serving), TensorRT-LLM (GPU optimization), Ollama (local), BentoML (ML deployment), Triton (multi-model), LangChain (orchestration), LlamaIndex (RAG), and streaming patterns.
ancoleman/ai-design-components · ★ 368 · AI & Automation · score 80

Install: claude install-skill ancoleman/ai-design-components

# Model Serving ## Purpose Deploy LLM and ML models for production inference with optimized serving engines, streaming response patterns, and orchestration frameworks. Focuses on self-hosted model serving, GPU optimization, and integration with frontend applications. ## When to Use - Deploying LLMs for production (self-hosted Llama, Mistral, Qwen) - Building AI APIs with streaming responses - Serving traditional ML models (scikit-learn, XGBoost, PyTorch) - Implementing RAG pipelines with vector databases - Optimizing inference throughput and latency - Integrating LLM serving with frontend chat interfaces ## Model Serving Selection ### LLM Serving Engines **vLLM (Recommended Primary)** - PagedAttention memory management (20-30x throughput improvement) - Continuous batching for dynamic request handling - OpenAI-compatible API endpoints - Use for: Most self-hosted LLM deployments **TensorRT-LLM** - Maximum GPU efficiency (2-8x faster than vLLM) - Requires model conversion and optimization - Use for: Production workloads needing absolute maximum throughput **Ollama** - Local development without GPUs - Simple CLI interface - Use for: Prototyping, laptop development, educational purposes **Decision Framework:** ``` Self-hosted LLM deployment needed? ├─ Yes, need maximum throughput → vLLM ├─ Yes, need absolute max GPU efficiency → TensorRT-LLM ├─ Yes, local development only → Ollama └─ No, use managed API (OpenAI, Anthropic) → No serving layer needed ``` ### ML Model Serv