groq-performance-tuning

Featured

Optimize Groq API performance with model selection, caching, streaming, and parallel requests. Use when experiencing slow responses, implementing caching strategies, or optimizing request throughput for Groq integrations. Trigger with phrases like "groq performance", "optimize groq", "groq latency", "groq caching", "groq slow", "groq speed".

AI & Automation 2,266 stars 315 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# Groq Performance Tuning ## Overview Maximize Groq's LPU inference speed advantage. Groq already delivers extreme throughput (280-560 tok/s) and low latency (<200ms TTFT), but client-side optimization -- model selection, prompt size, streaming, caching, and parallelism -- determines whether your application fully exploits that speed. ## Groq Speed Benchmarks | Model | TTFT | Throughput | Context | |-------|------|-----------|---------| | `llama-3.1-8b-instant` | ~50ms | ~560 tok/s | 128K | | `llama-3.3-70b-versatile` | ~150ms | ~280 tok/s | 128K | | `llama-3.3-70b-specdec` | ~100ms | ~400 tok/s | 128K | | `meta-llama/llama-4-scout-17b-16e-instruct` | ~80ms | ~460 tok/s | 128K | TTFT = Time to First Token. Actual values depend on prompt size and server load. ## Instructions ### Step 1: Choose the Right Model for Speed ```typescript import Groq from "groq-sdk"; const groq = new Groq(); // Speed tiers for different use cases const SPEED_MAP = { // Under 100ms TTFT -- use for latency-critical paths instant: "llama-3.1-8b-instant", // Under 200ms TTFT -- use for quality-sensitive paths balanced: "llama-3.3-70b-versatile", // Speculative decoding -- same quality as 70b, faster throughput fast70b: "llama-3.3-70b-specdec", } as const; type SpeedTier = keyof typeof SPEED_MAP; async function tieredCompletion(prompt: string, tier: SpeedTier = "instant") { return groq.chat.completions.create({ model: SPEED_MAP[tier], messages: [{ role: "user", content: pr...

Details

Author
jeremylongshore
Repository
jeremylongshore/claude-code-plugins-plus-skills
Created
7 months ago
Last Updated
today
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Featured

groq-cost-tuning

Optimize Groq costs through model routing, token management, and usage monitoring. Use when analyzing Groq billing, reducing API costs, or implementing usage monitoring and budget alerts. Trigger with phrases like "groq cost", "groq billing", "reduce groq costs", "groq pricing", "groq expensive", "groq budget".

2,266 Updated today
jeremylongshore
AI & Automation Featured

groq-observability

Set up observability for Groq integrations: latency histograms, token throughput, rate limit gauges, cost tracking, and Prometheus alerts. Trigger with phrases like "groq monitoring", "groq metrics", "groq observability", "monitor groq", "groq alerts", "groq dashboard".

2,266 Updated today
jeremylongshore
AI & Automation Featured

groq-upgrade-migration

Upgrade groq-sdk versions and handle Groq model deprecations. Use when upgrading SDK versions, detecting deprecated models, or migrating to new Groq model IDs. Trigger with phrases like "upgrade groq", "groq migration", "groq breaking changes", "update groq SDK", "groq deprecated model".

2,266 Updated today
jeremylongshore
AI & Automation Featured

clade-performance-tuning

Optimize Anthropic API latency — streaming, prompt caching, model selection, Use when working with performance-tuning patterns. connection reuse, and parallel requests. Trigger with "anthropic slow", "claude latency", "speed up anthropic", "anthropic performance", "claude response time".

2,266 Updated today
jeremylongshore
AI & Automation Featured

elevenlabs-performance-tuning

Optimize ElevenLabs TTS latency with model selection, streaming, caching, and audio format tuning. Use when experiencing slow TTS responses, implementing real-time voice features, or optimizing audio generation throughput. Trigger: "elevenlabs performance", "optimize elevenlabs", "elevenlabs latency", "elevenlabs slow", "fast TTS", "reduce elevenlabs latency", "TTS streaming".

2,266 Updated today
jeremylongshore