groq-performance-tuning

Featured

Optimize Groq API performance with model selection, caching, streaming, and parallel requests. Use when experiencing slow responses, implementing caching strategies, or optimizing request throughput for Groq integrations. Trigger with phrases like "groq performance", "optimize groq", "groq latency", "groq caching", "groq slow", "groq speed".

AI & Automation 2,266 stars 315 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%

100

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Groq Performance Tuning ## Overview Maximize Groq's LPU inference speed advantage. Groq already delivers extreme throughput (280-560 tok/s) and low latency (<200ms TTFT), but client-side optimization -- model selection, prompt size, streaming, caching, and parallelism -- determines whether your application fully exploits that speed. ## Groq Speed Benchmarks | Model | TTFT | Throughput | Context | |-------|------|-----------|---------| | `llama-3.1-8b-instant` | ~50ms | ~560 tok/s | 128K | | `llama-3.3-70b-versatile` | ~150ms | ~280 tok/s | 128K | | `llama-3.3-70b-specdec` | ~100ms | ~400 tok/s | 128K | | `meta-llama/llama-4-scout-17b-16e-instruct` | ~80ms | ~460 tok/s | 128K | TTFT = Time to First Token. Actual values depend on prompt size and server load. ## Instructions ### Step 1: Choose the Right Model for Speed ```typescript import Groq from "groq-sdk"; const groq = new Groq(); // Speed tiers for different use cases const SPEED_MAP = { // Under 100ms TTFT -- use for latency-critical paths instant: "llama-3.1-8b-instant", // Under 200ms TTFT -- use for quality-sensitive paths balanced: "llama-3.3-70b-versatile", // Speculative decoding -- same quality as 70b, faster throughput fast70b: "llama-3.3-70b-specdec", } as const; type SpeedTier = keyof typeof SPEED_MAP; async function tieredCompletion(prompt: string, tier: SpeedTier = "instant") { return groq.chat.completions.create({ model: SPEED_MAP[tier], messages: [{ role: "user", content: pr...

Details

Author: jeremylongshore
Repository: jeremylongshore/claude-code-plugins-plus-skills
Created: 7 months ago
Last Updated: today
Language: Python
License: MIT

elevenlabs-performance-tuning

Optimize ElevenLabs TTS latency with model selection, streaming, caching, and audio format tuning. Use when experiencing slow TTS responses, implementing real-time voice features, or optimizing audio generation throughput. Trigger: "elevenlabs performance", "optimize elevenlabs", "elevenlabs latency", "elevenlabs slow", "fast TTS", "reduce elevenlabs latency", "TTS streaming".

2,266 Updated today

jeremylongshore