cohere-performance-tuning

Featured

Optimize Cohere API performance with caching, batching, model selection, and streaming. Use when experiencing slow API responses, implementing caching strategies, or optimizing request throughput for Cohere Chat, Embed, and Rerank. Trigger with phrases like "cohere performance", "optimize cohere", "cohere latency", "cohere caching", "cohere slow", "cohere batch".

AI & Automation 2,266 stars 315 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# Cohere Performance Tuning ## Overview Optimize Cohere API v2 performance through model selection, embedding batches, rerank pipelines, caching, and streaming for time-to-first-token. ## Prerequisites - `cohere-ai` SDK installed - Understanding of Cohere endpoints (Chat, Embed, Rerank) - Redis or in-memory cache (optional) ## Latency Benchmarks (Typical) | Operation | Model | P50 | P95 | |-----------|-------|-----|-----| | Chat (short) | `command-r7b-12-2024` | 500ms | 1.5s | | Chat (short) | `command-a-03-2025` | 800ms | 2.5s | | Chat (stream TTFT) | `command-a-03-2025` | 200ms | 600ms | | Embed (96 texts) | `embed-v4.0` | 150ms | 400ms | | Rerank (100 docs) | `rerank-v3.5` | 100ms | 300ms | | Classify (96 inputs) | `embed-english-v3.0` | 200ms | 500ms | ## Instructions ### Strategy 1: Model Selection by Latency Budget ```typescript // Use smaller models for latency-sensitive paths function selectModel(latencyBudgetMs: number): string { if (latencyBudgetMs < 1000) return 'command-r7b-12-2024'; // 7B, fastest if (latencyBudgetMs < 3000) return 'command-r-08-2024'; // Mid-tier return 'command-a-03-2025'; // Best quality } // Pair with maxTokens to control output length await cohere.chat({ model: selectModel(1500), messages: [{ role: 'user', content: query }], maxTokens: 200, // Shorter output = lower latency }); ``` ### Strategy 2: Streaming for Time-to-First-Token ```typescript // Non-streaming: user waits for ...

Details

Author
jeremylongshore
Repository
jeremylongshore/claude-code-plugins-plus-skills
Created
7 months ago
Last Updated
today
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Featured

cohere-cost-tuning

Optimize Cohere costs through model selection, token budgets, and usage monitoring. Use when analyzing Cohere billing, reducing API costs, or implementing usage monitoring and budget alerts. Trigger with phrases like "cohere cost", "cohere billing", "reduce cohere costs", "cohere pricing", "cohere expensive", "cohere budget".

2,266 Updated today
jeremylongshore
AI & Automation Featured

clade-performance-tuning

Optimize Anthropic API latency — streaming, prompt caching, model selection, Use when working with performance-tuning patterns. connection reuse, and parallel requests. Trigger with "anthropic slow", "claude latency", "speed up anthropic", "anthropic performance", "claude response time".

2,266 Updated today
jeremylongshore
AI & Automation Featured

cohere-hello-world

Create a minimal working Cohere example with Chat, Embed, and Rerank. Use when starting a new Cohere integration, testing your setup, or learning basic Cohere API v2 patterns. Trigger with phrases like "cohere hello world", "cohere example", "cohere quick start", "simple cohere code".

2,266 Updated today
jeremylongshore
AI & Automation Featured

elevenlabs-performance-tuning

Optimize ElevenLabs TTS latency with model selection, streaming, caching, and audio format tuning. Use when experiencing slow TTS responses, implementing real-time voice features, or optimizing audio generation throughput. Trigger: "elevenlabs performance", "optimize elevenlabs", "elevenlabs latency", "elevenlabs slow", "fast TTS", "reduce elevenlabs latency", "TTS streaming".

2,266 Updated today
jeremylongshore
AI & Automation Featured

mistral-performance-tuning

Optimize Mistral AI performance with caching, batching, and latency reduction. Use when experiencing slow API responses, implementing caching strategies, or optimizing request throughput for Mistral AI integrations. Trigger with phrases like "mistral performance", "optimize mistral", "mistral latency", "mistral caching", "mistral slow".

2,266 Updated today
jeremylongshore