cohere-performance-tuning

Featured

Optimize Cohere API performance with caching, batching, model selection, and streaming. Use when experiencing slow API responses, implementing caching strategies, or optimizing request throughput for Cohere Chat, Embed, and Rerank. Trigger with phrases like "cohere performance", "optimize cohere", "cohere latency", "cohere caching", "cohere slow", "cohere batch".

AI & Automation 2,266 stars 315 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%

100

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Cohere Performance Tuning ## Overview Optimize Cohere API v2 performance through model selection, embedding batches, rerank pipelines, caching, and streaming for time-to-first-token. ## Prerequisites - `cohere-ai` SDK installed - Understanding of Cohere endpoints (Chat, Embed, Rerank) - Redis or in-memory cache (optional) ## Latency Benchmarks (Typical) | Operation | Model | P50 | P95 | |-----------|-------|-----|-----| | Chat (short) | `command-r7b-12-2024` | 500ms | 1.5s | | Chat (short) | `command-a-03-2025` | 800ms | 2.5s | | Chat (stream TTFT) | `command-a-03-2025` | 200ms | 600ms | | Embed (96 texts) | `embed-v4.0` | 150ms | 400ms | | Rerank (100 docs) | `rerank-v3.5` | 100ms | 300ms | | Classify (96 inputs) | `embed-english-v3.0` | 200ms | 500ms | ## Instructions ### Strategy 1: Model Selection by Latency Budget ```typescript // Use smaller models for latency-sensitive paths function selectModel(latencyBudgetMs: number): string { if (latencyBudgetMs < 1000) return 'command-r7b-12-2024'; // 7B, fastest if (latencyBudgetMs < 3000) return 'command-r-08-2024'; // Mid-tier return 'command-a-03-2025'; // Best quality } // Pair with maxTokens to control output length await cohere.chat({ model: selectModel(1500), messages: [{ role: 'user', content: query }], maxTokens: 200, // Shorter output = lower latency }); ``` ### Strategy 2: Streaming for Time-to-First-Token ```typescript // Non-streaming: user waits for ...

Details

Author: jeremylongshore
Repository: jeremylongshore/claude-code-plugins-plus-skills
Created: 7 months ago
Last Updated: today
Language: Python
License: MIT

mistral-performance-tuning

Optimize Mistral AI performance with caching, batching, and latency reduction. Use when experiencing slow API responses, implementing caching strategies, or optimizing request throughput for Mistral AI integrations. Trigger with phrases like "mistral performance", "optimize mistral", "mistral latency", "mistral caching", "mistral slow".

2,266 Updated today

jeremylongshore