dspy-evaluation-suite

Solid

This skill should be used when the user asks to "evaluate a DSPy program", "test my DSPy module", "measure performance", "create evaluation metrics", "use answer_exact_match or SemanticF1", mentions "Evaluate class", "comparing programs", "establishing baselines", or needs to systematically test and measure DSPy program quality with custom or built-in metrics.

AI & Automation 78 stars 10 forks Updated 1 weeks ago MIT

Install

View on GitHub

Quality Score: 90/100

Stars 20%
63
Recency 20%
90
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# DSPy Evaluation Suite ## Goal Systematically evaluate DSPy programs using built-in and custom metrics with parallel execution. ## When to Use - Measuring program performance before/after optimization - Comparing different program variants - Establishing baselines - Validating production readiness ## Related Skills - Use with any optimizer: [dspy-bootstrap-fewshot](../dspy-bootstrap-fewshot/SKILL.md), [dspy-miprov2-optimizer](../dspy-miprov2-optimizer/SKILL.md), [dspy-gepa-reflective](../dspy-gepa-reflective/SKILL.md) - Evaluate RAG pipelines: [dspy-rag-pipeline](../dspy-rag-pipeline/SKILL.md) ## Inputs | Input | Type | Description | |-------|------|-------------| | `program` | `dspy.Module` | Program to evaluate | | `devset` | `list[dspy.Example]` | Evaluation examples | | `metric` | `callable` | Scoring function | | `num_threads` | `int` | Parallel threads | ## Outputs | Output | Type | Description | |--------|------|-------------| | `score` | `float` | Average metric score | | `results` | `list` | Per-example results | ## Workflow ### Phase 1: Setup Evaluator ```python from dspy.evaluate import Evaluate evaluator = Evaluate( devset=devset, metric=my_metric, num_threads=8, display_progress=True ) ``` ### Phase 2: Run Evaluation ```python result = evaluator(my_program) print(f"Score: {result.score:.2f}%") # Access individual results: (example, prediction, score) tuples for example, pred, score in result.results[:3]: print(f"Example: {exam...

Details

Author
OmidZamani
Repository
OmidZamani/dspy-skills
Created
5 months ago
Last Updated
1 weeks ago
Language
Python
License
MIT

Similar Skills

Semantically similar based on skill content — not just same category