dspy-evaluation-suite

Solid

This skill should be used when the user asks to "evaluate a DSPy program", "test my DSPy module", "measure performance", "create evaluation metrics", "use answer_exact_match or SemanticF1", mentions "Evaluate class", "comparing programs", "establishing baselines", or needs to systematically test and measure DSPy program quality with custom or built-in metrics.

AI & Automation 78 stars 10 forks Updated 1 weeks ago MIT

Install

View on GitHub

Quality Score: 90/100

Stars 20%

Recency 20%

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# DSPy Evaluation Suite ## Goal Systematically evaluate DSPy programs using built-in and custom metrics with parallel execution. ## When to Use - Measuring program performance before/after optimization - Comparing different program variants - Establishing baselines - Validating production readiness ## Related Skills - Use with any optimizer: [dspy-bootstrap-fewshot](../dspy-bootstrap-fewshot/SKILL.md), [dspy-miprov2-optimizer](../dspy-miprov2-optimizer/SKILL.md), [dspy-gepa-reflective](../dspy-gepa-reflective/SKILL.md) - Evaluate RAG pipelines: [dspy-rag-pipeline](../dspy-rag-pipeline/SKILL.md) ## Inputs | Input | Type | Description | |-------|------|-------------| | `program` | `dspy.Module` | Program to evaluate | | `devset` | `list[dspy.Example]` | Evaluation examples | | `metric` | `callable` | Scoring function | | `num_threads` | `int` | Parallel threads | ## Outputs | Output | Type | Description | |--------|------|-------------| | `score` | `float` | Average metric score | | `results` | `list` | Per-example results | ## Workflow ### Phase 1: Setup Evaluator ```python from dspy.evaluate import Evaluate evaluator = Evaluate( devset=devset, metric=my_metric, num_threads=8, display_progress=True ) ``` ### Phase 2: Run Evaluation ```python result = evaluator(my_program) print(f"Score: {result.score:.2f}%") # Access individual results: (example, prediction, score) tuples for example, pred, score in result.results[:3]: print(f"Example: {exam...

Details

Author: OmidZamani
Repository: OmidZamani/dspy-skills
Created: 5 months ago
Last Updated: 1 weeks ago
Language: Python
License: MIT

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

dspy-simba-optimizer

This skill should be used when the user asks to "optimize with SIMBA", "use mini-batch introspective optimization", "generate self-reflective rules", mentions "SIMBA optimizer", "stochastic mini-batch ascent", "output variability", or needs an alternative to MIPROv2/GEPA that evolves rules and demonstrations from numeric metrics.

78 Updated 1 weeks ago

OmidZamani

AI & Automation Solid

dspy-miprov2-optimizer

This skill should be used when the user asks to "optimize a DSPy program", "use MIPROv2", "tune instructions and demos", "get best DSPy performance", "run Bayesian optimization", mentions "state-of-the-art DSPy optimizer", "joint instruction tuning", or needs maximum performance from a DSPy program with substantial training data (200+ examples).

78 Updated 1 weeks ago

OmidZamani

AI & Automation Solid

dspy-optimizer-selection

This skill should be used when the user asks to "choose a DSPy optimizer", "compare DSPy optimizers", "which teleprompter should I use", "optimize prompts or weights", mentions LabeledFewShot, BootstrapFewShotWithRandomSearch, KNNFewShot, COPRO, MIPROv2, SIMBA, GEPA, BootstrapFinetune, Ensemble, or BetterTogether, or needs a cost-aware DSPy optimization plan.

78 Updated 1 weeks ago

OmidZamani