eval-driven-dev

Featured

Set up eval-based QA for Python LLM applications: instrument the app, build golden datasets, write and run eval tests, and iterate on failures. ALWAYS USE THIS SKILL when the user asks to set up QA, add tests, add evals, evaluate, benchmark, fix wrong behaviors, improve quality, or do quality assurance for any Python project that calls an LLM model.

AI & Automation 34,233 stars 4188 forks Updated today MIT

Install

Quality Score: 99/100

Stars 20%

100

Recency 20%

100

Frontmatter 20%

70

Documentation 15%

100

Issue Health 10%

50

License 10%

100

Description 5%

100

Skill Content

# Eval-Driven Development for Python LLM Applications You're building an **automated QA pipeline** that tests a Python application end-to-end — running it the same way a real user would, with real inputs — then scoring the outputs using evaluators and producing pass/fail results via `pixie test`. **What you're testing is the app itself** — its request handling, context assembly (how it gathers data, builds prompts, manages conversation state), routing, and response formatting. The app uses an LLM, which makes outputs non-deterministic — that's why you use evaluators (LLM-as-judge, similarity scores) instead of `assertEqual` — but the thing under test is the app's code, not the LLM. During evaluation, the app's own code runs for real — routing, prompt assembly, LLM calls, response formatting — nothing is mocked or stubbed. But the data the app reads from external sources (databases, caches, third-party APIs, voice streams) is replaced with test-specified values via instrumentations. This means each test case controls exactly what data the app sees, while still exercising the full application code path. **The deliverable is a working `pixie test` run with real scores** — not a plan, not just instrumentation, not just a dataset. This skill is about doing the work, not describing it. Read code, edit files, run commands, produce a working pipeline. --- ## Before you start **First, activate the virtual environment**. Identify the correct virtual environment for the project ...

Details

Author: github
Repository: github/awesome-copilot
Created: 11 months ago
Last Updated: today
Language: Python
License: MIT

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Listed

eval-driven-development

Use when reasoning about building language-model-integrated systems by writing evaluations before and alongside the system: the statistical (not binary) nature of LLM evals, the five primitives (dataset, evaluation function, aggregation, iteration loop, regression budget), the judgment-mechanism taxonomy (programmatic, model-graded, human-graded, preference comparison), the difference between system-specific evals and canonical benchmarks (MMLU, HumanEval, BIG-bench, GAIA), how evals drive prompt/model/scaffolding/tooling changes, why Goodhart's Law means higher eval scores are not always improvements, and the offline-eval-vs-production-telemetry distinction. Do NOT use for deterministic unit testing (use testing-strategy), production monitoring (use evaluation or error-tracking), general-software TDD (use testing-strategy), or the construction of individual eval rubrics and task sets (use agent-eval-design — it owns construction; this skill owns the iteration discipline).

0 Updated 6 days ago

AI & Automation Listed

eval-driven-dev

Build the evaluation discipline that separates production agentic products from demos — error analysis on real traces, the three-level eval pyramid (code assertions / LLM-as-judge / human review), binary judge outputs calibrated against human labels, and CI gates that block regression. Based on the Husain/Shankar methodology. Use whenever the user mentions evals, evaluation, LLM-as-judge, hallucination testing, regression testing for AI, quality measurement, error analysis, "how do I know if my agent works," failure modes, or grading agent outputs.

5 Updated today

AI & Automation Listed

ai-evals

Help users create and run AI evaluations. Use when someone is building evals for LLM products, measuring model quality, creating test cases, designing rubrics, or trying to systematically measure AI output quality.

0 Updated today

AI & Automation Featured

phoenix-evals

Build and run evaluators for AI/LLM applications using Phoenix.

34,233 Updated today

AI & Automation Listed

advanced-evaluation

This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.

0 Updated today