eval-harness

Solid

Evaluation harness for testing agent and skill quality through structured benchmarks, regression tests, and quality scoring.

AI & Automation 814 stars 53 forks Updated today MIT

Install

View on GitHub

Quality Score: 93/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Eval Harness ## Overview Evaluation harness methodology adapted from the Everything Claude Code project. Provides structured frameworks for benchmarking agent performance, testing skill quality, and running regression suites. ## Evaluation Types ### 1. Agent Performance Benchmark - Define test cases with known-correct outputs - Run agent against each test case - Score: accuracy, completeness, relevance - Compare against baseline performance - Track performance over time ### 2. Skill Quality Testing - Verify skill instructions produce expected outcomes - Test edge cases and boundary conditions - Measure consistency across multiple runs - Check for harmful or incorrect outputs - Validate against ground truth ### 3. Regression Suite - Collection of previously-passing test cases - Run after any agent/skill modification - Flag regressions with before/after comparison - Maintain pass rate threshold (>= 95%) ### 4. Process Verification - End-to-end process execution with known inputs - Verify each phase produces expected outputs - Check task ordering and dependency satisfaction - Measure total execution time ## Quality Scoring ### Accuracy Score (0-100) - Correctness of output vs expected - Partial credit for partially correct outputs - Penalty for hallucinated or fabricated content ### Completeness Score (0-100) - Coverage of required output elements - Missing sections flagged and scored - Bonus for useful additional context ### Consistency Score (0-100) - Run same inp...

Details

Author: a5c-ai
Repository: a5c-ai/babysitter
Created: 4 months ago
Last Updated: today
Language: JavaScript
License: MIT

Related Skills

AI & Automation Featured

videodb

See, Understand, Act on video and audio. See- ingest from local files, URLs, RTSP/live feeds, or live record desktop; return realtime context and playable stream links. Understand- extract frames, build visual/semantic/temporal indexes, and search moments with timestamps and auto-clips. Act- transcode and normalize (codec, fps, resolution, aspect ratio), perform timeline edits (subtitles, text/image overlays, branding, audio overlays, dubbing, translation), generate media assets (image, audio, video), and create real time alerts for events from live streams or desktop capture.

196,640 Updated 2 days ago

affaan-m

AI & Automation Featured

ck

Persistent per-project memory for Claude Code. Auto-loads project context on session start, tracks sessions with git activity, and writes to native memory. Commands run deterministic Node.js scripts — behavior is consistent across model versions.

196,640 Updated 2 days ago

affaan-m

AI & Automation Featured

browser

Web browser automation with AI-optimized snapshots for claude-flow agents

55,973 Updated today

ruvnet