← ClaudeAtlas

agent-evaluationlisted

Evaluate GenAI agent task execution using LLM-as-judge. Produces structured scores (0-5), feedback, and improvement recommendations.
dzianisv/opencode-plugins · ★ 8 · AI & Automation · score 74
Install: claude install-skill dzianisv/opencode-plugins
# Agent Evaluation Skill Evaluate AI agent task execution using world-class LLM-as-judge patterns from DeepEval, RAGAS, and G-Eval frameworks. ## Output Format Evaluation results are saved to `evals/results/eval-${yyyy-mm-dd-hh-mm}-${commit_id}.md` ### Results Table | Task Input | Agent Output | Reflection Input | Reflection Output | Score | Verdict | Feedback | |------------|--------------|------------------|-------------------|-------|---------|----------| | Create hello.js... | I've created hello.js with... | Task: Create hello.js Agent Output: ... | Task complete | 5/5 | COMPLETE | Agent produced output; Found completion indicators | | Fix the bug... | I found the issue and... | Task: Fix bug Agent Output: ... | (none) | 3/5 | PARTIAL | Agent produced output; Missing reflection | ### Run Evaluation ```bash # Run E2E evaluation npx tsx eval.ts # Or via npm npm run eval:e2e # Output saved to: evals/results/eval-2026-01-28-12-30-abc1234.md ``` --- ## Evaluation Rubric (0-5) | Score | Verdict | Criteria | |-------|---------|----------| | **5** | COMPLETE | Task fully accomplished. All requirements met. Optimal execution. | | **4** | MOSTLY_COMPLETE | Task done with minor issues. 1-2 suboptimal steps. | | **3** | PARTIAL | Core objective achieved but significant gaps or errors. | | **2** | ATTEMPTED | Progress made but failed to complete. Correct intent, wrong execution. | | **1** | FAILED | Wrong approach or incorrect result. | | **0** | NO_ATTEMPT | No meaningful