eval-workflow

Solid

Run evaluation tests against a multi-agent workflow to assess orchestration quality and failure archetype resistance

AI & Automation 146 stars 21 forks Updated today MIT

Install

View on GitHub

Quality Score: 90/100

Stars 20%
72
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# Workflow Evaluation Run automated evaluation tests against a multi-agent workflow. ## Research Foundation - **REF-001**: BP-9 - Continuous evaluation of agent performance - **REF-002**: KAMI benchmark methodology for real agentic task evaluation ## Usage ```bash /eval-workflow flow-security-review-cycle /eval-workflow flow-inception-to-elaboration --scenario distractor-test /eval-workflow flow-deploy-to-production --verbose --strict ``` ## Arguments | Argument | Required | Description | |----------|----------|-------------| | workflow-name | Yes | Workflow (flow command) to evaluate | ## Options | Option | Default | Description | |--------|---------|-------------| | --scenario | all | Specific scenario to run | | --verbose | false | Show detailed test output | | --output | stdout | Output file for results | | --strict | false | Fail on any test failure | | --timeout | 300 | Maximum seconds per scenario | ## What Gets Evaluated ### Orchestration Quality - **Agent coordination**: Parallel agents launched correctly in single message - **Handoff fidelity**: Artifacts pass correctly between phases - **Gate enforcement**: Phase gates checked before transition ### Archetype Resistance - `grounding-test` — Archetype 1: Premature action without reading state - `distractor-test` — Archetype 3: Context pollution from irrelevant artifacts - `recovery-test` — Archetype 4: Fragile execution when subagent fails ### Output Validation - Required artifacts created in correct ...

Details

Author
jmagly
Repository
jmagly/aiwg
Created
10 months ago
Last Updated
today
Language
TypeScript
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category