evaluate-scenarios

Solid

Decompose each scenario into clean-context forks, measure framework-overhead bytes and hops per fork, and report a feasibility signal (heaviest fork's net load) and a cost signal (overhead summed across forks). Use to measure the operational overhead the framework imposes per agent.

AI & Automation 69 stars 9 forks Updated today CC-BY-4.0

Install

View on GitHub

Quality Score: 88/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

## EXECUTE NOW **Target: $ARGUMENTS** Parse immediately: - empty → evaluate all scenarios in `test/scenarios/` - a scenario name → evaluate only that scenario - `compare` → evaluate all and compare against a previous run if one exists This harness measures **operational overhead**: the framework instructions an agent must read on top of the task's own content. The unit is the **fork**, not the operation — every `cp-skill-*` runs `context: fork`, so each fork pays its overhead from a fresh context. See `kb/notes/feasibility-is-the-heaviest-forks-net-load.md` for the model. ### 1. Discover scenario files ```bash ls test/scenarios/*.md ``` Read each scenario. Each has a `## Forks` section with one subsection per fork; each fork has a table of loads: `load | kind | source | hops`, where `kind` is `overhead`, `content`, or `spared`. ### 2. Config (override via $ARGUMENTS, e.g. `notesize=3000 candidates=4 budget=50000 agents_per_fork=on`) | Knob | Default | Meaning | |---|---|---| | `notesize` | 2,000 B | average note/body read | | `candidates` | 3 | content notes opened where a fork prospects bodies | | `spared_bodies` | 3 | bodies an index or description-listing read lets a fork skip | | `index_size` | 3,000 B | one curated index read or scoped description listing | | `validate_out` | 500 B | bytes a `commonplace-validate` run returns into context | | `budget` | 50,000 B | usable-window soft ceiling for the feasibility flag (overhead + content + room to reason) | | `agent...

Details

Author: zby
Repository: zby/commonplace
Created: 3 months ago
Last Updated: today
Language: Python
License: CC-BY-4.0

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Listed

scenario

Author and manage holdout scenarios for behavioral validation. Scenarios are stored in .agents/holdout/ where implementing agents cannot see them. Triggers: "$scenario", "holdout", "behavioral scenario", "create scenario", "list scenarios".

389 Updated today

boshu2

AI & Automation Listed

scenario

Generate comprehensive edge cases and test scenarios by decomposing a feature or file across 12 risk dimensions. Use for pre-implementation risk discovery, QA planning, regression design, and exhaustive edge-case enumeration. Triggers: 'edge cases for X', 'what could break', 'test scenarios', 'QA plan', 'risk discovery', 'enumerate failure modes'.

0 Updated today

vanducng

AI & Automation Listed

eval-runner

Run eval scenarios to benchmark Mycelium effectiveness. Execute tasks using reflexion loop, validate against success criteria, record metrics.

33 Updated today

haabe