eval-plan

Solid

Design a scenario-driven Mnemon harness eval with target, hypothesis, HostAgent, loop configuration, evidence, and rubric.

AI & Automation 322 stars 46 forks Updated today Apache-2.0

Install

View on GitHub

Quality Score: 88/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Eval Plan Use this skill to design a scenario-driven eval before running a HostAgent. ## Procedure 1. Identify the target: loop, setup behavior, host projection, docs workflow, or eval itself. 2. Choose an existing scenario and suite when one fits. 3. If no scenario fits, draft an ephemeral plan first. Do not promote it during the same step. 4. State the hypothesis in observable terms. 5. Select the HostAgent and loop combination. Codex app server is the default HostAgent for current Mnemon evals. 6. Define the evidence to collect: - transcript or response reference - git diff - `.mnemon` state changes - projected host surface - report path - logs or timeout reason 7. Attach a rubric or mark the run exploratory. ## Output Return a short eval plan with: - target - scenario - suite - host - loops - hypothesis - evidence - expected report path

Details

Author: mnemon-dev
Repository: mnemon-dev/mnemon
Created: 3 months ago
Last Updated: today
Language: Go
License: Apache-2.0

Integrates with

SQLite · Database

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

eval-run

Execute or supervise a planned Mnemon harness eval run in an isolated HostAgent workspace.

322 Updated today

mnemon-dev

AI & Automation Solid

eval-analyze

Analyze Mnemon harness eval reports, classify outcomes, and extract improvement evidence.

322 Updated today

mnemon-dev

AI & Automation Listed

eval-suite-planner

Produces a concrete eval suite plan for AI agents - grounded in Microsoft's Eval Scenario Library and MS Learn agent evaluation guidance (Copilot Studio is the primary worked example, but the plan is platform-agnostic and adapts to any agent harness). Outputs scenario types, evaluation methods, quality signals, thresholds, and priority order - before any test cases are generated or evals are run.

1 Updated today

varunk130

AI & Automation Solid

eval-improve

Turn stable Mnemon harness eval findings into scoped project, loop, adapter, docs, or eval asset improvements.

322 Updated today

mnemon-dev

AI & Automation Listed

agent-eval-design

Use when designing evaluations for AI agents, skills, routers, prompts, tool-use policies, or multi-step workflows: task sets, rubrics, graders, hard negatives, regression cases, traces, and acceptance thresholds. Do NOT use for application test planning (use `testing-strategy`), skill-library health tooling (use `skill-infrastructure`), or live debugging of a failed run (use `debugging`).

0 Updated today

jacob-balslev