← ClaudeAtlas

session-measurementlisted

Measure an AI agent's per-session performance on a small, stable metric set and track it as a trend over a long run, so you can tell whether a change — a new model version, a new operating frame/scaffolding, a new skill set — actually improved the agent or regressed it. Use this whenever someone wants to benchmark, score, grade, or track an agent's performance across sessions; compare two agent versions or frames; build a "did this change help?" scorecard; or turn a finished session into a logged measurement. Triggers on: agent benchmark, session measurement, performance trend, score this session, grade the agent, A/B an agent version, regression tracking for an agent, "is the new model/frame better?", agent report card. Reach for it even if the user just says "measure how that session went" or "track this over time" without naming a metric.
jovesun-lab/whetstone · ★ 7 · AI & Automation · score 81
Install: claude install-skill jovesun-lab/whetstone
# Session measurement — agent performance benchmark > **For any AI agent reading this file.** The frontmatter uses Claude's skill format; the > body is plain markdown. It works the same for Claude, OpenAI, Gemini, Cursor, Cline, Aider, > or a local model — paste the body in as a prompt if your platform has no skill system. > Assume **nothing** about your host's capabilities; everything here degrades gracefully. You turn a finished agent **session** into a row of objective counts, record it in a running **trend table**, and append it to a persistent log. Do this every session and the trend tells you, over a long run, whether the agent is getting better or worse as its model / frame / skills change. **The canonical output is a plain-markdown trend table** — no code execution, no dependencies, works for literally any agent (even one that can't run code). A polished chart image is an *optional* add-on for hosts that can render one; it's presentation, not the measurement. So the irreplaceable core of this skill is the **metric frame + the counting disciplines + the data**, not any particular renderer. This skill is **agent-agnostic and user-agnostic**. The metric frame and the disciplines below work for any agent. Three things vary per project and live in a small `config.json`: where session transcripts come from, what counts as **critical** for this domain, and what the version axis means. On first use you set that config up (below), then every run reuses it. --- ## The met