benchmark-extractlisted

Use to ingest a LongMemEval-style benchmark dataset into the isolated benchmark gramaton store via Claude Code sub-agents. Each unique haystack session becomes one Gramaton session via the production `session_prepare` → `session_commit` path. User-triggered by "run benchmark extraction", "ingest LongMemEval-S", "load benchmark data". Requires the `gramaton-bench` MCP server live on port 7338; see docs/benchmarks.md for setup.
gramaton-ai/gramaton · ★ 4 · Data & Documents · score 68

Install: claude install-skill gramaton-ai/gramaton

# benchmark-extract Drives the production session-extraction code path against a benchmark dataset through Claude Code Agent sub-agents, one sub-agent per unique haystack session. All writes go to the `gramaton-bench` MCP toolset (never `gramaton`). Deliberate isolation — see docs/benchmarks.md for why. ## When to run - User explicitly requests ingestion of a benchmark dataset (LongMemEval-S, LongMemEval-M, MuSiQue, etc.). - Always preceded by a design alignment on subset size (pilot vs full). Do NOT run autonomously. Extraction spends significant subscription quota and wall-time; the user drives the cadence. ## Preconditions 1. **Dataset file** exists at the path the user specifies (for LongMemEval-S: `~/workspaces/gramaton-benchmarks/longmemeval/raw/longmemeval_s_cleaned.json`). 2. **Benchmark store running** on port 7338 with `gramaton-bench` MCP tools available in this Claude Code session. Verify with a `mcp__gramaton-bench__gramaton_stats` call; if it fails, stop and ask the user to start the server per `docs/benchmarks.md`. 3. **Personal store is NOT the target.** Any `mcp__gramaton__*` call in this skill is a bug. ## Session id convention Upstream ids (e.g. `sharegpt_yywfIrx_0`, `85a1be56_1`, `answer_280352e9`) are used verbatim with a dataset prefix: `lme-s-<haystack_session_id>`. Prefix makes origin unambiguous in the bench store; preserving the upstream id preserves traceability back to the dataset. ## Flow ### 1. Load and parse Read the