← ClaudeAtlas

nasde-benchmark-runnerlisted

Run coding agent benchmarks and verify results with nasde. Use this skill when the user wants to: - Run a benchmark (all tasks, single task, specific variant) - Re-run assessment evaluation on existing trial results - Check or verify results in Opik (traces, feedback scores, experiments) - Troubleshoot a failed benchmark run - View or compare trial results Even if the user doesn't say "benchmark" — if they're talking about running evaluations, checking scores, or analyzing agent performance, this skill applies. After every run that uses --with-opik, ALWAYS verify results via Opik REST API — don't wait for the user to ask.
NoesisVision/nasde-toolkit · ★ 10 · AI & Automation · score 79
Install: claude install-skill NoesisVision/nasde-toolkit
# NASDE Benchmark Runner Run coding agent benchmarks with `nasde` and verify results. The two-stage pipeline: Harbor runs agents in Docker containers (functional test → reward 0/1), then an LLM-as-a-Judge scores architecture quality across multiple dimensions. ## Authentication setup Before running any benchmark, set up authentication tokens for the agents you plan to run. Both OS and auth method matter — pick the right command per row. ### Step 1 — Ask the user which auth they prefer **Always ask the user before running, never assume.** Two questions: 1. **Which agents will you run?** (Claude / Codex / Gemini, any combination) 2. **For each agent, OAuth (subscription) or API key (per-token billing)?** Default recommendation: OAuth where available — no per-token cost, no env vars to manage. Then detect their OS and pick the matching script row from the table below. On Windows, also ask whether they're in **PowerShell** or **WSL** (cmd.exe is not directly supported — see "Windows: cmd.exe" below). ### Where the auth scripts live The OAuth scripts ship inside this skill. After `nasde install-skills` they are at: - **User scope** (default): `~/.claude/skills/nasde-benchmark-runner/scripts/` (macOS/Linux/WSL) or `%USERPROFILE%\.claude\skills\nasde-benchmark-runner\scripts\` (Windows PowerShell) - **Project scope**: `<project>/.claude/skills/nasde-benchmark-runner/scripts/` (if installed with `nasde install-skills --scope project`) - **Editable nasde checkout** (devs onl