nasde-benchmark-calibration

Solid

Calibrate assessment rubrics by reviewing agent work in GitHub/GitLab PRs and feeding human comments back into the rubric. Use this skill when the user wants to: - Calibrate, tune, or sanity-check assessment criteria / dimensions of a benchmark - Review trial diffs alongside the LLM-as-a-Judge scores in a PR/MR - Investigate why judge scores feel off, too harsh, too lenient, or misaligned with how a human would grade the code - Pull review comments back from PRs/MRs and turn them into concrete rubric edits Even if the user doesn't say "calibrate" — if they're worried the LLM judge's scores diverge from human judgment, or want to align scores with a real developer's opinion before freezing a benchmark, this skill applies.

Code & Development 11 stars 0 forks Updated today MIT

Install

View on GitHub

Quality Score: 79/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# NASDE Benchmark Calibration Close the loop between the LLM-as-a-Judge and a human reviewer. The judge scores trials against `assessment_criteria.md` (per task) and `assessment_dimensions.json` (benchmark-wide) — but an LLM judge is an imperfect grader, and how it reads the rubric may diverge from how a human grades the code. This skill publishes trial diffs + scores as Pull/Merge Requests for human review, pulls the comments back, and proposes concrete rubric edits. This is the third skill in the benchmark lifecycle: `nasde-benchmark-creator` writes the rubric, `nasde-benchmark-runner` runs trials and scores them, and **this skill calibrates the rubric** against human judgment before the benchmark is frozen. ## Prerequisites - A sink repository that already exists (creation is out of scope). Configure it in `nasde.toml`: ```toml [calibration] repo = "https://github.com/Org/nasde-calibration" # full URL or owner/repo slug # platform = "gitlab" # only needed for a bare slug or a self-hosted host # base_branch = "main" # throttle_sec = 2.0 ``` - The platform CLI for that repo's host: `gh` (GitHub) or `glab` (GitLab), installed and logged in (`gh auth login` / `glab auth login`). The platform is auto-detected from the repo URL host. nasde never handles tokens — the CLI's keyring does. See ADR-010. - `git` on PATH. ## Workflow ### 1. Decide which trials to calibrate Calibration is about the **criteria**, not individual runs. Discuss with the use...

Details

Author: NoesisVision
Repository: NoesisVision/nasde-toolkit
Created: 4 months ago
Last Updated: today
Language: Python
License: MIT

Integrates with

GitLab · DevTools

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

nasde-benchmark-from-public-repos

Build diverse benchmark task suites from public GitHub repositories for testing universal skills. Use this skill when the user wants to: - Create a benchmark that spans multiple public repositories and languages - Test a universal skill (refactoring, test writing, code review, etc.) across diverse codebases - Curate a representative set of repos and tasks for cross-codebase validation - Build an evaluation suite for a skill that should work in any repository Even if the user doesn't say "benchmark" — if they're building a skill meant to work everywhere and want to validate it across many different projects, this skill applies.

11 Updated today

NoesisVision

AI & Automation Solid

nasde-benchmark-runner

Run coding agent benchmarks and verify results with nasde. Use this skill when the user wants to: - Run a benchmark (all tasks, single task, specific variant) - Re-run assessment evaluation on existing trial results - Check or verify results in Opik (traces, feedback scores, experiments) - Troubleshoot a failed benchmark run - View or compare trial results Even if the user doesn't say "benchmark" — if they're talking about running evaluations, checking scores, or analyzing agent performance, this skill applies. After every run that uses --with-opik, ALWAYS verify results via Opik REST API — don't wait for the user to ask.

11 Updated today

NoesisVision

AI & Automation Solid

nasde-benchmark-from-history

Generate benchmark tasks from git history of the current or specified repository. Use this skill when the user wants to: - Create benchmark tasks based on real problems their team already solved (closed PRs, past commits, resolved issues) - Mine git history for good evaluation candidates - Turn a commit range or set of PRs into a NASDE benchmark - Build a regression test suite from their team's actual work Even if the user doesn't say "benchmark" — if they're talking about turning past work into evaluation tasks, or want to test AI agents against problems they've already solved, this skill applies.

11 Updated today

NoesisVision