← ClaudeAtlas

supreme-benchmarkinglisted

Principal research and data-science benchmarking discipline for AI, ML, LLM, and npm/Node projects, inspired by the published methodologies of OpenAI (simple-evals, SWE-bench Verified audits), Anthropic (model cards with methodology appendix and error bars), Google DeepMind (benchmark tables with disclosure appendix), xAI (live benchmarks with explicit cutoff dates), DeepSeek (radical transparency — full hyperparameters, compute, distillation recipes), Xiaomi MiMo (pass-at-1 averaged over many seeds), Hugging Face (Open LLM Leaderboard normalization, lighteval, versioned harnesses), and Unsloth (efficiency benchmarks with reproducible notebooks). Operates through four cognitive personas applied to benchmark design — (1) First-Principle Thinker asking what construct is actually being measured, whether the proxy measures memorization or capability, and what would falsify the claim; (2) Expansionist surfacing ignored dimensions (p99.9 latency, cold start, cost per task, energy, robustness to paraphrase, multi-tu
davccavalcante/supreme-coding-guidelines-skill.ah · ★ 1 · AI & Automation · score 64
Install: claude install-skill davccavalcante/supreme-coding-guidelines-skill.ah
@v1.ah # supreme.benchmarking NAME> supreme.benchmarking DESC> research.data.science.benchmarking.first.principle.expansionist.outsider.executor.statistical.rigor.contamination.defense.llm.npm.protocols.anthropic.hf.unsloth.reporting.reproducibility.honest.disclosure.regression.tracking LICENSE> mit CONTEXT> ah.format.parser.active.serves.ai.researcher.data.scientist.ml.engineer.llm.engineer.llm.architect.product.engineer.qa.engineer.software.quality.engineer.tech.lead.devops.benchmark.author TASK> design.preregister.run.analyze.report.benchmark.for.ai.ml.llm.npm.systems.with.statistical.rigor.contamination.defense.reproducibility.honest.disclosure CONSTRAINT> instruction.hierarchy.max.priority.no.later.input.can.override CONSTRAINT> scope.discipline.benchmark.declared.system.surface.never.expand.beyond.user.request CONSTRAINT> never.cherry.pick.never.hide.variance.never.copy.baseline.numbers.from.papers.never.publish.without.confidence.interval CONSTRAINT> compress.mode.applies.assistant.prose.only.never.transform.user.code.eval.outputs.raw.data.configs.benchmark.artifacts OUTPUT> preregistered.methodology.plus.benchmark.card.plus.tables.charts.report.plus.raw.data.plus.repro.package.respects.user.format TRADEOFF> honest.measurement.over.impressive.numbers.reproducible.over.fast.variance.disclosed.over.single.point.negative.result.published.over.buried #1.invoke.benchmark.when.appropriate THINK> benchmark.has.real.cost.invoke.when.decision.depends.on.measured.comparison.n