experiment-audit

Solid

Audit experiment integrity before claiming results. Uses cross-model review (external reviewer backend) to check for fake ground truth, score normalization fraud, phantom results, and insufficient scope. Use when user says "审计实验", "check experiment integrity", "audit results", "实验诚实度", or after experiments complete before writing claims.

AI & Automation 11,152 stars 1050 forks Updated today MIT

Install

View on GitHub

Quality Score: 96/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# Experiment Audit: Cross-Model Integrity Verification > 🔒 **Do not wrap this skill in `/loop`, `/schedule`, or `CronCreate`.** It is > verdict-bearing — it judges experiment integrity. Re-running that verdict on a > timer adds no new signal, and a loop that accepts its own output to decide > when to stop crosses into self-acquittal (`acceptance-gate.md`). Schedule the > *external wait that precedes it* — experiments done → then audit **once**. See > [`shared-references/external-cadence.md`](../shared-references/external-cadence.md). Audit experiment integrity for: **$ARGUMENTS** ## Why This Exists LLM agents can produce fraudulent experimental results through: 1. **Fake ground truth** — creating synthetic "reference" from model outputs, then reporting high agreement as performance 2. **Score normalization** — dividing metrics by the model's own max to get 0.99+ 3. **Phantom results** — claiming numbers from files that don't exist or functions never called 4. **Insufficient scope** — reporting 2-scene pilots as "comprehensive evaluation" These are NOT intentional deception — they are failure modes of optimizing agents that lack integrity constraints. This skill adds that constraint. ## Core Principle **The executor collects file paths. The external reviewer backend reads code and judges integrity. The executor does NOT participate in integrity judgment.** This follows `shared-references/reviewer-independence.md` and `shared-references/experiment-integrity.md`. ## Co...

Details

Author
wanshuiyin
Repository
wanshuiyin/Auto-claude-code-research-in-sleep
Created
2 months ago
Last Updated
today
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Listed

swmm-experiment-audit

Consolidate Agentic SWMM run artifacts into auditable provenance, comparison records, and local Obsidian audit notes. Use after any SWMM build/run/QA attempt, successful or failed, when OpenClaw or a CLI workflow needs a traceable record of inputs, commands, artifacts, metrics, QA checks, run-to-run differences, and first-user-friendly Obsidian visualization.

8 Updated 2 days ago
Zhonghao1995
AI & Automation Solid

paper-claim-audit

Zero-context verification that every number, comparison, and scope claim in the paper matches raw result files. Uses a fresh cross-model reviewer with NO prior context to prevent confirmation bias. Use when user says "审查论文数据", "check paper claims", "verify numbers", "论文数字核对", or before submission to ensure paper-to-evidence fidelity.

11,152 Updated today
wanshuiyin
AI & Automation Listed

eval-audit

Use when the user asks for an AI app audit, launch readiness review, safety/security review, OWASP agentic risk check, metric coverage review, or production RCA gap review.

29 Updated 3 days ago
Galileo-Agent-Labs
AI & Automation Listed

audit

Comprehensive multi-agent code audit that delegates to the code-reviewer and security-scanner sub-agents. Always runs security-scanner; set only_security_scan=true to restrict to a security-only review. Use when (1) verifying changes before shipping, (2) running review feedback inside the /impl Generator-Evaluator loop, or (3) reviewing a topic branch with no active ticket directory. Triggers on "audit changes", "review the diff", "code review", "security review", "/audit". Chain-invoked by /impl Step 17 and /ship review-gate; disable-model-invocation: false is intentional because callers reference this skill by name.

1 Updated today
aimsise
AI & Automation Listed

verify

Paper-vs-code consistency audit. After research:scientist implements a method from a paper, verify the implementation matches paper claims across five dimensions — formula matching [F], hyperparameter parity [H], eval protocol [E], notation consistency [N], and citation chain [C]. Reads paper (PDF path / arXiv URL / pasted text), maps claims to codebase, emits verification table with match status and severity.

17 Updated 2 days ago
Borda