artifact-detection

Solid

Detect annotation artifacts and shortcuts in benchmarks

AI & Automation 331 stars 25 forks Updated today Apache-2.0

Install

View on GitHub

Quality Score: 93/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Artifact Detection Tactic Systematically probe benchmarks for annotation artifacts, dataset shortcuts, and spurious correlations that allow models to achieve high scores without the intended capability. ## Stages ### Stage 1: Hypothesis-Only Baseline Test Search literature for evidence that partial-input baselines achieve unexpectedly high performance: - Hypothesis-only baselines (NLI without premise) - Question-only baselines (QA without context) - Label-word frequency baselines - Majority-class and surface-pattern baselines **Search queries**: "[benchmark] annotation artifacts", "[benchmark] hypothesis only", "[benchmark] spurious correlations", "[benchmark] dataset bias" If published partial-input results exist, record performance gap between partial and full input. Gap < 10 points above random indicates severe artifacts. ### Stage 2: Contrast Set Construction Identify whether contrast sets or adversarial evaluations exist: - Search for "[benchmark] contrast sets", "[benchmark] adversarial examples" - Check if CheckList-style behavioral tests have been applied - Look for counterfactual data augmentation studies Record performance drops on contrast sets. Drops > 20 points indicate reliance on surface patterns. ### Stage 3: Format Manipulation Probes Search for evidence of format sensitivity: - Prompt template sensitivity studies - Label name/ordering effects - Verbalization effects in classification - Input length correlations with labels Record whether minor ...

Details

Author: yogsoth-ai
Repository: yogsoth-ai/de-anthropocentric-research-engine
Created: 4 months ago
Last Updated: today
Language: HTML
License: Apache-2.0

Integrates with

Model Context Protocol · AI

Related Skills

AI & Automation Featured

videodb

See, Understand, Act on video and audio. See- ingest from local files, URLs, RTSP/live feeds, or live record desktop; return realtime context and playable stream links. Understand- extract frames, build visual/semantic/temporal indexes, and search moments with timestamps and auto-clips. Act- transcode and normalize (codec, fps, resolution, aspect ratio), perform timeline edits (subtitles, text/image overlays, branding, audio overlays, dubbing, translation), generate media assets (image, audio, video), and create real time alerts for events from live streams or desktop capture.

216,451 Updated today

affaan-m

AI & Automation Featured

ck

Persistent per-project memory for Claude Code. Auto-loads project context on session start, tracks sessions with git activity, and writes to native memory. Commands run deterministic Node.js scripts — behavior is consistent across model versions.

216,451 Updated today

affaan-m

AI & Automation Featured

code-simplifier

Review RTK Rust code for idiomatic simplification. Detects over-engineering, unnecessary allocations, verbose patterns. Applies Rust idioms without changing behavior.

62,788 Updated yesterday

rtk-ai