logprob-prefill-analysislisted

Reproduces the full prefill sensitivity analysis pipeline for reward hacking indicators. Use when evaluating how susceptible model checkpoints are to exploit-eliciting prefills, computing token-based trajectories, or comparing logprob vs token-count as predictors of exploitability.
aiskillstore/marketplace · ★ 329 · AI & Automation · score 79

Install: claude install-skill aiskillstore/marketplace

# Prefill Sensitivity Analysis Pipeline This skill documents the complete pipeline for measuring model susceptibility to reward hacking via prefill sensitivity analysis, including both token-based and logprob-based metrics. ## Quick Start: Single Command Reproducibility The full analysis can be run with a single command: ```bash # Run on most recent sensitivity experiment (auto-discovers checkpoints from config.yaml) python scripts/run_full_prefill_analysis.py # Specify a particular sensitivity experiment python scripts/run_full_prefill_analysis.py \ --sensitivity-run results/prefill_sensitivity/prefill_sensitivity-20251216-012007-47bf405 # Dry run to see what would be executed python scripts/run_full_prefill_analysis.py --dry-run # Skip logprob computation (just run trajectory analysis) python scripts/run_full_prefill_analysis.py --skip-logprob ``` This orchestration script: 1. Discovers checkpoints and prefill levels from the sensitivity experiment's `config.yaml` 2. Runs token-based trajectory analysis 3. Computes prefill logprobs for each checkpoint 4. Produces integrated analysis comparing token vs logprob metrics ## Overview The analysis measures how easily a model can be "kicked" into generating exploit code by prefilling its chain-of-thought with exploit-oriented reasoning. We track: 1. **Token-based metric**: Minimum prefill tokens needed to elicit an exploit 2. **Logprob-based metric**: How "natural" the exploit reasoning appears to the model ## Prer