logprob-prefill-analysislisted
Install: claude install-skill aiskillstore/marketplace
# Prefill Sensitivity Analysis Pipeline
This skill documents the complete pipeline for measuring model susceptibility to reward hacking via prefill sensitivity analysis, including both token-based and logprob-based metrics.
## Quick Start: Single Command Reproducibility
The full analysis can be run with a single command:
```bash
# Run on most recent sensitivity experiment (auto-discovers checkpoints from config.yaml)
python scripts/run_full_prefill_analysis.py
# Specify a particular sensitivity experiment
python scripts/run_full_prefill_analysis.py \
--sensitivity-run results/prefill_sensitivity/prefill_sensitivity-20251216-012007-47bf405
# Dry run to see what would be executed
python scripts/run_full_prefill_analysis.py --dry-run
# Skip logprob computation (just run trajectory analysis)
python scripts/run_full_prefill_analysis.py --skip-logprob
```
This orchestration script:
1. Discovers checkpoints and prefill levels from the sensitivity experiment's `config.yaml`
2. Runs token-based trajectory analysis
3. Computes prefill logprobs for each checkpoint
4. Produces integrated analysis comparing token vs logprob metrics
## Overview
The analysis measures how easily a model can be "kicked" into generating exploit code by prefilling its chain-of-thought with exploit-oriented reasoning. We track:
1. **Token-based metric**: Minimum prefill tokens needed to elicit an exploit
2. **Logprob-based metric**: How "natural" the exploit reasoning appears to the model
## Prer