calibrate-eval-infrastructurelisted
Install: claude install-skill pebeto/agent-stdlib
# Calibrate eval infrastructure
Source: [Quantifying infrastructure noise in agentic coding evals](https://www.anthropic.com/engineering/infrastructure-noise). This had no packaged skill anywhere; the topic existed only as the article and a few summaries.
Container configuration can move an agentic coding benchmark by 6 or more points. That is larger than the gap between the top models, which means a careless resource setting can rank a worse model above a better one. Treat resource configuration as an experimental variable you control and report, on the same footing as prompt format and sampling temperature.
## The mistake that causes most of it
A container has two separate numbers, and pinning them together is the trap:
- a **guaranteed allocation**, the floor the workload always has, and
- a **hard kill threshold**, the ceiling past which the runtime kills the process.
Set the floor equal to the ceiling and the workload has zero headroom. A normal memory spike crosses the line and the process dies an OOM death that looks like the agent failing the task. It was not the agent. It was the box.
Give the two numbers separate values and leave a band between them.
## Size the band empirically
Do not guess the headroom. Sweep it and measure whether the score still moves:
1. Run the benchmark at several ceilings (a useful starting reference is roughly 3x the baseline ceiling; Anthropic's sweep cut infra errors from 5.8% to 2.1% with negligible score change at that point).