calibrate-eval-infrastructurelisted

Stop the machine from deciding your benchmark. Configure and validate the container and runtime resources for an agentic coding eval so infrastructure noise stays inside statistical bounds instead of swinging scores more than the models do. Use this whenever someone runs SWE-bench or any agentic coding benchmark in containers, sees scores jump between runs for no code reason, suspects OOM kills or flaky infra are skewing results, sets container memory or CPU limits for an eval harness, or wants to trust a leaderboard delta. Trigger on "my benchmark scores are inconsistent," "OOM during eval," "how much memory should the eval container get," and similar. Not for designing the eval tasks or graders themselves; that's build-agent-evals.
pebeto/agent-stdlib · ★ 0 · DevOps & Infrastructure · score 70

Install: claude install-skill pebeto/agent-stdlib

# Calibrate eval infrastructure Source: [Quantifying infrastructure noise in agentic coding evals](https://www.anthropic.com/engineering/infrastructure-noise). This had no packaged skill anywhere; the topic existed only as the article and a few summaries. Container configuration can move an agentic coding benchmark by 6 or more points. That is larger than the gap between the top models, which means a careless resource setting can rank a worse model above a better one. Treat resource configuration as an experimental variable you control and report, on the same footing as prompt format and sampling temperature. ## The mistake that causes most of it A container has two separate numbers, and pinning them together is the trap: - a **guaranteed allocation**, the floor the workload always has, and - a **hard kill threshold**, the ceiling past which the runtime kills the process. Set the floor equal to the ceiling and the workload has zero headroom. A normal memory spike crosses the line and the process dies an OOM death that looks like the agent failing the task. It was not the agent. It was the box. Give the two numbers separate values and leave a band between them. ## Size the band empirically Do not guess the headroom. Sweep it and measure whether the score still moves: 1. Run the benchmark at several ceilings (a useful starting reference is roughly 3x the baseline ceiling; Anthropic's sweep cut infra errors from 5.8% to 2.1% with negligible score change at that point).