pebeto

build-agent-evals

Build automated evaluations for an AI agent from scratch: collecting tasks from real failures, choosing code/model/human graders, picking pass@k vs pass^k, building an isolated harness, and keeping the suite honest over time. Use this whenever someone wants to measure, benchmark, or regression-test an agent, write an eval harness for an LLM agent, decide how to grade non-deterministic output, set up an LLM-as-judge, or asks any version of "how do I know if my agent is actually getting better." Trigger even when they say "tests for my agent," "eval set," or "agent benchmark" rather than the word "evals." Not for container or resource limits making scores flaky across runs; that's calibrate-eval-infrastructure.

DevOps & Infrastructure Listed

calibrate-eval-infrastructure

Stop the machine from deciding your benchmark. Configure and validate the container and runtime resources for an agentic coding eval so infrastructure noise stays inside statistical bounds instead of swinging scores more than the models do. Use this whenever someone runs SWE-bench or any agentic coding benchmark in containers, sees scores jump between runs for no code reason, suspects OOM kills or flaky infra are skewing results, sets container memory or CPU limits for an eval harness, or wants to trust a leaderboard delta. Trigger on "my benchmark scores are inconsistent," "OOM during eval," "how much memory should the eval container get," and similar. Not for designing the eval tasks or graders themselves; that's build-agent-evals.

coding-agent-scaffold

Design the tool interface for a coding agent so the model stops misusing it. Covers the minimal two-tool scaffold (a bash tool plus a file editor), exact single-match string replacement, absolute-path rules, and error-proofing the tool descriptions so common model mistakes become impossible. Use this whenever someone is building a coding agent or SWE-bench-style harness, designing a bash or file-edit tool for an agent, deciding how much scaffolding to impose, or debugging an agent that keeps editing the wrong place, fumbling multi-line edits, or escaping shell commands wrong. Trigger on "build a coding agent," "str_replace tool," "agent keeps breaking the file," and similar. Not for general MCP or service tool design; this is the bash plus file-editor interface specifically.

durable-agent-architecture

Structure a long-lived agent service so any part can crash and resume. Decompose it into brain (model plus harness), hands (ephemeral sandbox and tools), and session (a durable event log), each replaceable on its own, with wake/resume semantics and credentials kept out of the execution environment. Use this whenever someone designs a production or long-running agent backend, asks how to make agents crash-recoverable or resumable, worries about losing session state when a container dies, needs to scale agents as a service, or asks where to keep credentials for an agent that runs code. Trigger on "agent infrastructure," "resume an agent after a crash," "agent runs for hours," "where do tokens live," and similar. Not for parallelizing work across agents or coordinating a shared repo; see multi-agent-orchestration and parallel-autonomous-agents.

multi-agent-orchestration

Run an orchestrator-worker system for breadth-first research: a lead agent plans, spawns three to five subagents with their own context windows, and synthesizes their findings. Covers when multi-agent actually beats a single agent and when it just burns tokens, how to delegate so subagents do not overlap, broad-to-narrow search, writing findings to a filesystem, and how to evaluate the system. Use this whenever someone wants to parallelize research or exploration across agents, asks how to coordinate a lead and subagents, considers a multi-agent setup, or asks whether multi-agent is worth it for their task. Trigger on "orchestrator and workers," "parallel research agents," "lead agent spawns subagents," "should this be multi-agent," and similar.

parallel-autonomous-agents

Coordinate several unsupervised agents working on one shared git repo without collisions. Covers the autonomy loop that lets each agent pick the next task and respawn without a human, file-based lock files that claim work, machine- readable test output so a test suite steers the agents instead of a person, and context hygiene for long unattended runs. Use this whenever someone wants multiple agents grinding on one codebase in parallel, asks how to stop agents from duplicating work or clobbering each other, sets up an unattended or overnight agent run, or asks how agents claim and release tasks. Trigger on "parallel agents on one repo," "agents keep doing the same task," "autonomy loop," "unsupervised agents," and similar. Not for breadth-first research across subagents (that's multi-agent-orchestration) or crash-recovery architecture (that's durable-agent-architecture).

sandboxing-agentic-systems

Contain an agent that runs code or reads untrusted content, layer by layer. Covers OS-level filesystem and network isolation that also catches spawned subprocesses, an egress proxy that checks request provenance, treating tool outputs and fetched pages as prompt-injection vectors, and keeping credentials outside the sandbox behind a proxy. Use this whenever someone designs a sandbox for a coding or computer-use agent, asks how to safely let an agent run shell commands or browse, worries about prompt injection from tool results, needs to limit what an autonomous agent can reach or delete, or asks where an agent's credentials should live. Trigger on "sandbox the agent," "agent runs untrusted code," "prompt injection," "restrict network access," and similar. This is environment containment; deciding which actions need approval before they run is a separate concern (action gating).

using-the-think-step

Decide when to give an agent a mid-task reasoning step and how to prompt for it. Covers the no-op "think" tool that lets a model stop and reason in the middle of a tool-use chain (distinct from extended thinking, which happens before acting), the task shapes where it helps, the ones where it just adds latency, and why the gains come from system-prompt guidance rather than the tool itself. Use this whenever someone builds an agent that follows policies or makes long sequential decisions, asks whether to add a think or scratchpad tool, finds an agent skipping rules or mishandling tool output mid-task, or wants more deliberate tool use. Trigger on "think tool," "let the agent reason before acting," "agent ignores the policy," and similar. Not for general prompt engineering or one-shot chain-of-thought, and not the same as extended thinking before a turn.