← ClaudeAtlas

agent-opslisted

Enterprise SRE patterns for AI agent operations. Provides cost caps, circuit breakers, stall detection, observability, and runbook-driven incident response for autonomous agent workflows. Use when: (1) Running long autonomous agent sessions, (2) Managing multi-agent swarms, (3) Monitoring agent costs and performance, (4) Debugging stuck or expensive agent loops, (5) Setting up agent observability.
stevengonsalvez/agents-in-a-box · ★ 14 · AI & Automation · score 71
Install: claude install-skill stevengonsalvez/agents-in-a-box
# Agent Ops — SRE for AI Agents ## Quick Reference | Pattern | Purpose | |---------|---------| | Cost Cap | Hard-stop agent when spend exceeds budget | | Circuit Breaker | Halt after N consecutive failures | | Stall Detector | Kill agent stuck in loops | | Health Dashboard | Real-time agent metrics | | Runbook | Automated incident response | ## Cost Cap Pattern Set maximum spend per session or per agent: ```bash # Check current session cost METRICS_FILE="$HOME/{{TOOL_DIR}}/metrics/costs.jsonl" if [[ -f "$METRICS_FILE" ]]; then SESSION_ID="${CLAUDE_SESSION_ID:-default}" TOTAL=$(grep "$SESSION_ID" "$METRICS_FILE" | \ python3 -c "import sys,json; print(sum(json.loads(l)['estimated_cost_usd'] for l in sys.stdin))") echo "Session cost: \$$TOTAL" fi ``` **Implementation in agent workflows:** ```python # Cost cap check — add to any long-running agent loop MAX_COST_USD = float(os.environ.get("AGENT_COST_CAP", "5.00")) def check_cost_cap(session_id: str) -> bool: """Return True if within budget, False if exceeded.""" metrics_file = Path.home() / ".claude" / "metrics" / "costs.jsonl" if not metrics_file.exists(): return True total = 0.0 for line in metrics_file.read_text().splitlines(): try: row = json.loads(line) if row.get("session_id") == session_id: total += row.get("estimated_cost_usd", 0) except json.JSONDecodeError: continue return total < MAX_COST_US