← ClaudeAtlas

rcalisted

Root cause analysis for process crashes, server deaths, and unexplained system failures using macOS logs, session forensics, and code tracing
bjornjee/agent-dashboard · ★ 11 · AI & Automation · score 82
Install: claude install-skill bjornjee/agent-dashboard
Root cause analysis for a system-level failure. **Gather ALL evidence before reasoning about the cause.** Incident description: $ARGUMENTS ## Instructions Follow these phases strictly in order. Do NOT speculate or reason about root cause until Phase 5. Every phase has a gate. --- ### Phase 1: Scope the Incident 1. Parse the incident description — what died? (process, tmux server, container, service, etc.) 2. Establish the **time window**: ask the user or derive from context when the failure was noticed and when things last worked. 3. Identify what was running at the time — check: - `tmux list-sessions` / `tmux list-panes -a` (if tmux is back up) - Shell history with timestamps to reconstruct user activity: ``` tail -500 ~/.zsh_history | while IFS= read -r line; do if echo "$line" | grep -q "^: [0-9]"; then ts=$(echo "$line" | sed 's/^: \([0-9]*\):.*/\1/') cmd=$(echo "$line" | sed 's/^: [0-9]*:[0-9]*;//') dt=$(date -r "$ts" "+%Y-%m-%d %H:%M:%S" 2>/dev/null) echo "$dt $cmd" fi done ``` - Filter to the time window around the incident. 4. Identify **all active agents/processes** — check Codex session logs: ``` ls -lt "${CODEX_HOME:-$HOME/.codex}"/sessions/*/*/*/rollout-*.jsonl 2>/dev/null | head -15 ``` Cross-reference modification times with the incident window. **Gate:** Time window is established. List of active processes/sessions at the time is known. --- ### Phase 2: System-