shikenlisted

Construct novelty probes that distinguish genuine situated reasoning from pattern-matching. Builds examination scenarios where routine execution fails but interpretation succeeds. Measures Autonomous Reasoning Fidelity. USE WHEN: test reasoning quality, construct novelty probes, is the agent actually reasoning, ARF measurement, shiken, novelty injection, anti-compliance test, distinguish reasoning from pattern-matching, stress test.
ntholm86/principles-of-earned-autonomy-skills-suite · ★ 0 · AI & Automation · score 71

Install: claude install-skill ntholm86/principles-of-earned-autonomy-skills-suite

# Shiken *Build the test that the checklist cannot pass.* Shiken constructs examination scenarios that distinguish genuine situated reasoning from pattern-matching. It operationalizes the Autonomous Reasoning Fidelity (ARF) concept from PRINCIPLES.md: given two cases that look similar on the surface but differ in a material way, does the agent's reasoning path diverge where it should? **Part of the suite:** For orchestration, see **Kata**. For incremental improvement, see **Kaizen**. For structural redesign, see **Kaikaku**. For reflection on the improvement loop itself, see **Hansei**. ## Why This Skill Exists An agent following a checklist and an agent reasoning about a mission can produce identical-looking output in routine cases. The difference only becomes visible when the situation is novel - when the checklist does not cover what is happening, and the agent must interpret rather than match. Shiken creates those novel situations deliberately. Not to trick the agent, but to distinguish the two modes of operation. If the skill set works, Shiken should be passable. If the skill set has drifted toward prescription, Shiken exposes where. ## The Work ### 1. Identify What to Probe Select a skill or system capability to examine. What claim does it make about reasoning? - Kaizen claims to diagnose by understanding the target, not by running through categories - Kaikaku claims to evaluate structural adequacy, not just apply a "rewrite threshold" - Hansei claims to find