run-evalslisted

Run the Nomos eval suite -- recall@5, per-user isolation, and the end-to-end agent eval with the Opus-4.8 DB-content audit + the spec-driven feature-manifest audit. Use /run-evals when asked to run the evals, verify the memory system, check tenant isolation, or audit that features are actually wired and their DB effects land.
project-nomos/nomos · ★ 22 · AI & Automation · score 84

Install: claude install-skill project-nomos/nomos

# Run Evals The eval suite verifies three things: the memory system recalls, per-user data never leaks across tenants, and every feature is actually wired and produces the durable DB state it promises. All commands run from the repo root and need a real `DATABASE_URL` (PostgreSQL + pgvector) and a model provider (`ANTHROPIC_API_KEY`, Vertex via `CLAUDE_CODE_USE_VERTEX=1`, or `NOMOS_USE_SUBSCRIPTION=true`). > On macOS, prefix commands with `PGGSSENCMODE=disable` if you see GSSAPI connection noise. ## Quick reference | Command | What it does | | ------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `pnpm eval:audit` | **The full gate.** Agent eval against a throwaway `nomos_eval`, then the Opus-4.8 label audit + the spec-driven manifest audit, then drops the DB. One process, end-to-end. | | `pnpm eval:agent` | Agent eval only (no LLM audit), throwaway DB. | | `pnpm eval:agent --keep` | Run + keep `nomos_eval` for inspection; also writes a results file.