document-analysislisted
Install: claude install-skill enriquerodrig/regulaitor
# Document Analysis (H5)
## When to use
- Analyzing a corporate document (policy, contract, impact assessment) against an EU regulatory corpus (AI Act, GDPR, NIS2, DORA).
- Extending or debugging the document pipeline modules (`document/extractor.py`, `document/sanitizer.py`, `document/segmenter.py`, `orchestration/document_graph.py`).
- Adding new anti-injection patterns for document mode.
## When NOT to use
- Chat queries → use `orchestration.graph.run` (H4) instead.
- Corpus ingestion (regulatory text) → that is `corpus/fetch.py` + `corpus/parse.py` (H1), not this pipeline.
- One-off PDF inspection → use the MCP tool `extract_document` directly; do not wrap it in custom orchestration.
## Canonical procedure
The single supported entrypoint is:
```python
from regulaitor.orchestration.document_graph import run_document
report = run_document(
file_bytes=open("policy.pdf", "rb").read(),
mime_type="application/pdf",
language="es",
corpus=["ai_act", "gdpr"],
)
```
CLI equivalent:
```bash
python -m scripts.analyze --file policy.pdf --lang es --corpus ai_act,gdpr
```
## What the pipeline guarantees
1. **No bypass of the sanitizer.** MCP tools `extract_document` and `segment_document` are inspection helpers; the only way to run the full E2E flow is `run_document(...)` (in-process).
2. **No citation, no answer.** Every Finding returned has at least one literal citation validated against the corpus.
3. **Deterministic verdict aggregation.** Per-Finding leni