pdf-extractlisted
Install: claude install-skill maaarcooo/claude-skills
# PDF Content Extraction Skill
Extract PDF content to clean, organized markdown.
## Workflow
1. **Extract** — Run script to get raw content + metadata
2. **Analyse** — Review for patterns and issues
3. **Clean** — **Manually** remove noise (footers, watermarks, branding)
4. **Organise** — Restructure fragmented content
5. **Output** — Deliver clean markdown
> **Note:** Only Step 1 uses a script. Steps 2–5 are performed manually by Claude reading and rewriting content. Do not write cleanup scripts.
## Step 1: Extract
```bash
python /mnt/skills/user/pdf-extract/scripts/extract_pdf.py \
/mnt/user-data/uploads/{filename}.pdf \
/home/claude/extracted/
```
**Options:**
| Option | Description |
|--------|-------------|
| `--pages 1-10` | Extract specific page range |
| `--method pymupdf4llm` | Force primary extractor (better formatting) |
| `--method pymupdf` | Force fallback (more reliable for scanned PDFs) |
| `--min-image-size 100` | Skip images smaller than 100px (filters icons) |
**Output:**
```
/home/claude/extracted/
├── {filename}.md # Raw markdown with YAML frontmatter
├── metadata.json # Structured metadata
└── images/ # Extracted images (if any)
```
## Step 2: Analyse
Read the extracted markdown:
```bash
cat /home/claude/extracted/{filename}.md
```
**Check YAML frontmatter for:**
- `extraction_method` — Which extractor was used
- `total_pages` — Document length
- `has_outline` — Bookmarks exist (helps with structure)
- `total_images