extracting-pdfslisted
Install: claude install-skill maaarcooo/claude-skills
# PDF Content Extraction Skill
Extract PDF content to clean, organized markdown.
## Workflow
1. **Extract** — Run script to get raw content + metadata
2. **Analyse** — Review for patterns and issues
3. **Clean** — Manually rewrite, omitting noise
4. **Organise** — Apply formatting principles
5. **Output** — Deliver clean markdown
## Step 1: Extract
```bash
python /mnt/skills/user/extracting-pdfs/scripts/extract_pdf.py \
/mnt/user-data/uploads/{filename}.pdf \
/home/claude/extracted/
```
For scanned PDFs or problematic extractions, use `--method pymupdf`. For page ranges, use `--pages 1-10`. To filter small icons, use `--min-image-size 100`.
**Output:**
```
/home/claude/extracted/
├── {filename}.md # Raw markdown with YAML frontmatter
├── metadata.json # Structured metadata
└── images/ # Extracted images (if any)
```
## Step 2: Analyse
Read the extracted markdown:
```bash
cat /home/claude/extracted/{filename}.md
```
**Check YAML frontmatter for:**
- `extraction_method` — Which extractor was used
- `total_pages` — Document length
- `has_outline` — Bookmarks exist (helps with structure)
- `total_images` — Number of images
**Identify issues requiring cleanup:**
- Repeated footers/headers on every page
- Watermarks, branding, page numbers
- Fragmented sentences across line breaks
- Malformed tables
- Image markers needing repositioning
## Step 3: Clean
> **Manual cleanup only.** Do not write scripts, sed/awk commands, or regex replacements