extracting-pdfslisted

Extract and clean PDF content to markdown format. Use when the user uploads a PDF file and wants to convert it to clean, readable markdown. Handles text extraction, image extraction, metadata capture, and intelligent content cleanup. Removes repeated footers, watermarks, page numbers, branding, and reorganizes fragmented content into coherent structure.
maaarcooo/claude-skills · ★ 6 · Data & Documents · score 68

Install: claude install-skill maaarcooo/claude-skills

# PDF Content Extraction Skill Extract PDF content to clean, organized markdown. ## Workflow 1. **Extract** — Run script to get raw content + metadata 2. **Analyse** — Review for patterns and issues 3. **Clean** — Manually rewrite, omitting noise 4. **Organise** — Apply formatting principles 5. **Output** — Deliver clean markdown ## Step 1: Extract ```bash python /mnt/skills/user/extracting-pdfs/scripts/extract_pdf.py \ /mnt/user-data/uploads/{filename}.pdf \ /home/claude/extracted/ ``` For scanned PDFs or problematic extractions, use `--method pymupdf`. For page ranges, use `--pages 1-10`. To filter small icons, use `--min-image-size 100`. **Output:** ``` /home/claude/extracted/ ├── {filename}.md # Raw markdown with YAML frontmatter ├── metadata.json # Structured metadata └── images/ # Extracted images (if any) ``` ## Step 2: Analyse Read the extracted markdown: ```bash cat /home/claude/extracted/{filename}.md ``` **Check YAML frontmatter for:** - `extraction_method` — Which extractor was used - `total_pages` — Document length - `has_outline` — Bookmarks exist (helps with structure) - `total_images` — Number of images **Identify issues requiring cleanup:** - Repeated footers/headers on every page - Watermarks, branding, page numbers - Fragmented sentences across line breaks - Malformed tables - Image markers needing repositioning ## Step 3: Clean > **Manual cleanup only.** Do not write scripts, sed/awk commands, or regex replacements