pdf-page-extractlisted

Extract rich data from PDF pages including text spans with metadata, rendered PNG images, and page mapping. Creates persistent artifacts for downstream processing.
aiskillstore/marketplace · ★ 329 · Data & Documents · score 79

Install: claude install-skill aiskillstore/marketplace

# PDF Page Extract Skill ## Purpose This skill extracts all necessary data from PDF pages to enable accurate AI-driven HTML generation. It produces three critical artifacts: 1. **Rich extraction data** - Text spans with font metadata (sizes, styles, positions) 2. **Rendered PNG image** - Visual reference for AI to understand page layout 3. **Page mapping** - Authoritative mapping of PDF indices to book pages This is the **deterministic, Python-based foundation** for the entire pipeline. All extracted data is saved to persistent files for traceability and future processing. ## What to Do 1. **Validate input parameters** - Check PDF file exists and is readable - Verify page range (PDF indices or book pages) - Confirm output directory structure 2. **Establish page mapping** (if not already done) - Run: `python3 Calypso/tools/read_page_footers.py` - Scans page footers to establish PDF index → book page mapping - Saves to: `analysis/page_mapping.json` 3. **Extract rich page data** using PyMuPDF and pdfplumber - Run: `python3 Calypso/tools/rich_extractor.py` - Extracts text spans with font metadata: - Font name and size - Bold/italic flags - Position (bounding box) - Color information - Analyzes page structure to identify: - Likely headings (by size and style) - Paragraphs (regular text) - Potential lists - Detects tables using pdfplumber - Saves to: `analysis/chapter_XX/rich_extraction.json` 4. **Render PD