pdf-page-extractlisted
Install: claude install-skill aiskillstore/marketplace
# PDF Page Extract Skill
## Purpose
This skill extracts all necessary data from PDF pages to enable accurate AI-driven HTML generation. It produces three critical artifacts:
1. **Rich extraction data** - Text spans with font metadata (sizes, styles, positions)
2. **Rendered PNG image** - Visual reference for AI to understand page layout
3. **Page mapping** - Authoritative mapping of PDF indices to book pages
This is the **deterministic, Python-based foundation** for the entire pipeline. All extracted data is saved to persistent files for traceability and future processing.
## What to Do
1. **Validate input parameters**
- Check PDF file exists and is readable
- Verify page range (PDF indices or book pages)
- Confirm output directory structure
2. **Establish page mapping** (if not already done)
- Run: `python3 Calypso/tools/read_page_footers.py`
- Scans page footers to establish PDF index → book page mapping
- Saves to: `analysis/page_mapping.json`
3. **Extract rich page data** using PyMuPDF and pdfplumber
- Run: `python3 Calypso/tools/rich_extractor.py`
- Extracts text spans with font metadata:
- Font name and size
- Bold/italic flags
- Position (bounding box)
- Color information
- Analyzes page structure to identify:
- Likely headings (by size and style)
- Paragraphs (regular text)
- Potential lists
- Detects tables using pdfplumber
- Saves to: `analysis/chapter_XX/rich_extraction.json`
4. **Render PD