pdf-text-extractlisted
Install: claude install-skill baronguyen001/ai-automation-skills
# PDF Text Extract
Use this skill when a workflow needs machine-readable text from a digital PDF before summarization, classification, or structured extraction. The helper uses pure-Python PDF libraries when available and includes a simple line-based table detector for downstream cleanup.
## When to invoke
- User says: "extract text from this PDF", "feed a PDF to AI", "pull tables from this report", "read this statement".
- The PDF already contains selectable text and does not require OCR.
## When NOT to invoke
- The PDF is a scanned image; use an OCR workflow instead.
- The user needs pixel-perfect table reconstruction with merged cells and layout fidelity.
## Concrete example
User input:
```text
Extract the text and rough tables from this PDF so Gemini can summarize it.
```
Output:
```python
# Copy assets/extract.py into your project, then:
from extract import extract_pdf
doc = extract_pdf("downloads/report.pdf")
print(doc["text"][:2000])
for table in doc["tables"]:
print(table["page"], table["rows"][:3])
```
Install either `pypdf` or `pdfminer.six` in the target project. No OCR binary is required, and the helper fails clearly when the PDF has no extractable text.
## Pattern to apply
1. Prefer digital text extraction first; do not add OCR unless the source is scanned.
2. Preserve page boundaries so downstream prompts can cite page numbers.
3. Keep table extraction simple: split rows on tabs or repeated spaces, then let a later schema pass normalize columns