document-handlinglisted
Install: claude install-skill liujiarui0918/claude-code-codex-strongest
# Handling Office and PDF documents
`.docx`, `.xlsx`, `.pptx`, and `.pdf` are binary container formats. `Read` will not show you their text directly, and shelling out to `cat` returns garbage. Always use a programmatic library through a short Python (or Node) script.
## When to trigger
- The user asks to read, extract, modify, or generate any of: `.docx`, `.xlsx`, `.xls`, `.pptx`, `.pdf`.
- A workflow expects "the spreadsheet" or "the slides" as input.
- You see a binary Office/PDF path passed as a path parameter.
## Library selection
Pick the smallest library that does the job. Python is usually the right host because the libraries are mature and well-documented.
### `.docx` (Word)
- **Read + write**: `python-docx`. Stable, handles paragraphs, tables, styles, headers.
- Node alternatives (`docx`, `mammoth`) exist but are weaker for round-tripping. Use `python-docx` unless you are already deep in a Node project.
- For complex layout or comments preservation, consider exporting to markdown via `pandoc` first.
### `.xlsx` / `.xls` (Excel)
- **Read + write**: `openpyxl` (xlsx only). Cell-level access, formulas, styles.
- **Bulk read**: `pandas.read_excel` with the `openpyxl` engine. Fastest path when you only need data, not formatting.
- **`.xls` legacy**: `xlrd` (read only, version 1.2.0).
- **Node**: `exceljs` is the closest equivalent to openpyxl.
### `.pdf`
- **Read text**: `pdfplumber` (best layout fidelity) or `pypdf` (simpler, faster for plain text).
- **Read ta