ebook-ingestlisted

Find, download, and prep an ebook as clean Markdown for AI ingestion. Use when digitizing an owned print book, converting EPUB/PDF/MOBI, or chunking for RAG. NOT academic papers or DRM'd ebooks.
tomcounsell/ai · ★ 18 · AI & Automation · score 71

Install: claude install-skill tomcounsell/ai

# Ebook acquisition and AI ingestion prep Fires on any request to prepare a book for RAG, fine-tuning, or long-context reference. Do NOT use for academic papers (use Sci-Hub/unpaywall), bulk public-domain scraping (hit Gutenberg's API directly), or DRM'd commercial ebooks the user has not purchased. ## Overview End-to-end pipeline: a named book in → clean, structured Markdown out, ready for an AI agent to consume. Stages: search → download → convert → clean → chunk. The deliverable is `library/processed/<author-slug>/<title-slug>.md` with YAML frontmatter (plus optional RAG chunks); success means the text is free of page numbers, running headers, and broken hyphenation, and carries accurate metadata. Assumes the user owns a print copy and is creating a personal digital backup for private AI use. Skip this skill if that premise doesn't hold. ## Quick reference | Step | Tool | Output | |------|------|--------| | Search | Anna's Archive (meta), Gutenberg, Standard Ebooks | candidate file URLs | | Download | `curl` / `wget` | `library/raw/<slug>.<ext>` | | Convert | `pandoc` (EPUB), `pdftotext -layout` (PDF), `ocrmypdf` (scanned) | raw `.md` or `.txt` | | Clean | bundled `scripts/clean_book.py` | normalized Markdown | | Metadata | YAML frontmatter | `<slug>.md` | | Chunk (optional) | `langchain` text splitters | `chunks/*.json` | ## Prerequisites ```bash # macOS brew install calibre pandoc poppler tesseract ocrmypdf # Ubuntu/Debian apt-get install calibre pandoc poppler-