ebook-ingestlisted
Install: claude install-skill tomcounsell/ai
# Ebook acquisition and AI ingestion prep
## Overview
End-to-end pipeline for turning a named book into clean, structured Markdown ready for an AI agent to consume. Covers search → download → convert → clean → chunk.
Assumes the user owns a print copy and is creating a personal digital backup for private AI use. Skip this skill if that premise doesn't hold.
## Quick reference
| Step | Tool | Output |
|------|------|--------|
| Search | Anna's Archive (meta), Gutenberg, Standard Ebooks | candidate file URLs |
| Download | `curl` / `wget` | `library/raw/<slug>.<ext>` |
| Convert | `pandoc` (EPUB), `pdftotext -layout` (PDF), `ocrmypdf` (scanned) | raw `.md` or `.txt` |
| Clean | `clean_book.py` | normalized Markdown |
| Metadata | YAML frontmatter | `<slug>.md` |
| Chunk (optional) | `langchain` text splitters | `chunks/*.json` |
## Prerequisites
```bash
# macOS
brew install calibre pandoc poppler tesseract ocrmypdf
# Ubuntu/Debian
apt-get install calibre pandoc poppler-utils tesseract-ocr ocrmypdf
# Python
pip install ebooklib beautifulsoup4 markdownify pymupdf langchain-text-splitters httpx
```
## Configuration
Anna's Archive supports authenticated programmatic downloads for paid members. Two env vars are required:
```bash
# In ~/Desktop/Valor/.env (already set on this machine)
ANNAS_ARCHIVE_ACCOUNT_ID="<account-id>"
ANNAS_ARCHIVE_SECRET_KEY="<secret-key>"
```
Both values are available at `https://annas-archive.org/account` after donating. They are passed as query