web-archive-scraperlisted

Search the Wayback Machine for archived versions of websites. Extract cached pages, customer lists, testimonials, and partner directories from sites that have changed or gone offline. Uses the free CDX API — no API key needed.
gooseworks-ai/goose-skills · ★ 708 · Web & Frontend · score 80

Install: claude install-skill gooseworks-ai/goose-skills

# Web Archive Scraper Search the Wayback Machine (Internet Archive) for archived snapshots of websites. Fetch cached page content to find customer lists, testimonials, partner directories, and other information from sites that have changed or shut down. ## Quick Start Only dependency is `requests`. No API key needed. ```bash # Find all snapshots of a URL python3 skills/web-archive-scraper/scripts/search_archive.py \ --url "https://botkeeper.com/customers" # Search with date range python3 skills/web-archive-scraper/scripts/search_archive.py \ --url "https://botkeeper.com" --from 2025-01-01 --to 2026-02-01 # Search all pages under a domain (prefix match) python3 skills/web-archive-scraper/scripts/search_archive.py \ --url "https://botkeeper.com" --match prefix --limit 50 # Fetch the actual archived page content python3 skills/web-archive-scraper/scripts/search_archive.py \ --url "https://botkeeper.com/customers" --fetch # Output formats python3 skills/web-archive-scraper/scripts/search_archive.py --url URL --output json python3 skills/web-archive-scraper/scripts/search_archive.py --url URL --output csv python3 skills/web-archive-scraper/scripts/search_archive.py --url URL --output summary ``` ## How It Works 1. **CDX API search** — Queries `web.archive.org/cdx/search/cdx` for snapshots matching the URL 2. **Filtering** — Filters by date range, HTTP status code, and MIME type 3. **Dedup** — Collapses to one snapshot per day by default to avoid redundant results