webreaper

Solid

Scrape, crawl, or extract structured data from one or more URLs via the `webreaper` CLI. Outputs clean Markdown by default; JSON when a schema is given. Maps a site's URLs in one call. Handles JS-rendered pages and bot-protected sites (Cloudflare, DataDome, PerimeterX) via auto-escalating stealth. Use this skill whenever the user asks to: - scrape, crawl, or extract from a URL or site - get clean Markdown of a webpage (for further processing, not a summary) - pull specific fields from one or many pages - enumerate / discover URLs on a site - read a JS-rendered single-page app - scrape a site that's blocking direct requests Trigger phrases include: "scrape <site>", "crawl <site>", "extract <data> from <url>", "what's on <site>", "what pages does <site> have", "give me the markdown of <url>", "convert <url> to markdown", "pull <field> from <url>", "save <article> as markdown", "build a scraper for <site>", "read <url> into context", "this site is blocking me", "Cloudflare-protected site". Prefer this over the b

Data & Documents 134 stars 32 forks Updated today MIT

Install

View on GitHub

Quality Score: 87/100

Stars 20%
71
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# WebReaper — scraping & extraction CLI Three commands. Each is one shell call; output goes to stdout unless `--output`. ## What to run | The user wants… | Run | |---|---| | The readable text of one page | `webreaper scrape <url>` | | The readable text saved to a file | `webreaper scrape <url> --output page.md` | | Specific fields from one page | `webreaper scrape <url> --schema schema.json` | | Every page of a whole site (Markdown) | `webreaper crawl <url> > pages.jsonl` | | Every page of a whole site (fields) | `webreaper crawl <url> --schema schema.json` | | URLs on a site | `webreaper map <url>` | | URLs matching a substring | `webreaper map <url> --search /blog/ --max-urls 50` | | A JS-rendered page (SPA) | `webreaper scrape <url> --browser` | | A bot-protected site | `webreaper scrape <url> --browser --auto-stealth` | | Fields from every linked page | `webreaper scrape <index-url> --follow "<css selector>" --schema schema.json` | ## Whole-site crawl in one command To cover an entire site, use `crawl`: it recursively follows every on-domain link from the start URL, extracts each page, and streams JSON Lines (one object per page). Markdown by default; `--schema` switches to field extraction. ```bash # Every page, as Markdown, to a file. webreaper crawl https://example.com > pages.jsonl # Every page, specific fields, bounded. webreaper crawl https://example.com --schema schema.json --max-pages 200 ``` Flags: `--max-pages <n>` (default 1000), `--max-depth <n>` (hops...

Details

Author
pavlovtech
Repository
pavlovtech/WebReaper
Created
4 years ago
Last Updated
today
Language
C#
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category