webreaper

Solid

Scrape, crawl, or extract structured data from one or more URLs via the `webreaper` CLI. Outputs clean Markdown by default; JSON when a schema is given. Maps a site's URLs in one call. Handles JS-rendered pages and bot-protected sites (Cloudflare, DataDome, PerimeterX) via auto-escalating stealth. Use this skill whenever the user asks to: - scrape, crawl, or extract from a URL or site - get clean Markdown of a webpage (for further processing, not a summary) - pull specific fields from one or many pages - enumerate / discover URLs on a site - read a JS-rendered single-page app - scrape a site that's blocking direct requests Trigger phrases include: "scrape <site>", "crawl <site>", "extract <data> from <url>", "what's on <site>", "what pages does <site> have", "give me the markdown of <url>", "convert <url> to markdown", "pull <field> from <url>", "save <article> as markdown", "build a scraper for <site>", "read <url> into context", "this site is blocking me", "Cloudflare-protected site". Prefer this over the b

Data & Documents 134 stars 32 forks Updated today MIT

Install

View on GitHub

Quality Score: 87/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# WebReaper — scraping & extraction CLI Three commands. Each is one shell call; output goes to stdout unless `--output`. ## What to run | The user wants… | Run | |---|---| | The readable text of one page | `webreaper scrape <url>` | | The readable text saved to a file | `webreaper scrape <url> --output page.md` | | Specific fields from one page | `webreaper scrape <url> --schema schema.json` | | Every page of a whole site (Markdown) | `webreaper crawl <url> > pages.jsonl` | | Every page of a whole site (fields) | `webreaper crawl <url> --schema schema.json` | | URLs on a site | `webreaper map <url>` | | URLs matching a substring | `webreaper map <url> --search /blog/ --max-urls 50` | | A JS-rendered page (SPA) | `webreaper scrape <url> --browser` | | A bot-protected site | `webreaper scrape <url> --browser --auto-stealth` | | Fields from every linked page | `webreaper scrape <index-url> --follow "<css selector>" --schema schema.json` | ## Whole-site crawl in one command To cover an entire site, use `crawl`: it recursively follows every on-domain link from the start URL, extracts each page, and streams JSON Lines (one object per page). Markdown by default; `--schema` switches to field extraction. ```bash # Every page, as Markdown, to a file. webreaper crawl https://example.com > pages.jsonl # Every page, specific fields, bounded. webreaper crawl https://example.com --schema schema.json --max-pages 200 ``` Flags: `--max-pages <n>` (default 1000), `--max-depth <n>` (hops...

Details

Author: pavlovtech
Repository: pavlovtech/WebReaper
Created: 4 years ago
Last Updated: today
Language: C#
License: MIT

Integrates with

Cloudflare · Cloud

Similar Skills

Semantically similar based on skill content — not just same category

Web & Frontend Listed

web-scrape

Intelligent web scraper with content extraction, multiple output formats, and error handling

335 Updated today

aiskillstore

AI & Automation Solid

scrapling

Web scraping with Scrapling - HTTP fetching, stealth browser automation, Cloudflare bypass, and spider crawling via CLI and Python.

173,893 Updated today

NousResearch

Data & Documents Listed

firecrawl

Firecrawl produces cleaner markdown than WebFetch, handles JavaScript-heavy pages, and avoids content truncation. This skill should be used when fetching URLs, scraping web pages, converting URLs to markdown, extracting web content, searching the web, crawling sites, mapping URLs, LLM-powered extraction, autonomous data gathering with the Agent API, interacting with scraped pages (clicking, filling forms, extracting dynamic content via Interact API), or fetching AI-generated documentation for GitHub repos via DeepWiki. Provides complete coverage of Firecrawl v2 API endpoints including parallel agents, spark-1-fast model, sitemap-only crawling, and the Interact API for post-scrape browser interaction.

33 Updated yesterday

tdimino

Data & Documents Listed

enact-firecrawl

Scrape, crawl, search, and extract structured data from websites using Firecrawl API - converts web pages to LLM-ready markdown

335 Updated today

aiskillstore

AI & Automation Solid

harvest-single

Single page smart extraction - articles, docs, blog posts to clean markdown

495 Updated 1 months ago

vibeeval