web-scrapinglisted
Install: claude install-skill christopherlouet/claude-base
# Web Scraping (Firecrawl-first)
## Goal
Extract LLM-ready web content without hacking around: clean markdown, structured JSON, anti-bot and JS-rendering handled. Firecrawl is the reference wrapper; fallback to Playwright or `curl + html2text` if unavailable.
## When to trigger this skill
- "scrape this page / this site"
- "extract data from ..."
- "crawl site X"
- "fetch all articles from ..."
- "search the web and extract the content"
- "parse this dynamic page" (site with JS-rendering)
- "bypass the paywall / anti-bot" (legitimate use only)
## When NOT to use this skill
- Quick web search without structured extraction -> `WebSearch` is enough
- A single static URL, simple page -> `WebFetch` is enough
- Visual test / browser interaction -> skill `qa-chrome` or agent-browser
- Form / login automation -> agent-browser or Playwright directly
## Prerequisites
### Option 1: Firecrawl cloud (recommended)
```bash
export FIRECRAWL_API_KEY="fc-xxx" # https://firecrawl.dev
npm install -g firecrawl # or pip install firecrawl-py
```
### Option 2: Firecrawl self-hosted
Docker compose available on github.com/mendableai/firecrawl. Useful if data is sensitive or budget is limited.
### Option 3: Fallback without Firecrawl
If Firecrawl is missing, degrade gracefully:
| Need | Fallback | Limitation |
|--------|----------|------------|
| Simple static page | `curl -sL URL \| pandoc -f html -t markdown` | No JS rendering |
| JS-heavy page | `npx playwright` + `p