← ClaudeAtlas

scraperapi-crawlerlisted

Product-usage reference for ScraperAPI's Crawler — crawl an entire site or section by following links automatically. Consult when the user needs to extract data from many pages of a site without knowing the URLs upfront. Use when user asks: "crawl an entire website with ScraperAPI", "scrape all pages on a domain", "follow links and scrape each page", "how do I use the ScraperAPI crawler API", "scrape a site map", "extract data from every product page on a site", "ScraperAPI crawler job API". Covers job creation, URL regex patterns, depth vs budget, per-page scraping parameters, status polling, webhooks, scheduling, and credit costs. Also invoke when the user is building a site-wide scraper and asks which ScraperAPI product to use.
scraperapi/scraperapi-skills · ★ 9 · AI & Automation · score 78
Install: claude install-skill scraperapi/scraperapi-skills
# ScraperAPI Crawler The Crawler discovers and scrapes linked pages automatically, starting from a seed URL and following links that match your regex pattern. Use it when you need data from many pages of a site but don't have the URLs upfront. ## When NOT to use the Crawler - **You have a known URL list** → use the [Async API](https://docs.scraperapi.com/making-async-requests) instead; it's cheaper and more predictable. - **You need a single page** → use the Standard API (`api.scraperapi.com`). - **The content is behind login** → the Crawler only accesses publicly available pages. - **Free plan, need depth > 1** → free accounts are capped at `max_depth: 1`. ## Endpoints | Action | Method | URL | |--------|--------|-----| | Start a crawl | POST | `https://crawler.scraperapi.com/job` | | Check status | GET | `https://crawler.scraperapi.com/job/<jobId>` | | Cancel a crawl | DELETE | `https://crawler.scraperapi.com/job/<jobId>` | Auth: include `api_key` in the JSON body (POST) or as a query parameter (GET/DELETE). ## Starting a Crawl ```python import os, requests API_KEY = os.environ["SCRAPERAPI_API_KEY"] job = requests.post( "https://crawler.scraperapi.com/job", json={ "api_key": API_KEY, "start_url": "https://example.com/blog/", "url_regexp_include": "(?<full_url>https?://example\\.com/blog/.*)", "url_regexp_exclude": ".*\\.(pdf|png|jpg|zip)", "max_depth": 3, "crawl_budget": 500