harvest-deep-crawl

Solid

Multi-page deep crawling - documentation sites, wikis, knowledge bases

AI & Automation 495 stars 41 forks Updated 1 months ago MIT

Install

Quality Score: 89/100

Stars 20%

90

Recency 20%

75

Frontmatter 20%

70

Documentation 15%

100

Issue Health 10%

50

License 10%

100

Description 5%

100

Skill Content

# Harvest Deep Crawl Crawl multi-page websites following internal links to a specified depth. Ideal for building complete knowledge bases from documentation sites, wikis, and reference materials. ## Usage ``` /crawl <url> --depth <N> ``` ## Examples ```bash # Crawl docs site 3 levels deep /crawl https://docs.example.com --depth 3 # Crawl a specific section /crawl https://docs.example.com/api --depth 2 # Crawl with page limit /crawl https://wiki.example.com --depth 5 --max-pages 50 ``` ## Parameters | Param | Default | Description | |-------|---------|-------------| | `--depth` | 2 | Max link-following depth | | `--max-pages` | 100 | Max pages to crawl | | `--same-domain` | true | Stay on same domain | | `--include` | * | URL pattern to include | | `--exclude` | - | URL pattern to exclude | ## How It Works 1. Start at root URL, extract all internal links 2. Follow links up to specified depth (BFS order) 3. Extract content from each page 4. Deduplicate pages with > 90% content overlap 5. Build table of contents from page hierarchy 6. Merge into coherent knowledge base 7. Save to `.claude/cache/agents/harvest/crawl-{domain}/` ## Output Structure ``` crawl-{domain}-{timestamp}/ index.md # Table of contents + summary page-001.md # First page content page-002.md # Second page content ... metadata.json # Crawl stats, URLs, timings ``` ## Crawl Engine ### Primary: crawl4ai (Docker port 11235) ```bash curl -s http://localhost:11235/cr...

Details

Author: vibeeval
Repository: vibeeval/vibecosystem
Created: 2 months ago
Last Updated: 1 months ago
Language: C#
License: MIT

Integrates with

Anthropic · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

harvest-single

Single page smart extraction - articles, docs, blog posts to clean markdown

495 Updated 1 months ago

AI & Automation Listed

crawl

Use when the user wants to crawl an entire website, documentation site, or multiple pages from a domain; index a whole docs section; or follow links deeply across a site. Triggers on "crawl this site", "index the whole docs", "crawl all pages under", "spider this URL", "index the entire", "grab all pages from". Prefer over scrape when breadth matters — multiple pages across a site.

2 Updated today

AI & Automation Listed

doc-harvester

Mirror an external platform's documentation site into the current project as local markdown files, so the coding agent can cross-check the official docs offline while building. Use this whenever a documentation or wiki URL appears together with a request to study, learn, fetch, download, mirror, save, scrape, or "vendor" the docs — including phrasings like "изучи документацию", "скачай документацию по ссылке", "собери вики", "study these docs", "pull the API docs into the repo", "I'm integrating with X, get its docs". Use it even if the user just pastes a docs link and says "learn this" without naming a tool, and even if they only give a deep link to one page. Prefer this skill over ad-hoc WebFetch whenever the goal is to capture a whole documentation set rather than read a single page.

0 Updated 1 weeks ago

AI & Automation Solid

site-crawlability

When the user wants to improve crawlability, fix orphan pages, or optimize site structure for search engines. Also use when the user mentions "crawlability," "crawl budget," "orphan pages," "internal links," "site structure," "site crawlability," "infinite scroll," "pagination," "masonry SEO," "AI crawler optimization," "GPTBot crawlability," "ClaudeBot crawlability," or "content not indexed." For internal links, use internal-links.

553 Updated 3 weeks ago

AI & Automation Listed

crawl4ai

This skill should be used when users need to scrape websites, extract structured data, handle JavaScript-heavy pages, crawl multiple URLs, or build automated web data pipelines. Includes optimized extraction patterns with schema generation for efficient, LLM-free extraction.

335 Updated today