web-scraping

Solid

Web scraping with anti-bot bypass, content extraction, undocumented APIs and poison pill detection. Use when extracting content from websites, handling paywalls, implementing scraping cascades or processing social media. Covers requests, trafilatura, Playwright with stealth mode, yt-dlp and instaloader patterns.

Data & Documents 233 stars 44 forks Updated today MIT

Install

View on GitHub

Quality Score: 89/100

Stars 20%
79
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# Web scraping methodology Patterns for reliable, ethical web scraping with fallback strategies and anti-bot handling. ## Scraping cascade architecture Implement multiple extraction strategies with automatic fallback: ```python from abc import ABC, abstractmethod from typing import Optional import requests from bs4 import BeautifulSoup import trafilatura #for .py files from playwright.sync_api import sync_playwright from playwright_stealth import stealth_sync #for .ipynb files import asyncio from playwright.async_api import async_playwright class ScrapingResult: def __init__(self, content: str, title: str, method: str): self.content = content self.title = title self.method = method # Track which method succeeded class Scraper(ABC): @abstractmethod def fetch(self, url: str) -> Optional[ScrapingResult]: ... class TrafilaturaCscraper(Scraper): """Fast, lightweight extraction for standard articles.""" def fetch(self, url: str) -> Optional[ScrapingResult]: try: downloaded = trafilatura.fetch_url(url) if not downloaded: return None content = trafilatura.extract( downloaded, include_comments=False, include_tables=True, favor_recall=True ) if not content or len(content) < 100: return None # Extract title separately soup = BeautifulSoup(downlo...

Details

Author
jamditis
Repository
jamditis/claude-skills-journalism
Created
5 months ago
Last Updated
today
Language
HTML
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category