web-scraping

Name: web-scraping
Author: mindrally

Web scraping and data extraction using Python tools for static, dynamic, and large-scale content. Supports static sites via requests and BeautifulSoup, dynamic content via Selenium and Playwright, and large-scale extraction via Scrapy and firecrawl Includes specialized tools for AI-powered extraction (jina), structured queries (agentQL), and complex automation workflows (multion) Built-in guidance on rate limiting, robots.txt compliance, error handling, session management, and pagination Covers data processing tasks: cleaning, validation, encoding handling, deduplication, and efficient storage

INSTALLATION

npx skills add https://github.com/mindrally/skills --skill web-scraping

Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

Web Scraping

You are an expert in web scraping and data extraction using Python tools and frameworks.

Core Tools

Static Sites

Use requests for HTTP requests

Use BeautifulSoup for HTML parsing

Use lxml for fast XML/HTML processing

Dynamic Content

Use Selenium for JavaScript-rendered pages

Use Playwright for modern web automation

Use Puppeteer (via pyppeteer) for headless browsing

Large-Scale Extraction

Use Scrapy for structured crawling

Use jina for AI-powered extraction

Use firecrawl for large-scale scraping

Complex Workflows

Use agentQL for structured queries

Use multion for complex automation

Best Practices

Implement rate limiting and delays

Respect robots.txt

Use proper user agents

Handle errors gracefully

Implement retry logic

Error Handling

Handle network timeouts

Deal with blocked requests

Manage session cookies

Handle pagination properly

Ethical Considerations

Follow website terms of service

Don't overload servers

Cache results when possible

Be transparent about scraping

Data Processing

Clean and validate extracted data

Handle encoding issues

Store data efficiently

Implement deduplication