web-scraping

Web scraping and data extraction using Python tools for static, dynamic, and large-scale content. Supports static sites via requests and BeautifulSoup, dynamic content via Selenium and Playwright, and large-scale extraction via Scrapy and firecrawl Includes specialized tools for AI-powered extraction (jina), structured queries (agentQL), and complex automation workflows (multion) Built-in guidance on rate limiting, robots.txt compliance, error handling, session management, and pagination Covers data processing tasks: cleaning, validation, encoding handling, deduplication, and efficient storage

INSTALLATION
npx skills add https://github.com/mindrally/skills --skill web-scraping
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

Web Scraping

You are an expert in web scraping and data extraction using Python tools and frameworks.

Core Tools

Static Sites

  • Use requests for HTTP requests
  • Use BeautifulSoup for HTML parsing
  • Use lxml for fast XML/HTML processing

Dynamic Content

  • Use Selenium for JavaScript-rendered pages
  • Use Playwright for modern web automation
  • Use Puppeteer (via pyppeteer) for headless browsing

Large-Scale Extraction

  • Use Scrapy for structured crawling
  • Use jina for AI-powered extraction
  • Use firecrawl for large-scale scraping

Complex Workflows

  • Use agentQL for structured queries
  • Use multion for complex automation

Best Practices

  • Implement rate limiting and delays
  • Respect robots.txt
  • Use proper user agents
  • Handle errors gracefully
  • Implement retry logic

Error Handling

  • Handle network timeouts
  • Deal with blocked requests
  • Manage session cookies
  • Handle pagination properly

Ethical Considerations

  • Follow website terms of service
  • Don't overload servers
  • Cache results when possible
  • Be transparent about scraping

Data Processing

  • Clean and validate extracted data
  • Handle encoding issues
  • Store data efficiently
  • Implement deduplication
BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card