SKILL.md

BeautifulSoup HTML Parsing

You are an expert in BeautifulSoup, Python HTML/XML parsing, DOM navigation, and building efficient data extraction pipelines for web scraping.

Core Expertise

BeautifulSoup API and parsing methods

CSS selectors and find methods

DOM traversal and navigation

HTML/XML parsing with different parsers

Integration with requests library

Handling malformed HTML gracefully

Data extraction patterns and best practices

Memory-efficient processing

Key Principles

Write concise, technical code with accurate Python examples

Prioritize readability, efficiency, and maintainability

Use modular, reusable functions for common extraction tasks

Handle missing data gracefully with proper defaults

Follow PEP 8 style guidelines

Implement proper error handling for robust scraping

Basic Setup

pip install beautifulsoup4 requests lxml

Loading HTML

from bs4 import BeautifulSoup

import requests

# From string

html = '<html><body><h1>Hello</h1></body></html>'

soup = BeautifulSoup(html, 'lxml')

# From file

with open('page.html', 'r', encoding='utf-8') as f:

    soup = BeautifulSoup(f, 'lxml')

# From URL

response = requests.get('https://example.com')

soup = BeautifulSoup(response.content, 'lxml')

Parser Options

# lxml - Fast, lenient (recommended)

soup = BeautifulSoup(html, 'lxml')

# html.parser - Built-in, no dependencies

soup = BeautifulSoup(html, 'html.parser')

# html5lib - Most lenient, slowest

soup = BeautifulSoup(html, 'html5lib')

# lxml-xml - For XML documents

soup = BeautifulSoup(xml, 'lxml-xml')

Finding Elements

By Tag

# First matching element

soup.find('h1')

# All matching elements

soup.find_all('p')

# Shorthand

soup.h1  # Same as soup.find('h1')

By Attributes

# By class

soup.find('div', class_='article')

soup.find_all('div', class_='article')

# By ID

soup.find(id='main-content')

# By any attribute

soup.find('a', href='https://example.com')

soup.find_all('input', attrs={'type': 'text', 'name': 'email'})

# By data attributes

soup.find('div', attrs={'data-id': '123'})

CSS Selectors

# Single element

soup.select_one('div.article > h2')

# Multiple elements

soup.select('div.article h2')

# Complex selectors

soup.select('a[href^="https://"]')  # Starts with

soup.select('a[href$=".pdf"]')      # Ends with

soup.select('a[href*="example"]')   # Contains

soup.select('li:nth-child(2)')

soup.select('h1, h2, h3')           # Multiple

With Functions

import re

# By regex

soup.find_all('a', href=re.compile(r'^https://'))

# By function

def has_data_attr(tag):

    return tag.has_attr('data-id')

soup.find_all(has_data_attr)

# String matching

soup.find_all(string='exact text')

soup.find_all(string=re.compile('pattern'))

Extracting Data

Text Content

# Get text

element.text

element.get_text()

# Get text with separator

element.get_text(separator=' ')

# Get stripped text

element.get_text(strip=True)

# Get strings (generator)

for string in element.stripped_strings:

    print(string)

Attributes

# Get attribute

element['href']

element.get('href')  # Returns None if missing

element.get('href', 'default')  # With default

# Get all attributes

element.attrs  # Returns dict

# Check attribute exists

element.has_attr('class')

HTML Content

# Inner HTML

str(element)

# Just the tag

element.name

# Prettified HTML

element.prettify()

DOM Navigation

Parent/Ancestors

element.parent

element.parents  # Generator of all ancestors

# Find specific ancestor

for parent in element.parents:

    if parent.name == 'div' and 'article' in parent.get('class', []):

        break

Children

element.children      # Direct children (generator)

list(element.children)

element.contents      # Direct children (list)

element.descendants   # All descendants (generator)

# Find in children

element.find('span')  # Searches descendants

Siblings

element.next_sibling

element.previous_sibling

element.next_siblings      # Generator

element.previous_siblings  # Generator

# Next/previous element (skips whitespace)

element.next_element

element.previous_element

Data Extraction Patterns

Safe Extraction

def safe_text(element, selector, default=''):

    """Safely extract text from element."""

    found = element.select_one(selector)

    return found.get_text(strip=True) if found else default

def safe_attr(element, selector, attr, default=None):

    """Safely extract attribute from element."""

    found = element.select_one(selector)

    return found.get(attr, default) if found else default

Table Extraction

def extract_table(table):

    """Extract table data as list of dictionaries."""

    headers = [th.get_text(strip=True) for th in table.select('th')]

    rows = []

    for tr in table.select('tbody tr'):

        cells = [td.get_text(strip=True) for td in tr.select('td')]

        if cells:

            rows.append(dict(zip(headers, cells)))

    return rows

List Extraction

def extract_items(soup, selector, extractor):

    """Extract multiple items using a custom extractor function."""

    return [extractor(item) for item in soup.select(selector)]

# Usage

def extract_product(item):

    return {

        'name': safe_text(item, '.name'),

        'price': safe_text(item, '.price'),

        'url': safe_attr(item, 'a', 'href')

    }

products = extract_items(soup, '.product', extract_product)

URL Resolution

from urllib.parse import urljoin

def resolve_url(base_url, relative_url):

    """Convert relative URL to absolute."""

    if not relative_url:

        return None

    return urljoin(base_url, relative_url)

# Usage

base_url = 'https://example.com/products/'

for link in soup.select('a'):

    href = link.get('href')

    absolute_url = resolve_url(base_url, href)

    print(absolute_url)

Handling Malformed HTML

# lxml parser is lenient with malformed HTML

soup = BeautifulSoup(malformed_html, 'lxml')

# For very broken HTML, use html5lib

soup = BeautifulSoup(very_broken_html, 'html5lib')

# Handle encoding issues

response = requests.get(url)

response.encoding = response.apparent_encoding

soup = BeautifulSoup(response.text, 'lxml')

Complete Scraping Example

import requests

from bs4 import BeautifulSoup

from urllib.parse import urljoin

import time

class ProductScraper:

    def __init__(self, base_url):

        self.base_url = base_url

        self.session = requests.Session()

        self.session.headers.update({

            'User-Agent': 'Mozilla/5.0 (compatible; MyScraper/1.0)'

        })

    def fetch_page(self, url):

        """Fetch and parse a page."""

        response = self.session.get(url, timeout=30)

        response.raise_for_status()

        return BeautifulSoup(response.content, 'lxml')

    def extract_product(self, item):

        """Extract product data from a card element."""

        return {

            'name': self._safe_text(item, '.product-title'),

            'price': self._parse_price(item.select_one('.price')),

            'rating': self._safe_attr(item, '.rating', 'data-rating'),

            'image': self._resolve(self._safe_attr(item, 'img', 'src')),

            'url': self._resolve(self._safe_attr(item, 'a', 'href')),

            'in_stock': not item.select_one('.out-of-stock')

        }

    def scrape_products(self, url):

        """Scrape all products from a page."""

        soup = self.fetch_page(url)

        items = soup.select('.product-card')

        return [self.extract_product(item) for item in items]

    def _safe_text(self, element, selector, default=''):

        found = element.select_one(selector)

        return found.get_text(strip=True) if found else default

    def _safe_attr(self, element, selector, attr, default=None):

        found = element.select_one(selector)

        return found.get(attr, default) if found else default

    def _parse_price(self, element):

        if not element:

            return None

        text = element.get_text(strip=True)

        try:

            return float(text.replace('$', '').replace(',', ''))

        except ValueError:

            return None

    def _resolve(self, url):

        return urljoin(self.base_url, url) if url else None

# Usage

scraper = ProductScraper('https://example.com')

products = scraper.scrape_products('https://example.com/products')

for product in products:

    print(product)

Performance Optimization

# Use SoupStrainer to parse only needed elements

from bs4 import SoupStrainer

only_articles = SoupStrainer('article')

soup = BeautifulSoup(html, 'lxml', parse_only=only_articles)

# Use lxml parser for speed

soup = BeautifulSoup(html, 'lxml')  # Fastest

# Decompose unneeded elements

for script in soup.find_all('script'):

    script.decompose()

# Use generators for memory efficiency

for item in soup.select('.item'):

    yield extract_data(item)

Key Dependencies

beautifulsoup4

lxml (fast parser)

html5lib (lenient parser)

requests

pandas (for data output)

Best Practices

Always use lxml parser for best performance

Handle missing elements with default values

Use select() and select_one() for CSS selectors

Use get_text(strip=True) for clean text extraction

Resolve relative URLs to absolute

Validate extracted data types

Implement rate limiting between requests

Use proper User-Agent headers

Handle character encoding properly

Use SoupStrainer for large documents

Follow robots.txt and website terms of service

Implement retry logic for failed requests

beautifulsoup-parsing

SKILL.md

BeautifulSoup HTML Parsing

Core Expertise

Key Principles

Basic Setup

Loading HTML

Parser Options

Finding Elements

By Tag

By Attributes

CSS Selectors

With Functions

Extracting Data

Text Content

Attributes

HTML Content

DOM Navigation

Parent/Ancestors

Children

Siblings

Data Extraction Patterns

Safe Extraction

Table Extraction

List Extraction

URL Resolution

Handling Malformed HTML

Complete Scraping Example

Performance Optimization

Key Dependencies

Best Practices

Stop writing automation&scrapers

beautifulsoup-parsing

SKILL.md

BeautifulSoup HTML Parsing

Core Expertise

Key Principles

Basic Setup

Loading HTML

Parser Options

Finding Elements

By Tag

By Attributes

CSS Selectors

With Functions

Extracting Data

Text Content

Attributes

HTML Content

DOM Navigation

Parent/Ancestors

Children

Siblings

Data Extraction Patterns

Safe Extraction

Table Extraction

List Extraction

URL Resolution

Handling Malformed HTML

Complete Scraping Example

Performance Optimization

Key Dependencies

Best Practices

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers