SKILL.md

Firecrawl Web Scraper Skill

Status: Production Ready

Last Updated: 2026-01-20

Official Docs: https://docs.firecrawl.dev

API Version: v2

SDK Versions: firecrawl-py 4.13.0+, @mendable/firecrawl-js 4.11.1+

What is Firecrawl?

Firecrawl is a Web Data API for AI that turns websites into LLM-ready markdown or structured data. It handles:

JavaScript rendering - Executes client-side JavaScript to capture dynamic content

Anti-bot bypass - Gets past CAPTCHA and bot detection systems

Format conversion - Outputs as markdown, HTML, JSON, screenshots, summaries

Document parsing - Processes PDFs, DOCX files, and images

Autonomous agents - AI-powered web data gathering without URLs

Change tracking - Monitor content changes over time

Branding extraction - Extract color schemes, typography, logos

API Endpoints Overview

Endpoint

Purpose

Use Case

/scrape

Single page

Extract article, product page

/crawl

Full site

Index docs, archive sites

/map

URL discovery

Find all pages, plan strategy

/search

Web search + scrape

Research with live data

/extract

Structured data

Product prices, contacts

/agent

Autonomous gathering

No URLs needed, AI navigates

/batch-scrape

Multiple URLs

Bulk processing

1. Scrape Endpoint ( /v2/scrape )

Scrapes a single webpage and returns clean, structured content.

Basic Usage

from firecrawl import Firecrawl

import os

app = Firecrawl(api_key=os.environ.get("FIRECRAWL_API_KEY"))

# Basic scrape

doc = app.scrape(

    url="https://example.com/article",

    formats=["markdown", "html"],

    only_main_content=True

)

print(doc.markdown)

print(doc.metadata)

import FirecrawlApp from '@mendable/firecrawl-js';

const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY });

const result = await app.scrapeUrl('https://example.com/article', {

  formats: ['markdown', 'html'],

  onlyMainContent: true

});

console.log(result.markdown);

Output Formats

Format

Description

markdown

LLM-optimized content

html

Full HTML

rawHtml

Unprocessed HTML

screenshot

Page capture (with viewport options)

links

All URLs on page

json

Structured data extraction

summary

AI-generated summary

branding

Design system data

changeTracking

Content change detection

Advanced Options

doc = app.scrape(

    url="https://example.com",

    formats=["markdown", "screenshot"],

    only_main_content=True,

    remove_base64_images=True,

    wait_for=5000,  # Wait 5s for JS

    timeout=30000,

    # Location &#x26; language

    location={"country": "AU", "languages": ["en-AU"]},

    # Cache control

    max_age=0,  # Fresh content (no cache)

    store_in_cache=True,

    # Stealth mode for complex sites

    stealth=True,

    # Custom headers

    headers={"User-Agent": "Custom Bot 1.0"}

)

Browser Actions

Perform interactions before scraping:

doc = app.scrape(

    url="https://example.com",

    actions=[

        {"type": "click", "selector": "button.load-more"},

        {"type": "wait", "milliseconds": 2000},

        {"type": "scroll", "direction": "down"},

        {"type": "write", "selector": "input#search", "text": "query"},

        {"type": "press", "key": "Enter"},

        {"type": "screenshot"}  # Capture state mid-action

    ]

)

JSON Mode (Structured Extraction)

# With schema

doc = app.scrape(

    url="https://example.com/product",

    formats=["json"],

    json_options={

        "schema": {

            "type": "object",

            "properties": {

                "title": {"type": "string"},

                "price": {"type": "number"},

                "in_stock": {"type": "boolean"}

            }

        }

    }

)

# Without schema (prompt-only)

doc = app.scrape(

    url="https://example.com/product",

    formats=["json"],

    json_options={

        "prompt": "Extract the product name, price, and availability"

    }

)

Branding Extraction

Extract design system and brand identity:

doc = app.scrape(

    url="https://example.com",

    formats=["branding"]

)

# Returns:

# - Color schemes and palettes

# - Typography (fonts, sizes, weights)

# - Spacing and layout metrics

# - UI component styles

# - Logo and imagery URLs

# - Brand personality traits

2. Crawl Endpoint ( /v2/crawl )

Crawls all accessible pages from a starting URL.

result = app.crawl(

    url="https://docs.example.com",

    limit=100,

    max_depth=3,

    allowed_domains=["docs.example.com"],

    exclude_paths=["/api/*", "/admin/*"],

    scrape_options={

        "formats": ["markdown"],

        "only_main_content": True

    }

)

for page in result.data:

    print(f"Scraped: {page.metadata.source_url}")

    print(f"Content: {page.markdown[:200]}...")

Async Crawl with Webhooks

# Start crawl (returns immediately)

job = app.start_crawl(

    url="https://docs.example.com",

    limit=1000,

    webhook="https://your-domain.com/webhook"

)

print(f"Job ID: {job.id}")

# Or poll for status

status = app.check_crawl_status(job.id)

3. Map Endpoint ( /v2/map )

Rapidly discover all URLs on a website without scraping content.

urls = app.map(url="https://example.com")

print(f"Found {len(urls)} pages")

for url in urls[:10]:

    print(url)

Use for: sitemap discovery, crawl planning, website audits.

4. Search Endpoint ( /search ) - NEW

Perform web searches and optionally scrape the results in one operation.

# Basic search

results = app.search(

    query="best practices for React server components",

    limit=10

)

for result in results:

    print(f"{result.title}: {result.url}")

# Search + scrape results

results = app.search(

    query="React server components tutorial",

    limit=5,

    scrape_options={

        "formats": ["markdown"],

        "only_main_content": True

    }

)

for result in results:

    print(f"{result.title}")

    print(result.markdown[:500])

Search Options

results = app.search(

    query="machine learning papers",

    limit=20,

    # Filter by source type

    sources=["web", "news", "images"],

    # Filter by category

    categories=["github", "research", "pdf"],

    # Location

    location={"country": "US"},

    # Time filter

    tbs="qdr:m",  # Past month (qdr:h=hour, qdr:d=day, qdr:w=week, qdr:y=year)

    timeout=30000

)

Cost: 2 credits per 10 results + scraping costs if enabled.

5. Extract Endpoint ( /v2/extract )

AI-powered structured data extraction from single pages, multiple pages, or entire domains.

Single Page

from pydantic import BaseModel

class Product(BaseModel):

    name: str

    price: float

    description: str

    in_stock: bool

result = app.extract(

    urls=["https://example.com/product"],

    schema=Product,

    system_prompt="Extract product information"

)

print(result.data)

Multi-Page / Domain Extraction

# Extract from entire domain using wildcard

result = app.extract(

    urls=["example.com/*"],  # All pages on domain

    schema=Product,

    system_prompt="Extract all products"

)

# Enable web search for additional context

result = app.extract(

    urls=["example.com/products"],

    schema=Product,

    enable_web_search=True  # Follow external links

)

Prompt-Only Extraction (No Schema)

result = app.extract(

    urls=["https://example.com/about"],

    prompt="Extract the company name, founding year, and key executives"

)

# LLM determines output structure

6. Agent Endpoint ( /agent ) - NEW

Autonomous web data gathering without requiring specific URLs. The agent searches, navigates, and gathers data using natural language prompts.

# Basic agent usage

result = app.agent(

    prompt="Find the pricing plans for the top 3 headless CMS platforms and compare their features"

)

print(result.data)

# With schema for structured output

from pydantic import BaseModel

from typing import List

class CMSPricing(BaseModel):

    name: str

    free_tier: bool

    starter_price: float

    features: List[str]

result = app.agent(

    prompt="Find pricing for Contentful, Sanity, and Strapi",

    schema=CMSPricing

)

# Optional: focus on specific URLs

result = app.agent(

    prompt="Extract the enterprise pricing details",

    urls=["https://contentful.com/pricing", "https://sanity.io/pricing"]

)

Agent Models

Model

Best For

Cost

spark-1-mini (default)

Simple extractions, high volume

Standard

spark-1-pro

Complex analysis, ambiguous data

60% more

result = app.agent(

    prompt="Analyze competitive positioning...",

    model="spark-1-pro"  # For complex tasks

)

Async Agent

# Start agent (returns immediately)

job = app.start_agent(

    prompt="Research market trends..."

)

# Poll for results

status = app.check_agent_status(job.id)

if status.status == "completed":

    print(status.data)

Note: Agent is in Research Preview. 5 free daily requests, then credit-based billing.

7. Batch Scrape - NEW

Process multiple URLs efficiently in a single operation.

Synchronous (waits for completion)

results = app.batch_scrape(

    urls=[

        "https://example.com/page1",

        "https://example.com/page2",

        "https://example.com/page3"

    ],

    formats=["markdown"],

    only_main_content=True

)

for page in results.data:

    print(f"{page.metadata.source_url}: {len(page.markdown)} chars")

Asynchronous (with webhooks)

job = app.start_batch_scrape(

    urls=url_list,

    formats=["markdown"],

    webhook="https://your-domain.com/webhook"

)

# Webhook receives events: started, page, completed, failed

const job = await app.startBatchScrape(urls, {

  formats: ['markdown'],

  webhook: 'https://your-domain.com/webhook'

});

// Poll for status

const status = await app.checkBatchScrapeStatus(job.id);

8. Change Tracking - NEW

Monitor content changes over time by comparing scrapes.

# Enable change tracking

doc = app.scrape(

    url="https://example.com/pricing",

    formats=["markdown", "changeTracking"]

)

# Response includes:

print(doc.change_tracking.status)  # new, same, changed, removed

print(doc.change_tracking.previous_scrape_at)

print(doc.change_tracking.visibility)  # visible, hidden

Comparison Modes

# Git-diff mode (default)

doc = app.scrape(

    url="https://example.com/docs",

    formats=["markdown", "changeTracking"],

    change_tracking_options={

        "mode": "diff"

    }

)

print(doc.change_tracking.diff)  # Line-by-line changes

# JSON mode (structured comparison)

doc = app.scrape(

    url="https://example.com/pricing",

    formats=["markdown", "changeTracking"],

    change_tracking_options={

        "mode": "json",

        "schema": {"type": "object", "properties": {"price": {"type": "number"}}}

    }

)

# Costs 5 credits per page

Change States:

new - Page not seen before

same - No changes since last scrape

changed - Content modified

removed - Page no longer accessible

Authentication

# Get API key from https://www.firecrawl.dev/app

# Store in environment

FIRECRAWL_API_KEY=fc-your-api-key-here

Never hardcode API keys!

Cloudflare Workers Integration

The Firecrawl SDK cannot run in Cloudflare Workers (requires Node.js). Use the REST API directly:

interface Env {

  FIRECRAWL_API_KEY: string;

}

export default {

  async fetch(request: Request, env: Env): Promise<Response> {

    const { url } = await request.json<{ url: string }>();

    const response = await fetch('https://api.firecrawl.dev/v2/scrape', {

      method: 'POST',

      headers: {

        'Authorization': `Bearer ${env.FIRECRAWL_API_KEY}`,

        'Content-Type': 'application/json',

      },

      body: JSON.stringify({

        url,

        formats: ['markdown'],

        onlyMainContent: true

      })

    });

    const result = await response.json();

    return Response.json(result);

  }

};

Rate Limits & Pricing

Warning: Stealth Mode Pricing Change (May 2025)

Stealth mode now costs 5 credits per request when actively used. Default behavior uses "auto" mode which only charges stealth credits if basic fails.

Recommended pattern:

# Use auto mode (default) - only charges 5 credits if stealth is needed

doc = app.scrape(url, formats=["markdown"])

# Or conditionally enable stealth for specific errors

if error_status_code in [401, 403, 500]:

    doc = app.scrape(url, formats=["markdown"], proxy="stealth")

Unified Billing (November 2025)

Credits and tokens merged into single system. Extract endpoint uses credits (15 tokens = 1 credit).

Pricing Tiers

Tier

Credits/Month

Notes

Free

500

Good for testing

Hobby

3,000

$19/month

Standard

100,000

$99/month

Growth

500,000

$399/month

Credit Costs:

Scrape: 1 credit (basic), 5 credits (stealth)

Crawl: 1 credit per page

Search: 2 credits per 10 results

Extract: 5 credits per page (changed from tokens in v2.6.0)

Agent: Dynamic (complexity-based)

Change Tracking JSON mode: +5 credits

Common Issues & Solutions

Issue

Cause

Solution

Empty content

JS not loaded

Add wait_for: 5000 or use actions

Rate limit exceeded

Over quota

Check dashboard, upgrade plan

Timeout error

Slow page

Increase timeout, use stealth: true

Bot detection

Anti-scraping

Use stealth: true, add location

Invalid API key

Wrong format

Must start with fc-

Known Issues Prevention

This skill prevents 10 documented issues:

Issue #1: Stealth Mode Pricing Change (May 2025)

Error: Unexpected credit costs when using stealth mode

Source: Stealth Mode Docs | Changelog

Why It Happens: Starting May 8th, 2025, Stealth Mode proxy requests cost 5 credits per request (previously included in standard pricing). This is a significant billing change.

Prevention: Use auto mode (default) which only charges stealth credits if basic fails

# RECOMMENDED: Use auto mode (default)

doc = app.scrape(url, formats=['markdown'])

# Auto retries with stealth (5 credits) only if basic fails

# Or conditionally enable based on error status

try:

    doc = app.scrape(url, formats=['markdown'], proxy='basic')

except Exception as e:

    if e.status_code in [401, 403, 500]:

        doc = app.scrape(url, formats=['markdown'], proxy='stealth')

Stealth Mode Options:

auto (default): Charges 5 credits only if stealth succeeds after basic fails

basic: Standard proxies, 1 credit cost

stealth: 5 credits per request when actively used

Issue #2: v2.0.0 Breaking Changes - Method Renames

Error: AttributeError: 'FirecrawlApp' object has no attribute 'scrape_url'

Source: v2.0.0 Release | Migration Guide

Why It Happens: v2.0.0 (August 2025) renamed SDK methods across all languages

Prevention: Use new method names

JavaScript/TypeScript:

scrapeUrl() → scrape()

crawlUrl() → crawl() or startCrawl()

asyncCrawlUrl() → startCrawl()

checkCrawlStatus() → getCrawlStatus()

Python:

scrape_url() → scrape()

crawl_url() → crawl() or start_crawl()

# OLD (v1)

doc = app.scrape_url("https://example.com")

# NEW (v2)

doc = app.scrape("https://example.com")

Issue #3: v2.0.0 Breaking Changes - Format Changes

Error: 'extract' is not a valid format

Source: v2.0.0 Release

Why It Happens: Old "extract" format renamed to "json" in v2.0.0

Prevention: Use new object format for JSON extraction

# OLD (v1)

doc = app.scrape_url(

    url="https://example.com",

    params={

        "formats": ["extract"],

        "extract": {"prompt": "Extract title"}

    }

)

# NEW (v2)

doc = app.scrape(

    url="https://example.com",

    formats=[{"type": "json", "prompt": "Extract title"}]

)

# With schema

doc = app.scrape(

    url="https://example.com",

    formats=[{

        "type": "json",

        "prompt": "Extract product info",

        "schema": {

            "type": "object",

            "properties": {

                "title": {"type": "string"},

                "price": {"type": "number"}

            }

        }

    }]

)

Screenshot format also changed:

# NEW: Screenshot as object

formats=[{

    "type": "screenshot",

    "fullPage": True,

    "quality": 80,

    "viewport": {"width": 1920, "height": 1080}

}]

Issue #4: v2.0.0 Breaking Changes - Crawl Options

Error: 'allowBackwardCrawling' is not a valid parameter

Source: v2.0.0 Release

Why It Happens: Several crawl parameters renamed or removed in v2.0.0

Prevention: Use new parameter names

Parameter Changes:

allowBackwardCrawling → Use crawlEntireDomain instead

maxDepth → Use maxDiscoveryDepth instead

ignoreSitemap (bool) → sitemap ("only", "skip", "include")

# OLD (v1)

app.crawl_url(

    url="https://docs.example.com",

    params={

        "allowBackwardCrawling": True,

        "maxDepth": 3,

        "ignoreSitemap": False

    }

)

# NEW (v2)

app.crawl(

    url="https://docs.example.com",

    crawl_entire_domain=True,

    max_discovery_depth=3,

    sitemap="include"  # "only", "skip", or "include"

)

Issue #5: v2.0.0 Default Behavior Changes

Error: Stale cached content returned unexpectedly

Source: v2.0.0 Release

Why It Happens: v2.0.0 changed several defaults

Prevention: Be aware of new defaults

Default Changes:

maxAge now defaults to 2 days (cached by default)

blockAds, skipTlsVerification, removeBase64Images enabled by default

# Force fresh data if needed

doc = app.scrape(url, formats=['markdown'], max_age=0)

# Disable cache entirely

doc = app.scrape(url, formats=['markdown'], store_in_cache=False)

Issue #6: Job Status Race Condition

Error: "Job not found" when checking crawl status immediately after creation

Source: GitHub Issue #2662

Why It Happens: Database replication delay between job creation and status endpoint availability

Prevention: Wait 1-3 seconds before first status check, or implement retry logic

import time

# Start crawl

job = app.start_crawl(url="https://docs.example.com")

print(f"Job ID: {job.id}")

# REQUIRED: Wait before first status check

time.sleep(2)  # 1-3 seconds recommended

# Now status check succeeds

status = app.get_crawl_status(job.id)

# Or implement retry logic

def get_status_with_retry(job_id, max_retries=3, delay=1):

    for attempt in range(max_retries):

        try:

            return app.get_crawl_status(job_id)

        except Exception as e:

            if "Job not found" in str(e) and attempt < max_retries - 1:

                time.sleep(delay)

                continue

            raise

status = get_status_with_retry(job.id)

Issue #7: DNS Errors Return HTTP 200

Error: DNS resolution failures return success: false with HTTP 200 status instead of 4xx

Source: GitHub Issue #2402 | Fixed in v2.7.0

Why It Happens: Changed in v2.7.0 for consistent error handling

Prevention: Check success field and code field, don't rely on HTTP status alone

const result = await app.scrape('https://nonexistent-domain-xyz.com');

// DON'T rely on HTTP status code

// Response: HTTP 200 with { success: false, code: "SCRAPE_DNS_RESOLUTION_ERROR" }

// DO check success field

if (!result.success) {

    if (result.code === 'SCRAPE_DNS_RESOLUTION_ERROR') {

        console.error('DNS resolution failed');

    }

    throw new Error(result.error);

}

Note: DNS resolution errors still charge 1 credit despite failure.

Issue #8: Bot Detection Still Charges Credits

Error: Cloudflare error page returned as "successful" scrape, credits charged

Source: GitHub Issue #2413

Why It Happens: Fire-1 engine charges credits even when bot detection prevents access

Prevention: Validate content isn't an error page before processing; use stealth mode for protected sites

# First attempt without stealth

doc = app.scrape(url="https://protected-site.com", formats=["markdown"])

# Validate content isn't an error page

if "cloudflare" in doc.markdown.lower() or "access denied" in doc.markdown.lower():

    # Retry with stealth (costs 5 credits if successful)

    doc = app.scrape(url, formats=["markdown"], stealth=True)

Cost Impact: Basic scrape charges 1 credit even on failure, stealth retry charges additional 5 credits.

Issue #9: Self-Hosted Anti-Bot Fingerprinting Weakness

Error: "All scraping engines failed!" (SCRAPE_ALL_ENGINES_FAILED) on sites with anti-bot measures

Source: GitHub Issue #2257

Why It Happens: Self-hosted Firecrawl lacks advanced anti-fingerprinting techniques present in cloud service

Prevention: Use Firecrawl cloud service for sites with strong anti-bot measures, or configure proxy

# Self-hosted fails on Cloudflare-protected sites

curl -X POST 'http://localhost:3002/v2/scrape' \

-H 'Authorization: Bearer YOUR_API_KEY' \

-d '{

  "url": "https://www.example.com/",

  "pageOptions": { "engine": "playwright" }

}'

# Error: "All scraping engines failed!"

# Workaround: Use cloud service instead

# Cloud service has better anti-fingerprinting

Note: This affects self-hosted v2.3.0+ with default docker-compose setup. Warning present: "⚠️ WARNING: No proxy server provided. Your IP address may be blocked."

Issue #10: Cache Performance Best Practices (Community-sourced)

Suboptimal: Not leveraging cache can make requests 500% slower

Source: Fast Scraping Docs | Blog Post

Why It Matters: Default maxAge is 2 days in v2+, but many use cases need different strategies

Prevention: Use appropriate cache strategy for your content type

# Fresh data (real-time pricing, stock prices)

doc = app.scrape(url, formats=["markdown"], max_age=0)

# 10-minute cache (news, blogs)

doc = app.scrape(url, formats=["markdown"], max_age=600000)  # milliseconds

# Use default cache (2 days) for static content

doc = app.scrape(url, formats=["markdown"])  # maxAge defaults to 172800000

# Don't store in cache (one-time scrape)

doc = app.scrape(url, formats=["markdown"], store_in_cache=False)

# Require minimum age before re-scraping (v2.7.0+)

doc = app.scrape(url, formats=["markdown"], min_age=3600000)  # 1 hour minimum

Performance Impact:

Cached response: Milliseconds

Fresh scrape: Seconds

Speed difference: Up to 500%

Package Versions

Package

Version

Last Checked

firecrawl-py

4.13.0+

2026-01-20

@mendable/firecrawl-js

4.11.1+

2026-01-20

API Version

Current

Official Documentation

Docs: https://docs.firecrawl.dev

Python SDK: https://docs.firecrawl.dev/sdks/python

Node.js SDK: https://docs.firecrawl.dev/sdks/node

API Reference: https://docs.firecrawl.dev/api-reference

GitHub: https://github.com/mendableai/firecrawl

Dashboard: https://www.firecrawl.dev/app

Token Savings: ~65% vs manual integration

Error Prevention: 10 documented issues (v2 migration, stealth pricing, job status race, DNS errors, bot detection billing, self-hosted limitations, cache optimization)

Production Ready: Yes

Last verified: 2026-01-21 | Skill version: 2.0.0 | Changes: Added Known Issues Prevention section with 10 documented errors from TIER 1-2 research findings; added v2 migration guidance; documented stealth mode pricing change and unified billing model

firecrawl-scraper

SKILL.md

Firecrawl Web Scraper Skill

What is Firecrawl?

API Endpoints Overview

1. Scrape Endpoint ( /v2/scrape )

Basic Usage

Output Formats

Advanced Options

Browser Actions

JSON Mode (Structured Extraction)

Branding Extraction

2. Crawl Endpoint ( /v2/crawl )

Async Crawl with Webhooks

3. Map Endpoint ( /v2/map )

4. Search Endpoint ( /search ) - NEW

Search Options

5. Extract Endpoint ( /v2/extract )

Single Page

Multi-Page / Domain Extraction

Prompt-Only Extraction (No Schema)

6. Agent Endpoint ( /agent ) - NEW

Agent Models

Async Agent

7. Batch Scrape - NEW

Synchronous (waits for completion)

Asynchronous (with webhooks)

8. Change Tracking - NEW

Comparison Modes

Authentication

Cloudflare Workers Integration

Rate Limits &#x26; Pricing

Warning: Stealth Mode Pricing Change (May 2025)

Unified Billing (November 2025)

Pricing Tiers

Common Issues &#x26; Solutions

Known Issues Prevention

Issue #1: Stealth Mode Pricing Change (May 2025)

Issue #2: v2.0.0 Breaking Changes - Method Renames

Issue #3: v2.0.0 Breaking Changes - Format Changes

Issue #4: v2.0.0 Breaking Changes - Crawl Options

Issue #5: v2.0.0 Default Behavior Changes

Issue #6: Job Status Race Condition

Issue #7: DNS Errors Return HTTP 200

Issue #8: Bot Detection Still Charges Credits

Issue #9: Self-Hosted Anti-Bot Fingerprinting Weakness

Issue #10: Cache Performance Best Practices (Community-sourced)

Package Versions

Official Documentation

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers

Rate Limits & Pricing

Common Issues & Solutions