robots-txt

When the user wants to configure, audit, or optimize robots.txt. Also use when the user mentions "robots.txt," "crawler rules," "block crawlers," "AI…

INSTALLATION
npx skills add https://github.com/kostja94/marketing-skills --skill robots-txt
Run in your project or agent environment. Adjust flags if your CLI version differs.

SKILL.md

$27

Identify:

  • Site URL: Base domain (e.g., https://example.com)
  • Indexing scope: Full site, partial, or specific paths to exclude
  • AI crawler strategy: Allow search/indexing vs. block training data crawlers

Best Practices

Purpose and Limitations

Point

Note

Purpose

Controls crawler access; does NOT prevent indexing (disallowed URLs may still appear in search without snippet)

Advisory

Rules are advisory; malicious crawlers may ignore

Public

robots.txt is publicly readable; use noindex or auth for sensitive content. See indexing

Crawl vs Index vs Link Equity (Quick Reference)

Tool

Controls

Prevents indexing?

robots.txt

Crawl (path-level)

No—blocked URLs may still appear in SERP

noindex (meta / X-Robots-Tag)

Index (page-level)

Yes. See indexing

nofollow

Link equity only

No—does not control indexing

When to Use robots.txt vs noindex

Use

Tool

Example

Path-level (whole directory)

robots.txt

Disallow: /admin/, Disallow: /api/, Disallow: /staging/

Page-level (specific pages)

noindex meta / X-Robots-Tag

Login, signup, thank-you, 404, legal. See indexing for full list

Critical

Do NOT block in robots.txt

Pages that use noindex—crawlers must access the page to read the directive

Paths to block in robots.txt: /admin/, /api/, /staging/, temp files. Paths to use noindex (allow crawl): /login/, /signup/, /thank-you/, etc.—see indexing.

Location and Format

Item

Requirement

Path

Site root: https://example.com/robots.txt

Encoding

UTF-8 plain text

Standard

RFC 9309 (Robots Exclusion Protocol)

Core Directives

Directive

Purpose

Example

User-agent:

Target crawler

User-agent: Googlebot, User-agent: *

Disallow:

Block path prefix

Disallow: /admin/

Allow:

Allow path (can override Disallow)

Allow: /public/

Sitemap:

Declare sitemap absolute URL

Sitemap: https://example.com/sitemap.xml

Clean-param:

Strip query params (Yandex)

See below

Critical: Do Not Block

Do not block

Reason

CSS, JS, images

Google needs them to render pages; blocking breaks indexing

/_next/ (Next.js)

Breaks CSS/JS loading; static assets in GSC "Crawled - not indexed" is expected. See indexing

Pages that use noindex

Crawlers must access the page to read the noindex directive; blocking in robots.txt prevents that

Only block: paths that don't need crawling: /admin/, /api/, /staging/, temp files.

AI Crawler Strategy

robots.txt is effective for all measured AI crawlers. Set rules per user-agent; check each vendor's docs for current tokens.

User-agent

Purpose

Typical

Notes

OAI-SearchBot

ChatGPT search

Allow

Respects robots.txt

GPTBot

OpenAI training

Disallow

Respects robots.txt; shares crawl data with OAI-SearchBot if both allowed

ChatGPT-User

User-initiated browsing

N/A

No longer respects robots.txt (Dec 2025); use server-side controls instead

Claude-SearchBot

Claude search

Allow

Respects robots.txt

ClaudeBot

Anthropic training

Disallow

Respects robots.txt

PerplexityBot

Perplexity search

Allow

Respects robots.txt

Google-Extended

Gemini training

Disallow

Respects robots.txt

CCBot

Common Crawl (LLM training)

Disallow

Respects robots.txt

Bytespider

ByteDance

Disallow

Respects robots.txt

Meta-ExternalAgent

Meta

Disallow

Respects robots.txt

AppleBot

Apple (Siri, Spotlight); renders JS

Allow for indexing

Respects robots.txt

Allow vs Disallow: Allow search/indexing bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot); Disallow training-only bots (GPTBot, ClaudeBot, CCBot) if you don't want content used for model training.

Important — ChatGPT-User exemption: As of December 2025, ChatGPT-User no longer respects robots.txt directives. OpenAI considers it a proxy for human-initiated browsing. If you need to block it, use server-side controls (WAF rules, IP rate-limiting), not robots.txt. See site-crawlability for AI crawler optimization (SSR, URL management).

Clean-param (Yandex)

Clean-param: utm_source&utm_medium&utm_campaign&utm_term&utm_content&ref&fbclid&gclid

Output Format

  • Current state (if auditing)
  • Recommended robots.txt (full file)
  • Compliance checklist

Related Skills

  • indexing: Full noindex page-type list; when to use noindex vs robots.txt; GSC indexing diagnosis
  • page-metadata: Meta robots (noindex, nofollow) implementation
  • xml-sitemap: Sitemap URL to reference in robots.txt
  • site-crawlability: Broader crawl and structure guidance; AI crawler optimization
  • rendering-strategies: SSR, SSG, CSR; content in initial HTML for crawlers
BrowserAct

Let your agent run on any real-world website

Bypass CAPTCHA & anti-bot for free. Start local, scale to cloud.

Explore BrowserAct Skills →

Stop writing automation&scrapers

Install the CLI. Run your first Skill in 30 seconds. Scale when you're ready.

Start free
free · no credit card