SKILL.md

$27

Identify:

Site URL: Base domain (e.g., https://example.com)

Indexing scope: Full site, partial, or specific paths to exclude

AI crawler strategy: Allow search/indexing vs. block training data crawlers

Best Practices

Purpose and Limitations

Point

Note

Purpose

Controls crawler access; does NOT prevent indexing (disallowed URLs may still appear in search without snippet)

Advisory

Rules are advisory; malicious crawlers may ignore

Public

robots.txt is publicly readable; use noindex or auth for sensitive content. See indexing

Crawl vs Index vs Link Equity (Quick Reference)

Tool

Controls

Prevents indexing?

robots.txt

Crawl (path-level)

No—blocked URLs may still appear in SERP

noindex (meta / X-Robots-Tag)

Index (page-level)

Yes. See indexing

nofollow

Link equity only

No—does not control indexing

When to Use robots.txt vs noindex

Use

Tool

Example

Path-level (whole directory)

robots.txt

Disallow: /admin/, Disallow: /api/, Disallow: /staging/

Page-level (specific pages)

noindex meta / X-Robots-Tag

Critical

Do NOT block in robots.txt

Pages that use noindex—crawlers must access the page to read the directive

Paths to block in robots.txt: /admin/, /api/, /staging/, temp files. Paths to use noindex (allow crawl): /login/, /signup/, /thank-you/, etc.—see indexing.

Location and Format

Item

Requirement

Path

Site root: https://example.com/robots.txt

Encoding

UTF-8 plain text

Standard

RFC 9309 (Robots Exclusion Protocol)

Core Directives

Directive

Purpose

Example

User-agent:

Target crawler

User-agent: Googlebot, User-agent: *

Disallow:

Block path prefix

Disallow: /admin/

Allow:

Allow path (can override Disallow)

Allow: /public/

Sitemap:

Declare sitemap absolute URL

Sitemap: https://example.com/sitemap.xml

Clean-param:

Strip query params (Yandex)

See below

Critical: Do Not Block

Do not block

Reason

CSS, JS, images

Google needs them to render pages; blocking breaks indexing

/_next/ (Next.js)

Breaks CSS/JS loading; static assets in GSC "Crawled - not indexed" is expected. See indexing

Pages that use noindex

Crawlers must access the page to read the noindex directive; blocking in robots.txt prevents that

Only block: paths that don't need crawling: /admin/, /api/, /staging/, temp files.

AI Crawler Strategy

robots.txt is effective for all measured AI crawlers. Set rules per user-agent; check each vendor's docs for current tokens.

User-agent

Purpose

Typical

Notes

OAI-SearchBot

ChatGPT search

Allow

Respects robots.txt

GPTBot

OpenAI training

Disallow

Respects robots.txt; shares crawl data with OAI-SearchBot if both allowed

ChatGPT-User

User-initiated browsing

N/A

No longer respects robots.txt (Dec 2025); use server-side controls instead

Claude-SearchBot

Claude search

Allow

Respects robots.txt

ClaudeBot

Anthropic training

Disallow

Respects robots.txt

PerplexityBot

Perplexity search

Allow

Respects robots.txt

Google-Extended

Gemini training

Disallow

Respects robots.txt

CCBot

Common Crawl (LLM training)

Disallow

Respects robots.txt

Bytespider

ByteDance

Disallow

Respects robots.txt

Meta-ExternalAgent

Clean-param (Yandex)

Clean-param: utm_source&#x26;utm_medium&#x26;utm_campaign&#x26;utm_term&#x26;utm_content&#x26;ref&#x26;fbclid&#x26;gclid

Output Format

Current state (if auditing)

Recommended robots.txt (full file)

Compliance checklist

References: Google robots.txt

Related Skills

indexing: Full noindex page-type list; when to use noindex vs robots.txt; GSC indexing diagnosis

page-metadata: Meta robots (noindex, nofollow) implementation

xml-sitemap: Sitemap URL to reference in robots.txt

site-crawlability: Broader crawl and structure guidance; AI crawler optimization

rendering-strategies: SSR, SSG, CSR; content in initial HTML for crawlers

robots-txt

SKILL.md

Best Practices

Purpose and Limitations

Crawl vs Index vs Link Equity (Quick Reference)

When to Use robots.txt vs noindex

Location and Format

Core Directives

Critical: Do Not Block

AI Crawler Strategy

Clean-param (Yandex)

Output Format

Related Skills

Stop writing automation&scrapers

robots-txt

SKILL.md

Best Practices

Purpose and Limitations

Crawl vs Index vs Link Equity (Quick Reference)

When to Use robots.txt vs noindex

Location and Format

Core Directives

Critical: Do Not Block

AI Crawler Strategy

Clean-param (Yandex)

Output Format

Related Skills

Let your agent run on any real-world website

Related skills

Stop writing automation&scrapers