SKILL.md
$27
Identify:
- Site URL: Base domain (e.g.,
https://example.com)
- Indexing scope: Full site, partial, or specific paths to exclude
- AI crawler strategy: Allow search/indexing vs. block training data crawlers
Best Practices
Purpose and Limitations
Point
Note
Purpose
Controls crawler access; does NOT prevent indexing (disallowed URLs may still appear in search without snippet)
Advisory
Rules are advisory; malicious crawlers may ignore
Public
robots.txt is publicly readable; use noindex or auth for sensitive content. See indexing
Crawl vs Index vs Link Equity (Quick Reference)
Tool
Controls
Prevents indexing?
robots.txt
Crawl (path-level)
No—blocked URLs may still appear in SERP
noindex (meta / X-Robots-Tag)
Index (page-level)
Yes. See indexing
nofollow
Link equity only
No—does not control indexing
When to Use robots.txt vs noindex
Use
Tool
Example
Path-level (whole directory)
robots.txt
Disallow: /admin/, Disallow: /api/, Disallow: /staging/
Page-level (specific pages)
noindex meta / X-Robots-Tag
Login, signup, thank-you, 404, legal. See indexing for full list
Critical
Do NOT block in robots.txt
Pages that use noindex—crawlers must access the page to read the directive
Paths to block in robots.txt: /admin/, /api/, /staging/, temp files. Paths to use noindex (allow crawl): /login/, /signup/, /thank-you/, etc.—see indexing.
Location and Format
Item
Requirement
Path
Site root: https://example.com/robots.txt
Encoding
UTF-8 plain text
Standard
RFC 9309 (Robots Exclusion Protocol)
Core Directives
Directive
Purpose
Example
User-agent:
Target crawler
User-agent: Googlebot, User-agent: *
Disallow:
Block path prefix
Disallow: /admin/
Allow:
Allow path (can override Disallow)
Allow: /public/
Sitemap:
Declare sitemap absolute URL
Sitemap: https://example.com/sitemap.xml
Clean-param:
Strip query params (Yandex)
See below
Critical: Do Not Block
Do not block
Reason
CSS, JS, images
Google needs them to render pages; blocking breaks indexing
/_next/ (Next.js)
Breaks CSS/JS loading; static assets in GSC "Crawled - not indexed" is expected. See indexing
Pages that use noindex
Crawlers must access the page to read the noindex directive; blocking in robots.txt prevents that
Only block: paths that don't need crawling: /admin/, /api/, /staging/, temp files.
AI Crawler Strategy
robots.txt is effective for all measured AI crawlers. Set rules per user-agent; check each vendor's docs for current tokens.
User-agent
Purpose
Typical
Notes
OAI-SearchBot
ChatGPT search
Allow
Respects robots.txt
GPTBot
OpenAI training
Disallow
Respects robots.txt; shares crawl data with OAI-SearchBot if both allowed
ChatGPT-User
User-initiated browsing
N/A
No longer respects robots.txt (Dec 2025); use server-side controls instead
Claude-SearchBot
Claude search
Allow
Respects robots.txt
ClaudeBot
Anthropic training
Disallow
Respects robots.txt
PerplexityBot
Perplexity search
Allow
Respects robots.txt
Google-Extended
Gemini training
Disallow
Respects robots.txt
CCBot
Common Crawl (LLM training)
Disallow
Respects robots.txt
Bytespider
ByteDance
Disallow
Respects robots.txt
Meta-ExternalAgent
Meta
Disallow
Respects robots.txt
AppleBot
Apple (Siri, Spotlight); renders JS
Allow for indexing
Respects robots.txt
Allow vs Disallow: Allow search/indexing bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot); Disallow training-only bots (GPTBot, ClaudeBot, CCBot) if you don't want content used for model training.
Important — ChatGPT-User exemption: As of December 2025, ChatGPT-User no longer respects robots.txt directives. OpenAI considers it a proxy for human-initiated browsing. If you need to block it, use server-side controls (WAF rules, IP rate-limiting), not robots.txt. See site-crawlability for AI crawler optimization (SSR, URL management).
Clean-param (Yandex)
Clean-param: utm_source&utm_medium&utm_campaign&utm_term&utm_content&ref&fbclid&gclid
Output Format
- Current state (if auditing)
- Recommended robots.txt (full file)
- Compliance checklist
- References: Google robots.txt
Related Skills
- indexing: Full noindex page-type list; when to use noindex vs robots.txt; GSC indexing diagnosis
- page-metadata: Meta robots (noindex, nofollow) implementation
- xml-sitemap: Sitemap URL to reference in robots.txt
- site-crawlability: Broader crawl and structure guidance; AI crawler optimization
- rendering-strategies: SSR, SSG, CSR; content in initial HTML for crawlers