docs: add robots.txt and llms.txt for docs site discoverability#1425
docs: add robots.txt and llms.txt for docs site discoverability#1425eccogrinder wants to merge 2 commits intomainfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📝 WalkthroughWalkthroughTwo new public documentation files are added: one describing The Interfold concept of confidential coordination with its goals and resources, and one providing robots.txt directives for search engine and AI training policies. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can suggest fixes for GitHub Check annotations.Configure the |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/public/robots.txt`:
- Line 5: The robots.txt currently includes a Sitemap:
https://docs.theinterfold.com/sitemap.xml entry before the sitemap is generated;
remove or comment out that Sitemap line in the public robots.txt (or
conditionally add it only when the sitemap.html/xml artifact is produced by the
build) so the deployed robots.txt does not reference a non-existent
/sitemap.xml; locate the Sitemap line in the docs/public/robots.txt and either
delete it or gate its addition into the CI/build that emits sitemap.xml.
- Line 2: Replace the Cloudflare-specific "Content-Signal:
search=yes,ai-train=no" line with explicit robots.txt user-agent sections:
remove the Content-Signal header and add per-crawler directives (e.g.,
User-agent: GPTBot / Anthropic crawlers / PerplexityBot with Disallow: / to
block AI crawlers, and User-agent: Googlebot / bingbot with appropriate
Allow/Disallow rules to permit search indexing). Edit the robots.txt so it
contains explicit User-agent and Disallow/Allow blocks for each crawler instead
of the Content-Signal header.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3d8b1737-914f-4e05-9f15-55f375baa61d
📒 Files selected for processing (2)
docs/public/llms.txtdocs/public/robots.txt
| @@ -0,0 +1,5 @@ | |||
| User-agent: * | |||
| Content-Signal: search=yes,ai-train=no | |||
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Is Content-Signal: search=yes,ai-train=no in robots.txt currently supported by major crawlers (Googlebot, Bingbot, Common Crawl, GPTBot, ClaudeBot, PerplexityBot)? Please provide official documentation links and recommended alternatives if unsupported.
💡 Result:
Is Content-Signal: search=yes,ai-train=no supported by major crawlers?
No—there is currently no official documentation from Googlebot, Bingbot, Common Crawl (CCBot), OpenAI (GPTBot), Anthropic (Claude bots), or Perplexity (PerplexityBot) stating they parse/obey the non-standard Content-Signal: robots.txt line. It’s a Cloudflare-defined extension/policy for expressing preferences, not part of what these crawlers document as supported robots.txt directives. (blog.cloudflare.com)
What happens in practice (per official docs)
- Googlebot: Google documents support for only
user-agent,allow,disallow,sitemapin robots.txt; other fields are not supported (soContent-Signalis not a supported directive). (developers.google.com) - Bingbot: Bing documents that it follows REP directives in robots.txt (no mention of
Content-Signal). (blogs.bing.com) - CCBot (Common Crawl): Common Crawl documents blocking via standard
User-agent: CCBot+Disallow. No mention ofContent-Signal. (commoncrawl.org) - GPTBot (OpenAI): OpenAI documents control via
GPTBot/OAI-SearchBotrobots.txt tags. No mention ofContent-Signal. (openai.com) - Claude bots (Anthropic): Anthropic documents opting out by disallowing their user-agents in robots.txt. No mention of
Content-Signal. (support.anthropic.com) - PerplexityBot: Perplexity documents control via
PerplexityBotrobots.txt tags. No mention ofContent-Signal. (docs.perplexity.ai)
Official documentation links (copy/paste)
Cloudflare Content Signals Policy (introduces Content-Signal):
https://blog.cloudflare.com/content-signals-policy/
https://developers.cloudflare.com/bots/additional-configurations/managed-robots-txt/
Google robots.txt spec / supported fields:
https://developers.google.com/crawling/docs/robots-txt/robots-txt-spec
Bing on robots.txt / REP behavior:
https://blogs.bing.com/webmaster/August-2009/Crawl-delay-and-the-Bing-crawler%2C-MSNBot
https://blogs.bing.com/webmaster/May-2012/To-crawl-or-not-to-crawl%2C-that-is-BingBot-s-questi
Common Crawl (CCBot) blocking:
https://commoncrawl.org/ccbot
OpenAI (GPTBot / OAI-SearchBot / ChatGPT-User):
https://openai.com/gptbot
Anthropic (how to block Claude crawlers via robots.txt):
https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler
Perplexity crawlers / robots.txt tags:
https://docs.perplexity.ai/docs/resources/perplexity-crawlers
Recommended alternatives (when Content-Signal is unsupported)
-
Use standard robots.txt allow/disallow rules per crawler user-agent
- Google indexing control:
User-agent: Googlebot+Allow/Disallow(standard REP). (developers.google.com) - Bing indexing control:
User-agent: bingbot+Allow/Disallow(standard REP). (blogs.bing.com) - Common Crawl opt-out:
User-agent: CCBot+Disallow: /. (commoncrawl.org)
- Google indexing control:
-
For AI-specific crawlers, disallow their documented user-agents
- OpenAI training crawler: disallow
GPTBot(and separately manageOAI-SearchBotif you want search visibility without training). (openai.com) - Anthropic: disallow the relevant Anthropic bot user-agent(s) per their help doc. (support.anthropic.com)
- Perplexity: manage
PerplexityBotvia robots.txt per their docs. (docs.perplexity.ai)
- OpenAI training crawler: disallow
-
Enforce with infrastructure controls (recommended if you need real control)
- Cloudflare explicitly notes content signals are preferences and recommends combining them with WAF/Bot Management. (blog.cloudflare.com)
Citations:
- 1: https://blog.cloudflare.com/content-signals-policy/
- 2: https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt
- 3: https://blogs.bing.com/webmaster/August-2009/Crawl-delay-and-the-Bing-crawler%2C-MSNBot?utm_source=openai
- 4: https://commoncrawl.org/ccbot
- 5: https://openai.com/gptbot
- 6: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler?utm_source=openai
- 7: https://docs.perplexity.ai/docs/resources/perplexity-crawlers
- 8: https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt
- 9: https://blogs.bing.com/webmaster/August-2009/Crawl-delay-and-the-Bing-crawler%2C-MSNBot?utm_source=openai
- 10: https://commoncrawl.org/ccbot
- 11: https://openai.com/gptbot
- 12: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler?utm_source=openai
- 13: https://docs.perplexity.ai/docs/resources/perplexity-crawlers
- 14: https://blog.cloudflare.com/content-signals-policy/
Use standard robots.txt directives instead of Content-Signal
Line 2's Content-Signal: search=yes,ai-train=no is a Cloudflare-specific extension not supported by any major crawler (Google, Bing, Common Crawl, GPTBot, Claude, PerplexityBot). The ai-train=no intent will not be enforced. Use standard robots.txt rules with per-crawler user-agent directives:
- Disallow AI crawlers:
User-agent: GPTBot,User-agent: Anthropic crawlers,User-agent: PerplexityBot+Disallow: / - Allow search indexing:
User-agent: Googlebot,User-agent: bingbot+ standard allow/disallow rules
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/public/robots.txt` at line 2, Replace the Cloudflare-specific
"Content-Signal: search=yes,ai-train=no" line with explicit robots.txt
user-agent sections: remove the Content-Signal header and add per-crawler
directives (e.g., User-agent: GPTBot / Anthropic crawlers / PerplexityBot with
Disallow: / to block AI crawlers, and User-agent: Googlebot / bingbot with
appropriate Allow/Disallow rules to permit search indexing). Edit the robots.txt
so it contains explicit User-agent and Disallow/Allow blocks for each crawler
instead of the Content-Signal header.
| Content-Signal: search=yes,ai-train=no | ||
| Allow: / | ||
|
|
||
| Sitemap: https://docs.theinterfold.com/sitemap.xml No newline at end of file |
There was a problem hiding this comment.
Sitemap URL may be premature without generated file
Line 5 points to /sitemap.xml, but PR objectives say sitemap automation lands in a follow-up PR. If that file is not present at deploy time, this is a temporary broken discovery hint.
Suggested adjustment
- Sitemap: https://docs.theinterfold.com/sitemap.xml
+ # Add after sitemap.xml is generated/deployed
+ # Sitemap: https://docs.theinterfold.com/sitemap.xml📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Sitemap: https://docs.theinterfold.com/sitemap.xml | |
| # Add after sitemap.xml is generated/deployed | |
| # Sitemap: https://docs.theinterfold.com/sitemap.xml |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/public/robots.txt` at line 5, The robots.txt currently includes a
Sitemap: https://docs.theinterfold.com/sitemap.xml entry before the sitemap is
generated; remove or comment out that Sitemap line in the public robots.txt (or
conditionally add it only when the sitemap.html/xml artifact is produced by the
build) so the deployed robots.txt does not reference a non-existent
/sitemap.xml; locate the Sitemap line in the docs/public/robots.txt and either
delete it or gate its addition into the CI/build that emits sitemap.xml.
Purpose
This reflects the ongoing rename from Enclave → The Interfold.
Adds
robots.txtandllms.txtto improve discoverability and AI retrieval of the Interfold documentation.robots.txtallows search indexing and AI retrieval while preservingai-train=no.llms.txtprovides a canonical description and entry points for LLM retrieval systems.Files added
/public/robots.txt
/public/llms.txt
Follow-up (separate PR)
Add automatic sitemap generation for the docs using
next-sitemap.Proposed approach:
pnpm add next-sitemapnext-sitemap.config.js"postbuild": "next-sitemap"to scriptshttps://docs.theinterfold.com/sitemap.xmlgenerated automaticallySummary by CodeRabbit