Skip to content

docs: add robots.txt and llms.txt for docs site discoverability#1425

Open
eccogrinder wants to merge 2 commits intomainfrom
docs-robots-llms-discovery
Open

docs: add robots.txt and llms.txt for docs site discoverability#1425
eccogrinder wants to merge 2 commits intomainfrom
docs-robots-llms-discovery

Conversation

@eccogrinder
Copy link
Copy Markdown
Member

@eccogrinder eccogrinder commented Mar 13, 2026

Purpose

This reflects the ongoing rename from Enclave → The Interfold.

Adds robots.txt and llms.txt to improve discoverability and AI retrieval of the Interfold documentation.

  • robots.txt allows search indexing and AI retrieval while preserving ai-train=no.
  • llms.txt provides a canonical description and entry points for LLM retrieval systems.

Files added

/public/robots.txt
/public/llms.txt

Follow-up (separate PR)

Add automatic sitemap generation for the docs using next-sitemap.

Proposed approach:

  1. pnpm add next-sitemap
  2. create next-sitemap.config.js
  3. add "postbuild": "next-sitemap" to scripts
  4. deploy → https://docs.theinterfold.com/sitemap.xml generated automatically

Summary by CodeRabbit

  • Documentation
    • Introduced comprehensive documentation on the Interfold concept, outlining confidential coordination goals, approaches for achieving shared verifiable outcomes across independent parties, and additional resources.
    • Added search engine configuration directives to manage content indexing behavior and prevent unauthorized use for AI model training.

@eccogrinder eccogrinder added the documentation Improvements or additions to documentation label Mar 13, 2026
@vercel
Copy link
Copy Markdown

vercel bot commented Mar 13, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
crisp Ready Ready Preview, Comment Mar 20, 2026 6:23am
enclave-docs Ready Ready Preview, Comment Mar 20, 2026 6:23am

Request Review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 13, 2026

📝 Walkthrough

Walkthrough

Two new public documentation files are added: one describing The Interfold concept of confidential coordination with its goals and resources, and one providing robots.txt directives for search engine and AI training policies.

Changes

Cohort / File(s) Summary
Public Documentation
docs/public/llms.txt, docs/public/robots.txt
Adds documentation about The Interfold concept and configuration for search engine crawling policies, including AI training restrictions and sitemap directives.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

A rabbit hops through docs so new,
With llms.txt and robots too!
Search engines learn where not to peek,
While Interfold's secrets stay unique! 🐰✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'docs: add robots.txt and llms.txt for docs site discoverability' directly and clearly summarizes the main change—adding two public configuration files to improve site discovery.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch docs-robots-llms-discovery
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can suggest fixes for GitHub Check annotations.

Configure the reviews.tools.github-checks setting to adjust the time to wait for GitHub Checks to complete.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/public/robots.txt`:
- Line 5: The robots.txt currently includes a Sitemap:
https://docs.theinterfold.com/sitemap.xml entry before the sitemap is generated;
remove or comment out that Sitemap line in the public robots.txt (or
conditionally add it only when the sitemap.html/xml artifact is produced by the
build) so the deployed robots.txt does not reference a non-existent
/sitemap.xml; locate the Sitemap line in the docs/public/robots.txt and either
delete it or gate its addition into the CI/build that emits sitemap.xml.
- Line 2: Replace the Cloudflare-specific "Content-Signal:
search=yes,ai-train=no" line with explicit robots.txt user-agent sections:
remove the Content-Signal header and add per-crawler directives (e.g.,
User-agent: GPTBot / Anthropic crawlers / PerplexityBot with Disallow: / to
block AI crawlers, and User-agent: Googlebot / bingbot with appropriate
Allow/Disallow rules to permit search indexing). Edit the robots.txt so it
contains explicit User-agent and Disallow/Allow blocks for each crawler instead
of the Content-Signal header.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3d8b1737-914f-4e05-9f15-55f375baa61d

📥 Commits

Reviewing files that changed from the base of the PR and between 1cd8ae0 and bc39cb5.

📒 Files selected for processing (2)
  • docs/public/llms.txt
  • docs/public/robots.txt

@@ -0,0 +1,5 @@
User-agent: *
Content-Signal: search=yes,ai-train=no
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Is Content-Signal: search=yes,ai-train=no in robots.txt currently supported by major crawlers (Googlebot, Bingbot, Common Crawl, GPTBot, ClaudeBot, PerplexityBot)? Please provide official documentation links and recommended alternatives if unsupported.

💡 Result:

Is Content-Signal: search=yes,ai-train=no supported by major crawlers?

No—there is currently no official documentation from Googlebot, Bingbot, Common Crawl (CCBot), OpenAI (GPTBot), Anthropic (Claude bots), or Perplexity (PerplexityBot) stating they parse/obey the non-standard Content-Signal: robots.txt line. It’s a Cloudflare-defined extension/policy for expressing preferences, not part of what these crawlers document as supported robots.txt directives. (blog.cloudflare.com)

What happens in practice (per official docs)

  • Googlebot: Google documents support for only user-agent, allow, disallow, sitemap in robots.txt; other fields are not supported (so Content-Signal is not a supported directive). (developers.google.com)
  • Bingbot: Bing documents that it follows REP directives in robots.txt (no mention of Content-Signal). (blogs.bing.com)
  • CCBot (Common Crawl): Common Crawl documents blocking via standard User-agent: CCBot + Disallow. No mention of Content-Signal. (commoncrawl.org)
  • GPTBot (OpenAI): OpenAI documents control via GPTBot / OAI-SearchBot robots.txt tags. No mention of Content-Signal. (openai.com)
  • Claude bots (Anthropic): Anthropic documents opting out by disallowing their user-agents in robots.txt. No mention of Content-Signal. (support.anthropic.com)
  • PerplexityBot: Perplexity documents control via PerplexityBot robots.txt tags. No mention of Content-Signal. (docs.perplexity.ai)

Official documentation links (copy/paste)

Cloudflare Content Signals Policy (introduces Content-Signal):
https://blog.cloudflare.com/content-signals-policy/
https://developers.cloudflare.com/bots/additional-configurations/managed-robots-txt/

Google robots.txt spec / supported fields:
https://developers.google.com/crawling/docs/robots-txt/robots-txt-spec

Bing on robots.txt / REP behavior:
https://blogs.bing.com/webmaster/August-2009/Crawl-delay-and-the-Bing-crawler%2C-MSNBot
https://blogs.bing.com/webmaster/May-2012/To-crawl-or-not-to-crawl%2C-that-is-BingBot-s-questi

Common Crawl (CCBot) blocking:
https://commoncrawl.org/ccbot

OpenAI (GPTBot / OAI-SearchBot / ChatGPT-User):
https://openai.com/gptbot

Anthropic (how to block Claude crawlers via robots.txt):
https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler

Perplexity crawlers / robots.txt tags:
https://docs.perplexity.ai/docs/resources/perplexity-crawlers

Recommended alternatives (when Content-Signal is unsupported)

  1. Use standard robots.txt allow/disallow rules per crawler user-agent

    • Google indexing control: User-agent: Googlebot + Allow/Disallow (standard REP). (developers.google.com)
    • Bing indexing control: User-agent: bingbot + Allow/Disallow (standard REP). (blogs.bing.com)
    • Common Crawl opt-out: User-agent: CCBot + Disallow: /. (commoncrawl.org)
  2. For AI-specific crawlers, disallow their documented user-agents

    • OpenAI training crawler: disallow GPTBot (and separately manage OAI-SearchBot if you want search visibility without training). (openai.com)
    • Anthropic: disallow the relevant Anthropic bot user-agent(s) per their help doc. (support.anthropic.com)
    • Perplexity: manage PerplexityBot via robots.txt per their docs. (docs.perplexity.ai)
  3. Enforce with infrastructure controls (recommended if you need real control)

    • Cloudflare explicitly notes content signals are preferences and recommends combining them with WAF/Bot Management. (blog.cloudflare.com)

Citations:


Use standard robots.txt directives instead of Content-Signal

Line 2's Content-Signal: search=yes,ai-train=no is a Cloudflare-specific extension not supported by any major crawler (Google, Bing, Common Crawl, GPTBot, Claude, PerplexityBot). The ai-train=no intent will not be enforced. Use standard robots.txt rules with per-crawler user-agent directives:

  • Disallow AI crawlers: User-agent: GPTBot, User-agent: Anthropic crawlers, User-agent: PerplexityBot + Disallow: /
  • Allow search indexing: User-agent: Googlebot, User-agent: bingbot + standard allow/disallow rules
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/public/robots.txt` at line 2, Replace the Cloudflare-specific
"Content-Signal: search=yes,ai-train=no" line with explicit robots.txt
user-agent sections: remove the Content-Signal header and add per-crawler
directives (e.g., User-agent: GPTBot / Anthropic crawlers / PerplexityBot with
Disallow: / to block AI crawlers, and User-agent: Googlebot / bingbot with
appropriate Allow/Disallow rules to permit search indexing). Edit the robots.txt
so it contains explicit User-agent and Disallow/Allow blocks for each crawler
instead of the Content-Signal header.

Content-Signal: search=yes,ai-train=no
Allow: /

Sitemap: https://docs.theinterfold.com/sitemap.xml No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Sitemap URL may be premature without generated file

Line 5 points to /sitemap.xml, but PR objectives say sitemap automation lands in a follow-up PR. If that file is not present at deploy time, this is a temporary broken discovery hint.

Suggested adjustment
- Sitemap: https://docs.theinterfold.com/sitemap.xml
+ # Add after sitemap.xml is generated/deployed
+ # Sitemap: https://docs.theinterfold.com/sitemap.xml
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Sitemap: https://docs.theinterfold.com/sitemap.xml
# Add after sitemap.xml is generated/deployed
# Sitemap: https://docs.theinterfold.com/sitemap.xml
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/public/robots.txt` at line 5, The robots.txt currently includes a
Sitemap: https://docs.theinterfold.com/sitemap.xml entry before the sitemap is
generated; remove or comment out that Sitemap line in the public robots.txt (or
conditionally add it only when the sitemap.html/xml artifact is produced by the
build) so the deployed robots.txt does not reference a non-existent
/sitemap.xml; locate the Sitemap line in the docs/public/robots.txt and either
delete it or gate its addition into the CI/build that emits sitemap.xml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants