Skip to content

Define mandatory content for crawler documentation pages #15

@garyillyes

Description

@garyillyes

Section 2.6 of the draft mandates that crawler operators host a documentation page, yet it fails to specify the technical requirements for what that page must actually contain. Currently, the text suggests providing a contact address, a REP example, and a vague explanation of data usage. This level of ambiguity / handwaviness is inefficient for both the operator and the site owner. If we want crawlers to be transparent, we should define the specific technical parameters they are required to disclose so that site owners can make informed decisions about their traffic.

The documentation should include a standardized checklist of required and recommended fields:

  • Crawler Identity: Specific User-Agent strings and substrings used for identification.
  • Purpose: Clear disclosure of whether the data is used for public search, private LLM training, or research.
  • Technical Behavior: Explicit statement on whether the crawler renders JavaScript or only fetches source content.
  • Verification: A link to the JAFAR-formatted IP ranges as defined in Section 2.5 and other verification methods.
  • Opt-out: Example robots.txt snippets for blocking the specific crawler.
  • Exemptions: If the crawler has exempted itself from certain best practices, the page must describe which one(s) as well as the rationale.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions