From f80832d91415c4ae96988407f9fc0410b2dfba8c Mon Sep 17 00:00:00 2001 From: Gary Illyes <51719901+garyillyes@users.noreply.github.com> Date: Wed, 8 Apr 2026 21:49:14 +0200 Subject: [PATCH 1/3] Define what a crawler precisely is. --- draft-illyes-aipref-cbcp.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/draft-illyes-aipref-cbcp.md b/draft-illyes-aipref-cbcp.md index a557ad0..c287ca3 100644 --- a/draft-illyes-aipref-cbcp.md +++ b/draft-illyes-aipref-cbcp.md @@ -59,6 +59,17 @@ identified and how their behavior can be influenced. Therefore, crawler operators are asked to follow the best practices for crawling outlined in this document. +For the purpose of the document, a crawler is an autonomous, non-interactive +software that functions as an automated HTTP {{HTTP-SEMANTICS}} client set up +for bulk resource retrieval. Unlike standard interactive clients, a crawler does +not perform immediate, synchronous link traversal upon URI discovery and +instead it utilizes an asynchronous crawl service to manage batch processing of +resources identified typically from previously downloaded resources. It operates +without real-time human supervision, relying on algorithmic prioritization and +compliance with protocol-level instructions, such as the +Robots Exclusion Protocol {{REP}}, to govern its behavior across HTTP +and supplementary URI-addressable protocols like FTP. + To further assist website owners, it should also be considered to create a central registry where website owners can look up well-behaved crawlers. Note that while self-declared research crawlers, including privacy and malware From 865d665253ae4acf78b80e94416372d465274947 Mon Sep 17 00:00:00 2001 From: Gary Illyes <51719901+garyillyes@users.noreply.github.com> Date: Wed, 8 Apr 2026 21:49:14 +0200 Subject: [PATCH 2/3] Define what a crawler precisely is. Integrate comments --- draft-illyes-aipref-cbcp.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/draft-illyes-aipref-cbcp.md b/draft-illyes-aipref-cbcp.md index a557ad0..1e33ef3 100644 --- a/draft-illyes-aipref-cbcp.md +++ b/draft-illyes-aipref-cbcp.md @@ -59,6 +59,13 @@ identified and how their behavior can be influenced. Therefore, crawler operators are asked to follow the best practices for crawling outlined in this document. +For the purposes of this document, a crawler is an automated +HTTP {{HTTP-SEMANTICS}} client that retrieves resources across one or more web +sites without direct human initiation of individual requests. A crawler +discovers URIs during retrieval and schedules them for later processing. It +relies on algorithmic prioritization and protocol-level instructions such as the +Robots Exclusion Protocol {{REP}} to govern its behavior. + To further assist website owners, it should also be considered to create a central registry where website owners can look up well-behaved crawlers. Note that while self-declared research crawlers, including privacy and malware From 63ab5392f694b4cb9edddd0506db777fbcf908b0 Mon Sep 17 00:00:00 2001 From: Gary Illyes <51719901+garyillyes@users.noreply.github.com> Date: Thu, 9 Apr 2026 20:59:26 +0200 Subject: [PATCH 3/3] Pesky whitespace --- draft-illyes-aipref-cbcp.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/draft-illyes-aipref-cbcp.md b/draft-illyes-aipref-cbcp.md index b17ffe3..adeaf3a 100644 --- a/draft-illyes-aipref-cbcp.md +++ b/draft-illyes-aipref-cbcp.md @@ -59,7 +59,7 @@ identified and how their behavior can be influenced. Therefore, crawler operators are asked to follow the best practices for crawling outlined in this document. -For the purposes of this document, a crawler is an automated +For the purposes of this document, a crawler is an automated HTTP {{HTTP-SEMANTICS}} client that retrieves resources across one or more web sites without direct human initiation of individual requests. A crawler discovers URIs during retrieval and schedules them for later processing. It