Skip to content

Commit fe8a89d

Browse files
authored
Merge branch 'main' into crawler
2 parents f80832d + cbf1014 commit fe8a89d

1 file changed

Lines changed: 10 additions & 10 deletions

File tree

draft-illyes-aipref-cbcp.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ informative:
4343

4444
--- abstract
4545

46-
This document describes best pratices for web crawlers.
46+
This document describes best practices for web crawlers.
4747

4848

4949
--- middle
@@ -52,7 +52,7 @@ This document describes best pratices for web crawlers.
5252

5353
Automatic clients, such as crawlers and bots, are used to access web resources,
5454
including indexing for search engines or, more recently, training data for new
55-
artifical intelligence (AI) applications. As crawling activity increases,
55+
artificial intelligence (AI) applications. As crawling activity increases,
5656
automatic clients must behave appropriately and respect the constraints of the
5757
resources they access. This includes clearly documenting how they can be
5858
identified and how their behavior can be influenced. Therefore, crawler
@@ -74,7 +74,7 @@ To further assist website owners, it should also be considered to create a
7474
central registry where website owners can look up well-behaved crawlers. Note
7575
that while self-declared research crawlers, including privacy and malware
7676
discovery crawlers, and contractual crawlers are welcome to adopt these practices,
77-
due to the nature of their relationsh with sites, they may exempt themselves
77+
due to the nature of their relationship with sites, they may exempt themselves
7878
from any of the Crawler Best Practices with a rationale.
7979

8080

@@ -87,9 +87,9 @@ vast majority of large-scale crawlers on the Internet:
8787
2. Crawlers must be easily identifiable through their user agent string.
8888
3. Crawlers must not interfere with the regular operation of a site.
8989
4. Crawlers must support caching directives.
90-
5. Crawlers must expose the ranges they are crawling from in a standardized format.
90+
5. Crawlers must expose the ranges they are crawling from in a standardized format.
9191
6. Crawlers must expose a page that explains how the crawling can be blocked, whether
92-
the page is rendered, amd how the crawled data is used.
92+
the page is rendered, and how the crawled data is used.
9393

9494

9595

@@ -106,7 +106,7 @@ by the REP, crawlers further need to respect the `X-robots-tag` in the HTTP head
106106

107107
As outlined in {{Section 2.2.1 of REP}} (Robots Exclusion Protocol; REP),
108108
the HTTP request header 'User-Agent' should clearly identify the crawler,
109-
usually by including a URL that hosts the crawler's descrtion. For example:
109+
usually by including a URL that hosts the crawler's description. For example:
110110

111111
`User-Agent: Mozilla/5.0 (compatible; ExampleBot/0.1; +https://www.example.com/bot.html)`.
112112

@@ -121,19 +121,19 @@ identify both the crawler owner and its purpose as much as reasonably possible.
121121

122122
Depending on a site's setup (computing resources and software efficiency) and its
123123
size, crawling may slow down the site or even take it offline altogether. Crawler
124-
operators must ensure that their crawlers are equped with back-out logic that
124+
operators must ensure that their crawlers are equipped with back-out logic that
125125
relies on at least the standard signals defined by {{Section 15.6 of HTTP-SEMANTICS}},
126126
preferably also additional heuristics such as a change in the relative response time
127127
of the server.
128128

129129
Therefore, crawlers should log already visited URLs, the number of requests sent to
130130
each resource, and the respective HTTP status codes in the responses, especially if
131-
errors occur, to prevent repeatedly crawling the same sourceerrors occur, to prevent
131+
errors occur, to prevent repeatedly crawling the same source
132132
repeatedly crawling the same source. Using the same data, crawlers should, on a best
133133
effort basis, crawl the site at times of the day when the site is estimated to have
134134
fewer human visitors.
135135

136-
Generally, crawlers should avoid sending multle requests to the same resources
136+
Generally, crawlers should avoid sending multiple requests to the same resources
137137
at the same time and should limit the crawling speed to prevent server overload, if
138138
possible, following the limits outlined in the REP protocol. Additionally, resources
139139
should not be re-crawled too often. Ideally, crawlers should restrict the depth of
@@ -146,7 +146,7 @@ unless explicitly agreed upon with the website owner.
146146
Crawlers should primarily access resources using HTTP GET requests, resorting to
147147
other methods (e.g., POST, PUT) only if there is a prior agreement with the publisher
148148
or if the publisher's content management system automatically makes those calls when
149-
JavaScrt runs. Generally, the load caused by executing JavaScrt should be
149+
JavaScrt runs. Generally, the load caused by executing JavaScript should be
150150
carefully considered or even avoided whenever possible.
151151

152152

0 commit comments

Comments
 (0)