@@ -43,7 +43,7 @@ informative:
4343
4444--- abstract
4545
46- This document describes best pratices for web crawlers.
46+ This document describes best practices for web crawlers.
4747
4848
4949--- middle
@@ -52,7 +52,7 @@ This document describes best pratices for web crawlers.
5252
5353Automatic clients, such as crawlers and bots, are used to access web resources,
5454including indexing for search engines or, more recently, training data for new
55- artifical intelligence (AI) applications. As crawling activity increases,
55+ artificial intelligence (AI) applications. As crawling activity increases,
5656automatic clients must behave appropriately and respect the constraints of the
5757resources they access. This includes clearly documenting how they can be
5858identified and how their behavior can be influenced. Therefore, crawler
@@ -74,7 +74,7 @@ To further assist website owners, it should also be considered to create a
7474central registry where website owners can look up well-behaved crawlers. Note
7575that while self-declared research crawlers, including privacy and malware
7676discovery crawlers, and contractual crawlers are welcome to adopt these practices,
77- due to the nature of their relationsh with sites, they may exempt themselves
77+ due to the nature of their relationship with sites, they may exempt themselves
7878from any of the Crawler Best Practices with a rationale.
7979
8080
@@ -87,9 +87,9 @@ vast majority of large-scale crawlers on the Internet:
87872. Crawlers must be easily identifiable through their user agent string.
88883. Crawlers must not interfere with the regular operation of a site.
89894. Crawlers must support caching directives.
90- 5. Crawlers must expose the ranges they are crawling from in a standardized format.
90+ 5. Crawlers must expose the ranges they are crawling from in a standardized format.
91916. Crawlers must expose a page that explains how the crawling can be blocked, whether
92- the page is rendered, amd how the crawled data is used.
92+ the page is rendered, and how the crawled data is used.
9393
9494
9595
@@ -106,7 +106,7 @@ by the REP, crawlers further need to respect the `X-robots-tag` in the HTTP head
106106
107107As outlined in {{Section 2.2.1 of REP}} (Robots Exclusion Protocol; REP),
108108the HTTP request header 'User-Agent' should clearly identify the crawler,
109- usually by including a URL that hosts the crawler's descrtion . For example :
109+ usually by including a URL that hosts the crawler's description . For example :
110110
111111`User-Agent : Mozilla/5.0 (compatible; ExampleBot/0.1; +https://www.example.com/bot.html)`.
112112
@@ -121,19 +121,19 @@ identify both the crawler owner and its purpose as much as reasonably possible.
121121
122122Depending on a site's setup (computing resources and software efficiency) and its
123123size, crawling may slow down the site or even take it offline altogether. Crawler
124- operators must ensure that their crawlers are equped with back-out logic that
124+ operators must ensure that their crawlers are equipped with back-out logic that
125125relies on at least the standard signals defined by {{Section 15.6 of HTTP-SEMANTICS}},
126126preferably also additional heuristics such as a change in the relative response time
127127of the server.
128128
129129Therefore, crawlers should log already visited URLs, the number of requests sent to
130130each resource, and the respective HTTP status codes in the responses, especially if
131- errors occur, to prevent repeatedly crawling the same sourceerrors occur, to prevent
131+ errors occur, to prevent repeatedly crawling the same source
132132repeatedly crawling the same source. Using the same data, crawlers should, on a best
133133effort basis, crawl the site at times of the day when the site is estimated to have
134134fewer human visitors.
135135
136- Generally, crawlers should avoid sending multle requests to the same resources
136+ Generally, crawlers should avoid sending multiple requests to the same resources
137137at the same time and should limit the crawling speed to prevent server overload, if
138138possible, following the limits outlined in the REP protocol. Additionally, resources
139139should not be re-crawled too often. Ideally, crawlers should restrict the depth of
@@ -146,7 +146,7 @@ unless explicitly agreed upon with the website owner.
146146Crawlers should primarily access resources using HTTP GET requests, resorting to
147147other methods (e.g., POST, PUT) only if there is a prior agreement with the publisher
148148or if the publisher's content management system automatically makes those calls when
149- JavaScrt runs. Generally, the load caused by executing JavaScrt should be
149+ JavaScrt runs. Generally, the load caused by executing JavaScript should be
150150carefully considered or even avoided whenever possible.
151151
152152
0 commit comments