You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary: This PR introduces a formal and precise definition of a "crawler" to establish a foundational understanding for the rest of the document.
Details: Core Definition: Defines a crawler as an autonomous, non-interactive software acting as an automated HTTP client explicitly set up for bulk resource retrieval. Behavioral Distinction: Differentiates crawlers from standard interactive clients by highlighting their asynchronous nature and batch-processing approach to link traversal, as opposed to immediate synchronous traversal. Operational Constraints: Clarifies that crawlers operate without human supervision, relying on algorithmic prioritization and strict compliance with protocol-level instructions like the Robots Exclusion Protocol (REP).
FTP seems out of place given the document is mostly about HTTP
this is the only use of interactive/non-interactive in the document, and is complemented with "real-time human supervision". That last framing seems sufficient. There might be limitation with software interacting with software, but that's out of scope for this document if i understand
"bulk resource retrieval" is an implementation detail
For instance, could be something along these lines
For the purposes of this document, a crawler is an automated HTTP {{HTTP-SEMANTICS}} client that retrieves resources across one or more web sites without direct human initiation of individual requests. A crawler discovers URIs during retrieval and schedules them for later processing. It relies on algorithmic prioritization and protocol-level instructions such as the Robots Exclusion Protocol {{REP}} to govern its behavior.
another consideration: there is a convention/definition section. It could be moved directly after the intro and the definition would be placed there.
FTP seems out of place given the document is mostly about HTTP
this is the only use of interactive/non-interactive in the document, and is complemented with "real-time human supervision". That last framing seems sufficient. There might be limitation with software interacting with software, but that's out of scope for this document if i understand
"bulk resource retrieval" is an implementation detail
For instance, could be something along these lines
For the purposes of this document, a crawler is an automated HTTP {{HTTP-SEMANTICS}} client that retrieves resources across one or more web sites without direct human initiation of individual requests. A crawler discovers URIs during retrieval and schedules them for later processing. It relies on algorithmic prioritization and protocol-level instructions such as the Robots Exclusion Protocol {{REP}} to govern its behavior.
another consideration: there is a convention/definition section. It could be moved directly after the intro and the definition would be placed there.
Yeah, that does sound better. Stole it and added it to the draft.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary: This PR introduces a formal and precise definition of a "crawler" to establish a foundational understanding for the rest of the document.
Details:
Core Definition: Defines a crawler as an autonomous, non-interactive software acting as an automated HTTP client explicitly set up for bulk resource retrieval.
Behavioral Distinction: Differentiates crawlers from standard interactive clients by highlighting their asynchronous nature and batch-processing approach to link traversal, as opposed to immediate synchronous traversal.
Operational Constraints: Clarifies that crawlers operate without human supervision, relying on algorithmic prioritization and strict compliance with protocol-level instructions like the Robots Exclusion Protocol (REP).
Closes #12