Skip to content

Add definition for "crawler"#13

Merged
garyillyes merged 6 commits intomainfrom
crawler
Apr 9, 2026
Merged

Add definition for "crawler"#13
garyillyes merged 6 commits intomainfrom
crawler

Conversation

@garyillyes
Copy link
Copy Markdown
Owner

@garyillyes garyillyes commented Apr 8, 2026

Summary: This PR introduces a formal and precise definition of a "crawler" to establish a foundational understanding for the rest of the document.

Details:
Core Definition: Defines a crawler as an autonomous, non-interactive software acting as an automated HTTP client explicitly set up for bulk resource retrieval.
Behavioral Distinction: Differentiates crawlers from standard interactive clients by highlighting their asynchronous nature and batch-processing approach to link traversal, as opposed to immediate synchronous traversal.
Operational Constraints: Clarifies that crawlers operate without human supervision, relying on algorithmic prioritization and strict compliance with protocol-level instructions like the Robots Exclusion Protocol (REP).

Closes #12

@garyillyes
Copy link
Copy Markdown
Owner Author

@thibmeu does something like this work for y'all? re #12

@thibmeu
Copy link
Copy Markdown
Contributor

thibmeu commented Apr 9, 2026

I would change a few details

  1. FTP seems out of place given the document is mostly about HTTP
  2. this is the only use of interactive/non-interactive in the document, and is complemented with "real-time human supervision". That last framing seems sufficient. There might be limitation with software interacting with software, but that's out of scope for this document if i understand
  3. "bulk resource retrieval" is an implementation detail

For instance, could be something along these lines

For the purposes of this document, a crawler is an automated HTTP {{HTTP-SEMANTICS}} client that retrieves resources across one or more web sites without direct human initiation of individual requests. A crawler discovers URIs during retrieval and schedules them for later processing. It relies on algorithmic prioritization and protocol-level instructions such as the Robots Exclusion Protocol {{REP}} to govern its behavior.

another consideration: there is a convention/definition section. It could be moved directly after the intro and the definition would be placed there.

@garyillyes
Copy link
Copy Markdown
Owner Author

I would change a few details

  1. FTP seems out of place given the document is mostly about HTTP
  2. this is the only use of interactive/non-interactive in the document, and is complemented with "real-time human supervision". That last framing seems sufficient. There might be limitation with software interacting with software, but that's out of scope for this document if i understand
  3. "bulk resource retrieval" is an implementation detail

For instance, could be something along these lines

For the purposes of this document, a crawler is an automated HTTP {{HTTP-SEMANTICS}} client that retrieves resources across one or more web sites without direct human initiation of individual requests. A crawler discovers URIs during retrieval and schedules them for later processing. It relies on algorithmic prioritization and protocol-level instructions such as the Robots Exclusion Protocol {{REP}} to govern its behavior.

another consideration: there is a convention/definition section. It could be moved directly after the intro and the definition would be placed there.

Yeah, that does sound better. Stole it and added it to the draft.

@garyillyes garyillyes merged commit 3213174 into main Apr 9, 2026
2 checks passed
@garyillyes garyillyes deleted the crawler branch April 9, 2026 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Define better the target of the best practices

2 participants