Add definition for "crawler" by garyillyes · Pull Request #13 · garyillyes/cbcp

garyillyes · 2026-04-08T19:52:28Z

Summary: This PR introduces a formal and precise definition of a "crawler" to establish a foundational understanding for the rest of the document.

Details:
Core Definition: Defines a crawler as an autonomous, non-interactive software acting as an automated HTTP client explicitly set up for bulk resource retrieval.
Behavioral Distinction: Differentiates crawlers from standard interactive clients by highlighting their asynchronous nature and batch-processing approach to link traversal, as opposed to immediate synchronous traversal.
Operational Constraints: Clarifies that crawlers operate without human supervision, relying on algorithmic prioritization and strict compliance with protocol-level instructions like the Robots Exclusion Protocol (REP).

Closes #12

garyillyes · 2026-04-08T20:01:46Z

@thibmeu does something like this work for y'all? re #12

thibmeu · 2026-04-09T05:10:44Z

I would change a few details

FTP seems out of place given the document is mostly about HTTP
this is the only use of interactive/non-interactive in the document, and is complemented with "real-time human supervision". That last framing seems sufficient. There might be limitation with software interacting with software, but that's out of scope for this document if i understand
"bulk resource retrieval" is an implementation detail

For instance, could be something along these lines

For the purposes of this document, a crawler is an automated HTTP {{HTTP-SEMANTICS}} client that retrieves resources across one or more web sites without direct human initiation of individual requests. A crawler discovers URIs during retrieval and schedules them for later processing. It relies on algorithmic prioritization and protocol-level instructions such as the Robots Exclusion Protocol {{REP}} to govern its behavior.

another consideration: there is a convention/definition section. It could be moved directly after the intro and the definition would be placed there.

garyillyes · 2026-04-09T18:45:13Z

I would change a few details

FTP seems out of place given the document is mostly about HTTP

this is the only use of interactive/non-interactive in the document, and is complemented with "real-time human supervision". That last framing seems sufficient. There might be limitation with software interacting with software, but that's out of scope for this document if i understand

"bulk resource retrieval" is an implementation detail

For instance, could be something along these lines

For the purposes of this document, a crawler is an automated HTTP {{HTTP-SEMANTICS}} client that retrieves resources across one or more web sites without direct human initiation of individual requests. A crawler discovers URIs during retrieval and schedules them for later processing. It relies on algorithmic prioritization and protocol-level instructions such as the Robots Exclusion Protocol {{REP}} to govern its behavior.

another consideration: there is a convention/definition section. It could be moved directly after the intro and the definition would be placed there.

Yeah, that does sound better. Stole it and added it to the draft.

Integrate comments

…wler

Define what a crawler precisely is.

f80832d

garyillyes added 5 commits April 9, 2026 20:49

Define what a crawler precisely is.

865d665

Integrate comments

Merge branch 'main' into crawler

fe8a89d

Merge branch 'crawler' of https://github.com/garyillyes/cbcp into cra…

f3d7860

…wler

Merge branch 'crawler' of https://github.com/garyillyes/cbcp into cra…

d4970f0

…wler

Pesky whitespace

63ab539

garyillyes merged commit 3213174 into main Apr 9, 2026
2 checks passed

garyillyes deleted the crawler branch April 9, 2026 19:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add definition for "crawler"#13

Add definition for "crawler"#13
garyillyes merged 6 commits intomainfrom
crawler

garyillyes commented Apr 8, 2026 •

edited

Loading

Uh oh!

garyillyes commented Apr 8, 2026

Uh oh!

thibmeu commented Apr 9, 2026 •

edited

Loading

Uh oh!

garyillyes commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

garyillyes commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

garyillyes commented Apr 8, 2026

Uh oh!

thibmeu commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

garyillyes commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

garyillyes commented Apr 8, 2026 •

edited

Loading

thibmeu commented Apr 9, 2026 •

edited

Loading