recluse is both a crate and a guide to help you write web crawlers in Rust.
It was originally inspired by Scrapy, a legendary Python framework for writing web scrapers
(which now calls itself a "data extraction framework") but it has deviated in its achitecture to be more Rust-idiomatic:
- Every parsing function in Scrapy
yields a host of different objects (items, URLs) into a single work queue.reclusestill encourages you to use the concept of generating items if you've found something interesting, and URLs (or web request objects) if you want to signal a link that needs to be traversed. But we have to work with a strongly-typed language, so instead of forcing users to maintain "giga-enums" that accomodate every "thing" that moves around, inrecluseevery type of "thing" has its own typed queue, and is statically bound to its consumer (no dynamic dispatch). Thanks totower, it gives you more granular control per item type over things like throttles, retries, etc. - There is no "master object" spider, no "context" provided to a parsing function.
It is recommended to use
towerlayers to add stateful behaviour, if you need it. Parse pages withscrapper. Compose your workers into a working program withtokio. As you go through examples, you'll see that not a lot of core code comes fromrecluse. It's mostly boilerplate and common crawler tasks like downloading pages. Your program structure and control flow remains familiar and understandable.
You might also notice that the architecture that recluse proposes is essentially an
actor model but with fewer steps.
This was not intended, but did not come as a surprise, either.
The actor model is a great solution for many problems in software engineering, especially in
long-running, concurrent programs, by decoupling components narrowing the interface between them
to a simple message. This simplifies state management and increases predictability of responses.
That being said, recluse is not a full-blown actor model. Actors cannot have a local state,
there are no supervisors (although you could make those things with tower layers). Actor references
are static, with no addressing (your handler either has a channel sender or not) and there is no
overarching "actor system" (unless you count your async runtime as one).
recluse tries its best not to force you into using specific crates or frameworks.
That being said, some choices have been made to make development more ergonomic:
- Most notably,
reclusedepends ontowerfor composing your workers. You will write your own [tower::Service] that processes a page and add [tower::Layer]s such as concurrency and rate limiting, retries, timeouts, etc.recluseeven provides layers on its own, for crawler-specific code. - We use the
logfacade, but don't depend on any specific logger. (Examples usecolog, but that's not a library dependency.) - The fundamental building block of
reclusemakes heavy use oftokio'smpsc::channel. That being said,tokioas a runtime is not a hard dependency and the channels are runtime-agnostic. You should be able to use this crate with any async runtime. reqwestis the most widely used HTTP client crate. Through it,recluseprovides a convenient wrapper that downloads pages for you (as raw text, or JSON-deserialized withserde), so you can focus on processing. It's gated by an (optional) feature,reqwest, so you're opted out of including it by default.leaky-bucketis an implementation of the leaky bucket algorithm. This crate provides a a rate limiter layer that uses it internally. This dependency is only pulled if you enable thelimitfeature.
$ cargo add recluse
If you want all the optional features (pulls reqwest and leaky-bucket, which you might not want/need):
$ cargo add recluse -F full
Alternatively, add directly to Cargo.toml under [dependencies]:
recluse = "0.1.1"
I recommend going through the examples included in this repository to understand how crawlers are built with recluse.
We'll go here through the first one, which simply extracts quotes from Quotes to Scrape, a popular website for developing and testing scrapers and crawlers. It will be a simple website-scoped spider (that is, it doesn't wander off to different domains) that must find a link to the next page every time and traverse them all.
We start by creating a "work pipe", which is recluse's fundamental construct. It is wrapper around a [tokio::sync::mpsc::channel]
with an atomic counter that keeps track of how many items are in it (in fact, it tracks all work pipes in your program
with a single counter):
let (page_pipe, page_worker) = WorkPipeBuilder::new().build();The pipe part is, essentially, the Sender part of the channel and you queue into it requests to download pages
(in this case, it's reqwest::Request objects, but you could use anything, e.g. your abstraction over it).
The other end of the channel is unavailable to you because it's wrapped by the worker - a simple-ish loop
that consumes work items by putting them through your processing function until there is no more work.
We then create the core function, which will be processing the pages as they are coming, one at a time.
This is the "meat" of our program, everything else is really either boilerplate or plumbing
(which recluse tries to minimize through utilities it provides):
let page_processor = {
let page_pipe = page_pipe.clone();
move |page_body: String| {
let page_pipe = page_pipe.clone();
async move {
let (quotes, next_page_request) = parse_quotes_page(page_body)
.context("Could not parse quotes page")?;
debug!("{:?}", quotes);
if let Some(next_page_request) = next_page_request {
page_pipe.submit_work(next_page_request).await
.context("Queue next page request")?;
}
anyhow::Ok(())
}
}
};Everything in recluse happens asynchronously (usually powered by the tokio runtime), therefore we ultimately produce
an async closure that takes a String and returns (). Because a lot of things can happen in a processor (you find items of interest,
links to further pages), we avoid forcing a specific return signature because unlike with Scrapy, a) we work in a strongly typed language,
and b) don't have generators/coroutines like in Python (not yet, anyway, Rust is coooking).
Instead, outgoing communication must be done via pipes. If your processor needs to send an item, or link, for processing,
just clone the pipe, move it into the closure, and submit work into it. Interestingly, in this case, our worker is simulatenously
the producer and consumer of a single channel. That could cause problems if it were a bare, say, tokio::sync::mpsc::channel,
but pipes contain additional logic that should resolve them. You can read more musings about this later,
under "Regarding work counters".
What parse_quotes_page does is not too important, but it's essentially what you'd write to parse the HTML and look for things of interest.
In this example, it returns the list of quotes it found, and (an Option of) a link to the next page.
We skip actually saving the quotes anywhere (you can see actual database interaction in later examples),
but a call to some database_pipe.submit_work(quote).await would happen here in an actual spider.
At the end we queue up a link to the next page. By the time this closure finishes, it will be available for processing.
Note that the examples use anyhow for error handling, however the crate itself does not, and returns custom error objects.
That is to say, using recluse does not pull anyhow as a dependency.
Next part uses tower's service composition pattern to actually produce a tower::Service:
let quotes_page_parser_service = tower::ServiceBuilder::new()
.map_string_to_reqwest_get()
.print_and_drop_request_errors()
.rate_limit(1, Duration::from_secs(1))
.layer(BodyDownloaderLayer)
.service_fn(page_processor);We start by mapping raw [String]s to proper GET [reqwest::Request]s (requires the reqwest feature of this crate).
Parsing a URL can fail, so any Errs produced by the map are output to [log::warn] and discarded.
Only valid request objects continue further.
We then apply a simple 1/second throttle, but, if you're familiar with tower, you'll know that many more options exist,
like retries and timeouts, that would be appropriate in a crawler. recluse also provides a LeakyBucketRateLimiterLayer,
which allows a rate limiter to be shared among multiple workers.
Next up is recluse::BodyDownloaderLayer, which handles the HTTP Client for you (requires the reqwest feature of this crate),
such that you only write a function of String, not reqwest::Request.
And in case you're wondering, yes, there is a recluse::JsonDownloaderLayer which attempts to deserialize the HTTP response
into a strongly typed object of your choosing. Useful if you're traversing an API, not a website. See the PokeAPI example.
We then kick the worker into its own thread, here with tokio:
let worker = tokio::spawn(async move {
page_worker.work(quotes_page_parser_service).await
});And give the spider its initial page by simply queuing it into the same pipe:
page_pipe.submit_work("https://quotes.toscrape.com/".to_string()).await
.context("Sent initial request")?;You could kickstart as much work this way as you need.
Finally, we simply wait for the worker to complete:
tokio::join!(worker).0??;As mentioned previously, the worker closures return unit (()), so we simply ignore the result here.
Analysing parse_quotes_page is out of scope here because it's not part of the spider's work logic.
The examples use scraper to achieve their objectives.
A default setup will use a global singleton counter and share it among all WorkPipe instances.
Normally, there's no reason to override it, as having multiple work counters in one crawler can cause
premature shutdowns or even deadlocks. The counter solves several problems commonly found in crawlers with concurrency.
I hope the following discussion can help you understand the problems, the overall architecture of this crate, and make better decisions should you choose to not use the default global counter.
A typical scenario is a worker who self-feeds work, e.g. one that iteratively navigates through pages of results.
You'd kick it off by giving it the URL to the first page, after which it would find the link to the next page and queue it up.
If you used a bare channel, the worker would keep the only remaining Sender (assuming the kickoff clone is dropped)
and its work loop manager the only Receiver. Thus, the channel would never get closed (because the Sender is never dropped)
and the worker would wait on recv() indefinitely.
We sidestep this by instead counting how much work is in the pipe. Channels don't provide a way to see how much items are in the buffer, but we introduce an atomic counter and make sure it's in sync with items going in and out of the channel. In a simplest case of a self-feeder, the counter can only have three values:
- 1 - if there is work queued and the worker is typically currently processing, or about to start
- 2 - the worker had just queued work on the next page and is about to exit its service function
- 0 - the worker exited the service function but was unable to queue up the next page - the system takes that as an indication of no further work and exits the worker's work loop
Now imagine there's two pipes and two workers. A can only download pages and B can only process them, meaning only one of them is ever working at a time (assuming a single-page kick-off and only one "next page" produced per page). If B is the only one who can find the link to the next page, then A's work counter would deplete to 0 while B is working, causing A to shutdown prematurely.
The solution is to have the two pipes share the counter. This way, the only scenario in which it goes to 0 is when there is not only no work for either worker, but also no possibility of new work arriving.
That's why, by default, pipes will share the counter:
let (download_pipe, download_worker) = WorkPipeBuilder::default()
.build();
let (parse_pipe, parse_worker) = WorkPipeBuilder::default()
.build(); // Will share counter with the previous