A high-performance, configurable web crawler written in Rust. Single-binary, single-node, async — designed to sustain 1K–10K pages/sec on a mid-range server.
PostgreSQL-backed URL frontier with SKIP LOCKED work queue, per-host politeness enforcement (robots.txt + token bucket rate limiting), adaptive recrawl scheduling, S3-compatible object storage for raw bodies, cuckoo filter URL deduplication, SimHash near-duplicate detection, gRPC API for real-time result streaming, and structured tracing with Prometheus metrics.
- High throughput — 500 concurrent async workers sharing a single Tokio runtime, tunable up to tens of thousands
- Durable frontier — PostgreSQL work queue with
SELECT ... FOR UPDATE SKIP LOCKED; no URL loss on crash, stale claims auto-recovered - Polite by default — robots.txt respected (three-layer cache), per-host token bucket rate limiting, configurable crawl delays
- Adaptive recrawling — hydration daemon with TCP-like TTL estimation: halve interval on change, double on unchanged, bounded by configurable floor/ceiling
- Conditional GET — stores
ETagandLast-Modifiedper URL; re-crawls sendIf-None-Match/If-Modified-Sinceto skip unchanged pages (saves ~60–80% bandwidth) - Near-duplicate detection — 64-bit SimHash fingerprinting with configurable Hamming distance threshold
- URL deduplication — cuckoo filter with deletion support, better cache locality than bloom filters, persisted to object storage on shutdown
- Object storage — S3-compatible (AWS, MinIO, R2) or local filesystem; bodies stored brotli-compressed with date-sharded keys
- gRPC API — submit URLs, stream results in real-time, query frontier status, pause/resume crawling, manage per-domain policies, force recrawls
- Observable — structured JSON logging, Prometheus metrics endpoint, OTLP trace export (1% sampling, always sample errors)
- Per-domain policies — override crawl delay, max depth, recrawl TTL, user agent, include/exclude URL patterns per host
- Configurable — TOML config with environment variable overrides (
FERROCRAWL__prefix)
Seed URLs / gRPC
|
v
+------------------+ +---------------------+
| URL Frontier | <------| Cuckoo Filter (dedup)|
| (PostgreSQL) | +---------------------+
+------------------+ ^
| |
SKIP LOCKED | new links
claim batch |
| |
v |
+------------------+ +------------------+
| Politeness Layer | | HTML + link parser|
| robots.txt | | (lol_html) |
| rate limiter | | SimHash |
+------------------+ +------------------+
| ^
v |
+------------------+ +------------------+
| HTTP Fetch Pool |------->| Parse Pipeline |
| (reqwest + | body | link extraction |
| hickory-dns) | | metadata / OG |
+------------------+ +------------------+
| |
v v
+------------------+ +------------------+
| DNS Negative | | Object Storage |
| Cache (DashMap) | | (S3 / local) |
+------------------+ +------------------+
|
+-------------------------------------------+
| | |
v v v
+----------------+ +----------------+ +--------------------+
| PostgreSQL | | Prometheus | | gRPC StreamResults |
| (crawled_ | | Metrics | | (broadcast to |
| pages) | | :9090 | | consumers) |
+----------------+ +----------------+ +--------------------+
^
|
+----------------+
| Hydration |
| Daemon |
| (adaptive TTL |
| recrawl) |
+----------------+
The dashed feedback arc on the left represents the primary loop: extracted and deduplicated links cycle back into the URL frontier, driven by the cuckoo filter.
Every component runs inside a single Tokio runtime. The architecture is designed so that each component can be extracted to a separate process later by replacing its in-process channel with a gRPC or NATS call.
| Crate | Purpose |
|---|---|
ferrocrawl-core |
Shared types, traits, config, error handling |
ferrocrawl-parser |
URL normalization, lol_html link/metadata extraction, SimHash |
ferrocrawl-politeness |
robots.txt checker (DashMap + DB + HTTP), token bucket rate limiter |
ferrocrawl-frontier |
PostgreSQL SKIP LOCKED work queue, PG NOTIFY listener, cuckoo filter |
ferrocrawl-storage |
S3 and local filesystem backends, brotli compression |
ferrocrawl-fetcher |
Shared reqwest client, conditional GET, DNS negative cache |
ferrocrawl-refresh |
Hydration daemon with adaptive TTL recrawl scheduling |
ferrocrawl-grpc |
tonic gRPC service (6 RPCs), Bearer auth, result broadcasting |
ferrocrawl-server |
Binary entrypoint, worker pool, telemetry, CLI |
No dependency cycles. core is the root; server is the leaf that wires everything together.
- Rust 1.82+ stable
- PostgreSQL 14+ (tested on 18)
- protoc (protocol buffer compiler) — for gRPC proto compilation
# Install protoc (no sudo required)
curl -sSL https://github.com/protocolbuffers/protobuf/releases/download/v29.3/protoc-29.3-linux-x86_64.zip \
-o /tmp/protoc.zip && unzip -o /tmp/protoc.zip -d ~/.local
export PATH="$HOME/.local/bin:$PATH"
# Install sqlx-cli
cargo install sqlx-cli --no-default-features --features postgresThe fastest way to get a complete development environment:
./setup.shRun ./setup.sh --dry-run to preview what it will do, ./setup.sh --skip-build to skip the cargo build/test steps, or ./setup.sh --db-only for database-only setup. The script is idempotent — safe to run repeatedly.
# 1. Create database and run migrations
export DATABASE_URL="postgres://postgres:postgres@localhost/ferrocrawl"
sqlx database create
sqlx migrate run --source migrations
# 2. Run in dev mode with a seed URL
RUST_LOG=ferrocrawl=debug cargo run -p ferrocrawl-server -- \
--config config/default.toml \
--seed-urls https://example.com
# 3. Check metrics
curl http://localhost:9090/metricsRUSTFLAGS="-C target-cpu=native" cargo build --release -p ferrocrawl-serverThe binary is at target/release/ferrocrawl. It uses jemalloc as the global allocator.
# seeds.txt — one URL per line, # comments supported
cargo run -p ferrocrawl-server -- --seed-file seeds.txtBase configuration lives in config/default.toml. Every value can be overridden via environment variables with the FERROCRAWL__ prefix using __ as the nested key separator:
FERROCRAWL__CRAWLER__WORKER_CONCURRENCY=1000
FERROCRAWL__STORAGE__BACKEND=local
FERROCRAWL__GRPC__ENABLED=true| Setting | Default | Description |
|---|---|---|
crawler.worker_concurrency |
500 | Number of concurrent async fetch workers |
crawler.max_depth |
6 | Maximum link-following depth from seed URLs |
crawler.default_crawl_delay_ms |
1000 | Minimum delay between requests to the same host |
crawler.max_retries |
3 | Retry attempts before marking a URL as failed |
dedup.expected_items |
1,000,000,000 | Cuckoo filter capacity (pre-allocate for expected URL count) |
refresh.adaptive_ttl |
true | Auto-adjust recrawl interval based on content change frequency |
refresh.default_recrawl_ttl_secs |
86400 | Default recrawl interval (24 hours) |
storage.backend |
"s3" |
"s3" or "local" |
storage.compression |
"brotli" |
Compress stored bodies (typically 70–80% ratio on HTML) |
grpc.enabled |
false | Enable gRPC API server |
telemetry.log_format |
"json" |
"json" for production, "text" for development |
See config/default.toml for the full schema with comments.
The gRPC service is defined in proto/crawler.proto and exposes 6 RPCs:
| RPC | Description |
|---|---|
SubmitUrls |
Submit seed URLs with optional depth limit and priority |
StreamResults |
Server-streaming RPC delivering crawl results in real-time (filterable by host) |
GetStatus |
Query frontier statistics (pending, in-flight, done, failed counts) |
SetCrawlerState |
Pause or resume all crawling |
UpsertDomainPolicy |
Create or update per-domain crawl policies |
ForceRecrawl |
Trigger immediate re-crawl of specific URLs |
Enable with FERROCRAWL__GRPC__ENABLED=true. Optional Bearer token auth via FERROCRAWL__GRPC__AUTH_TOKEN.
# Submit URLs
grpcurl -plaintext -d '{"urls": ["https://example.com"], "priority": 1.0}' \
localhost:50051 ferrocrawl.v1.CrawlerService/SubmitUrls
# Stream results
grpcurl -plaintext -d '{}' \
localhost:50051 ferrocrawl.v1.CrawlerService/StreamResults
# Check status
grpcurl -plaintext localhost:50051 ferrocrawl.v1.CrawlerService/GetStatus
# Pause crawling
grpcurl -plaintext -d '{"paused": true}' \
localhost:50051 ferrocrawl.v1.CrawlerService/SetCrawlerStateFerrocrawl uses four PostgreSQL tables:
frontier— URL work queue with state machine (pending->in_flight->done|failed)crawled_pages— crawl result metadata, storage keys, SimHash fingerprints, recrawl schedulingdomain_policies— per-host crawl configuration overridesrobots_cache— cached robots.txt bodies with TTL
The frontier uses PG NOTIFY triggers to wake idle workers when new URLs are inserted, avoiding tight polling loops.
See migrations/0001_initial.sql for the full schema.
Scraped from the metrics endpoint (default http://localhost:9090/metrics):
# Counters
ferrocrawl_pages_fetched_total{status="2xx|3xx|4xx|5xx|error"}
ferrocrawl_recrawls_total{outcome="queued|changed|unchanged|error"}
# Histograms
ferrocrawl_fetch_duration_ms
ferrocrawl_page_bytes
# Gauges
ferrocrawl_frontier_size{state="pending|in_flight|failed"}
ferrocrawl_active_workers
ferrocrawl_dns_cache_size
JSON-formatted logs with tracing spans at task boundaries:
# Development (human-readable)
FERROCRAWL__TELEMETRY__LOG_FORMAT=text RUST_LOG=ferrocrawl=debug cargo run -p ferrocrawl-server
# Production (JSON)
RUST_LOG=ferrocrawl=info ./ferrocrawl --config config/default.tomlSet FERROCRAWL__TELEMETRY__OTLP_ENDPOINT=http://localhost:4317 to export traces to Jaeger, Tempo, or any OTLP-compatible collector. Success spans are sampled at 1%; error spans are always exported.
- Seed — URLs are submitted via CLI args, seed file, or gRPC
SubmitUrls - Enqueue — URLs are pushed to the PostgreSQL frontier (deduplicated by SHA-256 hash)
- Claim — Workers atomically claim batches using
SELECT ... FOR UPDATE SKIP LOCKED - Politeness — robots.txt is checked, then the per-host rate limiter enforces crawl delay
- Fetch — HTTP GET with connection pooling, compression, redirect following, and conditional headers
- Parse — lol_html extracts links, title, meta tags, OpenGraph, JSON-LD, hreflang in a single streaming pass
- Store — Body is brotli-compressed and written to object storage; metadata goes to
crawled_pages - Discover — Extracted links are filtered through the cuckoo filter and pushed back to the frontier
- Broadcast — Results are sent to any connected gRPC
StreamResultsclients - Recrawl — The hydration daemon re-queues pages whose
next_recrawl_athas passed, with adaptive TTL
If a worker crashes mid-crawl, its claimed URLs are stuck in in_flight state. A background watchdog runs every 30 seconds and resets any in_flight URLs claimed more than claim_timeout_secs ago back to pending.
The hydration daemon adjusts per-URL recrawl intervals based on observed change frequency:
- Content changed on re-crawl:
new_ttl = max(min_ttl, current_ttl / 2)— check more often - Content unchanged (304 or same SimHash):
new_ttl = min(max_ttl, current_ttl * 2)— check less often
Default bounds: 1 hour floor, 30 day ceiling.
# Recommended: use SQLx offline mode (no DB required)
# This repo commits the generated SQLx query cache under `.sqlx/`.
export SQLX_OFFLINE=true
# Run all tests
cargo test --workspace
# Run tests for a specific crate
cargo test -p ferrocrawl-parser
cargo test -p ferrocrawl-frontierIf you change any sqlx::query!/query_as! SQL or add migrations, regenerate the cache and commit the updated .sqlx/ files:
export DATABASE_URL="postgres://postgres:postgres@localhost/ferrocrawl"
# Ensure schema matches migrations
sqlx migrate run --source migrations
# Regenerate `.sqlx/` for offline builds
cargo sqlx prepare --workspaceTests include:
- Config deserialization and URL hashing (core)
- URL normalization idempotency and link extraction from HTML fixtures (parser)
- Token bucket timing with
tokio::time::pause()(politeness) - Cuckoo filter insertion, deletion, export/import, false positive rate (frontier)
- Brotli compression round-trips and local storage CRUD (storage)
- Adaptive TTL boundary conditions (refresh)
- Bearer auth interceptor (grpc)
- JavaScript rendering — no Chromium; adds 50-100x per-page cost
- Distributed multi-node — designed for extraction to separate processes later, not built yet
- Full-text indexing — Ferrocrawl extracts and stores; indexing is a downstream consumer concern
- WARC format — raw bodies in object storage with metadata is sufficient
See LICENSE for details.