Ferrocrawl

A high-performance, configurable web crawler written in Rust. Single-binary, single-node, async — designed to sustain 1K–10K pages/sec on a mid-range server.

PostgreSQL-backed URL frontier with SKIP LOCKED work queue, per-host politeness enforcement (robots.txt + token bucket rate limiting), adaptive recrawl scheduling, S3-compatible object storage for raw bodies, cuckoo filter URL deduplication, SimHash near-duplicate detection, gRPC API for real-time result streaming, and structured tracing with Prometheus metrics.

Features

High throughput — 500 concurrent async workers sharing a single Tokio runtime, tunable up to tens of thousands
Durable frontier — PostgreSQL work queue with SELECT ... FOR UPDATE SKIP LOCKED; no URL loss on crash, stale claims auto-recovered
Polite by default — robots.txt respected (three-layer cache), per-host token bucket rate limiting, configurable crawl delays
Adaptive recrawling — hydration daemon with TCP-like TTL estimation: halve interval on change, double on unchanged, bounded by configurable floor/ceiling
Conditional GET — stores ETag and Last-Modified per URL; re-crawls send If-None-Match / If-Modified-Since to skip unchanged pages (saves ~60–80% bandwidth)
Near-duplicate detection — 64-bit SimHash fingerprinting with configurable Hamming distance threshold
URL deduplication — cuckoo filter with deletion support, better cache locality than bloom filters, persisted to object storage on shutdown
Object storage — S3-compatible (AWS, MinIO, R2) or local filesystem; bodies stored brotli-compressed with date-sharded keys
gRPC API — submit URLs, stream results in real-time, query frontier status, pause/resume crawling, manage per-domain policies, force recrawls
Observable — structured JSON logging, Prometheus metrics endpoint, OTLP trace export (1% sampling, always sample errors)
Per-domain policies — override crawl delay, max depth, recrawl TTL, user agent, include/exclude URL patterns per host
Configurable — TOML config with environment variable overrides (FERROCRAWL__ prefix)

Architecture

                          Seed URLs / gRPC
                               |
                               v
                      +------------------+        +---------------------+
                      |   URL Frontier   | <------| Cuckoo Filter (dedup)|
                      |   (PostgreSQL)   |        +---------------------+
                      +------------------+                ^
                               |                          |
                          SKIP LOCKED                     | new links
                          claim batch                     |
                               |                          |
                               v                          |
                      +------------------+        +------------------+
                      | Politeness Layer |        | HTML + link parser|
                      |  robots.txt      |        |   (lol_html)     |
                      |  rate limiter    |        |   SimHash        |
                      +------------------+        +------------------+
                               |                          ^
                               v                          |
                      +------------------+        +------------------+
                      | HTTP Fetch Pool  |------->| Parse Pipeline   |
                      |  (reqwest +      |  body  |  link extraction |
                      |   hickory-dns)   |        |  metadata / OG   |
                      +------------------+        +------------------+
                               |                          |
                               v                          v
                      +------------------+        +------------------+
                      | DNS Negative     |        | Object Storage   |
                      | Cache (DashMap)  |        |  (S3 / local)    |
                      +------------------+        +------------------+
                                                          |
              +-------------------------------------------+
              |                    |                       |
              v                    v                       v
     +----------------+   +----------------+   +--------------------+
     |  PostgreSQL    |   |  Prometheus    |   | gRPC StreamResults |
     |  (crawled_     |   |  Metrics      |   |  (broadcast to     |
     |   pages)       |   |  :9090        |   |   consumers)       |
     +----------------+   +----------------+   +--------------------+
              ^
              |
     +----------------+
     | Hydration      |
     | Daemon         |
     |  (adaptive TTL |
     |   recrawl)     |
     +----------------+

The dashed feedback arc on the left represents the primary loop: extracted and deduplicated links cycle back into the URL frontier, driven by the cuckoo filter.

Every component runs inside a single Tokio runtime. The architecture is designed so that each component can be extracted to a separate process later by replacing its in-process channel with a gRPC or NATS call.

Crate structure

Crate	Purpose
`ferrocrawl-core`	Shared types, traits, config, error handling
`ferrocrawl-parser`	URL normalization, lol_html link/metadata extraction, SimHash
`ferrocrawl-politeness`	robots.txt checker (DashMap + DB + HTTP), token bucket rate limiter
`ferrocrawl-frontier`	PostgreSQL SKIP LOCKED work queue, PG NOTIFY listener, cuckoo filter
`ferrocrawl-storage`	S3 and local filesystem backends, brotli compression
`ferrocrawl-fetcher`	Shared reqwest client, conditional GET, DNS negative cache
`ferrocrawl-refresh`	Hydration daemon with adaptive TTL recrawl scheduling
`ferrocrawl-grpc`	tonic gRPC service (6 RPCs), Bearer auth, result broadcasting
`ferrocrawl-server`	Binary entrypoint, worker pool, telemetry, CLI

No dependency cycles. core is the root; server is the leaf that wires everything together.

Prerequisites

Rust 1.82+ stable
PostgreSQL 14+ (tested on 18)
protoc (protocol buffer compiler) — for gRPC proto compilation

# Install protoc (no sudo required)
curl -sSL https://github.com/protocolbuffers/protobuf/releases/download/v29.3/protoc-29.3-linux-x86_64.zip \
  -o /tmp/protoc.zip && unzip -o /tmp/protoc.zip -d ~/.local
export PATH="$HOME/.local/bin:$PATH"

# Install sqlx-cli
cargo install sqlx-cli --no-default-features --features postgres

Quick start

The fastest way to get a complete development environment:

./setup.sh

Run ./setup.sh --dry-run to preview what it will do, ./setup.sh --skip-build to skip the cargo build/test steps, or ./setup.sh --db-only for database-only setup. The script is idempotent — safe to run repeatedly.

Manual setup

# 1. Create database and run migrations
export DATABASE_URL="postgres://postgres:postgres@localhost/ferrocrawl"
sqlx database create
sqlx migrate run --source migrations

# 2. Run in dev mode with a seed URL
RUST_LOG=ferrocrawl=debug cargo run -p ferrocrawl-server -- \
  --config config/default.toml \
  --seed-urls https://example.com

# 3. Check metrics
curl http://localhost:9090/metrics

Production build

RUSTFLAGS="-C target-cpu=native" cargo build --release -p ferrocrawl-server

The binary is at target/release/ferrocrawl. It uses jemalloc as the global allocator.

Seed from file

# seeds.txt — one URL per line, # comments supported
cargo run -p ferrocrawl-server -- --seed-file seeds.txt

Configuration

Base configuration lives in config/default.toml. Every value can be overridden via environment variables with the FERROCRAWL__ prefix using __ as the nested key separator:

FERROCRAWL__CRAWLER__WORKER_CONCURRENCY=1000
FERROCRAWL__STORAGE__BACKEND=local
FERROCRAWL__GRPC__ENABLED=true

Key settings

Setting	Default	Description
`crawler.worker_concurrency`	500	Number of concurrent async fetch workers
`crawler.max_depth`	6	Maximum link-following depth from seed URLs
`crawler.default_crawl_delay_ms`	1000	Minimum delay between requests to the same host
`crawler.max_retries`	3	Retry attempts before marking a URL as failed
`dedup.expected_items`	1,000,000,000	Cuckoo filter capacity (pre-allocate for expected URL count)
`refresh.adaptive_ttl`	true	Auto-adjust recrawl interval based on content change frequency
`refresh.default_recrawl_ttl_secs`	86400	Default recrawl interval (24 hours)
`storage.backend`	`"s3"`	`"s3"` or `"local"`
`storage.compression`	`"brotli"`	Compress stored bodies (typically 70–80% ratio on HTML)
`grpc.enabled`	false	Enable gRPC API server
`telemetry.log_format`	`"json"`	`"json"` for production, `"text"` for development

See config/default.toml for the full schema with comments.

gRPC API

The gRPC service is defined in proto/crawler.proto and exposes 6 RPCs:

RPC	Description
`SubmitUrls`	Submit seed URLs with optional depth limit and priority
`StreamResults`	Server-streaming RPC delivering crawl results in real-time (filterable by host)
`GetStatus`	Query frontier statistics (pending, in-flight, done, failed counts)
`SetCrawlerState`	Pause or resume all crawling
`UpsertDomainPolicy`	Create or update per-domain crawl policies
`ForceRecrawl`	Trigger immediate re-crawl of specific URLs

Enable with FERROCRAWL__GRPC__ENABLED=true. Optional Bearer token auth via FERROCRAWL__GRPC__AUTH_TOKEN.

Example with grpcurl

# Submit URLs
grpcurl -plaintext -d '{"urls": ["https://example.com"], "priority": 1.0}' \
  localhost:50051 ferrocrawl.v1.CrawlerService/SubmitUrls

# Stream results
grpcurl -plaintext -d '{}' \
  localhost:50051 ferrocrawl.v1.CrawlerService/StreamResults

# Check status
grpcurl -plaintext localhost:50051 ferrocrawl.v1.CrawlerService/GetStatus

# Pause crawling
grpcurl -plaintext -d '{"paused": true}' \
  localhost:50051 ferrocrawl.v1.CrawlerService/SetCrawlerState

Database schema

Ferrocrawl uses four PostgreSQL tables:

frontier — URL work queue with state machine (pending -> in_flight -> done | failed)
crawled_pages — crawl result metadata, storage keys, SimHash fingerprints, recrawl scheduling
domain_policies — per-host crawl configuration overrides
robots_cache — cached robots.txt bodies with TTL

The frontier uses PG NOTIFY triggers to wake idle workers when new URLs are inserted, avoiding tight polling loops.

See migrations/0001_initial.sql for the full schema.

Observability

Prometheus metrics

Scraped from the metrics endpoint (default http://localhost:9090/metrics):

# Counters
ferrocrawl_pages_fetched_total{status="2xx|3xx|4xx|5xx|error"}
ferrocrawl_recrawls_total{outcome="queued|changed|unchanged|error"}

# Histograms
ferrocrawl_fetch_duration_ms
ferrocrawl_page_bytes

# Gauges
ferrocrawl_frontier_size{state="pending|in_flight|failed"}
ferrocrawl_active_workers
ferrocrawl_dns_cache_size

Structured logging

JSON-formatted logs with tracing spans at task boundaries:

# Development (human-readable)
FERROCRAWL__TELEMETRY__LOG_FORMAT=text RUST_LOG=ferrocrawl=debug cargo run -p ferrocrawl-server

# Production (JSON)
RUST_LOG=ferrocrawl=info ./ferrocrawl --config config/default.toml

OTLP tracing

Set FERROCRAWL__TELEMETRY__OTLP_ENDPOINT=http://localhost:4317 to export traces to Jaeger, Tempo, or any OTLP-compatible collector. Success spans are sampled at 1%; error spans are always exported.

How it works

Crawl lifecycle

Seed — URLs are submitted via CLI args, seed file, or gRPC SubmitUrls
Enqueue — URLs are pushed to the PostgreSQL frontier (deduplicated by SHA-256 hash)
Claim — Workers atomically claim batches using SELECT ... FOR UPDATE SKIP LOCKED
Politeness — robots.txt is checked, then the per-host rate limiter enforces crawl delay
Fetch — HTTP GET with connection pooling, compression, redirect following, and conditional headers
Parse — lol_html extracts links, title, meta tags, OpenGraph, JSON-LD, hreflang in a single streaming pass
Store — Body is brotli-compressed and written to object storage; metadata goes to crawled_pages
Discover — Extracted links are filtered through the cuckoo filter and pushed back to the frontier
Broadcast — Results are sent to any connected gRPC StreamResults clients
Recrawl — The hydration daemon re-queues pages whose next_recrawl_at has passed, with adaptive TTL

Stale claim recovery

If a worker crashes mid-crawl, its claimed URLs are stuck in in_flight state. A background watchdog runs every 30 seconds and resets any in_flight URLs claimed more than claim_timeout_secs ago back to pending.

Adaptive recrawl TTL

The hydration daemon adjusts per-URL recrawl intervals based on observed change frequency:

Content changed on re-crawl: new_ttl = max(min_ttl, current_ttl / 2) — check more often
Content unchanged (304 or same SimHash): new_ttl = min(max_ttl, current_ttl * 2) — check less often

Default bounds: 1 hour floor, 30 day ceiling.

Testing

# Recommended: use SQLx offline mode (no DB required)
# This repo commits the generated SQLx query cache under `.sqlx/`.
export SQLX_OFFLINE=true

# Run all tests
cargo test --workspace

# Run tests for a specific crate
cargo test -p ferrocrawl-parser
cargo test -p ferrocrawl-frontier

Regenerating the SQLx query cache (`.sqlx/`)

If you change any sqlx::query!/query_as! SQL or add migrations, regenerate the cache and commit the updated .sqlx/ files:

export DATABASE_URL="postgres://postgres:postgres@localhost/ferrocrawl"

# Ensure schema matches migrations
sqlx migrate run --source migrations

# Regenerate `.sqlx/` for offline builds
cargo sqlx prepare --workspace

Tests include:

Config deserialization and URL hashing (core)
URL normalization idempotency and link extraction from HTML fixtures (parser)
Token bucket timing with tokio::time::pause() (politeness)
Cuckoo filter insertion, deletion, export/import, false positive rate (frontier)
Brotli compression round-trips and local storage CRUD (storage)
Adaptive TTL boundary conditions (refresh)
Bearer auth interceptor (grpc)

Non-goals (v1)

JavaScript rendering — no Chromium; adds 50-100x per-page cost
Distributed multi-node — designed for extraction to separate processes later, not built yet
Full-text indexing — Ferrocrawl extracts and stores; indexing is a downstream consumer concern
WARC format — raw bodies in object storage with metadata is sufficient

License

See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
config		config
crates		crates
migrations		migrations
proto		proto
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ferrocrawl

Features

Architecture

Crate structure

Prerequisites

Quick start

Manual setup

Production build

Seed from file

Configuration

Key settings

gRPC API

Example with grpcurl

Database schema

Observability

Prometheus metrics

Structured logging

OTLP tracing

How it works

Crawl lifecycle

Stale claim recovery

Adaptive recrawl TTL

Testing

Regenerating the SQLx query cache (`.sqlx/`)

Non-goals (v1)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Ferrocrawl

Features

Architecture

Crate structure

Prerequisites

Quick start

Manual setup

Production build

Seed from file

Configuration

Key settings

gRPC API

Example with grpcurl

Database schema

Observability

Prometheus metrics

Structured logging

OTLP tracing

How it works

Crawl lifecycle

Stale claim recovery

Adaptive recrawl TTL

Testing

Regenerating the SQLx query cache (.sqlx/)

Non-goals (v1)

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Regenerating the SQLx query cache (`.sqlx/`)

Packages