Skip to content

dexcompiler/ferrocrawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ferrocrawl

A high-performance, configurable web crawler written in Rust. Single-binary, single-node, async — designed to sustain 1K–10K pages/sec on a mid-range server.

PostgreSQL-backed URL frontier with SKIP LOCKED work queue, per-host politeness enforcement (robots.txt + token bucket rate limiting), adaptive recrawl scheduling, S3-compatible object storage for raw bodies, cuckoo filter URL deduplication, SimHash near-duplicate detection, gRPC API for real-time result streaming, and structured tracing with Prometheus metrics.

Features

  • High throughput — 500 concurrent async workers sharing a single Tokio runtime, tunable up to tens of thousands
  • Durable frontier — PostgreSQL work queue with SELECT ... FOR UPDATE SKIP LOCKED; no URL loss on crash, stale claims auto-recovered
  • Polite by default — robots.txt respected (three-layer cache), per-host token bucket rate limiting, configurable crawl delays
  • Adaptive recrawling — hydration daemon with TCP-like TTL estimation: halve interval on change, double on unchanged, bounded by configurable floor/ceiling
  • Conditional GET — stores ETag and Last-Modified per URL; re-crawls send If-None-Match / If-Modified-Since to skip unchanged pages (saves ~60–80% bandwidth)
  • Near-duplicate detection — 64-bit SimHash fingerprinting with configurable Hamming distance threshold
  • URL deduplication — cuckoo filter with deletion support, better cache locality than bloom filters, persisted to object storage on shutdown
  • Object storage — S3-compatible (AWS, MinIO, R2) or local filesystem; bodies stored brotli-compressed with date-sharded keys
  • gRPC API — submit URLs, stream results in real-time, query frontier status, pause/resume crawling, manage per-domain policies, force recrawls
  • Observable — structured JSON logging, Prometheus metrics endpoint, OTLP trace export (1% sampling, always sample errors)
  • Per-domain policies — override crawl delay, max depth, recrawl TTL, user agent, include/exclude URL patterns per host
  • Configurable — TOML config with environment variable overrides (FERROCRAWL__ prefix)

Architecture

                          Seed URLs / gRPC
                               |
                               v
                      +------------------+        +---------------------+
                      |   URL Frontier   | <------| Cuckoo Filter (dedup)|
                      |   (PostgreSQL)   |        +---------------------+
                      +------------------+                ^
                               |                          |
                          SKIP LOCKED                     | new links
                          claim batch                     |
                               |                          |
                               v                          |
                      +------------------+        +------------------+
                      | Politeness Layer |        | HTML + link parser|
                      |  robots.txt      |        |   (lol_html)     |
                      |  rate limiter    |        |   SimHash        |
                      +------------------+        +------------------+
                               |                          ^
                               v                          |
                      +------------------+        +------------------+
                      | HTTP Fetch Pool  |------->| Parse Pipeline   |
                      |  (reqwest +      |  body  |  link extraction |
                      |   hickory-dns)   |        |  metadata / OG   |
                      +------------------+        +------------------+
                               |                          |
                               v                          v
                      +------------------+        +------------------+
                      | DNS Negative     |        | Object Storage   |
                      | Cache (DashMap)  |        |  (S3 / local)    |
                      +------------------+        +------------------+
                                                          |
              +-------------------------------------------+
              |                    |                       |
              v                    v                       v
     +----------------+   +----------------+   +--------------------+
     |  PostgreSQL    |   |  Prometheus    |   | gRPC StreamResults |
     |  (crawled_     |   |  Metrics      |   |  (broadcast to     |
     |   pages)       |   |  :9090        |   |   consumers)       |
     +----------------+   +----------------+   +--------------------+
              ^
              |
     +----------------+
     | Hydration      |
     | Daemon         |
     |  (adaptive TTL |
     |   recrawl)     |
     +----------------+

The dashed feedback arc on the left represents the primary loop: extracted and deduplicated links cycle back into the URL frontier, driven by the cuckoo filter.

Every component runs inside a single Tokio runtime. The architecture is designed so that each component can be extracted to a separate process later by replacing its in-process channel with a gRPC or NATS call.

Crate structure

Crate Purpose
ferrocrawl-core Shared types, traits, config, error handling
ferrocrawl-parser URL normalization, lol_html link/metadata extraction, SimHash
ferrocrawl-politeness robots.txt checker (DashMap + DB + HTTP), token bucket rate limiter
ferrocrawl-frontier PostgreSQL SKIP LOCKED work queue, PG NOTIFY listener, cuckoo filter
ferrocrawl-storage S3 and local filesystem backends, brotli compression
ferrocrawl-fetcher Shared reqwest client, conditional GET, DNS negative cache
ferrocrawl-refresh Hydration daemon with adaptive TTL recrawl scheduling
ferrocrawl-grpc tonic gRPC service (6 RPCs), Bearer auth, result broadcasting
ferrocrawl-server Binary entrypoint, worker pool, telemetry, CLI

No dependency cycles. core is the root; server is the leaf that wires everything together.

Prerequisites

  • Rust 1.82+ stable
  • PostgreSQL 14+ (tested on 18)
  • protoc (protocol buffer compiler) — for gRPC proto compilation
# Install protoc (no sudo required)
curl -sSL https://github.com/protocolbuffers/protobuf/releases/download/v29.3/protoc-29.3-linux-x86_64.zip \
  -o /tmp/protoc.zip && unzip -o /tmp/protoc.zip -d ~/.local
export PATH="$HOME/.local/bin:$PATH"

# Install sqlx-cli
cargo install sqlx-cli --no-default-features --features postgres

Quick start

The fastest way to get a complete development environment:

./setup.sh

Run ./setup.sh --dry-run to preview what it will do, ./setup.sh --skip-build to skip the cargo build/test steps, or ./setup.sh --db-only for database-only setup. The script is idempotent — safe to run repeatedly.

Manual setup

# 1. Create database and run migrations
export DATABASE_URL="postgres://postgres:postgres@localhost/ferrocrawl"
sqlx database create
sqlx migrate run --source migrations

# 2. Run in dev mode with a seed URL
RUST_LOG=ferrocrawl=debug cargo run -p ferrocrawl-server -- \
  --config config/default.toml \
  --seed-urls https://example.com

# 3. Check metrics
curl http://localhost:9090/metrics

Production build

RUSTFLAGS="-C target-cpu=native" cargo build --release -p ferrocrawl-server

The binary is at target/release/ferrocrawl. It uses jemalloc as the global allocator.

Seed from file

# seeds.txt — one URL per line, # comments supported
cargo run -p ferrocrawl-server -- --seed-file seeds.txt

Configuration

Base configuration lives in config/default.toml. Every value can be overridden via environment variables with the FERROCRAWL__ prefix using __ as the nested key separator:

FERROCRAWL__CRAWLER__WORKER_CONCURRENCY=1000
FERROCRAWL__STORAGE__BACKEND=local
FERROCRAWL__GRPC__ENABLED=true

Key settings

Setting Default Description
crawler.worker_concurrency 500 Number of concurrent async fetch workers
crawler.max_depth 6 Maximum link-following depth from seed URLs
crawler.default_crawl_delay_ms 1000 Minimum delay between requests to the same host
crawler.max_retries 3 Retry attempts before marking a URL as failed
dedup.expected_items 1,000,000,000 Cuckoo filter capacity (pre-allocate for expected URL count)
refresh.adaptive_ttl true Auto-adjust recrawl interval based on content change frequency
refresh.default_recrawl_ttl_secs 86400 Default recrawl interval (24 hours)
storage.backend "s3" "s3" or "local"
storage.compression "brotli" Compress stored bodies (typically 70–80% ratio on HTML)
grpc.enabled false Enable gRPC API server
telemetry.log_format "json" "json" for production, "text" for development

See config/default.toml for the full schema with comments.

gRPC API

The gRPC service is defined in proto/crawler.proto and exposes 6 RPCs:

RPC Description
SubmitUrls Submit seed URLs with optional depth limit and priority
StreamResults Server-streaming RPC delivering crawl results in real-time (filterable by host)
GetStatus Query frontier statistics (pending, in-flight, done, failed counts)
SetCrawlerState Pause or resume all crawling
UpsertDomainPolicy Create or update per-domain crawl policies
ForceRecrawl Trigger immediate re-crawl of specific URLs

Enable with FERROCRAWL__GRPC__ENABLED=true. Optional Bearer token auth via FERROCRAWL__GRPC__AUTH_TOKEN.

Example with grpcurl

# Submit URLs
grpcurl -plaintext -d '{"urls": ["https://example.com"], "priority": 1.0}' \
  localhost:50051 ferrocrawl.v1.CrawlerService/SubmitUrls

# Stream results
grpcurl -plaintext -d '{}' \
  localhost:50051 ferrocrawl.v1.CrawlerService/StreamResults

# Check status
grpcurl -plaintext localhost:50051 ferrocrawl.v1.CrawlerService/GetStatus

# Pause crawling
grpcurl -plaintext -d '{"paused": true}' \
  localhost:50051 ferrocrawl.v1.CrawlerService/SetCrawlerState

Database schema

Ferrocrawl uses four PostgreSQL tables:

  • frontier — URL work queue with state machine (pending -> in_flight -> done | failed)
  • crawled_pages — crawl result metadata, storage keys, SimHash fingerprints, recrawl scheduling
  • domain_policies — per-host crawl configuration overrides
  • robots_cache — cached robots.txt bodies with TTL

The frontier uses PG NOTIFY triggers to wake idle workers when new URLs are inserted, avoiding tight polling loops.

See migrations/0001_initial.sql for the full schema.

Observability

Prometheus metrics

Scraped from the metrics endpoint (default http://localhost:9090/metrics):

# Counters
ferrocrawl_pages_fetched_total{status="2xx|3xx|4xx|5xx|error"}
ferrocrawl_recrawls_total{outcome="queued|changed|unchanged|error"}

# Histograms
ferrocrawl_fetch_duration_ms
ferrocrawl_page_bytes

# Gauges
ferrocrawl_frontier_size{state="pending|in_flight|failed"}
ferrocrawl_active_workers
ferrocrawl_dns_cache_size

Structured logging

JSON-formatted logs with tracing spans at task boundaries:

# Development (human-readable)
FERROCRAWL__TELEMETRY__LOG_FORMAT=text RUST_LOG=ferrocrawl=debug cargo run -p ferrocrawl-server

# Production (JSON)
RUST_LOG=ferrocrawl=info ./ferrocrawl --config config/default.toml

OTLP tracing

Set FERROCRAWL__TELEMETRY__OTLP_ENDPOINT=http://localhost:4317 to export traces to Jaeger, Tempo, or any OTLP-compatible collector. Success spans are sampled at 1%; error spans are always exported.

How it works

Crawl lifecycle

  1. Seed — URLs are submitted via CLI args, seed file, or gRPC SubmitUrls
  2. Enqueue — URLs are pushed to the PostgreSQL frontier (deduplicated by SHA-256 hash)
  3. Claim — Workers atomically claim batches using SELECT ... FOR UPDATE SKIP LOCKED
  4. Politeness — robots.txt is checked, then the per-host rate limiter enforces crawl delay
  5. Fetch — HTTP GET with connection pooling, compression, redirect following, and conditional headers
  6. Parse — lol_html extracts links, title, meta tags, OpenGraph, JSON-LD, hreflang in a single streaming pass
  7. Store — Body is brotli-compressed and written to object storage; metadata goes to crawled_pages
  8. Discover — Extracted links are filtered through the cuckoo filter and pushed back to the frontier
  9. Broadcast — Results are sent to any connected gRPC StreamResults clients
  10. Recrawl — The hydration daemon re-queues pages whose next_recrawl_at has passed, with adaptive TTL

Stale claim recovery

If a worker crashes mid-crawl, its claimed URLs are stuck in in_flight state. A background watchdog runs every 30 seconds and resets any in_flight URLs claimed more than claim_timeout_secs ago back to pending.

Adaptive recrawl TTL

The hydration daemon adjusts per-URL recrawl intervals based on observed change frequency:

  • Content changed on re-crawl: new_ttl = max(min_ttl, current_ttl / 2) — check more often
  • Content unchanged (304 or same SimHash): new_ttl = min(max_ttl, current_ttl * 2) — check less often

Default bounds: 1 hour floor, 30 day ceiling.

Testing

# Recommended: use SQLx offline mode (no DB required)
# This repo commits the generated SQLx query cache under `.sqlx/`.
export SQLX_OFFLINE=true

# Run all tests
cargo test --workspace

# Run tests for a specific crate
cargo test -p ferrocrawl-parser
cargo test -p ferrocrawl-frontier

Regenerating the SQLx query cache (.sqlx/)

If you change any sqlx::query!/query_as! SQL or add migrations, regenerate the cache and commit the updated .sqlx/ files:

export DATABASE_URL="postgres://postgres:postgres@localhost/ferrocrawl"

# Ensure schema matches migrations
sqlx migrate run --source migrations

# Regenerate `.sqlx/` for offline builds
cargo sqlx prepare --workspace

Tests include:

  • Config deserialization and URL hashing (core)
  • URL normalization idempotency and link extraction from HTML fixtures (parser)
  • Token bucket timing with tokio::time::pause() (politeness)
  • Cuckoo filter insertion, deletion, export/import, false positive rate (frontier)
  • Brotli compression round-trips and local storage CRUD (storage)
  • Adaptive TTL boundary conditions (refresh)
  • Bearer auth interceptor (grpc)

Non-goals (v1)

  • JavaScript rendering — no Chromium; adds 50-100x per-page cost
  • Distributed multi-node — designed for extraction to separate processes later, not built yet
  • Full-text indexing — Ferrocrawl extracts and stores; indexing is a downstream consumer concern
  • WARC format — raw bodies in object storage with metadata is sufficient

License

See LICENSE for details.

About

A high-performance, configurable web crawler. Single-binary, single-node, async runtime.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors