High-performance web scraping toolkit for Rust — graph-based execution engine + anti-detection browser automation.
Stygian is a monorepo containing five complementary Rust crates for building robust, scalable web scraping systems:
Graph-based scraping engine treating pipelines as DAGs with pluggable service modules:
- Hexagonal architecture — domain core isolated from infrastructure
- Extreme concurrency — Tokio for I/O, Rayon for CPU-bound tasks
- AI extraction — Claude, GPT, Gemini, GitHub Copilot, Ollama support
- Multi-modal — images, PDFs, videos via LLM vision APIs
- Distributed execution — Redis/Valkey-backed work queues
- Circuit breaker — graceful degradation when services fail
- Idempotency — safe retries with deduplication keys
- Graph introspection — runtime inspection, impact analysis, execution waves
Anti-detection browser automation library for bypassing modern bot protection:
- Browser pooling — warm pool, sub-100ms acquisition
- CDP-based — Chrome DevTools Protocol via chromiumoxide
- Stealth features — navigator spoofing, canvas noise, WebGL randomization
- Human behavior — Bézier mouse paths, realistic typing
- TLS fingerprinting — profile-matched JA3/JA4 signatures
- Cloudflare/DataDome/PerimeterX — bypass detection layers
Proxy pool management with intelligent rotation:
- Multi-protocol — HTTP, HTTPS, SOCKS5 support
- Health checking — automatic dead proxy removal
- Sticky sessions — domain-bound proxy affinity
- Weighted selection — prioritize faster/more reliable proxies
MCP (Model Context Protocol) aggregator for LLM tool integration:
- Unified interface — single JSON-RPC 2.0 server over stdin/stdout
- Tool namespacing —
graph_*,browser_*,proxy_*prefixes - Cross-crate tools —
scrape_proxied,browser_proxied - VS Code/Claude — direct integration with MCP-compatible clients
MCP tool matrix (aggregator surface):
| Namespace | Representative tools | Purpose |
|---|---|---|
graph_* |
graph_scrape, graph_scrape_rest, graph_scrape_graphql, graph_pipeline_validate, graph_pipeline_run |
HTTP/API/feed scraping and DAG execution |
browser_* |
browser_acquire, browser_navigate, browser_query, browser_extract, browser_extract_with_fallback, browser_extract_resilient, browser_release |
Headless browser automation and structured extraction |
proxy_* |
proxy_add, proxy_remove, proxy_pool_stats, proxy_acquire, proxy_acquire_for_domain, proxy_acquire_with_capabilities, proxy_fetch_freelist, proxy_fetch_freeapiproxies, proxy_release |
Proxy pool management, capability-aware leasing, and feed bootstrap |
| cross-crate | scrape_proxied, browser_proxied |
End-to-end orchestration across graph/browser/proxy |
Proc-macro backend that powers #[derive(Extract)] in stygian-browser:
- Declarative extraction — annotate structs with CSS selectors and attribute targets
- Internal crate — do not add directly; enable via
stygian-browser'sextractfeature - Zero boilerplate — generates typed DOM-to-struct deserialization at compile time
stygian-browser = { version = "*", features = ["extract"] }use serde_json::json;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let pipeline = PipelineBuilder::new()
.node("fetch", HttpAdapter::new())
.node("parse", MyParserAdapter)
.edge("fetch", "parse")
.build()?;
let results = pipeline
.execute(json!({"url": "https://example.com"}))
.await?;
println!("Results: {:?}", results);
Ok(())
}use std::time::Duration;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let pool = BrowserPool::new(BrowserConfig::default()).await?;
let handle = pool.acquire().await?;
let browser = handle
.browser()
.ok_or_else(|| std::io::Error::other("browser handle already released"))?;
let mut page = browser.new_page().await?;
page.navigate(
"https://example.com",
WaitUntil::Selector("body".to_string()),
Duration::from_secs(30),
).await?;
let html = page.content().await?;
println!("Page loaded: {} bytes", html.len());
handle.release().await;
Ok(())
}Add to your Cargo.toml:
[dependencies]
stygian-graph = { version = "*", features = ["browser"] }
stygian-browser = "*" # optional, for JavaScript rendering
stygian-proxy = "*" # optional, for proxy pool management
tokio = { version = "1", features = ["full"] }For MCP integration, install the stygian-mcp binary with the extract feature for full tool coverage:
# From crates.io
cargo install stygian-mcp --features extract
# Or from source
cargo install --path crates/stygian-mcp --features extract --lockedThen wire it into your MCP client. VS Code (.vscode/mcp.json or settings.json):
{
"mcp": {
"servers": {
"stygian": {
"command": "stygian-mcp",
"args": [],
"type": "stdio"
}
}
}
}Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json):
{
"mcpServers": {
"stygian": {
"command": "stygian-mcp",
"args": []
}
}
}Note: Browser tools require Chrome/Chromium. On macOS:
brew install --cask google-chrome
# Minimal: HTTP scraping only
stygian-graph = "*"
# Full-featured: browser, AI extraction, distributed queue
stygian-graph = { version = "*", features = ["full"] }
# Browser + Proxy integration
stygian-browser = { version = "*", features = ["stealth", "tls-config"] }
stygian-proxy = { version = "*", features = ["browser", "socks"] }Domain Layer (business logic)
↑
Ports (trait definitions)
↑
Adapters (HTTP, browser, AI providers, storage)
- Zero I/O dependencies in domain layer
- Dependency inversion — adapters depend on ports, not vice versa
- Extreme testability — mock any external system
- Self-contained modules with clear interfaces
- Pool management with resource limits
- Graceful degradation on browser unavailability
stygian/
├── crates/
│ ├── stygian-graph/ # Scraping engine
│ ├── stygian-browser/ # Browser automation
│ ├── stygian-proxy/ # Proxy pool management
│ ├── stygian-mcp/ # MCP aggregator server
│ └── stygian-extract-derive/ # Proc-macro for #[derive(Extract)]
├── examples/ # Example pipelines
├── book/ # mdBook documentation
├── docs/ # Architecture docs
└── assets/ # Diagrams, images
# Install Rust 1.94.0+
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Build workspace
cargo build --workspace
# Run tests
cargo test --workspace
# Run clippy
cargo clippy --workspace -- -D warnings# Unit tests
cargo test --lib
# Integration tests
cargo test --test '*'
# All tests (browser integration tests require Chrome)
cargo test --all-features
# Measure coverage (requires cargo-tarpaulin)
cargo tarpaulin --workspace --all-features --ignore-tests --out Lcovstygian-graph achieves strong unit coverage across domain, ports, and adapter layers.
stygian-browser coverage is structurally bounded by the Chrome CDP requirement — all tests
that spin up a real browser are marked #[ignore = "requires Chrome"]; pure-logic tests are
fully covered.
Contributions welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'feat: add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Use Conventional Commits:
feat:— new featurefix:— bug fixrefactor:— code restructuringtest:— test additions/changesdocs:— documentation updates
Dual-licensed under:
- GNU Affero General Public License v3.0 (
AGPL-3.0-only) — free for open-source use - Commercial License — available for proprietary/closed-source use
Under the AGPL, any modifications or derivative works must also be released under the AGPL-3.0, including when the software is used to provide a network service. For commercial licensing options that permit proprietary use, see LICENSE-COMMERCIAL.md.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you shall be dual-licensed as above, without any additional terms or conditions.
Built with:
- chromiumoxide — CDP client
- petgraph — graph algorithms
- tokio — async runtime
- reqwest — HTTP client
Status: Active development | Rust 2024 edition | Linux + macOS
For detailed documentation, see the project docs site.
