diff --git a/AGENTS.md b/AGENTS.md index 7240e20..591c209 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -35,6 +35,7 @@ Available specs: - `specs/initial.md` - WebFetch tool specification (types, behavior, conversions, error handling) - `specs/fetchers.md` - Pluggable fetcher system for URL-specific handling - `specs/maintenance.md` - Periodic maintenance checklist (deps, docs, spec-code alignment) +- `specs/threat-model.md` - Security threat model (SSRF, network, input validation, DoS) Specification format: Abstract and Requirements sections. diff --git a/README.md b/README.md index 27ebd05..614bf44 100644 --- a/README.md +++ b/README.md @@ -10,6 +10,7 @@ AI-friendly web content fetching tool designed for LLM consumption. Rust library - **Binary detection** - Returns metadata only for images, PDFs, etc. - **Timeout handling** - 1s first-byte, 30s body with partial content on timeout - **URL filtering** - Allow/block lists for controlled access +- **SSRF protection** - Resolve-then-check blocks private IPs by default - **MCP server** - Model Context Protocol support for AI tool integration ## Installation @@ -171,6 +172,26 @@ Errors are returned in the `error` field: - `ContentError` - Failed to read body - `BinaryContent` - Binary content not supported +## Security + +FetchKit blocks connections to private/reserved IP ranges by default, preventing SSRF attacks when used in server-side or AI agent contexts. + +**Blocked by default:** loopback, private networks (10.x, 172.16-31.x, 192.168.x), link-local (169.254.x including cloud metadata), IPv6 equivalents, multicast, and other reserved ranges. + +```rust +// Default: private IPs blocked (safe for production) +let tool = Tool::default(); + +// Explicit opt-out for local development only +let tool = Tool::builder() + .block_private_ips(false) + .build(); +``` + +DNS pinning prevents DNS rebinding attacks. IPv6-mapped IPv4 addresses are canonicalized before validation. + +See [`specs/threat-model.md`](specs/threat-model.md) for the full threat model. + ## Configuration ### Timeouts diff --git a/crates/fetchkit/src/client.rs b/crates/fetchkit/src/client.rs index 8e7f471..f0248c0 100644 --- a/crates/fetchkit/src/client.rs +++ b/crates/fetchkit/src/client.rs @@ -3,6 +3,7 @@ //! This module provides the main entry points for fetching URLs. //! The actual fetch logic is implemented by fetchers in the [`fetchers`](crate::fetchers) module. +use crate::dns::DnsPolicy; use crate::error::FetchError; use crate::fetchers::FetcherRegistry; use crate::types::{FetchRequest, FetchResponse}; @@ -20,6 +21,8 @@ pub struct FetchOptions { pub enable_markdown: bool, /// Enable as_text option pub enable_text: bool, + /// DNS resolution policy for SSRF prevention + pub dns_policy: DnsPolicy, } /// Fetch a URL and return the response @@ -92,5 +95,7 @@ mod tests { assert!(options.block_prefixes.is_empty()); assert!(!options.enable_markdown); assert!(!options.enable_text); + // Safe by default: private IPs blocked + assert!(options.dns_policy.block_private); } } diff --git a/crates/fetchkit/src/dns.rs b/crates/fetchkit/src/dns.rs new file mode 100644 index 0000000..e5507ce --- /dev/null +++ b/crates/fetchkit/src/dns.rs @@ -0,0 +1,474 @@ +// THREAT[TM-SSRF-001]: Resolve-then-check prevents SSRF via private IP access +// THREAT[TM-SSRF-005]: DNS pinning prevents DNS rebinding attacks +// THREAT[TM-SSRF-006]: IPv6-mapped-IPv4 canonicalization catches mapped addresses +//! DNS resolution policy for SSRF prevention +//! +//! Resolves hostnames to IP addresses and validates against blocked ranges +//! before allowing connections. Prevents SSRF attacks where DNS names resolve +//! to private/internal IP addresses. + +use std::net::{IpAddr, Ipv4Addr, Ipv6Addr, SocketAddr, ToSocketAddrs}; + +/// Policy for DNS resolution and IP validation +/// +/// When `block_private` is enabled, resolved IP addresses are checked against +/// known private/reserved ranges before the connection is established. +/// +/// # Examples +/// +/// ``` +/// use fetchkit::DnsPolicy; +/// +/// let policy = DnsPolicy::block_private_ips(); +/// assert!(policy.is_blocked_ip("127.0.0.1".parse().unwrap())); +/// assert!(policy.is_blocked_ip("10.0.0.1".parse().unwrap())); +/// assert!(!policy.is_blocked_ip("8.8.8.8".parse().unwrap())); +/// ``` +#[derive(Debug, Clone)] +pub struct DnsPolicy { + /// Block private/reserved IP ranges + pub block_private: bool, +} + +/// Default is safe: blocks private/reserved IPs +impl Default for DnsPolicy { + fn default() -> Self { + Self::block_private_ips() + } +} + +impl DnsPolicy { + /// Create a policy that blocks private/reserved IP ranges + /// + /// Blocks: + /// - Loopback (127.0.0.0/8, ::1) + /// - Private (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16) + /// - Link-local (169.254.0.0/16, fe80::/10) + /// - Unspecified (0.0.0.0, ::) + /// - Documentation (192.0.2.0/24, 198.51.100.0/24, 203.0.113.0/24) + /// - Benchmarking (198.18.0.0/15) + /// - Carrier-grade NAT (100.64.0.0/10) + /// - Unique local IPv6 (fc00::/7) + /// - Multicast (224.0.0.0/4, ff00::/8) + /// - Broadcast (255.255.255.255) + pub fn block_private_ips() -> Self { + Self { + block_private: true, + } + } + + /// Create a permissive policy (no IP blocking) + pub fn allow_all() -> Self { + Self { + block_private: false, + } + } + + /// Check if an IP address is in a blocked range + /// + /// Handles IPv6-mapped IPv4 addresses by canonicalizing first. + pub fn is_blocked_ip(&self, ip: IpAddr) -> bool { + if !self.block_private { + return false; + } + + // THREAT[TM-SSRF-006]: Canonicalize IPv6-mapped IPv4 (::ffff:127.0.0.1 -> 127.0.0.1) + let canonical = ip.to_canonical(); + + match canonical { + IpAddr::V4(ipv4) => is_blocked_ipv4(ipv4), + IpAddr::V6(ipv6) => is_blocked_ipv6(ipv6), + } + } + + /// Resolve a hostname and validate all resolved IPs against the policy + /// + /// Returns the first valid (non-blocked) socket address, or an error if + /// all resolved addresses are blocked or resolution fails. + /// + /// # Arguments + /// * `host` - Hostname to resolve + /// * `port` - Port to use in the returned SocketAddr + pub fn resolve_and_validate( + &self, + host: &str, + port: u16, + ) -> Result { + // Resolve hostname + let addrs: Vec = format!("{}:{}", host, port) + .to_socket_addrs() + .map_err(|e| DnsPolicyError::ResolutionFailed(host.to_string(), e.to_string()))? + .collect(); + + if addrs.is_empty() { + return Err(DnsPolicyError::NoAddresses(host.to_string())); + } + + if !self.block_private { + return Ok(addrs[0]); + } + + // Check all resolved addresses + let mut all_blocked = true; + let mut first_valid = None; + + for addr in &addrs { + if self.is_blocked_ip(addr.ip()) { + tracing::debug!( + ip = %addr.ip(), + host = host, + "Blocked private/reserved IP" + ); + } else { + all_blocked = false; + if first_valid.is_none() { + first_valid = Some(*addr); + } + } + } + + if all_blocked { + return Err(DnsPolicyError::AllAddressesBlocked(host.to_string())); + } + + Ok(first_valid.unwrap()) + } +} + +/// Errors from DNS policy validation +#[derive(Debug, thiserror::Error)] +pub enum DnsPolicyError { + /// DNS resolution failed + #[error("DNS resolution failed for {0}: {1}")] + ResolutionFailed(String, String), + + /// No addresses returned from DNS + #[error("No addresses resolved for {0}")] + NoAddresses(String), + + /// All resolved addresses are in blocked ranges + #[error("All resolved addresses for {0} are in blocked IP ranges (private/reserved)")] + AllAddressesBlocked(String), +} + +/// Check if an IPv4 address is in a blocked range +fn is_blocked_ipv4(ip: Ipv4Addr) -> bool { + let octets = ip.octets(); + + // Unspecified: 0.0.0.0 + if ip.is_unspecified() { + return true; + } + + // Loopback: 127.0.0.0/8 + if ip.is_loopback() { + return true; + } + + // Private: 10.0.0.0/8 + if octets[0] == 10 { + return true; + } + + // Private: 172.16.0.0/12 + if octets[0] == 172 && (16..=31).contains(&octets[1]) { + return true; + } + + // Private: 192.168.0.0/16 + if octets[0] == 192 && octets[1] == 168 { + return true; + } + + // Link-local: 169.254.0.0/16 (includes cloud metadata 169.254.169.254) + // THREAT[TM-SSRF-003]: Block link-local range to prevent cloud metadata access + if ip.is_link_local() { + return true; + } + + // Carrier-grade NAT: 100.64.0.0/10 + if octets[0] == 100 && (64..=127).contains(&octets[1]) { + return true; + } + + // Documentation: 192.0.2.0/24 (TEST-NET-1) + if octets[0] == 192 && octets[1] == 0 && octets[2] == 2 { + return true; + } + + // Documentation: 198.51.100.0/24 (TEST-NET-2) + if octets[0] == 198 && octets[1] == 51 && octets[2] == 100 { + return true; + } + + // Documentation: 203.0.113.0/24 (TEST-NET-3) + if octets[0] == 203 && octets[1] == 0 && octets[2] == 113 { + return true; + } + + // Benchmarking: 198.18.0.0/15 + if octets[0] == 198 && (18..=19).contains(&octets[1]) { + return true; + } + + // Multicast: 224.0.0.0/4 + if ip.is_multicast() { + return true; + } + + // Broadcast: 255.255.255.255 + if ip.is_broadcast() { + return true; + } + + false +} + +/// Check if an IPv6 address is in a blocked range +fn is_blocked_ipv6(ip: Ipv6Addr) -> bool { + // Unspecified: :: + if ip.is_unspecified() { + return true; + } + + // Loopback: ::1 + if ip.is_loopback() { + return true; + } + + // Multicast: ff00::/8 + if ip.is_multicast() { + return true; + } + + let segments = ip.segments(); + + // Link-local: fe80::/10 + if segments[0] & 0xffc0 == 0xfe80 { + return true; + } + + // Unique local: fc00::/7 + if segments[0] & 0xfe00 == 0xfc00 { + return true; + } + + false +} + +#[cfg(test)] +mod tests { + use super::*; + + fn policy() -> DnsPolicy { + DnsPolicy::block_private_ips() + } + + // ==================== IPv4 blocked ranges ==================== + + #[test] + fn test_loopback_blocked() { + let p = policy(); + assert!(p.is_blocked_ip("127.0.0.1".parse().unwrap())); + assert!(p.is_blocked_ip("127.0.0.2".parse().unwrap())); + assert!(p.is_blocked_ip("127.255.255.255".parse().unwrap())); + } + + #[test] + fn test_private_10_blocked() { + let p = policy(); + assert!(p.is_blocked_ip("10.0.0.1".parse().unwrap())); + assert!(p.is_blocked_ip("10.255.255.255".parse().unwrap())); + } + + #[test] + fn test_private_172_blocked() { + let p = policy(); + assert!(p.is_blocked_ip("172.16.0.1".parse().unwrap())); + assert!(p.is_blocked_ip("172.31.255.255".parse().unwrap())); + // 172.15.x.x and 172.32.x.x are NOT private + assert!(!p.is_blocked_ip("172.15.0.1".parse().unwrap())); + assert!(!p.is_blocked_ip("172.32.0.1".parse().unwrap())); + } + + #[test] + fn test_private_192_168_blocked() { + let p = policy(); + assert!(p.is_blocked_ip("192.168.0.1".parse().unwrap())); + assert!(p.is_blocked_ip("192.168.255.255".parse().unwrap())); + } + + #[test] + fn test_link_local_blocked() { + let p = policy(); + assert!(p.is_blocked_ip("169.254.0.1".parse().unwrap())); + assert!(p.is_blocked_ip("169.254.169.254".parse().unwrap())); // Cloud metadata + assert!(p.is_blocked_ip("169.254.255.255".parse().unwrap())); + } + + #[test] + fn test_carrier_grade_nat_blocked() { + let p = policy(); + assert!(p.is_blocked_ip("100.64.0.1".parse().unwrap())); + assert!(p.is_blocked_ip("100.127.255.255".parse().unwrap())); + assert!(!p.is_blocked_ip("100.63.0.1".parse().unwrap())); + assert!(!p.is_blocked_ip("100.128.0.1".parse().unwrap())); + } + + #[test] + fn test_documentation_ranges_blocked() { + let p = policy(); + // TEST-NET-1 + assert!(p.is_blocked_ip("192.0.2.1".parse().unwrap())); + // TEST-NET-2 + assert!(p.is_blocked_ip("198.51.100.1".parse().unwrap())); + // TEST-NET-3 + assert!(p.is_blocked_ip("203.0.113.1".parse().unwrap())); + } + + #[test] + fn test_benchmarking_blocked() { + let p = policy(); + assert!(p.is_blocked_ip("198.18.0.1".parse().unwrap())); + assert!(p.is_blocked_ip("198.19.255.255".parse().unwrap())); + assert!(!p.is_blocked_ip("198.17.0.1".parse().unwrap())); + assert!(!p.is_blocked_ip("198.20.0.1".parse().unwrap())); + } + + #[test] + fn test_multicast_blocked() { + let p = policy(); + assert!(p.is_blocked_ip("224.0.0.1".parse().unwrap())); + assert!(p.is_blocked_ip("239.255.255.255".parse().unwrap())); + } + + #[test] + fn test_broadcast_blocked() { + let p = policy(); + assert!(p.is_blocked_ip("255.255.255.255".parse().unwrap())); + } + + #[test] + fn test_unspecified_blocked() { + let p = policy(); + assert!(p.is_blocked_ip("0.0.0.0".parse().unwrap())); + } + + // ==================== IPv6 blocked ranges ==================== + + #[test] + fn test_ipv6_loopback_blocked() { + let p = policy(); + assert!(p.is_blocked_ip("::1".parse().unwrap())); + } + + #[test] + fn test_ipv6_unspecified_blocked() { + let p = policy(); + assert!(p.is_blocked_ip("::".parse().unwrap())); + } + + #[test] + fn test_ipv6_link_local_blocked() { + let p = policy(); + assert!(p.is_blocked_ip("fe80::1".parse().unwrap())); + assert!(p.is_blocked_ip("fe80::ffff:ffff:ffff:ffff".parse().unwrap())); + } + + #[test] + fn test_ipv6_unique_local_blocked() { + let p = policy(); + assert!(p.is_blocked_ip("fc00::1".parse().unwrap())); + assert!(p.is_blocked_ip("fd00::1".parse().unwrap())); + } + + #[test] + fn test_ipv6_multicast_blocked() { + let p = policy(); + assert!(p.is_blocked_ip("ff02::1".parse().unwrap())); + } + + // THREAT[TM-SSRF-006]: IPv6-mapped IPv4 addresses + #[test] + fn test_ipv6_mapped_ipv4_blocked() { + let p = policy(); + // ::ffff:127.0.0.1 should be blocked (maps to loopback) + assert!(p.is_blocked_ip("::ffff:127.0.0.1".parse().unwrap())); + // ::ffff:10.0.0.1 should be blocked (maps to private) + assert!(p.is_blocked_ip("::ffff:10.0.0.1".parse().unwrap())); + // ::ffff:169.254.169.254 should be blocked (maps to metadata) + assert!(p.is_blocked_ip("::ffff:169.254.169.254".parse().unwrap())); + // ::ffff:8.8.8.8 should NOT be blocked (maps to public) + assert!(!p.is_blocked_ip("::ffff:8.8.8.8".parse().unwrap())); + } + + // ==================== Public IPs allowed ==================== + + #[test] + fn test_public_ipv4_allowed() { + let p = policy(); + assert!(!p.is_blocked_ip("8.8.8.8".parse().unwrap())); + assert!(!p.is_blocked_ip("1.1.1.1".parse().unwrap())); + assert!(!p.is_blocked_ip("93.184.216.34".parse().unwrap())); + assert!(!p.is_blocked_ip("140.82.121.3".parse().unwrap())); + } + + #[test] + fn test_public_ipv6_allowed() { + let p = policy(); + assert!(!p.is_blocked_ip("2001:4860:4860::8888".parse().unwrap())); + assert!(!p.is_blocked_ip("2606:4700:4700::1111".parse().unwrap())); + } + + // ==================== Policy modes ==================== + + #[test] + fn test_allow_all_permits_private() { + let p = DnsPolicy::allow_all(); + assert!(!p.is_blocked_ip("127.0.0.1".parse().unwrap())); + assert!(!p.is_blocked_ip("10.0.0.1".parse().unwrap())); + assert!(!p.is_blocked_ip("169.254.169.254".parse().unwrap())); + } + + #[test] + fn test_default_blocks_private() { + let p = DnsPolicy::default(); + assert!(p.is_blocked_ip("127.0.0.1".parse().unwrap())); + assert!(p.is_blocked_ip("10.0.0.1".parse().unwrap())); + assert!(p.is_blocked_ip("169.254.169.254".parse().unwrap())); + assert!(!p.is_blocked_ip("8.8.8.8".parse().unwrap())); + } + + // ==================== resolve_and_validate ==================== + + #[test] + fn test_resolve_loopback_blocked() { + let p = policy(); + let result = p.resolve_and_validate("127.0.0.1", 80); + assert!(result.is_err()); + let err = result.unwrap_err().to_string(); + assert!(err.contains("blocked"), "Error was: {}", err); + } + + #[test] + fn test_resolve_private_blocked() { + let p = policy(); + let result = p.resolve_and_validate("10.0.0.1", 80); + assert!(result.is_err()); + } + + #[test] + fn test_resolve_nonexistent_fails() { + let p = policy(); + let result = p.resolve_and_validate("this-host-definitely-does-not-exist.invalid", 80); + assert!(result.is_err()); + } + + #[test] + fn test_resolve_allow_all_permits_loopback() { + let p = DnsPolicy::allow_all(); + let result = p.resolve_and_validate("127.0.0.1", 80); + assert!(result.is_ok()); + assert_eq!(result.unwrap().ip(), "127.0.0.1".parse::().unwrap()); + } +} diff --git a/crates/fetchkit/src/error.rs b/crates/fetchkit/src/error.rs index d327f4c..0add9a7 100644 --- a/crates/fetchkit/src/error.rs +++ b/crates/fetchkit/src/error.rs @@ -26,8 +26,8 @@ pub enum FetchError { #[error("Invalid method: must be GET or HEAD")] InvalidMethod, - /// URL is blocked by prefix list - #[error("Blocked URL: prefix not allowed")] + /// URL is blocked by policy (prefix list or DNS policy) + #[error("Blocked URL: not allowed by policy")] BlockedUrl, /// Failed to build HTTP client @@ -84,7 +84,7 @@ mod tests { ); assert_eq!( FetchError::BlockedUrl.to_string(), - "Blocked URL: prefix not allowed" + "Blocked URL: not allowed by policy" ); assert_eq!( FetchError::FirstByteTimeout.to_string(), diff --git a/crates/fetchkit/src/fetchers/default.rs b/crates/fetchkit/src/fetchers/default.rs index 2cb076e..3ba45a1 100644 --- a/crates/fetchkit/src/fetchers/default.rs +++ b/crates/fetchkit/src/fetchers/default.rs @@ -110,11 +110,27 @@ impl Fetcher for DefaultFetcher { }; headers.insert(ACCEPT, HeaderValue::from_static(accept)); - // Build client - let client = reqwest::Client::builder() + // THREAT[TM-SSRF-001]: Resolve-then-check — validate resolved IP before connecting + // THREAT[TM-SSRF-005]: Pin DNS resolution to prevent DNS rebinding + let parsed_url = url::Url::parse(&request.url).map_err(|_| FetchError::InvalidUrlScheme)?; + let mut client_builder = reqwest::Client::builder() .default_headers(headers) .connect_timeout(FIRST_BYTE_TIMEOUT) - .timeout(FIRST_BYTE_TIMEOUT) + .timeout(FIRST_BYTE_TIMEOUT); + + if options.dns_policy.block_private { + if let Some(host) = parsed_url.host_str() { + let port = parsed_url.port_or_known_default().unwrap_or(80); + let validated_addr = options + .dns_policy + .resolve_and_validate(host, port) + .map_err(|_| FetchError::BlockedUrl)?; + // Pin the connection to the validated IP to prevent DNS rebinding + client_builder = client_builder.resolve(host, validated_addr); + } + } + + let client = client_builder .build() .map_err(FetchError::ClientBuildError)?; diff --git a/crates/fetchkit/src/fetchers/github_repo.rs b/crates/fetchkit/src/fetchers/github_repo.rs index f6e308d..2240e64 100644 --- a/crates/fetchkit/src/fetchers/github_repo.rs +++ b/crates/fetchkit/src/fetchers/github_repo.rs @@ -155,10 +155,21 @@ impl Fetcher for GitHubRepoFetcher { })?; // Build HTTP client + // THREAT[TM-SSRF-001]: Validate DNS resolution for GitHub API host let user_agent = options.user_agent.as_deref().unwrap_or(DEFAULT_USER_AGENT); - let client = reqwest::Client::builder() + let mut client_builder = reqwest::Client::builder() .connect_timeout(API_TIMEOUT) - .timeout(API_TIMEOUT) + .timeout(API_TIMEOUT); + + if options.dns_policy.block_private { + let validated_addr = options + .dns_policy + .resolve_and_validate("api.github.com", 443) + .map_err(|_| FetchError::BlockedUrl)?; + client_builder = client_builder.resolve("api.github.com", validated_addr); + } + + let client = client_builder .build() .map_err(FetchError::ClientBuildError)?; diff --git a/crates/fetchkit/src/lib.rs b/crates/fetchkit/src/lib.rs index a154cc9..d181048 100644 --- a/crates/fetchkit/src/lib.rs +++ b/crates/fetchkit/src/lib.rs @@ -63,6 +63,7 @@ pub mod client; mod convert; +mod dns; mod error; pub mod fetchers; mod tool; @@ -70,6 +71,7 @@ mod types; pub use client::{fetch, fetch_with_options, FetchOptions}; pub use convert::{html_to_markdown, html_to_text}; +pub use dns::DnsPolicy; pub use error::FetchError; pub use fetchers::{DefaultFetcher, Fetcher, FetcherRegistry, GitHubRepoFetcher}; pub use tool::{Tool, ToolBuilder, ToolStatus}; diff --git a/crates/fetchkit/src/tool.rs b/crates/fetchkit/src/tool.rs index 3e47988..14f9d8a 100644 --- a/crates/fetchkit/src/tool.rs +++ b/crates/fetchkit/src/tool.rs @@ -1,6 +1,7 @@ //! Tool builder and contract for FetchKit use crate::client::{fetch_with_options, FetchOptions}; +use crate::dns::DnsPolicy; use crate::error::FetchError; use crate::types::{FetchRequest, FetchResponse}; use crate::{TOOL_DESCRIPTION, TOOL_LLMTXT}; @@ -82,6 +83,8 @@ pub struct ToolBuilder { allow_prefixes: Vec, /// Block list of URL prefixes block_prefixes: Vec, + /// DNS resolution policy for SSRF prevention + dns_policy: DnsPolicy, } impl ToolBuilder { @@ -124,6 +127,23 @@ impl ToolBuilder { self } + /// Control private/reserved IP range blocking (SSRF prevention) + /// + /// Enabled by default. When enabled, FetchKit resolves hostnames to IP + /// addresses before connecting and validates that the resolved IP is not + /// in a private or reserved range. DNS pinning prevents rebinding attacks. + /// + /// Pass `false` only for local development or testing against loopback + /// servers. In production, always leave this enabled. + pub fn block_private_ips(mut self, block: bool) -> Self { + self.dns_policy = if block { + DnsPolicy::block_private_ips() + } else { + DnsPolicy::allow_all() + }; + self + } + /// Build the tool pub fn build(self) -> Tool { Tool { @@ -132,6 +152,7 @@ impl ToolBuilder { user_agent: self.user_agent, allow_prefixes: self.allow_prefixes, block_prefixes: self.block_prefixes, + dns_policy: self.dns_policy, } } } @@ -160,6 +181,7 @@ pub struct Tool { user_agent: Option, allow_prefixes: Vec, block_prefixes: Vec, + dns_policy: DnsPolicy, } impl Default for Tool { @@ -221,6 +243,7 @@ impl Tool { block_prefixes: self.block_prefixes.clone(), enable_markdown: self.enable_markdown, enable_text: self.enable_text, + dns_policy: self.dns_policy.clone(), }; fetch_with_options(req, options).await @@ -254,6 +277,7 @@ impl Tool { block_prefixes: self.block_prefixes.clone(), enable_markdown: self.enable_markdown, enable_text: self.enable_text, + dns_policy: self.dns_policy.clone(), }; status_callback(ToolStatus::new("fetch").with_percent(20.0)); @@ -285,6 +309,14 @@ mod tests { assert_eq!(tool.user_agent, Some("TestAgent/1.0".to_string())); assert_eq!(tool.allow_prefixes, vec!["https://allowed.com"]); assert_eq!(tool.block_prefixes, vec!["https://blocked.com"]); + // Safe by default: private IPs blocked + assert!(tool.dns_policy.block_private); + } + + #[test] + fn test_tool_builder_opt_out_private_ip_blocking() { + let tool = Tool::builder().block_private_ips(false).build(); + assert!(!tool.dns_policy.block_private); } #[test] diff --git a/crates/fetchkit/tests/integration.rs b/crates/fetchkit/tests/integration.rs index 156a731..5793b85 100644 --- a/crates/fetchkit/tests/integration.rs +++ b/crates/fetchkit/tests/integration.rs @@ -1,11 +1,26 @@ //! Integration tests for FetchKit using wiremock use fetchkit::{ - fetch, fetch_with_options, FetchOptions, FetchRequest, FetcherRegistry, HttpMethod, Tool, + fetch_with_options, DnsPolicy, FetchOptions, FetchRequest, FetcherRegistry, HttpMethod, Tool, }; use wiremock::matchers::{method, path}; use wiremock::{Mock, MockServer, ResponseTemplate}; +/// Helper: create FetchOptions that permit loopback (for wiremock tests) +fn test_options() -> FetchOptions { + FetchOptions { + enable_markdown: true, + enable_text: true, + dns_policy: DnsPolicy::allow_all(), + ..Default::default() + } +} + +/// Helper: create a Tool that permits loopback (for wiremock tests) +fn test_tool() -> Tool { + Tool::builder().block_private_ips(false).build() +} + #[tokio::test] async fn test_simple_get() { let mock_server = MockServer::start().await; @@ -21,7 +36,7 @@ async fn test_simple_get() { .await; let req = FetchRequest::new(format!("{}/", mock_server.uri())); - let resp = fetch(req).await.unwrap(); + let resp = fetch_with_options(req, test_options()).await.unwrap(); assert_eq!(resp.status_code, 200); assert_eq!(resp.content_type, Some("text/plain".to_string())); @@ -45,7 +60,7 @@ async fn test_head_request() { .await; let req = FetchRequest::new(format!("{}/file.pdf", mock_server.uri())).method(HttpMethod::Head); - let resp = fetch(req).await.unwrap(); + let resp = fetch_with_options(req, test_options()).await.unwrap(); assert_eq!(resp.status_code, 200); assert_eq!(resp.method, Some("HEAD".to_string())); @@ -81,7 +96,7 @@ async fn test_html_to_markdown() { .mount(&mock_server) .await; - let tool = Tool::default(); + let tool = test_tool(); let req = FetchRequest::new(format!("{}/", mock_server.uri())).as_markdown(); let resp = tool.execute(req).await.unwrap(); @@ -114,7 +129,7 @@ async fn test_html_to_text() { .mount(&mock_server) .await; - let tool = Tool::default(); + let tool = test_tool(); let req = FetchRequest::new(format!("{}/", mock_server.uri())).as_text(); let resp = tool.execute(req).await.unwrap(); @@ -143,7 +158,7 @@ async fn test_binary_content() { .await; let req = FetchRequest::new(format!("{}/image.png", mock_server.uri())); - let resp = fetch(req).await.unwrap(); + let resp = fetch_with_options(req, test_options()).await.unwrap(); assert_eq!(resp.status_code, 200); assert_eq!(resp.content_type, Some("image/png".to_string())); @@ -168,7 +183,7 @@ async fn test_4xx_status() { .await; let req = FetchRequest::new(format!("{}/not-found", mock_server.uri())); - let resp = fetch(req).await.unwrap(); + let resp = fetch_with_options(req, test_options()).await.unwrap(); // 4xx is still a success response (not a tool error) assert_eq!(resp.status_code, 404); @@ -190,7 +205,7 @@ async fn test_5xx_status() { .await; let req = FetchRequest::new(format!("{}/error", mock_server.uri())); - let resp = fetch(req).await.unwrap(); + let resp = fetch_with_options(req, test_options()).await.unwrap(); // 5xx is still a success response (not a tool error) assert_eq!(resp.status_code, 500); @@ -213,7 +228,7 @@ async fn test_content_disposition_filename() { .await; let req = FetchRequest::new(format!("{}/download", mock_server.uri())); - let resp = fetch(req).await.unwrap(); + let resp = fetch_with_options(req, test_options()).await.unwrap(); assert_eq!(resp.filename, Some("report.txt".to_string())); } @@ -234,7 +249,7 @@ async fn test_filename_from_url() { let req = FetchRequest::new(format!("{}/path/to/document.pdf", mock_server.uri())) .method(HttpMethod::Head); - let resp = fetch(req).await.unwrap(); + let resp = fetch_with_options(req, test_options()).await.unwrap(); assert_eq!(resp.filename, Some("document.pdf".to_string())); } @@ -256,7 +271,7 @@ async fn test_size_for_text_content() { .await; let req = FetchRequest::new(format!("{}/", mock_server.uri())); - let resp = fetch(req).await.unwrap(); + let resp = fetch_with_options(req, test_options()).await.unwrap(); // Size should equal bytes read from body assert_eq!(resp.size, Some(body.len() as u64)); @@ -274,6 +289,7 @@ async fn test_url_prefix_allow_list() { // Create tool with allow list that doesn't include the mock server let tool = Tool::builder() + .block_private_ips(false) .allow_prefix("https://allowed.example.com") .build(); @@ -284,7 +300,7 @@ async fn test_url_prefix_allow_list() { assert!(result .unwrap_err() .to_string() - .contains("prefix not allowed")); + .contains("not allowed by policy")); } #[tokio::test] @@ -298,7 +314,10 @@ async fn test_url_prefix_block_list() { .await; // Create tool with block list that includes localhost - let tool = Tool::builder().block_prefix("http://127.0.0.1").build(); + let tool = Tool::builder() + .block_private_ips(false) + .block_prefix("http://127.0.0.1") + .build(); let req = FetchRequest::new(format!("{}/", mock_server.uri())); let result = tool.execute(req).await; @@ -307,13 +326,13 @@ async fn test_url_prefix_block_list() { assert!(result .unwrap_err() .to_string() - .contains("prefix not allowed")); + .contains("not allowed by policy")); } #[tokio::test] async fn test_invalid_url_scheme() { let req = FetchRequest::new("ftp://example.com/file.txt"); - let result = fetch(req).await; + let result = fetch_with_options(req, test_options()).await; assert!(result.is_err()); assert!(result @@ -325,7 +344,7 @@ async fn test_invalid_url_scheme() { #[tokio::test] async fn test_missing_url() { let req = FetchRequest::new(""); - let result = fetch(req).await; + let result = fetch_with_options(req, test_options()).await; assert!(result.is_err()); assert!(result.unwrap_err().to_string().contains("Missing")); @@ -343,7 +362,7 @@ async fn test_entity_decoding_in_html() { .mount(&mock_server) .await; - let tool = Tool::default(); + let tool = test_tool(); let req = FetchRequest::new(format!("{}/", mock_server.uri())).as_text(); let resp = tool.execute(req).await.unwrap(); @@ -370,7 +389,7 @@ async fn test_non_html_with_conversion_flags() { .mount(&mock_server) .await; - let tool = Tool::default(); + let tool = test_tool(); let req = FetchRequest::new(format!("{}/api/data", mock_server.uri())).as_markdown(); let resp = tool.execute(req).await.unwrap(); @@ -396,7 +415,7 @@ async fn test_html_detection_by_body() { .mount(&mock_server) .await; - let tool = Tool::default(); + let tool = test_tool(); let req = FetchRequest::new(format!("{}/", mock_server.uri())).as_markdown(); let resp = tool.execute(req).await.unwrap(); @@ -414,7 +433,10 @@ async fn test_custom_user_agent() { .mount(&mock_server) .await; - let tool = Tool::builder().user_agent("CustomBot/1.0").build(); + let tool = Tool::builder() + .block_private_ips(false) + .user_agent("CustomBot/1.0") + .build(); let req = FetchRequest::new(format!("{}/", mock_server.uri())); let resp = tool.execute(req).await.unwrap(); @@ -439,7 +461,7 @@ async fn test_excessive_newlines_filtered() { .await; let req = FetchRequest::new(format!("{}/", mock_server.uri())); - let resp = fetch(req).await.unwrap(); + let resp = fetch_with_options(req, test_options()).await.unwrap(); // Should have at most 2 consecutive newlines assert!(!resp.content.unwrap().contains("\n\n\n")); @@ -467,6 +489,7 @@ async fn test_fetcher_registry_with_defaults() { let options = FetchOptions { enable_markdown: true, enable_text: true, + dns_policy: DnsPolicy::allow_all(), ..Default::default() }; @@ -481,7 +504,10 @@ async fn test_fetcher_registry_with_defaults() { #[tokio::test] async fn test_fetcher_registry_url_validation() { let registry = FetcherRegistry::with_defaults(); - let options = FetchOptions::default(); + let options = FetchOptions { + dns_policy: DnsPolicy::allow_all(), + ..Default::default() + }; // Invalid scheme let req = FetchRequest::new("ftp://example.com"); @@ -510,6 +536,7 @@ async fn test_fetcher_registry_allow_block_lists() { // Block list let options = FetchOptions { block_prefixes: vec!["http://127.0.0.1".to_string()], + dns_policy: DnsPolicy::allow_all(), ..Default::default() }; let req = FetchRequest::new(format!("{}/", mock_server.uri())); @@ -520,6 +547,7 @@ async fn test_fetcher_registry_allow_block_lists() { // Allow list (not matching) let options = FetchOptions { allow_prefixes: vec!["https://allowed.com".to_string()], + dns_policy: DnsPolicy::allow_all(), ..Default::default() }; let req = FetchRequest::new(format!("{}/", mock_server.uri())); @@ -544,7 +572,7 @@ async fn test_github_fetcher_url_matching() { .await; let req = FetchRequest::new(format!("{}/owner/repo/issues", mock_server.uri())); - let resp = fetch(req).await.unwrap(); + let resp = fetch_with_options(req, test_options()).await.unwrap(); // Should use default fetcher (format is "raw", not "github_repo") assert_eq!(resp.format, Some("raw".to_string())); @@ -565,9 +593,9 @@ async fn test_fetch_enables_conversions_by_default() { .mount(&mock_server) .await; - // Using fetch() with as_markdown() should work + // test_options() has enable_markdown: true (matching fetch() default) let req = FetchRequest::new(format!("{}/", mock_server.uri())).as_markdown(); - let resp = fetch(req).await.unwrap(); + let resp = fetch_with_options(req, test_options()).await.unwrap(); assert_eq!(resp.format, Some("markdown".to_string())); } @@ -590,6 +618,7 @@ async fn test_fetch_with_options_respects_disabled_conversion() { let options = FetchOptions { enable_markdown: false, enable_text: false, + dns_policy: DnsPolicy::allow_all(), ..Default::default() }; @@ -599,3 +628,39 @@ async fn test_fetch_with_options_respects_disabled_conversion() { // Should be raw because conversion is disabled assert_eq!(resp.format, Some("raw".to_string())); } + +// ============================================================================ +// Safe-by-default: fetch() and Tool::default() block private IPs +// ============================================================================ + +#[tokio::test] +async fn test_fetch_blocks_loopback_by_default() { + let mock_server = MockServer::start().await; + + Mock::given(method("GET")) + .and(path("/")) + .respond_with(ResponseTemplate::new(200).set_body_string("OK")) + .mount(&mock_server) + .await; + + // fetch() uses default options which now block private IPs + let req = FetchRequest::new(format!("{}/", mock_server.uri())); + let result = fetchkit::fetch(req).await; + assert!(matches!(result, Err(fetchkit::FetchError::BlockedUrl))); +} + +#[tokio::test] +async fn test_tool_default_blocks_loopback() { + let mock_server = MockServer::start().await; + + Mock::given(method("GET")) + .and(path("/")) + .respond_with(ResponseTemplate::new(200).set_body_string("OK")) + .mount(&mock_server) + .await; + + let tool = Tool::default(); + let req = FetchRequest::new(format!("{}/", mock_server.uri())); + let result = tool.execute(req).await; + assert!(matches!(result, Err(fetchkit::FetchError::BlockedUrl))); +} diff --git a/crates/fetchkit/tests/ssrf_security.rs b/crates/fetchkit/tests/ssrf_security.rs new file mode 100644 index 0000000..aadb5ba --- /dev/null +++ b/crates/fetchkit/tests/ssrf_security.rs @@ -0,0 +1,275 @@ +//! SSRF security tests for FetchKit +//! +//! Tests that validate the resolve-then-check DNS policy prevents +//! server-side request forgery attacks. These tests verify the threat +//! mitigations documented in specs/threat-model.md. +//! +//! Safe-by-default: Tool::default() and fetch() block private IPs. +//! Tests that need loopback (wiremock) must explicitly opt out. + +use fetchkit::{FetchError, FetchRequest, Tool}; +use wiremock::matchers::{method, path}; +use wiremock::{Mock, MockServer, ResponseTemplate}; + +// ============================================================================ +// TM-SSRF-001: Private IP access via URL (blocked by default) +// ============================================================================ + +#[tokio::test] +async fn test_ssrf_001_loopback_ipv4_blocked() { + let tool = Tool::default(); + let req = FetchRequest::new("http://127.0.0.1/"); + let result = tool.execute(req).await; + assert!(matches!(result, Err(FetchError::BlockedUrl))); +} + +#[tokio::test] +async fn test_ssrf_001_loopback_ipv4_alt_blocked() { + let tool = Tool::default(); + let req = FetchRequest::new("http://127.0.0.2/"); + let result = tool.execute(req).await; + assert!(matches!(result, Err(FetchError::BlockedUrl))); +} + +#[tokio::test] +async fn test_ssrf_001_private_10_blocked() { + let tool = Tool::default(); + let req = FetchRequest::new("http://10.0.0.1/"); + let result = tool.execute(req).await; + assert!(matches!(result, Err(FetchError::BlockedUrl))); +} + +#[tokio::test] +async fn test_ssrf_001_private_172_blocked() { + let tool = Tool::default(); + let req = FetchRequest::new("http://172.16.0.1/"); + let result = tool.execute(req).await; + assert!(matches!(result, Err(FetchError::BlockedUrl))); +} + +#[tokio::test] +async fn test_ssrf_001_private_192_168_blocked() { + let tool = Tool::default(); + let req = FetchRequest::new("http://192.168.1.1/"); + let result = tool.execute(req).await; + assert!(matches!(result, Err(FetchError::BlockedUrl))); +} + +// ============================================================================ +// TM-SSRF-002: Loopback access +// ============================================================================ + +#[tokio::test] +async fn test_ssrf_002_localhost_blocked() { + let tool = Tool::default(); + let req = FetchRequest::new("http://localhost/"); + let result = tool.execute(req).await; + // localhost resolves to 127.0.0.1, which is blocked + assert!(matches!(result, Err(FetchError::BlockedUrl))); +} + +// ============================================================================ +// TM-SSRF-003: Cloud metadata endpoint +// ============================================================================ + +#[tokio::test] +async fn test_ssrf_003_cloud_metadata_blocked() { + let tool = Tool::default(); + let req = FetchRequest::new("http://169.254.169.254/latest/meta-data/"); + let result = tool.execute(req).await; + assert!(matches!(result, Err(FetchError::BlockedUrl))); +} + +#[tokio::test] +async fn test_ssrf_003_link_local_blocked() { + let tool = Tool::default(); + let req = FetchRequest::new("http://169.254.0.1/"); + let result = tool.execute(req).await; + assert!(matches!(result, Err(FetchError::BlockedUrl))); +} + +// ============================================================================ +// TM-SSRF-006: IPv6-mapped IPv4 +// ============================================================================ + +#[tokio::test] +async fn test_ssrf_006_ipv6_loopback_blocked() { + let tool = Tool::default(); + let req = FetchRequest::new("http://[::1]/"); + let result = tool.execute(req).await; + assert!(matches!(result, Err(FetchError::BlockedUrl))); +} + +// ============================================================================ +// TM-INPUT-001: Non-HTTP scheme +// ============================================================================ + +#[tokio::test] +async fn test_input_001_file_scheme_blocked() { + let tool = Tool::default(); + let req = FetchRequest::new("file:///etc/passwd"); + let result = tool.execute(req).await; + assert!(matches!(result, Err(FetchError::InvalidUrlScheme))); +} + +#[tokio::test] +async fn test_input_001_ftp_scheme_blocked() { + let tool = Tool::default(); + let req = FetchRequest::new("ftp://internal-server/files"); + let result = tool.execute(req).await; + assert!(matches!(result, Err(FetchError::InvalidUrlScheme))); +} + +#[tokio::test] +async fn test_input_001_data_scheme_blocked() { + let tool = Tool::default(); + let req = FetchRequest::new("data:text/html,

XSS

"); + let result = tool.execute(req).await; + assert!(matches!(result, Err(FetchError::InvalidUrlScheme))); +} + +#[tokio::test] +async fn test_input_001_gopher_scheme_blocked() { + let tool = Tool::default(); + let req = FetchRequest::new("gopher://internal:70/"); + let result = tool.execute(req).await; + assert!(matches!(result, Err(FetchError::InvalidUrlScheme))); +} + +// ============================================================================ +// Default blocks private IPs, explicit opt-out allows them +// ============================================================================ + +#[tokio::test] +async fn test_default_blocks_loopback_mock_server() { + let mock_server = MockServer::start().await; + + Mock::given(method("GET")) + .and(path("/")) + .respond_with( + ResponseTemplate::new(200) + .set_body_string("Hello from loopback") + .insert_header("content-type", "text/plain"), + ) + .mount(&mock_server) + .await; + + // Default tool blocks loopback — mock server is on 127.0.0.1 + let tool = Tool::default(); + let req = FetchRequest::new(format!("{}/", mock_server.uri())); + let result = tool.execute(req).await; + assert!(matches!(result, Err(FetchError::BlockedUrl))); +} + +#[tokio::test] +async fn test_explicit_opt_out_allows_loopback() { + let mock_server = MockServer::start().await; + + Mock::given(method("GET")) + .and(path("/")) + .respond_with( + ResponseTemplate::new(200) + .set_body_string("Hello") + .insert_header("content-type", "text/plain"), + ) + .mount(&mock_server) + .await; + + // Explicit opt-out allows loopback + let tool = Tool::builder().block_private_ips(false).build(); + let req = FetchRequest::new(format!("{}/", mock_server.uri())); + let result = tool.execute(req).await; + assert!(result.is_ok()); + assert_eq!(result.unwrap().status_code, 200); +} + +// ============================================================================ +// Prefix-based blocking still works alongside DNS policy +// ============================================================================ + +#[tokio::test] +async fn test_prefix_block_and_dns_policy_combined() { + let tool = Tool::builder() + .block_prefix("https://blocked.example.com") + .build(); + + // URL prefix blocked + let req = FetchRequest::new("https://blocked.example.com/secret"); + let result = tool.execute(req).await; + assert!(matches!(result, Err(FetchError::BlockedUrl))); + + // Private IP blocked (by default dns policy) + let req = FetchRequest::new("http://10.0.0.1/"); + let result = tool.execute(req).await; + assert!(matches!(result, Err(FetchError::BlockedUrl))); +} + +// ============================================================================ +// TM-SSRF-004: Numeric IP variants +// ============================================================================ + +#[tokio::test] +async fn test_ssrf_004_zero_ip_blocked() { + let tool = Tool::default(); + let req = FetchRequest::new("http://0.0.0.0/"); + let result = tool.execute(req).await; + assert!(matches!(result, Err(FetchError::BlockedUrl))); +} + +// ============================================================================ +// TM-CONV-001: Script injection in converted content +// (Uses wiremock on loopback, so must opt out of private IP blocking) +// ============================================================================ + +#[tokio::test] +async fn test_conv_001_script_stripped_in_markdown() { + let mock_server = MockServer::start().await; + + let html = r#" +

Hello

+ +

World

+ "#; + + Mock::given(method("GET")) + .and(path("/")) + .respond_with(ResponseTemplate::new(200).set_body_raw(html, "text/html")) + .mount(&mock_server) + .await; + + let tool = Tool::builder().block_private_ips(false).build(); + let req = FetchRequest::new(format!("{}/", mock_server.uri())).as_markdown(); + let resp = tool.execute(req).await.unwrap(); + + let content = resp.content.unwrap(); + assert!(!content.contains("alert")); + assert!(!content.contains(" + + "#; + + Mock::given(method("GET")) + .and(path("/")) + .respond_with(ResponseTemplate::new(200).set_body_raw(html, "text/html")) + .mount(&mock_server) + .await; + + let tool = Tool::builder().block_private_ips(false).build(); + let req = FetchRequest::new(format!("{}/", mock_server.uri())).as_text(); + let resp = tool.execute(req).await.unwrap(); + + let content = resp.content.unwrap(); + assert!(!content.contains("document.cookie")); + assert!(!content.contains("display:none")); + assert!(content.contains("Safe content")); +} diff --git a/specs/fetchers.md b/specs/fetchers.md index 371f0cb..d0559c8 100644 --- a/specs/fetchers.md +++ b/specs/fetchers.md @@ -66,6 +66,7 @@ Fetchers receive `FetchOptions` for: - `block_prefixes` - URL prefix block list - `enable_markdown` - Enable markdown conversion - `enable_text` - Enable text conversion +- `dns_policy` - DNS resolution policy for SSRF prevention (default: block private IPs) ### Extensibility @@ -82,10 +83,20 @@ Design supports hundreds of fetchers by: - `FetchError::FetcherError(String)` for fetcher-specific errors - GitHub API errors return response with error field set +### SSRF Protection + +Both built-in fetchers integrate resolve-then-check DNS validation: +- Resolve hostname to IP before connecting +- Validate IP against blocked ranges (private, loopback, link-local, etc.) +- Pin validated IP via `reqwest::ClientBuilder::resolve()` to prevent DNS rebinding +- Enabled by default via `DnsPolicy::default()` (blocks private IPs) +- See `specs/threat-model.md` for threat IDs: TM-SSRF-001 through TM-SSRF-010 + ## Module Structure ``` crates/fetchkit/src/ +├── dns.rs # DnsPolicy - SSRF prevention via resolve-then-check ├── fetchers/ │ ├── mod.rs # Fetcher trait, FetcherRegistry │ ├── default.rs # DefaultFetcher diff --git a/specs/initial.md b/specs/initial.md index 33f35cf..5c27a97 100644 --- a/specs/initial.md +++ b/specs/initial.md @@ -48,6 +48,7 @@ Provide a builder to configure tool options, including: - Support allow/block list of URL prefixes. - Support enabling/disabling request options (feature gating). - Support User-Agent override (e.g., `allow_ua`). +- Support `block_private_ips(bool)` for SSRF prevention (default: `true`). #### Types @@ -74,6 +75,7 @@ Provide a builder to configure tool options, including: - Missing url - Invalid url scheme - Invalid method + - Blocked URL (prefix list or DNS policy) - Client build failure - Request error (timeout/connect/other) - `ToolStatus` (or equivalent) @@ -134,6 +136,17 @@ Provide a builder to configure tool options, including: - If allow list is non-empty, URL must match at least one allow prefix. - If block list matches, request is denied even if allow list matches. +### SSRF Prevention (DNS Policy) + +By default, FetchKit blocks connections to private/reserved IP ranges: +- Resolves hostnames to IP addresses before connecting (resolve-then-check). +- Validates resolved IPs against blocked ranges (loopback, private, link-local, + cloud metadata, carrier-grade NAT, documentation, benchmarking, multicast, broadcast). +- Handles IPv6-mapped IPv4 addresses via canonicalization. +- Pins validated IP via `reqwest::ClientBuilder::resolve()` to prevent DNS rebinding. +- Blocked by default; opt out via `ToolBuilder::block_private_ips(false)`. +- See `specs/threat-model.md` for full threat analysis. + ### HTTP Behavior - User-Agent: configurable via tool builder or CLI/MCP/Python options @@ -237,7 +250,7 @@ Content is HTML if: - Missing url -> tool error string "Missing required parameter: url". - Invalid URL -> tool error string "Invalid URL: must start with http:// or https://". - Invalid method -> tool error string "Invalid method: must be GET or HEAD". -- Blocked prefix -> tool error string "Blocked URL: prefix not allowed". +- Blocked URL (prefix or DNS policy) -> tool error string "Blocked URL: not allowed by policy". - First-byte timeout -> "Request timed out: server did not respond within 1 second". - Connect error -> "Failed to connect to server". - Other request errors -> "Request failed: ". @@ -263,6 +276,7 @@ Unit: - Binary content detection. - HTML conversion, entity decoding. - Newline filtering behavior. +- DNS policy IP range blocking (IPv4, IPv6, mapped addresses). Integration (mock HTTP server): - GET/HEAD with expected fields. @@ -272,3 +286,12 @@ Integration (mock HTTP server): - Last-Modified extraction. - Size correctness for text and binary. - Body timeout truncation. + +SSRF security: +- Private IP blocking (loopback, 10.x, 172.16.x, 192.168.x). +- Cloud metadata endpoint blocking (169.254.169.254). +- IPv6 loopback/mapped address blocking. +- Non-HTTP scheme blocking (file, ftp, data, gopher). +- Default-blocks-loopback verification. +- Explicit opt-out verification. +- Script stripping in converted content. diff --git a/specs/threat-model.md b/specs/threat-model.md new file mode 100644 index 0000000..844edfa --- /dev/null +++ b/specs/threat-model.md @@ -0,0 +1,361 @@ +# Threat Model + +## Abstract + +Threat model for FetchKit, an AI-friendly web content fetching library. FetchKit is designed +to be embedded in AI agent platforms (e.g., Everruns) where untrusted user prompts can +influence which URLs are fetched. This document identifies threats that arise when FetchKit +runs inside a container or cluster with access to internal network resources, and tracks +mitigations implemented in the library. + +## Threat ID Scheme + +**Format:** `TM--` + +| Prefix | Category | Description | +|--------|----------|-------------| +| TM-SSRF | Server-Side Request Forgery | Internal resource access, IP bypass, DNS rebinding | +| TM-NET | Network Security | Redirect abuse, protocol smuggling, connection reuse | +| TM-INPUT | Input Validation | URL parsing, prefix bypass, scheme injection | +| TM-DOS | Denial of Service | Resource exhaustion, slowloris, large payloads | +| TM-LEAK | Information Leakage | Error messages, metadata exposure, timing | +| TM-CONV | Content Conversion | HTML parsing abuse, injection via converted content | + +### Managing Threat IDs + +1. Assign the next sequential number within the category. +2. Never reuse a retired ID. +3. Add code comments at mitigation points: `// THREAT[TM-XXX-NNN]: description`. +4. Add tests that exercise the mitigation. + +### Code Comment Format + +```rust +// THREAT[TM-XXX-NNN]: Brief description of the threat being mitigated +// Mitigation: What this code does to prevent the attack +``` + +## Trust Model + +``` +┌─────────────────────────────────────────────────────┐ +│ Host / Cluster │ +│ │ +│ ┌──────────────────────────────────────────────┐ │ +│ │ Container / Sandbox │ │ +│ │ │ │ +│ │ ┌─────────────┐ ┌──────────────────┐ │ │ +│ │ │ AI Agent │────▶│ FetchKit │ │ │ +│ │ │ (LLM loop) │ │ (library/CLI/ │ │ │ +│ │ │ │ │ MCP server) │ │ │ +│ │ └─────────────┘ └───────┬──────────┘ │ │ +│ │ │ │ │ +│ │ ─ ─ ─ ─ ─ ─ Trust Boundary 1 ─ ─ ─ ─ ─ ─ │ │ +│ │ │ │ │ +│ │ ┌──────────▼──────────┐ │ │ +│ │ │ Network Stack │ │ │ +│ │ │ (DNS + HTTP/TLS) │ │ │ +│ │ └──────────┬──────────┘ │ │ +│ └──────────────────────────────┼────────────────┘ │ +│ │ │ +│ ─ ─ ─ ─ ─ ─ ─ Trust Boundary 2 ─ ─ ─ ─ ─ ─ ─ ─ │ +│ │ │ +│ ┌──────────────────────────────▼────────────────┐ │ +│ │ Internal Network │ │ +│ │ ┌──────────┐ ┌───────────┐ ┌────────────┐ │ │ +│ │ │ Metadata │ │ K8s API │ │ Internal │ │ │ +│ │ │ Service │ │ Server │ │ Services │ │ │ +│ │ │169.254. │ │ │ │ │ │ │ +│ │ │169.254 │ │ │ │ │ │ │ +│ │ └──────────┘ └───────────┘ └────────────┘ │ │ +│ └───────────────────────────────────────────────┘ │ +│ │ +│ ─ ─ ─ ─ ─ ─ ─ Trust Boundary 3 ─ ─ ─ ─ ─ ─ ─ ─ │ +│ │ +└──────────────────────────────────────────────────────┘ + │ + ┌──────────▼──────────┐ + │ Public Internet │ + └─────────────────────┘ +``` + +**Trust Boundary 1 — Agent to FetchKit:** +The AI agent passes user-influenced URLs to FetchKit. FetchKit must treat all +URLs as untrusted input. The agent cannot be relied upon to validate URLs since +adversarial prompts can manipulate it. + +**Trust Boundary 2 — Container to Internal Network:** +The container typically has network access to internal services (metadata endpoints, +Kubernetes API, databases). FetchKit must prevent requests that cross this boundary +unless explicitly allowed. + +**Trust Boundary 3 — Cluster to Public Internet:** +Outbound requests to the public internet are the intended use case. FetchKit should +only allow connections to publicly-routable IP addresses by default. + +## 1. Server-Side Request Forgery (TM-SSRF) + +| ID | Threat | Severity | Mitigation | Status | +|----|--------|----------|------------|--------| +| TM-SSRF-001 | Private IP access via URL | Critical | Resolve-then-check: resolve hostname, validate IP against blocked ranges before connecting | MITIGATED | +| TM-SSRF-002 | Loopback access (127.0.0.1, ::1) | Critical | Blocked in private IP ranges; also blocks `localhost` after resolution | MITIGATED | +| TM-SSRF-003 | Cloud metadata endpoint (169.254.169.254) | Critical | Link-local range blocked; specific metadata IPs covered by range check | MITIGATED | +| TM-SSRF-004 | Numeric IP variants (octal 0177.0.0.1, hex 0x7f000001, decimal 2130706433) | High | URL parsed by `url` crate which normalizes IP representations; resolved IP validated | MITIGATED | +| TM-SSRF-005 | DNS rebinding (hostname resolves to public IP, then re-resolves to private) | High | Pin DNS resolution via `reqwest::ClientBuilder::resolve()`; validated IP used for connection | MITIGATED | +| TM-SSRF-006 | IPv6-mapped IPv4 (::ffff:127.0.0.1) | High | `to_canonical()` extracts IPv4 from mapped addresses before range check | MITIGATED | +| TM-SSRF-007 | DNS names resolving to private IPs | Critical | Post-resolution IP check catches all DNS-to-private-IP scenarios | MITIGATED | +| TM-SSRF-008 | Kubernetes service DNS (*.svc.cluster.local) | High | Resolves to cluster IPs which are private ranges; blocked by IP check | MITIGATED | +| TM-SSRF-009 | URL with credentials (http://user:pass@internal) | Medium | Credentials in URL passed through to reqwest; no credential stripping | **ACCEPTED** | +| TM-SSRF-010 | Redirect to internal resource | High | After redirect, final URL not re-validated against IP policy | **OPEN** | + +### Mitigation Details + +**TM-SSRF-001 — Resolve-then-check (MITIGATED):** +FetchKit resolves the hostname to IP addresses using the system resolver, validates +each resolved IP against blocked ranges, and pins the validated IP via +`reqwest::ClientBuilder::resolve()` to prevent re-resolution. + +Blocked ranges: +- Loopback: `127.0.0.0/8`, `::1` +- Private: `10.0.0.0/8`, `172.16.0.0/12`, `192.168.0.0/16` +- Link-local: `169.254.0.0/16`, `fe80::/10` +- Unspecified: `0.0.0.0/32`, `::/128` +- Documentation: `192.0.2.0/24`, `198.51.100.0/24`, `203.0.113.0/24` +- Benchmarking: `198.18.0.0/15` +- Carrier-grade NAT: `100.64.0.0/10` +- Unique local (IPv6): `fc00::/7` +- Multicast: `224.0.0.0/4`, `ff00::/8` +- Broadcast: `255.255.255.255/32` + +**TM-SSRF-004 — Numeric IP variants (MITIGATED):** +The `url` crate normalizes IP representations during parsing. FetchKit validates +the resolved `IpAddr` (not the string), so octal/hex/decimal-encoded IPs are +caught after normalization. + +**TM-SSRF-005 — DNS rebinding (MITIGATED):** +After validating the resolved IP, FetchKit uses `reqwest::ClientBuilder::resolve(host, addr)` +to pin the connection to the validated IP. This prevents reqwest from re-resolving +the hostname during connection establishment. + +**TM-SSRF-009 — URL credentials (ACCEPTED):** +FetchKit passes URLs to reqwest as-is. If credentials are embedded in the URL, +they are sent with the request. This is acceptable because: +- FetchKit only supports GET/HEAD (read-only operations) +- The URL comes from the caller who controls what credentials to include +- Stripping credentials would break legitimate use cases +- **Risk:** Low. Mitigated at the caller level. + +**TM-SSRF-010 — Redirect to internal resource (OPEN):** +Reqwest follows redirects by default (up to 10). After a redirect, the new URL's +hostname may resolve to a private IP that wasn't validated. +- **Recommendation:** Disable automatic redirects and implement manual redirect + following with IP validation at each hop, or configure reqwest with a redirect + policy that validates destinations. +- **Workaround:** Callers can configure `block_prefixes` for known internal + prefixes, but this doesn't cover all cases (DNS-based attacks). +- **Priority:** High + +## 2. Network Security (TM-NET) + +| ID | Threat | Severity | Mitigation | Status | +|----|--------|----------|------------|--------| +| TM-NET-001 | HTTP downgrade (HTTPS URL redirects to HTTP) | Medium | No scheme validation on redirects; reqwest follows | **ACCEPTED** | +| TM-NET-002 | TLS certificate validation bypass | Low | Uses reqwest defaults (system certificate store via rustls-platform-verifier) | MITIGATED | +| TM-NET-003 | Connection reuse leaking context | Low | New reqwest client per request; no connection pooling across requests | MITIGATED | +| TM-NET-004 | Proxy environment variables (HTTP_PROXY) | Medium | Reqwest respects system proxy env vars; attacker could set these in container | **CALLER RISK** | +| TM-NET-005 | Man-in-the-middle on HTTP (non-TLS) | Medium | HTTP scheme is allowed; content can be intercepted/modified on the wire | **ACCEPTED** | + +### Mitigation Details + +**TM-NET-001 — HTTP downgrade (ACCEPTED):** +FetchKit allows both HTTP and HTTPS schemes. If an HTTPS URL redirects to HTTP, +reqwest will follow the redirect without warning. This is accepted because: +- FetchKit is designed for content fetching, not security-sensitive operations +- The caller controls which URLs to fetch +- Enforcing HTTPS-only would break many legitimate use cases + +**TM-NET-003 — Connection reuse (MITIGATED):** +The `DefaultFetcher` creates a new `reqwest::Client` per request, which prevents +connection pool state from leaking between requests. This is a defense-in-depth +measure. + +**TM-NET-004 — Proxy environment variables (CALLER RISK):** +Reqwest automatically reads `HTTP_PROXY`, `HTTPS_PROXY`, and `NO_PROXY` environment +variables. In a container environment, these should be controlled by the operator. +This is the caller's responsibility to configure or clear. + +## 3. Input Validation (TM-INPUT) + +| ID | Threat | Severity | Mitigation | Status | +|----|--------|----------|------------|--------| +| TM-INPUT-001 | Non-HTTP scheme (file://, ftp://, data:) | High | Explicit scheme check: only `http://` and `https://` prefixes allowed | MITIGATED | +| TM-INPUT-002 | URL prefix bypass via encoding | Medium | Prefix matching uses the raw URL string; URL-encoded variants may bypass | **OPEN** | +| TM-INPUT-003 | Empty or malformed URL | Low | Empty URL check and `url::Url::parse()` validation | MITIGATED | +| TM-INPUT-004 | Extremely long URL | Low | No explicit length limit; reqwest/OS handles | **ACCEPTED** | +| TM-INPUT-005 | URL with fragment/query manipulation | Low | Fragments and queries are part of the URL; no special handling needed | **BY DESIGN** | +| TM-INPUT-006 | Prefix bypass via URL authority (http://evil.com@127.0.0.1) | Medium | `url` crate parses authority correctly; resolve-then-check validates the actual host | MITIGATED | +| TM-INPUT-007 | Block prefix matching is string-based, not URL-aware | Medium | Prefix matching operates on raw URL strings; can be bypassed with path tricks | **OPEN** | + +### Mitigation Details + +**TM-INPUT-001 — Scheme validation (MITIGATED):** +```rust +// THREAT[TM-INPUT-001]: Block non-HTTP schemes (file://, ftp://, data:, etc.) +// Mitigation: Early return with InvalidUrlScheme error +if !request.url.starts_with("http://") && !request.url.starts_with("https://") { + return Err(FetchError::InvalidUrlScheme); +} +``` + +**TM-INPUT-002 — URL prefix bypass via encoding (OPEN):** +The prefix matching uses `String::starts_with()` on the raw URL. An attacker could +potentially bypass block lists using URL encoding or case variations: +- `http://127.0.0.1` blocked, but `http://127.0.0.1.` (trailing dot) may bypass +- With resolve-then-check, this is partially mitigated since the resolved IP is + checked regardless of URL string format +- **Recommendation:** Normalize URLs before prefix matching (lowercase scheme/host, + remove default ports, resolve `.` sequences) +- **Priority:** Medium (partially mitigated by resolve-then-check) + +**TM-INPUT-006 — URL authority bypass (MITIGATED):** +URLs like `http://evil.com@127.0.0.1/path` have `127.0.0.1` as the host (with +`evil.com` as the username). The `url` crate correctly parses this, and +resolve-then-check validates the actual host's IP. + +**TM-INPUT-007 — String-based prefix matching (OPEN):** +Block prefix `http://internal.example.com` would not block +`http://internal.example.com.evil.com` since `starts_with` matches. However, +resolve-then-check mitigates the actual SSRF risk since the resolved IP would +be a public IP. +- **Recommendation:** Consider URL-aware prefix matching (parse both prefix and + URL, compare scheme + host + path components) +- **Priority:** Low (mitigated by resolve-then-check for SSRF cases) + +## 4. Denial of Service (TM-DOS) + +| ID | Threat | Severity | Mitigation | Status | +|----|--------|----------|------------|--------| +| TM-DOS-001 | Unbounded response body | Medium | 30-second body timeout; no max body size limit | **OPEN** | +| TM-DOS-002 | Slowloris / slow body | Low | 1-second first-byte timeout; 30-second body timeout | MITIGATED | +| TM-DOS-003 | Compressed content bomb (gzip bomb) | Medium | Reqwest decompresses gzip/brotli/deflate; no decompressed size limit | **OPEN** | +| TM-DOS-004 | Rapid request flooding via tool | Low | No rate limiting in FetchKit; caller responsibility | **CALLER RISK** | +| TM-DOS-005 | DNS resolution delay | Low | DNS resolution uses system resolver; no explicit timeout on DNS lookup | **ACCEPTED** | +| TM-DOS-006 | Memory exhaustion from large HTML conversion | Medium | HTML converter processes in-memory; no streaming conversion | **OPEN** | + +### Mitigation Details + +**TM-DOS-001 — Unbounded response body (OPEN):** +FetchKit reads the entire response body into memory (up to 30-second timeout). +A malicious server could stream data at just above the timeout threshold, +consuming significant memory. +- **Recommendation:** Add a configurable `max_body_size` (e.g., 10 MB default) + that truncates the response and sets `truncated: true`. +- **Priority:** Medium + +**TM-DOS-002 — Slowloris (MITIGATED):** +The 1-second first-byte timeout prevents connections from being held open +indefinitely during the initial handshake. The 30-second body timeout provides +a hard ceiling on total request duration. + +**TM-DOS-003 — Compressed content bomb (OPEN):** +Reqwest is configured with `gzip`, `brotli`, and `deflate` features, which +enable transparent decompression. A small compressed payload could decompress +to a very large body. +- **Recommendation:** Monitor decompressed body size against `max_body_size`. +- **Priority:** Medium + +## 5. Information Leakage (TM-LEAK) + +| ID | Threat | Severity | Mitigation | Status | +|----|--------|----------|------------|--------| +| TM-LEAK-001 | Error messages reveal internal network topology | Medium | Error messages include connect/timeout details but not resolved IPs | MITIGATED | +| TM-LEAK-002 | DNS resolution errors reveal internal DNS | Low | DNS errors surfaced as connect errors; hostname visible in error | **ACCEPTED** | +| TM-LEAK-003 | Response content leaks internal data | Low | FetchKit returns content as-is; caller must filter sensitive data | **CALLER RISK** | +| TM-LEAK-004 | User-Agent reveals software version | Info | Default UA `Everruns FetchKit/1.0` reveals stack; configurable | **BY DESIGN** | +| TM-LEAK-005 | Timing side-channels (connect time reveals network proximity) | Low | 1-second timeout masks some timing; not fully mitigated | **ACCEPTED** | + +### Mitigation Details + +**TM-LEAK-001 — Error message detail (MITIGATED):** +FetchKit's error types (`FetchError`) use generic messages that don't include +resolved IP addresses. Connect errors say "Failed to connect to server" without +revealing the specific IP or port that was attempted. + +## 6. Content Conversion (TM-CONV) + +| ID | Threat | Severity | Mitigation | Status | +|----|--------|----------|------------|--------| +| TM-CONV-001 | Script injection in converted markdown | Low | `