Skip to content

dondai1234/master-fetch

Repository files navigation

Hound logo

🐕 Hound

Give your AI agent the web. $0. Two commands. ~2.5K tokens.

Fetch · crawl · bypass bot walls · read PDFs (even scanned) · interact with pages · search the web No accounts · no Docker · no API keys · runs on your machine

PyPI Python License: MIT CI Downloads GitHub stars

pip install hound-mcp[all] && playwright install chromium

Install · Tools · Search · How it works · Comparison · Honest limits


Hound gives your AI agent the web

✨ Why Hound

Hound is one MCP server that gives any agent (Claude Code, Cursor, OpenCode, Hermes, Pi, anything that speaks MCP) full web research from a single local process. Six tools, one warm browser, zero accounts.

🆓 $0 forever, MIT No keys, no accounts, no per-request billing, no data sent to a third-party scraper. Search is keyless and local too.
🧠 Mastered on connect A one-time instructions block gives the agent the mental model + the #1 workflow + the known limits. Effective on turn one, not turn ten.
📐 ~2.5K tokens, 6 tools Hand-crafted tool defs, no Pydantic schema bloat. More capability than tools shipping 5K+.
🎯 Every response is actionable content_ok, next_action, summary, relevance_score, fetch_relevance. Agents branch on structured fields, not error text.

Hound is for the agent itself. You install it once; the agent calls it whenever it needs the web.


🚀 Quick start

pip install hound-mcp[all]          # fetch + crawl + keyless search + PDF + OCR + neural rerank
playwright install chromium         # the anti-detect browser engine
hound -v        # version + update status
hound -u        # update to latest

Then point any MCP client at the hound command. No arguments, no keys, no env vars. See Install for the lean option + Tell your agent to install it for a copy-paste prompt.


🧰 The 6 tools

Tool One-liner
smart_fetch Fetch any URL. HTTP first, auto-escalates to the anti-detect browser if blocked. Bulk, PDFs (with OCR + quality score), css_selector, focus, actions, pagination.
smart_crawl Best-first same-domain crawl. Each page as markdown with content_ok + page_type (article / list / js_shell). discover_only, crawl_urls, focus, time + token caps.
smart_search Local keyless web search. Runs 9 backends in parallel (duckduckgo, brave, mojeek, yahoo, yandex, startpage, google + opt-in wikipedia, grokipedia) via a vendored metasearch, merges + ranks with neural rerank + cross-backend consensus. relevance_score + fetch_relevance + engines_consensus per result. Neural / find_similar. Quality filter drops low-relevance results instead of padding. Returns URLs + ranking (the agent fetches the ones it wants). engine_blocked shows backends that rate-limited/CAPTCHA'd/timed out.
screenshot Capture a page as an image. For multimodal agents (canvas, image-of-text, visual layout). Session auto-managed.
cache_clear Clear the fetch cache. all=true wipes everything.
version Installed version + update status.
📖 Full tool reference

smart_fetch — Fetch any URL with full content extraction. Auto HTTP → stealthy escalation. Bulk via urls. Narrow with css_selector. PDFs auto-extracted to structured markdown (tables, headings, metadata); pages='1-5' for a subset; scanned PDFs AND CID-corrupted fonts auto-OCR with [all]. Response carries quality_score (0–1), table_of_contents, metadata. focus='query' returns only BM25-relevant blocks. actions for click/fill/scroll. Paginate with offset. Signals: content_ok, next_action, summary, is_truncated+next_offset. cache_ttl=0 bypasses cache.

smart_crawl — Best-first walk of same-domain links. Content-adaptive: article/docs → main content; list/index pages → structured link list; JS shells → detected + reported. discover_only=true → URL map. crawl_urls=[...] → second-phase selective crawl. focus prioritizes + filters. Caps: max_pages (10), max_depth (2), max_total_chars, deadline_ms (120000). Per-page content_ok + status (network error = −1) + fetched_at. Reuses smart_fetch anti-bot + cache.

smart_search — Runs 9 keyless backends in parallel (duckduckgo, brave, mojeek, yahoo, yandex, startpage, google + opt-in wikipedia, grokipedia — INDEPENDENT indexes, not the same feed twice) via a vendored metasearch engine layer, merges, dedups by normalized URL, ranks, and boosts cross-backend consensus (a URL returned by several independent indexes is an authority signal). Returns URLs + ranking, not page content (the agent smart_fetches the ones it wants). Filters: site/exclude_sites, location/language/region, page, freshness. Modes: auto (neural rerank if [all]+model present, else consensus + engine-position order), neural (same, explicit), find_similar (pass url=). engine_blocked lists backends that rate-limited/CAPTCHA'd/timed out.


🔎 Local keyless search

Hound fetches the web and brings it back to your agent

No API key, no account, no third-party service. smart_search runs 9 keyless backends in parallel (duckduckgo, brave, mojeek, yahoo, yandex, startpage, google + opt-in wikipedia, grokipedia) via a vendored metasearch engine layer (derived from ddgs, MIT, attributed) — INDEPENDENT indexes, not the same feed twice — merges, dedups, and ranks on your machine. It returns URLs + ranking, not page content — the agent smart_fetches whichever results match what it needs (the ranking is a hint, not a directive). Every result carries relevance_score (0–1), fetch_relevance (high / med / low), and engines_consensus (how many independent indexes returned that URL — a free authority signal). A quality filter drops low-relevance results instead of padding to the max with garbage, so niche queries return fewer good results.

Two rerank modes:

  • neural (the only reranker) — a local ONNX cross-encoder (ms-marco-MiniLM-L-6-v2, Apache-2.0) running on the onnxruntime Hound already ships for OCR. Exa-style semantic ranking, $0, on your machine. The model downloads once (~80MB, cached, not bundled). Scores are min-max normalized per query so the relevance_score field carries meaningful spread (ms-marco sigmoid saturates ~1.0; normalization restores discrimination). BM25 was removed — neural matches its speed and ranks better. Lean installs (no model) fall back to cross-engine consensus + engine-position order.
  • find_similar — pass url=; Hound fetches a page you like, derives a query, and reranks candidates against that source page. Exa's find-similar, local.

smart_search returns URLs + ranking only — the agent smart_fetches the results it wants (one extra call beats guessing which URL is worth fetching). Default 6 results. The 9 backends (duckduckgo, brave, mojeek, yahoo, yandex, startpage, google + opt-in wikipedia, grokipedia) run in parallel; a diversity quorum waits for at least 3 to contribute before returning (~1.5–3s) so a single backend's bias or rate-limit can't dominate, and a backend that CAPTCHAs or rate-limits is simply carried by the others. Search is 100% HTTP — it never touches the browser (the single Patchright browser is smart_fetch's alone).

🛡️ Search Engine Resilience Layer

Scraping public engines from your IP can be rate-limited or CAPTCHA'd. No keyless local tool is bulletproof against sustained blocking without a proxy — Hound is honest about that, and then makes the no-proxy case as robust as possible for a single user:

# Mechanism What it does
1 Persistent warm session per engine One long-lived session reused across searches — cookies + TLS accumulate, so the engine sees a returning human, not a fresh bot. Also faster (no per-search TLS handshake).
2 Per-engine pacing + jitter Within one search all engines fire in parallel (free); only same-engine bursts across searches get a small jittered delay.
3 Circuit breaker + cooldown A blocked engine auto-cools (15 → 120s) while the others keep serving. Results keep flowing.
4 202 / 429 / 503 / 403 + Retry-After DDG's HTTP 202 soft rate-limit is detected; Retry-After honored.
5 Fingerprint rotation A pool of real Chrome / Edge / Firefox / Safari TLS profiles, picked per request.
6 Diverse independent pool + cross-engine consensus Three independent indexes (DuckDuckGo, Bing, Qwant) run in parallel — no single engine is a bottleneck, and a URL returned by several of them gets a consensus boost (a free authority signal from merging independent indexes, no extra fetches).
7 HOUND_SEARCH_PROXY Route all engine requests through your own rotating / residential proxy — the bulletproof path for heavy use.

engine_blocked in the response tells the agent which engines are cooling down (retry shortly). Same gray-area posture as SearXNG / ddgs; no search-engine ToS compliance is claimed.


⚙️ How it works

Hound pipeline: HTTP to stealthy escalation, extract, agent-optimized response; PDF and crawl branches
  • smart_fetch checks cache + robots, tries HTTP, escalates to the stealthy Patchright browser on a block / JS-shell / 403 / 503, then extracts + enriches. PDFs branch to pdfplumber (with CID-garbage + scanned auto-OCR via pypdfium2 + rapidocr, quality_score, ToC). smart_crawl reuses the same pipeline across a same-domain best-first walk.
  • smart_search runs 9 keyless backends in parallel (duckduckgo, brave, mojeek, yahoo, yandex, startpage, google + opt-in wikipedia, grokipedia) via a vendored metasearch engine layer (derived from ddgs, MIT, attributed in NOTICE.ddgs.txt). A diversity quorum waits for at least 3 backends to contribute before returning (~1.5–3s) so no single backend's bias or rate-limit dominates; a backend that CAPTCHAs or rate-limits is carried by the others. Search is 100% HTTP (never touches the browser). Then it merges + dedups, reranks (neural / find_similar), and applies the cross-backend consensus boost.
  • One stealthy Chrome is pre-warmed at startup and reused, so escalation skips the 3–5s cold start.
  • Content over 40KB is chunked; the response gives next_offset so the agent pages through with one more call (served instantly from cache).
🔧 Deep-dive: every feature

📄 PDF + scanned-PDF OCR + CID-recovery

smart_fetch detects a PDF (by content-type or %PDF magic bytes) and extracts it to structured markdown with pdfplumber (MIT, no AGPL baggage): multi-column reading order, real tables as markdown tables, font-size headings, de-hyphenated paragraphs, a metadata header, and --- Page N --- markers. Pass pages="1-5" for a subset.

The flagship trick — CID-corruption auto-OCR. Academic papers embed font subsets without a Unicode map, so extractors emit (cid:71)(cid:302)... garbage for figures/diagrams/math. But the glyphs render correctly. Hound detects CID-garbage pages, renders them via pypdfium2, and OCRs them with rapidocr, recovering the real text automatically. Scanned / image-only PDFs (and image-only web pages) are auto-OCR'd too. Every PDF response carries quality_score (0.0–1.0) and an honest content_ok; table_of_contents from the PDF outline; metadata (title/author/dates); include_media=true for per-page image metadata; password for encrypted PDFs; a .pdf URL that returns a login/paywall is reported as auth_required. All pure-pip, no system binary, with [all].

🕷️ Deep crawl (smart_crawl)

Walk same-domain links in best-first order: discovered URLs are scored by focus relevance + content-likelihood (docs/guide/api boosted, login/submit/cart penalized) + shallow depth, so content pages are crawled before junk when the budget is tight. Extraction is content-adaptive: article/docs → trafilatura main content; list/index pages (HN, aggregators, directories) → a structured * [title](url) link list; JS shells → detected + reported honestly. URLs normalized so /docs and /docs/ are never crawled twice. discover_only=true → URL map; crawl_urls=[...] → second-phase selective crawl; path_include/path_exclude to scope. Caps on pages / depth / tokens / time. Same-domain only by default.

🛡️ Anti-bot + one warm browser

smart_fetch tries plain HTTP first (~1s). If the site blocks HTTP or serves a JS shell, it auto-escalates to a Patchright anti-detect browser with Cloudflare challenge solving. Two tiers, nothing to configure. A single Chrome warms at startup and stays alive for the whole session — it is smart_fetch's + screenshot's anti-bot tier. Pages are closed after each fetch and resource loading is dropped, so idle memory stays near baseline. One browser total, no extra Chrome. (smart_search is 100% HTTP and never touches this browser.)

🎯 Query-focused extraction (focus)

smart_fetch(url, focus="...") returns only the BM25-relevant blocks (paragraphs, headings, tables). On a long page this cuts context 80%+ with no re-fetch (runs post-cache, so one cached page serves any focus query). Re-pass the same focus when paginating.

🖱️ Page interaction (actions)

Content behind a click, a search form, a "load more", or infinite scroll: actions=[{click:"button.load-more"}, {fill:{selector:"#q", text:"x"}}, {press:"Enter"}, {wait:500}, {scroll:3}, {wait_selector:".item"}]. Runs on the stealthy browser after load, before extraction. Forces stealthy + bypasses cache.

🏷️ Metadata on every response

Structured metadata for citation + relevance: title, description, site name, type, image, canonical URL, language, published time, author (from OpenGraph, JSON-LD, the canonical link, and <title>).

🐕 Reddit, optimized

Reddit URLs auto-rewrite to old.reddit.com (7× smaller pages) and skip straight to the stealthy browser (www.reddit.com walls HTTP). Subreddit listings parse into structured posts (title, score, comments, author, domain) from canonical data attributes, with promoted ads filtered out and sticky/NSFW posts tagged.

💾 Smart caching

SQLite cache keyed by URL + extraction type + css_selector + pages (and by query + filters + mode for search). WAL mode for concurrent access. Bad content is never cached (JS shells, bot challenges, error statuses re-fetch instead of freezing broken pages). A size cap evicts the oldest entries so a long-lived agent's cache can't grow unbounded.


📊 Comparison: free tools

Hound is compared only to other free ways to give an agent web research. Only paid services (Bright Data, ZenRows, Firecrawl paid, Exa) can compete on hard anti-bot or hosted neural search at scale — and they cost money, require accounts, and route your data through their servers.

Hound Crawl4AI Jina Reader Firecrawl (OSS / free) DIY Playwright
Price $0 forever $0 (self-host) free, rate-limited $0 self-host / 1K free $0 (your time)
License MIT (no attribution) Apache-2.0 (attribution) proprietary AGPL-ish / cloud n/a
Account / API key none none optional free key cloud needs account + key none
Runs locally yes yes no (their API) self-host: yes (Redis + Docker) yes
Anti-bot / Cloudflare built-in (Patchright) limited none not by default none
Deep crawl yes (best-first, budget, map) yes no yes (cloud) build it
PDF → structured markdown yes (tables, subset) partial yes (native) yes (cloud + OCR) build it
Scanned-PDF / image OCR yes (rapidocr, pure-pip) no no yes (cloud paid) build it
Page interaction yes (actions) hooks (code) no yes (cloud) build it
Query-focused extraction yes (focus, BM25) yes (BM25 filter) no no build it
Web search yes (keyless local) no yes no no
Search rerank neural + find_similar BM25 none none n/a
Search anti-bot yes (9 keyless backends in parallel via vendored metasearch; primp browser-impersonated TLS + httpx HTTP/2 fingerprint; diversity quorum = 3 backends; a blocked one is carried by the others; 100% HTTP, no browser) n/a n/a n/a n/a
Agent signals yes (content_ok/next_action/summary/relevance_score) no no no no
Connect-time instructions yes no no no no
MCP server yes (official) community yes (official) yes (official) build it
Token cost (tools/list) ~2.5K (6 tools) varies n/a varies (12 tools) n/a

Takeaway: Crawl4AI is the closest free competitor (self-host Python, local, no key, has crawl + BM25), but no web search, no scanned-PDF OCR, no page interaction, no agent-optimized signals, and Apache-2.0 (attribution required) vs Hound's MIT. Jina Reader is the easiest (prefix a URL) and handles PDFs, but no anti-bot, no crawl, no interaction, no local search, routes through Jina's servers, rate-limited. Firecrawl's OSS has no anti-bot by default; the generous features live in the paid cloud. Hound is the only free tool that combines crawl, built-in Cloudflare bypass, scanned-PDF OCR, page interaction, query-focused extraction, and keyless local search with neural + find-similar rerank — all local, MIT, $0, no accounts, no keys.

When a paid service makes sense

Paid scrapers (Bright Data, ZenRows, Firecrawl paid, Spider.cloud) can beat free tools on the hardest anti-bot (DataDome, Akamai, Cloudflare Turnstile) and on massive scale, because they run large residential-proxy networks. Paid search APIs (Exa, Tavily) offer hosted neural search. They cost $16 to $500+/month, require accounts + API keys, and send your queries + content through their servers. Use Hound for $0 local web research with no accounts and no keys; reach for a paid service only for enterprise scale, sites Hound explicitly can't crack, or hosted neural search at scale.


🪙 Token cost

MCP tool count comparison: Hound 6 tools vs Firecrawl 12, Jina 19, Bright Data 60+

Most MCP servers cost 3–5K tokens just to exist. Hound's 6 tools cost ~2.5K tokens, and the connect-time instructions are injected once at handshake, not repeated every turn. Your context window is expensive; Hound respects it.


📦 Install

pip install hound-mcp[all]          # recommended: fetch + crawl + keyless search + PDF + OCR + neural rerank
playwright install chromium
Lean install (no neural rerank, no PDF/OCR)
pip install hound-mcp               # fetch + crawl + keyless keyword search
playwright install chromium

The lean install is fully functional: multi-engine keyless search with keyword BM25, anti-bot, crawl, fetch, caching. [all] adds the ONNX neural reranker, PDF extraction, and OCR (scanned PDFs + CID-recovery + image pages) on the same onnxruntime.

Optional environment variables
Variable Purpose
HOUND_SEARCH_PROXY Route all search-engine requests through your own proxy (http://host:port, socks5://..., or user:pass@host:port). For sustained heavy search use with a rotating / residential proxy. Not required for normal single-user use.
HOUND_SEARCH_MIN_INTERVAL Override the per-engine pacing floor (seconds, float). 0 = use the built-in defaults (DDG 1.2s, Bing 1.5s, Wikipedia 0.3s). Power-user tuning.

No API keys or accounts are needed for anything — search is keyless and local.


🤖 Tell your agent to install it

Paste this into your agent:

Install the Hound MCP server on this machine. Follow every step. Do not skip any.

1. Figure out which agent harness you are running on (OpenCode, Hermes, Pi, etc). Then find: (a) where the MCP config file lives, and (b) what format it expects for adding a local MCP server. Read the harness docs if needed. Do not guess.

2. Run: pip install hound-mcp[all]
   Then run: playwright install chromium (But only if it isnt installed already, verify first about its existence)
   If either fails, stop and tell the user.

3. Find the MCP config file from step 1 and back it up before editing. Add a new MCP server named "hound" with command "hound", no arguments, in the format your harness requires. No API keys or environment variables are needed (search is keyless and local).

4. Save the file. Tell the user to restart the agent. After restart, smart_fetch, smart_crawl, smart_search, screenshot, cache_clear and version should be available.
For Pi agent users
Install the Hound MCP server. Follow every step. Do not skip any.

1. Run: pip install hound-mcp[all]
   Then run: playwright install chromium (But only if it isnt installed already, verify first about its existence)
   If either fails, stop and tell the user.

2. Check pi-mcp-adapter: pi list. If not installed: pi install npm:pi-mcp-adapter

3. Backup ~/.pi/agent/mcp.json, then add this inside mcpServers:
   "hound": { "command": "hound", "transport": "stdio", "lifecycle": "eager" }
   No API keys or environment variables are needed (search is keyless and local).

4. Tell the user: "Run /reload, then /mcp to verify. smart_fetch, smart_crawl and smart_search should be available."

⚠️ Honest limits

No free tool can do everything. Hound is upfront about what it can't:

Limit What happens instead
DataDome / Akamai / Cloudflare Turnstile (interactive) Not bypassed. next_action tells the agent to switch sources instead of retrying.
Search rate-limits / CAPTCHAs Solved by diversity: 9 keyless backends (duckduckgo, brave, mojeek, yahoo, yandex, startpage, google, wikipedia, grokipedia) run in parallel via a vendored metasearch (primp browser-impersonated TLS); a backend that rate-limits/CAPTCHAs is carried by the others, and a diversity quorum waits for 3 to contribute so no single backend dominates. Search is never dead. engine_blocked reports backends that rate-limited/CAPTCHA'd/timed out; HOUND_SEARCH_PROXY (http/https/socks5) is a power-user rotating-proxy escape hatch for per-IP throttling (the one thing no scraper can escape from one IP).
Neural / find_similar search Need hound-mcp[all] (the ONNX reranker runs on the same onnxruntime as OCR; model downloads once). Lean installs get keyword BM25.
Sites requiring login Out of scope (Hound does page interaction, not authenticated sessions).
Deep shadow-DOM / hard SPAs actions (scroll, click, wait_selector) reach most of it; deep shadow-DOM piercing not yet wired.
YouTube Minimal text.

When a fetch or search fails, the response says exactly why and what to try next — so the agent doesn't waste calls guessing.


If Hound saves you time, ⭐ the repo — it helps others find it.

GitHub stars

MIT · Changelog · Issues · PyPI

About

MCP server for web fetching with Cloudflare bypass, Trafilatura extraction, and smart routing. Free, self-hosted, no API keys.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors