AI-powered web research in one async call.
pip install web-scout-ai
web-scout-setupfrom web_scout import run_web_research
result = await run_web_research("climate risk for agriculture in Kenya")
print(result.synthesis)Building a reliable research pipeline requires gluing together:
- a search API (Serper)
- a scraper that handles HTML, JS pages, PDFs, DOCX
- a coverage evaluator to know when you have enough sources
- a synthesizer that cites actual content
web-scout-ai is all of that in one call. No Tavily + crawl4ai + custom glue code. No open-ended agent that you cannot control in production.
Query institutional sources (IPCC, FAO, World Bank) and get a cited synthesis — not just links.
result = await run_web_research(
"drought impact on smallholder farmers in sub-Saharan Africa",
include_domains=["fao.org", "ipcc.ch", "worldbank.org"],
)Drop it in as a tool. One function, typed output, no framework lock-in.
@function_tool
async def research(query: str) -> str:
result = await run_web_research(query, models=models)
return result.synthesisPoint it at a report library or database page. It detects list pages, follows item links, and reads the actual documents.
result = await run_web_research(
"sustainable land management technologies",
direct_url="https://wocat.net/en/database/list/?type=technology&country=ke",
)Designed for agents, not humans. One async entry point, typed output, LiteLLM provider flexibility. Works inside pipelines with no sidechannels.
Returns structured + clean content. Every source is scraped and converted into a query-relevant extract before synthesis. You get cited prose, not a list of links.
Works on the full web. Static HTML, JS-rendered pages via Playwright, PDFs and DOCX via docling, JSON endpoints, even bot-protected files via browser download fallback.
Knows when to go deeper. If a URL is a list or database page, the pipeline detects it, follows item links, and takes a pagination hop. If coverage is still weak after the first round, it generates follow-up queries automatically.
import asyncio
from web_scout import run_web_research
async def main():
result = await run_web_research(
query="Kenya interannual variability and long-term trends in precipitation — current status and recent trend",
models={
"web_researcher": "gemini/gemini-3-flash-preview",
"content_extractor": "gemini/gemini-3-flash-preview",
},
search_backend="serper",
)
print(result.synthesis)
print(f"\n{len(result.scraped)} sources read, avg {sum(len(s.content) for s in result.scraped) // len(result.scraped):,} chars/source")
asyncio.run(main())Real output (from an actual run — sources, numbers, and dates are live from the web):
Precipitation in Kenya is characterized by extreme interannual variability and
distinct seasonal trends that have shifted significantly in recent decades.
The country's climate is dominated by a bimodal rainfall pattern consisting of
the 'long rains' (March–May, MAM) and 'short rains' (October–December, OND).
Long-Term Precipitation Trends
Historically, the two main rainy seasons have exhibited opposing trends:
• Long Rains (MAM): Between 1985 and 2010, a consistent drying trend was
observed, attributed to a shortening of the season through delayed onset and
earlier cessation. However, this trend has shown signs of recovery since 2018
due to extremely wet seasons in 2018, 2020, and 2024.
• Short Rains (OND): A consistent wetting trend has been recorded from 1983 to
2021, with seasonal rainfall increasing by approximately 1.44 to 2.36 mm per
year. Projections suggest the short rains may deliver more total rainfall than
the long rains by 2030–2040.
Current Status (2024)
The year 2024 exemplified the current state of extreme variability:
• MAM 2024: Recorded as one of the wettest seasons on record for several
stations, including Nairobi and Central Kenya. Ndakaini station recorded a
seasonal high of 1,355.5 mm. Many areas received 111% to over 200% of their
long-term mean, resulting in widespread flooding and crop destruction.
• OND 2024: In sharp contrast, the short rains were generally below average,
receiving only 26–75% of normal rainfall in the Northeast and Turkana
regions. This poor performance led to a deterioration in food security, with
2.15 million people facing food insecurity by early 2025.
Interannual Variability and Drivers
Rainfall variability has increased substantially since 2013, marked by more
frequent and intense extremes. Primary drivers include the Indian Ocean Dipole
(IOD) — positive IOD phases can lead to rainfall totals 2–3 times the
long-term mean — and ENSO, though the coherence between ENSO and Kenyan
rainfall has diminished since 2013, suggesting other regional factors are
becoming more influential.
4 sources read, avg 2,701 chars/source
Sources actually scraped:
- Observations of enhanced rainfall variability in Kenya, East Africa (1981–2021) — PMC / Scientific Reports
- Drivers and impacts of Eastern African rainfall variability — ICPAC / Nature Reviews
- State of the Climate Kenya 2024 — Kenya Meteorological Department (PDF)
- State of the Climate Report Kenya 2024 — Stockholm Environment Institute
pip install web-scout-ai
web-scout-setup # installs Chromium for JS-rendered pagesimport asyncio
from web_scout import run_web_research
async def main():
result = await run_web_research(
query="What are the main threats to coral reefs worldwide?",
models={
"web_researcher": "openai/gpt-5.4-mini",
"content_extractor": "gemini/gemini-3-flash-preview",
},
search_backend="serper",
)
print(result.synthesis)
print("\nSources:")
for source in result.scraped:
print(f"- {source.title or source.url}: {source.url}")
asyncio.run(main())class WebResearchResult(BaseModel):
synthesis: str
scraped: list[UrlEntry]
scrape_failed: list[UrlEntry]
blocked_by_policy: list[UrlEntry]
source_http_error: list[UrlEntry]
scraped_irrelevant: list[UrlEntry]
bot_detected: list[UrlEntry]
snippet_only: list[UrlEntry]
queries: list[SearchQuery]synthesis: final grounded answer with inline source citationsscraped: URLs successfully read, with extracted relevant contentscrape_failed: URLs attempted but could not be scrapedblocked_by_policy: URLs skipped because they match the built-in block policysource_http_error: URLs that failed because the source returned HTTP/network errorsscraped_irrelevant: URLs that were fetched successfully but did not contain relevant contentbot_detected: URLs blocked by bot protectionsnippet_only: search results kept only as snippetsqueries: all search queries executed during the run
UrlEntry contains url, title, and content.
SearchQuery contains query, num_results_returned, and domains_restricted.
result = await run_web_research(
query="latest IPCC findings on sea level rise",
models={
"web_researcher": "openai/gpt-5.4-mini",
"content_extractor": "gemini/gemini-3-flash-preview",
},
search_backend="serper",
research_depth="standard", # or "deep"
include_domains=["ipcc.ch"], # optional
direct_url=None, # optional
domain_expertise="climate science", # optional
allowed_domains=None, # optional
max_pdf_pages=50, # optional, default 50
)# 1) Open web research
await run_web_research(
query="latest IPCC findings on sea level rise",
models=models,
search_backend="serper",
)
# 2) Domain-restricted research
await run_web_research(
query="endemic species conservation programs",
models=models,
include_domains=["iucn.org", "wwf.org"],
)
# 3) Direct URL extraction (skip search)
await run_web_research(
query="key findings from this report",
models=models,
direct_url="https://example.org/biodiversity-report.pdf",
)
# 4) Direct URL list-page deepening
await run_web_research(
query="sustainable land management technologies in Kenya",
models=models,
direct_url="https://wocat.net/en/database/list/?type=technology&country=ke",
)If the URL is a list, index, or database page, the pipeline can:
- detect that it is a hub page
- collect the most relevant item links
- follow up to a depth-dependent cap of those links
- take one "next page" hop when pagination is present
Especially useful for catalog pages, result listings, and structured report libraries.
See the maintained flow doc: [docs/pipeline-flow.md](docs/pipeline-flow.md)
It includes:
- the top-level
run_web_research()flow - the direct-URL vs search-mode split
- the scrape router rules for docling vs crawl4ai vs JSON vs vision
- the extractor fallback rules
- the synthesis and citation-judge rules
- Generate targeted search queries.
- Search the web with Serper.
- Triage the best URLs across result sets.
- Scrape and extract relevant content in parallel.
- After each non-final search iteration, run the coverage evaluator to decide whether the evidence actually answers the question.
- If coverage is still weak, either reuse promising backlog URLs or run follow-up searches.
- Produce a grounded synthesis with inline citations.
- Run a deterministic citation check before returning.
Editable diagram: [pipeline-diagram.excalidraw](pipeline-diagram.excalidraw)
Readable rule map: [docs/pipeline-flow.md](docs/pipeline-flow.md)
Pipeline diagram showing mode selection, scrape routing, failure buckets, and synthesis rules
| What happened | Result bucket | Meaning |
|---|---|---|
| Scrape and extraction succeeded | scraped |
The URL produced usable extracted content |
| Search result was seen but never scraped | snippet_only |
Only the search snippet is kept |
| URL matched a blocked domain policy | blocked_by_policy |
Skipped before normal extraction |
| Source returned HTTP/network errors | source_http_error |
The source failed, not the package logic |
| Bot protection or anti-automation page detected | bot_detected |
The URL was reachable but blocked |
| Page loaded but content was not useful for the query | scraped_irrelevant |
Fetch succeeded, relevance failed |
| Extraction failed for other reasons | scrape_failed |
Generic scrape or extraction failure |
| Situation | What the pipeline does next |
|---|---|
direct_url is a list / index / database page |
Extract ranked detail links, allow one next-page hop, then scrape selected follow-ups |
direct_url is a document |
Do not fan out into site chrome or navigation pages |
| Search mode completes a non-final iteration | Run coverage evaluation to decide whether current evidence is sufficient |
| Search mode has weak coverage but promising snippet-only URLs | Scrape backlog URLs before running new searches |
| Search mode has weak coverage and backlog looks weak | Generate follow-up search queries |
| Domain-restricted mode finds a hub page | Deepen within the same domain before broadening search |
await run_web_research(query=..., models=..., search_backend="serper")serper: Google-quality results with rich metadata (date, rank, People Also Ask, Knowledge Graph). RequiresSERPER_API_KEY— Serper is generous with free-tier limits.
Additional backends can be added by the community — see SearchBackend in [search_backends.py](src/web_scout/search_backends.py).
# Standard (default): usually up to ~10 sources
await run_web_research(query=..., models=..., research_depth="standard")
# Deep: usually up to ~28 sources
await run_web_research(query=..., models=..., research_depth="deep")| Parameter | Standard | Deep |
|---|---|---|
| Max iterations | 2 | 3 |
| Search queries (first round) | 3 | 5 |
| Search queries (follow-up) | 2 | 4 |
| URLs scraped (first round) | 6 | 12 |
| URLs scraped (follow-up) | 4 | 8 |
| Hub deepening cap | 10 | 15 |
Model IDs follow LiteLLM provider naming:
models = {
# Required
"web_researcher": "openai/gpt-5.4-mini",
"content_extractor": "gemini/gemini-3-flash-preview",
# Optional step-specific overrides (default: web_researcher)
"query_generator": "openai/gpt-5.4-mini",
"coverage_evaluator": "openai/gpt-5.4-mini",
"synthesiser": "openai/gpt-5.4-mini",
# Optional fallback for scanned PDFs, image URLs, or empty JS pages
"vision_fallback": "gemini/gemini-3-flash-preview",
}# Search backend
export SERPER_API_KEY="..."
# LLM providers (set what you use)
export OPENAI_API_KEY="..."
export ANTHROPIC_API_KEY="..."
export GEMINI_API_KEY="..."
export MISTRAL_API_KEY="..."
export GROQ_API_KEY="..."# Restrict discovery to selected domains
await run_web_research(
query=...,
models=...,
include_domains=["fao.org", "ipcc.ch"],
)
# Re-allow domains that are blocked by default
await run_web_research(
query=...,
models=...,
allowed_domains=["reddit.com"],
)By default, the scraper blocks common social and video platforms. allowed_domains lets you opt specific domains back in when they are genuinely useful for the task.
from agents import Agent, function_tool
from web_scout import run_web_research
@function_tool
async def research(query: str) -> str:
result = await run_web_research(
query=query,
models={
"web_researcher": "openai/gpt-5.4-mini",
"content_extractor": "gemini/gemini-3-flash-preview",
},
search_backend="serper",
)
sources = "\n".join(f"- {s.url}" for s in result.scraped)
return f"{result.synthesis}\n\nSources:\n{sources}"
agent = Agent(
name="researcher",
model="gpt-5.4-mini",
tools=[research],
instructions="Use the research tool to answer with up-to-date web sources.",
)For fast local confidence:
mamba run -n web-agent python tests/run_checks.py quickFor the full local suite:
mamba run -n web-agent python tests/run_checks.py unitFor live behavior probes with saved artifacts:
mamba run -n web-agent python tests/run_checks.py behavior --env-file /path/to/.envEach run writes a timestamped folder under tests/run_results/ with:
- one
.logfile per step manifest.jsonwith status, commands, and durationssummary.mdwith a human-readable reportpytestJUnit XML for pytest steps
For a folder-level guide to what each test and probe does, see [tests/README.md](tests/README.md).
Presets:
quick:ruffplus a targeted unit sliceunit:ruffplus the full pytest suitebehavior: live probes (query_probe,matrix_probe,full_query_probe)all: local checks plus live probes
The live probe preset is env-aware and skips steps cleanly when required keys are missing.
web-scout-ai is a strong fit when you need:
- up-to-date answers grounded in real web sources
- multi-source synthesis without building a full deep-research stack
- a reusable research tool inside an agent workflow
- better handling of report libraries, list pages, and mixed web/document sources
It is probably not the right tool if you only need simple search snippets or if you want a fully autonomous long-form research agent that decides everything itself.
- Python
>=3.10 - API key for at least one supported LLM provider
SERPER_API_KEYfor the Serper search backend (generous free tier)
- Full logo:
[assets/web-scout-logo.svg](assets/web-scout-logo.svg) - Square logo mark (avatar-safe):
[assets/web-scout-logo-mark.svg](assets/web-scout-logo-mark.svg) - Social card preview:
[assets/web-scout-social-card.svg](assets/web-scout-social-card.svg)
MIT