Skip to content

v0.8.65: Make web_search reliable with a SearXNG JSON backend, health checks, and visible agent status #3079

@Hmbown

Description

@Hmbown

Problem

web_search is one of the main ways agents are supposed to get current information, but the current behavior is not reliable or visible enough. In practice, it often feels like the agent says it is searching the web and then nothing useful happens.

We should make the search stack explicit, testable, and observable. SearXNG looks like a good backend candidate because it exposes a documented metasearch API with JSON output and can be self-hosted/configured instead of relying primarily on fragile HTML scraping.

Current state

  • crates/tui/src/tools/web_search.rs:1-4 — current backends are Bing HTML scrape, DuckDuckGo HTML scrape with Bing fallback, Tavily, Bocha, Metaso, Baidu, Volcengine, and Sofya.
  • crates/tui/src/tools/web_search.rs:9-12 — config supports provider and base_url, but the base URL is described as DuckDuckGo-compatible HTML, not SearXNG JSON.
  • crates/tui/src/tools/web_search.rs:27-34 — backend endpoints are hard-coded for current providers; there is no SearXNG endpoint/default.
  • crates/tui/src/tools/web_search.rs:67-105 — DuckDuckGo/Bing parsing relies on HTML regexes, which is inherently brittle and prone to bot challenges/layout changes.
  • crates/tui/src/tools/web_search.rs:144-145 — tool description tells agents the default is DuckDuckGo with Bing fallback; it does not mention SearXNG.
  • config.example.toml:455-477[search] docs list DuckDuckGo/Bing/API providers and base_url, but no searxng provider.
  • SearXNG docs (docs/dev/search_api.rst) document JSON output with format=json and parameters such as q, categories, engines, language, pageno, time_range, and safesearch.
  • Smoke-probing public SearXNG instances during triage produced 403/429 on several instances, so CodeWhale should not blindly depend on random public instances. Prefer user-configured/self-hosted SearXNG, plus health checks and graceful fallback.

Proposed fix

  1. Add a first-class searxng search provider.
    • Use the SearXNG /search?q=...&format=json API.
    • Support both GET and POST if practical; SearXNG docs show both forms.
    • Normalize JSON results into the existing WebSearchResponse shape.
  2. Treat SearXNG as a configured backend, not a random public-instance scraper.
    • Add [search] provider = "searxng".
    • Add [search] base_url = "https://search.example.org" or /search path handling.
    • Document that self-hosted or trusted instances are recommended.
  3. Add health/fallback behavior.
    • Detect 403, 429, invalid JSON, disabled JSON format, and empty results.
    • Return a clear message explaining the backend failure and configured fallback path.
    • Optionally support ordered fallback providers, e.g. fallbacks = ["metaso", "duckduckgo"].
  4. Improve agent/user observability.
    • When an agent calls web_search, the Activity/tool UI should show that a web search is happening and whether it succeeded, failed, or fell back.
    • Result summaries should include source/backend and count so failures do not look like silence.
  5. Add docs and examples.
    • Update config.example.toml and any search docs with SearXNG setup.
    • Include a local/self-hosted SearXNG example and a note about public-instance rate limits/bot defenses.

Acceptance criteria

  • SearchProvider supports searxng.
  • [search] provider = "searxng" with base_url calls SearXNG JSON search and returns normalized title/url/snippet results.
  • SearXNG errors (403, 429, disabled JSON format, invalid JSON, empty results) produce actionable tool errors/messages.
  • Web search Activity/tool UI clearly shows “searching the web,” backend name, success/failure, and fallback when applicable.
  • Config docs explain when to use SearXNG, how to configure base_url, and why self-hosted/trusted instances are preferred over arbitrary public instances.
  • Existing DuckDuckGo/Bing/Tavily/etc. behavior remains available.
  • Tests cover successful SearXNG JSON parsing, no-results response, HTTP error, invalid JSON, and fallback messaging.

Verification

cargo test -p codewhale-tui web_search
cargo test -p codewhale-tui search
cargo test -p codewhale-tui config

Manual smoke tests:

# with a trusted/self-hosted SearXNG instance configured
codewhale -p 'search the web for CodeWhale GitHub release notes and cite URLs'

Expected: the UI shows a visible web-search action, the result reports source: searxng, and failures/fallbacks are explicit instead of silent.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestreliabilityReliability, flaky behavior, retries, fallbacks, and robustnesstoolsTool execution, tool schemas, tool UX, and built-in tool behaviortuiTerminal UI behavior, rendering, or interactionv0.8.65Targeting v0.8.65web-searchWeb search tool, search backends, result parsing, and browser/fetch search UX

    Projects

    Status
    Backlog

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions