Skip to content

Commit f923cfe

Browse files
penny-team[bot]jaredlockhartclaude
authored
Fix search hallucination: single-query tool with parallel agent dispatch (jaredlockhart#910)
* Fix search hallucination: single-query tool with parallel agent dispatch Root cause: SearchTool accepted a queries list and concatenated multiple results into one tool message, which got truncated mid-content. The model then hallucinated the rest from memory. Fix: SearchTool.execute() now takes a single query: str. Parallelism moves to the agent loop — _process_tool_calls uses asyncio.gather() to dispatch all tool calls concurrently, then appends one tool message per result. This matches Ollama's native parallel tool call protocol. Also rewrites CONVERSATION_PROMPT and THINKING_SYSTEM_PROMPT to be tool-agnostic — search-specific language replaced with neutral equivalents so the model uses the right tool (search, browse_url, etc.) for the job. Adds _make_parallel_tool_calls_response to the mock and a new TestParallelToolCalls test that verifies two tool calls in one turn produce two separate tool messages in the next Ollama call. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Rework MultiTool as fetch with single queries array and URL auto-routing The previous MultiTool design used separate arrays (queries, urls, news) and a complex inner-call schema that gpt-oss:20b couldn't reliably follow. The model kept putting URLs in queries, inventing its own call formats, or hedging by duplicating entries across arrays. New design: single queries array — the model dumps everything in one list and Python routes URLs to browse_url via regex, plain text to search. This matches the pattern the model already learned from the original single-query search tool. Key changes: - MultiTool renamed to "fetch" (avoids name collision with SearchTool) - Schema simplified to just queries[] — URLs auto-detected and routed - _create_search_tool returns SearchTool | None (was list for no reason) - MAX_TOOL_RESULT_CHARS raised from 8k to 50k (web pages need room) - Chat page context injection uses fetch format (was stale browse_url) - Browser channel tool status shows cumulative checklist with checkmarks - CONVERSATION_PROMPT kept tool-agnostic (tool descriptions do the work) - browse_url retries full tab lifecycle up to 3x on empty content - Tab load + tool timeouts raised to 60s for JS-heavy pages (e.g. IMDb) - Test: two 15k-char results both survive into model context without truncation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Hide browse_url tabs with tabHide API Tabs were visible in the tab bar during page reads because active: false only prevents focus steal. Now calls browser.tabs.hide() after creation with graceful fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Give ThinkingAgent its own MultiTool (max_calls=1) Moves multi_tool support to the base Agent class so both ChatAgent and ThinkingAgent use MultiTool for tool dispatch. ThinkingAgent gets its own instance with max_calls=1 (matching the old single-query cap on main). Both MultiTools share the same browse_url provider. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Enforce max_calls in MultiTool schema with per-instance maxItems The model was sending multiple queries from ThinkingAgent because the schema had no maxItems constraint. Now MultiTool sets description and parameters per-instance based on max_calls, matching how SearchTool on main advertised its cap via maxItems in the JSON schema. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Jared Lockhart <119884+jaredlockhart@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 39e9c1c commit f923cfe

24 files changed

Lines changed: 604 additions & 321 deletions

browser/manifest.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
"permissions": [
88
"storage",
99
"tabs",
10+
"tabHide",
1011
"<all_urls>"
1112
],
1213

browser/src/background/tools/browse_url.ts

Lines changed: 27 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,28 +4,48 @@
44

55
import { TAB_LOAD_TIMEOUT_MS } from "../../protocol.js";
66

7+
const BROWSE_MAX_RETRIES = 3;
8+
79
interface PageData {
810
title: string;
911
url: string;
1012
text: string;
1113
}
1214

1315
export async function browseUrl(url: string): Promise<string> {
14-
const tab = await openHiddenTab(url);
15-
try {
16-
await waitForTabLoad(tab.id!);
17-
const pageData = await extractPageContent(tab.id!);
18-
return formatResult(pageData);
19-
} finally {
20-
await closeTab(tab.id!);
16+
for (let attempt = 1; attempt <= BROWSE_MAX_RETRIES; attempt++) {
17+
console.log(`[browse_url] attempt ${attempt}/${BROWSE_MAX_RETRIES}: ${url}`);
18+
const tab = await openHiddenTab(url);
19+
try {
20+
await waitForTabLoad(tab.id!);
21+
console.log(`[browse_url] page complete, extracting content`);
22+
const pageData = await extractPageContent(tab.id!);
23+
const textLen = pageData.text.trim().length;
24+
if (textLen > 0) {
25+
console.log(`[browse_url] extracted ${textLen} chars`);
26+
return formatResult(pageData);
27+
}
28+
console.warn(`[browse_url] empty content on attempt ${attempt}`);
29+
} catch (err) {
30+
console.error(`[browse_url] attempt ${attempt} failed:`, err);
31+
} finally {
32+
await closeTab(tab.id!);
33+
}
2134
}
35+
console.error(`[browse_url] gave up after ${BROWSE_MAX_RETRIES} attempts: ${url}`);
36+
return `No content extracted from ${url} after ${BROWSE_MAX_RETRIES} attempts`;
2237
}
2338

2439
async function openHiddenTab(url: string): Promise<browser.tabs.Tab> {
2540
const tab = await browser.tabs.create({ url, active: false });
2641
if (!tab.id) {
2742
throw new Error("Failed to create tab");
2843
}
44+
try {
45+
await browser.tabs.hide(tab.id);
46+
} catch {
47+
// tabHide may not be available — tab stays visible but still works
48+
}
2949
return tab;
3050
}
3151

browser/src/protocol.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -389,8 +389,8 @@ export const MAX_PAGE_CONTEXT_CHARS = 5_000;
389389
// --- Tool constants ---
390390

391391
export const THOUGHTS_POLL_INTERVAL_MS = 300_000;
392-
export const TOOL_TIMEOUT_MS = 30_000;
393-
export const TAB_LOAD_TIMEOUT_MS = 15_000;
392+
export const TOOL_TIMEOUT_MS = 60_000;
393+
export const TAB_LOAD_TIMEOUT_MS = 60_000;
394394
export const MAX_EXTRACTED_CHARS = 50_000;
395395

396396
// --- Chat UI ---

penny/penny/agents/base.py

Lines changed: 45 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
from __future__ import annotations
44

5+
import asyncio
56
import logging
67
import re
78
import urllib.parse as _urlparse
@@ -18,6 +19,7 @@
1819
from penny.responses import PennyResponse
1920
from penny.tools import SearchTool, Tool, ToolCall, ToolExecutor, ToolRegistry
2021
from penny.tools.models import SearchResult
22+
from penny.tools.multi import MultiTool
2123

2224
logger = logging.getLogger(__name__)
2325

@@ -191,6 +193,7 @@ def __init__(
191193
allow_repeat_tools: bool = False,
192194
search_tool: Tool | None = None,
193195
news_tool: Tool | None = None,
196+
multi_tool: MultiTool | None = None,
194197
):
195198
self.config = config
196199
self.system_prompt = system_prompt
@@ -204,6 +207,7 @@ def __init__(
204207

205208
self._search_tool = search_tool
206209
self._news_tool = news_tool
210+
self._multi_tool = multi_tool
207211
self._browser_tools_provider: Callable[[], list[Tool]] | None = None
208212
self._current_user: str | None = None
209213
self._tool_result_text: list[str] = []
@@ -302,7 +306,7 @@ async def run(
302306
max_steps: int,
303307
history: list[tuple[str, str]] | None = None,
304308
system_prompt: str | None = None,
305-
on_tool_start: Callable[[str, dict], Awaitable[None]] | None = None,
309+
on_tool_start: Callable[[list[tuple[str, dict]]], Awaitable[None]] | None = None,
306310
) -> ControllerResponse:
307311
"""Run the agentic loop — prompt in, response out."""
308312
self._tool_result_text = []
@@ -323,7 +327,7 @@ async def _run_agentic_loop(
323327
messages: list[dict],
324328
tools: list[dict],
325329
steps: int,
326-
on_tool_start: Callable[[str, dict], Awaitable[None]] | None = None,
330+
on_tool_start: Callable[[list[tuple[str, dict]]], Awaitable[None]] | None = None,
327331
) -> ControllerResponse:
328332
"""Execute the step loop: call model, process tool calls, or return final answer."""
329333
attachments: list[str] = []
@@ -596,7 +600,12 @@ def set_browser_tools_provider(self, provider: Callable[[], list[Tool]]) -> None
596600
self._browser_tools_provider = provider
597601

598602
def get_tools(self, user: str) -> list[Tool]:
599-
"""Build tool list for this agent. Override in subclasses for custom tools."""
603+
"""Build tool list for this agent.
604+
605+
Returns MultiTool if configured, otherwise individual tools.
606+
"""
607+
if self._multi_tool is not None:
608+
return [self._multi_tool]
600609
tools: list[Tool] = []
601610
if self._search_tool:
602611
tools.append(self._search_tool)
@@ -617,16 +626,20 @@ async def _process_tool_calls(
617626
self,
618627
response,
619628
called_tools: set[tuple[str, ...]],
620-
on_tool_start: Callable[[str, dict], Awaitable[None]] | None = None,
629+
on_tool_start: Callable[[list[tuple[str, dict]]], Awaitable[None]] | None = None,
621630
) -> _StepResult:
622-
"""Process all tool calls from a model response. Returns results to append."""
631+
"""Process all tool calls from a model response, executing valid ones in parallel."""
623632
logger.info("Model requested %d tool call(s)", len(response.message.tool_calls or []))
624633
messages: list[dict] = [response.message.to_input_message()]
625634
records: list[ToolCallRecord] = []
626635
source_urls: list[str] = []
627636
attachments: list[str] = []
628637

629-
for ollama_tool_call in response.message.tool_calls or []:
638+
# Dedup check and on_tool_start are sequential: dedup requires ordered mutation of
639+
# called_tools, and on_tool_start fires UI status updates before execution begins.
640+
max_calls = int(self.config.runtime.MESSAGE_MAX_TOOL_CALLS) if self.config else 5
641+
pending: list[tuple[str, dict, str | None]] = []
642+
for ollama_tool_call in (response.message.tool_calls or [])[:max_calls]:
630643
tool_name = ollama_tool_call.function.name
631644
arguments = ollama_tool_call.function.arguments
632645

@@ -643,14 +656,24 @@ async def _process_tool_calls(
643656
continue
644657

645658
called_tools.add(call_key)
646-
if on_tool_start:
647-
try:
648-
await on_tool_start(tool_name, dict(arguments))
649-
except Exception:
650-
logger.debug("on_tool_start callback failed for %s", tool_name)
651-
result_str, record, urls, image = await self._execute_single_tool(
652-
tool_name, arguments, reasoning
653-
)
659+
pending.append((tool_name, arguments, reasoning))
660+
661+
# Fire on_tool_start once with all pending tools so the UI can show
662+
# a combined status (e.g. "Searching A + Searching B") for parallel calls.
663+
if on_tool_start and pending:
664+
try:
665+
await on_tool_start([(name, dict(args)) for name, args, _ in pending])
666+
except Exception:
667+
logger.debug("on_tool_start callback failed")
668+
669+
# Execute all valid tool calls in parallel.
670+
results = await asyncio.gather(
671+
*[self._execute_single_tool(name, args, reasoning) for name, args, reasoning in pending]
672+
)
673+
674+
for (tool_name, _, _), (result_str, record, urls, image) in zip(
675+
pending, results, strict=True
676+
):
654677
records.append(record)
655678
source_urls.extend(urls)
656679
if image:
@@ -820,8 +843,9 @@ def _identity_section(self) -> str:
820843
def _instructions_section(self, override: str | None = None) -> str:
821844
"""## Instructions — agent-specific prompt with tool descriptions."""
822845
prompt = override or self.system_prompt
823-
if "{tools}" in prompt:
824-
prompt = prompt.format(tools=self._build_tool_summary())
846+
if "{tools}" in prompt or "{max_tool_calls}" in prompt:
847+
max_tool_calls = int(self.config.runtime.MESSAGE_MAX_TOOL_CALLS) if self.config else 5
848+
prompt = prompt.format(tools=self._build_tool_summary(), max_tool_calls=max_tool_calls)
825849
return f"## Instructions\n{prompt}"
826850

827851
@staticmethod
@@ -842,8 +866,11 @@ def _profile_section(self, sender: str, content: str | None = None) -> str | Non
842866
if content is not None:
843867
name = user_info.name
844868
user_said_name = bool(re.search(rf"\b{re.escape(name)}\b", content, re.IGNORECASE))
845-
if self._search_tool and isinstance(self._search_tool, SearchTool):
846-
self._search_tool.redact_terms = [] if user_said_name else [name]
869+
redact = [] if user_said_name else [name]
870+
if self._multi_tool is not None:
871+
self._multi_tool.redact_terms = redact
872+
elif self._search_tool and isinstance(self._search_tool, SearchTool):
873+
self._search_tool.redact_terms = redact
847874

848875
logger.debug("Built profile context for %s", sender)
849876
return f"### User Profile\nThe user's name is {user_info.name}."

penny/penny/agents/chat.py

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
from penny.constants import PennyConstants
1616
from penny.prompts import Prompt
1717
from penny.responses import PennyResponse
18+
from penny.tools.multi import MultiTool
1819

1920
logger = logging.getLogger(__name__)
2021

@@ -53,7 +54,7 @@ async def handle(
5354
sender: str,
5455
images: list[str] | None = None,
5556
page_context: PageContext | None = None,
56-
on_tool_start: Callable[[str, dict], Awaitable[None]] | None = None,
57+
on_tool_start: Callable[[list[tuple[str, dict]]], Awaitable[None]] | None = None,
5758
) -> ControllerResponse:
5859
"""Handle an incoming message — summary method.
5960
@@ -100,32 +101,38 @@ def _build_messages(
100101
history: list[tuple[str, str]] | None = None,
101102
system_prompt: str | None = None,
102103
) -> list[dict]:
103-
"""Build messages, injecting page context as a synthetic browse_url result."""
104+
"""Build messages, injecting page context as a synthetic tools result."""
104105
messages = super()._build_messages(prompt, history, system_prompt)
105106
if self._pending_page_context:
106107
self._inject_page_context(messages, self._pending_page_context)
107108
return messages
108109

109110
@staticmethod
110111
def _inject_page_context(messages: list[dict], page_context: PageContext) -> None:
111-
"""Inject a synthetic browse_url tool call + result after the user prompt."""
112+
"""Inject a synthetic search call + result for page context.
113+
114+
Uses the MultiTool format so the synthetic history matches the tool
115+
the model actually sees in its tool definitions.
116+
"""
112117
if not page_context.text:
113118
return
114119

115120
page_content = (
116121
f"Title: {page_context.title}\nURL: {page_context.url}\n\n{page_context.text}"
117122
)
118123

119-
# Assistant "called" browse_url for the current page
124+
# Assistant "called" fetch with the URL in queries
120125
messages.append(
121126
{
122127
"role": "assistant",
123128
"content": "",
124129
"tool_calls": [
125130
{
126131
"function": {
127-
"name": "browse_url",
128-
"arguments": {"url": page_context.url},
132+
"name": MultiTool.name,
133+
"arguments": {
134+
"queries": [page_context.url],
135+
},
129136
},
130137
}
131138
],
@@ -136,7 +143,7 @@ def _inject_page_context(messages: list[dict], page_context: PageContext) -> Non
136143
{
137144
"role": "tool",
138145
"content": page_content,
139-
"tool_name": "browse_url",
146+
"tool_name": MultiTool.name,
140147
}
141148
)
142149

penny/penny/agents/notify.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -628,8 +628,8 @@ def _extract_search_query(tool_calls: list[ToolCallRecord]) -> str | None:
628628
if tc.tool != "search":
629629
continue
630630
args = SearchArgs.model_validate(tc.arguments)
631-
if args.queries:
632-
return args.queries[0]
631+
if args.query:
632+
return args.query
633633
return None
634634

635635
# Matches **bold text** in markdown (first occurrence)

penny/penny/channels/base.py

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -312,11 +312,14 @@ async def _embed_message(self, message_id: int, content: str) -> None:
312312
def _extract_image_prompt(response) -> str | None:
313313
"""Extract a short image search query from the agent's tool calls."""
314314
for tc in response.tool_calls or []:
315-
if tc.tool != "search":
316-
continue
317-
args = SearchArgs.model_validate(tc.arguments)
318-
if args.queries:
319-
return args.queries[0]
315+
if tc.tool == "fetch":
316+
for q in tc.arguments.get("queries", []):
317+
if not q.startswith("http"):
318+
return q
319+
elif tc.tool == "search":
320+
args = SearchArgs.model_validate(tc.arguments)
321+
if args.query:
322+
return args.query
320323
return None
321324

322325
async def _resolve_image(

penny/penny/channels/browser/channel.py

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -511,11 +511,23 @@ async def send_typing(self, recipient: str, typing: bool) -> bool:
511511
return True
512512

513513
def _make_handle_kwargs(self, message: IncomingMessage) -> dict:
514-
"""Pass an on_tool_start callback so tool calls update the typing indicator."""
515-
recipient = message.sender
514+
"""Pass an on_tool_start callback so tool calls update the typing indicator.
516515
517-
async def on_tool_start(tool_name: str, arguments: dict) -> None:
518-
await self._send_tool_status(recipient, self._format_tool_status(tool_name, arguments))
516+
Builds a cumulative checklist: prior steps show as completed (checkmark),
517+
current step shows as in-progress (dots).
518+
"""
519+
recipient = message.sender
520+
completed: list[str] = []
521+
522+
async def on_tool_start(tools: list[tuple[str, dict]]) -> None:
523+
current = [self._format_tool_status(name, args) for name, args in tools]
524+
lines: list[str] = []
525+
for item in completed:
526+
lines.append(f"&#x2713; {item}")
527+
for item in current:
528+
lines.append(item)
529+
await self._send_tool_status(recipient, "<br>".join(lines))
530+
completed.extend(current)
519531

520532
return {"on_tool_start": on_tool_start}
521533

penny/penny/config_params.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,15 @@ def _validate_unit_float(value: str) -> float:
116116
group=GROUP_GLOBAL,
117117
)
118118

119+
ConfigParam(
120+
key="MESSAGE_MAX_TOOL_CALLS",
121+
description="Max parallel tool calls per agent step",
122+
type=int,
123+
default=5,
124+
validator=_validate_positive_int,
125+
group=GROUP_GLOBAL,
126+
)
127+
119128
ConfigParam(
120129
key="IMAGE_DOWNLOAD_TIMEOUT",
121130
description="Timeout in seconds for image downloads",

penny/penny/constants.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,7 @@ class PreferenceSource(StrEnum):
110110
NEWS_NOTIFY_MAX_STEPS = 3
111111
TOOL_RESULT_TRUNCATION_THRESHOLD = 3
112112
TOOL_RESULT_TRUNCATION_MAX_CHARS = 500
113-
MAX_TOOL_RESULT_CHARS = 8000
113+
MAX_TOOL_RESULT_CHARS = 50000
114114
XML_RETRY_LIMIT = 3
115115
TOOL_FAILURE_ABORT_THRESHOLD = 2
116116
THOUGHT_CONTEXT_LIMIT = 10

0 commit comments

Comments
 (0)