1.2.0-rc.1#67
Merged
Merged
Conversation
Contributor
Reviewer's GuideImplements 1.2.0-rc.1 with streaming stability, better reasoning/usage accounting, updated default models, and new tooling observability, while adding tests around token usage, cost calculation, reasoning streaming, and session usage persistence. Sequence diagram for LLM usable execution task observerssequenceDiagram
participant Agent
participant Chatter
participant Tooling as llm_tool_call
participant Exec as LLMUsableExecution
participant Observer
Agent->>Agent: execute_local_usable(task_observer)
Agent->>Chatter: run_tool_call(..., task_observer)
Chatter->>Tooling: run_tool_call(..., task_observer)
Tooling->>Tooling: exec_llm_usable(..., task_observer)
Tooling->>Tooling: create_llm_usable_execution(..., task_observer)
Tooling->>Exec: __init__(execution)
Tooling->>Exec: execution.add_task_observer(task_observer)
Exec->>Exec: _set_task(asyncio.create_task(...))
Exec-->>Observer: task_observer(task)
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Contributor
There was a problem hiding this comment.
Hey - I've found 4 issues, and left some high level feedback:
- In
AnthropicChatClient._create_non_stream,_extract_anthropic_usageis called with the fullresponseobject instead of itsusagefield; this will miss the actual token stats on typical Anthropic responses (where usage is nested), so consider passingresponse.usage(or_get_attr(response, "usage")) instead. - In the Anthropic streaming path (
_create_stream), when you mergemessage_deltausage intoinput_usageyou updatecompletion_tokensbut never recomputetotal_tokens(and derived fields likecache_miss_tokens), so it would be safer to either call_extract_anthropic_usageon a combined usage object or explicitly recalculatetotal_tokensafter settingcompletion_tokens.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `AnthropicChatClient._create_non_stream`, `_extract_anthropic_usage` is called with the full `response` object instead of its `usage` field; this will miss the actual token stats on typical Anthropic responses (where usage is nested), so consider passing `response.usage` (or `_get_attr(response, "usage")`) instead.
- In the Anthropic streaming path (`_create_stream`), when you merge `message_delta` usage into `input_usage` you update `completion_tokens` but never recompute `total_tokens` (and derived fields like `cache_miss_tokens`), so it would be safer to either call `_extract_anthropic_usage` on a combined usage object or explicitly recalculate `total_tokens` after setting `completion_tokens`.
## Individual Comments
### Comment 1
<location path="src/kernel/llm/model_client/anthropic_client.py" line_range="565" />
<code_context>
if not reasoning_parts and reasoning_content:
reasoning_parts = [ReasoningText(reasoning_content)]
- return message_text, tool_calls, None, reasoning_parts or None
+ usage = _extract_anthropic_usage(response)
+ return message_text, tool_calls, None, reasoning_parts or None, usage
</code_context>
<issue_to_address>
**issue (bug_risk):** Non-stream Anthropic usage is extracted from the whole response object instead of its `usage` field
`response` is the full message object, but `_extract_anthropic_usage` expects the Anthropic `usage` payload. As written, this will almost always fall back to default values instead of real token counts, skewing cost and usage metrics. You likely want to pass `response.usage` (or similar), e.g.:
```python
usage = _extract_anthropic_usage(getattr(response, "usage", None))
```
</issue_to_address>
### Comment 2
<location path="src/kernel/llm/model_client/anthropic_client.py" line_range="593-602" />
<code_context>
+ continue
+
+ # ── message_delta:合并 output 侧 usage 并产出 StreamEvent ──
+ if event_type == "message_delta":
+ delta_usage = _get_attr(event, "usage")
+ if delta_usage is not None:
+ output_tokens = _get_attr(delta_usage, "output_tokens", 0) or 0
+ input_usage["completion_tokens"] = output_tokens
+ # Anthropic output_tokens 已包含 reasoning tokens
+ input_usage["completion_includes_reasoning"] = True
+ yield StreamEvent(usage=dict(input_usage) if input_usage else None)
+ continue
+
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Anthropic streaming usage aggregation does not update `total_tokens` when output tokens arrive
In the streaming path, `input_usage` is initialized from `message_start` with `total_tokens = prompt_tokens + completion_tokens`. When handling `message_delta`, you overwrite `completion_tokens` but never recompute `total_tokens`, so it will be stale and won’t match the final output token count.
Consider updating `total_tokens` whenever you set `completion_tokens`, e.g.:
```python
if delta_usage is not None:
output_tokens = _get_attr(delta_usage, "output_tokens", 0) or 0
input_usage["completion_tokens"] = output_tokens
input_usage["total_tokens"] = input_usage.get("prompt_tokens", 0) + output_tokens
input_usage["completion_includes_reasoning"] = True
```
This keeps streaming and non-streaming usage accounting consistent.
```suggestion
# ── message_start:提取 input 侧 usage ──
if event_type == "message_start":
msg_obj = _get_attr(event, "message")
if msg_obj is not None:
msg_usage = _get_attr(msg_obj, "usage")
if msg_usage is not None:
input_usage = _extract_anthropic_usage(msg_usage)
continue
# ── message_delta:合并 output 侧 usage 并产出 StreamEvent ──
if event_type == "message_delta":
delta_usage = _get_attr(event, "usage")
if delta_usage is not None:
output_tokens = _get_attr(delta_usage, "output_tokens", 0) or 0
input_usage["completion_tokens"] = output_tokens
# 更新 total_tokens 以保持与非流式统计一致
input_usage["total_tokens"] = input_usage.get("prompt_tokens", 0) + output_tokens
# Anthropic output_tokens 已包含 reasoning tokens
input_usage["completion_includes_reasoning"] = True
yield StreamEvent(usage=dict(input_usage) if input_usage else None)
continue
```
</issue_to_address>
### Comment 3
<location path="test/kernel/llm/test_response.py" line_range="260-269" />
<code_context>
+ def store(self, tmp_path: Path) -> SessionStore:
+ return SessionStore(str(tmp_path))
+
+ @pytest.mark.asyncio
+ async def test_save_and_load_with_usage(self, store: SessionStore) -> None:
+ usage = {
+ "gpt-4o": {
+ "prompt_tokens": 5000,
+ "completion_tokens": 2000,
+ "cache_hit_tokens": 500,
+ "cache_write_tokens": 0,
+ "reasoning_tokens": 300,
+ "cost": 0.05,
+ "request_count": 8,
+ }
+ }
+ data = SessionData(
+ session_id="sess-1",
+ working_directory="/tmp/project",
+ title="Test Session",
+ created_at=1000.0,
+ phase="ready",
+ usage_total=usage,
+ )
+ await store.save("sess-1", data)
+
+ loaded = await store.load("sess-1")
</code_context>
<issue_to_address>
**issue (testing):** The new async reasoning snapshot test doesn’t match the updated LLMResponse iteration semantics
This test iterates over `response` (i.e. `LLMResponse.__aiter__`), which after the `stream_events` refactor now yields only `text_delta` chunks. With only two `text_delta` events ("答" and "案"), `snapshots` will only have two entries, so the expected `['先', '先', '先想', '先想']` will not be met. To validate per-event reasoning snapshots, iterate over `response.stream_events()` (as in `test_response_streaming_reasoning.py`) and assert across all four events. If instead `__aiter__` is meant to expose reasoning for every event, its implementation and associated tests should be updated to reflect that contract.
</issue_to_address>
### Comment 4
<location path="test/kernel/llm/test_anthropic_client.py" line_range="250-253" />
<code_context>
monkeypatch.setattr(client, "_get_client", lambda **_: fake_client)
- message, tool_calls, stream_iter, reasoning = await client.create(
+ message, tool_calls, stream_iter, reasoning, usage = await client.create(
model_name="claude-sonnet-4-6",
payloads=[LLMPayload(ROLE.USER, Text("hello")), LLMPayload(ROLE.TOOL, MockTool)],
tools=[],
</code_context>
<issue_to_address>
**suggestion (testing):** The updated Anthropics client tests don’t assert anything about the new usage return value
Since `AnthropicChatClient.create` now returns `usage`, these tests only unpack it without verifying anything. To exercise `_extract_anthropic_usage` in both non-stream and stream paths, consider making `fake_client.messages.create` return an object with a `usage` field and assert that it has the expected token and cache fields plus `completion_includes_reasoning=True`. For the streaming case, you could also assert that `stream_iter` yields an event with a non-`None` `.usage` matching the merged `message_start` and `message_delta` usage.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
feat: 在 AnthropicChatClient 中提取 usage 时支持默认值,增强流式处理中的 token 统计 test: 增加对 usage 返回值的验证,确保正确提取和计算 token 数量
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary by Sourcery
Update LLM streaming, usage accounting, and configuration defaults while improving watchdog robustness and tool execution observability.
New Features:
Bug Fixes:
Enhancements:
Tests:
Chores: