fix(workflow): preserve event.output for cached LlmAgent results#5587
fix(workflow): preserve event.output for cached LlmAgent results#5587ValentinCordonnier wants to merge 1 commit intogoogle:v2from
Conversation
The runner stripped event.output to None before persisting any non-partial event with node_info.message_as_output=True. On resume, _reconstruct_node_state fell back to event.content (a raw genai.types.Content) for the cached child output, so subscripting the result of ctx.run_node on an LlmAgent across a HITL pause crashed with "'Content' object is not subscriptable". Removing the strip lets the validated dict produced by process_llm_agent_output survive the persist/rehydrate roundtrip. Closes google#5553
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
|
Response from ADK Triaging Agent Hello @ValentinCordonnier, thank you for creating this detailed pull request! It looks like the CLA check is failing. Before we can review your contribution, you'll need to sign the Contributor License Agreement. You can do so by following the instructions at https://cla.developers.google.com/. Thanks! |
samarth1224
left a comment
There was a problem hiding this comment.
Key Points:
Does this Fix the issue?
- Yes
Is this a correct way to fix this?
- I believe no
Hey, while your changes fixes the issue, but as far as my knowledge goes of this codebase, I believe this might not be the correct way to fix the issue.
As mentioned in the my issue, in your PR, in the test cases, and also as inferred from the codebase, event.output should ideally remain None when the message_as_input == True. Since, output can be inferred from the event.content.
-
And, therefore, I believe the change required must be made within the function _reconstruct_node_states() and not the _consume_event_queue().
-
Specifically, the change must be either a helper function to process the event.content to suitable output before assigning to child.output OR direct extraction of the output during the assignment to child.output.
I feel a helper function would be a better choice where in the function we could also validate output(stored in event.content) against the agent's output schema ( Validation might not be needed here, since output has already been validated during Fresh Run by the process_llm_agent_output() function. ).
In nutshell:
- I believe, your changes defeat the purpose of message_as_output attribute of NodeInfo class.
- Your changes need to be accompanied with additional changes in multiple files and functions, including _reconstruct_node_states(), _track_event_in_context() etc. so that codebase remains consistent.
- The issue could simply be resolved by processing the event.content before assigning it to child.output in _reconstruct_node_states().
Thanks.
Fixes #5553
Summary
Runner._consume_event_queuewas nullingevent.outputbefore persisting any non-partial event whosenode_info.message_as_outputwasTrue. The intent looked like a denormalization step ("the LLM text already lives inevent.content, no need to store it twice"), but it broke resume rehydration. On the live path,process_llm_agent_outputpopulatesevent.outputwith the validated dict (or raw text, when nooutput_schema); after the strip, the persisted event lost that value. On resume,_reconstruct_node_statewalked the persisted events, foundevent.output is None, and fell through tochild.output = event.content— handing the orchestrator a rawgenai.types.Contentinstead of the dict. Anything subscripting that result (lab[\"findings\"],Schema.model_validate(out), …) crashed with'Content' object is not subscriptableonly after a HITL pause, even though the same code worked on the initial pass.The diagnosis in #5553 by @samarth1224 already pinned the two functions involved; this PR is the minimal change.
Changes
src/google/adk/runners.py: removed the four-line strip block in_consume_event_queuethat copied the event and assignedevent.output = Nonefor non-partialmessage_as_outputevents. With the strip gone,event.outputsurvives the persist / rehydrate roundtrip and_reconstruct_node_state's primary branch (if event.output is not None: child.output = event.output) returns the same Python object type the live path produced.tests/unittests/workflow/test_workflow_hitl.py: added a regression test that pauses arerun_on_resume=Trueorchestrator on aRequestInput, resumes, and subscripts the cachedLlmAgentoutput. Fails onv2HEAD with the reportedTypeError; passes after this change.tests/unittests/workflow/test_workflow_dynamic_nodes.py: updatedtest_workflow_resume_does_not_rerun_completed_llm_agent. Its previous assertionagent_event.output is None(commented "Verify that runners.py cleared the output") was asserting the buggy behavior as if intentional. Replaced with a positive check that the persisted output equals the agent's emitted text. The test's actual semantic guarantee — the LLM is not re-fired on resume — still holds via the existingagent_runs_again == []assertion.Motivation
Any
@node(rerun_on_resume=True)orchestrator that subscripts anLlmAgentresult viactx.run_nodeand survives aRequestInputpause is silently broken onv2. Graph-edge propagation ofLlmAgentoutputs across resume has the same failure mode (the rehydratedchild.outputis aContentinstead of the schema dict), so this isn't only a dynamic-orchestrator problem. The fix unblocks the entire HITL-with-structured-output pattern, which is a corev2use case.What does not change
_node_runner._enqueue_eventruns before this strip in the queue and was already readingevent.outputcorrectly; non-resume runs are unaffected.rerun_on_resume, dedup of completed nodes, and the_dynamic_node_schedulerstate machine are all unchanged. The fix only changes what value gets returned from the cache, not whether the cache is hit.event.contentfallback in_reconstruct_node_stateis preserved (it now handles only legacy persisted events).Event.output: Any | Noneis unchanged; no contract said it had to beNonefor LLM events.Downsides
LlmAgentfinal responses. The validated dict is stored alongside the original text incontent.parts. Foroutput_schemaagents that's roughly the JSON serialization of the dict; for plain-text agents it's roughly the response text duplicated. The previous saving was illusory — rehydration was returning a different object type than the live path, which is the bug we're fixing — but storage-cost-sensitive deployments will see the bytes-per-event grow.event.outputfor LLM events used to seeNone; they will now see the dict / string. Any plugin that branched onevent.output is Noneto detect "this is the final LLM event" needs to switch toevent.node_info.message_as_output(unchanged, set by the live path). I couldn't find any in-tree plugin doing this, but third-party plugins exist.output=Nonefor these events; resuming them post-fix continues to hit theevent.contentfallback. That's not a regression — those sessions were broken before — but cross-version resume is not magically fixed. The fallback branch could be removed in a follow-up once legacy sessions age out.