fix(telemetry): unique run_id so concurrent sub-agents don't collide in the uploader#167
Open
justrach wants to merge 2 commits into
Open
fix(telemetry): unique run_id so concurrent sub-agents don't collide in the uploader#167justrach wants to merge 2 commits into
justrach wants to merge 2 commits into
Conversation
…s don't collide The uploader buffers in-flight runs in a HashMap keyed by (conversation_id, agent_id). Child recorders deliberately reuse the *parent's* conversation_id so /trace can group a fleet under one conversation — which means two concurrent sub-agents of the SAME agent (e.g. three parallel `forge` workers under one `task` fan-out) hash to the identical key. Their events interleave into one RunBuf and the envelope that uploads is a corrupted merge of both trajectories: wrong step counts, wrong token totals, wrong reward attribution. That poisons exactly the per-variant fitness signal the agent_version work exists to produce. Fix: stamp a process-unique `run_id` on every live trajectory event (a monotonic counter folded into the recorder, which is already one-per-run) and key the uploader by (conversation_id, agent_id, run_id). Parent grouping for /trace is untouched (it reads conversation_id); concurrent siblings now buffer and upload independently. `run_id` is Option<String> — None for events reconstructed from the DB (the upload path only ever sees live broadcast events, which always set it), so no migration and no DB schema change. Stacked on #166 (agent_version); together they make the trajectory feed both variant-attributed and collision-free for DGM-style evolution. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Three deterministic tests pinning the producer -> consumer chain:
- concurrent_siblings_mint_and_emit_distinct_run_ids (recorder): two
recorders sharing the same (conversation, agent) -- what a parallel
`task` fan-out produces -- mint distinct run_ids shaped
{conv}-{agent}-r{n}, and those ids ride on every event they emit.
- concurrent_siblings_with_distinct_run_ids_do_not_collide (uploader):
interleaved sibling events land in two independent buffers, each
holding only its own steps.
- without_run_id_siblings_collide_into_one_buffer (uploader): the
identical interleaving with run_id=None collapses into one buffer --
proving run_id is the load-bearing part of the key, i.e. the exact
corruption the fix removes.
Also fixes the now-stale module doc (buffer key is
(conversation_id, agent_id, run_id)).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Action required: PR inactive for 5 days. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The bug
The trajectory uploader buffers in-flight runs in a
HashMapkeyed by(conversation_id, agent_id). Child recorders deliberately reuse the parent'sconversation_idso/tracecan group a whole fleet under one conversation.The collision: two concurrent sub-agents of the same agent — e.g. three parallel
forgeworkers spawned by onetaskfan-out, or anultracodeescalation — hash to the identical key. Their events interleave into oneRunBuf, and the envelope that finally uploads is a corrupted merge of both trajectories: wrong step counts, wrong token totals, wrong reward attribution.That poisons exactly the per-variant fitness signal #166's
agent_versionexists to produce — so the two fixes belong together.The fix
run_id: Option<String>toTrajectoryEvent.new()({conv}-{agent}-r{N},Nfrom a globalAtomicU64) and sets it on every broadcast event.(conversation_id, agent_id, run_id).Parent grouping for
/traceis untouched — it readsconversation_id. Concurrent siblings now buffer and upload independently.run_idisOptionandNonefor events reconstructed from the DB (the upload path only ever sees live broadcast events, which always set it) — no migration, no DB schema change.Why
Option/ no DB columnThe DB→event
TryFromand the repo test fixtures passNone; only the live recorder setsSome. The uploader consumes the live broadcast stream, so the key is always populated where it matters.Test
cargo check --tests -p forge_app -p forge_repo -p forge_domain— clean.cargo test -p forge_app -p forge_repo trajectory— 18 passed (13 + 5), incl. the agent_version envelope test.Stacking
Stacked on #166 (
feat/trajectory-agent-version) — base it there so this diff is purely therun_idchange. GitHub will retarget torelease/0.2.17when #166 merges.🤖 Generated with Claude Code