Skip to content

fix(telemetry): unique run_id so concurrent sub-agents don't collide in the uploader#167

Open
justrach wants to merge 2 commits into
feat/trajectory-agent-versionfrom
fix/trajectory-subagent-run-id
Open

fix(telemetry): unique run_id so concurrent sub-agents don't collide in the uploader#167
justrach wants to merge 2 commits into
feat/trajectory-agent-versionfrom
fix/trajectory-subagent-run-id

Conversation

@justrach

@justrach justrach commented Jun 4, 2026

Copy link
Copy Markdown
Owner

The bug

The trajectory uploader buffers in-flight runs in a HashMap keyed by (conversation_id, agent_id). Child recorders deliberately reuse the parent's conversation_id so /trace can group a whole fleet under one conversation.

The collision: two concurrent sub-agents of the same agent — e.g. three parallel forge workers spawned by one task fan-out, or an ultracode escalation — hash to the identical key. Their events interleave into one RunBuf, and the envelope that finally uploads is a corrupted merge of both trajectories: wrong step counts, wrong token totals, wrong reward attribution.

That poisons exactly the per-variant fitness signal #166's agent_version exists to produce — so the two fixes belong together.

The fix

  • Add run_id: Option<String> to TrajectoryEvent.
  • The recorder is already one instance per run, so it stamps a process-unique id once in new() ({conv}-{agent}-r{N}, N from a global AtomicU64) and sets it on every broadcast event.
  • Key the uploader by (conversation_id, agent_id, run_id).

Parent grouping for /trace is untouched — it reads conversation_id. Concurrent siblings now buffer and upload independently.

run_id is Option and None for events reconstructed from the DB (the upload path only ever sees live broadcast events, which always set it) — no migration, no DB schema change.

Why Option / no DB column

The DB→event TryFrom and the repo test fixtures pass None; only the live recorder sets Some. The uploader consumes the live broadcast stream, so the key is always populated where it matters.

Test

  • cargo check --tests -p forge_app -p forge_repo -p forge_domain — clean.
  • cargo test -p forge_app -p forge_repo trajectory18 passed (13 + 5), incl. the agent_version envelope test.

Stacking

Stacked on #166 (feat/trajectory-agent-version) — base it there so this diff is purely the run_id change. GitHub will retarget to release/0.2.17 when #166 merges.

🤖 Generated with Claude Code

justrach and others added 2 commits June 4, 2026 18:36
…s don't collide

The uploader buffers in-flight runs in a HashMap keyed by
(conversation_id, agent_id). Child recorders deliberately reuse the
*parent's* conversation_id so /trace can group a fleet under one
conversation — which means two concurrent sub-agents of the SAME agent
(e.g. three parallel `forge` workers under one `task` fan-out) hash to
the identical key. Their events interleave into one RunBuf and the
envelope that uploads is a corrupted merge of both trajectories: wrong
step counts, wrong token totals, wrong reward attribution. That poisons
exactly the per-variant fitness signal the agent_version work exists to
produce.

Fix: stamp a process-unique `run_id` on every live trajectory event (a
monotonic counter folded into the recorder, which is already one-per-run)
and key the uploader by (conversation_id, agent_id, run_id). Parent
grouping for /trace is untouched (it reads conversation_id); concurrent
siblings now buffer and upload independently.

`run_id` is Option<String> — None for events reconstructed from the DB
(the upload path only ever sees live broadcast events, which always set
it), so no migration and no DB schema change.

Stacked on #166 (agent_version); together they make the trajectory feed
both variant-attributed and collision-free for DGM-style evolution.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Three deterministic tests pinning the producer -> consumer chain:

- concurrent_siblings_mint_and_emit_distinct_run_ids (recorder): two
  recorders sharing the same (conversation, agent) -- what a parallel
  `task` fan-out produces -- mint distinct run_ids shaped
  {conv}-{agent}-r{n}, and those ids ride on every event they emit.
- concurrent_siblings_with_distinct_run_ids_do_not_collide (uploader):
  interleaved sibling events land in two independent buffers, each
  holding only its own steps.
- without_run_id_siblings_collide_into_one_buffer (uploader): the
  identical interleaving with run_id=None collapses into one buffer --
  proving run_id is the load-bearing part of the key, i.e. the exact
  corruption the fix removes.

Also fixes the now-stale module doc (buffer key is
(conversation_id, agent_id, run_id)).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

Action required: PR inactive for 5 days.
Status update or closure in 10 days.

@github-actions github-actions Bot added the state: inactive No current action needed/possible; issue fixed, out of scope, or superseded. label Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

state: inactive No current action needed/possible; issue fixed, out of scope, or superseded.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant