🤖 fix: restore parallel sibling tool execution (revert #2906)#3576
🤖 fix: restore parallel sibling tool execution (revert #2906)#3576ethanndickson wants to merge 7 commits into
Conversation
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 79919ddffb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
79919dd to
5db9e90
Compare
|
Addressed in 5db9e90. The exemption is now per-call and runtime-aware instead of blanket. A new pure helper
Behavioral tests cover the matrix: worktree default/ |
|
@codex review Please take another look. The blanket task exemption is now per-call and runtime-aware (only provably-forked task calls skip the serialization mutex; shared-checkout task calls and all other tools stay serialized). |
5db9e90 to
c4ade72
Compare
|
@codex review Pushed a formatting-only fix (Prettier) on top of the runtime-aware change. No behavioral difference. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c4ade72cc8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
c4ade72 to
1505b69
Compare
|
Addressed in 1505b69. You're right that a forked task still touches the parent checkout during Replaced the blunt per-stream mutex with a per-stream read/write lock (
So within a stream: forked task ‖ forked task overlap (both readers), but forked task ‖ bash / file-write are mutually exclusive (read‖write), in either start order — no more concurrent access to the parent checkout. The classifier ( The RW lock is FIFO-fair (a queued writer blocks later readers from jumping ahead, so writers can't starve). Tests cover reader overlap, both-direction read/write exclusion, writer serialization, queued-writer fairness, and double-dispose safety, plus the task/bash interleaving matrix in One intentional conservatism (documented inline): a foreground forked task holds the read lock across its whole execute chain incl. |
|
@codex review Replaced the per-stream mutex with a read/write lock so forked task calls overlap each other (readers) but stay mutually exclusive with mutating tools (writers). Please take another look. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1505b695aa
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
#2906 wrapped every tool's execute() in a per-stream AsyncMutex to guard against sibling tool-call races. It had no linked issue or repro, and its own plan notes the real race (bash background output) was already fixed at the runtime boundary via backgroundProcessManager's outputLock. The blanket mutex serialized all foreground sibling tool calls — most visibly parallel 'task' calls, where the lock is held across waitForAgentReport — defeating provider-level parallel tool planning. The genuinely-shared resources are already guarded at the resource level (bash outputLock, TaskService.create()'s own mutex, config FIFO queue; file_read is read-only), so remove the redundant stream-level lock instead of making it concurrency-aware.
1505b69 to
7c44e03
Compare
|
Re: Codex P2 on |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7c44e031bf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Removing the per-stream mutex (prior commit) restores parallel sibling tool execution but exposes a pre-existing race the mutex was masking: file_edit_insert's create-on-missing fast path does fileExists -> write OUTSIDE the per-file MutexMap that guards the normal edit pipeline. Two sibling inserts to the same not-yet-existing path both observe 'missing', both create, and the last write silently drops the other insert while both report success. Route the existence-check + create through the same per-file lock (new withFileEditLock helper). The second insert then sees the file exists and merges through the guarded edit path instead of clobbering the first. Audited the other two writers that mutate outside the lock (agent_skill_write, mux_agents_write): both are full-content overwrites whose disk read is diff-only, so concurrent same-path writes resolve identically to sequential last-writer-wins (no silent merge-loss). Left them unchanged. Regression test asserts two concurrent creates of the same missing file yield exactly one success with the winning content intact; verified it fails (2 successes) when the lock is bypassed.
|
@codex review Addressed the create-on-missing race (P2) in 546db28: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 546db280a3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…lock Concurrent sibling mux_config_write calls to the same config file could each read the same baseline, apply disjoint mutations, and have the later atomic write silently drop the earlier one (a lost update). The reverted per-stream mutex used to mask this; now that parallel sibling tools are restored, mux_config_write owns its own concurrency safety. Add withConfigDocumentLock (a per-config-document MutexMap in configReadWrite.ts, keyed by resolved path) and wrap the read -> applyMutations -> write cycle in it, mirroring withFileEditLock for file edits. Regression test fires two concurrent disjoint mutations and asserts both survive.
|
@codex review Addressed the mux_config_write read-modify-write race (P2) in a5f6c39: the read → applyMutations → write cycle now runs under a per-config-document lock ( |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a5f6c39811
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Concurrent sibling bash calls (now that parallel tool execution is restored) could both enter BackgroundProcessManager.spawn, both pass the collision check in generateUniqueProcessId (which only scanned the processes map), and both pick the same processId/output directory because registration happens only after spawnProcess + writeMeta await. The second processes.set then overwrote the first, orphaning a process and pointing bash_output/termination at the wrong target. Reserve the chosen ID synchronously in generateUniqueProcessId (scan both the registered processes and a new reservedIds set, then add to it before returning), so two concurrent allocations can never select the same ID across the awaits. Release the reservation once the process is registered or the spawn/migration fails (spawn via try/finally, registerMigratedProcess inline, and the bash migrate-failure path via releaseReservedId). Keeps spawns fully parallel rather than re-serializing them with a global mutex. Unit test asserts concurrent allocations stay unique; it fails when the reservation check is removed.
|
@codex review Addressed the background process-ID allocation race (P2) in 591c134: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 591c134400
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
mux_config_write's lock only coordinated other mux_config_write calls, so a parallel mux_config_write(config) could still interleave with Config.editConfig (used by TaskService.create() to persist workspace entries to config.json) and clobber it. config.json had two independent serializers — Config.editConfigQueue (instance-scoped) and the tool's per-path MutexMap — that didn't coordinate. The on-disk file is the shared resource, so the lock belongs at the file path, not the Config instance or the tools layer. Add withConfigFileLock (a module-level MutexMap<string> keyed by absolute config-file path) and have both writers acquire it: Config.editConfig now uses it (replacing the bespoke editConfigQueue, with identical FIFO semantics) and withConfigDocumentLock delegates to it. Both layers already derive byte-identical absolute paths, so they serialize on one lock. Test asserts the tool side and a withConfigFileLock holder serialize on the same file (deterministic ordering); it fails if they use separate locks. Existing editConfig serialization tests still pass, confirming the queue replacement.
|
@codex review Addressed the cross-layer config race (P2) in edd89a3: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: edd89a3aa4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…ared per-file lock Codex P2: withConfigFileLock's guarantee over-claimed coverage for providers.jsonc and secrets.json. Config.saveProvidersConfig (via ProviderService) and the secrets update methods wrote those files without acquiring the lock, so an agent mux_config_write(providers) racing a UI provider edit (or two concurrent secret writes) could read the same baseline and clobber each other. Make the guarantee true at the read-modify-write callsite level: - Add Config.withProvidersConfigLock; wrap all five ProviderService providers.jsonc mutations (addCustomOpenAICompatibleProvider, removeCustomProvider, setModels, setConfigValue, setConfig) so their load->mutate->save runs under the shared lock. addCustom/setModels become async; callers updated. - Wrap Config.updateGlobalSecrets / updateProjectSecrets in withConfigFileLock(secretsFile). - removeCustomProvider keeps its editConfig repair OUTSIDE the providers lock; the remaining nesting (syncGatewayLifecycle/editConfig under the providers lock) is one-directional and deadlock-free, documented in configFileLocks. - Rewrite the configFileLocks doc to list real acquirers per file, the lock-ordering invariant, and the still-uncovered direct saveConfig/CLI bootstrap paths. Regression tests (with neuter proof) for both the cross-layer providers lock and concurrent secret writes.
|
@codex review Addressed the providers/secrets locking P2: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 61ed07a05e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…ifecycle syncGatewayLifecycle read routePriority and computed the inserted/removed nextPriority from that snapshot OUTSIDE editConfig's per-file lock, then wrote the precomputed value inside the lock. A concurrent routing edit (e.g. another mux_config_write(config)) that landed between the snapshot and the locked callback was silently clobbered (Codex P2). Move the routePriority read + merge into the editConfig callback so it is based on the locked config value. A cheap outer snapshot is kept only as a fast-path to skip the write (and its change notification) in the steady state — frequent credential/token refreshes on an already-routed gateway — while the authoritative merge happens under the lock. Adds a deterministic regression test that injects a concurrent routePriority edit in the exact TOCTOU window and asserts both the concurrent route and the gateway insert survive.
|
@codex review Addressed the routePriority TOCTOU (P2) in 037f3ae: |
|
Codex Review: Didn't find any major issues. Another round soon, please! Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 037f3ae320
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return await withConfigDocumentLock<MuxConfigWriteToolResult>( | ||
| muxHome, | ||
| args.file, | ||
| async () => { |
There was a problem hiding this comment.
Honor aborts before queued config writes mutate
When two mux_config_write calls target the same config file, this new lock queues the later call inside execute, but the tool explicitly ignores the provided abort signal. If the user interrupts the stream while that later write is waiting for the lock, it will still enter this callback and apply its mutation after cancellation; before removing withSequentialExecution, an aborted queued sibling was rejected before any side effect. Please re-check or pass through the abort signal before running the locked read/modify/write so interrupted turns cannot apply stale config changes.
Useful? React with 👍 / 👎.
Summary
Restore intended parallel execution of sibling tool calls within a single stream by reverting #2906, and close the concurrency gaps that revert re-exposes so each shared resource owns its own safety:
file_edit_insert's create-on-missing path, backgroundbashprocess-ID allocation, andconfig.json/providers.jsonc/secrets.jsonwrites (now serialized by a single per-file lock shared across the config tool, theConfigclass, andProviderService). When the model plans multiple tool calls in one assistant turn (most visibly paralleltasksub-agent spawns), all but the first were forced to run sequentially. This change removes the per-stream mutex that caused the regression and makes each mutating path safe on its own.Background
#2906 ("serialize sibling tool execution per stream") wrapped every tool's
execute()in a single shared per-streamAsyncMutex, held across the entire execute body. Provider-levelparallelToolCalls: truestayed on, so the LLM still planned parallel calls, but each wrapped handler blocked on the shared lock — serializing siblings. Fortask, a foreground call holds the lock across its longwaitForAgentReport, so a secondtaskcannot even reachtaskService.create()until the first finishes.That PR had no linked issue and no concrete repro; it was speculative hardening against races on shared mutable state. Its own implementation plan notes that the genuine race (bash background-process output) was already fixed at the runtime boundary by
backgroundProcessManager's per-processoutputLock. The right shape is parallel-by-default at the framework level, with each shared resource owning its own concurrency safety.The audit that justified the revert surfaced the places where that ownership was incomplete — the blanket mutex had been masking each:
file_edit_insert's create-on-missing fast path didfileExists→writeFileStringoutside the per-fileMutexMaplock that the normal edit path (executeFileEditOperation) holds. Two concurrent sibling inserts targeting the same not-yet-existing file could both observe "missing", both create, and the last write would silently drop the other. Insert semantics are accumulative, so both bodies should survive.bashspawns allocate a process ID from thedisplay_name, butBackgroundProcessManager.generateUniqueProcessIdonly scanned the registered process map, and registration happens only afterspawnProcess+writeMetaawaits. Two concurrent spawns sharing a display name could both pass the collision check, pick the same ID and output directory, and the secondprocesses.setwould orphan the first — leavingbash_output/termination pointed at the wrong process.config.jsonhad two independent serializers that didn't coordinate:Config.editConfigQueue(instance-scoped, serializingeditConfigagainst itself) andmux_config_write's own per-path lock (serializing tool calls against themselves). So amux_config_write(config)could still interleave withConfig.editConfig— whichTaskService.create()uses to persist workspace entries — and clobber it. The shared resource is the on-disk file, so the lock belonged at the file path, owned by neither layer.(The other unlocked writers,
agent_skill_writeandmux_agents_write, only read old content to render a diff; their bytes come entirely from tool args, so concurrent execution is indistinguishable from sequential last-writer-wins and needs no lock.file_readis read-only.)Implementation
Revert #2906 in full:
src/node/services/tools/withSequentialExecution.tsand its test.tools: finalToolsinStreamManager.buildStreamRequestConfig(drop the wrapper import and call).StreamManager - sequential tool executiontest that asserted the wrapper was applied.Close the create-on-missing race at the resource level:
withFileEditLock(runtime, resolvedPath, fn)infile_edit_operation.ts, exposing the same per-fileMutexMap(keyed byRuntime) thatexecuteFileEditOperationalready uses. Intentionally non-re-entrant (documented).file_edit_insert's existence-check + create-on-missing logic in it. If the file now exists, it releases and falls through to the normal guarded edit path; otherwise it creates under the lock.Close the background process-ID race without re-serializing spawns:
generateUniqueProcessId(scan both the registeredprocessesmap and a newreservedIdsset, then reserve before returning), so two concurrent allocations can never select the same ID across the spawn/migration awaits.spawnvia try/finally,registerMigratedProcessinline, bash migrate-failure via a newreleaseReservedId). Spawns stay fully parallel — no global mutex.Unify config-file writes under one per-file lock:
withConfigFileLock(absolutePath, fn)insrc/node/utils/concurrency/configFileLocks.ts, a module-levelMutexMap<string>keyed by the config file's absolute path. The on-disk file is the shared resource, so the lock lives at the path — not on theConfiginstance or in the tools layer.Config.editConfignow acquires it (replacing the bespoke instance-scopededitConfigQueue, with identical FIFO semantics), and the config tool'swithConfigDocumentLockdelegates to it. Both layers already derive byte-identical absolute paths, so amux_config_writeand aConfig.editConfigto the same file now serialize on one lock.Extend the per-file lock to every
providers.jsoncandsecrets.jsonwriter (Codex P2):Config.saveProvidersConfig/saveSecretsConfigacquire the lock would be false safety — the read must be inside too. Instead addConfig.withProvidersConfigLock(fn)and route all fiveProviderServiceproviders.jsoncmutations (addCustomOpenAICompatibleProvider,removeCustomProvider,setModels,setConfigValue,setConfig) through it.addCustom/setModelsbecome async; their callers were updated.Config.updateGlobalSecrets/updateProjectSecretsinwithConfigFileLock(secretsFile)(the onlysecrets.jsonwriters).removeCustomProviderkeeps its follow-upeditConfigrepair outside the providers lock. The remaining nesting —setConfig/setConfigValuecallingsyncGatewayLifecycle→editConfigwhile holding the providers lock — is one-directional (nothing acquires the providers lock while holding the config lock, sinceeditConfig's transform is pure), so there is no deadlock cycle. The invariant is documented inconfigFileLocks.configFileLocksdoc now lists the real acquirers per file and calls out the still-uncovered paths: directConfig.saveConfigcallers outsideeditConfig(some project/workspace service flows) and CLI bootstrap writers, which are single-writer/startup paths rather than agent-parallel tool calls and are left as follow-up.Each path has a regression test that fails when its guard is removed.
Recompute
routePriorityinside the config lock (Codex P2):syncGatewayLifecyclereadroutePriorityand computed the inserted/removednextPriorityfrom that snapshot outsideeditConfig's per-file lock, then wrote the precomputed value inside the lock. A concurrent routing edit (e.g. anothermux_config_write(config)) that landed between the snapshot and the locked callback was silently clobbered.editConfigcallback, based on the locked config value. A cheap outer snapshot is kept only as a fast-path to skip the write (and its change notification) in the steady state — frequent credential/token refreshes on an already-routed gateway — sinceeditConfigalways saves and notifies with no short-circuit.routePriorityedit in the exact TOCTOU window (via a one-shoteditConfigspy) and asserts both the concurrent route and the gateway insert survive; it fails when the merge reads the stale snapshot.Risks
This returns to pre-#2906 scheduling, where sibling
execute()handlers may overlap. The concurrency-sensitive resources are now individually guarded (see Background), so the realistic regression surface is narrow. The remaining interaction the deleted lock nominally covered — a shared-checkouttaskcall (local runtime, or worktree/SSHisolation:"none") overlapping a mutating sibling duringcreate()'s brief parent-checkout read — was unguarded before #2906, has no observed failure, and was only weakly covered by the mutex anyway (background sub-agents do their real work in separate streams outside the lock). If a concrete race is ever demonstrated there, the correct fix is a narrow lock aroundcreate()'s checkout-read critical section, not a blanket stream-level mutex.Generated with
mux• Model:anthropic:claude-opus-4-8• Thinking:xhigh• Cost:$29.64