Hi authors,
Thanks for open-sourcing Dr. MAS! The system engineering part (integrating Ray and SGLang for heterogeneous MAS deployment) is very impressive and highly valuable to the community.
However, while reading the algorithmic derivation in Section 4, I noticed an interesting mathematical property regarding the proposed agent-wise normalization (Eq. 5), which seems to conflate two distinct mechanisms. I would love to hear your thoughts on this.
1. The Confounding of Two "Bugs" in Vanilla GRPO
When moving from Vanilla GRPO to Dr. MAS, the formulation actually introduces two independent changes simultaneously, not just one:
-
Mechanism A (Agent Identity): Calculating statistics separately for each agent $k$.
-
Mechanism B (Action Frequency/Step-level Weighting): Shifting the denominator from the number of trajectories ($N$, as in Eq. 2) to the number of action steps ($|Y_k|$, as in Eq. 5).
In asynchronous tasks like the Search Orchestration (where the Search Agent is conditionally triggered and outputs varying token lengths), Vanilla GRPO suffers because it averages by trajectory globally. Therefore, Dr. MAS fixes both the "Identity mismatch" (Mechanism A) and the "Frequency mismatch" (Mechanism B) at the same time.
2. The Missing Key Ablation: Global Step-Level Normalization
This leads to a critical question: Is the performance gain of Dr. MAS coming from the Multi-Agent heterogeneity (Mechanism A), or is it simply coming from fixing the step-frequency weighting (Mechanism B)?
To prove that the gain is truly from "Agent-wise" separation, there seems to be a crucial missing baseline in the ablation study (Table 3):
Applying Mechanism B, but NOT Mechanism A.
Specifically, what if we calculate a Global Step-Level Baseline?
We mix all agents together, but we calculate the mean/std over the total action steps across the entire rollout, rather than the trajectories.
If this Global Step-Level Normalization yields a performance surge similar to Dr. MAS, it would imply that the instability of Vanilla GRPO is merely a statistical artifact of token/step frequency mismatch, rather than a fundamental flaw in Multi-Agent RL coordination. Do you have any experimental insights on this specific configuration?
3. The Degeneration Case: Symmetrical/Synchronous MAS
Furthermore, following the math in Eq (5), consider a strictly Synchronous MAS task (e.g., a fixed 2-turn debate where Agent A and Agent B always output exactly once per trajectory).
In this scenario, $|Y_A| = |Y_B| = N$. Therefore, the agent-wise mean $\mu_k$ will perfectly equal the global trajectory mean $\mu$ (and $\sigma_k = \sigma$). Mathematically, Dr. MAS would 100% degenerate into Vanilla GRPO.
Does this mean Dr. MAS does not actually improve stability for synchronous multi-agent systems, and its efficacy is strictly limited to asynchronous/conditional routing workflows (like the Search pipeline)?
Looking forward to your insights! Thanks again for the great system contribution.
Hi authors,
Thanks for open-sourcing Dr. MAS! The system engineering part (integrating Ray and SGLang for heterogeneous MAS deployment) is very impressive and highly valuable to the community.
However, while reading the algorithmic derivation in Section 4, I noticed an interesting mathematical property regarding the proposed agent-wise normalization (Eq. 5), which seems to conflate two distinct mechanisms. I would love to hear your thoughts on this.
1. The Confounding of Two "Bugs" in Vanilla GRPO
When moving from Vanilla GRPO to Dr. MAS, the formulation actually introduces two independent changes simultaneously, not just one:
In asynchronous tasks like the Search Orchestration (where the Search Agent is conditionally triggered and outputs varying token lengths), Vanilla GRPO suffers because it averages by trajectory globally. Therefore, Dr. MAS fixes both the "Identity mismatch" (Mechanism A) and the "Frequency mismatch" (Mechanism B) at the same time.
2. The Missing Key Ablation: Global Step-Level Normalization
This leads to a critical question: Is the performance gain of Dr. MAS coming from the Multi-Agent heterogeneity (Mechanism A), or is it simply coming from fixing the step-frequency weighting (Mechanism B)?
To prove that the gain is truly from "Agent-wise" separation, there seems to be a crucial missing baseline in the ablation study (Table 3):
Applying Mechanism B, but NOT Mechanism A.
Specifically, what if we calculate a Global Step-Level Baseline?
We mix all agents together, but we calculate the mean/std over the total action steps across the entire rollout, rather than the trajectories.
If this Global Step-Level Normalization yields a performance surge similar to Dr. MAS, it would imply that the instability of Vanilla GRPO is merely a statistical artifact of token/step frequency mismatch, rather than a fundamental flaw in Multi-Agent RL coordination. Do you have any experimental insights on this specific configuration?
3. The Degeneration Case: Symmetrical/Synchronous MAS
Furthermore, following the math in Eq (5), consider a strictly Synchronous MAS task (e.g., a fixed 2-turn debate where Agent A and Agent B always output exactly once per trajectory).
In this scenario,$|Y_A| = |Y_B| = N$ . Therefore, the agent-wise mean $\mu_k$ will perfectly equal the global trajectory mean $\mu$ (and $\sigma_k = \sigma$ ). Mathematically, Dr. MAS would 100% degenerate into Vanilla GRPO.
Does this mean Dr. MAS does not actually improve stability for synchronous multi-agent systems, and its efficacy is strictly limited to asynchronous/conditional routing workflows (like the Search pipeline)?
Looking forward to your insights! Thanks again for the great system contribution.