A very big concern about your paper

Hi authors,

Thanks for open-sourcing Dr. MAS! The system engineering part (integrating Ray and SGLang for heterogeneous MAS deployment) is very impressive and highly valuable to the community. 

However, while reading the algorithmic derivation in Section 4, I noticed an interesting mathematical property regarding the proposed agent-wise normalization (Eq. 5), which seems to conflate two distinct mechanisms. I would love to hear your thoughts on this.

### 1. The Confounding of Two "Bugs" in Vanilla GRPO
When moving from Vanilla GRPO to Dr. MAS, the formulation actually introduces **two independent changes** simultaneously, not just one:

*   **Mechanism A (Agent Identity):** Calculating statistics separately for each agent $k$.
*   **Mechanism B (Action Frequency/Step-level Weighting):** Shifting the denominator from the number of **trajectories** ($N$, as in Eq. 2) to the number of **action steps** ($|Y_k|$, as in Eq. 5). 

In asynchronous tasks like the Search Orchestration (where the Search Agent is conditionally triggered and outputs varying token lengths), Vanilla GRPO suffers because it averages by *trajectory* globally. Therefore, Dr. MAS fixes both the "Identity mismatch" (Mechanism A) and the "Frequency mismatch" (Mechanism B) at the same time.

### 2. The Missing Key Ablation: Global Step-Level Normalization
This leads to a critical question: **Is the performance gain of Dr. MAS coming from the Multi-Agent heterogeneity (Mechanism A), or is it simply coming from fixing the step-frequency weighting (Mechanism B)?**

To prove that the gain is truly from "Agent-wise" separation, there seems to be a crucial missing baseline in the ablation study (Table 3):
**Applying Mechanism B, but NOT Mechanism A.**

Specifically, what if we calculate a **Global Step-Level Baseline**? 
We mix all agents together, but we calculate the mean/std over the **total action steps** across the entire rollout, rather than the trajectories. 
If this *Global Step-Level Normalization* yields a performance surge similar to Dr. MAS, it would imply that the instability of Vanilla GRPO is merely a statistical artifact of token/step frequency mismatch, rather than a fundamental flaw in Multi-Agent RL coordination. Do you have any experimental insights on this specific configuration?

### 3. The Degeneration Case: Symmetrical/Synchronous MAS
Furthermore, following the math in Eq (5), consider a strictly **Synchronous MAS** task (e.g., a fixed 2-turn debate where Agent A and Agent B *always* output exactly once per trajectory). 

In this scenario, $|Y_A| = |Y_B| = N$. Therefore, the agent-wise mean $\mu_k$ will perfectly equal the global trajectory mean $\mu$ (and $\sigma_k = \sigma$). **Mathematically, Dr. MAS would 100% degenerate into Vanilla GRPO.**

Does this mean Dr. MAS does not actually improve stability for synchronous multi-agent systems, and its efficacy is strictly limited to asynchronous/conditional routing workflows (like the Search pipeline)?

Looking forward to your insights! Thanks again for the great system contribution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A very big concern about your paper #81

1. The Confounding of Two "Bugs" in Vanilla GRPO

2. The Missing Key Ablation: Global Step-Level Normalization

3. The Degeneration Case: Symmetrical/Synchronous MAS

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

A very big concern about your paper #81

Description

1. The Confounding of Two "Bugs" in Vanilla GRPO

2. The Missing Key Ablation: Global Step-Level Normalization

3. The Degeneration Case: Symmetrical/Synchronous MAS

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions