fix(ppo): exclude no-eos rows from reward normalization by haoyang9804 · Pull Request #1351 · areal-project/AReaL

haoyang9804 · 2026-05-19T05:48:35Z

Summary

When mask_no_eos_with_zero=True, AReaL suppresses task rewards for generations
that filled the response length without EOS. However, PPOActor._compute_advantages
normalizes scalar rewards before applying that no-EOS mask. A no-EOS row with an
outlier raw reward can therefore shift the normalized reward and advantage for
another valid EOS row, even though the no-EOS task reward is later zeroed.

This patch computes the no-EOS mask before reward normalization and passes it to
reward_norm, so rows that will not receive task reward also do not contribute to
reward normalization statistics.

Concrete Minimal Example

Configuration:

reward_norm = NormConfig(mean_level="batch", std_level=None)
mask_no_eos_with_zero = True
kl_ctl = 0.0

Batch:

attention_mask = torch.tensor([
    [1, 1, 0],  # valid EOS/padded row
    [1, 1, 1],  # no-EOS row: filled max sequence length
])
loss_mask = torch.tensor([
    [0, 1, 0],
    [0, 1, 1],
])

baseline_rewards = torch.tensor([1.0, 1.0])
outlier_rewards = torch.tensor([1.0, 100.0])

Expected invariant:

Changing only the no-EOS row's task reward must not change advantages for the valid
EOS row, because mask_no_eos_with_zero suppresses that no-EOS task reward.

Before this patch, the no-EOS outlier changes the valid row:

{
  "observed_advantage_diff": [[-49.5, 0.0, 0.0], [0.0, 0.0, 0.0]],
  "quiet_training_signal_bug": true
}

After this patch, the invariant holds:

{
  "observed_advantage_diff": [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]],
  "quiet_training_signal_bug": false
}

Root Cause

The old order was:

reward_score = self.reward_norm(reward_score)
...
seq_no_eos_mask = seqlens == attn_mask.shape[1]
...
rewards[batch_indices, indices] += torch.where(seq_no_eos_mask, 0, reward_score)

So a no-EOS row was excluded from task reward assignment, but not from
normalization statistics. With batch mean normalization, [1.0, 100.0] became
[-49.5, 49.5]; the valid EOS row then received -49.5 instead of matching the
clean [1.0, 1.0] case.

Fix

The patch computes seq_no_eos_mask before reward normalization. When
mask_no_eos_with_zero=True, it passes ~seq_no_eos_mask as the normalization
mask:

attn_mask = data["attention_mask"]
seqlens = attn_mask.sum(-1).long()
seq_no_eos_mask = seqlens == attn_mask.shape[1]
...
if self.reward_norm:
    reward_norm_mask = None
    if self.mask_no_eos_with_zero:
        reward_norm_mask = (~seq_no_eos_mask).to(dtype=reward_score.dtype)
    reward_score = self.reward_norm(reward_score, reward_norm_mask)

This preserves existing behavior when mask_no_eos_with_zero=False.

Validation Recipe

{
  "bug_id": "AREAL-REWARD-NORM-NO-EOS-LEAK",
  "validation_mode": "actual_areal_ppo_actor_compute_advantages_boundary_hook",
  "hooked_boundary": "areal.trainer.ppo.actor.PPOActor._compute_advantages",
  "constructed_scenario": {
    "reward_norm": {"mean_level": "batch", "std_level": null},
    "mask_no_eos_with_zero": true,
    "baseline_rewards": [1.0, 1.0],
    "outlier_rewards": [1.0, 100.0],
    "attention_mask": [[1, 1, 0], [1, 1, 1]],
    "loss_mask_before_roll": [[0, 1, 0], [0, 1, 1]],
    "row_1_is_no_eos": true
  },
  "expected_invariant": "A no-EOS row whose task reward is zeroed must not affect reward normalization or valid EOS-row advantages."
}

Runner script:

#!/usr/bin/env bash
set -euo pipefail

TARGET_REPO="${TARGET_REPO:-$(pwd)}"
OUTPUT_JSON="${OUTPUT_JSON:-validation-output.json}"

PYTHONPATH="${TARGET_REPO}:${PYTHONPATH:-}" \
TARGET_REPO="${TARGET_REPO}" \
OUTPUT_JSON="${OUTPUT_JSON}" \
python run_validation.py

Validation hook:

import importlib.machinery
import json
import os
import subprocess
import sys
import types
from pathlib import Path
from types import SimpleNamespace

import torch


def install_areal_import_shims():
    try:
        import torch.distributed.checkpoint.staging as staging

        staging.DefaultStager = getattr(
            staging, "DefaultStager", type("DefaultStager", (), {})
        )
        staging.StagingOptions = getattr(
            staging, "StagingOptions", type("StagingOptions", (), {})
        )
    except Exception:
        pass

    try:
        import torch.distributed.checkpoint.state_dict_saver as saver

        saver.AsyncSaveResponse = getattr(
            saver, "AsyncSaveResponse", type("AsyncSaveResponse", (), {})
        )
    except Exception:
        pass

    for name in ("swanlab", "trackio", "tabulate"):
        if name not in sys.modules:
            module = types.ModuleType(name)
            module.__spec__ = importlib.machinery.ModuleSpec(name, None)
            if name == "tabulate":
                module.tabulate = lambda *args, **kwargs: ""
            sys.modules[name] = module

    if "tensorboardX" not in sys.modules:
        module = types.ModuleType("tensorboardX")
        module.__spec__ = importlib.machinery.ModuleSpec("tensorboardX", None)
        module.SummaryWriter = type(
            "SummaryWriter", (), {"__init__": lambda self, *a, **k: None}
        )
        sys.modules["tensorboardX"] = module


sys.path.insert(0, os.environ["TARGET_REPO"])
install_areal_import_shims()

from areal.api.cli_args import NormConfig
from areal.trainer.ppo.actor import PPOActor
from areal.utils.data import KLEstimator, Normalization

actor = PPOActor.__new__(PPOActor)
actor.config = SimpleNamespace(
    overlong_reward_penalty=False,
    use_decoupled_loss=False,
    recompute_logprob=False,
    mask_no_eos_with_zero=True,
)
actor.reward_bias = 0.0
actor.reward_scaling = 1.0
actor.reward_clip = 1000.0
actor.reward_norm = Normalization(NormConfig(mean_level="batch", std_level=None))
actor.adv_norm = None
actor.kl_ctl = 0.0
actor.kl_estimator = KLEstimator("k1")
actor.discount = 1.0
actor.gae_lambda = 1.0
actor.mask_no_eos_with_zero = True


def run_case(rewards):
    data = {
        "input_ids": torch.tensor([[1, 2, 0], [1, 2, 3]]),
        "attention_mask": torch.tensor([[1, 1, 0], [1, 1, 1]], dtype=torch.long),
        "loss_mask": torch.tensor([[0, 1, 0], [0, 1, 1]], dtype=torch.bool),
        "rewards": torch.tensor(rewards, dtype=torch.float32),
        "logprobs": torch.zeros(2, 3),
    }
    out = actor._compute_advantages({key: value.clone() for key, value in data.items()})
    return {
        "tot_rewards": out["tot_rewards"].tolist(),
        "advantages": out["advantages"].tolist(),
        "loss_mask": out["loss_mask"].tolist(),
    }


baseline = run_case([1.0, 1.0])
with_no_eos_outlier = run_case([1.0, 100.0])
diff = (
    torch.tensor(with_no_eos_outlier["advantages"])
    - torch.tensor(baseline["advantages"])
).tolist()

payload = {
    "kind": "rl_sentinel_training_signal_validation",
    "candidate_id": "AREAL-REWARD-NORM-NO-EOS-LEAK",
    "target": "areal",
    "target_commit": subprocess.check_output(
        ["git", "-C", os.environ["TARGET_REPO"], "rev-parse", "HEAD"], text=True
    ).strip(),
    "boundary": "areal.trainer.ppo.actor.PPOActor._compute_advantages",
    "observed_baseline": baseline,
    "observed_with_no_eos_outlier": with_no_eos_outlier,
    "observed_advantage_diff": diff,
    "quiet_training_signal_bug": bool(abs(diff[0][0]) > 1e-6),
}

Path(os.environ["OUTPUT_JSON"]).write_text(
    json.dumps(payload, indent=2, sort_keys=True) + "\n"
)
print(json.dumps(payload, indent=2, sort_keys=True))

Validation

ruff check areal/trainer/ppo/actor.py tests/test_ppo_actor_reward_norm_no_eos.py
ruff format --check areal/trainer/ppo/actor.py tests/test_ppo_actor_reward_norm_no_eos.py
pytest -q tests/test_ppo_actor_reward_norm_no_eos.py
pre-commit run --files areal/trainer/ppo/actor.py tests/test_ppo_actor_reward_norm_no_eos.py

Results:

All checks passed!
2 files already formatted
1 passed, 1 warning
pre-commit run --files ... passed

Hook output on unpatched main:

{
  "observed_advantage_diff": [[-49.5, 0.0, 0.0], [0.0, 0.0, 0.0]],
  "quiet_training_signal_bug": true
}

Hook output on this branch:

{
  "observed_advantage_diff": [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]],
  "quiet_training_signal_bug": false
}

Duplicate Check

BUG_FINDINGS.md: no existing AReaL entry for no-EOS rows contaminating
reward normalization before mask_no_eos_with_zero.
Local myAReal branches: checked reward, norm, and eos branch names; existing
fix/norm-mask-invalid-values, fix/gspo-2d-masked-advantages, and
fix/ppo-adv-mask-invalid-logprobs cover different paths.
Local PR drafts: checked existing AReaL drafts; no no-EOS reward normalization
duplicate.
Upstream areal-project/AReaL PRs/issues: searched reward_norm mask_no_eos_with_zero no EOS and PPOActor reward_norm no eos advantages;
no duplicate issue or PR found.
Upstream code search: only current actor.py/docs/config references to
mask_no_eos_with_zero and reward_norm appeared; no existing fix path was
found.

gemini-code-assist

Code Review

This pull request updates the PPO actor to support masking sequences without an EOS token during reward normalization and includes a new test case for this functionality. A review comment suggests using the max_seqlen variable instead of attn_mask.shape[1] for better consistency and readability within the _compute_advantages method.

fix(ppo): exclude no-eos rows from reward norm

c1bd33f

haoyang9804 requested review from fishcrap, garrett4wade, rchardx and sitabulaixizawaluduo as code owners May 19, 2026 05:48

gemini-code-assist Bot reviewed May 19, 2026

View reviewed changes

Comment thread areal/trainer/ppo/actor.py Outdated

fix(ppo): use max seqlen for no-eos mask

cdc1c47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ppo): exclude no-eos rows from reward normalization#1351

fix(ppo): exclude no-eos rows from reward normalization#1351
haoyang9804 wants to merge 2 commits into
areal-project:mainfrom
haoyang9804:fix/reward-norm-no-eos-mask

haoyang9804 commented May 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

haoyang9804 commented May 19, 2026

Summary

Concrete Minimal Example

Root Cause

Fix

Validation Recipe

Validation

Duplicate Check

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant