Skip to content

fix(ppo): exclude no-eos rows from reward normalization#1351

Open
haoyang9804 wants to merge 2 commits into
areal-project:mainfrom
haoyang9804:fix/reward-norm-no-eos-mask
Open

fix(ppo): exclude no-eos rows from reward normalization#1351
haoyang9804 wants to merge 2 commits into
areal-project:mainfrom
haoyang9804:fix/reward-norm-no-eos-mask

Conversation

@haoyang9804
Copy link
Copy Markdown
Contributor

Summary

When mask_no_eos_with_zero=True, AReaL suppresses task rewards for generations
that filled the response length without EOS. However, PPOActor._compute_advantages
normalizes scalar rewards before applying that no-EOS mask. A no-EOS row with an
outlier raw reward can therefore shift the normalized reward and advantage for
another valid EOS row, even though the no-EOS task reward is later zeroed.

This patch computes the no-EOS mask before reward normalization and passes it to
reward_norm, so rows that will not receive task reward also do not contribute to
reward normalization statistics.

Concrete Minimal Example

Configuration:

reward_norm = NormConfig(mean_level="batch", std_level=None)
mask_no_eos_with_zero = True
kl_ctl = 0.0

Batch:

attention_mask = torch.tensor([
    [1, 1, 0],  # valid EOS/padded row
    [1, 1, 1],  # no-EOS row: filled max sequence length
])
loss_mask = torch.tensor([
    [0, 1, 0],
    [0, 1, 1],
])

baseline_rewards = torch.tensor([1.0, 1.0])
outlier_rewards = torch.tensor([1.0, 100.0])

Expected invariant:

Changing only the no-EOS row's task reward must not change advantages for the valid
EOS row, because mask_no_eos_with_zero suppresses that no-EOS task reward.

Before this patch, the no-EOS outlier changes the valid row:

{
  "observed_advantage_diff": [[-49.5, 0.0, 0.0], [0.0, 0.0, 0.0]],
  "quiet_training_signal_bug": true
}

After this patch, the invariant holds:

{
  "observed_advantage_diff": [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]],
  "quiet_training_signal_bug": false
}

Root Cause

The old order was:

reward_score = self.reward_norm(reward_score)
...
seq_no_eos_mask = seqlens == attn_mask.shape[1]
...
rewards[batch_indices, indices] += torch.where(seq_no_eos_mask, 0, reward_score)

So a no-EOS row was excluded from task reward assignment, but not from
normalization statistics. With batch mean normalization, [1.0, 100.0] became
[-49.5, 49.5]; the valid EOS row then received -49.5 instead of matching the
clean [1.0, 1.0] case.

Fix

The patch computes seq_no_eos_mask before reward normalization. When
mask_no_eos_with_zero=True, it passes ~seq_no_eos_mask as the normalization
mask:

attn_mask = data["attention_mask"]
seqlens = attn_mask.sum(-1).long()
seq_no_eos_mask = seqlens == attn_mask.shape[1]
...
if self.reward_norm:
    reward_norm_mask = None
    if self.mask_no_eos_with_zero:
        reward_norm_mask = (~seq_no_eos_mask).to(dtype=reward_score.dtype)
    reward_score = self.reward_norm(reward_score, reward_norm_mask)

This preserves existing behavior when mask_no_eos_with_zero=False.

Validation Recipe

{
  "bug_id": "AREAL-REWARD-NORM-NO-EOS-LEAK",
  "validation_mode": "actual_areal_ppo_actor_compute_advantages_boundary_hook",
  "hooked_boundary": "areal.trainer.ppo.actor.PPOActor._compute_advantages",
  "constructed_scenario": {
    "reward_norm": {"mean_level": "batch", "std_level": null},
    "mask_no_eos_with_zero": true,
    "baseline_rewards": [1.0, 1.0],
    "outlier_rewards": [1.0, 100.0],
    "attention_mask": [[1, 1, 0], [1, 1, 1]],
    "loss_mask_before_roll": [[0, 1, 0], [0, 1, 1]],
    "row_1_is_no_eos": true
  },
  "expected_invariant": "A no-EOS row whose task reward is zeroed must not affect reward normalization or valid EOS-row advantages."
}

Runner script:

#!/usr/bin/env bash
set -euo pipefail

TARGET_REPO="${TARGET_REPO:-$(pwd)}"
OUTPUT_JSON="${OUTPUT_JSON:-validation-output.json}"

PYTHONPATH="${TARGET_REPO}:${PYTHONPATH:-}" \
TARGET_REPO="${TARGET_REPO}" \
OUTPUT_JSON="${OUTPUT_JSON}" \
python run_validation.py

Validation hook:

import importlib.machinery
import json
import os
import subprocess
import sys
import types
from pathlib import Path
from types import SimpleNamespace

import torch


def install_areal_import_shims():
    try:
        import torch.distributed.checkpoint.staging as staging

        staging.DefaultStager = getattr(
            staging, "DefaultStager", type("DefaultStager", (), {})
        )
        staging.StagingOptions = getattr(
            staging, "StagingOptions", type("StagingOptions", (), {})
        )
    except Exception:
        pass

    try:
        import torch.distributed.checkpoint.state_dict_saver as saver

        saver.AsyncSaveResponse = getattr(
            saver, "AsyncSaveResponse", type("AsyncSaveResponse", (), {})
        )
    except Exception:
        pass

    for name in ("swanlab", "trackio", "tabulate"):
        if name not in sys.modules:
            module = types.ModuleType(name)
            module.__spec__ = importlib.machinery.ModuleSpec(name, None)
            if name == "tabulate":
                module.tabulate = lambda *args, **kwargs: ""
            sys.modules[name] = module

    if "tensorboardX" not in sys.modules:
        module = types.ModuleType("tensorboardX")
        module.__spec__ = importlib.machinery.ModuleSpec("tensorboardX", None)
        module.SummaryWriter = type(
            "SummaryWriter", (), {"__init__": lambda self, *a, **k: None}
        )
        sys.modules["tensorboardX"] = module


sys.path.insert(0, os.environ["TARGET_REPO"])
install_areal_import_shims()

from areal.api.cli_args import NormConfig
from areal.trainer.ppo.actor import PPOActor
from areal.utils.data import KLEstimator, Normalization

actor = PPOActor.__new__(PPOActor)
actor.config = SimpleNamespace(
    overlong_reward_penalty=False,
    use_decoupled_loss=False,
    recompute_logprob=False,
    mask_no_eos_with_zero=True,
)
actor.reward_bias = 0.0
actor.reward_scaling = 1.0
actor.reward_clip = 1000.0
actor.reward_norm = Normalization(NormConfig(mean_level="batch", std_level=None))
actor.adv_norm = None
actor.kl_ctl = 0.0
actor.kl_estimator = KLEstimator("k1")
actor.discount = 1.0
actor.gae_lambda = 1.0
actor.mask_no_eos_with_zero = True


def run_case(rewards):
    data = {
        "input_ids": torch.tensor([[1, 2, 0], [1, 2, 3]]),
        "attention_mask": torch.tensor([[1, 1, 0], [1, 1, 1]], dtype=torch.long),
        "loss_mask": torch.tensor([[0, 1, 0], [0, 1, 1]], dtype=torch.bool),
        "rewards": torch.tensor(rewards, dtype=torch.float32),
        "logprobs": torch.zeros(2, 3),
    }
    out = actor._compute_advantages({key: value.clone() for key, value in data.items()})
    return {
        "tot_rewards": out["tot_rewards"].tolist(),
        "advantages": out["advantages"].tolist(),
        "loss_mask": out["loss_mask"].tolist(),
    }


baseline = run_case([1.0, 1.0])
with_no_eos_outlier = run_case([1.0, 100.0])
diff = (
    torch.tensor(with_no_eos_outlier["advantages"])
    - torch.tensor(baseline["advantages"])
).tolist()

payload = {
    "kind": "rl_sentinel_training_signal_validation",
    "candidate_id": "AREAL-REWARD-NORM-NO-EOS-LEAK",
    "target": "areal",
    "target_commit": subprocess.check_output(
        ["git", "-C", os.environ["TARGET_REPO"], "rev-parse", "HEAD"], text=True
    ).strip(),
    "boundary": "areal.trainer.ppo.actor.PPOActor._compute_advantages",
    "observed_baseline": baseline,
    "observed_with_no_eos_outlier": with_no_eos_outlier,
    "observed_advantage_diff": diff,
    "quiet_training_signal_bug": bool(abs(diff[0][0]) > 1e-6),
}

Path(os.environ["OUTPUT_JSON"]).write_text(
    json.dumps(payload, indent=2, sort_keys=True) + "\n"
)
print(json.dumps(payload, indent=2, sort_keys=True))

Validation

ruff check areal/trainer/ppo/actor.py tests/test_ppo_actor_reward_norm_no_eos.py
ruff format --check areal/trainer/ppo/actor.py tests/test_ppo_actor_reward_norm_no_eos.py
pytest -q tests/test_ppo_actor_reward_norm_no_eos.py
pre-commit run --files areal/trainer/ppo/actor.py tests/test_ppo_actor_reward_norm_no_eos.py

Results:

All checks passed!
2 files already formatted
1 passed, 1 warning
pre-commit run --files ... passed

Hook output on unpatched main:

{
  "observed_advantage_diff": [[-49.5, 0.0, 0.0], [0.0, 0.0, 0.0]],
  "quiet_training_signal_bug": true
}

Hook output on this branch:

{
  "observed_advantage_diff": [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]],
  "quiet_training_signal_bug": false
}

Duplicate Check

  • BUG_FINDINGS.md: no existing AReaL entry for no-EOS rows contaminating
    reward normalization before mask_no_eos_with_zero.
  • Local myAReal branches: checked reward, norm, and eos branch names; existing
    fix/norm-mask-invalid-values, fix/gspo-2d-masked-advantages, and
    fix/ppo-adv-mask-invalid-logprobs cover different paths.
  • Local PR drafts: checked existing AReaL drafts; no no-EOS reward normalization
    duplicate.
  • Upstream areal-project/AReaL PRs/issues: searched reward_norm mask_no_eos_with_zero no EOS and PPOActor reward_norm no eos advantages;
    no duplicate issue or PR found.
  • Upstream code search: only current actor.py/docs/config references to
    mask_no_eos_with_zero and reward_norm appeared; no existing fix path was
    found.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the PPO actor to support masking sequences without an EOS token during reward normalization and includes a new test case for this functionality. A review comment suggests using the max_seqlen variable instead of attn_mask.shape[1] for better consistency and readability within the _compute_advantages method.

Comment thread areal/trainer/ppo/actor.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant