Skip to content

caixq1996/noisy-RLVR

Repository files navigation

Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

This repository extends the open-source verl training stack with the components used in the paper "Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers". We focus solely on modelling verifier noise and correcting the resulting bias in Group Relative Policy Optimisation (GRPO).

The code base supplies two closely related capabilities:

  • Asymmetric binary-noise injection. Rewards can be perturbed according to a verifier reward channel with user-specified false-positive (rho0) and false-negative (rho1) rates. This lets you reproduce the synthetic-noise ablations from the paper or stress-test custom verifier pipelines.
  • Noise-aware policy gradients. We implement the two correction algorithms derived in the paper—backward correction (PGBC) and forward correction (PGFC)—as drop-in replacements for the standard GRPO advantage construction. Both live in verl/trainer/ppo/core_algos.py and are activated via Hydra flags.

Installation

conda create -n noisy_rlvr python=3.10 -y
conda activate noisy_rlvr
pip install -r ./requirements.txt

Optional dependencies for flash-attention or vLLM follow the upstream verl instructions.


Quick Start

Kick off a noise-aware GRPO training run with the provided helper script:

./examples/train.sh

Adjust the environment variables in ./examples/train.sh to swap models, datasets, or noise settings before launching.


Using the Noise Channel

Enable synthetic corruption by toggling the grpo_noisy_verifier flags when launching verl.trainer.main_ppo:

python -m verl.trainer.main_ppo \
  +algorithm.grpo_noisy_verifier.add_noise=True \
  +algorithm.grpo_noisy_verifier.rho0=0.1 \
  +algorithm.grpo_noisy_verifier.rho1=0.2

With add_noise=True, the observed binary reward $\tilde{R}$ is sampled from $$ \Pr(\tilde{R}=1 \mid R^=0)=\rho_0, \qquad \Pr(\tilde{R}=0 \mid R^=1)=\rho_1, $$ mirroring the verifier reward channel defined in the paper.


PGBC: Backward Correction

Backward correction replaces the noisy reward with an unbiased estimator of the clean reward: $$ \widehat{R} = \frac{\tilde{R} - \rho_0}{1 - \rho_0 - \rho_1}. $$ Enable it together with the noise channel (or a real verifier) via

+algorithm.grpo_noisy_verifier.correct_noise=True \
+algorithm.grpo_noisy_verifier.noise_correction=unbiased_reward

The estimator is applied per trajectory before group normalisation, giving an unbiased policy-gradient update under the assumed noise rates.


PGFC: Forward Correction

Forward correction reweights the score-function terms so the expected update remains aligned with the clean gradient. For binary rewards the weights are $$ w_{\tilde{R}} = \begin{cases} \rho_1 - 1, & \tilde{R}=0,\ \rho_1, & \tilde{R}=1. \end{cases} $$ Activate it with

+algorithm.grpo_noisy_verifier.correct_noise=True \
+algorithm.grpo_noisy_verifier.noise_correction=asymmetric_weighting

This variant avoids dividing by $1-\rho_0-\rho_1$ and often yields lower variance when noise rates are moderate.


Both corrections can be combined with either synthetic corruption or real verifier logs by supplying estimated noise rates through the same rho0 and rho1 flags. No other infrastructure is required.


About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages