This repository extends the open-source verl training stack with the components used in the paper "Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers". We focus solely on modelling verifier noise and correcting the resulting bias in Group Relative Policy Optimisation (GRPO).
The code base supplies two closely related capabilities:
- Asymmetric binary-noise injection. Rewards can be perturbed according to a verifier reward channel with user-specified false-positive (
rho0) and false-negative (rho1) rates. This lets you reproduce the synthetic-noise ablations from the paper or stress-test custom verifier pipelines. - Noise-aware policy gradients. We implement the two correction algorithms derived in the paper—backward correction (PGBC) and forward correction (PGFC)—as drop-in replacements for the standard GRPO advantage construction. Both live in
verl/trainer/ppo/core_algos.pyand are activated via Hydra flags.
conda create -n noisy_rlvr python=3.10 -y
conda activate noisy_rlvr
pip install -r ./requirements.txtOptional dependencies for flash-attention or vLLM follow the upstream verl instructions.
Kick off a noise-aware GRPO training run with the provided helper script:
./examples/train.shAdjust the environment variables in ./examples/train.sh to swap models, datasets, or noise settings before launching.
Enable synthetic corruption by toggling the grpo_noisy_verifier flags when launching verl.trainer.main_ppo:
python -m verl.trainer.main_ppo \
+algorithm.grpo_noisy_verifier.add_noise=True \
+algorithm.grpo_noisy_verifier.rho0=0.1 \
+algorithm.grpo_noisy_verifier.rho1=0.2With add_noise=True, the observed binary reward
Backward correction replaces the noisy reward with an unbiased estimator of the clean reward: $$ \widehat{R} = \frac{\tilde{R} - \rho_0}{1 - \rho_0 - \rho_1}. $$ Enable it together with the noise channel (or a real verifier) via
+algorithm.grpo_noisy_verifier.correct_noise=True \
+algorithm.grpo_noisy_verifier.noise_correction=unbiased_rewardThe estimator is applied per trajectory before group normalisation, giving an unbiased policy-gradient update under the assumed noise rates.
Forward correction reweights the score-function terms so the expected update remains aligned with the clean gradient. For binary rewards the weights are $$ w_{\tilde{R}} = \begin{cases} \rho_1 - 1, & \tilde{R}=0,\ \rho_1, & \tilde{R}=1. \end{cases} $$ Activate it with
+algorithm.grpo_noisy_verifier.correct_noise=True \
+algorithm.grpo_noisy_verifier.noise_correction=asymmetric_weightingThis variant avoids dividing by
Both corrections can be combined with either synthetic corruption or real verifier logs by supplying estimated noise rates through the same rho0 and rho1 flags. No other infrastructure is required.