Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

This repository extends the open-source verl training stack with the components used in the paper "Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers". We focus solely on modelling verifier noise and correcting the resulting bias in Group Relative Policy Optimisation (GRPO).

The code base supplies two closely related capabilities:

Asymmetric binary-noise injection. Rewards can be perturbed according to a verifier reward channel with user-specified false-positive (rho0) and false-negative (rho1) rates. This lets you reproduce the synthetic-noise ablations from the paper or stress-test custom verifier pipelines.
Noise-aware policy gradients. We implement the two correction algorithms derived in the paper—backward correction (PGBC) and forward correction (PGFC)—as drop-in replacements for the standard GRPO advantage construction. Both live in verl/trainer/ppo/core_algos.py and are activated via Hydra flags.

Installation

conda create -n noisy_rlvr python=3.10 -y
conda activate noisy_rlvr
pip install -r ./requirements.txt

Optional dependencies for flash-attention or vLLM follow the upstream verl instructions.

Quick Start

Kick off a noise-aware GRPO training run with the provided helper script:

./examples/train.sh

Adjust the environment variables in ./examples/train.sh to swap models, datasets, or noise settings before launching.

Using the Noise Channel

Enable synthetic corruption by toggling the grpo_noisy_verifier flags when launching verl.trainer.main_ppo:

python -m verl.trainer.main_ppo \
  +algorithm.grpo_noisy_verifier.add_noise=True \
  +algorithm.grpo_noisy_verifier.rho0=0.1 \
  +algorithm.grpo_noisy_verifier.rho1=0.2

With add_noise=True, the observed binary reward $\tilde{R}$ is sampled from $$ \Pr(\tilde{R}=1 \mid R^=0)=\rho_0, \qquad \Pr(\tilde{R}=0 \mid R^=1)=\rho_1, $$ mirroring the verifier reward channel defined in the paper.

PGBC: Backward Correction

Backward correction replaces the noisy reward with an unbiased estimator of the clean reward: $$ \widehat{R} = \frac{\tilde{R} - \rho_0}{1 - \rho_0 - \rho_1}. $$ Enable it together with the noise channel (or a real verifier) via

+algorithm.grpo_noisy_verifier.correct_noise=True \
+algorithm.grpo_noisy_verifier.noise_correction=unbiased_reward

The estimator is applied per trajectory before group normalisation, giving an unbiased policy-gradient update under the assumed noise rates.

PGFC: Forward Correction

Forward correction reweights the score-function terms so the expected update remains aligned with the clean gradient. For binary rewards the weights are $$ w_{\tilde{R}} = \begin{cases} \rho_1 - 1, & \tilde{R}=0,\ \rho_1, & \tilde{R}=1. \end{cases} $$ Activate it with

+algorithm.grpo_noisy_verifier.correct_noise=True \
+algorithm.grpo_noisy_verifier.noise_correction=asymmetric_weighting

This variant avoids dividing by $1-\rho_0-\rho_1$ and often yields lower variance when noise rates are moderate.

Both corrections can be combined with either synthetic corruption or real verifier logs by supplying estimated noise rates through the same rho0 and rho1 flags. No other infrastructure is required.

Name		Name	Last commit message	Last commit date
Latest commit History 1,224 Commits
.gemini		.gemini
.github		.github
.vscode		.vscode
data/my_train		data/my_train
docker		docker
docs		docs
examples		examples
recipe		recipe
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
latex2sympy2.py		latex2sympy2.py
pyproject.toml		pyproject.toml
requirements-npu.txt		requirements-npu.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

Installation

Quick Start

Using the Noise Channel

PGBC: Backward Correction

PGFC: Forward Correction

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

Installation

Quick Start

Using the Noise Channel

PGBC: Backward Correction

PGFC: Forward Correction

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages