Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 49 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,55 @@ Transformer-based bitwise-aligned rollout for VeOmni FSDP with VeRL integration.
- 🧩 **Simple model definitions** — Transformer model code is self-contained and easy to audit, so training and inference model definitions stay in sync
- 📖 **Readable codebase** — Clean implementation with chunked prefill, pipeline parallelism, and CUDA graph support

## Effectiveness

> **Qwen3-30B-A3B · REINFORCE++ · DAPO dataset**
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with the recipe table below (line 53), the model name should include the -Base suffix when referring to the REINFORCE++ experiment.

Suggested change
> **Qwen3-30B-A3B · REINFORCE++ · DAPO dataset**
> **Qwen3-30B-A3B-Base · REINFORCE++ · DAPO dataset**


Off-policy logprob bias from vLLM causes the rollout-correction KL to explode after ~300 steps, which triggers gradient norm blow-up and ultimately training collapse. VeXact's bitwise-aligned rollout keeps the KL at exactly zero throughout, yielding stable training and a ~2× higher final AIME 2024 score.

<table>
<tr>
<td align="center"><b>Training reward</b></td>
<td align="center"><b>AIME 2024 (mean@32)</b></td>
</tr>
<tr>
<td><img src="assets/figures/train_reward.png" width="360"/></td>
<td><img src="assets/figures/aime24.png" width="360"/></td>
</tr>
<tr>
<td align="center"><b>Rollout-correction K3 KL (log scale)</b></td>
<td align="center"><b>Gradient norm (log scale)</b></td>
</tr>
<tr>
<td><img src="assets/figures/k3_kl.png" width="360"/></td>
<td><img src="assets/figures/grad_norm_logy.png" width="360"/></td>
</tr>
</table>

## Example Recipes

End-to-end RL training scripts live under [`examples/`](examples/README.md). Run any script from the repo root:

```bash
bash examples/getting_started/run_qwen3_1b7.sh
# override paths via env vars
model_dir=/path/to/model data_dir=/path/to/data bash examples/moe/run_qwen3_30B_A3B_dapo.sh
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example command is missing the test_path environment variable. The script examples/moe/run_qwen3_30B_A3B_dapo.sh explicitly requires test_path to locate the validation dataset (see line 36 of that script). Without it, the command will fail for users who do not have the default Arnold-style mount.

Suggested change
model_dir=/path/to/model data_dir=/path/to/data bash examples/moe/run_qwen3_30B_A3B_dapo.sh
model_dir=/path/to/model data_dir=/path/to/data test_path=/path/to/test bash examples/moe/run_qwen3_30B_A3B_dapo.sh

```

| Recipe | Model | Dataset | Hardware | Algorithm |
|---|---|---|---|---|
| [`getting_started/run_qwen3_1b7.sh`](examples/getting_started/run_qwen3_1b7.sh) | Qwen3-1.7B | gsm8k | 1×8H100 | GRPO |
| [`moe/run_qwen3_30B_A3B_dapo.sh`](examples/moe/run_qwen3_30B_A3B_dapo.sh) | Qwen3-30B-A3B | DAPO-Math-17k / AIME 2025 | 1×8H100 | DAPO |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The algorithm for this recipe is listed as DAPO, but the corresponding script examples/moe/run_qwen3_30B_A3B_dapo.sh sets algorithm.adv_estimator=grpo (line 34). Please ensure the algorithm name in the table accurately reflects the implementation in the script.

Suggested change
| [`moe/run_qwen3_30B_A3B_dapo.sh`](examples/moe/run_qwen3_30B_A3B_dapo.sh) | Qwen3-30B-A3B | DAPO-Math-17k / AIME 2025 | 1×8H100 | DAPO |
| [`moe/run_qwen3_30B_A3B_dapo.sh`](examples/moe/run_qwen3_30B_A3B_dapo.sh) | Qwen3-30B-A3B | DAPO-Math-17k / AIME 2025 | 1×8H100 | GRPO |

| [`moe/run_qwen3_30B_A3B_reinforce.sh`](examples/moe/run_qwen3_30B_A3B_reinforce.sh) | Qwen3-30B-A3B-Base | DAPO-Math-17k / AIME 2024 | 8×8H100 | REINFORCE++ |
| [`moe/run_qwen3_30B_A3B_16H100.sh`](examples/moe/run_qwen3_30B_A3B_16H100.sh) | Qwen3-30B-A3B | gsm8k | 2×8H100 | GRPO |
| [`moe/run_qwen3_30B_A3B_8B200.sh`](examples/moe/run_qwen3_30B_A3B_8B200.sh) | Qwen3-30B-A3B | gsm8k | 1×8B200 | GRPO |
| [`moe/run_moonlight_gsm8k.sh`](examples/moe/run_moonlight_gsm8k.sh) | Moonlight-16B-A3B-Instruct | gsm8k | 1×8B200 | GRPO |
| [`moe/run_moonlight_reinforce.sh`](examples/moe/run_moonlight_reinforce.sh) | Moonlight-16B-A3B-Instruct | DAPO-Math-17k / AIME 2024 | 1×8B200 | REINFORCE++ |
| [`verify/run_dense_vexact.sh`](examples/verify/run_dense_vexact.sh) | DeepSeek-R1-Distill-Qwen-1.5B | MATH / AIME 2024+2025 | 1×8H100 | GRPO (vexact) |
| [`verify/run_dense_vllm.sh`](examples/verify/run_dense_vllm.sh) | DeepSeek-R1-Distill-Qwen-1.5B | MATH / AIME 2024+2025 | 1×8H100 | GRPO (vllm) |

See [`examples/README.md`](examples/README.md) for path configuration, attention backend selection, and an explanation of the `verify/` pair.

## Installation

VeXact uses [uv](https://docs.astral.sh/uv/) for environment management. Pick
Expand Down
Binary file added assets/figures/aime24.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/figures/grad_norm_logy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/figures/k3_kl.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/figures/train_reward.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading