diff --git a/README.md b/README.md index 44ff1a4..a6dd7c8 100644 --- a/README.md +++ b/README.md @@ -11,6 +11,55 @@ Transformer-based bitwise-aligned rollout for VeOmni FSDP with VeRL integration. - 🧩 **Simple model definitions** — Transformer model code is self-contained and easy to audit, so training and inference model definitions stay in sync - 📖 **Readable codebase** — Clean implementation with chunked prefill, pipeline parallelism, and CUDA graph support +## Effectiveness + +> **Qwen3-30B-A3B · REINFORCE++ · DAPO dataset** + +Off-policy logprob bias from vLLM causes the rollout-correction KL to explode after ~300 steps, which triggers gradient norm blow-up and ultimately training collapse. VeXact's bitwise-aligned rollout keeps the KL at exactly zero throughout, yielding stable training and a ~2× higher final AIME 2024 score. + + + + + + + + + + + + + + + + + + +
Training rewardAIME 2024 (mean@32)
Rollout-correction K3 KL (log scale)Gradient norm (log scale)
+ +## Example Recipes + +End-to-end RL training scripts live under [`examples/`](examples/README.md). Run any script from the repo root: + +```bash +bash examples/getting_started/run_qwen3_1b7.sh +# override paths via env vars +model_dir=/path/to/model data_dir=/path/to/data bash examples/moe/run_qwen3_30B_A3B_dapo.sh +``` + +| Recipe | Model | Dataset | Hardware | Algorithm | +|---|---|---|---|---| +| [`getting_started/run_qwen3_1b7.sh`](examples/getting_started/run_qwen3_1b7.sh) | Qwen3-1.7B | gsm8k | 1×8H100 | GRPO | +| [`moe/run_qwen3_30B_A3B_dapo.sh`](examples/moe/run_qwen3_30B_A3B_dapo.sh) | Qwen3-30B-A3B | DAPO-Math-17k / AIME 2025 | 1×8H100 | DAPO | +| [`moe/run_qwen3_30B_A3B_reinforce.sh`](examples/moe/run_qwen3_30B_A3B_reinforce.sh) | Qwen3-30B-A3B-Base | DAPO-Math-17k / AIME 2024 | 8×8H100 | REINFORCE++ | +| [`moe/run_qwen3_30B_A3B_16H100.sh`](examples/moe/run_qwen3_30B_A3B_16H100.sh) | Qwen3-30B-A3B | gsm8k | 2×8H100 | GRPO | +| [`moe/run_qwen3_30B_A3B_8B200.sh`](examples/moe/run_qwen3_30B_A3B_8B200.sh) | Qwen3-30B-A3B | gsm8k | 1×8B200 | GRPO | +| [`moe/run_moonlight_gsm8k.sh`](examples/moe/run_moonlight_gsm8k.sh) | Moonlight-16B-A3B-Instruct | gsm8k | 1×8B200 | GRPO | +| [`moe/run_moonlight_reinforce.sh`](examples/moe/run_moonlight_reinforce.sh) | Moonlight-16B-A3B-Instruct | DAPO-Math-17k / AIME 2024 | 1×8B200 | REINFORCE++ | +| [`verify/run_dense_vexact.sh`](examples/verify/run_dense_vexact.sh) | DeepSeek-R1-Distill-Qwen-1.5B | MATH / AIME 2024+2025 | 1×8H100 | GRPO (vexact) | +| [`verify/run_dense_vllm.sh`](examples/verify/run_dense_vllm.sh) | DeepSeek-R1-Distill-Qwen-1.5B | MATH / AIME 2024+2025 | 1×8H100 | GRPO (vllm) | + +See [`examples/README.md`](examples/README.md) for path configuration, attention backend selection, and an explanation of the `verify/` pair. + ## Installation VeXact uses [uv](https://docs.astral.sh/uv/) for environment management. Pick diff --git a/assets/figures/aime24.png b/assets/figures/aime24.png new file mode 100644 index 0000000..85aeb11 Binary files /dev/null and b/assets/figures/aime24.png differ diff --git a/assets/figures/grad_norm_logy.png b/assets/figures/grad_norm_logy.png new file mode 100644 index 0000000..a2a5ad3 Binary files /dev/null and b/assets/figures/grad_norm_logy.png differ diff --git a/assets/figures/k3_kl.png b/assets/figures/k3_kl.png new file mode 100644 index 0000000..b5df25b Binary files /dev/null and b/assets/figures/k3_kl.png differ diff --git a/assets/figures/train_reward.png b/assets/figures/train_reward.png new file mode 100644 index 0000000..f578916 Binary files /dev/null and b/assets/figures/train_reward.png differ