Checklist
Describe the bug
When training an Eagle model for qwen3, I could only set the batch size to 2; increasing it further caused an OutOfMemoryError (OOM). I was using 4 H200 GPUs, each with 140GB of VRAM. This VRAM usage was clearly abnormal.
I found that most of the OOMs occurred in the padding function within specforge/utils.py.
Reproduction
NUM_GPUS=${1:-4}
#torch memory segment
export PYTORCH_ALLOC_CONF=expandable_segments:True
CUDA_VISIBLE_DEVICES=4,5,6,7 \
torchrun \
--standalone \
--nproc_per_node $NUM_GPUS \
$ROOT_DIR/scripts/train_eagle3.py \
--target-model-path Qwen/Qwen3-8B \
--draft-model-config $ROOT_DIR/configs/qwen3-8b-eagle3.json \
--train-data-path $ROOT_DIR/cache/dataset/dflash_dataset_nemotron_800K.jsonl \
--output-dir $ROOT_DIR/outputs/qwen3-8b-eagle-nemotron \
--num-epochs 10 \
--batch-size 2 \
--learning-rate 2e-4 \
--max-length 4096 \
--chat-template qwen \
--cache-dir $ROOT_DIR/cache \
--embedding-key model.embed_tokens.weight \
--log-interval 50 \
--save-interval 1000 \
--tp-size $NUM_GPUS \
--target-model-backend sglang \
--sglang-attention-backend fa3 \
--sglang-mem-fraction-static 0.1 \
--report-to wandb \
--wandb-project specforge-qwen3-8b-eagle \
--wandb-name qwen3-8b-eagle-nemotron
Environment
specforge 0.2.0
Checklist
Describe the bug
When training an Eagle model for qwen3, I could only set the batch size to 2; increasing it further caused an OutOfMemoryError (OOM). I was using 4 H200 GPUs, each with 140GB of VRAM. This VRAM usage was clearly abnormal.
I found that most of the OOMs occurred in the
paddingfunction withinspecforge/utils.py.Reproduction
Environment
specforge 0.2.0