Skip to content

[Bug] Abnormal memory usage and Out-of-Memory in eagle3 training. #466

@coco-alen

Description

@coco-alen

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/SpecForge/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

When training an Eagle model for qwen3, I could only set the batch size to 2; increasing it further caused an OutOfMemoryError (OOM). I was using 4 H200 GPUs, each with 140GB of VRAM. This VRAM usage was clearly abnormal.

I found that most of the OOMs occurred in the padding function within specforge/utils.py.

Reproduction

NUM_GPUS=${1:-4}

#torch memory segment
export PYTORCH_ALLOC_CONF=expandable_segments:True
CUDA_VISIBLE_DEVICES=4,5,6,7 \
torchrun \
    --standalone \
    --nproc_per_node $NUM_GPUS \
    $ROOT_DIR/scripts/train_eagle3.py \
    --target-model-path Qwen/Qwen3-8B \
    --draft-model-config $ROOT_DIR/configs/qwen3-8b-eagle3.json \
    --train-data-path $ROOT_DIR/cache/dataset/dflash_dataset_nemotron_800K.jsonl \
    --output-dir $ROOT_DIR/outputs/qwen3-8b-eagle-nemotron \
    --num-epochs 10 \
    --batch-size 2 \
    --learning-rate 2e-4 \
    --max-length 4096 \
    --chat-template qwen \
    --cache-dir $ROOT_DIR/cache \
    --embedding-key model.embed_tokens.weight \
    --log-interval 50 \
    --save-interval 1000 \
    --tp-size $NUM_GPUS \
    --target-model-backend sglang \
    --sglang-attention-backend fa3 \
    --sglang-mem-fraction-static 0.1 \
    --report-to wandb \
    --wandb-project specforge-qwen3-8b-eagle \
    --wandb-name qwen3-8b-eagle-nemotron

Environment

specforge 0.2.0

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions