[Bug] Abnormal memory usage and Out-of-Memory in eagle3 training.

### Checklist

- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/SpecForge/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.

### Describe the bug

When training an Eagle model for qwen3, I could only set the batch size to 2; increasing it further caused an OutOfMemoryError (OOM). I was using 4 H200 GPUs, each with 140GB of VRAM. This VRAM usage was clearly abnormal.

I found that most of the OOMs occurred in the `padding` function within `specforge/utils.py`.

### Reproduction

```python
NUM_GPUS=${1:-4}

#torch memory segment
export PYTORCH_ALLOC_CONF=expandable_segments:True
CUDA_VISIBLE_DEVICES=4,5,6,7 \
torchrun \
    --standalone \
    --nproc_per_node $NUM_GPUS \
    $ROOT_DIR/scripts/train_eagle3.py \
    --target-model-path Qwen/Qwen3-8B \
    --draft-model-config $ROOT_DIR/configs/qwen3-8b-eagle3.json \
    --train-data-path $ROOT_DIR/cache/dataset/dflash_dataset_nemotron_800K.jsonl \
    --output-dir $ROOT_DIR/outputs/qwen3-8b-eagle-nemotron \
    --num-epochs 10 \
    --batch-size 2 \
    --learning-rate 2e-4 \
    --max-length 4096 \
    --chat-template qwen \
    --cache-dir $ROOT_DIR/cache \
    --embedding-key model.embed_tokens.weight \
    --log-interval 50 \
    --save-interval 1000 \
    --tp-size $NUM_GPUS \
    --target-model-backend sglang \
    --sglang-attention-backend fa3 \
    --sglang-mem-fraction-static 0.1 \
    --report-to wandb \
    --wandb-project specforge-qwen3-8b-eagle \
    --wandb-name qwen3-8b-eagle-nemotron
```

### Environment

specforge                 0.2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Abnormal memory usage and Out-of-Memory in eagle3 training. #466

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Abnormal memory usage and Out-of-Memory in eagle3 training. #466

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions