Skip to content

OOM when running example of LoRA on Qwen2.5-Omni #205

Description

@haon-chen

Hi team,

I’m trying to reproduce this example training of qwen-omni, but I consistently hit CUDA OOM.

Hardware

  • 1 node, 8× NVIDIA H100 80GB
  • CPU RAM: e.g., 256 GB

Symptoms

  • With ZeRO-0: CUDA OOM on the first forward/backward steps.
  • Switching to ZeRO-3 avoids OOM but sometimes triggers NCCL collective timeouts (_ALLGATHER_BASE watchdog).

Questions

  1. In your reproduction, did you train LoRA with ZeRO-0 on 8× H100 80GB?

    • If yes, could you share the exact batch sizes, sequence lengths, and DeepSpeed config? (The current one consistently hits OOM)
  2. Could you please provide a requirements.txt (or conda env) for training Qwen 2.5 Omni with this example?

    • Version pins for torch / deepspeed / nccl would be very helpful.

What I tried

  • Lowering per_device_train_batch_size (16 → 2).
  • Shorter seq lengths (e.g., query_max_len=256, passage_max_len=256).
  • Enabling gradient checkpointing.

If you have a minimal working config (train args + DS json) and a requirements.txt, that would greatly help us reproduce your results. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions