OOM when running example of LoRA on Qwen2.5-Omni

Hi team,

I’m trying to reproduce [this](https://github.com/texttron/tevatron/tree/main/examples/multimodal/qwen_omni) example training of qwen-omni, but I consistently hit CUDA OOM.

### Hardware

* 1 node, **8× NVIDIA H100 80GB**
* CPU RAM: e.g., 256 GB

### Symptoms

* With **ZeRO-0**: CUDA OOM on the first forward/backward steps.
* Switching to **ZeRO-3** avoids OOM but sometimes triggers NCCL collective timeouts (`_ALLGATHER_BASE` watchdog).

### Questions

1. In your reproduction, did you train **LoRA** with **ZeRO-0 on 8× H100 80GB**?

   * If yes, could you share the exact batch sizes, sequence lengths, and DeepSpeed config? (The current one consistently hits OOM)
2. Could you please provide a **`requirements.txt`** (or conda env) for training **Qwen 2.5 Omni** with this example?

   * Version pins for torch / deepspeed / nccl would be very helpful.

### What I tried

* Lowering `per_device_train_batch_size` (16 → 2).
* Shorter seq lengths (e.g., `query_max_len=256`, `passage_max_len=256`).
* Enabling gradient checkpointing.

If you have a minimal working config (train args + DS json) and a `requirements.txt`, that would greatly help us reproduce your results. Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OOM when running example of LoRA on Qwen2.5-Omni #205

Hardware

Symptoms

Questions

What I tried

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

OOM when running example of LoRA on Qwen2.5-Omni #205

Description

Hardware

Symptoms

Questions

What I tried

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions