Skip to content

Out of GPU Memory when runing train_3D_diffusion.py #4

@thucz

Description

@thucz

Hi! DFM is a great work! I'm trying it for my research.

But when I ran the following command on 4 A100(40G) GPU cards, I got the out of GPU memory error.
I have revised the "batch_size" to 1 * ngpus in the get_train_settings function of train_3D_diffusion.py. This error still appears. Do you know how to fix it?

ngpus=4

torchrun  --nnodes 1 --nproc_per_node $ngpus experiment_scripts/train_3D_diffusion.py dataset=realestate setting_name=re name=re10k mode=cond feats_cond=true wandb=local ngpus=$ngpus use_guidance=true image_size=64

The log is:

......
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/hydra/_internal/utils.py", line 223, in run_and_report                          [37/1785]
    raise ex
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/group/30042/ozhengchen/pano_aigc/DFM/experiment_scripts/train_3D_diffusion.py", line 109, in train
    trainer.train()
  File "/group/30042/ozhengchen/pano_aigc/DFM/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 1218, in train
    losses, misc = self.model(data, render_video=render_video)
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])  # type: ignore[index]
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/group/30042/ozhengchen/pano_aigc/DFM/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 905, in forward
    return self.p_losses(inp, t, *args, **kwargs)
  File "/group/30042/ozhengchen/pano_aigc/DFM/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 722, in p_losses
    model_out, depth, misc = self.model(
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/group/30042/ozhengchen/pano_aigc/DFM/PixelNeRF/pixelnerf_model_cond.py", line 675, in forward
    rgbfeats, depth, misc = self.renderer(trgt_c2w, intrinsics, new_xy, rf)
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/group/30042/ozhengchen/pano_aigc/DFM/PixelNeRF/renderer.py", line 345, in forward
    sigma_all, feats_all, _ = radiance_field(pts_all, viewdirs_all, fine=True)
  File "/group/30042/ozhengchen/pano_aigc/DFM/PixelNeRF/pixelnerf_model_cond.py", line 722, in <lambda>
    return lambda x, v, fine: self.pixelNeRF_joint(
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/group/30042/ozhengchen/pano_aigc/DFM/PixelNeRF/pixelnerf_helpers.py", line 277, in forward
    mlp_output = self.mlp_fine(mlp_in, ns=num_context, time_emb=t)
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/group/30042/ozhengchen/pano_aigc/DFM/PixelNeRF/resnetfc_time_embed.py", line 246, in forward
    x = self.blocks[blkid](x, time_emb=time_emb)
  File "/group/30042/ozhengchen/ft_local/anaconda3/envs/dfm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/group/30042/ozhengchen/pano_aigc/DFM/PixelNeRF/resnetfc_time_embed.py", line 94, in forward
    return x_s + dx
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 39.59 GiB total capacity; 36.42 GiB already allocated; 191.19 MiB free; 36.74 GiB reserved 
in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_AL
LOC_CONF

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions