HPG hangs after a few epochs with multiple GPUs

Running 2 GPUs on HPG results in training that runs for a few epochs but then doesn't start the next epoch. This persists until the program times out. At that point, slurm outputs to log `slurmstepd: error: Detected 1 oom-kill event(s) in StepId=47928841.batch. Some of your processes may have been killed by the cgrou out-of-memory handler.`

Running this with 31gb memory instead of 8gb yields more epochs done before the program hangs. This makes me think that memory is not being deallocated/reallocated properly after the end of an epoch. Slurm says only 1 second of GPU time was used to this may be all on GPU.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPG hangs after a few epochs with multiple GPUs #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HPG hangs after a few epochs with multiple GPUs #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions