Running 2 GPUs on HPG results in training that runs for a few epochs but then doesn't start the next epoch. This persists until the program times out. At that point, slurm outputs to log slurmstepd: error: Detected 1 oom-kill event(s) in StepId=47928841.batch. Some of your processes may have been killed by the cgrou out-of-memory handler.
Running this with 31gb memory instead of 8gb yields more epochs done before the program hangs. This makes me think that memory is not being deallocated/reallocated properly after the end of an epoch. Slurm says only 1 second of GPU time was used to this may be all on GPU.
Running 2 GPUs on HPG results in training that runs for a few epochs but then doesn't start the next epoch. This persists until the program times out. At that point, slurm outputs to log
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=47928841.batch. Some of your processes may have been killed by the cgrou out-of-memory handler.Running this with 31gb memory instead of 8gb yields more epochs done before the program hangs. This makes me think that memory is not being deallocated/reallocated properly after the end of an epoch. Slurm says only 1 second of GPU time was used to this may be all on GPU.