Skip to content

HPG hangs after a few epochs with multiple GPUs #1

@sasank-desaraju

Description

@sasank-desaraju

Running 2 GPUs on HPG results in training that runs for a few epochs but then doesn't start the next epoch. This persists until the program times out. At that point, slurm outputs to log slurmstepd: error: Detected 1 oom-kill event(s) in StepId=47928841.batch. Some of your processes may have been killed by the cgrou out-of-memory handler.

Running this with 31gb memory instead of 8gb yields more epochs done before the program hangs. This makes me think that memory is not being deallocated/reallocated properly after the end of an epoch. Slurm says only 1 second of GPU time was used to this may be all on GPU.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions