Skip to content

Inconsistent OOMs occur during long-running jobs #2

@knagrecha

Description

@knagrecha

Problem:

Due to the inexact nature of the Pilot partitioner's memory estimation, it often underestimates the memory costs of minibatch passes. During training, the model exceeds the allocated memory bounds and errors out. Typically this occurs during the backward pass.

Quick fix: Increase double buffer space to reduce shard sizes and guarantee more free room.
Longer-term fix: Replace the Pilot Partitioner with a more exact algorithm, or one that doesn't push up on the limits of memory bounds.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions