Following openwebtext example, I several times faced errors like this.
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 784.00 MiB. GPU 2 has a total capacity of 79.15 GiB of which 97.88 MiB is free. Process 2358418 has 5.68 GiB memory in use. Process 207602 has 16.28 GiB memory in use. Process 2212797 has 16.80 GiB memory in use. Process 1399451 has 16.28 GiB memory in use. Process 2825982 has 16.80 GiB memory in use. Including non-PyTorch memory, this process has 7.18 GiB memory in use. Of the allocated memory 5.21 GiB is allocated by PyTorch, and 312.71 MiB is reserved by PyTorch but unallocated.
In different times I use different amount of NVIDIA H100 80GB (1-8). Should notice that not all of them may be fully free (but it's good enough, according to nvidia-smi). This happens both with 8b models and 124m models (but less often).
The purpose of this issue is to close the misunderstanding about the amount of memory needed in general and at certain stages of the algorithm. Thank You.
Following openwebtext example, I several times faced errors like this.
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 784.00 MiB. GPU 2 has a total capacity of 79.15 GiB of which 97.88 MiB is free. Process 2358418 has 5.68 GiB memory in use. Process 207602 has 16.28 GiB memory in use. Process 2212797 has 16.80 GiB memory in use. Process 1399451 has 16.28 GiB memory in use. Process 2825982 has 16.80 GiB memory in use. Including non-PyTorch memory, this process has 7.18 GiB memory in use. Of the allocated memory 5.21 GiB is allocated by PyTorch, and 312.71 MiB is reserved by PyTorch but unallocated.
In different times I use different amount of NVIDIA H100 80GB (1-8). Should notice that not all of them may be fully free (but it's good enough, according to
nvidia-smi). This happens both with 8b models and 124m models (but less often).The purpose of this issue is to close the misunderstanding about the amount of memory needed in general and at certain stages of the algorithm. Thank You.