-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Labels
questionFurther information is requestedFurther information is requested
Description
Hi,
We are trying to train a super BPE for an upcoming LLM, it will be a corpus of 12T in 30 languages.
We are trying to train the tokenizers but the second step of the training is require an immense amount of RAM,
Currently for 60GB of data the job is killed on the 1.5TB of RAM node due to OOM error.
Specifically, it's the step : tokenize_words that fails
I'm having doubt about the fact that 60GB of data require this amount of memory,
Is there any memory leak or do you have a clear ideas on the reason why the step 2 require much more RAM than the step one ?
Best Regards,
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested