Skip to content

Require a huge amount of memory #6

@jgcb00

Description

@jgcb00

Hi,
We are trying to train a super BPE for an upcoming LLM, it will be a corpus of 12T in 30 languages.
We are trying to train the tokenizers but the second step of the training is require an immense amount of RAM,
Currently for 60GB of data the job is killed on the 1.5TB of RAM node due to OOM error.
Specifically, it's the step : tokenize_words that fails

I'm having doubt about the fact that 60GB of data require this amount of memory,
Is there any memory leak or do you have a clear ideas on the reason why the step 2 require much more RAM than the step one ?

Best Regards,

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions