-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Labels
questionFurther information is requestedFurther information is requested
Description
I am training a tokenizer with a vocab size of 50,368 and have added some special tokens (mask, pad, unk, s, /s)
However, the count pairs step encounters an issue, taking 20 minutes to reach 99% progress, and from then on the speed becomes slower, with an estimated 2-3 hours to complete.
Is this a problem with the Hugging Face tokenizer library, or is it an issue with SuperBPE?
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested