Skip to content

Unusually slow training speed #17

@Iambestfeed

Description

@Iambestfeed
Image I am training a tokenizer with a vocab size of 50,368 and have added some special tokens (mask, pad, unk, s, /s)

However, the count pairs step encounters an issue, taking 20 minutes to reach 99% progress, and from then on the speed becomes slower, with an estimated 2-3 hours to complete.

Is this a problem with the Hugging Face tokenizer library, or is it an issue with SuperBPE?

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions