Unusually slow training speed

<img width="2559" height="970" alt="Image" src="https://github.com/user-attachments/assets/fd6ab63d-c591-4df5-bd5f-62f3381a27e7" />
I am training a tokenizer with a vocab size of 50,368 and have added some special tokens (mask, pad, unk, s, /s)

However, the count pairs step encounters an issue, taking 20 minutes to reach 99% progress, and from then on the speed becomes slower, with an estimated 2-3 hours to complete.

Is this a problem with the Hugging Face tokenizer library, or is it an issue with SuperBPE?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unusually slow training speed #17

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Unusually slow training speed #17

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions