Train BasicTokenizer on GPU with PyTorch, 100x speedup by kuprel · Pull Request #38 · karpathy/minbpe

kuprel · 2024-02-22T01:33:24Z

The following files are added:

minbpe/torch/base.py
- Contains merge_torch
minbpe/torch/basic.py
- Contains BasicTokenizerTorch, overrides the train and encode methods of BasicTokenizer
minbpe/torch/regex.py
- Contains RegexTokenizerTorch, overrides the encode_ordinary method of RegexTokenizer
minbpe/torch/gpt4.py
- Contains GPT4TokenizerTorch, mostly inherits from GPT4Tokenizer, but uses RegexTokenizerTorch's encode method
train_torch.py
- Similar to train.py but trains BasicTokenizerTorch

The following files are modified:

minbpe/__init__.py
- Import torch tokenizers
tests/test_tokenizer.py
- Add torch tokenizers to tests

It takes 67.4 seconds on an H100 80GB SXM5 to train the BasicTokenizerTorch with a vocab_size of 512 on 308MB of Enron emails. The original code takes 2hrs 15min on an M2 Air with Python 3.11 to do this.

I'm not sure if RegexTokenizerTorch or GPT4TokenizerTorch can benefit much from pytorch since there are many chunks of varying lengths, i.e. a "ragged tensor". These tokenizers are helpful for sanity checks though. For example, the test_gpt4_tiktoken_equality tests all pass suggesting that merge_torch is correctly implemented.

I also made a new repository minbpe-pytorch in case adding pytorch support is beyond the scope of this project.

kuprel · 2024-02-23T20:10:29Z

Using an H100 and int16, it's now 108x speedup over the original implementation on M2 air

kuprel · 2024-02-25T04:49:29Z

All of the tests pass

karpathy · 2024-02-27T00:03:56Z

Ok I'll step through this soon to take a look.
Not sure that I love duplicating everything and creating torch versions of it.
Would we be able to potentially isolate the def that is the bottleneck (I'm guessing in base.py), and just surgically have a fast version of one of those defs?
If that isn't straight forward happy to link to minbpe-pytorch.

kuprel · 2024-02-27T06:31:05Z

Thanks for the feedback! I made the diff more surgical. Now the only added files are:

minbpe/basic_torch.py
- Contains merge_torch and BasicTorchTokenizer, overrides the train and encode methods of BasicTokenizer
train_torch.py
- Similar to train.py but trains BasicTorchTokenizer

And the following files are lightly modified:

minbpe/__init__.py
- Import BasicTorchTokenizer
tests/test_tokenizer.py
- Add BasicTorchTokenizer to tests

owos · 2025-11-01T06:24:56Z

How do I actually train with a vocab size of 10000. The code says to concat the whole dataset to a giant string. This is breaking my computer.

kuprel added 12 commits February 21, 2024 16:53

pytorch train

35752c3

train_basic

81e0599

cuda

10de488

cuda

8f27715

cuda

c46ff79

cuda

2a9f17d

cuda

24ad07b

cuda

8e14487

cuda

dd451af

add comments

94fedba

change header comment

a428642

fixed verbose

c7108ef

kuprel changed the title ~~Train BasicTokenizer on GPU with PyTorch~~ Train BasicTokenizer on GPU with PyTorch, 55x speedup Feb 22, 2024

kuprel changed the title ~~Train BasicTokenizer on GPU with PyTorch, 55x speedup~~ Train BasicTokenizer on GPU with PyTorch, 100x speedup Feb 23, 2024

kuprel added 15 commits February 23, 2024 19:02

fix repeated tokens bug

120faab

fix comments

96f99b1

train_gpu -> train_pytorch

7807dc0

BasicPyTorchTokenizer class separate, added vectorized encode method

6e3c21c

.

7666578

simplified

1012c53

fixed merge

0b73000

fixed merge actually this time

b108bce

add BasicTokenizerTorch to test_tokenizer.py, all tests pass

5f8d778

merges order

f08cab9

add BasicTokenizerTorch to more tests, all tests pass

1e2ed28

regex_torch, refactor

9032cab

refactor

c51c46c

GPT4TokenizerTorch passes tests

2d1a74f

removed unused method

787a672

kuprel added 5 commits February 24, 2024 15:15

make train_torch.py near identical to train.py

a23fcc3

better file docstrings

b1b5f21

super.train -> super().train

d8cd8ca

use cuda if available

3ff8423

only train BasicTokenizerTorch in train_torch.py

a87ed6b

kuprel added 2 commits February 26, 2024 22:17

surgical

3931604

surgical

18ef297

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train BasicTokenizer on GPU with PyTorch, 100x speedup#38

Train BasicTokenizer on GPU with PyTorch, 100x speedup#38
kuprel wants to merge 34 commits intokarpathy:masterfrom
kuprel:master

kuprel commented Feb 22, 2024 •

edited

Loading

Uh oh!

kuprel commented Feb 23, 2024

Uh oh!

kuprel commented Feb 25, 2024

Uh oh!

karpathy commented Feb 27, 2024

Uh oh!

kuprel commented Feb 27, 2024

Uh oh!

owos commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kuprel commented Feb 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kuprel commented Feb 23, 2024

Uh oh!

kuprel commented Feb 25, 2024

Uh oh!

karpathy commented Feb 27, 2024

Uh oh!

kuprel commented Feb 27, 2024

Uh oh!

owos commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kuprel commented Feb 22, 2024 •

edited

Loading