01 Tokenizer - Break Text Into Numbers

Why do we need a tokenizer?

LLMs work with numbers, not text. We need to convert words/characters into token IDs.

Text:  "hello"
         ↓
Tokens: ['h', 'e', 'l', 'l', 'o']
         ↓
IDs:    [2, 3, 4, 4, 5]

How it works (in simple terms)

Create vocabulary: Make a dictionary of all characters/words
Map to numbers: Each character gets a unique ID
Encode: Convert text → token IDs
Decode: Convert token IDs → text

Example usage

const SimpleTokenizer = require("./simple_tokenizer");

const tokenizer = new SimpleTokenizer();

// Encode text to numbers
const tokens = tokenizer.encode("hello world");
console.log(tokens); // [2, 3, 4, 4, 5, 59, 6, 5, 7, 4, 8]

// Decode numbers back to text
const text = tokenizer.decode(tokens);
console.log(text); // "hello world"

// Pad to fixed length (important for neural networks!)
const padded = tokenizer.pad_sequence(tokens, 20);
console.log(padded); // [2, 3, 4, 4, 5, 59, 6, 5, 7, 4, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Key concepts

Vocabulary: Set of all possible tokens (characters in this case)
Token ID: Unique number assigned to each token
Padding: Adding zeros to make all sequences same length
Special tokens: Like <PAD> for padding, <UNK> for unknown characters

Real LLMs use more sophisticated tokenizers

This is a character-level tokenizer. Real LLMs use:

BPE (Byte Pair Encoding): Groups characters that appear together
WordPiece: Breaks text into likely meaningful pieces
SentencePiece: Works across languages

But the principle is the same: text → numbers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

01 Tokenizer - Break Text Into Numbers

Why do we need a tokenizer?

How it works (in simple terms)

Example usage

Key concepts

Real LLMs use more sophisticated tokenizers

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

01 Tokenizer - Break Text Into Numbers

Why do we need a tokenizer?

How it works (in simple terms)

Example usage

Key concepts

Real LLMs use more sophisticated tokenizers