LLMs work with numbers, not text. We need to convert words/characters into token IDs.
Text: "hello"
↓
Tokens: ['h', 'e', 'l', 'l', 'o']
↓
IDs: [2, 3, 4, 4, 5]
- Create vocabulary: Make a dictionary of all characters/words
- Map to numbers: Each character gets a unique ID
- Encode: Convert text → token IDs
- Decode: Convert token IDs → text
const SimpleTokenizer = require("./simple_tokenizer");
const tokenizer = new SimpleTokenizer();
// Encode text to numbers
const tokens = tokenizer.encode("hello world");
console.log(tokens); // [2, 3, 4, 4, 5, 59, 6, 5, 7, 4, 8]
// Decode numbers back to text
const text = tokenizer.decode(tokens);
console.log(text); // "hello world"
// Pad to fixed length (important for neural networks!)
const padded = tokenizer.pad_sequence(tokens, 20);
console.log(padded); // [2, 3, 4, 4, 5, 59, 6, 5, 7, 4, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0]- Vocabulary: Set of all possible tokens (characters in this case)
- Token ID: Unique number assigned to each token
- Padding: Adding zeros to make all sequences same length
- Special tokens: Like
<PAD>for padding,<UNK>for unknown characters
This is a character-level tokenizer. Real LLMs use:
- BPE (Byte Pair Encoding): Groups characters that appear together
- WordPiece: Breaks text into likely meaningful pieces
- SentencePiece: Works across languages
But the principle is the same: text → numbers