Skip to content

Latest commit

 

History

History
57 lines (40 loc) · 1.72 KB

File metadata and controls

57 lines (40 loc) · 1.72 KB

01 Tokenizer - Break Text Into Numbers

Why do we need a tokenizer?

LLMs work with numbers, not text. We need to convert words/characters into token IDs.

Text:  "hello"
         ↓
Tokens: ['h', 'e', 'l', 'l', 'o']
         ↓
IDs:    [2, 3, 4, 4, 5]

How it works (in simple terms)

  1. Create vocabulary: Make a dictionary of all characters/words
  2. Map to numbers: Each character gets a unique ID
  3. Encode: Convert text → token IDs
  4. Decode: Convert token IDs → text

Example usage

const SimpleTokenizer = require("./simple_tokenizer");

const tokenizer = new SimpleTokenizer();

// Encode text to numbers
const tokens = tokenizer.encode("hello world");
console.log(tokens); // [2, 3, 4, 4, 5, 59, 6, 5, 7, 4, 8]

// Decode numbers back to text
const text = tokenizer.decode(tokens);
console.log(text); // "hello world"

// Pad to fixed length (important for neural networks!)
const padded = tokenizer.pad_sequence(tokens, 20);
console.log(padded); // [2, 3, 4, 4, 5, 59, 6, 5, 7, 4, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Key concepts

  • Vocabulary: Set of all possible tokens (characters in this case)
  • Token ID: Unique number assigned to each token
  • Padding: Adding zeros to make all sequences same length
  • Special tokens: Like <PAD> for padding, <UNK> for unknown characters

Real LLMs use more sophisticated tokenizers

This is a character-level tokenizer. Real LLMs use:

  • BPE (Byte Pair Encoding): Groups characters that appear together
  • WordPiece: Breaks text into likely meaningful pieces
  • SentencePiece: Works across languages

But the principle is the same: text → numbers