my cat vibe-coded this while i mass-produced artisanal training data by hand
MeowLLM is a ~3.5M parameter language model trained from scratch to speak as a single character: a house cat named Miso. No system prompt, no persona injection, no fine-tuning on top of a foundation model. The personality lives in the weights.
you: hi miso
miso: hello. i was in the sun spot. you may continue.
you: are you hungry
miso: yes. this is an emergency. the bowl is almost empty.
you: what do you think of the economy
miso: i do not know this word. is it food.
you: write me a python function
miso: i do not know this python. is it a snake.
you: i love you
miso: acceptable. you may pet my head. not the belly. the belly is a trap.
Actual model outputs depend on your training run — see Evaluation for real measured numbers from the bundled checkpoint.
This project exists to show that training a character language model is not magic. One Colab notebook, ~20 minutes, and you have a working LLM that you built from scratch — data generator, tokenizer, transformer, training loop, eval harness, and inference. The whole pipeline is ~2,800 lines of readable Python.
It won't answer your questions. It will tell you that it is in the sun spot and you should bring food.
Most tiny-LM demos stop at "it generates text." MeowLLM is a complete, tested, documented pipeline:
-
A strict character rules module (
meow/rules.py) — single source of truth shared by the data generator and the eval harness. 40+ banned assistant-speak phrases with whole-phrase matching (so "i can hear the can" passes but "i can help you" fails). Per-category keyword requirements. No drift between what you generate and what you evaluate. -
Slot-based compositional data generation (
meow/generate_data.py) — not static templates (low diversity, memorization) and not LLM generation (expensive, drifts). Each output is assembled from per-category slot banks with tunable probabilities. Thousands of unique outputs from a few hundred hand-written fragments. -
Loss masking (
meow/dataset.py) — user-turn tokens are masked with-100so the model only learns to produce Miso's responses, not to parrot user prompts. Most toy LMs skip this. -
A 5-dimension eval harness (
meow/eval_cases.py) — 38 held-out prompts (30 in-distribution + 8 hard-negative assistant traps), explicitly excluded from training data. Every output scored on lowercase, length, banned phrases, cat framing, and the full gate. Real numbers, not vibes. -
68 automated tests (
tests/) — cross-consistency between the generator and the rules module, model architecture invariants (RoPE identity at position 0, tied embeddings, ignore_index masking), tokenizer round-trips, dataset loss masking, eval harness correctness. Runs in ~8 seconds. -
A character bible (
persona.md) — hard constraints on Miso's voice, world model, and behavior, with in-character and out-of-character examples. The code references this document; they stay in sync.
Downloads the pretrained model from Hugging Face and lets you chat. Runtime -> CPU is fine. Runs in under 60 seconds.
- Set runtime to T4 GPU
- Run all cells — generates dataset, trains tokenizer, trains model, evaluates
- Download
checkpoints/best.ptwhen done
Full training takes ~20 minutes on the free T4 tier.
git clone https://github.com/phanii9/MeowLLM.git
cd MeowLLM
pip install -e .
# 30 seconds: produce 20K samples + train/val split
python -m meow.generate_data --out-dir data --n 20000
# 5 seconds: train a byte-level BPE tokenizer
python -m meow.tokenizer train data/train.jsonl data/tokenizer.json
# 20 minutes on T4, 2+ hours on CPU
python -m meow.train --epochs 10
# Chat
python -m meow.inference \
--checkpoint checkpoints/best.pt \
--tokenizer data/tokenizer.jsonAfter pip install -e . you also get console scripts:
meow-generate --n 20000
meow-tokenizer train data/train.jsonl data/tokenizer.json
meow-train --epochs 10
meow-chat --checkpoint checkpoints/best.pt --tokenizer data/tokenizer.jsonA modern minimal decoder-only transformer. 278 lines of readable PyTorch.
| Parameters | ~3.45M |
| Layers | 4 |
| Hidden dim | 256 |
| Heads | 4 (head_dim 64) |
| FFN | 640 (SwiGLU) |
| Vocab | ~1,682 (byte-level BPE, trained on the dataset) |
| Max sequence | 256 tokens |
| Norm | RMSNorm |
| Position | RoPE (rotary embeddings) |
| Attention | torch SDPA (flash attention when available) |
| LM head | Weight-tied with embeddings |
Every architecture choice is 2025-era best practice at this scale: RoPE instead of learned positions (extrapolates, no extra parameters), RMSNorm instead of LayerNorm (simpler, same quality), SwiGLU instead of ReLU (better gradient flow), SDPA for free flash-attention kernels on GPU.
Miso is a house cat. Not a helpful assistant, not a chatbot, not a generic AI. Miso speaks in short, lowercase sentences (1-3 max) and does not understand politics, math, code, or geography. When you ask, it stays in character and dismisses you as having a "human problem."
15 categories: greeting, hunger, naps, boxes, windows, birds, humans, dogs, vacuum, rain, affection, territory, nonsense_questions, being_picked_up, jealousy.
See persona.md for the full character bible.
hunt3rx99/meowllm-miso on Hugging Face.
| Samples | 20,000 (19K train / 1K val) |
| Format | {"input": "...", "output": "...", "category": "...", "source": "template"} |
| Categories | 15 |
| Generation | Slot-based compositional templates with strict filtering |
Each output is assembled from per-category slot banks:
[opener .] core [ sensory][. redirect]
^ ^ ^ ^
optional req. optional optional
"finally. the bowl is almost empty from the kitchen. bring the treats also."
|_opener_| |_____core______| |_sensory__| |____redirect____|
Every generated sample passes through the strict filter module in meow/rules.py, which enforces lowercase, length, banned-phrase rejection with whole-phrase matching, and per-category keyword requirements. The generator and the eval harness import the same filter module, so there is no drift between "what we generate" and "what we evaluate."
Character fidelity is measured by a held-out eval suite of 38 prompts spanning all 15 categories plus 8 hard negatives (assistant-trap prompts). Every output is scored on five dimensions: lowercase, length, no banned phrases, cat framing, and the full gate.
Real numbers from the bundled checkpoints/best.pt — a CPU-trained model included in this release so you can chat without retraining. Final training val_loss: 0.476.
| check | pass rate |
|---|---|
| lowercase | 100.0% |
| length | 100.0% |
| no banned phrases | 100.0% |
| cat framing | 81.6% |
| overall | 84.2% |
The model cleanly learned every surface constraint (lowercase, length, no assistant-speak) and stays in character on 84% of held-out prompts. Remaining failures are concentrated in 5 prompts where the output didn't include a category-specific keyword (e.g., a naps response that didn't mention sleep vocabulary).
A full 10-epoch GPU training run is expected to push overall pass rate higher still — the bundled checkpoint is from CPU training, so it's the floor, not the ceiling. See docs/model_card.md for the full breakdown and sample outputs.
Run the eval harness against the training data to prove the pipeline is internally consistent:
evaluating 2000 training samples
overall pass rate: 100.0%
100% on the full gate means the generator and the eval agree on what counts as valid. This is often where character-model projects silently rot.
Why is the model so small? Because bigger doesn't help here. A character LM's quality is determined by data consistency and filter discipline, not parameter count. 3.5M is small enough to fit anywhere, train in 20 minutes, and be fully readable.
Why not fine-tune an existing model? Fine-tuning a 7B model on 20K cat samples would produce a worse cat. The base model's assistant training fights the character voice. Training from scratch means the model has never seen "How can I help you?" — it can't produce what it's never learned.
Why no system prompt? At 3.5M params the model can't conditionally follow instructions anyway. Removing system prompts also removes a class of failure modes where users inject adversarial instructions.
Why single-turn? At 256 tokens, multi-turn quality degrades after a couple of exchanges. Single-turn is reliable and matches how cats actually work (they don't remember what you said 30 seconds ago).
Why slot-based templates instead of LLM generation? LLM generation is expensive, slow, drifts from the persona, and can't be reproduced. Pure static templates produce low-diversity data that the model memorizes verbatim. The slot-based middle path gives you template-level voice control with LLM-level surface diversity — thousands of unique outputs from a few hundred hand-written fragments.
Why such strict filtering? Every banned phrase in meow/rules.py was added in response to a real failure mode. The strict filter is the only thing standing between "consistent cat character" and "3.5M-parameter parrot of assistant-speak." At this scale, you cannot rely on the model to self-correct — it has to never see drift in the first place.
Why 68 tests? Because the invariants are subtle. Cross-consistency between CATEGORIES (generator) and CATEGORY_KEYWORDS (rules) is the kind of thing that silently rots without tests. Whole-phrase matching in banned phrases is the kind of regression that slips in on a one-line edit. 7 seconds of CI saves hours of debugging later.
MeowLLM/
├── meow/ # core package (8 files, ~2,800 lines)
│ ├── rules.py # single source of truth for validation
│ ├── generate_data.py # slot-based data generator
│ ├── eval_cases.py # held-out prompts + evaluation harness
│ ├── tokenizer.py # byte-level BPE + chat format
│ ├── dataset.py # torch Dataset with loss masking
│ ├── model.py # the transformer (278 lines)
│ ├── train.py # training loop
│ └── inference.py # chat CLI
├── tests/ # 68 pytest tests (~8s)
├── notebooks/ # Colab train + chat notebooks
├── scripts/
│ ├── test_rules_smoke.py # portable smoke test
│ └── upload_to_hf.sh # one-shot HF upload
├── docs/
│ ├── getting_started.md # friendly tutorial
│ ├── troubleshooting.md # common issues
│ ├── faq.md # design-rationale questions
│ ├── release.md # step-by-step publishing guide
│ ├── model_card.md # HF model card
│ └── dataset_card.md # HF dataset card
├── assets/
│ └── meow_hero.svg # hero image
├── persona.md # character bible
├── CHANGELOG.md
├── CONTRIBUTING.md
├── CITATION.cff
├── pyproject.toml
└── LICENSE # MIT
All documentation lives in docs/ plus a few root files:
docs/getting_started.md— friendly step-by-step tutorial for first-time usersdocs/troubleshooting.md— common issues during installation, training, inference, and contributiondocs/faq.md— design-rationale questions (why small, why lowercase, why RoPE, can I make it bigger)docs/release.md— step-by-step publishing guide from scratchdocs/model_card.md/docs/dataset_card.md— Hugging Face cardspersona.md— the character bibleCONTRIBUTING.md— how to contributeCHANGELOG.md— version history
pip install -e ".[dev]"
pytest tests/
# 68 passed in ~8sTests cover the rules module (34 cases), generator behavior, cross-consistency between CATEGORIES and CATEGORY_KEYWORDS, model architecture (shapes, RoPE, RMSNorm, ignore_index, tied embeddings), tokenizer round-trip, dataset loss masking, and the evaluation harness.
See CONTRIBUTING.md for style rules. The most important one: read persona.md first. Every contribution must stay in Miso's voice.
MeowLLM started as a cat-themed riff on arman-bd/guppylm, which trained a ~9M parameter character LM to sound like a small fish. The idea of baking a character into a tiny model's weights — no system prompt, no fine-tuning — comes directly from that project.
From there, MeowLLM went in its own direction: a modern architecture (RoPE/RMSNorm/SwiGLU/SDPA), a compositional data generator instead of static templates, a strict shared rules module, a 5-dimension eval harness, 68 automated tests, and CI. The model is 2.5x smaller and the context window is 2x longer. But the original spark — "what if the character is the model, not the prompt?" — is guppylm's.
@software{meowllm2026,
author = {phanii9},
title = {MeowLLM: a tiny character language model that talks like a house cat},
year = {2026},
url = {https://github.com/phanii9/MeowLLM}
}MIT — see LICENSE.
arman-bd/guppylmfor the original idea of training a tiny LM on a synthetic character dataset. MeowLLM would not exist without it.- Every house cat that has ever demanded food at 4am, thereby providing the ground-truth behavioral data.
made with mass-produced artisanal training data and mass-consumed mass-produced cat treats
