Train a small GPT-2-style Transformer language model on a WhatsApp exported chat file and generate multi-speaker chat in the same style.
This project is intentionally simple and is organized into four Python modules:
whatsapp_data.py– parsing + cleaning + prepared-text artifactstrain_model.py– fine-tuning (Hugging FaceTrainer) + saving a single.pthartifact + tokenizer foldergenerate_model.py– loading + structured generation (speaker turns, stop atxxeom)app.py– Streamlit UI to upload → prepare → train → generate
Input format (USA export style)
The parser expects lines like:
[MM/DD/YY, HH:MM:SS AM] Name: message
Parsing + cleaning
parse_usa_format():
- Extracts
(timestamp, sender, text)rows. - Strips common invisible direction markers sometimes present in exports.
- Drops media/placeholder-like content (e.g. “omitted”, “end-to-end encrypted”).
- Cleans text by removing URLs and @mentions.
Note: the current cleaner keeps only Unicode letters/combining marks/spaces (so Devanagari stays readable). This simplifies the modeling task but also removes punctuation/digits.
Participants
extract_participants() produces a normalized participant list (lowercased, spaces replaced with underscores). These names are used both for training and for rendering generated output.
Training text format: special control tokens
Each message is converted to a single training string with explicit boundaries:
xxspk <sender_name> <message text> xxeom
xxspkmarks “speaker header begins”xxeommarks “end of message”
train_valid_text_by_time() splits the chat by time (earlier messages = train, later messages = validation) and concatenates all messages into two long strings.
Prepared-text artifacts (Option A)
prepare_text_data() + save_prepared_text() create a folder containing:
train_text.txtvalid_text.txtparticipants.jsonmeta.json(model name + special tokens)
These artifacts are what training consumes.
This project fine-tunes a pretrained GPT-2 family tokenizer (default base model: distilgpt2).
build_tokenizer()loads the base tokenizer and addsxxspkandxxeomas extra special tokens.- The model embeddings are resized during training/loading so these new tokens are learnable.
Model architecture
The training uses Hugging Face AutoModelForCausalLM with a GPT-2-style causal language modeling head (next-token prediction). By default it starts from distilgpt2 weights.
Dataset
Training uses fixed-length blocks (see GPTBlockDataset in whatsapp_data.py):
- Tokenize the concatenated
train_text.txt/valid_text.txt - Slice into blocks of
block_sizetokens - Use
labels = input_ids(standard causal LM objective)
Trainer
train_model() runs Hugging Face Trainer with step-based evaluation and early stopping.
Outputs
Training writes:
<out>/whatsapp_model.pth– a single torch artifact containing:state_dictmodel_nameparticipantsspecial_tokens- eval metrics (
eval_loss,perplexity)
<out>/tokenizer/– tokenizer saved viasave_pretrained()<out>/train_log.jsonl– step logs (used by the Streamlit UI)<out>/train_summary.json
Loading
load_generator():
- Loads the
.pthartifact - Loads the tokenizer from
<out>/tokenizer/ - Reconstructs the base model (
AutoModelForCausalLM.from_pretrained(model_name)), resizes embeddings, then loads your fine-tuned weights
Structured multi-turn generation
The generator continues a prompt like:
xxspk <p0> <your message> xxeom xxspk <p1>
Then it repeats:
- generate one message with Hugging Face
model.generate(...) - stop at
xxeom(used aseos_token_id) - inject the next forced header
xxspk <speaker>
This keeps turns bounded and makes output rendering reliable.
Decoding settings
Sampling is controlled mainly by:
temperature(lower = safer)top_pnucleus sampling
If you see gibberish / “language drift”, try lowering temperature and top_p.
python -m venv .venv
source .venv/bin/activatepip install -r requirements.txtstreamlit run app.pyIn the UI:
- Upload your WhatsApp export
.txt - Click Prepare (writes
runs/<run_name>/prepared_text/*) - Click Start training (writes
runs/<run_name>/model_out/*and shows live JSONL logs) - Click Generate
python train_model.py --chat /path/to/chat.txt --out runs/my_run/model_outThis automatically creates runs/my_run/model_out/prepared_text/ and then trains.
python train_model.py --prepared runs/my_run/model_out/prepared_text --out runs/my_run/model_out- Chat exports can contain sensitive information. Don’t commit them.
- Language models can memorize training text; generated output may contain verbatim snippets.