Skip to content

Training

Shrajan Shetty edited this page Sep 18, 2025 · 1 revision

📘 Training 🔹 Dataset Preparation

For training, this project primarily uses the WikiLarge dataset, which contains nearly 300,000 complex–simple sentence pairs. The dataset is split into:

Training set (~296k pairs)

Validation set (~2k pairs)

Test set (~359 pairs)

Optional datasets for experimentation:

WikiSmall – smaller dataset (~89k pairs).

ASSET – mainly for evaluation.

Each sentence pair consists of a complex sentence and a simplified equivalent, e.g.:

Complex: "The physician prescribed the medication to alleviate the patient’s symptoms."

Simplified: "The doctor gave medicine to help the patient feel better."

🔹 Preprocessing

Text Cleaning – remove special characters, normalize spacing.

Tokenization –

BERT tokenizer for encoder input.

GPT-2 tokenizer for decoder output.

Splitting – into train, validation, and test sets.

Padding & Attention Masks – applied for uniform sequence lengths.

🔹 Model Training

The model follows a BERT encoder → GPT-2 decoder architecture.

Training Pipeline

Complex sentence tokenized → fed into BERT encoder.

Encoder embeddings passed to GPT-2 decoder.

Decoder generates simplified output autoregressively.

Loss Function: Cross-Entropy loss between predicted and reference tokens.

Optimization: Adam optimizer with learning rate scheduling.

Hyperparameters (example)

Epochs: 5–10

Batch size: 32

Learning rate: 5e-5

Beam size: 5 (for decoding)

Max sequence length: 128–256

🔹 Environment

Python 3.10+

PyTorch / TensorFlow

Hugging Face Transformers

GPU (CUDA) recommended for training

Clone this wiki locally