-
Notifications
You must be signed in to change notification settings - Fork 1
Training
📘 Training 🔹 Dataset Preparation
For training, this project primarily uses the WikiLarge dataset, which contains nearly 300,000 complex–simple sentence pairs. The dataset is split into:
Training set (~296k pairs)
Validation set (~2k pairs)
Test set (~359 pairs)
Optional datasets for experimentation:
WikiSmall – smaller dataset (~89k pairs).
ASSET – mainly for evaluation.
Each sentence pair consists of a complex sentence and a simplified equivalent, e.g.:
Complex: "The physician prescribed the medication to alleviate the patient’s symptoms."
Simplified: "The doctor gave medicine to help the patient feel better."
🔹 Preprocessing
Text Cleaning – remove special characters, normalize spacing.
Tokenization –
BERT tokenizer for encoder input.
GPT-2 tokenizer for decoder output.
Splitting – into train, validation, and test sets.
Padding & Attention Masks – applied for uniform sequence lengths.
🔹 Model Training
The model follows a BERT encoder → GPT-2 decoder architecture.
Training Pipeline
Complex sentence tokenized → fed into BERT encoder.
Encoder embeddings passed to GPT-2 decoder.
Decoder generates simplified output autoregressively.
Loss Function: Cross-Entropy loss between predicted and reference tokens.
Optimization: Adam optimizer with learning rate scheduling.
Hyperparameters (example)
Epochs: 5–10
Batch size: 32
Learning rate: 5e-5
Beam size: 5 (for decoding)
Max sequence length: 128–256
🔹 Environment
Python 3.10+
PyTorch / TensorFlow
Hugging Face Transformers
GPU (CUDA) recommended for training