Training

📘 Training 🔹 Dataset Preparation

For training, this project primarily uses the WikiLarge dataset, which contains nearly 300,000 complex–simple sentence pairs. The dataset is split into:

Training set (~296k pairs)

Validation set (~2k pairs)

Test set (~359 pairs)

Optional datasets for experimentation:

WikiSmall – smaller dataset (~89k pairs).

ASSET – mainly for evaluation.

Each sentence pair consists of a complex sentence and a simplified equivalent, e.g.:

Complex: "The physician prescribed the medication to alleviate the patient’s symptoms."

Simplified: "The doctor gave medicine to help the patient feel better."

🔹 Preprocessing

Text Cleaning – remove special characters, normalize spacing.

Tokenization –

BERT tokenizer for encoder input.

GPT-2 tokenizer for decoder output.

Splitting – into train, validation, and test sets.

Padding & Attention Masks – applied for uniform sequence lengths.

🔹 Model Training

The model follows a BERT encoder → GPT-2 decoder architecture.

Training Pipeline

Complex sentence tokenized → fed into BERT encoder.

Encoder embeddings passed to GPT-2 decoder.

Decoder generates simplified output autoregressively.

Loss Function: Cross-Entropy loss between predicted and reference tokens.

Optimization: Adam optimizer with learning rate scheduling.

Hyperparameters (example)

Epochs: 5–10

Batch size: 32

Learning rate: 5e-5

Beam size: 5 (for decoding)

Max sequence length: 128–256

🔹 Environment

Python 3.10+

PyTorch / TensorFlow

Hugging Face Transformers

GPU (CUDA) recommended for training

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

📚 Project Wiki

Clone this wiki locally