This repository encompasses two core Natural Language Processing (NLP) tasks, each implementing baseline models using Recurrent Neural Networks (RNN) and Transformer architectures. These tasks serve as foundational benchmarks for future experiments and model enhancements.
This project aims to establish baseline models for two pivotal NLP tasks:
- Movie Review Categorization: Classifying IMDB movie reviews as positive or negative.
- English-French Translation: Translating English sentences into French.
Both tasks utilize RNN-based models and Transformer architectures to provide a comparative analysis of traditional and state-of-the-art approaches.
- Source: IMDB Movie Reviews Dataset
- Preparation:
- Converted text data into CSV format:
train_neg.csv,train_pos.csv: Training data.test_neg.csv,test_pos.csv: Testing data.
- Tokenized and numericalized reviews into fixed-length sequences.
- Converted text data into CSV format:
- RNN Baseline:
- Architecture:
- Embedding Layer: Utilizes pre-trained GloVe embeddings to transform input tokens into dense vector representations.
- LSTM Layers
- Fully Connected Layer for binary classification
- Architecture:
- Transformer Baseline:
- Architecture:
- Embedding Layer with Positional Encoding: Incorporates pre-trained GloVe embeddings for input token representation.
- Transformer Encoder Layers
- Classification Head
- Note: Custom-built using PyTorch’s
nn.TransformerEncoder, without pre-trained models like BERT.
- Architecture:
- Process:
- Data split into training, validation, and test sets.
- Employed data augmentation for robustness.
- Utilized PyTorch DataLoaders for batching.
- Trained models and saved the best-performing based on validation loss.
- Metrics:
- Accuracy
- Confusion Matrices
- Training/Validation Loss plots
- BLEU Scores (for Transformer)
- Strategies:
- Varying proportions of positive samples.
- Synonym-based augmentations.
- Adjusting training data size.
- Incorporating pre-trained embeddings.
- Implementation: Organized Jupyter notebooks reflecting each augmentation strategy for reproducibility.
- Source: Laurent Veyssier's Machine Translation Repository
- Preparation:
- Data Cleaning for consistency.
- Splitting into training, validation, and test sets.
- Building vocabularies for English and French.
- RNN Baseline (GRU-based):
- Architecture:
- Encoder:
- Embedding Layer: Utilizes pre-trained GloVe embeddings for English token representation.
- GRU Layers
- Decoder:
- Embedding Layer: Utilizes pre-trained GloVe embeddings for French token representation.
- GRU Layers
- Fully Connected Layer
- Seq2Seq Wrapper for end-to-end training
- Encoder:
- Architecture:
- Transformer Baseline:
- Architecture:
- Encoder and Decoder:
- Embedding Layer with Positional Encoding: Incorporates pre-trained GloVe embeddings.
- Multi-Head Self-Attention and Feed-Forward Layers
- Seq2Seq Wrapper for end-to-end training
- Encoder and Decoder:
- Note: Custom-built using PyTorch’s
nn.TransformerEncoder, without pre-trained models.
- Architecture:
- Process:
- Data tokenization and vocabulary mapping.
- Applied teacher forcing during training.
- Employed data augmentation for model robustness.
- Saved the best-performing models based on validation loss.
- Metrics:
- Cross-Entropy Loss
- BLEU Scores
- Training/Validation Loss and BLEU Score visualizations
- Strategies:
- Varying the proportion of long sentences.
- Adjusting training data size.
- Incorporating pre-trained embeddings.
- Modifying the number of attention heads in the Transformer.
- Implementation: Organized experiments with corresponding visualizations for each augmentation strategy.
- Sentiment Analysis using RNN-LSTM: GitHub Repository
- Transformer References:
- BLEU Score Implementation: BangoC123/BLEU
- GloVe Embeddings: GloVe: Global Vectors for Word Representation Description: Pre-trained word vectors developed by Stanford University, widely used for embedding layers in NLP models.
This project is licensed under the MIT License. See the LICENSE file for details.