NLP Baselines Project

This repository encompasses two core Natural Language Processing (NLP) tasks, each implementing baseline models using Recurrent Neural Networks (RNN) and Transformer architectures. These tasks serve as foundational benchmarks for future experiments and model enhancements.

Introduction

This project aims to establish baseline models for two pivotal NLP tasks:

Movie Review Categorization: Classifying IMDB movie reviews as positive or negative.
English-French Translation: Translating English sentences into French.

Both tasks utilize RNN-based models and Transformer architectures to provide a comparative analysis of traditional and state-of-the-art approaches.

Task 1: Movie Review Categorization

Dataset

Source: IMDB Movie Reviews Dataset
Preparation:
- Converted text data into CSV format:
  - train_neg.csv, train_pos.csv: Training data.
  - test_neg.csv, test_pos.csv: Testing data.
- Tokenized and numericalized reviews into fixed-length sequences.

Models

RNN Baseline:
- Architecture:
  - Embedding Layer: Utilizes pre-trained GloVe embeddings to transform input tokens into dense vector representations.
  - LSTM Layers
  - Fully Connected Layer for binary classification
Transformer Baseline:
- Architecture:
  - Embedding Layer with Positional Encoding: Incorporates pre-trained GloVe embeddings for input token representation.
  - Transformer Encoder Layers
  - Classification Head
- Note: Custom-built using PyTorch’s nn.TransformerEncoder, without pre-trained models like BERT.

Training & Evaluation

Process:
- Data split into training, validation, and test sets.
- Employed data augmentation for robustness.
- Utilized PyTorch DataLoaders for batching.
- Trained models and saved the best-performing based on validation loss.
Metrics:
- Accuracy
- Confusion Matrices
- Training/Validation Loss plots
- BLEU Scores (for Transformer)

Augmentation

Strategies:
- Varying proportions of positive samples.
- Synonym-based augmentations.
- Adjusting training data size.
- Incorporating pre-trained embeddings.
Implementation: Organized Jupyter notebooks reflecting each augmentation strategy for reproducibility.

Task 2: English-French Translation

Dataset

Source: Laurent Veyssier's Machine Translation Repository
Preparation:
- Data Cleaning for consistency.
- Splitting into training, validation, and test sets.
- Building vocabularies for English and French.

Models

RNN Baseline (GRU-based):
- Architecture:
  - Encoder:
    - Embedding Layer: Utilizes pre-trained GloVe embeddings for English token representation.
    - GRU Layers
  - Decoder:
    - Embedding Layer: Utilizes pre-trained GloVe embeddings for French token representation.
    - GRU Layers
    - Fully Connected Layer
  - Seq2Seq Wrapper for end-to-end training
Transformer Baseline:
- Architecture:
  - Encoder and Decoder:
    - Embedding Layer with Positional Encoding: Incorporates pre-trained GloVe embeddings.
    - Multi-Head Self-Attention and Feed-Forward Layers
  - Seq2Seq Wrapper for end-to-end training
- Note: Custom-built using PyTorch’s nn.TransformerEncoder, without pre-trained models.

Training & Evaluation

Process:
- Data tokenization and vocabulary mapping.
- Applied teacher forcing during training.
- Employed data augmentation for model robustness.
- Saved the best-performing models based on validation loss.
Metrics:
- Cross-Entropy Loss
- BLEU Scores
- Training/Validation Loss and BLEU Score visualizations

Augmentation

Strategies:
- Varying the proportion of long sentences.
- Adjusting training data size.
- Incorporating pre-trained embeddings.
- Modifying the number of attention heads in the Transformer.
Implementation: Organized experiments with corresponding visualizations for each augmentation strategy.

Additional Resources

Sentiment Analysis using RNN-LSTM: GitHub Repository
Transformer References:
BLEU Score Implementation: BangoC123/BLEU
GloVe Embeddings: GloVe: Global Vectors for Word Representation Description: Pre-trained word vectors developed by Stanford University, widely used for embedding layers in NLP models.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
Task1_Movie_Review_Categorization		Task1_Movie_Review_Categorization
Task2_English_French_Translation		Task2_English_French_Translation
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP Baselines Project

Table of Contents

Introduction

Task 1: Movie Review Categorization

Dataset

Models

Training & Evaluation

Augmentation

Task 2: English-French Translation

Dataset

Models

Training & Evaluation

Augmentation

Additional Resources

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NLP Baselines Project

Table of Contents

Introduction

Task 1: Movie Review Categorization

Dataset

Models

Training & Evaluation

Augmentation

Task 2: English-French Translation

Dataset

Models

Training & Evaluation

Augmentation

Additional Resources

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages