Skip to content

JoeZZG/CSC413-Final-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 

Repository files navigation

NLP Baselines Project

This repository encompasses two core Natural Language Processing (NLP) tasks, each implementing baseline models using Recurrent Neural Networks (RNN) and Transformer architectures. These tasks serve as foundational benchmarks for future experiments and model enhancements.

Table of Contents

Introduction

This project aims to establish baseline models for two pivotal NLP tasks:

  1. Movie Review Categorization: Classifying IMDB movie reviews as positive or negative.
  2. English-French Translation: Translating English sentences into French.

Both tasks utilize RNN-based models and Transformer architectures to provide a comparative analysis of traditional and state-of-the-art approaches.

Task 1: Movie Review Categorization

Dataset

  • Source: IMDB Movie Reviews Dataset
  • Preparation:
    • Converted text data into CSV format:
      • train_neg.csv, train_pos.csv: Training data.
      • test_neg.csv, test_pos.csv: Testing data.
    • Tokenized and numericalized reviews into fixed-length sequences.

Models

  1. RNN Baseline:
    • Architecture:
      • Embedding Layer: Utilizes pre-trained GloVe embeddings to transform input tokens into dense vector representations.
      • LSTM Layers
      • Fully Connected Layer for binary classification
  2. Transformer Baseline:
    • Architecture:
      • Embedding Layer with Positional Encoding: Incorporates pre-trained GloVe embeddings for input token representation.
      • Transformer Encoder Layers
      • Classification Head
    • Note: Custom-built using PyTorch’s nn.TransformerEncoder, without pre-trained models like BERT.

Training & Evaluation

  • Process:
    • Data split into training, validation, and test sets.
    • Employed data augmentation for robustness.
    • Utilized PyTorch DataLoaders for batching.
    • Trained models and saved the best-performing based on validation loss.
  • Metrics:
    • Accuracy
    • Confusion Matrices
    • Training/Validation Loss plots
    • BLEU Scores (for Transformer)

Augmentation

  • Strategies:
    • Varying proportions of positive samples.
    • Synonym-based augmentations.
    • Adjusting training data size.
    • Incorporating pre-trained embeddings.
  • Implementation: Organized Jupyter notebooks reflecting each augmentation strategy for reproducibility.

Task 2: English-French Translation

Dataset

Models

  1. RNN Baseline (GRU-based):
    • Architecture:
      • Encoder:
        • Embedding Layer: Utilizes pre-trained GloVe embeddings for English token representation.
        • GRU Layers
      • Decoder:
        • Embedding Layer: Utilizes pre-trained GloVe embeddings for French token representation.
        • GRU Layers
        • Fully Connected Layer
      • Seq2Seq Wrapper for end-to-end training
  2. Transformer Baseline:
    • Architecture:
      • Encoder and Decoder:
        • Embedding Layer with Positional Encoding: Incorporates pre-trained GloVe embeddings.
        • Multi-Head Self-Attention and Feed-Forward Layers
      • Seq2Seq Wrapper for end-to-end training
    • Note: Custom-built using PyTorch’s nn.TransformerEncoder, without pre-trained models.

Training & Evaluation

  • Process:
    • Data tokenization and vocabulary mapping.
    • Applied teacher forcing during training.
    • Employed data augmentation for model robustness.
    • Saved the best-performing models based on validation loss.
  • Metrics:
    • Cross-Entropy Loss
    • BLEU Scores
    • Training/Validation Loss and BLEU Score visualizations

Augmentation

  • Strategies:
    • Varying the proportion of long sentences.
    • Adjusting training data size.
    • Incorporating pre-trained embeddings.
    • Modifying the number of attention heads in the Transformer.
  • Implementation: Organized experiments with corresponding visualizations for each augmentation strategy.

Additional Resources

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors