Skip to content

TomusD/NLP-Sentiment-Analysis-Methods

Repository files navigation

NLP-Sentiment-Analysis-Methods

Overview

A repository comparing four different methods for NLP sentiment analysis on the same dataset: Logistic Regression, MLP, BERT, and DistilBERT. Providing a technical comparison of four different methodologies for solving a common NLP task: sentiment analysis (classifying text as positive or negative). Each implementation is contained within its own Jupyter Notebook and includes data preprocessing, feature extraction, model definition, hyperparameter tuning with Optuna, and submission generation.

Repository Contents

  • Logistic_Regression_NLP_Sentiment_Analysis.ipynb: A classical machine learning approach using TF-IDF features with a Logistic Regression classifier.
  • MLP_NLP_Sentiment_Analysis.ipynb: A deep learning approach using a Multi-Layer Perceptron (MLP) built with PyTorch, leveraging pre-trained GloVe embeddings.
  • BeRT_NLP_Sentiment_Analysis.ipynb: A Transformer-based approach that fine-tunes the bert-base model for sequence classification.
  • DistilBeRT_NLP_Sentiment_Analysis.ipynb: A lighter, faster Transformer-based approach that fine-tunes the distilbert-base model.

Common Setup & Dependencies

All notebooks share a common set of dependencies and a general workflow. Before running, ensure you have the required Python libraries installed.

Core Libraries:

  • pandas
  • numpy
  • scikit-learn
  • nltk
  • torch (Required for MLP, BERT, DistilBERT)
  • transformers (Required for BERT, DistilBERT)
  • gensim (Required for MLP notebook to load GloVe)
  • optuna (Used in all notebooks for hyperparameter tuning)
  • contractions (Used in BERT, DistilBERT for text preprocessing)

Installation:

You can install the primary dependencies using pip:

pip install pandas numpy scikit-learn nltk torch transformers gensim optuna contractions

NLTK Data:

The notebooks also require NLTK data packages (wordnet, punkt, stopwords). They include commands to download them automatically (nltk.download(...)).

Data:

These notebooks are configured to run in an environment (like Kaggle) where the dataset is located at a path like /kaggle/input/ai-2-dl-for-nlp.../.

To run these locally, you will need to download the dataset and adjust the file paths in the "Initialize Datasets" section of each notebook to point to your local train_dataset.csv, val_dataset.csv, and test_dataset.csv files.


How to Run

Each notebook is self-contained. For best results, especially with the MLP and Transformer models, run on a machine with a GPU (e.g., Google Colab, Kaggle).

1. Logistic Regression

  • File: Logistic_Regression_NLP_Sentiment_Analysis.ipynb
  • Key Features: TfidfVectorizer for feature extraction.
  • Run: Execute the cells sequentially. The optuna study will run to find the best hyperparameters for the LogisticRegression model.

2. MLP with GloVe

  • File: MLP_NLP_Sentiment_Analysis.ipynb
  • Key Features: Uses gensim.downloader to fetch glove-twitter-200 embeddings. Tweets are vectorized by averaging the embeddings of their tokens. The model is a custom PyTorch neural network.
  • Run:
    1. Execute the cells. The notebook will first download the GloVe embeddings (this may take several minutes).
    2. The optuna study will run to find the best hyperparameters for fine-tuning and the best model is re-trained.

3. BERT (Transformer)

  • File: BeRT_NLP_Sentiment_Analysis.ipynb
  • Key Features: Fine-tunes the bert-base-uncased model. Uses the BertTokenizer and BertForSequenceClassification models from the transformers library.
  • Run:
    1. This notebook requires a GPU. Ensure your environment is configured for CUDA (device = "cuda").
    2. Execute cells sequentially. The transformers library will download the pre-trained model weights.
    3. The optuna study will run to find the best hyperparameters for fine-tuning and the best model is re-trained.

4. DistilBERT (Transformer)

  • File: DistilBeRT_NLP_Sentiment_Analysis.ipynb
  • Key Features: A lighter alternative to BERT. Fine-tunes the distilbert-base-uncased model using DistilBertTokenizer and DistilBertForSequenceClassification.
  • Run:
    1. This notebook also requires a GPU.
    2. Execute cells sequentially. The pre-trained model weights will be downloaded.
    3. The optuna study will run to find the best hyperparameters for fine-tuning and the best model is re-trained.

About

A repository comparing four different architectures for NLP sentiment analysis: Logistic Regression (TF-IDF), MLP (GloVe), BERT, and DistilBERT.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors