A repository comparing four different methods for NLP sentiment analysis on the same dataset: Logistic Regression, MLP, BERT, and DistilBERT. Providing a technical comparison of four different methodologies for solving a common NLP task: sentiment analysis (classifying text as positive or negative). Each implementation is contained within its own Jupyter Notebook and includes data preprocessing, feature extraction, model definition, hyperparameter tuning with Optuna, and submission generation.
Logistic_Regression_NLP_Sentiment_Analysis.ipynb: A classical machine learning approach using TF-IDF features with a Logistic Regression classifier.MLP_NLP_Sentiment_Analysis.ipynb: A deep learning approach using a Multi-Layer Perceptron (MLP) built with PyTorch, leveraging pre-trained GloVe embeddings.BeRT_NLP_Sentiment_Analysis.ipynb: A Transformer-based approach that fine-tunes thebert-basemodel for sequence classification.DistilBeRT_NLP_Sentiment_Analysis.ipynb: A lighter, faster Transformer-based approach that fine-tunes thedistilbert-basemodel.
All notebooks share a common set of dependencies and a general workflow. Before running, ensure you have the required Python libraries installed.
pandasnumpyscikit-learnnltktorch(Required for MLP, BERT, DistilBERT)transformers(Required for BERT, DistilBERT)gensim(Required for MLP notebook to load GloVe)optuna(Used in all notebooks for hyperparameter tuning)contractions(Used in BERT, DistilBERT for text preprocessing)
You can install the primary dependencies using pip:
pip install pandas numpy scikit-learn nltk torch transformers gensim optuna contractionsThe notebooks also require NLTK data packages (wordnet, punkt, stopwords). They include commands to download them automatically (nltk.download(...)).
These notebooks are configured to run in an environment (like Kaggle) where the dataset is located at a path like /kaggle/input/ai-2-dl-for-nlp.../.
To run these locally, you will need to download the dataset and adjust the file paths in the "Initialize Datasets" section of each notebook to point to your local train_dataset.csv, val_dataset.csv, and test_dataset.csv files.
Each notebook is self-contained. For best results, especially with the MLP and Transformer models, run on a machine with a GPU (e.g., Google Colab, Kaggle).
- File:
Logistic_Regression_NLP_Sentiment_Analysis.ipynb - Key Features:
TfidfVectorizerfor feature extraction. - Run: Execute the cells sequentially. The
optunastudy will run to find the best hyperparameters for theLogisticRegressionmodel.
- File:
MLP_NLP_Sentiment_Analysis.ipynb - Key Features: Uses
gensim.downloaderto fetchglove-twitter-200embeddings. Tweets are vectorized by averaging the embeddings of their tokens. The model is a custom PyTorch neural network. - Run:
- Execute the cells. The notebook will first download the GloVe embeddings (this may take several minutes).
- The
optunastudy will run to find the best hyperparameters for fine-tuning and the best model is re-trained.
- File:
BeRT_NLP_Sentiment_Analysis.ipynb - Key Features: Fine-tunes the
bert-base-uncasedmodel. Uses theBertTokenizerandBertForSequenceClassificationmodels from thetransformerslibrary. - Run:
- This notebook requires a GPU. Ensure your environment is configured for CUDA (
device = "cuda"). - Execute cells sequentially. The
transformerslibrary will download the pre-trained model weights. - The
optunastudy will run to find the best hyperparameters for fine-tuning and the best model is re-trained.
- This notebook requires a GPU. Ensure your environment is configured for CUDA (
- File:
DistilBeRT_NLP_Sentiment_Analysis.ipynb - Key Features: A lighter alternative to BERT. Fine-tunes the
distilbert-base-uncasedmodel usingDistilBertTokenizerandDistilBertForSequenceClassification. - Run:
- This notebook also requires a GPU.
- Execute cells sequentially. The pre-trained model weights will be downloaded.
- The
optunastudy will run to find the best hyperparameters for fine-tuning and the best model is re-trained.