Members: Harish Venkata (hkv2), Abrielle Agron (aa106), Aryan Gandhi (aryang6), Ashwin Saxena (ashwins2), Tahir Bagasrawala (tahirib2)
README file inspired by this previous project.
Project Presentation Video - How to install and use app
Leaderboard Competition evaluates several generative and descriptive classifiers against Kaggle’s Natural Language Processing with Disaster Tweets to predict which tweets are about real disasters and which are not.
This project falls under Theme: Leaderboard Competition.
Leaderboard Competition soure code consists of following:
- /docs: Documentation for (1) Project Final Presentation (2) Project Progress Status Report (3) Project Presentation Video
- /models: 11 total models (incl. baseline), a combination of discriminative and generative classifiers, were developed, tuned and tested against the dataset.
- AdaBoostClassifer.py: AdaBoost classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.
- DecisionTreeClassifier.py: Decision Tree Classifier predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation.
- GaussianNB.py: A Gaussian Naive Bayes Classifier has a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable.
- GradientBoostingClassifier.py: A Gradient Boosting Classifier builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions.
- KNeighborsClassifier.py: A K-Neighbors Classifier implements the kth-nearest neighbor's vote.
- Kmeans.py: A K-Means model clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares.
- LatentDirichletAllocation.py: A Latent Dirichlet Allocation (LDA) model is a generative probabilistic model for collections of discrete dataset such as text corpora. It is also a topic model that is used for discovering abstract topics from a collection of documents.
- RandomForestClassifier.py: A Random Forest Classifier is a meta estimator that fits several decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
- mlp.py: A Multi-Layer Perceptron (MLP) optimizes the log-loss function using LBFGS or stochastic gradient descent.
- model1.py: A Logistic Regression (baseline) model (also known as logit or MaxEnt) is a popular for classification problems and is our baseline model for this project.
- svc.py: A C-Support Vector Classification is a linear support vector classification model.
- model_eval.py: Contains the function that evaluates all 11 models and plots a Precision-Recall curve to compare model results.
- hpo_tune.py: Is an interactive helper function to run hyper-parameter tuning (HPO) for any model based on user selection.
- sample_submission.csv, test.csv, train.csv is the evaluation, testing and training data set respectively.
All five members worked together to research various models (e.g. Random Forest, MLP, Gaussian Naive Bayes, etc) and then implement then train these models against the data set. We then tuned the hyperparameters of the models to ensure we get the best fit. We then deployed the models to the leaderboard to compare which models performance against each other. We tested the code thouroughly to ensure performance. Lastly, we planned and recorded the demo video together.
-
Download this project (cs410_proj) and unzip the directory.
-
Download Python 3.6+. Note: We faced some compatibility issues with Python 3.11.x. Please try switching Python 3.6.x if you are having trouble.
-
Create (and activate) a virtual env
python -m venv myvenv
source myvenv/bin/activate
- Install scikit learn, pandas, matplotlib
pip install scikit-learn
pip install pandas
pip install matplotlib
- Open
model_eval.py - Create a new model
<your_model_here>.pyin /models folder - Add a tuple for model file path and Model Class in the section of the code in
model_eval.py
Here's an example:
#List of models to evaluate - each entry is a (file_path, class_name) tuple
models_to_evaluate = [
('models/model1.py', 'LogisticModel'),
('models/RandomForestClassifier.py', 'RandomForestClassifierModel'), #TB
('models/AdaBoostClassifier.py', 'AdaBoostClassifierModel'), #TB
('models/DecisionTreeClassifier.py', 'DecisionTreeClassifierModel'), #TB
('models/KNeighborsClassifier.py', 'KNeighborsClassifierModel'), #TB
('models/Kmeans.py', 'KMeansModel'),
('models/LatentDirichletAllocation.py', 'LDAModel'),
('models/GaussianNB.py', 'GaussianNBModel'), #TB
('models/GradientBoostingClassifier.py', 'GradientBoostingClassifierModel'), #TB
('models/svc.py', 'SVCModel'), #TB
('models/mlp.py', 'MLPClassifierModel'), #TB
# Add more models here
- Open
hpo_tune.py - Choose a model that you want to tune
# Pick a model you want to tune (Example of MLP Classifer below)
mlp_model = MLPClassifier(max_iter=10000)
- Choose parameters for the model to update. The list of parameters for that model can be idenfied via the following:
# List of parameters that can be trained for this model selection
print("List of parameters that can be trained: ", mlp_model.get_params().keys())
- Update
parameter_spacewith the list of hyperparamters that would like to optimize in the model
#Example parameters for MLP Classifier model
parameter_space = {
'hidden_layer_sizes': [(10,30,10),(20,)],
'activation': ['tanh', 'relu'],
'solver': ['sgd', 'adam'],
'alpha': [0.0001, 0.001, 0.05, 0.01],
'learning_rate': ['constant','adaptive'],
}
- Once you've updated parameters for the model you'd like to tune, Open Terminal and run
hpo_tune.py - ATTENTION: HPO Tuning can take up to 10 - 15 mins to complete, depending on your computer processor speed and how many parameters you select for tuning.
python3 hpo_tune.py
- Choose a model that you would like to tune
[CASE sensitive] Choose one model to tune:
[MLP]
[AdaBoost]
[DecisionTree]
[NaiveBayes]
[Logistic]
[GradientBoosting]
[KNeighbors]
[svc]
[RandomForest]
Which model would like to tune?
- The model tuning will begin (usually takes a few minutes to run and complete)
- Once you've confirmed your "best" parameters, update them in
mlp.pyHere's an example:
#Example results for best parameters for MLP Classifier model
Best parameters found:
{'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (20,), 'learning_rate': 'constant', 'solver': 'sgd'}
- Open Terminal and run
model_eval.py
Here's an example:
python3 model_eval.py
- Run the model with the new parameters against the test dataset and compare precision, recall, F1 scores.
- Repeat Steps 1-4 until you've got the best precision, recall, F1 score.