The script consists of two parts
- Part A: Sentiment Polarity Classification (5 distinct labels)
- Part B: Subreddit Classification (20 distinct labels)
Both parts involve the following steps:
- Data pre-processing
Tokenization
Normalization
- Vectorization
One Hot Encoding
TF-IDF
- Model Creation
Logistic Regression
SVC
Random Forest
BernoulliNB
Decision Tree
- Parameter Tuning
GridSearchCV
- Error Analysis
- Word Embeddings, optimizing the models created
Word2Vec