SMS Spam Classification using Machine Learning and Deep Learning

Problem Statement

With the increasing volume of SMS communication, spam messages have become a major concern affecting user experience and security. This project aims to build an intelligent system that can automatically classify SMS messages as spam or legitimate using both machine learning and deep learning techniques.

Data Preprocessing & EDA: Clean textual data (Regex, Lemmatization, Stopword removal) and visually explore textual patterns utilizing custom Word Clouds and frequency distributions.
Addressing Class Imbalance: Recognize that 'Ham' dominates the dataset (86.59%) and prioritize metrics like precision and recall over bare accuracy to strictly minimize False Positives.
Traditional ML Baselines: Train and exhaustively compare algorithms including Logistic Regression, Support Vector Machine (SVM), Naive Bayes variants, Decision Trees, Gradient Boosting, and Bagging classifiers.
Deep Learning Architecture: Design a Bi-LSTM model with Word Embeddings and Dropout regularization layers to achieve superior contextual understanding of messaging patterns.

Objectives

Perform text preprocessing and feature extraction
Build machine learning models for spam classification
Develop deep learning models for improved accuracy
Compare performance between ML and DL approaches
Identify the most effective method for real-world deployment

Performance Validation

Model Comparison (Test Set)

Model	Accuracy	Precision	Recall	F1-Score
Logistic Regression	97.68%	98.90%	96.10%	97.48%
Support Vector Machine	97.70%	99.00%	96.50%	97.73%
Bidirectional LSTM	98.16%	99.20%	97.00%	98.09%

Dataset

SMS Spam Collection Dataset
Contains labeled SMS messages (Spam / Ham)
Text-based classification problem

Best Traditional ML: Logistic Regression & SVC achieved an impressive 97.68% Accuracy with near-perfect protection of legitimate messages, yielding an error rate of just 2.32%.
Deep Learning (Bi-LSTM): Achieved 98.16% Test Accuracy, proving highly resilient and capable of interpreting nuanced semantic meaning that traditional Bag-of-Words vectors occasionally drop.

Reproducibility & Technologies

This codebase was developed using Python 3.8+ and relies on fixed random seeds (seed = 42) to guarantee experimental reproducibility.

Tools and Technologies Used

Python
Pandas, NumPy
Scikit-learn
TensorFlow / Keras
NLTK

Python Data Stack: pandas, numpy, matplotlib, seaborn
NLP Processing: nltk (WordNet Lemmatizer, Tokenization), Regex, wordcloud
Machine Learning: scikit-learn (CountVectorizer, Classifiers, Evaluation Metrics)
Deep Learning: TensorFlow / Keras (Bi-LSTM, Dense Layers, Embeddings)

Repository Structure

├── Data/
│   └── spam.csv                             # Raw dataset
├── Images/                                  # Stored visualization artifacts
├── SMS_Spam_Classification_Analysis.ipynb   # Comprehensive EDA, Models & Visualization
├── spam_classification.py                   # Original standalone execution script
├── Spam_Detector_Model.h5                    # Serialized Bi-LSTM Keras model
├── Spam_Detector_Model.pkl                   # Serialized Tokenizer
├── requirements.txt                         # Package dependencies
└── README.md                                # Project documentation

Approach

1. Data Preprocessing

Text cleaning (removal of punctuation, stopwords)
Tokenization and normalization

2. Feature Engineering

TF-IDF / Count Vectorization

3. Machine Learning Models

Naive Bayes
Logistic Regression
Support Vector Machine

4. Deep Learning Models

Artificial Neural Network (ANN)
(Optional: LSTM if used)

5. Model Evaluation

Accuracy
Precision, Recall, F1-score
Confusion Matrix

Quick Start

To run this project locally, follow these steps:

Clone the repository:

git clone https://github.com/the-irritater/SMS_Spam_Classification_ML_DL.git
cd SMS_Spam_Classification_ML_DL

Install the dependencies: It is recommended to use a Python virtual environment.

Key Insights

Machine learning models performed well with faster training time
Deep learning models showed improved performance on complex patterns
TF-IDF provided strong baseline results for text classification
Model performance depends on feature representation and complexity

pip install -r requirements.txt

Run the analysis: You can either explore the highly documented Jupyter Notebook:
```
jupyter notebook SMS_Spam_Classification_Analysis.ipynb
```
Or run the standalone Python script directly:
```
python spam_classification.py
```

Model Comparison

ML models are faster and easier to deploy
DL models capture deeper patterns but require more computation
Simpler models (Naive Bayes, Logistic Regression) are effective for baseline solutions
======= The data utilized in this repository is the renowned SMS Spam Collection Data Set originally sourced from the UCI Machine Learning Repository.

Business Impact

Enables automatic filtering of spam messages with 98.16% accuracy
Improves user experience by reducing unwanted messages and phishing attempts
Can be integrated into messaging platforms and telecom systems via API
Enhances security by identifying potentially harmful messages before user interaction
Provides interpretable metrics (precision/recall) to balance safety and usability

Future Enhancements

Deployment: Build a Flask/FastAPI endpoint for real-time classification
Multilingual Support: Extend to non-English SMS datasets
Explainability: Integrate LIME or SHAP for model interpretability
Active Learning: Implement user feedback loop for continuous model improvement
Model Compression: Apply quantization or pruning for mobile deployment

How to Run

Install required libraries
Load dataset
Run preprocessing and model training
Evaluate model performance

Author Sanman Kadam MSc Statistics | Data Analyst Location: Mumbai, India GitHub: https://github.com/the-irritater LinkedIn: https://www.linkedin.com/in/sanman-kadam-7a4990374/ Email: sanman.kadam@statistics.mu.ac.in

This project demonstrates an end-to-end Machine Learning lifecycle. By successfully evaluating multiple modeling paradigms (Classical ML vs. Deep Neural Networks) and focusing on the precision-recall tradeoff to protect legitimate user traffic, this framework acts as a highly reliable, production-ready spam verification system.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SMS Spam Classification using Machine Learning and Deep Learning

Problem Statement

Objectives

Performance Validation

Model Comparison (Test Set)

Dataset

Reproducibility & Technologies

Tools and Technologies Used

Repository Structure

Approach

1. Data Preprocessing

2. Feature Engineering

3. Machine Learning Models

4. Deep Learning Models

5. Model Evaluation

Quick Start

Key Insights

Model Comparison

Business Impact

How to Run

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Data		Data
Images		Images
script		script
README.md		README.md
SMS_Spam_Classification_Analysis.ipynb		SMS_Spam_Classification_Analysis.ipynb
requirements.txt		requirements.txt
spam_classification.py		spam_classification.py

Folders and files

Latest commit

History

Repository files navigation

SMS Spam Classification using Machine Learning and Deep Learning

Problem Statement

Objectives

Performance Validation

Model Comparison (Test Set)

Dataset

Reproducibility & Technologies

Tools and Technologies Used

Repository Structure

Approach

1. Data Preprocessing

2. Feature Engineering

3. Machine Learning Models

4. Deep Learning Models

5. Model Evaluation

Quick Start

Key Insights

Model Comparison

Business Impact

How to Run

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages