A machine learning-based web application that classifies SMS messages as either "Spam" or "Ham" (legitimate) with high precision. The application is built using Python and deployed via Streamlit, utilizing a Multinomial Naive Bayes classifier trained on the SMS Spam Collection dataset.
Unwanted spam messages are a nuisance and a potential security threat. This project aims to filter these messages out by analyzing their text content. The system uses Natural Language Processing (NLP) techniques to preprocess text and a machine learning model to predict the category of the message.
- Input: Raw SMS text.
- Output: Classification (Spam/Not Spam).
- Key Metric: The model was selected specifically for its 100% Precision score, ensuring that legitimate messages are not accidentally classified as spam.
- Language: Python 3.x
- Web Framework: Streamlit
- Machine Learning: Scikit-learn (MultinomialNB, TF-IDF)
- Natural Language Processing: NLTK (PorterStemmer, Stopwords, Tokenization)
- Data Manipulation: Pandas, NumPy
- Visualization: Matplotlib, Seaborn, WordCloud
SMS-Spam-detection.ipynb: Jupyter Notebook for EDA, preprocessing, and model training.app1.py: Main Streamlit application script.spam.csv: Dataset used for training.vectorizer.pkl: Saved TF-IDF Vectorizer (generated by notebook).model.pkl: Saved Naive Bayes Model (generated by notebook).requirements.txt: List of project dependencies.README.md: Project documentation.
The notebook covers the entire data science pipeline:
- Data Cleaning: Dropping empty columns, renaming features, and handling duplicates.
- Exploratory Data Analysis (EDA): Analyzing message lengths, class distribution (imbalanced), and visualizing frequent words using WordClouds.
- Preprocessing:
- Lowercasing text.
- Tokenization (breaking text into words).
- Removing special characters, punctuation, and stopwords.
- Stemming (reducing words to their root form).
- Vectorization: Using
TfidfVectorizer(max features=3000) to convert text into numerical data. - Model Selection: After testing multiple algorithms (Logistic Regression, SVM, Random Forest, etc.), Multinomial Naive Bayes was chosen for offering the best balance of accuracy (~97%) and precision (1.0).
The Streamlit app provides a user interface for the model:
- Loads the pre-trained
vectorizer.pklandmodel.pklfiles. - Accepts user input via a text area.
- Preprocesses the input using the same pipeline defined in training.
- Displays the prediction ("Spam" or "Not Spam").
-
Clone the repository:
git clone [https://github.com/yourusername/sms-spam-classifier.git](https://github.com/yourusername/sms-spam-classifier.git) cd sms-spam-classifier -
Create a virtual environment (Recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
Note: Ensure you download the necessary NLTK data:
import nltk nltk.download('punkt') nltk.download('stopwords')
-
Run the App:
streamlit run app1.py
streamlit
pandas
numpy
scikit-learn
nltk
matplotlib
seaborn
wordcloud