Skip to content

shendu-95/sms-email-spam-classifier

Repository files navigation

📩 SMS Spam Classifier

A machine learning-based web application that classifies SMS messages as either "Spam" or "Ham" (legitimate) with high precision. The application is built using Python and deployed via Streamlit, utilizing a Multinomial Naive Bayes classifier trained on the SMS Spam Collection dataset.

🚀 Project Overview

Unwanted spam messages are a nuisance and a potential security threat. This project aims to filter these messages out by analyzing their text content. The system uses Natural Language Processing (NLP) techniques to preprocess text and a machine learning model to predict the category of the message.

  • Input: Raw SMS text.
  • Output: Classification (Spam/Not Spam).
  • Key Metric: The model was selected specifically for its 100% Precision score, ensuring that legitimate messages are not accidentally classified as spam.

🛠️ Tech Stack

  • Language: Python 3.x
  • Web Framework: Streamlit
  • Machine Learning: Scikit-learn (MultinomialNB, TF-IDF)
  • Natural Language Processing: NLTK (PorterStemmer, Stopwords, Tokenization)
  • Data Manipulation: Pandas, NumPy
  • Visualization: Matplotlib, Seaborn, WordCloud

📂 Project Structure

  • SMS-Spam-detection.ipynb: Jupyter Notebook for EDA, preprocessing, and model training.
  • app1.py: Main Streamlit application script.
  • spam.csv: Dataset used for training.
  • vectorizer.pkl: Saved TF-IDF Vectorizer (generated by notebook).
  • model.pkl: Saved Naive Bayes Model (generated by notebook).
  • requirements.txt: List of project dependencies.
  • README.md: Project documentation.

⚙️ How It Works

1. Model Training (SMS-Spam-detection.ipynb)

The notebook covers the entire data science pipeline:

  • Data Cleaning: Dropping empty columns, renaming features, and handling duplicates.
  • Exploratory Data Analysis (EDA): Analyzing message lengths, class distribution (imbalanced), and visualizing frequent words using WordClouds.
  • Preprocessing:
    • Lowercasing text.
    • Tokenization (breaking text into words).
    • Removing special characters, punctuation, and stopwords.
    • Stemming (reducing words to their root form).
  • Vectorization: Using TfidfVectorizer (max features=3000) to convert text into numerical data.
  • Model Selection: After testing multiple algorithms (Logistic Regression, SVM, Random Forest, etc.), Multinomial Naive Bayes was chosen for offering the best balance of accuracy (~97%) and precision (1.0).

2. The Application (app1.py)

The Streamlit app provides a user interface for the model:

  1. Loads the pre-trained vectorizer.pkl and model.pkl files.
  2. Accepts user input via a text area.
  3. Preprocesses the input using the same pipeline defined in training.
  4. Displays the prediction ("Spam" or "Not Spam").

🔧 Installation & Run

  1. Clone the repository:

    git clone [https://github.com/yourusername/sms-spam-classifier.git](https://github.com/yourusername/sms-spam-classifier.git)
    cd sms-spam-classifier
  2. Create a virtual environment (Recommended):

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt

    Note: Ensure you download the necessary NLTK data:

    import nltk
    nltk.download('punkt')
    nltk.download('stopwords')
  4. Run the App:

    streamlit run app1.py

📦 Requirements

streamlit
pandas
numpy
scikit-learn
nltk
matplotlib
seaborn
wordcloud

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors