Skip to content

khushi4124/spam-sms-detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

spam-sms-detection

This project implements a robust machine learning system to classify SMS messages as either Ham (legitimate) or Spam. By combining traditional text vectorization with custom-engineered behavioral features, the model achieves high accuracy and reliability.

Problem Statement

The objective of this project is to develop a predictive model that accurately filters out harmful spam messages while ensuring legitimate communications are never misclassified. This is achieved by analyzing both the semantic content of the messages and their structural patterns.

Dataset Used: https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

Approach

The project follows a comprehensive data science workflow, incorporating advanced feature engineering and a hybrid modeling strategy:

1. Feature Engineering

Beyond the raw text,I engineered many features to capture the distinct style of spam messages:

Message Length: Word count per message.

Symbol Density: The ratio of non-alphanumeric characters to message length.

Currency Count: Detection of financial symbols ($, €, etc.), which are common in scams.

Digit Density: The proportion of numbers, often representing phone numbers or prize amounts.

Caps Group Count: Sequences of capital letters to detect "shouting" or emphasis.

Urgent Word Count: Occurrences of high-trigger keywords like "Free," "Winner," and "Claim."

2. Preprocessing & Pipeline

A ColumnTransformer was used to handle multiple data types simultaneously within a Scikit-Learn Pipeline:

Text Data: Processed using TfidfVectorizer (with English stop-word removal) to extract semantic meaning.

Numerical Data: Scaled using StandardScaler to ensure all engineered features contribute equally to the model.

3. Model Building

A Linear Support Vector Machine (SVM) was selected for classification due to its high effectiveness in high-dimensional spaces, such as those created by TF-IDF vectorization.

Results

The model's performance across all key metrics on the test dataset is as follows:

Overall Accuracy: 99.12%

Mean Cross-Validation Accuracy: 98.88%

Ham Recall (Legitimate Messages): 100% (0 false positives)

Spam Precision: 99%

image

About

This project implements a robust machine learning system to classify SMS messages as either Ham (legitimate) or Spam. By combining traditional text vectorization with custom-engineered behavioral features, the model achieves high accuracy and reliability.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors