This repository contains an implementation of an email spam classifier using a Naive Bayes algorithm. The classifier is designed to distinguish between spam and non-spam (ham) emails based on their content. This is achieved through the use of text preprocessing, feature extraction using a bag-of-words model, and training a Naive Bayes classifier.
To run the spam classifier, you need to have Python 3.x installed along with the following libraries:
- pandas
- numpy
- scikit-learn
You can install the necessary libraries using pip:
pip install pandas numpy scikit-learn- Clone this repository to your local machine.
- Place the
spam.csvfile in the repository's root directory. - Run the Python script to train the model and classify emails.
python spam_classifier.pyThe dataset used for this classifier is a collection of labeled email messages. The dataset contains the following columns:
Category: The label for each email, either "spam" or "ham".Message: The content of the email.
The dataset is preprocessed by adding an additional column spam which contains binary values:
1: Indicates the email is spam.0: Indicates the email is not spam (ham).
The model is built using the following steps:
- Data Preprocessing: The
Categorycolumn is converted into a numerical format, where spam is represented as 1 and ham as 0. - Train-Test Split: The data is split into training and testing sets.
- Feature Extraction: The
CountVectorizeris used to convert the text data into a sparse matrix of word counts. - Model Training: A Multinomial Naive Bayes classifier is trained on the transformed training data.
- Prediction and Evaluation: The model is tested on the test set to evaluate its performance.
Here’s an example of how you can use the trained model to classify new emails:
# Example ham email
email_ham = ["could you help me?"]
email_ham_count = cv.transform(email_ham)
print("Ham Email Prediction:", model.predict(email_ham_count))
# Example spam email
email_spam = ["free"]
email_spam_count = cv.transform(email_spam)
print("Spam Email Prediction:", model.predict(email_spam_count))After training, the model can be tested using the test data split. The model's accuracy is measured by comparing its predictions to the true labels of the test set.
# Evaluating the model
x_test_count = cv.transform(x_test)
accuracy = model.score(x_test_count, y_test)
print(f"Model Accuracy: {accuracy}")Contributions to this repository are welcome. You can contribute by:
- Submitting bug reports and feature requests.
- Forking the repository and creating a pull request with new features or improvements.
This project is licensed under the MIT License - see the LICENSE file for details.
This README file provides an overview of the email spam classifier project, including installation instructions, a description of the data and model, and usage examples. Feel free to modify the content to better suit your project specifics!