Sentinel AI — Phishing Early Warning System

An AI-powered browser extension that detects phishing URLs and suspicious emails in real-time using explainable machine learning. Built for students interested in cybersecurity.

Features

Feature	Description
URL Classifier	RandomForest trained on the Kaggle "Phishing Website Detector" dataset (11K+ samples, 97% accuracy)
Email Text Scanner	TF-IDF + Naive Bayes model detects phishing language in email bodies
Header Anomaly Checks	Detects SPF/DKIM failures and From/Reply-To mismatches
SHAP Explainability	Every prediction comes with plain-English reasons (e.g. "The URL uses a raw IP address instead of a domain name")
Site Blocking	Phishing sites are blocked before loading with a full-screen interstitial warning page
File Download Scanner	Automatically scans files under 10MB on download using 6-layer static analysis (magic bytes, entropy, suspicious strings, etc.)
Auto-Retrain Feedback Loop	Model learns from user corrections — auto-retrains after every 500 reports
Weekly Digital Hygiene Report	Tracks browsing habits locally and displays a visual dashboard with safety score, daily chart, and flagged domains

How It Works

User visits a URL
    ↓
Background script intercepts BEFORE page loads
    ↓
Sends URL to FastAPI backend → RandomForest extracts 30 features → predicts
    ↓
┌─ Safe → Page loads normally
└─ Phishing → Page BLOCKED, interstitial shown with:
       • Risk percentage gauge
       • SHAP-generated plain-English reasons
       • "Go Back to Safety" / "I understand, proceed" buttons
           ↓
   If user clicks "proceed" → feedback correction stored
       ↓
   After 500 corrections → MODEL AUTO-RETRAINS
       • Merges original Kaggle data + user corrections
       • Retrains RandomForest (200 trees)
       • Hot-reloads model (no restart needed)
       • Clears feedback file, cycle repeats

Architecture

phishing_detector/
├── backend/
│   ├── main.py                  # FastAPI server with all endpoints
│   ├── train_url.py             # Train URL model on Kaggle dataset
│   ├── train_email.py           # Train email text model
│   ├── feedback_store.py        # Feedback CSV storage + auto-retrain logic
│   ├── models/
│   │   ├── url_model.py         # 30-feature URL extractor + prediction
│   │   ├── email_model.py       # TF-IDF email classifier
│   │   ├── headers_check.py     # SPF/DKIM/Reply-To rule checks
│   │   └── file_scanner.py      # 6-layer static file analysis
│   └── explain/
│       └── shap_explainer.py    # SHAP explanations → plain English
├── extension/
│   ├── manifest.json            # Chrome Manifest V3
│   ├── background.js            # Intercepts navigation + logs scans + download scanner
│   ├── content.js               # Injects warning banners on pages
│   ├── blocked.html / .js       # Full-page interstitial for blocked sites
│   ├── report.html / .js        # Weekly Digital Hygiene Report dashboard
│   ├── popup.html / .js         # Extension popup dashboard
│   └── styles.css               # All styling
└── data/                        # Kaggle dataset (not in git)

Quick Start

1. Clone and Install Dependencies

git clone https://github.com/swarnim-dev/sentinel-ai.git
cd sentinel-ai
python3 -m venv venv
source venv/bin/activate
pip install -r backend/requirements.txt

2. Download Dataset

Download the Phishing Website Detector dataset from Kaggle and place phishing.csv in the data/ folder.

3. Train Models

cd backend
python train_url.py      # Trains RandomForest URL classifier (~97% accuracy)
python train_email.py    # Trains TF-IDF email classifier

4. Start the API

cd backend
uvicorn main:app --port 8000

5. Load the Extension

Open chrome://extensions/ in Chrome/Brave/Edge
Enable Developer mode
Click Load unpacked → select the extension/ folder
The Sentinel icon will appear in your toolbar

API Endpoints

Method	Endpoint	Description
`POST`	`/predict/url`	Scan a URL → risk score + SHAP reasons
`POST`	`/predict/email`	Scan email body + headers → risk score + reasons
`POST`	`/scan/file`	Upload a file (max 10MB) for static malware analysis
`POST`	`/feedback`	Submit a correction (triggers retrain at 500)
`GET`	`/feedback/status`	Check progress toward next auto-retrain

Example — Scan a URL

curl -X POST http://127.0.0.1:8000/predict/url \
  -H "Content-Type: application/json" \
  -d '{"url": "http://192.168.1.1/paypal-login/secure"}'

Example — Check Retrain Progress

curl http://127.0.0.1:8000/feedback/status
# → {"feedback_count": 42, "retrain_threshold": 500, "progress_percent": 8.4}

Auto-Retrain Feedback Loop

The model improves over time through user feedback:

When a user clicks "I understand, proceed" on a blocked page, a correction is stored with the URL's 30 extracted features
Corrections accumulate in backend/feedback_log.csv
At 500 corrections, the system automatically:
- Merges the original 11K Kaggle samples with the 500 user-labeled corrections
- Retrains the RandomForest classifier (200 trees)
- Saves the updated model and hot-reloads it (no server restart)
- Clears the feedback file — the cycle resets for the next 500

Weekly Digital Hygiene Report

Click "View Weekly Report" in the extension popup to see:

Stat cards — Total scans, safe sites, threats blocked, unique domains
Safety Score — Color-coded ring (Excellent / Good / Fair / Poor)
Daily bar chart — Safe vs phishing breakdown for the last 7 days
Top flagged domains — Riskiest sites you've encountered

All data is stored locally in Chrome storage — nothing is sent to any server.

File Download Scanner

Every file you download (under 10MB) is automatically scanned with 6 analysis layers:

Layer	What It Catches
Dangerous Extensions	`.exe`, `.bat`, `.ps1`, `.vbs`, `.scr`, `.msi`, `.jar`, `.sh` and more
Double Extensions	Files like `invoice.pdf.exe` that disguise their true type
Magic Byte Analysis	Detects mismatches between file extension and actual content (e.g. a `.pdf` that's really an `.exe`)
Entropy Analysis	High Shannon entropy indicates packed/encrypted malware
Suspicious Strings	PowerShell commands, base64 decode, `eval()`, registry edits, reverse shells
Office Macro Indicators	VBA macros, `AutoOpen`, `Shell`, `CreateObject` in Office files

Results appear as a Chrome notification immediately after download completes.

# Manual test via API
curl -X POST http://127.0.0.1:8000/scan/file -F "file=@suspicious_file.exe"

Tech Stack

Backend: Python, FastAPI, scikit-learn, SHAP, Pandas
ML Models: RandomForest (URL), Naive Bayes + TF-IDF (Email)
Extension: JavaScript, Chrome Manifest V3
Dataset: Kaggle Phishing Website Detector

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
backend		backend
extension		extension
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentinel AI — Phishing Early Warning System

Features

How It Works

Architecture

Quick Start

1. Clone and Install Dependencies

2. Download Dataset

3. Train Models

4. Start the API

5. Load the Extension

API Endpoints

Example — Scan a URL

Example — Check Retrain Progress

Auto-Retrain Feedback Loop

Weekly Digital Hygiene Report

File Download Scanner

Tech Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sentinel AI — Phishing Early Warning System

Features

How It Works

Architecture

Quick Start

1. Clone and Install Dependencies

2. Download Dataset

3. Train Models

4. Start the API

5. Load the Extension

API Endpoints

Example — Scan a URL

Example — Check Retrain Progress

Auto-Retrain Feedback Loop

Weekly Digital Hygiene Report

File Download Scanner

Tech Stack

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages