Skip to content

swarnim-dev/sentinel-ai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Sentinel AI — Phishing Early Warning System

An AI-powered browser extension that detects phishing URLs and suspicious emails in real-time using explainable machine learning. Built for students interested in cybersecurity.

Python FastAPI Chrome Extension


Features

Feature Description
URL Classifier RandomForest trained on the Kaggle "Phishing Website Detector" dataset (11K+ samples, 97% accuracy)
Email Text Scanner TF-IDF + Naive Bayes model detects phishing language in email bodies
Header Anomaly Checks Detects SPF/DKIM failures and From/Reply-To mismatches
SHAP Explainability Every prediction comes with plain-English reasons (e.g. "The URL uses a raw IP address instead of a domain name")
Site Blocking Phishing sites are blocked before loading with a full-screen interstitial warning page
File Download Scanner Automatically scans files under 10MB on download using 6-layer static analysis (magic bytes, entropy, suspicious strings, etc.)
Auto-Retrain Feedback Loop Model learns from user corrections — auto-retrains after every 500 reports
Weekly Digital Hygiene Report Tracks browsing habits locally and displays a visual dashboard with safety score, daily chart, and flagged domains

How It Works

User visits a URL
    ↓
Background script intercepts BEFORE page loads
    ↓
Sends URL to FastAPI backend → RandomForest extracts 30 features → predicts
    ↓
┌─ Safe → Page loads normally
└─ Phishing → Page BLOCKED, interstitial shown with:
       • Risk percentage gauge
       • SHAP-generated plain-English reasons
       • "Go Back to Safety" / "I understand, proceed" buttons
           ↓
   If user clicks "proceed" → feedback correction stored
       ↓
   After 500 corrections → MODEL AUTO-RETRAINS
       • Merges original Kaggle data + user corrections
       • Retrains RandomForest (200 trees)
       • Hot-reloads model (no restart needed)
       • Clears feedback file, cycle repeats

Architecture

phishing_detector/
├── backend/
│   ├── main.py                  # FastAPI server with all endpoints
│   ├── train_url.py             # Train URL model on Kaggle dataset
│   ├── train_email.py           # Train email text model
│   ├── feedback_store.py        # Feedback CSV storage + auto-retrain logic
│   ├── models/
│   │   ├── url_model.py         # 30-feature URL extractor + prediction
│   │   ├── email_model.py       # TF-IDF email classifier
│   │   ├── headers_check.py     # SPF/DKIM/Reply-To rule checks
│   │   └── file_scanner.py      # 6-layer static file analysis
│   └── explain/
│       └── shap_explainer.py    # SHAP explanations → plain English
├── extension/
│   ├── manifest.json            # Chrome Manifest V3
│   ├── background.js            # Intercepts navigation + logs scans + download scanner
│   ├── content.js               # Injects warning banners on pages
│   ├── blocked.html / .js       # Full-page interstitial for blocked sites
│   ├── report.html / .js        # Weekly Digital Hygiene Report dashboard
│   ├── popup.html / .js         # Extension popup dashboard
│   └── styles.css               # All styling
└── data/                        # Kaggle dataset (not in git)

Quick Start

1. Clone and Install Dependencies

git clone https://github.com/swarnim-dev/sentinel-ai.git
cd sentinel-ai
python3 -m venv venv
source venv/bin/activate
pip install -r backend/requirements.txt

2. Download Dataset

Download the Phishing Website Detector dataset from Kaggle and place phishing.csv in the data/ folder.

3. Train Models

cd backend
python train_url.py      # Trains RandomForest URL classifier (~97% accuracy)
python train_email.py    # Trains TF-IDF email classifier

4. Start the API

cd backend
uvicorn main:app --port 8000

5. Load the Extension

  1. Open chrome://extensions/ in Chrome/Brave/Edge
  2. Enable Developer mode
  3. Click Load unpacked → select the extension/ folder
  4. The Sentinel icon will appear in your toolbar

API Endpoints

Method Endpoint Description
POST /predict/url Scan a URL → risk score + SHAP reasons
POST /predict/email Scan email body + headers → risk score + reasons
POST /scan/file Upload a file (max 10MB) for static malware analysis
POST /feedback Submit a correction (triggers retrain at 500)
GET /feedback/status Check progress toward next auto-retrain

Example — Scan a URL

curl -X POST http://127.0.0.1:8000/predict/url \
  -H "Content-Type: application/json" \
  -d '{"url": "http://192.168.1.1/paypal-login/secure"}'

Example — Check Retrain Progress

curl http://127.0.0.1:8000/feedback/status
# → {"feedback_count": 42, "retrain_threshold": 500, "progress_percent": 8.4}

Auto-Retrain Feedback Loop

The model improves over time through user feedback:

  1. When a user clicks "I understand, proceed" on a blocked page, a correction is stored with the URL's 30 extracted features
  2. Corrections accumulate in backend/feedback_log.csv
  3. At 500 corrections, the system automatically:
    • Merges the original 11K Kaggle samples with the 500 user-labeled corrections
    • Retrains the RandomForest classifier (200 trees)
    • Saves the updated model and hot-reloads it (no server restart)
    • Clears the feedback file — the cycle resets for the next 500

Weekly Digital Hygiene Report

Click "View Weekly Report" in the extension popup to see:

  • Stat cards — Total scans, safe sites, threats blocked, unique domains
  • Safety Score — Color-coded ring (Excellent / Good / Fair / Poor)
  • Daily bar chart — Safe vs phishing breakdown for the last 7 days
  • Top flagged domains — Riskiest sites you've encountered

All data is stored locally in Chrome storage — nothing is sent to any server.


File Download Scanner

Every file you download (under 10MB) is automatically scanned with 6 analysis layers:

Layer What It Catches
Dangerous Extensions .exe, .bat, .ps1, .vbs, .scr, .msi, .jar, .sh and more
Double Extensions Files like invoice.pdf.exe that disguise their true type
Magic Byte Analysis Detects mismatches between file extension and actual content (e.g. a .pdf that's really an .exe)
Entropy Analysis High Shannon entropy indicates packed/encrypted malware
Suspicious Strings PowerShell commands, base64 decode, eval(), registry edits, reverse shells
Office Macro Indicators VBA macros, AutoOpen, Shell, CreateObject in Office files

Results appear as a Chrome notification immediately after download completes.

# Manual test via API
curl -X POST http://127.0.0.1:8000/scan/file -F "file=@suspicious_file.exe"

Tech Stack

  • Backend: Python, FastAPI, scikit-learn, SHAP, Pandas
  • ML Models: RandomForest (URL), Naive Bayes + TF-IDF (Email)
  • Extension: JavaScript, Chrome Manifest V3
  • Dataset: Kaggle Phishing Website Detector

License

MIT

About

AI-powered browser extension that blocks phishing websites and suspicious emails in real-time using explainable machine learning (SHAP + RandomForest, 97% accuracy). Built with FastAPI, Chrome Manifest V3, and trained on the Kaggle Phishing Website Detector dataset. Student-focused with plain-English threat explanations.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors