PhishGuard AI — Real-Time Phishing Website Detection System

Short Summary

Real-time phishing URL detection using 57 engineered features and Gradient Boosting (scikit-learn).
Deployable system: Web UI + Flask backend + Chrome Extension (Manifest V3).
Designed for zero-day phishing URL detection with a local-first architecture (no blacklist dependency; no external ML inference APIs).

Overview

PhishGuard AI classifies URLs as Legitimate, Suspicious, or Phishing by extracting a fixed feature vector from URL structure and running inference with a scikit-learn Gradient Boosting model.

For safety, the system can return Suspicious when a URL is predicted Legitimate with low confidence.

The system is intended for real-world usage: fast predictions, consistent feature extraction, and a local-first deployment model.

Problem Statement

Phishing attacks remain one of the most common cybersecurity threats, frequently abusing lookalike domains and deceptive URL patterns.

Blacklist-driven approaches can lag behind newly created or previously unseen phishing URLs.

PhishGuard AI focuses on:

URL structure analysis instead of external databases
Pattern-based detection to support zero-day phishing URL detection
Real-time predictions integrated into the browser workflow

Key Capabilities

Extracts 57 engineered URL features deterministically for every prediction
Produces real-time classifications: Legitimate / Suspicious / Phishing
Runs without external ML inference APIs (local-first deployment)
Flags low-confidence outcomes as Suspicious to reduce risk
Captures user feedback and supports retraining from corrections
Ships with both a Web UI and a Chrome Extension (Manifest V3)

What Makes It Different

Zero-day oriented: targets URL-based phishing patterns rather than known-url lists
Local-first by design: prediction does not require third-party services or external ML endpoints
Operationally simple: feature extraction + Gradient Boosting inference for low-latency classification

System Architecture

┌──────────────┐     HTTP      ┌─────────────────┐
│  Web UI      │ ───────────▶  │ Flask Backend    │
│  (WebUI/)    │               │ (Backend/app.py) │
└──────────────┘               └────────┬─────────┘
                                        │
                                        │ feature extraction (57)
                                        │ + model inference
                                        ▼
                               ┌─────────────────┐
                               │ scikit-learn     │
                               │ Gradient Boosting│
                               └─────────────────┘

┌───────────────────────────────┐
│ Chrome Extension (Manifest V3) │
│ (Extension/)                   │
└───────────────┬───────────────┘
                │
                └── sends URL to backend for prediction

Workflow

A URL is submitted (Web UI or Chrome Extension).
The backend extracts 57 structural features from the URL.
The model runs inference and returns a classification with confidence.
- Low-confidence Legitimate outcomes may be returned as Suspicious.
Optional: the user submits feedback if the prediction is incorrect.
Feedback can be used to retrain the model.

Use Cases

Detect phishing websites in real time
Improve browser safety with a local-first, model-based URL analyzer
Support security education and ML prototyping for URL-based threat detection
Provide a baseline system for deploying URL-focused security tooling

Model Performance

Metric	Value
Accuracy	96.4%
Precision	95.8%
Recall	97.1%
F1 Score	96.3%
AUC-ROC	0.993
False Positive Rate	4.4%

Gradient Boosting was selected as the final model based on validation performance.

Tech Stack

Python (Flask backend)
scikit-learn (Gradient Boosting)
NumPy, Pandas
JavaScript (Web UI + Extension)
Chrome Extension (Manifest V3)

Quick Start

Prerequisites

Python 3.10+
Google Chrome (for the extension)

Installation

git clone https://github.com/VarshithReddy2006/PhishingWebsite_Detection.git
cd PhishingWebsite_Detection

Setup

powershell -ExecutionPolicy Bypass -File .\setup.ps1

Train and Run

python Backend/train.py
python Backend/app.py

Access the Web UI at http://127.0.0.1:5000

Chrome Extension Setup

Open chrome://extensions
Enable Developer mode
Select Load unpacked and choose Extension/

Project Structure

PhishingWebsite_Detection/
├── Backend/
│   ├── app.py           # Flask API server
│   ├── features.py      # 57-feature extraction
│   ├── train.py         # Model training
│   ├── retrain.py       # Feedback-based retraining
│   ├── utils.py         # Environment helpers
│   ├── dataset.csv      # Training data
│   └── feedback.csv     # User corrections
├── WebUI/
│   ├── index.html       # Dashboard
│   ├── feedback.html    # Feedback page
│   ├── styles.css       # UI styling
│   ├── app.js           # Dashboard logic
│   ├── feedback.js      # Feedback logic
│   └── shared.js        # Utilities
├── Extension/
│   ├── manifest.json    # Manifest V3
│   ├── background.js    # Service worker
│   ├── content.js       # Content script
│   ├── popup.html       # Popup UI
│   ├── popup.js         # Popup logic
│   └── shared.js        # Shared utilities
├── tools/
│   └── bootstrap.py
├── requirements.txt
├── setup.ps1
├── run.ps1
└── start_backend.cmd

API Endpoints

Method	Endpoint	Description
POST	`/predict`	Analyze URL for phishing
POST	`/feedback`	Submit prediction correction
POST	`/features`	Get 57-feature vector
GET	`/health`	Health check
GET	`/download/feedback`	Download feedback CSV
GET	`/download/dataset`	Download training dataset

Example: `/predict`

Request

{ "url": "https://example.com" }

Response

{
  "url": "https://example.com",
  "status_code": 0,
  "result": "Legitimate",
  "confidence": 0.97
}

Feature Categories

URL structure: length, dot/hyphen/slash counts, path depth
Domain analysis: hostname length, subdomain count, entropy, TLD risk
Character ratios: digit-to-letter ratio, special character frequency
Security signals: HTTPS token, IP usage, punycode detection
Suspicious patterns: phishing keywords, shorteners, abnormal subdomains
Word statistics: min/max/avg word length in host and path

Privacy & Deployment

Local-first architecture: the system is designed to run locally without external ML inference APIs.
No blacklist dependency: predictions are derived from feature extraction and model inference.
Feedback storage: user corrections are stored locally (e.g., Backend/feedback.csv) for retraining workflows.

Limitations

URL-only signals: the model operates on engineered URL features and does not analyze page content.
Dataset and feature bias: performance depends on how representative training data is of current threats.
Adversarial adaptation: attackers may change URL patterns to evade heuristic and feature-based detection.

Future Work

Improve monitoring and evaluation over time as threat patterns evolve
Expand feature engineering and retraining workflows based on user feedback
Enhance decision thresholds and confidence handling for high-risk classifications

Testing

# Start backend
python Backend/app.py

# Test URL
$body = @{ url = "https://google.com" } | ConvertTo-Json
Invoke-RestMethod -Method Post -Uri "http://127.0.0.1:5000/predict" -ContentType "application/json" -Body $body

# Retrain from feedback
python Backend/retrain.py

Development

Add new features

Add the feature name to FEATURE_COLUMNS in Backend/features.py
Implement extraction in get_structural_features()
Retrain: python Backend/train.py

Retrain model

python Backend/retrain.py

Security

Input validation on all endpoints
CORS configuration
Sanitized error messages
Robust CSV parsing
Low-confidence predictions flagged as Suspicious

Contributing

Fork the repository
Create feature branch (git checkout -b feature/YourFeature)
Commit changes (git commit -m 'Add YourFeature')
Push (git push origin feature/YourFeature)
Open a Pull Request

License

MIT License

Acknowledgments

scikit-learn — ML classifiers
Flask — Backend framework
Font Awesome — Icons

Support / Contact

varshithreddy6147@gmail.com bodakarthiknetha@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.vscode		.vscode
Backend		Backend
Extension		Extension
WebUI		WebUI
tools		tools
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run.ps1		run.ps1
setup.ps1		setup.ps1
start_backend.cmd		start_backend.cmd

Folders and files

Latest commit

History

Repository files navigation

PhishGuard AI — Real-Time Phishing Website Detection System

Short Summary

Overview

Problem Statement

Key Capabilities

What Makes It Different

System Architecture

Workflow

Use Cases

Model Performance

Tech Stack

Quick Start

Prerequisites

Installation

Setup

Train and Run

Chrome Extension Setup

Project Structure

API Endpoints

Example: /predict

Feature Categories

Privacy & Deployment

Limitations

Future Work

Testing

Development

Security

Contributing

License

Acknowledgments

Support / Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Example: `/predict`

Packages