Guide: Dr. Sina Keshvadi
Phishing attacks constitute one of the most pervasive and evolving cybersecurity threats, exploiting human vulnerability to deceive users into divulging sensitive credentials, financial information, and personal data through fraudulent websites. Traditional rule-based and heuristic detection systems struggle to generalize to novel attack patterns, while conventional machine learning classifiers trained on static datasets exhibit limited robustness against adversarial samples and distribution shift.
This project addresses the challenge of building a robust, generalizable phishing detection system that can withstand adversarial manipulation. The central research question is: Can synthetic data augmentation via Generative Adversarial Networks (GANs) improve the resilience of gradient boosting classifiers against adversarial and previously unseen phishing patterns?
The proposed framework combines multi-modal feature extraction, adversarial data augmentation, and gradient boosting classification into a reproducible, end-to-end pipeline.
- Feature Engineering: Extract 15 interpretable URL, domain, and HTML/JavaScript-based features from raw URLs and webpage content.
- Data Preprocessing: Clean missing values, remove duplicates, and apply standardization to ensure numerical stability.
- GAN Augmentation (optional): Train a GAN on phishing-class samples to learn the feature distribution and generate synthetic phishing instances for adversarial training.
- Classification: Train gradient boosting models (XGBoost, LightGBM, CatBoost) on the original or augmented dataset.
- Evaluation: Assess performance using F1 score, precision, recall, ROC-AUC, and accuracy on a held-out test set.
- Gradient Boosting: State-of-the-art performance on tabular data, interpretability via feature importance, and robustness to feature scaling.
- GAN Augmentation: Expands the phishing manifold, exposing the classifier to synthetic adversarial samples and improving generalization.
- Stratified Splits: Preserves class balance during train/test partitioning for reliable metric estimation.
| Source | Description | File(s) |
|---|---|---|
| JPCERT/CC | Japan Computer Emergency Response Team — phishing URLs from security incidents | 202409.csv, 202410.csv |
| University of New Brunswick (UNB) | Curated legitimate URLs for phishing research (URL-2016) | Legit_datasets.csv |
| OpenPhish | Real-time phishing feed | feed.txt |
- Legitimate samples: 5,000 URLs (post-deduplication: ~2,600 unique)
- Phishing samples: 5,000 URLs (post-deduplication: ~2,600 unique)
- Labels: 0 = legitimate, 1 = phishing
- Train/Test split: 80% / 20%, stratified
Features are extracted from URL structure, domain metadata, and HTML/JavaScript behavior. All features are binary (0/1) or low-cardinality categorical, suitable for gradient boosting.
| Feature | Description |
|---|---|
| Have_IP | URL contains an IP address instead of a domain name |
| Have_At | Presence of @ symbol (can obscure true destination) |
| URL_Length | Long URL (obfuscation indicator) |
| URL_Depth | Number of path segments (deeper paths may indicate mimicry) |
| Redirection | Multiple consecutive slashes // in path |
| https_Domain | http/https embedded in domain name |
| TinyURL | Use of URL shortening services |
| Prefix/Suffix | Hyphen in domain (typosquatting) |
| Feature | Description |
|---|---|
| DNS_Record | Valid DNS record present |
| Domain_Age | Domain registration age |
| Domain_End | Domain expiration period |
| Feature | Description |
|---|---|
| iFrame | Presence of iframe redirection |
| Mouse_Over | Status bar customization via JavaScript |
| Right_Click | Right-click disabled |
| Web_Forwards | Automatic forwarding (meta-refresh, redirects) |
| Component | Architecture |
|---|---|
| Generator | Linear(100 → 128) → ReLU → Linear(128 → 256) → ReLU → Linear(256 → 15) → Sigmoid |
| Discriminator | Linear(15 → 256) → LeakyReLU(0.2) → Linear(256 → 128) → LeakyReLU(0.2) → Linear(128 → 1) → Sigmoid |
| Loss | Binary cross-entropy (BCE) |
| Optimizer | Adam (lr=0.0002) |
| Latent dimension | 100 |
The GAN is trained exclusively on phishing samples to learn their feature distribution; generated samples are binarized (threshold 0.5) before augmentation.
| Model | Implementation | Key Hyperparameters |
|---|---|---|
| XGBoost | XGBClassifier |
n_estimators, learning_rate, eval_metric='logloss' |
| LightGBM | LGBMClassifier |
n_estimators, learning_rate, verbose=-1 |
| CatBoost | CatBoostClassifier |
iterations, learning_rate, verbose=0 |
GAN-Enhanced-Phishing-Detection/
├── data/
│ ├── raw/ # Raw URLs (JPCERT, UNB, OpenPhish)
│ └── processed/ # Feature matrices, cleaned dataset
├── src/
│ ├── data/ # Preprocessing, dataset loading
│ ├── models/ # GAN, classifiers
│ ├── training/ # Training scripts
│ ├── evaluation/ # Metrics, error analysis
│ └── utils/ # Reproducibility (seeds)
├── experiments/ # Experiment logs
├── tests/ # Unit tests
├── notebooks/ # Exploration, analysis
├── requirements.txt
├── train.py # Entry point
└── README.md
git clone https://github.com/Varun-Mayilvaganan/GAN-Enhanced-Phishing-Detection
cd GAN-Enhanced-Phishing-Detection
python -m venv venv
# Windows: venv\Scripts\activate
# Unix: source venv/bin/activate
pip install -r requirements.txtStep 1 — Preprocessing (requires legitimate.csv and phishing.csv in data/processed/):
python -m src.data.preprocess --input-dir data/processed --output-dir data/processedStep 2 — GAN Training (optional):
python -m src.training.train_gan --epochs 500 --batch-size 64 --learning-rate 0.0002 --n-samples 2000Step 3 — Classifier Training:
python train.py --model xgboost --learning_rate 0.01 --epochs 50 --batch_size 64With GAN augmentation:
python train.py --model xgboost --gan-samples data/processed/GAN_phishing_samples.csv --gan-ratio 0.5 --epochs 100| Argument | Description | Default |
|---|---|---|
--model |
Classifier: xgboost, lightgbm, catboost |
xgboost |
--epochs |
Boosting rounds (n_estimators) | 100 |
--learning_rate |
Learning rate | 0.01 |
--batch_size |
Batch size (for logging) | 64 |
--gan-samples |
Path to GAN-generated phishing CSV | None |
--gan-ratio |
Fraction of phishing count to augment | 0.0 |
--seed |
Random seed for reproducibility | 42 |
Performance is evaluated on a stratified 20% held-out test set:
| Metric | Definition |
|---|---|
| F1 Score | Harmonic mean of precision and recall |
| Precision | TP / (TP + FP) |
| Recall | TP / (TP + FN) |
| ROC-AUC | Area under the receiver operating characteristic curve |
| Accuracy | (TP + TN) / (TP + TN + FP + FN) |
Results are logged to experiments/experiment_log.csv for experiment tracking.
| Model | F1 Score | Precision | Recall | ROC-AUC | Accuracy |
|---|---|---|---|---|---|
| XGBoost | 0.946 | 0.898 | 1.000 | — | 0.898 |
| LightGBM | 0.946 | 0.898 | 1.000 | 0.970 | 0.898 |
Note: Results from stratified 80/20 split on cleaned dataset (5,196 samples).
| Model | Train Accuracy | Test Accuracy |
|---|---|---|
| Logistic Regression | 0.949 | 0.945 |
| Decision Tree | 0.953 | 0.948 |
| Random Forest | 0.948 | 0.944 |
| Support Vector Machine | 0.935 | 0.937 |
| XGBoost | 0.960 | 0.959 |
| LightGBM | 0.959 | 0.960 |
| CatBoost | 0.958 | 0.954 |
- Adversarial Evaluation: Benchmark classifiers against GAN-generated and hand-crafted adversarial samples to quantify robustness gains from augmentation.
- Deep Feature Extraction: Integrate learned representations from URL embeddings or webpage screenshots to capture semantic patterns beyond hand-crafted features.
- Real-Time Deployment: Package the pipeline as a REST API (e.g., FastAPI) for low-latency URL classification in production environments.
- Regularization and Cross-Validation: Apply k-fold cross-validation and hyperparameter tuning to reduce overfitting and improve generalization estimates.
- Explainability: Leverage SHAP or LIME to provide interpretable predictions for security analysts.
- Temporal Validation: Evaluate on time-split data to assess performance under distribution shift as phishing tactics evolve.
