Production-ready AI-powered malware detection system across 4 file formats, deployed on AWS serverless infrastructure.
CLF09 is a multi-format malware detection platform that classifies threats across Windows PE executables, PDFs, URLs, and Office/VBA documents. Each detector runs as an independent AWS Lambda microservice, enabling isolated scaling and sub-3-second inference in production.
The system combines classical feature engineering (binary/assembly analysis, TF-IDF, structural features) with gradient boosting models optimized via Optuna hyperparameter search — achieving up to 99.82% accuracy on benchmark datasets.
| Detector | Algorithm | Dataset | Samples | Accuracy | Latency |
|---|---|---|---|---|---|
| Windows PE | LightGBM + Optuna | Microsoft BIG 2015 (Kaggle) | 21,000 | 99.82% | ~1.73s |
| LightGBM | CIC-Evasive-PDFMal2022 | 10,023 | 99.35% | ~1.26s | |
| URL | XGBoost | ISCX-URL-2016 | 651,191 | 98.23% | ~0.50s |
| Office/VBA | XGBoost + Optuna | Custom (VirusTotal, Malware Bazaar) | 18,538 | 96.00% | ~2.10s |
- 1,615 features extracted from raw bytes and assembly code
- Detects : Ramnit, Lollipop, Kelihos, Vundo, Simda, Tracur, Obfuscator, Gatak + Benign
- LightGBM optimized with Optuna (50 trials)
- F1-Score : 99.82% · Precision : 99.81% · Recall : 99.83%
- 23 structural features : JavaScript presence, XFA forms, encryption, object counts
- Binary classification : malicious vs. benign
- 5,015 features : 15 numerical + 5,000 TF-IDF tokens
- Classes : Benign · Defacement · Phishing · Malware
- Trained on 651K URLs from the ISCX-URL-2016 benchmark
- 9 VBA-specific features : macro presence, auto-execution keywords, suspicious API calls
- Covers Word (.doc, .docx) and Excel (.xls, .xlsx)
┌──────────────────────────────────────────────┐
│ Client Layer │
│ Browser · Mobile App · API Clients │
└──────────────────┬───────────────────────────┘
│
┌────────▼────────┐
│ API Gateway │
│ (REST API) │
└────────┬────────┘
│
┌─────────────┼─────────────┬─────────────┐
│ │ │ │
┌────▼────┐ ┌─────▼────┐ ┌────▼────┐ ┌─────▼────┐
│ Lambda │ │ Lambda │ │ Lambda │ │ Lambda │
│ Windows │ │ PDF │ │ URL │ │ Office │
│ 2048 MB │ │ 2048 MB │ │ 2048 MB │ │ 2048 MB │
└────┬────┘ └─────┬────┘ └────┬────┘ └─────┬────┘
└─────────────┴────────────┴──────────────┘
│
┌────────▼────────┐
│ Amazon S3 │
│ (Model Store) │
└─────────────────┘
git clone https://github.com/Bassongo/malware_classification.git
cd malware_classification
pip install -r requirements.txtPrerequisites : Python 3.11+ · Docker · AWS CLI configured
Base URL : https://63d3m7xcjf.execute-api.us-east-1.amazonaws.com/prod
| Endpoint | Method | Input |
|---|---|---|
/detect/windows |
POST | Base64-encoded .exe |
/detect/pdf |
POST | Base64-encoded .pdf |
/detect/url |
POST | URL string |
/detect/office |
POST | Base64-encoded .doc/.xls |
Example — URL Detection
curl -X POST https://63d3m7xcjf.execute-api.us-east-1.amazonaws.com/prod/detect/url \
-H "Content-Type: application/json" \
-d '{"url": "http://suspicious-site.com/payload"}'Response
{
"prediction": "malware",
"family": "trojan-downloader",
"confidence": 0.97
}Example — Windows PE (Python)
import requests, base64
with open("sample.exe", "rb") as f:
payload = base64.b64encode(f.read()).decode()
response = requests.post(
"https://63d3m7xcjf.execute-api.us-east-1.amazonaws.com/prod/detect/windows",
json={"file_content": payload, "file_name": "sample.exe"}
)
result = response.json()
print(f"Family: {result['family']} | Confidence: {result['confidence']:.2%}")malware_classification/
├── notebooks/
│ ├── windows/ # 7 notebooks — feature engineering & LightGBM training
│ ├── pdf/ # 3 notebooks — structural feature extraction
│ ├── urls/ # 2 notebooks — TF-IDF pipeline & XGBoost
│ └── office/ # 3 notebooks — VBA analysis & classification
├── lambda_deploy/
│ ├── lambda_deploy_urls/ # URL Lambda microservice
│ ├── lambda_deploy_office/ # Office Lambda microservice
│ └── clf09-webapp/ # React 18 + Vite + TailwindCSS frontend
├── requirements.txt
├── .gitignore
├── LICENSE
└── README.md
| Dataset | Source | Samples | Link |
|---|---|---|---|
| Microsoft Malware Classification (BIG 2015) | Kaggle | 21,000 | Kaggle |
| CIC-Evasive-PDFMal2022 | Canadian Institute for Cybersecurity | 10,023 | UNB |
| ISCX-URL-2016 | Canadian Institute for Cybersecurity | 651,191 | UNB |
| Office/VBA Custom | VirusTotal · Malware Bazaar | 18,538 | Custom collection |
| Layer | Technologies |
|---|---|
| ML | LightGBM · XGBoost · scikit-learn · Optuna |
| Feature Engineering | NumPy · Pandas · TF-IDF |
| Cloud | AWS Lambda · API Gateway · S3 · ECR |
| Frontend | React 18 · Vite · TailwindCSS |
| DevOps | Docker |
- Evasion robustness : adversarial PE files (packed, obfuscated) may fool byte-level features — adding dynamic analysis (sandbox traces) would improve robustness
- Zero-day detection : current models are trained on known families; anomaly detection (isolation forest, autoencoders) could flag novel malware
- Relevance to AI Safety : the adversarial threat model here mirrors alignment challenges — a system optimized against a fixed classifier will find exploits; this motivates ongoing model updates and interpretability research
- Interpretability : adding SHAP explanations per prediction would make the system more auditable and trustworthy
CLF09 — AS3, ENSAE Dakar (2024–2025)
| Name | Role |
|---|---|
| Marc MARE | ML Engineering · AWS Deployment |
| Fatou Soumaya WADE | Feature Engineering · PDF Detector |
| Gilbert OUMSAORE | URL & Office Detectors |
| Ndèye Aissatou CISSE | Data Pipeline · Evaluation |
Academic supervisor : Mme Fatou SALL
Marc MARE — Statistics & ML Engineer
ENSAE Dakar | MSc SEP, University of Reims (2026)
MIT License — see LICENSE for details.