Skip to content

Bassongo/malware_classification

Repository files navigation

CLF09 — Multi-Format Malware Detection Platform

Production-ready AI-powered malware detection system across 4 file formats, deployed on AWS serverless infrastructure.


Overview

CLF09 is a multi-format malware detection platform that classifies threats across Windows PE executables, PDFs, URLs, and Office/VBA documents. Each detector runs as an independent AWS Lambda microservice, enabling isolated scaling and sub-3-second inference in production.

The system combines classical feature engineering (binary/assembly analysis, TF-IDF, structural features) with gradient boosting models optimized via Optuna hyperparameter search — achieving up to 99.82% accuracy on benchmark datasets.


Key Results

Detector Algorithm Dataset Samples Accuracy Latency
Windows PE LightGBM + Optuna Microsoft BIG 2015 (Kaggle) 21,000 99.82% ~1.73s
PDF LightGBM CIC-Evasive-PDFMal2022 10,023 99.35% ~1.26s
URL XGBoost ISCX-URL-2016 651,191 98.23% ~0.50s
Office/VBA XGBoost + Optuna Custom (VirusTotal, Malware Bazaar) 18,538 96.00% ~2.10s

Detectors

Windows PE — 9-Class Malware Family Classification

  • 1,615 features extracted from raw bytes and assembly code
  • Detects : Ramnit, Lollipop, Kelihos, Vundo, Simda, Tracur, Obfuscator, Gatak + Benign
  • LightGBM optimized with Optuna (50 trials)
  • F1-Score : 99.82% · Precision : 99.81% · Recall : 99.83%

PDF — JavaScript & XFA Exploit Detection

  • 23 structural features : JavaScript presence, XFA forms, encryption, object counts
  • Binary classification : malicious vs. benign

URL — 4-Class Threat Classification

  • 5,015 features : 15 numerical + 5,000 TF-IDF tokens
  • Classes : Benign · Defacement · Phishing · Malware
  • Trained on 651K URLs from the ISCX-URL-2016 benchmark

Office/VBA — Macro Malware Detection

  • 9 VBA-specific features : macro presence, auto-execution keywords, suspicious API calls
  • Covers Word (.doc, .docx) and Excel (.xls, .xlsx)

Architecture

┌──────────────────────────────────────────────┐
│              Client Layer                    │
│    Browser · Mobile App · API Clients        │
└──────────────────┬───────────────────────────┘
                   │
          ┌────────▼────────┐
          │   API Gateway   │
          │   (REST API)    │
          └────────┬────────┘
                   │
     ┌─────────────┼─────────────┬─────────────┐
     │             │             │             │
┌────▼────┐  ┌─────▼────┐  ┌────▼────┐  ┌─────▼────┐
│ Lambda  │  │  Lambda  │  │ Lambda  │  │  Lambda  │
│ Windows │  │   PDF    │  │   URL   │  │  Office  │
│ 2048 MB │  │ 2048 MB  │  │ 2048 MB │  │ 2048 MB  │
└────┬────┘  └─────┬────┘  └────┬────┘  └─────┬────┘
     └─────────────┴────────────┴──────────────┘
                         │
                ┌────────▼────────┐
                │   Amazon S3     │
                │  (Model Store)  │
                └─────────────────┘

Quick Start

git clone https://github.com/Bassongo/malware_classification.git
cd malware_classification
pip install -r requirements.txt

Prerequisites : Python 3.11+ · Docker · AWS CLI configured


API Reference

Base URL : https://63d3m7xcjf.execute-api.us-east-1.amazonaws.com/prod

Endpoint Method Input
/detect/windows POST Base64-encoded .exe
/detect/pdf POST Base64-encoded .pdf
/detect/url POST URL string
/detect/office POST Base64-encoded .doc/.xls

Example — URL Detection

curl -X POST https://63d3m7xcjf.execute-api.us-east-1.amazonaws.com/prod/detect/url \
  -H "Content-Type: application/json" \
  -d '{"url": "http://suspicious-site.com/payload"}'

Response

{
  "prediction": "malware",
  "family": "trojan-downloader",
  "confidence": 0.97
}

Example — Windows PE (Python)

import requests, base64

with open("sample.exe", "rb") as f:
    payload = base64.b64encode(f.read()).decode()

response = requests.post(
    "https://63d3m7xcjf.execute-api.us-east-1.amazonaws.com/prod/detect/windows",
    json={"file_content": payload, "file_name": "sample.exe"}
)

result = response.json()
print(f"Family: {result['family']} | Confidence: {result['confidence']:.2%}")

Project Structure

malware_classification/
├── notebooks/
│   ├── windows/        # 7 notebooks — feature engineering & LightGBM training
│   ├── pdf/            # 3 notebooks — structural feature extraction
│   ├── urls/           # 2 notebooks — TF-IDF pipeline & XGBoost
│   └── office/         # 3 notebooks — VBA analysis & classification
├── lambda_deploy/
│   ├── lambda_deploy_urls/     # URL Lambda microservice
│   ├── lambda_deploy_office/   # Office Lambda microservice
│   └── clf09-webapp/           # React 18 + Vite + TailwindCSS frontend
├── requirements.txt
├── .gitignore
├── LICENSE
└── README.md

Datasets

Dataset Source Samples Link
Microsoft Malware Classification (BIG 2015) Kaggle 21,000 Kaggle
CIC-Evasive-PDFMal2022 Canadian Institute for Cybersecurity 10,023 UNB
ISCX-URL-2016 Canadian Institute for Cybersecurity 651,191 UNB
Office/VBA Custom VirusTotal · Malware Bazaar 18,538 Custom collection

Tech Stack

Layer Technologies
ML LightGBM · XGBoost · scikit-learn · Optuna
Feature Engineering NumPy · Pandas · TF-IDF
Cloud AWS Lambda · API Gateway · S3 · ECR
Frontend React 18 · Vite · TailwindCSS
DevOps Docker

Limitations & Future Work

  • Evasion robustness : adversarial PE files (packed, obfuscated) may fool byte-level features — adding dynamic analysis (sandbox traces) would improve robustness
  • Zero-day detection : current models are trained on known families; anomaly detection (isolation forest, autoencoders) could flag novel malware
  • Relevance to AI Safety : the adversarial threat model here mirrors alignment challenges — a system optimized against a fixed classifier will find exploits; this motivates ongoing model updates and interpretability research
  • Interpretability : adding SHAP explanations per prediction would make the system more auditable and trustworthy

Team

CLF09 — AS3, ENSAE Dakar (2024–2025)

Name Role
Marc MARE ML Engineering · AWS Deployment
Fatou Soumaya WADE Feature Engineering · PDF Detector
Gilbert OUMSAORE URL & Office Detectors
Ndèye Aissatou CISSE Data Pipeline · Evaluation

Academic supervisor : Mme Fatou SALL


Author

Marc MARE — Statistics & ML Engineer
ENSAE Dakar | MSc SEP, University of Reims (2026)

LinkedIn GitHub


License

MIT License — see LICENSE for details.

Releases

No releases published

Packages

 
 
 

Contributors

Languages