CLF09 — Multi-Format Malware Detection Platform

Production-ready AI-powered malware detection system across 4 file formats, deployed on AWS serverless infrastructure.

Live Demo · API Reference · Quick Start

Overview

CLF09 is a multi-format malware detection platform that classifies threats across Windows PE executables, PDFs, URLs, and Office/VBA documents. Each detector runs as an independent AWS Lambda microservice, enabling isolated scaling and sub-3-second inference in production.

The system combines classical feature engineering (binary/assembly analysis, TF-IDF, structural features) with gradient boosting models optimized via Optuna hyperparameter search — achieving up to 99.82% accuracy on benchmark datasets.

Key Results

Detector	Algorithm	Dataset	Samples	Accuracy	Latency
Windows PE	LightGBM + Optuna	Microsoft BIG 2015 (Kaggle)	21,000	99.82%	~1.73s
PDF	LightGBM	CIC-Evasive-PDFMal2022	10,023	99.35%	~1.26s
URL	XGBoost	ISCX-URL-2016	651,191	98.23%	~0.50s
Office/VBA	XGBoost + Optuna	Custom (VirusTotal, Malware Bazaar)	18,538	96.00%	~2.10s

Detectors

Windows PE — 9-Class Malware Family Classification

1,615 features extracted from raw bytes and assembly code
Detects : Ramnit, Lollipop, Kelihos, Vundo, Simda, Tracur, Obfuscator, Gatak + Benign
LightGBM optimized with Optuna (50 trials)
F1-Score : 99.82% · Precision : 99.81% · Recall : 99.83%

PDF — JavaScript & XFA Exploit Detection

23 structural features : JavaScript presence, XFA forms, encryption, object counts
Binary classification : malicious vs. benign

URL — 4-Class Threat Classification

5,015 features : 15 numerical + 5,000 TF-IDF tokens
Classes : Benign · Defacement · Phishing · Malware
Trained on 651K URLs from the ISCX-URL-2016 benchmark

Office/VBA — Macro Malware Detection

9 VBA-specific features : macro presence, auto-execution keywords, suspicious API calls
Covers Word (.doc, .docx) and Excel (.xls, .xlsx)

Architecture

┌──────────────────────────────────────────────┐
│              Client Layer                    │
│    Browser · Mobile App · API Clients        │
└──────────────────┬───────────────────────────┘
                   │
          ┌────────▼────────┐
          │   API Gateway   │
          │   (REST API)    │
          └────────┬────────┘
                   │
     ┌─────────────┼─────────────┬─────────────┐
     │             │             │             │
┌────▼────┐  ┌─────▼────┐  ┌────▼────┐  ┌─────▼────┐
│ Lambda  │  │  Lambda  │  │ Lambda  │  │  Lambda  │
│ Windows │  │   PDF    │  │   URL   │  │  Office  │
│ 2048 MB │  │ 2048 MB  │  │ 2048 MB │  │ 2048 MB  │
└────┬────┘  └─────┬────┘  └────┬────┘  └─────┬────┘
     └─────────────┴────────────┴──────────────┘
                         │
                ┌────────▼────────┐
                │   Amazon S3     │
                │  (Model Store)  │
                └─────────────────┘

Quick Start

git clone https://github.com/Bassongo/malware_classification.git
cd malware_classification
pip install -r requirements.txt

Prerequisites : Python 3.11+ · Docker · AWS CLI configured

API Reference

Base URL : https://63d3m7xcjf.execute-api.us-east-1.amazonaws.com/prod

Endpoint	Method	Input
`/detect/windows`	POST	Base64-encoded .exe
`/detect/pdf`	POST	Base64-encoded .pdf
`/detect/url`	POST	URL string
`/detect/office`	POST	Base64-encoded .doc/.xls

Example — URL Detection

curl -X POST https://63d3m7xcjf.execute-api.us-east-1.amazonaws.com/prod/detect/url \
  -H "Content-Type: application/json" \
  -d '{"url": "http://suspicious-site.com/payload"}'

Response

{
  "prediction": "malware",
  "family": "trojan-downloader",
  "confidence": 0.97
}

Example — Windows PE (Python)

import requests, base64

with open("sample.exe", "rb") as f:
    payload = base64.b64encode(f.read()).decode()

response = requests.post(
    "https://63d3m7xcjf.execute-api.us-east-1.amazonaws.com/prod/detect/windows",
    json={"file_content": payload, "file_name": "sample.exe"}
)

result = response.json()
print(f"Family: {result['family']} | Confidence: {result['confidence']:.2%}")

Project Structure

malware_classification/
├── notebooks/
│   ├── windows/        # 7 notebooks — feature engineering & LightGBM training
│   ├── pdf/            # 3 notebooks — structural feature extraction
│   ├── urls/           # 2 notebooks — TF-IDF pipeline & XGBoost
│   └── office/         # 3 notebooks — VBA analysis & classification
├── lambda_deploy/
│   ├── lambda_deploy_urls/     # URL Lambda microservice
│   ├── lambda_deploy_office/   # Office Lambda microservice
│   └── clf09-webapp/           # React 18 + Vite + TailwindCSS frontend
├── requirements.txt
├── .gitignore
├── LICENSE
└── README.md

Datasets

Dataset	Source	Samples	Link
Microsoft Malware Classification (BIG 2015)	Kaggle	21,000	Kaggle
CIC-Evasive-PDFMal2022	Canadian Institute for Cybersecurity	10,023	UNB
ISCX-URL-2016	Canadian Institute for Cybersecurity	651,191	UNB
Office/VBA Custom	VirusTotal · Malware Bazaar	18,538	Custom collection

Tech Stack

Layer	Technologies
ML	LightGBM · XGBoost · scikit-learn · Optuna
Feature Engineering	NumPy · Pandas · TF-IDF
Cloud	AWS Lambda · API Gateway · S3 · ECR
Frontend	React 18 · Vite · TailwindCSS
DevOps	Docker

Limitations & Future Work

Evasion robustness : adversarial PE files (packed, obfuscated) may fool byte-level features — adding dynamic analysis (sandbox traces) would improve robustness
Zero-day detection : current models are trained on known families; anomaly detection (isolation forest, autoencoders) could flag novel malware
Relevance to AI Safety : the adversarial threat model here mirrors alignment challenges — a system optimized against a fixed classifier will find exploits; this motivates ongoing model updates and interpretability research
Interpretability : adding SHAP explanations per prediction would make the system more auditable and trustworthy

Team

CLF09 — AS3, ENSAE Dakar (2024–2025)

Name	Role
Marc MARE	ML Engineering · AWS Deployment
Fatou Soumaya WADE	Feature Engineering · PDF Detector
Gilbert OUMSAORE	URL & Office Detectors
Ndèye Aissatou CISSE	Data Pipeline · Evaluation

Academic supervisor : Mme Fatou SALL

Author

Marc MARE — Statistics & ML Engineer
ENSAE Dakar | MSc SEP, University of Reims (2026)

License

MIT License — see LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLF09 — Multi-Format Malware Detection Platform

Overview

Key Results

Detectors

Windows PE — 9-Class Malware Family Classification

PDF — JavaScript & XFA Exploit Detection

URL — 4-Class Threat Classification

Office/VBA — Macro Malware Detection

Architecture

Quick Start

API Reference

Project Structure

Datasets

Tech Stack

Limitations & Future Work

Team

Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
lambda_deploy		lambda_deploy
notebooks		notebooks
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MALWARE CLASSIFICATION (1).pdf		MALWARE CLASSIFICATION (1).pdf
README.md		README.md
iml-project-description.pdf		iml-project-description.pdf
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

CLF09 — Multi-Format Malware Detection Platform

Overview

Key Results

Detectors

Windows PE — 9-Class Malware Family Classification

PDF — JavaScript & XFA Exploit Detection

URL — 4-Class Threat Classification

Office/VBA — Macro Malware Detection

Architecture

Quick Start

API Reference

Project Structure

Datasets

Tech Stack

Limitations & Future Work

Team

Author

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages