This project implements an end-to-end log anomaly detection system using machine learning.
It processes raw system and login logs, performs preprocessing and feature engineering, and applies an Isolation Forest model to identify anomalous or suspicious events without requiring labeled data.
Such systems are commonly used for security monitoring, system reliability, and detecting unusual behavior in large-scale infrastructure.
Modern systems generate a large volume of logs every day. These logs may contain abnormal patterns such as unusual login times, high latency, or suspicious access behavior.
Manually inspecting logs is not scalable.
The goal of this project is to automatically detect anomalous log entries using an unsupervised machine learning approach.
The dataset consists of system and login-related logs containing fields such as:
- Timestamp
- User identifier
- Network latency (round trip time)
- IP address and ASN
- Geographic information (country, region, city)
- User agent details
Due to file size limitations, the dataset is not included in this repository.
The dataset can be downloaded from Kaggle:
https://www.kaggle.com/datasets/dasgroup/rba-dataset/data?select=rba-dataset.csv
After downloading, place the file in the following path: data/system_logs.csv
log-anomaly-detection/
│
├── src/
│ ├── preprocess.py
│ ├── feature_engineering.py
│ ├── model.py
│ └── main.py
│
├── data/
│ └── system_logs.csv
│
├── output/
│ └── anomaly_results.csv
│
├── requirements.txt
└── README.md
- Standardizes column names
- Parses timestamps into datetime format
- Handles missing and invalid values
- Cleans numerical latency fields
- Extracts time-based features such as hour and day
- Encodes categorical variables
- Scales numerical features
- Uses Isolation Forest for unsupervised anomaly detection
- Learns normal system behavior
- Flags deviations as anomalies
Each log entry is assigned an anomaly label:
0→ Normal1→ Anomalous
Results are saved to: output/anomaly_results.csv
Isolation Forest is suitable for this problem because:
- Log data usually lacks labeled anomalies
- It scales well to large datasets
- It performs efficiently in high-dimensional feature spaces
- It is widely used in industry for anomaly detection
- Python
- Pandas
- Scikit-learn
- Git and GitHub
- Visual Studio Code
git clone https://github.com/himanshu2394i/log-anomaly-detection.git cd log-anomaly-detection
pip install -r requirements.txt
Download the dataset from Kaggle and place it in:
data/system_logs.csv
python src/main.py
The output file contains the original log data along with an additional column:
anomaly
This column indicates whether a log entry is considered normal or anomalous by the model.
Himanshu Mann