Log Anomaly Detection System

Overview

This project implements an end-to-end log anomaly detection system using machine learning.
It processes raw system and login logs, performs preprocessing and feature engineering, and applies an Isolation Forest model to identify anomalous or suspicious events without requiring labeled data.

Such systems are commonly used for security monitoring, system reliability, and detecting unusual behavior in large-scale infrastructure.

Problem Statement

Modern systems generate a large volume of logs every day. These logs may contain abnormal patterns such as unusual login times, high latency, or suspicious access behavior.

Manually inspecting logs is not scalable.
The goal of this project is to automatically detect anomalous log entries using an unsupervised machine learning approach.

Dataset

The dataset consists of system and login-related logs containing fields such as:

Timestamp
User identifier
Network latency (round trip time)
IP address and ASN
Geographic information (country, region, city)
User agent details

Dataset Source

Due to file size limitations, the dataset is not included in this repository.

The dataset can be downloaded from Kaggle:

https://www.kaggle.com/datasets/dasgroup/rba-dataset/data?select=rba-dataset.csv

After downloading, place the file in the following path: data/system_logs.csv

Project Structure

log-anomaly-detection/
│
├── src/
│ ├── preprocess.py
│ ├── feature_engineering.py
│ ├── model.py
│ └── main.py
│
├── data/
│ └── system_logs.csv
│
├── output/
│ └── anomaly_results.csv
│
├── requirements.txt
└── README.md

Pipeline Description

Data Preprocessing

Standardizes column names
Parses timestamps into datetime format
Handles missing and invalid values
Cleans numerical latency fields

Feature Engineering

Extracts time-based features such as hour and day
Encodes categorical variables
Scales numerical features

Model Training

Uses Isolation Forest for unsupervised anomaly detection
Learns normal system behavior
Flags deviations as anomalies

Output

Each log entry is assigned an anomaly label:

0 → Normal
1 → Anomalous

Results are saved to: output/anomaly_results.csv

Why Isolation Forest

Isolation Forest is suitable for this problem because:

Log data usually lacks labeled anomalies
It scales well to large datasets
It performs efficiently in high-dimensional feature spaces
It is widely used in industry for anomaly detection

Technologies Used

Python
Pandas
Scikit-learn
Git and GitHub
Visual Studio Code

How to Run the Project

Clone the Repository

git clone https://github.com/himanshu2394i/log-anomaly-detection.git cd log-anomaly-detection

Install Dependencies

pip install -r requirements.txt

Add the Dataset

Download the dataset from Kaggle and place it in:

data/system_logs.csv

Run the Pipeline

python src/main.py

Output

The output file contains the original log data along with an additional column:

anomaly

This column indicates whether a log entry is considered normal or anomalous by the model.

Author

Himanshu Mann

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
output		output
src		src
.gitattributes		.gitattributes
README.md		README.md
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Log Anomaly Detection System

Overview

Problem Statement

Dataset

Dataset Source

Project Structure

Pipeline Description

Data Preprocessing

Feature Engineering

Model Training

Output

Why Isolation Forest

Technologies Used

How to Run the Project

Clone the Repository

Install Dependencies

Add the Dataset

Run the Pipeline

Output

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Log Anomaly Detection System

Overview

Problem Statement

Dataset

Dataset Source

Project Structure

Pipeline Description

Data Preprocessing

Feature Engineering

Model Training

Output

Why Isolation Forest

Technologies Used

How to Run the Project

Clone the Repository

Install Dependencies

Add the Dataset

Run the Pipeline

Output

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages