Skip to content

himanshu2394i/log-anomaly-detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Log Anomaly Detection System

Overview

This project implements an end-to-end log anomaly detection system using machine learning.
It processes raw system and login logs, performs preprocessing and feature engineering, and applies an Isolation Forest model to identify anomalous or suspicious events without requiring labeled data.

Such systems are commonly used for security monitoring, system reliability, and detecting unusual behavior in large-scale infrastructure.


Problem Statement

Modern systems generate a large volume of logs every day. These logs may contain abnormal patterns such as unusual login times, high latency, or suspicious access behavior.

Manually inspecting logs is not scalable.
The goal of this project is to automatically detect anomalous log entries using an unsupervised machine learning approach.


Dataset

The dataset consists of system and login-related logs containing fields such as:

  • Timestamp
  • User identifier
  • Network latency (round trip time)
  • IP address and ASN
  • Geographic information (country, region, city)
  • User agent details

Dataset Source

Due to file size limitations, the dataset is not included in this repository.

The dataset can be downloaded from Kaggle:

https://www.kaggle.com/datasets/dasgroup/rba-dataset/data?select=rba-dataset.csv

After downloading, place the file in the following path: data/system_logs.csv


Project Structure

log-anomaly-detection/

├── src/
│ ├── preprocess.py
│ ├── feature_engineering.py
│ ├── model.py
│ └── main.py

├── data/
│ └── system_logs.csv

├── output/
│ └── anomaly_results.csv

├── requirements.txt
└── README.md


Pipeline Description

Data Preprocessing

  • Standardizes column names
  • Parses timestamps into datetime format
  • Handles missing and invalid values
  • Cleans numerical latency fields

Feature Engineering

  • Extracts time-based features such as hour and day
  • Encodes categorical variables
  • Scales numerical features

Model Training

  • Uses Isolation Forest for unsupervised anomaly detection
  • Learns normal system behavior
  • Flags deviations as anomalies

Output

Each log entry is assigned an anomaly label:

  • 0 → Normal
  • 1 → Anomalous

Results are saved to: output/anomaly_results.csv


Why Isolation Forest

Isolation Forest is suitable for this problem because:

  • Log data usually lacks labeled anomalies
  • It scales well to large datasets
  • It performs efficiently in high-dimensional feature spaces
  • It is widely used in industry for anomaly detection

Technologies Used

  • Python
  • Pandas
  • Scikit-learn
  • Git and GitHub
  • Visual Studio Code

How to Run the Project

Clone the Repository

git clone https://github.com/himanshu2394i/log-anomaly-detection.git cd log-anomaly-detection

Install Dependencies

pip install -r requirements.txt

Add the Dataset

Download the dataset from Kaggle and place it in:

data/system_logs.csv

Run the Pipeline

python src/main.py

Output

The output file contains the original log data along with an additional column:

anomaly

This column indicates whether a log entry is considered normal or anomalous by the model.

Author

Himanshu Mann

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages