Thesis Work: Anomaly Detection in System Logs using Deep Learning
This project implements DeepLog, a deep neural network model based on LSTM (Long Short-Term Memory) for anomaly detection in system logs. The DeepLog approach treats system logs as natural language sequences, automatically learning patterns during normal operation and flagging anomalies when new logs deviate from these learned patterns. Description
System logs are a fundamental resource for debugging and monitoring the performance of a computer system. In this work, DeepLog:
Learns log patterns under normal conditions.
Detects anomalies when logs don't follow the learned patterns.
Allows for incremental and online model updates to adapt to emerging new patterns.
Generates workflows from logs to effectively diagnose anomalies and analyze their causes.
The model is implemented in PyTorch and includes functionalities for:
Loading and preprocessing logs (from local files or an S3 bucket).
Training the LSTM model in a distributed manner (supports multi-GPU/multi-node training).
Saving and loading the trained model.
Functions to serialize input and output for deployment, for example, on Amazon SageMaker.
Project Structure
Imports and Configuration: Handles logging and sets the logging level for debugging.
Model Definition (class Model): Implements an LSTM model with linear layers for detecting anomalous events in logs.
Generate Class: Responsible for loading logs, which can be read from local files or an AWS S3 bucket.
generate(): Converts log sequences into datasets suitable for training.
init_line() and readline(): Manage log access depending on the execution mode.
Data Loader and Auxiliary Functions:
_get_train_data_loader(): Prepares the DataLoader for training.
_average_gradients(): Auxiliary function for distributed training.
Training and Saving Functions:
train(): Main function to train the model.
save_model(): Saves the model and necessary information for restoration.
model_fn(): Function for loading the model during inference.
Inference Functions:
input_fn(): Deserializes input data.
predict_fn(): Predicts the next event and determines if it's anomalous.
output_fn(): Serializes the prediction output.
Main Script: Uses argparse to define parameters (batch size, number of epochs, window size, etc.) and starts the model training.
Requirements
Python 3.7+
PyTorch
boto3 (if using logs from AWS S3)
Argparse and other standard Python modules
Execution Instructions
Configure the environment:
Ensure all dependencies are installed and that, if using logs from AWS S3, credentials are correctly configured.
Start training:
Execute the main script, passing the desired parameters. For example:
Bash
python deep_log.py --batch-size 64 --epochs 50 --window-size 10 --input-size 1 --hidden-size 64 --num-layers 2 --num-classes <NUM_CLASSES> --num-candidates <NUM_CANDIDATES> --local True
Note: This thesis work involved studying, modifying, and adapting the authors' original model, then evaluating the results of these changes. MIT License
Copyright (c) 2019 Yifan Wu
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.