Real-time Log Analytics & Security Monitoring Pipeline

A scalable, end-to-end data engineering solution for real-time web server monitoring, anomaly detection, and log analysis.

This project implements a robust Real-time Big Data Pipeline designed to ingest, process, and visualize high-throughput log data from distributed web servers. Leveraging the power of the Hadoop ecosystem and Stream Processing technologies, the system provides sub-minute insights into infrastructure health, traffic patterns, and potential security threats (DDoS, scanning).

System Architecture

The architecture follows a decoupled, microservices-based design ensuring high availability, fault tolerance, and horizontal scalability.

Figure 1. High-level System Architecture

Key Technical Highlights

Decoupled Ingestion: Uses Apache Flume with TAILDIR source for reliable log shipping and Apache Kafka as a durable message buffer.
Event-Time Processing: Implements Apache Spark Structured Streaming with stateful aggregations, handling late data via Watermarking (2-minute threshold) and Windowing (10-second sliding windows).
TimeSeries Optimization: Metrics are stored in InfluxDB with optimized schema design (Tags vs. Fields) for high-cardinality queries.
Infrastructure as Code (IaC): Fully containerized environment using Docker and Docker Compose for reproducible deployments.

Data Pipeline Workflow

Figure 2. End-to-end Data Flow: From Nginx to Grafana

Log Generation: Distributed Nginx servers generate Access (JSON) and Error (Text) logs.
Ingestion Layer: Flume Agents tail logs and push structured events to a centralized Flume Collector via Avro.
Buffering Layer: Kafka partitions data into web-logs (access) and web-errors topics, decoupling producers from consumers.
Processing Layer: Spark Streaming jobs consume Kafka topics to:

Parse and normalize raw logs.
Compute metrics (RPS, Latency, Throughput).
Detect anomalies using statistical thresholds.

Serving & Visualization: InfluxDB stores time-series metrics, while Grafana queries and visualizes actionable insights.

Key Features

1. Real-time Traffic Monitoring

Performance Metrics: Tracks Requests Per Second (RPS), average/max Latency, and Bandwidth usage in real-time.
Status Code Analysis: Granular breakdown of HTTP 2xx, 3xx, 4xx, and 5xx responses.

2. Cybersecurity Anomaly Detection

The system employs heuristic analysis to identify potential threats:

IP Spike Detection: Identifies IPs exceeding request thresholds within a short window (Potential DDoS).
Error Surges: Alerts when the 5xx error rate spikes abnormally.
Port/Path Scanning: Detects IPs probing multiple distinct non-existent paths (Reconnaissance attacks).

3. Centralized Alerting

Automated alerts configured in Grafana based on InfluxDB thresholds, notifying administrators of critical incidents immediately.

Monitoring Dashboards

HTTP Performance Stats

Real-time visibility into server health, latency distribution, and error rates.

Error Analysis & Forensics

Deep dive into system errors classified by hostname, severity level, and message patterns.

Security & Anomaly Detection

Visualizing potential threats such as IP spikes and scanning activities with anomaly scores.

Traffic Distribution (Top URLs)

Analysis of most accessed endpoints and their respective status codes.

Tech Stack

Domain	Technology	Usage
Ingestion	Apache Flume	Log collection (Taildir Source, Avro Sink)
Messaging	Apache Kafka	Distributed Event Streaming Platform
Processing	Apache Spark	Structured Streaming (Python/PySpark), Stateful Aggregation
Storage	InfluxDB	Time-Series Database (TSDB) for metrics
Visualization	Grafana	Operational Dashboards & Alerting (Flux query language)
DevOps	Docker	Containerization & Orchestration

Project Structure

realtime-log-analytics/
├── docker-compose.yml       # Orchestration of the entire stack
├── .env                     # Environment variables configuration
├── flume/                   # Log ingestion configurations
│   ├── agents/              # Per-node agents (web1, web2, web3)
│   └── collector/           # Centralized aggregation logic
├── kafka/                   # Kafka provisioning scripts
├── spark/                   # ETL & Analytics Logic
│   ├── src/python/          # PySpark streaming jobs
│   │   ├── stream_access.py # Access log processing & metrics
│   │   ├── stream_error.py  # Error log parsing & classification
│   │   └── schemas/         # Data contracts & Schema definitions
│   └── Dockerfile           # Custom Spark image with dependencies
├── influxdb/                # DB initialization & retention policies
├── grafana/                 # Dashboard provisioning & Datasources
└── docs/                    # Architectural documentation

Contributors & Roles

This project was built by a collaborative team of 5 engineers, simulating a real-world agile environment.

Hung (DevOps & Ingestion):
Designed the Flume topology (Agents & Collectors).
Managed Docker deployment scripts and environment stability.
Hao (Data Pipeline):
Managed the Kafka cluster implementation.
Optimized topic partitioning and retention policies for high throughput.
Hai (Data Processing):
Developed Spark Structured Streaming jobs.
Implemented windowed aggregations and watermark logic for data consistency.
Don (Data Storage):
Designed InfluxDB schema (measurements, tags, fields) for query optimization.
Handled data persistence layers.
Nhat (Visualization):
Built Grafana dashboards and configured alerting rules (Flux).
Ensured data interpretability for end-users.

Documentation

For detailed architectural decisions and runbooks, please refer to the internal documentation:

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
docs		docs
flume		flume
grafana/provisioning		grafana/provisioning
influxdb		influxdb
kafka		kafka
nginx		nginx
scripts		scripts
spark		spark
.env		.env
.env.sample		.env.sample
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
docker-compose.grafana.yml		docker-compose.grafana.yml
docker-compose.yml		docker-compose.yml
source		source

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-time Log Analytics & Security Monitoring Pipeline

System Architecture

Key Technical Highlights

Data Pipeline Workflow

Key Features

1. Real-time Traffic Monitoring

2. Cybersecurity Anomaly Detection

3. Centralized Alerting

Monitoring Dashboards

HTTP Performance Stats

Error Analysis & Forensics

Security & Anomaly Detection

Traffic Distribution (Top URLs)

Tech Stack

Project Structure

Contributors & Roles

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Real-time Log Analytics & Security Monitoring Pipeline

System Architecture

Key Technical Highlights

Data Pipeline Workflow

Key Features

1. Real-time Traffic Monitoring

2. Cybersecurity Anomaly Detection

3. Centralized Alerting

Monitoring Dashboards

HTTP Performance Stats

Error Analysis & Forensics

Security & Anomaly Detection

Traffic Distribution (Top URLs)

Tech Stack

Project Structure

Contributors & Roles

Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages