Skip to content

AyoubToueti/ecommerce-analytics

Repository files navigation

Real-time E-commerce Analytics Platform

Project Overview

The Challenge

Modern e-commerce platforms generate massive volumes of user activity, transaction, and inventory data. Traditional batch processing cannot provide the real-time insights needed for:

  • Instant fraud detection
  • Live inventory management
  • Personalized recommendations
  • Real-time business dashboards

Our Solution

We're building a distributed streaming analytics platform that processes e-commerce data in real-time using Apache Kafka and Apache Spark. This platform will handle millions of events per day and provide actionable insights within seconds.


Project Goals

Primary Objectives

  1. Build an end-to-end streaming data pipeline that ingests, processes, and analyzes e-commerce data in real-time
  2. Implement real-time analytics for user behavior, transactions, and inventory
  3. Create a scalable, fault-tolerant architecture using distributed systems principles
  4. Develop practical experience with Kafka, Spark, and modern data engineering patterns

Key Features

  • Real-time user activity tracking
  • Fraud detection and alerting
  • Live inventory monitoring
  • Real-time business dashboards
  • Historical data archiving

🏗️ Architecture Overview

System Architecture

┌─────────────────────────────────────────────────────────────┐
│                     DATA SOURCES                            │
│  ┌─────────────┐  ┌──────────────┐  ┌──────────────────┐  │
│  │   User      │  │ Transactions │  │   Inventory      │  │
│  │ Activities  │  │    Stream    │  │    Updates       │  │
│  │  (Clicks,   │  │  (Orders,    │  │  (Stock levels,  │  │
│  │  Searches)  │  │  Payments)   │  │   Shipments)     │  │
│  └──────┬──────┘  └──────┬───────┘  └────────┬─────────┘  │
│         │                 │                    │           │
└─────────┼─────────────────┼────────────────────┼───────────┘
          │                 │                    │
          ▼                 ▼                    ▼
    ┌─────────────────────────────────────────────────────┐
    │           APACHE KAFKA - EVENT STREAMING            │
    │                                                     │
    │  ┌────────────┐  ┌─────────────┐  ┌─────────────┐  │
    │  │  user_     │  │  transac-   │  │  inventory_ │  │
    │  │  events    │  │   tions     │  │   updates   │  │
    │  │  Topic     │  │   Topic     │  │   Topic     │  │
    │  │ (Partitioned & Replicated)  │  │             │  │
    │  └────────────┘  └─────────────┘  └─────────────┘  │
    └────────────────────────┬───────────────────────────┘
                             │
                             ▼
    ┌─────────────────────────────────────────────────────┐
    │        APACHE SPARK - STREAM PROCESSING             │
    │                                                     │
    │  ┌──────────────────────────────────────────────┐  │
    │  │          Structured Streaming Jobs           │  │
    │  │  • Real-time aggregations (5-min windows)    │  │
    │  │  • Sessionization (user journey analysis)    │  │
    │  │  • Fraud detection (pattern matching)        │  │
    │  │  • Recommendation engine (ML predictions)    │  │
    │  └──────────────────────────────────────────────┘  │
    └────────────────────────┬───────────────────────────┘
                             │
    ┌────────────────────────┼───────────────────────────┐
    │                        ▼                           │
┌───┼─────────────┐  ┌──────────────┐  ┌──────────────┐ │
│   │             │  │              │  │              │ │
│ Real-time     │  │   Alerts &   │  │   Data Lake  │ │
│ Dashboards    │  │  Notifications│  │ (Parquet)    │ │
│ (PostgreSQL/  │  │  (Email/Slack)│  │              │ │
│  Elasticsearch)│  │              │  │              │ │
└───────────────┘  └──────────────┘  └──────────────┘ │
                                                       │
┌─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│                 MONITORING & VISUALIZATION          │
│  • Streamlit Dashboard (real-time metrics)          │
│  • Spark UI (job monitoring)                        │
│  • Kafka UI (topic health)                          │
└─────────────────────────────────────────────────────┘

Component Breakdown

1. Data Generation Layer

  • Mock Data Generators: Python scripts simulating real e-commerce events
  • Three main data streams:
    • User Events: Page views, clicks, searches, cart actions
    • Transactions: Orders, payments, refunds
    • Inventory Updates: Stock changes, shipments, restocks

2. Event Streaming Layer (Kafka)

  • Apache Kafka Cluster: 3-node cluster in KRaft mode (no ZooKeeper)
  • Topics Structure:
    • user_events: User interactions (high volume)
    • transactions: Financial transactions (lower volume, high importance)
    • inventory_updates: Stock changes (medium volume)
  • Features: Partitioning, replication, message retention

3. Processing Layer (Spark)

  • Spark Structured Streaming Jobs:
    • Job 1: Real-time aggregations (windowed counts, metrics)
    • Job 2: User session analysis (journey mapping)
    • Job 3: Fraud detection (anomaly detection)
    • Job 4: Recommendation engine (ML predictions)

4. Storage & Output Layer

  • PostgreSQL: For real-time dashboard data
  • Parquet Files: For data lake/long-term storage

5. Visualization Layer

  • Streamlit Dashboard: Real-time business metrics
  • Spark UI: Job monitoring and debugging
  • Kafka UI: Topic and consumer group monitoring

📊 Data Flow

End-to-End Pipeline

  1. Data Generation → Mock events created at ~300 events/second
  2. Kafka Ingestion → Events published to appropriate topics
  3. Spark Processing → Real-time analytics and transformations
  4. Storage → Results saved to multiple sinks
  5. Visualization → Real-time dashboards and alerts

Processing Patterns

  • Micro-batch Processing: Spark processes data in small batches (30-second intervals)
  • Windowed Aggregations: 5-minute sliding windows for trend analysis
  • Stateful Operations: User session tracking across multiple events
  • Machine Learning: Real-time predictions and recommendations

🛠️ Technology Stack

Core Technologies

  • Apache Kafka 3.6+: Distributed event streaming platform
  • Apache Spark 3.5+: Distributed data processing engine
  • Python 3.10+: Primary programming language
  • PostgreSQL 15: Relational database for analytics
  • Docker: Containerization for easy deployment

Supporting Libraries

  • PySpark: Python API for Spark
  • Faker: Mock data generation
  • Streamlit: Dashboard creation
  • Plotly: Interactive visualizations
  • Confluent Kafka: Kafka Python client

About

Real-time E‑commerce Analytics, an end-to-end Kafka + Spark streaming pipeline that ingests simulated e‑commerce events, runs real‑time Spark Structured Streaming jobs (aggregations, sessionization, fraud detection, recommendations), and exposes insights via a Streamlit dashboard with PostgreSQL/Parquet sinks and Docker Compose orchestration.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors