Skip to content

namansinghal111/Uber_Data_RealTImeStreaming

Repository files navigation

🚀 Uber Data Real-Time Streaming Pipeline

Python Spark Databricks Azure AWS Status


📌 Overview

This project demonstrates a real-time data engineering pipeline for processing Uber trip data using modern big data technologies.

The pipeline ingests streaming data, processes it through multiple transformation layers (Bronze → Silver → Gold), and enables downstream analytics and insights.


🏗️ Architecture

Data Source → Ingestion Layer → Bronze Layer → Silver Layer → Gold Layer → Analytics

🔹 Key Components:

  • Ingestion Layer: Streaming data ingestion using Python
  • Bronze Layer: Raw data storage (unprocessed)
  • Silver Layer: Cleaned and transformed data
  • Gold Layer: Aggregated, business-ready data
  • Processing Engine: Apache Spark (Structured Streaming)

⚙️ Tech Stack

  • 🐍 Python
  • ⚡ Apache Spark (Structured Streaming)
  • ☁️ Azure Data Lake
  • 📊 Databricks
  • 🧠 SQL
  • 📁 Delta Lake (optional enhancement)

📂 Project Structure

Uber_Data_RealTimeStreaming/
│
├── Code_Files/
│   ├── ingest.py              # Data ingestion script
│   ├── bronze_adls.ipynb      # Bronze layer processing
│   ├── silver.py              # Silver transformations
│   ├── silver_obt.sql         # SQL transformations
│   ├── silver_obt.ipynb       # Notebook transformations
│   ├── model.py               # Data modeling / logic
│   └── readme.md              # Internal documentation
│
├── Data/                      # Sample or input data
├── templates/                 # Config/templates
├── .gitignore
├── .python-version
└── README.md

🔄 Data Pipeline Flow

1️⃣ Ingestion (ingest.py)

  • Reads real-time / batch data
  • Pushes raw data into Bronze layer

2️⃣ Bronze Layer

  • Stores raw, unprocessed data
  • Schema is loosely enforced

3️⃣ Silver Layer (silver.py, silver_obt.sql)

  • Data cleaning & transformation
  • Handles nulls, duplicates, schema enforcement

4️⃣ Gold Layer (Modeling)

  • Business-level aggregations
  • Ready for dashboards & analytics

▶️ How to Run

✅ Prerequisites

  • Python 3.9+
  • Apache Spark / Databricks
  • Cloud Storage (Azure / AWS)

🔧 Steps

# Clone the repository
git clone https://github.com/namansinghal111/Uber_Data_RealTImeStreaming.git

# Navigate to project
cd Uber_Data_RealTimeStreaming

# Install dependencies (if applicable)
pip install -r requirements.txt

# Run ingestion
python Code_Files/ingest.py

📊 Use Cases

  • Real-time ride analytics 🚖
  • Surge pricing insights 📈
  • Driver & trip performance analysis
  • Streaming ETL pipeline demo for interviews

🌟 Features

  • ✅ Real-time data processing
  • ✅ Layered architecture (Bronze/Silver/Gold)
  • ✅ Scalable & cloud-ready
  • ✅ Modular code design
  • ✅ Supports both batch & streaming

🚧 Future Improvements

  • Add Kafka for real-time ingestion
  • Implement Delta Lake optimizations
  • Add Airflow for orchestration
  • Dashboard integration (Power BI / Tableau)
  • CI/CD pipeline setup

👨‍💻 Author

Naman Singhal

💼 Data Engineer | Cloud & AI Enthusiast ☁️ AWS | Azure | Databricks 🤖 AI/ML & Real-Time Systems


⭐ Support

If you like this project:

  • ⭐ Star the repo
  • 🍴 Fork it
  • 📢 Share with others

📜 License

This project is licensed under the MIT License.

About

This project demonstrates a real-time data engineering pipeline for processing Uber trip data using modern big data technologies. The pipeline ingests streaming data, processes it through multiple transformation layers (Bronze → Silver → Gold), and enables downstream analytics and insights.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages