🚀 Uber Data Real-Time Streaming Pipeline

📌 Overview

This project demonstrates a real-time data engineering pipeline for processing Uber trip data using modern big data technologies.

The pipeline ingests streaming data, processes it through multiple transformation layers (Bronze → Silver → Gold), and enables downstream analytics and insights.

🏗️ Architecture

Data Source → Ingestion Layer → Bronze Layer → Silver Layer → Gold Layer → Analytics

🔹 Key Components:

Ingestion Layer: Streaming data ingestion using Python
Bronze Layer: Raw data storage (unprocessed)
Silver Layer: Cleaned and transformed data
Gold Layer: Aggregated, business-ready data
Processing Engine: Apache Spark (Structured Streaming)

⚙️ Tech Stack

🐍 Python
⚡ Apache Spark (Structured Streaming)
☁️ Azure Data Lake
📊 Databricks
🧠 SQL
📁 Delta Lake (optional enhancement)

📂 Project Structure

Uber_Data_RealTimeStreaming/
│
├── Code_Files/
│   ├── ingest.py              # Data ingestion script
│   ├── bronze_adls.ipynb      # Bronze layer processing
│   ├── silver.py              # Silver transformations
│   ├── silver_obt.sql         # SQL transformations
│   ├── silver_obt.ipynb       # Notebook transformations
│   ├── model.py               # Data modeling / logic
│   └── readme.md              # Internal documentation
│
├── Data/                      # Sample or input data
├── templates/                 # Config/templates
├── .gitignore
├── .python-version
└── README.md

🔄 Data Pipeline Flow

1️⃣ Ingestion (`ingest.py`)

Reads real-time / batch data
Pushes raw data into Bronze layer

2️⃣ Bronze Layer

Stores raw, unprocessed data
Schema is loosely enforced

3️⃣ Silver Layer (`silver.py`, `silver_obt.sql`)

Data cleaning & transformation
Handles nulls, duplicates, schema enforcement

4️⃣ Gold Layer (Modeling)

Business-level aggregations
Ready for dashboards & analytics

▶️ How to Run

✅ Prerequisites

Python 3.9+
Apache Spark / Databricks
Cloud Storage (Azure / AWS)

🔧 Steps

# Clone the repository
git clone https://github.com/namansinghal111/Uber_Data_RealTImeStreaming.git

# Navigate to project
cd Uber_Data_RealTimeStreaming

# Install dependencies (if applicable)
pip install -r requirements.txt

# Run ingestion
python Code_Files/ingest.py

📊 Use Cases

Real-time ride analytics 🚖
Surge pricing insights 📈
Driver & trip performance analysis
Streaming ETL pipeline demo for interviews

🌟 Features

✅ Real-time data processing
✅ Layered architecture (Bronze/Silver/Gold)
✅ Scalable & cloud-ready
✅ Modular code design
✅ Supports both batch & streaming

🚧 Future Improvements

Add Kafka for real-time ingestion
Implement Delta Lake optimizations
Add Airflow for orchestration
Dashboard integration (Power BI / Tableau)
CI/CD pipeline setup

👨‍💻 Author

Naman Singhal

💼 Data Engineer | Cloud & AI Enthusiast ☁️ AWS | Azure | Databricks 🤖 AI/ML & Real-Time Systems

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
Code_Files		Code_Files
Data		Data
templates		templates
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
Uber_Project.svg		Uber_Project.svg
api.py		api.py
architecture.png		architecture.png
connection.py		connection.py
data.py		data.py
files_array.json		files_array.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Uber Data Real-Time Streaming Pipeline

📌 Overview

🏗️ Architecture

🔹 Key Components:

⚙️ Tech Stack

📂 Project Structure

🔄 Data Pipeline Flow

1️⃣ Ingestion (`ingest.py`)

2️⃣ Bronze Layer

3️⃣ Silver Layer (`silver.py`, `silver_obt.sql`)

4️⃣ Gold Layer (Modeling)

▶️ How to Run

✅ Prerequisites

🔧 Steps

📊 Use Cases

🌟 Features

🚧 Future Improvements

👨‍💻 Author

⭐ Support

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 Uber Data Real-Time Streaming Pipeline

📌 Overview

🏗️ Architecture

🔹 Key Components:

⚙️ Tech Stack

📂 Project Structure

🔄 Data Pipeline Flow

1️⃣ Ingestion (ingest.py)

2️⃣ Bronze Layer

3️⃣ Silver Layer (silver.py, silver_obt.sql)

4️⃣ Gold Layer (Modeling)

▶️ How to Run

✅ Prerequisites

🔧 Steps

📊 Use Cases

🌟 Features

🚧 Future Improvements

👨‍💻 Author

⭐ Support

📜 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1️⃣ Ingestion (`ingest.py`)

3️⃣ Silver Layer (`silver.py`, `silver_obt.sql`)

Packages