This project demonstrates a real-time data engineering pipeline for processing Uber trip data using modern big data technologies.
The pipeline ingests streaming data, processes it through multiple transformation layers (Bronze → Silver → Gold), and enables downstream analytics and insights.
Data Source → Ingestion Layer → Bronze Layer → Silver Layer → Gold Layer → Analytics
- Ingestion Layer: Streaming data ingestion using Python
- Bronze Layer: Raw data storage (unprocessed)
- Silver Layer: Cleaned and transformed data
- Gold Layer: Aggregated, business-ready data
- Processing Engine: Apache Spark (Structured Streaming)
- 🐍 Python
- ⚡ Apache Spark (Structured Streaming)
- ☁️ Azure Data Lake
- 📊 Databricks
- 🧠 SQL
- 📁 Delta Lake (optional enhancement)
Uber_Data_RealTimeStreaming/
│
├── Code_Files/
│ ├── ingest.py # Data ingestion script
│ ├── bronze_adls.ipynb # Bronze layer processing
│ ├── silver.py # Silver transformations
│ ├── silver_obt.sql # SQL transformations
│ ├── silver_obt.ipynb # Notebook transformations
│ ├── model.py # Data modeling / logic
│ └── readme.md # Internal documentation
│
├── Data/ # Sample or input data
├── templates/ # Config/templates
├── .gitignore
├── .python-version
└── README.md
- Reads real-time / batch data
- Pushes raw data into Bronze layer
- Stores raw, unprocessed data
- Schema is loosely enforced
- Data cleaning & transformation
- Handles nulls, duplicates, schema enforcement
- Business-level aggregations
- Ready for dashboards & analytics
- Python 3.9+
- Apache Spark / Databricks
- Cloud Storage (Azure / AWS)
# Clone the repository
git clone https://github.com/namansinghal111/Uber_Data_RealTImeStreaming.git
# Navigate to project
cd Uber_Data_RealTimeStreaming
# Install dependencies (if applicable)
pip install -r requirements.txt
# Run ingestion
python Code_Files/ingest.py- Real-time ride analytics 🚖
- Surge pricing insights 📈
- Driver & trip performance analysis
- Streaming ETL pipeline demo for interviews
- ✅ Real-time data processing
- ✅ Layered architecture (Bronze/Silver/Gold)
- ✅ Scalable & cloud-ready
- ✅ Modular code design
- ✅ Supports both batch & streaming
- Add Kafka for real-time ingestion
- Implement Delta Lake optimizations
- Add Airflow for orchestration
- Dashboard integration (Power BI / Tableau)
- CI/CD pipeline setup
Naman Singhal
💼 Data Engineer | Cloud & AI Enthusiast ☁️ AWS | Azure | Databricks 🤖 AI/ML & Real-Time Systems
If you like this project:
- ⭐ Star the repo
- 🍴 Fork it
- 📢 Share with others
This project is licensed under the MIT License.