This project showcases an end-to-end Azure Data Engineering pipeline built to process Spotify streaming datasets.
The primary focus is on duplicate data detection, data quality enforcement, and scalable analytics delivery using Azure-native services.
The solution follows the Medallion Architecture to ensure data reliability and performance.
- Raw Spotify data ingestion (CSV/JSON)
- Stored in Azure Data Lake Storage Gen2
- Orchestrated via Azure Data Factory
- Data cleansing and schema standardization
- Duplicate record removal using PySpark
- Delta Lake used for ACID compliance
- Aggregated, analytics-ready datasets
- Served via Azure Synapse Analytics
| Layer | Technology |
|---|---|
| Ingestion | Azure Data Factory |
| Storage | Azure Data Lake Storage Gen2 |
| Processing | Azure Databricks (PySpark) |
| Warehouse | Azure Synapse Analytics |
| Format | Delta Lake |
- Ingest raw Spotify data into ADLS Gen2
- Validate schema and metadata
- Detect and eliminate duplicate records
- Store cleansed data in Delta format
- Aggregate data for analytics
- Query curated datasets using Synapse
- ✔ End-to-end automated pipeline
- ✔ Duplicate data detection & removal
- ✔ Idempotent and re-runnable workflows
- ✔ Scalable medallion architecture
- ✔ Enterprise-grade data quality handling
- Top streamed tracks and artists
- Popularity trend analysis
- User listening behavior insights
- BI-ready datasets for reporting tools