Azure End-to-End COVID-19 Data Engineering Pipeline

Project Overview

This project demonstrates a complete, enterprise-style data engineering solution built on Microsoft Azure. The objective is to ingest, process, transform, and publish COVID‑19 data using multiple Azure services, following modern data lake and analytics architecture patterns.

The solution integrates public health data from the European Centre for Disease Prevention and Control (ECDC) with population reference data to deliver accurate, scalable, and analytics-ready datasets. The final outputs are consumed through interactive Power BI dashboards for reporting and analysis.

Architecture Summary

The architecture follows a layered approach consisting of data sources, ingestion, staging, transformation, serving, and publishing layers. Each layer uses Azure-native services chosen for scalability, performance, and maintainability.

The solution architecture diagram illustrates how data flows from external and internal sources, through transformation engines, into a relational serving layer, and finally into business intelligence tools.

Project Objectives

Build an end-to-end data engineering pipeline using Microsoft Azure services
Ingest structured and semi-structured data from multiple sources
Store raw and processed data in a scalable data lake
Perform both light and heavy data transformations using the right tools
Deliver clean, analytics-ready data to a relational database
Enable business users to analyze COVID‑19 trends using Power BI
Demonstrate best practices in cloud-based data architecture

Data Sources

ECDC COVID-19 Data

The primary dataset is sourced from the European Centre for Disease Prevention and Control (ECDC). This data is accessed through a public REST API using an HTTP connector.

The dataset includes COVID‑19 case counts, deaths, dates, and country-level reporting information. Because the source is external and regularly updated, it is ingested dynamically through Azure Data Factory.

Population Data

Population reference data is stored in Azure Blob Storage. Azure Blob Storage is an object storage service optimized for unstructured and semi-structured data.

This population dataset is used to calculate ratios and metrics such as cases per capita and death rates by population.

Ingestion Layer — Azure Data Factory (ADF)

Azure Data Factory acts as the central orchestration and ingestion service for the entire pipeline. ADF pipelines schedule, automate, and monitor data movement activities.

The following ingestion mechanisms are implemented:

HTTP Linked Service to connect to the ECDC public REST API
Blob Storage Linked Service to access population datasets
Copy Activities to move raw data into the data lake

ADF ensures that data ingestion is reliable, repeatable, and auditable. All raw data is ingested without transformation to preserve source fidelity.

Staging Layer — Azure Data Lake Storage Gen2 (ADLS Gen2)

All ingested data is stored in Azure Data Lake Storage Gen2. This layer represents the raw or bronze layer of the data architecture.

ADLS Gen2 provides:

Massively scalable storage
Hierarchical namespace for folder-based organization
Optimized performance for big data analytics
Integration with Spark, Hive, and SQL-based services

Data in this layer is stored in its original format and structure, enabling reprocessing and historical analysis if needed.

Transformation Layer

The transformation layer uses multiple services, each selected based on the complexity and nature of the transformation tasks.

Azure Data Factory (ADF)

ADF is reused in this layer for lightweight, flow-based transformations. Mapping Data Flows and Copy Activities are used to perform:

Schema mapping
Column selection and renaming
Simple aggregations
Data cleansing and filtering

These transformations are ideal for structured data and low-to-medium complexity workloads.

Azure Databricks (DBX)

Azure Databricks is used for advanced, compute-intensive transformations. It is based on Apache Spark and supports Python and PySpark notebooks.

Databricks is responsible for:

Large-scale data joins
Complex aggregations across multiple datasets
Enrichment of COVID‑19 data with population data
Performance-optimized transformations on large volumes of data

Notebook-driven development enables version control, experimentation, and collaboration across teams.

Azure HDInsight (HDI)

Azure HDInsight provides managed Hadoop and Hive clusters. It is used for SQL-style transformations directly on files stored in the data lake.

HDInsight enables:

Hive SQL queries on data lake files
Batch processing workloads
Compatibility with traditional big data ecosystems

This allows data engineers familiar with SQL-based tooling to work efficiently within the data lake environment.

Serving Layer — Azure SQL Database

After transformation, clean and curated datasets are loaded into Azure SQL Database. This database acts as the serving or gold layer of the architecture.

Key characteristics of this layer include:

Relational schema design
Optimized indexes and tables
High availability and reliability
Fast query performance for analytics

This structured layer is designed specifically for downstream consumption by analytics and reporting tools.

Publishing and Visualization — Power BI

Power BI connects directly to Azure SQL Database to build interactive dashboards. These dashboards allow users to explore and analyze COVID‑19 data visually.

Example insights provided by the dashboards include:

Daily and cumulative case counts
Trends over time by country and region
Deaths and recovery statistics
Population-adjusted metrics such as cases per 100,000 people

Power BI provides self-service analytics capabilities for stakeholders without requiring direct access to the data engineering infrastructure.

What Has Been Implemented

End-to-end Azure data pipeline
Multiple ingestion sources with ADF
Raw data storage in ADLS Gen2
Multi-engine transformation strategy (ADF, Databricks, HDInsight)
Relational serving layer using Azure SQL Database
Interactive Power BI dashboards
Scalable and modular architecture

Key Learnings

Choosing the right Azure service for each processing requirement
Designing layered data lake architectures
Balancing cost, performance, and complexity
Building pipelines that are reusable and extensible
Integrating big data platforms with traditional BI tools

Future Enhancements

Implement data quality checks and validation rules
Add incremental data loading and change data capture
Enhance monitoring and logging
Integrate machine learning models for forecasting
Automate infrastructure using Infrastructure as Code

Conclusion

This project showcases a real-world, production-style data engineering solution on Azure. It demonstrates how multiple Azure services can work together to ingest, transform, and deliver meaningful insights from raw data.

The architecture is scalable, modular, and extensible, making it a strong foundation for advanced analytics and data science workloads.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
covid_sql_database		covid_sql_database
data_factory		data_factory
data_sources		data_sources
databricks_notebooks		databricks_notebooks
log_analytics_workspace		log_analytics_workspace
storage_accounts		storage_accounts
README.md		README.md
covid_reporting_resources.csv		covid_reporting_resources.csv
diagram.png		diagram.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Azure End-to-End COVID-19 Data Engineering Pipeline

Project Overview

Architecture Summary

Project Objectives

Data Sources

ECDC COVID-19 Data

Population Data

Ingestion Layer — Azure Data Factory (ADF)

Staging Layer — Azure Data Lake Storage Gen2 (ADLS Gen2)

Transformation Layer

Azure Data Factory (ADF)

Azure Databricks (DBX)

Azure HDInsight (HDI)

Serving Layer — Azure SQL Database

Publishing and Visualization — Power BI

What Has Been Implemented

Key Learnings

Future Enhancements

Conclusion

About

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Azure End-to-End COVID-19 Data Engineering Pipeline

Project Overview

Architecture Summary

Project Objectives

Data Sources

ECDC COVID-19 Data

Population Data

Ingestion Layer — Azure Data Factory (ADF)

Staging Layer — Azure Data Lake Storage Gen2 (ADLS Gen2)

Transformation Layer

Azure Data Factory (ADF)

Azure Databricks (DBX)

Azure HDInsight (HDI)

Serving Layer — Azure SQL Database

Publishing and Visualization — Power BI

What Has Been Implemented

Key Learnings

Future Enhancements

Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors 1

Languages