Multi-CSV ETL Pipeline

This project is a Python-based ETL (Extract, Transform, Load) pipeline that processes multiple CSV files with different schemas, cleans and standardizes the data, merges everything into a unified structure, and loads the final result into a SQLite database.

The project was designed in a modular, reusable, and easily extensible way. It demonstrates practical data engineering concepts such as schema normalization, column standardization, error handling, and database loading.

Key Features

Dynamic schema merging
Automatically combines CSV files with different column structures into a unified schema.
Data standardization
Normalizes column names and data types, including IDs and date fields.
Modular design
Encapsulates the ETL logic in a reusable MultiCSVETL class.
Error handling
Handles file-level read errors gracefully without breaking the entire pipeline.
Database integration
Loads the processed data into a SQLite database using SQLAlchemy.
Runnable example
Includes a main.py script that generates sample CSV files and runs the full ETL pipeline.

Project Structure

multi-csv-etl-pipeline/
├── data/
│   ├── activity.csv
│   ├── orders.csv
│   └── users.csv
├── etl/
│   ├── __init__.py
│   └── multi_csv_pipeline.py
├── main.py
├── unified_data.db
├── README.md
└── .gitignore

Technologies Used

Python
Pandas
SQLAlchemy
SQLite

How It Works

The pipeline performs the following steps:

Reads multiple CSV files from the data/ directory
Standardizes column names and selected data types
Merges heterogeneous datasets into a unified schema
Loads the cleaned output into a SQLite database
Prints the final unified data for inspection

Running the Project

Clone the repository:

git clone https://github.com/coderfeye13/multi-csv-etl-pipeline.git
cd multi-csv-etl-pipeline

Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate

Install dependencies:

pip install pandas sqlalchemy

Run the project:

python main.py

Output

After execution, the pipeline:

creates sample CSV files inside the data/ directory
generates unified_data.db
loads the transformed data into the database
prints the final merged dataset to the console

Development Context

This project was originally developed as a remote technical task and is kept in my portfolio as an example of practical ETL and data engineering work.

Author

Furkan Yilmaz

M.Sc. Computer Science
HAW Kiel University of Applied Sciences (Germany)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-CSV ETL Pipeline

Key Features

Project Structure

Technologies Used

How It Works

Running the Project

Output

Development Context

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.idea		.idea
data		data
etl		etl
README.md		README.md
main.py		main.py
unified_data.db		unified_data.db

Folders and files

Latest commit

History

Repository files navigation

Multi-CSV ETL Pipeline

Key Features

Project Structure

Technologies Used

How It Works

Running the Project

Output

Development Context

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages