This project is a Python-based ETL (Extract, Transform, Load) pipeline that processes multiple CSV files with different schemas, cleans and standardizes the data, merges everything into a unified structure, and loads the final result into a SQLite database.
The project was designed in a modular, reusable, and easily extensible way. It demonstrates practical data engineering concepts such as schema normalization, column standardization, error handling, and database loading.
-
Dynamic schema merging
Automatically combines CSV files with different column structures into a unified schema. -
Data standardization
Normalizes column names and data types, including IDs and date fields. -
Modular design
Encapsulates the ETL logic in a reusableMultiCSVETLclass. -
Error handling
Handles file-level read errors gracefully without breaking the entire pipeline. -
Database integration
Loads the processed data into a SQLite database using SQLAlchemy. -
Runnable example
Includes amain.pyscript that generates sample CSV files and runs the full ETL pipeline.
multi-csv-etl-pipeline/
├── data/
│ ├── activity.csv
│ ├── orders.csv
│ └── users.csv
├── etl/
│ ├── __init__.py
│ └── multi_csv_pipeline.py
├── main.py
├── unified_data.db
├── README.md
└── .gitignore
- Python
- Pandas
- SQLAlchemy
- SQLite
The pipeline performs the following steps:
- Reads multiple CSV files from the
data/directory - Standardizes column names and selected data types
- Merges heterogeneous datasets into a unified schema
- Loads the cleaned output into a SQLite database
- Prints the final unified data for inspection
Clone the repository:
git clone https://github.com/coderfeye13/multi-csv-etl-pipeline.git
cd multi-csv-etl-pipelineCreate and activate a virtual environment:
python -m venv .venv
source .venv/bin/activateInstall dependencies:
pip install pandas sqlalchemyRun the project:
python main.pyAfter execution, the pipeline:
- creates sample CSV files inside the
data/directory - generates
unified_data.db - loads the transformed data into the database
- prints the final merged dataset to the console
This project was originally developed as a remote technical task and is kept in my portfolio as an example of practical ETL and data engineering work.
Furkan Yilmaz
M.Sc. Computer Science
HAW Kiel University of Applied Sciences (Germany)