Skip to content

coderfeye13/multi-csv-etl-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi-CSV ETL Pipeline

Python ETL License

This project is a Python-based ETL (Extract, Transform, Load) pipeline that processes multiple CSV files with different schemas, cleans and standardizes the data, merges everything into a unified structure, and loads the final result into a SQLite database.

The project was designed in a modular, reusable, and easily extensible way. It demonstrates practical data engineering concepts such as schema normalization, column standardization, error handling, and database loading.


Key Features

  • Dynamic schema merging
    Automatically combines CSV files with different column structures into a unified schema.

  • Data standardization
    Normalizes column names and data types, including IDs and date fields.

  • Modular design
    Encapsulates the ETL logic in a reusable MultiCSVETL class.

  • Error handling
    Handles file-level read errors gracefully without breaking the entire pipeline.

  • Database integration
    Loads the processed data into a SQLite database using SQLAlchemy.

  • Runnable example
    Includes a main.py script that generates sample CSV files and runs the full ETL pipeline.


Project Structure

multi-csv-etl-pipeline/
├── data/
│   ├── activity.csv
│   ├── orders.csv
│   └── users.csv
├── etl/
│   ├── __init__.py
│   └── multi_csv_pipeline.py
├── main.py
├── unified_data.db
├── README.md
└── .gitignore

Technologies Used

  • Python
  • Pandas
  • SQLAlchemy
  • SQLite

How It Works

The pipeline performs the following steps:

  1. Reads multiple CSV files from the data/ directory
  2. Standardizes column names and selected data types
  3. Merges heterogeneous datasets into a unified schema
  4. Loads the cleaned output into a SQLite database
  5. Prints the final unified data for inspection

Running the Project

Clone the repository:

git clone https://github.com/coderfeye13/multi-csv-etl-pipeline.git
cd multi-csv-etl-pipeline

Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate

Install dependencies:

pip install pandas sqlalchemy

Run the project:

python main.py

Output

After execution, the pipeline:

  • creates sample CSV files inside the data/ directory
  • generates unified_data.db
  • loads the transformed data into the database
  • prints the final merged dataset to the console

Development Context

This project was originally developed as a remote technical task and is kept in my portfolio as an example of practical ETL and data engineering work.


Author

Furkan Yilmaz

M.Sc. Computer Science
HAW Kiel University of Applied Sciences (Germany)

About

Python ETL pipeline that standardizes and merges heterogeneous CSV datasets into a unified SQLite database.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages