GitHub - riazfatima/Task-1: Advanced EDA, statistical data cleaning, and feature engineering pipeline for online store orders using Python and Pandera, developed during my Data Science internship at DecodeLabs.

Advanced EDA, Vectorized Pipelines, and High-Fidelity Feature Stores

Project Overview

A production-grade data preprocessing pipeline implementing the Input-Process-Output (IPO) architectural pattern. This project handles professional data cleaning, robust feature engineering, statistical outlier mitigation, and strict schema validation for e-commerce data.

Key Performance Indicators

Metric	Value	Business Value
Total Revenue Processed	$1,264,371.82	Scalable processing of large-scale financial inputs
Average Order Value	$1,053.64	Accurate tracking of transactional baselines
Outliers Neutralized	8	Distribution variance protected via IQR capping
Features Engineered	3	Manufactured indicators for customer segmentation
Data Integrity Pass Rate	100%	Guaranteed schema alignment via Pandera

Pipeline Architecture (IPO Pattern)

The data flows through a strict three-tier modular architecture designed for production stability:


[INPUT LAYER]     ➔      [PROCESS LAYER]     ➔     [OUTPUT LAYER]
─────────────────         ─────────────────         ────────────────
• Handling Missing Data   • Vectorized Calculations • Pandera Contract Validation
• Removing Duplicates     • Outlier Capping (IQR)   • Integrity Verifications
• Type Assertions         • Feature Manufacturing   • Clean Asset Export (.csv)

Core Technical Features

1. Automated Missing Data Matrix

< 5% Missingness: Automatic row deletion.
5% - 20% Missingness: Skew-resistant Median Imputation.
> 20% Missingness: Categorical Proxy Imputation (e.g., handling CouponCode gaps by mapping to "None" without losing transactional history).

2. High-Fidelity Feature Engineering

OrderSizeCategory: Quantile-based customer segmentation (Low, Mid, High-Value).
IsPremiumProduct: Binary indicator capturing high-margin orders against median pricing.
IsDiscounted: Track performance metrics by isolating coupon usage.

3. Non-Parametric Outlier Mitigation

Utilizes the Interquartile Range (IQR) method to isolate anomalies.
Implements upper and lower fence boundary capping via NumPy to prevent distribution distortion without dropping valid data points.

Quick Start

1. Installation & Setup

# Clone the repository
git clone [https://github.com/riazfatima/Advanced-EDA-and-Feature-Engineering.git]
cd Advanced-EDA-and-Feature-Engineering

# Install dependencies
pip install -r requirements.txt



---

## 🎓 Core Competencies Demonstrated

1. **Production Architecture:** Decoupled stages ensuring high code maintainability.
2. **Defensive Data Design:** Data contract implementation using `pandera`.
3. **Mathematical Rigor:** Non-parametric statistical boundaries applied to messy business data.
4. **Vectorization:** High-efficiency processing leveraging native Pandas and NumPy methods.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Advanced EDA & Feature Engineering.py		Advanced EDA & Feature Engineering.py
High_Fidelity_Cleaned_Orders.csv		High_Fidelity_Cleaned_Orders.csv
High_Fidelity_EDA_Dashboard.png		High_Fidelity_EDA_Dashboard.png
LICENSE		LICENSE
Online_Store_Orders.csv		Online_Store_Orders.csv
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Advanced EDA, Vectorized Pipelines, and High-Fidelity Feature Stores

Project Overview

Key Performance Indicators

Pipeline Architecture (IPO Pattern)

Core Technical Features

1. Automated Missing Data Matrix

2. High-Fidelity Feature Engineering

3. Non-Parametric Outlier Mitigation

Quick Start

1. Installation & Setup

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Advanced EDA, Vectorized Pipelines, and High-Fidelity Feature Stores

Project Overview

Key Performance Indicators

Pipeline Architecture (IPO Pattern)

Core Technical Features

1. Automated Missing Data Matrix

2. High-Fidelity Feature Engineering

3. Non-Parametric Outlier Mitigation

Quick Start

1. Installation & Setup

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages