Skip to content

riazfatima/Task-1

Repository files navigation

Advanced EDA, Vectorized Pipelines, and High-Fidelity Feature Stores

Project Overview

A production-grade data preprocessing pipeline implementing the Input-Process-Output (IPO) architectural pattern. This project handles professional data cleaning, robust feature engineering, statistical outlier mitigation, and strict schema validation for e-commerce data.

Key Performance Indicators

Metric Value Business Value
Total Revenue Processed $1,264,371.82 Scalable processing of large-scale financial inputs
Average Order Value $1,053.64 Accurate tracking of transactional baselines
Outliers Neutralized 8 Distribution variance protected via IQR capping
Features Engineered 3 Manufactured indicators for customer segmentation
Data Integrity Pass Rate 100% Guaranteed schema alignment via Pandera

Pipeline Architecture (IPO Pattern)

The data flows through a strict three-tier modular architecture designed for production stability:


[INPUT LAYER]     ➔      [PROCESS LAYER]     ➔     [OUTPUT LAYER]
─────────────────         ─────────────────         ────────────────
• Handling Missing Data   • Vectorized Calculations • Pandera Contract Validation
• Removing Duplicates     • Outlier Capping (IQR)   • Integrity Verifications
• Type Assertions         • Feature Manufacturing   • Clean Asset Export (.csv)


Core Technical Features

1. Automated Missing Data Matrix

  • < 5% Missingness: Automatic row deletion.
  • 5% - 20% Missingness: Skew-resistant Median Imputation.
  • > 20% Missingness: Categorical Proxy Imputation (e.g., handling CouponCode gaps by mapping to "None" without losing transactional history).

2. High-Fidelity Feature Engineering

  • OrderSizeCategory: Quantile-based customer segmentation (Low, Mid, High-Value).
  • IsPremiumProduct: Binary indicator capturing high-margin orders against median pricing.
  • IsDiscounted: Track performance metrics by isolating coupon usage.

3. Non-Parametric Outlier Mitigation

  • Utilizes the Interquartile Range (IQR) method to isolate anomalies.
  • Implements upper and lower fence boundary capping via NumPy to prevent distribution distortion without dropping valid data points.

Quick Start

1. Installation & Setup

# Clone the repository
git clone [https://github.com/riazfatima/Advanced-EDA-and-Feature-Engineering.git]
cd Advanced-EDA-and-Feature-Engineering

# Install dependencies
pip install -r requirements.txt



---

## 🎓 Core Competencies Demonstrated

1. **Production Architecture:** Decoupled stages ensuring high code maintainability.
2. **Defensive Data Design:** Data contract implementation using `pandera`.
3. **Mathematical Rigor:** Non-parametric statistical boundaries applied to messy business data.
4. **Vectorization:** High-efficiency processing leveraging native Pandas and NumPy methods.

About

Advanced EDA, statistical data cleaning, and feature engineering pipeline for online store orders using Python and Pandera, developed during my Data Science internship at DecodeLabs.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages