A production-grade data preprocessing pipeline implementing the Input-Process-Output (IPO) architectural pattern. This project handles professional data cleaning, robust feature engineering, statistical outlier mitigation, and strict schema validation for e-commerce data.
| Metric | Value | Business Value |
|---|---|---|
| Total Revenue Processed | $1,264,371.82 | Scalable processing of large-scale financial inputs |
| Average Order Value | $1,053.64 | Accurate tracking of transactional baselines |
| Outliers Neutralized | 8 | Distribution variance protected via IQR capping |
| Features Engineered | 3 | Manufactured indicators for customer segmentation |
| Data Integrity Pass Rate | 100% | Guaranteed schema alignment via Pandera |
The data flows through a strict three-tier modular architecture designed for production stability:
[INPUT LAYER] ➔ [PROCESS LAYER] ➔ [OUTPUT LAYER]
───────────────── ───────────────── ────────────────
• Handling Missing Data • Vectorized Calculations • Pandera Contract Validation
• Removing Duplicates • Outlier Capping (IQR) • Integrity Verifications
• Type Assertions • Feature Manufacturing • Clean Asset Export (.csv)
- < 5% Missingness: Automatic row deletion.
- 5% - 20% Missingness: Skew-resistant Median Imputation.
- > 20% Missingness: Categorical Proxy Imputation (e.g., handling
CouponCodegaps by mapping to"None"without losing transactional history).
OrderSizeCategory: Quantile-based customer segmentation (Low,Mid,High-Value).IsPremiumProduct: Binary indicator capturing high-margin orders against median pricing.IsDiscounted: Track performance metrics by isolating coupon usage.
- Utilizes the Interquartile Range (IQR) method to isolate anomalies.
- Implements upper and lower fence boundary capping via NumPy to prevent distribution distortion without dropping valid data points.
# Clone the repository
git clone [https://github.com/riazfatima/Advanced-EDA-and-Feature-Engineering.git]
cd Advanced-EDA-and-Feature-Engineering
# Install dependencies
pip install -r requirements.txt
---
## 🎓 Core Competencies Demonstrated
1. **Production Architecture:** Decoupled stages ensuring high code maintainability.
2. **Defensive Data Design:** Data contract implementation using `pandera`.
3. **Mathematical Rigor:** Non-parametric statistical boundaries applied to messy business data.
4. **Vectorization:** High-efficiency processing leveraging native Pandas and NumPy methods.