This is a complete AI/Data Analysis project for analyzing online retail data, identifying high-value customers, and building customer segmentation models. The project demonstrates a complete workflow from data understanding, processing, model building to validation and evaluation.
- ✅ Python Data Processing: pandas, numpy for data cleaning and feature engineering
- ✅ Reproducible Scripts: Modular design with clear code structure
- ✅ Testing & Validation: Defined inputs/outputs, edge cases, and evaluation metrics
- ✅ Comparative Evaluation: Baseline vs improved version performance comparison
- ✅ Engineering Practices: Git version control, clear documentation, runnable project structure
.
├── README.md # Project documentation
├── requirements.txt # Dependencies list
├── .gitignore # Git ignore file
├── data/
│ └── online_retail_II.csv # Raw data
├── src/
│ ├── data_preprocessing.py # Data preprocessing module
│ ├── rfm_analysis.py # RFM customer value analysis
│ ├── customer_segmentation.py # Customer segmentation model
│ └── evaluation.py # Evaluation and validation module
├── tests/
│ ├── test_preprocessing.py # Preprocessing tests
│ ├── test_rfm.py # RFM analysis tests
│ └── test_evaluation.py # Evaluation tests
├── notebooks/
│ └── customer_analysis.ipynb # Complete analysis workflow notebook
└── results/
└── README.md # Results documentation
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtOption 1: Using Jupyter Notebook (Recommended)
jupyter notebook notebooks/customer_analysis.ipynbOption 2: Run Python Scripts
# Data preprocessing
python src/data_preprocessing.py
# RFM analysis
python src/rfm_analysis.py
# Customer segmentation
python src/customer_segmentation.py
# Evaluation and validation
python src/evaluation.pypytest tests/- Data cleaning: Handle missing values, outliers
- Data validation: Check data quality and completeness
- Feature engineering: Extract time features, calculate derived metrics
Input: Raw CSV file Output: Cleaned DataFrame Validation Metrics: Data completeness, missing rate, outlier detection
- Recency: Most recent purchase time
- Frequency: Purchase frequency
- Monetary: Spending amount
- Customer value scoring
Input: Cleaned transaction data Output: RFM scores and customer segments Evaluation Metrics: Segment rationality, customer distribution uniformity
- Baseline method: Rule-based segmentation using RFM
- Improved method: K-means clustering + RFM features
- Performance comparison evaluation
Input: RFM feature data Output: Customer segmentation results and evaluation report Evaluation Metrics:
- Silhouette Score
- Intra-cluster cohesion
- Inter-cluster separation
- Business metrics (average value per segment)
- Define test cases and edge cases
- Calculate evaluation metrics (accuracy, precision, recall, etc.)
- Generate comparison reports
Test Scenarios:
- Normal data flow
- Edge cases (single-purchase customers, abnormal large orders, etc.)
- Data quality validation
- Data completeness: > 95%
- Missing value rate: < 5%
- Outlier detection rate
-
Baseline Method:
- Number of segments: 8 (based on RFM rules)
- Computation speed: Fast
- Interpretability: High
-
Improved Method:
- Silhouette Score: > 0.3
- Intra-cluster cohesion: 20%+ improvement
- Inter-cluster separation: 15%+ improvement
- High-value customer identification accuracy
- Customer churn prediction accuracy
- Average spending per segment
- Python 3.8+
- pandas: Data processing
- numpy: Numerical computation
- scikit-learn: Machine learning models
- matplotlib/seaborn: Data visualization
- pytest: Unit testing
- jupyter: Interactive analysis
- Complete Engineering Practices: Modular design, test coverage, comprehensive documentation
- Verifiable Components: Each module has clear inputs/outputs and validation metrics
- Comparative Evaluation: Quantitative comparison of Baseline vs improved versions
- Edge Case Handling: Processing of abnormal data and extreme cases
- Reproducibility: Clear code structure and dependency management
MIT License