Skip to content

Complete AI/Data Analysis project: RFM-based customer segmentation with Baseline vs Improved method comparison. Python + pandas + scikit-learn. Includes data preprocessing, validation, testing, and evaluation metrics.

Notifications You must be signed in to change notification settings

bigdatapersonnel/customer-value-analysis

Repository files navigation

🛍️ Online Retail Customer Value Analysis System

📋 Project Overview

This is a complete AI/Data Analysis project for analyzing online retail data, identifying high-value customers, and building customer segmentation models. The project demonstrates a complete workflow from data understanding, processing, model building to validation and evaluation.

⭐ Core Skills Demonstrated

  • Python Data Processing: pandas, numpy for data cleaning and feature engineering
  • Reproducible Scripts: Modular design with clear code structure
  • Testing & Validation: Defined inputs/outputs, edge cases, and evaluation metrics
  • Comparative Evaluation: Baseline vs improved version performance comparison
  • Engineering Practices: Git version control, clear documentation, runnable project structure

📁 Project Structure

.
├── README.md                 # Project documentation
├── requirements.txt          # Dependencies list
├── .gitignore               # Git ignore file
├── data/
│   └── online_retail_II.csv # Raw data
├── src/
│   ├── data_preprocessing.py    # Data preprocessing module
│   ├── rfm_analysis.py          # RFM customer value analysis
│   ├── customer_segmentation.py # Customer segmentation model
│   └── evaluation.py            # Evaluation and validation module
├── tests/
│   ├── test_preprocessing.py    # Preprocessing tests
│   ├── test_rfm.py              # RFM analysis tests
│   └── test_evaluation.py       # Evaluation tests
├── notebooks/
│   └── customer_analysis.ipynb  # Complete analysis workflow notebook
└── results/
    └── README.md                # Results documentation

🚀 Quick Start

1. ⚙️ Environment Setup

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. ▶️ Run Analysis

Option 1: Using Jupyter Notebook (Recommended)

jupyter notebook notebooks/customer_analysis.ipynb

Option 2: Run Python Scripts

# Data preprocessing
python src/data_preprocessing.py

# RFM analysis
python src/rfm_analysis.py

# Customer segmentation
python src/customer_segmentation.py

# Evaluation and validation
python src/evaluation.py

3. 🧪 Run Tests

pytest tests/

🧩 Project Components

1. 🧹 Data Preprocessing (src/data_preprocessing.py)

  • Data cleaning: Handle missing values, outliers
  • Data validation: Check data quality and completeness
  • Feature engineering: Extract time features, calculate derived metrics

Input: Raw CSV file Output: Cleaned DataFrame Validation Metrics: Data completeness, missing rate, outlier detection

2. 📊 RFM Analysis (src/rfm_analysis.py)

  • Recency: Most recent purchase time
  • Frequency: Purchase frequency
  • Monetary: Spending amount
  • Customer value scoring

Input: Cleaned transaction data Output: RFM scores and customer segments Evaluation Metrics: Segment rationality, customer distribution uniformity

3. 👥 Customer Segmentation (src/customer_segmentation.py)

  • Baseline method: Rule-based segmentation using RFM
  • Improved method: K-means clustering + RFM features
  • Performance comparison evaluation

Input: RFM feature data Output: Customer segmentation results and evaluation report Evaluation Metrics:

  • Silhouette Score
  • Intra-cluster cohesion
  • Inter-cluster separation
  • Business metrics (average value per segment)

4. ✅ Evaluation & Validation (src/evaluation.py)

  • Define test cases and edge cases
  • Calculate evaluation metrics (accuracy, precision, recall, etc.)
  • Generate comparison reports

Test Scenarios:

  • Normal data flow
  • Edge cases (single-purchase customers, abnormal large orders, etc.)
  • Data quality validation

📈 Evaluation Metrics

🔍 Data Quality Metrics

  • Data completeness: > 95%
  • Missing value rate: < 5%
  • Outlier detection rate

⚡ Model Performance Metrics

  • Baseline Method:

    • Number of segments: 8 (based on RFM rules)
    • Computation speed: Fast
    • Interpretability: High
  • Improved Method:

    • Silhouette Score: > 0.3
    • Intra-cluster cohesion: 20%+ improvement
    • Inter-cluster separation: 15%+ improvement

💰 Business Metrics

  • High-value customer identification accuracy
  • Customer churn prediction accuracy
  • Average spending per segment

🛠️ Tech Stack

  • Python 3.8+
  • pandas: Data processing
  • numpy: Numerical computation
  • scikit-learn: Machine learning models
  • matplotlib/seaborn: Data visualization
  • pytest: Unit testing
  • jupyter: Interactive analysis

✨ Project Highlights

  1. Complete Engineering Practices: Modular design, test coverage, comprehensive documentation
  2. Verifiable Components: Each module has clear inputs/outputs and validation metrics
  3. Comparative Evaluation: Quantitative comparison of Baseline vs improved versions
  4. Edge Case Handling: Processing of abnormal data and extreme cases
  5. Reproducibility: Clear code structure and dependency management

📄 License

MIT License

About

Complete AI/Data Analysis project: RFM-based customer segmentation with Baseline vs Improved method comparison. Python + pandas + scikit-learn. Includes data preprocessing, validation, testing, and evaluation metrics.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published