Skip to content

minwoo-data/walmart-sales-forecasting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Walmart Sales Forecasting (Purchase Prediction)

📋 Table of Contents


🎯 Project Overview

This project focuses on predicting customer purchase amounts using the Walmart Black Friday retail dataset.

Objective: Build and compare multiple regression-based machine learning models to estimate purchase behavior and identify the best-performing approach.

  • Problem Type: Supervised Learning (Regression)
  • Target Variable: Purchase (purchase amount in dollars)
  • Evaluation Metrics: RMSE, MAE, R-squared

Highlights

  • Trained and compared 5 regression models on 550K+ transactions
  • Detected 0.49% outliers via IQR and stabilized variance using log(Purchase)
  • Best model: Linear Regression (log target) with R² ≈ 0.736
  • Used 5-fold cross-validation for linear models and a 70/30 train-test split for heavy models (runtime constraints)
  • Built a reproducible pipeline: preprocessing → EDA → modeling → evaluation

📊 Dataset

Source: Kaggle — Walmart e-Commerce Sales Dataset (Black Friday)

Description: Retail transaction dataset containing customer demographics, product categories, and purchase amounts.

Dataset Overview

  • Total Records: 550,068 transactions
  • Unique Users: 5,891 customers
  • Unique Products: 3,631 products

Key Features

Feature Description
User_ID Unique customer identifier
Product_ID Unique product identifier
Gender Customer gender (M/F)
Age Customer age group
Occupation Customer occupation (encoded)
City_Category City type (A/B/C)
Stay_In_Current_City_Years Years in current city
Marital_Status Marital status (0/1)
Product_Category_1/2/3 Product categories
Purchase Target variable (purchase amount)

Note: The raw dataset is not included in this repository due to size constraints.
Please download it from Kaggle and place it under data/raw/.


🔬 Methodology

1. Data Cleaning & Preprocessing

  • Checked for missing values and duplicates (none found)
  • Standardized column names to lowercase
  • Converted categorical variables to factors
  • Removed unnecessary identifiers (User_ID, Product_ID)

2. Exploratory Data Analysis (EDA)

  • Analyzed purchase patterns by age, gender, occupation, and product category
  • Identified right-skewed distribution in purchase amounts
  • Visualized relationships between features and target variable

3. Outlier Detection

  • Applied IQR (Interquartile Range) method
  • Found 0.49% outliers (2,677 out of 550,068 transactions)
  • Outliers concentrated in premium product categories (9, 10, 15)

4. Feature Engineering

  • Created numeric encodings for categorical variables
  • Applied log transformation on Purchase to:
    • Reduce skewness and stabilize variance
    • Improve model performance (especially for linear models)
    • Mitigate impact of outliers

5. Model Training & Validation

  • Validation Strategy:
    • 5-fold Cross-Validation for linear models
    • 70:30 Train-Test Split for complex models (due to computational constraints)
  • Hyperparameter Tuning: Grid search for Ridge, Lasso, and tree-based models

🤖 Models Evaluated

The following models were trained and compared:

Model Type Key Characteristics
Linear Regression Baseline Simple, interpretable
Ridge Regression Regularized Linear L2 penalty, handles multicollinearity
Lasso Regression Regularized Linear L1 penalty, feature selection
Random Forest Ensemble (Bagging) Non-linear, robust to outliers
Gradient Boosting (GBM) Ensemble (Boosting) Sequential error correction
XGBoost Advanced Boosting Not evaluated (similar to GBM, time constraints)
Neural Network Deep Learning Not completed (training time >5 hours)

📈 Results

Model Performance Comparison

Model RMSE MAE Notes
Linear Regression (Log) 0.3799 0.7359 0.2859 Best overall
Lasso Regression (Log) 0.3800 0.7358 0.2860 Nearly identical to Linear
Ridge Regression (Log) 0.3820 0.7344 0.2873 Slightly lower performance
Linear Regression (Raw) 3014.09 0.6399 2282.59 Baseline (no transformation)
Ridge Regression (Raw) 3027.54 0.6390 2287.48 Baseline regularized
Lasso Regression (Raw) 3014.17 0.6399 2282.70 Baseline with feature selection
GBM (Raw) 3241.94 0.5828 2454.09 Non-linear approach
GBM (Log → Raw) 3533.96 0.5060 2607.78 Degraded after back-transform
Random Forest (Raw) 3864.52 0.4000 2945.00 High variance
Random Forest (Log → Raw) 10531.06 - - Amplified errors

Note: Log-transformed model metrics (RMSE, MAE) are on the log scale.
For fair comparison with raw models, predictions were back-transformed using exp() where applicable.


💡 Key Findings

Best Model: Log-Transformed Linear Regression

  • Why it won:
    • Lowest RMSE (0.3799) and MAE (0.2859)
    • Highest R² (0.7359) — explains ~74% of variance
    • Most interpretable and computationally efficient
    • Stable performance with 5-fold cross-validation

📊 Important Insights

  1. Log Transformation is Critical

    • Dramatically improved linear model performance (R² from 0.64 → 0.74)
    • Stabilized variance and reduced heteroscedasticity
    • Mitigated impact of outliers
  2. Linear Relationships Dominate

    • Simple linear models outperformed complex non-linear models
    • Suggests underlying linear relationship between features and purchase amount
    • Regularization (Ridge/Lasso) provided minimal improvement
  3. Complex Models Underperformed

    • Random Forest and GBM showed lower R² despite higher complexity
    • Long training times without performance gains
    • Log transformation hurt tree-based models after back-transformation
  4. Feature Importance

    • All predictors showed statistical significance (low p-values)
    • No multicollinearity issues (VIF < 5 for all features)
    • Age, occupation, and product category were key drivers

Business Implications

Customer Segmentation: Identify high-value customers for targeted marketing
Demand Forecasting: Predict purchase patterns by demographics
Inventory Management: Optimize stock for high-demand product categories
Pricing Strategy: Dynamic pricing based on predicted purchase behavior


📁 Repository Structure

walmart-sales-forecasting/
├── README.md
├── .gitignore
├── docs/
│   └── Walmart_Sales_Forecasting_Report.pdf
├── src/
│   ├── 00_setup.R
│   ├── 01_load_and_clean.R
│   ├── 02_eda_and_outliers.R
│   ├── 03_feature_engineering.R
│   ├── 04_linear_and_regularized_models.R
│   ├── 05_tree_models.R
│   └── 06_optional_xgb_and_nn.R
└── results/
    ├── metrics_summary.csv
    └── figures/
        └── purchase_distribution.png


🚀 How to Run

Prerequisites

  • R (>= 4.0.0)
  • RStudio (recommended)

Installation

1. Clone the repository

git clone https://github.com/yourusername/walmart-sales-forecasting.git
cd walmart-sales-forecasting

2. Install required R packages

# Run in R console
install.packages(c(
  "tidyverse",      # Data manipulation & visualization
  "caret",          # Machine learning framework
  "glmnet",         # Ridge/Lasso regression
  "randomForest",   # Random Forest
  "gbm",            # Gradient Boosting
  "skimr",          # Data summary
  "MASS",           # Statistical tools
  "car"             # VIF calculation
))

3. Download the dataset

4. Run the analysis

# Option 1: Run notebooks sequentially in RStudio
# Open notebooks/01_EDA.Rmd → 02_Feature_Engineering.Rmd → ...

# Option 2: Run scripts programmatically
source("src/data_preprocessing.R")
source("src/feature_engineering.R")
source("src/model_training.R")
source("src/evaluation.R")

📚 Documentation

For detailed analysis and methodology, see:


⚠️ Limitations & Future Work

Current Limitations

  • No temporal data (timestamps unavailable)
  • Computational constraints prevented full Neural Network training
  • Some models used manual splits instead of cross-validation

Future Improvements

  • Incorporate time-series features if temporal data becomes available
  • Deploy model as REST API for real-time predictions
  • Experiment with ensemble methods combining top models
  • A/B testing for business impact validation

👤 Author

Minwoo Park
📧 University of Georgia | MIST 5635
💼 LinkedIn


🙏 Acknowledgments


About

Customer purchase amount prediction using regression and ML models in R (caret), including feature engineering and model evaluation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages