- Project Overview
- Dataset
- Methodology
- Models Evaluated
- Results
- Key Findings
- Repository Structure
- How to Run
- Author
This project focuses on predicting customer purchase amounts using the Walmart Black Friday retail dataset.
Objective: Build and compare multiple regression-based machine learning models to estimate purchase behavior and identify the best-performing approach.
- Problem Type: Supervised Learning (Regression)
- Target Variable:
Purchase(purchase amount in dollars) - Evaluation Metrics: RMSE, MAE, R-squared
- Trained and compared 5 regression models on 550K+ transactions
- Detected 0.49% outliers via IQR and stabilized variance using log(Purchase)
- Best model: Linear Regression (log target) with R² ≈ 0.736
- Used 5-fold cross-validation for linear models and a 70/30 train-test split for heavy models (runtime constraints)
- Built a reproducible pipeline: preprocessing → EDA → modeling → evaluation
Source: Kaggle — Walmart e-Commerce Sales Dataset (Black Friday)
Description: Retail transaction dataset containing customer demographics, product categories, and purchase amounts.
- Total Records: 550,068 transactions
- Unique Users: 5,891 customers
- Unique Products: 3,631 products
| Feature | Description |
|---|---|
User_ID |
Unique customer identifier |
Product_ID |
Unique product identifier |
Gender |
Customer gender (M/F) |
Age |
Customer age group |
Occupation |
Customer occupation (encoded) |
City_Category |
City type (A/B/C) |
Stay_In_Current_City_Years |
Years in current city |
Marital_Status |
Marital status (0/1) |
Product_Category_1/2/3 |
Product categories |
Purchase |
Target variable (purchase amount) |
Note: The raw dataset is not included in this repository due to size constraints.
Please download it from Kaggle and place it underdata/raw/.
- Checked for missing values and duplicates (none found)
- Standardized column names to lowercase
- Converted categorical variables to factors
- Removed unnecessary identifiers (
User_ID,Product_ID)
- Analyzed purchase patterns by age, gender, occupation, and product category
- Identified right-skewed distribution in purchase amounts
- Visualized relationships between features and target variable
- Applied IQR (Interquartile Range) method
- Found 0.49% outliers (2,677 out of 550,068 transactions)
- Outliers concentrated in premium product categories (9, 10, 15)
- Created numeric encodings for categorical variables
- Applied log transformation on
Purchaseto:- Reduce skewness and stabilize variance
- Improve model performance (especially for linear models)
- Mitigate impact of outliers
- Validation Strategy:
- 5-fold Cross-Validation for linear models
- 70:30 Train-Test Split for complex models (due to computational constraints)
- Hyperparameter Tuning: Grid search for Ridge, Lasso, and tree-based models
The following models were trained and compared:
| Model | Type | Key Characteristics |
|---|---|---|
| Linear Regression | Baseline | Simple, interpretable |
| Ridge Regression | Regularized Linear | L2 penalty, handles multicollinearity |
| Lasso Regression | Regularized Linear | L1 penalty, feature selection |
| Random Forest | Ensemble (Bagging) | Non-linear, robust to outliers |
| Gradient Boosting (GBM) | Ensemble (Boosting) | Sequential error correction |
| XGBoost | Advanced Boosting | Not evaluated (similar to GBM, time constraints) |
| Neural Network | Deep Learning | Not completed (training time >5 hours) |
| Model | RMSE | R² | MAE | Notes |
|---|---|---|---|---|
| Linear Regression (Log) ✅ | 0.3799 | 0.7359 | 0.2859 | Best overall |
| Lasso Regression (Log) | 0.3800 | 0.7358 | 0.2860 | Nearly identical to Linear |
| Ridge Regression (Log) | 0.3820 | 0.7344 | 0.2873 | Slightly lower performance |
| Linear Regression (Raw) | 3014.09 | 0.6399 | 2282.59 | Baseline (no transformation) |
| Ridge Regression (Raw) | 3027.54 | 0.6390 | 2287.48 | Baseline regularized |
| Lasso Regression (Raw) | 3014.17 | 0.6399 | 2282.70 | Baseline with feature selection |
| GBM (Raw) | 3241.94 | 0.5828 | 2454.09 | Non-linear approach |
| GBM (Log → Raw) | 3533.96 | 0.5060 | 2607.78 | Degraded after back-transform |
| Random Forest (Raw) | 3864.52 | 0.4000 | 2945.00 | High variance |
| Random Forest (Log → Raw) | 10531.06 | - | - | Amplified errors |
Note: Log-transformed model metrics (RMSE, MAE) are on the log scale.
For fair comparison with raw models, predictions were back-transformed usingexp()where applicable.
- Why it won:
- Lowest RMSE (0.3799) and MAE (0.2859)
- Highest R² (0.7359) — explains ~74% of variance
- Most interpretable and computationally efficient
- Stable performance with 5-fold cross-validation
-
Log Transformation is Critical
- Dramatically improved linear model performance (R² from 0.64 → 0.74)
- Stabilized variance and reduced heteroscedasticity
- Mitigated impact of outliers
-
Linear Relationships Dominate
- Simple linear models outperformed complex non-linear models
- Suggests underlying linear relationship between features and purchase amount
- Regularization (Ridge/Lasso) provided minimal improvement
-
Complex Models Underperformed
- Random Forest and GBM showed lower R² despite higher complexity
- Long training times without performance gains
- Log transformation hurt tree-based models after back-transformation
-
Feature Importance
- All predictors showed statistical significance (low p-values)
- No multicollinearity issues (VIF < 5 for all features)
- Age, occupation, and product category were key drivers
Customer Segmentation: Identify high-value customers for targeted marketing
Demand Forecasting: Predict purchase patterns by demographics
Inventory Management: Optimize stock for high-demand product categories
Pricing Strategy: Dynamic pricing based on predicted purchase behavior
walmart-sales-forecasting/
├── README.md
├── .gitignore
├── docs/
│ └── Walmart_Sales_Forecasting_Report.pdf
├── src/
│ ├── 00_setup.R
│ ├── 01_load_and_clean.R
│ ├── 02_eda_and_outliers.R
│ ├── 03_feature_engineering.R
│ ├── 04_linear_and_regularized_models.R
│ ├── 05_tree_models.R
│ └── 06_optional_xgb_and_nn.R
└── results/
├── metrics_summary.csv
└── figures/
└── purchase_distribution.png
- R (>= 4.0.0)
- RStudio (recommended)
1. Clone the repository
git clone https://github.com/yourusername/walmart-sales-forecasting.git
cd walmart-sales-forecasting2. Install required R packages
# Run in R console
install.packages(c(
"tidyverse", # Data manipulation & visualization
"caret", # Machine learning framework
"glmnet", # Ridge/Lasso regression
"randomForest", # Random Forest
"gbm", # Gradient Boosting
"skimr", # Data summary
"MASS", # Statistical tools
"car" # VIF calculation
))3. Download the dataset
- Go to Kaggle - Walmart Black Friday Dataset
- Download the dataset
- Place it in
data/raw/directory
4. Run the analysis
# Option 1: Run notebooks sequentially in RStudio
# Open notebooks/01_EDA.Rmd → 02_Feature_Engineering.Rmd → ...
# Option 2: Run scripts programmatically
source("src/data_preprocessing.R")
source("src/feature_engineering.R")
source("src/model_training.R")
source("src/evaluation.R")For detailed analysis and methodology, see:
- Full Report:
docs/Walmart_Sales_Forecasting_Report.pdf - Code Documentation: Comments in
src/andnotebooks/
- No temporal data (timestamps unavailable)
- Computational constraints prevented full Neural Network training
- Some models used manual splits instead of cross-validation
- Incorporate time-series features if temporal data becomes available
- Deploy model as REST API for real-time predictions
- Experiment with ensemble methods combining top models
- A/B testing for business impact validation
Minwoo Park
📧 University of Georgia | MIST 5635
💼 LinkedIn
- Dataset: Kaggle - Walmart Black Friday Sales
- Tools: R, RStudio, tidyverse ecosystem