This project demonstrates an end-to-end pipeline for forecasting user expenses using time series modeling. It integrates PySpark for scalable preprocessing and feature engineering, and leverages XGBoost for modeling transaction-level financial data.
Predict future daily user expense patterns using transaction data, card and user metadata, and derived temporal and lag-based features.
- Language: Python
- Distributed Processing: Apache Spark (PySpark)
- Data Source: Kaggle Fraud Dataset via
kagglehub - Modeling: XGBoost Regressor
- Visualization: matplotlib
- Libraries: pandas, scikit-learn, numpy, seaborn
-
Data Acquisition
- Download Kaggle dataset using
kagglehub - Load and merge transactions, user, and card data
- Download Kaggle dataset using
-
Data Preprocessing (PySpark)
- Handle missing values
- Remove fraudulent transactions
- Clean currency fields (
$) - Derive date-based and cyclic features (
sin/cos) - Compute lag-based predictors (
lag_1,lag_7, etc.)
-
Feature Engineering
- Group by date to get daily total spending
- Add
year,day, andweekdayindicators - Normalize features using
StandardScaler
-
Modeling
- Use
TimeSeriesSplitwith a gap for realistic validation - Train XGBoost regressor across hyperparameter grid
- Select best model using average RMSE
- Use
-
Evaluation
- Plot predicted vs actual values for both train and test sets
- Report RMSE for model performance
- Time Series Plot of daily spending
- Actual vs Predicted Plots for train and test sets
- Lag and Seasonality Features visualized
- Robust time-aware validation using
TimeSeriesSplit - Realistic financial behavior modeled using lagged features and seasonality
- Final model trained with optimal
max_depth,learning_rate, andsubsample