This project involves data exploration, preprocessing, and regression modeling on a dataset (DataTrain_HW2Problem1.csv) containing features x1, x2, x3 and target y. The goal is to build and compare Linear Regression, Lasso (L1), and Ridge (L2) models, analyze their coefficients, and visualize the results.
- Dataset: 30 samples with 4 columns (y, x1, x2, x3)
- Features: x1, x2 (continuous), x3 (binary)
- Target: y (continuous)
- Statistics:
- y: mean=34.25, std=47.32, range=[-14.4, 137.0]
- x1: mean=0.63, std=5.83
- x2: mean=0.33, std=1.72
- x3: 37% ones, 63% zeros
- Data Loading and Inspection: Loaded CSV, inspected structure and statistics.
- Data Splitting: 80% train, 20% test (random_state=42).
- Feature Scaling: Standardized features using StandardScaler.
- Model Training:
- Linear Regression
- Lasso (initial alpha=0.1)
- Ridge (initial alpha=1.0)
- Hyperparameter Tuning: Used cross-validation to find optimal alphas.
- Evaluation: R² scores on test set.
- Coefficient Analysis: Compared coefficients across models.
- Visualization: Added plots for data distribution, coefficients, and predictions.
- Model Performance (R² on Test Set):
- Linear Regression: 0.6687
- Lasso: 0.6665
- Ridge: 0.6422
- Optimal Alphas:
- Lasso: 3.5112
- Ridge: 4.6416
- Coefficients:
- Linear: [-37.57, -4.84, 6.17]
- Lasso: [-34.42, -2.46, 2.31] (some shrinkage)
- Ridge: [-30.99, -5.98, 4.50] (moderate shrinkage)
- Feature distributions (histograms)
- Boxplots for features
- Scatter plots (x1 vs y, x2 vs y)
- Coefficient comparison bar chart
- Predictions vs actual scatter plots for each model
- Lasso shows more coefficient shrinkage than Ridge, indicating feature selection.
- x1 has the largest coefficient magnitude, suggesting strongest impact.
- Models perform similarly, with Linear Regression slightly better.
- Main Script:
PSUData_Exploration.py - Dependencies: pandas, numpy, scikit-learn, matplotlib, seaborn
- Data File:
DataTrain_HW2Problem1.csv
- Installed scikit-learn to resolve import issues.
- Corrected data file extension from .cvs to .csv.
- Added coefficient analysis and visualizations.
All code and changes are committed to GitHub: https://github.com/AlexanderUbaldoGutierrez21/PSUTrainingDataModel