PSU Data Exploration and Regression Analysis - Final Report

Project Overview

This project involves data exploration, preprocessing, and regression modeling on a dataset (DataTrain_HW2Problem1.csv) containing features x1, x2, x3 and target y. The goal is to build and compare Linear Regression, Lasso (L1), and Ridge (L2) models, analyze their coefficients, and visualize the results.

Data Description

Dataset: 30 samples with 4 columns (y, x1, x2, x3)
Features: x1, x2 (continuous), x3 (binary)
Target: y (continuous)
Statistics:
- y: mean=34.25, std=47.32, range=[-14.4, 137.0]
- x1: mean=0.63, std=5.83
- x2: mean=0.33, std=1.72
- x3: 37% ones, 63% zeros

Methodology

Data Loading and Inspection: Loaded CSV, inspected structure and statistics.
Data Splitting: 80% train, 20% test (random_state=42).
Feature Scaling: Standardized features using StandardScaler.
Model Training:
- Linear Regression
- Lasso (initial alpha=0.1)
- Ridge (initial alpha=1.0)
Hyperparameter Tuning: Used cross-validation to find optimal alphas.
Evaluation: R² scores on test set.
Coefficient Analysis: Compared coefficients across models.
Visualization: Added plots for data distribution, coefficients, and predictions.

Results

Model Performance (R² on Test Set):
- Linear Regression: 0.6687
- Lasso: 0.6665
- Ridge: 0.6422
Optimal Alphas:
- Lasso: 3.5112
- Ridge: 4.6416
Coefficients:
- Linear: [-37.57, -4.84, 6.17]
- Lasso: [-34.42, -2.46, 2.31] (some shrinkage)
- Ridge: [-30.99, -5.98, 4.50] (moderate shrinkage)

Visualizations

Feature distributions (histograms)
Boxplots for features
Scatter plots (x1 vs y, x2 vs y)
Coefficient comparison bar chart
Predictions vs actual scatter plots for each model

Key Insights

Lasso shows more coefficient shrinkage than Ridge, indicating feature selection.
x1 has the largest coefficient magnitude, suggesting strongest impact.
Models perform similarly, with Linear Regression slightly better.

Code and Dependencies

Main Script: PSUData_Exploration.py
Dependencies: pandas, numpy, scikit-learn, matplotlib, seaborn
Data File: DataTrain_HW2Problem1.csv

Fixes Applied

Installed scikit-learn to resolve import issues.
Corrected data file extension from .cvs to .csv.
Added coefficient analysis and visualizations.

Repository

All code and changes are committed to GitHub: https://github.com/AlexanderUbaldoGutierrez21/PSUTrainingDataModel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PSU Data Exploration and Regression Analysis - Final Report

Project Overview

Data Description

Methodology

Results

Visualizations

Key Insights

Code and Dependencies

Fixes Applied

Repository

FilesExpand file tree

Final_Report.md

Latest commit

History

Final_Report.md

File metadata and controls

PSU Data Exploration and Regression Analysis - Final Report

Project Overview

Data Description

Methodology

Results

Visualizations

Key Insights

Code and Dependencies

Fixes Applied

Repository