CS5785 Assignment 1 - Part I: Complete machine learning pipeline for predicting house prices using regression techniques with OLS implementation from scratch.
This project implements a comprehensive machine learning solution for the House Prices - Advanced Regression Techniques Kaggle competition. The solution features a custom implementation of Ordinary Least Squares (OLS) regression with ridge regularization for numerical stability.
- 🔍 Comprehensive Feature Analysis: Identifies and analyzes 25 continuous and 55 categorical features
- 🛠️ Advanced Data Preprocessing: Handles missing values, feature scaling, and one-hot encoding
- ⚙️ Custom OLS Implementation: Ordinary Least Squares implemented from scratch with ridge regularization
- 📊 Model Evaluation: Achieves R² = 0.7831 and RMSE = 36,984 on training data
- 🏆 Kaggle Ready: Complete submission file for competition entry
- 📈 Visualizations: Feature distribution plots and one-hot encoding demonstrations
housing-prices-analysis/
├── 📄 housing_analysis.py # Main analysis script
├── 🖼️ feature_distributions.png # Feature distribution visualizations
├── 🖼️ one_hot_encoding_demo.png # One-hot encoding demonstration
├── 📄 submission.csv # Kaggle submission file
├── 📄 train.csv # Training dataset
├── 📄 test.csv # Test dataset
├── 📄 data_description.txt # Feature descriptions
└── 📄 README.md # This file
pip install pandas numpy matplotlib seaborn scikit-learn# Clone the repository
git clone https://github.com/beyzasimsekoglu/housing-prices-analysis.git
cd housing-prices-analysis
# Run the complete analysis
python3 housing_analysis.py| Metric | Value |
|---|---|
| R² Score | 0.7831 |
| RMSE | 36,984 |
| Features Used | 11 (after selection) |
| Training Samples | 1,460 |
| Test Samples | 1,459 |
- Missing Value Handling: Median imputation for continuous features, mode for categorical
- Feature Scaling: StandardScaler for numerical stability
- One-Hot Encoding: 109 binary features from 17 categorical variables
- Feature Selection: Reduced from 80+ features to 11 most important
# Ridge-regularized OLS formula
θ = (X^T X + λI)^(-1) X^T y- Regularization: λ = 0.001 for numerical stability
- Matrix Operations: Custom implementation using NumPy
- No External Libraries: Pure mathematical implementation
- Continuous: LotArea, YearBuilt, TotalBsmtSF, GrLivArea, GarageArea
- Categorical: OverallQual, MSZoning, Neighborhood, ExterQual, KitchenQual
- 🏘️ Location Matters: Neighborhood significantly impacts house prices
- 🏗️ Quality Counts: OverallQual and ExterQual are strong predictors
- 📏 Size Matters: LotArea, TotalBsmtSF, and GrLivArea are crucial
- 🚗 Garage Value: GarageArea and GarageCars affect pricing
- 🏠 Age Factor: YearBuilt and YearRemodAdd influence values
This project was developed for CS5785 - Machine Learning at Cornell University, demonstrating:
- Understanding of linear regression theory
- Implementation of algorithms from scratch
- Data preprocessing and feature engineering
- Model evaluation and interpretation
- Professional software development practices
- ✅ Feature Analysis: Continuous vs categorical feature identification
- ✅ Data Preprocessing: Missing values, normalization, categorical encoding
- ✅ One-Hot Encoding: Custom implementation with visualization
- ✅ OLS from Scratch: No external regression libraries used
- ✅ Model Evaluation: MSE and R² score calculation
- ✅ Kaggle Submission: Complete competition entry
Feel free to fork this repository and submit pull requests for improvements!
This project is licensed under the MIT License - see the LICENSE file for details.
Beyza Simsekoglu
CS5785 Student
GitHub | LinkedIn
⭐ Star this repository if you found it helpful!