[[Objectives]] [[Project Plan]] [[Key Findings]] [[Modeling]] [[Recommendations]] [[Next Steps]] [[Steps to Reproduce]] [[Data Exploration]] [[Conclusion]] [[Data Dictionary]]
For the Zillow Data Science team:
- Provide recommendations on how to build a better model (things that do and don't work
- Determine the states and counties where the fips are located
Follow the data science pipeline to find insights to present to the Zillow Data Science team on how to improve model performance to predict home values; identify the location of each transaction. Produce a GitHub repository and a final notebook from which to present findings.
-
The bedrooms relationship to home_value is extremely weak yet significant¶
-
The square_feet relationship to home_value is very strong and significant
-
Bedrooms and square_feet have some degree of multicolinearity, however there is high variance in the relationship so both features will still be taken to modeling
The best performing regression model on this data set was a LassoLars model with an alpha of .01 which performed on test data with an RMSE of $207,884 and an R squared value of .22
-
Merging the bedrooms and bathrooms information into a ratio does not add value to a model because it increases feature variance when plotted against home_value. Do not combine these features in future models.
-
Eliminating outliers in the home value distribution will increase model performance. Only 10% of the transactions are above 1M, eliminating these from the data set reduces heteroskedasticity and improves model performance.
-
The dataframe I was able to create which matches state and county information to each transaction id may be added to the Zillow database server, possibly using transaction 'id' as the primary key and 'fips' as a foreign key.
To further improve model performance, feature selection methods may be used to determine other columns in the data that can add predictive power to future models
It may be useful to experiment with trimming outlier values from other model features as well, besides just the home value
If desirable to the Zillow Data Science team, develop a model that works well at predicting extreme target values (potentially a clustering model)
- Make an env.py file with your credentials to access the codeup database
- clone this repository to a local directory
- Follow the hyperlink in the 'State and County Data' section at the bottom of the notebook and 'curl -O' the raw file into your local directory
- Run all cells of the notebook
The project objectives were met: recommendations to build a better model were presented and state and county information were matched and added to the list of 2017 home transactions.
home_value The taxable value of each home. Given as a float. square_feet The square footage of each home. Given as a float. bedrooms The bedroom count of each home. Given as a float.