This project uses Linear Regression to predict individual medical insurance charges. By analyzing factors such as age, BMI, and smoking status, I developed a model that explains 99% of the variance in insurance costs, providing a clear mathematical framework for risk-based pricing.
- One-Hot Encoding: Transformed categorical variables (Smoker status, Region) into numerical values using
pd.get_dummieswhile avoiding the Dummy Variable Trap. - Algorithm: Implemented Scikit-Learn's
LinearRegressionto identify the specific "weights" of health risk factors.
My model identified the exact cost drivers for insurance premiums:
- The Smoker Tax: Being a smoker increases charges by approximately $19,799.
- The Aging Factor: For every single year of age, the premium increases by $248.64.
- The BMI Impact: Each point of BMI adds roughly $95.21 to the annual cost.
- R-Squared Score: 0.99 (The model captures 99% of pricing logic).
- Mean Absolute Error (MAE): $806.51 (High precision in cost estimation).
insurance_regression.ipynb: Complete code for encoding, training, and evaluation.actual_vs_predicted.png: Visualization confirming the model's accuracy.