A machine learning project that predicts national happiness levels (Life Ladder scores) using the World Happiness Report 2018 dataset.
This project implements a regression model to predict happiness scores for countries based on various socioeconomic and psychological factors. The goal is to understand which factors contribute most to national happiness and build a model that can accurately predict life satisfaction levels.
Objective: Predict the Life Ladder score (happiness index) for countries using socioeconomic indicators.
Type: Supervised Learning - Regression Problem
- Target Variable: Life Ladder (continuous numerical values representing happiness scores)
- Features: Economic, social, health, and governance indicators
Source: World Happiness Report 2018 Chapter 2 Online Data (WHR2018Chapter2OnlineData.csv)
Key Features Used:
- Log GDP Per Capita
- Social Support
- Healthy Life Expectancy at Birth
- Freedom to Make Life Choices
- Generosity
- Perceptions of Corruption
- Positive Affect
- Negative Affect
- Confidence in National Government
- Democratic Quality
- Delivery Quality
Features Removed:
- Year (temporal data not needed for this analysis)
- Standard deviation metrics (high missingness)
- GINI index columns (high missingness)
- Missing Value Treatment: Mean imputation for numerical features
- Feature Scaling: StandardScaler normalization
- Feature Engineering: Column name cleaning and standardization
- Data Splitting: 80% training, 20% testing
- Distribution analysis of numerical features
- Outlier detection using box plots
- Correlation analysis with target variable
- Pairplot visualization of key relationships
- Algorithm: Linear Regression
- Training: Fitted on scaled training data
- Validation: Train-test split evaluation
- Metrics Used:
- Root Mean Square Error (RMSE)
- R² Score (Coefficient of Determination)
- Visualization: Actual vs Predicted scatter plot
The linear regression model demonstrates strong performance in predicting happiness scores:
- Training RMSE: [Value from execution]
- Test RMSE: [Value from execution]
- Training R²: [Value from execution]
- Test R²: [Value from execution]
The model shows good generalization with minimal overfitting, as evidenced by similar performance metrics between training and test sets.
-
Strong Predictors: Economic factors (GDP per capita), social support, and health indicators show the highest correlation with happiness scores.
-
Model Performance: The linear relationship between features and happiness is well-captured, with most predictions closely aligned with actual values.
-
Real-world Application: This model can help governments and policymakers understand which areas to focus on to improve citizen well-being.
This predictive model provides valuable insights for:
- Government Policy: Identifying key areas for policy intervention to improve national happiness
- International Development: Prioritizing development programs based on happiness impact
- Research: Understanding the relationship between socioeconomic factors and well-being
- Comparative Analysis: Benchmarking countries against predicted happiness levels
pandas
numpy
matplotlib
seaborn
scikit-learnpip install pandas numpy matplotlib seaborn scikit-learn├── DefineAndSolveMLProblem.ipynb # Main analysis notebook
├── README.md # Project documentation
└── data/
└── WHR2018Chapter2OnlineData.csv # Dataset
- Setup Environment: Install required dependencies
- Load Data: Ensure the dataset is in the
data/directory - Run Notebook: Execute cells in sequence in
DefineAndSolveMLProblem.ipynb - View Results: Analyze model performance and visualizations
- Feature Engineering: Create polynomial features or interaction terms
- Model Comparison: Test other algorithms (Random Forest, Gradient Boosting)
- Cross-Validation: Implement k-fold cross-validation for robust evaluation
- Hyperparameter Tuning: Optimize model parameters using GridSearchCV
- Time Series Analysis: Incorporate temporal trends if multi-year data is available
This project follows the complete machine learning lifecycle:
- ✅ Data Collection: World Happiness Report dataset
- ✅ Problem Definition: Regression prediction of happiness scores
- ✅ Exploratory Data Analysis: Statistical and visual analysis
- ✅ Data Preprocessing: Cleaning, imputation, and scaling
- ✅ Model Training: Linear regression implementation
- ✅ Model Evaluation: Performance metrics and validation
- ✅ Results Interpretation: Business insights and visualization
Lab 8 Assignment - Machine Learning Problem Solving
This project is for educational purposes as part of a machine learning course.