This project predicts insurance premiums based on a user's profile using a machine learning pipeline. The project includes data ingestion, preprocessing, model training, and prediction serving through a Flask API. The application also logs predictions and experiment details using MLflow.
├── application.py # Flask server for prediction
├── requirements.txt # Python dependencies
├── Dockerfile # For containerizing the app
├── model # Stores base dataset and experimental/trained models
├── src/ # Source code
│ ├── components/ # Contains data ingestion, transformation and training modules
│ │ ├── data_ingestion.py # Loads data and applies initial feature engineering
│ │ ├── data_transformation.py # Handles preprocessing and feature transformation
│ │ └── model_training.py # Trains and evaluates regression models
│ └── pipeline/ # Pipeline logic for inference
│ | └── predict_pipeline.py # Handles model loading and prediction
│ └── exception.py # Custom exception handling module
│ └── logger.py # Handles logging across the application
│ └── utils.py # Common utility functions
└── artifacts/ # Stores trained models and preprocessor objects
pip install -r requirements.txtpython application.pyServer will start at http://127.0.0.1:5000/
To build and run the application inside a Docker container:
docker build -t insurance_premium .docker run -p 8000:5000 insurance_premiumThen access the server at http://localhost:8000/
- Loads trained
model.pklandproprocessor.pklfrom theartifacts/folder. - Transforms incoming data and performs predictions.
- Logs inputs, model parameters, and output using MLflow.
- Reads and processes the raw dataset.
- Handles missing dates, applies log transform to income, removes duplicates, and drops sparse columns.
- Splits data into train/test sets and stores them in
artifacts/.
- Applies imputation to numeric and categorical columns.
- Label encodes categorical features.
- Uses
ColumnTransformerandPipelinefor structured preprocessing. - Saves the preprocessor object as
proprocessor.pkl.
- Trains multiple regression models (Random Forest, XGBoost, LightGBM, etc.).
- Evaluates models using R2 score.
- Saves the best model as
model.pkl. - Logs training details using MLflow.
All inference and training metadata are tracked using MLflow experiments:
This helps monitor performance across different runs.

