Skip to content

End-to-end MLOps pipeline for loan default prediction using Airflow, MLflow, Vertex AI, Cloud Run, and Terraform.

License

Notifications You must be signed in to change notification settings

JDede1/loan_default_prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CI Integration & CD Python Docker Terraform License: MIT



πŸŽ₯ Demo

The following demo shows the Airflow UI with all orchestrated DAGs available:

  1. Training Pipeline DAG – submits training jobs and logs results in MLflow.
  2. Hyperparameter Tuning DAG – runs Optuna optimization for XGBoost.
  3. Batch Prediction DAG – generates daily predictions.
  4. Monitoring DAG – runs Evidently drift detection and can trigger retraining.
  5. Promotion DAG – promotes models from staging to production.

Demo

(GIF captured with ScreenToGif.)



πŸ“‘ Table of Contents


🏦 Loan Default Prediction – End-to-End MLOps Project

1. Project Overview

Financial institutions face significant risk when issuing loans, as defaults can lead to major losses. Being able to predict the likelihood of loan default before issuing credit allows lenders to make better decisions, reduce risk exposure, and maintain healthier loan portfolios.

This project implements a production-grade MLOps pipeline to automate the full lifecycle of a loan default prediction model β€” from data preparation and training to deployment, monitoring, and continuous improvement. The pipeline is designed to be scalable, reproducible, and cloud-native, following best practices in modern machine learning operations.

πŸ”Ή Dataset

  • Source: LendingClub loan dataset (public loan default dataset widely used in risk modeling).

  • Data Location: Stored in GCS (gs://loan-default-artifacts-loan-default-mlops/data/loan_default_selected_features_clean.csv).

  • Features: 20 predictive features including applicant financial attributes (loan amount, annual income, DTI ratio, credit history length, number of open accounts, revolving balance, etc.).

  • Target Variable:

    • loan_status β†’ Binary classification

      • 1 = Loan defaulted
      • 0 = Loan fully paid

πŸ”Ή Business Value

  • Risk Mitigation – Identify high-risk loan applications early.
  • Operational Efficiency – Automate model training, deployment, and monitoring to reduce manual effort.
  • Scalability – Leverage cloud infrastructure to handle large-scale data and model workloads.
  • Continuous Improvement – Monitor performance drift and trigger retraining when needed.

πŸ”Ή Key Capabilities

  • Training & Tuning – Automated hyperparameter tuning with Optuna and scalable training jobs on Vertex AI.
  • Experiment Tracking & Registry – MLflow for logging experiments, metrics, artifacts, and versioned models.
  • Deployment – Batch and real-time model serving with Docker and Cloud Run.
  • Monitoring – Evidently AI for detecting data drift and target drift in production.
  • CI/CD – GitHub Actions pipelines for automated testing, integration, and deployment; local CI simulation for reproducibility.
  • Infrastructure as Code (IaC) – Terraform for provisioning GCP resources in a consistent, reproducible manner.

πŸ”Ή Tech Stack

  • Orchestration β†’ Apache Airflow
  • Experiment Tracking & Registry β†’ MLflow
  • Deployment β†’ Docker + GCP Cloud Run
  • Monitoring β†’ Evidently AI
  • CI/CD β†’ GitHub Actions + Local CI simulation
  • Infrastructure as Code β†’ Terraform
  • Cloud β†’ Google Cloud Platform (GCS, Artifact Registry, Vertex AI, Cloud Run)


2. Architecture

The project is built as a modular MLOps pipeline where each component handles a specific part of the ML lifecycle β€” from data ingestion and training to deployment, monitoring, and automated retraining.

πŸ”Ή System Architecture

graph TD
    subgraph Dev ["Local Dev / Codespaces"]
        A[Developer] -->|Code + DAGs| B["Airflow + MLflow (Docker Compose)"]
        A -->|Push to Repo| G[GitHub]
    end

    G[GitHub] -->|CI/CD Workflows| H[GitHub Actions]
    H -->|Provision Infra| T[Terraform]
    H -->|Build + Push| R[Artifact Registry]

    subgraph GCP ["Google Cloud Platform"]
        T --> BKT["GCS Bucket: Data, Artifacts, Reports"]
        R --> CR["Cloud Run: Model Serving"]
        T --> ML["MLflow Tracking Server (Cloud Run)"]
        T --> VA["Vertex AI Training Jobs"]
    end

    B -->|Sync Artifacts| BKT
    ML -->|Store Experiments| BKT
    CR -->|Batch Predictions| BKT
    BKT --> MON["Evidently Drift Monitoring"]
    MON -->|Trigger Retrain| VA

Loading

ℹ️ Note: MLflow appears in both environments. β€’ Locally, MLflow runs in Docker Compose (alongside Airflow) for dev and testing. β€’ In production, MLflow is deployed on Cloud Run with GCS as the backend store.


πŸ”Ή Component Breakdown

  • Apache Airflow

    • Orchestrates the end-to-end ML workflows (training, tuning, batch prediction, monitoring, promotion).
    • DAGs trigger Vertex AI jobs, batch predictions, and Evidently monitoring.
  • MLflow

    • Experiment tracking (metrics, artifacts, models).
    • Model registry with Staging β†’ Production lifecycle.
    • Deployed both locally (Docker) and on Cloud Run with GCS artifact storage.
  • Docker + Cloud Run

    • Dockerfiles for training, serving, and MLflow.
    • Models served via Cloud Run API for real-time predictions.
  • Google Cloud Storage (GCS)

    • Centralized storage for:

      • Training data.
      • Model artifacts (Optuna params, metrics, plots).
      • Batch prediction outputs.
      • Evidently monitoring reports.
  • Vertex AI

    • Runs training jobs using the custom trainer image.
    • Integrates with MLflow for logging experiments and registering models.
  • Evidently AI

    • Compares training dataset with new predictions.
    • Detects data drift and target drift.
    • If drift is detected, triggers retraining via Airflow DAG.
  • Terraform

    • Infrastructure-as-Code provisioning:

      • GCS bucket.
      • Artifact Registry.
      • MLflow Cloud Run service + IAM.
      • Networking & permissions.
  • GitHub Actions (CI/CD)

    • CI Pipeline: Lint, unit tests, formatting checks.
    • Integration Pipeline: Full stack startup, health checks, integration tests.
    • CD Pipeline: Builds and pushes trainer images to Artifact Registry.


3. Prerequisites

To run this project locally or on the cloud, ensure you have the following tools installed:

  • Docker β†’ >= 24.x

  • Docker Compose β†’ >= 2.x

    • Often bundled with Docker Desktop.
  • Python β†’ 3.10 (recommended for compatibility)

  • Terraform β†’ >= 1.5.0

  • Google Cloud CLI (gcloud) β†’ >= 456.x

    • Required for authentication, Artifact Registry, and Vertex AI.
    • Install gcloud
  • Make (GNU Make) β†’ >= 4.3

    • Pre-installed on most Linux/macOS; Windows users can install via WSL or Git Bash.

πŸ”Ή GCP Setup

Before running on Google Cloud:

  1. Create a GCP project (e.g., loan-default-mlops).

  2. Enable the following APIs:

    • storage.googleapis.com
    • run.googleapis.com
    • aiplatform.googleapis.com
    • artifactregistry.googleapis.com
    • sqladmin.googleapis.com
  3. Create a service account with roles:

    • roles/storage.admin
    • roles/artifactregistry.admin
    • roles/aiplatform.admin
    • roles/run.admin
  4. Download the service account key and place it at:

    keys/gcs-service-account.json
    airflow/keys/gcs-service-account.json
    


4. Setup & Installation

Follow the steps below to set up the project locally and prepare it for deployment to Google Cloud.


1️⃣ Clone the Repository

git clone https://github.com/your-org/loan_default_prediction.git
cd loan_default_prediction

2️⃣ Configure Environment Variables

Copy the example environment file and update values as needed:

cp .env.example .env

Key variables in .env:

  • GCS_BUCKET β†’ GCS bucket name for storing artifacts, predictions, reports.
  • MLFLOW_TRACKING_URI β†’ MLflow tracking server URL.
  • TRAIN_DATA_PATH β†’ Path to training dataset (CSV in GCS).
  • PREDICTION_INPUT_PATH β†’ Batch input data path.
  • PREDICTION_OUTPUT_PATH β†’ Where batch predictions will be saved.
  • PROMOTION_AUC_THRESHOLD β†’ Metric threshold for auto-promotion.
  • SLACK_WEBHOOK_URL / ALERT_EMAILS β†’ For monitoring alerts.

3️⃣ Add GCP Service Account

Create a GCP service account with roles:

  • roles/storage.admin
  • roles/artifactregistry.admin
  • roles/aiplatform.admin
  • roles/run.admin

Download the JSON key and place it in:

keys/gcs-service-account.json
airflow/keys/gcs-service-account.json

4️⃣ Install Dependencies

Set up Python dependencies locally (for testing and CI/CD tooling):

make install

This installs:

  • Core ML libs (scikit-learn, xgboost, pandas, etc.)
  • MLflow
  • Evidently AI
  • Dev tools (pytest, flake8, black, isort, mypy)

5️⃣ Start the Local Stack

Spin up Airflow + MLflow + Serving API with Docker Compose:

make start

Once running, access the services at:


6️⃣ Verify Setup

Run the built-in verification:

make verify

Expected outputs:

  • Airflow version + healthy UI.
  • DAGs mounted in Airflow.
  • MLflow logs accessible.
  • Serving API responding.


5. Repository Structure

The repo follows a clean structure that separates orchestration, infrastructure, source code, and CI/CD.

πŸ”Ή Directory Layout

.
β”œβ”€β”€ airflow/               # Airflow orchestration
β”‚   β”œβ”€β”€ dags/              # DAGs: training, tuning, prediction, monitoring, promotion
β”‚   β”œβ”€β”€ docker-compose.yaml# Local stack (Airflow, MLflow, DB, Serve)
β”‚   β”œβ”€β”€ artifacts/         # Prediction outputs, Optuna params, monitoring reports (gitignored)
β”‚   β”œβ”€β”€ logs/              # Airflow logs (gitignored)
β”‚   β”œβ”€β”€ keys/              # Service account (gitignored)
β”‚   └── tmp/               # Temp files (gitignored)
β”‚
β”œβ”€β”€ infra/terraform/       # Terraform IaC
β”‚   β”œβ”€β”€ main.tf            # GCS bucket + lifecycle rules
β”‚   β”œβ”€β”€ cloudrun.tf        # MLflow Cloud Run, Artifact Registry, IAM
β”‚   β”œβ”€β”€ variables.tf       # Configurable variables
β”‚   β”œβ”€β”€ outputs.tf         # Export bucket + MLflow URLs
β”‚   β”œβ”€β”€ terraform.tfvars   # Env-specific vars (gitignored)
β”‚   └── .terraform/        # Terraform state (gitignored)
β”‚
β”œβ”€β”€ src/                   # ML source code
β”‚   β”œβ”€β”€ train_with_mlflow.py    # Train & log to MLflow
β”‚   β”œβ”€β”€ tune_xgboost_with_optuna.py # Optuna tuning
β”‚   β”œβ”€β”€ batch_predict.py         # Batch inference
β”‚   β”œβ”€β”€ monitor_predictions.py   # Evidently drift monitoring
β”‚   β”œβ”€β”€ predict.py               # Real-time inference
β”‚   β”œβ”€β”€ ingest_vertex_run.py     # Ingest Vertex AI outputs
β”‚   └── utils.py                 # Utility helpers
β”‚
β”œβ”€β”€ scripts/               # Helper scripts
β”‚   β”œβ”€β”€ test_ci_local.sh   # Local CI/CD simulation
β”‚   β”œβ”€β”€ troubleshoot.sh    # Diagnostics
β”‚   └── start/stop_all.sh  # Start/stop automation
β”‚
β”œβ”€β”€ tests/                 # Unit & integration tests
β”‚   β”œβ”€β”€ test_utils.py
β”‚   β”œβ”€β”€ test_prediction_integration.py
β”‚   └── test_batch_prediction_integration.py
β”‚
β”œβ”€β”€ MLflow/                # MLflow custom Docker image
β”‚   β”œβ”€β”€ Dockerfile.mlflow
β”‚   β”œβ”€β”€ requirements.mlflow.txt
β”‚   └── tracking_entrypoint.sh
β”‚
β”œβ”€β”€ data/                  # Data (samples only, large data ignored)
β”‚   β”œβ”€β”€ loan_default_selected_features_clean.csv
β”‚   β”œβ”€β”€ batch_input.csv
β”‚   └── sample_input.json
β”‚
β”œβ”€β”€ docs/                  # Documentation assets
β”‚   β”œβ”€β”€ images/            # Screenshots + demo GIFs
β”‚   β”‚   β”œβ”€β”€ airflow_train_pipeline.gif
β”‚   β”‚   └── batch_prediction_dag.png
β”‚   └── *.png              # DAG screenshots + Vertex AI logs
β”‚
β”œβ”€β”€ .github/workflows/     # GitHub Actions CI/CD
β”‚   β”œβ”€β”€ ci.yml             # Lint + unit tests
β”‚   └── integration-cd.yml # Integration tests + deploy
β”‚
β”œβ”€β”€ artifacts/             # MLflow + prediction artifacts (gitignored)
β”œβ”€β”€ mlruns/                # Local MLflow runs (gitignored, kept in GCS in production)
β”œβ”€β”€ keys/                  # Global service account key (gitignored)
β”œβ”€β”€ Makefile               # One-stop automation: setup, test, deploy
β”œβ”€β”€ requirements.txt       # Core dependencies
β”œβ”€β”€ requirements-*.txt     # Env-specific deps (dev, serve, airflow, monitoring, vertex)
β”œβ”€β”€ SECURITY_AND_CONTRIBUTING.md
β”œβ”€β”€ TROUBLESHOOTING.md
└── README.md

πŸ”Ή What’s Ignored (via .gitignore and .dockerignore)

  • Secrets β†’ .env, keys/, airflow/keys/, *.json (except small samples).
  • Logs & Artifacts β†’ airflow/logs/, mlruns/, artifacts/, predictions_*.csv.
  • Large Data β†’ data/* (only small sample files are versioned).
  • Terraform State β†’ .terraform/, terraform.tfstate*, terraform.tfvars.
  • OS/Editor Junk β†’ .DS_Store, .vscode/, .idea/, swap files.
  • Cache β†’ __pycache__/, .pytest_cache/, .coverage.

6. Project Components

This project is composed of several modular components that work together to automate the end-to-end ML lifecycle.


πŸ”Ή Airflow DAGs

Airflow orchestrates the ML workflows through a set of DAGs:

  • tune_hyperparams_dag.py – Runs Optuna to tune XGBoost hyperparameters, saves best parameters and study DB to GCS.

  • train_pipeline_dag.py – Weekly training pipeline:

    • Submits Vertex AI training job using best params.
    • Ingests results into MLflow.
    • Decides whether to promote model (based on AUC/F1 thresholds).
    • Triggers batch prediction after training.
  • batch_prediction_dag.py – Daily batch inference using the latest MLflow model alias (staging/production), saves predictions to GCS.

  • promote_model_dag.py – Promotes model from staging β†’ production in MLflow, sends Slack + email notifications.

  • monitoring_dag.py – Runs Evidently drift detection daily; if drift is detected, retrains via train_pipeline_dag.

πŸ“Έ Example DAGs in Airflow UI

Training Pipeline DAG Airflow Training Pipeline DAG

Hyperparameter Tuning DAG

View of the DAG in Airflow:
Airflow Hyperparameter Tuning DAG

Expanded view with task details:
Airflow Hyperparameter Tuning DAG Expanded

Batch Prediction DAG

Batch Prediction DAG


πŸ”Ή MLflow Tracking & Registry

  • Tracks experiments, metrics, artifacts, and models.
  • Registry supports staging β†’ production promotion.
  • Backend: Postgres (local), GCS (cloud).

πŸ”Ή Terraform Infrastructure

  • Provisions GCS bucket for data, artifacts, and reports.
  • Creates Artifact Registry for trainer + MLflow images.
  • Deploys MLflow server on Cloud Run.
  • Configures IAM roles and service accounts.

πŸ”Ή Monitoring with Evidently

  • Compares training dataset vs latest batch predictions.
  • Generates JSON + HTML drift reports.
  • Stores reports in GCS and triggers retraining if drift is detected.

πŸ”Ή High-Level DAG Orchestration

flowchart TD
    TUNE[Tune Hyperparameters] --> TRAIN["Train Pipeline (Vertex AI)"]
    TRAIN --> DECIDE{Promotion?}
    DECIDE -->|Pass| PROMOTE[Promote Model]
    DECIDE -->|Fail| SKIP[Skip Promotion]
    PROMOTE --> BATCH[Batch Prediction]
    SKIP --> BATCH
    BATCH --> MONITOR["Monitoring DAG (Evidently)"]
    MONITOR -->|No Drift| END([End])
    MONITOR -->|Drift Detected| TRAIN

Loading


7. CI/CD Pipeline

This project follows a two-tier CI/CD strategy:

  1. Continuous Integration (CI) – Run automatically on every push/PR to main.

    • Code quality checks (linting, formatting).
    • Unit tests (pytest -m "not integration").
    • Fails fast if issues are found.
  2. Integration + Continuous Deployment (CD) – Run manually or on main branch merges.

    • Spins up the full Airflow + MLflow + Serve stack inside Docker.
    • Runs integration tests against batch and real-time predictions.
    • If successful β†’ builds & pushes trainer image to Artifact Registry.
  3. Local CI Simulation – Developers can replicate the same pipeline locally with make ci-local.

    • Runs lint, formatting, unit tests.
    • Spins up containers, checks health, runs integration tests.
    • Optional local CD (make deploy-trainer).

πŸ”Ή CI/CD Flow Diagram

flowchart LR
    C[Commit / PR to Main] --> CI[GitHub Actions CI Workflow]
    CI --> L[Lint + Format Checks]
    CI --> U[Unit Tests]

    L -->|Fail| X[Stop ❌]
    U -->|Fail| X
    L -->|Pass| INT[Integration/CD Workflow]
    U -->|Pass| INT

    INT --> D[Start Airflow + MLflow Stack]
    D --> IT[Run Integration Tests]
    IT -->|Fail| X
    IT -->|Pass| DEPLOY[Build + Push Trainer Image]

    DEPLOY --> GCP[Artifact Registry + Vertex AI]
Loading

πŸ”Ή GitHub Actions Workflows

  • ci.yml

    • Trigger: push/PR to main.

    • Steps:

      • Install dependencies.
      • Check formatting (black, isort).
      • Lint (flake8).
      • Run unit tests only.
  • integration-cd.yml

    • Trigger: manual (workflow_dispatch) or after integration passes.

    • Steps:

      • Build Docker images.
      • Start Postgres + Airflow + MLflow.
      • Health checks for services.
      • Bootstrap MLflow with dummy model.
      • Run integration tests (batch + real-time).
      • If on main β†’ build & push trainer image to Artifact Registry.

πŸ”Ή Local CI/CD Simulation

Run the entire CI/CD process locally:

make ci-local

This executes scripts/test_ci_local.sh, which:

  • Runs lint + formatting checks.
  • Runs unit tests.
  • Spins up Postgres + Airflow + MLflow.
  • Ensures MLflow DB exists.
  • Starts stack, checks health.
  • Boots a dummy model in MLflow.
  • Starts Serve API.
  • Runs integration tests.
  • Optional: deploy trainer image with CD=1 make ci-local.


8. Testing & Quality Checks

Ensuring code quality and reproducibility is a critical part of the pipeline. This project enforces multiple levels of testing and static analysis.


πŸ”Ή Unit Tests

  • Located in tests/.

  • Lightweight checks for utility functions and components.

  • Example:

    • tests/test_utils.py β†’ verifies utility functions like add_numbers().

Run unit tests only:

pytest -m "not integration" -v

πŸ”Ή Integration Tests

  • Validate the end-to-end pipeline inside the Airflow + MLflow stack.

  • Test scenarios include:

    • Batch prediction test (test_batch_prediction_integration.py) β†’ ensures batch inference runs and saves predictions.
    • Real-time prediction test (test_prediction_integration.py) β†’ checks Serve API responds with valid predictions.

Run integration tests locally:

make integration-tests

These tests are also executed in GitHub Actions via the integration-cd.yml workflow.


πŸ”Ή Linting

  • Enforced via Flake8.
  • Ensures Python code adheres to PEP8 and project style guidelines.
make lint

πŸ”Ή Code Formatting

  • Black – auto-formats code to consistent style.
  • isort – ensures consistent import ordering.

Check formatting (CI/CD safe):

make check-format

Format automatically:

make format

πŸ”Ή Type Checking

  • Enforced via mypy.
  • Ensures static typing coverage in source code.
mypy src

πŸ”Ή Coverage & Reports

  • pytest-cov enabled in dev dependencies.
  • Generates test coverage reports locally:
pytest --cov=src tests/

With these checks in place, the project guarantees:

  • Code correctness (unit tests).
  • Pipeline reliability (integration tests).
  • Consistent style (lint/format).
  • Type safety (mypy).
  • Maintainability and reproducibility (coverage).


9. How to Run the Project

The project can be run in two modes:

  1. Locally with Docker Compose – for development and testing.
  2. On Google Cloud (production-ready) – with Terraform-managed infrastructure.

πŸ”Ή Run Locally (Dev Mode)

  1. Start the full stack (Airflow + MLflow + Serve):

    make start
  2. Access services:

  3. Run unit + integration tests:

    make test               # unit tests
    make integration-tests  # integration tests inside containers
  4. Trigger Airflow DAGs manually:

    • Open Airflow UI (localhost:8080).

    • Enable and run DAGs:

      • tune_hyperparams_dag
      • train_pipeline_dag
      • batch_prediction_dag
      • monitoring_dag

πŸ”Ή Run on GCP (Production Mode)

  1. Provision infrastructure with Terraform:

    make terraform-init
    make terraform-apply

    This creates:

    • GCS bucket (for data, artifacts, reports).
    • Artifact Registry (for trainer + MLflow images).
    • Cloud Run MLflow service.
    • IAM service accounts + permissions.
  2. Verify outputs: Terraform will print:

    • bucket_url β†’ where artifacts are stored.
    • mlflow_url β†’ Cloud Run endpoint for MLflow tracking server.
  3. Build and push trainer image:

    make trainer

    This builds the loan-default-trainer image and pushes it to Artifact Registry.

  4. Run training on Vertex AI (via Airflow train_pipeline_dag):

    • Airflow submits a Vertex AI training job using the trainer image.
    • Results are ingested into MLflow automatically.
  5. Batch predictions:

    • Airflow runs batch_prediction_dag.

    • Predictions are written to:

      gs://<your-bucket>/predictions/predictions.csv
      
  6. Monitoring & retraining:

    • Airflow runs monitoring_dag.

    • Evidently compares training vs latest predictions.

    • Drift reports are stored in GCS under:

      gs://<your-bucket>/reports/
      
    • If drift detected β†’ train_pipeline_dag is triggered automatically.


πŸ”Ή Tear Down GCP Infrastructure

When you’re done, destroy resources to avoid costs:

make terraform-destroy


10. Results & Monitoring

This section highlights the outputs of the ML pipeline and how monitoring ensures continuous model performance.


πŸ”Ή Hyperparameter Tuning (Optuna)

  • The DAG tune_hyperparams_dag runs Optuna to optimize XGBoost hyperparameters.

  • Best params are saved to:

    • Local: airflow/artifacts/best_xgb_params.json
    • GCS: gs://<bucket>/artifacts/best_xgb_params.json
  • Example:

{
  "max_depth": 7,
  "learning_rate": 0.12,
  "n_estimators": 350,
  "subsample": 0.85,
  "colsample_bytree": 0.75
}

πŸ”Ή Model Training & Registry (MLflow)

  • Models are trained weekly via train_pipeline_dag on Vertex AI.

  • Runs are logged in MLflow with:

    • Metrics: AUC, F1, Precision, Recall.
    • Artifacts: feature importance plots, confusion matrix, ROC curves.
    • Model versions: promoted from staging β†’ production if thresholds met (AUC β‰₯ 0.75).

MLflow UI (local): http://localhost:5000 MLflow UI (cloud): <mlflow_url> from Terraform outputs

πŸ“Έ Example: Vertex AI Training Logs

Vertex AI Training Logs


πŸ”Ή Batch Predictions

  • Generated daily by batch_prediction_dag.
  • Predictions written to:
gs://<bucket>/predictions/predictions.csv
  • A marker file tracks latest prediction path:
airflow/artifacts/latest_prediction.json

πŸ”Ή Monitoring (Evidently AI)

  • Drift reports generated daily by monitoring_dag.
  • Compares training data vs latest predictions.
  • Outputs both JSON + HTML reports:
gs://<bucket>/reports/monitoring_report_<timestamp>.json
gs://<bucket>/reports/monitoring_report_<timestamp>.html
  • Example drift report metrics:

    • Data Drift β†’ % of features with distribution shift.
    • Target Drift β†’ Stability of loan default predictions over time.

πŸ”Ή Monitoring Feedback Loop

sequenceDiagram
    participant Train as Training Data
    participant Batch as Batch Predictions
    participant Evidently as Evidently Reports
    participant Airflow as Airflow DAG
    participant GCS as GCS Bucket

    Train->>Evidently: Provide reference dataset
    Batch->>Evidently: Provide current batch
    Evidently->>Evidently: Compute data & target drift
    Evidently-->>GCS: Save JSON + HTML reports
    Evidently-->>Airflow: Write drift_status.json
    Airflow-->>Airflow: If drift detected β†’ Trigger retraining
Loading


11. Makefile Reference

The project uses a Makefile to automate common developer and CI/CD tasks. Below are the most important targets grouped by purpose.


πŸ”Ή Setup & Development

  • install β†’ Install project and dev dependencies.
  • lint β†’ Run flake8 for linting.
  • format β†’ Auto-format code with black + isort.
  • check-format β†’ Verify formatting without changing files.
  • test β†’ Run unit tests with pytest.

πŸ”Ή Stack Management (Airflow + MLflow)

  • start β†’ Start the full Airflow + MLflow + Serve stack.
  • stop β†’ Stop all services (containers paused).
  • down β†’ Stop and remove containers + networks.
  • start-core β†’ Start only core services (Postgres, Airflow webserver + scheduler, MLflow).
  • stop-core β†’ Stop core services.
  • stop-hard β†’ Full clean-up (containers, volumes, logs, artifacts).

πŸ”Ή Model Serving

  • start-serve β†’ Start the model serving API.
  • stop-serve β†’ Stop serving API.
  • restart-serve β†’ Restart serving API.

πŸ”Ή CI/CD & Tests

  • integration-tests β†’ Run integration tests inside Airflow/MLflow containers.
  • ci-local β†’ Run local CI/CD simulation (scripts/test_ci_local.sh).

πŸ”Ή Terraform (GCP Infra)

  • terraform-init β†’ Initialize Terraform.
  • terraform-plan β†’ Preview infrastructure changes.
  • terraform-apply β†’ Apply Terraform plan (create/update infra).
  • terraform-destroy β†’ Tear down all provisioned resources.

πŸ”Ή Reset & Debugging

  • reset β†’ Reset stack (rebuild using cached layers).
  • fresh-reset β†’ Reset stack with no cache (force rebuild).
  • verify β†’ Verify health of Airflow + MLflow + Serve.
  • troubleshoot β†’ Run diagnostics (logs, health checks, variables).

πŸ”Ή Build & Deployment

  • build-trainer β†’ Build Vertex AI trainer Docker image.
  • push-trainer β†’ Push trainer image to Artifact Registry.
  • trainer β†’ Build + push trainer image (shortcut).
  • deploy-trainer β†’ Build + push trainer, set image URI in Airflow.
  • build-mlflow β†’ Build custom MLflow Docker image.
  • bootstrap-all β†’ Full rebuild: MLflow + trainer + Airflow stack.


12. Future Improvements

While the current pipeline is production-ready, there are several enhancements that can make it more robust, scalable, and enterprise-grade:


πŸ”Ή Automation & Retraining

  • Automate full continuous retraining loop with Airflow:

    • Drift detection β†’ retrain β†’ evaluate β†’ promote β†’ redeploy.
  • Add canary deployment strategy for new models (A/B testing before full promotion).


πŸ”Ή Monitoring & Alerts

  • Integrate WhyLogs or Prometheus + Grafana for richer monitoring.
  • Add real-time drift detection on streaming data, not just batch.
  • Expand alerting integrations (Slack, email, PagerDuty) beyond the current Airflow notifications.

πŸ”Ή Cloud-Native Enhancements

  • Expand Vertex AI Pipelines integration to orchestrate end-to-end workflows natively on GCP.
  • Deploy serving API with GKE (Kubernetes) or Vertex AI Prediction for scale-out serving.
  • Store logs/metrics in BigQuery for auditing and analysis.

πŸ”Ή Data & Feature Management

  • Integrate with a Feature Store (e.g., Feast) for consistent offline/online feature parity.
  • Add data versioning (DVC or Delta Lake) for reproducibility.

πŸ”Ή Testing & CI/CD

  • Expand test coverage with:

    • Load tests for prediction API.
    • Chaos tests for Airflow resilience.
    • End-to-end regression suites.
  • Enable scheduled nightly CI runs with smoke tests against staging environment.


πŸ”Ή Developer Experience

  • Add pre-commit hooks for lint/format checks before commits.
  • Publish Docker images (trainer, MLflow, monitor) to a public registry for faster onboarding.
  • Expand documentation in a /docs folder with DAG-specific diagrams and usage guides.


13. Security & Contributions

This project follows security best practices:

  • Secrets (GCP keys, .env) are not committed to git (enforced via .gitignore).
  • Service accounts are scoped with least privilege (roles/storage.admin, roles/aiplatform.admin, etc.).
  • Terraform provisions infra in a reproducible, auditable way.

For detailed contribution guidelines and security reporting, see: πŸ‘‰ SECURITY_AND_CONTRIBUTING.md



14. Troubleshooting

Common issues and quick fixes:

  • Airflow logs directory not writable

    make fix-perms
  • MLflow volume permissions issue

    make fix-mlflow-volume
  • Reset full stack (cached build)

    make reset
  • Reset full stack (no cache)

    make fresh-reset
  • Verify services are running

    make verify

For a comprehensive list of real-world error messages and fixes (with exact stack traces), see: πŸ‘‰ TROUBLESHOOTING.md



15. Acknowledgments

I would like to sincerely thank the following for their guidance, encouragement, and inspiration throughout the course of this project:

  • The DataTalks.Club mentors and peers β€” their instructions and feedback provided invaluable insights.
  • The broader Data Science and MLOps community β€” for sharing knowledge and best practices that shaped my approach.
  • Family and friends β€” for their unwavering support and patience during the many long hours dedicated to building and refining this project.