Skip to content

Xaik89/mlops-gcp

Repository files navigation

MLOps End-to-End Examples on GCP

Production-ready ML systems demonstrating complete MLOps pipelines on Google Cloud Platform using Kubernetes, featuring both computer vision and tabular ML use cases.

πŸš€ Features

Computer Vision Pipeline (YOLOv8)

  • High-Performance Inference: NVIDIA Triton with ONNX-optimized YOLOv8 on CPU
  • Auto-Scaling Infrastructure: GKE cluster provisioned via Terraform
  • Monitoring: Prometheus + Grafana dashboards
  • Data Drift Detection: Evidently AI integration for vision models

Tabular ML Pipeline (Iris Classification)

  • Complete ML Lifecycle: Data ingestion β†’ Training β†’ Deployment β†’ Monitoring β†’ Retraining
  • Workflow Orchestration: Apache Airflow for data pipelines and retraining automation
  • Experiment Tracking: MLFlow with PostgreSQL backend and GCS artifact storage
  • Model Serving: FastAPI service with automatic model loading from MLFlow registry
  • Drift Monitoring: Evidently AI for data and concept drift detection
  • Automated Retraining: Triggered by drift thresholds with model comparison
  • Event Streaming: Apache Kafka for real-time event processing and email notifications

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    ML Pipeline Architecture                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Airflow    │─────▢│    MLFlow    │─────▢│  ML Service  β”‚
β”‚ Orchestrationβ”‚      β”‚Model Registryβ”‚      β”‚   (FastAPI)  β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚                     β”‚                      β”‚
       β”‚ Triggers            β”‚ Loads Model          β”‚ Serves
       β–Ό                     β–Ό                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Training    β”‚      β”‚ Production   β”‚      β”‚ Predictions  β”‚
β”‚   Pipeline   β”‚      β”‚  Model v1    β”‚      β”‚  + Drift     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                                                    β”‚
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
       β”‚                                            β”‚
       β–Ό                                            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Evidently   │◀───── Drift Reports ───────│    Kafka     β”‚
β”‚     UI       β”‚                            β”‚    Events    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚                                            β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β–Ό
                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                   β”‚   Grafana    β”‚
                   β”‚  Monitoring  β”‚
                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

⚑ Quick Start: Activate ML Pipeline

Prerequisites: GKE cluster deployed (see Setup below)

Train & Deploy Your First Model (2 minutes)

# Train model and activate ML Service
./scripts/train-iris-model.sh

# Test predictions
kubectl port-forward svc/ml-service 8082:8080 &
curl -X POST http://localhost:8082/predict \
  -H "Content-Type: application/json" \
  -d '{"sepal_length":5.1,"sepal_width":3.5,"petal_length":1.4,"petal_width":0.2}'

Result: Model trained β†’ Registered in MLFlow β†’ Deployed to Production β†’ Serving predictions

View in MLFlow UI: kubectl port-forward svc/mlflow 5000:5000 β†’ http://localhost:5000

Advanced Options:

πŸ“Š Dashboards & Monitoring

Quick Access: make port-forward-all then visit the URLs below. See DASHBOARD_QUICK_START.md for details.

Dashboard Port Login Purpose
Grafana 3000 admin / make get-grafana-password Operations monitoring, system metrics, alerts
MLFlow 5000 - Experiment tracking, model registry, artifacts
Airflow 8081 admin / admin DAG orchestration, pipeline monitoring
Evidently 8001 - Drift detection, data quality reports
Prometheus 9090 - Raw metrics, PromQL queries

Access: kubectl port-forward svc/<service-name> <port>:<target-port> or use make port-forward-all


πŸ“ Project Structure

β”œβ”€β”€ backend/                      # FastAPI app (YOLOv8 pre/post-processing)
β”œβ”€β”€ ml-service/                   # FastAPI app (Iris classification serving)
β”œβ”€β”€ ml-pipeline/                  # Complete tabular ML pipeline
β”‚   β”œβ”€β”€ data/                    # Data generation and GCS utilities
β”‚   β”œβ”€β”€ training/                # Training scripts with MLFlow
β”‚   β”œβ”€β”€ deployment/              # Model promotion and deployment
β”‚   └── monitoring/              # Drift analysis and monitoring
β”œβ”€β”€ infrastructure/
β”‚   β”œβ”€β”€ gcp/                     # Terraform (GKE, GCS, Artifact Registry)
β”‚   └── kubernetes/              # Helm charts
β”‚       β”œβ”€β”€ backend/            # YOLOv8 backend
β”‚       β”œβ”€β”€ triton/             # Triton Inference Server
β”‚       β”œβ”€β”€ ml-service/         # Iris ML service
β”‚       β”œβ”€β”€ mlflow/             # MLFlow with PostgreSQL
β”‚       β”œβ”€β”€ airflow/            # Airflow with DAGs
β”‚       β”œβ”€β”€ kafka/              # Kafka + Zookeeper
β”‚       β”œβ”€β”€ event-consumer/     # Email notification consumer
β”‚       β”œβ”€β”€ monitoring/         # Prometheus + Grafana
β”‚       └── evidently-ui/       # Evidently UI
β”œβ”€β”€ event-consumer/              # Kafka consumer for email notifications
β”œβ”€β”€ models/                      # YOLO model and conversion scripts
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ performance/            # Load testing with Locust
β”‚   β”œβ”€β”€ data_drift/             # YOLOv8 drift detection
β”‚   └── ml-service/             # ML service integration tests
└── .github/workflows/          # CI/CD automation

πŸ› οΈ Quick Setup

This project includes two complete MLOps pipelines:

  1. Computer Vision Pipeline (YOLOv8 with Triton) - See sections below
  2. Tabular ML Pipeline (Iris with MLFlow + Airflow) - See ml-pipeline/README.md

Prerequisites

  • GCP account with billing enabled
  • gcloud, terraform, kubectl, helm installed
  • Python 3.10+
  • Docker

1. Configure GCP

# Enable APIs
gcloud services enable container.googleapis.com \
    artifactregistry.googleapis.com \
    storage-component.googleapis.com

# Create service account with roles:
# - Kubernetes Engine Admin
# - Storage Admin
# - Artifact Registry Admin
# Download JSON key as service-account-key.json

2. Set Environment Variables

cp env.example env
# Edit env with your GCP project ID, region, bucket names:
# - gcs_bucket_name: For YOLO models
# - gcs_ml_data_bucket_name: For tabular ML data
source env

3. Provision Infrastructure

cd infrastructure/gcp
terraform init
terraform apply
cd ../..

# Configure kubectl
gcloud container clusters get-credentials $GKE_CLUSTER_NAME \
    --zone=$GCP_ZONE --project=$GCP_PROJECT_ID

4. Prepare Model

cd models
pip install -r requirements.txt

# Place your model.pt in models/yolov8n/1/
# Convert to ONNX
python convert_to_onnx.py

# Upload to GCS
gsutil -m cp -r yolov8n gs://$GCS_BUCKET_NAME/
cd ..

5. Deploy Services

A. Computer Vision Services (YOLOv8)

# Create GCS secret
kubectl create secret generic gcs-sa-key \
    --from-file=gcp-credentials.json=service-account-key.json

# Deploy Triton
helm upgrade --install triton ./infrastructure/kubernetes/triton \
    --set modelRepository.gcsBucket=$GCS_BUCKET_NAME \
    --set gcsAuthSecret=gcs-sa-key

# Build & deploy backend
export IMAGE_TAG="latest"
export IMAGE_NAME="$GCP_REGION-docker.pkg.dev/$GCP_PROJECT_ID/$ARTIFACT_REGISTRY_REPO/$BACKEND_IMAGE_NAME:$IMAGE_TAG"

gcloud auth configure-docker $GCP_REGION-docker.pkg.dev
docker build -t $IMAGE_NAME ./backend
docker push $IMAGE_NAME

helm upgrade --install backend ./infrastructure/kubernetes/backend \
    --set image.repository=$GCP_REGION-docker.pkg.dev/$GCP_PROJECT_ID/$ARTIFACT_REGISTRY_REPO/$BACKEND_IMAGE_NAME \
    --set image.tag=$IMAGE_TAG \
    --set tritonReleaseName=triton

B. Tabular ML Services (Iris + MLFlow + Airflow)

# Deploy MLFlow
helm upgrade --install mlflow ./infrastructure/kubernetes/mlflow \
    --set gcs.bucketName=$GCS_ML_DATA_BUCKET_NAME

# Deploy Airflow
helm upgrade --install airflow ./infrastructure/kubernetes/airflow \
    --set gcs.bucketName=$GCS_ML_DATA_BUCKET_NAME \
    --set mlflow.trackingUri=http://mlflow:5000

# Build & deploy ML service
export ML_SERVICE_IMAGE="$GCP_REGION-docker.pkg.dev/$GCP_PROJECT_ID/$ARTIFACT_REGISTRY_REPO/ml-service:$IMAGE_TAG"

docker build -t $ML_SERVICE_IMAGE ./ml-service
docker push $ML_SERVICE_IMAGE

helm upgrade --install ml-service ./infrastructure/kubernetes/ml-service \
    --set image.repository=$GCP_REGION-docker.pkg.dev/$GCP_PROJECT_ID/$ARTIFACT_REGISTRY_REPO/ml-service \
    --set image.tag=$IMAGE_TAG \
    --set mlflow.trackingUri=http://mlflow:5000

C. Monitoring & Observability

# Deploy Prometheus + Grafana
cd infrastructure/kubernetes/monitoring
helm dependency build
cd ../../..
helm upgrade --install monitoring ./infrastructure/kubernetes/monitoring \
    --namespace monitoring --create-namespace

# Deploy Evidently UI
helm upgrade --install evidently-ui ./infrastructure/kubernetes/evidently-ui \
    --set ui.demoProjects=false \
    --set persistence.enabled=true

πŸ§ͺ Testing

Access Services

# YOLOv8 Backend API
kubectl port-forward svc/backend 8080:8080
curl -X POST -F "file=@test.jpg" http://localhost:8080/invocations

# ML Service (Iris)
kubectl port-forward svc/ml-service 8082:8080
curl http://localhost:8082/docs  # OpenAPI docs

# MLFlow UI
kubectl port-forward svc/mlflow 5000:5000
# Access at http://localhost:5000

# Airflow UI (default: admin/admin)
kubectl port-forward svc/airflow-webserver 8081:8080
# Access at http://localhost:8081

# Grafana (default: admin/prom-operator)
kubectl get secret --namespace monitoring monitoring-grafana \
    -o jsonpath="{.data.admin-password}" | base64 --decode
kubectl port-forward --namespace monitoring svc/monitoring-grafana 3000:80

# Evidently UI
kubectl port-forward svc/evidently-ui 8001:8000

Performance Testing

cd tests/performance
pip install -r requirements.txt

# Set test image (optional)
export TEST_IMAGE_PATH=/path/to/test.jpg

# Run load test
locust -f test_load.py --host http://localhost:8000 \
    --users 5 --spawn-rate 2 --run-time 30s --headless

Data Drift Testing

YOLOv8 Drift

cd tests/data_drift
pip install -r requirements.txt

# Set test image (optional)
export TEST_IMAGE_PATH=/path/to/test.jpg
export BACKEND_URL=http://localhost:8000

python test_yolo_drift_real.py

Iris ML Service Tests

cd tests/ml-service
pip install -r requirements.txt

# Port-forward ML service first
kubectl port-forward svc/ml-service 8082:8080

# Run tests
pytest test_predictions.py -v
pytest test_drift_detection.py -v

πŸ”„ CI/CD

GitHub Actions automatically builds and deploys on push to main. Configure these secrets:

  • GCP_PROJECT_ID, GCP_REGION, GCP_ZONE
  • GKE_CLUSTER_NAME
  • GCS_BUCKET_NAME (for YOLOv8 models)
  • GCS_ML_DATA_BUCKET_NAME (for tabular ML data)
  • ARTIFACT_REGISTRY_REPO, BACKEND_IMAGE_NAME
  • GCP_KEY_FILE (entire service account JSON)
  • SMTP_USERNAME, SMTP_PASSWORD, EMAIL_TO, EMAIL_FROM (for Kafka email notifications)

See .github/SETUP_SECRETS.md for detailed setup instructions.

Deployed Services

The CI/CD pipeline automatically deploys:

  • Triton Inference Server (YOLOv8)
  • Backend (YOLOv8 FastAPI service)
  • MLFlow (Experiment tracking)
  • Airflow (Workflow orchestration)
  • ML Service (Iris classification service)
  • Kafka (Event streaming)
  • Event Consumer (Email notifications)
  • Evidently UI (Drift visualization)
  • Monitoring (Prometheus + Grafana)

πŸ“Š Monitoring & Observability

Prometheus Metrics: Backend exposes /metrics with:

  • Request count, duration, errors
  • Inference latency
  • Triton connection errors

Grafana Dashboards: Pre-configured dashboard in infrastructure/kubernetes/monitoring/mlops-dashboard.json

Evidently Data Drift:

  • Backend auto-collects prediction features
  • Reports generated every 50 predictions
  • View in Evidently UI

πŸ“¨ Event Streaming with Kafka

Kafka powers real-time events for predictions, drift alerts, DAG status, and training notifications.
To deploy, test, and extend the Kafka stack, follow:

  • Full architecture: infrastructure/kubernetes/kafka/KAFKA_ARCHITECTURE.md
  • Quick start & testing: infrastructure/kubernetes/kafka/QUICK_START.md
  • Email / consumer configuration: event-consumer/ chart README

πŸ”§ Configuration

Environment Variables (backend)

  • TRITON_URL: Triton server endpoint (default: triton:8000)
  • EVIDENTLY_WORKSPACE: Workspace path (default: /workspace)
  • EVIDENTLY_PROJECT_ID: Project ID for drift reports
  • EVIDENTLY_BATCH_SIZE: Report frequency (default: 50)
  • EVIDENTLY_MAX_SAMPLES: Max samples in memory (default: 1000)

Model Configuration

Edit models/yolov8n/config.pbtxt to customize:

  • Instance count (parallelism)
  • Max batch size
  • Dynamic batching settings

πŸ› Troubleshooting

See TROUBLESHOOTING.md for comprehensive debugging guide.

Common Issues:

  • Service not running: make check-dashboards β†’ check pod logs
  • Port conflicts: make stop-port-forward then restart
  • Auth issues: gcloud container clusters get-credentials
  • Model not loading: Verify MLFlow connectivity and model registry

πŸ“š Additional Documentation

πŸ“ Notes

  • CPU-optimized: Uses ONNX Runtime on CPU nodes (e2-standard-4)
  • Two GCS Buckets: Separate buckets for YOLO models and ML data
  • Terraform state: Should be stored remotely (S3, GCS) for production
  • Test images: Set TEST_IMAGE_PATH env var or tests will use dummy images
  • Backend Port: Backend service runs on port 8080 internally (use kubectl port-forward svc/backend 8080:8080)
  • ML Service requires model: Train and register an Iris model in MLFlow first (see ml-pipeline/README.md)
  • GPU support: Not configured by default, see models/config.pbtxt and Terraform to enable
  • Airflow DAGs: Located in infrastructure/kubernetes/airflow/dags/
  • Model Versioning: All models tracked in MLFlow registry with staging/production stages

πŸ“„ License

MIT

About

Kubernetes on GCP with backend that runs Triton serve model

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors