MLOps End-to-End Examples on GCP

Production-ready ML systems demonstrating complete MLOps pipelines on Google Cloud Platform using Kubernetes, featuring both computer vision and tabular ML use cases.

🚀 Features

Computer Vision Pipeline (YOLOv8)

High-Performance Inference: NVIDIA Triton with ONNX-optimized YOLOv8 on CPU
Auto-Scaling Infrastructure: GKE cluster provisioned via Terraform
Monitoring: Prometheus + Grafana dashboards
Data Drift Detection: Evidently AI integration for vision models

Tabular ML Pipeline (Iris Classification)

Complete ML Lifecycle: Data ingestion → Training → Deployment → Monitoring → Retraining
Workflow Orchestration: Apache Airflow for data pipelines and retraining automation
Experiment Tracking: MLFlow with PostgreSQL backend and GCS artifact storage
Model Serving: FastAPI service with automatic model loading from MLFlow registry
Drift Monitoring: Evidently AI for data and concept drift detection
Automated Retraining: Triggered by drift thresholds with model comparison
Event Streaming: Apache Kafka for real-time event processing and email notifications

🏗️ Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    ML Pipeline Architecture                      │
└─────────────────────────────────────────────────────────────────┘

┌──────────────┐      ┌──────────────┐      ┌──────────────┐
│   Airflow    │─────▶│    MLFlow    │─────▶│  ML Service  │
│ Orchestration│      │Model Registry│      │   (FastAPI)  │
└──────┬───────┘      └──────┬───────┘      └──────┬───────┘
       │                     │                      │
       │ Triggers            │ Loads Model          │ Serves
       ▼                     ▼                      ▼
┌──────────────┐      ┌──────────────┐      ┌──────────────┐
│  Training    │      │ Production   │      │ Predictions  │
│   Pipeline   │      │  Model v1    │      │  + Drift     │
└──────────────┘      └──────────────┘      └──────┬───────┘
                                                    │
       ┌────────────────────────────────────────────┤
       │                                            │
       ▼                                            ▼
┌──────────────┐                            ┌──────────────┐
│  Evidently   │◀───── Drift Reports ───────│    Kafka     │
│     UI       │                            │    Events    │
└──────────────┘                            └──────────────┘
       │                                            │
       └──────────────────┬─────────────────────────┘
                          ▼
                   ┌──────────────┐
                   │   Grafana    │
                   │  Monitoring  │
                   └──────────────┘

⚡ Quick Start: Activate ML Pipeline

Prerequisites: GKE cluster deployed (see Setup below)

Train & Deploy Your First Model (2 minutes)

# Train model and activate ML Service
./scripts/train-iris-model.sh

# Test predictions
kubectl port-forward svc/ml-service 8082:8080 &
curl -X POST http://localhost:8082/predict \
  -H "Content-Type: application/json" \
  -d '{"sepal_length":5.1,"sepal_width":3.5,"petal_length":1.4,"petal_width":0.2}'

Result: Model trained → Registered in MLFlow → Deployed to Production → Serving predictions

View in MLFlow UI: kubectl port-forward svc/mlflow 5000:5000 → http://localhost:5000

Advanced Options:

Local training: ml-pipeline/README.md
Airflow orchestration: PIPELINE_ACTIVATION_GUIDE.md
Troubleshooting: TROUBLESHOOTING.md

📊 Dashboards & Monitoring

Quick Access: make port-forward-all then visit the URLs below. See DASHBOARD_QUICK_START.md for details.

Dashboard	Port	Login	Purpose
Grafana	3000	admin / `make get-grafana-password`	Operations monitoring, system metrics, alerts
MLFlow	5000	-	Experiment tracking, model registry, artifacts
Airflow	8081	admin / admin	DAG orchestration, pipeline monitoring
Evidently	8001	-	Drift detection, data quality reports
Prometheus	9090	-	Raw metrics, PromQL queries

Access: kubectl port-forward svc/<service-name> <port>:<target-port> or use make port-forward-all

📁 Project Structure

├── backend/                      # FastAPI app (YOLOv8 pre/post-processing)
├── ml-service/                   # FastAPI app (Iris classification serving)
├── ml-pipeline/                  # Complete tabular ML pipeline
│   ├── data/                    # Data generation and GCS utilities
│   ├── training/                # Training scripts with MLFlow
│   ├── deployment/              # Model promotion and deployment
│   └── monitoring/              # Drift analysis and monitoring
├── infrastructure/
│   ├── gcp/                     # Terraform (GKE, GCS, Artifact Registry)
│   └── kubernetes/              # Helm charts
│       ├── backend/            # YOLOv8 backend
│       ├── triton/             # Triton Inference Server
│       ├── ml-service/         # Iris ML service
│       ├── mlflow/             # MLFlow with PostgreSQL
│       ├── airflow/            # Airflow with DAGs
│       ├── kafka/              # Kafka + Zookeeper
│       ├── event-consumer/     # Email notification consumer
│       ├── monitoring/         # Prometheus + Grafana
│       └── evidently-ui/       # Evidently UI
├── event-consumer/              # Kafka consumer for email notifications
├── models/                      # YOLO model and conversion scripts
├── tests/
│   ├── performance/            # Load testing with Locust
│   ├── data_drift/             # YOLOv8 drift detection
│   └── ml-service/             # ML service integration tests
└── .github/workflows/          # CI/CD automation

🛠️ Quick Setup

This project includes two complete MLOps pipelines:

Computer Vision Pipeline (YOLOv8 with Triton) - See sections below
Tabular ML Pipeline (Iris with MLFlow + Airflow) - See ml-pipeline/README.md

Prerequisites

GCP account with billing enabled
gcloud, terraform, kubectl, helm installed
Python 3.10+
Docker

1. Configure GCP

# Enable APIs
gcloud services enable container.googleapis.com \
    artifactregistry.googleapis.com \
    storage-component.googleapis.com

# Create service account with roles:
# - Kubernetes Engine Admin
# - Storage Admin
# - Artifact Registry Admin
# Download JSON key as service-account-key.json

2. Set Environment Variables

cp env.example env
# Edit env with your GCP project ID, region, bucket names:
# - gcs_bucket_name: For YOLO models
# - gcs_ml_data_bucket_name: For tabular ML data
source env

3. Provision Infrastructure

cd infrastructure/gcp
terraform init
terraform apply
cd ../..

# Configure kubectl
gcloud container clusters get-credentials $GKE_CLUSTER_NAME \
    --zone=$GCP_ZONE --project=$GCP_PROJECT_ID

4. Prepare Model

cd models
pip install -r requirements.txt

# Place your model.pt in models/yolov8n/1/
# Convert to ONNX
python convert_to_onnx.py

# Upload to GCS
gsutil -m cp -r yolov8n gs://$GCS_BUCKET_NAME/
cd ..

5. Deploy Services

A. Computer Vision Services (YOLOv8)

# Create GCS secret
kubectl create secret generic gcs-sa-key \
    --from-file=gcp-credentials.json=service-account-key.json

# Deploy Triton
helm upgrade --install triton ./infrastructure/kubernetes/triton \
    --set modelRepository.gcsBucket=$GCS_BUCKET_NAME \
    --set gcsAuthSecret=gcs-sa-key

# Build & deploy backend
export IMAGE_TAG="latest"
export IMAGE_NAME="$GCP_REGION-docker.pkg.dev/$GCP_PROJECT_ID/$ARTIFACT_REGISTRY_REPO/$BACKEND_IMAGE_NAME:$IMAGE_TAG"

gcloud auth configure-docker $GCP_REGION-docker.pkg.dev
docker build -t $IMAGE_NAME ./backend
docker push $IMAGE_NAME

helm upgrade --install backend ./infrastructure/kubernetes/backend \
    --set image.repository=$GCP_REGION-docker.pkg.dev/$GCP_PROJECT_ID/$ARTIFACT_REGISTRY_REPO/$BACKEND_IMAGE_NAME \
    --set image.tag=$IMAGE_TAG \
    --set tritonReleaseName=triton

B. Tabular ML Services (Iris + MLFlow + Airflow)

# Deploy MLFlow
helm upgrade --install mlflow ./infrastructure/kubernetes/mlflow \
    --set gcs.bucketName=$GCS_ML_DATA_BUCKET_NAME

# Deploy Airflow
helm upgrade --install airflow ./infrastructure/kubernetes/airflow \
    --set gcs.bucketName=$GCS_ML_DATA_BUCKET_NAME \
    --set mlflow.trackingUri=http://mlflow:5000

# Build & deploy ML service
export ML_SERVICE_IMAGE="$GCP_REGION-docker.pkg.dev/$GCP_PROJECT_ID/$ARTIFACT_REGISTRY_REPO/ml-service:$IMAGE_TAG"

docker build -t $ML_SERVICE_IMAGE ./ml-service
docker push $ML_SERVICE_IMAGE

helm upgrade --install ml-service ./infrastructure/kubernetes/ml-service \
    --set image.repository=$GCP_REGION-docker.pkg.dev/$GCP_PROJECT_ID/$ARTIFACT_REGISTRY_REPO/ml-service \
    --set image.tag=$IMAGE_TAG \
    --set mlflow.trackingUri=http://mlflow:5000

C. Monitoring & Observability

# Deploy Prometheus + Grafana
cd infrastructure/kubernetes/monitoring
helm dependency build
cd ../../..
helm upgrade --install monitoring ./infrastructure/kubernetes/monitoring \
    --namespace monitoring --create-namespace

# Deploy Evidently UI
helm upgrade --install evidently-ui ./infrastructure/kubernetes/evidently-ui \
    --set ui.demoProjects=false \
    --set persistence.enabled=true

🧪 Testing

Access Services

# YOLOv8 Backend API
kubectl port-forward svc/backend 8080:8080
curl -X POST -F "file=@test.jpg" http://localhost:8080/invocations

# ML Service (Iris)
kubectl port-forward svc/ml-service 8082:8080
curl http://localhost:8082/docs  # OpenAPI docs

# MLFlow UI
kubectl port-forward svc/mlflow 5000:5000
# Access at http://localhost:5000

# Airflow UI (default: admin/admin)
kubectl port-forward svc/airflow-webserver 8081:8080
# Access at http://localhost:8081

# Grafana (default: admin/prom-operator)
kubectl get secret --namespace monitoring monitoring-grafana \
    -o jsonpath="{.data.admin-password}" | base64 --decode
kubectl port-forward --namespace monitoring svc/monitoring-grafana 3000:80

# Evidently UI
kubectl port-forward svc/evidently-ui 8001:8000

Performance Testing

cd tests/performance
pip install -r requirements.txt

# Set test image (optional)
export TEST_IMAGE_PATH=/path/to/test.jpg

# Run load test
locust -f test_load.py --host http://localhost:8000 \
    --users 5 --spawn-rate 2 --run-time 30s --headless

Data Drift Testing

YOLOv8 Drift

cd tests/data_drift
pip install -r requirements.txt

# Set test image (optional)
export TEST_IMAGE_PATH=/path/to/test.jpg
export BACKEND_URL=http://localhost:8000

python test_yolo_drift_real.py

Iris ML Service Tests

cd tests/ml-service
pip install -r requirements.txt

# Port-forward ML service first
kubectl port-forward svc/ml-service 8082:8080

# Run tests
pytest test_predictions.py -v
pytest test_drift_detection.py -v

🔄 CI/CD

GitHub Actions automatically builds and deploys on push to main. Configure these secrets:

GCP_PROJECT_ID, GCP_REGION, GCP_ZONE
GKE_CLUSTER_NAME
GCS_BUCKET_NAME (for YOLOv8 models)
GCS_ML_DATA_BUCKET_NAME (for tabular ML data)
ARTIFACT_REGISTRY_REPO, BACKEND_IMAGE_NAME
GCP_KEY_FILE (entire service account JSON)
SMTP_USERNAME, SMTP_PASSWORD, EMAIL_TO, EMAIL_FROM (for Kafka email notifications)

See .github/SETUP_SECRETS.md for detailed setup instructions.

Deployed Services

The CI/CD pipeline automatically deploys:

Triton Inference Server (YOLOv8)
Backend (YOLOv8 FastAPI service)
MLFlow (Experiment tracking)
Airflow (Workflow orchestration)
ML Service (Iris classification service)
Kafka (Event streaming)
Event Consumer (Email notifications)
Evidently UI (Drift visualization)
Monitoring (Prometheus + Grafana)

📊 Monitoring & Observability

Prometheus Metrics: Backend exposes /metrics with:

Request count, duration, errors
Inference latency
Triton connection errors

Grafana Dashboards: Pre-configured dashboard in infrastructure/kubernetes/monitoring/mlops-dashboard.json

Evidently Data Drift:

Backend auto-collects prediction features
Reports generated every 50 predictions
View in Evidently UI

📨 Event Streaming with Kafka

Kafka powers real-time events for predictions, drift alerts, DAG status, and training notifications.
To deploy, test, and extend the Kafka stack, follow:

Full architecture: infrastructure/kubernetes/kafka/KAFKA_ARCHITECTURE.md
Quick start & testing: infrastructure/kubernetes/kafka/QUICK_START.md
Email / consumer configuration: event-consumer/ chart README

🔧 Configuration

Environment Variables (backend)

TRITON_URL: Triton server endpoint (default: triton:8000)
EVIDENTLY_WORKSPACE: Workspace path (default: /workspace)
EVIDENTLY_PROJECT_ID: Project ID for drift reports
EVIDENTLY_BATCH_SIZE: Report frequency (default: 50)
EVIDENTLY_MAX_SAMPLES: Max samples in memory (default: 1000)

Model Configuration

Edit models/yolov8n/config.pbtxt to customize:

Instance count (parallelism)
Max batch size
Dynamic batching settings

🐛 Troubleshooting

See TROUBLESHOOTING.md for comprehensive debugging guide.

Common Issues:

Service not running: make check-dashboards → check pod logs
Port conflicts: make stop-port-forward then restart
Auth issues: gcloud container clusters get-credentials
Model not loading: Verify MLFlow connectivity and model registry

📚 Additional Documentation

ml-pipeline/README.md - Complete ML pipeline guide
DASHBOARD_QUICK_START.md - Dashboard access & usage
PIPELINE_ACTIVATION_GUIDE.md - Airflow setup
infrastructure/kubernetes/kafka/KAFKA_ARCHITECTURE.md - Event streaming

📝 Notes

CPU-optimized: Uses ONNX Runtime on CPU nodes (e2-standard-4)
Two GCS Buckets: Separate buckets for YOLO models and ML data
Terraform state: Should be stored remotely (S3, GCS) for production
Test images: Set TEST_IMAGE_PATH env var or tests will use dummy images
Backend Port: Backend service runs on port 8080 internally (use kubectl port-forward svc/backend 8080:8080)
ML Service requires model: Train and register an Iris model in MLFlow first (see ml-pipeline/README.md)
GPU support: Not configured by default, see models/config.pbtxt and Terraform to enable
Airflow DAGs: Located in infrastructure/kubernetes/airflow/dags/
Model Versioning: All models tracked in MLFlow registry with staging/production stages

📄 License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github		.github
backend		backend
event-consumer		event-consumer
infrastructure		infrastructure
ml-pipeline		ml-pipeline
ml-service		ml-service
models		models
scripts		scripts
tests		tests
.gitignore		.gitignore
DASHBOARD_QUICK_START.md		DASHBOARD_QUICK_START.md
Makefile		Makefile
PIPELINE_ACTIVATION_GUIDE.md		PIPELINE_ACTIVATION_GUIDE.md
README.md		README.md
TROUBLESHOOTING.md		TROUBLESHOOTING.md
env.example		env.example

Folders and files

Latest commit

History

Repository files navigation

MLOps End-to-End Examples on GCP

🚀 Features

Computer Vision Pipeline (YOLOv8)

Tabular ML Pipeline (Iris Classification)

🏗️ Architecture

⚡ Quick Start: Activate ML Pipeline

Train & Deploy Your First Model (2 minutes)

📊 Dashboards & Monitoring

📁 Project Structure

🛠️ Quick Setup

Prerequisites

1. Configure GCP

2. Set Environment Variables

3. Provision Infrastructure

4. Prepare Model

5. Deploy Services

A. Computer Vision Services (YOLOv8)

B. Tabular ML Services (Iris + MLFlow + Airflow)

C. Monitoring & Observability

🧪 Testing

Access Services

Performance Testing

Data Drift Testing

YOLOv8 Drift

Iris ML Service Tests

🔄 CI/CD

Deployed Services

📊 Monitoring & Observability

📨 Event Streaming with Kafka

🔧 Configuration

Environment Variables (backend)

Model Configuration

🐛 Troubleshooting

📚 Additional Documentation

📝 Notes

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages