Skip to content

HarshalSant/vigil

Repository files navigation

Vigil — The Missing Safety Layer for Production ML

CI PyPI Python License Tests

Vigil protects every stage of your ML pipeline with five integrated safety modules — from data ingestion to model serving. Works with any MLOps stack. Zero infrastructure changes required.

pip install vigil-ml

The Problem Vigil Solves

Most production ML failures are silent — no errors, no alerts, just wrong predictions. They all stem from five root causes:

Root Cause What Actually Breaks How Long Before You Notice
Features computed differently in training vs serving Accuracy silently degrades Weeks to months
Schema/stats changes between pipeline stages Downstream models get garbage input Next training run
Prediction distribution shifts without system errors Users get wrong recommendations Never (no alerts)
Upgrading Model A silently breaks Model B Cascading production failures At peak traffic
Stale cached features served at inference time Model quality collapses Hours to days

Vigil addresses all five — before they reach users.


Architecture

High-Level System Architecture

╔══════════════════════════════════════════════════════════════════════════════╗
║                           VIGIL  v0.1.0                                      ║
║               The Missing Safety Layer for Production ML                     ║
╠══════════════╦═══════════════╦══════════════╦════════════╦═══════════════════╣
║              ║               ║              ║            ║                   ║
║  SkewGuard   ║   Contracts   ║    Pulse     ║  DepGraph  ║    Freshness      ║
║  Module 1    ║   Module 2    ║   Module 3   ║  Module 4  ║    Module 5       ║
║              ║               ║              ║            ║                   ║
║  @feature    ║  DataContract ║ PulseMonitor ║  model_    ║ FreshnessTracker  ║
║  decorator   ║  @pipeline    ║  .wrap()     ║  registry  ║  .assert_fresh()  ║
║  + compiler  ║  .stage()     ║  PSI drift   ║  semver    ║  SLA enforcement  ║
║  + PSI check ║  + engine     ║  detection   ║  compat.   ║  + bulk_check()   ║
║              ║               ║              ║            ║                   ║
╠══════════════╩═══════════════╩══════════════╩════════════╩═══════════════════╣
║                           CORE ENGINE                                        ║
║                                                                              ║
║  ┌─────────────────┐  ┌────────────────┐  ┌──────────┐  ┌────────────────┐  ║
║  │    Registry      │  │   Event Bus    │  │  Config  │  │  Exceptions    │  ║
║  │  (SQLAlchemy)    │  │   (pub/sub)    │  │  (YAML)  │  │  (typed)       │  ║
║  │  SQLite/Postgres │  │  10k history   │  │ vigil.   │  │  VIGILError    │  ║
║  │  8 ORM models    │  │  wildcard subs │  │  yaml    │  │  + subtypes    │  ║
║  └─────────────────┘  └────────────────┘  └──────────┘  └────────────────┘  ║
╠══════════════════════════════════════════════════════════════════════════════╣
║                         INTEGRATIONS LAYER                                   ║
║                                                                              ║
║   MLflow │ Airflow │ Prefect │ Kubeflow │ SageMaker │ ZenML │ Vertex AI      ║
║   Webhook (Slack/PagerDuty) │ CallbackAdapter (custom Python)                ║
╠══════════════════════════════════════════════════════════════════════════════╣
║                         SURFACE AREA                                         ║
║                                                                              ║
║   REST API (FastAPI)    CLI (Click)    Prometheus /metrics    Python SDK     ║
╚══════════════════════════════════════════════════════════════════════════════╝

Module Architecture: How All 5 Modules Interact

  Your ML Pipeline
  ────────────────────────────────────────────────────────────────────
  [Raw Data]
      │
      ▼
  ┌───────────────────────────────────┐
  │     Stage: Feature Engineering    │◄─── Module 2: Contracts
  │   @pipeline.stage(contract=...)   │     validates schema + stats
  └───────────────────────────────────┘     on every stage output
      │
      │  Features computed via:
      ▼
  ┌───────────────────────────────────┐
  │     Module 1: SkewGuard           │
  │   @feature(freshness_sla="5m")    │
  │                                   │
  │   Training path:                  │
  │   feature_fn → batch_code()       │─── writes training stats
  │                    ▼              │    to Registry
  │   [Training DataFrame]            │
  │                                   │
  │   Serving path:                   │
  │   feature_fn → serving_wrapper()  │─── writes serving stats
  │                    ▼              │    auto-checks PSI drift
  │   [Real-time value]               │
  └───────────────────────────────────┘
      │
      ▼
  ┌───────────────────────────────────┐
  │     Stage: Model Inference        │
  │                                   │
  │   Module 5: Freshness             │
  │   tracker.assert_fresh("feat")    │─── blocks stale features
  │                                   │    before prediction
  │   Module 4: DepGraph              │
  │   model_registry.deploy(...)      │─── validates all model
  │                                   │    dependencies are
  │                                   │    version-compatible
  │   Module 3: Pulse                 │
  │   @monitor.wrap                   │─── records prediction
  │   def predict(features): ...      │    distributions, detects
  │                                   │    silent drift via PSI
  └───────────────────────────────────┘
      │
      ▼
  [Predictions]
  ────────────────────────────────────────────────────────────────────

  All events flow to → Event Bus → Integrations (MLflow, Slack, etc.)
  All state stored in → Registry (SQLite / PostgreSQL)
  All signals exported → REST API /metrics → Prometheus → Grafana

Data Flow: Training vs Serving (SkewGuard)

  PROBLEM: Training and serving often compute features differently
  ────────────────────────────────────────────────────────────────
  Before Vigil:

  Training pipeline         Serving pipeline
  ─────────────────         ────────────────
  pandas groupby()    ≠≠≠   SQL AVG()
  fillna(mean)        ≠≠≠   fillna(0)
  log1p(x)            ≠≠≠   no transform
       ↓                         ↓
  Model trained on X    Model served with Y → Silent accuracy drop

  ────────────────────────────────────────────────────────────────
  With Vigil SkewGuard:

  Single canonical definition
  ───────────────────────────
  @feature(freshness_sla="5m")
  def user_spend_30d(user_id):
      return canonical_logic(user_id)
          │
          ├── feature.batch_code()     →  Training pipeline code
          │                               (auto-generated, guaranteed identical)
          │
          └── feature.serving_code()  →  Serving wrapper code
                                         (auto-generated, freshness tracked)

  PSI monitoring:
  Training distribution  ──┐
                           ├── PSI computed → Alert if PSI > threshold
  Serving distribution   ──┘

Dependency Graph: DepGraph Module

  Model Dependency Graph
  ──────────────────────
  ┌─────────────────┐       ┌─────────────────┐
  │ feature_         │       │   embedder_v2    │
  │ transformer v1.0 │       │      v2.0        │
  └────────┬────────┘       └────────┬─────────┘
           │ deployed ✓              │ deployed ✓
           │                         │
           └──────────────┬──────────┘
                          │
                          ▼
               ┌──────────────────┐
               │   churn_model    │
               │      v2.1        │
               │  depends_on:     │
               │  transformer>=1.0│
               │  embedder>=1.5   │
               └────────┬─────────┘
                        │ deploy → check compat → OK ✓
                        │
                        ▼
               ┌──────────────────┐
               │    ltv_model     │
               │      v1.0        │
               │  depends_on:     │
               │  churn_model>=2.0│   ← Vigil checks: 2.1 ≥ 2.0 ✓
               └──────────────────┘

  model_registry.deploy("ltv_model", "1.0")
  → checks all deps → all satisfied → deploy allowed

Pulse: No-Ground-Truth Drift Detection

  Traditional monitoring:              Vigil Pulse:
  ────────────────────                 ────────────
  Needs ground truth labels            Works without any labels
  Wait weeks for feedback              Detects drift in minutes
  Only measures accuracy               Measures distribution shift

  How PSI drift detection works:
  ──────────────────────────────
  Baseline (training):
  ██████████████████████░░░░░░░░░  Distribution A
  0.0  0.1  0.2  0.3  0.4  0.5

  Current (serving, 1000 preds):
  ░░░░░░░████████████████████████  Distribution B (drifted!)
  0.5  0.6  0.7  0.8  0.9  1.0

  PSI = Σ (actual% - expected%) × ln(actual%/expected%) = 0.42 > 0.20
                                                           ↑
                                              ALERT TRIGGERED

Observability Stack

  ┌──────────────────────────────────────────────────────────────┐
  │                     Observability                            │
  │                                                              │
  │  Vigil API                                                   │
  │  localhost:8000                                              │
  │       │                                                      │
  │       ├── GET /metrics  ──────► Prometheus :9090             │
  │       │   (Prometheus format)        │                       │
  │       │                             └──► Grafana :3000       │
  │       │                                  Dashboard panels:   │
  │       │                                  • Pipeline status   │
  │       ├── GET /health/               • Alert counts      │
  │       ├── GET /health/alerts         • Skew PSI/feature  │
  │       ├── GET /pulse/{m}/drift       • Prediction volume  │
  │       └── GET /docs                  • Feature null rates │
  │           (Swagger UI)                                       │
  └──────────────────────────────────────────────────────────────┘

  docker-compose up
  → vigil:8000 + prometheus:9090 + grafana:3000 (admin/vigil)

Integration Architecture

  Your existing MLOps stack           Vigil layer
  ────────────────────────            ─────────────────────────────────
                                      from vigil.integrations import ...
  MLflow runs           ────────────► MLflowAdapter
                                        logs contracts as tags
                                        logs PSI as metrics
                                        syncs model registry

  Airflow DAGs          ────────────► VIGILContractOperator
                                        native Airflow operator
                                        fails DAG on violation

  Prefect flows         ────────────► vigil_contract_task()
                                        drop-in Prefect task

  Kubeflow Pipelines    ────────────► KubeflowAdapter
                                        writes /tmp/mlpipeline-metrics.json

  AWS SageMaker         ────────────► SageMakerAdapter
                                        syncs experiments + model registry

  ZenML pipelines       ────────────► @vigil_contract_step()
                                        step decorator

  Vertex AI Pipelines   ────────────► VertexAIAdapter
                                        logs Vertex AI experiment metrics

  Any platform          ────────────► WebhookAdapter (Slack/PagerDuty)
                                        CallbackAdapter (custom Python)

Repository Structure

vigil/
├── vigil/                          # Python package
│   ├── __init__.py                 # Public API: feature, pipeline, PulseMonitor...
│   ├── core/
│   │   ├── config.py               # VIGILConfig (Pydantic) + vigil.yaml loader
│   │   ├── registry.py             # SQLAlchemy ORM models + session management
│   │   ├── event_bus.py            # Pub/sub event system (10k event history)
│   │   └── exceptions.py           # Typed exceptions (ContractViolationError, etc.)
│   ├── modules/
│   │   ├── skew_guard/
│   │   │   ├── feature.py          # @feature decorator + FeatureRegistry
│   │   │   ├── compiler.py         # Batch + serving code generator
│   │   │   └── validator.py        # PSI-based skew detection
│   │   ├── contracts/
│   │   │   ├── schema.py           # DataContract, ColumnContract (YAML + code)
│   │   │   └── engine.py           # ContractEngine + @pipeline.stage decorator
│   │   ├── pulse/
│   │   │   ├── baseline.py         # BaselineManager + BaselineStats
│   │   │   └── monitor.py          # PulseMonitor + PSI drift detection
│   │   ├── dep_graph/
│   │   │   └── graph.py            # ModelDependencyGraph + semver compat
│   │   └── freshness/
│   │       └── enforcer.py         # FreshnessTracker + SLA parser
│   ├── integrations/
│   │   ├── base.py                 # VIGILIntegration (abstract base)
│   │   ├── mlflow_adapter.py       # MLflow
│   │   ├── airflow_adapter.py      # Airflow + VIGILContractOperator
│   │   ├── prefect_adapter.py      # Prefect
│   │   ├── kubeflow_adapter.py     # Kubeflow Pipelines
│   │   ├── sagemaker_adapter.py    # AWS SageMaker
│   │   ├── zenml_adapter.py        # ZenML
│   │   ├── vertex_adapter.py       # Google Vertex AI
│   │   └── generic_adapter.py      # Webhook + CallbackAdapter
│   ├── api/
│   │   ├── app.py                  # FastAPI app factory
│   │   ├── schemas.py              # Pydantic request/response models
│   │   └── routes/
│   │       ├── features.py         # /features - SkewGuard endpoints
│   │       ├── contracts.py        # /contracts - Contract endpoints
│   │       ├── models.py           # /models - DepGraph endpoints
│   │       ├── pulse.py            # /pulse - Drift endpoints
│   │       ├── health.py           # /health - Alerts + freshness
│   │       └── metrics.py          # /metrics - Prometheus export
│   └── cli/
│       └── main.py                 # Click CLI (init/validate/status/serve/deploy)
├── tests/                          # 72 tests, 100% passing
│   ├── conftest.py                 # Isolated SQLite per test
│   ├── test_skew_guard.py          # 10 tests
│   ├── test_contracts.py           # 11 tests
│   ├── test_pulse.py               # 9 tests
│   ├── test_freshness.py           # 13 tests
│   ├── test_dep_graph.py           # 11 tests
│   └── test_api.py                 # 15 tests
├── examples/
│   ├── basic_pipeline.py           # All 5 modules end-to-end
│   ├── mlflow_integration.py       # MLflow training run + Vigil
│   ├── airflow_dag.py              # Airflow DAG with Vigil gates
│   └── schemas/
│       └── churn_features_v1.yaml  # Example contract schema
├── grafana/
│   ├── vigil_dashboard.json        # Grafana dashboard (8 panels)
│   └── provisioning/               # Auto-provision datasource + dashboard
├── .github/
│   └── workflows/
│       ├── ci.yml                  # Tests on Python 3.10/3.11/3.12 + lint
│       └── publish.yml             # OIDC publish to PyPI on release
├── vigil.yaml                      # Default configuration
├── prometheus.yml                  # Prometheus scrape config
├── Dockerfile                      # Production container
├── docker-compose.yml              # vigil + prometheus + grafana
├── pyproject.toml                  # Package metadata + dependencies
└── README.md                       # This file

Quick Start

from vigil import feature, pipeline, PulseMonitor, model_registry, FreshnessTracker
from vigil.core.registry import init_db

init_db()  # defaults to sqlite:///vigil.db

# ── Module 1: SkewGuard ──────────────────────────────────────────────────────
@feature(freshness_sla="5m", dtype=float)
def user_spend_30d(user_id: str) -> float:
    return db.query("SELECT SUM(amount) FROM orders WHERE user_id=?", user_id)

# Auto-generate batch code for training:
print(user_spend_30d.batch_code())

# Auto-generate serving wrapper with freshness tracking:
print(user_spend_30d.serving_wrapper_code())

# ── Module 2: Contracts ──────────────────────────────────────────────────────
@pipeline.stage(contract="vigil_schemas/features_v2.yaml")
def feature_engineering(df):
    return df.assign(spend_log=np.log1p(df["spend"]))

# ── Module 3: Pulse ──────────────────────────────────────────────────────────
monitor = PulseMonitor("churn_model", version="2.1")
monitor.set_baseline(training_predictions)

@monitor.wrap
def predict(features):
    return model.predict(features)

# ── Module 4: DepGraph ───────────────────────────────────────────────────────
model_registry.register("ltv_model", version="1.0", depends_on={
    "churn_model": ">=2.1",
    "feature_transformer": ">=1.0",
})
model_registry.deploy("ltv_model", "1.0")  # raises ModelIncompatibilityError if deps unmet

# ── Module 5: Freshness ──────────────────────────────────────────────────────
tracker = FreshnessTracker()
tracker.mark_computed("user_spend_30d", entity_id=user_id)
tracker.assert_fresh("user_spend_30d", entity_id=user_id, sla="5m")

Installation

pip install vigil-ml                  # core
pip install vigil-ml[mlflow]          # + MLflow
pip install vigil-ml[airflow]         # + Airflow
pip install vigil-ml[prefect]         # + Prefect
pip install vigil-ml[all]             # everything

CLI Reference

vigil init                            # scaffold vigil.yaml + schema dir
vigil validate [--strict]             # run all contract + skew checks
vigil status                          # live terminal health dashboard
vigil deploy --check                  # pre-deploy gate (exit 1 on failure)
vigil serve [--port 8000] [--reload]  # start REST API
vigil alerts                          # list active alerts
vigil contract-register schema.yaml   # register a contract
vigil model-register name ver [--depends-on "dep>=x.y"]

REST API

Start with vigil serve. Swagger UI at http://localhost:8000/docs.

GET  /                              →  API info
GET  /health/                       →  pipeline health (5 module statuses)
GET  /health/alerts                 →  active alerts
POST /health/freshness/{f}/mark     →  mark feature as computed
GET  /features/                     →  list features
GET  /features/{name}/batch-code    →  generated batch transformation
GET  /features/{name}/serving-code  →  generated serving wrapper
POST /features/skew/record          →  record training/serving stats
GET  /features/skew/{name}/validate →  run PSI skew check
GET  /contracts/                    →  list contracts
POST /contracts/                    →  register contract
POST /contracts/validate            →  validate data against contract
GET  /models/                       →  list models
POST /models/                       →  register model
POST /models/{n}/{v}/deploy         →  deploy (validates compatibility)
GET  /models/{n}/{v}/compatibility  →  compatibility check
GET  /models/{n}/{v}/tree           →  dependency tree
POST /pulse/baseline                →  set prediction baseline
POST /pulse/record                  →  record prediction
GET  /pulse/{model}/drift           →  PSI drift report
GET  /metrics                       →  Prometheus metrics

Docker: Full Observability Stack

# Start Vigil + Prometheus + Grafana
docker-compose up

# Access:
# Vigil API:    http://localhost:8000/docs
# Prometheus:   http://localhost:9090
# Grafana:      http://localhost:3000  (admin / vigil)

Configuration (vigil.yaml)

database:
  url: sqlite:///vigil.db     # or postgresql://user:pass@host/db

api:
  host: "0.0.0.0"
  port: 8000

modules:
  skew_guard:
    drift_threshold: 0.10     # PSI > 0.10 → alert
  contracts:
    fail_on_violation: true
    schema_dir: vigil_schemas
  pulse:
    window_size: 1000
    psi_threshold: 0.20
  dep_graph:
    block_incompatible_deploy: true
  freshness:
    default_sla: "1h"
    fail_on_stale: false

alerts:
  slack_webhook: "https://hooks.slack.com/services/..."

Integrations

Platform Class / Function Install
MLflow MLflowAdapter pip install vigil-ml[mlflow]
Airflow VIGILContractOperator, VIGILFreshnessOperator pip install vigil-ml[airflow]
Prefect vigil_contract_task, vigil_freshness_task pip install vigil-ml[prefect]
Kubeflow KubeflowAdapter included
SageMaker SageMakerAdapter pip install boto3
ZenML ZenMLAdapter, @vigil_contract_step pip install zenml
Vertex AI VertexAIAdapter, @vigil_vertex_component pip install google-cloud-aiplatform
Any (webhook) WebhookAdapter included
Any (Python) CallbackAdapter included

Tech Stack

Layer Technology
Language Python 3.10+
Data validation Pydantic v2
ORM / Storage SQLAlchemy 2.0 (SQLite / PostgreSQL)
REST API FastAPI + Uvicorn
CLI Click + Rich
Drift detection NumPy + SciPy (PSI, KS test)
Data processing Pandas
Version compat packaging (semver ranges)
Observability Prometheus-compatible /metrics
Dashboards Grafana (pre-built dashboard JSON)
Container Docker + Docker Compose
CI/CD GitHub Actions (test matrix + PyPI publish)

License

Apache 2.0 — see LICENSE.


Built as a product-grade MLOps accelerator. Designed to be the last piece every ML pipeline is missing.

About

The missing safety layer for production ML pipelines — SkewGuard, Contracts, Pulse, DepGraph, Freshness

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors