Vigil — The Missing Safety Layer for Production ML

Vigil protects every stage of your ML pipeline with five integrated safety modules — from data ingestion to model serving. Works with any MLOps stack. Zero infrastructure changes required.

pip install vigil-ml

The Problem Vigil Solves

Most production ML failures are silent — no errors, no alerts, just wrong predictions. They all stem from five root causes:

Root Cause	What Actually Breaks	How Long Before You Notice
Features computed differently in training vs serving	Accuracy silently degrades	Weeks to months
Schema/stats changes between pipeline stages	Downstream models get garbage input	Next training run
Prediction distribution shifts without system errors	Users get wrong recommendations	Never (no alerts)
Upgrading Model A silently breaks Model B	Cascading production failures	At peak traffic
Stale cached features served at inference time	Model quality collapses	Hours to days

Vigil addresses all five — before they reach users.

Architecture

High-Level System Architecture

╔══════════════════════════════════════════════════════════════════════════════╗
║                           VIGIL  v0.1.0                                      ║
║               The Missing Safety Layer for Production ML                     ║
╠══════════════╦═══════════════╦══════════════╦════════════╦═══════════════════╣
║              ║               ║              ║            ║                   ║
║  SkewGuard   ║   Contracts   ║    Pulse     ║  DepGraph  ║    Freshness      ║
║  Module 1    ║   Module 2    ║   Module 3   ║  Module 4  ║    Module 5       ║
║              ║               ║              ║            ║                   ║
║  @feature    ║  DataContract ║ PulseMonitor ║  model_    ║ FreshnessTracker  ║
║  decorator   ║  @pipeline    ║  .wrap()     ║  registry  ║  .assert_fresh()  ║
║  + compiler  ║  .stage()     ║  PSI drift   ║  semver    ║  SLA enforcement  ║
║  + PSI check ║  + engine     ║  detection   ║  compat.   ║  + bulk_check()   ║
║              ║               ║              ║            ║                   ║
╠══════════════╩═══════════════╩══════════════╩════════════╩═══════════════════╣
║                           CORE ENGINE                                        ║
║                                                                              ║
║  ┌─────────────────┐  ┌────────────────┐  ┌──────────┐  ┌────────────────┐  ║
║  │    Registry      │  │   Event Bus    │  │  Config  │  │  Exceptions    │  ║
║  │  (SQLAlchemy)    │  │   (pub/sub)    │  │  (YAML)  │  │  (typed)       │  ║
║  │  SQLite/Postgres │  │  10k history   │  │ vigil.   │  │  VIGILError    │  ║
║  │  8 ORM models    │  │  wildcard subs │  │  yaml    │  │  + subtypes    │  ║
║  └─────────────────┘  └────────────────┘  └──────────┘  └────────────────┘  ║
╠══════════════════════════════════════════════════════════════════════════════╣
║                         INTEGRATIONS LAYER                                   ║
║                                                                              ║
║   MLflow │ Airflow │ Prefect │ Kubeflow │ SageMaker │ ZenML │ Vertex AI      ║
║   Webhook (Slack/PagerDuty) │ CallbackAdapter (custom Python)                ║
╠══════════════════════════════════════════════════════════════════════════════╣
║                         SURFACE AREA                                         ║
║                                                                              ║
║   REST API (FastAPI)    CLI (Click)    Prometheus /metrics    Python SDK     ║
╚══════════════════════════════════════════════════════════════════════════════╝

Module Architecture: How All 5 Modules Interact

  Your ML Pipeline
  ────────────────────────────────────────────────────────────────────
  [Raw Data]
      │
      ▼
  ┌───────────────────────────────────┐
  │     Stage: Feature Engineering    │◄─── Module 2: Contracts
  │   @pipeline.stage(contract=...)   │     validates schema + stats
  └───────────────────────────────────┘     on every stage output
      │
      │  Features computed via:
      ▼
  ┌───────────────────────────────────┐
  │     Module 1: SkewGuard           │
  │   @feature(freshness_sla="5m")    │
  │                                   │
  │   Training path:                  │
  │   feature_fn → batch_code()       │─── writes training stats
  │                    ▼              │    to Registry
  │   [Training DataFrame]            │
  │                                   │
  │   Serving path:                   │
  │   feature_fn → serving_wrapper()  │─── writes serving stats
  │                    ▼              │    auto-checks PSI drift
  │   [Real-time value]               │
  └───────────────────────────────────┘
      │
      ▼
  ┌───────────────────────────────────┐
  │     Stage: Model Inference        │
  │                                   │
  │   Module 5: Freshness             │
  │   tracker.assert_fresh("feat")    │─── blocks stale features
  │                                   │    before prediction
  │   Module 4: DepGraph              │
  │   model_registry.deploy(...)      │─── validates all model
  │                                   │    dependencies are
  │                                   │    version-compatible
  │   Module 3: Pulse                 │
  │   @monitor.wrap                   │─── records prediction
  │   def predict(features): ...      │    distributions, detects
  │                                   │    silent drift via PSI
  └───────────────────────────────────┘
      │
      ▼
  [Predictions]
  ────────────────────────────────────────────────────────────────────

  All events flow to → Event Bus → Integrations (MLflow, Slack, etc.)
  All state stored in → Registry (SQLite / PostgreSQL)
  All signals exported → REST API /metrics → Prometheus → Grafana

Data Flow: Training vs Serving (SkewGuard)

  PROBLEM: Training and serving often compute features differently
  ────────────────────────────────────────────────────────────────
  Before Vigil:

  Training pipeline         Serving pipeline
  ─────────────────         ────────────────
  pandas groupby()    ≠≠≠   SQL AVG()
  fillna(mean)        ≠≠≠   fillna(0)
  log1p(x)            ≠≠≠   no transform
       ↓                         ↓
  Model trained on X    Model served with Y → Silent accuracy drop

  ────────────────────────────────────────────────────────────────
  With Vigil SkewGuard:

  Single canonical definition
  ───────────────────────────
  @feature(freshness_sla="5m")
  def user_spend_30d(user_id):
      return canonical_logic(user_id)
          │
          ├── feature.batch_code()     →  Training pipeline code
          │                               (auto-generated, guaranteed identical)
          │
          └── feature.serving_code()  →  Serving wrapper code
                                         (auto-generated, freshness tracked)

  PSI monitoring:
  Training distribution  ──┐
                           ├── PSI computed → Alert if PSI > threshold
  Serving distribution   ──┘

Dependency Graph: DepGraph Module

  Model Dependency Graph
  ──────────────────────
  ┌─────────────────┐       ┌─────────────────┐
  │ feature_         │       │   embedder_v2    │
  │ transformer v1.0 │       │      v2.0        │
  └────────┬────────┘       └────────┬─────────┘
           │ deployed ✓              │ deployed ✓
           │                         │
           └──────────────┬──────────┘
                          │
                          ▼
               ┌──────────────────┐
               │   churn_model    │
               │      v2.1        │
               │  depends_on:     │
               │  transformer>=1.0│
               │  embedder>=1.5   │
               └────────┬─────────┘
                        │ deploy → check compat → OK ✓
                        │
                        ▼
               ┌──────────────────┐
               │    ltv_model     │
               │      v1.0        │
               │  depends_on:     │
               │  churn_model>=2.0│   ← Vigil checks: 2.1 ≥ 2.0 ✓
               └──────────────────┘

  model_registry.deploy("ltv_model", "1.0")
  → checks all deps → all satisfied → deploy allowed

Pulse: No-Ground-Truth Drift Detection

  Traditional monitoring:              Vigil Pulse:
  ────────────────────                 ────────────
  Needs ground truth labels            Works without any labels
  Wait weeks for feedback              Detects drift in minutes
  Only measures accuracy               Measures distribution shift

  How PSI drift detection works:
  ──────────────────────────────
  Baseline (training):
  ██████████████████████░░░░░░░░░  Distribution A
  0.0  0.1  0.2  0.3  0.4  0.5

  Current (serving, 1000 preds):
  ░░░░░░░████████████████████████  Distribution B (drifted!)
  0.5  0.6  0.7  0.8  0.9  1.0

  PSI = Σ (actual% - expected%) × ln(actual%/expected%) = 0.42 > 0.20
                                                           ↑
                                              ALERT TRIGGERED

Observability Stack

  ┌──────────────────────────────────────────────────────────────┐
  │                     Observability                            │
  │                                                              │
  │  Vigil API                                                   │
  │  localhost:8000                                              │
  │       │                                                      │
  │       ├── GET /metrics  ──────► Prometheus :9090             │
  │       │   (Prometheus format)        │                       │
  │       │                             └──► Grafana :3000       │
  │       │                                  Dashboard panels:   │
  │       │                                  • Pipeline status   │
  │       ├── GET /health/               • Alert counts      │
  │       ├── GET /health/alerts         • Skew PSI/feature  │
  │       ├── GET /pulse/{m}/drift       • Prediction volume  │
  │       └── GET /docs                  • Feature null rates │
  │           (Swagger UI)                                       │
  └──────────────────────────────────────────────────────────────┘

  docker-compose up
  → vigil:8000 + prometheus:9090 + grafana:3000 (admin/vigil)

Integration Architecture

  Your existing MLOps stack           Vigil layer
  ────────────────────────            ─────────────────────────────────
                                      from vigil.integrations import ...
  MLflow runs           ────────────► MLflowAdapter
                                        logs contracts as tags
                                        logs PSI as metrics
                                        syncs model registry

  Airflow DAGs          ────────────► VIGILContractOperator
                                        native Airflow operator
                                        fails DAG on violation

  Prefect flows         ────────────► vigil_contract_task()
                                        drop-in Prefect task

  Kubeflow Pipelines    ────────────► KubeflowAdapter
                                        writes /tmp/mlpipeline-metrics.json

  AWS SageMaker         ────────────► SageMakerAdapter
                                        syncs experiments + model registry

  ZenML pipelines       ────────────► @vigil_contract_step()
                                        step decorator

  Vertex AI Pipelines   ────────────► VertexAIAdapter
                                        logs Vertex AI experiment metrics

  Any platform          ────────────► WebhookAdapter (Slack/PagerDuty)
                                        CallbackAdapter (custom Python)

Repository Structure

vigil/
├── vigil/                          # Python package
│   ├── __init__.py                 # Public API: feature, pipeline, PulseMonitor...
│   ├── core/
│   │   ├── config.py               # VIGILConfig (Pydantic) + vigil.yaml loader
│   │   ├── registry.py             # SQLAlchemy ORM models + session management
│   │   ├── event_bus.py            # Pub/sub event system (10k event history)
│   │   └── exceptions.py           # Typed exceptions (ContractViolationError, etc.)
│   ├── modules/
│   │   ├── skew_guard/
│   │   │   ├── feature.py          # @feature decorator + FeatureRegistry
│   │   │   ├── compiler.py         # Batch + serving code generator
│   │   │   └── validator.py        # PSI-based skew detection
│   │   ├── contracts/
│   │   │   ├── schema.py           # DataContract, ColumnContract (YAML + code)
│   │   │   └── engine.py           # ContractEngine + @pipeline.stage decorator
│   │   ├── pulse/
│   │   │   ├── baseline.py         # BaselineManager + BaselineStats
│   │   │   └── monitor.py          # PulseMonitor + PSI drift detection
│   │   ├── dep_graph/
│   │   │   └── graph.py            # ModelDependencyGraph + semver compat
│   │   └── freshness/
│   │       └── enforcer.py         # FreshnessTracker + SLA parser
│   ├── integrations/
│   │   ├── base.py                 # VIGILIntegration (abstract base)
│   │   ├── mlflow_adapter.py       # MLflow
│   │   ├── airflow_adapter.py      # Airflow + VIGILContractOperator
│   │   ├── prefect_adapter.py      # Prefect
│   │   ├── kubeflow_adapter.py     # Kubeflow Pipelines
│   │   ├── sagemaker_adapter.py    # AWS SageMaker
│   │   ├── zenml_adapter.py        # ZenML
│   │   ├── vertex_adapter.py       # Google Vertex AI
│   │   └── generic_adapter.py      # Webhook + CallbackAdapter
│   ├── api/
│   │   ├── app.py                  # FastAPI app factory
│   │   ├── schemas.py              # Pydantic request/response models
│   │   └── routes/
│   │       ├── features.py         # /features - SkewGuard endpoints
│   │       ├── contracts.py        # /contracts - Contract endpoints
│   │       ├── models.py           # /models - DepGraph endpoints
│   │       ├── pulse.py            # /pulse - Drift endpoints
│   │       ├── health.py           # /health - Alerts + freshness
│   │       └── metrics.py          # /metrics - Prometheus export
│   └── cli/
│       └── main.py                 # Click CLI (init/validate/status/serve/deploy)
├── tests/                          # 72 tests, 100% passing
│   ├── conftest.py                 # Isolated SQLite per test
│   ├── test_skew_guard.py          # 10 tests
│   ├── test_contracts.py           # 11 tests
│   ├── test_pulse.py               # 9 tests
│   ├── test_freshness.py           # 13 tests
│   ├── test_dep_graph.py           # 11 tests
│   └── test_api.py                 # 15 tests
├── examples/
│   ├── basic_pipeline.py           # All 5 modules end-to-end
│   ├── mlflow_integration.py       # MLflow training run + Vigil
│   ├── airflow_dag.py              # Airflow DAG with Vigil gates
│   └── schemas/
│       └── churn_features_v1.yaml  # Example contract schema
├── grafana/
│   ├── vigil_dashboard.json        # Grafana dashboard (8 panels)
│   └── provisioning/               # Auto-provision datasource + dashboard
├── .github/
│   └── workflows/
│       ├── ci.yml                  # Tests on Python 3.10/3.11/3.12 + lint
│       └── publish.yml             # OIDC publish to PyPI on release
├── vigil.yaml                      # Default configuration
├── prometheus.yml                  # Prometheus scrape config
├── Dockerfile                      # Production container
├── docker-compose.yml              # vigil + prometheus + grafana
├── pyproject.toml                  # Package metadata + dependencies
└── README.md                       # This file

Quick Start

from vigil import feature, pipeline, PulseMonitor, model_registry, FreshnessTracker
from vigil.core.registry import init_db

init_db()  # defaults to sqlite:///vigil.db

# ── Module 1: SkewGuard ──────────────────────────────────────────────────────
@feature(freshness_sla="5m", dtype=float)
def user_spend_30d(user_id: str) -> float:
    return db.query("SELECT SUM(amount) FROM orders WHERE user_id=?", user_id)

# Auto-generate batch code for training:
print(user_spend_30d.batch_code())

# Auto-generate serving wrapper with freshness tracking:
print(user_spend_30d.serving_wrapper_code())

# ── Module 2: Contracts ──────────────────────────────────────────────────────
@pipeline.stage(contract="vigil_schemas/features_v2.yaml")
def feature_engineering(df):
    return df.assign(spend_log=np.log1p(df["spend"]))

# ── Module 3: Pulse ──────────────────────────────────────────────────────────
monitor = PulseMonitor("churn_model", version="2.1")
monitor.set_baseline(training_predictions)

@monitor.wrap
def predict(features):
    return model.predict(features)

# ── Module 4: DepGraph ───────────────────────────────────────────────────────
model_registry.register("ltv_model", version="1.0", depends_on={
    "churn_model": ">=2.1",
    "feature_transformer": ">=1.0",
})
model_registry.deploy("ltv_model", "1.0")  # raises ModelIncompatibilityError if deps unmet

# ── Module 5: Freshness ──────────────────────────────────────────────────────
tracker = FreshnessTracker()
tracker.mark_computed("user_spend_30d", entity_id=user_id)
tracker.assert_fresh("user_spend_30d", entity_id=user_id, sla="5m")

Installation

pip install vigil-ml                  # core
pip install vigil-ml[mlflow]          # + MLflow
pip install vigil-ml[airflow]         # + Airflow
pip install vigil-ml[prefect]         # + Prefect
pip install vigil-ml[all]             # everything

CLI Reference

vigil init                            # scaffold vigil.yaml + schema dir
vigil validate [--strict]             # run all contract + skew checks
vigil status                          # live terminal health dashboard
vigil deploy --check                  # pre-deploy gate (exit 1 on failure)
vigil serve [--port 8000] [--reload]  # start REST API
vigil alerts                          # list active alerts
vigil contract-register schema.yaml   # register a contract
vigil model-register name ver [--depends-on "dep>=x.y"]

REST API

Start with vigil serve. Swagger UI at http://localhost:8000/docs.

GET  /                              →  API info
GET  /health/                       →  pipeline health (5 module statuses)
GET  /health/alerts                 →  active alerts
POST /health/freshness/{f}/mark     →  mark feature as computed
GET  /features/                     →  list features
GET  /features/{name}/batch-code    →  generated batch transformation
GET  /features/{name}/serving-code  →  generated serving wrapper
POST /features/skew/record          →  record training/serving stats
GET  /features/skew/{name}/validate →  run PSI skew check
GET  /contracts/                    →  list contracts
POST /contracts/                    →  register contract
POST /contracts/validate            →  validate data against contract
GET  /models/                       →  list models
POST /models/                       →  register model
POST /models/{n}/{v}/deploy         →  deploy (validates compatibility)
GET  /models/{n}/{v}/compatibility  →  compatibility check
GET  /models/{n}/{v}/tree           →  dependency tree
POST /pulse/baseline                →  set prediction baseline
POST /pulse/record                  →  record prediction
GET  /pulse/{model}/drift           →  PSI drift report
GET  /metrics                       →  Prometheus metrics

Docker: Full Observability Stack

# Start Vigil + Prometheus + Grafana
docker-compose up

# Access:
# Vigil API:    http://localhost:8000/docs
# Prometheus:   http://localhost:9090
# Grafana:      http://localhost:3000  (admin / vigil)

Configuration (`vigil.yaml`)

database:
  url: sqlite:///vigil.db     # or postgresql://user:pass@host/db

api:
  host: "0.0.0.0"
  port: 8000

modules:
  skew_guard:
    drift_threshold: 0.10     # PSI > 0.10 → alert
  contracts:
    fail_on_violation: true
    schema_dir: vigil_schemas
  pulse:
    window_size: 1000
    psi_threshold: 0.20
  dep_graph:
    block_incompatible_deploy: true
  freshness:
    default_sla: "1h"
    fail_on_stale: false

alerts:
  slack_webhook: "https://hooks.slack.com/services/..."

Integrations

Platform	Class / Function	Install
MLflow	`MLflowAdapter`	`pip install vigil-ml[mlflow]`
Airflow	`VIGILContractOperator`, `VIGILFreshnessOperator`	`pip install vigil-ml[airflow]`
Prefect	`vigil_contract_task`, `vigil_freshness_task`	`pip install vigil-ml[prefect]`
Kubeflow	`KubeflowAdapter`	included
SageMaker	`SageMakerAdapter`	`pip install boto3`
ZenML	`ZenMLAdapter`, `@vigil_contract_step`	`pip install zenml`
Vertex AI	`VertexAIAdapter`, `@vigil_vertex_component`	`pip install google-cloud-aiplatform`
Any (webhook)	`WebhookAdapter`	included
Any (Python)	`CallbackAdapter`	included

Tech Stack

Layer	Technology
Language	Python 3.10+
Data validation	Pydantic v2
ORM / Storage	SQLAlchemy 2.0 (SQLite / PostgreSQL)
REST API	FastAPI + Uvicorn
CLI	Click + Rich
Drift detection	NumPy + SciPy (PSI, KS test)
Data processing	Pandas
Version compat	`packaging` (semver ranges)
Observability	Prometheus-compatible `/metrics`
Dashboards	Grafana (pre-built dashboard JSON)
Container	Docker + Docker Compose
CI/CD	GitHub Actions (test matrix + PyPI publish)

License

Apache 2.0 — see LICENSE.

Built as a product-grade MLOps accelerator. Designed to be the last piece every ML pipeline is missing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vigil — The Missing Safety Layer for Production ML

The Problem Vigil Solves

Architecture

High-Level System Architecture

Module Architecture: How All 5 Modules Interact

Data Flow: Training vs Serving (SkewGuard)

Dependency Graph: DepGraph Module

Pulse: No-Ground-Truth Drift Detection

Observability Stack

Integration Architecture

Repository Structure

Quick Start

Installation

CLI Reference

REST API

Docker: Full Observability Stack

Configuration (`vigil.yaml`)

Integrations

Tech Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
examples		examples
grafana		grafana
tests		tests
vigil		vigil
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
prometheus.yml		prometheus.yml
pyproject.toml		pyproject.toml
vigil.yaml		vigil.yaml

Folders and files

Latest commit

History

Repository files navigation

Vigil — The Missing Safety Layer for Production ML

The Problem Vigil Solves

Architecture

High-Level System Architecture

Module Architecture: How All 5 Modules Interact

Data Flow: Training vs Serving (SkewGuard)

Dependency Graph: DepGraph Module

Pulse: No-Ground-Truth Drift Detection

Observability Stack

Integration Architecture

Repository Structure

Quick Start

Installation

CLI Reference

REST API

Docker: Full Observability Stack

Configuration (vigil.yaml)

Integrations

Tech Stack

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Configuration (`vigil.yaml`)

Packages