Vigil protects every stage of your ML pipeline with five integrated safety modules — from data ingestion to model serving. Works with any MLOps stack. Zero infrastructure changes required.
pip install vigil-mlMost production ML failures are silent — no errors, no alerts, just wrong predictions. They all stem from five root causes:
| Root Cause | What Actually Breaks | How Long Before You Notice |
|---|---|---|
| Features computed differently in training vs serving | Accuracy silently degrades | Weeks to months |
| Schema/stats changes between pipeline stages | Downstream models get garbage input | Next training run |
| Prediction distribution shifts without system errors | Users get wrong recommendations | Never (no alerts) |
| Upgrading Model A silently breaks Model B | Cascading production failures | At peak traffic |
| Stale cached features served at inference time | Model quality collapses | Hours to days |
Vigil addresses all five — before they reach users.
╔══════════════════════════════════════════════════════════════════════════════╗
║ VIGIL v0.1.0 ║
║ The Missing Safety Layer for Production ML ║
╠══════════════╦═══════════════╦══════════════╦════════════╦═══════════════════╣
║ ║ ║ ║ ║ ║
║ SkewGuard ║ Contracts ║ Pulse ║ DepGraph ║ Freshness ║
║ Module 1 ║ Module 2 ║ Module 3 ║ Module 4 ║ Module 5 ║
║ ║ ║ ║ ║ ║
║ @feature ║ DataContract ║ PulseMonitor ║ model_ ║ FreshnessTracker ║
║ decorator ║ @pipeline ║ .wrap() ║ registry ║ .assert_fresh() ║
║ + compiler ║ .stage() ║ PSI drift ║ semver ║ SLA enforcement ║
║ + PSI check ║ + engine ║ detection ║ compat. ║ + bulk_check() ║
║ ║ ║ ║ ║ ║
╠══════════════╩═══════════════╩══════════════╩════════════╩═══════════════════╣
║ CORE ENGINE ║
║ ║
║ ┌─────────────────┐ ┌────────────────┐ ┌──────────┐ ┌────────────────┐ ║
║ │ Registry │ │ Event Bus │ │ Config │ │ Exceptions │ ║
║ │ (SQLAlchemy) │ │ (pub/sub) │ │ (YAML) │ │ (typed) │ ║
║ │ SQLite/Postgres │ │ 10k history │ │ vigil. │ │ VIGILError │ ║
║ │ 8 ORM models │ │ wildcard subs │ │ yaml │ │ + subtypes │ ║
║ └─────────────────┘ └────────────────┘ └──────────┘ └────────────────┘ ║
╠══════════════════════════════════════════════════════════════════════════════╣
║ INTEGRATIONS LAYER ║
║ ║
║ MLflow │ Airflow │ Prefect │ Kubeflow │ SageMaker │ ZenML │ Vertex AI ║
║ Webhook (Slack/PagerDuty) │ CallbackAdapter (custom Python) ║
╠══════════════════════════════════════════════════════════════════════════════╣
║ SURFACE AREA ║
║ ║
║ REST API (FastAPI) CLI (Click) Prometheus /metrics Python SDK ║
╚══════════════════════════════════════════════════════════════════════════════╝
Your ML Pipeline
────────────────────────────────────────────────────────────────────
[Raw Data]
│
▼
┌───────────────────────────────────┐
│ Stage: Feature Engineering │◄─── Module 2: Contracts
│ @pipeline.stage(contract=...) │ validates schema + stats
└───────────────────────────────────┘ on every stage output
│
│ Features computed via:
▼
┌───────────────────────────────────┐
│ Module 1: SkewGuard │
│ @feature(freshness_sla="5m") │
│ │
│ Training path: │
│ feature_fn → batch_code() │─── writes training stats
│ ▼ │ to Registry
│ [Training DataFrame] │
│ │
│ Serving path: │
│ feature_fn → serving_wrapper() │─── writes serving stats
│ ▼ │ auto-checks PSI drift
│ [Real-time value] │
└───────────────────────────────────┘
│
▼
┌───────────────────────────────────┐
│ Stage: Model Inference │
│ │
│ Module 5: Freshness │
│ tracker.assert_fresh("feat") │─── blocks stale features
│ │ before prediction
│ Module 4: DepGraph │
│ model_registry.deploy(...) │─── validates all model
│ │ dependencies are
│ │ version-compatible
│ Module 3: Pulse │
│ @monitor.wrap │─── records prediction
│ def predict(features): ... │ distributions, detects
│ │ silent drift via PSI
└───────────────────────────────────┘
│
▼
[Predictions]
────────────────────────────────────────────────────────────────────
All events flow to → Event Bus → Integrations (MLflow, Slack, etc.)
All state stored in → Registry (SQLite / PostgreSQL)
All signals exported → REST API /metrics → Prometheus → Grafana
PROBLEM: Training and serving often compute features differently
────────────────────────────────────────────────────────────────
Before Vigil:
Training pipeline Serving pipeline
───────────────── ────────────────
pandas groupby() ≠≠≠ SQL AVG()
fillna(mean) ≠≠≠ fillna(0)
log1p(x) ≠≠≠ no transform
↓ ↓
Model trained on X Model served with Y → Silent accuracy drop
────────────────────────────────────────────────────────────────
With Vigil SkewGuard:
Single canonical definition
───────────────────────────
@feature(freshness_sla="5m")
def user_spend_30d(user_id):
return canonical_logic(user_id)
│
├── feature.batch_code() → Training pipeline code
│ (auto-generated, guaranteed identical)
│
└── feature.serving_code() → Serving wrapper code
(auto-generated, freshness tracked)
PSI monitoring:
Training distribution ──┐
├── PSI computed → Alert if PSI > threshold
Serving distribution ──┘
Model Dependency Graph
──────────────────────
┌─────────────────┐ ┌─────────────────┐
│ feature_ │ │ embedder_v2 │
│ transformer v1.0 │ │ v2.0 │
└────────┬────────┘ └────────┬─────────┘
│ deployed ✓ │ deployed ✓
│ │
└──────────────┬──────────┘
│
▼
┌──────────────────┐
│ churn_model │
│ v2.1 │
│ depends_on: │
│ transformer>=1.0│
│ embedder>=1.5 │
└────────┬─────────┘
│ deploy → check compat → OK ✓
│
▼
┌──────────────────┐
│ ltv_model │
│ v1.0 │
│ depends_on: │
│ churn_model>=2.0│ ← Vigil checks: 2.1 ≥ 2.0 ✓
└──────────────────┘
model_registry.deploy("ltv_model", "1.0")
→ checks all deps → all satisfied → deploy allowed
Traditional monitoring: Vigil Pulse:
──────────────────── ────────────
Needs ground truth labels Works without any labels
Wait weeks for feedback Detects drift in minutes
Only measures accuracy Measures distribution shift
How PSI drift detection works:
──────────────────────────────
Baseline (training):
██████████████████████░░░░░░░░░ Distribution A
0.0 0.1 0.2 0.3 0.4 0.5
Current (serving, 1000 preds):
░░░░░░░████████████████████████ Distribution B (drifted!)
0.5 0.6 0.7 0.8 0.9 1.0
PSI = Σ (actual% - expected%) × ln(actual%/expected%) = 0.42 > 0.20
↑
ALERT TRIGGERED
┌──────────────────────────────────────────────────────────────┐
│ Observability │
│ │
│ Vigil API │
│ localhost:8000 │
│ │ │
│ ├── GET /metrics ──────► Prometheus :9090 │
│ │ (Prometheus format) │ │
│ │ └──► Grafana :3000 │
│ │ Dashboard panels: │
│ │ • Pipeline status │
│ ├── GET /health/ • Alert counts │
│ ├── GET /health/alerts • Skew PSI/feature │
│ ├── GET /pulse/{m}/drift • Prediction volume │
│ └── GET /docs • Feature null rates │
│ (Swagger UI) │
└──────────────────────────────────────────────────────────────┘
docker-compose up
→ vigil:8000 + prometheus:9090 + grafana:3000 (admin/vigil)
Your existing MLOps stack Vigil layer
──────────────────────── ─────────────────────────────────
from vigil.integrations import ...
MLflow runs ────────────► MLflowAdapter
logs contracts as tags
logs PSI as metrics
syncs model registry
Airflow DAGs ────────────► VIGILContractOperator
native Airflow operator
fails DAG on violation
Prefect flows ────────────► vigil_contract_task()
drop-in Prefect task
Kubeflow Pipelines ────────────► KubeflowAdapter
writes /tmp/mlpipeline-metrics.json
AWS SageMaker ────────────► SageMakerAdapter
syncs experiments + model registry
ZenML pipelines ────────────► @vigil_contract_step()
step decorator
Vertex AI Pipelines ────────────► VertexAIAdapter
logs Vertex AI experiment metrics
Any platform ────────────► WebhookAdapter (Slack/PagerDuty)
CallbackAdapter (custom Python)
vigil/
├── vigil/ # Python package
│ ├── __init__.py # Public API: feature, pipeline, PulseMonitor...
│ ├── core/
│ │ ├── config.py # VIGILConfig (Pydantic) + vigil.yaml loader
│ │ ├── registry.py # SQLAlchemy ORM models + session management
│ │ ├── event_bus.py # Pub/sub event system (10k event history)
│ │ └── exceptions.py # Typed exceptions (ContractViolationError, etc.)
│ ├── modules/
│ │ ├── skew_guard/
│ │ │ ├── feature.py # @feature decorator + FeatureRegistry
│ │ │ ├── compiler.py # Batch + serving code generator
│ │ │ └── validator.py # PSI-based skew detection
│ │ ├── contracts/
│ │ │ ├── schema.py # DataContract, ColumnContract (YAML + code)
│ │ │ └── engine.py # ContractEngine + @pipeline.stage decorator
│ │ ├── pulse/
│ │ │ ├── baseline.py # BaselineManager + BaselineStats
│ │ │ └── monitor.py # PulseMonitor + PSI drift detection
│ │ ├── dep_graph/
│ │ │ └── graph.py # ModelDependencyGraph + semver compat
│ │ └── freshness/
│ │ └── enforcer.py # FreshnessTracker + SLA parser
│ ├── integrations/
│ │ ├── base.py # VIGILIntegration (abstract base)
│ │ ├── mlflow_adapter.py # MLflow
│ │ ├── airflow_adapter.py # Airflow + VIGILContractOperator
│ │ ├── prefect_adapter.py # Prefect
│ │ ├── kubeflow_adapter.py # Kubeflow Pipelines
│ │ ├── sagemaker_adapter.py # AWS SageMaker
│ │ ├── zenml_adapter.py # ZenML
│ │ ├── vertex_adapter.py # Google Vertex AI
│ │ └── generic_adapter.py # Webhook + CallbackAdapter
│ ├── api/
│ │ ├── app.py # FastAPI app factory
│ │ ├── schemas.py # Pydantic request/response models
│ │ └── routes/
│ │ ├── features.py # /features - SkewGuard endpoints
│ │ ├── contracts.py # /contracts - Contract endpoints
│ │ ├── models.py # /models - DepGraph endpoints
│ │ ├── pulse.py # /pulse - Drift endpoints
│ │ ├── health.py # /health - Alerts + freshness
│ │ └── metrics.py # /metrics - Prometheus export
│ └── cli/
│ └── main.py # Click CLI (init/validate/status/serve/deploy)
├── tests/ # 72 tests, 100% passing
│ ├── conftest.py # Isolated SQLite per test
│ ├── test_skew_guard.py # 10 tests
│ ├── test_contracts.py # 11 tests
│ ├── test_pulse.py # 9 tests
│ ├── test_freshness.py # 13 tests
│ ├── test_dep_graph.py # 11 tests
│ └── test_api.py # 15 tests
├── examples/
│ ├── basic_pipeline.py # All 5 modules end-to-end
│ ├── mlflow_integration.py # MLflow training run + Vigil
│ ├── airflow_dag.py # Airflow DAG with Vigil gates
│ └── schemas/
│ └── churn_features_v1.yaml # Example contract schema
├── grafana/
│ ├── vigil_dashboard.json # Grafana dashboard (8 panels)
│ └── provisioning/ # Auto-provision datasource + dashboard
├── .github/
│ └── workflows/
│ ├── ci.yml # Tests on Python 3.10/3.11/3.12 + lint
│ └── publish.yml # OIDC publish to PyPI on release
├── vigil.yaml # Default configuration
├── prometheus.yml # Prometheus scrape config
├── Dockerfile # Production container
├── docker-compose.yml # vigil + prometheus + grafana
├── pyproject.toml # Package metadata + dependencies
└── README.md # This file
from vigil import feature, pipeline, PulseMonitor, model_registry, FreshnessTracker
from vigil.core.registry import init_db
init_db() # defaults to sqlite:///vigil.db
# ── Module 1: SkewGuard ──────────────────────────────────────────────────────
@feature(freshness_sla="5m", dtype=float)
def user_spend_30d(user_id: str) -> float:
return db.query("SELECT SUM(amount) FROM orders WHERE user_id=?", user_id)
# Auto-generate batch code for training:
print(user_spend_30d.batch_code())
# Auto-generate serving wrapper with freshness tracking:
print(user_spend_30d.serving_wrapper_code())
# ── Module 2: Contracts ──────────────────────────────────────────────────────
@pipeline.stage(contract="vigil_schemas/features_v2.yaml")
def feature_engineering(df):
return df.assign(spend_log=np.log1p(df["spend"]))
# ── Module 3: Pulse ──────────────────────────────────────────────────────────
monitor = PulseMonitor("churn_model", version="2.1")
monitor.set_baseline(training_predictions)
@monitor.wrap
def predict(features):
return model.predict(features)
# ── Module 4: DepGraph ───────────────────────────────────────────────────────
model_registry.register("ltv_model", version="1.0", depends_on={
"churn_model": ">=2.1",
"feature_transformer": ">=1.0",
})
model_registry.deploy("ltv_model", "1.0") # raises ModelIncompatibilityError if deps unmet
# ── Module 5: Freshness ──────────────────────────────────────────────────────
tracker = FreshnessTracker()
tracker.mark_computed("user_spend_30d", entity_id=user_id)
tracker.assert_fresh("user_spend_30d", entity_id=user_id, sla="5m")pip install vigil-ml # core
pip install vigil-ml[mlflow] # + MLflow
pip install vigil-ml[airflow] # + Airflow
pip install vigil-ml[prefect] # + Prefect
pip install vigil-ml[all] # everythingvigil init # scaffold vigil.yaml + schema dir
vigil validate [--strict] # run all contract + skew checks
vigil status # live terminal health dashboard
vigil deploy --check # pre-deploy gate (exit 1 on failure)
vigil serve [--port 8000] [--reload] # start REST API
vigil alerts # list active alerts
vigil contract-register schema.yaml # register a contract
vigil model-register name ver [--depends-on "dep>=x.y"]Start with vigil serve. Swagger UI at http://localhost:8000/docs.
GET / → API info
GET /health/ → pipeline health (5 module statuses)
GET /health/alerts → active alerts
POST /health/freshness/{f}/mark → mark feature as computed
GET /features/ → list features
GET /features/{name}/batch-code → generated batch transformation
GET /features/{name}/serving-code → generated serving wrapper
POST /features/skew/record → record training/serving stats
GET /features/skew/{name}/validate → run PSI skew check
GET /contracts/ → list contracts
POST /contracts/ → register contract
POST /contracts/validate → validate data against contract
GET /models/ → list models
POST /models/ → register model
POST /models/{n}/{v}/deploy → deploy (validates compatibility)
GET /models/{n}/{v}/compatibility → compatibility check
GET /models/{n}/{v}/tree → dependency tree
POST /pulse/baseline → set prediction baseline
POST /pulse/record → record prediction
GET /pulse/{model}/drift → PSI drift report
GET /metrics → Prometheus metrics
# Start Vigil + Prometheus + Grafana
docker-compose up
# Access:
# Vigil API: http://localhost:8000/docs
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3000 (admin / vigil)database:
url: sqlite:///vigil.db # or postgresql://user:pass@host/db
api:
host: "0.0.0.0"
port: 8000
modules:
skew_guard:
drift_threshold: 0.10 # PSI > 0.10 → alert
contracts:
fail_on_violation: true
schema_dir: vigil_schemas
pulse:
window_size: 1000
psi_threshold: 0.20
dep_graph:
block_incompatible_deploy: true
freshness:
default_sla: "1h"
fail_on_stale: false
alerts:
slack_webhook: "https://hooks.slack.com/services/..."| Platform | Class / Function | Install |
|---|---|---|
| MLflow | MLflowAdapter |
pip install vigil-ml[mlflow] |
| Airflow | VIGILContractOperator, VIGILFreshnessOperator |
pip install vigil-ml[airflow] |
| Prefect | vigil_contract_task, vigil_freshness_task |
pip install vigil-ml[prefect] |
| Kubeflow | KubeflowAdapter |
included |
| SageMaker | SageMakerAdapter |
pip install boto3 |
| ZenML | ZenMLAdapter, @vigil_contract_step |
pip install zenml |
| Vertex AI | VertexAIAdapter, @vigil_vertex_component |
pip install google-cloud-aiplatform |
| Any (webhook) | WebhookAdapter |
included |
| Any (Python) | CallbackAdapter |
included |
| Layer | Technology |
|---|---|
| Language | Python 3.10+ |
| Data validation | Pydantic v2 |
| ORM / Storage | SQLAlchemy 2.0 (SQLite / PostgreSQL) |
| REST API | FastAPI + Uvicorn |
| CLI | Click + Rich |
| Drift detection | NumPy + SciPy (PSI, KS test) |
| Data processing | Pandas |
| Version compat | packaging (semver ranges) |
| Observability | Prometheus-compatible /metrics |
| Dashboards | Grafana (pre-built dashboard JSON) |
| Container | Docker + Docker Compose |
| CI/CD | GitHub Actions (test matrix + PyPI publish) |
Apache 2.0 — see LICENSE.
Built as a product-grade MLOps accelerator. Designed to be the last piece every ML pipeline is missing.