Skip to content

willowvibe/ObservaKit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

34 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

WillowVibe Logo

πŸ”­ ObservaKit

Data Observability Starter Kit for Small Teams

Quickstart β€’ Features β€’ Architecture β€’ Adding Checks β€’ Alert Setup β€’ Contributing


A self-hosted, Docker-Compose-ready observability layer that gives small data teams the 5 core observability pillars β€” Freshness, Volume, Quality, Schema Drift, and Pipeline Health β€” without needing a paid platform like Monte Carlo or Metaplane.

Who Is This For?

  • 1–5 person data teams at seed/Series-A startups
  • Teams using Airflow or Prefect for orchestration
  • Teams using dbt for transformations
  • Warehouses: PostgreSQL, BigQuery, or Snowflake
  • Pain: pipelines breaking silently, dashboards going stale, no single alert channel

Design Principles

  • Zero vendor lock-in β€” everything runs on open-source infra you control
  • Plug-in, don't replace β€” works alongside existing Airflow/dbt setups; no DAG refactoring required
  • Opinionated but minimal β€” ships with sensible defaults; quickstart in under 10 minutes
  • Progressive complexity β€” each observability layer is independent; adopt what you need

Features

1. πŸ• Freshness Monitor

Detects stale tables by tracking max(updated_at) and comparing against your SLA thresholds.

2. πŸ“Š Volume Monitor

Tracks row counts per table per DAG run with Z-score anomaly detection against a 7-day rolling average.

3. βœ… Quality Checks

Ships with pre-built Soda Core and Great Expectations templates for null checks, duplicates, value ranges, and referential integrity.

4. πŸ”€ Schema Drift Detector

Snapshots information_schema and diffs against previous snapshots. Detects added/removed columns and type changes.

5. πŸš€ Pipeline Health

Pulls Airflow/Prefect metrics via REST API and OpenTelemetry. Pre-built Grafana dashboards for success rates, task durations, and SLA misses.

6. πŸ’Έ FinOps Tracker

Tracks Snowflake compute credits and BigQuery bytes billed natively, preventing runaway dashboard queries and exploding ETL costs.

7. πŸ› οΈ Native dbt Integration

Parses run_results.json and manifest.json directly into ObservaKit's Postgres database, eliminating the need for third-party dbt packages like Elementary.

Tech Stack

Layer Tool
Data Quality Soda Core + Great Expectations
dbt Observability Native run_results.json parser
Pipeline Metrics OpenTelemetry + Prometheus
Dashboards Grafana
Backend API FastAPI + SQLAlchemy
Metadata Store PostgreSQL
Orchestration Airflow / Prefect REST API
Containerisation Docker Compose
Alerting Slack webhooks, Email (SMTP)

Quickstart

Prerequisites

  • Docker + Docker Compose
  • Python 3.10+
  • A supported SQL warehouse (PostgreSQL, BigQuery, or Snowflake)

1. Clone the repo

git clone https://github.com/willowvibe/ObservaKit.git
cd ObservaKit

2. Configure

cp .env.example .env
# Edit .env with your warehouse credentials and Airflow URL

3. Start the stack

docker-compose up -d

4. Run the Demo Data Generator (Optional)

To instantly see ObservaKit in action without hooking up your own database, generate 7 days of simulated history and inject data anomalies (like schema drift and volume drops):

make demo

(Once run, the dashboards will immediately populate with simulated pipeline failures and data quality alerts).

5. Open Grafana

Visit http://localhost:3000 (default: admin / admin) Dashboards are auto-provisioned under the Data Observability folder.

5. Explore the API

Visit http://localhost:8000/docs for the interactive Swagger UI.

6. Add your first quality checks

cp checks/templates/soda/no_nulls_on_pk.yml checks/my_project/orders.yml
# Edit the YAML to point to your table

Checks run every hour by default. Override in config/kit.yml.

Use Cases

Data Migrations (Zero-Drift Guarantee)

When migrating from legacy on-prem to a Cloud Lakehouse (e.g., Postgres to Snowflake), run ObservaKit in parallel to guarantee zero schema drift and 100% volume parity.

  • Connect ObservaKit to both source and destination.
  • Catch unsupported data type mappings early.
  • Ensure every single row makes it across. This turns ObservaKit into an automated audit layer for complex data migrations.

Pipeline Audits & Cost Observability

Instantly identify silent failures, stale dashboards, and missing SLA targets. Future releases will include native integration for Cost Observability (e.g., Snowflake compute credits and BigQuery bytes billed).

Architecture

flowchart TD
    subgraph Orchestration
        A[Airflow / Prefect]
    end

    subgraph Warehouse
        B[(PostgreSQL / BigQuery / Snowflake)]
    end

    subgraph Kit - Backend
        C[FastAPI Service]
        D[(Metadata Store - Postgres)]
        E[Scheduler - APScheduler]
        F[Schema Diff Engine]
        G[Volume Anomaly Detector]
        H[Freshness Poller]
    end

    subgraph Quality
        I[Soda Core / Great Expectations]
        J[Native dbt parser]
    end

    subgraph Observability Stack
        K[OpenTelemetry Collector]
        L[Prometheus]
        M[Grafana Dashboards]
    end

    subgraph Alerts
        N[Slack / Email / PagerDuty]
    end

    A -- REST API / OTel --> K
    B -- SQL queries --> H
    B -- SQL queries --> G
    B -- information_schema --> F
    B -- check execution --> I
    J -- JSON artifacts --> D
    I -- results --> C
    C --> D
    E --> C
    K --> L
    L --> M
    C -- Prometheus metrics --> L
    C -- alert trigger --> N
Loading

Project Structure

Project Structure

ObservaKit/
β”œβ”€β”€ docker-compose.yml
β”œβ”€β”€ .env.example
β”œβ”€β”€ config/
β”‚   β”œβ”€β”€ kit.yml
β”‚   └── warehouses/
β”œβ”€β”€ checks/
β”‚   β”œβ”€β”€ templates/
β”‚   └── examples/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ main.py
β”‚   β”œβ”€β”€ models.py
β”‚   β”œβ”€β”€ scheduler.py
β”‚   └── routers/
β”œβ”€β”€ landing-page/       <-- Vite/React GitHub Pages site
β”œβ”€β”€ dbt_integration/    <-- Native parsing logic for dbt artifacts
β”œβ”€β”€ connectors/
β”œβ”€β”€ alerts/
β”œβ”€β”€ otel/
β”œβ”€β”€ prometheus/
β”œβ”€β”€ grafana/
β”‚   β”œβ”€β”€ dashboards/
β”‚   └── provisioning/
β”œβ”€β”€ tests/
└── docs/

Contributing

Contributions welcome! Please read the guidelines before opening a PR.

Good First Issues

  • Add a new warehouse connector
  • Add a Grafana dashboard for a new use case
  • Write a quality check template for a common schema
  • Improve documentation or quickstart clarity

License

MIT License β€” free to use, modify, and distribute.


Built by WillowVibe DataSynapse β€” AI-first data enablement for modern teams.

About

Self-hosted Data Observability & FinOps starter kit. Automate pipeline audits, track data freshness/quality, and monitor cloud costs with zero vendor lock-in. Powered by Docker Compose, dbt, Airflow, and Grafana.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Contributors