Sefaria Status Monitor

Real-time uptime monitoring system for Sefaria's critical services.

Features

Health Checking: Parallel HTTP checks with configurable retries (sefaria.org, MCP Server, AI Chatbot, Linker)
Async E2E Verification: Two-phase check for the Linker API — verifies task submission and successful processing, with retries
Consecutive Failure Threshold: Per-service configurable threshold — requires N consecutive failed check cycles before alerting, filtering brief blips
State Tracking: Detects UP/DOWN transitions to prevent alert storms
Slack Alerts: Block Kit notifications on confirmed outages, with accurate outage start time and downtime duration on recovery
Status Page: Public dashboard at status.sefaria.org with 60s auto-refresh
Scheduled Cleanup: Automatic daily purging of old records at 3 AM UTC

Quick Start

Local Development

# Clone and setup
git clone <repository-url>
cd sefaria-status
python -m venv .venv
.venv\Scripts\activate  # Windows
# source .venv/bin/activate  # Linux/Mac

# Install dependencies
pip install -r requirements.txt

# Setup database
python manage.py migrate
python manage.py createsuperuser

# Run development server
python manage.py runserver

# Run health check scheduler
python manage.py run_checks

Docker Deployment

# Copy and configure environment
cp .env.example .env
# Edit .env with your settings

# Build and start
docker compose up -d

# View logs
docker compose logs -f scheduler

Configuration

Environment Variables

Variable	Description	Default
`SECRET_KEY`	Django secret key	Required
`DEBUG`	Enable debug mode	`False`
`ALLOWED_HOSTS`	Comma-separated hosts	`status.sefaria.org`
`DATABASE_URL`	PostgreSQL connection URL	SQLite (dev)
`SLACK_WEBHOOK_URL`	Slack incoming webhook	-
`SLACK_CHANNEL`	Alert channel name	`sefaria-down`
`STATUS_PAGE_URL`	Public status page URL	-
`HEALTH_CHECK_INTERVAL`	Check frequency (seconds)	`60`
`HEALTH_CHECK_RETRIES`	Retry attempts per check	`3`
`HEALTH_CHECK_RETRY_DELAY`	Delay between retries (seconds)	`10`
`ALERT_AFTER_CONSECUTIVE_FAILURES`	Default consecutive failures before alerting	`2`
`HEALTH_CHECK_RETENTION_DAYS`	Days to keep records	`60`

Monitored Services

Configured in config/settings/base.py:

MONITORED_SERVICES = [
    {
        "name": "sefaria.org",
        "url": "https://www.sefaria.org/healthz",
        "method": "GET",
        "follow_redirects": True,
        "expected_status": 200,
        "failure_threshold": 2,  # alert after 2 consecutive failed cycles
    },
    {
        "name": "MCP Server",
        "url": "https://mcp.sefaria.org/healthz",
        "expected_status": 200,
        "failure_threshold": 2,
    },
    {
        "name": "AI Chatbot",
        "url": "https://chat-dev.sefaria.org/api/health",
        "expected_status": 200,
        "failure_threshold": 2,
    },
    {
        "name": "Linker",
        "url": "https://www.sefaria.org/api/find-refs",
        "method": "POST",
        "expected_status": 202,
        "check_type": "async_two_phase",
        "failure_threshold": 3,  # noisiest service, higher threshold
        "request_body": {"text": {"title": "", "body": "Job 1:1"}},
        "async_verification": {
            "base_url": "https://www.sefaria.org/api/async/",
            "max_poll_attempts": 10,
            "poll_interval": 1,
        },
    },
]

Each service's failure_threshold sets how many consecutive failed check cycles are required before a DOWN alert is sent to Slack. This filters brief blips that self-resolve. Recovery alerts always fire immediately on the first successful check. Falls back to ALERT_AFTER_CONSECUTIVE_FAILURES (default 2) if not set per service.

Management Commands

# Run health check scheduler (includes auto-cleanup at 3 AM UTC)
python manage.py run_checks

# Run once (for testing)
python manage.py run_checks --once

# Manual cleanup (runs automatically via scheduler)
python manage.py cleanup_old_checks

# Dry run cleanup
python manage.py cleanup_old_checks --dry-run

Testing

# Run all tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=monitoring --cov-report=html

Architecture

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   APScheduler   │────▶│  Health Checker │────▶│   PostgreSQL    │
│   (run_checks)  │     │    (httpx)      │     │   (HealthCheck) │
└─────────────────┘     └─────────────────┘     └─────────────────┘
                               │
                               ▼
                        ┌─────────────────┐
                        │  State Tracker  │
                        │ (UP/DOWN detect)│
                        └─────────────────┘
                               │
                               ▼
                        ┌─────────────────┐
                        │  Slack Alerter  │
                        │  (Block Kit)    │
                        └─────────────────┘

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
config		config
monitoring		monitoring
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
manage.py		manage.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sefaria Status Monitor

Features

Quick Start

Local Development

Docker Deployment

Configuration

Environment Variables

Monitored Services

Management Commands

Testing

Architecture

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Sefaria/down-detector

Folders and files

Latest commit

History

Repository files navigation

Sefaria Status Monitor

Features

Quick Start

Local Development

Docker Deployment

Configuration

Environment Variables

Monitored Services

Management Commands

Testing

Architecture

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages