Skip to content

jerdaw/healtharchive

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,054 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HealthArchive.ca – Monorepo

This repository contains the backend services, documentation hub, and in-tree Next.js frontend for HealthArchive.ca. The versioned metadata release repo, healtharchive-datasets, intentionally remains separate.

Public documentation focuses on project purpose, methodology, limitations, and reproducible local development. Deployment is environment-specific and handled outside this public repository; exact host paths, private inventories, and operator-only runbooks are intentionally not documented here.

The backend has three main responsibilities:

  • Run crawl jobs for sources like Health Canada (hc) and PHAC (phac) by calling the archive_tool CLI (which wraps zimit in Docker).
  • Index WARCs into snapshots (URL + timestamp + HTML text, etc.) in a relational database.
  • Expose HTTP APIs that the Next.js frontend uses for search, source summaries, and snapshot viewing.

For a deep architecture and implementation walkthrough, see docs/architecture.md. For a step‑by‑step local live‑testing guide, see docs/development/live-testing.md. This README is intentionally shorter and focused on practical usage.

Repository boundaries:

Shared documentation boundary:

  • Private shared-ops documentation is the default home for shared host facts that are not specific to HealthArchive alone:
    • host access posture
    • shared ingress ownership
    • cross-project service inventory
    • host path conventions
    • shared maintenance and hardening state
  • this repo owns the app-specific subset:
    • API/worker/replay behavior
    • backend env vars and automation
    • frontend route behavior, build/runtime wiring, and UI verification
    • safe local development and verification

Historical identity note:

  • Preserved older records may still mention the former repo slug, checkout path, or CLI name from before the 2026-04 repo/runtime identity rename.

Project layout (high level)

.
├── README.md
├── frontend/                 # Next.js app + frontend-specific docs/scripts
├── docs/                     # Documentation source for the current docs portal
│   ├── architecture.md       # Detailed architecture and implementation guide
│   ├── development/          # Local dev + live-testing flows
│   ├── deployment/           # Public deployment overviews only
│   ├── operations/           # Public-safe operations summaries
│   ├── frontend/             # Docs portal bridge to in-tree frontend docs
│   └── datasets-external/    # Link-out pointers to datasets repo/docs
├── mkdocs.yml                # Current documentation navigation source of truth
├── pyproject.toml            # Package + dependency metadata
├── requirements.txt          # Convenience requirements file (mirrors pyproject)
├── alembic/                  # Database migrations
├── src/
│   ├── ha_backend/           # Backend package
│   │   ├── api/              # FastAPI app, public + admin routes
│   │   ├── cli.py            # healtharchive CLI entrypoint
│   │   ├── config.py         # Archive root + DB + tool config
│   │   ├── db.py             # SQLAlchemy engine/session helpers
│   │   ├── indexing/         # WARC discovery, parsing, text extraction, mapping
│   │   ├── job_registry.py   # Per-source job templates (hc, phac)
│   │   ├── jobs.py           # Persistent job runner → archive_tool
│   │   ├── logging_config.py # Shared logging configuration
│   │   ├── models.py         # ORM models (Source, ArchiveJob, Snapshot)
│   │   ├── seeds.py          # Initial Source seeding
│   │   └── worker/           # Long-running worker loop for queued jobs
│   └── archive_tool/         # Crawler/orchestrator subpackage, with its own docs
└── tests/                    # Pytest suite

The archive_tool package started as a separate repository and is now maintained in-tree as the backend's crawler/orchestrator subpackage. It is invoked primarily via its CLI (archive-tool) and integrates closely with the backend's job, worker, and indexing code. Its internal documentation lives under src/archive_tool/docs/documentation.md.


Installation & setup

1. Prerequisites

  • Python 3.11+
  • Docker (required by archive_tool / zimit for crawls)
  • A Python virtual environment (recommended)

2. Install dependencies

From the repo root:

make venv

This provides:

  • healtharchive – backend CLI
  • archive-tool – console script pointing at the in-repo archive_tool package

For the in-tree frontend:

make frontend-install
make contract-sync
make frontend-ci

Node.js 20.19+ is required for frontend/.

For a full same-checkout smoke of backend + frontend:

make integration-e2e

make contract-sync regenerates docs/openapi.json and the frontend's generated API types so the backend schema remains the source of truth.

3. Database

By default the backend uses a SQLite file at sqlite:///healtharchive.db in the repo root, or whatever you point HEALTHARCHIVE_DATABASE_URL at.

To verify connectivity:

healtharchive check-db

For production, you will typically point HEALTHARCHIVE_DATABASE_URL at a Postgres instance and run Alembic migrations:

alembic upgrade head

For local development it is common to isolate everything under the repo directory:

export HEALTHARCHIVE_DATABASE_URL=sqlite:///$(pwd)/.dev-healtharchive.db
export HEALTHARCHIVE_ARCHIVE_ROOT=$(pwd)/.dev-archive-root
export HEALTHARCHIVE_ADMIN_TOKEN=localdev-admin  # optional for admin routes
# Optional CORS overrides (defaults already cover localhost + prod domains)
# export HEALTHARCHIVE_CORS_ORIGINS=http://localhost:3000,http://localhost:5173
alembic upgrade head

If you want to use Postgres locally via Docker for testing:

docker run --name ha-pg \
  -e POSTGRES_USER=healtharchive \
  -e POSTGRES_PASSWORD=healtharchive \
  -e POSTGRES_DB=healtharchive \
  -p 5432:5432 -d postgres:16

export HEALTHARCHIVE_DATABASE_URL=postgresql+psycopg://healtharchive:healtharchive@localhost:5432/healtharchive
alembic upgrade head

4. Archive root & archive_tool

The backend writes job output under an archive root directory:

  • HEALTHARCHIVE_ARCHIVE_ROOT (env) or
  • a project-local default from ha_backend.config.

For local development, point the archive root at an isolated directory under your checkout:

export HEALTHARCHIVE_ARCHIVE_ROOT=$(pwd)/.dev-archive-root

To verify the archive root and archive_tool:

healtharchive check-env           # shows archive root and checks writability
healtharchive check-archive-tool  # runs 'archive-tool --help'

5. Optional Git hooks

This repo uses pre-commit and provides an optional pre-push hook helper:

pre-commit install
bash scripts/install-pre-push-hook.sh

Running the API

The FastAPI app lives at ha_backend.api:app. Once your virtualenv and DB are configured:

uvicorn ha_backend.api:app --reload

Key public endpoints (all prefixed with /api):

  • GET /api/health Basic health check (status, DB connectivity, job and snapshot counts).

  • GET /api/stats Lightweight public archive stats for the frontend (snapshot totals, unique page count, source count, latest capture date).

  • GET /api/sources Per-source summaries derived from indexed snapshots.

    When HEALTHARCHIVE_REPLAY_BASE_URL is set, the response also includes:

    • entryRecordId – a “best effort” entry-point snapshot ID for browsing a source
    • entryBrowseUrl – a timestamp-locked replay URL for that entry point
  • GET /api/search Full-text style search over snapshots (with filters for source, pagination, etc.).

    When HEALTHARCHIVE_REPLAY_BASE_URL is set, each result may include:

    • jobId + captureTimestamp – used to lock replay to a specific capture
    • browseUrl – a timestamp-locked replay URL for browsing within the archive

    Ranking controls:

    • Default ranking is controlled by HA_SEARCH_RANKING_VERSION (v1 or v2).
    • Per-request override: add ranking=v1|v2 to /api/search.
  • GET /api/snapshot/{id} Snapshot metadata for a single record.

    When HEALTHARCHIVE_REPLAY_BASE_URL is set, this includes a browseUrl suitable for embedding a replay engine (pywb) for full-fidelity browsing.

  • GET /api/usage Aggregated usage metrics (daily counts) for public reporting.

  • GET /api/changes Precomputed change events feed (filters by source, edition/job, date range).

  • GET /api/changes/compare Precomputed diff between two adjacent captures (A → B).

  • GET /api/changes/rss RSS feed for the latest edition-aware change events.

  • GET /api/exports Export manifest describing available research exports and limits.

  • GET /api/exports/snapshots Snapshot metadata export (JSONL/CSV, metadata-only).

  • GET /api/exports/changes Change event export (JSONL/CSV, metadata-only).

  • POST /api/reports Public issue intake endpoint for broken snapshots, metadata errors, missing coverage, or takedown requests.

  • GET /api/snapshots/raw/{id} Returns the archived HTML document for embedding in the frontend.

  • GET /api/snapshots/{id}/timeline Timeline of captures for the same normalized URL group.

Operator-only endpoints exist for job administration and service telemetry. They are not part of the public API contract, require admin-token protection outside local development, and are intentionally summarized here rather than listed as public documentation.

Dev .env helper

For convenience, you can copy .env.example to .env (git-ignored) and source it in your shell:

cp .env.example .env
source .env
alembic upgrade head
uvicorn ha_backend.api:app --reload --port 8001

Do not commit real secrets in .env; use host-managed env vars for staging/prod.


Search evaluation tools

This repo includes lightweight scripts to capture and compare search results:

  • Capture a standard query set (v1 vs v2):
    • ./scripts/search-eval-capture.sh --out-dir /tmp/ha-search-eval --page-size 20 --ranking v1
    • ./scripts/search-eval-capture.sh --out-dir /tmp/ha-search-eval --page-size 20 --ranking v2
  • Diff two capture directories:
    • python ./scripts/search-eval-diff.py --a /tmp/ha-search-eval/<TS_A> --b /tmp/ha-search-eval/<TS_B> --top 20

Docs:

  • docs/operations/search-quality.md
  • docs/operations/search-golden-queries.md

CORS / frontend origins

The API enables CORS for the public endpoints. Allowed origins come from HEALTHARCHIVE_CORS_ORIGINS (comma-separated). Defaults cover local dev and production:

http://localhost:3000, http://localhost:5173, https://healtharchive.ca, https://www.healtharchive.ca

Set HEALTHARCHIVE_CORS_ORIGINS when your frontend runs on a different host or port (e.g., a preview/staging domain). Admin routes remain token-gated even when CORS is enabled.


Running the worker

The worker process polls for queued jobs and runs both the crawl (archive_tool) and indexing pipeline.

Start it via the CLI:

healtharchive start-worker

Options:

  • --poll-interval SECONDS – sleep delay when no work is found (default 30).
  • --once – process at most one job and exit (useful for cron / debugging).

The worker:

  • Looks for ArchiveJob rows with status in ("queued", "retryable").
  • Runs run_persistent_job(job_id) which calls archive_tool as a subprocess.
  • On crawl success, runs index_job(job_id) to ingest WARCs into Snapshots.
  • Applies a simple retry policy (MAX_CRAWL_RETRIES) before marking jobs permanently failed.

Creating and managing jobs

The backend exposes a small CLI layer for managing ArchiveJob rows.

Seed sources

Ensure Source rows for hc and phac exist:

healtharchive seed-sources

Create a job from registry defaults

For example, a monthly Health Canada job:

healtharchive create-job --source hc

This:

  • Uses the SourceJobConfig for hc (seeds, naming template, tool options).
  • Creates an ArchiveJob row with status="queued" and a unique output_dir.

Run a specific DB-backed job once

healtharchive run-db-job --id 42

This calls archive_tool with the stored seeds, output_dir, and tool options. It updates status, timestamps, and crawler_exit_code.

Index an existing job

If you ran a crawl separately and just want to index WARCs:

healtharchive index-job --id 42

If you have an existing archive_tool output directory on disk (e.g. from a manual run) and want to attach it to the DB for indexing, use:

healtharchive register-job-dir --source hc --output-dir /path/to/job_dir [--name NAME]
healtharchive index-job --id <printed ID>

Permissions note: crawls run as root inside Docker. The registry defaults now enable relax_perms so temp WARCs are chmod’d readable after the crawl, allowing indexing without a host-side sudo chown. If you disable relax_perms, you may need to chown .tmp* before indexing.

Compute change events (diffs)

Change tracking is computed off the request path using precomputed events.

# Incremental (last 30 days by default)
healtharchive compute-changes --max-events 200

# Backfill historical changes
healtharchive compute-changes --backfill --max-events 500

These commands populate snapshot_changes rows used by /api/changes and /api/changes/compare.

List and inspect jobs

healtharchive list-jobs
healtharchive show-job --id 42

Validate a job's configuration (dry-run)

To validate that a job's configuration is coherent (seeds, tool options, and zimit args) without actually running a crawl, you can invoke the integrated archive_tool CLI in dry-run mode via:

healtharchive validate-job-config --id 42

This:

  • Reconstructs the archive_tool CLI arguments from ArchiveJob.config.
  • Runs archive-tool with --dry-run so it validates the configuration and prints a summary.
  • Does not change the job's status or timestamps.

Retry and cleanup

  • Retry a failed crawl or reindex:

    healtharchive retry-job --id 42
    • For status="failed" → sets status="retryable" for another crawl.
    • For status="index_failed" → sets status="completed" so indexing can re-run.
    • For other statuses, the command logs that there is nothing to retry.
  • Cleanup temp dirs and state for an indexed or index_failed job:

    healtharchive cleanup-job --id 42

    This:

    • Uses archive_tool’s CrawlState and cleanup_temp_dirs(...) to delete .tmp* directories and the .archive_state.json file under output_dir.
    • Leaves the job directory and any final ZIM in place.
    • Updates ArchiveJob.cleanup_status = "temp_cleaned" and cleaned_at when there was actually a state file and/or temp dirs to remove.

Note: cleanup-job is destructive for temporary crawl artifacts (including WARCs under .tmp*). Only run it after you are confident the job has been fully indexed (or indexing has failed in a way you do not plan to recover from) and any desired ZIMs or exports are verified.


Configuration (environment variables)

The backend reads configuration from environment variables with sensible defaults:

  • HEALTHARCHIVE_DATABASE_URL SQLAlchemy URL for the DB. Defaults to sqlite:///healtharchive.db in the repo root.

  • HEALTHARCHIVE_ARCHIVE_ROOT Base directory for job output dirs (passed as --output-dir to archive_tool). Defaults to the value configured in ha_backend.config. For local development, set it explicitly to a git-ignored directory under your checkout, such as $(pwd)/.dev-archive-root.

  • HEALTHARCHIVE_TOOL_CMD Command used to invoke the archiver. Defaults to archive-tool.

  • HEALTHARCHIVE_ENV High-level environment hint used by admin auth. Recognised values:

    • "development" (default when unset): admin endpoints are open when HEALTHARCHIVE_ADMIN_TOKEN is unset (dev convenience).
    • "staging" or "production": admin endpoints fail closed with HTTP 500 if HEALTHARCHIVE_ADMIN_TOKEN is not configured.
  • HEALTHARCHIVE_ADMIN_TOKEN Optional admin token. If set, /api/admin/* and /metrics require either:

    • Authorization: Bearer <token> or
    • X-Admin-Token: <token> If unset and HEALTHARCHIVE_ENV is "development" (or unset), admin endpoints are open (intended only for local development). In staging and production you should always set a long, random token and store it as a secret in your hosting platform (never committed to the repo); when HEALTHARCHIVE_ENV is "staging" or "production" and this token is missing, admin and metrics endpoints return HTTP 500.
  • HEALTHARCHIVE_LOG_LEVEL Global log level (DEBUG, INFO, etc.). Defaults to INFO.

  • HEALTHARCHIVE_CORS_ORIGINS Comma-separated list of allowed Origins for CORS on the public API routes. If unset, a built-in default is used:

    • http://localhost:3000
    • http://localhost:5173
    • https://healtharchive.ca
    • https://www.healtharchive.ca

    In hosted environments, set this explicitly so that only expected frontend hosts can call the API from a browser. Example:

  • Canonical public frontend:

    export HEALTHARCHIVE_CORS_ORIGINS="https://healtharchive.ca,https://www.healtharchive.ca"
  • Optional preview/historical frontend origin:

    export HEALTHARCHIVE_CORS_ORIGINS="https://healtharchive.vercel.app"

    Keep this only if you still intentionally use an old Vercel-hosted preview frontend. It is not part of the current production path.

    You can also include http://localhost:3000 if you want local development to talk directly to a remote API instance.

Deployment details are environment-specific and intentionally kept outside the public README. Public documentation in this repo covers local setup, architecture, data methodology, and the externally consumable API behavior.


Continuous integration

A GitHub Actions workflow (.github/workflows/backend-ci.yml) is intended to run on pushes to main and on pull requests. It:

  • Checks out the repository.
  • Sets up Python 3.11.
  • Runs make ci (fast gate: format check, lint, typecheck, tests).
  • Runs an end-to-end smoke test (backend + frontend) from the same checkout.

A separate nightly/manual workflow (.github/workflows/backend-ci-full.yml) runs make check-full, which includes coverage-critical, docs checks, pre-commit hooks, and security scans. That broader full gate is useful before deploys, but it is not the default PR-blocking backend CI path today.

The CI job uses a temporary SQLite database via:

HEALTHARCHIVE_DATABASE_URL=sqlite:///./ci-healtharchive.db

so no external DB or Docker services are required. Crawls are not executed in CI; tests focus on unit-level behavior (DB models, APIs, job orchestration, etc.).


Detailed architecture

For a full walkthrough of:

  • ORM models and status lifecycle
  • Job registry and how per-source jobs are configured
  • archive_tool integration and adaptive strategies
  • Indexing pipeline and snapshot schema
  • HTTP API routes and JSON schemas
  • Worker loop and retry semantics
  • Cleanup and retention strategy (future)
  • How the backend integrates with the in-repo archive_tool crawler

see docs/architecture.md.

Frontend integration smoke test

Once a frontend is pointed at this backend (via NEXT_PUBLIC_API_BASE_URL on the frontend side and HEALTHARCHIVE_CORS_ORIGINS here), you can perform a quick end-to-end smoke test:

  1. Verify API health from the frontend host

    From a shell:

    curl -i "$API_BASE_URL/api/health"
    curl -i "$API_BASE_URL/api/sources"

    You should see HTTP 200 responses and JSON bodies. If you add an Origin header matching the frontend (e.g. https://healtharchive.ca), the response should include:

    Access-Control-Allow-Origin: https://healtharchive.ca
    Vary: Origin
    
  2. Exercise the UI

    From the frontend domain (staging or production):

    • Visit /archive:
      • With the backend up, the filters should show Filters (live API) and search/pagination should be backed by real snapshot data.
      • If you intentionally stop the backend (in staging), the UI should show a small “Backend unreachable” banner (when enabled) and fall back to the demo dataset with a clear notice.
    • Visit /archive/browse-by-source and /snapshot/[id] to confirm source summaries and snapshot details load correctly against the live API.

The archive_tool subpackage also has its own detailed documentation in src/archive_tool/docs/documentation.md describing its internal state machine and Docker orchestration, and how it cooperates with the backend.

About

HealthArchive.ca monorepo – backend services, frontend app, and documentation hub

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors