HealthArchive.ca – Monorepo

This repository contains the backend services, documentation hub, and in-tree Next.js frontend for HealthArchive.ca. The versioned metadata release repo, healtharchive-datasets, intentionally remains separate.

Public documentation focuses on project purpose, methodology, limitations, and reproducible local development. Deployment is environment-specific and handled outside this public repository; exact host paths, private inventories, and operator-only runbooks are intentionally not documented here.

The backend has three main responsibilities:

Run crawl jobs for sources like Health Canada (hc) and PHAC (phac) by calling the archive_tool CLI (which wraps zimit in Docker).
Index WARCs into snapshots (URL + timestamp + HTML text, etc.) in a relational database.
Expose HTTP APIs that the Next.js frontend uses for search, source summaries, and snapshot viewing.

For a deep architecture and implementation walkthrough, see docs/architecture.md. For a step‑by‑step local live‑testing guide, see docs/development/live-testing.md. This README is intentionally shorter and focused on practical usage.

Repository boundaries:

Frontend UI: lives in frontend/ in this repo
Datasets: https://github.com/jerdaw/healtharchive-datasets
Documentation Site: Run make docs-serve in this repo for a searchable web UI.

Shared documentation boundary:

Private shared-ops documentation is the default home for shared host facts that are not specific to HealthArchive alone:
- host access posture
- shared ingress ownership
- cross-project service inventory
- host path conventions
- shared maintenance and hardening state
this repo owns the app-specific subset:
- API/worker/replay behavior
- backend env vars and automation
- frontend route behavior, build/runtime wiring, and UI verification
- safe local development and verification

Historical identity note:

Preserved older records may still mention the former repo slug, checkout path, or CLI name from before the 2026-04 repo/runtime identity rename.

Project layout (high level)

.
├── README.md
├── frontend/                 # Next.js app + frontend-specific docs/scripts
├── docs/                     # Documentation source for the current docs portal
│   ├── architecture.md       # Detailed architecture and implementation guide
│   ├── development/          # Local dev + live-testing flows
│   ├── deployment/           # Public deployment overviews only
│   ├── operations/           # Public-safe operations summaries
│   ├── frontend/             # Docs portal bridge to in-tree frontend docs
│   └── datasets-external/    # Link-out pointers to datasets repo/docs
├── mkdocs.yml                # Current documentation navigation source of truth
├── pyproject.toml            # Package + dependency metadata
├── requirements.txt          # Convenience requirements file (mirrors pyproject)
├── alembic/                  # Database migrations
├── src/
│   ├── ha_backend/           # Backend package
│   │   ├── api/              # FastAPI app, public + admin routes
│   │   ├── cli.py            # healtharchive CLI entrypoint
│   │   ├── config.py         # Archive root + DB + tool config
│   │   ├── db.py             # SQLAlchemy engine/session helpers
│   │   ├── indexing/         # WARC discovery, parsing, text extraction, mapping
│   │   ├── job_registry.py   # Per-source job templates (hc, phac)
│   │   ├── jobs.py           # Persistent job runner → archive_tool
│   │   ├── logging_config.py # Shared logging configuration
│   │   ├── models.py         # ORM models (Source, ArchiveJob, Snapshot)
│   │   ├── seeds.py          # Initial Source seeding
│   │   └── worker/           # Long-running worker loop for queued jobs
│   └── archive_tool/         # Crawler/orchestrator subpackage, with its own docs
└── tests/                    # Pytest suite

The archive_tool package started as a separate repository and is now maintained in-tree as the backend's crawler/orchestrator subpackage. It is invoked primarily via its CLI (archive-tool) and integrates closely with the backend's job, worker, and indexing code. Its internal documentation lives under src/archive_tool/docs/documentation.md.

Installation & setup

1. Prerequisites

Python 3.11+
Docker (required by archive_tool / zimit for crawls)
A Python virtual environment (recommended)

2. Install dependencies

From the repo root:

make venv

This provides:

healtharchive – backend CLI
archive-tool – console script pointing at the in-repo archive_tool package

For the in-tree frontend:

make frontend-install
make contract-sync
make frontend-ci

Node.js 20.19+ is required for frontend/.

For a full same-checkout smoke of backend + frontend:

make integration-e2e

make contract-sync regenerates docs/openapi.json and the frontend's generated API types so the backend schema remains the source of truth.

3. Database

By default the backend uses a SQLite file at sqlite:///healtharchive.db in the repo root, or whatever you point HEALTHARCHIVE_DATABASE_URL at.

To verify connectivity:

healtharchive check-db

For production, you will typically point HEALTHARCHIVE_DATABASE_URL at a Postgres instance and run Alembic migrations:

alembic upgrade head

For local development it is common to isolate everything under the repo directory:

export HEALTHARCHIVE_DATABASE_URL=sqlite:///$(pwd)/.dev-healtharchive.db
export HEALTHARCHIVE_ARCHIVE_ROOT=$(pwd)/.dev-archive-root
export HEALTHARCHIVE_ADMIN_TOKEN=localdev-admin  # optional for admin routes
# Optional CORS overrides (defaults already cover localhost + prod domains)
# export HEALTHARCHIVE_CORS_ORIGINS=http://localhost:3000,http://localhost:5173
alembic upgrade head

If you want to use Postgres locally via Docker for testing:

docker run --name ha-pg \
  -e POSTGRES_USER=healtharchive \
  -e POSTGRES_PASSWORD=healtharchive \
  -e POSTGRES_DB=healtharchive \
  -p 5432:5432 -d postgres:16

export HEALTHARCHIVE_DATABASE_URL=postgresql+psycopg://healtharchive:healtharchive@localhost:5432/healtharchive
alembic upgrade head

4. Archive root & archive_tool

The backend writes job output under an archive root directory:

HEALTHARCHIVE_ARCHIVE_ROOT (env) or
a project-local default from ha_backend.config.

For local development, point the archive root at an isolated directory under your checkout:

export HEALTHARCHIVE_ARCHIVE_ROOT=$(pwd)/.dev-archive-root

To verify the archive root and archive_tool:

healtharchive check-env           # shows archive root and checks writability
healtharchive check-archive-tool  # runs 'archive-tool --help'

5. Optional Git hooks

This repo uses pre-commit and provides an optional pre-push hook helper:

pre-commit install
bash scripts/install-pre-push-hook.sh

Running the API

The FastAPI app lives at ha_backend.api:app. Once your virtualenv and DB are configured:

uvicorn ha_backend.api:app --reload

Key public endpoints (all prefixed with /api):

GET /api/health Basic health check (status, DB connectivity, job and snapshot counts).
GET /api/stats Lightweight public archive stats for the frontend (snapshot totals, unique page count, source count, latest capture date).
GET /api/sources Per-source summaries derived from indexed snapshots.

When HEALTHARCHIVE_REPLAY_BASE_URL is set, the response also includes:
- entryRecordId – a “best effort” entry-point snapshot ID for browsing a source
- entryBrowseUrl – a timestamp-locked replay URL for that entry point
GET /api/search Full-text style search over snapshots (with filters for source, pagination, etc.).

When HEALTHARCHIVE_REPLAY_BASE_URL is set, each result may include:
- jobId + captureTimestamp – used to lock replay to a specific capture
- browseUrl – a timestamp-locked replay URL for browsing within the archive
Ranking controls:
- Default ranking is controlled by HA_SEARCH_RANKING_VERSION (v1 or v2).
- Per-request override: add ranking=v1|v2 to /api/search.
GET /api/snapshot/{id} Snapshot metadata for a single record.

When HEALTHARCHIVE_REPLAY_BASE_URL is set, this includes a browseUrl suitable for embedding a replay engine (pywb) for full-fidelity browsing.
GET /api/usage Aggregated usage metrics (daily counts) for public reporting.
GET /api/changes Precomputed change events feed (filters by source, edition/job, date range).
GET /api/changes/compare Precomputed diff between two adjacent captures (A → B).
GET /api/changes/rss RSS feed for the latest edition-aware change events.
GET /api/exports Export manifest describing available research exports and limits.
GET /api/exports/snapshots Snapshot metadata export (JSONL/CSV, metadata-only).
GET /api/exports/changes Change event export (JSONL/CSV, metadata-only).
POST /api/reports Public issue intake endpoint for broken snapshots, metadata errors, missing coverage, or takedown requests.
GET /api/snapshots/raw/{id} Returns the archived HTML document for embedding in the frontend.
GET /api/snapshots/{id}/timeline Timeline of captures for the same normalized URL group.

Operator-only endpoints exist for job administration and service telemetry. They are not part of the public API contract, require admin-token protection outside local development, and are intentionally summarized here rather than listed as public documentation.

Dev .env helper

For convenience, you can copy .env.example to .env (git-ignored) and source it in your shell:

cp .env.example .env
source .env
alembic upgrade head
uvicorn ha_backend.api:app --reload --port 8001

Do not commit real secrets in .env; use host-managed env vars for staging/prod.

Search evaluation tools

This repo includes lightweight scripts to capture and compare search results:

Capture a standard query set (v1 vs v2):
- ./scripts/search-eval-capture.sh --out-dir /tmp/ha-search-eval --page-size 20 --ranking v1
- ./scripts/search-eval-capture.sh --out-dir /tmp/ha-search-eval --page-size 20 --ranking v2
Diff two capture directories:
- python ./scripts/search-eval-diff.py --a /tmp/ha-search-eval/<TS_A> --b /tmp/ha-search-eval/<TS_B> --top 20

Docs:

docs/operations/search-quality.md
docs/operations/search-golden-queries.md

CORS / frontend origins

The API enables CORS for the public endpoints. Allowed origins come from HEALTHARCHIVE_CORS_ORIGINS (comma-separated). Defaults cover local dev and production:

http://localhost:3000, http://localhost:5173, https://healtharchive.ca, https://www.healtharchive.ca

Set HEALTHARCHIVE_CORS_ORIGINS when your frontend runs on a different host or port (e.g., a preview/staging domain). Admin routes remain token-gated even when CORS is enabled.

Running the worker

The worker process polls for queued jobs and runs both the crawl (archive_tool) and indexing pipeline.

Start it via the CLI:

healtharchive start-worker

Options:

--poll-interval SECONDS – sleep delay when no work is found (default 30).
--once – process at most one job and exit (useful for cron / debugging).

The worker:

Looks for ArchiveJob rows with status in ("queued", "retryable").
Runs run_persistent_job(job_id) which calls archive_tool as a subprocess.
On crawl success, runs index_job(job_id) to ingest WARCs into Snapshots.
Applies a simple retry policy (MAX_CRAWL_RETRIES) before marking jobs permanently failed.

Creating and managing jobs

The backend exposes a small CLI layer for managing ArchiveJob rows.

Seed sources

Ensure Source rows for hc and phac exist:

healtharchive seed-sources

Create a job from registry defaults

For example, a monthly Health Canada job:

healtharchive create-job --source hc

This:

Uses the SourceJobConfig for hc (seeds, naming template, tool options).
Creates an ArchiveJob row with status="queued" and a unique output_dir.

Run a specific DB-backed job once

healtharchive run-db-job --id 42

This calls archive_tool with the stored seeds, output_dir, and tool options. It updates status, timestamps, and crawler_exit_code.

Index an existing job

If you ran a crawl separately and just want to index WARCs:

healtharchive index-job --id 42

If you have an existing archive_tool output directory on disk (e.g. from a manual run) and want to attach it to the DB for indexing, use:

healtharchive register-job-dir --source hc --output-dir /path/to/job_dir [--name NAME]
healtharchive index-job --id <printed ID>

Permissions note: crawls run as root inside Docker. The registry defaults now enable relax_perms so temp WARCs are chmod’d readable after the crawl, allowing indexing without a host-side sudo chown. If you disable relax_perms, you may need to chown .tmp* before indexing.

Compute change events (diffs)

Change tracking is computed off the request path using precomputed events.

# Incremental (last 30 days by default)
healtharchive compute-changes --max-events 200

# Backfill historical changes
healtharchive compute-changes --backfill --max-events 500

These commands populate snapshot_changes rows used by /api/changes and /api/changes/compare.

List and inspect jobs

healtharchive list-jobs
healtharchive show-job --id 42

Validate a job's configuration (dry-run)

To validate that a job's configuration is coherent (seeds, tool options, and zimit args) without actually running a crawl, you can invoke the integrated archive_tool CLI in dry-run mode via:

healtharchive validate-job-config --id 42

This:

Reconstructs the archive_tool CLI arguments from ArchiveJob.config.
Runs archive-tool with --dry-run so it validates the configuration and prints a summary.
Does not change the job's status or timestamps.

Retry and cleanup

Retry a failed crawl or reindex:
```
healtharchive retry-job --id 42
```
- For status="failed" → sets status="retryable" for another crawl.
- For status="index_failed" → sets status="completed" so indexing can re-run.
- For other statuses, the command logs that there is nothing to retry.
Cleanup temp dirs and state for an indexed or index_failed job:
```
healtharchive cleanup-job --id 42
```
This:
- Uses archive_tool’s CrawlState and cleanup_temp_dirs(...) to delete .tmp* directories and the .archive_state.json file under output_dir.
- Leaves the job directory and any final ZIM in place.
- Updates ArchiveJob.cleanup_status = "temp_cleaned" and cleaned_at when there was actually a state file and/or temp dirs to remove.

Note: cleanup-job is destructive for temporary crawl artifacts (including WARCs under .tmp*). Only run it after you are confident the job has been fully indexed (or indexing has failed in a way you do not plan to recover from) and any desired ZIMs or exports are verified.

Configuration (environment variables)

The backend reads configuration from environment variables with sensible defaults:

HEALTHARCHIVE_DATABASE_URL SQLAlchemy URL for the DB. Defaults to sqlite:///healtharchive.db in the repo root.
HEALTHARCHIVE_ARCHIVE_ROOT Base directory for job output dirs (passed as --output-dir to archive_tool). Defaults to the value configured in ha_backend.config. For local development, set it explicitly to a git-ignored directory under your checkout, such as $(pwd)/.dev-archive-root.
HEALTHARCHIVE_TOOL_CMD Command used to invoke the archiver. Defaults to archive-tool.
HEALTHARCHIVE_ENV High-level environment hint used by admin auth. Recognised values:
- "development" (default when unset): admin endpoints are open when HEALTHARCHIVE_ADMIN_TOKEN is unset (dev convenience).
- "staging" or "production": admin endpoints fail closed with HTTP 500 if HEALTHARCHIVE_ADMIN_TOKEN is not configured.
HEALTHARCHIVE_ADMIN_TOKEN Optional admin token. If set, /api/admin/* and /metrics require either:
- Authorization: Bearer <token> or
- X-Admin-Token: <token> If unset and HEALTHARCHIVE_ENV is "development" (or unset), admin endpoints are open (intended only for local development). In staging and production you should always set a long, random token and store it as a secret in your hosting platform (never committed to the repo); when HEALTHARCHIVE_ENV is "staging" or "production" and this token is missing, admin and metrics endpoints return HTTP 500.
HEALTHARCHIVE_LOG_LEVEL Global log level (DEBUG, INFO, etc.). Defaults to INFO.
HEALTHARCHIVE_CORS_ORIGINS Comma-separated list of allowed Origins for CORS on the public API routes. If unset, a built-in default is used:
- http://localhost:3000
- http://localhost:5173
- https://healtharchive.ca
- https://www.healtharchive.ca
In hosted environments, set this explicitly so that only expected frontend hosts can call the API from a browser. Example:

Canonical public frontend:

export HEALTHARCHIVE_CORS_ORIGINS="https://healtharchive.ca,https://www.healtharchive.ca"

Optional preview/historical frontend origin:
```
export HEALTHARCHIVE_CORS_ORIGINS="https://healtharchive.vercel.app"
```
Keep this only if you still intentionally use an old Vercel-hosted preview frontend. It is not part of the current production path.

You can also include http://localhost:3000 if you want local development to talk directly to a remote API instance.

Deployment details are environment-specific and intentionally kept outside the public README. Public documentation in this repo covers local setup, architecture, data methodology, and the externally consumable API behavior.

Continuous integration

A GitHub Actions workflow (.github/workflows/backend-ci.yml) is intended to run on pushes to main and on pull requests. It:

Checks out the repository.
Sets up Python 3.11.
Runs make ci (fast gate: format check, lint, typecheck, tests).
Runs an end-to-end smoke test (backend + frontend) from the same checkout.

A separate nightly/manual workflow (.github/workflows/backend-ci-full.yml) runs make check-full, which includes coverage-critical, docs checks, pre-commit hooks, and security scans. That broader full gate is useful before deploys, but it is not the default PR-blocking backend CI path today.

The CI job uses a temporary SQLite database via:

HEALTHARCHIVE_DATABASE_URL=sqlite:///./ci-healtharchive.db

so no external DB or Docker services are required. Crawls are not executed in CI; tests focus on unit-level behavior (DB models, APIs, job orchestration, etc.).

Detailed architecture

For a full walkthrough of:

ORM models and status lifecycle
Job registry and how per-source jobs are configured
archive_tool integration and adaptive strategies
Indexing pipeline and snapshot schema
HTTP API routes and JSON schemas
Worker loop and retry semantics
Cleanup and retention strategy (future)
How the backend integrates with the in-repo archive_tool crawler

see docs/architecture.md.

Frontend integration smoke test

Once a frontend is pointed at this backend (via NEXT_PUBLIC_API_BASE_URL on the frontend side and HEALTHARCHIVE_CORS_ORIGINS here), you can perform a quick end-to-end smoke test:

Verify API health from the frontend host

From a shell:
```
curl -i "$API_BASE_URL/api/health"
curl -i "$API_BASE_URL/api/sources"
```
You should see HTTP 200 responses and JSON bodies. If you add an Origin header matching the frontend (e.g. https://healtharchive.ca), the response should include:
```
Access-Control-Allow-Origin: https://healtharchive.ca
Vary: Origin
```
Exercise the UI

From the frontend domain (staging or production):
- Visit /archive:
  - With the backend up, the filters should show Filters (live API) and search/pagination should be backed by real snapshot data.
  - If you intentionally stop the backend (in staging), the UI should show a small “Backend unreachable” banner (when enabled) and fall back to the demo dataset with a clear notice.
- Visit /archive/browse-by-source and /snapshot/[id] to confirm source summaries and snapshot details load correctly against the live API.

The archive_tool subpackage also has its own detailed documentation in src/archive_tool/docs/documentation.md describing its internal state machine and Docker orchestration, and how it cooperates with the backend.

Name		Name	Last commit message	Last commit date
Latest commit History 1,054 Commits
.github		.github
alembic		alembic
docs		docs
frontend		frontend
ops		ops
scripts		scripts
src		src
tests		tests
.editorconfig		.editorconfig
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.mailmap		.mailmap
.nvmrc		.nvmrc
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
ENVIRONMENTS.md		ENVIRONMENTS.md
GEMINI.md		GEMINI.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
alembic.ini		alembic.ini
mkdocs.yml		mkdocs.yml
platform-ops-contract.example.yaml		platform-ops-contract.example.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HealthArchive.ca – Monorepo

Project layout (high level)

Installation & setup

1. Prerequisites

2. Install dependencies

3. Database

4. Archive root & archive_tool

5. Optional Git hooks

Running the API

Dev .env helper

Search evaluation tools

CORS / frontend origins

Running the worker

Creating and managing jobs

Seed sources

Create a job from registry defaults

Run a specific DB-backed job once

Index an existing job

Compute change events (diffs)

List and inspect jobs

Validate a job's configuration (dry-run)

Retry and cleanup

Configuration (environment variables)

Continuous integration

Detailed architecture

Frontend integration smoke test

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HealthArchive.ca – Monorepo

Project layout (high level)

Installation & setup

1. Prerequisites

2. Install dependencies

3. Database

4. Archive root & archive_tool

5. Optional Git hooks

Running the API

Dev .env helper

Search evaluation tools

CORS / frontend origins

Running the worker

Creating and managing jobs

Seed sources

Create a job from registry defaults

Run a specific DB-backed job once

Index an existing job

Compute change events (diffs)

List and inspect jobs

Validate a job's configuration (dry-run)

Retry and cleanup

Configuration (environment variables)

Continuous integration

Detailed architecture

Frontend integration smoke test

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages