This repository contains the backend services, documentation hub, and in-tree
Next.js frontend for HealthArchive.ca. The
versioned metadata release repo, healtharchive-datasets, intentionally
remains separate.
Public documentation focuses on project purpose, methodology, limitations, and reproducible local development. Deployment is environment-specific and handled outside this public repository; exact host paths, private inventories, and operator-only runbooks are intentionally not documented here.
The backend has three main responsibilities:
- Run crawl jobs for sources like Health Canada (
hc) and PHAC (phac) by calling thearchive_toolCLI (which wrapszimitin Docker). - Index WARCs into snapshots (URL + timestamp + HTML text, etc.) in a relational database.
- Expose HTTP APIs that the Next.js frontend uses for search, source summaries, and snapshot viewing.
For a deep architecture and implementation walkthrough, see
docs/architecture.md. For a step‑by‑step local live‑testing guide, see
docs/development/live-testing.md.
This README is intentionally shorter and focused on practical usage.
Repository boundaries:
- Frontend UI: lives in
frontend/in this repo - Datasets: https://github.com/jerdaw/healtharchive-datasets
- Documentation Site: Run
make docs-servein this repo for a searchable web UI.
Shared documentation boundary:
- Private shared-ops documentation is the default home for shared host facts
that are not specific to HealthArchive alone:
- host access posture
- shared ingress ownership
- cross-project service inventory
- host path conventions
- shared maintenance and hardening state
- this repo owns the app-specific subset:
- API/worker/replay behavior
- backend env vars and automation
- frontend route behavior, build/runtime wiring, and UI verification
- safe local development and verification
Historical identity note:
- Preserved older records may still mention the former repo slug, checkout path, or CLI name from before the 2026-04 repo/runtime identity rename.
.
├── README.md
├── frontend/ # Next.js app + frontend-specific docs/scripts
├── docs/ # Documentation source for the current docs portal
│ ├── architecture.md # Detailed architecture and implementation guide
│ ├── development/ # Local dev + live-testing flows
│ ├── deployment/ # Public deployment overviews only
│ ├── operations/ # Public-safe operations summaries
│ ├── frontend/ # Docs portal bridge to in-tree frontend docs
│ └── datasets-external/ # Link-out pointers to datasets repo/docs
├── mkdocs.yml # Current documentation navigation source of truth
├── pyproject.toml # Package + dependency metadata
├── requirements.txt # Convenience requirements file (mirrors pyproject)
├── alembic/ # Database migrations
├── src/
│ ├── ha_backend/ # Backend package
│ │ ├── api/ # FastAPI app, public + admin routes
│ │ ├── cli.py # healtharchive CLI entrypoint
│ │ ├── config.py # Archive root + DB + tool config
│ │ ├── db.py # SQLAlchemy engine/session helpers
│ │ ├── indexing/ # WARC discovery, parsing, text extraction, mapping
│ │ ├── job_registry.py # Per-source job templates (hc, phac)
│ │ ├── jobs.py # Persistent job runner → archive_tool
│ │ ├── logging_config.py # Shared logging configuration
│ │ ├── models.py # ORM models (Source, ArchiveJob, Snapshot)
│ │ ├── seeds.py # Initial Source seeding
│ │ └── worker/ # Long-running worker loop for queued jobs
│ └── archive_tool/ # Crawler/orchestrator subpackage, with its own docs
└── tests/ # Pytest suite
The archive_tool package started as a separate repository and is now
maintained in-tree as the backend's crawler/orchestrator subpackage. It is
invoked primarily via its CLI (archive-tool) and integrates closely with the
backend's job, worker, and indexing code. Its internal documentation lives
under src/archive_tool/docs/documentation.md.
- Python 3.11+
- Docker (required by
archive_tool/zimitfor crawls) - A Python virtual environment (recommended)
From the repo root:
make venvThis provides:
healtharchive– backend CLIarchive-tool– console script pointing at the in-repoarchive_toolpackage
For the in-tree frontend:
make frontend-install
make contract-sync
make frontend-ciNode.js 20.19+ is required for frontend/.
For a full same-checkout smoke of backend + frontend:
make integration-e2emake contract-sync regenerates docs/openapi.json and the frontend's
generated API types so the backend schema remains the source of truth.
By default the backend uses a SQLite file at sqlite:///healtharchive.db in
the repo root, or whatever you point HEALTHARCHIVE_DATABASE_URL at.
To verify connectivity:
healtharchive check-dbFor production, you will typically point HEALTHARCHIVE_DATABASE_URL at a
Postgres instance and run Alembic migrations:
alembic upgrade headFor local development it is common to isolate everything under the repo directory:
export HEALTHARCHIVE_DATABASE_URL=sqlite:///$(pwd)/.dev-healtharchive.db
export HEALTHARCHIVE_ARCHIVE_ROOT=$(pwd)/.dev-archive-root
export HEALTHARCHIVE_ADMIN_TOKEN=localdev-admin # optional for admin routes
# Optional CORS overrides (defaults already cover localhost + prod domains)
# export HEALTHARCHIVE_CORS_ORIGINS=http://localhost:3000,http://localhost:5173
alembic upgrade headIf you want to use Postgres locally via Docker for testing:
docker run --name ha-pg \
-e POSTGRES_USER=healtharchive \
-e POSTGRES_PASSWORD=healtharchive \
-e POSTGRES_DB=healtharchive \
-p 5432:5432 -d postgres:16
export HEALTHARCHIVE_DATABASE_URL=postgresql+psycopg://healtharchive:healtharchive@localhost:5432/healtharchive
alembic upgrade headThe backend writes job output under an archive root directory:
HEALTHARCHIVE_ARCHIVE_ROOT(env) or- a project-local default from
ha_backend.config.
For local development, point the archive root at an isolated directory under your checkout:
export HEALTHARCHIVE_ARCHIVE_ROOT=$(pwd)/.dev-archive-rootTo verify the archive root and archive_tool:
healtharchive check-env # shows archive root and checks writability
healtharchive check-archive-tool # runs 'archive-tool --help'This repo uses pre-commit and provides an optional pre-push hook helper:
pre-commit install
bash scripts/install-pre-push-hook.shThe FastAPI app lives at ha_backend.api:app. Once your virtualenv and DB
are configured:
uvicorn ha_backend.api:app --reloadKey public endpoints (all prefixed with /api):
-
GET /api/healthBasic health check (status, DB connectivity, job and snapshot counts). -
GET /api/statsLightweight public archive stats for the frontend (snapshot totals, unique page count, source count, latest capture date). -
GET /api/sourcesPer-source summaries derived from indexed snapshots.When
HEALTHARCHIVE_REPLAY_BASE_URLis set, the response also includes:entryRecordId– a “best effort” entry-point snapshot ID for browsing a sourceentryBrowseUrl– a timestamp-locked replay URL for that entry point
-
GET /api/searchFull-text style search over snapshots (with filters forsource, pagination, etc.).When
HEALTHARCHIVE_REPLAY_BASE_URLis set, each result may include:jobId+captureTimestamp– used to lock replay to a specific capturebrowseUrl– a timestamp-locked replay URL for browsing within the archive
Ranking controls:
- Default ranking is controlled by
HA_SEARCH_RANKING_VERSION(v1orv2). - Per-request override: add
ranking=v1|v2to/api/search.
-
GET /api/snapshot/{id}Snapshot metadata for a single record.When
HEALTHARCHIVE_REPLAY_BASE_URLis set, this includes abrowseUrlsuitable for embedding a replay engine (pywb) for full-fidelity browsing. -
GET /api/usageAggregated usage metrics (daily counts) for public reporting. -
GET /api/changesPrecomputed change events feed (filters by source, edition/job, date range). -
GET /api/changes/comparePrecomputed diff between two adjacent captures (A → B). -
GET /api/changes/rssRSS feed for the latest edition-aware change events. -
GET /api/exportsExport manifest describing available research exports and limits. -
GET /api/exports/snapshotsSnapshot metadata export (JSONL/CSV, metadata-only). -
GET /api/exports/changesChange event export (JSONL/CSV, metadata-only). -
POST /api/reportsPublic issue intake endpoint for broken snapshots, metadata errors, missing coverage, or takedown requests. -
GET /api/snapshots/raw/{id}Returns the archived HTML document for embedding in the frontend. -
GET /api/snapshots/{id}/timelineTimeline of captures for the same normalized URL group.
Operator-only endpoints exist for job administration and service telemetry. They are not part of the public API contract, require admin-token protection outside local development, and are intentionally summarized here rather than listed as public documentation.
For convenience, you can copy .env.example to .env (git-ignored) and source
it in your shell:
cp .env.example .env
source .env
alembic upgrade head
uvicorn ha_backend.api:app --reload --port 8001Do not commit real secrets in .env; use host-managed env vars for staging/prod.
This repo includes lightweight scripts to capture and compare search results:
- Capture a standard query set (v1 vs v2):
./scripts/search-eval-capture.sh --out-dir /tmp/ha-search-eval --page-size 20 --ranking v1./scripts/search-eval-capture.sh --out-dir /tmp/ha-search-eval --page-size 20 --ranking v2
- Diff two capture directories:
python ./scripts/search-eval-diff.py --a /tmp/ha-search-eval/<TS_A> --b /tmp/ha-search-eval/<TS_B> --top 20
Docs:
docs/operations/search-quality.mddocs/operations/search-golden-queries.md
The API enables CORS for the public endpoints. Allowed origins come from
HEALTHARCHIVE_CORS_ORIGINS (comma-separated). Defaults cover local dev and
production:
http://localhost:3000, http://localhost:5173, https://healtharchive.ca, https://www.healtharchive.ca
Set HEALTHARCHIVE_CORS_ORIGINS when your frontend runs on a different host
or port (e.g., a preview/staging domain). Admin routes remain token-gated even
when CORS is enabled.
The worker process polls for queued jobs and runs both the crawl (archive_tool)
and indexing pipeline.
Start it via the CLI:
healtharchive start-workerOptions:
--poll-interval SECONDS– sleep delay when no work is found (default 30).--once– process at most one job and exit (useful for cron / debugging).
The worker:
- Looks for
ArchiveJobrows withstatus in ("queued", "retryable"). - Runs
run_persistent_job(job_id)which callsarchive_toolas a subprocess. - On crawl success, runs
index_job(job_id)to ingest WARCs intoSnapshots. - Applies a simple retry policy (
MAX_CRAWL_RETRIES) before marking jobs permanentlyfailed.
The backend exposes a small CLI layer for managing ArchiveJob rows.
Ensure Source rows for hc and phac exist:
healtharchive seed-sourcesFor example, a monthly Health Canada job:
healtharchive create-job --source hcThis:
- Uses the
SourceJobConfigforhc(seeds, naming template, tool options). - Creates an
ArchiveJobrow withstatus="queued"and a uniqueoutput_dir.
healtharchive run-db-job --id 42This calls archive_tool with the stored seeds, output_dir, and tool
options. It updates status, timestamps, and crawler_exit_code.
If you ran a crawl separately and just want to index WARCs:
healtharchive index-job --id 42If you have an existing archive_tool output directory on disk (e.g. from a
manual run) and want to attach it to the DB for indexing, use:
healtharchive register-job-dir --source hc --output-dir /path/to/job_dir [--name NAME]
healtharchive index-job --id <printed ID>Permissions note: crawls run as root inside Docker. The registry defaults now
enable relax_perms so temp WARCs are chmod’d readable after the crawl, allowing
indexing without a host-side sudo chown. If you disable relax_perms, you may
need to chown .tmp* before indexing.
Change tracking is computed off the request path using precomputed events.
# Incremental (last 30 days by default)
healtharchive compute-changes --max-events 200
# Backfill historical changes
healtharchive compute-changes --backfill --max-events 500These commands populate snapshot_changes rows used by /api/changes and
/api/changes/compare.
healtharchive list-jobs
healtharchive show-job --id 42To validate that a job's configuration is coherent (seeds, tool options, and
zimit args) without actually running a crawl, you can invoke the integrated
archive_tool CLI in dry-run mode via:
healtharchive validate-job-config --id 42This:
- Reconstructs the
archive_toolCLI arguments fromArchiveJob.config. - Runs
archive-toolwith--dry-runso it validates the configuration and prints a summary. - Does not change the job's status or timestamps.
-
Retry a failed crawl or reindex:
healtharchive retry-job --id 42
- For
status="failed"→ setsstatus="retryable"for another crawl. - For
status="index_failed"→ setsstatus="completed"so indexing can re-run. - For other statuses, the command logs that there is nothing to retry.
- For
-
Cleanup temp dirs and state for an indexed or index_failed job:
healtharchive cleanup-job --id 42
This:
- Uses
archive_tool’sCrawlStateandcleanup_temp_dirs(...)to delete.tmp*directories and the.archive_state.jsonfile underoutput_dir. - Leaves the job directory and any final ZIM in place.
- Updates
ArchiveJob.cleanup_status = "temp_cleaned"andcleaned_atwhen there was actually a state file and/or temp dirs to remove.
- Uses
Note:
cleanup-jobis destructive for temporary crawl artifacts (including WARCs under.tmp*). Only run it after you are confident the job has been fully indexed (or indexing has failed in a way you do not plan to recover from) and any desired ZIMs or exports are verified.
The backend reads configuration from environment variables with sensible defaults:
-
HEALTHARCHIVE_DATABASE_URLSQLAlchemy URL for the DB. Defaults tosqlite:///healtharchive.dbin the repo root. -
HEALTHARCHIVE_ARCHIVE_ROOTBase directory for job output dirs (passed as--output-dirtoarchive_tool). Defaults to the value configured inha_backend.config. For local development, set it explicitly to a git-ignored directory under your checkout, such as$(pwd)/.dev-archive-root. -
HEALTHARCHIVE_TOOL_CMDCommand used to invoke the archiver. Defaults toarchive-tool. -
HEALTHARCHIVE_ENVHigh-level environment hint used by admin auth. Recognised values:"development"(default when unset): admin endpoints are open whenHEALTHARCHIVE_ADMIN_TOKENis unset (dev convenience)."staging"or"production": admin endpoints fail closed with HTTP 500 ifHEALTHARCHIVE_ADMIN_TOKENis not configured.
-
HEALTHARCHIVE_ADMIN_TOKENOptional admin token. If set,/api/admin/*and/metricsrequire either:Authorization: Bearer <token>orX-Admin-Token: <token>If unset andHEALTHARCHIVE_ENVis"development"(or unset), admin endpoints are open (intended only for local development). In staging and production you should always set a long, random token and store it as a secret in your hosting platform (never committed to the repo); whenHEALTHARCHIVE_ENVis"staging"or"production"and this token is missing, admin and metrics endpoints return HTTP 500.
-
HEALTHARCHIVE_LOG_LEVELGlobal log level (DEBUG,INFO, etc.). Defaults toINFO. -
HEALTHARCHIVE_CORS_ORIGINSComma-separated list of allowed Origins for CORS on the public API routes. If unset, a built-in default is used:http://localhost:3000http://localhost:5173https://healtharchive.cahttps://www.healtharchive.ca
In hosted environments, set this explicitly so that only expected frontend hosts can call the API from a browser. Example:
-
Canonical public frontend:
export HEALTHARCHIVE_CORS_ORIGINS="https://healtharchive.ca,https://www.healtharchive.ca"
-
Optional preview/historical frontend origin:
export HEALTHARCHIVE_CORS_ORIGINS="https://healtharchive.vercel.app"
Keep this only if you still intentionally use an old Vercel-hosted preview frontend. It is not part of the current production path.
You can also include
http://localhost:3000if you want local development to talk directly to a remote API instance.
Deployment details are environment-specific and intentionally kept outside the public README. Public documentation in this repo covers local setup, architecture, data methodology, and the externally consumable API behavior.
A GitHub Actions workflow (.github/workflows/backend-ci.yml) is intended to
run on pushes to main and on pull requests. It:
- Checks out the repository.
- Sets up Python 3.11.
- Runs
make ci(fast gate: format check, lint, typecheck, tests). - Runs an end-to-end smoke test (backend + frontend) from the same checkout.
A separate nightly/manual workflow (.github/workflows/backend-ci-full.yml)
runs make check-full, which includes coverage-critical, docs checks,
pre-commit hooks, and security scans. That broader full gate is useful before
deploys, but it is not the default PR-blocking backend CI path today.
The CI job uses a temporary SQLite database via:
HEALTHARCHIVE_DATABASE_URL=sqlite:///./ci-healtharchive.dbso no external DB or Docker services are required. Crawls are not executed in CI; tests focus on unit-level behavior (DB models, APIs, job orchestration, etc.).
For a full walkthrough of:
- ORM models and status lifecycle
- Job registry and how per-source jobs are configured
archive_toolintegration and adaptive strategies- Indexing pipeline and snapshot schema
- HTTP API routes and JSON schemas
- Worker loop and retry semantics
- Cleanup and retention strategy (future)
- How the backend integrates with the in-repo
archive_toolcrawler
see docs/architecture.md.
Once a frontend is pointed at this backend (via NEXT_PUBLIC_API_BASE_URL on
the frontend side and HEALTHARCHIVE_CORS_ORIGINS here), you can perform a
quick end-to-end smoke test:
-
Verify API health from the frontend host
From a shell:
curl -i "$API_BASE_URL/api/health" curl -i "$API_BASE_URL/api/sources"
You should see HTTP 200 responses and JSON bodies. If you add an
Originheader matching the frontend (e.g.https://healtharchive.ca), the response should include:Access-Control-Allow-Origin: https://healtharchive.ca Vary: Origin -
Exercise the UI
From the frontend domain (staging or production):
- Visit
/archive:- With the backend up, the filters should show
Filters (live API)and search/pagination should be backed by real snapshot data. - If you intentionally stop the backend (in staging), the UI should show a small “Backend unreachable” banner (when enabled) and fall back to the demo dataset with a clear notice.
- With the backend up, the filters should show
- Visit
/archive/browse-by-sourceand/snapshot/[id]to confirm source summaries and snapshot details load correctly against the live API.
- Visit
The archive_tool subpackage also has its own detailed documentation in
src/archive_tool/docs/documentation.md describing its internal state
machine and Docker orchestration, and how it cooperates with the backend.