Skip to content

feat: crawl progress reporting — live status, ETA, and completion summary #48

@skishchampi

Description

@skishchampi

Problem

When running a long crawl (--house both across 8 search groups × 5 ministries × ~270 RS sessions), there is no way to know:

  • How many records have been collected so far
  • Which query bucket is currently running and how many remain
  • Estimated time to completion
  • A final summary of what was retrieved

Users are left polling the manifest (wc -l manifest.jsonl) and the log (tail crawl.log) manually. Confusing for non-developer researchers running the tool for actual parliamentary research.

Requested behaviour

  1. Live progress line — after each bucket completes: [LS 12/40] query='NTA' ministry=EDUCATION → 8 new | total=143
  2. ETA estimate — rolling estimate printed after the first 10% of buckets complete
  3. Completion summary — final table showing records by house, tag, year, PDFs downloaded, and total duration
  4. status subcommandsansad-semantic-crawler status --out data/my-crawl/ reads a running or completed crawl and prints the summary on demand, no Python one-liners needed

Why

The tool is for researchers, not just developers. A full two-house crawl takes 90+ minutes. Without feedback, users cannot tell if it is working, stuck, or nearly done — which erodes trust and forces manual inspection of internal files.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions