Problem
When running a long crawl (--house both across 8 search groups × 5 ministries × ~270 RS sessions), there is no way to know:
- How many records have been collected so far
- Which query bucket is currently running and how many remain
- Estimated time to completion
- A final summary of what was retrieved
Users are left polling the manifest (wc -l manifest.jsonl) and the log (tail crawl.log) manually. Confusing for non-developer researchers running the tool for actual parliamentary research.
Requested behaviour
- Live progress line — after each bucket completes:
[LS 12/40] query='NTA' ministry=EDUCATION → 8 new | total=143
- ETA estimate — rolling estimate printed after the first 10% of buckets complete
- Completion summary — final table showing records by house, tag, year, PDFs downloaded, and total duration
status subcommand — sansad-semantic-crawler status --out data/my-crawl/ reads a running or completed crawl and prints the summary on demand, no Python one-liners needed
Why
The tool is for researchers, not just developers. A full two-house crawl takes 90+ minutes. Without feedback, users cannot tell if it is working, stuck, or nearly done — which erodes trust and forces manual inspection of internal files.
Problem
When running a long crawl (
--house bothacross 8 search groups × 5 ministries × ~270 RS sessions), there is no way to know:Users are left polling the manifest (
wc -l manifest.jsonl) and the log (tail crawl.log) manually. Confusing for non-developer researchers running the tool for actual parliamentary research.Requested behaviour
[LS 12/40] query='NTA' ministry=EDUCATION → 8 new | total=143statussubcommand —sansad-semantic-crawler status --out data/my-crawl/reads a running or completed crawl and prints the summary on demand, no Python one-liners neededWhy
The tool is for researchers, not just developers. A full two-house crawl takes 90+ minutes. Without feedback, users cannot tell if it is working, stuck, or nearly done — which erodes trust and forces manual inspection of internal files.