Skip to content

Add archive.org download support to download.py#25

Merged
ali5ter merged 1 commit into
mainfrom
feature/archive-org-download
Apr 13, 2026
Merged

Add archive.org download support to download.py#25
ali5ter merged 1 commit into
mainfrom
feature/archive-org-download

Conversation

@ali5ter

@ali5ter ali5ter commented Apr 13, 2026

Copy link
Copy Markdown
Owner

Summary

  • Auto-detects source from URL: archive.org/details/ routes to the new archive.org downloader; all other URLs use the existing World Radio History scrape path unchanged — no behaviour change for existing usage
  • New --pdf-format {text,image,both} flag (default text) selects between *_text.pdf (Abbyy OCR layer, what convert.py extracts from) and plain image PDFs
  • New --year-from / --year-to flags filter by year extracted from the filename, so a decade slice like --year-from 1974 --year-to 1989 works without a full-collection download
  • Adds requirements.txt listing both pymupdf and internetarchive; updates README with archive.org usage examples and a flag reference table

Closes #4.

Test plan

  • python3 download.py --help shows all new flags
  • python3 download.py "https://archive.org/details/ElektorMagazine" --year-from 1974 --year-to 1974 --dry-run lists only 1974 _text.pdf files without downloading
  • python3 download.py "https://archive.org/details/ElektorMagazine" --pdf-format image --dry-run lists only plain .pdf files (no _text suffix)
  • python3 download.py "https://www.worldradiohistory.com/ETI_Magazine.htm" --dry-run still works as before
  • markdownlint README.md passes with zero warnings

🤖 Generated with Claude Code

Auto-detect source from URL: archive.org/details/ routes to the new
internetarchive-backed downloader; all other URLs use the existing
World Radio History scrape path unchanged.

New flags: --pdf-format (text/image/both, default text), --year-from,
--year-to. Adds requirements.txt and updates README with archive.org
usage examples. Closes #4.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ali5ter ali5ter merged commit b8785d9 into main Apr 13, 2026
1 check passed
@ali5ter ali5ter deleted the feature/archive-org-download branch April 13, 2026 01:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant