Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 43 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,10 +27,17 @@ so the shape of your library is version-controlled even if the contents are not.
## Requirements

```bash
pip3 install pymupdf
pip3 install -r requirements.txt
```

Python 3.10+. No other dependencies.
Or install individually:

```bash
pip3 install pymupdf # required for convert.py
pip3 install internetarchive # required for archive.org downloads
```

Python 3.10+.

---

Expand All @@ -51,6 +58,10 @@ Python 3.10+. No other dependencies.

### 1. Download a collection

The source is auto-detected from the URL. Both modes share `--output-dir`, `--delay`, and `--dry-run`.

**World Radio History** — scrapes PDF links from an archive page:

```bash
# Preview what would be downloaded
python3 download.py "https://www.worldradiohistory.com/ETI_Magazine.htm" --dry-run
Expand All @@ -64,6 +75,36 @@ python3 download.py "https://www.worldradiohistory.com/ETI_Magazine.htm" \
--filter "1970" --output-dir collections/eti/pdfs
```

**archive.org** — downloads files from a single archive.org item by identifier.
Each issue typically has two PDF variants: a plain image PDF and a `_text.pdf` with an
Abbyy OCR text layer. The `--pdf-format` flag controls which variant is downloaded
(`text` is the default since `convert.py` extracts from the OCR layer):

```bash
# Download all OCR PDFs from an archive.org item
python3 download.py "https://archive.org/details/ElektorMagazine" \
--output-dir collections/elektor/pdfs

# Download only issues from a specific decade
python3 download.py "https://archive.org/details/ElektorMagazine" \
--output-dir collections/elektor/pdfs \
--year-from 1974 --year-to 1989

# Download image-only PDFs (no OCR layer)
python3 download.py "https://archive.org/details/ElektorMagazine" \
--pdf-format image --output-dir collections/elektor/pdfs

# Preview without downloading
python3 download.py "https://archive.org/details/ElektorMagazine" \
--year-from 1980 --dry-run
```

| Flag | Description | Default |
| --- | --- | --- |
| `--pdf-format` | `text` (_text.pdf, OCR), `image` (plain PDF), `both` | `text` |
| `--year-from` | Only download files with a year >= this value | — |
| `--year-to` | Only download files with a year <= this value | — |

### 2. Probe the collection structure

```bash
Expand Down
Loading
Loading