Poland document spider

cc @yolile @allakulov 

Unreviewed and untested (but potentially working) code at https://github.com/open-contracting/collect-generic/tree/poland-documents

<details><summary>Plan</summary>

# Download attachments and structured data from ezamowienia.gov.pl

## Context

The Polish public-procurement portal at https://ezamowienia.gov.pl/mp-client/search/list lists ~22,600 procedures. The user wants, per procedure:

1. The non-HTML **attachments** (PDFs, ZIPs, DOCX, etc. — the actual procurement documents).
2. The **basic information** that the SPA renders (organisation, dates, terms, document index) saved as JSON.
3. The **announcement** ("ogłoszenie", a section-numbered HTML key-value document) saved as parsed JSON, with the raw HTML kept alongside as a fallback.

The page is an Angular SPA, so all data flows over JSON APIs. None of them require authentication. The collector should iterate the full catalogue once, then re-runs should pick up only what's new.

## Approach

Add a new top-level Python package `ezamowienia/` (peer to `netherlands_ocds_transformer/` and `portland_ocds/`). Single-threaded, polite, resumable bulk collector.

### File layout (new)

```
ezamowienia/
├── __init__.py
├── README.md           # one-screen usage notes
├── client.py           # requests.Session w/ retry/backoff, rate-limit, helpers per endpoint
├── parse_notice.py     # parse Board/GetNoticeHtmlBody → nested dict
├── collect.py          # paginated list loop, per-procedure fetch+save, resume logic
└── __main__.py         # `python -m ezamowienia ...` Click CLI
```

### Output directory layout (default `./output/ezamowienia/`, `--output-dir` overrides)

```
output/ezamowienia/
├── _state.json                                   # {last_seen_object_id, last_seen_initiation_date, run_log_path}
├── _errors.jsonl                                 # one line per failed tender (id, step, http status, msg)
└── <tenderId>/                                   # tenderId == objectId from SearchTenders, e.g. ocds-148610-…
    ├── tender.json                               # Search/GetTender response
    ├── documents.json                            # Search/GetTenderDocuments response
    ├── notice_meta.json                          # mo-board GetNoticeDetails response (or null marker if no BZP notice)
    ├── announcement.html                         # raw mo-board GetNoticeHtmlBody body
    ├── announcement.json                         # parsed key-value (see parse_notice.py)
    └── attachments/
        └── <documentId>__<sanitized-filename>    # one file per non-HTML tenderDocuments[i]
```

### API endpoints (reverse-engineered from the SPA bundle)

| Purpose | Method | URL |
| --- | --- | --- |
| Paginated list | GET | `https://ezamowienia.gov.pl/mp-readmodels/api/Search/SearchTenders?Page={n}&PageSize=100&SortingColumnName=InitiationDate&SortingDirection=DESC` |
| Basic info | GET | `https://ezamowienia.gov.pl/mp-readmodels/api/Search/GetTender?id={tenderId}` |
| Document index | GET | `https://ezamowienia.gov.pl/mp-readmodels/api/Search/GetTenderDocuments?tenderId={tenderId}` |
| Document download | GET | `https://ezamowienia.gov.pl/mp-readmodels/api/Tender/DownloadDocument/{tenderId}/{documentId}` |
| Announcement metadata | GET | `https://ezamowienia.gov.pl/mo-board/api/v1/Board/GetNoticeDetails?noticeNumber={bzpNumber}` |
| Announcement HTML | GET | `https://ezamowienia.gov.pl/mo-board/api/v1/Board/GetNoticeHtmlBody?noticeNumber={bzpNumber}` |

`SearchTenders` returns 200 with a JSON array and an `X-Pagination` header (`{TotalCount, PageSize, CurrentPage, TotalPages, HasNext, HasPrevious}`) — drive the loop from `HasNext`.

### Per-procedure flow (collect.py)

For each tender from the list page:

1. If `output/ezamowienia/<tenderId>/.done` exists, skip.
2. `GET Search/GetTender?id=<tenderId>` → write `tender.json`. The `tenderDocuments[]` array carries `objectId`, `attachment.fileName`, `attachment.mimeType`, `attachment.fileSize`.
3. `GET Search/GetTenderDocuments?tenderId=<tenderId>` → write `documents.json` (independent index used to mirror the public download URLs).
4. For each document where `attachment.mimeType != 'text/html'`: stream `Tender/DownloadDocument/{tenderId}/{documentId}` to `attachments/<documentId>__<sanitized fileName>`. Use the `Content-Disposition` filename when present. Skip files already on disk with the expected size.
5. If `tender.json` has a `noticeNumber`/`bzpNumber`:
   - `GET Board/GetNoticeDetails` → `notice_meta.json`
   - `GET Board/GetNoticeHtmlBody` → `announcement.html`
   - Parse → `announcement.json`
6. Touch `.done` so re-runs skip cleanly.

Errors at any step append a line to `_errors.jsonl` and move on; missing-announcement (404) is logged but not treated as a failure.

### Resume / pagination (collect.py)

- Sorted DESC by `InitiationDate`, so newest is page 1.
- After every successful procedure, update `_state.json.last_seen_object_id`.
- On a fresh run, walk pages from 1 and stop the moment we hit a tender whose `.done` file exists AND whose `objectId == last_seen_object_id` — everything earlier on this page is genuinely new (newer than last run). Bounded by `--max-pages` for safety.
- A `--full` flag re-walks the entire catalogue regardless of state, still skipping per-procedure work where `.done` exists.

### Client behaviour (client.py)

- One `requests.Session` with a retry adapter (`urllib3.Retry` on 429/5xx, exponential backoff 1→16s, max 5 attempts).
- A small `time.sleep(0.2)` between requests as a courtesy throttle.
- `User-Agent: ezamowienia-collector (+contact: jmckinney@open-contracting.org)`.
- Streaming downloads (`stream=True`, write to `<file>.part`, atomic rename).
- Helper functions: `search_tenders(page)`, `get_tender(id)`, `get_documents(id)`, `download_document(tender_id, doc_id, dest)`, `get_notice_details(num)`, `get_notice_html(num)`.

### Announcement parser (parse_notice.py)

The HTML body is plain DOM with predictable structure:

```html
<h2 class="bg-light p-3 mt-4">SEKCJA I - ZAMAWIAJĄCY</h2>
<h3 class="mb-0">1.5.1.) Ulica: <span class="normal">19</span></h3>
<p class="mb-0">Free-text continuation</p>
```

Use `lxml.html` (already a transitive dependency — see commit 857abf2 bumping lxml to 6.1.0). Walk top-level children in order, producing:

```json
{
  "sections": [
    {"title": "SEKCJA I - ZAMAWIAJĄCY",
     "items": [
       {"key": "1.5.1", "label": "Ulica", "value": "19", "extra": null},
       {"key": "1.1",   "label": "Rola zamawiającego", "value": null,
        "extra": "Postępowanie prowadzone jest samodzielnie przez zamawiającego"}
     ]}
  ],
  "title": "Ogłoszenie o zamówieniu — Dostawy — …",
  "raw_html_path": "announcement.html"
}
```

Key extraction regex: `^\s*([0-9]+(?:\.[0-9]+)*)\.\)\s*(.*?)(?::\s*(.*))?$` against the `<h3>` text content; the `<span class="normal">` text is the value when present, otherwise the post-colon remainder, otherwise `None`. A trailing `<p>` sibling becomes `extra`. Unrecognised h3s are kept under a `"_unparsed"` list rather than dropped, so the raw HTML is never the only source of truth.

### CLI (__main__.py via Click)

```
python -m ezamowienia collect            # full + resume
python -m ezamowienia collect --limit 5  # smoke test: first 5 procedures only
python -m ezamowienia collect --full     # ignore _state.json checkpoint
python -m ezamowienia parse-notice <bzpNumber>   # one-shot announcement debug helper
```

Default `--output-dir` is `./output/ezamowienia/`. Add `output/` to `.gitignore` if not already present.

### Dependencies

Add to `requirements.in`: `requests`, `lxml` (lxml is already pinned transitively but make it explicit). `click` is already there.

## Critical files

- New: `ezamowienia/__init__.py`, `ezamowienia/client.py`, `ezamowienia/parse_notice.py`, `ezamowienia/collect.py`, `ezamowienia/__main__.py`, `ezamowienia/README.md`.
- Modify: `requirements.in` (add `requests`, `lxml`), `.gitignore` (add `output/`), `requirements.txt` (regenerate with `uv pip compile`).

## Verification

1. Smoke run: `python -m ezamowienia collect --limit 3 --output-dir /tmp/ezam-test/`. Confirm:
   - 3 directories created under `/tmp/ezam-test/ezamowienia/`.
   - Each has `tender.json`, `documents.json`, `notice_meta.json`, `announcement.html`, `announcement.json`, `attachments/` with the right number of non-HTML files.
   - File sizes match `attachment.fileSize` from `tender.json`.
   - `announcement.json` has a non-empty `sections[]` and no items in `_unparsed` for at least one tender.
2. Resume check: re-run the same command. Expect zero new HTTP requests for those three tenders (skipped via `.done`); confirm by tailing logs.
3. Spot-check known tender from this exploration: `ocds-148610-3c9cfc83-d86a-4b43-8b0f-fcd455f7cf54` / `2026/BZP 00227455/01`. Should yield 4 attachments (1 ZIP + 3 PDFs) and an announcement with section "SEKCJA I - ZAMAWIAJĄCY" → "1.5.1 Ulica: 19".
4. Fault-injection: pick a tender that lacks a BZP notice (rare but possible — pre-publication state). Confirm it ends with `.done` written and a warning row in `_errors.jsonl` rather than aborting.
5. Unit test the parser with the saved `announcement.html` from the spot-check (round-trip parse, assert ≥ one expected key/value).

</details>

<details><summary>Prompt to resume work</summary>

Resume context — Polish tender collector (Scrapy)

Goal: Implement the plan at /Users/james/.claude/plans/i-want-to-download-golden-giraffe.md (originally written for a standalone Python package ezamowienia/) inside this Scrapy repo, updating the existing poland spider.

Files changed

1. `generic_scrapy/spiders/poland.py` — full rewrite. Replaces the old mo-board notices → CSV flow with: `SearchTenders` (paginate) → `GetTender` (per tender) → fan-out to `DownloadDocument` (non-HTML attachments), `GetTenderDocuments`, and `Board/GetNoticeDetails` → `Board/GetNoticeHtmlBody` (parsed via parser module). Inherits `BaseSpider`, not `ExportFileSpider` (no aggregate output any more).
2. `generic_scrapy/parsers/poland_announcement.py` (new) + `generic_scrapy/parsers/__init__.py` (new empty). `parse_announcement(html)` walks h2/h3/p elements with regex `^\s*([0-9]+(?:\.[0-9]+)*)\.?\)\s*(.*?)(?::\s*(.*))?$` (loosened from plan's `\.\`) to also catch `1.4`)). Returns `{title, sections:[{title, items:[{key,label,value,extra}]}], _unparsed}`. Uses `scrapy.Selector` — no new dep.
3. `generic_scrapy/filters.py` — deleted `PolandNoticeFilter` and `PolandContractorFilter` (unused after CSV outputs dropped).

Output layout

`<FILES_STORE>/poland/<crawl_directory>/<tenderId>/` containing `tender.json`, `documents.json`, `notice_meta.json`, `announcement.html`, `announcement.json`, `attachments/<docId>__<sanitized-filename>`. Resume = re-run with same `crawl_directory`; skip is "if `tender.json` exists, don't fetch the tender again".

Plan items deliberately not implemented (Scrapy already does it)

- `_state.json`, `_errors.jsonl`, `.done` markers, `urllib3.Retry`, `time.sleep` throttle, Click CLI / `__main__.py` — replaced by Scrapy's RetryMiddleware (429/5xx), AutoThrottle (off by default — `DOWNLOAD_DELAY` in settings.py if needed), stats collection, and `scrapy crawl`.
- The plan suggested adding requests + lxml to requirements.in — not needed; using Scrapy + scrapy.Selector.

Verification done

- Spider runs cleanly: `uv run scrapy crawl poland -a sample=3` → 18 requests, all 200, 5 attachments downloaded with sizes matching `attachment.fileSize` exactly.
- Parser run against the plan's spot-check tender `ocds-148610-3c9cfc83-…` produces `{key:"1.5.1", label:"Ulica", value:"19"}` under section "SEKCJA I - ZAMAWIAJĄCY". 4 _unparsed entries are legitimate subsection labels like "Kryterium 1".
- Resume verified: second run with same `crawl_directory=…` skipped the 3 originally-collected tenders and processed 3 newer arrivals instead.
- `uv run ruff check` clean on all three files.

Commands to invoke

```
uv run scrapy crawl poland                              # full catalogue
uv run scrapy crawl poland -a sample=3                  # smoke test (3 tenders)
uv run scrapy crawl poland -a crawl_directory=<dir>     # resume into existing run
uv run scrapy crawl poland -s FILES_STORE=/tmp/x ...    # override output root
```

State

- M generic_scrapy/filters.py
- M generic_scrapy/spiders/poland.py
- A generic_scrapy/parsers/__init__.py
- A generic_scrapy/parsers/poland_announcement.py

Open questions if work continues

- Need a unit test for `parse_announcement` (plan item 5 under Verification — never written; the parser was only manually validated against one saved HTML body).
- No README update for the new spider behaviour (README.md is currently 2 lines).
- Existing `incrementalupdate` command in `generic_scrapy/commands/` was designed for `ExportFileSpider`-based date-windowing — it no longer applies to the poland spider since this one uses skip-if-exists instead of `from_date`/`until_date`.

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poland document spider #38

Download attachments and structured data from ezamowienia.gov.pl

Context

Approach

File layout (new)

Output directory layout (default `./output/ezamowienia/`, `--output-dir` overrides)

API endpoints (reverse-engineered from the SPA bundle)

Per-procedure flow (collect.py)

Resume / pagination (collect.py)

Client behaviour (client.py)

Announcement parser (parse_notice.py)

CLI (main.py via Click)

Dependencies

Critical files

Verification

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Purpose	Method	URL
Paginated list	GET	`https://ezamowienia.gov.pl/mp-readmodels/api/Search/SearchTenders?Page={n}&PageSize=100&SortingColumnName=InitiationDate&SortingDirection=DESC`
Basic info	GET	`https://ezamowienia.gov.pl/mp-readmodels/api/Search/GetTender?id={tenderId}`
Document index	GET	`https://ezamowienia.gov.pl/mp-readmodels/api/Search/GetTenderDocuments?tenderId={tenderId}`
Document download	GET	`https://ezamowienia.gov.pl/mp-readmodels/api/Tender/DownloadDocument/{tenderId}/{documentId}`
Announcement metadata	GET	`https://ezamowienia.gov.pl/mo-board/api/v1/Board/GetNoticeDetails?noticeNumber={bzpNumber}`
Announcement HTML	GET	`https://ezamowienia.gov.pl/mo-board/api/v1/Board/GetNoticeHtmlBody?noticeNumber={bzpNumber}`

Poland document spider #38

Description

Download attachments and structured data from ezamowienia.gov.pl

Context

Approach

File layout (new)

Output directory layout (default ./output/ezamowienia/, --output-dir overrides)

API endpoints (reverse-engineered from the SPA bundle)

Per-procedure flow (collect.py)

Resume / pagination (collect.py)

Client behaviour (client.py)

Announcement parser (parse_notice.py)

CLI (main.py via Click)

Dependencies

Critical files

Verification

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Output directory layout (default `./output/ezamowienia/`, `--output-dir` overrides)