Skip to content

Poland document spider #38

@jpmckinney

Description

@jpmckinney

cc @yolile @allakulov

Unreviewed and untested (but potentially working) code at https://github.com/open-contracting/collect-generic/tree/poland-documents

Plan

Download attachments and structured data from ezamowienia.gov.pl

Context

The Polish public-procurement portal at https://ezamowienia.gov.pl/mp-client/search/list lists ~22,600 procedures. The user wants, per procedure:

  1. The non-HTML attachments (PDFs, ZIPs, DOCX, etc. — the actual procurement documents).
  2. The basic information that the SPA renders (organisation, dates, terms, document index) saved as JSON.
  3. The announcement ("ogłoszenie", a section-numbered HTML key-value document) saved as parsed JSON, with the raw HTML kept alongside as a fallback.

The page is an Angular SPA, so all data flows over JSON APIs. None of them require authentication. The collector should iterate the full catalogue once, then re-runs should pick up only what's new.

Approach

Add a new top-level Python package ezamowienia/ (peer to netherlands_ocds_transformer/ and portland_ocds/). Single-threaded, polite, resumable bulk collector.

File layout (new)

ezamowienia/
├── __init__.py
├── README.md           # one-screen usage notes
├── client.py           # requests.Session w/ retry/backoff, rate-limit, helpers per endpoint
├── parse_notice.py     # parse Board/GetNoticeHtmlBody → nested dict
├── collect.py          # paginated list loop, per-procedure fetch+save, resume logic
└── __main__.py         # `python -m ezamowienia ...` Click CLI

Output directory layout (default ./output/ezamowienia/, --output-dir overrides)

output/ezamowienia/
├── _state.json                                   # {last_seen_object_id, last_seen_initiation_date, run_log_path}
├── _errors.jsonl                                 # one line per failed tender (id, step, http status, msg)
└── <tenderId>/                                   # tenderId == objectId from SearchTenders, e.g. ocds-148610-…
    ├── tender.json                               # Search/GetTender response
    ├── documents.json                            # Search/GetTenderDocuments response
    ├── notice_meta.json                          # mo-board GetNoticeDetails response (or null marker if no BZP notice)
    ├── announcement.html                         # raw mo-board GetNoticeHtmlBody body
    ├── announcement.json                         # parsed key-value (see parse_notice.py)
    └── attachments/
        └── <documentId>__<sanitized-filename>    # one file per non-HTML tenderDocuments[i]

API endpoints (reverse-engineered from the SPA bundle)

Purpose Method URL
Paginated list GET https://ezamowienia.gov.pl/mp-readmodels/api/Search/SearchTenders?Page={n}&PageSize=100&SortingColumnName=InitiationDate&SortingDirection=DESC
Basic info GET https://ezamowienia.gov.pl/mp-readmodels/api/Search/GetTender?id={tenderId}
Document index GET https://ezamowienia.gov.pl/mp-readmodels/api/Search/GetTenderDocuments?tenderId={tenderId}
Document download GET https://ezamowienia.gov.pl/mp-readmodels/api/Tender/DownloadDocument/{tenderId}/{documentId}
Announcement metadata GET https://ezamowienia.gov.pl/mo-board/api/v1/Board/GetNoticeDetails?noticeNumber={bzpNumber}
Announcement HTML GET https://ezamowienia.gov.pl/mo-board/api/v1/Board/GetNoticeHtmlBody?noticeNumber={bzpNumber}

SearchTenders returns 200 with a JSON array and an X-Pagination header ({TotalCount, PageSize, CurrentPage, TotalPages, HasNext, HasPrevious}) — drive the loop from HasNext.

Per-procedure flow (collect.py)

For each tender from the list page:

  1. If output/ezamowienia/<tenderId>/.done exists, skip.
  2. GET Search/GetTender?id=<tenderId> → write tender.json. The tenderDocuments[] array carries objectId, attachment.fileName, attachment.mimeType, attachment.fileSize.
  3. GET Search/GetTenderDocuments?tenderId=<tenderId> → write documents.json (independent index used to mirror the public download URLs).
  4. For each document where attachment.mimeType != 'text/html': stream Tender/DownloadDocument/{tenderId}/{documentId} to attachments/<documentId>__<sanitized fileName>. Use the Content-Disposition filename when present. Skip files already on disk with the expected size.
  5. If tender.json has a noticeNumber/bzpNumber:
    • GET Board/GetNoticeDetailsnotice_meta.json
    • GET Board/GetNoticeHtmlBodyannouncement.html
    • Parse → announcement.json
  6. Touch .done so re-runs skip cleanly.

Errors at any step append a line to _errors.jsonl and move on; missing-announcement (404) is logged but not treated as a failure.

Resume / pagination (collect.py)

  • Sorted DESC by InitiationDate, so newest is page 1.
  • After every successful procedure, update _state.json.last_seen_object_id.
  • On a fresh run, walk pages from 1 and stop the moment we hit a tender whose .done file exists AND whose objectId == last_seen_object_id — everything earlier on this page is genuinely new (newer than last run). Bounded by --max-pages for safety.
  • A --full flag re-walks the entire catalogue regardless of state, still skipping per-procedure work where .done exists.

Client behaviour (client.py)

  • One requests.Session with a retry adapter (urllib3.Retry on 429/5xx, exponential backoff 1→16s, max 5 attempts).
  • A small time.sleep(0.2) between requests as a courtesy throttle.
  • User-Agent: ezamowienia-collector (+contact: jmckinney@open-contracting.org).
  • Streaming downloads (stream=True, write to <file>.part, atomic rename).
  • Helper functions: search_tenders(page), get_tender(id), get_documents(id), download_document(tender_id, doc_id, dest), get_notice_details(num), get_notice_html(num).

Announcement parser (parse_notice.py)

The HTML body is plain DOM with predictable structure:

<h2 class="bg-light p-3 mt-4">SEKCJA I - ZAMAWIAJĄCY</h2>
<h3 class="mb-0">1.5.1.) Ulica: <span class="normal">19</span></h3>
<p class="mb-0">Free-text continuation</p>

Use lxml.html (already a transitive dependency — see commit 857abf2 bumping lxml to 6.1.0). Walk top-level children in order, producing:

{
  "sections": [
    {"title": "SEKCJA I - ZAMAWIAJĄCY",
     "items": [
       {"key": "1.5.1", "label": "Ulica", "value": "19", "extra": null},
       {"key": "1.1",   "label": "Rola zamawiającego", "value": null,
        "extra": "Postępowanie prowadzone jest samodzielnie przez zamawiającego"}
     ]}
  ],
  "title": "Ogłoszenie o zamówieniu — Dostawy — …",
  "raw_html_path": "announcement.html"
}

Key extraction regex: ^\s*([0-9]+(?:\.[0-9]+)*)\.\)\s*(.*?)(?::\s*(.*))?$ against the <h3> text content; the <span class="normal"> text is the value when present, otherwise the post-colon remainder, otherwise None. A trailing <p> sibling becomes extra. Unrecognised h3s are kept under a "_unparsed" list rather than dropped, so the raw HTML is never the only source of truth.

CLI (main.py via Click)

python -m ezamowienia collect            # full + resume
python -m ezamowienia collect --limit 5  # smoke test: first 5 procedures only
python -m ezamowienia collect --full     # ignore _state.json checkpoint
python -m ezamowienia parse-notice <bzpNumber>   # one-shot announcement debug helper

Default --output-dir is ./output/ezamowienia/. Add output/ to .gitignore if not already present.

Dependencies

Add to requirements.in: requests, lxml (lxml is already pinned transitively but make it explicit). click is already there.

Critical files

  • New: ezamowienia/__init__.py, ezamowienia/client.py, ezamowienia/parse_notice.py, ezamowienia/collect.py, ezamowienia/__main__.py, ezamowienia/README.md.
  • Modify: requirements.in (add requests, lxml), .gitignore (add output/), requirements.txt (regenerate with uv pip compile).

Verification

  1. Smoke run: python -m ezamowienia collect --limit 3 --output-dir /tmp/ezam-test/. Confirm:
    • 3 directories created under /tmp/ezam-test/ezamowienia/.
    • Each has tender.json, documents.json, notice_meta.json, announcement.html, announcement.json, attachments/ with the right number of non-HTML files.
    • File sizes match attachment.fileSize from tender.json.
    • announcement.json has a non-empty sections[] and no items in _unparsed for at least one tender.
  2. Resume check: re-run the same command. Expect zero new HTTP requests for those three tenders (skipped via .done); confirm by tailing logs.
  3. Spot-check known tender from this exploration: ocds-148610-3c9cfc83-d86a-4b43-8b0f-fcd455f7cf54 / 2026/BZP 00227455/01. Should yield 4 attachments (1 ZIP + 3 PDFs) and an announcement with section "SEKCJA I - ZAMAWIAJĄCY" → "1.5.1 Ulica: 19".
  4. Fault-injection: pick a tender that lacks a BZP notice (rare but possible — pre-publication state). Confirm it ends with .done written and a warning row in _errors.jsonl rather than aborting.
  5. Unit test the parser with the saved announcement.html from the spot-check (round-trip parse, assert ≥ one expected key/value).
Prompt to resume work

Resume context — Polish tender collector (Scrapy)

Goal: Implement the plan at /Users/james/.claude/plans/i-want-to-download-golden-giraffe.md (originally written for a standalone Python package ezamowienia/) inside this Scrapy repo, updating the existing poland spider.

Files changed

  1. generic_scrapy/spiders/poland.py — full rewrite. Replaces the old mo-board notices → CSV flow with: SearchTenders (paginate) → GetTender (per tender) → fan-out to DownloadDocument (non-HTML attachments), GetTenderDocuments, and Board/GetNoticeDetailsBoard/GetNoticeHtmlBody (parsed via parser module). Inherits BaseSpider, not ExportFileSpider (no aggregate output any more).
  2. generic_scrapy/parsers/poland_announcement.py (new) + generic_scrapy/parsers/__init__.py (new empty). parse_announcement(html) walks h2/h3/p elements with regex ^\s*([0-9]+(?:\.[0-9]+)*)\.?\)\s*(.*?)(?::\s*(.*))?$ (loosened from plan's \.\) to also catch 1.4)). Returns {title, sections:[{title, items:[{key,label,value,extra}]}], _unparsed}. Uses scrapy.Selector — no new dep.
  3. generic_scrapy/filters.py — deleted PolandNoticeFilter and PolandContractorFilter (unused after CSV outputs dropped).

Output layout

<FILES_STORE>/poland/<crawl_directory>/<tenderId>/ containing tender.json, documents.json, notice_meta.json, announcement.html, announcement.json, attachments/<docId>__<sanitized-filename>. Resume = re-run with same crawl_directory; skip is "if tender.json exists, don't fetch the tender again".

Plan items deliberately not implemented (Scrapy already does it)

  • _state.json, _errors.jsonl, .done markers, urllib3.Retry, time.sleep throttle, Click CLI / __main__.py — replaced by Scrapy's RetryMiddleware (429/5xx), AutoThrottle (off by default — DOWNLOAD_DELAY in settings.py if needed), stats collection, and scrapy crawl.
  • The plan suggested adding requests + lxml to requirements.in — not needed; using Scrapy + scrapy.Selector.

Verification done

  • Spider runs cleanly: uv run scrapy crawl poland -a sample=3 → 18 requests, all 200, 5 attachments downloaded with sizes matching attachment.fileSize exactly.
  • Parser run against the plan's spot-check tender ocds-148610-3c9cfc83-… produces {key:"1.5.1", label:"Ulica", value:"19"} under section "SEKCJA I - ZAMAWIAJĄCY". 4 _unparsed entries are legitimate subsection labels like "Kryterium 1".
  • Resume verified: second run with same crawl_directory=… skipped the 3 originally-collected tenders and processed 3 newer arrivals instead.
  • uv run ruff check clean on all three files.

Commands to invoke

uv run scrapy crawl poland                              # full catalogue
uv run scrapy crawl poland -a sample=3                  # smoke test (3 tenders)
uv run scrapy crawl poland -a crawl_directory=<dir>     # resume into existing run
uv run scrapy crawl poland -s FILES_STORE=/tmp/x ...    # override output root

State

  • M generic_scrapy/filters.py
  • M generic_scrapy/spiders/poland.py
  • A generic_scrapy/parsers/init.py
  • A generic_scrapy/parsers/poland_announcement.py

Open questions if work continues

  • Need a unit test for parse_announcement (plan item 5 under Verification — never written; the parser was only manually validated against one saved HTML body).
  • No README update for the new spider behaviour (README.md is currently 2 lines).
  • Existing incrementalupdate command in generic_scrapy/commands/ was designed for ExportFileSpider-based date-windowing — it no longer applies to the poland spider since this one uses skip-if-exists instead of from_date/until_date.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions