Plan
Download attachments and structured data from ezamowienia.gov.pl
Context
The Polish public-procurement portal at https://ezamowienia.gov.pl/mp-client/search/list lists ~22,600 procedures. The user wants, per procedure:
- The non-HTML attachments (PDFs, ZIPs, DOCX, etc. — the actual procurement documents).
- The basic information that the SPA renders (organisation, dates, terms, document index) saved as JSON.
- The announcement ("ogłoszenie", a section-numbered HTML key-value document) saved as parsed JSON, with the raw HTML kept alongside as a fallback.
The page is an Angular SPA, so all data flows over JSON APIs. None of them require authentication. The collector should iterate the full catalogue once, then re-runs should pick up only what's new.
Approach
Add a new top-level Python package ezamowienia/ (peer to netherlands_ocds_transformer/ and portland_ocds/). Single-threaded, polite, resumable bulk collector.
File layout (new)
ezamowienia/
├── __init__.py
├── README.md # one-screen usage notes
├── client.py # requests.Session w/ retry/backoff, rate-limit, helpers per endpoint
├── parse_notice.py # parse Board/GetNoticeHtmlBody → nested dict
├── collect.py # paginated list loop, per-procedure fetch+save, resume logic
└── __main__.py # `python -m ezamowienia ...` Click CLI
Output directory layout (default ./output/ezamowienia/, --output-dir overrides)
output/ezamowienia/
├── _state.json # {last_seen_object_id, last_seen_initiation_date, run_log_path}
├── _errors.jsonl # one line per failed tender (id, step, http status, msg)
└── <tenderId>/ # tenderId == objectId from SearchTenders, e.g. ocds-148610-…
├── tender.json # Search/GetTender response
├── documents.json # Search/GetTenderDocuments response
├── notice_meta.json # mo-board GetNoticeDetails response (or null marker if no BZP notice)
├── announcement.html # raw mo-board GetNoticeHtmlBody body
├── announcement.json # parsed key-value (see parse_notice.py)
└── attachments/
└── <documentId>__<sanitized-filename> # one file per non-HTML tenderDocuments[i]
API endpoints (reverse-engineered from the SPA bundle)
| Purpose |
Method |
URL |
| Paginated list |
GET |
https://ezamowienia.gov.pl/mp-readmodels/api/Search/SearchTenders?Page={n}&PageSize=100&SortingColumnName=InitiationDate&SortingDirection=DESC |
| Basic info |
GET |
https://ezamowienia.gov.pl/mp-readmodels/api/Search/GetTender?id={tenderId} |
| Document index |
GET |
https://ezamowienia.gov.pl/mp-readmodels/api/Search/GetTenderDocuments?tenderId={tenderId} |
| Document download |
GET |
https://ezamowienia.gov.pl/mp-readmodels/api/Tender/DownloadDocument/{tenderId}/{documentId} |
| Announcement metadata |
GET |
https://ezamowienia.gov.pl/mo-board/api/v1/Board/GetNoticeDetails?noticeNumber={bzpNumber} |
| Announcement HTML |
GET |
https://ezamowienia.gov.pl/mo-board/api/v1/Board/GetNoticeHtmlBody?noticeNumber={bzpNumber} |
SearchTenders returns 200 with a JSON array and an X-Pagination header ({TotalCount, PageSize, CurrentPage, TotalPages, HasNext, HasPrevious}) — drive the loop from HasNext.
Per-procedure flow (collect.py)
For each tender from the list page:
- If
output/ezamowienia/<tenderId>/.done exists, skip.
GET Search/GetTender?id=<tenderId> → write tender.json. The tenderDocuments[] array carries objectId, attachment.fileName, attachment.mimeType, attachment.fileSize.
GET Search/GetTenderDocuments?tenderId=<tenderId> → write documents.json (independent index used to mirror the public download URLs).
- For each document where
attachment.mimeType != 'text/html': stream Tender/DownloadDocument/{tenderId}/{documentId} to attachments/<documentId>__<sanitized fileName>. Use the Content-Disposition filename when present. Skip files already on disk with the expected size.
- If
tender.json has a noticeNumber/bzpNumber:
GET Board/GetNoticeDetails → notice_meta.json
GET Board/GetNoticeHtmlBody → announcement.html
- Parse →
announcement.json
- Touch
.done so re-runs skip cleanly.
Errors at any step append a line to _errors.jsonl and move on; missing-announcement (404) is logged but not treated as a failure.
Resume / pagination (collect.py)
- Sorted DESC by
InitiationDate, so newest is page 1.
- After every successful procedure, update
_state.json.last_seen_object_id.
- On a fresh run, walk pages from 1 and stop the moment we hit a tender whose
.done file exists AND whose objectId == last_seen_object_id — everything earlier on this page is genuinely new (newer than last run). Bounded by --max-pages for safety.
- A
--full flag re-walks the entire catalogue regardless of state, still skipping per-procedure work where .done exists.
Client behaviour (client.py)
- One
requests.Session with a retry adapter (urllib3.Retry on 429/5xx, exponential backoff 1→16s, max 5 attempts).
- A small
time.sleep(0.2) between requests as a courtesy throttle.
User-Agent: ezamowienia-collector (+contact: jmckinney@open-contracting.org).
- Streaming downloads (
stream=True, write to <file>.part, atomic rename).
- Helper functions:
search_tenders(page), get_tender(id), get_documents(id), download_document(tender_id, doc_id, dest), get_notice_details(num), get_notice_html(num).
Announcement parser (parse_notice.py)
The HTML body is plain DOM with predictable structure:
<h2 class="bg-light p-3 mt-4">SEKCJA I - ZAMAWIAJĄCY</h2>
<h3 class="mb-0">1.5.1.) Ulica: <span class="normal">19</span></h3>
<p class="mb-0">Free-text continuation</p>
Use lxml.html (already a transitive dependency — see commit 857abf2 bumping lxml to 6.1.0). Walk top-level children in order, producing:
{
"sections": [
{"title": "SEKCJA I - ZAMAWIAJĄCY",
"items": [
{"key": "1.5.1", "label": "Ulica", "value": "19", "extra": null},
{"key": "1.1", "label": "Rola zamawiającego", "value": null,
"extra": "Postępowanie prowadzone jest samodzielnie przez zamawiającego"}
]}
],
"title": "Ogłoszenie o zamówieniu — Dostawy — …",
"raw_html_path": "announcement.html"
}
Key extraction regex: ^\s*([0-9]+(?:\.[0-9]+)*)\.\)\s*(.*?)(?::\s*(.*))?$ against the <h3> text content; the <span class="normal"> text is the value when present, otherwise the post-colon remainder, otherwise None. A trailing <p> sibling becomes extra. Unrecognised h3s are kept under a "_unparsed" list rather than dropped, so the raw HTML is never the only source of truth.
CLI (main.py via Click)
python -m ezamowienia collect # full + resume
python -m ezamowienia collect --limit 5 # smoke test: first 5 procedures only
python -m ezamowienia collect --full # ignore _state.json checkpoint
python -m ezamowienia parse-notice <bzpNumber> # one-shot announcement debug helper
Default --output-dir is ./output/ezamowienia/. Add output/ to .gitignore if not already present.
Dependencies
Add to requirements.in: requests, lxml (lxml is already pinned transitively but make it explicit). click is already there.
Critical files
- New:
ezamowienia/__init__.py, ezamowienia/client.py, ezamowienia/parse_notice.py, ezamowienia/collect.py, ezamowienia/__main__.py, ezamowienia/README.md.
- Modify:
requirements.in (add requests, lxml), .gitignore (add output/), requirements.txt (regenerate with uv pip compile).
Verification
- Smoke run:
python -m ezamowienia collect --limit 3 --output-dir /tmp/ezam-test/. Confirm:
- 3 directories created under
/tmp/ezam-test/ezamowienia/.
- Each has
tender.json, documents.json, notice_meta.json, announcement.html, announcement.json, attachments/ with the right number of non-HTML files.
- File sizes match
attachment.fileSize from tender.json.
announcement.json has a non-empty sections[] and no items in _unparsed for at least one tender.
- Resume check: re-run the same command. Expect zero new HTTP requests for those three tenders (skipped via
.done); confirm by tailing logs.
- Spot-check known tender from this exploration:
ocds-148610-3c9cfc83-d86a-4b43-8b0f-fcd455f7cf54 / 2026/BZP 00227455/01. Should yield 4 attachments (1 ZIP + 3 PDFs) and an announcement with section "SEKCJA I - ZAMAWIAJĄCY" → "1.5.1 Ulica: 19".
- Fault-injection: pick a tender that lacks a BZP notice (rare but possible — pre-publication state). Confirm it ends with
.done written and a warning row in _errors.jsonl rather than aborting.
- Unit test the parser with the saved
announcement.html from the spot-check (round-trip parse, assert ≥ one expected key/value).
Prompt to resume work
Resume context — Polish tender collector (Scrapy)
Goal: Implement the plan at /Users/james/.claude/plans/i-want-to-download-golden-giraffe.md (originally written for a standalone Python package ezamowienia/) inside this Scrapy repo, updating the existing poland spider.
Files changed
generic_scrapy/spiders/poland.py — full rewrite. Replaces the old mo-board notices → CSV flow with: SearchTenders (paginate) → GetTender (per tender) → fan-out to DownloadDocument (non-HTML attachments), GetTenderDocuments, and Board/GetNoticeDetails → Board/GetNoticeHtmlBody (parsed via parser module). Inherits BaseSpider, not ExportFileSpider (no aggregate output any more).
generic_scrapy/parsers/poland_announcement.py (new) + generic_scrapy/parsers/__init__.py (new empty). parse_announcement(html) walks h2/h3/p elements with regex ^\s*([0-9]+(?:\.[0-9]+)*)\.?\)\s*(.*?)(?::\s*(.*))?$ (loosened from plan's \.\) to also catch 1.4)). Returns {title, sections:[{title, items:[{key,label,value,extra}]}], _unparsed}. Uses scrapy.Selector — no new dep.
generic_scrapy/filters.py — deleted PolandNoticeFilter and PolandContractorFilter (unused after CSV outputs dropped).
Output layout
<FILES_STORE>/poland/<crawl_directory>/<tenderId>/ containing tender.json, documents.json, notice_meta.json, announcement.html, announcement.json, attachments/<docId>__<sanitized-filename>. Resume = re-run with same crawl_directory; skip is "if tender.json exists, don't fetch the tender again".
Plan items deliberately not implemented (Scrapy already does it)
_state.json, _errors.jsonl, .done markers, urllib3.Retry, time.sleep throttle, Click CLI / __main__.py — replaced by Scrapy's RetryMiddleware (429/5xx), AutoThrottle (off by default — DOWNLOAD_DELAY in settings.py if needed), stats collection, and scrapy crawl.
- The plan suggested adding requests + lxml to requirements.in — not needed; using Scrapy + scrapy.Selector.
Verification done
- Spider runs cleanly:
uv run scrapy crawl poland -a sample=3 → 18 requests, all 200, 5 attachments downloaded with sizes matching attachment.fileSize exactly.
- Parser run against the plan's spot-check tender
ocds-148610-3c9cfc83-… produces {key:"1.5.1", label:"Ulica", value:"19"} under section "SEKCJA I - ZAMAWIAJĄCY". 4 _unparsed entries are legitimate subsection labels like "Kryterium 1".
- Resume verified: second run with same
crawl_directory=… skipped the 3 originally-collected tenders and processed 3 newer arrivals instead.
uv run ruff check clean on all three files.
Commands to invoke
uv run scrapy crawl poland # full catalogue
uv run scrapy crawl poland -a sample=3 # smoke test (3 tenders)
uv run scrapy crawl poland -a crawl_directory=<dir> # resume into existing run
uv run scrapy crawl poland -s FILES_STORE=/tmp/x ... # override output root
State
- M generic_scrapy/filters.py
- M generic_scrapy/spiders/poland.py
- A generic_scrapy/parsers/init.py
- A generic_scrapy/parsers/poland_announcement.py
Open questions if work continues
- Need a unit test for
parse_announcement (plan item 5 under Verification — never written; the parser was only manually validated against one saved HTML body).
- No README update for the new spider behaviour (README.md is currently 2 lines).
- Existing
incrementalupdate command in generic_scrapy/commands/ was designed for ExportFileSpider-based date-windowing — it no longer applies to the poland spider since this one uses skip-if-exists instead of from_date/until_date.
cc @yolile @allakulov
Unreviewed and untested (but potentially working) code at https://github.com/open-contracting/collect-generic/tree/poland-documents
Plan
Download attachments and structured data from ezamowienia.gov.pl
Context
The Polish public-procurement portal at https://ezamowienia.gov.pl/mp-client/search/list lists ~22,600 procedures. The user wants, per procedure:
The page is an Angular SPA, so all data flows over JSON APIs. None of them require authentication. The collector should iterate the full catalogue once, then re-runs should pick up only what's new.
Approach
Add a new top-level Python package
ezamowienia/(peer tonetherlands_ocds_transformer/andportland_ocds/). Single-threaded, polite, resumable bulk collector.File layout (new)
Output directory layout (default
./output/ezamowienia/,--output-diroverrides)API endpoints (reverse-engineered from the SPA bundle)
https://ezamowienia.gov.pl/mp-readmodels/api/Search/SearchTenders?Page={n}&PageSize=100&SortingColumnName=InitiationDate&SortingDirection=DESChttps://ezamowienia.gov.pl/mp-readmodels/api/Search/GetTender?id={tenderId}https://ezamowienia.gov.pl/mp-readmodels/api/Search/GetTenderDocuments?tenderId={tenderId}https://ezamowienia.gov.pl/mp-readmodels/api/Tender/DownloadDocument/{tenderId}/{documentId}https://ezamowienia.gov.pl/mo-board/api/v1/Board/GetNoticeDetails?noticeNumber={bzpNumber}https://ezamowienia.gov.pl/mo-board/api/v1/Board/GetNoticeHtmlBody?noticeNumber={bzpNumber}SearchTendersreturns 200 with a JSON array and anX-Paginationheader ({TotalCount, PageSize, CurrentPage, TotalPages, HasNext, HasPrevious}) — drive the loop fromHasNext.Per-procedure flow (collect.py)
For each tender from the list page:
output/ezamowienia/<tenderId>/.doneexists, skip.GET Search/GetTender?id=<tenderId>→ writetender.json. ThetenderDocuments[]array carriesobjectId,attachment.fileName,attachment.mimeType,attachment.fileSize.GET Search/GetTenderDocuments?tenderId=<tenderId>→ writedocuments.json(independent index used to mirror the public download URLs).attachment.mimeType != 'text/html': streamTender/DownloadDocument/{tenderId}/{documentId}toattachments/<documentId>__<sanitized fileName>. Use theContent-Dispositionfilename when present. Skip files already on disk with the expected size.tender.jsonhas anoticeNumber/bzpNumber:GET Board/GetNoticeDetails→notice_meta.jsonGET Board/GetNoticeHtmlBody→announcement.htmlannouncement.json.doneso re-runs skip cleanly.Errors at any step append a line to
_errors.jsonland move on; missing-announcement (404) is logged but not treated as a failure.Resume / pagination (collect.py)
InitiationDate, so newest is page 1._state.json.last_seen_object_id..donefile exists AND whoseobjectId == last_seen_object_id— everything earlier on this page is genuinely new (newer than last run). Bounded by--max-pagesfor safety.--fullflag re-walks the entire catalogue regardless of state, still skipping per-procedure work where.doneexists.Client behaviour (client.py)
requests.Sessionwith a retry adapter (urllib3.Retryon 429/5xx, exponential backoff 1→16s, max 5 attempts).time.sleep(0.2)between requests as a courtesy throttle.User-Agent: ezamowienia-collector (+contact: jmckinney@open-contracting.org).stream=True, write to<file>.part, atomic rename).search_tenders(page),get_tender(id),get_documents(id),download_document(tender_id, doc_id, dest),get_notice_details(num),get_notice_html(num).Announcement parser (parse_notice.py)
The HTML body is plain DOM with predictable structure:
Use
lxml.html(already a transitive dependency — see commit 857abf2 bumping lxml to 6.1.0). Walk top-level children in order, producing:{ "sections": [ {"title": "SEKCJA I - ZAMAWIAJĄCY", "items": [ {"key": "1.5.1", "label": "Ulica", "value": "19", "extra": null}, {"key": "1.1", "label": "Rola zamawiającego", "value": null, "extra": "Postępowanie prowadzone jest samodzielnie przez zamawiającego"} ]} ], "title": "Ogłoszenie o zamówieniu — Dostawy — …", "raw_html_path": "announcement.html" }Key extraction regex:
^\s*([0-9]+(?:\.[0-9]+)*)\.\)\s*(.*?)(?::\s*(.*))?$against the<h3>text content; the<span class="normal">text is the value when present, otherwise the post-colon remainder, otherwiseNone. A trailing<p>sibling becomesextra. Unrecognised h3s are kept under a"_unparsed"list rather than dropped, so the raw HTML is never the only source of truth.CLI (main.py via Click)
Default
--output-diris./output/ezamowienia/. Addoutput/to.gitignoreif not already present.Dependencies
Add to
requirements.in:requests,lxml(lxml is already pinned transitively but make it explicit).clickis already there.Critical files
ezamowienia/__init__.py,ezamowienia/client.py,ezamowienia/parse_notice.py,ezamowienia/collect.py,ezamowienia/__main__.py,ezamowienia/README.md.requirements.in(addrequests,lxml),.gitignore(addoutput/),requirements.txt(regenerate withuv pip compile).Verification
python -m ezamowienia collect --limit 3 --output-dir /tmp/ezam-test/. Confirm:/tmp/ezam-test/ezamowienia/.tender.json,documents.json,notice_meta.json,announcement.html,announcement.json,attachments/with the right number of non-HTML files.attachment.fileSizefromtender.json.announcement.jsonhas a non-emptysections[]and no items in_unparsedfor at least one tender..done); confirm by tailing logs.ocds-148610-3c9cfc83-d86a-4b43-8b0f-fcd455f7cf54/2026/BZP 00227455/01. Should yield 4 attachments (1 ZIP + 3 PDFs) and an announcement with section "SEKCJA I - ZAMAWIAJĄCY" → "1.5.1 Ulica: 19"..donewritten and a warning row in_errors.jsonlrather than aborting.announcement.htmlfrom the spot-check (round-trip parse, assert ≥ one expected key/value).Prompt to resume work
Resume context — Polish tender collector (Scrapy)
Goal: Implement the plan at /Users/james/.claude/plans/i-want-to-download-golden-giraffe.md (originally written for a standalone Python package ezamowienia/) inside this Scrapy repo, updating the existing poland spider.
Files changed
generic_scrapy/spiders/poland.py— full rewrite. Replaces the old mo-board notices → CSV flow with:SearchTenders(paginate) →GetTender(per tender) → fan-out toDownloadDocument(non-HTML attachments),GetTenderDocuments, andBoard/GetNoticeDetails→Board/GetNoticeHtmlBody(parsed via parser module). InheritsBaseSpider, notExportFileSpider(no aggregate output any more).generic_scrapy/parsers/poland_announcement.py(new) +generic_scrapy/parsers/__init__.py(new empty).parse_announcement(html)walks h2/h3/p elements with regex^\s*([0-9]+(?:\.[0-9]+)*)\.?\)\s*(.*?)(?::\s*(.*))?$(loosened from plan's\.\) to also catch1.4)). Returns{title, sections:[{title, items:[{key,label,value,extra}]}], _unparsed}. Usesscrapy.Selector— no new dep.generic_scrapy/filters.py— deletedPolandNoticeFilterandPolandContractorFilter(unused after CSV outputs dropped).Output layout
<FILES_STORE>/poland/<crawl_directory>/<tenderId>/containingtender.json,documents.json,notice_meta.json,announcement.html,announcement.json,attachments/<docId>__<sanitized-filename>. Resume = re-run with samecrawl_directory; skip is "iftender.jsonexists, don't fetch the tender again".Plan items deliberately not implemented (Scrapy already does it)
_state.json,_errors.jsonl,.donemarkers,urllib3.Retry,time.sleepthrottle, Click CLI /__main__.py— replaced by Scrapy's RetryMiddleware (429/5xx), AutoThrottle (off by default —DOWNLOAD_DELAYin settings.py if needed), stats collection, andscrapy crawl.Verification done
uv run scrapy crawl poland -a sample=3→ 18 requests, all 200, 5 attachments downloaded with sizes matchingattachment.fileSizeexactly.ocds-148610-3c9cfc83-…produces{key:"1.5.1", label:"Ulica", value:"19"}under section "SEKCJA I - ZAMAWIAJĄCY". 4 _unparsed entries are legitimate subsection labels like "Kryterium 1".crawl_directory=…skipped the 3 originally-collected tenders and processed 3 newer arrivals instead.uv run ruff checkclean on all three files.Commands to invoke
State
Open questions if work continues
parse_announcement(plan item 5 under Verification — never written; the parser was only manually validated against one saved HTML body).incrementalupdatecommand ingeneric_scrapy/commands/was designed forExportFileSpider-based date-windowing — it no longer applies to the poland spider since this one uses skip-if-exists instead offrom_date/until_date.