Add CPA per-year bulk download (download_cpa_file) once OECD fixes the malformed dotStat files
Follow-up to #38 (CPA reader). The CPA API reader download_cpa() shipped in #38. The per-year bulk
path is blocked on an upstream OECD data bug, so download_cpa_file() is intentionally not
included yet.
Why it's blocked
The per-year "CRS CPA <year> (dotStat format)" .txt bulk files (dataflow
OECD.DCD.FSD:DSD_CPA@DF_CRS_CPA) are malformed: a large fraction of rows have more
pipe-delimited fields than the 49-column header declares, so they can't be parsed against their own
header. The same records via the SDMX API are clean.
Affected-row counts (rows whose field count ≠ 49):
| Year |
Rows |
Ragged % |
Rows with >49 non-empty fields |
| 2023 |
251,992 |
69% |
47,345 |
| 2022 |
233,467 |
48% |
25,704 |
| 2021 |
252,976 |
33% |
24,469 |
| 2020 |
188,902 |
35% |
12,837 |
| 2015 |
152,954 |
0% |
0 |
Tens of thousands of rows/year have more than 49 non-empty fields (structurally impossible for a
49-column row), and the API confirms the real text fields contain zero | characters — so the extra
delimiters are spurious. An extra (empty or duplicated-text) field is injected mid-row, shifting all
later columns. Not deterministically recoverable. This has been reported to the OECD.
Implementation once the upstream files are fixed
Small — clone the CRS bulk path (crs.py download_crs_file / get_year_crs_zip_id):
- Add
download_cpa_file(year, save_to_path=None, *, as_iterator=False, use_raw_cache=True) to
src/oda_reader/cpa.py, plus get_year_cpa_zip_id(year) using
search_string=f"CRS CPA {year} (dotStat format)" and CPA_FLOW_URL (=
BASE_DATAFLOW + "DSD_CPA@DF_CRS_CPA/").
- Export
download_cpa_file from __init__.py (__all__).
- Bulk is per-year only (2010–2024); there is no all-years/full-dataset CPA bulk file, so no
bulk_download_cpa().
- Add a unit test (mock
bulk_download_parquet) + an integration test, and a README "Per-year CPA
files" subsection.
Acceptance
download_cpa_file(<year>) returns/saves a year of CPA data and parses cleanly (no ragged-row
errors) once OECD has corrected the files.
- Re-verify raggedness is 0% on the corrected files before implementing.
Add CPA per-year bulk download (
download_cpa_file) once OECD fixes the malformed dotStat filesFollow-up to #38 (CPA reader). The CPA API reader
download_cpa()shipped in #38. The per-year bulkpath is blocked on an upstream OECD data bug, so
download_cpa_file()is intentionally notincluded yet.
Why it's blocked
The per-year "CRS CPA <year> (dotStat format)"
.txtbulk files (dataflowOECD.DCD.FSD:DSD_CPA@DF_CRS_CPA) are malformed: a large fraction of rows have morepipe-delimited fields than the 49-column header declares, so they can't be parsed against their own
header. The same records via the SDMX API are clean.
Affected-row counts (rows whose field count ≠ 49):
Tens of thousands of rows/year have more than 49 non-empty fields (structurally impossible for a
49-column row), and the API confirms the real text fields contain zero
|characters — so the extradelimiters are spurious. An extra (empty or duplicated-text) field is injected mid-row, shifting all
later columns. Not deterministically recoverable. This has been reported to the OECD.
Implementation once the upstream files are fixed
Small — clone the CRS bulk path (
crs.pydownload_crs_file/get_year_crs_zip_id):download_cpa_file(year, save_to_path=None, *, as_iterator=False, use_raw_cache=True)tosrc/oda_reader/cpa.py, plusget_year_cpa_zip_id(year)usingsearch_string=f"CRS CPA {year} (dotStat format)"andCPA_FLOW_URL(=BASE_DATAFLOW + "DSD_CPA@DF_CRS_CPA/").download_cpa_filefrom__init__.py(__all__).bulk_download_cpa().bulk_download_parquet) + an integration test, and a README "Per-year CPAfiles" subsection.
Acceptance
download_cpa_file(<year>)returns/saves a year of CPA data and parses cleanly (no ragged-rowerrors) once OECD has corrected the files.