Skip to content

Add CPA per-year bulk download (download_cpa_file) once OECD fixes malformed dotStat files #39

Description

@jm-rivera

Add CPA per-year bulk download (download_cpa_file) once OECD fixes the malformed dotStat files

Follow-up to #38 (CPA reader). The CPA API reader download_cpa() shipped in #38. The per-year bulk
path is blocked
on an upstream OECD data bug, so download_cpa_file() is intentionally not
included yet.

Why it's blocked

The per-year "CRS CPA <year> (dotStat format)" .txt bulk files (dataflow
OECD.DCD.FSD:DSD_CPA@DF_CRS_CPA) are malformed: a large fraction of rows have more
pipe-delimited fields than the 49-column header declares, so they can't be parsed against their own
header. The same records via the SDMX API are clean.

Affected-row counts (rows whose field count ≠ 49):

Year Rows Ragged % Rows with >49 non-empty fields
2023 251,992 69% 47,345
2022 233,467 48% 25,704
2021 252,976 33% 24,469
2020 188,902 35% 12,837
2015 152,954 0% 0

Tens of thousands of rows/year have more than 49 non-empty fields (structurally impossible for a
49-column row), and the API confirms the real text fields contain zero | characters — so the extra
delimiters are spurious. An extra (empty or duplicated-text) field is injected mid-row, shifting all
later columns. Not deterministically recoverable. This has been reported to the OECD.

Implementation once the upstream files are fixed

Small — clone the CRS bulk path (crs.py download_crs_file / get_year_crs_zip_id):

  • Add download_cpa_file(year, save_to_path=None, *, as_iterator=False, use_raw_cache=True) to
    src/oda_reader/cpa.py, plus get_year_cpa_zip_id(year) using
    search_string=f"CRS CPA {year} (dotStat format)" and CPA_FLOW_URL (=
    BASE_DATAFLOW + "DSD_CPA@DF_CRS_CPA/").
  • Export download_cpa_file from __init__.py (__all__).
  • Bulk is per-year only (2010–2024); there is no all-years/full-dataset CPA bulk file, so no
    bulk_download_cpa().
  • Add a unit test (mock bulk_download_parquet) + an integration test, and a README "Per-year CPA
    files" subsection.

Acceptance

  • download_cpa_file(<year>) returns/saves a year of CPA data and parses cleanly (no ragged-row
    errors) once OECD has corrected the files.
  • Re-verify raggedness is 0% on the corrected files before implementing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions