Fix and improve bulk ZIP creation by simon-20 · Pull Request #113 · IATI/bulk-data-service

simon-20 · 2025-10-27T08:28:52Z

This PR does two things, both related to the creation of the bulk ZIP files.

First, it fixes issue #112 whereby some datasets that were being updated were not being updated in the bulk ZIP file. (For example, when the dataset short name was updated, the dataset was not updated in the ZIP file, so it retained the old name).

Second, to try to address #110, the code now attempts to verify the integrity of the ZIP file created by unpacking it to disk. If any error is encountered, it wipes the working directory, re-copies the XML files, and tries once more.

This second alteration required changing how the ZIP creation worked. Previously, all the available XML files were downloaded into a main ZIP working directory. Then, for each type of ZIP created, a copy of that directory was taken, where any required modifications to the directory names (e.g., for Code for IATI, iati-data -> iati-data-main) and any extra metadata files were placed. These directories were then left in place and just updated as needed, to save copying ~20 Gb to/from disk every 20 minutes.

However, Azure Container Instances only provide an ephemeral disk space of max ~50 Gb, and this is not configurable. Given that we now unpack the ZIPs to verify them, these files cannot be left in place. So after each ZIP is created and uploaded to Azure, the working directory for that type of ZIP is cleared, which leaves enough space for the next ZIP creation to verify by unpacking.

More specific log messages to help when debugging.

This commit fixes #112 and should help to resolve #110. The code now verifies the created ZIP by unpacking it once it's created. If the unpack fails, it wipes the ZIP working directory and then retries.

Bjwebb

Have you checked that python's zipfile throws an error for the bad zip we had? If so this all looks good.

simon-20 · 2025-10-28T09:05:20Z

Have you checked that python's zipfile throws an error for the bad zip we had? If so this all looks good.

Yes, it throws an error very similar to that which unzip throws, so I think it's picking up the same problem.

simon-20 added 7 commits October 27, 2025 08:16

test: new fixtures and data files for test framework

23379f9

test: check ZIP contents, check ZIP verification

ce60225

chore: improve log messages

149eed4

More specific log messages to help when debugging.

feat: verify ZIP creation

5d0c5e9

This commit fixes #112 and should help to resolve #110. The code now verifies the created ZIP by unpacking it once it's created. If the unpack fails, it wipes the ZIP working directory and then retries.

docs: small fix to README, updated CHANGELOG

82d0372

build: version bump

f2f1b31

fix: set correct access level at container level

e915338

simon-20 requested a review from Bjwebb October 27, 2025 08:39

Bjwebb approved these changes Oct 27, 2025

View reviewed changes

simon-20 merged commit 7a9e87a into develop Oct 28, 2025
1 check passed

simon-20 deleted the sk-improve-zip-creation branch November 30, 2025 10:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix and improve bulk ZIP creation#113

Fix and improve bulk ZIP creation#113
simon-20 merged 7 commits into
developfrom
sk-improve-zip-creation

simon-20 commented Oct 27, 2025

Uh oh!

Bjwebb left a comment

Uh oh!

simon-20 commented Oct 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

simon-20 commented Oct 27, 2025

Uh oh!

Bjwebb left a comment

Choose a reason for hiding this comment

Uh oh!

simon-20 commented Oct 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants