Skip to content

Fix and improve bulk ZIP creation#113

Merged
simon-20 merged 7 commits into
developfrom
sk-improve-zip-creation
Oct 28, 2025
Merged

Fix and improve bulk ZIP creation#113
simon-20 merged 7 commits into
developfrom
sk-improve-zip-creation

Conversation

@simon-20
Copy link
Copy Markdown
Contributor

This PR does two things, both related to the creation of the bulk ZIP files.

First, it fixes issue #112 whereby some datasets that were being updated were not being updated in the bulk ZIP file. (For example, when the dataset short name was updated, the dataset was not updated in the ZIP file, so it retained the old name).

Second, to try to address #110, the code now attempts to verify the integrity of the ZIP file created by unpacking it to disk. If any error is encountered, it wipes the working directory, re-copies the XML files, and tries once more.

This second alteration required changing how the ZIP creation worked. Previously, all the available XML files were downloaded into a main ZIP working directory. Then, for each type of ZIP created, a copy of that directory was taken, where any required modifications to the directory names (e.g., for Code for IATI, iati-data -> iati-data-main) and any extra metadata files were placed. These directories were then left in place and just updated as needed, to save copying ~20 Gb to/from disk every 20 minutes.

However, Azure Container Instances only provide an ephemeral disk space of max ~50 Gb, and this is not configurable. Given that we now unpack the ZIPs to verify them, these files cannot be left in place. So after each ZIP is created and uploaded to Azure, the working directory for that type of ZIP is cleared, which leaves enough space for the next ZIP creation to verify by unpacking.

More specific log messages to help when debugging.
This commit fixes #112 and should help to resolve #110.
The code now verifies the created ZIP by unpacking it
once it's created. If the unpack fails, it wipes the ZIP
working directory and then retries.
@simon-20 simon-20 requested a review from Bjwebb October 27, 2025 08:39
Copy link
Copy Markdown

@Bjwebb Bjwebb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked that python's zipfile throws an error for the bad zip we had? If so this all looks good.

@simon-20
Copy link
Copy Markdown
Contributor Author

Have you checked that python's zipfile throws an error for the bad zip we had? If so this all looks good.

Yes, it throws an error very similar to that which unzip throws, so I think it's picking up the same problem.

@simon-20 simon-20 merged commit 7a9e87a into develop Oct 28, 2025
1 check passed
@simon-20 simon-20 deleted the sk-improve-zip-creation branch November 30, 2025 10:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants