Fix and improve bulk ZIP creation#113
Merged
Merged
Conversation
More specific log messages to help when debugging.
Bjwebb
approved these changes
Oct 27, 2025
Bjwebb
left a comment
There was a problem hiding this comment.
Have you checked that python's zipfile throws an error for the bad zip we had? If so this all looks good.
Contributor
Author
Yes, it throws an error very similar to that which |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR does two things, both related to the creation of the bulk ZIP files.
First, it fixes issue #112 whereby some datasets that were being updated were not being updated in the bulk ZIP file. (For example, when the dataset short name was updated, the dataset was not updated in the ZIP file, so it retained the old name).
Second, to try to address #110, the code now attempts to verify the integrity of the ZIP file created by unpacking it to disk. If any error is encountered, it wipes the working directory, re-copies the XML files, and tries once more.
This second alteration required changing how the ZIP creation worked. Previously, all the available XML files were downloaded into a main ZIP working directory. Then, for each type of ZIP created, a copy of that directory was taken, where any required modifications to the directory names (e.g., for Code for IATI,
iati-data->iati-data-main) and any extra metadata files were placed. These directories were then left in place and just updated as needed, to save copying ~20 Gb to/from disk every 20 minutes.However, Azure Container Instances only provide an ephemeral disk space of max ~50 Gb, and this is not configurable. Given that we now unpack the ZIPs to verify them, these files cannot be left in place. So after each ZIP is created and uploaded to Azure, the working directory for that type of ZIP is cleared, which leaves enough space for the next ZIP creation to verify by unpacking.