Google Summer of Code 2025: Streaming Zarr from Cloud Storage (Ice Chunk + Zarr 3) #29

peterdudfield · 2025-02-27T21:35:18Z

peterdudfield
Feb 27, 2025
Maintainer

This space is for you to ask any questions you have about this project. We're here to provide clarifications and help you understand the project's goals, scope, and requirements. Feel free to ask about anything that interests you!

Please note that this discussion is for questions and clarifications, not for formal applications.

Project Description

We use a large amount of Satellite and Numerical Weather Prediction data all saved in Zarr format for training our ML models. We normally have a local copy, rather than using the cloud. We would like to explore using Ice Chunk and Zarr 3. It would be great to create a benchmark when training our PVNet model with data in cloud storage (using modern stack like Ice Chunk + Zarr 3). We could then use this to measure speed and compare it to training using data on disk.

Expected Outcome

A quantitative comparison between the speeds when using these new tools and not.

Other Key Information

Expected Size: 175hrs
Skills: ML Knowledge, Familiarity with training ML Models with Pytorch, Python, Data Analysis. Zarr is a bonus
Difficulty level: Medium
Related Reading: Add support for GOES and Himawari Satellite Imagery openclimatefix-archives/Satip#222
Potential mentors: @devsjc, @peterdudfield

yuvraajnarula · 2025-03-01T04:05:09Z

yuvraajnarula
Mar 1, 2025

Hi @peterdudfield and @devsjc,

The Ice Chunk + Zarr 3 integration for streaming satellite data caught my attention as I've recently encountered bottlenecks with similar workloads.

Some specific technical questions:

Are you targeting a particular cloud provider's object storage for the benchmark? S3's multipart API handles Zarr chunks differently than GCP Storage, especially with concurrent reads at high parallelism.
For the PVNet model training, will you be measuring impact on varying batch sizes? I've observed that larger batches (32+) with cloud-streamed Zarr can sometimes suffer from throughput degradation that's not present with local storage.
Have you considered testing varying Zarr chunk sizes (e.g., 512MB vs 4MB) specifically for Numerical Weather Prediction data? In my experience, NWP data has unique access patterns where chunk size optimization yields substantial performance gains.

I recently built a custom PyTorch DataLoader that pre-fetches Zarr chunks from S3 with thread pooling that improved throughput by 37% over vanilla implementations. This approach specifically addressed the latency spikes when chunk boundaries don't align with batch sampling.

For benchmark metrics, I'd suggest measuring not just overall training speed but also:

Cold-start latency (critical for serverless deployment)
Throughput consistency (p95/p99 latency spikes)
Memory footprint differences (cloud streaming can use less RAM with proper buffering)

Would you be open to including compression ratio comparisons in the benchmark? Zarr 3's newer compressors might change the calculus on chunk size vs. download speed tradeoffs.

Happy to elaborate on specific implementation approaches if helpful.

3 replies

devsjc Mar 11, 2025
Maintainer

Hi @yuvraajnarula, good questions!

For 1) it would probably be interesting to see how it differed across the providers, but initially I suspect S3 would be the first port of call. 2) I will have to defer to @peterdudfield, but it sounds like it might be more of a stretch goal for the project than a required investigation. 3) definitely is relevant: we have done chunk size testing for local data but not streamed using icechunk - I think if we see differences in the speed that can't just be attributed to network then we would proceed to investingate the chunking effects, and, as you say, compression ratio effects: we have seen definite differences when moving from V2 to V3.

In effect, the more of an idea we can get around the optimum setup for cloud streaming (including all of the above), the better we can determine the places it can be leveraged effectively!

yuvraajnarula Mar 13, 2025

Hey @devsjc, thanks for the breakdown. Starting with S3 as the baseline sounds like a solid plan. I’m particularly interested in exploring how larger batch sizes and different Zarr chunk sizes affect performance, especially for NWP data. The metrics you mentioned—cold-start latency, throughput consistency (p95/p99), and memory footprint—are spot on, and adding compression ratio comparisons between Zarr V2 and V3 could really sharpen our insights. I'm looking forward to collaborating on this and figuring out which aspects to prioritize. What are your thoughts?

Additionally, I’ve been working on a custom PyTorch DataLoader that leverages pre-fetching with thread pooling to mitigate latency spikes, especially when batch sampling doesn’t perfectly align with chunk boundaries. This approach has yielded significant throughput improvements, and I believe integrating similar strategies could provide a clearer picture of performance variability across different cloud storage scenarios. It would be interesting to see how these optimizations play out when we compare varying workloads and configurations across providers.

yuvraajnarula Mar 25, 2025

Hi @peterdudfield,

I hope you’re doing well! I noticed that the project has shifted to using satellite-consumer instead of Satip, and I wanted to reach out. If you could shed some light on the specific expectations or requirements for this change, I would greatly appreciate it. Having a clear understanding of the core objectives will help me in crafting a thoughtful proposal. I’m looking forward to hearing your insights and working together on this. Thank you!

alirashidAR · 2025-03-06T10:04:27Z

alirashidAR
Mar 6, 2025

Hi @peterdudfield @devsjc,

I've really enjoyed contributing to the organization's project, especially working with data collection, archiving, and using ocf-data-sampler to create batches. It has been a great experience working with Zarr, and I would love to be part of this new project as well.

Just a quick question—will we be using ocf-data-sampler for creating batches when training PVNet here too?

3 replies

devsjc Mar 11, 2025
Maintainer

Hi @alirashidAR - thanks for reaching out! Yes, we're trying to use data-sampler for all our training going forward. Using it would give us an apples-to-apples comparison against our current training pipelines (although obviously it might need some tweaks to enable the new data loading!)

alirashidAR Mar 13, 2025

Thanks, @devsjc! That makes a lot of sense—keeping everything consistent across pipelines will definitely help with benchmarking.

Just to clarify, will the tweaks and changes needed in ocf-data-sampler for this new data loading also be part of this project? Or would they be handled separately?

alirashidAR Apr 7, 2025

@peterdudfield @devsjc quick question — will we be working with satellite data that’s already in Zarr format, or will we be using satellite-consumer to process the satellite data ourselves?

zyadamr-dev · 2025-03-20T19:06:58Z

zyadamr-dev
Mar 20, 2025

Hi,
I'm excited to start working on this task! Before diving in, I’d like to get a big-picture understanding since this is a team effort.

From what I’ve gathered so far:

We’re replacing the Quartz solar model with PVNet to ensure everything is 100% open data.
We need to train a new model, with two possible approaches: locally or in the cloud—this task is about benchmarking both.
To handle large datasets efficiently, we’ll use Zarr format with IceChunk, which allows us to read only the parts of the data we need, avoiding unnecessary downloads.
One thing I’d like to clarify: How do we link NWP and satellite data for training? Specifically, how can I determine the cloud movement from satellite data over London at 3 PM, rather than, say, Manchester?

Thanks in advance!

0 replies

utsav-pal · 2025-03-21T18:41:40Z

utsav-pal
Mar 21, 2025

Comment

Hi @peterdudfield and @devsjc,

I'm excited about the opportunity to contribute to this project! I’ve been following the discussions closely, and I had a couple of additional thoughts and questions to explore:

1. Test Data for Local Validation

I’m currently looking into training the PVNet model using NWP (Numerical Weather Prediction) data stored in Zarr format. Since handling large datasets locally can be resource-intensive, I was wondering if there’s any possibility of accessing a smaller subset of test data to validate and debug the model efficiently.

2. Additional Benchmarking Considerations

I also wanted to expand on the benchmarking criteria mentioned earlier. Beyond cold-start latency, throughput consistency (p95/p99), and memory footprint, do you think it would be valuable to explore:

Disk I/O performance when streaming data from cloud storage versus local storage
Impact of compression formats in Zarr V3 and their effect on read/write speeds
Parallelization efficiency when streaming batches with different chunk sizes

3. Leveraging Dask for Distributed Processing

Additionally, have you considered using Dask to parallelize I/O operations and optimize data loading when working with large Zarr datasets? Dask’s capability to handle distributed computation across multiple cores could significantly improve throughput and reduce latency during training.

I’d love to hear your thoughts on whether these additions would bring meaningful insights to the benchmarking process. Thanks in advance! 😊

0 replies

Shantanu-Saharan · 2025-03-23T18:29:52Z

Shantanu-Saharan
Mar 23, 2025

Hello @peterdudfield and @devsjc,

I'm Shantanu Saharan, student at the Indian Institute of Technology Bombay (IITB). I looked up this project honestly find it really interesting. I know the GSOC contributing period is over but still I would love to contribute to this project.

Looking forward to your guidance!

0 replies

ArchBlizzard · 2025-03-23T22:12:55Z

ArchBlizzard
Mar 23, 2025

Hi @peterdudfield @devsjc,

I noticed that tools like ocf-data-sampler and ocf-datapipes currently have strict dependencies on zarr==2.18.3, while the latest version is zarr 3.0.6.

As stated in one of the above conversations that we’ll be using ocf-datapipes for training purposes, I wanted to ask:
Would we stick to zarr v2 for compatibility with the current ocf data pipeline, or is there a plan to update ocf-datapipes and related tools to support zarr v3 as part of this project?

TIA

0 replies

peterdudfield · 2025-03-24T08:45:13Z

peterdudfield
Mar 24, 2025
Maintainer Author

For this project we can use zarr>3, we just need to manage moving that over carefully

0 replies

Dakshbir · 2025-03-26T07:46:53Z

Dakshbir
Mar 26, 2025

Hi @peterdudfield and @devsjc,

I’m Dakshbir Singh, an enthusiastic contributor to Open Climate Fix, and I’m incredibly excited about the Streaming Zarr from Cloud Storage (Ice Chunk + Zarr 3) project. I’ve been actively contributing to OCF over the past few weeks, and I’m eager to make more meaningful contributions as I prepare my GSoC 2025 proposal.

I am particularly drawn to this project because optimizing data pipelines for cloud storage directly impacts the efficiency and scalability of solar forecasting models. Improving cloud-streamed data handling with Zarr 3 + Ice Chunk is an exciting challenge with real-world sustainability benefits.

After reviewing the project description and discussions, I had a few technical questions and would love to get your insights:

1.Zarr 3 Integration with OCF Pipelines: Since tools like ocf-data-sampler and ocf-datapipes currently rely on Zarr v2, do you plan on gradually migrating them to Zarr v3 as part of this project, or will they remain on Zarr v2 for compatibility? I’d be eager to assist with the migration process if it’s on the roadmap.
2.Benchmarking Considerations: Beyond raw training speed, would it be valuable to measure cold-start latency, throughput consistency (p95/p99), and memory footprint when streaming from cloud storage versus local storage? These metrics could offer deeper insights into the real-world performance impact.
3.Compression and Chunking Strategies: Are there specific Zarr compression formats or chunk sizes you recommend benchmarking? For example, have you observed noticeable differences in performance between 512MB vs. 4MB chunks when working with large NWP datasets?
4. Immediate Contribution Opportunities: Are there any high-priority issues or areas within this project where I can contribute right away? I would love to start working on relevant tasks and gain a deeper understanding of the project’s architecture.

I am genuinely excited to collaborate and learn from your guidance. Thank you for your time and dedication—I look forward to contributing further and helping OCF drive impactful, sustainable innovations! 😊

Best regards,
Dakshbir Singh
LinkedIn
dakshbirkapoor@gmail.com

0 replies

emlweb · 2025-04-24T15:25:56Z

emlweb
Apr 24, 2025
Maintainer

Google Summer of Code 2025 applications are now closed.

We are currently reviewing all applications. Contributors will be announced 8 May 2025. Thank you!

0 replies

peterdudfield · 2025-09-09T15:43:16Z

peterdudfield
Sep 9, 2025
Maintainer Author

I'm closing this discussing now as GSOC 2025 is nearly over. Thank you for everyones input and help.

We hope to take part next year and we'll be posting info here

0 replies

Google Summer of Code 2025: Streaming Zarr from Cloud Storage (Ice Chunk + Zarr 3) #29

Uh oh!

peterdudfield Feb 27, 2025 Maintainer

Project Description

Expected Outcome

Other Key Information

Replies: 10 comments · 6 replies

Uh oh!

Uh oh!

devsjc Mar 11, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

devsjc Mar 11, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Comment

1. Test Data for Local Validation

2. Additional Benchmarking Considerations

3. Leveraging Dask for Distributed Processing

Uh oh!

Uh oh!

Uh oh!

peterdudfield Mar 24, 2025 Maintainer Author

Uh oh!

Uh oh!

emlweb Apr 24, 2025 Maintainer

Google Summer of Code 2025 applications are now closed.

Uh oh!

peterdudfield Sep 9, 2025 Maintainer Author

peterdudfield
Feb 27, 2025
Maintainer

Replies: 10 comments 6 replies

devsjc Mar 11, 2025
Maintainer

devsjc Mar 11, 2025
Maintainer

peterdudfield
Mar 24, 2025
Maintainer Author

emlweb
Apr 24, 2025
Maintainer

peterdudfield
Sep 9, 2025
Maintainer Author