Google Summer of Code 2025: Streaming Zarr from Cloud Storage (Ice Chunk + Zarr 3) #29
Replies: 10 comments 6 replies
-
|
Hi @peterdudfield and @devsjc, The Ice Chunk + Zarr 3 integration for streaming satellite data caught my attention as I've recently encountered bottlenecks with similar workloads. Some specific technical questions:
I recently built a custom PyTorch DataLoader that pre-fetches Zarr chunks from S3 with thread pooling that improved throughput by 37% over vanilla implementations. This approach specifically addressed the latency spikes when chunk boundaries don't align with batch sampling. For benchmark metrics, I'd suggest measuring not just overall training speed but also:
Would you be open to including compression ratio comparisons in the benchmark? Zarr 3's newer compressors might change the calculus on chunk size vs. download speed tradeoffs. Happy to elaborate on specific implementation approaches if helpful. |
Beta Was this translation helpful? Give feedback.
-
|
I've really enjoyed contributing to the organization's project, especially working with data collection, archiving, and using ocf-data-sampler to create batches. It has been a great experience working with Zarr, and I would love to be part of this new project as well. Just a quick question—will we be using ocf-data-sampler for creating batches when training PVNet here too? |
Beta Was this translation helpful? Give feedback.
-
|
Hi, From what I’ve gathered so far: We’re replacing the Quartz solar model with PVNet to ensure everything is 100% open data. Thanks in advance! |
Beta Was this translation helpful? Give feedback.
-
CommentHi @peterdudfield and @devsjc, I'm excited about the opportunity to contribute to this project! I’ve been following the discussions closely, and I had a couple of additional thoughts and questions to explore: 1. Test Data for Local ValidationI’m currently looking into training the PVNet model using NWP (Numerical Weather Prediction) data stored in Zarr format. Since handling large datasets locally can be resource-intensive, I was wondering if there’s any possibility of accessing a smaller subset of test data to validate and debug the model efficiently. 2. Additional Benchmarking ConsiderationsI also wanted to expand on the benchmarking criteria mentioned earlier. Beyond cold-start latency, throughput consistency (p95/p99), and memory footprint, do you think it would be valuable to explore:
3. Leveraging Dask for Distributed ProcessingAdditionally, have you considered using Dask to parallelize I/O operations and optimize data loading when working with large Zarr datasets? Dask’s capability to handle distributed computation across multiple cores could significantly improve throughput and reduce latency during training. I’d love to hear your thoughts on whether these additions would bring meaningful insights to the benchmarking process. Thanks in advance! 😊 |
Beta Was this translation helpful? Give feedback.
-
|
Hello @peterdudfield and @devsjc, I'm Shantanu Saharan, student at the Indian Institute of Technology Bombay (IITB). I looked up this project honestly find it really interesting. I know the GSOC contributing period is over but still I would love to contribute to this project. Looking forward to your guidance! |
Beta Was this translation helpful? Give feedback.
-
|
I noticed that tools like ocf-data-sampler and ocf-datapipes currently have strict dependencies on zarr==2.18.3, while the latest version is zarr 3.0.6. As stated in one of the above conversations that we’ll be using ocf-datapipes for training purposes, I wanted to ask: TIA |
Beta Was this translation helpful? Give feedback.
-
|
For this project we can use zarr>3, we just need to manage moving that over carefully |
Beta Was this translation helpful? Give feedback.
-
|
Hi @peterdudfield and @devsjc, I’m Dakshbir Singh, an enthusiastic contributor to Open Climate Fix, and I’m incredibly excited about the Streaming Zarr from Cloud Storage (Ice Chunk + Zarr 3) project. I’ve been actively contributing to OCF over the past few weeks, and I’m eager to make more meaningful contributions as I prepare my GSoC 2025 proposal. I am particularly drawn to this project because optimizing data pipelines for cloud storage directly impacts the efficiency and scalability of solar forecasting models. Improving cloud-streamed data handling with Zarr 3 + Ice Chunk is an exciting challenge with real-world sustainability benefits. After reviewing the project description and discussions, I had a few technical questions and would love to get your insights: 1.Zarr 3 Integration with OCF Pipelines: Since tools like ocf-data-sampler and ocf-datapipes currently rely on Zarr v2, do you plan on gradually migrating them to Zarr v3 as part of this project, or will they remain on Zarr v2 for compatibility? I’d be eager to assist with the migration process if it’s on the roadmap. I am genuinely excited to collaborate and learn from your guidance. Thank you for your time and dedication—I look forward to contributing further and helping OCF drive impactful, sustainable innovations! 😊 Best regards, |
Beta Was this translation helpful? Give feedback.
-
Google Summer of Code 2025 applications are now closed.We are currently reviewing all applications. Contributors will be announced 8 May 2025. Thank you! |
Beta Was this translation helpful? Give feedback.
-
|
I'm closing this discussing now as GSOC 2025 is nearly over. Thank you for everyones input and help. We hope to take part next year and we'll be posting info here |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
This space is for you to ask any questions you have about this project. We're here to provide clarifications and help you understand the project's goals, scope, and requirements. Feel free to ask about anything that interests you!
Please note that this discussion is for questions and clarifications, not for formal applications.
Project Description
We use a large amount of Satellite and Numerical Weather Prediction data all saved in Zarr format for training our ML models. We normally have a local copy, rather than using the cloud. We would like to explore using Ice Chunk and Zarr 3. It would be great to create a benchmark when training our PVNet model with data in cloud storage (using modern stack like Ice Chunk + Zarr 3). We could then use this to measure speed and compare it to training using data on disk.
Expected Outcome
A quantitative comparison between the speeds when using these new tools and not.
Other Key Information
Beta Was this translation helpful? Give feedback.
All reactions