consecutive repartitions blog post #127

gene-bordegaray · 2025-12-07T22:03:10Z

blog post for consecutive repartitions

NGA-TRAN · 2025-12-09T01:24:25Z

Thanks @gene-bordegaray. Great story. Strong content

alamb

Thank you @gene-bordegaray and @NGA-TRAN -- this looks great to me

I recommend we consider the title and authors before we publish this

cc @berkaysynnada @ozankabak and @akurmustafa as I think you were involvd in the EnforceDistribution code and may be interested in this post

alamb · 2025-12-09T16:39:26Z

content/blog/2025-12-07-avoid-consecutive-repartitions.md

+layout: post
+title: A Noob's Guide to Databases
+date: 2025-12-07
+author: Gene Bordegaray, Nga Tran, Andrew Lamb


I appreciate the link here, but since the blog is written in the first person "Who am I.... etc" I would suggest you leave yourself as the only author -- you have already recognized @NGA-TRAN and I in the Acknowledgments section

Suggested change

author: Gene Bordegaray, Nga Tran, Andrew Lamb

author: Gene Bordegaray

alamb · 2025-12-09T16:41:37Z

content/blog/2025-12-07-avoid-consecutive-repartitions.md

@@ -0,0 +1,428 @@
+---
+layout: post
+title: A Noob's Guide to Databases


I recommend we add some more specifics to this title so it has more hints about the contents. The current title is general enough that I feel people may miss it.

How about something like "Optimizing Repartitions in DataFusion: How I went from Database Noob to Core Contribution" ?

I like this name... I'm stealing it 😄

alamb · 2025-12-09T16:43:33Z

content/blog/2025-12-07-avoid-consecutive-repartitions.md

+
+## **Starting Out**
+
+I am no expert in databases or any of their subsystems, but I am someone who recently began learning about them. These are some tips I find useful when first starting.


Maybe the past tense would be more natural here (as you are a lot more expert than when you were starting!)

Suggested change

I am no expert in databases or any of their subsystems, but I am someone who recently began learning about them. These are some tips I find useful when first starting.

I was no expert in databases or any of their subsystems, but I am someone who recently began learning about them. These are some tips I found useful when first starting.

alamb · 2025-12-09T16:44:23Z

content/blog/2025-12-07-avoid-consecutive-repartitions.md

+
+### Narrow Your Scope
+
+The next crucial step is to pick your niche and stick to it. Database systems are so vast that trying to tackle the whole beast at once is a lost cause. If you want to effectively contribute to this space, you need to deeply understand the system you are working on, and you will have much better luck narrowing your scope.


Suggested change

The next crucial step is to pick your niche and stick to it. Database systems are so vast that trying to tackle the whole beast at once is a lost cause. If you want to effectively contribute to this space, you need to deeply understand the system you are working on, and you will have much better luck narrowing your scope.

The next crucial step is to pick your niche to focus on. Database systems are so vast that trying to tackle the whole beast at once is a lost cause. If you want to effectively contribute to this space, you need to deeply understand the system you are working on, and you will have much better luck narrowing your scope.

alamb · 2025-12-09T16:45:53Z

content/blog/2025-12-07-avoid-consecutive-repartitions.md

+
+---
+
+## **Intro to Datafusion**


A nit is that the formal name of the project is DataFusion (capital F) so it would be nice t use that form in the text

alamb · 2025-12-09T16:48:08Z

content/blog/2025-12-15-avoid-consecutive-repartitions.md

+
+Hash repartitioning distributes data based on a hash function applied to one or more columns, called the partitioning key. Rows with the same hash value are placed in the same partition.
+<br><br>
+Hash repartitioning is useful when working with grouped data. Imagine you have a database containing information on company sales, and you are looking to find the total revenue each store produced. Hash repartitioning would make this query much more efficient. Rather than iterating over the data on a single thread and keeping a running sum for each store, it would be better to hash repartition on the store column and have multiple threads calculate individual store sales.


this is a really nice example

Thank you 👍

alamb · 2025-12-09T16:50:05Z

content/images/avoid-consecutive-repartitions/basic_before_query_plan.png

I think there is a typo in this diagram:

alamb · 2025-12-09T16:51:19Z

content/blog/2025-12-07-avoid-consecutive-repartitions.md

+
+Repartitions would appear back-to-back in query plans, specifically a round-robin followed by a hash repartition.
+
+Why is this such a big deal? Well, repartitions do not process the data; their purpose is to redistribute it in ways that enable more efficient computation for other operators. Having consecutive repartitions is counterintuitive because we are redistributing data, then immediately redistributing it again, making the first repartition pointless. While this didn't create extreme overhead for queries, since round-robin repartitioning does not copy data, just the pointers to batches, the behavior was unclear and incorrect.


Suggested change

Why is this such a big deal? Well, repartitions do not process the data; their purpose is to redistribute it in ways that enable more efficient computation for other operators. Having consecutive repartitions is counterintuitive because we are redistributing data, then immediately redistributing it again, making the first repartition pointless. While this didn't create extreme overhead for queries, since round-robin repartitioning does not copy data, just the pointers to batches, the behavior was unclear and incorrect.

Why is this such a big deal? Well, repartitions do not process the data; their purpose is to redistribute it in ways that enable more efficient computation for other operators. Having consecutive repartitions is counterintuitive because we are redistributing data, then immediately redistributing it again, making the first repartition pointless. While this didn't create extreme overhead for queries, since round-robin repartitioning does not copy data, just the pointers to batches, the behavior was unclear and unecessary.

(I think the behavior was "correct" in the sense that the correct answers come out)

alamb · 2025-12-09T16:56:13Z

content/blog/2025-12-07-avoid-consecutive-repartitions.md

+
+Well, what is the correct logic?
+
+Based on our lesson on hash repartitioning and the indicators Datafusion uses to determine when repartitioning can benefit an operator, the fix is easy. In the sub-tree where an operator's parent requires hash partitioning:


nit: "heuristics" is another term for this type of rule -- maybe this would read better

Suggested change

Based on our lesson on hash repartitioning and the indicators Datafusion uses to determine when repartitioning can benefit an operator, the fix is easy. In the sub-tree where an operator's parent requires hash partitioning:

Based on our lesson on hash repartitioning and the heuristics Datafusion uses to determine when repartitioning can benefit an operator, the fix is easy. In the sub-tree where an operator's parent requires hash partitioning:

alamb · 2025-12-09T16:58:35Z

content/blog/2025-12-07-avoid-consecutive-repartitions.md

+
+1. Deeply understand the system you are working on. It is not only fun to figure these things out, but it also pays off in the long run when having surface-level knowledge won't cut it.
+
+2. This is complementary to the first, narrow down the scope of your work when starting your journey into databases. Find a project that you are interested in and provides an environment that enhances your early learning process. I have found that Apache Datafusion and its community has been an amazing first step and plan to continue learning about query engines here.


Suggested change

2. This is complementary to the first, narrow down the scope of your work when starting your journey into databases. Find a project that you are interested in and provides an environment that enhances your early learning process. I have found that Apache Datafusion and its community has been an amazing first step and plan to continue learning about query engines here.

2. Narrow down the scope of your work when starting your journey into databases. Find a project that you are interested in and provides an environment that enhances your early learning process. I have found that Apache Datafusion and its community has been an amazing first step and plan to continue learning about query engines here.

I am not quite sure what you mean by "This is complementary to the first" -- if you meant complimentary to the first point, I think it might be clearer if there were fewer words.

yes this is what I meant, it felt weird while writing too. I think being more concise here is better

alamb · 2025-12-10T12:46:22Z

I think this looks great to me -- how about we shoot for a publish date of next Monday Dec 15 to both

Give people more time to review
Make sure we don't crowd the other datafusion related blog (Blog: Practical Dive Into Late Materialization in arrow-rs Parquet Reads arrow-site#740) targeting tomorrow

gene-bordegaray · 2025-12-10T13:56:34Z

Stage Site / build-pelican (pull_request)

Awesome! Sounds like a plan to me 😄

Thank you for all the feedback and guidance @alamb and @NGA-TRAN

akurmustafa · 2025-12-10T17:25:57Z

content/images/avoid-consecutive-repartitions/basic_before_query_plan.png

In the final aggregation part: there is a typo in the figure, text says aggregation reults

akurmustafa

Hi @gene-bordegaray, thank you for this blog. I like how you described the whole process for contributing and the how was it in your case. I added a very minor typo suggestion in one of the figures. Other than that, this looks great!

gene-bordegaray · 2025-12-10T19:37:37Z

Hi @gene-bordegaray, thank you for this blog. I like how you described the whole process for contributing and the how was it in your case. I added a very minor typo suggestion in one of the figures. Other than that, this looks great!

Thank you for giving it a read 😄 I am glad you liked it.

alamb · 2025-12-15T13:14:55Z

I renamed the file to match todays date -- let's get this thing published!

alamb · 2025-12-15T17:26:03Z

Strangely, I don't yet see this blog published on https://datafusion.apache.org/blog/:

However, the CI job that builds and pushes the site appears to have worked (and you can see the blog clearly listed on the asf-site branch): https://github.com/apache/datafusion-site/blob/asf-site/output/index.html

datafusion-site/output/index.html

Lines 48 to 97 in bfd2dae

    
               <!-- Post --> 
        
               <div class="row"> 
        
                   <div class="callout"> 
        
                       <article class="post"> 
        
                           <header> 
        
                               <div class="title"> 
        
                                   <h1><a href="/blog/2025/12/15/avoid-consecutive-repartitions">Optimizing Repartitions in DataFusion: How I Went From Database Nood to Core Contribution</a></h1> 
        
                                   <p>Posted on: Mon 15 December 2025 by Gene Bordegaray</p> 
        
                                   <p><!-- 
        
           {% comment %} 
        
           Licensed to the Apache Software Foundation (ASF) under one or more 
        
           contributor license agreements.  See the NOTICE file distributed with 
        
           this work for additional information regarding copyright ownership. 
        
           The ASF licenses this file to you under the Apache License, Version 2.0 
        
           (the "License"); you may not use this file except in compliance with 
        
           the License.  You may obtain a copy of the License at 
        
           http://www.apache.org/licenses/LICENSE-2.0 
        
           Unless required by applicable law or agreed to in writing, software 
        
           distributed under the License is distributed on an "AS IS" BASIS, 
        
           WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 
        
           See the License for the specific language governing permissions and 
        
           limitations under the License. 
        
           {% endcomment %} 
        
           --> 
        
           <div style="display: flex; align-items: center; gap: 20px; margin-bottom: 20px;"> 
        
           <div style="flex: 1;"> 
        
           Databases are some of the most complex yet interesting pieces of software. They are amazing pieces of abstraction: query engines optimize and execute complex plans, storage engines provide sophisticated infrastructure as the backbone of the system, while intricate file formats lay the groundwork for particular workloads. All of this is …</div></div></p> 
        
                                   <footer> 
        
                                       <ul class="actions"> 
        
                                           <div style="text-align: right"><a href="/blog/2025/12/15/avoid-consecutive-repartitions" class="button medium">Continue Reading</a></div> 
        
                                       </ul> 
        
                                       <ul class="stats"> 
        
                                       </ul> 
        
                                   </footer> 
        
                       </article> 
        
                   </div> 
        
               </div> 
        
               <!-- Post --> 
        
               <div class="row"> 
        
                   <div class="callout"> 
        
                       <article class="post"> 
        
                           <header> 
        
                               <div class="title"> 
        
                                   <h1><a href="/blog/2025/12/04/datafusion-comet-0.12.0">Apache DataFusion Comet 0.12.0 Release</a></h1> 
        
                                   <p>Posted on: Thu 04 December 2025 by pmc</p> 
        
                                   <p><!-- 
        
           {% comment %}

I'll poke around and try to see what is going on

alamb · 2025-12-15T17:33:53Z

It appears to be an ASF infra issue (other projects are reporting the same thing): https://issues.apache.org/jira/browse/INFRA-27494

I'll keep an eye on it and post here when it is fixed

alamb · 2025-12-16T16:51:11Z

Still watching https://issues.apache.org/jira/browse/INFRA-27494 -- I left a comment this morning.

gene-bordegaray · 2025-12-16T18:11:41Z

Still watching https://issues.apache.org/jira/browse/INFRA-27494 -- I left a comment this morning.

thanks for checking in on it. Let me know if there is anything I can look into to help

alamb · 2025-12-16T18:56:28Z

Will do -- I am sorry this is taking so long. It is unfortunate, but hopefully it will get sorted out shortly

alamb · 2025-12-17T22:04:48Z

Update here. The blog is posted to https://datafusion.blog.apache.org/2025/12/15/avoid-consecutive-repartitions/

However, for some reason it is not being replicated to https://datafusion.apache.org/blog anymore. I have filed another ASF infra ticket about this too:

https://issues.apache.org/jira/browse/INFRA-27512

alamb · 2025-12-18T16:54:18Z

I am still going back and forth with ASF infra on getting this thing on to https://datafusion.apache.org/blog

I will update here when I get that figured out https://issues.apache.org/jira/browse/INFRA-27512

alamb · 2025-12-20T12:27:35Z

Update -- this post now is showing up correctly on the main datatfusion blog site: https://datafusion.apache.org/blog/

Specifically the url is: https://datafusion.apache.org/blog/output/2025/12/15/avoid-consecutive-repartitions/

gene-bordegaray added 5 commits December 7, 2025 16:58

initial blog post

4e7ba73

better images and formatting

b1d749e

realigned some images

94ae096

added links for Nga and Andrew's github

7e1dc85

added links for Nga and Andrew's github

e552282

gene-bordegaray changed the title ~~initial blog post~~ consecutive repartitions blog post Dec 8, 2025

alamb approved these changes Dec 9, 2025

View reviewed changes

fixed to DataFusion and some word selection

ab24465

akurmustafa reviewed Dec 10, 2025

View reviewed changes

akurmustafa approved these changes Dec 10, 2025

View reviewed changes

gene-bordegaray and others added 2 commits December 12, 2025 13:12

reformatted some images for clarity and minor changes to punctuation

bd0f736

Update file name to match publish date

ee54b2c

alamb merged commit 6d8cbb6 into apache:main Dec 15, 2025

alamb mentioned this pull request Dec 15, 2025

Site/gene.bordegaray/2025/12/consecutive repartitions blog post fix image #128

Merged

	author: Gene Bordegaray, Nga Tran, Andrew Lamb
	author: Gene Bordegaray


		## Starting Out

		I am no expert in databases or any of their subsystems, but I am someone who recently began learning about them. These are some tips I find useful when first starting.

	I am no expert in databases or any of their subsystems, but I am someone who recently began learning about them. These are some tips I find useful when first starting.
	I was no expert in databases or any of their subsystems, but I am someone who recently began learning about them. These are some tips I found useful when first starting.


		### Narrow Your Scope

		The next crucial step is to pick your niche and stick to it. Database systems are so vast that trying to tackle the whole beast at once is a lost cause. If you want to effectively contribute to this space, you need to deeply understand the system you are working on, and you will have much better luck narrowing your scope.


		Repartitions would appear back-to-back in query plans, specifically a round-robin followed by a hash repartition.

		Why is this such a big deal? Well, repartitions do not process the data; their purpose is to redistribute it in ways that enable more efficient computation for other operators. Having consecutive repartitions is counterintuitive because we are redistributing data, then immediately redistributing it again, making the first repartition pointless. While this didn't create extreme overhead for queries, since round-robin repartitioning does not copy data, just the pointers to batches, the behavior was unclear and incorrect.


		Well, what is the correct logic?

		Based on our lesson on hash repartitioning and the indicators Datafusion uses to determine when repartitioning can benefit an operator, the fix is easy. In the sub-tree where an operator's parent requires hash partitioning:


		1. Deeply understand the system you are working on. It is not only fun to figure these things out, but it also pays off in the long run when having surface-level knowledge won't cut it.

		2. This is complementary to the first, narrow down the scope of your work when starting your journey into databases. Find a project that you are interested in and provides an environment that enhances your early learning process. I have found that Apache Datafusion and its community has been an amazing first step and plan to continue learning about query engines here.

	2. This is complementary to the first, narrow down the scope of your work when starting your journey into databases. Find a project that you are interested in and provides an environment that enhances your early learning process. I have found that Apache Datafusion and its community has been an amazing first step and plan to continue learning about query engines here.
	2. Narrow down the scope of your work when starting your journey into databases. Find a project that you are interested in and provides an environment that enhances your early learning process. I have found that Apache Datafusion and its community has been an amazing first step and plan to continue learning about query engines here.

consecutive repartitions blog post #127

consecutive repartitions blog post #127

Uh oh!

Conversation

gene-bordegaray commented Dec 7, 2025

Uh oh!

NGA-TRAN commented Dec 9, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Dec 10, 2025

Uh oh!

gene-bordegaray commented Dec 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akurmustafa left a comment

Choose a reason for hiding this comment

Uh oh!

gene-bordegaray commented Dec 10, 2025

Uh oh!

alamb commented Dec 15, 2025

Uh oh!

alamb commented Dec 15, 2025

Uh oh!

alamb commented Dec 15, 2025

Uh oh!

alamb commented Dec 16, 2025

Uh oh!

gene-bordegaray commented Dec 16, 2025

Uh oh!

alamb commented Dec 16, 2025

Uh oh!

alamb commented Dec 17, 2025

Uh oh!

alamb commented Dec 18, 2025

Uh oh!

alamb commented Dec 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants