-
Notifications
You must be signed in to change notification settings - Fork 22
consecutive repartitions blog post #127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consecutive repartitions blog post #127
Conversation
|
Thanks @gene-bordegaray. Great story. Strong content |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @gene-bordegaray and @NGA-TRAN -- this looks great to me
I recommend we consider the title and authors before we publish this
cc @berkaysynnada @ozankabak and @akurmustafa as I think you were involvd in the EnforceDistribution code and may be interested in this post
| layout: post | ||
| title: A Noob's Guide to Databases | ||
| date: 2025-12-07 | ||
| author: Gene Bordegaray, Nga Tran, Andrew Lamb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the link here, but since the blog is written in the first person "Who am I.... etc" I would suggest you leave yourself as the only author -- you have already recognized @NGA-TRAN and I in the Acknowledgments section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| author: Gene Bordegaray, Nga Tran, Andrew Lamb | |
| author: Gene Bordegaray |
| @@ -0,0 +1,428 @@ | |||
| --- | |||
| layout: post | |||
| title: A Noob's Guide to Databases | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend we add some more specifics to this title so it has more hints about the contents. The current title is general enough that I feel people may miss it.
How about something like "Optimizing Repartitions in DataFusion: How I went from Database Noob to Core Contribution" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this name... I'm stealing it 😄
|
|
||
| ## **Starting Out** | ||
|
|
||
| I am no expert in databases or any of their subsystems, but I am someone who recently began learning about them. These are some tips I find useful when first starting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the past tense would be more natural here (as you are a lot more expert than when you were starting!)
| I am no expert in databases or any of their subsystems, but I am someone who recently began learning about them. These are some tips I find useful when first starting. | |
| I was no expert in databases or any of their subsystems, but I am someone who recently began learning about them. These are some tips I found useful when first starting. |
|
|
||
| ### Narrow Your Scope | ||
|
|
||
| The next crucial step is to pick your niche and stick to it. Database systems are so vast that trying to tackle the whole beast at once is a lost cause. If you want to effectively contribute to this space, you need to deeply understand the system you are working on, and you will have much better luck narrowing your scope. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The next crucial step is to pick your niche and stick to it. Database systems are so vast that trying to tackle the whole beast at once is a lost cause. If you want to effectively contribute to this space, you need to deeply understand the system you are working on, and you will have much better luck narrowing your scope. | |
| The next crucial step is to pick your niche to focus on. Database systems are so vast that trying to tackle the whole beast at once is a lost cause. If you want to effectively contribute to this space, you need to deeply understand the system you are working on, and you will have much better luck narrowing your scope. |
|
|
||
| --- | ||
|
|
||
| ## **Intro to Datafusion** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nit is that the formal name of the project is DataFusion (capital F) so it would be nice t use that form in the text
|
|
||
| Hash repartitioning distributes data based on a hash function applied to one or more columns, called the partitioning key. Rows with the same hash value are placed in the same partition. | ||
| <br><br> | ||
| Hash repartitioning is useful when working with grouped data. Imagine you have a database containing information on company sales, and you are looking to find the total revenue each store produced. Hash repartitioning would make this query much more efficient. Rather than iterating over the data on a single thread and keeping a running sum for each store, it would be better to hash repartition on the store column and have multiple threads calculate individual store sales. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a really nice example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| Repartitions would appear back-to-back in query plans, specifically a round-robin followed by a hash repartition. | ||
|
|
||
| Why is this such a big deal? Well, repartitions do not process the data; their purpose is to redistribute it in ways that enable more efficient computation for other operators. Having consecutive repartitions is counterintuitive because we are redistributing data, then immediately redistributing it again, making the first repartition pointless. While this didn't create extreme overhead for queries, since round-robin repartitioning does not copy data, just the pointers to batches, the behavior was unclear and incorrect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Why is this such a big deal? Well, repartitions do not process the data; their purpose is to redistribute it in ways that enable more efficient computation for other operators. Having consecutive repartitions is counterintuitive because we are redistributing data, then immediately redistributing it again, making the first repartition pointless. While this didn't create extreme overhead for queries, since round-robin repartitioning does not copy data, just the pointers to batches, the behavior was unclear and incorrect. | |
| Why is this such a big deal? Well, repartitions do not process the data; their purpose is to redistribute it in ways that enable more efficient computation for other operators. Having consecutive repartitions is counterintuitive because we are redistributing data, then immediately redistributing it again, making the first repartition pointless. While this didn't create extreme overhead for queries, since round-robin repartitioning does not copy data, just the pointers to batches, the behavior was unclear and unecessary. |
(I think the behavior was "correct" in the sense that the correct answers come out)
|
|
||
| Well, what is the correct logic? | ||
|
|
||
| Based on our lesson on hash repartitioning and the indicators Datafusion uses to determine when repartitioning can benefit an operator, the fix is easy. In the sub-tree where an operator's parent requires hash partitioning: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "heuristics" is another term for this type of rule -- maybe this would read better
| Based on our lesson on hash repartitioning and the indicators Datafusion uses to determine when repartitioning can benefit an operator, the fix is easy. In the sub-tree where an operator's parent requires hash partitioning: | |
| Based on our lesson on hash repartitioning and the heuristics Datafusion uses to determine when repartitioning can benefit an operator, the fix is easy. In the sub-tree where an operator's parent requires hash partitioning: |
|
|
||
| 1. Deeply understand the system you are working on. It is not only fun to figure these things out, but it also pays off in the long run when having surface-level knowledge won't cut it. | ||
|
|
||
| 2. This is complementary to the first, narrow down the scope of your work when starting your journey into databases. Find a project that you are interested in and provides an environment that enhances your early learning process. I have found that Apache Datafusion and its community has been an amazing first step and plan to continue learning about query engines here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 2. This is complementary to the first, narrow down the scope of your work when starting your journey into databases. Find a project that you are interested in and provides an environment that enhances your early learning process. I have found that Apache Datafusion and its community has been an amazing first step and plan to continue learning about query engines here. | |
| 2. Narrow down the scope of your work when starting your journey into databases. Find a project that you are interested in and provides an environment that enhances your early learning process. I have found that Apache Datafusion and its community has been an amazing first step and plan to continue learning about query engines here. |
I am not quite sure what you mean by "This is complementary to the first" -- if you meant complimentary to the first point, I think it might be clearer if there were fewer words.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this is what I meant, it felt weird while writing too. I think being more concise here is better
|
I think this looks great to me -- how about we shoot for a publish date of next Monday Dec 15 to both
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the final aggregation part: there is a typo in the figure, text says aggregation reults
akurmustafa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @gene-bordegaray, thank you for this blog. I like how you described the whole process for contributing and the how was it in your case. I added a very minor typo suggestion in one of the figures. Other than that, this looks great!
Thank you for giving it a read 😄 I am glad you liked it. |
|
I renamed the file to match todays date -- let's get this thing published! |
|
Strangely, I don't yet see this blog published on https://datafusion.apache.org/blog/:
However, the CI job that builds and pushes the site appears to have worked (and you can see the blog clearly listed on the asf-site branch): https://github.com/apache/datafusion-site/blob/asf-site/output/index.html datafusion-site/output/index.html Lines 48 to 97 in bfd2dae
I'll poke around and try to see what is going on |
|
It appears to be an ASF infra issue (other projects are reporting the same thing): https://issues.apache.org/jira/browse/INFRA-27494 I'll keep an eye on it and post here when it is fixed |
|
Still watching https://issues.apache.org/jira/browse/INFRA-27494 -- I left a comment this morning.
|
thanks for checking in on it. Let me know if there is anything I can look into to help |
|
Will do -- I am sorry this is taking so long. It is unfortunate, but hopefully it will get sorted out shortly |
|
Update here. The blog is posted to https://datafusion.blog.apache.org/2025/12/15/avoid-consecutive-repartitions/ However, for some reason it is not being replicated to https://datafusion.apache.org/blog anymore. I have filed another ASF infra ticket about this too: |
|
I am still going back and forth with ASF infra on getting this thing on to https://datafusion.apache.org/blog I will update here when I get that figured out https://issues.apache.org/jira/browse/INFRA-27512 |
|
Update -- this post now is showing up correctly on the main datatfusion blog site: https://datafusion.apache.org/blog/ Specifically the url is: https://datafusion.apache.org/blog/output/2025/12/15/avoid-consecutive-repartitions/ |




blog post for consecutive repartitions
cc: @alamb @NGA-TRAN