Skip to content

Conversation

@gene-bordegaray
Copy link
Contributor

blog post for consecutive repartitions

cc: @alamb @NGA-TRAN

@gene-bordegaray gene-bordegaray changed the title initial blog post consecutive repartitions blog post Dec 8, 2025
@NGA-TRAN
Copy link

NGA-TRAN commented Dec 9, 2025

Thanks @gene-bordegaray. Great story. Strong content

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @gene-bordegaray and @NGA-TRAN -- this looks great to me

I recommend we consider the title and authors before we publish this

cc @berkaysynnada @ozankabak and @akurmustafa as I think you were involvd in the EnforceDistribution code and may be interested in this post

layout: post
title: A Noob's Guide to Databases
date: 2025-12-07
author: Gene Bordegaray, Nga Tran, Andrew Lamb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the link here, but since the blog is written in the first person "Who am I.... etc" I would suggest you leave yourself as the only author -- you have already recognized @NGA-TRAN and I in the Acknowledgments section

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
author: Gene Bordegaray, Nga Tran, Andrew Lamb
author: Gene Bordegaray

@@ -0,0 +1,428 @@
---
layout: post
title: A Noob's Guide to Databases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend we add some more specifics to this title so it has more hints about the contents. The current title is general enough that I feel people may miss it.

How about something like "Optimizing Repartitions in DataFusion: How I went from Database Noob to Core Contribution" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this name... I'm stealing it 😄


## **Starting Out**

I am no expert in databases or any of their subsystems, but I am someone who recently began learning about them. These are some tips I find useful when first starting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the past tense would be more natural here (as you are a lot more expert than when you were starting!)

Suggested change
I am no expert in databases or any of their subsystems, but I am someone who recently began learning about them. These are some tips I find useful when first starting.
I was no expert in databases or any of their subsystems, but I am someone who recently began learning about them. These are some tips I found useful when first starting.


### Narrow Your Scope

The next crucial step is to pick your niche and stick to it. Database systems are so vast that trying to tackle the whole beast at once is a lost cause. If you want to effectively contribute to this space, you need to deeply understand the system you are working on, and you will have much better luck narrowing your scope.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The next crucial step is to pick your niche and stick to it. Database systems are so vast that trying to tackle the whole beast at once is a lost cause. If you want to effectively contribute to this space, you need to deeply understand the system you are working on, and you will have much better luck narrowing your scope.
The next crucial step is to pick your niche to focus on. Database systems are so vast that trying to tackle the whole beast at once is a lost cause. If you want to effectively contribute to this space, you need to deeply understand the system you are working on, and you will have much better luck narrowing your scope.


---

## **Intro to Datafusion**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit is that the formal name of the project is DataFusion (capital F) so it would be nice t use that form in the text


Hash repartitioning distributes data based on a hash function applied to one or more columns, called the partitioning key. Rows with the same hash value are placed in the same partition.
<br><br>
Hash repartitioning is useful when working with grouped data. Imagine you have a database containing information on company sales, and you are looking to find the total revenue each store produced. Hash repartitioning would make this query much more efficient. Rather than iterating over the data on a single thread and keeping a running sum for each store, it would be better to hash repartition on the store column and have multiple threads calculate individual store sales.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a really nice example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a typo in this diagram:

Image


Repartitions would appear back-to-back in query plans, specifically a round-robin followed by a hash repartition.

Why is this such a big deal? Well, repartitions do not process the data; their purpose is to redistribute it in ways that enable more efficient computation for other operators. Having consecutive repartitions is counterintuitive because we are redistributing data, then immediately redistributing it again, making the first repartition pointless. While this didn't create extreme overhead for queries, since round-robin repartitioning does not copy data, just the pointers to batches, the behavior was unclear and incorrect.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Why is this such a big deal? Well, repartitions do not process the data; their purpose is to redistribute it in ways that enable more efficient computation for other operators. Having consecutive repartitions is counterintuitive because we are redistributing data, then immediately redistributing it again, making the first repartition pointless. While this didn't create extreme overhead for queries, since round-robin repartitioning does not copy data, just the pointers to batches, the behavior was unclear and incorrect.
Why is this such a big deal? Well, repartitions do not process the data; their purpose is to redistribute it in ways that enable more efficient computation for other operators. Having consecutive repartitions is counterintuitive because we are redistributing data, then immediately redistributing it again, making the first repartition pointless. While this didn't create extreme overhead for queries, since round-robin repartitioning does not copy data, just the pointers to batches, the behavior was unclear and unecessary.

(I think the behavior was "correct" in the sense that the correct answers come out)


Well, what is the correct logic?

Based on our lesson on hash repartitioning and the indicators Datafusion uses to determine when repartitioning can benefit an operator, the fix is easy. In the sub-tree where an operator's parent requires hash partitioning:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "heuristics" is another term for this type of rule -- maybe this would read better

Suggested change
Based on our lesson on hash repartitioning and the indicators Datafusion uses to determine when repartitioning can benefit an operator, the fix is easy. In the sub-tree where an operator's parent requires hash partitioning:
Based on our lesson on hash repartitioning and the heuristics Datafusion uses to determine when repartitioning can benefit an operator, the fix is easy. In the sub-tree where an operator's parent requires hash partitioning:


1. Deeply understand the system you are working on. It is not only fun to figure these things out, but it also pays off in the long run when having surface-level knowledge won't cut it.

2. This is complementary to the first, narrow down the scope of your work when starting your journey into databases. Find a project that you are interested in and provides an environment that enhances your early learning process. I have found that Apache Datafusion and its community has been an amazing first step and plan to continue learning about query engines here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. This is complementary to the first, narrow down the scope of your work when starting your journey into databases. Find a project that you are interested in and provides an environment that enhances your early learning process. I have found that Apache Datafusion and its community has been an amazing first step and plan to continue learning about query engines here.
2. Narrow down the scope of your work when starting your journey into databases. Find a project that you are interested in and provides an environment that enhances your early learning process. I have found that Apache Datafusion and its community has been an amazing first step and plan to continue learning about query engines here.

I am not quite sure what you mean by "This is complementary to the first" -- if you meant complimentary to the first point, I think it might be clearer if there were fewer words.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is what I meant, it felt weird while writing too. I think being more concise here is better

@alamb
Copy link
Contributor

alamb commented Dec 10, 2025

I think this looks great to me -- how about we shoot for a publish date of next Monday Dec 15 to both

  1. Give people more time to review
  2. Make sure we don't crowd the other datafusion related blog (Blog: Practical Dive Into Late Materialization in arrow-rs Parquet Reads arrow-site#740) targeting tomorrow

@gene-bordegaray
Copy link
Contributor Author

  • Stage Site / build-pelican (pull_request)

Awesome! Sounds like a plan to me 😄

Thank you for all the feedback and guidance @alamb and @NGA-TRAN

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the final aggregation part: there is a typo in the figure, text says aggregation reults

Copy link
Contributor

@akurmustafa akurmustafa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gene-bordegaray, thank you for this blog. I like how you described the whole process for contributing and the how was it in your case. I added a very minor typo suggestion in one of the figures. Other than that, this looks great!

@gene-bordegaray
Copy link
Contributor Author

Hi @gene-bordegaray, thank you for this blog. I like how you described the whole process for contributing and the how was it in your case. I added a very minor typo suggestion in one of the figures. Other than that, this looks great!

Thank you for giving it a read 😄 I am glad you liked it.

@alamb
Copy link
Contributor

alamb commented Dec 15, 2025

I renamed the file to match todays date -- let's get this thing published!

@alamb
Copy link
Contributor

alamb commented Dec 15, 2025

Strangely, I don't yet see this blog published on https://datafusion.apache.org/blog/:

Screenshot 2025-12-15 at 12 22 58 PM

However, the CI job that builds and pushes the site appears to have worked (and you can see the blog clearly listed on the asf-site branch): https://github.com/apache/datafusion-site/blob/asf-site/output/index.html

<!-- Post -->
<div class="row">
<div class="callout">
<article class="post">
<header>
<div class="title">
<h1><a href="/blog/2025/12/15/avoid-consecutive-repartitions">Optimizing Repartitions in DataFusion: How I Went From Database Nood to Core Contribution</a></h1>
<p>Posted on: Mon 15 December 2025 by Gene Bordegaray</p>
<p><!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
-->
<div style="display: flex; align-items: center; gap: 20px; margin-bottom: 20px;">
<div style="flex: 1;">
Databases are some of the most complex yet interesting pieces of software. They are amazing pieces of abstraction: query engines optimize and execute complex plans, storage engines provide sophisticated infrastructure as the backbone of the system, while intricate file formats lay the groundwork for particular workloads. All of this is …</div></div></p>
<footer>
<ul class="actions">
<div style="text-align: right"><a href="/blog/2025/12/15/avoid-consecutive-repartitions" class="button medium">Continue Reading</a></div>
</ul>
<ul class="stats">
</ul>
</footer>
</article>
</div>
</div>
<!-- Post -->
<div class="row">
<div class="callout">
<article class="post">
<header>
<div class="title">
<h1><a href="/blog/2025/12/04/datafusion-comet-0.12.0">Apache DataFusion Comet 0.12.0 Release</a></h1>
<p>Posted on: Thu 04 December 2025 by pmc</p>
<p><!--
{% comment %}

I'll poke around and try to see what is going on

@alamb
Copy link
Contributor

alamb commented Dec 15, 2025

It appears to be an ASF infra issue (other projects are reporting the same thing): https://issues.apache.org/jira/browse/INFRA-27494

I'll keep an eye on it and post here when it is fixed

@alamb
Copy link
Contributor

alamb commented Dec 16, 2025

Still watching https://issues.apache.org/jira/browse/INFRA-27494 -- I left a comment this morning.

Screenshot 2025-12-16 at 11 50 53 AM

@gene-bordegaray
Copy link
Contributor Author

Still watching https://issues.apache.org/jira/browse/INFRA-27494 -- I left a comment this morning.

Screenshot 2025-12-16 at 11 50 53 AM

thanks for checking in on it. Let me know if there is anything I can look into to help

@alamb
Copy link
Contributor

alamb commented Dec 16, 2025

Will do -- I am sorry this is taking so long. It is unfortunate, but hopefully it will get sorted out shortly

@alamb
Copy link
Contributor

alamb commented Dec 17, 2025

Update here. The blog is posted to https://datafusion.blog.apache.org/2025/12/15/avoid-consecutive-repartitions/

However, for some reason it is not being replicated to https://datafusion.apache.org/blog anymore. I have filed another ASF infra ticket about this too:

https://issues.apache.org/jira/browse/INFRA-27512

@alamb
Copy link
Contributor

alamb commented Dec 18, 2025

I am still going back and forth with ASF infra on getting this thing on to https://datafusion.apache.org/blog

I will update here when I get that figured out https://issues.apache.org/jira/browse/INFRA-27512

@alamb
Copy link
Contributor

alamb commented Dec 20, 2025

Update -- this post now is showing up correctly on the main datatfusion blog site: https://datafusion.apache.org/blog/

Specifically the url is: https://datafusion.apache.org/blog/output/2025/12/15/avoid-consecutive-repartitions/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants