unify handling of outer parameters for joins#599
unify handling of outer parameters for joins#599Stolb27 wants to merge 3 commits intoadb-6.x-devfrom
Conversation
|
Allure report https://allure-ee.adsw.io/launch/52567 |
1. In opposite to Postgres, not all plan operators may be rescanned in gpdb. E.g., Motions must be scan only once. If we need to scan Motion output several times we should decorate it with Material operator. 2. The Nested loop operator may provide some outer parameters for the inner tree that depends on the current outer sub-tree tuple. Every time when the new outer tuple is fetched, inner sub-tree of the Nested Loop should be rescanned. Moreover, gpdb can't propagate outer parameters cross the slices, so the path to parameter's consumer may be considered as rescannable (there are no motions). Currently, only the Nested Loop operator handles presence of outer parameters (e.g. from ancestor Nested Loop) and make sure that both subtrees are rescannable if rescan expected (see the 25763c2). Hash and Merge Joins are passive and provides rescannable status based on their children only. But it's not enough to say about absence of possibility to rescan itself in case of there is outer parameter from ancestor, because rescan come here anyway even through Materialize appended above. To fix possible issues with non-rescannable operators rescan (see a provided regression test) we've decided to unify joins behavior. Hash Join has rescan implementation that differs from Postgres one. The hash table was modified to spill out all batches (including 0th) to disk (see the SpillCurrentBatch comment as a start point). So the hash join inner subtree always rescannable and hash join rescannability depends only on outer subtree. According to ExecReScanMergeJoin, Merge Join always rescan both subtrees in case of receiving rescan request. So we should make sure that both subtrees are rescannable. One plan was modified in gporca.sql regression tests file, and it becomes more consistent with ORCA plan.
|
Allure report https://allure-ee.adsw.io/launch/52812 |
currently there is another problem connected with eager free resources for hash join operator. gpdb relies on REWIND executor flag to detect rescan sub-trees but NL doesn't provide it in case of presence of outer params. But appending Material under NL masks this problem.
|
Allure report https://allure-ee.adsw.io/launch/52902 |
|
How can I test that new code branches in Merge Join not break something when Material node is applying? |
@InnerLife0, I couldn't reproduce such plans when working on this patch. But I think we shouldn't repeat the upstream error and wait when our customers find this plans by themselves. Conditions are the same. There should be the parametrization from above (nested loop above merge join) and motions in subtrees. There is Gather Merge Motion operator in gpdb, that can preserve ordering for merge join. Take a look at several places in the code base:
|
|
We should extend description with MergeJoin case, probably with the link to upstream discussion, where they faced the same problem with reproduction case. Additionally, we have a lot of discuss about LATERAL. This will be done by my colleagues after your arrival. |
Currently, only the Nested Loop operator handles presence of outer parameters (e.g. from ancestor Nested Loop) and make sure that both subtrees are rescannable if rescan expected (see the 25763c2). Hash and Merge Joins are passive and provides rescannable status based on their children only. But it's not enough to say about absence of possibility to rescan itself in case of there is outer parameter from ancestor, because rescan come here anyway even through Materialize appended above. To fix possible issues with non-rescannable operators rescan (see a provided regression test) we've decided to unify joins behavior.
Hash Join has rescan implementation that differs from Postgres one. The hash table was modified to spill out all batches (including 0th) to disk (see the SpillCurrentBatch comment as a start point). So the hash join inner subtree always rescannable and hash join rescannability depends only on outer subtree.
According to ExecReScanMergeJoin, Merge Join always rescan both subtrees in case of receiving rescan request. So we should make sure that both subtrees are rescannable.
One plan was modified in gporca.sql regression tests file, and it becomes more consistent with ORCA plan.