fix(#74): exclude CodegenFallback children from TimedWithCodegenExec stage#75
Merged
Conversation
TimedWithCodegenExec.doConsume was dropping the row: ExprCode parameter when calling consume(ctx, input). This caused ctx.INPUT_ROW to be null for any downstream CodegenFallback expression (e.g. from_json) that interpolates INPUT_ROW into its generated code, producing an NPE inside Block.code interpolation. Pass row.value through to consume so downstream CodegenFallback expressions see a valid INPUT_ROW under whole-stage codegen.
|
|
The previous fix (propagate `row.value` in doConsume) didn't actually prevent the NPE — Spark's `consume()` resets `ctx.INPUT_ROW = null` internally before calling parent.doConsume, so a downstream CodegenFallback still saw null and NPE'd in Block.code interpolation. It also produced corrupt output (count returned garbage) and could emit syntactically invalid Java when row.value was a complex expression. Root cause: Spark's `CollapseCodegenStages` already excludes plans whose expressions contain a CodegenFallback (e.g. from_json / JsonToStructs). But TimedWithCodegenExec exposes its child's children transparently (`children = child.children`), hiding the wrapped child from that check — so a node that wouldn't normally be wholestaged ends up wholestaged behind a TimedExec wrapper, and the CodegenFallback expression generates code referencing the now-null ctx.INPUT_ROW. Fix: mirror that CodegenFallback check inside TimedWithCodegenExec.supportCodegen. When the wrapped child has any CodegenFallback expression, we report supportCodegen=false so the framework inserts an InputAdapter and breaks the WholeStage at this boundary — exactly what would happen without the wrapper. Reverts the doConsume change from the previous commit.
pluginspark4 had no test setup, so the Spark-4 build of the plugin source was only compile-checked, never exercised. Add a test configuration mirroring pluginspark3 (scalatest, Spark 4.0.1 test deps, fork + --add-opens) and pull in DataFlintCodegenFallbackSpec so the issue #74 regression test runs against both Spark 3.5 and Spark 4.0. Most pluginspark3 specs depend on Spark-3-only internals (Dataset constructor, PythonMapInArrowExec, etc.) and won't compile against Spark 4, so the test source list is explicit rather than a directory include — only version-portable suites are added.
Earlier review claimed the executeCollect rebuild path stales out WriteFilesExec / DataWritingCommandExec metrics on the original tree. This test verifies — and disproves — that: - DataWritingCommandExec.metrics delegates to cmd.metrics, and withNewChildren reuses the same cmd instance, so numOutputRows / numFiles / numOutputBytes are the same SQLMetric objects on the original and rebuilt nodes. - WriteFilesExec (3.4+) has no metrics field of its own. - The data plan is shared by reference via RDDTimingWrapper. Net: the rebuild is metric-safe today. The test asserts those values flow through to the UI-visible plan tree so that any future change that breaks one of those assumptions fails loudly.
- postRddId: switch from += to set so re-executing the same TimedExec
instance overwrites the RDD-id metric instead of summing across runs.
Test confirms metric == latest, not the sum (TimedExecMetricsSpec).
- canEqual: match runtime class so TimedExec(x) and TimedWithCodegenExec(x)
no longer compare equal — TreeNode equality / canonicalization use this
for plan reuse and the two have different execution semantics.
- isLegacySpark: wrap version parse in Try; vendor strings with non-numeric
major/minor default to non-legacy (matches every release ≥ 3.2).
- executeCollect write-path: replace child.children.head /
innerChild.children.head with headOption-based fall-through to
super.executeCollect() if the assumed shape doesn't hold (vendor write
commands, future Spark layouts).
- rddId metric: use a plain sum metric instead of size — size renders the
RDD id as bytes ("12 B") in the SparkUI.
RDDTimingWrapper now carries an explicit comment about the durationMetric
not being in productArity (we override withNewChildrenInternal to plumb
it through manually).
The pluginspark4 test config added in c42752e forks a JVM and bootstraps Spark 4.0.1, which requires Java 17+. CI was running Java 8 (temurin@8), so the forked JVM hung trying to start Spark 4 — every PR run since that commit timed out at the 6h limit. Spark 3.5 also supports Java 17, so plugin/pluginspark3 tests run on the same JVM with no behavioral change. Scala 2.12/2.13 default to Java 8 bytecode (verified via javap on the existing target/ output: major version 52 across all four module/scala variants), so the published jars stay loadable by Java 8 runtimes — only the build/test JVM changes.
`Expression.exists(predicate)` was added to TreeNode in Spark 3.2. Calling it on Spark 3.0/3.1 NoSuchMethodErrors at runtime even though the code compiles fine against newer Spark headers — the build/Spark versions are decoupled at runtime when DataFlint is loaded as a plugin. Replace `_.exists(_.isInstanceOf[CodegenFallback])` with `_.find(_.isInstanceOf[CodegenFallback]).isDefined`. `find` has been on TreeNode since the earliest 3.x lines, so the supportCodegen check now works across the whole supported range (3.0 → 4.x). Reproduced via the cross-version pyspark integration harness: utils/.spark-versions/3.1.2 → java.lang.NoSuchMethodError on Expression.exists during a parquet write. With this change the same write succeeds.
menishmueli
approved these changes
May 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #74.
Summary
from_json(and any otherCodegenFallbackexpression that readsctx.INPUT_ROW) NPE'd inBlock.codeinterpolation when DataFlint'sTimedWithCodegenExecwrapped the operator that contained it under whole-stage codegen.Root cause
Spark's
CollapseCodegenStagesalready excludes plans whoseexpressionscontain aCodegenFallback. ButTimedWithCodegenExecexposeschildren = child.children(transparent wrapping for the SQL UI), which hides the wrapped child's expressions from that check — so an operator that would normally be broken out of whole-stage codegen ends up inside it, behind the wrapper, and thecode\"…\"interpolator NPEs on the nullINPUT_ROW.The first attempt (the original commit on this branch) tried to plumb
row.valuethroughdoConsume. That approach didn't actually prevent the NPE — Spark'sconsume()still resetsctx.INPUT_ROW = nullbefore calling the parent'sdoConsume— and it also brokeprepareRowVar's projection logic (corrupt counts, occasionally invalid Java in the generated method signature). It's been reverted on this branch.The fix
Mirror the same
CodegenFallbackcheck thatCollapseCodegenStages.supportCodegenalready does, insideTimedWithCodegenExec.supportCodegen:When the wrapped child has any
CodegenFallbackexpression, we reportsupportCodegen = false.CollapseCodegenStagesthen inserts anInputAdapterbetween us and the next stage, breaking the whole-stage at the boundary — exactly the behavior end users get without the wrapper. The expression goes through the interpreted path, whereINPUT_ROWdoesn't matter.TreeNode.find(not.exists) is used becauseExpression.existswas added in Spark 3.2; calling it at runtime on Spark 3.0/3.1 NoSuchMethodErrors.findhas been on TreeNode since the earliest 3.x lines.Test coverage
DataFlintCodegenFallbackSpec— new regression test that runs the exact PySpark snippet from the issue (from_json→explode→filter) under whole-stage codegen and asserts the count matches. Without the fix this NPEs; with it, the wrappedProjectExecis correctly excluded from the WholeStage and the query runs.pluginspark4previously had no test setup at all; this PR adds one that shares version-portable specs frompluginspark3so the same regression suite executes on the Spark 4 surface.DataFlintWriteMetricsSpec— guard for theexecuteCollectrebuild path: confirmsDataWritingCommandExecmetrics (numFiles,numOutputBytes,numOutputRows, etc.) survive the rebuild via the sharedcmdinstance.TimedExecMetricsSpec— covers the rddId / canEqual fixes below.Drive-by hardening on
TimedExecWhile reviewing the wrapper for adjacent issues, fixed five small things that the original CI / runtime exposed:
postRddIdaccumulated: was+=, nowset— re-executing the sameTimedExecinstance no longer sums every RDD id it ever wrapped.canEqualcollapsed variants:TimedExec(x).equals(TimedWithCodegenExec(x))was true; they have different execution semantics so canonicalization could conflate them. Now matches by exact runtime class.isLegacySparkparser: wrapped inTry— vendor versions with non-numeric major/minor parts no longer throw at class-load time.executeCollectwrite path:child.children.head/innerChild.children.headreplaced withheadOptionfall-through tosuper.executeCollect()if the assumed shape doesn't hold.rddIdmetric type: was a "size" metric (rendered as bytes —\"12 B\"in the SparkUI), now a plain sum metric.CI
CI was running on Java 8. Spark 4.0.1 needs Java 17+, so adding
pluginspark4tests caused the forked test JVM to hang at startup — every run timed out at 6h. Bumped totemurin@17. Spark 3.5 also supports Java 17, sopluginspark3tests run on the same JVM unchanged. The published jars stay Java 8 compatible — verified viajavapon the produced classes (major version 52 across all four module/Scala variants); Scala 2.12/2.13 default to Java 8 bytecode regardless of which JVM compiles them.Test plan
DataFlintCodegenFallbackSpecreproduces the original NPE without the fix and passes with itsbt +testgreen on Spark 3.5 (Scala 2.12 + 2.13) and Spark 4.0.1 (Scala 2.13)pluginspark4runsDataFlintCodegenFallbackSpecagainst Spark 4.0.1 / Java 17utils/.spark-versions/3.1.2) — parquet write no longer NoSuchMethodErrors🤖 Generated with Claude Code