Support Spark 3.0/3.1 instrumentation with non-transparent TimedExec wrapper by minskya · Pull Request #68 · dataflint/spark

minskya · 2026-04-27T17:10:30Z

Summary

TimedExec non-transparent wrapper for Spark 3.0/3.1: The transparent wrapper pattern (children = child.children) is incompatible with Spark 3.0/3.1's withNewChildren (uses mapProductIterator + containsChild). On legacy Spark, switches to children = Seq(child) so plan transformations (CollapseCodegenStages, ApplyColumnarRulesAndInsertTransitions, AQE) work correctly.
TimedWithCodegenExec: Split codegen support into a separate class so non-CodegenSupport nodes (e.g. FileSourceScanExec on Spark 3.1) are wrapped without codegen, avoiding ClassCastException.
Re-enabled SQL node instrumentation on all Spark versions (was disabled for 3.0/3.1).
UI node merge: On legacy Spark the non-transparent wrapper creates duplicate nodes in the plan graph. The UI now merges these pairs — keeps the child node (which has rich plan descriptions like filter conditions, window expressions) and adds the wrapper's duration metric.
Preserved parsedPlan on repeated polls: The non-paginated /sql API (used on Spark < 3.2) returns all SQLs every poll cycle, causing repeated recalculations that lost plan descriptions when sqlplan offset advanced. Now preserves existing plan data when unavailable.
Fix function name extraction for Python nodes: WindowInPandas/ArrowWindowPython plan descriptions weren't parsed (regex required Window [ prefix). MapInPandas, FlatMapGroupsInPandas, and FlatMapCoGroupsInPandas use a bare function format that the bracket-based parser couldn't handle. Both fixed with fallback regex patterns.
PySpark test: Added columnar scan + Python UDF + shuffle test case.

Test plan

All existing unit tests pass (43/43, 1 pre-existing flaky timing test)
New parser tests for MapInPandas, FlatMapGroupsInPandas, FlatMapCoGroupsInPandas
Manual test with Spark 3.1.2: verify instrumented nodes show duration metrics
Manual test with Spark 3.1.2: verify filter conditions and window expressions visible in UI
Manual test with Spark 3.5.x: verify no regression (transparent wrapper still used)
Run pyspark-testing/dataflint_pyspark_example.py on Spark 3.1.2

🤖 Generated with Claude Code

…wrapper TimedExec's transparent wrapper (children = child.children) is incompatible with Spark 3.0/3.1's withNewChildren which uses mapProductIterator + containsChild. This caused CollapseCodegenStages, ApplyColumnarRulesAndInsertTransitions, and AQE to silently fail to update children through the wrapper. Changes: - TimedExec: detect Spark < 3.2 via SPARK_VERSION and use non-transparent children = Seq(child) on legacy Spark, transparent on 3.2+ - TimedWithCodegenExec: split codegen support into separate class so non-codegen nodes (e.g. FileSourceScanExec on 3.1) don't get cast errors - DataFlintInstrumentationExtension: re-enable SQL node instrumentation on all Spark versions (was disabled for 3.0/3.1) - SqlReducer: merge duplicate wrapper+child node pairs in the UI by keeping the child node (rich plan descriptions) and adding wrapper's duration metric - SqlReducer: preserve parsedPlan when plan data unavailable on repeated polls (non-paginated SQL API returns all SQLs every cycle) - PySpark test: add columnar scan + Python UDF + shuffle test case Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

notion-workspace · 2026-04-27T17:10:35Z

ColumnarBatch cannot be cast to class org.apache.spark.sql.catalyst.InternalRow

…tMap nodes - WindowParser: regex now matches WindowInPandas/ArrowWindowPython plan descriptions (was hardcoded to "Window [" prefix) - batchEvalPythonParser: fallback regex for bare function names used by MapInPandas, FlatMapGroupsInPandas, and FlatMapCoGroupsInPandas (these don't use the [funcs], [udfs] bracket format) - SqlReducer: reverted unnecessary WindowInPandas split (parser fix makes it redundant) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The bracket regex [funcs], [udfs] was matching FlatMapCoGroupsInPandas's [group_keys], [group_keys] as function/UDF lists. Now requires the first bracket to contain parentheses (function calls) or be empty, so CoGroups falls through to the bare function name regex. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

minskya changed the title ~~Support Spark 3.0/3.1 instrumentation with non-transparent TimedExec …~~ Support Spark 3.0/3.1 instrumentation with non-transparent TimedExec wrapper Apr 27, 2026

minskya and others added 3 commits April 28, 2026 09:59

updated example - reduce window dataset

767dfce

minskya merged commit a9dca6f into main Apr 28, 2026
6 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Spark 3.0/3.1 instrumentation with non-transparent TimedExec wrapper#68

Support Spark 3.0/3.1 instrumentation with non-transparent TimedExec wrapper#68
minskya merged 4 commits into
mainfrom
DATAFLINT-4840

minskya commented Apr 27, 2026 •

edited

Loading

Uh oh!

notion-workspace Bot commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

minskya commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

notion-workspace Bot commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

minskya commented Apr 27, 2026 •

edited

Loading