Skip to content

[BugFix] Fix bin/chart NPE with null values (#5174)#25

Closed
qianheng-aws wants to merge 17 commits intorefactor/dedupe-reusable-workflowfrom
bugfix/5174-bin-chart-npe
Closed

[BugFix] Fix bin/chart NPE with null values (#5174)#25
qianheng-aws wants to merge 17 commits intorefactor/dedupe-reusable-workflowfrom
bugfix/5174-bin-chart-npe

Conversation

@qianheng-aws
Copy link
Copy Markdown
Owner

Description

Fix NullPointerException when using bin then chart with null values in the binned field.

Root cause: The bin bucket functions (SpanBucketFunction, MinspanBucketFunction, RangeBucketFunction) declared a non-nullable VARCHAR(2000) return type via ReturnTypes.VARCHAR_2000, even though they return null for null inputs. This caused Calcite's optimizer to remove the IS NOT NULL filters (added by visitAggregation's nonNullGroupMask logic) as trivially true, allowing null group keys to reach the Enumerable aggregation's TreeMap, which crashed with NPE in compareTo.

Fix:

  1. Declare the return type as nullable using ReturnTypes.VARCHAR_2000.andThen(SqlTypeTransforms.FORCE_NULLABLE) so that the null-filtering logic in visitAggregation works correctly
  2. Add nullsLast to chart command sort operations as a defensive measure

Related Issues

Resolves opensearch-project#5174

Check List

  • New functionality includes testing
  • Commits signed per DCO (-s)
  • spotlessCheck passed
  • Unit tests passed
  • Integration tests passed

qianheng-aws and others added 17 commits March 26, 2026 10:22
Implements a 3-workflow system using claude-code-action with Bedrock OIDC:
- claude-dedupe-issues.yml: detects duplicates on new issues via Claude
- auto-close-duplicates.yml: daily cron closes flagged issues after 3 days
- remove-autoclose-on-activity.yml: removes autoclose label on human comment

Signed-off-by: Heng Qian <qianheng@amazon.com>
Signed-off-by: Heng Qian <qianheng@amazon.com>
- Detected duplicates now get `duplicate` label instead of `autoclose`
- Auto-close workflow looks for `duplicate` label
- After closing, adds `autoclose` label
- Human comment removes `duplicate` label to prevent auto-closure
- Fix state_reason to `duplicate`
- Change grace period to 1 hour for testing

Signed-off-by: Heng Qian <qianheng@amazon.com>
- Add backfill-duplicate-comments.yml to scan historical issues for duplicates
- Add thumbs-down instruction to duplicate detection comment

Signed-off-by: Heng Qian <qianheng@amazon.com>
- Detected duplicates now get `duplicate` label instead of `autoclose`
- Auto-close workflow looks for `duplicate` label
- After closing, adds `autoclose` label
- Human comment removes `duplicate` label to prevent auto-closure
- Fix state_reason to `duplicate`
- Change grace period to 1 hour for testing

Signed-off-by: Heng Qian <qianheng@amazon.com>
- Add backfill-duplicate-comments.yml to scan historical issues for duplicates
- Add thumbs-down instruction to duplicate detection comment

Signed-off-by: Heng Qian <qianheng@amazon.com>
Signed-off-by: Heng Qian <qianheng@amazon.com>
- Remove unnecessary allowed_non_write_users from dedupe workflow
- Pass workflow inputs via env vars to prevent JS injection in backfill
- Use bash array for REPO_FLAG to prevent word splitting in shell script

Signed-off-by: Heng Qian <qianheng@amazon.com>
Backfill workflow dispatches dedupe via API as github-actions[bot],
which requires explicit allowlisting in claude-code-action.

Signed-off-by: Heng Qian <qianheng@amazon.com>
- Remove unnecessary allowed_non_write_users from dedupe workflow
- Pass workflow inputs via env vars to prevent JS injection in backfill
- Use bash array for REPO_FLAG to prevent word splitting in shell script

Signed-off-by: Heng Qian <qianheng@amazon.com>
Backfill workflow dispatches dedupe via API as github-actions[bot],
which requires explicit allowlisting in claude-code-action.

Signed-off-by: Heng Qian <qianheng@amazon.com>
…kflow

Signed-off-by: Heng Qian <qianheng@amazon.com>
Read from repo variable DUPLICATE_GRACE_DAYS (default 7) instead of
hardcoded 3 days for both auto-close workflow and comment script.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…7 days)

Replace hardcoded 1-hour test value on main with configurable
DUPLICATE_GRACE_DAYS repo variable, defaulting to 7 days.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Heng Qian <qianheng@amazon.com>
Only consider issues with lower issue numbers as potential originals,
and exclude issues already labeled duplicate from search results.

Signed-off-by: Heng Qian <qianheng@amazon.com>
The bin bucket functions (SpanBucketFunction, MinspanBucketFunction,
RangeBucketFunction) declared a non-nullable VARCHAR return type via
ReturnTypes.VARCHAR_2000, even though they return null for null inputs.
This caused Calcite's optimizer to remove IS NOT NULL filters as
trivially true, allowing null group keys to reach the Enumerable
aggregation's TreeMap, which crashed with NPE in compareTo.

Fix the return type to be nullable using FORCE_NULLABLE, so the
null-filtering logic in visitAggregation works correctly. Also add
nullsLast to chart command sorts as a defensive measure.

Signed-off-by: Heng Qian <qianheng@amazon.com>
@qianheng-aws
Copy link
Copy Markdown
Owner Author

Decision Log

Root Cause: Bin bucket functions (SpanBucketFunction, MinspanBucketFunction, RangeBucketFunction) declared ReturnTypes.VARCHAR_2000 as their return type inference, producing a non-nullable type. Despite the implementations returning null for null inputs, Calcite's type system believed the result was always non-null. This caused Calcite's optimizer to remove IS NOT NULL filters as trivially true, allowing null group keys from binning null values to reach the Enumerable aggregation's TreeMap, which crashed with Cannot invoke "java.lang.Comparable.compareTo(Object)" because "v0" is null.

Approach: Two-part fix:

  1. Changed return type inference in all three bucket functions from ReturnTypes.VARCHAR_2000 to ReturnTypes.VARCHAR_2000.andThen(SqlTypeTransforms.FORCE_NULLABLE), making the return type nullable so the IS NOT NULL filter in visitAggregation is preserved by the optimizer.
  2. Added nullsLast() to chart command sort operations (lines 3208 and 3278 in CalciteRelNodeVisitor) as a defensive measure against any remaining null-in-sort scenarios.

Alternatives Rejected:

  • Adding null checks in the sort comparator: Would be a workaround, not a fix for the actual type mismatch.
  • Filtering nulls in visitChart before sort: The existing mechanism in visitAggregation already handles this correctly once the type is correct; adding another filter would be redundant.

Pitfalls:

  • The existing chart unit tests (CalcitePPLChartTest) already expected NULLS LAST in the SparkSQL output, because the RelToSqlConverter's SparkSQL dialect naturally emits NULLS LAST for ASC sorts. This masked the fact that the runtime Enumerable execution could still fail on nulls.
  • The DEFAULT_USE_NULL = Literal.TRUE default in Chart means column splits can legitimately have null values that get labeled "NULL" string, so the non-null filter must only apply to fields marked in nonNullGroupMask.

Things to Watch:

  • Other UDF functions that return null for null inputs but declare non-nullable return types may have similar issues. Worth auditing functions in udf/binning/ and other PPLFuncImpTable registrations.
  • The NullPolicy.ANY used by these functions means Calcite generates code that passes nulls through to the implementor rather than short-circuiting. This interacts with the return type declaration - both must be consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] bin then chart with null values causes NullPointerException in compareTo

1 participant