chore(docker): address PR #18520 review comments for Spark 4.0.1 stack by voonhous · Pull Request #18524 · apache/hudi

voonhous · 2026-04-17T19:12:20Z

Describe the issue this Pull Request addresses

Addressing non-blocking comments from #18520.

Summary and Changelog

Fix duplicate/misplaced Hadoop properties and MY_CONTAINER_IP propagation in base_java17 entrypoint and export_container_ip scripts.
Guard mapred-site.xml.template copy in base_java17 Dockerfile so it works against Hadoop 3.4.0 which ships mapred-site.xml directly.
Validate SPARK_MAJOR in build_docker_images.sh and forward HADOOP_AWS_VERSION / AWS_SDK_VERSION as build args keyed off Hadoop major.minor so Hadoop 3.4 builds pull the matching AWS jars.
Pin linux/amd64 platform on every service in the Spark 4 amd64 compose file, swap Kafka to apache/kafka:3.7.2 with KRaft config, and expose the JobHistory web UI port (19888) on both amd64 and arm64.
Document which base image goes with which Spark/JDK combo in docker/README.md.

Impact

None

Risk Level

None

Documentation Update

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Nice documentation improvement. The new "Base image by Java version" table is a helpful reference for mapping Spark versions to their required base images. One minor clarity point: the auto-selection description only covers Java 11 vs Java 17, but the table also lists the Java 8 base image for "Legacy Spark 2.x", which may lead readers to assume it is also auto-selected.

yihua · 2026-04-17T20:06:37Z

-      - KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181
-      - ALLOW_PLAINTEXT_LISTENER=yes


Do we still need these?

No, with apache/kafka:3.7.2 in KRaft mode the broker acts as its own controller, so KAFKA_ZOOKEEPER_CONNECT is obsolete and ALLOW_PLAINTEXT_LISTENER (a bitnami-specific convenience var) no longer applies.

The KRaft listeners and protocol map are expressed directly via the new KAFKA_LISTENERS / KAFKA_LISTENER_SECURITY_PROTOCOL_MAP env vars in the current config.

It should be safe to remove.

yihua · 2026-04-17T20:11:38Z


    # MAPRED
-    addProperty /etc/hadoop/mapred-site.xml yarn.nodemanager.bind-host 0.0.0.0
+    addProperty /etc/hadoop/mapred-site.xml mapreduce.jobhistory.address 0.0.0.0:10020


Not sure if we need changes in entrypoint.sh and export_container_ip.sh. Let's revert them.

Will revert.

yihua · 2026-04-17T20:15:01Z

+    --build-arg HADOOP_AWS_VERSION=${HADOOP_AWS_VERSION} \
+    --build-arg AWS_SDK_VERSION=${AWS_SDK_VERSION} \


Let's add these build args to the multi-arch build command?

Yeap, added!

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the docs update. The new "Base image by Java version" section is a helpful orientation for contributors. A couple of small clarifications on the table could make it easier to read, especially around the asymmetric "Published repo suffix" values and what "anything else" means for unsupported Spark versions.

yihua · 2026-04-17T20:19:00Z

+
+| Base module   | JDK     | Default Hadoop | Published repo suffix | Used for   |
+|---------------|---------|----------------|-----------------------|------------|
+| `base_java11` | Java 11 | 2.8.4          | `...-base-java11`     | Spark 3.x  |


🤖 The Published repo suffix column lists ...-base-java11 for base_java11 but just ...-base for base_java17 (no -java17 qualifier). Is that asymmetry intentional? If it reflects how the images are actually published, it could help readers to add a short footnote explaining why base_java17 drops the JDK qualifier — otherwise at first glance this looks like a typo.

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

The asymmetry isn't intentional and the column is misleading as currently written.

There are two image-naming paths in this repo and the table was silently mixing them:

build_docker_images.sh:129 produces apachehudi/hudi-hadoop_<HADOOP>-base for both base_java11 and base_java17 (no JDK qualifier). This is the only path that actually pushes to Docker Hub (via --multi-arch -> buildx --push on line 141).

The per-module poms (base_java11/pom.xml:71,87, base_java17/pom.xml:69,85) declare ...-base-java11 / ...-base-java17 -- but push is commented out in both, so these only tag locally.

I'll just drop the Published repo suffix column as this column is confusing and not helpful at all.

This is addressed in:
#18663

…1 stack - Fix duplicate/misplaced Hadoop properties and MY_CONTAINER_IP propagation in base_java17 entrypoint and export_container_ip scripts. - Guard mapred-site.xml.template copy in base_java17 Dockerfile so it works against Hadoop 3.4.0 which ships mapred-site.xml directly. - Validate SPARK_MAJOR in build_docker_images.sh and forward HADOOP_AWS_VERSION / AWS_SDK_VERSION as build args keyed off Hadoop major.minor so Hadoop 3.4 builds pull the matching AWS jars. - Pin linux/amd64 platform on every service in the Spark 4 amd64 compose file, swap Kafka to apache/kafka:3.7.2 with KRaft config, and expose the JobHistory web UI port (19888) on both amd64 and arm64. - Document which base image goes with which Spark/JDK combo in docker/README.md.

Updated the published repo suffix for base_java17 module.

hudi-bot · 2026-04-18T14:02:10Z

CI report:

78733fd UNKNOWN
57783ce UNKNOWN
00ced67 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codecov-commenter · 2026-04-29T14:01:21Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.87%. Comparing base (3d0ab80) to head (00ced67).
⚠️ Report is 50 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##             master   #18524   +/-   ##
=========================================
  Coverage     68.87%   68.87%           
- Complexity    28272    28274    +2     
=========================================
  Files          2464     2464           
  Lines        135594   135594           
  Branches      16447    16447           
=========================================
+ Hits          93389    93396    +7     
+ Misses        34815    34809    -6     
+ Partials       7390     7389    -1

Flag	Coverage Δ
common-and-other-modules	`44.64% <ø> (+<0.01%)`	⬆️
hadoop-mr-java-client	`44.77% <ø> (+<0.01%)`	⬆️
spark-client-hadoop-common	`48.41% <ø> (-0.01%)`	⬇️
spark-java-tests	`48.92% <ø> (+<0.01%)`	⬆️
spark-scala-tests	`45.44% <ø> (ø)`
utilities	`38.19% <ø> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 12 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

yihua reviewed Apr 17, 2026

View reviewed changes

Comment thread docker/README.md Outdated

voonhous changed the title ~~fix(docker): address PR #18520 review comments for Spark 4.0.1 stack~~ chore(docker): address PR #18520 review comments for Spark 4.0.1 stack Apr 17, 2026

voonhous force-pushed the adress-hadoop3-upgrade-changes branch from ab4918a to 78733fd Compare April 17, 2026 19:15

github-actions Bot added the size:S PR with lines of changes in (10, 100] label Apr 17, 2026

voonhous force-pushed the adress-hadoop3-upgrade-changes branch from 78733fd to 57783ce Compare April 17, 2026 19:18

yihua mentioned this pull request Apr 17, 2026

chore: Add Java 17 Hadoop base image and Spark 4.0.1 docker compose s… #18520

Merged

3 tasks

yihua reviewed Apr 17, 2026

View reviewed changes

voonhous and others added 3 commits April 18, 2026 19:35

Change published repo suffix for base_java17

1240f4c

Updated the published repo suffix for base_java17 module.

Address comments

00ced67

voonhous force-pushed the adress-hadoop3-upgrade-changes branch from 1f72e66 to 00ced67 Compare April 18, 2026 12:31

github-actions Bot added size:M PR with lines of changes in (100, 300] and removed size:S PR with lines of changes in (10, 100] labels Apr 18, 2026

		- KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181
		- ALLOW_PLAINTEXT_LISTENER=yes

		--build-arg HADOOP_AWS_VERSION=${HADOOP_AWS_VERSION} \
		--build-arg AWS_SDK_VERSION=${AWS_SDK_VERSION} \

Conversation

voonhous commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

voonhous Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hudi-bot commented Apr 18, 2026

CI report:

Uh oh!

codecov-commenter commented Apr 29, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

voonhous commented Apr 17, 2026 •

edited

Loading

voonhous Apr 18, 2026 •

edited

Loading