chore: Add Java 17 Hadoop base image and Spark 4.0.1 docker compose s… by voonhous · Pull Request #18520 · apache/hudi

voonhous · 2026-04-17T17:49:04Z

…etup

Describe the issue this Pull Request addresses

Adding Java17 Hadoop base image and Spark4.0.1 docker compose images as our Variant feature push requires it.

Register base_java17 in Maven reactor and fix tag collision

Add base_java17 to the hudi-hadoop-docker parent POM so the Java 17 base image is picked up by the Maven reactor and imported by IntelliJ, mirroring how base_java11 is wired in.
Rename base_java17's docker repository from apachehudi/hudi-hadoop_*-base to apachehudi/hudi-hadoop_*-base-java17 so that a Maven-driven reactor build does not stomp on the tag produced by the base/ module (which publishes apachehudi/hudi-hadoop_*-base and is consumed by namenode, datanode, historyserver, hive_base, and prestobase via FROM ...-base:latest).
This mirrors base_java11's -base-java11 suffix. Also add the commented-out push goal for parity with base_java11.

Note: docker/build_docker_images.sh still tags the base image as plain -base regardless of Java version, which is intentional for the shell path so that downstream Dockerfiles' FROM ...-base clauses inherit the newer JDK. Only the Maven reactor path needed the suffix fix.

Summary and Changelog

Introduce base_java17 Hadoop base image to support Spark 4.x (which requires Java 17)
build_docker_images.sh auto-selects base_java11 or base_java17 based on SPARK_VERSION
Add docker-compose_hadoop340_hive313_spark401 files for amd64 and arm64
Parameterize hadoop-aws and aws-java-sdk-bundle versions in spark_base Dockerfile

Impact

None

Risk Level

Low, this is a chore task.

Documentation Update

None

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

LGTM — straightforward chore adding a Java 17 base image module for Spark 4.x. One question inline about potential Docker image tag collision with the existing base_java11 module.

yihua · 2026-04-17T17:59:28Z

+            </goals>
+            <configuration>
+              <skip>${docker.build.skip}</skip>
+              <pullNewerImage>false</pullNewerImage>


🤖 The repository name apachehudi/hudi-hadoop_${docker.hadoop.version}-base and tags (latest, ${project.version}) look identical to what base_java11 publishes. If both modules are ever built in the same Maven invocation, won't the second one overwrite the first's image under the same tag? Would it be safer to differentiate the tag (e.g. include java17 in the repository name or tag) so the two base images can coexist?

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

I think this is fine as long as we're using a new hadoop version that is Java 17 supported. Left to you @voonhous to see if java17 suffix to the image name is really needed.

We can add docs on what image is for which Java version.

yihua · 2026-04-17T18:03:29Z

+      - ALLOW_ANONYMOUS_LOGIN=yes
+
+  kafka:
+    image: 'bitnamilegacy/kafka:3.4.1'


Non-blocker: should we consider using 'apache/kafka:3.7.2'?

yihua · 2026-04-17T18:04:39Z

+      - KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181
+      - ALLOW_PLAINTEXT_LISTENER=yes


Non-blocker: do we need same configs as Spark3.5 docker image (docker-compose_hadoop334_hive313_spark353_arm64):

- KAFKA_NODE_ID=1 - KAFKA_PROCESS_ROLES=broker,controller - KAFKA_LISTENERS=PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:9093,PLAINTEXT_HOST://0.0.0.0:9092 - KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://kafkabroker:29092,PLAINTEXT_HOST://localhost:9092 - KAFKA_CONTROLLER_LISTENER_NAMES=CONTROLLER - KAFKA_LISTENER_SECURITY_PROTOCOL_MAP=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT - KAFKA_INTER_BROKER_LISTENER_NAME=PLAINTEXT - KAFKA_CONTROLLER_QUORUM_VOTERS=1@kafkabroker:9093 - KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1 - KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR=1 - KAFKA_TRANSACTION_STATE_LOG_MIN_ISR=1 - KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS=0 - KAFKA_NUM_PARTITIONS=3

…etup - Introduce base_java17 Hadoop base image to support Spark 4.x (which requires Java 17) - build_docker_images.sh auto-selects base_java11 or base_java17 based on SPARK_VERSION - Add docker-compose_hadoop340_hive313_spark401 files for amd64 and arm64 - Parameterize hadoop-aws and aws-java-sdk-bundle versions in spark_base Dockerfile - Register base_java17 in Maven reactor and fix tag collision - Add <module>base_java17</module> to the hudi-hadoop-docker parent POM so the Java 17 base image is picked up by the Maven reactor and imported by IntelliJ, mirroring how base_java11 is wired in. - Rename base_java17's docker repository from apachehudi/hudi-hadoop_*-base to apachehudi/hudi-hadoop_*-base-java17 so that a Maven-driven reactor build does not stomp on the tag produced by the base/ module (which publishes apachehudi/hudi-hadoop_*-base and is consumed by namenode, datanode, historyserver, hive_base, and prestobase via FROM ...-base:latest). - This mirrors base_java11's -base-java11 suffix. Also add the commented-out push goal for parity with base_java11. - Note: docker/build_docker_images.sh still tags the base image as plain -base regardless of Java version, which is intentional for the shell path so that downstream Dockerfiles' FROM ...-base clauses inherit the newer JDK. Only the Maven reactor path needed the suffix fix.

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

CodeRabbit Walkthrough: This PR extends Docker infrastructure to support Java 17-based images for Apache Spark 4.0.1+ environments. It introduces a new base_java17 image with Hadoop, updates the build script to detect Spark version and select the appropriate base image, adds Docker Compose configurations for Hadoop 3.4.0 + Hive 3.1.3 + Spark 4.0.1 stacks on both amd64 and arm64 architectures, parameterizes AWS library versions in Spark, and improves error handling in Spark ad-hoc containers.

Greptile Summary: This PR adds a Java 17 Hadoop base Docker image (base_java17/) to support Spark 4.0+, introduces new Docker Compose files for the Hadoop 3.4.0 + Hive 3.1.3 + Spark 4.0.1 combination, and updates build_docker_images.sh to automatically select the Java 17 base when Spark 4.0+ is requested.

Key changes:

New docker/hoodie/hadoop/base_java17/ directory with Dockerfile, entrypoint.sh, export_container_ip.sh, and pom.xml — mirrors the structure of the existing base_java11 image but targets eclipse-temurin:17-jdk and Hadoop 3.4.0
New docker-compose_hadoop340_hive313_spark401_amd64.yml and _arm64.yml compose files
build_docker_images.sh gained logic to select base_java17 when SPARK_MAJOR >= 4
spark_base/Dockerfile and sparkadhoc/Dockerfile updated to work with the new stack

Issues found:

The new amd64 compose file is missing platform: linux/amd64 on every service — all other _amd64.yml files in this directory include this directive on each service, and its absence means the file won't enforce amd64 execution on non-amd64 hosts
build_docker_images.sh defaults to HADOOP_VERSION=2.8.4, which is incompatible with both Hadoop 2.x/Java 17 and the new 3.4.0 image tags expected by the compose files; a user running ./build_docker_images.sh --spark-version 4.0.1 without explicitly adding --hadoop-version 3.4.0 would build mismatched image tags

Greptile Confidence Score: 3/5
Not ready to merge — the amd64 compose file is missing platform directives and the build script's default Hadoop version will produce mismatched image tags when building for Spark 4.0.1

Two concrete P1 issues block the primary user path: (1) the _amd64.yml compose file omits platform: linux/amd64 on all services, breaking its purpose on non-amd64 hosts; (2) the build script defaults to HADOOP_VERSION=2.8.4 while the new compose files expect 3.4.0, meaning the default build invocation produces images with wrong tags that won't work with the new compose setup. The new base_java17 image and pom are structurally correct; the issues are in the wiring between the build script and compose files.

docker/compose/docker-compose_hadoop340_hive313_spark401_amd64.yml needs platform directives; docker/build_docker_images.sh needs its default HADOOP_VERSION aligned with the Spark 4.0 path

CodeRabbit: yihua#48 (review)
Greptile: yihua#48 (review)

yihua · 2026-04-17T18:13:08Z

+    && tar -xvf /tmp/hadoop.tar.gz -C /opt/ \
+    && rm /tmp/hadoop.tar.gz* \
+    && ln -s /opt/hadoop-$HADOOP_VERSION/etc/hadoop /etc/hadoop \
+    && mkdir /hadoop-data


Hadoop archive signature downloaded but never verified

The .asc file is fetched but no GPG verification step follows:

&& curl -fSL "${HADOOP_URL}.asc" -o /tmp/hadoop.tar.gz.asc \

Without calling gpg --verify, the downloaded .asc file provides no actual integrity guarantee. The same pattern exists in base_java11/Dockerfile, so this is pre-existing, but worth resolving in the new image. A hardened build would import the Apache release keys and run:

gpg --import <apache-keys.asc> gpg --verify /tmp/hadoop.tar.gz.asc /tmp/hadoop.tar.gz

Alternatively, use the SHA-512 checksum which Apache publishes alongside the tarball.

— Greptile (original) (source:comment#3102315298)

All these are existing issues. Comparing base_java11 and base_java17, there isn't much change except for version and docs update:

> diff -r base_java11 base_java17 diff --color=auto -r base_java11/Dockerfile base_java17/Dockerfile 18c18 < FROM eclipse-temurin:11-jdk-jammy --- > FROM eclipse-temurin:17-jdk 25c25 < ARG HADOOP_VERSION=2.8.4 --- > ARG HADOOP_VERSION=3.4.0 31c31 < && DEBIAN_FRONTEND=noninteractive apt-get -yq update && apt-get -yq install curl wget netcat procps \ --- > && DEBIAN_FRONTEND=noninteractive apt-get -yq update && apt-get -yq install curl wget netcat-openbsd procps \ 39d38 < && cp /etc/hadoop/mapred-site.xml.template /etc/hadoop/mapred-site.xml \ 52c51 < EXPOSE 0-1024 4040 7000-10100 5000-5100 50000-50200 58188 58088 58042 --- > EXPOSE 0-1024 4040 7000-10100 5000-5100 50000-50200 58188 58088 58042 60d58 < diff --color=auto -r base_java11/pom.xml base_java17/pom.xml 27c27 < <artifactId>hudi-hadoop-base-java11-docker</artifactId> --- > <artifactId>hudi-hadoop-base-java17-docker</artifactId> 29c29 < <description>Base Docker Image with Hoodie</description> --- > <description>Base Docker Image with Java 17 for Spark 4.0+</description> 37d36 < 50d48 < 71c69 < <repository>apachehudi/hudi-hadoop_${docker.hadoop.version}-base-java11</repository> --- > <repository>apachehudi/hudi-hadoop_${docker.hadoop.version}-base-java17</repository> 87c85 < <repository>apachehudi/hudi-hadoop_${docker.hadoop.version}-base-java11</repository> --- > <repository>apachehudi/hudi-hadoop_${docker.hadoop.version}-base-java17</repository>

yihua · 2026-04-17T18:13:08Z

+    # YARN
+    addProperty /etc/hadoop/yarn-site.xml yarn.resourcemanager.bind-host 0.0.0.0
+    addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0
+    addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0


⚠️ Potential issue | 🟡 Minor

Duplicate property addition.

yarn.nodemanager.bind-host is added twice to yarn-site.xml (lines 76 and 77).

🔧 Proposed fix

addProperty /etc/hadoop/yarn-site.xml yarn.resourcemanager.bind-host 0.0.0.0 addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0 - addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0 addProperty /etc/hadoop/yarn-site.xml yarn.timeline-service.bind-host 0.0.0.0

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0

addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docker/hoodie/hadoop/base_java17/entrypoint.sh` around lines 76 - 77, Duplicate addition of the same Hadoop property is happening: remove the redundant addProperty call so yarn.nodemanager.bind-host is only added once, or replace the second addProperty invocation with a guard that checks yarn-site.xml for an existing yarn.nodemanager.bind-host entry before adding; locate the addProperty calls (symbol: addProperty) targeting yarn-site.xml and ensure only a single write for yarn.nodemanager.bind-host occurs.

— CodeRabbit (original) (source:comment#3102348614)

yihua · 2026-04-17T18:13:09Z

+    addProperty /etc/hadoop/yarn-site.xml yarn.timeline-service.bind-host 0.0.0.0
+
+    # MAPRED
+    addProperty /etc/hadoop/mapred-site.xml yarn.nodemanager.bind-host 0.0.0.0


⚠️ Potential issue | 🟡 Minor

Wrong configuration file for YARN property.

yarn.nodemanager.bind-host is a YARN property but is being added to mapred-site.xml. This appears to be a copy-paste error. The MAPRED section should likely set a MapReduce-specific property or be removed.

🔧 Proposed fix (remove incorrect entry)

# MAPRED - addProperty /etc/hadoop/mapred-site.xml yarn.nodemanager.bind-host 0.0.0.0 + addProperty /etc/hadoop/mapred-site.xml mapreduce.jobhistory.address 0.0.0.0:10020 + addProperty /etc/hadoop/mapred-site.xml mapreduce.jobhistory.webapp.address 0.0.0.0:19888

Alternatively, if no MapReduce multi-homed config is needed, simply remove lines 80-81.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

addProperty /etc/hadoop/mapred-site.xml yarn.nodemanager.bind-host 0.0.0.0

# MAPRED

addProperty /etc/hadoop/mapred-site.xml mapreduce.jobhistory.address 0.0.0.0:10020

addProperty /etc/hadoop/mapred-site.xml mapreduce.jobhistory.webapp.address 0.0.0.0:19888

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docker/hoodie/hadoop/base_java17/entrypoint.sh` around lines 80 - 81, The MAPRED section incorrectly adds the YARN property via the addProperty call targeting /etc/hadoop/mapred-site.xml with key yarn.nodemanager.bind-host; remove this addProperty line or move it to the appropriate YARN config (e.g. use addProperty against /etc/hadoop/yarn-site.xml for yarn.nodemanager.bind-host) and ensure the MAPRED block only contains MapReduce-specific properties.

— CodeRabbit (original) (source:comment#3102348632)

hudi-bot · 2026-04-17T18:14:13Z

CI report:

f43107b UNKNOWN
74b91d5 Azure: PENDING
1252d5d UNKNOWN
1fc0a55 UNKNOWN

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua

LGTM overall. We should address minor comments in a follow-up.

…stack - Fix duplicate/misplaced Hadoop properties and MY_CONTAINER_IP propagation in base_java17 entrypoint and export_container_ip scripts. - Guard mapred-site.xml.template copy in base_java17 Dockerfile so it works against Hadoop 3.4.0 which ships mapred-site.xml directly. - Validate SPARK_MAJOR in build_docker_images.sh and forward HADOOP_AWS_VERSION / AWS_SDK_VERSION as build args keyed off Hadoop major.minor so Hadoop 3.4 builds pull the matching AWS jars. - Pin linux/amd64 platform on every service in the Spark 4 amd64 compose file, swap Kafka to apache/kafka:3.7.2 with KRaft config, and expose the JobHistory web UI port (19888) on both amd64 and arm64. - Document which base image goes with which Spark/JDK combo in docker/README.md.

….1 stack - Fix duplicate/misplaced Hadoop properties and MY_CONTAINER_IP propagation in base_java17 entrypoint and export_container_ip scripts. - Guard mapred-site.xml.template copy in base_java17 Dockerfile so it works against Hadoop 3.4.0 which ships mapred-site.xml directly. - Validate SPARK_MAJOR in build_docker_images.sh and forward HADOOP_AWS_VERSION / AWS_SDK_VERSION as build args keyed off Hadoop major.minor so Hadoop 3.4 builds pull the matching AWS jars. - Pin linux/amd64 platform on every service in the Spark 4 amd64 compose file, swap Kafka to apache/kafka:3.7.2 with KRaft config, and expose the JobHistory web UI port (19888) on both amd64 and arm64. - Document which base image goes with which Spark/JDK combo in docker/README.md.

…1 stack - Fix duplicate/misplaced Hadoop properties and MY_CONTAINER_IP propagation in base_java17 entrypoint and export_container_ip scripts. - Guard mapred-site.xml.template copy in base_java17 Dockerfile so it works against Hadoop 3.4.0 which ships mapred-site.xml directly. - Validate SPARK_MAJOR in build_docker_images.sh and forward HADOOP_AWS_VERSION / AWS_SDK_VERSION as build args keyed off Hadoop major.minor so Hadoop 3.4 builds pull the matching AWS jars. - Pin linux/amd64 platform on every service in the Spark 4 amd64 compose file, swap Kafka to apache/kafka:3.7.2 with KRaft config, and expose the JobHistory web UI port (19888) on both amd64 and arm64. - Document which base image goes with which Spark/JDK combo in docker/README.md.

codecov-commenter · 2026-04-17T19:49:11Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.84%. Comparing base (a369773) to head (1fc0a55).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##             master   #18520   +/-   ##
=========================================
  Coverage     68.84%   68.84%           
- Complexity    28250    28253    +3     
=========================================
  Files          2464     2464           
  Lines        135442   135442           
  Branches      16417    16417           
=========================================
+ Hits          93239    93251   +12     
+ Misses        34821    34812    -9     
+ Partials       7382     7379    -3

Flag	Coverage Δ
common-and-other-modules	`44.58% <ø> (+<0.01%)`	⬆️
hadoop-mr-java-client	`44.80% <ø> (-0.02%)`	⬇️
spark-client-hadoop-common	`48.44% <ø> (ø)`
spark-java-tests	`48.93% <ø> (+0.01%)`	⬆️
spark-scala-tests	`45.45% <ø> (-0.01%)`	⬇️
utilities	`38.19% <ø> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 11 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…1 stack - Fix duplicate/misplaced Hadoop properties and MY_CONTAINER_IP propagation in base_java17 entrypoint and export_container_ip scripts. - Guard mapred-site.xml.template copy in base_java17 Dockerfile so it works against Hadoop 3.4.0 which ships mapred-site.xml directly. - Validate SPARK_MAJOR in build_docker_images.sh and forward HADOOP_AWS_VERSION / AWS_SDK_VERSION as build args keyed off Hadoop major.minor so Hadoop 3.4 builds pull the matching AWS jars. - Pin linux/amd64 platform on every service in the Spark 4 amd64 compose file, swap Kafka to apache/kafka:3.7.2 with KRaft config, and expose the JobHistory web UI port (19888) on both amd64 and arm64. - Document which base image goes with which Spark/JDK combo in docker/README.md.

voonhous force-pushed the add-docker-compose-hadoop3 branch from f43107b to 74b91d5 Compare April 17, 2026 17:50

github-actions Bot added the size:L PR with lines of changes in (300, 1000] label Apr 17, 2026

voonhous force-pushed the add-docker-compose-hadoop3 branch from 74b91d5 to 1252d5d Compare April 17, 2026 17:56

voonhous requested a review from yihua April 17, 2026 17:57

yihua reviewed Apr 17, 2026

View reviewed changes

yihua mentioned this pull request Apr 17, 2026

[OSS PR #18520] chore: Add Java 17 Hadoop base image and Spark 4.0.1 docker compose s… yihua/hudi#48

Open

yihua reviewed Apr 17, 2026

View reviewed changes

voonhous force-pushed the add-docker-compose-hadoop3 branch from 1252d5d to 1fc0a55 Compare April 17, 2026 18:10

yihua reviewed Apr 17, 2026

View reviewed changes

Comment thread docker/build_docker_images.sh

yihua approved these changes Apr 17, 2026

View reviewed changes

yihua merged commit 41cfc19 into apache:master Apr 17, 2026
54 of 58 checks passed

voonhous deleted the add-docker-compose-hadoop3 branch April 17, 2026 18:42

voonhous mentioned this pull request Apr 17, 2026

chore(docker): address PR #18520 review comments for Spark 4.0.1 stack #18524

Open

3 tasks

		- KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181
		- ALLOW_PLAINTEXT_LISTENER=yes

	addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0
	addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0

Conversation

voonhous commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hudi-bot commented Apr 17, 2026

CI report:

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-commenter commented Apr 17, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

voonhous commented Apr 17, 2026 •

edited

Loading