chore: Add Java 17 Hadoop base image and Spark 4.0.1 docker compose s…#18520
Conversation
f43107b to
74b91d5
Compare
74b91d5 to
1252d5d
Compare
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
LGTM — straightforward chore adding a Java 17 base image module for Spark 4.x. One question inline about potential Docker image tag collision with the existing base_java11 module.
| </goals> | ||
| <configuration> | ||
| <skip>${docker.build.skip}</skip> | ||
| <pullNewerImage>false</pullNewerImage> |
There was a problem hiding this comment.
🤖 The repository name apachehudi/hudi-hadoop_${docker.hadoop.version}-base and tags (latest, ${project.version}) look identical to what base_java11 publishes. If both modules are ever built in the same Maven invocation, won't the second one overwrite the first's image under the same tag? Would it be safer to differentiate the tag (e.g. include java17 in the repository name or tag) so the two base images can coexist?
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
There was a problem hiding this comment.
I think this is fine as long as we're using a new hadoop version that is Java 17 supported. Left to you @voonhous to see if java17 suffix to the image name is really needed.
There was a problem hiding this comment.
We can add docs on what image is for which Java version.
| - ALLOW_ANONYMOUS_LOGIN=yes | ||
|
|
||
| kafka: | ||
| image: 'bitnamilegacy/kafka:3.4.1' |
There was a problem hiding this comment.
Non-blocker: should we consider using 'apache/kafka:3.7.2'?
| - KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181 | ||
| - ALLOW_PLAINTEXT_LISTENER=yes |
There was a problem hiding this comment.
Non-blocker: do we need same configs as Spark3.5 docker image (docker-compose_hadoop334_hive313_spark353_arm64):
- KAFKA_NODE_ID=1
- KAFKA_PROCESS_ROLES=broker,controller
- KAFKA_LISTENERS=PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:9093,PLAINTEXT_HOST://0.0.0.0:9092
- KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://kafkabroker:29092,PLAINTEXT_HOST://localhost:9092
- KAFKA_CONTROLLER_LISTENER_NAMES=CONTROLLER
- KAFKA_LISTENER_SECURITY_PROTOCOL_MAP=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
- KAFKA_INTER_BROKER_LISTENER_NAME=PLAINTEXT
- KAFKA_CONTROLLER_QUORUM_VOTERS=1@kafkabroker:9093
- KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1
- KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR=1
- KAFKA_TRANSACTION_STATE_LOG_MIN_ISR=1
- KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS=0
- KAFKA_NUM_PARTITIONS=3
…etup - Introduce base_java17 Hadoop base image to support Spark 4.x (which requires Java 17) - build_docker_images.sh auto-selects base_java11 or base_java17 based on SPARK_VERSION - Add docker-compose_hadoop340_hive313_spark401 files for amd64 and arm64 - Parameterize hadoop-aws and aws-java-sdk-bundle versions in spark_base Dockerfile - Register base_java17 in Maven reactor and fix tag collision - Add <module>base_java17</module> to the hudi-hadoop-docker parent POM so the Java 17 base image is picked up by the Maven reactor and imported by IntelliJ, mirroring how base_java11 is wired in. - Rename base_java17's docker repository from apachehudi/hudi-hadoop_*-base to apachehudi/hudi-hadoop_*-base-java17 so that a Maven-driven reactor build does not stomp on the tag produced by the base/ module (which publishes apachehudi/hudi-hadoop_*-base and is consumed by namenode, datanode, historyserver, hive_base, and prestobase via FROM ...-base:latest). - This mirrors base_java11's -base-java11 suffix. Also add the commented-out push goal for parity with base_java11. - Note: docker/build_docker_images.sh still tags the base image as plain -base regardless of Java version, which is intentional for the shell path so that downstream Dockerfiles' FROM ...-base clauses inherit the newer JDK. Only the Maven reactor path needed the suffix fix.
1252d5d to
1fc0a55
Compare
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
CodeRabbit Walkthrough: This PR extends Docker infrastructure to support Java 17-based images for Apache Spark 4.0.1+ environments. It introduces a new base_java17 image with Hadoop, updates the build script to detect Spark version and select the appropriate base image, adds Docker Compose configurations for Hadoop 3.4.0 + Hive 3.1.3 + Spark 4.0.1 stacks on both amd64 and arm64 architectures, parameterizes AWS library versions in Spark, and improves error handling in Spark ad-hoc containers.
Greptile Summary: This PR adds a Java 17 Hadoop base Docker image (base_java17/) to support Spark 4.0+, introduces new Docker Compose files for the Hadoop 3.4.0 + Hive 3.1.3 + Spark 4.0.1 combination, and updates build_docker_images.sh to automatically select the Java 17 base when Spark 4.0+ is requested.
Key changes:
- New
docker/hoodie/hadoop/base_java17/directory withDockerfile,entrypoint.sh,export_container_ip.sh, andpom.xml— mirrors the structure of the existingbase_java11image but targetseclipse-temurin:17-jdkand Hadoop 3.4.0 - New
docker-compose_hadoop340_hive313_spark401_amd64.ymland_arm64.ymlcompose files build_docker_images.shgained logic to selectbase_java17whenSPARK_MAJOR >= 4spark_base/Dockerfileandsparkadhoc/Dockerfileupdated to work with the new stack
Issues found:
- The new
amd64compose file is missingplatform: linux/amd64on every service — all other_amd64.ymlfiles in this directory include this directive on each service, and its absence means the file won't enforce amd64 execution on non-amd64 hosts build_docker_images.shdefaults toHADOOP_VERSION=2.8.4, which is incompatible with both Hadoop 2.x/Java 17 and the new3.4.0image tags expected by the compose files; a user running./build_docker_images.sh --spark-version 4.0.1without explicitly adding--hadoop-version 3.4.0would build mismatched image tags
Greptile Confidence Score: 3/5
Not ready to merge — the amd64 compose file is missing platform directives and the build script's default Hadoop version will produce mismatched image tags when building for Spark 4.0.1
Two concrete P1 issues block the primary user path: (1) the _amd64.yml compose file omits platform: linux/amd64 on all services, breaking its purpose on non-amd64 hosts; (2) the build script defaults to HADOOP_VERSION=2.8.4 while the new compose files expect 3.4.0, meaning the default build invocation produces images with wrong tags that won't work with the new compose setup. The new base_java17 image and pom are structurally correct; the issues are in the wiring between the build script and compose files.
docker/compose/docker-compose_hadoop340_hive313_spark401_amd64.yml needs platform directives; docker/build_docker_images.sh needs its default HADOOP_VERSION aligned with the Spark 4.0 path
CodeRabbit: yihua#48 (review)
Greptile: yihua#48 (review)
| && tar -xvf /tmp/hadoop.tar.gz -C /opt/ \ | ||
| && rm /tmp/hadoop.tar.gz* \ | ||
| && ln -s /opt/hadoop-$HADOOP_VERSION/etc/hadoop /etc/hadoop \ | ||
| && mkdir /hadoop-data |
There was a problem hiding this comment.
Hadoop archive signature downloaded but never verified
The .asc file is fetched but no GPG verification step follows:
&& curl -fSL "${HADOOP_URL}.asc" -o /tmp/hadoop.tar.gz.asc \Without calling gpg --verify, the downloaded .asc file provides no actual integrity guarantee. The same pattern exists in base_java11/Dockerfile, so this is pre-existing, but worth resolving in the new image. A hardened build would import the Apache release keys and run:
gpg --import <apache-keys.asc>
gpg --verify /tmp/hadoop.tar.gz.asc /tmp/hadoop.tar.gzAlternatively, use the SHA-512 checksum which Apache publishes alongside the tarball.
— Greptile (original) (source:comment#3102315298)
There was a problem hiding this comment.
All these are existing issues. Comparing base_java11 and base_java17, there isn't much change except for version and docs update:
> diff -r base_java11 base_java17
diff --color=auto -r base_java11/Dockerfile base_java17/Dockerfile
18c18
< FROM eclipse-temurin:11-jdk-jammy
---
> FROM eclipse-temurin:17-jdk
25c25
< ARG HADOOP_VERSION=2.8.4
---
> ARG HADOOP_VERSION=3.4.0
31c31
< && DEBIAN_FRONTEND=noninteractive apt-get -yq update && apt-get -yq install curl wget netcat procps \
---
> && DEBIAN_FRONTEND=noninteractive apt-get -yq update && apt-get -yq install curl wget netcat-openbsd procps \
39d38
< && cp /etc/hadoop/mapred-site.xml.template /etc/hadoop/mapred-site.xml \
52c51
< EXPOSE 0-1024 4040 7000-10100 5000-5100 50000-50200 58188 58088 58042
---
> EXPOSE 0-1024 4040 7000-10100 5000-5100 50000-50200 58188 58088 58042
60d58
<
diff --color=auto -r base_java11/pom.xml base_java17/pom.xml
27c27
< <artifactId>hudi-hadoop-base-java11-docker</artifactId>
---
> <artifactId>hudi-hadoop-base-java17-docker</artifactId>
29c29
< <description>Base Docker Image with Hoodie</description>
---
> <description>Base Docker Image with Java 17 for Spark 4.0+</description>
37d36
<
50d48
<
71c69
< <repository>apachehudi/hudi-hadoop_${docker.hadoop.version}-base-java11</repository>
---
> <repository>apachehudi/hudi-hadoop_${docker.hadoop.version}-base-java17</repository>
87c85
< <repository>apachehudi/hudi-hadoop_${docker.hadoop.version}-base-java11</repository>
---
> <repository>apachehudi/hudi-hadoop_${docker.hadoop.version}-base-java17</repository>
| # YARN | ||
| addProperty /etc/hadoop/yarn-site.xml yarn.resourcemanager.bind-host 0.0.0.0 | ||
| addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0 | ||
| addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0 |
There was a problem hiding this comment.
Duplicate property addition.
yarn.nodemanager.bind-host is added twice to yarn-site.xml (lines 76 and 77).
🔧 Proposed fix
addProperty /etc/hadoop/yarn-site.xml yarn.resourcemanager.bind-host 0.0.0.0
addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0
- addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0
addProperty /etc/hadoop/yarn-site.xml yarn.timeline-service.bind-host 0.0.0.0📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0 | |
| addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docker/hoodie/hadoop/base_java17/entrypoint.sh` around lines 76 - 77,
Duplicate addition of the same Hadoop property is happening: remove the
redundant addProperty call so yarn.nodemanager.bind-host is only added once, or
replace the second addProperty invocation with a guard that checks yarn-site.xml
for an existing yarn.nodemanager.bind-host entry before adding; locate the
addProperty calls (symbol: addProperty) targeting yarn-site.xml and ensure only
a single write for yarn.nodemanager.bind-host occurs.
— CodeRabbit (original) (source:comment#3102348614)
| addProperty /etc/hadoop/yarn-site.xml yarn.timeline-service.bind-host 0.0.0.0 | ||
|
|
||
| # MAPRED | ||
| addProperty /etc/hadoop/mapred-site.xml yarn.nodemanager.bind-host 0.0.0.0 |
There was a problem hiding this comment.
Wrong configuration file for YARN property.
yarn.nodemanager.bind-host is a YARN property but is being added to mapred-site.xml. This appears to be a copy-paste error. The MAPRED section should likely set a MapReduce-specific property or be removed.
🔧 Proposed fix (remove incorrect entry)
# MAPRED
- addProperty /etc/hadoop/mapred-site.xml yarn.nodemanager.bind-host 0.0.0.0
+ addProperty /etc/hadoop/mapred-site.xml mapreduce.jobhistory.address 0.0.0.0:10020
+ addProperty /etc/hadoop/mapred-site.xml mapreduce.jobhistory.webapp.address 0.0.0.0:19888Alternatively, if no MapReduce multi-homed config is needed, simply remove lines 80-81.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| addProperty /etc/hadoop/mapred-site.xml yarn.nodemanager.bind-host 0.0.0.0 | |
| # MAPRED | |
| addProperty /etc/hadoop/mapred-site.xml mapreduce.jobhistory.address 0.0.0.0:10020 | |
| addProperty /etc/hadoop/mapred-site.xml mapreduce.jobhistory.webapp.address 0.0.0.0:19888 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docker/hoodie/hadoop/base_java17/entrypoint.sh` around lines 80 - 81, The
MAPRED section incorrectly adds the YARN property via the addProperty call
targeting /etc/hadoop/mapred-site.xml with key yarn.nodemanager.bind-host;
remove this addProperty line or move it to the appropriate YARN config (e.g. use
addProperty against /etc/hadoop/yarn-site.xml for yarn.nodemanager.bind-host)
and ensure the MAPRED block only contains MapReduce-specific properties.
— CodeRabbit (original) (source:comment#3102348632)
yihua
left a comment
There was a problem hiding this comment.
LGTM overall. We should address minor comments in a follow-up.
…stack - Fix duplicate/misplaced Hadoop properties and MY_CONTAINER_IP propagation in base_java17 entrypoint and export_container_ip scripts. - Guard mapred-site.xml.template copy in base_java17 Dockerfile so it works against Hadoop 3.4.0 which ships mapred-site.xml directly. - Validate SPARK_MAJOR in build_docker_images.sh and forward HADOOP_AWS_VERSION / AWS_SDK_VERSION as build args keyed off Hadoop major.minor so Hadoop 3.4 builds pull the matching AWS jars. - Pin linux/amd64 platform on every service in the Spark 4 amd64 compose file, swap Kafka to apache/kafka:3.7.2 with KRaft config, and expose the JobHistory web UI port (19888) on both amd64 and arm64. - Document which base image goes with which Spark/JDK combo in docker/README.md.
….1 stack - Fix duplicate/misplaced Hadoop properties and MY_CONTAINER_IP propagation in base_java17 entrypoint and export_container_ip scripts. - Guard mapred-site.xml.template copy in base_java17 Dockerfile so it works against Hadoop 3.4.0 which ships mapred-site.xml directly. - Validate SPARK_MAJOR in build_docker_images.sh and forward HADOOP_AWS_VERSION / AWS_SDK_VERSION as build args keyed off Hadoop major.minor so Hadoop 3.4 builds pull the matching AWS jars. - Pin linux/amd64 platform on every service in the Spark 4 amd64 compose file, swap Kafka to apache/kafka:3.7.2 with KRaft config, and expose the JobHistory web UI port (19888) on both amd64 and arm64. - Document which base image goes with which Spark/JDK combo in docker/README.md.
…1 stack - Fix duplicate/misplaced Hadoop properties and MY_CONTAINER_IP propagation in base_java17 entrypoint and export_container_ip scripts. - Guard mapred-site.xml.template copy in base_java17 Dockerfile so it works against Hadoop 3.4.0 which ships mapred-site.xml directly. - Validate SPARK_MAJOR in build_docker_images.sh and forward HADOOP_AWS_VERSION / AWS_SDK_VERSION as build args keyed off Hadoop major.minor so Hadoop 3.4 builds pull the matching AWS jars. - Pin linux/amd64 platform on every service in the Spark 4 amd64 compose file, swap Kafka to apache/kafka:3.7.2 with KRaft config, and expose the JobHistory web UI port (19888) on both amd64 and arm64. - Document which base image goes with which Spark/JDK combo in docker/README.md.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #18520 +/- ##
=========================================
Coverage 68.84% 68.84%
- Complexity 28250 28253 +3
=========================================
Files 2464 2464
Lines 135442 135442
Branches 16417 16417
=========================================
+ Hits 93239 93251 +12
+ Misses 34821 34812 -9
+ Partials 7382 7379 -3
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
…1 stack - Fix duplicate/misplaced Hadoop properties and MY_CONTAINER_IP propagation in base_java17 entrypoint and export_container_ip scripts. - Guard mapred-site.xml.template copy in base_java17 Dockerfile so it works against Hadoop 3.4.0 which ships mapred-site.xml directly. - Validate SPARK_MAJOR in build_docker_images.sh and forward HADOOP_AWS_VERSION / AWS_SDK_VERSION as build args keyed off Hadoop major.minor so Hadoop 3.4 builds pull the matching AWS jars. - Pin linux/amd64 platform on every service in the Spark 4 amd64 compose file, swap Kafka to apache/kafka:3.7.2 with KRaft config, and expose the JobHistory web UI port (19888) on both amd64 and arm64. - Document which base image goes with which Spark/JDK combo in docker/README.md.
…etup
Describe the issue this Pull Request addresses
Adding Java17 Hadoop base image and Spark4.0.1 docker compose images as our Variant feature push requires it.
Register base_java17 in Maven reactor and fix tag collision
base_java11is wired in.base_java17's docker repository fromapachehudi/hudi-hadoop_*-basetoapachehudi/hudi-hadoop_*-base-java17so that a Maven-driven reactor build does not stomp on the tag produced by the base/ module (which publishesapachehudi/hudi-hadoop_*-baseand is consumed by namenode, datanode, historyserver, hive_base, and prestobase via FROM ...-base:latest).-base-java11suffix. Also add the commented-out push goal for parity with base_java11.Note:
docker/build_docker_images.shstill tags the base image as plain-baseregardless of Java version, which is intentional for the shell path so that downstream Dockerfiles' FROM ...-base clauses inherit the newer JDK. Only the Maven reactor path needed the suffix fix.Summary and Changelog
Impact
None
Risk Level
Low, this is a chore task.
Documentation Update
None
Contributor's checklist