chore(docker): address PR #18520 review comments for Spark 4.0.1 stack#18524
chore(docker): address PR #18520 review comments for Spark 4.0.1 stack#18524voonhous wants to merge 3 commits into
Conversation
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Nice documentation improvement. The new "Base image by Java version" table is a helpful reference for mapping Spark versions to their required base images. One minor clarity point: the auto-selection description only covers Java 11 vs Java 17, but the table also lists the Java 8 base image for "Legacy Spark 2.x", which may lead readers to assume it is also auto-selected.
ab4918a to
78733fd
Compare
78733fd to
57783ce
Compare
| - KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181 | ||
| - ALLOW_PLAINTEXT_LISTENER=yes |
There was a problem hiding this comment.
No, with apache/kafka:3.7.2 in KRaft mode the broker acts as its own controller, so KAFKA_ZOOKEEPER_CONNECT is obsolete and ALLOW_PLAINTEXT_LISTENER (a bitnami-specific convenience var) no longer applies.
The KRaft listeners and protocol map are expressed directly via the new KAFKA_LISTENERS / KAFKA_LISTENER_SECURITY_PROTOCOL_MAP env vars in the current config.
It should be safe to remove.
|
|
||
| # MAPRED | ||
| addProperty /etc/hadoop/mapred-site.xml yarn.nodemanager.bind-host 0.0.0.0 | ||
| addProperty /etc/hadoop/mapred-site.xml mapreduce.jobhistory.address 0.0.0.0:10020 |
There was a problem hiding this comment.
Not sure if we need changes in entrypoint.sh and export_container_ip.sh. Let's revert them.
| --build-arg HADOOP_AWS_VERSION=${HADOOP_AWS_VERSION} \ | ||
| --build-arg AWS_SDK_VERSION=${AWS_SDK_VERSION} \ |
There was a problem hiding this comment.
Let's add these build args to the multi-arch build command?
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the docs update. The new "Base image by Java version" section is a helpful orientation for contributors. A couple of small clarifications on the table could make it easier to read, especially around the asymmetric "Published repo suffix" values and what "anything else" means for unsupported Spark versions.
|
|
||
| | Base module | JDK | Default Hadoop | Published repo suffix | Used for | | ||
| |---------------|---------|----------------|-----------------------|------------| | ||
| | `base_java11` | Java 11 | 2.8.4 | `...-base-java11` | Spark 3.x | |
There was a problem hiding this comment.
🤖 The Published repo suffix column lists ...-base-java11 for base_java11 but just ...-base for base_java17 (no -java17 qualifier). Is that asymmetry intentional? If it reflects how the images are actually published, it could help readers to add a short footnote explaining why base_java17 drops the JDK qualifier — otherwise at first glance this looks like a typo.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
There was a problem hiding this comment.
The asymmetry isn't intentional and the column is misleading as currently written.
There are two image-naming paths in this repo and the table was silently mixing them:
- build_docker_images.sh:129 produces
apachehudi/hudi-hadoop_<HADOOP>-basefor both base_java11 and base_java17 (no JDK qualifier). This is the only path that actually pushes to Docker Hub (via --multi-arch -> buildx --push on line 141). - The per-module poms (
base_java11/pom.xml:71,87,base_java17/pom.xml:69,85) declare ...-base-java11 / ...-base-java17 -- but push is commented out in both, so these only tag locally.
I'll just drop the Published repo suffix column as this column is confusing and not helpful at all.
…1 stack - Fix duplicate/misplaced Hadoop properties and MY_CONTAINER_IP propagation in base_java17 entrypoint and export_container_ip scripts. - Guard mapred-site.xml.template copy in base_java17 Dockerfile so it works against Hadoop 3.4.0 which ships mapred-site.xml directly. - Validate SPARK_MAJOR in build_docker_images.sh and forward HADOOP_AWS_VERSION / AWS_SDK_VERSION as build args keyed off Hadoop major.minor so Hadoop 3.4 builds pull the matching AWS jars. - Pin linux/amd64 platform on every service in the Spark 4 amd64 compose file, swap Kafka to apache/kafka:3.7.2 with KRaft config, and expose the JobHistory web UI port (19888) on both amd64 and arm64. - Document which base image goes with which Spark/JDK combo in docker/README.md.
Updated the published repo suffix for base_java17 module.
1f72e66 to
00ced67
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #18524 +/- ##
=========================================
Coverage 68.87% 68.87%
- Complexity 28272 28274 +2
=========================================
Files 2464 2464
Lines 135594 135594
Branches 16447 16447
=========================================
+ Hits 93389 93396 +7
+ Misses 34815 34809 -6
+ Partials 7390 7389 -1
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Describe the issue this Pull Request addresses
Addressing non-blocking comments from #18520.
Summary and Changelog
base_java17entrypoint andexport_container_ipscripts.mapred-site.xml.templatecopy inbase_java17Dockerfile so it works againstHadoop 3.4.0which shipsmapred-site.xmldirectly.SPARK_MAJORinbuild_docker_images.shand forwardHADOOP_AWS_VERSION/AWS_SDK_VERSIONas build args keyed off Hadoop major.minor so Hadoop 3.4 builds pull the matching AWS jars.linux/amd64platform on every service in the Spark 4 amd64 compose file, swapKafkatoapache/kafka:3.7.2withKRaft config, and expose the JobHistory web UI port (19888) on both amd64 and arm64.docker/README.md.Impact
None
Risk Level
None
Documentation Update
Contributor's checklist