Adding Nvidia Isaac Lab sample#1083
Conversation
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 1/5 — Structure & Repository Hygiene
Thanks for putting this together — there's a lot of nice machinery in here. The cross-cutting structural items below sit at the body level since they don't map to a single line.
Directory layout doesn't follow the test-case convention
File: 3.test_cases/pytorch/nvidia-isaac-lab/ (whole tree)
The repo convention for a test case (encoded in CONTRIBUTING.md and the PR template's "Directory Structure" section) is:
3.test_cases/<framework>/<library>/[<model>/]
├── Dockerfile
├── README.md
├── slurm/
├── kubernetes/
└── hyperpod-eks/
The PR currently uses a flat layout with custom directory names instead:
3.test_cases/pytorch/nvidia-isaac-lab/
├── docker/Dockerfile # should be ./Dockerfile at the test-case root
├── README.md
├── config.yaml.example
├── generate.py
├── scripts/ # Slurm/K8s launchers — not a convention dir
├── templates/ # K8s manifests live here — should be kubernetes/
└── viz-scripts/ # workshop-DCV — not a convention dir
Concretely the things I'd ask for:
- Move
docker/Dockerfile→Dockerfileat the test-case root and updateCOPY docker/patches/...paths accordingly. The lint rule expectsDockerfileat the root. - Rename
templates/→kubernetes/. The generator-based design is fine — the constraint is just that anything K8s lives underkubernetes/. - Add a
hyperpod-eks/subdirectory holding the HyperPod-EKS-specific launch flow (today it's intermixed into the top-level README +scripts/+templates/). Even a thin README inhyperpod-eks/that points at../generate.pyand the relevant template names is enough to satisfy the convention. - Add a
training-jobs/subdirectory for the SageMaker Training Jobs path. The Training-Jobs-specific assets (scripts/sm-train-entrypoint.sh,templates/launch-sm-training.py.tpl, and the corresponding README sections) should live here as a sibling tohyperpod-eks/. viz-scripts/is workshop-DCV-specific and doesn't map to a convention dir. Either move it under a clearly-scopedworkshop/subdir or drop it from this PR and ship it separately — see the Documentation Consistency batch for context.- There's no Slurm path here, so
slurm/is correctly omitted — no action needed.
Missing patch file referenced by Dockerfile and README (build will fail)
File: 3.test_cases/pytorch/nvidia-isaac-lab/docker/Dockerfile (line 56)
The Dockerfile copies docker/patches/mlflow-train.patch and applies it with git apply, and the README documents the patch in two places (## MLflow Integration (Optional) and the ## Source Files table). However, the patch file itself is not in this PR's diff, so docker build -f docker/Dockerfile . will fail at the COPY step with a missing-file error. Could you commit docker/patches/mlflow-train.patch as part of this PR? If the patch is intentionally out-of-tree (e.g., generated at build time), I'd love a brief note in the README explaining how a contributor is supposed to obtain it before building.
There was a problem hiding this comment.
Review Batch 2/5 — Deployment Pipeline (K8s / Slurm)
Remove NCCL_SOCKET_IFNAME=eth0 and NCCL_IB_DISABLE=1 — g6.12xlarge supports EFA
Files:
scripts/sm-train-entrypoint.shlines 10–12 (# g6 instances don't have EFA...comment,export NCCL_SOCKET_IFNAME=eth0,export NCCL_IB_DISABLE=1)templates/training-job.yaml.tpllines 63 and 144 (value: "eth0")
My original framing here was wrong — I claimed the issue was positive vs. exclusion selection, but the deeper issue is that the override is opting out of EFA on EFA-capable hardware based on a false premise. Per the AWS EFA supported instance types page, g6.8xlarge, g6.12xlarge, g6.16xlarge, g6.24xlarge, and g6.48xlarge all support EFA (Nitro v4, RDMA read + write). Only the smallest g6 sizes (xlarge, 2xlarge, 4xlarge) lack EFA. The comment in the entrypoint ("g6 instances don't have EFA") is therefore incorrect for the size used here.
Action: drop NCCL_IB_DISABLE=1 and the NCCL_SOCKET_IFNAME=eth0 line in the entrypoint, drop the equivalent value: "eth0" entries in templates/training-job.yaml.tpl (lines 63 and 144), and let NCCL auto-detect EFA. With aws-ofi-nccl already on the image, NCCL will pick up SRD over EFA without further config. Add the EFA device plugin and an EFA-allowing security-group rule between HyperPod nodes to the README's HyperPod EKS prerequisites so the cluster side is set up correctly.
This supersedes the earlier exclusion-pattern suggestion — the cleanest outcome is to remove the override entirely rather than rewrite it. See also the standalone follow-up review (#1083 (review)) covering the same point on the entrypoint.
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 3/5 — Training Code Quality
Three inline findings on the SageMaker entrypoint script.
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 4/5 — Infrastructure & NCCL Configuration
Default config doesn't match the tested config
File: 3.test_cases/pytorch/nvidia-isaac-lab/config.yaml.example — and the PR test plan.
The PR description says testing was done on ml.g6e.12xlarge (4× NVIDIA L40S) but the example config defaults to ml.g6.12xlarge (4× NVIDIA L4). Both are listed as RT-Core-compatible in the README, and either will probably work, but a contributor copying the example will get a different GPU and somewhat different perf characteristics from what's documented in the test results. Could you either bump the default to the tested SKU, or add a short note in the example/README explaining that the test results were measured on ml.g6e.12xlarge?
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 5/5 — Documentation Consistency
Things That Look Great
- Config-driven generation with
string.Template.safe_substituteis a clean pattern — auto-derived ECR URI and S3 paths ingenerate.pysave the user a meaningful amount of fiddling. - The MLflow integration design (background-thread batching with documented overhead numbers — "~5s on a 87s baseline vs 10+ minutes per-metric") is a nicely surfaced piece of engineering rationale; future readers will appreciate knowing why the patch exists.
- The Dockerfile's surgical
sedpins forflatdict,numpy,torch,onnx, etc. with an inline comment linking to upstreamIsaacLab#5388is exactly the kind of "explain the why right where it matters" comment that makes diff archaeology easier. - The README's
Supported Instancestable that explicitly calls out why P-family instances are unsupported (no RT cores) is genuinely helpful — most test cases just say "use this instance type" and leave the user guessing. - Cross-node connectivity preflight in
sm-train-entrypoint.sh(master-port reachability + ping) is a thoughtful debuggability touch for the SageMaker BYOC path where networking issues are otherwise opaque. EnableInterContainerTrafficEncryption: Truein the SageMaker job config is good security hygiene and not always remembered.
Thanks again for the contribution — none of the items above are dealbreakers, just things to tighten before merge.
There was a problem hiding this comment.
Follow-up — EFA capability on g6.12xlarge
Following up on the NCCL_SOCKET_IFNAME finding from the earlier batch — there's a deeper issue with the surrounding lines that's worth surfacing on its own.
Per the AWS EFA supported instance types page, g6.8xlarge, g6.12xlarge, g6.16xlarge, g6.24xlarge, and g6.48xlarge all support EFA (Nitro v4, with both RDMA read and RDMA write). The smaller g6 sizes (xlarge, 2xlarge, 4xlarge) don't, but the test case targets g6.12xlarge, which does. The comment in the entrypoint ("g6 instances don't have EFA") is therefore factually incorrect for the size used here, and the NCCL_IB_DISABLE=1 + NCCL_SOCKET_IFNAME=eth0 pair is opting out of EFA on EFA-capable hardware based on a false premise.
Action: drop NCCL_IB_DISABLE=1 and the NCCL_SOCKET_IFNAME=eth0 line in the entrypoint, drop the equivalent value: "eth0" entries in templates/training-job.yaml.tpl (lines 63 and 144), and let NCCL auto-detect EFA. With aws-ofi-nccl already on the image, NCCL picks up SRD over EFA without further config. Add the EFA device plugin and an EFA-allowing security-group rule between HyperPod nodes to the README's HyperPod EKS prerequisites so contributors know what cluster-side setup is required.
The earlier NCCL_SOCKET_IFNAME exclusion-pattern finding only applies if any IFNAME override is kept; the cleaner outcome is to remove it entirely and let NCCL handle interface selection.
- sm-train-entrypoint.sh: symlink /workspace/IsaacLab/logs to /opt/ml/checkpoints so SageMaker syncs checkpoints to S3 continuously during training. On a resumed job, SageMaker restores the previous checkpoints before the container starts; the entrypoint finds best_agent.pt and passes --checkpoint to train.py. - launch-sm-training.py.tpl: add CheckpointConfig with checkpoint_s3_uri and checkpoint_local_path=/opt/ml/checkpoints. - config.yaml.example + generate.py: add sagemaker_training.checkpoint_s3_path and checkpoint_prefix fields. The prefix scopes checkpoints per experiment so a new run does not accidentally resume from a different experiment's checkpoint.
|
Thanks for the thorough review @KeitaW . I have addressed the issues. Please review. |
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 1/3 — Structure & Repository Hygiene (round 2)
A lot of cleanup since round 1 — most of the smaller findings (NCCL positive selection, :latest tags, NCCL_DEBUG_SUBSYS verbosity, hardcoded task name, Python prerequisite, license headers, .gitignore trim) are fully resolved, and the viz-pod simplification dropped both the third-party npx runtime fetch and the sed-on-source patch chain entirely — that's a real improvement. The three carry-over blockers are: the directory layout still doesn't match the repo's test-case convention, docker/patches/mlflow-train.patch is still referenced by both the Dockerfile and the README but not in the diff (so docker build will fail for anyone outside the author's local tree), and scripts/distributed_run.bash was deleted but the README's Source Files table still lists it. A small new regression: the viz-scripts/*.sh files lost their set -e and now run with no error trapping at all.
Resolution Status of Prior Findings
| # | Finding | Status |
|---|---|---|
| 1 | Directory layout doesn't follow the test-case convention | ❌ Open — still flat (docker/, templates/, scripts/, viz-scripts/) instead of Dockerfile at root + kubernetes/ + hyperpod-eks/ + training-jobs/ |
| 2 | Missing patch file referenced by Dockerfile and README (build will fail) | ❌ Open — docker/Dockerfile:58 still COPY docker/patches/mlflow-train.patch and README still references it; patch not in diff |
| 3 | Missing MIT-0 license headers across the new test case | ✅ Resolved — headers present in .gitignore, Dockerfile, generate.py, config.yaml.example, all templates/*.tpl, all scripts/*.sh, all viz-scripts/*.sh, and viz-scripts/README.md |
| 4 | :latest tag pinned in viz-scripts |
✅ Resolved — now ${ISAAC_IMAGE:-isaaclab-sagemaker:5.1.0} across all three scripts |
| 5 | Local .gitignore mostly duplicates the repo root |
✅ Resolved — trimmed to generated/ + config.yaml |
| 6 | Dead-code / off-topic launcher (distributed_run.bash) |
🟡 Partial — file deleted from the PR, but README's Source Files table at line 210 still lists it |
| 7 | NCCL_SOCKET_IFNAME=eth0 uses positive selection |
✅ Resolved — removed entirely; entrypoint now carries an explanatory comment that NCCL should auto-detect, and training-job.yaml.tpl no longer sets it |
| 8 | Shell scripts use set -e instead of the repo floor |
🟡 Partial — scripts/sm-train-entrypoint.sh:4 is still set -e; the three viz-scripts/*.sh files now have no set at all, which is worse than before |
| 9 | Runtime download from a third-party artifactory in the viz pod | ✅ Resolved — npx @nvidia/create-ov-web-rtc-app step removed; the viz pod is now ~80 lines of plain pod spec running play.py with --livestream 2 |
| 10 | The sed chain may be patching values that are already the default |
✅ Resolved — entire sed block removed with the npx step |
| 11 | Hardcoded task name in SageMaker entrypoint | ✅ Resolved — now --task=${TASK:-Isaac-Velocity-Rough-H1-v0} and TASK is wired through launch-sm-training.py.tpl's Environment block |
| 12 | NCCL_DEBUG_SUBSYS=ALL is debug-only verbosity for a default |
✅ Resolved — line removed, only NCCL_DEBUG=INFO remains |
| 13 | ISAACLAB_INIT_LOCK is exported but appears unused |
✅ Resolved — comment block at sm-train-entrypoint.sh:23-26 now explains Isaac Sim reads this env var for file-based locking to serialize per-node pip env initialization |
| 14 | Default config doesn't match the tested config | ✅ Resolved — PR body and config.yaml.example both now use ml.g6.12xlarge |
| 15 | Source Files table doesn't list the missing patch |
➖ N/A — the row is still in the table; depends on whether finding #2 is resolved by adding the patch or by removing the reference |
| 16 | Python prerequisite is too lax | ✅ Resolved — bumped to Python 3.10+ |
| 17 | Workshop/DCV context for viz-scripts/ not declared in the main README |
✅ Resolved — viz-scripts/README.txt renamed to viz-scripts/README.md with prerequisites, and the main README's Source Files row now reads DCV visualization helpers for EC2 workshop path |
| 18 | best_agent.pt checkpoint name is implicit |
🟡 Partial — comment now documents that the filename is the skrl default and that other frameworks need a different pattern; still hardcoded rather than templated from ${FRAMEWORK} |
Legend: ✅ Resolved · 🟡 Partial · ❌ Open · ➖ N/A (scope changed)
Directory layout still doesn't follow the test-case convention
File: 3.test_cases/pytorch/nvidia-isaac-lab/ (whole tree)
Carrying this over from round 1 — the layout is unchanged. The repo convention (encoded in CONTRIBUTING.md and the PR template's "Directory Structure" section) puts Dockerfile at the test-case root and groups service-specific assets under kubernetes/, hyperpod-eks/, training-jobs/, etc., so a contributor landing in any test case can find "the K8s manifest" or "the HyperPod instructions" without re-reading a per-test-case README. Today's layout still uses docker/Dockerfile, templates/*.tpl, scripts/sm-train-entrypoint.sh, and viz-scripts/ as flat sibling directories.
Concretely I'd still ask for:
- Move
docker/Dockerfile→Dockerfileat the test-case root and update theCOPY docker/patches/...path accordingly. - Rename
templates/→kubernetes/(orkubernetes/templates/if you want to keep the generator output separate from raw YAML). The generator-based design is fine — only the directory name needs to change. - Add a
hyperpod-eks/subdirectory with a short README that points at../generate.pyand the relevant template names, so the HyperPod path is discoverable from the convention. - Add a
training-jobs/subdirectory holdingsm-train-entrypoint.shandlaunch-sm-training.py.tplwith its own README — symmetric tohyperpod-eks/. viz-scripts/is workshop-DCV-specific and doesn't map to a convention dir; either move it under a clearly-scopedworkshop/subdir or split it off into a follow-up PR.
If keeping the flat layout is a hard preference for the workshop story, I'd love a brief note in the PR description explaining the trade-off so future contributors know whether to mirror this layout or treat it as an exception.
Missing patch file referenced by Dockerfile and README (build will still fail)
File: 3.test_cases/pytorch/nvidia-isaac-lab/docker/Dockerfile (line 58)
Same as round 1 — the Dockerfile still does COPY docker/patches/mlflow-train.patch /tmp/mlflow-train.patch and the README documents the patch in two places (## MLflow Integration (Optional) and the ## Source Files table row), but the patch file itself is not in this PR's diff. docker build -f docker/Dockerfile . will fail at the COPY step with a missing-file error for anyone who doesn't already have the patch in their working tree. Could you either commit docker/patches/mlflow-train.patch as part of this PR, or — if the patch is intentionally out-of-tree — add a short README section explaining how a contributor is supposed to obtain or regenerate it before docker build?
| | `scripts/sm-train-entrypoint.sh` | SageMaker BYOC entrypoint (parses `resourceconfig.json`, launches `torchrun`) | | ||
| | `scripts/distributed_run.bash` | AWS Batch distributed launcher (workshop path) | | ||
| | `templates/*.tpl` | Kubernetes manifest and Python script templates | |
There was a problem hiding this comment.
README still references the deleted distributed_run.bash
scripts/distributed_run.bash was removed from the PR (good — addressing round 1's dead-code finding), but the README's ## Source Files table still has the row referencing it. A contributor running ls scripts/ will find only sm-train-entrypoint.sh and reasonably wonder if the file is missing. Could you drop this row?
| | `scripts/sm-train-entrypoint.sh` | SageMaker BYOC entrypoint (parses `resourceconfig.json`, launches `torchrun`) | | |
| | `scripts/distributed_run.bash` | AWS Batch distributed launcher (workshop path) | | |
| | `templates/*.tpl` | Kubernetes manifest and Python script templates | | |
| | `scripts/sm-train-entrypoint.sh` | SageMaker BYOC entrypoint (parses `resourceconfig.json`, launches `torchrun`) | | |
| | `templates/*.tpl` | Kubernetes manifest and Python script templates | |
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 3/3 — Documentation Consistency + Wrap-up (round 2)
One partial carry-over inline below. Closing with what's working well after this round.
Things That Look Great
- The viz-pod simplification is a clear win — dropping the
npx @nvidia/create-ov-web-rtc-appruntime fetch and the multi-linesedchain removed both a supply-chain dependency on a non-Amazon registry and a brittle source-edit step in one go. - The new comment at
sm-train-entrypoint.sh:23-26explaining whyISAACLAB_INIT_LOCKis set ("Isaac Sim's internal pip env initialization is not process-safe... uses file-based locking to serialize the initialization") is exactly the kind of "explain the why right where it matters" comment that pays off in diff archaeology two years later. - The NCCL story in the SageMaker entrypoint is now both correct and self-documenting — the explicit "Do NOT set NCCL_SOCKET_IFNAME or NCCL_IB_DISABLE unless debugging" comment at line 14 is more useful than the round 1 anti-pattern would have been even if I'd just asked for the exclusion variant.
- Templating
--task=${TASK:-Isaac-Velocity-Rough-H1-v0}and wiringTASKthrough the SageMakerEnvironmentblock inlaunch-sm-training.py.tplis a tidy fix that keeps the three launch paths (Kubernetes / SM Training Job / fallback default) in sync. - The
.gitignoretrim landed in the cleanest possible shape — three entries, license header, no duplication. - Renaming
viz-scripts/README.txt→viz-scripts/README.mdand expanding it into a proper workshop-context doc is a quiet but real improvement — readers landing in the viz directory directly now have everything they need.
| # Note: best_agent.pt is the default checkpoint name for the skrl framework. | ||
| # For other frameworks (rsl_rl, rl_games, sb3), adjust the filename pattern. | ||
| CKPT=$(find ${VIZ_FSX_LOG_DIR} -name best_agent.pt -printf "%T@ %p\n" | sort -n | tail -1 | cut -d' ' -f2) | ||
| echo "Using checkpoint: $CKPT" | ||
| exec /isaac-sim/python.sh scripts/reinforcement_learning/skrl/play.py \ |
There was a problem hiding this comment.
best_agent.pt still hardcoded in viz pod template (with a documenting comment)
This is much better than round 1 — the new inline comment at lines 21-22 explains that best_agent.pt is the skrl default and that other frameworks need a different pattern, which addresses the discoverability half of the original concern. The other half is still there: the same pod template hardcodes play.py from scripts/reinforcement_learning/skrl/play.py (line 25), so the moment a user changes training.framework to rsl_rl / rl_games / sb3, the viz manifest silently keeps loading a skrl checkpoint and invoking the skrl player. Could you either (a) template both the checkpoint glob and the play.py path from ${FRAMEWORK} like the training-job template does, or (b) tighten the comment to "this viz pod only supports framework: skrl — regenerate manifests with a different framework only if you also customize this template" and add the same caveat to the README's framework-selection bullet? Option (a) is cleaner; option (b) is a 2-line doc change if templating turns out to be fiddly.
|
Almost done. Few remaining action items. |
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 1/2 — Scope & Resolution Status (round 3)
Prior reviews: round 1, round 2.
Summary
Two messages this round, in priority order. (1) Let's prioritize landing this PR — scope freeze on additions. The diff has grown to 17 files across Docker, K8s, SageMaker BYOC, viz pods, and workshop scripts. From here forward, please no new features on this branch — everything should be close-out work (fixing the open findings) so we can land it. (2) The patch → wrapper swap is part of close-out, not a feature. A unified-diff git apply against IsaacLab is the most fragile shape for the MLflow integration, and I confirmed via primary-source check that there is no upstream alternative to "just configure" — so the integration genuinely has to live in this PR, but it should live as a small wrapper module rather than a .patch file. This needs to be addressed in this PR, alongside the other open round-2 carry-overs (flat directory layout, implicit-skrl assumption in the viz pod). That's the complete close-out checklist.
Resolution Status of Prior Findings
| # | Finding | Status |
|---|---|---|
| 1 | Directory layout doesn't follow the test-case convention | ❌ Open — still docker/, templates/, scripts/, viz-scripts/ as flat siblings; no Dockerfile at root, no kubernetes/, no hyperpod-eks/, no training-jobs/ |
| 2 | Missing patch file referenced by Dockerfile and README | ✅ Resolved — docker/patches/mlflow-train.patch is now in the diff (129 lines). See the round-3 inline finding (Batch 2) about whether the patch itself is the right shape. |
| 6 | Dead-code / off-topic launcher (distributed_run.bash README row) |
✅ Resolved — README's Source Files table no longer references distributed_run.bash |
| 18 | best_agent.pt checkpoint name is implicit in viz template |
🟡 Partial — round-2 comment is in place; the glob and play.py path remain hardcoded to skrl |
Legend: ✅ Resolved · 🟡 Partial · ❌ Open · ➖ N/A (scope changed)
Prioritize closing this PR — no new features, finish the close-out list
The diff now spans 17 files and ~1,500 lines across five sub-systems (Docker image, Kubernetes manifests, SageMaker BYOC entrypoint, WebRTC viz pod, EC2 workshop helpers). At this size, reviewer attention is the bottleneck — every additional feature compounds review cost. I'd recommend an explicit scope freeze:
- Hard-stop on new features. No new Isaac task families (rsl_rl/rl_games/sb3 wiring beyond skrl), no additional visualization variants, no additional instance-type tuning beyond
ml.g6.12xlarge, no expansion of the workshop/DCV helpers. If any of those are valuable, they go in follow-up PRs after this lands. - Close-out work is in scope and expected. This is the bar for "ready to merge":
- Directory layout (#1) — flatten
docker/,templates/,scripts/,viz-scripts/into the repo convention (Dockerfileat root +kubernetes/+hyperpod-eks/+training-jobs/). - Viz pod skrl assumption (#18) — either template
${FRAMEWORK}through or tighten the README caveat. - Patch → wrapper swap (new this round, Batch 2 below) — replace
docker/patches/mlflow-train.patch+git applywith a real.pywrapper module.
- Directory layout (#1) — flatten
Item 3 is correctness/maintainability cleanup, not a new feature — it changes the shape of code that's already in the PR, doesn't add capability, and is the kind of thing that gets harder to fix the longer it lives in main.
Concretely, please (a) reply with a short comment listing any follow-ups you intend to file after merge so they're not forgotten, and (b) resist any reviewer ask (including future ones from me) that would expand the feature surface of this branch.
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 2/2 — Structure & Repository Hygiene (round 3)
One inline finding on the Dockerfile (patch → wrapper swap), then the close-out positives.
Things That Look Great
- The
mlflow-train.patchcontent itself is genuinely thoughtful — the "why notmlflow.config.enable_async_logging()?" comment block at the top of the patch is the kind of design-rationale comment that pays off when someone wonders why this exists at all. That prose deserves to land inmlflow_hook.pyverbatim regardless of which alternative you pick. - Bringing the README in line with the slimmed-down
scripts/directory (no moredistributed_run.bashrow) made the test case noticeably more navigable from round 2 — a contributor runningls scripts/and reading the README now sees a consistent story. - Pinning
mlflow-skinny==3.4.0,sagemaker-mlflow==0.3.0, andpynvml==11.5.3with an inline comment explaining the role of each is exactly the kind of "explain the why right where it matters" that future bumpers will thank you for.
| # Patch Isaac Lab's train.py to mirror TensorBoard scalars to MLflow in real | ||
| # time when MLFLOW_TRACKING_URI is set. The patch only touches train.py; if | ||
| # Isaac Lab is upgraded, regenerate it against the new file. | ||
| COPY docker/patches/mlflow-train.patch /tmp/mlflow-train.patch | ||
| RUN cd /workspace/IsaacLab && git apply /tmp/mlflow-train.patch && rm /tmp/mlflow-train.patch |
There was a problem hiding this comment.
Shipping a unified-diff patch isn't the right shape — wrap train.py instead (close-out blocker)
The Dockerfile currently does:
COPY docker/patches/mlflow-train.patch /tmp/mlflow-train.patch
RUN cd /workspace/IsaacLab && git apply /tmp/mlflow-train.patch && rm /tmp/mlflow-train.patch
Before discussing shape, the upstream question: I checked whether the patch is necessary at all. It is. IsaacLab v2.3.2 and main have zero MLflow references in any of the eight RL training entrypoints (scripts/reinforcement_learning/{skrl,rsl_rl,rl_games,sb3}/{train,play}.py); the only MLflow code in the repo is in the Ray-Tune sweep workflow (scripts/reinforcement_learning/ray/tuner.py), which is a different use case. None of the four upstream RL libraries (skrl 1.4.3, rsl_rl, rl_games, sb3) ships a native MLflow backend either — skrl has TensorBoard + W&B, rsl_rl has TensorBoard + W&B + Neptune (explicit ValueError on unknown values), rl_games is hard-coded to tensorboardX, and sb3 has no MLflowOutputFormat / MlflowCallback. So there is no MLFLOW_TRACKING_URI switch that "just works"; this PR is genuinely the integration point.
Given that, the question is purely about shape — the patch and a wrapper module do the same thing functionally; the wrapper is just review-friendlier. I'd push back on the patch shape for two reasons:
- It's fragile. A unified diff is matched against IsaacLab's
scripts/reinforcement_learning/skrl/train.pyby line numbers and surrounding context. The Dockerfile pins IsaacLab tov2.3.2today, but any future bump (security, CUDA stack, an upstream fix the user actually wants) becomes agit applyfailure that contributors have to debug — and the in-Dockerfile comment already calls this out ("if Isaac Lab is upgraded, regenerate it against the new file"). That's a maintenance trap, not a feature. - It's hard to review and lint. Of the 129 lines in the patch, only ~80 are actually new Python — the rest is
@@hunk headers, context lines, and+/-prefixes. Nothing in repo CI (pylint, flake8, bandit) runs against.patchfiles, so the MLflow integration code escapes static analysis entirely. A reviewer reading it has to mentally reconstruct what the file looks like after application before they can reason about it.
Recommended alternative — a runtime wrapper module. The patch only does two things: (a) monkey-patch SummaryWriter.add_scalar so every TB scalar is mirrored to MLflow on rank 0, and (b) close out the MLflow run after runner.run() returns. Both of those can live in a small Python file under your control, with train.py invoked through it. Concretely:
-
Drop
docker/patches/mlflow-train.patchentirely. -
Add a new file
scripts/mlflow_hook.py(orkubernetes/-relative once the directory-layout finding is addressed) that owns theSummaryWriter.add_scalarpatch, the background thread, and the run lifecycle — i.e. the exact same logic the patch installs intotrain.py, but in a real Python file you can review, lint, and unit-test. -
Change the SageMaker / Kubernetes entrypoint to call
train.pythrough the hook instead of directly:# scripts/run_train.py import runpy import mlflow_hook # installs SummaryWriter.add_scalar patch + atexit cleanup mlflow_hook.install() runpy.run_path( "/workspace/IsaacLab/scripts/reinforcement_learning/skrl/train.py", run_name="__main__", )
Then
sm-train-entrypoint.sh(and the PyTorchJob template) invokerun_train.pyrather thantrain.py. The CLI args (--task,--num_envs, …) are forwarded byrunpyviasys.argv, so nothing else changes. -
The Dockerfile loses the
COPY+RUN git applypair; it justCOPYs the hook + wrapper alongsidesm-train-entrypoint.sh(which it already does for the entrypoint at line 65). Bumping IsaacLab to a new tag no longer requires a patch regeneration step.
Two other shapes worth considering, in case the wrapper doesn't fit:
- Event-file sidecar. Run a separate background process (
mlflow_tb_forwarder.py &from the entrypoint) that tails the TensorBoard event-file directory IsaacLab already writes (tbparseortensorboard.backend.event_processing.EventAccumulator) and forwards scalars to MLflow. This decouples MLflow from IsaacLab entirely — zero modification, even at runtime. The trade-offs are an extra process, slightly higher latency (events have to flush to disk first), and a bit more code than the wrapper. - Upstream the hook. The cleanest long-term target is actually skrl rather than IsaacLab —
skrl.utils.runner.Runneralready acceptsexperiment.wandb = Trueviaagent_cfg, and adding a parallelexperiment.mlflowblock (whichIsaacLab/scripts/reinforcement_learning/skrl/train.pywould pick up unchanged because it passesagent_cfgstraight through) is a small, well-precedented contribution toToni-SM/skrl. There's no open issue or PR there on this yet — first contributor wins. Until that lands, the wrapper bridges the gap. (IsaacLab itself doesn't have any open MLflow issues or PRs requesting native RL logging — the question hasn't been asked publicly, so there's neither a "yes, PR welcome" nor a "no, won't take it" maintainer signal to lean on.)
My recommendation is option (1) — the wrapper. It's the smallest change from where you are now (the hook code already exists in the patch; it just needs to move out of unified-diff form into a .py file), it removes the fragile git apply, and it puts the MLflow integration code on the same footing as the rest of the test case for review and CI purposes.
On scope: this is close-out work, not a new feature — it changes the shape of code that's already in the PR without adding capability. Please land it in this branch alongside the other open round-2 carry-overs rather than deferring. Mechanically the swap is small: (a) extract the patch's ~80 lines of new Python into kubernetes/mlflow_hook.py (or training-jobs/mlflow_hook.py once the directory-layout finding is addressed), (b) add a 5-line run_train.py that calls mlflow_hook.install() then runpy.run_path(".../train.py", run_name="__main__"), (c) delete docker/patches/, (d) change the Dockerfile to COPY the two new .py files and drop the COPY ... .patch + RUN git apply lines, (e) point sm-train-entrypoint.sh and templates/training-job.yaml.tpl at run_train.py instead of train.py. The skrl-upstream-PR idea is a genuine follow-up — that one is out of scope for this branch.
There was a problem hiding this comment.
@KeitaW , the value proposition here is to showcase end-end integration with SM Managed MLflow for tracking the experiment runs hence giving users the option to directly integrate with this other than relying on self-managed or 3rd-party tracking servers
Restored the web-viewer sidecar to match the blog browser-based workflow with krelay + http://localhost:8210 , plus fixed the media port from 48010 to 47998 to align with the krelay command and the viewer's sed config. Removed the ephemeral-storage: 100Gi limit that prevented scheduling on nodes with <89Gi allocatable e.g on the g6e.12xlarge.
|
Following up on the round-3 Batch 2 finding — here's a skeleton for the wrapper alternative so the swap is a copy-paste rather than a green-field design exercise. Behaviour is line-by-line equivalent to The one shape change is that run metadata the patch reads from
|
Purpose
Adds a new test case for distributed reinforcement learning training using NVIDIA Isaac Lab on SageMaker HyperPod EKS and SageMaker Training Jobs. This demonstrates multi-node GPU-accelerated robot simulation (PPO training of the Unitree H1 humanoid)
Changes
3.test_cases/pytorch/isaaclab/generate.py) for Kubernetes and SageMaker Training Jobsnvcr.io/nvidia/isaac-sim:5.1.0with Isaac Lab v2.3.2 and skrl 1.4.3resourceconfig.jsonfor multi-nodetorchrunTest Plan
Environment:
Test commands:
Test Results
best_agent.pt(4.8 MB)SucceededDirectory Structure
Dockerfile,README.md, training scripts, configs) cover general setup.slurm/,kubernetes/,hyperpod-eks/) contain service-specific launch instructions.Checklist
mainbranch.latest).