Adding Nvidia Isaac Lab sample by allela-roy · Pull Request #1083 · awslabs/awsome-distributed-ai

allela-roy · 2026-05-06T10:25:32Z

Purpose

Adds a new test case for distributed reinforcement learning training using NVIDIA Isaac Lab on SageMaker HyperPod EKS and SageMaker Training Jobs. This demonstrates multi-node GPU-accelerated robot simulation (PPO training of the Unitree H1 humanoid)

Changes

Add Isaac Lab distributed RL training test case under 3.test_cases/pytorch/isaaclab/
Config-driven manifest generation (generate.py) for Kubernetes and SageMaker Training Jobs
Dockerfile based on nvcr.io/nvidia/isaac-sim:5.1.0 with Isaac Lab v2.3.2 and skrl 1.4.3
MLflow integration patch for batched metric logging (5s overhead vs 10min with per-metric calls)
SageMaker BYOC entrypoint that parses resourceconfig.json for multi-node torchrun
Kubeflow PyTorchJob template supporting single- and multi-node distributed training
TensorBoard and WebRTC visualization pod templates

Test Plan

Environment:

AWS Service: SageMaker HyperPod EKS
Instance type: ml.g6.12xlarge (4x NVIDIA L4)
Number of nodes: 2

Test commands:

# Generate manifests
python generate.py --config config.yaml

# Deploy 2-node PyTorchJob
kubectl apply -f generated/training-job.yaml

# Monitor
kubectl get pytorchjobs
kubectl logs -f isaaclab-h1-master-0

# Validate SageMaker Training Job API
python generated/launch-sm-training.py --dry-run

# Launch SageMaker Training Job
python generated/launch-sm-training.py

# MLflow integration test
python -c "
import mlflow
mlflow.set_tracking_uri('arn:aws:sagemaker:us-east-1:159553542841:mlflow-tracking-server/ml-mlflow-workload-us-east-1')
mlflow.set_experiment('isaaclab-h1')
run = mlflow.start_run(run_name='test')
mlflow.log_params({'task': 'Isaac-Velocity-Rough-H1-v0', 'num_envs': '4096'})
mlflow.end_run()
"

Test Results

Training completed in ~63 seconds (1200 timesteps / 50 policy updates)
8 GPUs detected across 2 nodes (4x NVIDIA L4 per node)
Checkpoints saved to FSx: 11 files including best_agent.pt (4.8 MB)
Worker rendezvous with master succeeded via DNS-based init container
PyTorchJob final state: Succeeded

Directory Structure

3.test_cases/
└── <framework>/                # e.g. pytorch, megatron, jax
    └── <library>/              # e.g. picotron, FSDP, megatron-lm
        └── <model>/            # e.g. SmolLM-1.7B (may be omitted for single-model cases)
            ├── Dockerfile      # Container / environment setup
            ├── README.md       # Overview, prerequisites, usage
            ├── slurm/          # Slurm-specific launch scripts
            ├── kubernetes/     # Kubernetes manifests
            └── hyperpod-eks/   # HyperPod EKS instructions

Top-level files (Dockerfile, README.md, training scripts, configs) cover general setup.
Subdirectories (slurm/, kubernetes/, hyperpod-eks/) contain service-specific launch instructions.
Not all service subdirectories are required — include only the ones relevant to your test case.

Checklist

I have read the contributing guidelines.
I am working against the latest main branch.
I have searched existing open and recently merged PRs to confirm this is not a duplicate.
The contribution is self-contained with documentation and scripts.
External dependencies are pinned to a specific version or tag (no latest).
A README is included or updated with prerequisites, instructions, and known issues.
New test cases follow the expected directory structure.

KeitaW

Review Batch 1/5 — Structure & Repository Hygiene

Thanks for putting this together — there's a lot of nice machinery in here. The cross-cutting structural items below sit at the body level since they don't map to a single line.

Directory layout doesn't follow the test-case convention

File: 3.test_cases/pytorch/nvidia-isaac-lab/ (whole tree)

The repo convention for a test case (encoded in CONTRIBUTING.md and the PR template's "Directory Structure" section) is:

3.test_cases/<framework>/<library>/[<model>/]
    ├── Dockerfile
    ├── README.md
    ├── slurm/
    ├── kubernetes/
    └── hyperpod-eks/

The PR currently uses a flat layout with custom directory names instead:

3.test_cases/pytorch/nvidia-isaac-lab/
├── docker/Dockerfile          # should be ./Dockerfile at the test-case root
├── README.md
├── config.yaml.example
├── generate.py
├── scripts/                   # Slurm/K8s launchers — not a convention dir
├── templates/                 # K8s manifests live here — should be kubernetes/
└── viz-scripts/               # workshop-DCV — not a convention dir

Concretely the things I'd ask for:

Move docker/Dockerfile → Dockerfile at the test-case root and update COPY docker/patches/... paths accordingly. The lint rule expects Dockerfile at the root.
Rename templates/ → kubernetes/. The generator-based design is fine — the constraint is just that anything K8s lives under kubernetes/.
Add a hyperpod-eks/ subdirectory holding the HyperPod-EKS-specific launch flow (today it's intermixed into the top-level README + scripts/ + templates/). Even a thin README in hyperpod-eks/ that points at ../generate.py and the relevant template names is enough to satisfy the convention.
Add a training-jobs/ subdirectory for the SageMaker Training Jobs path. The Training-Jobs-specific assets (scripts/sm-train-entrypoint.sh, templates/launch-sm-training.py.tpl, and the corresponding README sections) should live here as a sibling to hyperpod-eks/.
viz-scripts/ is workshop-DCV-specific and doesn't map to a convention dir. Either move it under a clearly-scoped workshop/ subdir or drop it from this PR and ship it separately — see the Documentation Consistency batch for context.
There's no Slurm path here, so slurm/ is correctly omitted — no action needed.

Missing patch file referenced by Dockerfile and README (build will fail)

File: 3.test_cases/pytorch/nvidia-isaac-lab/docker/Dockerfile (line 56)

The Dockerfile copies docker/patches/mlflow-train.patch and applies it with git apply, and the README documents the patch in two places (## MLflow Integration (Optional) and the ## Source Files table). However, the patch file itself is not in this PR's diff, so docker build -f docker/Dockerfile . will fail at the COPY step with a missing-file error. Could you commit docker/patches/mlflow-train.patch as part of this PR? If the patch is intentionally out-of-tree (e.g., generated at build time), I'd love a brief note in the README explaining how a contributor is supposed to obtain it before building.

KeitaW

Review Batch 2/5 — Deployment Pipeline (K8s / Slurm)

Remove `NCCL_SOCKET_IFNAME=eth0` and `NCCL_IB_DISABLE=1` — `g6.12xlarge` supports EFA

Files:

scripts/sm-train-entrypoint.sh lines 10–12 (# g6 instances don't have EFA... comment, export NCCL_SOCKET_IFNAME=eth0, export NCCL_IB_DISABLE=1)
templates/training-job.yaml.tpl lines 63 and 144 (value: "eth0")

My original framing here was wrong — I claimed the issue was positive vs. exclusion selection, but the deeper issue is that the override is opting out of EFA on EFA-capable hardware based on a false premise. Per the AWS EFA supported instance types page, g6.8xlarge, g6.12xlarge, g6.16xlarge, g6.24xlarge, and g6.48xlarge all support EFA (Nitro v4, RDMA read + write). Only the smallest g6 sizes (xlarge, 2xlarge, 4xlarge) lack EFA. The comment in the entrypoint ("g6 instances don't have EFA") is therefore incorrect for the size used here.

Action: drop NCCL_IB_DISABLE=1 and the NCCL_SOCKET_IFNAME=eth0 line in the entrypoint, drop the equivalent value: "eth0" entries in templates/training-job.yaml.tpl (lines 63 and 144), and let NCCL auto-detect EFA. With aws-ofi-nccl already on the image, NCCL will pick up SRD over EFA without further config. Add the EFA device plugin and an EFA-allowing security-group rule between HyperPod nodes to the README's HyperPod EKS prerequisites so the cluster side is set up correctly.

This supersedes the earlier exclusion-pattern suggestion — the cleanest outcome is to remove the override entirely rather than rewrite it. See also the standalone follow-up review (#1083 (review)) covering the same point on the entrypoint.

KeitaW

Review Batch 3/5 — Training Code Quality

Three inline findings on the SageMaker entrypoint script.

KeitaW

Review Batch 4/5 — Infrastructure & NCCL Configuration

Default config doesn't match the tested config

File: 3.test_cases/pytorch/nvidia-isaac-lab/config.yaml.example — and the PR test plan.

The PR description says testing was done on ml.g6e.12xlarge (4× NVIDIA L40S) but the example config defaults to ml.g6.12xlarge (4× NVIDIA L4). Both are listed as RT-Core-compatible in the README, and either will probably work, but a contributor copying the example will get a different GPU and somewhat different perf characteristics from what's documented in the test results. Could you either bump the default to the tested SKU, or add a short note in the example/README explaining that the test results were measured on ml.g6e.12xlarge?

KeitaW

Review Batch 5/5 — Documentation Consistency

Things That Look Great

Config-driven generation with string.Template.safe_substitute is a clean pattern — auto-derived ECR URI and S3 paths in generate.py save the user a meaningful amount of fiddling.
The MLflow integration design (background-thread batching with documented overhead numbers — "~5s on a 87s baseline vs 10+ minutes per-metric") is a nicely surfaced piece of engineering rationale; future readers will appreciate knowing why the patch exists.
The Dockerfile's surgical sed pins for flatdict, numpy, torch, onnx, etc. with an inline comment linking to upstream IsaacLab#5388 is exactly the kind of "explain the why right where it matters" comment that makes diff archaeology easier.
The README's Supported Instances table that explicitly calls out why P-family instances are unsupported (no RT cores) is genuinely helpful — most test cases just say "use this instance type" and leave the user guessing.
Cross-node connectivity preflight in sm-train-entrypoint.sh (master-port reachability + ping) is a thoughtful debuggability touch for the SageMaker BYOC path where networking issues are otherwise opaque.
EnableInterContainerTrafficEncryption: True in the SageMaker job config is good security hygiene and not always remembered.

Thanks again for the contribution — none of the items above are dealbreakers, just things to tighten before merge.

KeitaW

Few comments

KeitaW

Follow-up — EFA capability on g6.12xlarge

Following up on the NCCL_SOCKET_IFNAME finding from the earlier batch — there's a deeper issue with the surrounding lines that's worth surfacing on its own.

Per the AWS EFA supported instance types page, g6.8xlarge, g6.12xlarge, g6.16xlarge, g6.24xlarge, and g6.48xlarge all support EFA (Nitro v4, with both RDMA read and RDMA write). The smaller g6 sizes (xlarge, 2xlarge, 4xlarge) don't, but the test case targets g6.12xlarge, which does. The comment in the entrypoint ("g6 instances don't have EFA") is therefore factually incorrect for the size used here, and the NCCL_IB_DISABLE=1 + NCCL_SOCKET_IFNAME=eth0 pair is opting out of EFA on EFA-capable hardware based on a false premise.

Action: drop NCCL_IB_DISABLE=1 and the NCCL_SOCKET_IFNAME=eth0 line in the entrypoint, drop the equivalent value: "eth0" entries in templates/training-job.yaml.tpl (lines 63 and 144), and let NCCL auto-detect EFA. With aws-ofi-nccl already on the image, NCCL picks up SRD over EFA without further config. Add the EFA device plugin and an EFA-allowing security-group rule between HyperPod nodes to the README's HyperPod EKS prerequisites so contributors know what cluster-side setup is required.

The earlier NCCL_SOCKET_IFNAME exclusion-pattern finding only applies if any IFNAME override is kept; the cleaner outcome is to remove it entirely and let NCCL handle interface selection.

- sm-train-entrypoint.sh: symlink /workspace/IsaacLab/logs to /opt/ml/checkpoints so SageMaker syncs checkpoints to S3 continuously during training. On a resumed job, SageMaker restores the previous checkpoints before the container starts; the entrypoint finds best_agent.pt and passes --checkpoint to train.py. - launch-sm-training.py.tpl: add CheckpointConfig with checkpoint_s3_uri and checkpoint_local_path=/opt/ml/checkpoints. - config.yaml.example + generate.py: add sagemaker_training.checkpoint_s3_path and checkpoint_prefix fields. The prefix scopes checkpoints per experiment so a new run does not accidentally resume from a different experiment's checkpoint.

allela-roy · 2026-05-12T09:44:14Z

Thanks for the thorough review @KeitaW . I have addressed the issues. Please review.

KeitaW

Review Batch 1/3 — Structure & Repository Hygiene (round 2)

A lot of cleanup since round 1 — most of the smaller findings (NCCL positive selection, :latest tags, NCCL_DEBUG_SUBSYS verbosity, hardcoded task name, Python prerequisite, license headers, .gitignore trim) are fully resolved, and the viz-pod simplification dropped both the third-party npx runtime fetch and the sed-on-source patch chain entirely — that's a real improvement. The three carry-over blockers are: the directory layout still doesn't match the repo's test-case convention, docker/patches/mlflow-train.patch is still referenced by both the Dockerfile and the README but not in the diff (so docker build will fail for anyone outside the author's local tree), and scripts/distributed_run.bash was deleted but the README's Source Files table still lists it. A small new regression: the viz-scripts/*.sh files lost their set -e and now run with no error trapping at all.

Resolution Status of Prior Findings

#	Finding	Status
1	Directory layout doesn't follow the test-case convention	❌ Open — still flat (`docker/`, `templates/`, `scripts/`, `viz-scripts/`) instead of `Dockerfile` at root + `kubernetes/` + `hyperpod-eks/` + `training-jobs/`
2	Missing patch file referenced by Dockerfile and README (build will fail)	❌ Open — `docker/Dockerfile:58` still `COPY docker/patches/mlflow-train.patch` and README still references it; patch not in diff
3	Missing MIT-0 license headers across the new test case	✅ Resolved — headers present in `.gitignore`, `Dockerfile`, `generate.py`, `config.yaml.example`, all `templates/.tpl`, all `scripts/.sh`, all `viz-scripts/*.sh`, and `viz-scripts/README.md`
4	`:latest` tag pinned in viz-scripts	✅ Resolved — now `${ISAAC_IMAGE:-isaaclab-sagemaker:5.1.0}` across all three scripts
5	Local `.gitignore` mostly duplicates the repo root	✅ Resolved — trimmed to `generated/` + `config.yaml`
6	Dead-code / off-topic launcher (`distributed_run.bash`)	🟡 Partial — file deleted from the PR, but README's `Source Files` table at line 210 still lists it
7	`NCCL_SOCKET_IFNAME=eth0` uses positive selection	✅ Resolved — removed entirely; entrypoint now carries an explanatory comment that NCCL should auto-detect, and `training-job.yaml.tpl` no longer sets it
8	Shell scripts use `set -e` instead of the repo floor	🟡 Partial — `scripts/sm-train-entrypoint.sh:4` is still `set -e`; the three `viz-scripts/*.sh` files now have no `set` at all, which is worse than before
9	Runtime download from a third-party artifactory in the viz pod	✅ Resolved — `npx @nvidia/create-ov-web-rtc-app` step removed; the viz pod is now ~80 lines of plain pod spec running `play.py` with `--livestream 2`
10	The `sed` chain may be patching values that are already the default	✅ Resolved — entire `sed` block removed with the `npx` step
11	Hardcoded task name in SageMaker entrypoint	✅ Resolved — now `--task=${TASK:-Isaac-Velocity-Rough-H1-v0}` and `TASK` is wired through `launch-sm-training.py.tpl`'s `Environment` block
12	`NCCL_DEBUG_SUBSYS=ALL` is debug-only verbosity for a default	✅ Resolved — line removed, only `NCCL_DEBUG=INFO` remains
13	`ISAACLAB_INIT_LOCK` is exported but appears unused	✅ Resolved — comment block at `sm-train-entrypoint.sh:23-26` now explains Isaac Sim reads this env var for file-based locking to serialize per-node pip env initialization
14	Default config doesn't match the tested config	✅ Resolved — PR body and `config.yaml.example` both now use `ml.g6.12xlarge`
15	`Source Files` table doesn't list the missing patch	➖ N/A — the row is still in the table; depends on whether finding #2 is resolved by adding the patch or by removing the reference
16	Python prerequisite is too lax	✅ Resolved — bumped to `Python 3.10+`
17	Workshop/DCV context for `viz-scripts/` not declared in the main README	✅ Resolved — `viz-scripts/README.txt` renamed to `viz-scripts/README.md` with prerequisites, and the main README's `Source Files` row now reads `DCV visualization helpers for EC2 workshop path`
18	`best_agent.pt` checkpoint name is implicit	🟡 Partial — comment now documents that the filename is the skrl default and that other frameworks need a different pattern; still hardcoded rather than templated from `${FRAMEWORK}`

Legend: ✅ Resolved · 🟡 Partial · ❌ Open · ➖ N/A (scope changed)

Directory layout still doesn't follow the test-case convention

File: 3.test_cases/pytorch/nvidia-isaac-lab/ (whole tree)

Carrying this over from round 1 — the layout is unchanged. The repo convention (encoded in CONTRIBUTING.md and the PR template's "Directory Structure" section) puts Dockerfile at the test-case root and groups service-specific assets under kubernetes/, hyperpod-eks/, training-jobs/, etc., so a contributor landing in any test case can find "the K8s manifest" or "the HyperPod instructions" without re-reading a per-test-case README. Today's layout still uses docker/Dockerfile, templates/*.tpl, scripts/sm-train-entrypoint.sh, and viz-scripts/ as flat sibling directories.

Concretely I'd still ask for:

Move docker/Dockerfile → Dockerfile at the test-case root and update the COPY docker/patches/... path accordingly.
Rename templates/ → kubernetes/ (or kubernetes/templates/ if you want to keep the generator output separate from raw YAML). The generator-based design is fine — only the directory name needs to change.
Add a hyperpod-eks/ subdirectory with a short README that points at ../generate.py and the relevant template names, so the HyperPod path is discoverable from the convention.
Add a training-jobs/ subdirectory holding sm-train-entrypoint.sh and launch-sm-training.py.tpl with its own README — symmetric to hyperpod-eks/.
viz-scripts/ is workshop-DCV-specific and doesn't map to a convention dir; either move it under a clearly-scoped workshop/ subdir or split it off into a follow-up PR.

If keeping the flat layout is a hard preference for the workshop story, I'd love a brief note in the PR description explaining the trade-off so future contributors know whether to mirror this layout or treat it as an exception.

Missing patch file referenced by Dockerfile and README (build will still fail)

File: 3.test_cases/pytorch/nvidia-isaac-lab/docker/Dockerfile (line 58)

Same as round 1 — the Dockerfile still does COPY docker/patches/mlflow-train.patch /tmp/mlflow-train.patch and the README documents the patch in two places (## MLflow Integration (Optional) and the ## Source Files table row), but the patch file itself is not in this PR's diff. docker build -f docker/Dockerfile . will fail at the COPY step with a missing-file error for anyone who doesn't already have the patch in their working tree. Could you either commit docker/patches/mlflow-train.patch as part of this PR, or — if the patch is intentionally out-of-tree — add a short README section explaining how a contributor is supposed to obtain or regenerate it before docker build?

KeitaW · 2026-05-15T06:59:27Z

+| `scripts/sm-train-entrypoint.sh` | SageMaker BYOC entrypoint (parses `resourceconfig.json`, launches `torchrun`) |
+| `scripts/distributed_run.bash` | AWS Batch distributed launcher (workshop path) |
+| `templates/*.tpl` | Kubernetes manifest and Python script templates |


README still references the deleted distributed_run.bash

scripts/distributed_run.bash was removed from the PR (good — addressing round 1's dead-code finding), but the README's ## Source Files table still has the row referencing it. A contributor running ls scripts/ will find only sm-train-entrypoint.sh and reasonably wonder if the file is missing. Could you drop this row?

Suggested change

| `scripts/sm-train-entrypoint.sh` | SageMaker BYOC entrypoint (parses `resourceconfig.json`, launches `torchrun`) |

| `scripts/distributed_run.bash` | AWS Batch distributed launcher (workshop path) |

| `templates/*.tpl` | Kubernetes manifest and Python script templates |

| `scripts/sm-train-entrypoint.sh` | SageMaker BYOC entrypoint (parses `resourceconfig.json`, launches `torchrun`) |

| `templates/*.tpl` | Kubernetes manifest and Python script templates |

KeitaW

Review Batch 3/3 — Documentation Consistency + Wrap-up (round 2)

One partial carry-over inline below. Closing with what's working well after this round.

Things That Look Great

The viz-pod simplification is a clear win — dropping the npx @nvidia/create-ov-web-rtc-app runtime fetch and the multi-line sed chain removed both a supply-chain dependency on a non-Amazon registry and a brittle source-edit step in one go.
The new comment at sm-train-entrypoint.sh:23-26 explaining why ISAACLAB_INIT_LOCK is set ("Isaac Sim's internal pip env initialization is not process-safe... uses file-based locking to serialize the initialization") is exactly the kind of "explain the why right where it matters" comment that pays off in diff archaeology two years later.
The NCCL story in the SageMaker entrypoint is now both correct and self-documenting — the explicit "Do NOT set NCCL_SOCKET_IFNAME or NCCL_IB_DISABLE unless debugging" comment at line 14 is more useful than the round 1 anti-pattern would have been even if I'd just asked for the exclusion variant.
Templating --task=${TASK:-Isaac-Velocity-Rough-H1-v0} and wiring TASK through the SageMaker Environment block in launch-sm-training.py.tpl is a tidy fix that keeps the three launch paths (Kubernetes / SM Training Job / fallback default) in sync.
The .gitignore trim landed in the cleanest possible shape — three entries, license header, no duplication.
Renaming viz-scripts/README.txt → viz-scripts/README.md and expanding it into a proper workshop-context doc is a quiet but real improvement — readers landing in the viz directory directly now have everything they need.

KeitaW · 2026-05-15T06:59:40Z

+      # Note: best_agent.pt is the default checkpoint name for the skrl framework.
+      # For other frameworks (rsl_rl, rl_games, sb3), adjust the filename pattern.
+      CKPT=$(find ${VIZ_FSX_LOG_DIR} -name best_agent.pt -printf "%T@ %p\n" | sort -n | tail -1 | cut -d' ' -f2)
+      echo "Using checkpoint: $CKPT"
+      exec /isaac-sim/python.sh scripts/reinforcement_learning/skrl/play.py \


best_agent.pt still hardcoded in viz pod template (with a documenting comment)

This is much better than round 1 — the new inline comment at lines 21-22 explains that best_agent.pt is the skrl default and that other frameworks need a different pattern, which addresses the discoverability half of the original concern. The other half is still there: the same pod template hardcodes play.py from scripts/reinforcement_learning/skrl/play.py (line 25), so the moment a user changes training.framework to rsl_rl / rl_games / sb3, the viz manifest silently keeps loading a skrl checkpoint and invoking the skrl player. Could you either (a) template both the checkpoint glob and the play.py path from ${FRAMEWORK} like the training-job template does, or (b) tighten the comment to "this viz pod only supports framework: skrl — regenerate manifests with a different framework only if you also customize this template" and add the same caveat to the README's framework-selection bullet? Option (a) is cleaner; option (b) is a 2-line doc change if templating turns out to be fiddly.

Fixed with option (b)

KeitaW · 2026-05-15T07:01:29Z

Almost done. Few remaining action items.

KeitaW

Review Batch 1/2 — Scope & Resolution Status (round 3)

Prior reviews: round 1, round 2.

Summary

Two messages this round, in priority order. (1) Let's prioritize landing this PR — scope freeze on additions. The diff has grown to 17 files across Docker, K8s, SageMaker BYOC, viz pods, and workshop scripts. From here forward, please no new features on this branch — everything should be close-out work (fixing the open findings) so we can land it. (2) The patch → wrapper swap is part of close-out, not a feature. A unified-diff git apply against IsaacLab is the most fragile shape for the MLflow integration, and I confirmed via primary-source check that there is no upstream alternative to "just configure" — so the integration genuinely has to live in this PR, but it should live as a small wrapper module rather than a .patch file. This needs to be addressed in this PR, alongside the other open round-2 carry-overs (flat directory layout, implicit-skrl assumption in the viz pod). That's the complete close-out checklist.

Resolution Status of Prior Findings

#	Finding	Status
1	Directory layout doesn't follow the test-case convention	❌ Open — still `docker/`, `templates/`, `scripts/`, `viz-scripts/` as flat siblings; no `Dockerfile` at root, no `kubernetes/`, no `hyperpod-eks/`, no `training-jobs/`
2	Missing patch file referenced by Dockerfile and README	✅ Resolved — `docker/patches/mlflow-train.patch` is now in the diff (129 lines). See the round-3 inline finding (Batch 2) about whether the patch itself is the right shape.
6	Dead-code / off-topic launcher (`distributed_run.bash` README row)	✅ Resolved — README's `Source Files` table no longer references `distributed_run.bash`
18	`best_agent.pt` checkpoint name is implicit in viz template	🟡 Partial — round-2 comment is in place; the glob and `play.py` path remain hardcoded to skrl

Legend: ✅ Resolved · 🟡 Partial · ❌ Open · ➖ N/A (scope changed)

Prioritize closing this PR — no new features, finish the close-out list

The diff now spans 17 files and ~1,500 lines across five sub-systems (Docker image, Kubernetes manifests, SageMaker BYOC entrypoint, WebRTC viz pod, EC2 workshop helpers). At this size, reviewer attention is the bottleneck — every additional feature compounds review cost. I'd recommend an explicit scope freeze:

Hard-stop on new features. No new Isaac task families (rsl_rl/rl_games/sb3 wiring beyond skrl), no additional visualization variants, no additional instance-type tuning beyond ml.g6.12xlarge, no expansion of the workshop/DCV helpers. If any of those are valuable, they go in follow-up PRs after this lands.
Close-out work is in scope and expected. This is the bar for "ready to merge":
1. Directory layout (#1) — flatten docker/, templates/, scripts/, viz-scripts/ into the repo convention (Dockerfile at root + kubernetes/ + hyperpod-eks/ + training-jobs/).
2. Viz pod skrl assumption (#18) — either template ${FRAMEWORK} through or tighten the README caveat.
3. Patch → wrapper swap (new this round, Batch 2 below) — replace docker/patches/mlflow-train.patch + git apply with a real .py wrapper module.

Item 3 is correctness/maintainability cleanup, not a new feature — it changes the shape of code that's already in the PR, doesn't add capability, and is the kind of thing that gets harder to fix the longer it lives in main.

Concretely, please (a) reply with a short comment listing any follow-ups you intend to file after merge so they're not forgotten, and (b) resist any reviewer ask (including future ones from me) that would expand the feature surface of this branch.

KeitaW

Review Batch 2/2 — Structure & Repository Hygiene (round 3)

One inline finding on the Dockerfile (patch → wrapper swap), then the close-out positives.

Things That Look Great

The mlflow-train.patch content itself is genuinely thoughtful — the "why not mlflow.config.enable_async_logging()?" comment block at the top of the patch is the kind of design-rationale comment that pays off when someone wonders why this exists at all. That prose deserves to land in mlflow_hook.py verbatim regardless of which alternative you pick.
Bringing the README in line with the slimmed-down scripts/ directory (no more distributed_run.bash row) made the test case noticeably more navigable from round 2 — a contributor running ls scripts/ and reading the README now sees a consistent story.
Pinning mlflow-skinny==3.4.0, sagemaker-mlflow==0.3.0, and pynvml==11.5.3 with an inline comment explaining the role of each is exactly the kind of "explain the why right where it matters" that future bumpers will thank you for.

KeitaW · 2026-05-17T00:04:27Z

+# Patch Isaac Lab's train.py to mirror TensorBoard scalars to MLflow in real
+# time when MLFLOW_TRACKING_URI is set. The patch only touches train.py; if
+# Isaac Lab is upgraded, regenerate it against the new file.
+COPY docker/patches/mlflow-train.patch /tmp/mlflow-train.patch
+RUN cd /workspace/IsaacLab && git apply /tmp/mlflow-train.patch && rm /tmp/mlflow-train.patch


Shipping a unified-diff patch isn't the right shape — wrap train.py instead (close-out blocker)

The Dockerfile currently does:

COPY docker/patches/mlflow-train.patch /tmp/mlflow-train.patch RUN cd /workspace/IsaacLab && git apply /tmp/mlflow-train.patch && rm /tmp/mlflow-train.patch

Before discussing shape, the upstream question: I checked whether the patch is necessary at all. It is. IsaacLab v2.3.2 and main have zero MLflow references in any of the eight RL training entrypoints (scripts/reinforcement_learning/{skrl,rsl_rl,rl_games,sb3}/{train,play}.py); the only MLflow code in the repo is in the Ray-Tune sweep workflow (scripts/reinforcement_learning/ray/tuner.py), which is a different use case. None of the four upstream RL libraries (skrl 1.4.3, rsl_rl, rl_games, sb3) ships a native MLflow backend either — skrl has TensorBoard + W&B, rsl_rl has TensorBoard + W&B + Neptune (explicit ValueError on unknown values), rl_games is hard-coded to tensorboardX, and sb3 has no MLflowOutputFormat / MlflowCallback. So there is no MLFLOW_TRACKING_URI switch that "just works"; this PR is genuinely the integration point.

Given that, the question is purely about shape — the patch and a wrapper module do the same thing functionally; the wrapper is just review-friendlier. I'd push back on the patch shape for two reasons:

It's fragile. A unified diff is matched against IsaacLab's scripts/reinforcement_learning/skrl/train.py by line numbers and surrounding context. The Dockerfile pins IsaacLab to v2.3.2 today, but any future bump (security, CUDA stack, an upstream fix the user actually wants) becomes a git apply failure that contributors have to debug — and the in-Dockerfile comment already calls this out ("if Isaac Lab is upgraded, regenerate it against the new file"). That's a maintenance trap, not a feature.

It's hard to review and lint. Of the 129 lines in the patch, only ~80 are actually new Python — the rest is @@ hunk headers, context lines, and +/- prefixes. Nothing in repo CI (pylint, flake8, bandit) runs against .patch files, so the MLflow integration code escapes static analysis entirely. A reviewer reading it has to mentally reconstruct what the file looks like after application before they can reason about it.

Recommended alternative — a runtime wrapper module. The patch only does two things: (a) monkey-patch SummaryWriter.add_scalar so every TB scalar is mirrored to MLflow on rank 0, and (b) close out the MLflow run after runner.run() returns. Both of those can live in a small Python file under your control, with train.py invoked through it. Concretely:

Drop docker/patches/mlflow-train.patch entirely.

Add a new file scripts/mlflow_hook.py (or kubernetes/-relative once the directory-layout finding is addressed) that owns the SummaryWriter.add_scalar patch, the background thread, and the run lifecycle — i.e. the exact same logic the patch installs into train.py, but in a real Python file you can review, lint, and unit-test.

Change the SageMaker / Kubernetes entrypoint to call train.py through the hook instead of directly:

# scripts/run_train.py import runpy import mlflow_hook # installs SummaryWriter.add_scalar patch + atexit cleanup mlflow_hook.install() runpy.run_path( "/workspace/IsaacLab/scripts/reinforcement_learning/skrl/train.py", run_name="__main__", )

Then sm-train-entrypoint.sh (and the PyTorchJob template) invoke run_train.py rather than train.py. The CLI args (--task, --num_envs, …) are forwarded by runpy via sys.argv, so nothing else changes.

The Dockerfile loses the COPY + RUN git apply pair; it just COPYs the hook + wrapper alongside sm-train-entrypoint.sh (which it already does for the entrypoint at line 65). Bumping IsaacLab to a new tag no longer requires a patch regeneration step.

Two other shapes worth considering, in case the wrapper doesn't fit:

Event-file sidecar. Run a separate background process (mlflow_tb_forwarder.py & from the entrypoint) that tails the TensorBoard event-file directory IsaacLab already writes (tbparse or tensorboard.backend.event_processing.EventAccumulator) and forwards scalars to MLflow. This decouples MLflow from IsaacLab entirely — zero modification, even at runtime. The trade-offs are an extra process, slightly higher latency (events have to flush to disk first), and a bit more code than the wrapper.

Upstream the hook. The cleanest long-term target is actually skrl rather than IsaacLab — skrl.utils.runner.Runner already accepts experiment.wandb = True via agent_cfg, and adding a parallel experiment.mlflow block (which IsaacLab/scripts/reinforcement_learning/skrl/train.py would pick up unchanged because it passes agent_cfg straight through) is a small, well-precedented contribution to Toni-SM/skrl. There's no open issue or PR there on this yet — first contributor wins. Until that lands, the wrapper bridges the gap. (IsaacLab itself doesn't have any open MLflow issues or PRs requesting native RL logging — the question hasn't been asked publicly, so there's neither a "yes, PR welcome" nor a "no, won't take it" maintainer signal to lean on.)

My recommendation is option (1) — the wrapper. It's the smallest change from where you are now (the hook code already exists in the patch; it just needs to move out of unified-diff form into a .py file), it removes the fragile git apply, and it puts the MLflow integration code on the same footing as the rest of the test case for review and CI purposes.

On scope: this is close-out work, not a new feature — it changes the shape of code that's already in the PR without adding capability. Please land it in this branch alongside the other open round-2 carry-overs rather than deferring. Mechanically the swap is small: (a) extract the patch's ~80 lines of new Python into kubernetes/mlflow_hook.py (or training-jobs/mlflow_hook.py once the directory-layout finding is addressed), (b) add a 5-line run_train.py that calls mlflow_hook.install() then runpy.run_path(".../train.py", run_name="__main__"), (c) delete docker/patches/, (d) change the Dockerfile to COPY the two new .py files and drop the COPY ... .patch + RUN git apply lines, (e) point sm-train-entrypoint.sh and templates/training-job.yaml.tpl at run_train.py instead of train.py. The skrl-upstream-PR idea is a genuine follow-up — that one is out of scope for this branch.

@KeitaW , the value proposition here is to showcase end-end integration with SM Managed MLflow for tracking the experiment runs hence giving users the option to directly integrate with this other than relying on self-managed or 3rd-party tracking servers

Restored the web-viewer sidecar to match the blog browser-based workflow with krelay + http://localhost:8210 , plus fixed the media port from 48010 to 47998 to align with the krelay command and the viewer's sed config. Removed the ephemeral-storage: 100Gi limit that prevented scheduling on nodes with <89Gi allocatable e.g on the g6e.12xlarge.

KeitaW · 2026-05-18T11:26:16Z

Following up on the round-3 Batch 2 finding — here's a skeleton for the wrapper alternative so the swap is a copy-paste rather than a green-field design exercise. Behaviour is line-by-line equivalent to docker/patches/mlflow-train.patch: same SummaryWriter.add_scalar monkey-patch, same queue + worker + 2 s / 1000-metric batching via MlflowClient.log_batch, same tag-sanitisation regex, same warn-once on log_batch failures, same mlflow.log_artifacts + end_run at run end.

The one shape change is that run metadata the patch reads from args_cli / env_cfg / agent_cfg inside train.py moves to env-var sourcing in the entrypoint — the same pattern already in use for TASK and FRAMEWORK, so no new convention is introduced.

`scripts/mlflow_hook.py`

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
"""MLflow real-time logging hook for Isaac Lab training scripts.

Mirrors every TensorBoard scalar written via SummaryWriter.add_scalar to
MLflow on rank 0. Enabled when MLFLOW_TRACKING_URI is set; no-op otherwise.

Why not mlflow.config.enable_async_logging() (MLflow 2.16+)? On a ~1200
iteration PPO run, the built-in async logger serialises each metric
individually through a single worker with SigV4 signing per call, which
pushes the SageMaker MLflow App path from ~90 s to ~10 min on
ml.g6.12xlarge. We batch metrics into memory and flush them with
MlflowClient.log_batch (up to 1000 per call) from a dedicated thread.

Usage (called once before the training script is imported / run):

    import mlflow_hook
    mlflow_hook.install()
    # ... then runpy.run_path("/workspace/IsaacLab/scripts/.../train.py")

Run metadata is sourced from environment variables the launcher already sets:

    MLFLOW_TRACKING_URI       required to enable the hook
    MLFLOW_EXPERIMENT_NAME    default "isaaclab"
    MLFLOW_RUN_NAME           default ${MLFLOW_ARTIFACT_DIR##*/} or "isaaclab-run"
    MLFLOW_ARTIFACT_DIR       directory uploaded via log_artifacts() at run end
    MLFLOW_TASK               logged as param "task"
    MLFLOW_NUM_ENVS           logged as param "num_envs"
    MLFLOW_SEED               logged as param "seed"
    MLFLOW_ALGORITHM          logged as param "algorithm"
    MLFLOW_FRAMEWORK          logged as param "framework"
    MLFLOW_MAX_ITERATIONS     logged as param "max_iterations"
    RANK                      torchrun convention; non-zero ranks no-op
"""

import atexit
import os
import queue
import re
import threading
import time

_state = {"installed": False}
_INVALID = re.compile(r"[^A-Za-z0-9_\-. :/]")


def _env_int(name):
    v = os.environ.get(name)
    if v is None or v == "":
        return None
    try:
        return int(v)
    except ValueError:
        return None


def install():
    """Install the SummaryWriter monkey-patch and start the MLflow run.

    Safe to call once at process start. Returns silently if MLflow is not
    configured, if this is not rank 0, or if installation fails for any
    reason (training continues without MLflow tracking).
    """
    if _state["installed"]:
        return
    _state["installed"] = True

    tracking_uri = os.environ.get("MLFLOW_TRACKING_URI")
    if not tracking_uri:
        return

    if int(os.environ.get("RANK", "0")) != 0:
        return

    try:
        import mlflow
        from mlflow.entities import Metric
        from torch.utils.tensorboard import SummaryWriter

        mlflow.set_tracking_uri(tracking_uri)
        mlflow.set_experiment(
            os.environ.get("MLFLOW_EXPERIMENT_NAME", "isaaclab")
        )

        artifact_dir = os.environ.get("MLFLOW_ARTIFACT_DIR")
        default_run_name = (
            os.path.basename(artifact_dir.rstrip("/"))
            if artifact_dir
            else "isaaclab-run"
        )
        run_name = os.environ.get("MLFLOW_RUN_NAME", default_run_name)
        run = mlflow.start_run(run_name=run_name)
        run_id = run.info.run_id

        params = {
            "task": os.environ.get("MLFLOW_TASK"),
            "algorithm": os.environ.get("MLFLOW_ALGORITHM"),
            "framework": os.environ.get("MLFLOW_FRAMEWORK"),
            "num_envs": _env_int("MLFLOW_NUM_ENVS"),
            "seed": _env_int("MLFLOW_SEED"),
            "max_iterations": _env_int("MLFLOW_MAX_ITERATIONS"),
        }
        mlflow.log_params({k: v for k, v in params.items() if v is not None})

        client = mlflow.tracking.MlflowClient()
        metric_queue: "queue.Queue[Metric]" = queue.Queue()
        stop = threading.Event()
        warn_once = [False]

        def worker():
            # Flush at most every 2 s, batching up to 1000 metrics per call
            # (MLflow server caps log_batch payloads).
            pending = []
            while not stop.is_set() or not metric_queue.empty() or pending:
                try:
                    pending.append(metric_queue.get(timeout=2.0))
                    while len(pending) < 1000:
                        try:
                            pending.append(metric_queue.get_nowait())
                        except queue.Empty:
                            break
                except queue.Empty:
                    pass
                if pending:
                    try:
                        client.log_batch(run_id, metrics=pending)
                    except Exception as e:
                        if not warn_once[0]:
                            print(
                                f"[WARN] MLflow log_batch failed: {e}. "
                                "Subsequent failures will be silent."
                            )
                            warn_once[0] = True
                    pending = []

        thread = threading.Thread(target=worker, daemon=True)
        thread.start()

        orig_add_scalar = SummaryWriter.add_scalar

        def dual_add_scalar(self, tag, value, global_step=None, *args, **kw):
            orig_add_scalar(self, tag, value, global_step, *args, **kw)
            try:
                metric_queue.put_nowait(
                    Metric(
                        key=_INVALID.sub("_", tag),
                        value=float(value),
                        timestamp=int(time.time() * 1000),
                        step=int(global_step or 0),
                    )
                )
            except Exception:
                pass

        SummaryWriter.add_scalar = dual_add_scalar

        def finalize():
            try:
                stop.set()
                thread.join(timeout=30)
                if artifact_dir and os.path.isdir(artifact_dir):
                    mlflow.log_artifacts(artifact_dir)
            except Exception as e:
                print(
                    f"[WARN] MLflow artifact upload failed: {e}. "
                    "Training results are saved on the filesystem."
                )
            finally:
                try:
                    mlflow.end_run()
                except Exception:
                    pass

        atexit.register(finalize)

    except Exception as e:
        print(
            f"[WARN] MLflow initialization failed: {e}. "
            "Training will continue without MLflow tracking."
        )

`scripts/run_train.py`

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
"""Entrypoint shim: install MLflow hook, then run Isaac Lab's train.py.

The Isaac Lab script is executed via runpy.run_path with
run_name="__main__" so its `if __name__ == "__main__":` guard fires and
sys.argv is forwarded unchanged (--task, --num_envs, etc.).

The MLflow hook is installed *before* train.py runs so the monkey-patch on
SummaryWriter.add_scalar is live before skrl's Runner instantiates the
writer — this is strictly safer than the patch's current ordering, which
installs the monkey-patch after Runner(env, agent_cfg) construction.
"""

import os
import runpy
import sys

import mlflow_hook

TRAIN_SCRIPT = os.environ.get(
    "ISAACLAB_TRAIN_SCRIPT",
    "/workspace/IsaacLab/scripts/reinforcement_learning/skrl/train.py",
)


def main():
    mlflow_hook.install()
    runpy.run_path(TRAIN_SCRIPT, run_name="__main__")


if __name__ == "__main__":
    sys.exit(main())

`scripts/sm-train-entrypoint.sh` (snippet)

Replace the direct python train.py ... invocation with python run_train.py ... and export the env vars mlflow_hook.py reads. The exports are gated on MLFLOW_TRACKING_URI so a launcher that doesn't set it gets the no-op path automatically.

# --- MLflow run metadata (read by mlflow_hook.py on rank 0) ---
if [ -n "${MLFLOW_TRACKING_URI:-}" ]; then
    export MLFLOW_EXPERIMENT_NAME="${MLFLOW_EXPERIMENT_NAME:-isaaclab}"
    export MLFLOW_TASK="${TASK:-Isaac-Velocity-Rough-H1-v0}"
    export MLFLOW_ALGORITHM="${ALGORITHM:-ppo}"
    export MLFLOW_FRAMEWORK="${FRAMEWORK:-skrl}"
    export MLFLOW_NUM_ENVS="${NUM_ENVS:-4096}"
    export MLFLOW_SEED="${SEED:-42}"
    export MLFLOW_MAX_ITERATIONS="${MAX_ITERATIONS:-1500}"
    # IsaacLab writes per-run logs under logs/skrl/<task>/<timestamp>/.
    # The hook uploads this directory via mlflow.log_artifacts at run end.
    export MLFLOW_ARTIFACT_DIR="${MLFLOW_ARTIFACT_DIR:-/workspace/IsaacLab/logs/${FRAMEWORK:-skrl}/${TASK:-Isaac-Velocity-Rough-H1-v0}}"
fi

# Previously:
#   python /workspace/IsaacLab/scripts/reinforcement_learning/skrl/train.py \
#       --task="${TASK:-...}" --num_envs="${NUM_ENVS:-...}" ...
#
# Now (the shim forwards sys.argv unchanged to train.py via runpy):
python /opt/training/run_train.py \
    --task="${TASK:-Isaac-Velocity-Rough-H1-v0}" \
    --num_envs="${NUM_ENVS:-4096}" \
    --headless \
    --distributed

`docker/Dockerfile` (diff)

--- a/3.test_cases/pytorch/nvidia-isaac-lab/docker/Dockerfile
+++ b/3.test_cases/pytorch/nvidia-isaac-lab/docker/Dockerfile
@@ -55,11 +55,8 @@
 # Install IsaacLab
 RUN cd /workspace/IsaacLab && ./isaaclab.sh --install

-# Apply MLflow integration patch to skrl train.py
-# If Isaac Lab is upgraded, regenerate it against the new file.
-COPY docker/patches/mlflow-train.patch /tmp/mlflow-train.patch
-RUN cd /workspace/IsaacLab && git apply /tmp/mlflow-train.patch && rm /tmp/mlflow-train.patch
+# MLflow integration: a hook module + thin entrypoint shim. The hook
+# monkey-patches SummaryWriter.add_scalar at runtime and is robust to
+# upstream changes in train.py (no line-number coupling, no git apply step).
+COPY scripts/mlflow_hook.py /opt/training/mlflow_hook.py
+COPY scripts/run_train.py   /opt/training/run_train.py
+ENV PYTHONPATH=/opt/training:${PYTHONPATH}

 COPY scripts/sm-train-entrypoint.sh /opt/training/sm-train-entrypoint.sh
 RUN chmod +x /opt/training/sm-train-entrypoint.sh

Things worth double-checking on your end before merging

MLFLOW_ARTIFACT_DIR default path. Defaults to /workspace/IsaacLab/logs/${FRAMEWORK}/${TASK} — please confirm that's where skrl actually writes per-run logs on your test cluster. If the run-timestamp subdirectory matters, the hook can be adjusted to glob the most recent child instead of uploading the parent.
Rank source. The hook reads os.environ["RANK"] (torchrun convention) instead of app_launcher.global_rank. If your launcher path doesn't set RANK before mlflow_hook.install() runs, non-rank-0 workers would also try to log — easy to verify with a quick printenv RANK in the entrypoint.
PyTorchJob env block. The K8s manifest may need the same MLFLOW_* exports as the SM entrypoint if you want MLflow logging on the HyperPod-EKS path too. The conditional shape (if [ -n "${MLFLOW_TRACKING_URI:-}" ]) keeps it a no-op when the var is unset, so it's safe to add unconditionally.

Happy to iterate on any of this — and if it's easier I can open a fresh branch off your PR head with these files in place so the diff is reviewable as code rather than as comment markdown.

Adding Nvidia Isaac Lab sample

8220740

allela-roy requested a review from KeitaW May 6, 2026 10:25

KeitaW reviewed May 7, 2026

View reviewed changes

Comment thread 3.test_cases/pytorch/nvidia-isaac-lab/templates/viz-eks-webrtc-pod.yaml.tpl

Comment thread 3.test_cases/pytorch/nvidia-isaac-lab/templates/viz-eks-webrtc-pod.yaml.tpl

KeitaW reviewed May 7, 2026

View reviewed changes

Comment thread 3.test_cases/pytorch/nvidia-isaac-lab/scripts/sm-train-entrypoint.sh Outdated

Comment thread 3.test_cases/pytorch/nvidia-isaac-lab/scripts/sm-train-entrypoint.sh Outdated

Comment thread 3.test_cases/pytorch/nvidia-isaac-lab/scripts/sm-train-entrypoint.sh Outdated

KeitaW reviewed May 7, 2026

View reviewed changes

KeitaW requested changes May 7, 2026

View reviewed changes

KeitaW reviewed May 8, 2026

View reviewed changes

Comment thread 3.test_cases/pytorch/nvidia-isaac-lab/scripts/sm-train-entrypoint.sh Outdated

allela-roy and others added 3 commits May 11, 2026 13:32

Fixing 1083 PR comments

bdce026

Adding resiliency workflow for SMHP runs

ec7ab01

KeitaW reviewed May 15, 2026

View reviewed changes

This comment was marked as resolved.

Sign in to view

KeitaW reviewed May 15, 2026

View reviewed changes

nicolasj92 added 2 commits May 15, 2026 15:46

viz: tighten skrl-only caveat in viz pod template comment

8d572b7

Add mlflow patch and remove distributed_run.bash reference

c28af09

KeitaW reviewed May 17, 2026

View reviewed changes

Conversation

allela-roy commented May 6, 2026 • edited by KeitaW Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Test Plan

Test Results

Directory Structure

Checklist

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 1/5 — Structure & Repository Hygiene

Directory layout doesn't follow the test-case convention

Missing patch file referenced by Dockerfile and README (build will fail)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KeitaW left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Review Batch 2/5 — Deployment Pipeline (K8s / Slurm)

Remove NCCL_SOCKET_IFNAME=eth0 and NCCL_IB_DISABLE=1 — g6.12xlarge supports EFA

Uh oh!

Uh oh!

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 3/5 — Training Code Quality

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 4/5 — Infrastructure & NCCL Configuration

Default config doesn't match the tested config

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 5/5 — Documentation Consistency

Things That Look Great

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Uh oh!

KeitaW left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Follow-up — EFA capability on g6.12xlarge

Uh oh!

Uh oh!

allela-roy commented May 12, 2026

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 1/3 — Structure & Repository Hygiene (round 2)

Resolution Status of Prior Findings

Directory layout still doesn't follow the test-case convention

Missing patch file referenced by Dockerfile and README (build will still fail)

Uh oh!

KeitaW May 15, 2026

Choose a reason for hiding this comment

README still references the deleted distributed_run.bash

Uh oh!

nicolasj92 May 15, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 3/3 — Documentation Consistency + Wrap-up (round 2)

Things That Look Great

allela-roy commented May 6, 2026 •

edited by KeitaW

Loading

KeitaW left a comment •

edited

Loading

Remove `NCCL_SOCKET_IFNAME=eth0` and `NCCL_IB_DISABLE=1` — `g6.12xlarge` supports EFA

KeitaW left a comment •

edited

Loading

README still references the deleted `distributed_run.bash`

`best_agent.pt` still hardcoded in viz pod template (with a documenting comment)

Shipping a unified-diff patch isn't the right shape — wrap `train.py` instead (close-out blocker)

KeitaW commented May 18, 2026 •

edited

Loading

`scripts/mlflow_hook.py`

`scripts/run_train.py`

`scripts/sm-train-entrypoint.sh` (snippet)

`docker/Dockerfile` (diff)