Add UCC/NCCL alltoallv, deepEP v1/v2 and moe benchmark by ybenvidia · Pull Request #891 · NVIDIA/cloudai

ybenvidia · 2026-05-14T13:58:09Z

Add UCC/NCCL alltoallv vs DeepEP

add ucc alltoallv test
add nccl alltoallv test
add report generation for DeepEP vs NCCL vs UCC

coderabbitai · 2026-05-14T13:58:22Z

📝 Walkthrough

Walkthrough

This PR introduces a new MoE (Mixture of Experts) benchmark workload to CloudAI, refactors DeepEP for v1/v2 tests, and enables NCCL/UCC to consume DeepEP-generated traffic matrices. It includes configuration files, Slurm command generation, report generation with throughput visualization, and comprehensive test coverage.

Changes

MoE Benchmark and DeepEP Integration

Layer / File(s)	Summary
MoE benchmark core definition `src/cloudai/workloads/moe_benchmark/moe_benchmark.py`, `src/cloudai/workloads/moe_benchmark/__init__.py`	`MoEBenchmarkCmdArgs` defines typed configuration (mode, sizing, datatype, feature flags, paths), and `MoEBenchmarkTestDefinition` lazily resolves Docker images and exports `cmd_args_dict` with infrastructure fields filtered out.
MoE Slurm command generation `src/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.py`	Generates Slurm batch scripts with head node IP detection, exports rendezvous environment, generates/mounts `config.yaml`, selects benchmark entrypoint by mode, and reports success via bandwidth pattern matching.
DeepEP output discovery helpers `src/cloudai/workloads/common/moe_benchmark_report.py`	Provides dependency chain traversal, benchmark root detection via matrix files or stdout markers, and collection of ranked results directories.
MoE report generation `src/cloudai/workloads/moe_benchmark/report_generation_strategy.py`	Discovers MoE results JSON files, validates and enriches entries with metadata, and generates CSV with bus bandwidth columns.
MoE throughput visualization `src/cloudai/workloads/moe_benchmark/throughput_reporter.py`	Parses MoE/UCC/NCCL outputs and generates standalone SVG bar charts summarizing bandwidth metrics per test series.
DeepEP schema refactor `src/cloudai/workloads/deepep/deepep.py`, `src/cloudai/workloads/deepep/__init__.py`	`DeepEPCmdArgs` replaces benchmark fields with v1/v2 test configuration (subtest selection, root paths, CLI flags); test definition adds container runtime path property and removes custom cmd_args_dict; exports cleanup.
DeepEP Slurm strategy overhaul `src/cloudai/workloads/deepep/slurm_command_gen_strategy.py`	Restructures command generation: exports `MASTER_ADDR`/`MASTER_PORT` for rendezvous, persists env vars to `env_vars.sh`, selects script by subtest name, uses per-subtest CLI allowlists for `torchrun`, and expands success pattern matching.
NCCL alltoallv and matrix support `src/cloudai/workloads/nccl_test/nccl.py`, `src/cloudai/workloads/nccl_test/slurm_command_gen_strategy.py`	Adds `alltoallv_perf_mpi`/`alltoallv_perf` subtests and matrix configuration fields; discovers and mounts DeepEP-generated matrices; generates specialized alltoallv_perf command with explicit numeric options; sets `ALLTOALLV_MATRIX_FILE` environment variable.
UCC matrix integration `src/cloudai/workloads/ucc_test/ucc.py`, `src/cloudai/workloads/ucc_test/slurm_command_gen_strategy.py`	Adds `use_deepep_matrix` flag; resolves DeepEP matrix files and mounts them; appends file-based `--gen` parameter when explicit input is absent.
Test and scenario configurations `conf/experimental/test/.toml`, `conf/experimental/test_scenario/.toml`	Adds DeepEP v1/v2 test definitions (EP v2, internode, intranode, low-latency) and MoE benchmark configurations (standard, low-latency); updates NCCL and UCC test configs with matrix support; defines multi-test scenario with chained `start_post_comp` dependencies.
Registration and wiring `src/cloudai/registration.py`	Registers MoE benchmark components (test definition, Slurm strategy, report generation, throughput reporter); removes DeepEP report strategy; adds new scenario report for throughput visualization.
Test coverage and fixtures `tests/test_acceptance.py`, `tests/test_init.py`, `tests/test_test_scenario.py`, `tests/ref_data/*.sbatch`	Extends acceptance/init/scenario tests for moe-benchmark workload; updates sbatch reference data for DeepEP/MoE rendezvous and command generation; verifies registry entries and reporter mappings.
Cleanup `.gitignore`	Ignores SLURM-generated artifacts matching `slurm-*` pattern.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Poem

🐰 A benchmark now thrives with experts in many,
DeepEP discovers the paths through the grain,
Matrix by matrix we measure the flow—
NCCL and UCC now share what they know,
Throughput in charts as the tests ebb and flow! 📊

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main changes: adding UCC/NCCL alltoallv tests and DeepEP v1/v2 and MoE benchmark support.
Description check	✅ Passed	The description is related to the changeset, mentioning the addition of UCC/NCCL alltoallv tests and report generation across different benchmarking systems.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/cloudai/workloads/deepep/slurm_command_gen_strategy.py (1)
28-134: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix Ruff formatting for this file before merge.

CI is failing on ruff-format for this file; please run formatting and commit the normalized output to unblock the pipeline.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/deepep/slurm_command_gen_strategy.py` around lines 28 -
134, The file fails ruff-format checks; run the ruff formatter and commit the
normalized file so CI passes. Specifically, run your project's ruff/formatter
(e.g., ruff format) on the module containing methods like
_append_head_node_detection, _append_sbatch_directives, _container_mounts,
_generate_config_yaml and gen_srun_success_check, review the changes for any
unintended edits, and commit the formatted file to the branch.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/workloads/deepep/deepep_moe_throughput_reporter.py`:
- Line 15: The import of Reporter in deepep_moe_throughput_reporter.py currently
uses the private path cloudai._core.base_reporter; replace that with the allowed
public reporter interface used by other workloads (import the same Reporter
symbol from the public reporter module instead) so the file imports Reporter
from the public/allowed package rather than cloudai._core.base_reporter.
- Around line 23-25: The function _deepep_dispatch_combine_bars is too complex
and triggers C901 and has an unused loop variable causing B007; split it into
small helpers (e.g., deepep_results_json_files -> leave, add helpers like
_load_latest_results(path: Path) -> dict,
_extract_dispatch_combine_rows(results: dict) -> Iterable[dict], and
_row_to_bar(row: dict) -> tuple[str, float, str]) and have
_deepep_dispatch_combine_bars orchestrate these helpers to reduce cyclomatic
complexity; also rename any unused loop variable "lab" to "_lab" (or prefix with
underscore) to satisfy B007, and finally run ruff format to fix formatting
issues.

In `@src/cloudai/workloads/nccl_test/slurm_command_gen_strategy.py`:
- Around line 51-58: If use_deepep_matrix is true, don’t silently return None
when deepep_benchmark_root or the nccl matrix file is missing; instead fail fast
by raising a clear exception (e.g., RuntimeError/ValueError) from
_deepep_nccl_matrix_host_path that includes context (test/run id and that
use_deepep_matrix was requested) so the caller will stop the run; apply the same
change to the analogous methods referenced in the comment (the other
deepep-related path/resolution methods around lines 62-67 and 81-83, such as the
container-path resolver), and use deepep_benchmark_root and
_nccl_matrix_path_under_deepep_output in the error message to make the root
cause obvious.
- Around line 95-112: The alltoallv_perf_mpi branch only appends the fixed flags
and misses forwarding other configured NCCL options; update the branch in
slurm_command_gen_strategy.py (the block using _NCCL_TESTS_ALLTOALLV_PERF,
tdef.cmd_args and _nccl_cmd_scalar) to conditionally append the additional flags
present on tdef.cmd_args (e.g., stepfactor, nthreads, check, blocking and any
other optional fields) in the same way you handle minbytes/maxbytes/ngpus (use
_nccl_cmd_scalar where appropriate and add boolean flags when true), and keep
the existing inclusion of self.test_run.test.extra_args_str when
test.extra_cmd_args is set so TOML-configured options are not dropped.

In `@src/cloudai/workloads/ucc_test/slurm_command_gen_strategy.py`:
- Around line 55-64: If tdef.cmd_args.use_deepep_matrix is true, do not silently
return []; instead detect missing matrix by checking
deepep_benchmark_root(self.test_run) and self._deepep_ucc_matrix_host_path() and
raise a clear exception (or call fail-fast helper) when either is None and there
is no manual generation option provided (e.g. no tdef.cmd_args.gen or equivalent
flag). Update the logic in the block guarded by use_deepep_matrix (and the
analogous block later around the other check) to validate deepep_root and
matrix_host and raise an informative error mentioning the missing DeepEP matrix
and required --gen/manual generation flag rather than falling back to running
without a matrix.

---

Outside diff comments:
In `@src/cloudai/workloads/deepep/slurm_command_gen_strategy.py`:
- Around line 28-134: The file fails ruff-format checks; run the ruff formatter
and commit the normalized file so CI passes. Specifically, run your project's
ruff/formatter (e.g., ruff format) on the module containing methods like
_append_head_node_detection, _append_sbatch_directives, _container_mounts,
_generate_config_yaml and gen_srun_success_check, review the changes for any
unintended edits, and commit the formatted file to the branch.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 0416a4db-8f59-4498-afdb-cfa1d0fa7cce

📥 Commits

Reviewing files that changed from the base of the PR and between 2d672e1 and 12f6742.

📒 Files selected for processing (15)

conf/experimental/test/deepep_standard.toml
conf/experimental/test/nccl_test_alltoallv.toml
conf/experimental/test_scenario/deepep_with_nccl_alltoallv.toml
conf/experimental/test_scenario/deepep_with_ucc_alltoallv.toml
src/cloudai/registration.py
src/cloudai/workloads/deepep/__init__.py
src/cloudai/workloads/deepep/deepep.py
src/cloudai/workloads/deepep/deepep_combined_report.py
src/cloudai/workloads/deepep/deepep_moe_throughput_reporter.py
src/cloudai/workloads/deepep/report_generation_strategy.py
src/cloudai/workloads/deepep/slurm_command_gen_strategy.py
src/cloudai/workloads/nccl_test/nccl.py
src/cloudai/workloads/nccl_test/slurm_command_gen_strategy.py
src/cloudai/workloads/ucc_test/slurm_command_gen_strategy.py
src/cloudai/workloads/ucc_test/ucc.py

podkidyshev · 2026-05-15T13:11:07Z

please resolve coderabbit comments and make CI pass before review

ping me again directly once it's done please

coderabbitai

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

tests/test_init.py (1)
241-241: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Update test definition count assertion.

The test expects 26 test definitions, but with the addition of MoEBenchmarkTestDefinition, there are now 27. Please update the assertion to reflect the correct count.
🔧 Proposed fix
-    assert len(test_defs) == 26
+    assert len(test_defs) == 27
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_init.py` at line 241, Update the assertion that checks the number
of test definitions from 26 to the correct count to include the newly added
MoEBenchmarkTestDefinition; locate the assertion referencing test_defs (assert
len(test_defs) == 26) in tests/test_init.py and change the expected value to 27
so the test reflects the added MoEBenchmarkTestDefinition.
src/cloudai/workloads/moe_benchmark/throughput_reporter.py (1)
23-57: 🧹 Nitpick | 🔵 Trivial | ⚖️ Poor tradeoff

Consider reducing function complexity.

The function _moe_benchmark_dispatch_combine_bars has a cyclomatic complexity of 11, exceeding the threshold of 10. Consider extracting the row-filtering and value-extraction logic into smaller helper functions.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/moe_benchmark/throughput_reporter.py` around lines 23 -
57, The function _moe_benchmark_dispatch_combine_bars is too complex; extract
the row-filtering/value-extraction and file-reading into small helpers to reduce
cyclomatic complexity. Create a helper like _read_latest_results(test_output) to
return parsed rows (or [] on error) and another helper like _extract_bus_bw(row)
that validates a row dict, ensures "operation" is a str and contains
"bus_bw_avg", normalizes operation to lowercase and returns (op_l,
float(bus_bw_avg)) or None on failure; then simplify
_moe_benchmark_dispatch_combine_bars to call _read_latest_results and iterate
the rows using _extract_bus_bw, populate by_op and build the out list (keep
existing tuple labels "MoE dispatch"/"MoE combine" and colors).
tests/test_acceptance.py (1)
721-721: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Missing reference sbatch file for moe-benchmark test.

The test test_sbatch_generation[moe-benchmark] expects a reference file at tests/ref_data/moe-benchmark.sbatch, but the file does not exist. This causes test failures. Please add the expected reference sbatch file for the moe-benchmark workload.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_acceptance.py` at line 721, The test
test_sbatch_generation[moe-benchmark] fails because the expected reference
sbatch file is missing; add a new reference file named moe-benchmark.sbatch
under the tests/ref_data directory (the path constructed by
Path(__file__).parent / "ref_data" / test_req[1] in tests/test_acceptance.py)
containing the expected sbatch output for the moe-benchmark workload so the
assertion comparing generated SBATCH to the reference succeeds. Ensure the file
name exactly matches "moe-benchmark.sbatch" and the contents match the generator
output used by the test.
conf/experimental/test/moe_benchmark_standard.toml (1)
34-36: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Remove trailing blank lines.

The pre-commit hook end-of-file-fixer modified this file by removing trailing blank lines. Remove lines 34-36 to fix the CI failure.
🐛 Proposed fix
 [extra_env_vars]
 NUM_QPS_PER_RANK = "12"
 NUM_SMS = "24"
-
-
-
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@conf/experimental/test/moe_benchmark_standard.toml` around lines 34 - 36,
Remove the trailing blank lines at the end of
conf/experimental/test/moe_benchmark_standard.toml so the file ends immediately
after its last TOML entry (delete the extra blank lines at EOF that were added);
ensure there is no extra newline-only lines after the final content so the
end-of-file-fixer pre-commit hook is satisfied.

♻️ Duplicate comments (1)

src/cloudai/workloads/moe_benchmark/throughput_reporter.py (1)
15-15: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fix the layering violation by using the public Reporter import path.

The import from cloudai._core.base_reporter violates the import-linter contract (cloudai.workloads must not import cloudai._core). Use the public Reporter interface instead—likely from cloudai.reporter import Reporter or the appropriate public module.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/moe_benchmark/throughput_reporter.py` at line 15, The
file imports Reporter from the private module cloudai._core.base_reporter which
breaks layering rules; update the import to use the public Reporter export
(e.g., replace "from cloudai._core.base_reporter import Reporter" with the
public import such as "from cloudai.reporter import Reporter" or the correct
public module that exposes Reporter) so that throughput_reporter.py only depends
on the public API.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@conf/experimental/test_scenario/moe_benchmark.toml`:
- Around line 22-29: The Tests chain currently makes Tests.nccl_alltoallv depend
on Tests.ucc_alltoallv (serial), causing NCCL to wait for UCC; if both only
require the MoE output, change the dependency of Tests.ucc_alltoallv and
Tests.nccl_alltoallv to depend directly on Tests.moe_benchmark (replace the
dependency id in the [[Tests.dependencies]] block for both Tests.ucc_alltoallv
and Tests.nccl_alltoallv to "Tests.moe_benchmark") so they start in parallel
after the MoE benchmark; if the serial ordering was intentional for resource or
debugging reasons, confirm and leave as-is.

In `@conf/experimental/test/nccl_test_alltoallv.toml`:
- Line 37: The file ends with the line 'NCCL_SHM_DISABLE = "1"' but is missing a
trailing newline; add a single newline character after the NCCL_SHM_DISABLE line
(i.e., ensure the file ends with a newline) so the end-of-file-fixer pre-commit
check passes and CI succeeds.

In `@src/cloudai/registration.py`:
- Line 321: The call to Registry().add_scenario_report(...) exceeds the 120-char
limit; reformat the invocation for MoEBenchmarkThroughputReporter and
ReportConfig(enable=True) so the line is <=120 chars (for example, break the
arguments across multiple lines or assign arguments to temporary variables)
while keeping the same call to Registry().add_scenario_report with the same
symbols: Registry, add_scenario_report, MoEBenchmarkThroughputReporter, and
ReportConfig.

In `@src/cloudai/workloads/deepep/slurm_command_gen_strategy.py`:
- Around line 182-189: generate_test_command currently hardcodes "torchrun
--nproc_per_node=1" and ignores DeepEPCmdArgs.python_executable; compute the
per-node process count from cmd_args.num_processes and num_nodes (e.g., per_node
= max(1, int(cmd_args.num_processes / num_nodes)) and validate evenly divisible
or handle remainder), then set "--nproc_per_node={per_node}" instead of the
fixed 1, and prepend cmd_args.python_executable when building the launcher
invocation (use cmd_args.python_executable + " " + self._script_path(cmd_args)
or equivalent) so generate_test_command and the launcher respect
DeepEPCmdArgs.num_processes and python_executable.
- Around line 169-170: The code calls _append_cli_field() which references an
undefined _BOOL_VALUE_FIELDS causing a NameError; define or import
_BOOL_VALUE_FIELDS (e.g., the set/list of field names treated as boolean flags)
and update _append_cli_field to use it, or replace the check with explicit
boolean-type handling (refer to the function name _append_cli_field). Also
update generate_test_command() to construct the launcher from DeepEPCmdArgs: use
DeepEPCmdArgs.python_executable instead of hardcoding "torchrun" (or document
and remove the unused field) and pass DeepEPCmdArgs.num_processes to the
launcher (replace the hardcoded "--nproc_per_node=1"), ensuring the launcher
flags match existing cmd arg names so DeepEP launcher aligns with the command
arguments.

In `@src/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.py`:
- Around line 60-64: Replace the simple if-else assignment for script_name with
a ternary expression to satisfy ruff SIM108: where the code currently checks
cmd_args.mode and sets script_name, change it to a single-line assignment using
"benchmark.py" if cmd_args.mode == "standard" else "benchmark_ll.py" (update the
assignment around the script_name variable).
- Line 69: The _generate_config_yaml method declares an unused parameter
cmd_args; remove the cmd_args parameter from the _generate_config_yaml signature
and all internal references (it isn't used because tdef = self.test_run.test and
tdef.cmd_args_dict supply the needed data), then update the caller that invokes
_generate_config_yaml (the callsite that currently passes a MoEBenchmarkCmdArgs)
to stop passing that argument so the call matches the new signature; ensure no
other callsites reference the removed parameter and run tests/lint to confirm.

In `@src/cloudai/workloads/moe_benchmark/throughput_reporter.py`:
- Line 159: The for-loop that zips centers, values, colors, and labels uses an
unused loop variable named `lab`; update that variable to `_lab` in the loop
header (for cx, val, col, _lab in zip(centers, values, colors, labels,
strict=True)) to signal it's intentionally unused while preserving the zip
ordering—locate this change in the loop inside throughput_reporter.py where
`centers`, `values`, `colors`, and `labels` are iterated.
- Line 157: The SVG f-string in throughput_reporter.py (e.g., the parts.append
call that builds the <text ...>{gv:.1f}</text>) exceeds the 120-char line
length; refactor these SVG generation lines (including the similar ones at the
locations around lines 169-171 and 173) by splitting the long f-strings into
smaller parts or intermediate variables (e.g., compute x_attr, y_attr, attrs,
and text_content separately) and then join/concatenate them in a short
parts.append call so each source line stays under the 120-character limit while
preserving the same output and formatting.

In `@tests/test_test_scenario.py`:
- Around line 47-50: Remove the trailing whitespace after the imported symbol
MoEBenchmarkReportGenerationStrategy in the import statement that also includes
MoEBenchmarkTestDefinition; edit the import line so it reads without any extra
spaces after MoEBenchmarkReportGenerationStrategy, preserving the same symbols
and line breaks to satisfy the pre-commit hook.

---

Outside diff comments:
In `@conf/experimental/test/moe_benchmark_standard.toml`:
- Around line 34-36: Remove the trailing blank lines at the end of
conf/experimental/test/moe_benchmark_standard.toml so the file ends immediately
after its last TOML entry (delete the extra blank lines at EOF that were added);
ensure there is no extra newline-only lines after the final content so the
end-of-file-fixer pre-commit hook is satisfied.

In `@src/cloudai/workloads/moe_benchmark/throughput_reporter.py`:
- Around line 23-57: The function _moe_benchmark_dispatch_combine_bars is too
complex; extract the row-filtering/value-extraction and file-reading into small
helpers to reduce cyclomatic complexity. Create a helper like
_read_latest_results(test_output) to return parsed rows (or [] on error) and
another helper like _extract_bus_bw(row) that validates a row dict, ensures
"operation" is a str and contains "bus_bw_avg", normalizes operation to
lowercase and returns (op_l, float(bus_bw_avg)) or None on failure; then
simplify _moe_benchmark_dispatch_combine_bars to call _read_latest_results and
iterate the rows using _extract_bus_bw, populate by_op and build the out list
(keep existing tuple labels "MoE dispatch"/"MoE combine" and colors).

In `@tests/test_acceptance.py`:
- Line 721: The test test_sbatch_generation[moe-benchmark] fails because the
expected reference sbatch file is missing; add a new reference file named
moe-benchmark.sbatch under the tests/ref_data directory (the path constructed by
Path(__file__).parent / "ref_data" / test_req[1] in tests/test_acceptance.py)
containing the expected sbatch output for the moe-benchmark workload so the
assertion comparing generated SBATCH to the reference succeeds. Ensure the file
name exactly matches "moe-benchmark.sbatch" and the contents match the generator
output used by the test.

In `@tests/test_init.py`:
- Line 241: Update the assertion that checks the number of test definitions from
26 to the correct count to include the newly added MoEBenchmarkTestDefinition;
locate the assertion referencing test_defs (assert len(test_defs) == 26) in
tests/test_init.py and change the expected value to 27 so the test reflects the
added MoEBenchmarkTestDefinition.

---

Duplicate comments:
In `@src/cloudai/workloads/moe_benchmark/throughput_reporter.py`:
- Line 15: The file imports Reporter from the private module
cloudai._core.base_reporter which breaks layering rules; update the import to
use the public Reporter export (e.g., replace "from cloudai._core.base_reporter
import Reporter" with the public import such as "from cloudai.reporter import
Reporter" or the correct public module that exposes Reporter) so that
throughput_reporter.py only depends on the public API.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 25bf978f-ad63-4710-8250-1199086f3041

📥 Commits

Reviewing files that changed from the base of the PR and between 3a743d3 and 72a6f84.

📒 Files selected for processing (28)

.gitignore
conf/experimental/test/deepep_low_latency.toml
conf/experimental/test/deepep_standard.toml
conf/experimental/test/deepep_test_ep_v2.toml
conf/experimental/test/deepep_test_internode.toml
conf/experimental/test/deepep_test_intranode.toml
conf/experimental/test/deepep_test_low_latency.toml
conf/experimental/test/moe_benchmark_low_latency.toml
conf/experimental/test/moe_benchmark_standard.toml
conf/experimental/test/nccl_test_alltoallv.toml
conf/experimental/test/ucc_alltoallv_deepep.toml
conf/experimental/test_scenario/deepep_official.toml
conf/experimental/test_scenario/moe_benchmark.toml
src/cloudai/registration.py
src/cloudai/workloads/deepep/__init__.py
src/cloudai/workloads/deepep/deepep.py
src/cloudai/workloads/deepep/slurm_command_gen_strategy.py
src/cloudai/workloads/moe_benchmark/__init__.py
src/cloudai/workloads/moe_benchmark/combined_report.py
src/cloudai/workloads/moe_benchmark/moe_benchmark.py
src/cloudai/workloads/moe_benchmark/report_generation_strategy.py
src/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.py
src/cloudai/workloads/moe_benchmark/throughput_reporter.py
src/cloudai/workloads/nccl_test/slurm_command_gen_strategy.py
src/cloudai/workloads/ucc_test/slurm_command_gen_strategy.py
tests/test_acceptance.py
tests/test_init.py
tests/test_test_scenario.py

💤 Files with no reviewable changes (3)

conf/experimental/test/deepep_low_latency.toml
conf/experimental/test/deepep_standard.toml
src/cloudai/workloads/deepep/init.py

podkidyshev · 2026-06-10T09:25:52Z

please also resolve remaining coderabbit comments 🙏

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

src/cloudai/workloads/deepep/slurm_command_gen_strategy.py (2)
196-198: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

This success check can mark failed runs as passed.

[testing], dispatch, combine, and passed are generic mid-run tokens. Once any one of them appears in stdout.txt, this returns success even if the job aborts later.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/deepep/slurm_command_gen_strategy.py` around lines 196
- 198, gen_srun_success_check currently treats generic mid-run tokens as success
(it greps for "[testing]|dispatch|combine|passed|tuning|Best" in stdout.txt).
Update the gen_srun_success_check method to only consider definitive end-of-run
success markers or to inspect the tail of the log instead of the whole file:
e.g., restrict patterns to true final markers (e.g., "Best" or "passed" only
when at end-of-line or in the final N lines) or run tail -n 50 | grep -E ... so
transient tokens like "[testing]", "dispatch", "combine", "tuning" are removed
from the success regex; reference gen_srun_success_check and output_file when
making the change.
113-114: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Derive MASTER_PORT per job instead of pinning 29500.

A fixed rendezvous port collides whenever two DeepEP jobs land on the same head node, so one run can fail before c10d initializes.
Suggested fix
-                "export MASTER_PORT=29500",
+                'export MASTER_PORT=$((10000 + (${SLURM_JOB_ID:-0} % 50000)))',
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/deepep/slurm_command_gen_strategy.py` around lines 113
- 114, The script currently hardcodes "export MASTER_PORT=29500" causing
rendezvous-port collisions; update the Slurm command generation in
slurm_command_gen_strategy.py (where the job script is built—look for the code
that emits the "export MASTER_PORT=29500" line, e.g., the generate_slurm_command
/ build script routine) to pick a per-job free port instead of 29500: implement
a helper (e.g., find_free_port()) that creates a socket bound to ('', 0) to
obtain an ephemeral port (validate it's in 1024-65535), close the socket and
then emit "export MASTER_PORT=<picked_port>" into the generated script or
environment; ensure the chosen port is derived per job (e.g., using job id or
ephemeral bind) so concurrent DeepEP runs on the same head node do not conflict.
conf/experimental/test/moe_benchmark_low_latency.toml (1)
23-23: ⚠️ Potential issue | 🟠 Major

Fix placeholder benchmark_root in low-latency MoE test config

conf/experimental/test/moe_benchmark_low_latency.toml sets benchmark_root to /path/in/the/container/to/the/tests/folder (line 23), while the standard config uses /workspace/dp-benchmark/benchmark. MoEBenchmarkSlurmCommandGenStrategy.generate_test_command() builds the executed script as PurePosixPath(cmd_args.benchmark_root) / benchmark_ll.py directly from this field, so this placeholder will cause low-latency runs to execute the wrong script path and fail to produce the expected results.json artifacts for downstream reporting/matrix consumers. Update low-latency benchmark_root to the correct in-container path (or add a clear mechanism/documentation that replaces the placeholder before execution).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@conf/experimental/test/moe_benchmark_low_latency.toml` at line 23, The
low-latency MoE test config currently uses a placeholder benchmark_root value
which causes MoEBenchmarkSlurmCommandGenStrategy.generate_test_command() to
build an incorrect script path; update the benchmark_root entry in
conf/experimental/test/moe_benchmark_low_latency.toml from
"/path/in/the/container/to/the/tests/folder" to the correct in-container path
used by the standard config (e.g., "/workspace/dp-benchmark/benchmark"), or add
a documented replacement mechanism run prior to generate_test_command() so
PurePosixPath(cmd_args.benchmark_root) / benchmark_ll.py resolves to the real
benchmark directory.

♻️ Duplicate comments (1)

src/cloudai/workloads/deepep/slurm_command_gen_strategy.py (1)
178-186: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

--nproc_per_node=1 contradicts the one-launcher-per-node design.

The PR discussion says this strategy intentionally starts one torchrun per node and lets torchrun fan out the per-node workers. Hardcoding one worker here leaves num_processes only on the script CLI and under-launches multi-GPU nodes. Derive --nproc_per_node from the configured DeepEP process count instead of pinning it to 1.

Based on prior review discussion about the intended one-launcher-per-node geometry in this PR.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/deepep/slurm_command_gen_strategy.py` around lines 178
- 186, The parts list hardcodes "--nproc_per_node=1" which conflicts with the
one-launcher-per-node design; change it to compute per-node worker count from
the configured DeepEP process count (e.g., use the strategy's configured value
such as self._num_processes or derive from cmd_args.num_processes / per-node
setting) and insert f"--nproc_per_node={per_node_count}" instead of the literal
1 in the parts array inside the method that builds the torchrun invocation in
slurm_command_gen_strategy.py; ensure the value is an integer/string and falls
back to 1 only if no explicit per-node count is available.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@conf/experimental/test/moe_benchmark_low_latency.toml`:
- Line 23: The low-latency MoE test config currently uses a placeholder
benchmark_root value which causes
MoEBenchmarkSlurmCommandGenStrategy.generate_test_command() to build an
incorrect script path; update the benchmark_root entry in
conf/experimental/test/moe_benchmark_low_latency.toml from
"/path/in/the/container/to/the/tests/folder" to the correct in-container path
used by the standard config (e.g., "/workspace/dp-benchmark/benchmark"), or add
a documented replacement mechanism run prior to generate_test_command() so
PurePosixPath(cmd_args.benchmark_root) / benchmark_ll.py resolves to the real
benchmark directory.

In `@src/cloudai/workloads/deepep/slurm_command_gen_strategy.py`:
- Around line 196-198: gen_srun_success_check currently treats generic mid-run
tokens as success (it greps for "[testing]|dispatch|combine|passed|tuning|Best"
in stdout.txt). Update the gen_srun_success_check method to only consider
definitive end-of-run success markers or to inspect the tail of the log instead
of the whole file: e.g., restrict patterns to true final markers (e.g., "Best"
or "passed" only when at end-of-line or in the final N lines) or run tail -n 50
| grep -E ... so transient tokens like "[testing]", "dispatch", "combine",
"tuning" are removed from the success regex; reference gen_srun_success_check
and output_file when making the change.
- Around line 113-114: The script currently hardcodes "export MASTER_PORT=29500"
causing rendezvous-port collisions; update the Slurm command generation in
slurm_command_gen_strategy.py (where the job script is built—look for the code
that emits the "export MASTER_PORT=29500" line, e.g., the generate_slurm_command
/ build script routine) to pick a per-job free port instead of 29500: implement
a helper (e.g., find_free_port()) that creates a socket bound to ('', 0) to
obtain an ephemeral port (validate it's in 1024-65535), close the socket and
then emit "export MASTER_PORT=<picked_port>" into the generated script or
environment; ensure the chosen port is derived per job (e.g., using job id or
ephemeral bind) so concurrent DeepEP runs on the same head node do not conflict.

---

Duplicate comments:
In `@src/cloudai/workloads/deepep/slurm_command_gen_strategy.py`:
- Around line 178-186: The parts list hardcodes "--nproc_per_node=1" which
conflicts with the one-launcher-per-node design; change it to compute per-node
worker count from the configured DeepEP process count (e.g., use the strategy's
configured value such as self._num_processes or derive from
cmd_args.num_processes / per-node setting) and insert
f"--nproc_per_node={per_node_count}" instead of the literal 1 in the parts array
inside the method that builds the torchrun invocation in
slurm_command_gen_strategy.py; ensure the value is an integer/string and falls
back to 1 only if no explicit per-node count is available.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 8947e249-c3b5-4e48-8bac-f95e11941f5c

📥 Commits

Reviewing files that changed from the base of the PR and between f81856e and bd5be75.

📒 Files selected for processing (26)

conf/experimental/test/deepep_test_ep_v2.toml
conf/experimental/test/deepep_test_internode.toml
conf/experimental/test/deepep_test_intranode.toml
conf/experimental/test/deepep_test_low_latency.toml
conf/experimental/test/moe_benchmark_low_latency.toml
conf/experimental/test/moe_benchmark_standard.toml
conf/experimental/test/nccl_test_alltoallv.toml
conf/experimental/test/ucc_alltoallv_deepep.toml
conf/experimental/test_scenario/deepep_official.toml
conf/experimental/test_scenario/moe_benchmark.toml
src/cloudai/registration.py
src/cloudai/workloads/common/moe_benchmark_report.py
src/cloudai/workloads/deepep/slurm_command_gen_strategy.py
src/cloudai/workloads/moe_benchmark/__init__.py
src/cloudai/workloads/moe_benchmark/moe_benchmark.py
src/cloudai/workloads/moe_benchmark/report_generation_strategy.py
src/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.py
src/cloudai/workloads/moe_benchmark/throughput_reporter.py
src/cloudai/workloads/nccl_test/nccl.py
src/cloudai/workloads/nccl_test/slurm_command_gen_strategy.py
src/cloudai/workloads/ucc_test/slurm_command_gen_strategy.py
src/cloudai/workloads/ucc_test/ucc.py
tests/ref_data/deepep-benchmark.sbatch
tests/ref_data/moe-benchmark.sbatch
tests/test_acceptance.py
tests/test_init.py

💤 Files with no reviewable changes (1)

tests/ref_data/moe-benchmark.sbatch

ybenvidia and others added 6 commits March 5, 2026 13:09

add tuning option

e912a9c

Merge branch 'NVIDIA:main' into dp-benchmark

aee7ebb

Merge branch 'NVIDIA:main' into dp-benchmark

ed01ea8

Merge branch 'NVIDIA:main' into dp-benchmark

8b029c9

add ucc/nccl all2allv

1b85a6e

Merge branch 'NVIDIA:main' into dp-benchmark

12f6742

ybenvidia requested review from jeffnvidia, podkidyshev and srivatsankrishnan as code owners May 14, 2026 13:58

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

ybenvidia and others added 2 commits June 2, 2026 12:01

Merge branch 'NVIDIA:main' into main

3a743d3

add deepep v1/v2 and first version of the moe-benchmark

72a6f84

ybenvidia changed the title ~~Add UCC/NCCL alltoallv vs DeepEP~~ Add UCC/NCCL alltoallv, deepEP va/v2 and moe benchmark Jun 2, 2026

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

fix pytest

f81856e

ybenvidia changed the title ~~Add UCC/NCCL alltoallv, deepEP va/v2 and moe benchmark~~ Add UCC/NCCL alltoallv, deepEP v1/v2 and moe benchmark Jun 2, 2026

ybenvidia and others added 5 commits June 9, 2026 11:14

Merge branch 'NVIDIA:main' into main

98ec0bc

fix CI reviews

f81792e

Merge branch 'main' of https://github.com/ybenvidia/cloudai

20176bd

update pytest

b8ef623

pytest

7600e59

podkidyshev requested changes Jun 10, 2026

View reviewed changes

ybenvidia and others added 2 commits June 10, 2026 14:42

fix CI

cd81035

Merge branch 'main' into main

bd5be75

coderabbitai Bot reviewed Jun 10, 2026

View reviewed changes

podkidyshev approved these changes Jun 10, 2026

View reviewed changes

podkidyshev merged commit f82fd9a into NVIDIA:main Jun 10, 2026
4 checks passed

Conversation

ybenvidia commented May 14, 2026

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

podkidyshev commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

podkidyshev commented Jun 10, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 14, 2026 •

edited

Loading

podkidyshev commented May 15, 2026 •

edited

Loading