Add UCC/NCCL alltoallv, deepEP v1/v2 and moe benchmark#891
Conversation
📝 WalkthroughWalkthroughThis PR introduces a new MoE (Mixture of Experts) benchmark workload to CloudAI, refactors DeepEP for v1/v2 tests, and enables NCCL/UCC to consume DeepEP-generated traffic matrices. It includes configuration files, Slurm command generation, report generation with throughput visualization, and comprehensive test coverage. ChangesMoE Benchmark and DeepEP Integration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/cloudai/workloads/deepep/slurm_command_gen_strategy.py (1)
28-134:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFix Ruff formatting for this file before merge.
CI is failing on
ruff-formatfor this file; please run formatting and commit the normalized output to unblock the pipeline.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/workloads/deepep/slurm_command_gen_strategy.py` around lines 28 - 134, The file fails ruff-format checks; run the ruff formatter and commit the normalized file so CI passes. Specifically, run your project's ruff/formatter (e.g., ruff format) on the module containing methods like _append_head_node_detection, _append_sbatch_directives, _container_mounts, _generate_config_yaml and gen_srun_success_check, review the changes for any unintended edits, and commit the formatted file to the branch.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cloudai/workloads/deepep/deepep_moe_throughput_reporter.py`:
- Line 15: The import of Reporter in deepep_moe_throughput_reporter.py currently
uses the private path cloudai._core.base_reporter; replace that with the allowed
public reporter interface used by other workloads (import the same Reporter
symbol from the public reporter module instead) so the file imports Reporter
from the public/allowed package rather than cloudai._core.base_reporter.
- Around line 23-25: The function _deepep_dispatch_combine_bars is too complex
and triggers C901 and has an unused loop variable causing B007; split it into
small helpers (e.g., deepep_results_json_files -> leave, add helpers like
_load_latest_results(path: Path) -> dict,
_extract_dispatch_combine_rows(results: dict) -> Iterable[dict], and
_row_to_bar(row: dict) -> tuple[str, float, str]) and have
_deepep_dispatch_combine_bars orchestrate these helpers to reduce cyclomatic
complexity; also rename any unused loop variable "lab" to "_lab" (or prefix with
underscore) to satisfy B007, and finally run ruff format to fix formatting
issues.
In `@src/cloudai/workloads/nccl_test/slurm_command_gen_strategy.py`:
- Around line 51-58: If use_deepep_matrix is true, don’t silently return None
when deepep_benchmark_root or the nccl matrix file is missing; instead fail fast
by raising a clear exception (e.g., RuntimeError/ValueError) from
_deepep_nccl_matrix_host_path that includes context (test/run id and that
use_deepep_matrix was requested) so the caller will stop the run; apply the same
change to the analogous methods referenced in the comment (the other
deepep-related path/resolution methods around lines 62-67 and 81-83, such as the
container-path resolver), and use deepep_benchmark_root and
_nccl_matrix_path_under_deepep_output in the error message to make the root
cause obvious.
- Around line 95-112: The alltoallv_perf_mpi branch only appends the fixed flags
and misses forwarding other configured NCCL options; update the branch in
slurm_command_gen_strategy.py (the block using _NCCL_TESTS_ALLTOALLV_PERF,
tdef.cmd_args and _nccl_cmd_scalar) to conditionally append the additional flags
present on tdef.cmd_args (e.g., stepfactor, nthreads, check, blocking and any
other optional fields) in the same way you handle minbytes/maxbytes/ngpus (use
_nccl_cmd_scalar where appropriate and add boolean flags when true), and keep
the existing inclusion of self.test_run.test.extra_args_str when
test.extra_cmd_args is set so TOML-configured options are not dropped.
In `@src/cloudai/workloads/ucc_test/slurm_command_gen_strategy.py`:
- Around line 55-64: If tdef.cmd_args.use_deepep_matrix is true, do not silently
return []; instead detect missing matrix by checking
deepep_benchmark_root(self.test_run) and self._deepep_ucc_matrix_host_path() and
raise a clear exception (or call fail-fast helper) when either is None and there
is no manual generation option provided (e.g. no tdef.cmd_args.gen or equivalent
flag). Update the logic in the block guarded by use_deepep_matrix (and the
analogous block later around the other check) to validate deepep_root and
matrix_host and raise an informative error mentioning the missing DeepEP matrix
and required --gen/manual generation flag rather than falling back to running
without a matrix.
---
Outside diff comments:
In `@src/cloudai/workloads/deepep/slurm_command_gen_strategy.py`:
- Around line 28-134: The file fails ruff-format checks; run the ruff formatter
and commit the normalized file so CI passes. Specifically, run your project's
ruff/formatter (e.g., ruff format) on the module containing methods like
_append_head_node_detection, _append_sbatch_directives, _container_mounts,
_generate_config_yaml and gen_srun_success_check, review the changes for any
unintended edits, and commit the formatted file to the branch.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 0416a4db-8f59-4498-afdb-cfa1d0fa7cce
📒 Files selected for processing (15)
conf/experimental/test/deepep_standard.tomlconf/experimental/test/nccl_test_alltoallv.tomlconf/experimental/test_scenario/deepep_with_nccl_alltoallv.tomlconf/experimental/test_scenario/deepep_with_ucc_alltoallv.tomlsrc/cloudai/registration.pysrc/cloudai/workloads/deepep/__init__.pysrc/cloudai/workloads/deepep/deepep.pysrc/cloudai/workloads/deepep/deepep_combined_report.pysrc/cloudai/workloads/deepep/deepep_moe_throughput_reporter.pysrc/cloudai/workloads/deepep/report_generation_strategy.pysrc/cloudai/workloads/deepep/slurm_command_gen_strategy.pysrc/cloudai/workloads/nccl_test/nccl.pysrc/cloudai/workloads/nccl_test/slurm_command_gen_strategy.pysrc/cloudai/workloads/ucc_test/slurm_command_gen_strategy.pysrc/cloudai/workloads/ucc_test/ucc.py
|
please resolve coderabbit comments and make CI pass before review ping me again directly once it's done please |
There was a problem hiding this comment.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
tests/test_init.py (1)
241-241:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winUpdate test definition count assertion.
The test expects 26 test definitions, but with the addition of
MoEBenchmarkTestDefinition, there are now 27. Please update the assertion to reflect the correct count.🔧 Proposed fix
- assert len(test_defs) == 26 + assert len(test_defs) == 27🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_init.py` at line 241, Update the assertion that checks the number of test definitions from 26 to the correct count to include the newly added MoEBenchmarkTestDefinition; locate the assertion referencing test_defs (assert len(test_defs) == 26) in tests/test_init.py and change the expected value to 27 so the test reflects the added MoEBenchmarkTestDefinition.src/cloudai/workloads/moe_benchmark/throughput_reporter.py (1)
23-57: 🧹 Nitpick | 🔵 Trivial | ⚖️ Poor tradeoffConsider reducing function complexity.
The function
_moe_benchmark_dispatch_combine_barshas a cyclomatic complexity of 11, exceeding the threshold of 10. Consider extracting the row-filtering and value-extraction logic into smaller helper functions.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/workloads/moe_benchmark/throughput_reporter.py` around lines 23 - 57, The function _moe_benchmark_dispatch_combine_bars is too complex; extract the row-filtering/value-extraction and file-reading into small helpers to reduce cyclomatic complexity. Create a helper like _read_latest_results(test_output) to return parsed rows (or [] on error) and another helper like _extract_bus_bw(row) that validates a row dict, ensures "operation" is a str and contains "bus_bw_avg", normalizes operation to lowercase and returns (op_l, float(bus_bw_avg)) or None on failure; then simplify _moe_benchmark_dispatch_combine_bars to call _read_latest_results and iterate the rows using _extract_bus_bw, populate by_op and build the out list (keep existing tuple labels "MoE dispatch"/"MoE combine" and colors).tests/test_acceptance.py (1)
721-721:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winMissing reference sbatch file for moe-benchmark test.
The test
test_sbatch_generation[moe-benchmark]expects a reference file attests/ref_data/moe-benchmark.sbatch, but the file does not exist. This causes test failures. Please add the expected reference sbatch file for the moe-benchmark workload.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_acceptance.py` at line 721, The test test_sbatch_generation[moe-benchmark] fails because the expected reference sbatch file is missing; add a new reference file named moe-benchmark.sbatch under the tests/ref_data directory (the path constructed by Path(__file__).parent / "ref_data" / test_req[1] in tests/test_acceptance.py) containing the expected sbatch output for the moe-benchmark workload so the assertion comparing generated SBATCH to the reference succeeds. Ensure the file name exactly matches "moe-benchmark.sbatch" and the contents match the generator output used by the test.conf/experimental/test/moe_benchmark_standard.toml (1)
34-36:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winRemove trailing blank lines.
The pre-commit hook
end-of-file-fixermodified this file by removing trailing blank lines. Remove lines 34-36 to fix the CI failure.🐛 Proposed fix
[extra_env_vars] NUM_QPS_PER_RANK = "12" NUM_SMS = "24" - - -🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@conf/experimental/test/moe_benchmark_standard.toml` around lines 34 - 36, Remove the trailing blank lines at the end of conf/experimental/test/moe_benchmark_standard.toml so the file ends immediately after its last TOML entry (delete the extra blank lines at EOF that were added); ensure there is no extra newline-only lines after the final content so the end-of-file-fixer pre-commit hook is satisfied.
♻️ Duplicate comments (1)
src/cloudai/workloads/moe_benchmark/throughput_reporter.py (1)
15-15:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winFix the layering violation by using the public Reporter import path.
The import from
cloudai._core.base_reporterviolates the import-linter contract (cloudai.workloadsmust not importcloudai._core). Use the public Reporter interface instead—likelyfrom cloudai.reporter import Reporteror the appropriate public module.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/workloads/moe_benchmark/throughput_reporter.py` at line 15, The file imports Reporter from the private module cloudai._core.base_reporter which breaks layering rules; update the import to use the public Reporter export (e.g., replace "from cloudai._core.base_reporter import Reporter" with the public import such as "from cloudai.reporter import Reporter" or the correct public module that exposes Reporter) so that throughput_reporter.py only depends on the public API.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@conf/experimental/test_scenario/moe_benchmark.toml`:
- Around line 22-29: The Tests chain currently makes Tests.nccl_alltoallv depend
on Tests.ucc_alltoallv (serial), causing NCCL to wait for UCC; if both only
require the MoE output, change the dependency of Tests.ucc_alltoallv and
Tests.nccl_alltoallv to depend directly on Tests.moe_benchmark (replace the
dependency id in the [[Tests.dependencies]] block for both Tests.ucc_alltoallv
and Tests.nccl_alltoallv to "Tests.moe_benchmark") so they start in parallel
after the MoE benchmark; if the serial ordering was intentional for resource or
debugging reasons, confirm and leave as-is.
In `@conf/experimental/test/nccl_test_alltoallv.toml`:
- Line 37: The file ends with the line 'NCCL_SHM_DISABLE = "1"' but is missing a
trailing newline; add a single newline character after the NCCL_SHM_DISABLE line
(i.e., ensure the file ends with a newline) so the end-of-file-fixer pre-commit
check passes and CI succeeds.
In `@src/cloudai/registration.py`:
- Line 321: The call to Registry().add_scenario_report(...) exceeds the 120-char
limit; reformat the invocation for MoEBenchmarkThroughputReporter and
ReportConfig(enable=True) so the line is <=120 chars (for example, break the
arguments across multiple lines or assign arguments to temporary variables)
while keeping the same call to Registry().add_scenario_report with the same
symbols: Registry, add_scenario_report, MoEBenchmarkThroughputReporter, and
ReportConfig.
In `@src/cloudai/workloads/deepep/slurm_command_gen_strategy.py`:
- Around line 182-189: generate_test_command currently hardcodes "torchrun
--nproc_per_node=1" and ignores DeepEPCmdArgs.python_executable; compute the
per-node process count from cmd_args.num_processes and num_nodes (e.g., per_node
= max(1, int(cmd_args.num_processes / num_nodes)) and validate evenly divisible
or handle remainder), then set "--nproc_per_node={per_node}" instead of the
fixed 1, and prepend cmd_args.python_executable when building the launcher
invocation (use cmd_args.python_executable + " " + self._script_path(cmd_args)
or equivalent) so generate_test_command and the launcher respect
DeepEPCmdArgs.num_processes and python_executable.
- Around line 169-170: The code calls _append_cli_field() which references an
undefined _BOOL_VALUE_FIELDS causing a NameError; define or import
_BOOL_VALUE_FIELDS (e.g., the set/list of field names treated as boolean flags)
and update _append_cli_field to use it, or replace the check with explicit
boolean-type handling (refer to the function name _append_cli_field). Also
update generate_test_command() to construct the launcher from DeepEPCmdArgs: use
DeepEPCmdArgs.python_executable instead of hardcoding "torchrun" (or document
and remove the unused field) and pass DeepEPCmdArgs.num_processes to the
launcher (replace the hardcoded "--nproc_per_node=1"), ensuring the launcher
flags match existing cmd arg names so DeepEP launcher aligns with the command
arguments.
In `@src/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.py`:
- Around line 60-64: Replace the simple if-else assignment for script_name with
a ternary expression to satisfy ruff SIM108: where the code currently checks
cmd_args.mode and sets script_name, change it to a single-line assignment using
"benchmark.py" if cmd_args.mode == "standard" else "benchmark_ll.py" (update the
assignment around the script_name variable).
- Line 69: The _generate_config_yaml method declares an unused parameter
cmd_args; remove the cmd_args parameter from the _generate_config_yaml signature
and all internal references (it isn't used because tdef = self.test_run.test and
tdef.cmd_args_dict supply the needed data), then update the caller that invokes
_generate_config_yaml (the callsite that currently passes a MoEBenchmarkCmdArgs)
to stop passing that argument so the call matches the new signature; ensure no
other callsites reference the removed parameter and run tests/lint to confirm.
In `@src/cloudai/workloads/moe_benchmark/throughput_reporter.py`:
- Line 159: The for-loop that zips centers, values, colors, and labels uses an
unused loop variable named `lab`; update that variable to `_lab` in the loop
header (for cx, val, col, _lab in zip(centers, values, colors, labels,
strict=True)) to signal it's intentionally unused while preserving the zip
ordering—locate this change in the loop inside throughput_reporter.py where
`centers`, `values`, `colors`, and `labels` are iterated.
- Line 157: The SVG f-string in throughput_reporter.py (e.g., the parts.append
call that builds the <text ...>{gv:.1f}</text>) exceeds the 120-char line
length; refactor these SVG generation lines (including the similar ones at the
locations around lines 169-171 and 173) by splitting the long f-strings into
smaller parts or intermediate variables (e.g., compute x_attr, y_attr, attrs,
and text_content separately) and then join/concatenate them in a short
parts.append call so each source line stays under the 120-character limit while
preserving the same output and formatting.
In `@tests/test_test_scenario.py`:
- Around line 47-50: Remove the trailing whitespace after the imported symbol
MoEBenchmarkReportGenerationStrategy in the import statement that also includes
MoEBenchmarkTestDefinition; edit the import line so it reads without any extra
spaces after MoEBenchmarkReportGenerationStrategy, preserving the same symbols
and line breaks to satisfy the pre-commit hook.
---
Outside diff comments:
In `@conf/experimental/test/moe_benchmark_standard.toml`:
- Around line 34-36: Remove the trailing blank lines at the end of
conf/experimental/test/moe_benchmark_standard.toml so the file ends immediately
after its last TOML entry (delete the extra blank lines at EOF that were added);
ensure there is no extra newline-only lines after the final content so the
end-of-file-fixer pre-commit hook is satisfied.
In `@src/cloudai/workloads/moe_benchmark/throughput_reporter.py`:
- Around line 23-57: The function _moe_benchmark_dispatch_combine_bars is too
complex; extract the row-filtering/value-extraction and file-reading into small
helpers to reduce cyclomatic complexity. Create a helper like
_read_latest_results(test_output) to return parsed rows (or [] on error) and
another helper like _extract_bus_bw(row) that validates a row dict, ensures
"operation" is a str and contains "bus_bw_avg", normalizes operation to
lowercase and returns (op_l, float(bus_bw_avg)) or None on failure; then
simplify _moe_benchmark_dispatch_combine_bars to call _read_latest_results and
iterate the rows using _extract_bus_bw, populate by_op and build the out list
(keep existing tuple labels "MoE dispatch"/"MoE combine" and colors).
In `@tests/test_acceptance.py`:
- Line 721: The test test_sbatch_generation[moe-benchmark] fails because the
expected reference sbatch file is missing; add a new reference file named
moe-benchmark.sbatch under the tests/ref_data directory (the path constructed by
Path(__file__).parent / "ref_data" / test_req[1] in tests/test_acceptance.py)
containing the expected sbatch output for the moe-benchmark workload so the
assertion comparing generated SBATCH to the reference succeeds. Ensure the file
name exactly matches "moe-benchmark.sbatch" and the contents match the generator
output used by the test.
In `@tests/test_init.py`:
- Line 241: Update the assertion that checks the number of test definitions from
26 to the correct count to include the newly added MoEBenchmarkTestDefinition;
locate the assertion referencing test_defs (assert len(test_defs) == 26) in
tests/test_init.py and change the expected value to 27 so the test reflects the
added MoEBenchmarkTestDefinition.
---
Duplicate comments:
In `@src/cloudai/workloads/moe_benchmark/throughput_reporter.py`:
- Line 15: The file imports Reporter from the private module
cloudai._core.base_reporter which breaks layering rules; update the import to
use the public Reporter export (e.g., replace "from cloudai._core.base_reporter
import Reporter" with the public import such as "from cloudai.reporter import
Reporter" or the correct public module that exposes Reporter) so that
throughput_reporter.py only depends on the public API.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 25bf978f-ad63-4710-8250-1199086f3041
📒 Files selected for processing (28)
.gitignoreconf/experimental/test/deepep_low_latency.tomlconf/experimental/test/deepep_standard.tomlconf/experimental/test/deepep_test_ep_v2.tomlconf/experimental/test/deepep_test_internode.tomlconf/experimental/test/deepep_test_intranode.tomlconf/experimental/test/deepep_test_low_latency.tomlconf/experimental/test/moe_benchmark_low_latency.tomlconf/experimental/test/moe_benchmark_standard.tomlconf/experimental/test/nccl_test_alltoallv.tomlconf/experimental/test/ucc_alltoallv_deepep.tomlconf/experimental/test_scenario/deepep_official.tomlconf/experimental/test_scenario/moe_benchmark.tomlsrc/cloudai/registration.pysrc/cloudai/workloads/deepep/__init__.pysrc/cloudai/workloads/deepep/deepep.pysrc/cloudai/workloads/deepep/slurm_command_gen_strategy.pysrc/cloudai/workloads/moe_benchmark/__init__.pysrc/cloudai/workloads/moe_benchmark/combined_report.pysrc/cloudai/workloads/moe_benchmark/moe_benchmark.pysrc/cloudai/workloads/moe_benchmark/report_generation_strategy.pysrc/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.pysrc/cloudai/workloads/moe_benchmark/throughput_reporter.pysrc/cloudai/workloads/nccl_test/slurm_command_gen_strategy.pysrc/cloudai/workloads/ucc_test/slurm_command_gen_strategy.pytests/test_acceptance.pytests/test_init.pytests/test_test_scenario.py
💤 Files with no reviewable changes (3)
- conf/experimental/test/deepep_low_latency.toml
- conf/experimental/test/deepep_standard.toml
- src/cloudai/workloads/deepep/init.py
|
please also resolve remaining coderabbit comments 🙏 |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
src/cloudai/workloads/deepep/slurm_command_gen_strategy.py (2)
196-198:⚠️ Potential issue | 🟠 Major | ⚡ Quick winThis success check can mark failed runs as passed.
[testing],dispatch,combine, andpassedare generic mid-run tokens. Once any one of them appears instdout.txt, this returns success even if the job aborts later.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/workloads/deepep/slurm_command_gen_strategy.py` around lines 196 - 198, gen_srun_success_check currently treats generic mid-run tokens as success (it greps for "[testing]|dispatch|combine|passed|tuning|Best" in stdout.txt). Update the gen_srun_success_check method to only consider definitive end-of-run success markers or to inspect the tail of the log instead of the whole file: e.g., restrict patterns to true final markers (e.g., "Best" or "passed" only when at end-of-line or in the final N lines) or run tail -n 50 | grep -E ... so transient tokens like "[testing]", "dispatch", "combine", "tuning" are removed from the success regex; reference gen_srun_success_check and output_file when making the change.
113-114:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDerive
MASTER_PORTper job instead of pinning29500.A fixed rendezvous port collides whenever two DeepEP jobs land on the same head node, so one run can fail before c10d initializes.
Suggested fix
- "export MASTER_PORT=29500", + 'export MASTER_PORT=$((10000 + (${SLURM_JOB_ID:-0} % 50000)))',🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/workloads/deepep/slurm_command_gen_strategy.py` around lines 113 - 114, The script currently hardcodes "export MASTER_PORT=29500" causing rendezvous-port collisions; update the Slurm command generation in slurm_command_gen_strategy.py (where the job script is built—look for the code that emits the "export MASTER_PORT=29500" line, e.g., the generate_slurm_command / build script routine) to pick a per-job free port instead of 29500: implement a helper (e.g., find_free_port()) that creates a socket bound to ('', 0) to obtain an ephemeral port (validate it's in 1024-65535), close the socket and then emit "export MASTER_PORT=<picked_port>" into the generated script or environment; ensure the chosen port is derived per job (e.g., using job id or ephemeral bind) so concurrent DeepEP runs on the same head node do not conflict.conf/experimental/test/moe_benchmark_low_latency.toml (1)
23-23:⚠️ Potential issue | 🟠 MajorFix placeholder
benchmark_rootin low-latency MoE test config
conf/experimental/test/moe_benchmark_low_latency.tomlsetsbenchmark_rootto/path/in/the/container/to/the/tests/folder(line 23), while the standard config uses/workspace/dp-benchmark/benchmark.MoEBenchmarkSlurmCommandGenStrategy.generate_test_command()builds the executed script asPurePosixPath(cmd_args.benchmark_root) / benchmark_ll.pydirectly from this field, so this placeholder will cause low-latency runs to execute the wrong script path and fail to produce the expectedresults.jsonartifacts for downstream reporting/matrix consumers. Update low-latencybenchmark_rootto the correct in-container path (or add a clear mechanism/documentation that replaces the placeholder before execution).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@conf/experimental/test/moe_benchmark_low_latency.toml` at line 23, The low-latency MoE test config currently uses a placeholder benchmark_root value which causes MoEBenchmarkSlurmCommandGenStrategy.generate_test_command() to build an incorrect script path; update the benchmark_root entry in conf/experimental/test/moe_benchmark_low_latency.toml from "/path/in/the/container/to/the/tests/folder" to the correct in-container path used by the standard config (e.g., "/workspace/dp-benchmark/benchmark"), or add a documented replacement mechanism run prior to generate_test_command() so PurePosixPath(cmd_args.benchmark_root) / benchmark_ll.py resolves to the real benchmark directory.
♻️ Duplicate comments (1)
src/cloudai/workloads/deepep/slurm_command_gen_strategy.py (1)
178-186:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
--nproc_per_node=1contradicts the one-launcher-per-node design.The PR discussion says this strategy intentionally starts one
torchrunper node and letstorchrunfan out the per-node workers. Hardcoding one worker here leavesnum_processesonly on the script CLI and under-launches multi-GPU nodes. Derive--nproc_per_nodefrom the configured DeepEP process count instead of pinning it to1.Based on prior review discussion about the intended one-launcher-per-node geometry in this PR.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/workloads/deepep/slurm_command_gen_strategy.py` around lines 178 - 186, The parts list hardcodes "--nproc_per_node=1" which conflicts with the one-launcher-per-node design; change it to compute per-node worker count from the configured DeepEP process count (e.g., use the strategy's configured value such as self._num_processes or derive from cmd_args.num_processes / per-node setting) and insert f"--nproc_per_node={per_node_count}" instead of the literal 1 in the parts array inside the method that builds the torchrun invocation in slurm_command_gen_strategy.py; ensure the value is an integer/string and falls back to 1 only if no explicit per-node count is available.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@conf/experimental/test/moe_benchmark_low_latency.toml`:
- Line 23: The low-latency MoE test config currently uses a placeholder
benchmark_root value which causes
MoEBenchmarkSlurmCommandGenStrategy.generate_test_command() to build an
incorrect script path; update the benchmark_root entry in
conf/experimental/test/moe_benchmark_low_latency.toml from
"/path/in/the/container/to/the/tests/folder" to the correct in-container path
used by the standard config (e.g., "/workspace/dp-benchmark/benchmark"), or add
a documented replacement mechanism run prior to generate_test_command() so
PurePosixPath(cmd_args.benchmark_root) / benchmark_ll.py resolves to the real
benchmark directory.
In `@src/cloudai/workloads/deepep/slurm_command_gen_strategy.py`:
- Around line 196-198: gen_srun_success_check currently treats generic mid-run
tokens as success (it greps for "[testing]|dispatch|combine|passed|tuning|Best"
in stdout.txt). Update the gen_srun_success_check method to only consider
definitive end-of-run success markers or to inspect the tail of the log instead
of the whole file: e.g., restrict patterns to true final markers (e.g., "Best"
or "passed" only when at end-of-line or in the final N lines) or run tail -n 50
| grep -E ... so transient tokens like "[testing]", "dispatch", "combine",
"tuning" are removed from the success regex; reference gen_srun_success_check
and output_file when making the change.
- Around line 113-114: The script currently hardcodes "export MASTER_PORT=29500"
causing rendezvous-port collisions; update the Slurm command generation in
slurm_command_gen_strategy.py (where the job script is built—look for the code
that emits the "export MASTER_PORT=29500" line, e.g., the generate_slurm_command
/ build script routine) to pick a per-job free port instead of 29500: implement
a helper (e.g., find_free_port()) that creates a socket bound to ('', 0) to
obtain an ephemeral port (validate it's in 1024-65535), close the socket and
then emit "export MASTER_PORT=<picked_port>" into the generated script or
environment; ensure the chosen port is derived per job (e.g., using job id or
ephemeral bind) so concurrent DeepEP runs on the same head node do not conflict.
---
Duplicate comments:
In `@src/cloudai/workloads/deepep/slurm_command_gen_strategy.py`:
- Around line 178-186: The parts list hardcodes "--nproc_per_node=1" which
conflicts with the one-launcher-per-node design; change it to compute per-node
worker count from the configured DeepEP process count (e.g., use the strategy's
configured value such as self._num_processes or derive from
cmd_args.num_processes / per-node setting) and insert
f"--nproc_per_node={per_node_count}" instead of the literal 1 in the parts array
inside the method that builds the torchrun invocation in
slurm_command_gen_strategy.py; ensure the value is an integer/string and falls
back to 1 only if no explicit per-node count is available.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 8947e249-c3b5-4e48-8bac-f95e11941f5c
📒 Files selected for processing (26)
conf/experimental/test/deepep_test_ep_v2.tomlconf/experimental/test/deepep_test_internode.tomlconf/experimental/test/deepep_test_intranode.tomlconf/experimental/test/deepep_test_low_latency.tomlconf/experimental/test/moe_benchmark_low_latency.tomlconf/experimental/test/moe_benchmark_standard.tomlconf/experimental/test/nccl_test_alltoallv.tomlconf/experimental/test/ucc_alltoallv_deepep.tomlconf/experimental/test_scenario/deepep_official.tomlconf/experimental/test_scenario/moe_benchmark.tomlsrc/cloudai/registration.pysrc/cloudai/workloads/common/moe_benchmark_report.pysrc/cloudai/workloads/deepep/slurm_command_gen_strategy.pysrc/cloudai/workloads/moe_benchmark/__init__.pysrc/cloudai/workloads/moe_benchmark/moe_benchmark.pysrc/cloudai/workloads/moe_benchmark/report_generation_strategy.pysrc/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.pysrc/cloudai/workloads/moe_benchmark/throughput_reporter.pysrc/cloudai/workloads/nccl_test/nccl.pysrc/cloudai/workloads/nccl_test/slurm_command_gen_strategy.pysrc/cloudai/workloads/ucc_test/slurm_command_gen_strategy.pysrc/cloudai/workloads/ucc_test/ucc.pytests/ref_data/deepep-benchmark.sbatchtests/ref_data/moe-benchmark.sbatchtests/test_acceptance.pytests/test_init.py
💤 Files with no reviewable changes (1)
- tests/ref_data/moe-benchmark.sbatch
Add UCC/NCCL alltoallv vs DeepEP