Override some arguments if dumper is enabled by fzyzcjy · Pull Request #627 · radixark/miles

fzyzcjy · 2026-02-22T23:23:39Z

No description provided.

Provides DumperPhase type, CLI config parsing with key validation, phase-specific directory helpers, env var builder for SGLang subprocess, and dumper_phase_scope context manager for Megatron lifecycle.

Adds --dumper-enable, --dumper-dir, --dumper-sglang, --dumper-fwd-only, --dumper-fwd-bwd flags. Links --dump-details to auto-enable dumper.

Integrates sglang dumper lifecycle into Megatron forward-only and forward-backward passes via dumper_phase_scope context manager.

Each engine gets isolated dump directory via DUMPER_EXP_NAME=engine_{rank}.

Runs a full training loop with --dumper-enable, then verifies dump files exist in all three phase directories including per-engine isolation for SGLang.

Unified naming eliminates _PHASE_ATTR_MAP dict — CLI attr is now derived as f"dumper_{phase}". Renamed --dumper-sglang to --dumper-inference for consistency.

Eliminates separate PHASE_* constants and _ALL_PHASES tuple. Callers use DumperPhase.FWD_ONLY etc. directly.

- parse_dumper_config: use type-aware coercion via _FrozenConfig._parse_env_value instead of value-based heuristics (e.g. "0" no longer becomes False for str fields) - configure_dumper_for_phase: reset to defaults before each phase to prevent cross-phase config leaking via singleton replace() - dumper_phase_scope: disable dumper on scope exit to prevent stray dumps - sglang_engine: clean up DUMPER_* env vars after server launch - Fix --dumper-enable help text referencing old --dumper-sglang flag

Remove parse_dumper_config from dumper_utils.py — the parsing logic now lives in _DumperConfig (sglang side) so other users can reuse it. Internally use _kv_pairs_to_dict for partial-override semantics.

Add extra_env parameter to launch_server_process that sets env vars inside the child process only, avoiding the os.environ.update/pop pattern that leaked state in the parent Ray actor.

- Remove dumper_phase_scope, add finalize_dumper_phase - Call configure_dumper_for_phase before and finalize_dumper_phase after - Move step() to end (after dump_model) instead of beginning

Instead of passing env vars through extra_env/multiprocessing wrapper in sglang_engine.py, inject DUMPER_* vars into the Ray actor's runtime_env alongside other env vars. The spawned SGLang server subprocess inherits them naturally.

Instead of pre-setting all DUMPER_* env vars at Ray actor creation, only set DUMPER_SERVER_PORT=reuse to register the HTTP endpoint. Configure and enable the dumper dynamically via POST /dumper/configure before each rollout, giving each engine its own exp_name=engine_{i}. Also extract _get_worker_urls helper in sglang_rollout.py.

Callers no longer need to obtain and pass worker_urls — the function discovers them itself via get_worker_urls from inference_rollout_train.

Inline worker URL discovery back into abort() to keep sglang_rollout.py diff minimal. Enable cleanup_previous by default for both SGLang and Megatron dumper phases.

dir = dumper_dir for all phases, exp_name carries the distinction: engine_{i} for inference, fwd_only / fwd_bwd for Megatron. Eliminates redundant fwd_only/fwd_only nesting.

Wrap forward_step with dumper stepping so each microbatch gets its own dumper step counter. The wrapper calls dumper.step() before every forward_step invocation except the first, ensuring model forward and the subsequent loss callback share the same step number.

Module name already conveys the dumper context. Callers now use `dumper_utils.configure_for_phase(...)` instead of importing `configure_dumper_for_phase`. Also switch all call sites from `from dumper_utils import X` to `from miles.utils import dumper_utils`.

Prevents health-check requests from polluting dumper output by overriding three args in miles_validate_args: - use_fault_tolerance=False (RolloutHealthMonitor) - router_disable_health_check=True (sgl-router) - rollout_health_check_interval=1e18 (miles-router)

Callers were passing model[0] which would silently lose parameters when virtual pipeline parallelism is enabled (len(model) > 1). Now finalize() receives the full model list and asserts len == 1 until multi-chunk support is implemented.

…tensors dump_model() saved every parameter as a separate .pt file, which for MoE models (e.g. 128 experts) produced files 40+ GB each, totaling 500+ GB across ranks. Replace with save=False (console only) plus a single lightweight model_summary dump containing shapes/dtypes/devices.

…of full tensors" This reverts commit 5e2f648.

Speeds up e2e test by generating shorter responses (128 -> 20 tokens).

With enable_model_value and enable_model_grad defaulting to False, the test must explicitly enable them via phase overrides so that fwd_only and fwd_bwd directories are created for verification.

MoE model has too many parameters for dump_model to complete in reasonable time. Explicitly disable enable_model_value/grad and only verify engine_* (inference) dumps.

The non-intrusive dumper (non_intrusive_mode=core) needs register_non_intrusive_dumper() to be called with the model so forward hooks can dump input_ids, positions, etc. Without this, fwd_only and fwd_bwd phases produce no dump files. Restore EXP_PATTERNS to verify all three phases.

Pass model to DumperMegatronUtil.__init__ and register non-intrusive hooks there. This is safe because _configure() calls dumper.reset() which now removes previous hooks before re-registering. Removes the separate register_hooks() method and its call sites in forward_only() and train_one_step().

Megatron models pass input_ids/positions as keyword arguments, which non-intrusive hooks cannot capture via pre_hook positional args. Core mode only dumps fields matching {input_ids, positions}, so it produces no output for Megatron. Switch to 'all' mode to capture sub-module I/O via positional args.

The barrier in sglang's cleanup_previous deadlocks under pipeline parallelism. Miles handles dump directory cleanup externally via prepare() before training, so the lazy barrier-based cleanup is not needed.

non_intrusive_mode=all on a 30B MoE model generates thousands of dump files per forward pass, causing excessive I/O. Filter to only dump embedding-related modules to verify hooks work without overwhelming disk.

- Remove cleanup_previous from sglang dumper config to avoid dist.barrier() deadlocks in async PP contexts - Add _cleanup_dump_dir helper: rank-0 rmtree + barrier - Call cleanup explicitly in DumperMegatronUtil._configure - E2e test: switch from non_intrusive_mode=all to default core mode, disable model value/grad dumps for both fwd phases

The DumperConfig default is already False, and no caller passes this field explicitly.

Avoids rmtree on files or symlinks to files.

Move torch.distributed import to module level, extract _get_rank helper, flatten the if/elif into two independent checks.

Check that specific dump fields exist for each phase: - engine_*: input_ids, positions (from SGLang ForwardBatch) - fwd_only/fwd_bwd: input_ids, cu_seqlens_q, cu_seqlens_kv, qkv_format (from Megatron kwargs + PackedSeqParams)

Dump files are named like step=0___rank=0___name=input_ids.pt, not just input_ids.pt. Use *name={field}.pt glob pattern.

This reverts commit d3a82dc.

gemini-code-assist · 2026-02-22T23:23:59Z

Summary of Changes

Hello @fzyzcjy, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust dumping mechanism designed to aid in the detailed analysis and debugging of both SGLang inference and Megatron training workflows. It provides granular control over data collection through new command-line arguments and intelligently adjusts runtime configurations to prevent interference with the dumping process, thereby enhancing the observability of complex model behaviors.

Highlights

Dumper Utility Introduction: A new dumper utility has been introduced to facilitate debugging and analysis across SGLang inference and Megatron training phases (forward-only and forward-backward).
Argument Parsing for Dumper Configuration: New command-line arguments have been added to configure the dumper, including enabling it, specifying output directories, and setting phase-specific parameters.
Integration into Model Execution: The dumper has been integrated into key model execution functions within Megatron's forward_only and train_one_step, and SGLang's rollout processes, allowing for data capture at critical points.
Automatic Argument Overrides: When the dumper is enabled, certain training parameters are automatically overridden to ensure a stable environment for data collection, such as disabling fault tolerance, health checks, and forcing single rollouts.
Comprehensive Testing: New end-to-end and fast tests have been added to validate the dumper's functionality and the argument override logic.

Changelog

miles/backends/megatron_utils/model.py
- Imported DumperMegatronUtil and DumperPhase for dumper integration.
- Initialized DumperMegatronUtil within forward_only and train_one_step functions.
- Wrapped forward_step with dumper functionality to capture intermediate states.
- Added calls to dumper_phase_util.finalize to complete dumping for each phase.
miles/ray/rollout.py
- Imported dumper_utils for SGLang dumper configuration.
- Updated environment variables to include SGLang dumper settings.
miles/rollout/inference_rollout/inference_rollout_train.py
- Imported dumper_utils for SGLang dumper configuration.
- Added a call to dumper_utils.configure_sglang to set up the dumper for inference rollout.
miles/rollout/sglang_rollout.py
- Imported dumper_utils for SGLang dumper configuration.
- Added a call to dumper_utils.configure_sglang to set up the dumper for rollout generation.
miles/utils/arguments.py
- Added new command-line arguments: --dumper-enable, --dumper-dir, --dumper-inference, --dumper-fwd-only, and --dumper-fwd-bwd for dumper configuration.
- Implemented _maybe_apply_dumper_overrides function to adjust training parameters when the dumper is enabled, including disabling fault tolerance, health checks, and forcing single rollouts.
miles/utils/dumper_utils.py
- Added a new file defining the DumperPhase enum.
- Implemented SGLang dumper configuration functions (get_sglang_env, configure_sglang).
- Implemented DumperMegatronUtil class for Megatron dumper integration, including methods for initialization, wrapping forward steps, and finalization.
- Added helper functions _wrap_forward_step_with_stepping, _cleanup_dump_dir, _get_phase_override_configs, _is_phase_enabled, and _get_dir.
tests/e2e/short/test_dumper.py
- Added a new end-to-end test file to verify the functionality of the dumper across different parallelism configurations.
tests/fast/utils/test_arguments.py
- Imported SimpleNamespace and _maybe_apply_dumper_overrides for testing.
- Added TestMaybeApplyDumperOverrides class with unit tests for the dumper argument override logic.
tests/fast/utils/test_dumper_utils.py
- Added a new fast test file to test the _wrap_forward_step_with_stepping utility function.

Ignored Files

Ignored by pattern: .github/workflows/** (2)
- .github/workflows/pr-test.yml
- .github/workflows/pr-test.yml.j2

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a "dumper" utility to capture intermediate states during different phases of training and inference for debugging and analysis. The feature is enabled by a new set of --dumper-* command-line arguments. When enabled, it overrides several other arguments to ensure a controlled environment for dumping, such as disabling health checks and forcing a single rollout. The implementation is well-structured and includes new utility functions, integration into Megatron and SGLang workflows, and comprehensive unit and end-to-end tests.

My review focuses on improving code structure and readability. I've pointed out a circular dependency that could be refactored and suggested replacing a magic number with a more idiomatic constant.

gemini-code-assist · 2026-02-22T23:30:39Z

miles/utils/dumper_utils.py

+    if not _is_phase_enabled(args, DumperPhase.INFERENCE):
+        return
+
+    from miles.rollout.inference_rollout.inference_rollout_train import get_worker_urls


There appears to be a circular dependency here. miles/utils/dumper_utils.py imports from miles/rollout/inference_rollout/inference_rollout_train.py, which in turn imports miles/utils/dumper_utils.py. While using a local import inside configure_sglang makes this work at runtime, it's generally better to avoid circular dependencies for improved code structure and maintainability.

Consider moving the get_worker_urls function from miles/rollout/inference_rollout/inference_rollout_train.py to a lower-level utility module like miles/utils/http_utils.py, since it's already using get from there. Both dumper_utils.py and inference_rollout_train.py could then import it from http_utils.py without creating a cycle.

gemini-code-assist · 2026-02-22T23:30:40Z

miles/utils/arguments.py

+
+    logger.info("Dumper mode: all heartbeat mechanisms disabled")
+    args.router_disable_health_check = True
+    args.rollout_health_check_interval = 1e18


Using 1e18 to effectively disable the health check interval works, but it's a bit of a magic number. For better readability and to more clearly express the intent of an infinite timeout, consider using float('inf').

Suggested change

args.rollout_health_check_interval = 1e18

args.rollout_health_check_interval = float('inf')

gemini-code-assist · 2026-02-22T23:30:40Z

tests/fast/utils/test_arguments.py

+
+        assert args.use_fault_tolerance is False
+        assert args.router_disable_health_check is True
+        assert args.rollout_health_check_interval == 1e18


To align with the suggested change of using float('inf') for the health check interval in miles/utils/arguments.py, this assertion should be updated accordingly.

Suggested change

assert args.rollout_health_check_interval == 1e18

assert args.rollout_health_check_interval == float('inf')

yushengsu-thu

workflow part LGTM

fzyzcjy added 30 commits February 21, 2026 07:58

Add dumper_utils.py for Miles dumper integration

40e4c3b

Provides DumperPhase type, CLI config parsing with key validation, phase-specific directory helpers, env var builder for SGLang subprocess, and dumper_phase_scope context manager for Megatron lifecycle.

Add --dumper-* CLI arguments for sglang dumper integration

cb53ac0

Adds --dumper-enable, --dumper-dir, --dumper-sglang, --dumper-fwd-only, --dumper-fwd-bwd flags. Links --dump-details to auto-enable dumper.

Wrap forward_only and train_one_step with dumper_phase_scope

4421eb3

Integrates sglang dumper lifecycle into Megatron forward-only and forward-backward passes via dumper_phase_scope context manager.

Add dumper sanity check logging in Megatron actor init

417f715

Set DUMPER_* env vars before SGLang server launch

0785883

Each engine gets isolated dump directory via DUMPER_EXP_NAME=engine_{rank}.

Add e2e test for dumper integration

37a88db

Runs a full training loop with --dumper-enable, then verifies dump files exist in all three phase directories including per-engine isolation for SGLang.

Convert lazy imports to global imports and apply pre-commit fixes

35fedb2

Simplify DumperPhase names: inference, fwd_only, fwd_bwd

ba8b3b8

Unified naming eliminates _PHASE_ATTR_MAP dict — CLI attr is now derived as f"dumper_{phase}". Renamed --dumper-sglang to --dumper-inference for consistency.

Change DumperPhase from Literal to enum

aa62004

Eliminates separate PHASE_* constants and _ALL_PHASES tuple. Callers use DumperPhase.FWD_ONLY etc. directly.

Use _DumperConfig.from_kv_pairs instead of local parse_dumper_config

677cce9

Remove parse_dumper_config from dumper_utils.py — the parsing logic now lives in _DumperConfig (sglang side) so other users can reuse it. Internally use _kv_pairs_to_dict for partial-override semantics.

more

307e039

more

cc6affe

Pass dumper env vars via extra_env instead of polluting parent process

7fa1fc0

Add extra_env parameter to launch_server_process that sets env vars inside the child process only, avoiding the os.environ.update/pop pattern that leaked state in the parent Ray actor.

Replace dumper_phase_scope context manager with explicit calls

c8cee9b

- Remove dumper_phase_scope, add finalize_dumper_phase - Call configure_dumper_for_phase before and finalize_dumper_phase after - Move step() to end (after dump_model) instead of beginning

more

59023f5

Remove unnecessary dumper status logging from actor init

0406384

more

9510113

Add dumper HTTP configure call to inference_rollout_train

e8ccbf0

Let configure_dumper_for_sglang fetch worker URLs internally

a32d634

Callers no longer need to obtain and pass worker_urls — the function discovers them itself via get_worker_urls from inference_rollout_train.

Minimize sglang_rollout.py diff; default cleanup_previous=True

e1ca7ed

Inline worker URL discovery back into abort() to keep sglang_rollout.py diff minimal. Enable cleanup_previous by default for both SGLang and Megatron dumper phases.

Simplify test_dumper: verify all three phases uniformly

8857af9

Remove comments and section headers from dumper_utils.py

bfe8a80

Use flat dir layout: exp_name alone distinguishes phases

e671e35

dir = dumper_dir for all phases, exp_name carries the distinction: engine_{i} for inference, fwd_only / fwd_bwd for Megatron. Eliminates redundant fwd_only/fwd_only nesting.

Simplify SGLang dumper body: spread overrides then overlay forced keys

f4dbd48

fzyzcjy added 23 commits February 21, 2026 21:38

fmt

7bef9ee

Revert "Fix dump_model in finalize: save lightweight summary instead …

02bec36

…of full tensors" This reverts commit 5e2f648.

Reduce rollout-max-response-len to 20 in e2e dumper test

73279a7

Speeds up e2e test by generating shorter responses (128 -> 20 tokens).

Enable model dump explicitly in e2e dumper test

e55febd

With enable_model_value and enable_model_grad defaulting to False, the test must explicitly enable them via phase overrides so that fwd_only and fwd_bwd directories are created for verification.

Disable model dump in e2e test, only verify inference dumps

003618d

MoE model has too many parameters for dump_model to complete in reasonable time. Explicitly disable enable_model_value/grad and only verify engine_* (inference) dumps.

Extract _extract_model to deduplicate model chunk assertion

bbdf956

Remove cleanup_previous from dumper phase configuration

d91d2ef

The barrier in sglang's cleanup_previous deadlocks under pipeline parallelism. Miles handles dump directory cleanup externally via prepare() before training, so the lazy barrier-based cleanup is not needed.

Add filter=embedding to reduce non-intrusive dump I/O in e2e test

3094c20

non_intrusive_mode=all on a 30B MoE model generates thousands of dump files per forward pass, causing excessive I/O. Filter to only dump embedding-related modules to verify hooks work without overwhelming disk.

Remove unnecessary cleanup_previous pop from _configure

e5643fb

The DumperConfig default is already False, and no caller passes this field explicitly.

Use is_dir() instead of exists() in _cleanup_dump_dir

7ee3550

Avoids rmtree on files or symlinks to files.

Simplify _cleanup_dump_dir with flat control flow

5ec2ea4

Move torch.distributed import to module level, extract _get_rank helper, flatten the if/elif into two independent checks.

Reuse _get_rank from sglang dumper instead of local copy

5f9c060

Add expected field name verification to e2e dumper test

1b3bf75

Check that specific dump fields exist for each phase: - engine_*: input_ids, positions (from SGLang ForwardBatch) - fwd_only/fwd_bwd: input_ids, cu_seqlens_q, cu_seqlens_kv, qkv_format (from Megatron kwargs + PackedSeqParams)

Apply black formatting to e2e test_dumper.py

0a1e77d

Fix field name glob pattern in e2e test

633e4fe

Dump files are named like step=0___rank=0___name=input_ids.pt, not just input_ids.pt. Use *name={field}.pt glob pattern.

revert override

d3a82dc

Revert "revert override"

e1b75bd

This reverts commit d3a82dc.

fzyzcjy requested review from guapisolo, maocheng23, yueming-yuan and yushengsu-thu as code owners February 22, 2026 23:23

gemini-code-assist bot reviewed Feb 22, 2026

View reviewed changes

yushengsu-thu approved these changes Feb 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Override some arguments if dumper is enabled#627

Override some arguments if dumper is enabled#627
fzyzcjy wants to merge 91 commits intomainfrom
ac8403/1

fzyzcjy commented Feb 22, 2026

Uh oh!

gemini-code-assist bot commented Feb 22, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 22, 2026

Uh oh!

gemini-code-assist bot Feb 22, 2026

Uh oh!

gemini-code-assist bot Feb 22, 2026

Uh oh!

yushengsu-thu left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	args.rollout_health_check_interval = 1e18
	args.rollout_health_check_interval = float('inf')

	assert args.rollout_health_check_interval == 1e18
	assert args.rollout_health_check_interval == float('inf')

Comments

Conversation

fzyzcjy commented Feb 22, 2026

Uh oh!

gemini-code-assist bot commented Feb 22, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

yushengsu-thu left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants