Skip to content

Comments

Integrate dumper in miles#625

Open
fzyzcjy wants to merge 90 commits intomainfrom
ac8403/0
Open

Integrate dumper in miles#625
fzyzcjy wants to merge 90 commits intomainfrom
ac8403/0

Conversation

@fzyzcjy
Copy link
Collaborator

@fzyzcjy fzyzcjy commented Feb 21, 2026

No description provided.

Provides DumperPhase type, CLI config parsing with key validation,
phase-specific directory helpers, env var builder for SGLang subprocess,
and dumper_phase_scope context manager for Megatron lifecycle.
Adds --dumper-enable, --dumper-dir, --dumper-sglang, --dumper-fwd-only,
--dumper-fwd-bwd flags. Links --dump-details to auto-enable dumper.
Integrates sglang dumper lifecycle into Megatron forward-only and
forward-backward passes via dumper_phase_scope context manager.
Each engine gets isolated dump directory via DUMPER_EXP_NAME=engine_{rank}.
Runs a full training loop with --dumper-enable, then verifies dump
files exist in all three phase directories including per-engine
isolation for SGLang.
Unified naming eliminates _PHASE_ATTR_MAP dict — CLI attr is now
derived as f"dumper_{phase}". Renamed --dumper-sglang to
--dumper-inference for consistency.
Eliminates separate PHASE_* constants and _ALL_PHASES tuple.
Callers use DumperPhase.FWD_ONLY etc. directly.
- parse_dumper_config: use type-aware coercion via _FrozenConfig._parse_env_value
  instead of value-based heuristics (e.g. "0" no longer becomes False for str fields)
- configure_dumper_for_phase: reset to defaults before each phase to prevent
  cross-phase config leaking via singleton replace()
- dumper_phase_scope: disable dumper on scope exit to prevent stray dumps
- sglang_engine: clean up DUMPER_* env vars after server launch
- Fix --dumper-enable help text referencing old --dumper-sglang flag
Remove parse_dumper_config from dumper_utils.py — the parsing logic
now lives in _DumperConfig (sglang side) so other users can reuse it.
Internally use _kv_pairs_to_dict for partial-override semantics.
Add extra_env parameter to launch_server_process that sets env vars
inside the child process only, avoiding the os.environ.update/pop
pattern that leaked state in the parent Ray actor.
- Remove dumper_phase_scope, add finalize_dumper_phase
- Call configure_dumper_for_phase before and finalize_dumper_phase after
- Move step() to end (after dump_model) instead of beginning
Instead of passing env vars through extra_env/multiprocessing wrapper in
sglang_engine.py, inject DUMPER_* vars into the Ray actor's runtime_env
alongside other env vars. The spawned SGLang server subprocess inherits
them naturally.
Instead of pre-setting all DUMPER_* env vars at Ray actor creation,
only set DUMPER_SERVER_PORT=reuse to register the HTTP endpoint.
Configure and enable the dumper dynamically via POST /dumper/configure
before each rollout, giving each engine its own exp_name=engine_{i}.

Also extract _get_worker_urls helper in sglang_rollout.py.
Callers no longer need to obtain and pass worker_urls — the function
discovers them itself via get_worker_urls from inference_rollout_train.
Inline worker URL discovery back into abort() to keep
sglang_rollout.py diff minimal. Enable cleanup_previous by default
for both SGLang and Megatron dumper phases.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @fzyzcjy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive dumper utility into the miles framework, enabling detailed data capture and analysis during various stages of model execution. By integrating this tool into both Megatron and SGLang components, users gain enhanced visibility into model internals, facilitating debugging and performance analysis. The changes provide flexible configuration options through new command-line arguments, allowing for tailored data dumping based on specific needs.

Highlights

  • Dumper Utility Integration: A new dumper_utils.py module has been added to centralize dumper configuration and management across different execution phases.
  • Megatron Dumper Hooks: The dumper has been integrated into Megatron's forward_only and train_one_step functions, allowing for data dumping during forward-only and forward-backward passes.
  • SGLang Dumper Configuration: The dumper can now be configured for SGLang inference rollouts, both by setting environment variables for Ray actors and via HTTP calls for runtime control.
  • New Command-Line Arguments: Several new command-line arguments (--dumper-enable, --dumper-dir, --dumper-inference, --dumper-fwd-only, --dumper-fwd-bwd) have been introduced to provide granular control over dumper behavior.
  • End-to-End Test: A new end-to-end test (test_dumper.py) has been added to verify the correct functionality of the dumper across different phases.
Changelog
  • miles/backends/megatron_utils/model.py
    • Imported dumper utilities for phase-specific configuration.
    • Configured dumper for the forward-only phase.
    • Finalized the dumper phase after the forward step.
    • Configured dumper for the forward-backward phase.
    • Finalized the dumper phase after optimizer zero_grad.
  • miles/ray/rollout.py
    • Imported get_dumper_env_for_sglang to manage dumper environment variables.
    • Updated environment variables for SGLang workers to include dumper settings.
  • miles/rollout/inference_rollout/inference_rollout_train.py
    • Imported configure_dumper_for_sglang for SGLang dumper setup.
    • Called configure_dumper_for_sglang during asynchronous inference rollout generation.
  • miles/rollout/sglang_rollout.py
    • Imported configure_dumper_for_sglang for SGLang dumper setup.
    • Called configure_dumper_for_sglang during asynchronous rollout generation.
  • miles/utils/arguments.py
    • Added new command-line arguments (--dumper-enable, --dumper-dir, --dumper-inference, --dumper-fwd-only, --dumper-fwd-bwd) for dumper configuration.
  • miles/utils/dumper_utils.py
    • Added a new utility file containing DumperPhase enum, shared helpers, SGLang-specific dumper functions, and Megatron-specific dumper functions.
  • tests/e2e/short/test_dumper.py
    • Added a new end-to-end test to prepare, execute, and verify dumper functionality for both SGLang and Megatron phases.
Activity
  • No specific activity has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

dir = dumper_dir for all phases, exp_name carries the distinction:
engine_{i} for inference, fwd_only / fwd_bwd for Megatron.
Eliminates redundant fwd_only/fwd_only nesting.
Replace isinstance guard + setdefault with a single dict expression.
_kv_pairs_to_dict already handles None, and spread syntax gives the
same override-default semantics as setdefault.
Callers were passing model[0] which would silently lose parameters
when virtual pipeline parallelism is enabled (len(model) > 1).
Now finalize() receives the full model list and asserts len == 1
until multi-chunk support is implemented.
…tensors

dump_model() saved every parameter as a separate .pt file, which for MoE
models (e.g. 128 experts) produced files 40+ GB each, totaling 500+ GB
across ranks. Replace with save=False (console only) plus a single
lightweight model_summary dump containing shapes/dtypes/devices.
Speeds up e2e test by generating shorter responses (128 -> 20 tokens).
With enable_model_value and enable_model_grad defaulting to False,
the test must explicitly enable them via phase overrides so that
fwd_only and fwd_bwd directories are created for verification.
MoE model has too many parameters for dump_model to complete in
reasonable time. Explicitly disable enable_model_value/grad and
only verify engine_* (inference) dumps.
The non-intrusive dumper (non_intrusive_mode=core) needs
register_non_intrusive_dumper() to be called with the model so
forward hooks can dump input_ids, positions, etc. Without this,
fwd_only and fwd_bwd phases produce no dump files. Restore
EXP_PATTERNS to verify all three phases.
Pass model to DumperMegatronUtil.__init__ and register non-intrusive
hooks there. This is safe because _configure() calls dumper.reset()
which now removes previous hooks before re-registering.

Removes the separate register_hooks() method and its call sites in
forward_only() and train_one_step().
Megatron models pass input_ids/positions as keyword arguments,
which non-intrusive hooks cannot capture via pre_hook positional
args. Core mode only dumps fields matching {input_ids, positions},
so it produces no output for Megatron. Switch to 'all' mode to
capture sub-module I/O via positional args.
The barrier in sglang's cleanup_previous deadlocks under pipeline
parallelism. Miles handles dump directory cleanup externally via
prepare() before training, so the lazy barrier-based cleanup is
not needed.
non_intrusive_mode=all on a 30B MoE model generates thousands of
dump files per forward pass, causing excessive I/O. Filter to only
dump embedding-related modules to verify hooks work without
overwhelming disk.
- Remove cleanup_previous from sglang dumper config to avoid
  dist.barrier() deadlocks in async PP contexts
- Add _cleanup_dump_dir helper: rank-0 rmtree + barrier
- Call cleanup explicitly in DumperMegatronUtil._configure
- E2e test: switch from non_intrusive_mode=all to default core
  mode, disable model value/grad dumps for both fwd phases
The DumperConfig default is already False, and no caller passes
this field explicitly.
Avoids rmtree on files or symlinks to files.
Move torch.distributed import to module level, extract _get_rank
helper, flatten the if/elif into two independent checks.
Check that specific dump fields exist for each phase:
- engine_*: input_ids, positions (from SGLang ForwardBatch)
- fwd_only/fwd_bwd: input_ids, cu_seqlens_q, cu_seqlens_kv, qkv_format
  (from Megatron kwargs + PackedSeqParams)
Dump files are named like step=0___rank=0___name=input_ids.pt,
not just input_ids.pt. Use *name={field}.pt glob pattern.
guapisolo added a commit to guapisolo/miles that referenced this pull request Feb 22, 2026
Squashed PR radixark#625 (ac8403/0): adds dumper_utils, CLI arguments,
forward/train phase scoping, env var propagation, and e2e tests
for Miles dumper integration to aid precision debugging.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant