Draft
Conversation
Grid-searches DeepEP kernel parameters (num_sms, nvl_chunk, rdma_chunk) at training startup by benchmarking buffer.dispatch()/buffer.combine() with synthetic data. Stores optimal Config objects as globals passed to every dispatch/combine call during training. Key changes: - job_config.py: Add autotune, autotune_warmup, autotune_repeat, autotune_verbose, num_sms, nvl_buffer_size, rdma_buffer_size fields - deepep.py: Tuned config globals, config= kwargs on all dispatch/combine calls (forward + backward), full autotune grid search with internode support (joint sms tuning, RDMA chunk search) - parallelize.py: Wire autotune into qwen3, upgrade MoE to DeepEPMoE - TOML configs for 1-node (EP=8) and 2-node (EP=16) debug runs - sbatch script with NVSHMEM env vars for internode RDMA Internode safety: restrict sms sweep to single validated value to prevent DeepEP dispatch timeouts that fatally corrupt CUDA state. Tested: 1-node (8x B200) and 2-node (16x B200) with decreasing loss.
- Replace 6 bare module globals with DeepEPState singleton class (_buffer, _handle_cache, _handle_counter, _pending_combine_event, _tuned_dispatch_config, _tuned_combine_config -> _state) - Add _create_uniform_routing() for deterministic round-robin routing in autotune benchmarks, replacing random scores + torch.topk - Add setup_deepep() to centralize SAC registration, MoE->DeepEPMoE upgrade, and autotune into a single call - Simplify qwen3 parallelize.py from 35-line block to one-liner
…ebug configs - Apply setup_deepep() to llama4 and deepseek_v3 parallelize.py (replaces 22-line DeepEP blocks with single call) - Add EP/ETP validation checks to setup_deepep() - Document autotune in deepep/README.md (config options, usage, example output) - Remove sbatch script and debug TOML configs from tracking
Restore the pre-existing ep_enabled/etp_enabled checks and SAC registration in llama4 and deepseek_v3. Remove duplicate validation from setup_deepep() since callers handle it.
Keep these files identical to the base branch to minimize merge conflicts when merging upstream changes.
Restore original bare globals and code structure. Only add: - _create_uniform_routing(): deterministic round-robin for autotune - setup_deepep(): centralized SAC + MoE upgrade + autotune setup
Keep original inline logic in parallelize.py. The only change to deepep.py is replacing random synthetic data with uniform round-robin routing (_create_uniform_routing) for autotune.
Eliminate all `global` statements by grouping mutable process state into a simple _State class. Functions remain as free functions, just access _state.xxx instead of bare globals.
24 tests covering: - _create_uniform_routing: shapes, dtypes, round-robin, balanced load - _State config management: get/set/overwrite tuned configs - _bench_fn: timing, warmup/repeat counts, exception propagation - _detect_internode: intranode vs internode topology detection - _get_gpu_sm_range: GPU-specific SM ranges, fallback behavior - run_deepep_autotune_if_enabled: default config fallback paths
Run autotune at beginning of training (alongside LLEP autotune) instead of during model parallelization. This avoids adding autotune code to each model's parallelize.py file.
…nt double-run - Use DeepEP's pre-tuned nvl_buffer_size per rank count (256→288→480→720) instead of hardcoded 256. The built-in values are optimized per topology. - When autotune=false, use Buffer.get_dispatch/combine_config() directly instead of hardcoded defaults that don't match DeepEP's tuned values. - Expand internode search space: nvl_dispatch [2,48], nvl_combine [1,16], rdma [4,36]. Covers all DeepEP built-in defaults so autotune converges to them if they're already optimal. - Guard train.py autotune call to skip if configs already set by parallelize.py, preventing double-autotune.
Phase 0 searches nvl_buffer_size [128,256,288,384,480,512,560,720] and rdma_buffer_size [64,128,256] before chunk tuning. Includes DeepEP's built-in per-rank value in candidates so autotune can converge to it if optimal.
Smaller buffer sizes cause CUDA illegal memory access in internode dispatch. Filter candidates to only include values at or above DeepEP's recommended per-rank-count value.
Replace phased sequential tuning with full Cartesian search over (num_sms, nvl_chunk, nvl_buffer_size, rdma_chunk, rdma_buffer_size) for both dispatch and combine independently. Based on https://nousresearch.com/moe-scaling-field-notes/: - num_sms range extended to 128 (2.3-2.6x speedup over 24) - nvl_buffer_size up to 1024 (blog found optimal at 1024) - All params searched jointly, no greedy phase decomposition
Sweeping nvl_buffer_size at runtime causes unrecoverable CUDA crashes (illegal memory access) when values are too small. Buffer sizes are hardware/topology-dependent and DeepEP's per-rank-count defaults are already well-tuned. Only sweep chunk sizes and num_sms (intranode).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.