Conversation
Summary of ChangesHello @Ratish1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request focuses on improving the documentation and clarity of command-line arguments in Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request does a great job of synchronizing the command-line argument help strings with the documentation, which significantly improves usability and clarity for users. The new descriptions are much more detailed and helpful. I've identified a couple of minor areas for improvement to enhance maintainability and user experience.
| help=( | ||
| "Whether to colocate the inference engines and the actor. " | ||
| "Turning this on will also set --offload to true." | ||
| "Deploy training and rollout on the same GPUs. This mode automatically enables `--offload-train` and `--offload-rollout` to facilitate weight-swapping between the training actor and inference engine. **Note:** The offload parameters are currently only used for AMD GPUs and will be removed soon. **Memory Tip:** When colocating, it is highly recommended to set `--sglang-mem-fraction-static` to **0.8** (especially on **NVIDIA Blackwell B200/B300** GPUs). This leaves sufficient VRAM (~20%) for Megatron to initialize its structures before the first weight offload to CPU occurs. On GB200/GB300, values up to 0.75 are safer for long-running jobs to prevent potential OOMs. #TODO: Verify optimal fraction for Blackwell in production" | ||
| ), |
There was a problem hiding this comment.
The help string for --colocate contains a #TODO comment. This appears to be an internal development note and should not be part of a user-facing help message. It's recommended to remove this from the help string and track it as a code comment elsewhere or in an issue tracker.
| help=( | |
| "Whether to colocate the inference engines and the actor. " | |
| "Turning this on will also set --offload to true." | |
| "Deploy training and rollout on the same GPUs. This mode automatically enables `--offload-train` and `--offload-rollout` to facilitate weight-swapping between the training actor and inference engine. **Note:** The offload parameters are currently only used for AMD GPUs and will be removed soon. **Memory Tip:** When colocating, it is highly recommended to set `--sglang-mem-fraction-static` to **0.8** (especially on **NVIDIA Blackwell B200/B300** GPUs). This leaves sufficient VRAM (~20%) for Megatron to initialize its structures before the first weight offload to CPU occurs. On GB200/GB300, values up to 0.75 are safer for long-running jobs to prevent potential OOMs. #TODO: Verify optimal fraction for Blackwell in production" | |
| ), | |
| help=( | |
| "Deploy training and rollout on the same GPUs. This mode automatically enables `--offload-train` and `--offload-rollout` to facilitate weight-swapping between the training actor and inference engine. **Note:** The offload parameters are currently only used for AMD GPUs and will be removed soon. **Memory Tip:** When colocating, it is highly recommended to set `--sglang-mem-fraction-static` to **0.8** (especially on **NVIDIA Blackwell B200/B300** GPUs). This leaves sufficient VRAM (~20%) for Megatron to initialize its structures before the first weight offload to CPU occurs. On GB200/GB300, values up to 0.75 are safer for long-running jobs to prevent potential OOMs." | |
| ), |
There was a problem hiding this comment.
We should move TODOs out of the help text
There was a problem hiding this comment.
Yes, I thought of this too, but then the script would just keep failing since docs and arguments are out of sync.
cc: @zhaochenyang20
|
Hi @Ratish1 , I think for the sake of better code visualization, we should not put the helper text as a single line, I think a better way to do is instead of using a single string, stacking a bunch of substrings. e.g., instead of I believe there're a lot of linting tools can help you automate this. Please let me know what you think. Thanks. Also, for those single lined help text, why we need bracket around the text? I think we should remove those to make the coding style aligned. |
Yes I will follow this, thanks @zijiexia . I assumed the pre commit would fix this but it didnt. I will look into it and fix it. |
miles/utils/arguments.py
Outdated
| "Save the rollout data to this path for debugging. " | ||
| "The file will be saved to `save_debug_rollout_data.format(rollout_id)`." | ||
| ), | ||
| help=("Path to save rollout data for offline analysis. " "[Ref](../developer_guide/debug.md)"), |
There was a problem hiding this comment.
Here we split on periods even though it's a short string
miles/utils/arguments.py
Outdated
| "--check-weight-update-equal", | ||
| action="store_true", | ||
| help=( | ||
| "Use SGLang's weight checker to check and ensure that the loaded weight from HF checkpoint and received from Megatron " |
There was a problem hiding this comment.
Here we split on random place in the sentence. I was wondering what is the logic behind the splitting? Will this works with your sync script?
This is part 1 of #578 . Manual sync of current doc's description of parameters to the argument.py.