Make sparse gradients configurable in IndexedMultiplier#104
Open
merajhashemi wants to merge 5 commits intocooper-org:mainfrom
Open
Make sparse gradients configurable in IndexedMultiplier#104merajhashemi wants to merge 5 commits intocooper-org:mainfrom
IndexedMultiplier#104merajhashemi wants to merge 5 commits intocooper-org:mainfrom
Conversation
Coverage reportClick to see where and how coverage changed
This report was generated by python-coverage-comment-action |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
juan43ramirez
approved these changes
Sep 16, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a
sparse_gradparameter to theIndexedMultiplierclass to address compatibility issues with Distributed Data Parallel (DDP) training. The current implementation ofIndexedMultiplieralways uses sparse gradients, but sparse tensors are not supported in the all-reduce multi-GPU communication operation required by DDP (see pytorch/22400).Changes
sparse_gradparameter toIndexedMultiplier(default:Truefor backward compatibility).Usage
By making the sparsity configurable, users can now choose dense gradients (
sparse_grad=False) when training with DDP.Important Note
When using
sparse_grad=Falsewith stateful optimizers (e.g., Adam), optimizer states will be updated for all parameters. This may lead to incorrect optimization behavior since the optimizer will assume zero gradients for non-sampled indices when in reality these values should not be updated at all.