Hi, this might be a minor thing, but I'm wondering in distributed data parallel, when we aggregate grad_weights from all machines using model.allgather, since allgather performs sum operation, shouldn't we further divide grad_weights by model.BATCHES?
Thank you!
Hi, this might be a minor thing, but I'm wondering in distributed data parallel, when we aggregate
grad_weightsfrom all machines usingmodel.allgather, sinceallgatherperformssumoperation, shouldn't we further dividegrad_weightsbymodel.BATCHES?Thank you!