Many thanks for the great work!
My understanding is that the deterministic round-to-nearest even is applied in the forward pass for the best accuracy, while stochastic rounding is applied in the backward pass to avoid quantization bias. However, in your paper and implementation where SR is applied in both forward and backward passes. So I was wondering if there is a reason for That?
Kind regards
Many thanks for the great work!
My understanding is that the deterministic round-to-nearest even is applied in the forward pass for the best accuracy, while stochastic rounding is applied in the backward pass to avoid quantization bias. However, in your paper and implementation where SR is applied in both forward and backward passes. So I was wondering if there is a reason for That?
Kind regards