How to avoid gates overly relying on one head

Hi, 

In your training, since there is no control on how to weight the contribution of each gate, how do you ensure that the gating net won't end up overly relying on one particular gate and ignore others? For example, model could potentially learn to give 0.9999 to one gate and 0 to other gates after softmax. 

Thanks