Conversation
|
@microsoft-github-policy-service agree |
|
Have you tried this on the modded_nanogpt repo? What were the results? |
|
I haven't tried it there, I've been working on nanoPLM which is a framework for Protein Language Model experimentation in both distillation and pretraining (my focus have been on the latter). I don't think the intuition behind the paper (generalization) aligns with the goal of modded-nanogpt (eval loss wallclock speedrunning), so I didn't bother testing under those constraints, but in our usecase, skipupdate did improve the loss and downstream tasks slightly, albeit on limited model size. We're still waiting for access to clusters for the larger scale ablations but testing SkipUpdate and Magma on larger scale models/datasets is in our list of experiments. |
|
I tried it in modded_nanogpt---it's somewhat worse than normuon there. I also tried a 1B parameters/100B tokens run. It started out superior but gradually fell behind as the system became better optimized. Overall, my impression is that there's a useful thig here, but we perhaps do not have the right expression of it. |
|
was this using Magma or Skipupdate? and what were the parameters? |
|
skip_update_prob = 0.9 , magma = 0.5. I'll try some more variations. |
|
Skipupdate at 0.9 is way too aggressive no? IIRC I tested at 0.5 and it was performing better than stock |
|
In the modded_nanogpt codebase hack, the meaning of p was inverted, so that was a 0.1 chance of skipping. After a parameter scan there I haven't been able to find something helpful. p(keep)={0.9,0.7,0.5} x tau{--,0.5,1.0}. |
Implements two stochastic update masking techniques from arxiv.org/abs/2602.15322 for both Muon and NorMuon.
SkipUpdate (skip_update_prob): at each step, each parameter matrix is independently kept with probability p or zeroed out with probability 1-p. Surviving updates are rescaled by 1/p to stay unbiased in expectation. Moment buffers always update densely regardless of the skip.
Magma (magma_tau): replaces the fixed 1/p rescaling with an adaptive EMA scale driven by momentum-gradient cosine similarity:
The scale is intentionally biased (no 1/s_t correction), the paper found unbiased variants to be unstable. Bernoulli masking is still applied on top.
Both features are opt-in and off by default (None). For NorMuon, the mask is applied after the neuron-normalization step so both moment buffers (momentum and variance_neuron) always update densely, consistent with the paper's intent.