Skip to content

Add skip_update and magma support#27

Open
alint77 wants to merge 2 commits intomicrosoft:mainfrom
alint77:add/skipUpdate_magma
Open

Add skip_update and magma support#27
alint77 wants to merge 2 commits intomicrosoft:mainfrom
alint77:add/skipUpdate_magma

Conversation

@alint77
Copy link
Copy Markdown
Contributor

@alint77 alint77 commented Feb 18, 2026

Implements two stochastic update masking techniques from arxiv.org/abs/2602.15322 for both Muon and NorMuon.

SkipUpdate (skip_update_prob): at each step, each parameter matrix is independently kept with probability p or zeroed out with probability 1-p. Surviving updates are rescaled by 1/p to stay unbiased in expectation. Moment buffers always update densely regardless of the skip.

Magma (magma_tau): replaces the fixed 1/p rescaling with an adaptive EMA scale driven by momentum-gradient cosine similarity:

ẽ_t = sigmoid(cossim(μ_t_before, g_t) / τ)
s_t  = 0.9 * s_{t-1} + 0.1 * ẽ_t

The scale is intentionally biased (no 1/s_t correction), the paper found unbiased variants to be unstable. Bernoulli masking is still applied on top.

Both features are opt-in and off by default (None). For NorMuon, the mask is applied after the neuron-normalization step so both moment buffers (momentum and variance_neuron) always update densely, consistent with the paper's intent.

@alint77
Copy link
Copy Markdown
Contributor Author

alint77 commented Feb 18, 2026

@microsoft-github-policy-service agree

@JohnLangford
Copy link
Copy Markdown
Contributor

Have you tried this on the modded_nanogpt repo? What were the results?

@alint77
Copy link
Copy Markdown
Contributor Author

alint77 commented Mar 27, 2026

I haven't tried it there, I've been working on nanoPLM which is a framework for Protein Language Model experimentation in both distillation and pretraining (my focus have been on the latter).

I don't think the intuition behind the paper (generalization) aligns with the goal of modded-nanogpt (eval loss wallclock speedrunning), so I didn't bother testing under those constraints, but in our usecase, skipupdate did improve the loss and downstream tasks slightly, albeit on limited model size.

We're still waiting for access to clusters for the larger scale ablations but testing SkipUpdate and Magma on larger scale models/datasets is in our list of experiments.

@JohnLangford
Copy link
Copy Markdown
Contributor

I tried it in modded_nanogpt---it's somewhat worse than normuon there. I also tried a 1B parameters/100B tokens run. It started out superior but gradually fell behind as the system became better optimized. Overall, my impression is that there's a useful thig here, but we perhaps do not have the right expression of it.

@alint77
Copy link
Copy Markdown
Contributor Author

alint77 commented Mar 31, 2026

was this using Magma or Skipupdate? and what were the parameters?

@JohnLangford
Copy link
Copy Markdown
Contributor

skip_update_prob = 0.9 , magma = 0.5. I'll try some more variations.

@alint77
Copy link
Copy Markdown
Contributor Author

alint77 commented Apr 1, 2026

Skipupdate at 0.9 is way too aggressive no? IIRC I tested at 0.5 and it was performing better than stock

@JohnLangford
Copy link
Copy Markdown
Contributor

In the modded_nanogpt codebase hack, the meaning of p was inverted, so that was a 0.1 chance of skipping. After a parameter scan there I haven't been able to find something helpful. p(keep)={0.9,0.7,0.5} x tau{--,0.5,1.0}.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants