This feature would monitor lm head overlaps for each of the earlier layers.
Two variations:
- learned lm heads per layer, and somehow included in backprop
- same as word embedding table (WTE), but not included in back prop.
Bonus to include early exit innovations from the SLED paper:
https://arxiv.org/pdf/2411.02433
This feature would monitor lm head overlaps for each of the earlier layers.
Two variations:
Bonus to include early exit innovations from the SLED paper:
https://arxiv.org/pdf/2411.02433