Implement Early Exit

This feature would monitor lm head overlaps for each of the earlier layers.

Two variations:
1. learned lm heads per layer, and somehow included in backprop
2. same as word embedding table (WTE), but not included in back prop.

Bonus to include early exit innovations from the SLED paper: 
https://arxiv.org/pdf/2411.02433