This repository compares a baseline causal self-attention model (vanilla) against an exclusive self attention variant.
The paper is included here:
The implementation is in train.py.
Inside the attention head output computation, we remove the component of the output vector along the value vector direction:
- Compute dot product:
dot_product = sum(out * v) - Compute squared norm:
v_norm_sq = sum(v * v) - Compute projected component:
component = (dot_product / (v_norm_sq + 1e-8)) * v - Subtract projection:
out = out - component
This behavior is toggled with use_exclusive_self_attention=True.
The script runs both configurations:
vanillaexclussive self attention
It saves per-run CSV logs and comparison plots to outputs_compare.
python train.pyAfter training completes, plots and CSV logs are available in outputs_compare/.

