You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Infrastructure for capturing LLM activations and SAE (Sparse Autoencoders) features, training probes for prompt maliciousness detection, and evaluating out-of-distribution generalization with Leave-One-Dataset-Out (LODO)
A stable linear direction in GPT-2's residual stream predicts per-token decision quality beyond output confidence, hand-designed statistics, and SAE features. Three initializations converge to the same projection.
Reverse-engineering hidden backdoor triggers in three 671B DeepSeek-V3 language models. Activation-space probing, SVD weight analysis, and absorbed MLA SVD for the Jane Street Dormant LLM Puzzle.
Evaluation framework of different methods for probing and steering LLMs activations to mitigate Chain-of-Thought Unfaithfulness. Research project by Giovanni M. Occhipinti (University of Bologna), Alessandro Abate e Nandi Schoots (University of Oxford).