Awesome-LLM-Explainability
A curated list of explainability-related papers, articles, and resources focused on Large Language Models (LLMs). This repository aims to provide researchers, practitioners, and enthusiasts with insights into the explainability implications, challenges, and advancements surrounding these powerful models.
🚧 This repository is under construction (with daily updates) 🚧
We've curated a collection of the latest 📈, most comprehensive 📚, and most valuable 💡 resources on large language model explainability (LLM Explainability)). But we don't stop there; included are also relevant talks, tutorials, conferences, news, and articles. Our repository is constantly updated to ensure you have the most current information at your fingertips.
Date
Institute
Publication
Paper Title
GitHub
2024
New Jersey Institute of Technology
ACM TIST
Explainability for Large Language Models: A Survey
GitHub
2024
Imperial College
Arxiv
From Understanding to Utilization: A Survey on Explainability for Large Language Models
N/A
2024
Hong Kong University of Science and Technology
Arxiv
Explainable Artificial Intelligence for Scientific Discovery
N/A
2024
UMaT
Arxiv
Explainable Artificial Intelligence (XAI): from Inherent Explainability to Large Language Models
N/A
2024
Nanyang Technological University
Arxiv
XAI meets LLMs: A Survey of the Relation between Explainable AI and Large Language Models
N/A
2024
University of Maryland,
Arxiv
Large Language Models and Causal Inference in Collaboration: A Comprehensive Survey
N/A
Perutrbation-based LLM Explainability
Date
Institute
Publication
Paper Title
Code
2024
IMB Research
Arxiv
CELL your Model: Contrastive Explanations for Large Language Models
Not Official
2025
University of Hull
Arxiv
Mapping the Mind of an Instruction-based Image Editing using SMILE
GitHub
2025
University of Hull
Arxiv
Explainability of Large Language Models using SMILE
GitHub
2025
Ruhr University Bochum
Arxiv
Can LLMs Explain Themselves Counterfactually?
N/A
2025
Imperial College & J.P. Morgan AI
Arxiv
Interpreting Language Reward Models via Contrastive Explanations
N/A
2025
Pekin University
Arxiv
Towards Budget-Friendly Model-Agnostic Explanation Generation for Large Language Models
N/A
LLM Explainability Evaluation
Date
Institute
Publication
Paper Title
Code
2023
Tsinghua University
Arxiv
Scaling LLM-as-Critic for Effective and Explainable Evaluation of Large Language Model Generation
GitHub
2023
UC Brekley
NIPS23
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
GitHub
Date
Institute
Publication
Paper Title
2023
MIT/Harvard
Arxiv
Finding Neurons in a Haystack: Case Studies with Sparse Probing
2023
UoTexas/DeepMind
Arxiv
Copy Suppression: Comprehensively Understanding an Attention Head
2023
UCL
Arxiv
Towards Automated Circuit Discovery for Mechanistic Interpretability
2023
OpenAI
OpenAI Publication
Language models can explain neurons in language models
2023
MIT
NIPS23
Toward a Mechanistic Understanding of Stepwise Inference in Transformers: A Synthetic Graph Navigation Model
2023
Cambridge
Arxiv
Successor Heads: Recurring, Interpretable Attention Heads In The Wild
2023
Meta
Arxiv
Neurons in Large Language Models: Dead, N-gram, Positional
2023
Redwood/UC Berkeley
Arxiv
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
2023
Microsoft
Arxiv
Explaining black box text modules in natural language with language models
2023
ApartR/Oxford
ICLR23
N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models
2023
---
Blog
Interpreting GPT: the Logit Lens
Date
Institute
Publication
Paper Title
YYYY-MM-DD
Institute
Journal
Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias
YYYY-MM-DD
Institute
Journal
Discovering Latent Knowledge in Language Models Without Supervision
YYYY-MM-DD
Institute
Journal
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
YYYY-MM-DD
Institute
Journal
Spine: Sparse interpretable neural embeddings
YYYY-MM-DD
Institute
Journal
Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors
YYYY-MM-DD
Institute
Journal
Sparse Autoencoders Find Highly Interpretable Features in Language Models
YYYY-MM-DD
Institute
Journal
Attribution Patching: Activation Patching At Industrial Scale
YYYY-MM-DD
Institute
Journal
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
Date
Institute
Publication
Paper Title
2023
EleutherAI Institute
Arxiv
Linear Representations of Sentiment in Large Language Models
2023
Michigan/Harvard
Arxiv
Emergent Linear Representations in World Models of Self-Supervised Sequence Models
2023
MIT/Standford/Oxford
Arxiv
Measuring Feature Sparsity in Language Models
2023
Flatiron
Arxiv
Polysemanticity and capacity in neural networks
2019
Google/Cambridge
NeurIPS
Visualizing and measuring the geometry of BERT
2024
NEU/MIT
Arxiv
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
2021
Google
Conference
Attention is not all you need: pure attention loses rank doubly exponentially with depth
2019
NCKU
arXiv
Probing neural network comprehension of natural language arguments
2024
Tsinghua University
arXiv
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States
2024
IIITDM
arXiv
HULLMI: Human vs LLM identification with explainability
2024
HUST
arXiv
Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM
2024
UMass Amherst
arXiv
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
2024
Tsinghua University
arXiv
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation
2024
UGA
arXiv
Explainable AI Reloaded: Challenging the XAI Status Quo in the Era of Large Language Models
2024
Stanford\California
arXiv
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Bias and Robustness Studies
Date
Institute
Publication
Paper Title
YYYY-MM-DD
Institute
Journal
Large Language Models Are Not Robust Multiple Choice Selectors
YYYY-MM-DD
Institute
Journal
The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Language Models
YYYY-MM-DD
Institute
Journal
ChainPoll: A High Efficacy Method for LLM Hallucination Detection
2023
PrincetonU
Online Presentation
Evaluating LLMs is a minefield
Interpretability Frameworks
Date
Institute
Publication
Paper Title
YYYY-MM-DD
Institute
Journal
Let's Verify Step by Step
YYYY-MM-DD
Institute
Journal
Interpretability Illusions in the Generalization of Simplified Models
YYYY-MM-DD
Institute
Journal
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
2024
Polytechnique Montreal
Arxiv
Can Large Language Models Explain Themselves?
YYYY-MM-DD
Institute
Journal
A Mechanistic Interpretability Analysis of Grokking
YYYY-MM-DD
Institute
Journal
200 Concrete Open Problems in Mechanistic Interpretability
YYYY-MM-DD
Institute
Journal
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca
YYYY-MM-DD
Institute
Journal
Representation Engineering: A Top-Down Approach to AI Transparency
2023
UC Berkeley
Nature Communication
Augmenting Interpretable Models with LLMs during Training
Application-Specific Studies
Date
Institute
Publication
Paper Title
YYYY-MM-DD
Institute
Journal
Emergent world representations: Exploring a sequence model trained on a synthetic task
YYYY-MM-DD
Institute
Journal
How does GPT-2 compute greater than?: Interpreting mathematical abilities in a pre-trained language model
YYYY-MM-DD
Institute
Journal
Interpreting the Inner Mechanisms of Large Language Models in Mathematical Addition
YYYY-MM-DD
Institute
Journal
An Overview of Early Vision in InceptionV1
Date
Institute
Publication
Paper Title
YYYY-MM-DD
Institute
Journal
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations
YYYY-MM-DD
Institute
Journal
The Quantization Model of Neural Scaling
YYYY-MM-DD
Institute
Journal
Toy Models of Superposition
YYYY-MM-DD
Institute
Journal
Engineering monosemanticity in toy models
YYYY-MM-DD
Institute
Journal
A New Approach to Computation Reimagines Artificial Intelligence
Related GitHub Repositories:
Contribution and Collaboration:
Please feel free to check out CONTRIBUTING and CODE-OF-CONDUCT to collaborate with us.
Future Research Directions
One future direction is Fairness-Explainability Evaluation for LLMs.