Awesome-LLM-Explainability

A curated list of explainability-related papers, articles, and resources focused on Large Language Models (LLMs). This repository aims to provide researchers, practitioners, and enthusiasts with insights into the explainability implications, challenges, and advancements surrounding these powerful models.

🚧 This repository is under construction (with daily updates) 🚧

Introduction

We've curated a collection of the latest 📈, most comprehensive 📚, and most valuable 💡 resources on large language model explainability (LLM Explainability)). But we don't stop there; included are also relevant talks, tutorials, conferences, news, and articles. Our repository is constantly updated to ensure you have the most current information at your fingertips.

Webinars

Recorded Videos

LLM Explainability or Controllability Improvements with Tensor Networks, ChemicalQDevice, March 28.
AI Explained: Inference, Guardrails, and Observability for LLMs

Events

LLM Explainability, Mitigating Hallucinations & Ensuring Ethical Practices, April 2nd, 5:30 - 9pm CEST, Berlin.

Papers

Survey Papers

Date	Institute	Publication	Paper Title	GitHub
2024	New Jersey Institute of Technology	ACM TIST	Explainability for Large Language Models: A Survey	GitHub
2024	Imperial College	Arxiv	From Understanding to Utilization: A Survey on Explainability for Large Language Models	N/A
2024	Hong Kong University of Science and Technology	Arxiv	Explainable Artificial Intelligence for Scientific Discovery	N/A
2024	UMaT	Arxiv	Explainable Artificial Intelligence (XAI): from Inherent Explainability to Large Language Models	N/A
2024	Nanyang Technological University	Arxiv	XAI meets LLMs: A Survey of the Relation between Explainable AI and Large Language Models	N/A
2024	University of Maryland,	Arxiv	Large Language Models and Causal Inference in Collaboration: A Comprehensive Survey	N/A

Perutrbation-based LLM Explainability

Date	Institute	Publication	Paper Title	Code
2024	IMB Research	Arxiv	CELL your Model: Contrastive Explanations for Large Language Models	Not Official
2025	University of Hull	Arxiv	Mapping the Mind of an Instruction-based Image Editing using SMILE	GitHub
2025	University of Hull	Arxiv	Explainability of Large Language Models using SMILE	GitHub
2025	Ruhr University Bochum	Arxiv	Can LLMs Explain Themselves Counterfactually?	N/A
2025	Imperial College & J.P. Morgan AI	Arxiv	Interpreting Language Reward Models via Contrastive Explanations	N/A
2025	Pekin University	Arxiv	Towards Budget-Friendly Model-Agnostic Explanation Generation for Large Language Models	N/A

LLM Explainability Evaluation

Date	Institute	Publication	Paper Title	Code
2023	Tsinghua University	Arxiv	Scaling LLM-as-Critic for Effective and Explainable Evaluation of Large Language Model Generation	GitHub
2023	UC Brekley	NIPS23	Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena	GitHub

Neural Network Analysis

Date	Institute	Publication	Paper Title
2023	MIT/Harvard	Arxiv	Finding Neurons in a Haystack: Case Studies with Sparse Probing
2023	UoTexas/DeepMind	Arxiv	Copy Suppression: Comprehensively Understanding an Attention Head
2023	UCL	Arxiv	Towards Automated Circuit Discovery for Mechanistic Interpretability
2023	OpenAI	OpenAI Publication	Language models can explain neurons in language models
2023	MIT	NIPS23	Toward a Mechanistic Understanding of Stepwise Inference in Transformers: A Synthetic Graph Navigation Model
2023	Cambridge	Arxiv	Successor Heads: Recurring, Interpretable Attention Heads In The Wild
2023	Meta	Arxiv	Neurons in Large Language Models: Dead, N-gram, Positional
2023	Redwood/UC Berkeley	Arxiv	Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
2023	Microsoft	Arxiv	Explaining black box text modules in natural language with language models
2023	ApartR/Oxford	ICLR23	N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models
2023	---	Blog	Interpreting GPT: the Logit Lens

Algorithmic Approaches

Date	Institute	Publication	Paper Title
YYYY-MM-DD	Institute	Journal	Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias
YYYY-MM-DD	Institute	Journal	Discovering Latent Knowledge in Language Models Without Supervision
YYYY-MM-DD	Institute	Journal	Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
YYYY-MM-DD	Institute	Journal	Spine: Sparse interpretable neural embeddings
YYYY-MM-DD	Institute	Journal	Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors
YYYY-MM-DD	Institute	Journal	Sparse Autoencoders Find Highly Interpretable Features in Language Models
YYYY-MM-DD	Institute	Journal	Attribution Patching: Activation Patching At Industrial Scale
YYYY-MM-DD	Institute	Journal	Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

Representation Analysis

Date	Institute	Publication	Paper Title
2023	EleutherAI Institute	Arxiv	Linear Representations of Sentiment in Large Language Models
2023	Michigan/Harvard	Arxiv	Emergent Linear Representations in World Models of Self-Supervised Sequence Models
2023	MIT/Standford/Oxford	Arxiv	Measuring Feature Sparsity in Language Models
2023	Flatiron	Arxiv	Polysemanticity and capacity in neural networks
2019	Google/Cambridge	NeurIPS	Visualizing and measuring the geometry of BERT
2024	NEU/MIT	Arxiv	The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
2021	Google	Conference	Attention is not all you need: pure attention loses rank doubly exponentially with depth
2019	NCKU	arXiv	Probing neural network comprehension of natural language arguments
2024	Tsinghua University	arXiv	How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States
2024	IIITDM	arXiv	HULLMI: Human vs LLM identification with explainability
2024	HUST	arXiv	Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM
2024	UMass Amherst	arXiv	Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
2024	Tsinghua University	arXiv	CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation
2024	UGA	arXiv	Explainable AI Reloaded: Challenging the XAI Status Quo in the Era of Large Language Models
2024	Stanford\California	arXiv	Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Bias and Robustness Studies

Date	Institute	Publication	Paper Title
YYYY-MM-DD	Institute	Journal	Large Language Models Are Not Robust Multiple Choice Selectors
YYYY-MM-DD	Institute	Journal	The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Language Models
YYYY-MM-DD	Institute	Journal	ChainPoll: A High Efficacy Method for LLM Hallucination Detection
2023	PrincetonU	Online Presentation	Evaluating LLMs is a minefield

Interpretability Frameworks

Date	Institute	Publication	Paper Title
YYYY-MM-DD	Institute	Journal	Let's Verify Step by Step
YYYY-MM-DD	Institute	Journal	Interpretability Illusions in the Generalization of Simplified Models
YYYY-MM-DD	Institute	Journal	Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
2024	Polytechnique Montreal	Arxiv	Can Large Language Models Explain Themselves?
YYYY-MM-DD	Institute	Journal	A Mechanistic Interpretability Analysis of Grokking
YYYY-MM-DD	Institute	Journal	200 Concrete Open Problems in Mechanistic Interpretability
YYYY-MM-DD	Institute	Journal	Interpretability at Scale: Identifying Causal Mechanisms in Alpaca
YYYY-MM-DD	Institute	Journal	Representation Engineering: A Top-Down Approach to AI Transparency
2023	UC Berkeley	Nature Communication	Augmenting Interpretable Models with LLMs during Training

Application-Specific Studies

Date	Institute	Publication	Paper Title
YYYY-MM-DD	Institute	Journal	Emergent world representations: Exploring a sequence model trained on a synthetic task
YYYY-MM-DD	Institute	Journal	How does GPT-2 compute greater than?: Interpreting mathematical abilities in a pre-trained language model
YYYY-MM-DD	Institute	Journal	Interpreting the Inner Mechanisms of Large Language Models in Mathematical Addition
YYYY-MM-DD	Institute	Journal	An Overview of Early Vision in InceptionV1

Theoretical Approaches

Date	Institute	Publication	Paper Title
YYYY-MM-DD	Institute	Journal	A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations
YYYY-MM-DD	Institute	Journal	The Quantization Model of Neural Scaling
YYYY-MM-DD	Institute	Journal	Toy Models of Superposition
YYYY-MM-DD	Institute	Journal	Engineering monosemanticity in toy models
YYYY-MM-DD	Institute	Journal	A New Approach to Computation Reimagines Artificial Intelligence

Related GitHub Repositories:

Blogs

Medium Blogs

Georgia Deaconu, (December 2023) Towards LLM Explainability: Why Did My Model Produce This Output?

Big Player's Blogs

Anthropic -- Adly Templeton, et al. (May 2024), Mapping the Mind of a Large Language Model
OpenAI -- (May 2023), Language Models Can Explain Neurons in Language Models, Github, Neuron Viewer.

Tools

Gemma Scope
- Gemma Scope Tutorial:
Toy Models of Superposition
AWS Saliency Heatmap, Referred Paper
Anthropic Circuit_Tracing
- Circuit Tracing Tutorial:

Related Communities

Interpretability, AI Safety & Explainability

Contribution and Collaboration:

Please feel free to check out CONTRIBUTING and CODE-OF-CONDUCT to collaborate with us.

Future Research Directions

One future direction is Fairness-Explainability Evaluation for LLMs.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.github		.github
docs		docs
.gitignore		.gitignore
LICENSE		LICENSE

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome-LLM-Explainability

🚧 This repository is under construction (with daily updates) 🚧

Table of Contents

Introduction

Webinars

Recorded Videos

Events

Papers

Survey Papers

Perutrbation-based LLM Explainability

LLM Explainability Evaluation

Neural Network Analysis

Algorithmic Approaches

Representation Analysis

Bias and Robustness Studies

Interpretability Frameworks

Application-Specific Studies

Theoretical Approaches

Related GitHub Repositories:

Blogs

Medium Blogs

Big Player's Blogs

Tools

Related Communities

Contribution and Collaboration:

Future Research Directions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome-LLM-Explainability

🚧 This repository is under construction (with daily updates) 🚧

Table of Contents

Introduction

Webinars

Recorded Videos

Events

Papers

Survey Papers

Perutrbation-based LLM Explainability

LLM Explainability Evaluation

Neural Network Analysis

Algorithmic Approaches

Representation Analysis

Bias and Robustness Studies

Interpretability Frameworks

Application-Specific Studies

Theoretical Approaches

Related GitHub Repositories:

Blogs

Medium Blogs

Big Player's Blogs

Tools

Related Communities

Contribution and Collaboration:

Future Research Directions

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages