Skip to content

mdhk/awesome-speech-interpretability

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

overview

Interpretability Techniques for Speech Models

Marianne de Heer Kloots, Martijn Bentum, Tom Lentz, Hosein Mohebbi, Charlotte Pouw, Gaofei Shen, Willem Zuidema, Grzegorz Chrupała

This reading list accompanies our tutorial and upcoming survey. It provides an overview of recent publications reporting interpretability analyses on speech processing models, though it is not a complete list. Is your favourite paper missing or notice an error? Please submit a pull request to suggest additions or changes!

Behavioural analyses

Behavioural analyses focus on interpreting model functioning by identifying systematic patterns in generated output and mapping them to interpretable dimensions in model input (naturalness, grammaticality, noise).

Adolfi, F., Bowers, J. S., & Poeppel, D. (2023). Successes and critical failures of neural networks in capturing human-like speech recognition. Neural Networks, 162, 199–211. https://doi.org/10.1016/j.neunet.2023.02.032

De Seyssel, M., Lavechin, M., Titeux, H., Thomas, A., Virlet, G., Revilla, A. S., Wisniewski, G., Ludusan, B., Dupoux, E. (2023). ProsAudit, a prosodic benchmark for self-supervised speech models. https://doi.org/10.21437/Interspeech.2023-438

Gessinger, I., Amirzadeh Shams, E., & Carson-Berndsen, J. (2025). Under the Hood: Phonemic Restoration in Transformer-Based Automatic Speech Recognition. Rochester, NY: Social Science Research Network. https://doi.org/10.2139/ssrn.5212722

Kim, S.-E., Chernyak, B. R., Seleznova, O., Keshet, J., Goldrick, M., & Bradlow, A. R. (2024). Automatic recognition of second language speech-in-noise. JASA Express Letters, 4(2), 025204. https://doi.org/10.1121/10.0024877

Lavechin, M., Sy, Y., Titeux, H., Blandón, M. A. C., Räsänen, O., Bredin, H., Dupoux, E., Cristia, A. (2023). BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models. https://doi.org/10.21437/Interspeech.2023-978

Orhan, P., Boubenec, Y., & King, J.-R. (2024). Algebraic structures emerge from the self-supervised learning of natural sounds. https://doi.org/10.1101/2024.03.13.584776

Patman, C., & Chodroff, E. (2024). Speech recognition in adverse conditions by humans and machines. https://doi.org/10.1121/10.0032473

Pouw, C., Alishahi, A., & Zuidema, W. (2025). A Linguistically Motivated Analysis of Intonational Phrasing in Text-to-Speech Systems: Revealing Gaps in Syntactic Sensitivity. In Proceedings of the 29th Conference on Computational Natural Language Learning, pages 126–140, Vienna, Austria. https://aclanthology.org/2025.conll-1.9/

Shim, H.E., Yung, O., Tuttösí, P., Kwan, B., Lim, A., Wang, Y., Yeung, H.H. (2025) Generating Consistent Prosodic Patterns from Open-Source TTS Systems. Proc. Interspeech 2025, 5383-5387. https://doi.org/10.21437/Interspeech.2025-2159

Tånnander, C., House, D., & Edlund, J. (2022). Syllable duration as a proxy to latent prosodic features. In (pp. 220– 224). https://doi.org/10.21437/SpeechProsody.2022-45

Weerts, L., Rosen, S., Clopath, C., & Goodman, D. F. M. (2022). The Psychometrics of Automatic Speech Recognition. bioRxiv. https://doi.org/10.1101/2021.04.19.440438

Zhu, J. (2020). Probing the phonetic and phonological knowledge of tones in Mandarin TTS models. https://doi.org/10.21437/SpeechProsody.2020-190

Representational analyses

Representational analyses aim to characterize how model-internal representations are structured, including their organization across layers and alignment with interpretable features.

Abdullah, B. M., Shaik, M. M., Möbius, B., & Klakow, D. (2023). An information-theoretic analysis of self-supervised discrete representations of speech. In Interspeech 2023 (pp. 2883–2887). https://doi.org/10.21437/Interspeech.2023-2131

Algayres, R., Adi, Y., Nguyen, T., Copet, J., Synnaeve, G., Sagot, B., & Dupoux, E. (2023). Generative Spoken Language Model based on continuous word-sized audio tokens. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 3008–3028). Singapore: Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.182

Algayres, R., Ricoul, T., Karadayi, J., Laurençon, H., Zaiem, S., Mohamed, A., Sagot, B. Dupoux, E. (2022). DP-parse: Finding word boundaries from raw speech with an instance lexicon. Transactions of the Association for Computational Linguistics, 10, 1051-1065. https://doi.org/10.1162/tacl_a_00505

Ashihara, T., Moriya, T., Matsuura, K., Tanaka, T., Ijima, Y., Asami, T., Delcroix, M., Honma, Y. (2023). SpeechGLUE: How well can self-supervised speech models capture linguistic knowledge? In Proc. Interspeech 2023 (pp. 2888–2892). https://doi.org/10.21437/Interspeech.2023-1823

Ashihara, T., Delcroix, M., Ochiai, T., Matsuura, K., Horiguchi, S. (2025) Analysis of Semantic and Acoustic Token Variability Across Speech, Music, and Audio Domains. Proc. Interspeech 2025, 226-230, https://doi.org/10.21437/Interspeech.2025-945

Bentum, M., Bosch, L. t., & Lentz, T. (2024). The Processing of Stress in End-to-End Automatic Speech Recognition Models. In (pp. 2350–2354). https://doi.org/10.21437/Interspeech.2024-44

Bentum, M., ten Bosch, L., Lentz, T.O. (2025) Word stress in self-supervised speech models: A cross-linguistic comparison. Proc. Interspeech 2025, 251-255. https://doi.org/10.21437/Interspeech.2025-106

Choi, K., Pasad, A., Nakamura, T., Fukayama, S., Livescu, K., & Watanabe, S. (2024). Self-Supervised Speech Representations are More Phonetic than Semantic. In (pp. 4578–4582). https://doi.org/10.21437/Interspeech.2024-1157

Chrupała, G., Higy, B., & Alishahi, A. (2018) Analyzing analytical methods: The case of phonology in neural models of spoken language. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th annual meeting of the Association for Computational Linguistics (pp. 4146–4156). https://doi.org/10.18653/v1/2020.acl-main.381

Cormac English, P., Kelleher, J. D., & Carson-Berndsen, J. (2022). Domain-Informed Probing of wav2vec 2.0 Embeddings for Phonetic Features. In G. Nicolai & E. Chodroff (Eds.), Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology (pp. 83–91). https://doi.org/10.18653/v1/2022.sigmorphon-1.9

de Heer Kloots, M., & Zuidema, W. (2024). Human-like Linguistic Biases in Neural Speech Models: Phonetic Categorization and Phonotactic Constraints in Wav2Vec2.0. In (pp. 4593–4597). https://doi.org/10.21437/Interspeech.2024-2490

de Heer Kloots, M., Mohebbi, H., Pouw, C., Shen, G., Zuidema, W., Bentum, M. (2025) What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training. Proc. Interspeech 2025, 256-260. https://doi.org/10.21437/Interspeech.2025-1526

Dieck, T. t., Pérez-Toro, P. A., Arias, T., Noeth, E., & Klumpp, P. (2022). Wav2vec behind the Scenes: How end2end Models learn Phonetics. In (pp. 5130–5134). https://doi.org/10.21437/Interspeech.2022-10865

Dunbar, E., Hamilakis, N., & Dupoux, E. (2022). Self-supervised language learning from raw audio: Lessons from the zero resource speech challenge. IEEE Journal of Selected Topics in Signal Processing, 16(6), 1211–1226.

Ersoy, A., Mousi, B.A., Chowdhury, S.A., Alam, F., Dalvi, F.I., Durrani, N. (2025) From Words to Waves: Analyzing Concept Formation in Speech and Text-Based Foundation Models. Proc. Interspeech 2025, 241-245. https://doi.org/10.21437/Interspeech.2025-2180

Fucci, D., Gaido, M., Negri, M., Bentivogli, L., Martins, A., & Attanasio, G. (2025). Different Speech Translation Models Encode and Translate Speaker Gender Differently. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1005–1019, Vienna, Austria. https://aclanthology.org/2025.acl-short.78/

Higy, B., Gelderloos, L., Alishahi, A., & Chrupała, G. (2021). Discrete representations in neural models of spoken language. In J. Bastings et al. (Eds.), Proceedings of the fourth blackboxnlp workshop on analyzing and interpreting neural networks for nlp (pp. 163–176). https://doi.org/10.18653/v1/2021.blackboxnlp-1.11

Huo, R., Dunbar, E. (2025) Iterative Refinement, Not Training Objective, Makes HuBERT Behave Differently from wav2vec 2.0. Proc. Interspeech 2025, 261-265. https://doi.org/10.21437/Interspeech.2025-514

Langedijk, A., Mohebbi, H., Sarti, G., Zuidema, W., & Jumelet, J. (2024). DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers. In K. Duh, H. Gomez, & S. Bethard (Eds.), Findings of the Association for Computational Linguistics: NAACL 2024 (pp. 4764–4780). Mexico City, Mexico: Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.findings-naacl.296

Liu, O.D., Tang, H., Goldwater, S. (2023) Self-supervised Predictive Coding Models Encode Speaker and Phonetic Information in Orthogonal Subspaces. Proc. Interspeech 2023, 2968-2972. https://doi.org/10.21437/Interspeech.2023-871

Liu, O. D., Tang, H., Feldman, N. H., & Goldwater, S. (2024). A predictive learning model can simulate temporal dynamics and context effects found in neural representations of continuous speech. In Proceedings of the Annual Meeting of the Cognitive Science Society (Vol. 46). https://escholarship.org/uc/item/34b4v268

Ma, D., Ryant, N., & Liberman, M. (2021). Probing acoustic representations for phonetic properties. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (p. 311-315). https://doi.org/10.1109/ICASSP39728.2021.9414776

Millet, J., & Dunbar, E. (2022). Do self-supervised speech models develop human-like perception biases? In S. Muresan, P. Nakov, & A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 7591–7605). Dublin, Ireland: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.523

Mohamed, M., Liu, O.D., Tang, H., Goldwater, S. (2024) Orthogonality and isotropy of speaker and phonetic information in self-supervised speech representations. Proc. Interspeech 2024, 3625-3629. https://doi.org/10.21437/Interspeech.2024-1054

Pasad, A., Chou, J.-C., & Livescu, K. (2021). Layer-wise analysis of a self-supervised speech representation model. IEEE Automatic Speech Recognition and Understanding workshop (ASRU) (p. 914-921). https://doi.org/10.1109/ASRU51503.2021.9688093

Schatz, T. (2016). ABX-Discriminability Measures and Applications (Theses, Université Paris 6 (UPMC)). https://hal.science/tel-01407461

Shen, G., Alishahi, A., Bisazza, A., & Chrupała, G. (2023). Wave to Syntax: Probing spoken language models for syntax. In Interspeech 2023 (pp. 1259–1263). https://doi.org/10.21437/Interspeech.2023-679

Sicherman, A., & Adi, Y. (2023). Analysing discrete self supervised speech representation for spoken language modeling. In IEEE International Conference on Acoustics, Speech and Signal processing (ICASSP) (pp. 1–5). https://doi.org/10.1109/ICASSP49357.2023.10097097

ten Bosch, L., Bentum, M., & Boves, L. (2023). Phonemic competition in end-to-end ASR models. In (pp. 586–590). https://doi.org/10.21437/Interspeech.2023-1846

Wells, D., Tang, H., & Richmond, K. (2022). Phonetic analysis of self-supervised representations of English speech. In Proc. Interspeech (pp. 3583–3587). https://doi.org/10.21437/Interspeech.2022-10884

Importance scoring

These methods aim to quantify the relative importance of input segments for model representations (context mixing) or for model predictions (feature attribution).

Fucci, D., Gaido, M., Savoldi, B., Negri, M., Cettolo, M., & Bentivogli, L. (2024). SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation. https://doi.org/10.48550/arXiv.2411.01710

Fucci, D., Savoldi, B., Gaido, M., Negri, M., Cettolo, M., & Bentivogli, L. (2024). Explainability for Speech Models: On the Challenges of Acoustic Feature Selection. In F. Dell’Orletta, A. Lenci, S. Montemagni, & R. Sprugnoli (Eds.), Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024) (pp. 373–381). https://aclanthology.org/2024.clicit-1.45/

Fucci, D., Gaido, M., Negri, M., Cettolo, M., Bentivogli, L. (2025) Echoes of Phonetics: Unveiling Relevant Acoustic Cues for ASR via Feature Attribution. Proc. Interspeech 2025, 206-210. https://doi.org/10.21437/Interspeech.2025-918

Gupta, S., Ravanelli, M., Germain, P., & Subakan, C. (2024). Phoneme Discretized Saliency Maps for Explainable Detection of AI-Generated Voice. In (pp. 3295–3299). https://doi.org/10.21437/Interspeech.2024-632

Mancini, E., Paissan, F., Torroni, P., Ravanelli, M., & Subakan, C. (2024). Investigating the Effectiveness of Explainability Methods in Parkinson’s Detection from Speech. https://arxiv.org/abs/2411.08013

Meng, Y., Goldwater, S., Tang, H. (2025) Effective Context in Neural Speech Models. Proc. Interspeech 2025, 246-250, https://doi.org/10.21437/Interspeech.2025-1932

Mohebbi, H., Chrupała, G., Zuidema, W., & Alishahi, A. (2023). Homophone Disambiguation Reveals Patterns of Context Mixing in Speech Transformers. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 8249–8260). Singapore: Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.513

Muckenhirn, H., Abrol, V., Magimai-Doss, M., & Marcel, S. (2019). Understanding and visualizing raw waveform-based CNNs. In Interspeech 2019 (pp. 2345–2349). https://doi.org/10.21437/Interspeech.2019-2341

Parekh, J., Parekh, S., Mozharovskyi, P., d’Alché-Buc, F., & Richard, G. (2022). Listen to Interpret: Post-hoc Interpretability for Audio Networks with NMF. https://doi.org/10.48550/arXiv.2202.11479

Pastor, E., Koudounas, A., Attanasio, G., Hovy, D., & Baralis, E. (2023). Explaining speech classification models via word-level audio segments and paralinguistic features. https://doi.org/10.48550/arXiv.2309.07733

Peng, P., & Harwath, D. (2022). Word Discovery in Visually Grounded, Self-Supervised Speech Models. In Interspeech 2023 (pp. 2823–2827). https://doi.org/10.21437/Interspeech.2022-10652

Peng, P., Li, S.-W., Räsänen, O., Mohamed, A., & Harwath, D. (2023). Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model. In Interspeech 2023 (pp. 391–395). https://doi.org/10.21437/Interspeech.2023-2044

Prasad, A., & Jyothi, P. (2020). How Accents Confound: Probing for Accent Information in End-to-End Speech Recognition Systems. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 3739–3753). Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.345

Schnell, B., & Garner, P. N. (2021). Improving Emotional TTS with an Emotion Intensity Input from Unsupervised Extraction. In 11th ISCA Speech Synthesis Workshop (SSW 11) (pp. 60–65). ISCA. https://doi.org/10.21437/SSW.2021-11

Shams, E. A., Gessinger, I., & Carson-Berndsen, J. (2024). Uncovering Syllable Constituents in the Self-Attention-Based Speech Representations of Whisper. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP (pp. 238-247). https://doi.org/10.18653/v1/2024.blackboxnlp-1.16

Shen, G., Mohebbi, H., Bisazza, A., Alishahi, A., Chrupala, G. (2025) On the reliability of feature attribution methods for speech classification. Proc. Interspeech 2025, 266-270. https://doi.org/10.21437/Interspeech.2025-1911

Shim, K., Choi, J., & Sung, W. (2022). Understanding the role of self attention for efficient speech recognition. In International Conference on Learning Representations. https://openreview.net/forum?id=AvcfxqRy4Y

Sivasankaran, S., Vincent, E., & Fohr, D. (2021). Explaining Deep Learning Models for Speech Enhancement. In Interspeech 2021 (pp. 696–700). https://doi.org/10.21437/Interspeech.2021-1764

Singhvi, D., Misra, D., Erkelens, A., Jain, R., Papadimitriou, I., & Saphra, N. (2025). Using Shapley interactions to understand how models use structure. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 20727–20737, Vienna, Austria. https://aclanthology.org/2025.acl-long.1011/

Wu, X., Bell, P., & Rajan, A. (2023). Explanations for Automatic Speech Recognition. https://doi.org/10.48550/arXiv.2302.14062

Wu, X., Bell, P., & Rajan, A. (2024). Can We Trust Explainable AI Methods on ASR? An Evaluation on Phoneme Recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 10296–10300). https://doi.org/10.1109/ICASSP48485.2024.10445989

Yang, S.-W., Liu, A. T., & yi Lee, H. (2020). Understanding Self-Attention of Self-Supervised Audio Transformers. In Interspeech 2020. https://doi.org/10.21437/Interspeech.2020-2231

Causal analyses

These methods aim to identify features and (networks of) components which causally affect model output behaviour.

Futami, H., Arora, S., Kashiwagi, Y., Tsunoo, E. Watanabe, S. (2024). Finding Task-specific Subnetworks in Multi-task Spoken Language Understanding Model. In Interspeech 2024(802–806). https://doi.org/10.21437/Interspeech.2024-712

Krishnan, A., Abdullah, B.M. Klakow, D. (2024). On the Encoding of Gender in Transformer-based ASR Representations. In Interspeech 2024 (3090–3094). https://doi.org/10.21437/Interspeech.2024-2209

Lai, C.I.J., Cooper, E., Zhang, Y., Chang, S., Qian, K., Liao, Y.L., Glass, J. (2022). On the Interplay between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis On the interplay between sparsity, naturalness, intelligibility, and prosody in speech synthesis. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/ICASSP43922.2022.9747728

Lin, T.Q., Lin, G.T., Lee, H., Tang, H. (2024). Property Neurons in Self-Supervised Speech Transformers. SLT 2024. https://doi.org/10.48550/arXiv.2409.05910

Pouw, C., de Heer Kloots, M., Alishahi, A. Zuidema, W. (2024). Perception of Phonological Assimilation by Neural Speech Recognition Models Perception of Phonological Assimilation by Neural Speech Recognition Models. Computational Linguistics. https://doi.org/10.1162/coli_a_00526

About

A curated list of interpretability work for speech processing models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published