nn-observability/CITATION.cff at main · tmcarmichael/nn-observability · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
cff-version: 1.2.0
title: "Architecture Predicts Linear Readability of Decision Quality in Transformers"
message: "If you use this work, please cite it using the metadata from this file."
type: software
authors:
  - family-names: Carmichael
    given-names: Thomas
version: "2.4.0"
date-released: "2026-04-15"
license: MIT
doi: "10.5281/zenodo.19435674"
repository-code: "https://github.com/tmcarmichael/nn-observability"
abstract: "Half to two-thirds of the signal in standard activation probes is output confidence in disguise, and what remains varies by architecture family, not model scale. A linear probe on frozen activations, evaluated by partial Spearman correlation after controlling for max softmax probability and activation norm, recovers a stable signal across five Qwen 2.5 scales (0.5B-14B) and six architecture families (GPT-2, Qwen, Gemma, Llama, Mistral, Phi). A nonlinear MLP is statistically equivalent to the linear probe on eight models tested. Llama 3.2 3B produces a 2.9x gap with Qwen at matched scale (permutation test p = 0.006). The Llama 1B model (+0.286) matches high-observability families while 3B and 8B show weak signal. The observer catches 8-11% of errors confidence misses at 10% flag rate, converging to 11-15% at 20% across all tested families. The probe transfers zero-shot to retrieval-augmented QA and medical licensing questions at the same ceiling."
keywords:
  - neural network observability
  - transformer interpretability
  - decision quality
  - learned probes
  - confidence controls
  - architecture-dependent readability
  - partial correlation
  - cross-family evaluation
  - retrieval-augmented generation