-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathCITATION.cff
More file actions
23 lines (23 loc) · 1.62 KB
/
CITATION.cff
File metadata and controls
23 lines (23 loc) · 1.62 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
cff-version: 1.2.0
title: "Architecture Predicts Linear Readability of Decision Quality in Transformers"
message: "If you use this work, please cite it using the metadata from this file."
type: software
authors:
- family-names: Carmichael
given-names: Thomas
version: "2.4.0"
date-released: "2026-04-15"
license: MIT
doi: "10.5281/zenodo.19435674"
repository-code: "https://github.com/tmcarmichael/nn-observability"
abstract: "Half to two-thirds of the signal in standard activation probes is output confidence in disguise, and what remains varies by architecture family, not model scale. A linear probe on frozen activations, evaluated by partial Spearman correlation after controlling for max softmax probability and activation norm, recovers a stable signal across five Qwen 2.5 scales (0.5B-14B) and six architecture families (GPT-2, Qwen, Gemma, Llama, Mistral, Phi). A nonlinear MLP is statistically equivalent to the linear probe on eight models tested. Llama 3.2 3B produces a 2.9x gap with Qwen at matched scale (permutation test p = 0.006). The Llama 1B model (+0.286) matches high-observability families while 3B and 8B show weak signal. The observer catches 8-11% of errors confidence misses at 10% flag rate, converging to 11-15% at 20% across all tested families. The probe transfers zero-shot to retrieval-augmented QA and medical licensing questions at the same ceiling."
keywords:
- neural network observability
- transformer interpretability
- decision quality
- learned probes
- confidence controls
- architecture-dependent readability
- partial correlation
- cross-family evaluation
- retrieval-augmented generation