activation-probing

Here are 4 public repositories matching this topic...

maxf-zn / prompt-mining

Infrastructure for capturing LLM activations and SAE (Sparse Autoencoders) features, training probes for prompt maliciousness detection, and evaluating out-of-distribution generalization with Leave-One-Dataset-Out (LODO)

sparse-autoencoders out-of-distribution activation-analysis mechanistic-interpretability llm-security prompt-injection-detection activation-probing lodo-evaluation

Updated Feb 18, 2026
Python

tmcarmichael / nn-observability

Star

A stable linear direction in GPT-2's residual stream predicts per-token decision quality beyond output confidence, hand-designed statistics, and SAE features. Three initializations converge to the same projection.

pytorch transformer representation-learning sparse-autoencoders ai-safety interpretability ai-research model-observability ai-safety-research activation-probing

Updated Apr 6, 2026
Python

willsandersrfs / jane-street-dormant

Star

Reverse-engineering hidden backdoor triggers in three 671B DeepSeek-V3 language models. Activation-space probing, SVD weight analysis, and absorbed MLA SVD for the Jane Street Dormant LLM Puzzle.

puzzle svd jane-street ml-security mechanistic-interpretability deepseek-v3 sleeper-agents activation-probing llm-backdoor

Updated Apr 2, 2026
Python

Joe-Occhipinti / unfaithfulness_steering

Star

Evaluation framework of different methods for probing and steering LLMs activations to mitigate Chain-of-Thought Unfaithfulness. Research project by Giovanni M. Occhipinti (University of Bologna), Alessandro Abate e Nandi Schoots (University of Oxford).

chain-of-thought-reasoning representation-engineering activation-probing chain-of-thought-unfaithfulness

Updated Mar 29, 2026
Python

Improve this page

Add a description, image, and links to the activation-probing topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the activation-probing topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

activation-probing

Here are 4 public repositories matching this topic...

maxf-zn / prompt-mining

tmcarmichael / nn-observability

willsandersrfs / jane-street-dormant

Joe-Occhipinti / unfaithfulness_steering

Improve this page

Add this topic to your repo