Systematic probing toolkit for alignment-relevant LLM behaviors: sycophancy, sandbagging, power-seeking, deceptive alignment, and corrigibility failures
-
Updated
Mar 3, 2026 - Python
Systematic probing toolkit for alignment-relevant LLM behaviors: sycophancy, sandbagging, power-seeking, deceptive alignment, and corrigibility failures
Detecting scheming behaviour in multi-agent settings
Read-only mirror of MSTP-1.0, a non-canonical protocol for detecting ethical mimicry in AI systems, with companion guide and verified hashes.
Can AI systems detect when they're being evaluated? Research paper and reference implementation exploring the Hawthorne Effect for AI.
Add a description, image, and links to the deceptive-alignment topic page so that developers can more easily learn about it.
To associate your repository with the deceptive-alignment topic, visit your repo's landing page and select "manage topics."