Safety Research
Popular repositories Loading
-
-
persona_vectors
persona_vectors PublicPersona Vectors: Monitoring and Controlling Character Traits in Language Models
-
-
safety-tooling
safety-tooling PublicInference API for many LLMs and other useful tools for empirical research
Repositories
Showing 10 of 37 repositories
- assistant-axis Public
The Assistant Axis is a direction in activation space that captures how "Assistant-like" a model's behavior is. Models can drift away from the Assistant during conversations—sometimes toward bizarre or harmful personas. This repo contains a pipeline for generating the Assistant Axis and notebooks for monitoring and steering with it.
safety-research/assistant-axis’s past year of commit activity - PurpleLlama Public Forked from meta-llama/PurpleLlama
Set of tools to assess and improve LLM security.
safety-research/PurpleLlama’s past year of commit activity - circuit-tracer Public
safety-research/circuit-tracer’s past year of commit activity - how-ai-impacts-skill-formation Public
Repo for measuring whether using AI tools inhibits skill formation and development
safety-research/how-ai-impacts-skill-formation’s past year of commit activity
Most used topics
Loading…