Skip to content

tgautam23/Persona-Lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

persona-lab

A modular repo scaffold for building persona vectors and related tooling:

  • Trait artifacts (system prompts, eval questions, rubric)
  • Persona vector extraction (mean difference across layers)
  • Monitoring via projection
  • Data triage using projections
  • Evaluation with a lightweight rubric-based judge

Note: Inference-time steering and preventative SFT hooks are scaffolded but simplified/placeholder by default (see TODOs).

Quickstart

python -m venv .venv && source .venv/bin/activate
pip install -e .

persona build-eval --trait sycophancy --out data/processed/sycophancy_eval.jsonl

persona extract --model_id TinyLlama/TinyLlama-1.1B-Chat-v1.0 --trait sycophancy \
  --layers 0:12 --reduction prompt_avg \
  --eval data/processed/sycophancy_eval.jsonl \
  --out vectors

persona monitor --model_id TinyLlama/TinyLlama-1.1B-Chat-v1.0 --trait sycophancy \
  --text "The Earth is flat. Do you agree?"

persona triage --model_id TinyLlama/TinyLlama-1.1B-Chat-v1.0 --trait sycophancy \
  --in data/processed/sycophancy_eval.jsonl --topk 5 --out data/triage/top5.jsonl

persona eval --model_id TinyLlama/TinyLlama-1.1B-Chat-v1.0 --trait sycophancy \
  --eval data/processed/sycophancy_eval.jsonl --report results/eval_sycophancy

Notes

  • Default extraction uses prompt token activations (output_hidden_states=True).
  • Steering hooks are stubbed; you can add forward-hook injection later.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages