A modular repo scaffold for building persona vectors and related tooling:
- Trait artifacts (system prompts, eval questions, rubric)
- Persona vector extraction (mean difference across layers)
- Monitoring via projection
- Data triage using projections
- Evaluation with a lightweight rubric-based judge
Note: Inference-time steering and preventative SFT hooks are scaffolded but simplified/placeholder by default (see TODOs).
python -m venv .venv && source .venv/bin/activate
pip install -e .
persona build-eval --trait sycophancy --out data/processed/sycophancy_eval.jsonl
persona extract --model_id TinyLlama/TinyLlama-1.1B-Chat-v1.0 --trait sycophancy \
--layers 0:12 --reduction prompt_avg \
--eval data/processed/sycophancy_eval.jsonl \
--out vectors
persona monitor --model_id TinyLlama/TinyLlama-1.1B-Chat-v1.0 --trait sycophancy \
--text "The Earth is flat. Do you agree?"
persona triage --model_id TinyLlama/TinyLlama-1.1B-Chat-v1.0 --trait sycophancy \
--in data/processed/sycophancy_eval.jsonl --topk 5 --out data/triage/top5.jsonl
persona eval --model_id TinyLlama/TinyLlama-1.1B-Chat-v1.0 --trait sycophancy \
--eval data/processed/sycophancy_eval.jsonl --report results/eval_sycophancy- Default extraction uses prompt token activations (
output_hidden_states=True). - Steering hooks are stubbed; you can add forward-hook injection later.