feat: implement iterative SOLiD training pipeline by rrmaura · Pull Request #8 · AlignmentResearch/deception-evasion-honesty

rrmaura · 2025-07-21T23:43:24Z

What

Complete implementation of iterative SOLiD (Scalable Oversight via Lie Detector) training system that tests whether an iterative version of SOLiD lowers the probability of generation of lies.

This addresses two key research questions:

Can iterative SOLiD improve lie detection rates over single-iteration approaches?
Are there fundamental limits to lie detectability using linear probes?

How

Split dataset into h1/h2 halves for cross-iteration training
Modified munge_data.py to support dataset splitting
Created run_iterative.sh orchestrating full pipeline across iterations
Built analyze_iterations.py for comparative analysis
Used iteration 1 policy as surrogate base model for iteration 2

Key Findings

Policies under 2 iterations show consistent improvement on "Generated Ground Truth Lie Fraction"
Evidence of lies that resist linear probe detection, suggesting performance bounds

…torch version)

…ailed analysis and summary statistics

…d the repository

- Introduced `analyze_iterations.py` for evaluating lie detector performance and deception rates across training iterations. - Added `run_iterative.sh` to manage the iterative SOLiD training process, including data splitting and model training. - Created `README_iterative.md` to document the iterative training workflow and usage instructions. - Implemented `test_iterative_splitting.py` to verify the correctness of the iterative data splitting functionality. - Enhanced `munge_data.py` with a new function for creating h1/h2 splits for iterative training.

- Set fixed logical batch sizes for RM, SFT, DPO, and GRPO to improve consistency. - Adjusted DPO_PDTBS calculation for better performance. (previously CUDA errors)

- Introduced DEBUG_MODE to enable fast iterations using only 5% of data. - Set fixed logical batch sizes for RM, SFT, DPO, and GRPO to enhance consistency. - Updated DPO_PDTBS calculation for improved performance.

- Enabled DEBUG_MODE in run_iterative.sh to sample a fraction of the dataset for faster iterations. - Updated munge_data.py to support sampling based on debug_frac argument.

- Introduced `load_model_for_detection` function to streamline model loading with LoRA adapter support. - Enhanced model loading logic to check for existing adapters, preventing redundant additions during training. - Updated tokenizer loading to accommodate base model tokenizer when using adapters.

… configurations - Modified tokenizer initialization in `train_dpo.py`, `train_explicit_rm.py`, `train_reward.py`, and `train_sft.py` to check for the presence of an adapter configuration file. - If an adapter is detected, the base model tokenizer is used; otherwise, the specified model path is utilized. - This change enhances flexibility in model training with adapter support.

…ctor.txt

…urations for the second iteration - If an adapter is detected (i.e. during the second SOLID iteration), the base model tokenizer is used; otherwise, the specified tokenizer path is utilized.

- Introduced `run_eval_only.sh` to facilitate evaluation of the SOLiD model. - The script sets up the environment, configures paths, and runs the evaluation process with specified parameters. - Supports iteration-based evaluation and includes options for debugging and model configuration.

- Updated `run_eval_only.sh` to dynamically set `BASE_MODEL_PATH` and `BASE_POLICY_PATH` based on the iteration number. - For iteration 1, both paths point to the base model; for subsequent iterations, the policy path references the previous iteration's policy adapter. - Added echo statements to display the configured base model and policy paths during evaluation.

…tion during iterations

- Introduced `run_hyperparameter_sweep.sh` to facilitate the execution of multiple configurations based on TPR values and seeds. - The script sets up the environment, creates a master log for tracking progress, and runs configurations sequentially. - Each configuration execution logs success or failure, ensuring robust tracking of the hyperparameter tuning process.

… and calculate the matrix of detected and undetected lies by iterators from both probes.

- Introduced functions to plot the confusion matrix and log results to wandb, enhancing the evaluation process. - Updated the main evaluation script to save confusion matrix plots and log metrics, including true positive rates and overlap statistics. - Modified the run script to set up wandb project and run ID dynamically based on the timestamp.

- Updated the `log_to_wandb` function to accept an optional `run_name` argument, allowing for more flexible logging configurations. - Modified the main script to pass the `run_name` argument when logging results to wandb. - Adjusted the run script to dynamically set the `run_name` based on the current timestamp, improving traceability of experiments.

… models - Updated `run_iterative.sh` to train the SFT model only during the first iteration, skipping it in subsequent iterations. - Adjusted DPO model loading to use the SFT model for the first iteration and the previous iteration's policy model thereafter. - Added a merge and evaluate step for probes, which is executed only if more than one iteration is performed.

- Introduced `run_hyperparameter_sweep.py` to automate the execution of multiple hyperparameter configurations. - The script sets environment variables for each configuration and runs the `run_iterative.sh` script, logging success or failure for each experiment. - Added functionality to confirm execution of all configurations and track results, enhancing the hyperparameter tuning process.

…hyperparameter sweep script - Set `DEBUG_TRAINING` to `True` in `run_hyperparameter_sweep.py` for enhanced debugging capabilities. - Updated the configuration to automatically proceed with all hyperparameter configurations, removing the user confirmation step. - Adjusted `run_iterative.sh` to use a fixed SFT model path for evaluation, ensuring consistency across iterations.

- Updated the `create_iterative_splits` function to manage the scenario where `h1_frac` equals 1.0, ensuring that an empty h2 dataset is created with the same schema when all data is allocated to h1. This prevents errors during training iterations.

…ta.py

… subsampling - Mostly to avoid issues with small subsamples during debugging

- Added logic to load models from LoRA adapters, merging them with the base model for improved compatibility.

…cript - Introduced new command-line arguments: `--seed`, `--lie_tpr`, `--debug_training`, `--subsample_dataset`, and `--num_iterations` to enhance configuration flexibility for standalone executions. - Default values are set for these parameters to streamline the running process.

…data processing - Changed the `--n_rows` parameter from 20 to 100 in `run_iterative.sh` to allow for larger data handling during iterations, improving the evaluation process.

- Introduced additional hyperparameter configurations: `low_TPR_2iter`, `high_TPR_2iter`, `high_TPR_1iter`, `mid_TPR_2iter`, and `mid_TPR_1iter` to enhance the variety of training scenarios. - Updated existing configurations to maintain consistency in seed values and subsampling settings.

…rors

…mory errors

- Introduced `analyze_probe_results.py` to analyze probe detection results from CSV files. - The script provides a summary of detection patterns and prints examples for both detected and undetected probes. - Added error handling for missing probe detection columns and file existence checks.

- Deleted the `evaluate_answers.ipynb` file, which contained ELO calculations and Bradley-Terry model implementations. - This removal is part of a cleanup process to streamline the project structure.

- Deleted the `run_eval_only.sh` file - This removal is part of a project cleanup to streamline the codebase and eliminate unused scripts.

- Deleted the `analyze_iterations.py` file, which was an obsolete script responsible for analyzing iterative SOLiD training results. - This removal is part of a project cleanup to streamline the codebase and eliminate unused scripts.

- Modified the main and per-iteration log file paths to include hyperparameters in their names for better organization and clarity. - Added a description of the hyperparameter structure to assist users in understanding the log file naming scheme.

…training

- Introduced a new Jupyter notebook, `generate_plots_wandb.ipynb`, to facilitate the generation of plots for training results. - Updated `README_iterative.md` to include instructions for using the new notebook for plot generation.

rrmaura added 30 commits July 14, 2025 18:12

Modify LoRA rank for fast iteration

e199150

Use 1B llama model for faster iteration and focus on DPO experimetns

7cbd022

Update folder, gitignore and batchsize

e5dc759

update gitignore and batchsize

09c537e

debug: comment line on torch.serialization (I might have a different …

2d261cc

…torch version)

Add new Jupyter notebook for evaluating AI-generated answers with det…

f6e1ed6

…ailed analysis and summary statistics

Create simulations to estimate rewards under different densities

847108e

Set Debug = False and increase batch size

5e680ee

Update gitignore. Also add txt with paper to help Cursor AI understan…

201ca84

…d the repository

Decrease batch size (CUDA errors with previous ones)

8107bf9

Lower batchsize and move to debugging

c260b7f

Refactor batch size settings in run.sh for various training components

bae9d29

- Set fixed logical batch sizes for RM, SFT, DPO, and GRPO to improve consistency. - Adjusted DPO_PDTBS calculation for better performance. (previously CUDA errors)

Add debug mode and adjust logical batch sizes in run_iterative.sh

0fee612

- Introduced DEBUG_MODE to enable fast iterations using only 5% of data. - Set fixed logical batch sizes for RM, SFT, DPO, and GRPO to enhance consistency. - Updated DPO_PDTBS calculation for improved performance.

Enhance debug mode and model loading in training scripts

573e3bb

- Enabled DEBUG_MODE in run_iterative.sh to sample a fraction of the dataset for faster iterations. - Updated munge_data.py to support sampling based on debug_frac argument.

Update .gitignore to include test_output directory and paper_lie_dete…

c13d8df

…ctor.txt

Rename variable to avoid confusion

1bdbe4c

Update tokenizer loading logic in reward.py to support adapter config…

21d6f2e

…urations for the second iteration - If an adapter is detected (i.e. during the second SOLID iteration), the base model tokenizer is used; otherwise, the specified tokenizer path is utilized.

Disable dataset subsampling in run_iterative.sh for full data utiliza…

5eaccc6

…tion during iterations

Create python scripts to merge the test datasets from both iterations…

1d1da10

… and calculate the matrix of detected and undetected lies by iterators from both probes.

refactor and simplify the evaluation of probes

af1cbec

rrmaura added 21 commits July 20, 2025 01:25

Disable debug training mode in hyperparameter sweep script

a16b970

Refactor the way to do debugging. Just subsample the data in munge_da…

83bc659

…ta.py

Update data sampling in reward evaluation to allow replacement during…

a518d53

… subsampling - Mostly to avoid issues with small subsamples during debugging

Debug: support LoRA adapters and update tokenizer handling

07402ac

- Added logic to load models from LoRA adapters, merging them with the base model for improved compatibility.

Update row count parameter in iterative training script for enhanced …

3ded415

…data processing - Changed the `--n_rows` parameter from 20 to 100 in `run_iterative.sh` to allow for larger data handling during iterations, improving the evaluation process.

Fix indentation in hyperparameter sweep

a7b5b6b

Adjust SFT_PDTBS in iterative training script to avoid memory CUDA er…

e36d59f

…rors

Update per device batch size in iterative training script to avoid me…

3aac453

…mory errors

Remove evaluate_answers.ipynb notebook

31c4c6d

- Deleted the `evaluate_answers.ipynb` file, which contained ELO calculations and Bradley-Terry model implementations. - This removal is part of a cleanup process to streamline the project structure.

Remove run_eval_only.sh script

090c4e1

- Deleted the `run_eval_only.sh` file - This removal is part of a project cleanup to streamline the codebase and eliminate unused scripts.

Move test_iterative_splitting.py to tests directory

ce42f9f

Remove analyze_iterations.py script

a76b49c

- Deleted the `analyze_iterations.py` file, which was an obsolete script responsible for analyzing iterative SOLiD training results. - This removal is part of a project cleanup to streamline the codebase and eliminate unused scripts.

Update README_iterative.md to clarify key questions and outcomes

9b1d51e

Update README_iterative.md to clarify the purpose of iterative SOLiD …

4ae4d22

…training

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement iterative SOLiD training pipeline#8

feat: implement iterative SOLiD training pipeline#8
rrmaura wants to merge 51 commits into
AlignmentResearch:mainfrom
rrmaura:rrmaura_branch

rrmaura commented Jul 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rrmaura commented Jul 21, 2025

What

How

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant