Skip to content

feat: implement iterative SOLiD training pipeline#8

Open
rrmaura wants to merge 51 commits into
AlignmentResearch:mainfrom
rrmaura:rrmaura_branch
Open

feat: implement iterative SOLiD training pipeline#8
rrmaura wants to merge 51 commits into
AlignmentResearch:mainfrom
rrmaura:rrmaura_branch

Conversation

@rrmaura
Copy link
Copy Markdown

@rrmaura rrmaura commented Jul 21, 2025

What

Complete implementation of iterative SOLiD (Scalable Oversight via Lie Detector) training system that tests whether an iterative version of SOLiD lowers the probability of generation of lies.

This addresses two key research questions:

  • Can iterative SOLiD improve lie detection rates over single-iteration approaches?
  • Are there fundamental limits to lie detectability using linear probes?

How

Split dataset into h1/h2 halves for cross-iteration training
Modified munge_data.py to support dataset splitting
Created run_iterative.sh orchestrating full pipeline across iterations
Built analyze_iterations.py for comparative analysis
Used iteration 1 policy as surrogate base model for iteration 2

Key Findings

  • Policies under 2 iterations show consistent improvement on "Generated Ground Truth Lie Fraction"
  • Evidence of lies that resist linear probe detection, suggesting performance bounds

rrmaura added 30 commits July 14, 2025 18:12
- Introduced `analyze_iterations.py` for evaluating lie detector performance and deception rates across training iterations.
- Added `run_iterative.sh` to manage the iterative SOLiD training process, including data splitting and model training.
- Created `README_iterative.md` to document the iterative training workflow and usage instructions.
- Implemented `test_iterative_splitting.py` to verify the correctness of the iterative data splitting functionality.
- Enhanced `munge_data.py` with a new function for creating h1/h2 splits for iterative training.
- Set fixed logical batch sizes for RM, SFT, DPO, and GRPO to improve consistency.
- Adjusted DPO_PDTBS calculation for better performance. (previously CUDA errors)
- Introduced DEBUG_MODE to enable fast iterations using only 5% of data.
- Set fixed logical batch sizes for RM, SFT, DPO, and GRPO to enhance consistency.
- Updated DPO_PDTBS calculation for improved performance.
- Enabled DEBUG_MODE in run_iterative.sh to sample a fraction of the dataset for faster iterations.
- Updated munge_data.py to support sampling based on debug_frac argument.
- Introduced `load_model_for_detection` function to streamline model loading with LoRA adapter support.
- Enhanced model loading logic to check for existing adapters, preventing redundant additions during training.
- Updated tokenizer loading to accommodate base model tokenizer when using adapters.
… configurations

- Modified tokenizer initialization in `train_dpo.py`, `train_explicit_rm.py`, `train_reward.py`, and `train_sft.py` to check for the presence of an adapter configuration file.
- If an adapter is detected, the base model tokenizer is used; otherwise, the specified model path is utilized.
- This change enhances flexibility in model training with adapter support.
…urations for the second iteration

- If an adapter is detected (i.e. during the second SOLID iteration), the base model tokenizer is used; otherwise, the specified tokenizer path is utilized.
- Introduced `run_eval_only.sh` to facilitate evaluation of the SOLiD model.
- The script sets up the environment, configures paths, and runs the evaluation process with specified parameters.
- Supports iteration-based evaluation and includes options for debugging and model configuration.
- Updated `run_eval_only.sh` to dynamically set `BASE_MODEL_PATH` and `BASE_POLICY_PATH` based on the iteration number.
- For iteration 1, both paths point to the base model; for subsequent iterations, the policy path references the previous iteration's policy adapter.
- Added echo statements to display the configured base model and policy paths during evaluation.
- Introduced `run_hyperparameter_sweep.sh` to facilitate the execution of multiple configurations based on TPR values and seeds.
- The script sets up the environment, creates a master log for tracking progress, and runs configurations sequentially.
- Each configuration execution logs success or failure, ensuring robust tracking of the hyperparameter tuning process.
… and calculate the matrix of detected and undetected lies by iterators from both probes.
- Introduced functions to plot the confusion matrix and log results to wandb, enhancing the evaluation process.
- Updated the main evaluation script to save confusion matrix plots and log metrics, including true positive rates and overlap statistics.
- Modified the run script to set up wandb project and run ID dynamically based on the timestamp.
- Updated the `log_to_wandb` function to accept an optional `run_name` argument, allowing for more flexible logging configurations.
- Modified the main script to pass the `run_name` argument when logging results to wandb.
- Adjusted the run script to dynamically set the `run_name` based on the current timestamp, improving traceability of experiments.
… models

- Updated `run_iterative.sh` to train the SFT model only during the first iteration, skipping it in subsequent iterations.
- Adjusted DPO model loading to use the SFT model for the first iteration and the previous iteration's policy model thereafter.
- Added a merge and evaluate step for probes, which is executed only if more than one iteration is performed.
- Introduced `run_hyperparameter_sweep.py` to automate the execution of multiple hyperparameter configurations.
- The script sets environment variables for each configuration and runs the `run_iterative.sh` script, logging success or failure for each experiment.
- Added functionality to confirm execution of all configurations and track results, enhancing the hyperparameter tuning process.
rrmaura added 21 commits July 20, 2025 01:25
…hyperparameter sweep script

- Set `DEBUG_TRAINING` to `True` in `run_hyperparameter_sweep.py` for enhanced debugging capabilities.
- Updated the configuration to automatically proceed with all hyperparameter configurations, removing the user confirmation step.
- Adjusted `run_iterative.sh` to use a fixed SFT model path for evaluation, ensuring consistency across iterations.
- Updated the `create_iterative_splits` function to manage the scenario where `h1_frac` equals 1.0, ensuring that an empty h2 dataset is created with the same schema when all data is allocated to h1. This prevents errors during training iterations.
… subsampling

- Mostly to avoid issues with small subsamples during debugging
- Added logic to load models from LoRA adapters, merging them with the base model for improved compatibility.
…cript

- Introduced new command-line arguments: `--seed`, `--lie_tpr`, `--debug_training`, `--subsample_dataset`, and `--num_iterations` to enhance configuration flexibility for standalone executions.
- Default values are set for these parameters to streamline the running process.
…data processing

- Changed the `--n_rows` parameter from 20 to 100 in `run_iterative.sh` to allow for larger data handling during iterations, improving the evaluation process.
- Introduced additional hyperparameter configurations: `low_TPR_2iter`, `high_TPR_2iter`, `high_TPR_1iter`, `mid_TPR_2iter`, and `mid_TPR_1iter` to enhance the variety of training scenarios.
- Updated existing configurations to maintain consistency in seed values and subsampling settings.
- Introduced `analyze_probe_results.py` to analyze probe detection results from CSV files.
- The script provides a summary of detection patterns and prints examples for both detected and undetected probes.
- Added error handling for missing probe detection columns and file existence checks.
- Deleted the `evaluate_answers.ipynb` file, which contained ELO calculations and Bradley-Terry model implementations.
- This removal is part of a cleanup process to streamline the project structure.
- Deleted the `run_eval_only.sh` file
- This removal is part of a project cleanup to streamline the codebase and eliminate unused scripts.
- Deleted the `analyze_iterations.py` file, which was an obsolete script responsible for analyzing iterative SOLiD training results.
- This removal is part of a project cleanup to streamline the codebase and eliminate unused scripts.
- Modified the main and per-iteration log file paths to include hyperparameters in their names for better organization and clarity.
- Added a description of the hyperparameter structure to assist users in understanding the log file naming scheme.
- Introduced a new Jupyter notebook, `generate_plots_wandb.ipynb`, to facilitate the generation of plots for training results.
- Updated `README_iterative.md` to include instructions for using the new notebook for plot generation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant