Welcome to the DPO & GRPO project, a comprehensive framework for optimizing language models using Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO). This project leverages Hugging Face Transformers and PEFT (LoRA) to provide efficient and scalable solutions for preference learning and policy optimization.
This repository contains two main components:
-
Direct Preference Optimization (DPO): A method for training language models directly from preference pairs using a classification-style loss, improving sample quality and alignment in tasks like summarization, dialogue, and sentiment control.
-
Group Relative Policy Optimization (GRPO): An extension of DPO that handles more complex preference structures, particularly suited for tasks like instruction tuning and mathematical reasoning.
- Kindly open the DPO and GRPO folders to learn more about how the algorithms work !
dpo_grpo_rlhf/
├── data/
│ ├── raw/
│ ├── processed/
│ └── prompts.json
│
├── model/
│ └── base_gpt2/
│
├── configs/
│ ├── dpo_config.yaml
│ └── grpo_config.yaml
│
├── dpo/
│ ├── train_dpo.py
│ ├── dpo_trainer.py
│ ├── dataset_utils.py
│ ├── eval.py
│ └── README.md
│
├── grpo/
│ ├── train_grpo.py
│ ├── grpo_trainer.py
│ ├── dataset_utils.py
│ ├── eval.py
│ └── README.md
│
├── scripts/
│ └── convert_raw_to_processed.py
│
├── requirements.txt
└── README.md
To run this project on your device, clone the repository using the following command:
git clone https://github.com/kaushikd24/LLM-Policy-Optimization.git
cd LLM-Policy-OptimizationEnsure you have the necessary dependencies installed:
pip install -r requirements.txtConfigure your settings in configs/dpo_config.yaml and run:
python dpo/train_dpo.pyConfigure your settings in configs/grpo_config.yaml and run:
python grpo/train_grpo.py- DPO: Direct Preference Optimization
- GRPO: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
This project is licensed under the MIT License.
- Based on the original Stanford DPO paper by Rafailov et al. (2023)
- HuggingFace 🤗 team and PEFT for making adapters dead simple
- Thanks to the contributors and the open-source community for their support and resources.