A fine-tuning framework for training language models on debate quality indicators (DQI) using US Congressional debate data.
- uv package manager (install separately)
- Set up the Python environment:
This command uses
make env
uvto install all project dependencies and create a virtual environment.
The project requires US Congressional debate data from the Stanford Congress Text dataset.
Download the data from: https://data.stanford.edu/congress_text
Extract the archive and place:
hein-daily/directory →data/raw/hein-daily/USfinal-clean.csv→data/raw/USfinal-clean.csv
Note: The USfinal-clean.csv file is not publicly available. If you need access, please contact the project author.
After downloading, your data/raw/ directory should look like:
data/
└── raw/
├── hein-daily/
│ ├── 097_SpeakerMap.txt
│ ├── 098_SpeakerMap.txt
│ └── ... (more speaker maps)
└── USfinal-clean.csv
The project uses a Makefile-based workflow for data processing and model training. Commands can be chained to create a complete pipeline.
| Command | Description |
|---|---|
make env |
Install Python environment with uv |
make data |
Merge, split, and preprocess all data |
make train |
Fine-tune the language model |
make evaluate |
Evaluate the fine-tuned model |
make publish |
Publish/export the model |
# 1. Set up environment
make env
# 2. Process data (merges, splits, and preprocesses)
make data
# 3. Fine-tune the model
make train
# 4. Evaluate the model
make evaluate
# 5. Publish the model
make publishThe make data command handles the following steps automatically, but you can also run them individually:
# Merge datasets
make merge
# Split into train/val/test
make split
# Preprocess data
make preprocessTo use a different config file, specify it via the CONFIG variable:
make train CONFIG=configs/config_mistral-7b-instruct-v0.3.yaml
make evaluate CONFIG=configs/config_mistral-7b-instruct-v0.3.yamlDefault config: configs/config.yaml
Pre-configured models are available in the configs/ directory:
config_llama3.1-8b-instruct.yaml- Llama 3.1 8Bconfig_mistral-7b-instruct-v0.3.yaml- Mistral 7Bconfig_deepseek-r1-distill-llama-70b-bnb-4bit.yaml- DeepSeek Llama 70B (4-bit)config_gemma3-12b-it.yaml- Gemma 3 12B- And more...
.
├── data/ # Data directory (raw and processed)
├── configs/ # Model configuration files
├── scripts/ # Python scripts for data/training/evaluation
├── outputs/ # Fine-tuned model outputs
├── prompts/ # Evaluation prompt templates
├── slurm/ # SLURM job submission scripts
├── Makefile # Workflow automation
└── pyproject.toml # Project dependencies
- The project includes Jupyter notebooks (
finetuning.ipynb,graphs.ipynb) for detailed analysis - Fine-tuned models are saved to
outputs/ - Experiment tracking is handled via Weights & Biases (wandb)
For questions about the USfinal-clean.csv dataset or other inquiries, please contact the project author.