This project aims to train a T5 based model to perform common-sense multiple-choice reasoning and generate human-like explanations.
We use the Common Sense QA dataset, which provides:
- A question
- Multiple-choice answer options
- The correct answer
- A human-written rationale (explanation)
.
├── csqa_explanation_generation.py
│ # Main script for generating high-quality explanations for CSQA
│ # using a large language model (Here we use Qwen2.5 7B)
├── training_t5_CoSE.ipynb
│ # Jupyter notebook for training and fine-tuning T5 models
│ # (T5-Small baseline and T5-Large + LoRA)
│ # Includes:
│ # - data preprocessing & cleaning
│ # - tokenization
│ # - LoRA configuration
│ # - training & evaluation
│ # - baseline vs fine-tuned comparison
├── data/
│ ├── csqa_full.jsonl
│ │ # Cleaned CommonsenseQA-style dataset with
│ │ # question, choices, answer, and generated explanations
│ │ # (JSONL format, one example per line)
│ │
│ └── csqa_full.csv
│ # CSV version of the same dataset for analysis,
│ # visualization, or third-party tools
├── t5_csqa_lora/
│ # Output directory for LoRA fine-tuning checkpoints
│ ├── checkpoint-*/ # Intermediate training checkpoints
│ │ ├── adapter_config.json
│ │ ├── adapter_model.safetensors
│ │ └── trainer_state.json
│ │
│ └── trainer_state.json # Final trainer metadata
├── t5_csqa_lora_merged/
│ # Final merged T5 model (base T5 + LoRA weights)
│ # Fully loadable for inference
│ ├── config.json
│ ├── generation_config.json
│ ├── model.safetensors
│ ├── tokenizer.json
│ ├── tokenizer_config.json
│ └── special_tokens_map.json
└── README.md
# Project overview, methodology, and results
Follow the steps below to set up the environment and run the project.
git clone https://github.com/kkli08/common-sense-reasoning.git
cd <your-repo-name>We recommend using a Python virtual environment to avoid dependency conflicts.
python3 -m venv venv
source venv/bin/activatepython -m venv venv
venv\Scripts\activateAll required packages are listed in requirements.txt.
pip install --upgrade pip
pip install -r requirements.txt✅ Note (macOS / Apple Silicon): Training is supported via Metal Performance Shaders (MPS). No CUDA or
bitsandbytesis required.
You can quickly verify that PyTorch and Transformers are installed correctly:
python - <<EOF
import torch
import transformers
print("Torch version:", torch.__version__)
print("Transformers version:", transformers.__version__)
print("MPS available:", torch.backends.mps.is_available())
EOFThis project includes a data generation pipeline that creates high-quality short reasoning explanations for the CommonsenseQA dataset using a locally deployed large language model. Why not CoSE?
The generation process uses a two-stage reasoning approach:
- Generate a correct long explanation for each question
- Compress it into a clean, single-sentence short explanation
Only the final short explanations are kept and saved for downstream model training.
Run the following command:
python3 csqa_explanation_generation.pyThis will:
- Load the CommonsenseQA training set
- Generate explanations incrementally with progress tracking
- Automatically resume from partially generated files
- Save results to:
data/csqa_full.jsonl
data/csqa_full.csv
Each generated sample follows this format:
{
"question": "...",
"choices": ["...", "...", "...", "...", "..."],
"answer": "B",
"short_explanation": "Because ..."
}The dataset is expected to be located in:
data/
├── csqa_full.jsonl
└── csqa_full.csv
To train and evaluate the model:
training_t5_CoSE.ipynb
The notebook covers:
- Data cleaning and formatting
- Tokenization for T5
- LoRA fine-tuning
- Baseline vs fine-tuned evaluation
- Model merging and inference
Run the notebook sequentially to reproduce results.
After training, the merged model will be saved under:
t5_csqa_lora_merged/
You can load it directly using Hugging Face Transformers for inference.
TBD
We compare both models on the same 8 baseline CSQA questions.
| Model | Accuracy | Correct / Total |
|---|---|---|
| T5-Small (Baseline) | 0% | 0 / 8 |
| T5-Large + LoRA Fine-Tuned | 100% | 8 / 8 |
Fine-tuning transformed model performance from completely failing → perfect accuracy.
Below, each question shows outputs from both models, allowing direct comparison.
Question 1
Q: Sammy wanted to go to where the people were. Where might he go?
Choices: race track, populated areas, the desert, apartment, roadblock
Gold: B
| Model | Output |
|---|---|
| T5-Small | answer: E. Because Sammy wanted to go to where the people were. ❌ Predicted: E |
| T5-Large + LoRA | answer: B. Because "populated areas" best fits Sammy's desire to visit populated areas. ✔ Correct |
Question 2
Q: Where do you store fresh vegetables?
Choices: garage, refrigerator, bookshelf, bathroom, attic
Gold: B
| Model | Output |
|---|---|
| T5-Small | answer: E. Because refrigerators are used to store fresh vegetables. ❌ Predicted: E |
| T5-Large + LoRA | answer: B. Because refrigerators are ideal for storing fresh vegetables. ✔ Correct |
Question 3
Q: If you heat water to 100°C, what will happen?
Choices: freeze, boil, rust, glow, evaporate
Gold: B
| Model | Output |
|---|---|
| T5-Small | answer: E. Because heat water to 100°C will freeze and rust. ❌ Predicted: E |
| T5-Large + LoRA | answer: B. Because boiling occurs when water reaches 100°C. ✔ Correct |
Question 4
Q: What do people usually use to dry their hands after washing?
Choices: towel, hammer, blanket, pillow, shoe
Gold: A
| Model | Output |
|---|---|
| T5-Small | answer: E. Because a towel is used to dry hands after washing. ❌ Predicted: E |
| T5-Large + LoRA | answer: A. Because a towel is commonly used to dry hands after washing. ✔ Correct |
Question 5
Q: Where would you typically find books to read?
Choices: library, swimming pool, factory, garage, freeway
Gold: A
| Model | Output |
|---|---|
| T5-Small | answer: D. Because libraries are the most popular places to read books. ❌ Predicted: D |
| T5-Large + LoRA | answer: A. Because libraries offer a wide variety of books for reading, making them the best choice. ✔ Correct |
Question 6
Q: If someone wants to relax and reduce stress, what might they do?
Choices: meditate, argue, shout, work more, run into danger
Gold: A
| Model | Output |
|---|---|
| T5-Small | answer: D. Because "work more" is the best way to relax and reduce stress. ❌ Predicted: D |
| T5-Large + LoRA | answer: A. Because meditation is the most effective way to relax and reduce stress. ✔ Correct |
Question 7
Q: What tool is commonly used to tighten screws?
Choices: screwdriver, spoon, pencil, comb, fork
Gold: A
| Model | Output |
|---|---|
| T5-Small | answer: D. Because a fork is commonly used to tighten screws. ❌ Predicted: D |
| T5-Large + LoRA | answer: A. Because a screwdriver is the most common tool for tightening screws. ✔ Correct |
Question 8
Q: Where would you likely find many wild animals living together?
Choices: forest, kitchen, bathroom, rooftop, office
Gold: A
| Model | Output |
|---|---|
| T5-Small | answer: D. Because "office" is the most popular place for wild animals living together. ❌ Predicted: D |
| T5-Large + LoRA | answer: A. Because forests are ideal habitats for wild animals to live together. ✔ Correct |