ReAlign: Process-Aware Benchmarking & Efficient Fine-Tuning for Math LLMs

ReAlign is a comprehensive research toolkit designed to advance mathematical reasoning in Large Language Models (LLMs). It addresses two critical challenges:

Evaluation: Moving beyond binary "correct/incorrect" grading to evaluate the quality and alignment of the reasoning process.
Training: Democratizing the fine-tuning of reasoning models on consumer hardware (specifically Apple Silicon) using efficient LoRA and QLoRA techniques.

📂 Repository Structure

This project is organized into three main components:

1. Benchmarking Suite (`Benchmarking/`)

The core of the ReAlign framework. It implements a novel evaluation metric using Dynamic Programming (Needleman-Wunsch) and Semantic Embeddings to align student reasoning steps with ground-truth solutions.

Key Features:
- StepAligner: Algorithms to parse and align reasoning chains.
- Quadrant Analysis: Visualizes "Robust Reasoning" vs. "Hallucinations" vs. "Lucky Guesses".
- Dual-Mode Grading: Combines symbolic verification with LLM-based judging.
- Publication-Ready Figures: Automatically generates heatmaps and statistical reports.

2. LoRA Training (`LORA/`)

A standard Low-Rank Adaptation (LoRA) training pipeline optimized for Apple Silicon using the MLX framework.

Features:
- Fine-tunes DeepSeek-Math-7B on OpenR1-Math-220k.
- Automatic model conversion from PyTorch to MLX.
- Gradient accumulation for large effective batch sizes.

3. QLoRA Training (`QLORA/`)

[NEW] An advanced Quantized LoRA pipeline that enables training 7B models on devices with as little as 8GB-16GB RAM.

Features:
- 4-bit Quantization: Loads the base model in 4-bit NormalFloat precision.
- Memory Efficiency: Reduces memory footprint by ~60% compared to standard LoRA.
- Full Pipeline: Includes training (qlora.py) and inference (inference.py) scripts.

🚀 Getting Started

Prerequisites

Hardware: Apple Silicon Mac (M1/M2/M3) recommended.
Software: Python 3.10+, pip.

Installation

Clone the repository and install dependencies:

git clone https://github.com/yourusername/ReAlign.git
cd ReAlign

# Install core dependencies
pip install mlx mlx-lm datasets sentence-transformers pandas matplotlib seaborn scipy tqdm

📊 Running the Benchmark

To evaluate a model (e.g., DeepSeek-Math-7B) using the ReAlign framework:

cd Benchmarking
python3 ReAlign-Benchmark.py --model "deepseek-ai/deepseek-math-7b-instruct" --limit 100

Output: Results will be saved to realign_benchmark_results.jsonl and figures (heatmaps, quadrant analysis) will be generated in figures-and-statistics/.

🧠 Training a Model

Option A: Standard LoRA (Best for M2/M3 Max/Ultra with >32GB RAM)

cd LORA
python3 lora.py

Option B: QLoRA (Best for M1/M2/M3 Air/Pro with 8GB-16GB RAM)

cd QLORA
python3 qlora.py

After training, you can run inference with your new adapters:

python3 inference.py --prompt "Solve: integral of x^2" --adapter-path qlora_adapters/adapters.safetensors

📄 Project Report

A detailed academic report describing the methodology, theoretical framework, and experimental results is available in Project_Report.tex.

To compile the report:

pdflatex Project_Report.tex

🧩 Methodology Highlights

The Alignment Score

Unlike standard benchmarks that only check the final answer, ReAlign calculates an Alignment Score ($S_{align} \in [0, 1]$).

$$ D[i][j] = \max \begin{cases} D[i-1][j-1] + \text{sim}(s_i, b_j) & \text{(Match)} \\ D[i-1][j] + \gamma & \text{(Deletion)} \\ D[i][j-1] + \gamma & \text{(Insertion)} \end{cases} $$

This allows us to distinguish between:

Robust Reasoning: High Alignment + Correct Answer.
Hallucination: High Alignment + Incorrect Answer (Calculation error).
Lucky Guess: Low Alignment + Correct Answer (Heuristic/Shortcut).

📝 Citation

If you use this codebase or the ReAlign framework, please cite:

@article{realign2025,
  title={ReAlign: Process-Aware Benchmarking for Mathematical Reasoning},
  author={Srivastav, Shaurya},
  journal={ECS 289 Project Report},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Benchmarking		Benchmarking
LORA		LORA
Project_Report		Project_Report
QLORA		QLORA
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
ReAlign.pdf		ReAlign.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ReAlign: Process-Aware Benchmarking & Efficient Fine-Tuning for Math LLMs

📂 Repository Structure

1. Benchmarking Suite (`Benchmarking/`)

2. LoRA Training (`LORA/`)

3. QLoRA Training (`QLORA/`)

🚀 Getting Started

Prerequisites

Installation

📊 Running the Benchmark

🧠 Training a Model

Option A: Standard LoRA (Best for M2/M3 Max/Ultra with >32GB RAM)

Option B: QLoRA (Best for M1/M2/M3 Air/Pro with 8GB-16GB RAM)

📄 Project Report

🧩 Methodology Highlights

The Alignment Score

📝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ReAlign: Process-Aware Benchmarking & Efficient Fine-Tuning for Math LLMs

📂 Repository Structure

1. Benchmarking Suite (Benchmarking/)

2. LoRA Training (LORA/)

3. QLoRA Training (QLORA/)

🚀 Getting Started

Prerequisites

Installation

📊 Running the Benchmark

🧠 Training a Model

Option A: Standard LoRA (Best for M2/M3 Max/Ultra with >32GB RAM)

Option B: QLoRA (Best for M1/M2/M3 Air/Pro with 8GB-16GB RAM)

📄 Project Report

🧩 Methodology Highlights

The Alignment Score

📝 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Benchmarking Suite (`Benchmarking/`)

2. LoRA Training (`LORA/`)

3. QLoRA Training (`QLORA/`)

Packages