ReAlign is a comprehensive research toolkit designed to advance mathematical reasoning in Large Language Models (LLMs). It addresses two critical challenges:
- Evaluation: Moving beyond binary "correct/incorrect" grading to evaluate the quality and alignment of the reasoning process.
- Training: Democratizing the fine-tuning of reasoning models on consumer hardware (specifically Apple Silicon) using efficient LoRA and QLoRA techniques.
This project is organized into three main components:
The core of the ReAlign framework. It implements a novel evaluation metric using Dynamic Programming (Needleman-Wunsch) and Semantic Embeddings to align student reasoning steps with ground-truth solutions.
- Key Features:
- StepAligner: Algorithms to parse and align reasoning chains.
- Quadrant Analysis: Visualizes "Robust Reasoning" vs. "Hallucinations" vs. "Lucky Guesses".
- Dual-Mode Grading: Combines symbolic verification with LLM-based judging.
- Publication-Ready Figures: Automatically generates heatmaps and statistical reports.
A standard Low-Rank Adaptation (LoRA) training pipeline optimized for Apple Silicon using the MLX framework.
- Features:
- Fine-tunes DeepSeek-Math-7B on OpenR1-Math-220k.
- Automatic model conversion from PyTorch to MLX.
- Gradient accumulation for large effective batch sizes.
[NEW] An advanced Quantized LoRA pipeline that enables training 7B models on devices with as little as 8GB-16GB RAM.
- Features:
- 4-bit Quantization: Loads the base model in 4-bit NormalFloat precision.
- Memory Efficiency: Reduces memory footprint by ~60% compared to standard LoRA.
- Full Pipeline: Includes training (
qlora.py) and inference (inference.py) scripts.
- Hardware: Apple Silicon Mac (M1/M2/M3) recommended.
- Software: Python 3.10+,
pip.
Clone the repository and install dependencies:
git clone https://github.com/yourusername/ReAlign.git
cd ReAlign
# Install core dependencies
pip install mlx mlx-lm datasets sentence-transformers pandas matplotlib seaborn scipy tqdmTo evaluate a model (e.g., DeepSeek-Math-7B) using the ReAlign framework:
cd Benchmarking
python3 ReAlign-Benchmark.py --model "deepseek-ai/deepseek-math-7b-instruct" --limit 100Output: Results will be saved to realign_benchmark_results.jsonl and figures (heatmaps, quadrant analysis) will be generated in figures-and-statistics/.
cd LORA
python3 lora.pycd QLORA
python3 qlora.pyAfter training, you can run inference with your new adapters:
python3 inference.py --prompt "Solve: integral of x^2" --adapter-path qlora_adapters/adapters.safetensorsA detailed academic report describing the methodology, theoretical framework, and experimental results is available in Project_Report.tex.
To compile the report:
pdflatex Project_Report.texUnlike standard benchmarks that only check the final answer, ReAlign calculates an Alignment Score (
This allows us to distinguish between:
- Robust Reasoning: High Alignment + Correct Answer.
- Hallucination: High Alignment + Incorrect Answer (Calculation error).
- Lucky Guess: Low Alignment + Correct Answer (Heuristic/Shortcut).
If you use this codebase or the ReAlign framework, please cite:
@article{realign2025,
title={ReAlign: Process-Aware Benchmarking for Mathematical Reasoning},
author={Srivastav, Shaurya},
journal={ECS 289 Project Report},
year={2025}
}