AI meets Mathematics Education: Supporting Instructors in Large Mathematics Classes with Context-Aware AI
Large-enrollment university courses face persistent challenges in delivering timely and scalable instructional support. While generative AI holds promise, its educational use requires trust, reliability, and pedagogical alignment. We present a human-centered framework for AI-assisted course support, co-designed and evaluated in partnership with course instructors to foreground instructional oversight. For a large introductory mathematics course, we fine-tune a lightweight, cost-efficient language model on 2,588 historical student–instructor interactions and evaluate it against a new benchmark of 150 representative questions, annotated by five instructors. The model achieves 75.3% accuracy and excels on the medium-difficulty queries that dominate student requests; in 36% of cases, its answers are rated equal to or better than instructors’. Yet, instructors identified consistent cases where oversight is essential, highlighting the risks of unsupervised deployment. Our findings demonstrate how hybrid human–AI systems, when integrated into structured course workflows, can enhance instructional capacity while setting realistic expectations for AI in education.
The already existing course platform contains the course data, exercises, quizzes, and a forum where students can ask questions (see here). The goal is to create a large language model that can answer in real-time questions from the students. Models and datasets can be found here
Figure: Overview of our human-guided research process: (1) collecting and annotating student–instructor Q&A pairs, (2) instructor-led model evaluation and fine-tuning, and (3) expert evaluation with five course instructors through surveys and interviews.
Note: This repository provides the code and tooling to train, evaluate, and deploy models for the CaLlm (Calculus Large Language Model) project. It does not include the raw training data, dataset-generation or translation assets, the evaluation interfaces, the evaluation results (for privacy reasons), or the notebooks used for data analysis. For access to those materials, please contact the project maintainers.
The main objective of this project is to finetune a large language model with strong mathematical reasoning capabilities that can answer students' questions. Concerning hardware limitations, the model should have fewer than 14B parameters and be hosted at the EPFL local cluster.
| Model | Model Family | Context Window |
|---|---|---|
| Mathstral-7B-v0.1 | mistral | 32k |
| Gemma-2-9b-it | gemma | 32k |
| Llama-3.1-8B-Inst | llama | 128k |
| Mathstral-7B | mistral | 32k |
| Qwen-2.5-Math-7B | qwen | 4k |
| OpenR1-Qwen-7B | qwen | 32k |
| Gemma-3-12B-Inst | gemma | 128k |
| DeepSeek-R1-Distill-14B | qwen | 128k |
Figure: Evaluation of base and finetuned models on 40 test questions by a single expert. The results are shown by the question's difficulty: we have 10 examples of easy questions, 20 medium, and 10 hard ones. Base models: Llama 3.1, Mathstral, Qwen 2.5, Gemma 3, and DeepSeek R1 Distill. Models with ''-QA'' are finetuned on our dataset; ``OpenMath220k'' are finetuned on the respective math dataset. Scores shown are averages. 95% confidence intervals are computed via non-parametric bootstrap with 10000 resamples.
Please refer to the Data Readme for more information about the data and datasets used in this project.
Daily model statistics and usage can be found here.
There are 3 main independent tools provided in this release to install and use the project, depending on your needs :
- Finetuning
- Automated evaluation
- Local inference for fast prototyping and testing
- Inference server for broader usage
This project was developed and tested on a Windows system, and models were trained and evaluated on EPFL's Kuma cluster.
In any case, please follow these first steps to clone the repository and install the necessary dependencies (for local usage):
git clone git@github.com:botafogo-EPFL/CaLlm_pub.git
conda create -n callm python=3.11.9
conda activate callm
pip install -r requirements.txt
Once this is done, go to the corresponding Readme files for each tool to get more information about how to use them :
- Finetuning
- Automated evaluation
- Local inference for fast prototyping and testing
- Inference server for broader usage
If you want to install the project on EPFL's clusters (Kuma or Izar is recommended), you can follow the steps that are provided here. This will install the project in a virtual environment and set up the necessary libraries and dependencies.
Involved persons :
-
Jeremy Barghorn, Data Science Master student at EPFL, project's code developer and maintainer, jeremy.barghorn@epfl.ch
-
Prof. Sacha Friedli, EPFL, project initiator and supervisor, in charge of human annotation and the course platform, sacha.friedli@epfl.ch
-
Postdoctoral Researcher Anna Sotnikova, Natural Language Processing Lab EPFL, in charge of the technical and scientific supervision of the project, anna.sotnikova@epfl.ch

