AI meets Mathematics Education: Supporting Instructors in Large Mathematics Classes with Context-Aware AI

Abstract :

Large-enrollment university courses face persistent challenges in delivering timely and scalable instructional support. While generative AI holds promise, its educational use requires trust, reliability, and pedagogical alignment. We present a human-centered framework for AI-assisted course support, co-designed and evaluated in partnership with course instructors to foreground instructional oversight. For a large introductory mathematics course, we fine-tune a lightweight, cost-efficient language model on 2,588 historical student–instructor interactions and evaluate it against a new benchmark of 150 representative questions, annotated by five instructors. The model achieves 75.3% accuracy and excels on the medium-difficulty queries that dominate student requests; in 36% of cases, its answers are rated equal to or better than instructors’. Yet, instructors identified consistent cases where oversight is essential, highlighting the risks of unsupervised deployment. Our findings demonstrate how hybrid human–AI systems, when integrated into structured course workflows, can enhance instructional capacity while setting realistic expectations for AI in education.

The already existing course platform contains the course data, exercises, quizzes, and a forum where students can ask questions (see here). The goal is to create a large language model that can answer in real-time questions from the students. Models and datasets can be found here

Figure: Overview of our human-guided research process: (1) collecting and annotating student–instructor Q&A pairs, (2) instructor-led model evaluation and fine-tuning, and (3) expert evaluation with five course instructors through surveys and interviews.

Note: This repository provides the code and tooling to train, evaluate, and deploy models for the CaLlm (Calculus Large Language Model) project. It does not include the raw training data, dataset-generation or translation assets, the evaluation interfaces, the evaluation results (for privacy reasons), or the notebooks used for data analysis. For access to those materials, please contact the project maintainers.

Project objective :

The main objective of this project is to finetune a large language model with strong mathematical reasoning capabilities that can answer students' questions. Concerning hardware limitations, the model should have fewer than 14B parameters and be hosted at the EPFL local cluster.

Tested models :

Model	Model Family	Context Window
Mathstral-7B-v0.1	mistral	32k
Gemma-2-9b-it	gemma	32k
Llama-3.1-8B-Inst	llama	128k
Mathstral-7B	mistral	32k
Qwen-2.5-Math-7B	qwen	4k
OpenR1-Qwen-7B	qwen	32k
Gemma-3-12B-Inst	gemma	128k
DeepSeek-R1-Distill-14B	qwen	128k

Figure: Evaluation of base and finetuned models on 40 test questions by a single expert. The results are shown by the question's difficulty: we have 10 examples of easy questions, 20 medium, and 10 hard ones. Base models: Llama 3.1, Mathstral, Qwen 2.5, Gemma 3, and DeepSeek R1 Distill. Models with ''-QA'' are finetuned on our dataset; ``OpenMath220k'' are finetuned on the respective math dataset. Scores shown are averages. 95% confidence intervals are computed via non-parametric bootstrap with 10000 resamples.

Data :

Please refer to the Data Readme for more information about the data and datasets used in this project.

Model statistics

Daily model statistics and usage can be found here.

Installation and project usage:

There are 3 main independent tools provided in this release to install and use the project, depending on your needs :

Finetuning
Automated evaluation
Local inference for fast prototyping and testing
Inference server for broader usage

This project was developed and tested on a Windows system, and models were trained and evaluated on EPFL's Kuma cluster.

In any case, please follow these first steps to clone the repository and install the necessary dependencies (for local usage):

git clone git@github.com:botafogo-EPFL/CaLlm_pub.git
conda create -n callm python=3.11.9
conda activate callm
pip install -r requirements.txt

Once this is done, go to the corresponding Readme files for each tool to get more information about how to use them :

If you want to install the project on EPFL's clusters (Kuma or Izar is recommended), you can follow the steps that are provided here. This will install the project in a virtual environment and set up the necessary libraries and dependencies.

Involved persons :

Jeremy Barghorn, Data Science Master student at EPFL, project's code developer and maintainer, jeremy.barghorn@epfl.ch
Prof. Sacha Friedli, EPFL, project initiator and supervisor, in charge of human annotation and the course platform, sacha.friedli@epfl.ch
Postdoctoral Researcher Anna Sotnikova, Natural Language Processing Lab EPFL, in charge of the technical and scientific supervision of the project, anna.sotnikova@epfl.ch

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
img		img
scripts		scripts
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI meets Mathematics Education: Supporting Instructors in Large Mathematics Classes with Context-Aware AI

Abstract :

Project objective :

Tested models :

Data :

Model statistics

Installation and project usage:

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI meets Mathematics Education: Supporting Instructors in Large Mathematics Classes with Context-Aware AI

Abstract :

Project objective :

Tested models :

Data :

Model statistics

Installation and project usage:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages