This is the official repository for Code Execution as Grounded Supervision for LLM Reasoning (EMNLP 2025)
We propose a scalable method for generating verifiable CoT data to supervise the reasoning process of LLM by leveraging the determinism of program execution. Unlike existing reasoning dataset generation methods that rely on costly human annotations or error-prone LLM-generated CoT, our approach extracts verifiable, step-by-step reasoning traces from code execution and transforms them into a natural language CoT reasoning, producing highly accurate reasoning traces. Models trained on our generated data demonstrate superior reasoning abilities across various domains, while generating fewer tokens during inference, effectively reducing overthinking and meaningless repetition.
You can download our dataset from huggingface
| Base Model | Link |
|---|---|
| Qwen3-4B | 🤗 |
| Qwen3-8B | 🤗 |
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtFor our experiment, we used Pyedu dataset released by CodeI/O. If you want to reproduce our data generation, download the Pyedu dataset and place it under data/. For a quick start, we provide a small subset of the data under data/.
Execute the following Python scripts to generate an execution trace for each coding problem and filter out unsuccessful executions. We employ a Python debugging tool called Snoop to record detailed line-by-line execution signals, which will serve as a basis for grounded supervision for enhanced reasoning.
python src/filter_raw_data.py
python src/generate_execution_trace.py
python src/filter_execution_trace.pyThen, use a Translator model to translate the execution traces to a more natural form of reasoning. The nl_trace field in the resulting file will contain the translated execution trace.
python src/execution_trace_translation.py --translator_model Qwen/Qwen3-32B --num_gpus 8Run the following script to convert the final data into training data compatible with LLaMA-Factory.
python src/data_construction.py --trained_model Qwen/Qwen3-4BOur work builds upon and is inspired by the following projects. We sincerely appreciate their contributions to the community:
@article{jung2025code,
title={Code Execution as Grounded Supervision for LLM Reasoning},
author={Jung, Dongwon and Zhou, Wenxuan and Chen, Muhao},
journal={arXiv preprint arXiv:2506.10343},
year={2025}
}