Code Execution as Grounded Supervision for LLM Reasoning

This is the official repository for Code Execution as Grounded Supervision for LLM Reasoning (EMNLP 2025)

Introduction

We propose a scalable method for generating verifiable CoT data to supervise the reasoning process of LLM by leveraging the determinism of program execution. Unlike existing reasoning dataset generation methods that rely on costly human annotations or error-prone LLM-generated CoT, our approach extracts verifiable, step-by-step reasoning traces from code execution and transforms them into a natural language CoT reasoning, producing highly accurate reasoning traces. Models trained on our generated data demonstrate superior reasoning abilities across various domains, while generating fewer tokens during inference, effectively reducing overthinking and meaningless repetition.

Released Resources

Dataset

You can download our dataset from huggingface

Models

Base Model	Link
Qwen3-4B	🤗
Qwen3-8B	🤗

Installation

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Data Generation 🔧

For our experiment, we used Pyedu dataset released by CodeI/O. If you want to reproduce our data generation, download the Pyedu dataset and place it under data/. For a quick start, we provide a small subset of the data under data/.

Data Generation Pipeline

Execute the following Python scripts to generate an execution trace for each coding problem and filter out unsuccessful executions. We employ a Python debugging tool called Snoop to record detailed line-by-line execution signals, which will serve as a basis for grounded supervision for enhanced reasoning.

python src/filter_raw_data.py
python src/generate_execution_trace.py
python src/filter_execution_trace.py

Then, use a Translator model to translate the execution traces to a more natural form of reasoning. The nl_trace field in the resulting file will contain the translated execution trace.

python src/execution_trace_translation.py --translator_model Qwen/Qwen3-32B --num_gpus 8

Run the following script to convert the final data into training data compatible with LLaMA-Factory.

python src/data_construction.py --trained_model Qwen/Qwen3-4B

Acknowledgments 🏆

Our work builds upon and is inspired by the following projects. We sincerely appreciate their contributions to the community:

Citation

@article{jung2025code,
  title={Code Execution as Grounded Supervision for LLM Reasoning},
  author={Jung, Dongwon and Zhou, Wenxuan and Chen, Muhao},
  journal={arXiv preprint arXiv:2506.10343},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
data		data
src		src
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code Execution as Grounded Supervision for LLM Reasoning

Introduction

Released Resources

Dataset

Models

Installation

Data Generation 🔧

Data Generation Pipeline

Acknowledgments 🏆

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Code Execution as Grounded Supervision for LLM Reasoning

Introduction

Released Resources

Dataset

Models

Installation

Data Generation 🔧

Data Generation Pipeline

Acknowledgments 🏆

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages