SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement

Yuan Ge, Junxiang Zhang, Xiaoqian Liu, Bei Li, Xiangnan Ma, Chenglong Wang, Kaiyang Ye, Yangfan Du, Linfeng Zhang, Yuxin Huang, Tong Xiao, Zhengtao Yu, Jingbo Zhu

News💡

[2025.08] We release our paper. If you have any questions about our project, please send email to geyuanqaq@gmail.com
[2025.08] Code, test dataset, and model parameters have been publicly released.
[2025.11] SageLM is accepted by AAAI 2026 poster!🎉🎉🎉
[2025.12] We release our training dataset.

Quick Installation ⚙️

conda create -n sagelm python=3.10
conda activate sagelm
cd ./LLaMA-Factory
pip install -e .

Usage 🛠

Data Preparation

In order to use SageLM, you should first create a JSON file for your dataset in ./LLaMA-Factory/data. Each entry should have the following format:

{
	"instruction": "...",  // prompt
	"input": "",  // leave empty
	"output": "...",  // label (used during training, leave empty during inference)
	"audios": [
		"",  // audio response 1
		""   // audio response 2
	]
}

We use the following prompt template for SageLM training and inference:

Below are two responses for a given task. The task is defined by the Instruction. Evaluate in terms of **{eval_dim}** and indicate a better response using 1, 2 or Tie.

### Instruction:
{question}

### Response 1:
<audio>

### Response 2:
<audio>

where {eval_dim} represents the evaluation dimension , {question} represents the user query, and <audio> serves as a placeholder for audio responses.

Next, register your dataset in ./LLaMA-Factory/data/dataset_info.json. For example:

"test_semantic": {
    "file_name": "test_semantic.json",
    "columns": {
        "prompt": "instruction",
        "query": "input",
        "response": "output",
        "audios": "audios"
    }
}

Model Inference

cd ./LLaMA-Factory
bash ./LLaMA-Factory/scripts/my_infer/infer.sh

Notice: SageLM currently supports only English and evaluated audio (audio 1 & audio 2) is truncated to 60 s.

We currently only support batch inference with JSON datasets, but inference can also be performed using the Qwen2.5-Omni official code.

Evaluation

We have released our test dataset at https://huggingface.co/LGB666/SageLM_testset_audio. After downloading, please move and rename the directory to match the audio paths in the corresponding dataset JSON file.

We also released our evaluation scripts to reproduce the main results in our paper.

To evaluate the model's udging performance on semantic dimensions, run:

bash ./LLaMA-Factory/scripts/eval.sh

Note that each response pair in prediction and ground-truth files should be splited into four semantic dimensions, in the order of helpfulness, honesty, instruction_following, truthfulness. The data order should be consistent between the prediction and the ground-truth files.

To evaluate the model's judging performance on acoustic dimensions, run:

bash ./LLaMA-Factory/scripts/eval_stage2.sh

The acoustic evaluation should be performed on one of the following dimensions: emotion instruction following, gender instruction following, character instruction following, gender instruction following and emotion instruction following. The data order should also be consistent between the prediction and the ground-truth.

📜 Training of SageLM Model

We trained our model using LLaMA-Factory.

To train your own model, you need to register your dataset in ./LLaMA-Factory/data/dataset_info.json. Then, specify the dataset and other training parameters in the YAML configuration file. We provide an example configuration file at examples/judge/qwen2.5_omni_7B_compare_1_aspect.yaml. Start training using the following commands:

cd ./LLaMA-Factory
llamafactory-cli train examples/judge/qwen2.5_omni_7B_compare_1_aspect.yaml

We release the training dataset to facilitate the reproduction of our results.

🔊 End-to-End Speech Evaluation vs. Cascade ASR → LLM: A Case Study

Traditional cascade approaches first use Whisper for ASR, then pass the transcript to a text-based LLM for response comparison. However, ASR can introduce cascaded errors, which may lead GPT to incorrectly favor one response over another.

SageLM is trained end-to-end on audio and does not rely on ASR transcripts, making it robust to pronunciation variations, disfluencies, and unclear articulation. Here we show several real cases to demonstrate these effects.

Note that in these cases we focus solely on semantic dimensions (Helpfulness, Honesty, Truthfulness, Instruction Following). Thus, disfluencies or unclear articulation in the audio should not affect the semantic comparison.

Case 1

❓Question:

Come up with healthy and easy dinner ideas for weeknights.

🔊 Response 1 (Qwen2.5-Omni)

🔊 Response 2 (Kimi-Audio)

📊 Comparison Results (1 = Response 1 better, 2 = Response 2 better, T = Tie)

Method	Helpfulness	Honesty	Truthfulness	Instruction Following
GPT (Whisper-large-v3 + GPT-4o)	1	1	1	1
SageLM	2	T	2	T
Human Evaluation	2	T	2	T

Case 2

❓Question:

For a quick and efficient office workout, suggest a short routine.

🔊 Response 1 (Qwen2.5-Omni)

🔊 Response 2 (Kimi-Audio)

📊 Comparison Results

Method	Helpfulness	Honesty	Truthfulness	Instruction Following
GPT	1	1	T	1
SageLM	2	2	2	T
Human Evaluation	2	T	T	T

Case 3

❓Question:

How can I create a budget and stick to it for better financial health?

🔊 Response 1 (Qwen2.5-Omni)

🔊 Response 2 (Kimi-Audio)

📊 Comparison Results

Method	Helpfulness	Honesty	Truthfulness	Instruction Following
GPT	1	T	1	1
SageLM	T	T	T	T
Human Evaluation	T	T	T	T

Citation

If you find our paper useful, please consider citing:

@article{ge2025sagelm,
	title={SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement},
	volume={40},
	url={https://ojs.aaai.org/index.php/AAAI/article/view/40338},
	DOI={10.1609/aaai.v40i36.40338},
	abstractNote={Speech-to-Speech (S2S) Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling end-to-end spoken dialogue systems. However, evaluating these models remains a fundamental challenge. We propose SageLM, an end-to-end, multi-aspect, and explainable speech LLM for comprehensive S2S LLMs evaluation. First, unlike cascaded approaches that disregard acoustic features, SageLM jointly assesses both semantic and acoustic dimensions. Second, it leverages rationale-based supervision to enhance explainability and guide model learning, achieving superior alignment with evaluation outcomes compared to rule-based reinforcement learning methods. Third, we introduce SpeechFeedback, a synthetic preference dataset, and employ a two-stage training paradigm to mitigate the scarcity of speech preference data. Trained on both semantic and acoustic dimensions, SageLM achieves an 82.79% agreement rate with human evaluators, outperforming cascaded and SLM-based baselines by at least 7.42% and 26.20%, respectively.},
	number={36},
	journal={Proceedings of the AAAI Conference on Artificial Intelligence},
	author={Ge, Yuan and Zhang, Junxiang and Liu, Xiaoqian and Li, Bei and Ma, Xiangnan and Wang, Chenglong and Ye, Kaiyang and Du, Yangfan and Zhang, Linfeng and Huang, Yuxin and Xiao, Tong and Yu, Zhengtao and Zhu, Jingbo},
	year={2026},
	month={Mar.},
	pages={30807-30815}
}

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
Figs		Figs
LLaMA-Factory		LLaMA-Factory
demo_audio		demo_audio
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement

Yuan Ge, Junxiang Zhang, Xiaoqian Liu, Bei Li, Xiangnan Ma, Chenglong Wang, Kaiyang Ye, Yangfan Du, Linfeng Zhang, Yuxin Huang, Tong Xiao, Zhengtao Yu, Jingbo Zhu

News💡

Quick Installation ⚙️

Usage 🛠

Data Preparation

Model Inference

Evaluation

📜 Training of SageLM Model

🔊 End-to-End Speech Evaluation vs. Cascade ASR → LLM: A Case Study

Case 1

Case 2

Case 3

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement

Yuan Ge, Junxiang Zhang, Xiaoqian Liu, Bei Li, Xiangnan Ma, Chenglong Wang, Kaiyang Ye, Yangfan Du, Linfeng Zhang, Yuxin Huang, Tong Xiao, Zhengtao Yu, Jingbo Zhu

News💡

Quick Installation ⚙️

Usage 🛠

Data Preparation

Model Inference

Evaluation

📜 Training of SageLM Model

🔊 End-to-End Speech Evaluation vs. Cascade ASR → LLM: A Case Study

Case 1

Case 2

Case 3

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages