Hi,
Thanks for the work. I have a question about the performance of MCTS compared to Best-of-N on MATH500 dataset using Qwen2.5-Math-7B-Instruct model. In my experiments, MCTS could not get higher majority_vote results than best-of-N. I am sharing my configs and the comparative results below. Considering MCTS's more complex structure, I believe that it should achieve higher results than the Best-of-N, which has a very direct way of reasoning. Do you have any suggestions on improving MCTS results?
Thanks.
Table 1. Comparative results of different reasoning techniques.
| method |
majority_vote |
| CoT |
0.836 |
| best-of-N |
0.876 |
| MCTS |
0.872 |
Table 2. The parameter setting used in the experiments.
| parameter |
value |
| temperature |
0.7 |
| num_sequence |
8 |
| max_new_tokens |
2048 |
| num_worker |
32 |
System Info
Operating System = Linux
Python version = 3.10
Hardware = A40
Hi,
Thanks for the work. I have a question about the performance of MCTS compared to Best-of-N on MATH500 dataset using Qwen2.5-Math-7B-Instruct model. In my experiments, MCTS could not get higher majority_vote results than best-of-N. I am sharing my configs and the comparative results below. Considering MCTS's more complex structure, I believe that it should achieve higher results than the Best-of-N, which has a very direct way of reasoning. Do you have any suggestions on improving MCTS results?
Thanks.
Table 1. Comparative results of different reasoning techniques.
Table 2. The parameter setting used in the experiments.
System Info
Operating System = Linux
Python version = 3.10
Hardware = A40