Is any evaluation prove it won‘t damage the quality of output?

I think researchers should make an evaluation on some benchmark, such as Humaneval, Locomo, and SWE-bench. Just for proving😭🙋.

Thank you for your work.