SQL-equivalence results analysis

**Problem: ** The evaluation score generated by the Judge-LLM is not insightful. The current approach is doing an average of all the evaluated queries.

**Solution: ** Plotting the results for a deeper and more insightful analysis by adding metrics such as mean, median, standard deviation, z-scores, and p-value.