Hello, I am curious about how SAIL was evaluated, and was it evaluated using GPT4? Did all benchmark data be used for evaluation?
Hello, I am curious about how SAIL was evaluated, and was it evaluated using GPT4? Did all benchmark data be used for evaluation?