Hi, when I evaluate open-llava-next-vicuna-7b on MME-Realworld with model_vqa_mme_realworld.py, it outputs a sentence like "The correct answer to the question is (A) There are three cars and two pedestrians. This is evident from the image, which shows three cars parked on the side of the road and two pedestrians walking on the sidewalk. The other options provided do not accurately reflect the contents of the image. There are no traffic cones, barriers, trucks, trailers, or construction vehicles visible in the image. Therefore, the best answer based on the image is (A).", rather than a single choice.
Is there any idea why this happen?
Hi, when I evaluate open-llava-next-vicuna-7b on MME-Realworld with
model_vqa_mme_realworld.py, it outputs a sentence like "The correct answer to the question is (A) There are three cars and two pedestrians. This is evident from the image, which shows three cars parked on the side of the road and two pedestrians walking on the sidewalk. The other options provided do not accurately reflect the contents of the image. There are no traffic cones, barriers, trucks, trailers, or construction vehicles visible in the image. Therefore, the best answer based on the image is (A).", rather than a single choice.Is there any idea why this happen?