Flow-- [Image -> Extract Segments (SAM) -> Extract Attributes + Caption] + Question ---> Answer
[Attributes considered (only relevant attributes extracted based on segment type out of these)-- (spatial relationships, pose estimation, depth estimation, motion, action, count, size, shape., color, coordinates, etc)]


