Would be interesting to compare MS-CLIP and @xiong-zhitong's DOFA-CLIP: https://arxiv.org/abs/2503.06312