A new paper evaluates five leading vision-language models (VLMs) on their trustworthiness for medical visual question answering (VQA). The study found significant limitations in the models' ability to accurately localize anatomical targets and a tendency for laterality confusion, with the best model achieving only 0.23 mean IoU. Integrating localization into a pipeline further degraded performance, highlighting grounding as a key bottleneck. While domain adaptation shows promise for improving VQA accuracy, the perception and trustworthiness issues remain. AI
Summary written by gemini-2.5-flash-lite from 3 sources. How we write summaries →
IMPACT Identifies critical perception and grounding failures in frontier VLMs for medical applications, suggesting domain adaptation is needed to improve trustworthiness.
RANK_REASON Academic paper evaluating frontier models on a specific task.