A new paper introduces a method to improve latent reasoning in multimodal large language models (MLLMs) by optimizing visual latents at inference time, addressing a pathology where their contribution is suppressed. Separately, another study reveals significant foundational visual gaps in current MLLMs, even frontier models like GPT and Gemini, using a new benchmark called VisFactor. This benchmark, based on human cognitive psychology assessments, highlights consistent failures in tasks like spatial relation inference and figure-ground discrimination, suggesting current MLLM performance may not reflect true visual cognition. AI
Summary written by gemini-2.5-flash-lite from 4 sources. How we write summaries →
IMPACT Highlights critical visual reasoning deficits in MLLMs, suggesting current benchmarks may overstate capabilities and prompting a need for more robust evaluation methods.
RANK_REASON Two arXiv papers present novel research on multimodal large language models, one proposing a new optimization technique and the other introducing a new benchmark for evaluating visual cognition.