Researchers have introduced a new training paradigm called "Starve to Perceive" to address the issue of "lazy perception" in Vision-Language Models (VLMs). This phenomenon occurs when VLMs can achieve adequate accuracy using coarse visual inputs and language priors, thus lacking a true incentive to learn active visual search strategies like zooming or cropping. The "Starve to Perceive" method constrains the visual bandwidth, limiting each observation to a small token budget, which forces the model to engage in active perception for task completion. This minimal, plug-in modification to existing training pipelines resulted in an average relative improvement of 5% across various benchmarks without requiring architectural changes or auxiliary losses. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT This research introduces a method to improve the active perception capabilities of VLMs, potentially leading to more effective agents in complex visual environments.
RANK_REASON The cluster contains an academic paper detailing a new training methodology for existing models. [lever_c_demoted from research: ic=1 ai=1.0]