Researchers have introduced PushupBench, a new dataset designed to evaluate the ability of vision-language models (VLMs) to accurately count repetitions in videos. The benchmark highlights that even top-tier VLMs struggle with this task, achieving only 42.1% exact accuracy on counting pushups. Furthermore, the study reveals that some models may exploit statistical biases rather than performing genuine temporal reasoning. Interestingly, fine-tuning models on this counting task improved their performance on broader video understanding benchmarks. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Highlights a key limitation in current VLMs for temporal reasoning and counting, potentially guiding future model development.
RANK_REASON The cluster contains an academic paper introducing a new benchmark for evaluating vision-language models.