New PushupBench benchmark reveals VLMs struggle with counting repetitions

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have introduced PushupBench, a new dataset designed to evaluate the ability of vision-language models (VLMs) to accurately count repetitions in videos. The benchmark highlights that even top-tier VLMs struggle with this task, achieving only 42.1% exact accuracy on counting pushups. Furthermore, the study reveals that some models may exploit statistical biases rather than performing genuine temporal reasoning. Interestingly, fine-tuning models on this counting task improved their performance on broader video understanding benchmarks. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Highlights a key limitation in current VLMs for temporal reasoning and counting, potentially guiding future model development.

RANK_REASON The cluster contains an academic paper introducing a new benchmark for evaluating vision-language models.

Read on arXiv cs.CV →

paper
other

COVERAGE [1]

arXiv cs.CV TIER_1 · Shengzhi Li, Jiarun Chen, Karun Sharma, Jiaqi Su, Shichao Pei · 2026-04-28 04:00

PushupBench: Your VLM is not good at counting pushups

arXiv:2604.23407v1 Announce Type: new Abstract: Large vision-language models (VLMs) can recognize \textit{what} happens in video but fail to count \textit{how many} times. We introduce \textbf{PushupBench}, 446 long-form clips (avg. 36.7s) for evaluating repetition counting. The …

COVERAGE [1]

PushupBench: Your VLM is not good at counting pushups

RELATED ENTITIES

RELATED TOPICS