PulseAugur
LIVE 10:03:55
research · [2 sources] ·
0
research

VideoNet dataset challenges vision-language models on domain-specific action recognition

Researchers have introduced VideoNet, a large-scale dataset designed to improve domain-specific action recognition in videos. The benchmark, covering 1,000 actions across 37 domains, highlights current limitations in vision-language models (VLMs) like Gemini 3.1 Pro and Qwen3-VL-8B, which struggle with accuracy and few-shot learning on these tasks. To address this, a new training dataset of nearly 500,000 video question-answer pairs was created, enabling a fine-tuned Molmo2-4B model to outperform existing open-weight 8B models on VideoNet. AI

Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →

IMPACT Revitalizes action recognition research, potentially improving VLM capabilities in specialized video understanding tasks.

RANK_REASON The cluster contains a new academic paper introducing a dataset and benchmark for action recognition.

Read on arXiv cs.CV →

COVERAGE [2]

  1. arXiv cs.CV TIER_1 · Tanush Yadav, Mohammadreza Salehi, Jae Sung Park, Vivek Ramanujan, Hannaneh Hajishirzi, Yejin Choi, Ali Farhadi, Rohun Tripathi, Ranjay Krishna ·

    VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

    arXiv:2605.02834v1 Announce Type: new Abstract: Videos are unique in their ability to capture actions which transcend multiple frames. Accordingly, for many years action recognition was the quintessential task for video understanding. Unfortunately, due to a lack of sufficiently …

  2. arXiv cs.CV TIER_1 · Ranjay Krishna ·

    VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

    Videos are unique in their ability to capture actions which transcend multiple frames. Accordingly, for many years action recognition was the quintessential task for video understanding. Unfortunately, due to a lack of sufficiently diverse and challenging data, modern vision-lang…