Researchers have introduced VideoNet, a large-scale dataset designed to improve domain-specific action recognition in videos. The benchmark, covering 1,000 actions across 37 domains, highlights current limitations in vision-language models (VLMs) like Gemini 3.1 Pro and Qwen3-VL-8B, which struggle with accuracy and few-shot learning on these tasks. To address this, a new training dataset of nearly 500,000 video question-answer pairs was created, enabling a fine-tuned Molmo2-4B model to outperform existing open-weight 8B models on VideoNet. AI
Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →
IMPACT Revitalizes action recognition research, potentially improving VLM capabilities in specialized video understanding tasks.
RANK_REASON The cluster contains a new academic paper introducing a dataset and benchmark for action recognition.