PulseAugur
LIVE 04:07:09
research · [3 sources] ·
0
research

OmniVTG dataset and CoT paradigm enhance open-world video temporal grounding

Researchers have introduced OmniVTG, a large-scale dataset and training paradigm designed to improve open-world Video Temporal Grounding (VTG) for Multimodal Large Language Models (MLLMs). The dataset was created using a novel pipeline to identify and collect videos with underrepresented concepts, and a caption-centric approach for high-quality annotation. Additionally, a Self-Correction Chain-of-Thought (CoT) training method is proposed, which leverages MLLMs' understanding capabilities to refine predictions, showing state-of-the-art performance on existing benchmarks and the new OmniVTG dataset. AI

Summary written by gemini-2.5-flash-lite from 3 sources. How we write summaries →

IMPACT New datasets and training paradigms may improve the ability of multimodal models to accurately localize video segments based on text queries.

RANK_REASON This cluster contains two academic papers detailing new datasets and training methodologies for video temporal grounding.

Read on arXiv cs.CV →

COVERAGE [3]

  1. arXiv cs.CV TIER_1 · Minghang Zheng, Zihao Yin, Yi Yang, Yuxin Peng, Yang Liu ·

    OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

    arXiv:2604.25276v1 Announce Type: new Abstract: Video Temporal Grounding (VTG), the task of localizing video segments from text queries, struggles in open-world settings due to limited dataset scale and semantic diversity, causing performance gaps between common and rare concepts…

  2. arXiv cs.CV TIER_1 · Yang Liu ·

    OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

    Video Temporal Grounding (VTG), the task of localizing video segments from text queries, struggles in open-world settings due to limited dataset scale and semantic diversity, causing performance gaps between common and rare concepts. To overcome these limitations, we introduce Om…

  3. arXiv cs.CV TIER_1 · Thong Thanh Nguyen, Yi Bin, Xiaobao Wu, Zhiyuan Hu, Cong-Duy T Nguyen, See-Kiong Ng, Anh Tuan Luu ·

    Multi-Scale Contrastive Learning for Video Temporal Grounding

    arXiv:2412.07157v3 Announce Type: replace Abstract: Temporal grounding, which localizes video moments related to a natural language query, is a core problem of vision-language learning and video understanding. To encode video moments of varying lengths, recent methods employ a mu…