New benchmark reveals limitations in AI video reasoning

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have introduced TraceAV-Bench, a new benchmark designed to evaluate multi-hop reasoning capabilities in models processing long audio-visual videos. This benchmark includes over 2,200 questions across 578 videos, totaling more than 339 hours, with an average reasoning chain of 3.68 hops. Current leading models, including Google's Gemini 3.1 Pro and an open-source model called Ming-Flash-Omni-2.0, show significant limitations, achieving only 68.29% and 51.70% accuracy respectively. The benchmark also highlights that robustness to multimodal hallucination is not strongly correlated with general reasoning performance. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Highlights significant gaps in current AI models' ability to perform complex reasoning over extended audio-visual content.

RANK_REASON Introduction of a new benchmark dataset for evaluating AI model capabilities. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CV →

paper
other

COVERAGE [1]

arXiv cs.CV TIER_1 · Wentao Zhang · 2026-05-08 11:06

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

Real-world audio-visual understanding requires chaining evidence that is sparse, temporally dispersed, and split across the visual and auditory streams, whereas existing benchmarks largely fail to evaluate this capability. They restrict videos to short clips, isolate modalities, …

COVERAGE [1]

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

RELATED ENTITIES

RELATED TOPICS