Researchers have introduced TraceAV-Bench, a new benchmark designed to evaluate multi-hop reasoning capabilities in models processing long audio-visual videos. This benchmark includes over 2,200 questions across 578 videos, totaling more than 339 hours, with an average reasoning chain of 3.68 hops. Current leading models, including Google's Gemini 3.1 Pro and an open-source model called Ming-Flash-Omni-2.0, show significant limitations, achieving only 68.29% and 51.70% accuracy respectively. The benchmark also highlights that robustness to multimodal hallucination is not strongly correlated with general reasoning performance. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Highlights significant gaps in current AI models' ability to perform complex reasoning over extended audio-visual content.
RANK_REASON Introduction of a new benchmark dataset for evaluating AI model capabilities. [lever_c_demoted from research: ic=1 ai=1.0]