HiCrew: Hierarchical Reasoning for Long-Form Video Understanding via Question-Aware Multi-Agent Collaboration

By PulseAugur Editorial · Summary by None from 5 sources

Researchers have developed new frameworks to improve video understanding and reasoning capabilities in AI models. StoryTR introduces a benchmark and training method focused on 'Theory of Mind' to infer narrative causality, showing that reasoning ability is more critical than model size. HiCrew utilizes a hierarchical multi-agent approach with question-aware collaboration to handle long-form videos by preserving temporal coherence and adapting reasoning strategies. UpstreamQA proposes a modular framework that disentangles reasoning components, using large reasoning models to enrich input for downstream video question-answering models, enhancing both performance and interpretability. Find, Fix, Reason introduces a context repair method where a teacher model guides a student model by providing missing spatiotemporal dependencies to improve video reasoning accuracy and generalization. AI

Summary written by None from 5 sources. How we write summaries →

IMPACT Advances in video reasoning frameworks could lead to more sophisticated AI agents capable of understanding complex narratives and causal relationships in visual data.

RANK_REASON The cluster contains multiple academic papers introducing new models, benchmarks, and frameworks for video understanding and reasoning.

Read on arXiv cs.AI →

COVERAGE [5]

arXiv cs.AI TIER_1 · Xuanyue Zhong, Yuqiang Xie, Guanqun Bi, Jiangping Yang, Guibin Chen · 2026-04-28 04:00

StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning

arXiv:2604.23198v1 Announce Type: new Abstract: Current video moment retrieval excels at action-centric tasks but struggles with narrative content. Models can see \textit{what is happening} but fail to reason \textit{why it matters}. This semantic gap stems from the lack of \text…
arXiv cs.AI TIER_1 · Baoquan Zhao · 2026-04-23 09:04

HiCrew: Hierarchical Reasoning for Long-Form Video Understanding via Question-Aware Multi-Agent Collaboration

Long-form video understanding remains fundamentally challenged by pervasive spatiotemporal redundancy and intricate narrative dependencies that span extended temporal horizons. While recent structured representations compress visual information effectively, they frequently sacrif…
arXiv cs.CV TIER_1 · Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, Yan Feng, Peng Pei, Xunliang Cai, Xiangyu Yue · 2026-04-29 04:00

OneThinker: All-in-one Reasoning Model for Image and Video

arXiv:2512.03043v3 Announce Type: replace Abstract: Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train separate models for different tasks…
arXiv cs.CV TIER_1 · Jason Nguyen, Ameet Rao, Alexander Chang, Ishaan Kumar, Erin Tan · 2026-04-28 04:00

UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks

arXiv:2604.23145v1 Announce Type: new Abstract: Video Question Answering (VideoQA) demands models that jointly reason over spatial, temporal, and linguistic cues. However, the task's inherent complexity often requires multi-step reasoning that current large multimodal models (LMM…
arXiv cs.CV TIER_1 · Haojian Huang, Chuanyu Qin, Yinchuan Li, Yingcong Chen · 2026-04-28 04:00

Find, Fix, Reason: Context Repair for Video Reasoning

arXiv:2604.16243v2 Announce Type: replace Abstract: Reinforcement learning has advanced video reasoning in large multi-modal models, yet dominant pipelines either rely on on-policy self-exploration, which plateaus at the model's knowledge boundary, or hybrid replay that mixes pol…

COVERAGE [5]

StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning

HiCrew: Hierarchical Reasoning for Long-Form Video Understanding via Question-Aware Multi-Agent Collaboration

OneThinker: All-in-one Reasoning Model for Image and Video

UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks

Find, Fix, Reason: Context Repair for Video Reasoning

RELATED ENTITIES

RELATED TOPICS