Video2GUI generates 12M GUI trajectories from unlabeled videos

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have developed Video2GUI, an automated framework designed to generate large-scale interaction trajectories for training GUI agents. This system extracts data from unlabeled internet videos, converting them into structured agent trajectories through a filtering process. The resulting dataset, WildGUI, contains 12 million trajectories across over 1,500 applications, significantly improving the pre-training of models like Qwen2.5-VL and Mimo-VL. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Enables creation of large-scale datasets for GUI agents, potentially improving their generalization and performance across diverse applications.

RANK_REASON Academic paper introducing a new method and dataset for GUI agent pretraining. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CV →

COVERAGE [1]

arXiv cs.CV TIER_1 · Hao Tian · 2026-05-14 12:14

Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

Recent advances in multimodal large language models have driven growing interest in graphical user interface (GUI) agents, yet their generalization remains constrained by the scarcity of large-scale training data spanning diverse real-world applications. Existing datasets rely he…

COVERAGE [1]

Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

RELATED ENTITIES

RELATED TOPICS