Video Generation with Predictive Latents

By PulseAugur Editorial · Summary by None from 21 sources

Researchers have developed several new methods to improve the efficiency and quality of visual generative models. DC-DiT introduces dynamic chunking to Diffusion Transformers, adaptively compressing visual data for faster inference and better quality. Quant VideoGen addresses the KV cache memory bottleneck in autoregressive video generation by using 2-bit quantization, significantly reducing memory usage while maintaining consistency. FreeSpec tackles long video generation challenges by using singular-spectrum reconstruction to balance global and local features, improving temporal dynamics and visual quality. SwiftI2V offers an efficient high-resolution image-to-video generation framework using conditional segment-wise generation, enabling practical 2K synthesis on consumer hardware. RealCam provides real-time novel-view video generation with interactive camera control through an autoregressive framework and loop-closed data augmentation. AI

Summary written by None from 21 sources. How we write summaries →

IMPACT These advancements in visual generation models promise more efficient, higher-quality, and real-time synthesis capabilities for various applications.

RANK_REASON This cluster contains multiple research papers detailing novel methods for improving visual generative models.

Read on Hugging Face Daily Papers →

paper
infra

COVERAGE [21]

arXiv cs.LG TIER_1 · Akash Haridas, Utkarsh Saxena, Parsa Ashrafi Fashi, Mehdi Rezagholizadeh, Vikram Appia, Emad Barsoum · 2026-05-08 04:00

DC-DiT: Adaptive Compute and Elastic Inference for Visual Generation via Dynamic Chunking

arXiv:2603.06351v2 Announce Type: replace-cross Abstract: Diffusion Transformers rely on static patchify tokenization, assigning the same token budget to smooth backgrounds, detailed object regions, noisy early timesteps, and late-stage refinements. We introduce the Dynamic Chunk…
arXiv cs.LG TIER_1 · Jinbin Bai, Yu Lei, Qingyu Shi, Aosong Feng, Yi Xin, Zhuoran Zhao, Fei Shen, Kaidong Yu, Jason Li · 2026-05-07 04:00

Threshold-Guided Optimization for Visual Generative Models

arXiv:2605.04653v1 Announce Type: new Abstract: Aligning large visual generative models with human feedback is often performed through pairwise preference optimization. While such approaches are conceptually simple, they fundamentally rely on annotated pairs, limiting scalability…
arXiv cs.LG TIER_1 Italiano(IT) · Haocheng Xi, Shuo Yang, Yilong Zhao, Muyang Li, Han Cai, Xingyang Li, Yujun Lin, Zhuoyang Zhang, Jintao Zhang, Xiuyu Li, Zhiying Xu, Jun Wu, Chenfeng Xu, Ion Stoica, Song Han, Kurt Keutzer · 2026-05-05 04:00

Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

arXiv:2602.02958v4 Announce Type: replace Abstract: Despite rapid progress in autoregressive video diffusion, an emerging system algorithm bottleneck limits both deployability and generation capability: KV cache memory. In autoregressive video generation models, the KV cache grow…
Hugging Face Daily Papers TIER_1 · 2026-05-04 01:30

Video Generation with Predictive Latents

Video Variational Autoencoder (VAE) enables latent video generative modeling by mapping the visual world into compact spatiotemporal latent spaces, improving training efficiency and stability. While existing video VAEs achieve commendable reconstruction quality, continued optimiz…
arXiv cs.CV TIER_1 · Youcan Xu, Jiaxin Shi, Zhen Wang, Wensong Song, Feifei Shao, Chen Liang, Jun Xiao, Long Chen · 2026-05-08 04:00

RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control

arXiv:2605.06051v1 Announce Type: new Abstract: Camera-controlled video-to-video (V2V) generation enables dynamic viewpoint synthesis from monocular footage, holding immense potential for interactive filmmaking and live broadcasting. However, existing implicit synthesis methods f…
arXiv cs.CV TIER_1 · Zeyu Liu, Zanlin Ni, Yang Yue, Cheng Da, Huan Yang, Di Zhang, Kun Gai, Gao Huang · 2026-05-08 04:00

Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

arXiv:2605.05781v1 Announce Type: new Abstract: Unified multimodal models are envisioned to bridge the gap between understanding and generation. Yet, to achieve competitive performance, state-of-the-art models adopt largely decoupled understanding and generation components. This …
arXiv cs.CV TIER_1 · YaoYang Liu, Yuechen Zhang, Wenbo Li, Yufei Zhao, Rui Liu, Long Chen · 2026-05-08 04:00

SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

arXiv:2605.06356v1 Announce Type: new Abstract: High-resolution image-to-video (I2V) generation aims to synthesize realistic temporal dynamics while preserving fine-grained appearance details of the input image. At 2K resolution, it becomes extremely challenging, and existing sol…
arXiv cs.CV TIER_1 · Fangda Chen, Shanshan Zhao, Longrong Yang, Chuanfu Xu, Zhigang Luo, Long Lan · 2026-05-08 04:00

FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction

arXiv:2605.06509v1 Announce Type: new Abstract: Video diffusion models perform well in short-video synthesis, but their training-free extension to long videos often suffers from content drift, temporal inconsistency, and over-smoothed dynamics. Existing methods improve temporal c…
arXiv cs.CV TIER_1 · Long Lan · 2026-05-07 16:21

FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction

Video diffusion models perform well in short-video synthesis, but their training-free extension to long videos often suffers from content drift, temporal inconsistency, and over-smoothed dynamics. Existing methods improve temporal consistency by combining a global branch with a l…
arXiv cs.CV TIER_1 · Long Chen · 2026-05-07 14:34

SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

High-resolution image-to-video (I2V) generation aims to synthesize realistic temporal dynamics while preserving fine-grained appearance details of the input image. At 2K resolution, it becomes extremely challenging, and existing solutions suffer from various weaknesses: 1) end-to…
arXiv cs.CV TIER_1 · Long Chen · 2026-05-07 11:36

RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control

Camera-controlled video-to-video (V2V) generation enables dynamic viewpoint synthesis from monocular footage, holding immense potential for interactive filmmaking and live broadcasting. However, existing implicit synthesis methods fundamentally rely on non-causal, full-sequence p…
arXiv cs.CV TIER_1 · InSpatio Team, Donghui Shen, Guofeng Zhang, Haomin Liu, Haoyu Ji, Jialin Liu, Jing Guo, Nan Wang, Siji Pan, Weihong Pan, Weijian Xie, Xiaojun Xiang, Xiaoyu Zhang, Xianbin Liu, Yifu Wang, Yipeng Chen, Zhewen Le, Zhichao Ye, Ziqiang Zhao · 2026-05-07 04:00

InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model

arXiv:2603.11911v3 Announce Type: replace Abstract: We present InSpatio-WorldFM, an open-source real-time frame model for spatial intelligence. Unlike video-based world models that rely on sequential frame generation and incur substantial latency due to window-level processing, I…
arXiv cs.CV TIER_1 · Karthik Inbasekar, Guy Rom, Omer Shlomovits · 2026-05-06 04:00

WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models

arXiv:2605.03475v1 Announce Type: new Abstract: Evaluating generative video models remains an open problem. Reference-based metrics such as Structural Similarity Index Measure (SSIM) and Peak Signal to Noise Ratio (PSNR) reward pixel fidelity over semantic correctness, while Frec…
arXiv cs.CV TIER_1 · Muyang He, Hanzhong Guo, Junxiong Lin, Yizhou Yu · 2026-05-06 04:00

Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

arXiv:2603.28489v2 Announce Type: cross Abstract: The rapid evolution of video generation has enabled models to simulate complex physical dynamics and long-horizon causalities, positioning them as potential world simulators. However, a critical gap still remains between the theor…
arXiv cs.CV TIER_1 · Daneul Kim, Jingxu Zhang, Wonjoon Jin, Sunghyun Cho, Qi Dai, Jaesik Park, Chong Luo · 2026-05-06 04:00

Learning Zero-Shot Subject-Driven Video Generation Using 1% Compute

arXiv:2504.17816v3 Announce Type: replace Abstract: Subject-driven video generation (SDV-Gen) aims to produce videos of a specific subject by adapting a pretrained video model, enabling personalized and application-driven content creation. To achieve this goal, per-subject tuning…
arXiv cs.CV TIER_1 · Omer Shlomovits · 2026-05-05 08:03

WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models

Evaluating generative video models remains an open problem. Reference-based metrics such as Structural Similarity Index Measure (SSIM) and Peak Signal to Noise Ratio (PSNR) reward pixel fidelity over semantic correctness, while Frechet Video Distance (FVD) favors distributional t…
arXiv cs.CV TIER_1 · Hongyang Du, Junjie Ye, Xiaoyan Cong, Runhao Li, Jingcheng Ni, Aman Agarwal, Zeqi Zhou, Zekun Li, Randall Balestriero, Yue Wang · 2026-05-05 04:00

VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

arXiv:2601.23286v2 Announce Type: replace Abstract: While recent video diffusion models (VDMs) produce visually impressive results, they fundamentally struggle to maintain 3D structural consistency, often resulting in object deformation or spatial drift. We hypothesize that these…
arXiv cs.CV TIER_1 · Ziyang Song, Xinyu Gong, Bangya Liu, Zelin Zhao · 2026-05-05 04:00

MV-S2V: Multi-View Subject-Consistent Video Generation

arXiv:2601.17756v3 Announce Type: replace Abstract: Existing Subject-to-Video Generation (S2V) methods have achieved high-fidelity and subject-consistent video generation, yet remain constrained to single-view subject references. This limitation renders the S2V task reducible to …
arXiv cs.CV TIER_1 · Yian Zhao, Feng Wang, Qiushan Guo, Chang Liu, Xiangyang Ji, Jian Zhang, Jie Chen · 2026-05-05 04:00

Video Generation with Predictive Latents

arXiv:2605.02134v1 Announce Type: new Abstract: Video Variational Autoencoder (VAE) enables latent video generative modeling by mapping the visual world into compact spatiotemporal latent spaces, improving training efficiency and stability. While existing video VAEs achieve comme…
arXiv cs.CV TIER_1 · Jongmin Shin, Ka Young Kim, Eunki Cho, Seong Tae Kim, Namkee Oh · 2026-05-05 04:00

SurgCheck: Do Vision-Language Models Really Look at Images in Surgical VQA?

arXiv:2605.01911v1 Announce Type: new Abstract: Purpose: Vision-language models (VLMs) have shown promising performance in surgical visual question answering (VQA). However, existing surgical VQA datasets often contain linguistic shortcuts, where question phrasing implicitly cons…
arXiv cs.CV TIER_1 · Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Jianfei Chen, Song Han, Kurt Keutzer, Ion Stoica · 2026-05-04 04:00

Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

arXiv:2505.18875v4 Announce Type: replace Abstract: Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention. By computing only critical tokens, sparse attention reduces computational costs a…

COVERAGE [21]

RELATED ENTITIES

RELATED TOPICS