Researchers have developed Omni2Sound, a unified diffusion model capable of generating audio from video, text, or a combination of both. The model addresses challenges in data scarcity and cross-task competition by introducing SoundAtlas, a large-scale dataset with tightly aligned audio captions, and a novel three-stage progressive training schedule. Omni2Sound achieves state-of-the-art performance across video-to-audio, text-to-audio, and video-text-to-audio generation tasks within a single model, demonstrating strong generalization capabilities. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Introduces a unified model for multimodal audio generation, potentially simplifying workflows for content creators and researchers.
RANK_REASON This is a research paper introducing a new model and dataset for audio generation.