What-Where Transformer separates object appearance from location

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have introduced the What-Where Transformer (WWT), a novel visual backbone designed to better separate object appearance from spatial location. This new architecture uses a slot-based design where tokens represent 'what' an object is and attention maps represent 'where' it is located. The WWT demonstrates emergent capabilities in discovering multiple objects directly from attention maps, even when trained with standard classification supervision, and shows improved performance on zero-shot object discovery and weakly supervised semantic segmentation tasks. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Introduces a new architectural bias for visual models that could improve localization tasks and emergent object discovery.

RANK_REASON The cluster contains a new academic paper detailing a novel model architecture. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CV →

COVERAGE [1]

arXiv cs.CV TIER_1 · Ikuro Sato · 2026-05-12 12:08

What-Where Transformer: A Slot-Centric Visual Backbone for Concurrent Representation and Localization

Many image understanding tasks involve identifying what is present and where it appears. However, tasks that address where, such as object discovery, detection, and segmentation, are often considerably more complex than image classification, which primarily focuses on what. One p…

COVERAGE [1]

What-Where Transformer: A Slot-Centric Visual Backbone for Concurrent Representation and Localization

RELATED ENTITIES

RELATED TOPICS