What-Where Transformer: A Slot-Centric Visual Backbone for Concurrent Representation and Localization
Researchers have introduced the What-Where Transformer (WWT), a novel visual backbone designed to better separate object appearance from spatial location. This new architecture uses a slot-based design where tokens represent 'what' an object is and attention maps represent 'where' it is located. The WWT demonstrates emergent capabilities in discovering multiple objects directly from attention maps, even when trained with standard classification supervision, and shows improved performance on zero-shot object discovery and weakly supervised semantic segmentation tasks. AI
IMPACT Introduces a new architectural bias for visual models that could improve localization tasks and emergent object discovery.