New SAEgis framework detects adversarial attacks on vision-language models

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have developed a new framework called SAEgis to detect adversarial attacks on vision-language models (VLMs). This method utilizes sparse autoencoders (SAEs) as a plug-and-play module, requiring no additional adversarial training and introducing minimal overhead. SAEgis effectively identifies perturbed inputs by leveraging learned sparse latent features, demonstrating strong performance across various attack and domain settings, with notable improvements in cross-domain generalization compared to existing methods. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Enhances the safety and reliability of vision-language models in real-world applications by providing a practical defense against adversarial attacks.

RANK_REASON Academic paper proposing a novel method for adversarial attack detection in VLMs. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

paper
safety

COVERAGE [1]

arXiv cs.CL TIER_1 · Daisuke Kawahara · 2026-05-08 08:53

Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

Vision-language models (VLMs) have advanced rapidly and are increasingly deployed in real-world applications, especially with the rise of agent-based systems. However, their safety has received relatively limited attention. Even the latest proprietary and open-weight VLMs remain …

COVERAGE [1]

Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

RELATED ENTITIES

RELATED TOPICS