PulseAugur
LIVE 00:07:14
research · [32 sources] ·
0
research

OpenAI and researchers reveal AI vulnerabilities to adversarial attacks

OpenAI researchers are exploring the transferability of adversarial robustness across different types of perturbations in neural networks. Their findings indicate that robustness against one perturbation type does not always guarantee robustness against others and can sometimes be detrimental. They recommend evaluating adversarial defenses using a diverse range of perturbation types and sizes to ensure comprehensive security. Additionally, OpenAI is investigating adversarial examples as a concrete AI safety problem, noting their potential to cause significant issues, such as tricking autonomous vehicles. AI

Summary written by gemini-2.5-flash-lite from 32 sources. How we write summaries →

IMPACT Highlights the ongoing challenges in securing AI systems against sophisticated adversarial attacks, necessitating robust evaluation and defense strategies.

RANK_REASON The cluster contains multiple arXiv papers and OpenAI blog posts detailing research into adversarial examples and robustness in machine learning models.

Read on OpenAI News →

OpenAI and researchers reveal AI vulnerabilities to adversarial attacks

COVERAGE [32]

  1. OpenAI News TIER_1 ·

    Transfer of adversarial robustness between perturbation types

  2. OpenAI News TIER_1 ·

    Robust adversarial inputs

    We’ve created images that reliably fool neural network classifiers when viewed from varied scales and perspectives. This challenges a claim from last week that self-driving cars would be hard to trick maliciously since they capture images from multiple scales, angles, perspective…

  3. OpenAI News TIER_1 ·

    Attacking machine learning with adversarial examples

    Adversarial examples are inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake; they’re like optical illusions for machines. In this post we’ll show how adversarial examples work across different mediums, and will discu…

  4. OpenAI News TIER_1 ·

    Adversarial attacks on neural network policies

  5. arXiv cs.LG TIER_1 · Xinshuai Dong, Haifeng Chen, Xuyuan Liu, Shengyu Chen, Haoyu Wang, Shaoan Xie, Kun Zhang, Zhengzhang Chen ·

    The Power of Order: Fooling LLMs with Adversarial Table Permutations

    arXiv:2605.00445v1 Announce Type: new Abstract: Large Language Models have achieved remarkable success and are increasingly deployed in critical applications involving tabular data, such as Table Question Answering. However, their robustness to the structure of this input remains…

  6. arXiv cs.LG TIER_1 · Zhengzhang Chen ·

    The Power of Order: Fooling LLMs with Adversarial Table Permutations

    Large Language Models have achieved remarkable success and are increasingly deployed in critical applications involving tabular data, such as Table Question Answering. However, their robustness to the structure of this input remains a critical, unaddressed question. This paper de…

  7. arXiv cs.CL TIER_1 · Wenhao Lan, Shan Li, Junbin Yang, Haihua Shen, Yijun Yang ·

    Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry

    arXiv:2604.27019v1 Announce Type: cross Abstract: Safety-aligned language models must refuse harmful requests without collapsing into broad over-refusal, but the training-time mechanisms behind this tradeoff remain unclear. Prior work characterizes refusal directions and jailbrea…

  8. arXiv cs.LG TIER_1 · Han Liu, Shanghao Shi, Yevgeniy Vorobeychik, Chongjie Zhang, Ning Zhang ·

    Low Rank Adaptation for Adversarial Perturbation

    arXiv:2604.27487v1 Announce Type: new Abstract: Low-Rank Adaptation (LoRA), which leverages the insight that model updates typically reside in a low-dimensional space, has significantly improved the training efficiency of Large Language Models (LLMs) by updating neural network la…

  9. arXiv cs.AI TIER_1 · Jasmine Moreira ·

    IACDM: Interactive Adversarial Convergence Development Methodology -- A Structured Framework for AI-Assisted Software Development

    arXiv:2604.16399v2 Announce Type: replace-cross Abstract: The widespread adoption of AI-assisted development tools in 2025 -- and the emergence of vibe coding, a practice of generating complete applications from natural language without verification -- exposed a critical and tool…

  10. arXiv cs.AI TIER_1 · Ivan Bercovich ·

    What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design

    arXiv:2604.28093v1 Announce Type: new Abstract: Terminal-agent benchmarks have become a primary signal for measuring the coding and system-administration capabilities of large language models. As the market for evaluation environments grows, so does the pressure to ship tasks qui…

  11. arXiv cs.AI TIER_1 · Jon-Paul Cacioli ·

    Instruction Complexity Induces Positional Collapse in Adversarial LLM Evaluation

    arXiv:2604.27249v1 Announce Type: cross Abstract: When instructed to underperform on multiple-choice evaluations, do language models engage with question content or fall back on positional shortcuts? We map the boundary between these regimes using a six-condition adversarial inst…

  12. arXiv cs.AI TIER_1 · Xu Wang, Zexian Li, Litong Gong, Tiezheng Ge, Zhijie Deng ·

    AdvDMD: Adversarial Reward Meets DMD For High-Quality Few-Step Generation

    arXiv:2604.28126v1 Announce Type: cross Abstract: Diffusion models offer superior generation quality at the expense of extensive sampling steps. Distillation methods, with Distribution Matching Distillation (DMD) as a popular example, can mitigate this issue, but performance degr…

  13. arXiv cs.AI TIER_1 · Ivan Bercovich ·

    What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design

    Terminal-agent benchmarks have become a primary signal for measuring the coding and system-administration capabilities of large language models. As the market for evaluation environments grows, so does the pressure to ship tasks quickly, often without thorough adversarial review …

  14. Hugging Face Daily Papers TIER_1 ·

    Low Rank Adaptation for Adversarial Perturbation

    Low-Rank Adaptation (LoRA), which leverages the insight that model updates typically reside in a low-dimensional space, has significantly improved the training efficiency of Large Language Models (LLMs) by updating neural network layers using low-rank matrices. Since the generati…

  15. arXiv cs.CL TIER_1 · Yuan Xin, Yixuan Weng, Minjun Zhu, Ying Ling, Chengwei Qin, Michael Hahn, Michael Backes, Yue Zhang, Linyi Yang ·

    SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts

    arXiv:2604.26506v1 Announce Type: new Abstract: As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial prompts -- adversarial instructions embedded in submissions to manipulate outcomes -- emerges as a critical th…

  16. arXiv cs.CL TIER_1 · Jon-Paul Cacioli ·

    Instruction Complexity Induces Positional Collapse in Adversarial LLM Evaluation

    When instructed to underperform on multiple-choice evaluations, do language models engage with question content or fall back on positional shortcuts? We map the boundary between these regimes using a six-condition adversarial instruction-specificity gradient administered to two i…

  17. Hugging Face Daily Papers TIER_1 ·

    SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts

    As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial prompts -- adversarial instructions embedded in submissions to manipulate outcomes -- emerges as a critical threat to scholarly integrity. To counter this, we…

  18. arXiv cs.CL TIER_1 · Linyi Yang ·

    SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts

    As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial prompts -- adversarial instructions embedded in submissions to manipulate outcomes -- emerges as a critical threat to scholarly integrity. To counter this, we…

  19. arXiv cs.CL TIER_1 · Zhang Wei, Hanxuan Chen, Peilu Hu, Zhenyuan Wei, Chenwei Liang, Jing Luo, Ziyi Ni, Hao Yan, Li Mei, Shengning Lang, Kuan Lu, Xi Xiao, Zhimo Han, Yijin Wang, Yichao Zhang, Chen Yang, Junfeng Hao, Jiayi Gu, Riyang Bao, Mu-Jiang-Shan Wang ·

    Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models

    arXiv:2512.20677v4 Announce Type: replace-cross Abstract: The increasing deployment of large language models (LLMs) in safety-critical applications raises fundamental challenges in systematically evaluating robustness against adversarial behaviors. Existing red-teaming practices …

  20. arXiv cs.AI TIER_1 · Akshay Jagadish ·

    Automated Adversarial Collaboration for Advancing Theory Building in the Cognitive Sciences

    Cognitive science often evaluates theories through narrow paradigms and local model comparisons, limiting the integration of evidence across tasks and realizations. We introduce an automated adversarial collaboration framework for adjudicating among competing theories even when t…

  21. arXiv cs.AI TIER_1 · Vishruti Kakkad (Carnegie Mellon University), Paul Chung (University of California, San Diego), Hanan Hibshi (Carnegie Mellon University, King Abdulaziz University), Maverick Woo (Carnegie Mellon University) ·

    Comparative Insights on Adversarial Machine Learning from Industry and Academia: A User-Study Approach

    arXiv:2602.04753v2 Announce Type: replace-cross Abstract: An exponential growth of Machine Learning and its Generative AI applications brings with it significant security challenges, often referred to as Adversarial Machine Learning (AML). In this paper, we conducted two comprehe…

  22. arXiv cs.AI TIER_1 · Mazal Bethany, Kim-Kwang Raymond Choo, Nishant Vishwamitra, Peyman Najafirad ·

    Agentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines

    arXiv:2604.23483v1 Announce Type: new Abstract: Multi-component natural language processing (NLP) pipelines are increasingly deployed for high-stakes decisions, yet no existing adversarial method can test their robustness under realistic conditions: binary-only feedback, no gradi…

  23. arXiv cs.CL TIER_1 · Honglin Mu, Jinghao Liu, Kaiyang Wan, Rui Xing, Xiuying Chen, Timothy Baldwin, Wanxiang Che ·

    AI Security Beyond Core Domains: Resume Screening as a Case Study of Adversarial Vulnerabilities in Specialized LLM Applications

    arXiv:2512.20164v2 Announce Type: replace Abstract: Large Language Models (LLMs) excel at text comprehension and generation, making them ideal for automated tasks like code review and content moderation. However, our research identifies a vulnerability: LLMs can be manipulated by…

  24. Hugging Face Daily Papers TIER_1 ·

    Optimally Auditing Adversarial Agents

    Fraud can pose a challenge in many resource allocation domains, including social service delivery and credit provision. For example, agents may misreport private information in order to gain benefits or access to credit. To mitigate this, a principal can design strategic audits t…

  25. arXiv cs.LG TIER_1 · Akansha Kalra, Basavasagar Patil, Guanhong Tao, Daniel S. Brown ·

    How Vulnerable Is My Learned Policy? Universal Adversarial Perturbation Attacks On Modern Behavior Cloning Policies

    arXiv:2502.03698v4 Announce Type: replace Abstract: Learning from demonstrations is a popular approach to train AI models; however, their vulnerability to adversarial attacks remains underexplored. We present the first systematic study of adversarial attacks, across a range of bo…

  26. arXiv stat.ML TIER_1 · Yuxuan Hou ·

    Adversarial Robustness of NTK Neural Networks

    arXiv:2604.25965v1 Announce Type: new Abstract: Deep learning models are widely deployed in safety-critical domains, but remain vulnerable to adversarial attacks. In this paper, we study the adversarial robustness of NTK neural networks in the context of nonparametric regression.…

  27. arXiv cs.CV TIER_1 · Vishesh Kumar, Akshay Agarwal ·

    The Unseen Adversaries: Robust and Generalized Defense Against Adversarial Patches

    arXiv:2604.26317v1 Announce Type: new Abstract: The vulnerabilities of deep neural networks against singularities have raised serious concerns regarding their deployment in the physical world. One of the most prominent and impactful physical-world adversarial perturbations is the…

  28. arXiv cs.CV TIER_1 · Yanyun Wang, Qingqing Ye, Li Liu, Zi Liang, Haibo Hu ·

    Robust Alignment: Harmonizing Clean Accuracy and Adversarial Robustness in Adversarial Training

    arXiv:2604.26496v1 Announce Type: new Abstract: Adversarial Training (AT) is one of the most effective methods for developing robust deep neural networks (DNNs). However, AT faces a trade-off problem between clean accuracy and adversarial robustness. In this work, we reveal a sur…

  29. arXiv cs.CV TIER_1 · Haibo Hu ·

    Robust Alignment: Harmonizing Clean Accuracy and Adversarial Robustness in Adversarial Training

    Adversarial Training (AT) is one of the most effective methods for developing robust deep neural networks (DNNs). However, AT faces a trade-off problem between clean accuracy and adversarial robustness. In this work, we reveal a surprising phenomenon for the first time: Varying i…

  30. arXiv cs.CV TIER_1 · Akshay Agarwal ·

    The Unseen Adversaries: Robust and Generalized Defense Against Adversarial Patches

    The vulnerabilities of deep neural networks against singularities have raised serious concerns regarding their deployment in the physical world. One of the most prominent and impactful physical-world adversarial perturbations is the attachment of patches to clean images, known as…

  31. arXiv stat.ML TIER_1 · Yuxuan Hou ·

    Adversarial Robustness of NTK Neural Networks

    Deep learning models are widely deployed in safety-critical domains, but remain vulnerable to adversarial attacks. In this paper, we study the adversarial robustness of NTK neural networks in the context of nonparametric regression. We establish minimax optimal rates for adversar…

  32. Hamel Husain TIER_1 Bahasa(ID) · Hamel Husain ·

    Debugging AI With Adversarial Validation

    <!-- Content inserted at the beginning of body tag --> <!-- Google Tag Manager (noscript) --> <noscript></noscript> <!-- End Google Tag Manager (noscript) --> <p>For years, I’ve relied on a straightforward method to identify sudden changes in model inputs or training data, known …