OpenAI and researchers reveal AI vulnerabilities to adversarial attacks

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 32 sources

OpenAI researchers are exploring the transferability of adversarial robustness across different types of perturbations in neural networks. Their findings indicate that robustness against one perturbation type does not always guarantee robustness against others and can sometimes be detrimental. They recommend evaluating adversarial defenses using a diverse range of perturbation types and sizes to ensure comprehensive security. Additionally, OpenAI is investigating adversarial examples as a concrete AI safety problem, noting their potential to cause significant issues, such as tricking autonomous vehicles. AI

Summary written by gemini-2.5-flash-lite from 32 sources. How we write summaries →

IMPACT Highlights the ongoing challenges in securing AI systems against sophisticated adversarial attacks, necessitating robust evaluation and defense strategies.

RANK_REASON The cluster contains multiple arXiv papers and OpenAI blog posts detailing research into adversarial examples and robustness in machine learning models.

Read on OpenAI News →

safety
paper

OpenAI and researchers reveal AI vulnerabilities to adversarial attacks

COVERAGE [32]

OpenAI News TIER_1 · 2019-05-03 07:00

Transfer of adversarial robustness between perturbation types
OpenAI News TIER_1 · 2017-07-17 07:00

Robust adversarial inputs

We’ve created images that reliably fool neural network classifiers when viewed from varied scales and perspectives. This challenges a claim from last week that self-driving cars would be hard to trick maliciously since they capture images from multiple scales, angles, perspective…
OpenAI News TIER_1 · 2017-02-24 08:00

Attacking machine learning with adversarial examples

Adversarial examples are inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake; they’re like optical illusions for machines. In this post we’ll show how adversarial examples work across different mediums, and will discu…
OpenAI News TIER_1 · 2017-02-08 08:00

Adversarial attacks on neural network policies
arXiv cs.LG TIER_1 · Xinshuai Dong, Haifeng Chen, Xuyuan Liu, Shengyu Chen, Haoyu Wang, Shaoan Xie, Kun Zhang, Zhengzhang Chen · 2026-05-04 04:00

The Power of Order: Fooling LLMs with Adversarial Table Permutations

arXiv:2605.00445v1 Announce Type: new Abstract: Large Language Models have achieved remarkable success and are increasingly deployed in critical applications involving tabular data, such as Table Question Answering. However, their robustness to the structure of this input remains…
arXiv cs.LG TIER_1 · Zhengzhang Chen · 2026-05-01 06:25

The Power of Order: Fooling LLMs with Adversarial Table Permutations

Large Language Models have achieved remarkable success and are increasingly deployed in critical applications involving tabular data, such as Table Question Answering. However, their robustness to the structure of this input remains a critical, unaddressed question. This paper de…
arXiv cs.CL TIER_1 · Wenhao Lan, Shan Li, Junbin Yang, Haihua Shen, Yijun Yang · 2026-05-01 04:00

Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry

arXiv:2604.27019v1 Announce Type: cross Abstract: Safety-aligned language models must refuse harmful requests without collapsing into broad over-refusal, but the training-time mechanisms behind this tradeoff remain unclear. Prior work characterizes refusal directions and jailbrea…
arXiv cs.LG TIER_1 · Han Liu, Shanghao Shi, Yevgeniy Vorobeychik, Chongjie Zhang, Ning Zhang · 2026-05-01 04:00

Low Rank Adaptation for Adversarial Perturbation

arXiv:2604.27487v1 Announce Type: new Abstract: Low-Rank Adaptation (LoRA), which leverages the insight that model updates typically reside in a low-dimensional space, has significantly improved the training efficiency of Large Language Models (LLMs) by updating neural network la…
arXiv cs.AI TIER_1 · Jasmine Moreira · 2026-05-01 04:00

IACDM: Interactive Adversarial Convergence Development Methodology -- A Structured Framework for AI-Assisted Software Development

arXiv:2604.16399v2 Announce Type: replace-cross Abstract: The widespread adoption of AI-assisted development tools in 2025 -- and the emergence of vibe coding, a practice of generating complete applications from natural language without verification -- exposed a critical and tool…
arXiv cs.AI TIER_1 · Ivan Bercovich · 2026-05-01 04:00

What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design

arXiv:2604.28093v1 Announce Type: new Abstract: Terminal-agent benchmarks have become a primary signal for measuring the coding and system-administration capabilities of large language models. As the market for evaluation environments grows, so does the pressure to ship tasks qui…
arXiv cs.AI TIER_1 · Jon-Paul Cacioli · 2026-05-01 04:00

Instruction Complexity Induces Positional Collapse in Adversarial LLM Evaluation

arXiv:2604.27249v1 Announce Type: cross Abstract: When instructed to underperform on multiple-choice evaluations, do language models engage with question content or fall back on positional shortcuts? We map the boundary between these regimes using a six-condition adversarial inst…
arXiv cs.AI TIER_1 · Xu Wang, Zexian Li, Litong Gong, Tiezheng Ge, Zhijie Deng · 2026-05-01 04:00

AdvDMD: Adversarial Reward Meets DMD For High-Quality Few-Step Generation

arXiv:2604.28126v1 Announce Type: cross Abstract: Diffusion models offer superior generation quality at the expense of extensive sampling steps. Distillation methods, with Distribution Matching Distillation (DMD) as a popular example, can mitigate this issue, but performance degr…
arXiv cs.AI TIER_1 · Ivan Bercovich · 2026-04-30 16:37

What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design

Terminal-agent benchmarks have become a primary signal for measuring the coding and system-administration capabilities of large language models. As the market for evaluation environments grows, so does the pressure to ship tasks quickly, often without thorough adversarial review …
Hugging Face Daily Papers TIER_1 · 2026-04-30 06:38

Low Rank Adaptation for Adversarial Perturbation

Low-Rank Adaptation (LoRA), which leverages the insight that model updates typically reside in a low-dimensional space, has significantly improved the training efficiency of Large Language Models (LLMs) by updating neural network layers using low-rank matrices. Since the generati…
arXiv cs.CL TIER_1 · Yuan Xin, Yixuan Weng, Minjun Zhu, Ying Ling, Chengwei Qin, Michael Hahn, Michael Backes, Yue Zhang, Linyi Yang · 2026-04-30 04:00

SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts

arXiv:2604.26506v1 Announce Type: new Abstract: As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial prompts -- adversarial instructions embedded in submissions to manipulate outcomes -- emerges as a critical th…
arXiv cs.CL TIER_1 · Jon-Paul Cacioli · 2026-04-29 22:48

Instruction Complexity Induces Positional Collapse in Adversarial LLM Evaluation

When instructed to underperform on multiple-choice evaluations, do language models engage with question content or fall back on positional shortcuts? We map the boundary between these regimes using a six-condition adversarial instruction-specificity gradient administered to two i…
Hugging Face Daily Papers TIER_1 · 2026-04-29 10:11

SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts

As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial prompts -- adversarial instructions embedded in submissions to manipulate outcomes -- emerges as a critical threat to scholarly integrity. To counter this, we…
arXiv cs.CL TIER_1 · Linyi Yang · 2026-04-29 10:11

SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts

As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial prompts -- adversarial instructions embedded in submissions to manipulate outcomes -- emerges as a critical threat to scholarly integrity. To counter this, we…
arXiv cs.CL TIER_1 · Zhang Wei, Hanxuan Chen, Peilu Hu, Zhenyuan Wei, Chenwei Liang, Jing Luo, Ziyi Ni, Hao Yan, Li Mei, Shengning Lang, Kuan Lu, Xi Xiao, Zhimo Han, Yijin Wang, Yichao Zhang, Chen Yang, Junfeng Hao, Jiayi Gu, Riyang Bao, Mu-Jiang-Shan Wang · 2026-04-29 04:00

Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models

arXiv:2512.20677v4 Announce Type: replace-cross Abstract: The increasing deployment of large language models (LLMs) in safety-critical applications raises fundamental challenges in systematically evaluating robustness against adversarial behaviors. Existing red-teaming practices …
arXiv cs.AI TIER_1 · Akshay Jagadish · 2026-04-28 11:41

Automated Adversarial Collaboration for Advancing Theory Building in the Cognitive Sciences

Cognitive science often evaluates theories through narrow paradigms and local model comparisons, limiting the integration of evidence across tasks and realizations. We introduce an automated adversarial collaboration framework for adjudicating among competing theories even when t…
arXiv cs.AI TIER_1 · Vishruti Kakkad (Carnegie Mellon University), Paul Chung (University of California, San Diego), Hanan Hibshi (Carnegie Mellon University, King Abdulaziz University), Maverick Woo (Carnegie Mellon University) · 2026-04-28 04:00

Comparative Insights on Adversarial Machine Learning from Industry and Academia: A User-Study Approach

arXiv:2602.04753v2 Announce Type: replace-cross Abstract: An exponential growth of Machine Learning and its Generative AI applications brings with it significant security challenges, often referred to as Adversarial Machine Learning (AML). In this paper, we conducted two comprehe…
arXiv cs.AI TIER_1 · Mazal Bethany, Kim-Kwang Raymond Choo, Nishant Vishwamitra, Peyman Najafirad · 2026-04-28 04:00

Agentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines

arXiv:2604.23483v1 Announce Type: new Abstract: Multi-component natural language processing (NLP) pipelines are increasingly deployed for high-stakes decisions, yet no existing adversarial method can test their robustness under realistic conditions: binary-only feedback, no gradi…
arXiv cs.CL TIER_1 · Honglin Mu, Jinghao Liu, Kaiyang Wan, Rui Xing, Xiuying Chen, Timothy Baldwin, Wanxiang Che · 2026-04-28 04:00

AI Security Beyond Core Domains: Resume Screening as a Case Study of Adversarial Vulnerabilities in Specialized LLM Applications

arXiv:2512.20164v2 Announce Type: replace Abstract: Large Language Models (LLMs) excel at text comprehension and generation, making them ideal for automated tasks like code review and content moderation. However, our research identifies a vulnerability: LLMs can be manipulated by…
Hugging Face Daily Papers TIER_1 · 2026-04-28 00:44

Optimally Auditing Adversarial Agents

Fraud can pose a challenge in many resource allocation domains, including social service delivery and credit provision. For example, agents may misreport private information in order to gain benefits or access to credit. To mitigate this, a principal can design strategic audits t…
arXiv cs.LG TIER_1 · Akansha Kalra, Basavasagar Patil, Guanhong Tao, Daniel S. Brown · 2026-04-27 04:00

How Vulnerable Is My Learned Policy? Universal Adversarial Perturbation Attacks On Modern Behavior Cloning Policies

arXiv:2502.03698v4 Announce Type: replace Abstract: Learning from demonstrations is a popular approach to train AI models; however, their vulnerability to adversarial attacks remains underexplored. We present the first systematic study of adversarial attacks, across a range of bo…
arXiv stat.ML TIER_1 · Yuxuan Hou · 2026-04-30 04:00

Adversarial Robustness of NTK Neural Networks

arXiv:2604.25965v1 Announce Type: new Abstract: Deep learning models are widely deployed in safety-critical domains, but remain vulnerable to adversarial attacks. In this paper, we study the adversarial robustness of NTK neural networks in the context of nonparametric regression.…
arXiv cs.CV TIER_1 · Vishesh Kumar, Akshay Agarwal · 2026-04-30 04:00

The Unseen Adversaries: Robust and Generalized Defense Against Adversarial Patches

arXiv:2604.26317v1 Announce Type: new Abstract: The vulnerabilities of deep neural networks against singularities have raised serious concerns regarding their deployment in the physical world. One of the most prominent and impactful physical-world adversarial perturbations is the…
arXiv cs.CV TIER_1 · Yanyun Wang, Qingqing Ye, Li Liu, Zi Liang, Haibo Hu · 2026-04-30 04:00

Robust Alignment: Harmonizing Clean Accuracy and Adversarial Robustness in Adversarial Training

arXiv:2604.26496v1 Announce Type: new Abstract: Adversarial Training (AT) is one of the most effective methods for developing robust deep neural networks (DNNs). However, AT faces a trade-off problem between clean accuracy and adversarial robustness. In this work, we reveal a sur…
arXiv cs.CV TIER_1 · Haibo Hu · 2026-04-29 09:59

Robust Alignment: Harmonizing Clean Accuracy and Adversarial Robustness in Adversarial Training

Adversarial Training (AT) is one of the most effective methods for developing robust deep neural networks (DNNs). However, AT faces a trade-off problem between clean accuracy and adversarial robustness. In this work, we reveal a surprising phenomenon for the first time: Varying i…
arXiv cs.CV TIER_1 · Akshay Agarwal · 2026-04-29 05:56

The Unseen Adversaries: Robust and Generalized Defense Against Adversarial Patches

The vulnerabilities of deep neural networks against singularities have raised serious concerns regarding their deployment in the physical world. One of the most prominent and impactful physical-world adversarial perturbations is the attachment of patches to clean images, known as…
arXiv stat.ML TIER_1 · Yuxuan Hou · 2026-04-28 04:49

Adversarial Robustness of NTK Neural Networks

Deep learning models are widely deployed in safety-critical domains, but remain vulnerable to adversarial attacks. In this paper, we study the adversarial robustness of NTK neural networks in the context of nonparametric regression. We establish minimax optimal rates for adversar…
Hamel Husain TIER_1 Bahasa(ID) · Hamel Husain · 2024-04-12 07:00

Debugging AI With Adversarial Validation

  <noscript></noscript>  <p>For years, I’ve relied on a straightforward method to identify sudden changes in model inputs or training data, known …

COVERAGE [32]

RELATED ENTITIES

RELATED TOPICS