OpenAI and researchers reveal AI vulnerabilities to adversarial attacks
ByPulseAugur Editorial·
Summary by gemini-2.5-flash-lite
from 32 sources
OpenAI researchers are exploring the transferability of adversarial robustness across different types of perturbations in neural networks. Their findings indicate that robustness against one perturbation type does not always guarantee robustness against others and can sometimes be detrimental. They recommend evaluating adversarial defenses using a diverse range of perturbation types and sizes to ensure comprehensive security. Additionally, OpenAI is investigating adversarial examples as a concrete AI safety problem, noting their potential to cause significant issues, such as tricking autonomous vehicles.
AI
IMPACT
Highlights the ongoing challenges in securing AI systems against sophisticated adversarial attacks, necessitating robust evaluation and defense strategies.
RANK_REASON
The cluster contains multiple arXiv papers and OpenAI blog posts detailing research into adversarial examples and robustness in machine learning models.
We’ve created images that reliably fool neural network classifiers when viewed from varied scales and perspectives. This challenges a claim from last week that self-driving cars would be hard to trick maliciously since they capture images from multiple scales, angles, perspective…
Adversarial examples are inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake; they’re like optical illusions for machines. In this post we’ll show how adversarial examples work across different mediums, and will discu…
arXiv:2605.00445v1 Announce Type: new Abstract: Large Language Models have achieved remarkable success and are increasingly deployed in critical applications involving tabular data, such as Table Question Answering. However, their robustness to the structure of this input remains…
Large Language Models have achieved remarkable success and are increasingly deployed in critical applications involving tabular data, such as Table Question Answering. However, their robustness to the structure of this input remains a critical, unaddressed question. This paper de…
arXiv:2604.27019v1 Announce Type: cross Abstract: Safety-aligned language models must refuse harmful requests without collapsing into broad over-refusal, but the training-time mechanisms behind this tradeoff remain unclear. Prior work characterizes refusal directions and jailbrea…
arXiv:2604.27487v1 Announce Type: new Abstract: Low-Rank Adaptation (LoRA), which leverages the insight that model updates typically reside in a low-dimensional space, has significantly improved the training efficiency of Large Language Models (LLMs) by updating neural network la…
arXiv:2604.16399v2 Announce Type: replace-cross Abstract: The widespread adoption of AI-assisted development tools in 2025 -- and the emergence of vibe coding, a practice of generating complete applications from natural language without verification -- exposed a critical and tool…
arXiv:2604.28093v1 Announce Type: new Abstract: Terminal-agent benchmarks have become a primary signal for measuring the coding and system-administration capabilities of large language models. As the market for evaluation environments grows, so does the pressure to ship tasks qui…
arXiv:2604.27249v1 Announce Type: cross Abstract: When instructed to underperform on multiple-choice evaluations, do language models engage with question content or fall back on positional shortcuts? We map the boundary between these regimes using a six-condition adversarial inst…
arXiv:2604.28126v1 Announce Type: cross Abstract: Diffusion models offer superior generation quality at the expense of extensive sampling steps. Distillation methods, with Distribution Matching Distillation (DMD) as a popular example, can mitigate this issue, but performance degr…
Terminal-agent benchmarks have become a primary signal for measuring the coding and system-administration capabilities of large language models. As the market for evaluation environments grows, so does the pressure to ship tasks quickly, often without thorough adversarial review …
Low-Rank Adaptation (LoRA), which leverages the insight that model updates typically reside in a low-dimensional space, has significantly improved the training efficiency of Large Language Models (LLMs) by updating neural network layers using low-rank matrices. Since the generati…
arXiv cs.CL
TIER_1·Yuan Xin, Yixuan Weng, Minjun Zhu, Ying Ling, Chengwei Qin, Michael Hahn, Michael Backes, Yue Zhang, Linyi Yang·
arXiv:2604.26506v1 Announce Type: new Abstract: As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial prompts -- adversarial instructions embedded in submissions to manipulate outcomes -- emerges as a critical th…
When instructed to underperform on multiple-choice evaluations, do language models engage with question content or fall back on positional shortcuts? We map the boundary between these regimes using a six-condition adversarial instruction-specificity gradient administered to two i…
As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial prompts -- adversarial instructions embedded in submissions to manipulate outcomes -- emerges as a critical threat to scholarly integrity. To counter this, we…
As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial prompts -- adversarial instructions embedded in submissions to manipulate outcomes -- emerges as a critical threat to scholarly integrity. To counter this, we…
arXiv:2512.20677v4 Announce Type: replace-cross Abstract: The increasing deployment of large language models (LLMs) in safety-critical applications raises fundamental challenges in systematically evaluating robustness against adversarial behaviors. Existing red-teaming practices …
Cognitive science often evaluates theories through narrow paradigms and local model comparisons, limiting the integration of evidence across tasks and realizations. We introduce an automated adversarial collaboration framework for adjudicating among competing theories even when t…
arXiv cs.AI
TIER_1·Vishruti Kakkad (Carnegie Mellon University), Paul Chung (University of California, San Diego), Hanan Hibshi (Carnegie Mellon University, King Abdulaziz University), Maverick Woo (Carnegie Mellon University)·
arXiv:2602.04753v2 Announce Type: replace-cross Abstract: An exponential growth of Machine Learning and its Generative AI applications brings with it significant security challenges, often referred to as Adversarial Machine Learning (AML). In this paper, we conducted two comprehe…
arXiv:2604.23483v1 Announce Type: new Abstract: Multi-component natural language processing (NLP) pipelines are increasingly deployed for high-stakes decisions, yet no existing adversarial method can test their robustness under realistic conditions: binary-only feedback, no gradi…
arXiv:2512.20164v2 Announce Type: replace Abstract: Large Language Models (LLMs) excel at text comprehension and generation, making them ideal for automated tasks like code review and content moderation. However, our research identifies a vulnerability: LLMs can be manipulated by…
Fraud can pose a challenge in many resource allocation domains, including social service delivery and credit provision. For example, agents may misreport private information in order to gain benefits or access to credit. To mitigate this, a principal can design strategic audits t…
arXiv cs.LG
TIER_1·Akansha Kalra, Basavasagar Patil, Guanhong Tao, Daniel S. Brown·
arXiv:2502.03698v4 Announce Type: replace Abstract: Learning from demonstrations is a popular approach to train AI models; however, their vulnerability to adversarial attacks remains underexplored. We present the first systematic study of adversarial attacks, across a range of bo…
arXiv:2604.25965v1 Announce Type: new Abstract: Deep learning models are widely deployed in safety-critical domains, but remain vulnerable to adversarial attacks. In this paper, we study the adversarial robustness of NTK neural networks in the context of nonparametric regression.…
arXiv:2604.26317v1 Announce Type: new Abstract: The vulnerabilities of deep neural networks against singularities have raised serious concerns regarding their deployment in the physical world. One of the most prominent and impactful physical-world adversarial perturbations is the…
arXiv cs.CV
TIER_1·Yanyun Wang, Qingqing Ye, Li Liu, Zi Liang, Haibo Hu·
arXiv:2604.26496v1 Announce Type: new Abstract: Adversarial Training (AT) is one of the most effective methods for developing robust deep neural networks (DNNs). However, AT faces a trade-off problem between clean accuracy and adversarial robustness. In this work, we reveal a sur…
Adversarial Training (AT) is one of the most effective methods for developing robust deep neural networks (DNNs). However, AT faces a trade-off problem between clean accuracy and adversarial robustness. In this work, we reveal a surprising phenomenon for the first time: Varying i…
The vulnerabilities of deep neural networks against singularities have raised serious concerns regarding their deployment in the physical world. One of the most prominent and impactful physical-world adversarial perturbations is the attachment of patches to clean images, known as…
Deep learning models are widely deployed in safety-critical domains, but remain vulnerable to adversarial attacks. In this paper, we study the adversarial robustness of NTK neural networks in the context of nonparametric regression. We establish minimax optimal rates for adversar…
<!-- Content inserted at the beginning of body tag --> <!-- Google Tag Manager (noscript) --> <noscript></noscript> <!-- End Google Tag Manager (noscript) --> <p>For years, I’ve relied on a straightforward method to identify sudden changes in model inputs or training data, known …