New research tackles LLM jailbreaks with dynamic evaluation and robust defense strategies

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 8 sources

Multiple research papers explore advanced techniques for enhancing the safety and robustness of large language models (LLMs) against jailbreak attacks. These studies introduce novel frameworks and methods for evaluating and defending against adversarial prompts that aim to elicit harmful outputs. The research focuses on developing more comprehensive evaluation metrics, adaptive attack generation strategies, and robust detection mechanisms that can identify subtle patterns in model behavior. AI

Summary written by gemini-2.5-flash-lite from 8 sources. How we write summaries →

IMPACT Developments in LLM safety and jailbreak detection are crucial for the responsible deployment of AI in sensitive applications.

RANK_REASON The cluster consists of multiple academic papers published on arXiv detailing new research into LLM safety and jailbreak attacks.

Read on arXiv cs.CL →

paper
safety

COVERAGE [8]

arXiv cs.LG TIER_1 · Shai Feldman, Yaniv Romano · 2026-05-08 04:00

How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

arXiv:2605.06605v1 Announce Type: new Abstract: Evaluating and predicting the performance of large language models (LLMs) in multi-turn conversational settings is critical yet computationally expensive; key events -- e.g., jailbreaks or successful task completion by an agent -- o…
arXiv cs.LG TIER_1 · Yaniv Romano · 2026-05-07 17:25

How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

Evaluating and predicting the performance of large language models (LLMs) in multi-turn conversational settings is critical yet computationally expensive; key events -- e.g., jailbreaks or successful task completion by an agent -- often emerge only after repeated interactions. Th…
arXiv cs.AI TIER_1 · Shuo Wang · 2026-05-06 15:53

SoK: Robustness in Large Language Models against Jailbreak Attacks

Large Language Models (LLMs) have achieved remarkable success but remain highly susceptible to jailbreak attacks, in which adversarial prompts coerce models into generating harmful, unethical, or policy-violating outputs. Such attacks pose real-world risks, eroding safety, trust,…
arXiv cs.LG TIER_1 · Xulin Hu, Che Wang, Wei Yang Bryan Lim, Jianbo Gao, Zhong Chen · 2026-05-06 04:00

Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

arXiv:2605.02958v1 Announce Type: cross Abstract: Representation Engineering typically relies on static refusal vectors derived from terminal representations. We move beyond this paradigm, demonstrating that refusal is a dynamic and sparse process rather than a localized outcome.…
arXiv cs.LG TIER_1 · Rui Tang, Kaiyu Xu, Pengsen Cheng, Hao Ren, Haizhou Wang, Shuyu Jiang · 2026-05-06 04:00

EvoJail: Evolutionary Diverse Jailbreak Prompt Generation for Large Language Models

arXiv:2605.02921v1 Announce Type: cross Abstract: As LLMs continue to shape real-world applications, automated jailbreak generation becomes essential to reveal safety weaknesses and guide model improvement. Existing automatic jailbreak generation methods have not yet fully consid…
arXiv cs.CL TIER_1 · Jialin Song, Xiaodong Liu, Weiwei Yang, Wuyang Chen, Mingqian Feng, Xuekai Zhu, Jianfeng Gao · 2026-05-05 04:00

MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety

arXiv:2605.01687v1 Announce Type: new Abstract: We present MultiBreak, a scalable and diverse multi-turn jailbreak benchmark to evaluate large language model (LLM) safety. Multi-turn jailbreaks mimic natural conversational settings, making them easier to bypass safety-aligned LLM…
arXiv cs.LG TIER_1 · Haonan Zhang, Dongxia Wang, Yi Liu, Kexin Chen, Wenhai Wang · 2026-05-05 04:00

LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment

arXiv:2601.19487v2 Announce Type: replace Abstract: Safety-aligned LLMs suffer from two failure modes: jailbreak (answering harmful inputs) and over-refusal (declining benign queries). Existing vector steering methods adjust the magnitude of answer vectors, but this creates a fun…
arXiv cs.CL TIER_1 · Jindong Li, Ying Liu, Yali Fu, Jinjing Zhu, Leyao Wang, Menglin Yang, Rex Ying · 2026-05-05 04:00

SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking

arXiv:2605.00974v1 Announce Type: cross Abstract: LLMs are increasingly equipped with safety alignment mechanisms, yet recent studies demonstrate that they remain vulnerable to jailbreaking attacks that elicit harmful behaviors without explicit policy violations. While a growing …

COVERAGE [8]

RELATED ENTITIES

RELATED TOPICS