New research tackles LLM jailbreaks with dynamic evaluation and robust defense strategies
ByPulseAugur Editorial·
Summary by gemini-2.5-flash-lite
from 8 sources
Multiple research papers explore advanced techniques for enhancing the safety and robustness of large language models (LLMs) against jailbreak attacks. These studies introduce novel frameworks and methods for evaluating and defending against adversarial prompts that aim to elicit harmful outputs. The research focuses on developing more comprehensive evaluation metrics, adaptive attack generation strategies, and robust detection mechanisms that can identify subtle patterns in model behavior.
AI
arXiv:2605.06605v1 Announce Type: new Abstract: Evaluating and predicting the performance of large language models (LLMs) in multi-turn conversational settings is critical yet computationally expensive; key events -- e.g., jailbreaks or successful task completion by an agent -- o…
Evaluating and predicting the performance of large language models (LLMs) in multi-turn conversational settings is critical yet computationally expensive; key events -- e.g., jailbreaks or successful task completion by an agent -- often emerge only after repeated interactions. Th…
Large Language Models (LLMs) have achieved remarkable success but remain highly susceptible to jailbreak attacks, in which adversarial prompts coerce models into generating harmful, unethical, or policy-violating outputs. Such attacks pose real-world risks, eroding safety, trust,…
arXiv cs.LG
TIER_1·Xulin Hu, Che Wang, Wei Yang Bryan Lim, Jianbo Gao, Zhong Chen·
arXiv:2605.02958v1 Announce Type: cross Abstract: Representation Engineering typically relies on static refusal vectors derived from terminal representations. We move beyond this paradigm, demonstrating that refusal is a dynamic and sparse process rather than a localized outcome.…
arXiv:2605.02921v1 Announce Type: cross Abstract: As LLMs continue to shape real-world applications, automated jailbreak generation becomes essential to reveal safety weaknesses and guide model improvement. Existing automatic jailbreak generation methods have not yet fully consid…
arXiv:2605.01687v1 Announce Type: new Abstract: We present MultiBreak, a scalable and diverse multi-turn jailbreak benchmark to evaluate large language model (LLM) safety. Multi-turn jailbreaks mimic natural conversational settings, making them easier to bypass safety-aligned LLM…
arXiv:2601.19487v2 Announce Type: replace Abstract: Safety-aligned LLMs suffer from two failure modes: jailbreak (answering harmful inputs) and over-refusal (declining benign queries). Existing vector steering methods adjust the magnitude of answer vectors, but this creates a fun…
arXiv:2605.00974v1 Announce Type: cross Abstract: LLMs are increasingly equipped with safety alignment mechanisms, yet recent studies demonstrate that they remain vulnerable to jailbreaking attacks that elicit harmful behaviors without explicit policy violations. While a growing …