HarmBench
PulseAugur coverage of HarmBench — every cluster mentioning HarmBench across labs, papers, and developer communities, ranked by signal.
-
New research tackles LLM jailbreaks with dynamic evaluation and robust defense strategies
Multiple research papers explore advanced techniques for enhancing the safety and robustness of large language models (LLMs) against jailbreak attacks. These studies introduce novel frameworks and methods for evaluating…
-
CorrSteer method enhances LLM steering using correlated sparse autoencoder features
Researchers have developed CorrSteer, a novel method for steering large language models (LLMs) during generation using features extracted from Sparse Autoencoders (SAEs). This technique correlates sample correctness wit…
-
New Logit-Gap Steering method efficiently measures AI alignment robustness
Researchers have developed a new metric called the refusal-affirmation logit gap to quantify the safety margin of aligned language models. This metric, which measures the difference between refusal and affirmation token…
-
New attack redirects LLM attention to bypass safety alignment
Researchers have developed a new white-box adversarial attack called the Attention Redistribution Attack (ARA) that targets the internal attention mechanisms of safety-aligned large language models. This attack crafts n…
-
New red-teaming method ContextualJailbreak bypasses LLM safety alignment
Researchers have developed ContextualJailbreak, an evolutionary red-teaming strategy designed to find vulnerabilities in large language models. This black-box approach uses simulated multi-turn dialogues and a graded ha…
-
New tool AgentSeer reveals critical gaps in LLM agentic security
Researchers have developed a new tool called AgentSeer to evaluate the vulnerabilities of large language models (LLMs) when they are deployed in agentic systems. This tool decomposes agent executions into action-compone…
-
LLM safety benchmarks show high sensitivity to judge configuration choices
A new research paper highlights significant variability in AI safety benchmark results due to judge configuration choices. The study found that altering prompt wording alone, while keeping the judge model constant, coul…