ENTITY HarmBench

HarmBench

PulseAugur coverage of HarmBench — every cluster mentioning HarmBench across labs, papers, and developer communities, ranked by signal.

Total · 30d

7 over 90d

Releases · 30d

0 over 90d

Papers · 30d

7 over 90d

TIER MIX · 90D

RECENT · PAGE 1/1 · 7 TOTAL

RESEARCH · CL_15872 · May 5 · 04:00

New research tackles LLM jailbreaks with dynamic evaluation and robust defense strategies

Multiple research papers explore advanced techniques for enhancing the safety and robustness of large language models (LLMs) against jailbreak attacks. These studies introduce novel frameworks and methods for evaluating…
TOOL · CL_15954 · May 5 · 04:00

CorrSteer method enhances LLM steering using correlated sparse autoencoder features

Researchers have developed CorrSteer, a novel method for steering large language models (LLMs) during generation using features extracted from Sparse Autoencoders (SAEs). This technique correlates sample correctness wit…
TOOL · CL_15984 · May 5 · 04:00

New Logit-Gap Steering method efficiently measures AI alignment robustness

Researchers have developed a new metric called the refusal-affirmation logit gap to quantify the safety margin of aligned language models. This metric, which measures the difference between refusal and affirmation token…
TOOL · CL_15459 · May 5 · 04:00

New attack redirects LLM attention to bypass safety alignment

Researchers have developed a new white-box adversarial attack called the Attention Redistribution Attack (ARA) that targets the internal attention mechanisms of safety-aligned large language models. This attack crafts n…
RESEARCH · CL_15906 · May 5 · 04:00

New red-teaming method ContextualJailbreak bypasses LLM safety alignment

Researchers have developed ContextualJailbreak, an evolutionary red-teaming strategy designed to find vulnerabilities in large language models. This black-box approach uses simulated multi-turn dialogues and a graded ha…
RESEARCH · CL_06684 · Apr 28 · 04:00

New tool AgentSeer reveals critical gaps in LLM agentic security

Researchers have developed a new tool called AgentSeer to evaluate the vulnerabilities of large language models (LLMs) when they are deployed in agentic systems. This tool decomposes agent executions into action-compone…
RESEARCH · CL_06288 · Apr 27 · 05:59

LLM safety benchmarks show high sensitivity to judge configuration choices

A new research paper highlights significant variability in AI safety benchmark results due to judge configuration choices. The study found that altering prompt wording alone, while keeping the judge model constant, coul…

New research tackles LLM jailbreaks with dynamic evaluation and robust defense strategies

CorrSteer method enhances LLM steering using correlated sparse autoencoder features

New Logit-Gap Steering method efficiently measures AI alignment robustness

New attack redirects LLM attention to bypass safety alignment

New red-teaming method ContextualJailbreak bypasses LLM safety alignment

New tool AgentSeer reveals critical gaps in LLM agentic security

LLM safety benchmarks show high sensitivity to judge configuration choices