ENTITY
AdvBench
AdvBench
PulseAugur coverage of AdvBench — every cluster mentioning AdvBench across labs, papers, and developer communities, ranked by signal.
Total · 30d
2
2 over 90d
Releases · 30d
0
0 over 90d
Papers · 30d
2
2 over 90d
TIER MIX · 90D
RECENT · PAGE 1/1 · 2 TOTAL
-
New Logit-Gap Steering method efficiently measures AI alignment robustness
Researchers have developed a new metric called the refusal-affirmation logit gap to quantify the safety margin of aligned language models. This metric, which measures the difference between refusal and affirmation token…
-
New diagnostic tool probes LLM circuits for safety and behavior insights
A new research paper introduces "Perturbation Probing," a diagnostic method for understanding the internal workings of large language models. This technique uses two forward passes per prompt to identify and analyze "be…