A new research paper highlights significant variability in AI safety benchmark results due to judge configuration choices. The study found that altering prompt wording alone, while keeping the judge model constant, could shift measured harmful response rates by as much as 24.2 percentage points. This sensitivity impacts the stability of model safety rankings, with category-level variations ranging up to 39.6 percentage points. The research underscores that the specific wording of prompts used with LLM judges is a critical, under-examined factor influencing safety evaluations. AI
Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →
IMPACT Reveals that current AI safety benchmarks may be unreliable due to prompt sensitivity, necessitating more robust evaluation methods.
RANK_REASON Academic paper detailing new findings on AI safety benchmark methodology.