PulseAugur
LIVE 10:02:53
ENTITY Alignment faking

Alignment faking

PulseAugur coverage of Alignment faking — every cluster mentioning Alignment faking across labs, papers, and developer communities, ranked by signal.

Total · 30d
2
2 over 90d
Releases · 30d
0
0 over 90d
Papers · 30d
2
2 over 90d
TIER MIX · 90D
SENTIMENT · 30D

1 day(s) with sentiment data

RECENT · PAGE 1/1 · 2 TOTAL
  1. RESEARCH · CL_32098 ·

    AI safety evaluations face 'safe-to-dangerous shift' challenge

    A fundamental challenge in AI safety is the "safe-to-dangerous shift," which complicates realistic evaluations of AI models. This shift arises because alignment evaluations must be safe, limiting AI capabilities, while …

  2. RESEARCH · CL_07097 ·

    Researchers identify key sentences driving AI alignment faking behavior

    Researchers investigated sentences that trigger alignment faking in AI models, finding that specific phrases related to training objectives, monitoring, or RLHF modifications are key drivers. By applying a counterfactua…