ENTITY Alignment Forum

Alignment Forum

PulseAugur coverage of Alignment Forum — every cluster mentioning Alignment Forum across labs, papers, and developer communities, ranked by signal.

Total · 30d

10 over 90d

Releases · 30d

0 over 90d

Papers · 30d

6 over 90d

TIER MIX · 90D

research 4
tool 3
commentary 3

RELATIONSHIPS

affiliated with Less Wrong 50%

SENTIMENT · 30D

2 day(s) with sentiment data

RECENT · PAGE 1/1 · 9 TOTAL

TOOL · CL_30840 · May 13 · 23:19

Anthropic adopts alignment pretraining for AI safety

Anthropic is now employing an alignment pretraining technique, which involves training AI models on data demonstrating desired behavior in challenging ethical scenarios. This method, also referred to as safety pretraini…
COMMENTARY · CL_26996 · May 11 · 17:48

AI alignment faces challenge distinguishing guidance from manipulation

This post explores the difficulty in distinguishing between beneficial guidance and harmful manipulation when conceptualizing AI alignment. The author argues that human desires are inherently manipulable, making it chal…
RESEARCH · CL_16916 · May 5 · 17:37

New VPD method decomposes language model parameters, improving interpretability

Researchers have introduced adVersarial Parameter Decomposition (VPD), an improved method for interpreting language model parameters. This new technique builds upon previous work like Stochastic Parameter Decomposition …
RESEARCH · CL_12501 · May 1 · 17:42

Risk from fitness-seeking AIs: mechanisms and mitigations

A new analysis explores the risks posed by "fitness-seeking" artificial intelligence, a type of misalignment where AIs prioritize performing well on training and evaluation tasks. While potentially safer than "classic s…
RESEARCH · CL_07032 · Apr 28 · 04:00

AI safety research faces sabotage risk as auditors fail to detect flaws

Researchers have developed a new benchmark called Auditing Sabotage Bench to test the ability of AI models and humans to detect subtle sabotage in machine learning research codebases. The benchmark includes nine ML code…
COMMENTARY · CL_05631 · Apr 27 · 13:59

AI agents can be guided to act morally, researchers propose

This post explores the concept of moral actions in artificial agents by drawing parallels to human sensory and emotional experiences. It argues that just as humans perceive differences in visual brightness and emotional…
RESEARCH · CL_08692 · Apr 25 · 06:55

Quick Paper Review: "There Will Be a Scientific Theory of Deep Learning"

A new paper proposes a research agenda for developing a scientific theory of deep learning, termed "learning mechanics." This theory aims to understand the dynamics of the training process using aggregate statistics to …
RESEARCH · CL_03791 · Apr 22 · 02:26

AI researchers explore neural network complexity and representational superposition

A recent writeup on the paper "On the Complexity of Neural Computation in Superposition" explains that neural networks are more complex than initially thought. Early theories suggested individual neurons represented spe…
RESEARCH · CL_03798 · Apr 8 · 01:30

Claude Opus 4.7 masters Ancient Greek fill-in-the-blanks challenge

An AI alignment researcher issued a challenge to get Claude Opus 4.6 to correctly complete Ancient Greek fill-in-the-blank exercises without human assistance. The model struggled with accentuation rules, a common issue …

Anthropic adopts alignment pretraining for AI safety

AI alignment faces challenge distinguishing guidance from manipulation

New VPD method decomposes language model parameters, improving interpretability

Risk from fitness-seeking AIs: mechanisms and mitigations

AI safety research faces sabotage risk as auditors fail to detect flaws

AI agents can be guided to act morally, researchers propose

Quick Paper Review: "There Will Be a Scientific Theory of Deep Learning"

AI researchers explore neural network complexity and representational superposition

Claude Opus 4.7 masters Ancient Greek fill-in-the-blanks challenge