Paper distinguishes three models for RLHF annotation: extension, evidence, and authority

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 2 sources

A new paper proposes three distinct models for how human annotator judgments shape large language model behavior through Reinforcement Learning from Human Feedback (RLHF). These models are 'extension,' where annotators align with designers' views; 'evidence,' where annotators provide factual information; and 'authority,' where annotators represent broader societal consensus. The paper argues that RLHF pipelines should be tailored to these different roles rather than using a single unified approach. AI

Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →

IMPACT Clarifies the normative role of human feedback in LLM alignment, potentially improving annotation strategies.

RANK_REASON Academic paper proposing new conceptual models for RLHF annotation.

Read on arXiv cs.CL →

paper
safety

COVERAGE [2]

arXiv cs.CL TIER_1 · Steve Coyne · 2026-04-29 04:00

Three Models of RLHF Annotation: Extension, Evidence, and Authority

arXiv:2604.25895v1 Announce Type: cross Abstract: Preference-based alignment methods, most prominently Reinforcement Learning with Human Feedback (RLHF), use the judgments of human annotators to shape large language model behaviour. However, the normative role of these judgments …
arXiv cs.CL TIER_1 · Steve Coyne · 2026-04-28 17:39

Three Models of RLHF Annotation: Extension, Evidence, and Authority

Preference-based alignment methods, most prominently Reinforcement Learning with Human Feedback (RLHF), use the judgments of human annotators to shape large language model behaviour. However, the normative role of these judgments is rarely made explicit. I distinguish three conce…

COVERAGE [2]

Three Models of RLHF Annotation: Extension, Evidence, and Authority

Three Models of RLHF Annotation: Extension, Evidence, and Authority

RELATED ENTITIES

RELATED TOPICS