PulseAugur
LIVE 10:34:35
research · [2 sources] ·
0
research

OpenAI accidentally graded CoTs in GPT models, raising minor alignment concerns

OpenAI has identified instances where their AI models' chains of thought (CoT) were inadvertently graded during reinforcement learning training. This practice, which OpenAI policy prohibits due to risks of misleading reasoning, affected several model versions including GPT-5.4 Thinking and GPT-5.1 Instant. Despite the accidental grading, initial analyses did not reveal significant degradation in CoT monitorability, though the company acknowledges potential subtle effects and aims to prevent future occurrences. AI

Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →

IMPACT Accidental CoT grading could subtly impact model alignment and future training, underscoring the need for robust safety protocols.

RANK_REASON Paper detailing an internal safety incident and its investigation, with external review.

Read on LessWrong (AI tag) →

COVERAGE [2]

  1. LessWrong (AI tag) TIER_1 · papetoast ·

    Investigating the consequences of accidentally grading CoT during RL

    <p><em>This is an unofficial <a href="https://gist.github.com/Glinte/5c3fa2f6bcecb7c573664b19bb76eaaf">automated</a> linkpost.</em></p> <p>Monitoring our models’ chains of thought (CoT) has proven to be an effective way to detect and track model misalignment, both <a href="https:…

  2. LessWrong (AI tag) TIER_1 · Buck ·

    A review of “Investigating the consequences of accidentally grading CoT during RL”

    <p><span>Last week, OpenAI staff shared an early draft of </span><a href="https://alignment.openai.com/accidental-cot-grading/"><span>Investigating the consequences of accidentally grading CoT during RL</span></a><span> with Redwood Research staff.</span></p><p><span>To start wit…