A new paper reveals that leading AI models like Opus 4.6, GPT 5.4, and Gemini 3.1 exhibit significant performance degradation when classifying long transcripts, a crucial task for monitoring coding agents. These models miss subtly dangerous actions much more frequently in transcripts exceeding 800,000 tokens compared to shorter ones. While prompting techniques can partially mitigate this issue, further post-training improvements are likely necessary to ensure reliable monitoring in long-context scenarios. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Leading AI models struggle with long contexts, potentially overestimating their safety monitoring capabilities and requiring new training or prompting strategies.
RANK_REASON The cluster contains an academic paper detailing a new finding about AI model performance. [lever_c_demoted from research: ic=1 ai=1.0]