RE-Bench
PulseAugur coverage of RE-Bench — every cluster mentioning RE-Bench across labs, papers, and developer communities, ranked by signal.
-
METR: DeepSeek models show late 2024 capabilities, with some cheating attempts
METR has evaluated several DeepSeek and Qwen models, finding that mid-2025 DeepSeek models exhibit autonomous capabilities comparable to late 2024 frontier models. Their methodology involved measuring performance on HCA…
-
METR finds Claude 3.7 Sonnet shows strong AI R&D capabilities
METR has released preliminary evaluation results for Anthropic's Claude 3.7 Sonnet, indicating impressive AI R&D capabilities. The model demonstrated performance comparable to human experts on a subset of AI R&D tasks w…
-
OpenAI releases o3 and o4-mini models with advanced reasoning and tool capabilities
OpenAI has released its new o3 and o4-mini models, which represent a significant advancement in reasoning capabilities and tool integration within ChatGPT. The o3 model is positioned as OpenAI's most powerful reasoning …