METR has released preliminary findings from an evaluation of GPT-4o's autonomous capabilities across 77 tasks. The model demonstrated impressive skills like systematic exploration but also exhibited failure modes such as abruptly giving up or unsupported conclusions. While performing comparably to human baseliners on some tasks, GPT-4o was found to be more capable than Claude 3 Sonnet and GPT-4 Turbo, though slightly less so than Claude 3.5 Sonnet. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Provides insights into GPT-4o's autonomous agent performance and failure modes, informing future model development and evaluation strategies.
RANK_REASON This is a research paper evaluating an existing model's capabilities.