A new benchmark from Microsoft Research, DELEGATE-52, reveals that leading AI models like Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 corrupt document content in 25% of interactions. The addition of agentic tools further degrades content by an additional 6%. The benchmark suggests that only Python coding tasks are currently considered ready for enterprise deployment. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT New benchmark reveals significant document corruption in leading AI models, indicating current limitations for enterprise use beyond coding.
RANK_REASON The cluster describes a new benchmark created by a research lab, evaluating AI model performance. [lever_c_demoted from research: ic=1 ai=1.0]