PulseAugur
LIVE 23:12:52
tool · [1 source] ·
0
tool

Microsoft benchmark finds top AI models corrupt documents

A new benchmark from Microsoft Research, DELEGATE-52, reveals that leading AI models like Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 corrupt document content in 25% of interactions. The addition of agentic tools further degrades content by an additional 6%. The benchmark suggests that only Python coding tasks are currently considered ready for enterprise deployment. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT New benchmark reveals significant document corruption in leading AI models, indicating current limitations for enterprise use beyond coding.

RANK_REASON The cluster describes a new benchmark created by a research lab, evaluating AI model performance. [lever_c_demoted from research: ic=1 ai=1.0]

Read on Mastodon — mastodon.social →

COVERAGE [1]

  1. Mastodon — mastodon.social TIER_1 · AIntelligenceHub ·

    A new Microsoft Research benchmark called DELEGATE-52 found something enterprise teams need to know: even the best models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT

    A new Microsoft Research benchmark called DELEGATE-52 found something enterprise teams need to know: even the best models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupted 25% of document content over 20 interactions. Agentic tools added another 6% degradation. Only Python cod…