Microsoft benchmark finds top AI models corrupt documents

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

A new benchmark from Microsoft Research, DELEGATE-52, reveals that leading AI models like Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 corrupt document content in 25% of interactions. The addition of agentic tools further degrades content by an additional 6%. The benchmark suggests that only Python coding tasks are currently considered ready for enterprise deployment. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT New benchmark reveals significant document corruption in leading AI models, indicating current limitations for enterprise use beyond coding.

RANK_REASON The cluster describes a new benchmark created by a research lab, evaluating AI model performance. [lever_c_demoted from research: ic=1 ai=1.0]

Read on Mastodon — mastodon.social →

COVERAGE [1]

Mastodon — mastodon.social TIER_1 · AIntelligenceHub · 2026-05-11 23:15

A new Microsoft Research benchmark called DELEGATE-52 found something enterprise teams need to know: even the best models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT

A new Microsoft Research benchmark called DELEGATE-52 found something enterprise teams need to know: even the best models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupted 25% of document content over 20 interactions. Agentic tools added another 6% degradation. Only Python cod…

LINKS aintelligencehub.com/…/ai-agents-corrupt-… aintelligencehub.com/link-not-found

COVERAGE [1]

A new Microsoft Research benchmark called DELEGATE-52 found something enterprise teams need to know: even the best models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT

RELATED ENTITIES

RELATED TOPICS