PulseAugur
LIVE 01:34:48
research · [3 sources] · · Polski(PL) Najnowszy model Claude Mythos Preview osiągnął limity metodologii badawczej organizacji METR, wykazując zdolności wykraczające poza obecne standardy pomiarowe.
0
research

Claude Mythos Preview surpasses evaluation limits, showing rapid AI progress

Anthropic's Claude Mythos Preview model has demonstrated capabilities that push the boundaries of current evaluation methodologies, according to METR. The model achieved completion times of over 16 hours for 50% of tasks and 3 hours for 80%, surpassing previous benchmarks. This advancement highlights the rapid progress in AI capabilities and raises questions about the adequacy of existing assessment tools. AI

Summary written by gemini-2.5-flash-lite from 3 sources. How we write summaries →

IMPACT Demonstrates AI models are outpacing current evaluation benchmarks, signaling a need for new assessment tools.

RANK_REASON The cluster reports on a new benchmark evaluation of an AI model that pushes the limits of existing assessment methodologies.

Read on Mastodon — mastodon.social →

COVERAGE [3]

  1. Mastodon — fosstodon.org TIER_1 한국어(KO) · [email protected] ·

    AI Leaks and News (@AILeaksAndNews) METR released the Task-Completion Time Horizon evaluation of Claude Mythos Preview. It explained that it surpassed previous evaluations, recording over 16 hours for the 50% benchmark and 3 hours for the 80% benchmark, and...

    AI Leaks and News (@AILeaksAndNews) METR가 Claude Mythos Preview의 Task-Completion Time Horizon 평가를 공개했다. 50% 기준 16시간 이상, 80% 기준 3시간 수준을 기록하며 기존 평가를 넘어섰다고 설명하고, 결과를 AI 능력의 빠른 진전과 관련해 해석한다. https:// x.com/AILeaksAndNews/status/20 52901460375949510 # metr # claude # benchmark # evalu…

  2. Mastodon — fosstodon.org TIER_1 한국어(KO) · [email protected] ·

    Shain Noor (@shaincodes) mentions the IdeaBlock concept, evaluating that changing the unit of embedding instead of simply searching and tuning is a smarter approach. He also introduces a real-world case where AI CFO Silvia needs to remember the financial history of thousands of users across sessions, enabling long-term memory.

    Shain Noor (@shaincodes) IdeaBlock 개념을 언급하며, 단순히 검색 후 튜닝하는 대신 임베딩하는 단위를 바꾸는 접근이 더 스마트하다고 평가한다. 또한 AI CFO Silvia가 수천 명 사용자의 금융 이력을 세션 간 기억해야 하는 실제 사례를 소개해, 장기 메모리와 개인화가 필요한 AI 응용의 중요성을 보여준다. https:// x.com/shaincodes/status/205282 2155558347242 # ai # memory # personalization # ra…

  3. Mastodon — mastodon.social TIER_1 Polski(PL) · aisight ·

    The latest Claude Mythos Preview model has reached the limits of METR organization's research methodology, demonstrating capabilities beyond current measurement standards.

    Najnowszy model Claude Mythos Preview osiągnął limity metodologii badawczej organizacji METR, wykazując zdolności wykraczające poza obecne standardy pomiarowe. Eksperci od ewaluacji przyznają, że brakuje im narzędzi do oceny tak potężnych systemów, a w tym samym czasie Palo Alto …