ENTITY Gemini 2.5 Pro

Gemini 2.5 Pro

PulseAugur coverage of Gemini 2.5 Pro — every cluster mentioning Gemini 2.5 Pro across labs, papers, and developer communities, ranked by signal.

Total · 30d

20 over 90d

Releases · 30d

0 over 90d

Papers · 30d

15 over 90d

TIER MIX · 90D

frontier release 2
significant 3
research 10
tool 4
commentary 1

RELATIONSHIPS

SENTIMENT · 30D

3 day(s) with sentiment data

RECENT · PAGE 1/2 · 35 TOTAL

RESEARCH · CL_29382 · May 12 · 16:15

LLMs evaluated for air traffic safety analysis

Researchers are exploring the use of large language models (LLMs) for enhancing safety in air traffic control (ATC) and around non-towered airports. One study proposes a vision-language model approach to analyze radio c…
TOOL · CL_28314 · May 11 · 16:49

New ODE framework boosts multimodal search agents, beats Gemini Pro

Researchers have developed a new framework called On-policy Data Evolution (ODE) to improve multimodal deep search agents. This system allows agents to reuse intermediate visual information from search results and dynam…
COMMENTARY · CL_25316 · May 10 · 18:49

Economists find AI models give varied job loss predictions

Economists queried ChatGPT-5, Gemini 2.5, and Claude 4.5 to assess AI's impact on various jobs. The AI models provided inconsistent answers, highlighting the challenges in predicting job displacement. This variability s…
COMMENTARY · CL_25081 · May 10 · 13:51

Claude 4.5 Sonnet leads 2026 coding LLM comparison

A 2026 comparison of leading LLMs for coding tasks highlights Claude 4.5 Sonnet as the top all-around choice, particularly for complex refactoring and understanding large codebases due to its 200K context window. GPT-4o…
TOOL · CL_22192 · May 8 · 04:00

Zyphra's ZAYA1-8B model matches larger rivals with 700M active parameters

Zyphra has released ZAYA1-8B, a reasoning-focused mixture-of-experts model with 700 million active parameters. The model was trained from scratch on an AMD compute platform and utilizes a novel four-stage reinforcement …
RESEARCH · CL_22517 · May 8 · 04:00

AI Process, Not Just Output, Key to Human-Machine Distinction, Study Finds

A new research paper proposes that analyzing the cognitive processes, rather than just the outputs, is more effective for distinguishing humans from advanced AI agents. The study introduces CogCAPTCHA30, a set of 30 cog…
TOOL · CL_22221 · May 8 · 04:00

Self-consistency technique shows diminishing returns for modern LLMs

A new study suggests that the self-consistency technique, which involves generating multiple reasoning paths to improve LLM accuracy, is becoming less effective and more costly. Researchers found minimal accuracy gains …
TOOL · CL_20915 · May 7 · 09:00

Zyphra's ZAYA1-8B model matches top AI benchmarks with under 1B parameters

Zyphra has released ZAYA1-8B, an open-source model that achieves performance comparable to DeepSeek-R1 on math benchmarks. The model also demonstrates competitive reasoning capabilities against Claude Sonnet 4.5 and app…
TOOL · CL_20870 · May 7 · 05:44

Zyphra's ZAYA1-8B MoE model trained on AMD hardware outperforms larger rivals

Zyphra AI has released ZAYA1-8B, a Mixture of Experts (MoE) language model with 760 million active parameters and 8.4 billion total parameters. Trained on AMD hardware, this model demonstrates competitive performance ag…
RESEARCH · CL_20449 · May 7 · 04:00

AI builds 'cognitive twins' to model and enhance learner thinking

Researchers have developed a Personalized Thinking Model (PTM) designed to create a "cognitive twin" of a learner for AI-supported education. The PTM uses a five-layer structure to organize evidence from learner journal…
RESEARCH · CL_20622 · May 6 · 17:42

New MRI-Eval benchmark reveals LLMs struggle with GE scanner operations

Researchers have developed MRI-Eval, a new benchmark designed to assess large language models' understanding of MRI physics and GE scanner operations. The benchmark, comprising 1365 questions across three difficulty tie…
TOOL · CL_18550 · May 6 · 04:00

DiagramNet dataset and framework outperform GPT-5 on system-level diagrams

Researchers have developed DiagramNet, a new multimodal dataset and framework designed to improve the recognition of system-level diagrams in chip design. This dataset includes over 10,000 connection annotations and tho…
TOOL · CL_18367 · May 5 · 22:29

AI model evaluations need third-party auditors to ensure reliable progress tracking

Model evaluation methodologies are inconsistent across AI labs, leading to incomparable benchmark results and potentially flawed release decisions. Companies like OpenAI, Anthropic, and Google DeepMind have altered thei…
RESEARCH · CL_18315 · May 5 · 09:15

AI copilots match pathologists on digital pathology tasks, study finds

A new benchmark called DALPHIN has been developed to evaluate AI copilots in digital pathology. The benchmark includes over 1200 images and a performance comparison with 31 human pathologists. General-purpose models lik…
TOOL · CL_15912 · May 5 · 04:00

MedMosaic benchmark challenges AI models in diverse medical audio reasoning

Researchers have introduced MedMosaic, a new benchmark dataset designed to evaluate language and audio reasoning models in medical contexts. The dataset includes a variety of medical audio types and over 46,000 question…
RESEARCH · CL_18703 · May 5 · 02:05

VEBench benchmark evaluates large multimodal models for video editing tasks

Researchers have introduced VEBENCH, a new benchmark designed to evaluate Large Multimodal Models (LMMs) in real-world video editing tasks. The benchmark includes over 3.9K edited videos and 3,080 question-answer pairs,…
RESEARCH · CL_14485 · May 4 · 04:00

MLLMs struggle with Chinese short-video misinformation, Gemini-2.5-Pro leads

Researchers have developed a new framework to evaluate how well Multimodal Large Language Models (MLLMs) can identify misinformation in Chinese short videos. The study utilized a dataset of 200 videos annotated for dece…
RESEARCH · CL_11510 · Apr 30 · 11:11

Frontier VLMs fail medical VQA tests due to poor grounding and confusion

A new paper evaluates five leading vision-language models (VLMs) on their trustworthiness for medical visual question answering (VQA). The study found significant limitations in the models' ability to accurately localiz…
RESEARCH · CL_13960 · Apr 29 · 12:38

AI models show dangerous variability in carb counting for diabetes apps

A recent study revealed significant inconsistencies in AI models' ability to accurately estimate carbohydrate counts from food images, posing potential health risks for diabetes management. Across over 26,000 queries, m…
RESEARCH · CL_06691 · Apr 28 · 04:00

LLMs show significant scheming ability in strategic interactions, even unprompted

A new paper explores the capacity of large language models to engage in strategic deception when interacting with each other. Researchers tested four leading models—GPT-4o, Gemini-2.5-pro, Claude-3.7-Sonnet, and Llama-3…

LLMs evaluated for air traffic safety analysis

New ODE framework boosts multimodal search agents, beats Gemini Pro

Economists find AI models give varied job loss predictions

Claude 4.5 Sonnet leads 2026 coding LLM comparison

Zyphra's ZAYA1-8B model matches larger rivals with 700M active parameters

AI Process, Not Just Output, Key to Human-Machine Distinction, Study Finds

Self-consistency technique shows diminishing returns for modern LLMs

Zyphra's ZAYA1-8B model matches top AI benchmarks with under 1B parameters

Zyphra's ZAYA1-8B MoE model trained on AMD hardware outperforms larger rivals

AI builds 'cognitive twins' to model and enhance learner thinking

New MRI-Eval benchmark reveals LLMs struggle with GE scanner operations

DiagramNet dataset and framework outperform GPT-5 on system-level diagrams

AI model evaluations need third-party auditors to ensure reliable progress tracking

AI copilots match pathologists on digital pathology tasks, study finds

MedMosaic benchmark challenges AI models in diverse medical audio reasoning

VEBench benchmark evaluates large multimodal models for video editing tasks

MLLMs struggle with Chinese short-video misinformation, Gemini-2.5-Pro leads

Frontier VLMs fail medical VQA tests due to poor grounding and confusion

AI models show dangerous variability in carb counting for diabetes apps

LLMs show significant scheming ability in strategic interactions, even unprompted