ENTITY SWE-bench Verified

SWE-bench Verified

PulseAugur coverage of SWE-bench Verified — every cluster mentioning SWE-bench Verified across labs, papers, and developer communities, ranked by signal.

Total · 30d

10 over 90d

Releases · 30d

0 over 90d

Papers · 30d

4 over 90d

TIER MIX · 90D

frontier release 1
significant 1
research 2
tool 6

SENTIMENT · 30D

1 day(s) with sentiment data

RECENT · PAGE 1/1 · 11 TOTAL

TOOL · CL_27278 · May 11 · 22:33

Anthropic's Claude models learn to verbalize internal activations

Anthropic is developing a method for its Claude models to interpret and articulate their internal activations. This technique, when tested on the SWE-bench Verified benchmark, showed the model recognizing a test scenari…
TOOL · CL_23871 · May 9 · 05:53

Low-cost AI model beats top performers on coding benchmark with new context engine

A new method called Xanther Context Engine (XCE) has enabled the MiniMax M2.5 model to achieve a 78.2% score on the SWE-bench Verified benchmark, outperforming all other models. This achievement is notable because MiniM…
SIGNIFICANT · CL_21309 · May 7 · 19:20

Anthropic's Claude Opus 4.7 boosts coding, integrates with n8n workflows

Anthropic has released Claude Opus 4.7, boasting significant improvements in coding performance, achieving 87.6% on the SWE-Bench Verified benchmark. The update also enhances agent reliability with new features like "xh…
TOOL · CL_18367 · May 5 · 22:29

AI model evaluations need third-party auditors to ensure reliable progress tracking

Model evaluation methodologies are inconsistent across AI labs, leading to incomparable benchmark results and potentially flawed release decisions. Companies like OpenAI, Anthropic, and Google DeepMind have altered thei…
SIGNIFICANT · CL_12673 · May 2 · 00:54

AI coding tools end subsidies, shift to pay-as-you-go pricing amid rising costs

The era of heavily subsidized AI coding tools is ending as companies like Microsoft and Anthropic shift from flat-rate subscriptions to pay-as-you-go pricing. This change reflects the immense scale of AI investment, wit…
RESEARCH · CL_11210 · Apr 30 · 13:16

Mistral AI leads non-Chinese models on SWE-Bench Verified leaderboard

Mistral AI's models have achieved a notable position on the SWE-Bench Verified leaderboard, distinguishing themselves as the sole non-Chinese models within the top 25. This ranking highlights the performance of Mistral …
FRONTIER RELEASE · CL_09216 · Apr 29 · 15:17

Mistral AI launches Medium 3.5 model with cloud-based remote coding agents

Mistral AI has launched Mistral Medium 3.5, a new 128B parameter model designed for coding and productivity tasks. This model powers new remote coding agents in Mistral Vibe, allowing users to initiate complex, multi-st…
RESEARCH · CL_08454 · Apr 29 · 04:12

Poolside AI releases open-weight agentic coding models Laguna XS.2 and M.1

Poolside AI has launched two new open-weight agentic coding models, Laguna XS.2 and M.1. The models achieved impressive scores on the SWE-bench Verified benchmark, with M.1 reaching 72.5% and XS.2 reaching 68.2%. The XS…
RESEARCH · CL_00777 · Feb 23 · 20:03

OpenAI abandons SWE-bench Verified due to flawed tests and data contamination

OpenAI has announced it will no longer use SWE-bench Verified to evaluate the coding capabilities of frontier AI models. The benchmark has become contaminated, with models showing improved scores primarily due to exposu…
SIGNIFICANT · CL_01759 · Feb 19 · 05:44

Google DeepMind launches autonomous research agents powered by Gemini 3.1 Pro

Google DeepMind has launched two new autonomous research agents, Deep Research and Deep Research Max, powered by Gemini 3.1 Pro. These agents are designed to securely analyze user-provided or third-party data, with Deep…
RESEARCH · CL_21046 · Nov 28 · 20:54

Anthropic's NLA tech translates LLM 'thoughts' into human language

Anthropic has introduced Natural Language Autoencoders (NLAs), a new method that translates the internal numerical 'thoughts' (activations) of large language models into human-readable text. This technique allows resear…

Anthropic's Claude models learn to verbalize internal activations

Low-cost AI model beats top performers on coding benchmark with new context engine

Anthropic's Claude Opus 4.7 boosts coding, integrates with n8n workflows

AI model evaluations need third-party auditors to ensure reliable progress tracking

AI coding tools end subsidies, shift to pay-as-you-go pricing amid rising costs

Mistral AI leads non-Chinese models on SWE-Bench Verified leaderboard

Mistral AI launches Medium 3.5 model with cloud-based remote coding agents

Poolside AI releases open-weight agentic coding models Laguna XS.2 and M.1

OpenAI abandons SWE-bench Verified due to flawed tests and data contamination

Google DeepMind launches autonomous research agents powered by Gemini 3.1 Pro

Anthropic's NLA tech translates LLM 'thoughts' into human language