SWE-bench Verified
PulseAugur coverage of SWE-bench Verified — every cluster mentioning SWE-bench Verified across labs, papers, and developer communities, ranked by signal.
1 day(s) with sentiment data
-
Anthropic's Claude models learn to verbalize internal activations
Anthropic is developing a method for its Claude models to interpret and articulate their internal activations. This technique, when tested on the SWE-bench Verified benchmark, showed the model recognizing a test scenari…
-
Low-cost AI model beats top performers on coding benchmark with new context engine
A new method called Xanther Context Engine (XCE) has enabled the MiniMax M2.5 model to achieve a 78.2% score on the SWE-bench Verified benchmark, outperforming all other models. This achievement is notable because MiniM…
-
Anthropic's Claude Opus 4.7 boosts coding, integrates with n8n workflows
Anthropic has released Claude Opus 4.7, boasting significant improvements in coding performance, achieving 87.6% on the SWE-Bench Verified benchmark. The update also enhances agent reliability with new features like "xh…
-
AI model evaluations need third-party auditors to ensure reliable progress tracking
Model evaluation methodologies are inconsistent across AI labs, leading to incomparable benchmark results and potentially flawed release decisions. Companies like OpenAI, Anthropic, and Google DeepMind have altered thei…
-
AI coding tools end subsidies, shift to pay-as-you-go pricing amid rising costs
The era of heavily subsidized AI coding tools is ending as companies like Microsoft and Anthropic shift from flat-rate subscriptions to pay-as-you-go pricing. This change reflects the immense scale of AI investment, wit…
-
Mistral AI leads non-Chinese models on SWE-Bench Verified leaderboard
Mistral AI's models have achieved a notable position on the SWE-Bench Verified leaderboard, distinguishing themselves as the sole non-Chinese models within the top 25. This ranking highlights the performance of Mistral …
-
Mistral AI launches Medium 3.5 model with cloud-based remote coding agents
Mistral AI has launched Mistral Medium 3.5, a new 128B parameter model designed for coding and productivity tasks. This model powers new remote coding agents in Mistral Vibe, allowing users to initiate complex, multi-st…
-
Poolside AI releases open-weight agentic coding models Laguna XS.2 and M.1
Poolside AI has launched two new open-weight agentic coding models, Laguna XS.2 and M.1. The models achieved impressive scores on the SWE-bench Verified benchmark, with M.1 reaching 72.5% and XS.2 reaching 68.2%. The XS…
-
OpenAI abandons SWE-bench Verified due to flawed tests and data contamination
OpenAI has announced it will no longer use SWE-bench Verified to evaluate the coding capabilities of frontier AI models. The benchmark has become contaminated, with models showing improved scores primarily due to exposu…
-
Google DeepMind launches autonomous research agents powered by Gemini 3.1 Pro
Google DeepMind has launched two new autonomous research agents, Deep Research and Deep Research Max, powered by Gemini 3.1 Pro. These agents are designed to securely analyze user-provided or third-party data, with Deep…
-
Anthropic's NLA tech translates LLM 'thoughts' into human language
Anthropic has introduced Natural Language Autoencoders (NLAs), a new method that translates the internal numerical 'thoughts' (activations) of large language models into human-readable text. This technique allows resear…