ENTITY Terminal-Bench-2.0

Terminal-Bench-2.0

PulseAugur coverage of Terminal-Bench-2.0 — every cluster mentioning Terminal-Bench-2.0 across labs, papers, and developer communities, ranked by signal.

Show in brief

Total · 30d

9 over 90d

Releases · 30d

0 over 90d

Papers · 30d

5 over 90d

TIER MIX · 90D

significant 1
research 3
tool 5

TOPICS

SENTIMENT · 30D

5 day(s) with sentiment data

RECENT · PAGE 1/1 · 9 TOTAL

TOOL · CL_79558 · Jun 8 · 13:50

Self-Harness enables LLM agents to improve their own operational harnesses

Researchers have developed a novel method called Self-Harness, enabling LLM-based agents to autonomously improve their own operational harnesses. This iterative process involves identifying model-specific failure patter…
TOOL · CL_68283 · Jun 3 · 04:00

Research: Interaction trajectories boost AI agent generalization

A new research paper explores the effectiveness of interaction trajectories for training AI agents, finding that standalone performance doesn't dictate teaching efficacy. Surprisingly, agents fine-tuned on trajectories …
TOOL · CL_60204 · May 29 · 19:01

AI coding agents: GPT-5.5, Claude Sonnet 4.6, Gemini 3.5 Flash compared

A recent comparison evaluated three AI coding agents: OpenAI's Codex (powered by GPT-5.5), Anthropic's Claude Code (using Claude Sonnet 4.6), and Google's Antigravity (with Gemini 3.5 Flash). The experiment focused on r…
TOOL · CL_35928 · May 17 · 21:00

Local LLMs struggle with real-world terminal tasks despite benchmark success

Local large language models often perform poorly on multi-step terminal tasks despite excelling at standard benchmarks like MMLU. This discrepancy arises because traditional benchmarks measure single-turn reasoning, fai…
TOOL · CL_34986 · May 16 · 21:33

Llama.cpp adds MTP, new Gemma-4 finetune released, Qwen 3.6 excels locally

The llama.cpp project has integrated Multi-head Attention Parallelism (MTP), leading to an 11.5% speed increase for 27B Qwen models in local inference. A new finetuned Gemma-4 model, optimized for creative writing and a…
SIGNIFICANT · CL_26039 · May 11 · 03:44

Qwen 3.6-Plus excels in complex AI agent tasks and coding

Alibaba's Qwen 3.6-Plus model has demonstrated advanced capabilities in complex decision-making and agentic coding tasks, according to a recent evaluation. The model successfully generated a detailed implementation plan…
RESEARCH · CL_07734 · Apr 28 · 16:17

Poolside AI releases open-weight Laguna XS.2 and M.1 coding models

Poolside AI has released two new agentic coding models, Laguna M.1 and Laguna XS.2, along with their agent training and operation runtime. Laguna M.1 is a large Mixture of Experts (MoE) model trained on 30T tokens using…
RESEARCH · CL_47566 · Apr 9 · 13:05

Anthropic's 'Mythos' AI too risky for public release

Anthropic has developed a new AI model named Claude Mythos, which demonstrates significant advancements in benchmark performance, particularly in identifying software vulnerabilities. Due to its advanced capabilities in…
FRONTIER RELEASE · CL_01718 · Nov 18 · 17:49

Google DeepMind launches Gemini 3 Pro with advanced coding and agentic capabilities

Google DeepMind has launched Gemini 3 Pro, their latest and most intelligent model, which demonstrates significant improvements in reasoning and coding capabilities. This new model surpasses previous versions and excels…

Self-Harness enables LLM agents to improve their own operational harnesses

Research: Interaction trajectories boost AI agent generalization

AI coding agents: GPT-5.5, Claude Sonnet 4.6, Gemini 3.5 Flash compared

Local LLMs struggle with real-world terminal tasks despite benchmark success

Llama.cpp adds MTP, new Gemma-4 finetune released, Qwen 3.6 excels locally

Qwen 3.6-Plus excels in complex AI agent tasks and coding

Poolside AI releases open-weight Laguna XS.2 and M.1 coding models

Anthropic's 'Mythos' AI too risky for public release

Google DeepMind launches Gemini 3 Pro with advanced coding and agentic capabilities