Terminal-Bench-2.0
PulseAugur coverage of Terminal-Bench-2.0 — every cluster mentioning Terminal-Bench-2.0 across labs, papers, and developer communities, ranked by signal.
5 day(s) with sentiment data
-
Self-Harness enables LLM agents to improve their own operational harnesses
Researchers have developed a novel method called Self-Harness, enabling LLM-based agents to autonomously improve their own operational harnesses. This iterative process involves identifying model-specific failure patter…
-
Research: Interaction trajectories boost AI agent generalization
A new research paper explores the effectiveness of interaction trajectories for training AI agents, finding that standalone performance doesn't dictate teaching efficacy. Surprisingly, agents fine-tuned on trajectories …
-
AI coding agents: GPT-5.5, Claude Sonnet 4.6, Gemini 3.5 Flash compared
A recent comparison evaluated three AI coding agents: OpenAI's Codex (powered by GPT-5.5), Anthropic's Claude Code (using Claude Sonnet 4.6), and Google's Antigravity (with Gemini 3.5 Flash). The experiment focused on r…
-
Local LLMs struggle with real-world terminal tasks despite benchmark success
Local large language models often perform poorly on multi-step terminal tasks despite excelling at standard benchmarks like MMLU. This discrepancy arises because traditional benchmarks measure single-turn reasoning, fai…
-
Llama.cpp adds MTP, new Gemma-4 finetune released, Qwen 3.6 excels locally
The llama.cpp project has integrated Multi-head Attention Parallelism (MTP), leading to an 11.5% speed increase for 27B Qwen models in local inference. A new finetuned Gemma-4 model, optimized for creative writing and a…
-
Qwen 3.6-Plus excels in complex AI agent tasks and coding
Alibaba's Qwen 3.6-Plus model has demonstrated advanced capabilities in complex decision-making and agentic coding tasks, according to a recent evaluation. The model successfully generated a detailed implementation plan…
-
Poolside AI releases open-weight Laguna XS.2 and M.1 coding models
Poolside AI has released two new agentic coding models, Laguna M.1 and Laguna XS.2, along with their agent training and operation runtime. Laguna M.1 is a large Mixture of Experts (MoE) model trained on 30T tokens using…
-
Anthropic's 'Mythos' AI too risky for public release
Anthropic has developed a new AI model named Claude Mythos, which demonstrates significant advancements in benchmark performance, particularly in identifying software vulnerabilities. Due to its advanced capabilities in…
-
Google DeepMind launches Gemini 3 Pro with advanced coding and agentic capabilities
Google DeepMind has launched Gemini 3 Pro, their latest and most intelligent model, which demonstrates significant improvements in reasoning and coding capabilities. This new model surpasses previous versions and excels…