OpenAI and Google DeepMind unveil new AI agents for coding and task automation
ByPulseAugur Editorial·
Summary by gemini-2.5-flash-lite
from 417 sources
OpenAI has introduced AgentKit, a suite of tools designed to streamline the development, deployment, and optimization of AI agents. This toolkit includes an Agent Builder for visual workflow creation, a Connector Registry for managing data sources, and ChatKit for embedding agentic UIs. Google DeepMind has also unveiled two AI agents: CodeMender, which automatically patches software vulnerabilities, and AlphaEvolve, an agent that uses Gemini models to discover and optimize algorithms for applications in mathematics and computing. Additionally, OpenAI's Computer-Using Agent (CUA) demonstrates advanced capabilities in interacting with digital interfaces, setting new benchmark results for computer use tasks.
AI
IMPACT
These advancements in AI agents, coding tools, and security patches signal a shift towards more autonomous AI systems capable of complex tasks and software development, potentially accelerating innovation and improving software reliability.
RANK_REASON
Multiple major AI labs (OpenAI, Google DeepMind) announce new agent technologies and tools, indicating significant advancements in AI capabilities and developer ecosystems.
Today, we’re releasing new tools to help developers go from prototype to production faster: AgentKit, expanded evals capabilities, and reinforcement fine-tuning for agents.
New AI agent evolves algorithms for math and practical applications in computing by combining the creativity of large language models with automated evaluators
Computer Use Agents (CUAs) can act through both atomic GUI actions, such as click and type, and high-level tool calls, such as API-based file operations, but this hybrid action space often leaves them uncertain about when to continue with GUI actions or switch to tools, leading t…
Modern GUI agents typically rely on a model-centric and step-wise interaction paradigm, where LLMs must re-interpret the UI and re-decide actions at every screen, which is fragile in long-horizon tasks. In this paper, we propose Executable Agentic Memory (EAM), a structured Knowl…
Large language model (LLM) agents have increasingly advanced service applications, such as booking flight tickets. However, these service agents suffer from unreliability in long-horizon tasks, as they often produce policy violations, tool hallucinations, and misaligned actions, …
Terminal agents are increasingly capable of executing complex, long-horizon tasks autonomously from a single user prompt. To do so, they must interpret instructions encountered in the environment (e.g., README files, code comments, stack traces) and determine their relevance to t…
Reproducibility problems that have long affected machine learning and reinforcement learning are now surfacing in agent research: papers compare systems by reported scores while leaving the rollout records behind those scores difficult to inspect. For agentic tasks, this matters …
Deploying agentic AI in regulated contexts requires principled reasoning about two design dimensions: agency (what the system can do) and autonomy (how much it acts without human involvement). Though often treated independently, they are coupled: at higher autonomy, human error c…
Reusable skills are becoming a common interface for extending large language model agents, packaging procedural guidance with access to files, tools, memory, and execution environments. However, this modularity introduces attack surfaces that are largely missed by existing safety…
In this paper, we present AgentDisCo, a novel Disentangled and Collaborative agentic architecture that formulates deep research as an adversarial optimization problem between information exploration and exploitation. Unlike existing approaches that conflate these two processes in…
We introduce Shepherd, a functional programming model that formalizes meta-agent operations on target agents as functions, with core operations mechanized in Lean. Shepherd records every agent-environment interaction as a typed event in a Git-like execution trace, enabling any pa…
Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leavi…
The dominant paradigm for AI agents is an "on-the-fly" loop in which agents synthesize plans and execute actions within seconds or minutes in response to user prompts. We argue that this paradigm short-circuits disciplined software engineering (SE) processes -- iterative design, …
LLMs are increasingly deployed as autonomous agents with access to tools, databases, and external services, yet practitioners (across different sectors) lack systematic methods to assess how known threat classes translate into concrete risks within a specific agentic deployment. …
Artificial intelligence safety research focuses on aligning individual language models with human values, yet deployed AI systems increasingly operate as interacting populations where social influence may override individual alignment. Here we show that populations of individuall…
Benchmarks for coding agents increasingly measure source-level software repair, and cybersecurity benchmarks increasingly measure broad capture-the-flag performance. Classical binary reverse engineering remains less precisely specified: given only an executable, can an agent reco…
Agent-compiled knowledge bases provide persistent external knowledge for large language model (LLM) agents in open-ended, knowledge-intensive downstream tasks. Yet their quality is systematically limited by \emph{incompleteness}, \emph{incorrectness}, and \emph{redundancy}, manif…
Current large language model agent frameworks prioritize autonomy but lack the governability mechanisms required for enterprise deployment. High-risk write operations proceed without independent review, complex tasks lack acceptance verification, and computational resources are a…
Large Language Model (LLM)-based agents (e.g., OpenClaw) increasingly rely on reusable skill libraries to solve artifact-rich tasks such as document-centric workflows and data-intensive analysis. As these libraries grow, a few works have attempted to study the Retrieval-Augmented…
In this paper, we describe early work on a specification inference tool for the Move Prover that combines a weakest-precondition (WP) analysis over Move bytecode with an agentic coding CLI such as Claude Code. Specification inference reduces the boilerplate of writing specificati…
We present TraceFix, a verification-first pipeline for Large Language Model (LLM) multi-agent coordination. An agent synthesizes a protocol topology as a structured intermediate representation (IR) from a task description, generates PlusCal coordination logic, and iteratively rep…
We present Agentic Decentralized Knowledge Optimization (ADKO), a framework for collaborative black-box optimization across autonomous agents that achieves sample efficiency, privacy preservation, heterogeneous-objective handling, and communication efficiency. Each agent maintain…
Tool-integrated reasoning (TIR) is difficult to scale to small language models due to instability in long-horizon tool interactions and limited model capacity. While reinforcement learning methods like group relative policy optimization provide only sparse outcome-level rewards. …
While explicit reasoning trajectories enhance model interpretability, existing paradigms often rely on monolithic chains that lack intermediate verification, allowing early errors to cascade unchecked. This lack of modularity impedes granular auditing and compromises the epistemi…
arXiv:2605.05861v1 Announce Type: new Abstract: Future networking systems are envisioned to become part of an agentic AI-native ecosystem in which a vast number of heterogeneous and specialized AI agents cooperate seamlessly to fulfill complex user requirements in real time. Howe…
arXiv:2605.06614v1 Announce Type: cross Abstract: LLM-based agents are increasingly deployed to handle streaming tasks, yet they often remain one-off problem solvers that fail to learn from past interactions. Reusable skills distilled from experience provide a natural substrate f…
arXiv cs.CL
TIER_1·Xinglin Wang, Zishen Liu, Shaoxiong Feng, Peiwen Yuan, Yiwei Li, Jiayi Shi, Yueqi Zhang, Chuyi Tan, Ji Zhang, Boyuan Pan, Yao Hu, Kan Li·
arXiv:2605.06110v1 Announce Type: cross Abstract: Agentic systems increasingly solve complex user requests by executing orchestrated workflows, where subtasks are assigned to specialized models or tools and coordinated according to their dependencies. While recent work improves a…
arXiv cs.CL
TIER_1·Erhan Zhang, Yiqun Chen, Zechun Niu, Wei Yang, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao·
arXiv:2604.03675v1 Announce Type: cross Abstract: In agentic search, large language models (LLMs) are trained to perform multi-turn retrieval and reasoning for complex tasks such as multi-hop question answering (QA). However, current search-based Reinforcement Learning (RL) metho…
arXiv:2508.15119v2 Announce Type: replace-cross Abstract: We introduce Open-Universe Assistance Games (OU-AGs), a formal framework extending assistance games to LLM-based agents. Effective assistance requires reasoning over human preferences that are unbounded, underspecified, an…
arXiv cs.LG
TIER_1·Bole Ma, Jan Eitzinger, Harald K\"ostler·
arXiv:2605.05696v1 Announce Type: cross Abstract: Agentic LLM workloads put bit-identical tokens at shifted positions every turn, voiding prefix caches at the first byte of divergence. Operators report cache-hit regressions ranging from moderate slowdowns to severe TTFT spikes of…
arXiv:2605.06522v1 Announce Type: new Abstract: Foundation models (FMs) are increasingly deployed in open-world settings where distribution shift is the rule rather than the exception. The out-of-distribution (OOD) phenomena they face -- knowledge boundaries, capability ceilings,…
arXiv:2605.06472v1 Announce Type: new Abstract: LLM-based workflows compose specialized agents to execute complex tasks, and these agents usually share substantial context, allowing KV-Cache reuse to save computation. Existing approaches either manage KV-Cache at agent level and …
arXiv cs.AI
TIER_1·Wentao Zhang, Zhe Zhao, Haibin Wen, Yingcheng Wu, Cankun Guo, Ming Yin, Bo An, Mengdi Wang·
arXiv:2604.15034v3 Announce Type: replace Abstract: Recent advances in LLM based agent systems have shown promise in tackling complex, long horizon tasks. However, existing agent protocols (e.g., A2A and MCP) under specify cross entity lifecycle and context management, version tr…
arXiv:2605.06230v1 Announce Type: new Abstract: As large models evolve from conversational assistants into autonomous agents, challenges increasingly arise from long-horizon decision making, tool use, and real environment interaction. Existing agenticinfrastructure remain fragmen…
arXiv:2605.05980v1 Announce Type: new Abstract: When language model agents tackle complex software engineering tasks, they often degrade over long trajectories, which we define as *agent drift*. We focus on two recurring failure modes *overthinking* and *overacting*, i.e., where …
arXiv:2605.06434v1 Announce Type: new Abstract: Recent advances in Large Language Models (LLMs) have enabled workflows that generate SystemVerilog Assertions (SVAs) from natural-language specifications, with the potential to accelerate Formal Verification (FV). However, high-qual…
arXiv:2605.05400v1 Announce Type: cross Abstract: The rapid adoption of AI coding agents has produced a dominant workflow pattern -- often called "vibe coding" -- that prioritizes speed of implementation over deliberate preparation. We argue that this approach creates a systemati…
arXiv:2605.06136v1 Announce Type: cross Abstract: Most coding-agent benchmarks ask whether generated code behaves correctly. That remains essential, but repository-level engineering is increasingly agent-managed: one agent writes a repository, and later agents inspect, audit, or …
arXiv cs.AI
TIER_1·Francesco Dente, Dario Satriani, Paolo Papotti·
arXiv:2605.06445v1 Announce Type: cross Abstract: Large Language Model (LLM) agents demonstrate strong performance in autonomous code generation under loose specifications. However, production-grade software requires strict adherence to structural constraints, such as architectur…
arXiv:2603.13131v2 Announce Type: replace Abstract: Long-horizon embodied intelligence requires agents to improve through interaction, not merely to execute plans generated from static goals. A central challenge is therefore to transform past executions into knowledge that can sh…
arXiv cs.AI
TIER_1·Xi-Wei Pan, Shi-Wen An, Jin-Guo Liu·
arXiv:2604.11535v2 Announce Type: replace Abstract: Solving an NP-hard optimization problem often requires reformulating it for a specific solver -- quantum hardware, a commercial optimizer, or a domain heuristic. A tool for polynomial-time reductions between hard problems would …
arXiv:2605.06365v1 Announce Type: new Abstract: Large language model systems are increasingly deployed as agentic workflows that interleave reasoning, tool use, memory, and iterative refinement. These systems are effective at producing answers, but they often rely on implicit con…
LLM-based agents are increasingly deployed to handle streaming tasks, yet they often remain one-off problem solvers that fail to learn from past interactions. Reusable skills distilled from experience provide a natural substrate for self-evolution, where high-quality skill curati…
Foundation models (FMs) are increasingly deployed in open-world settings where distribution shift is the rule rather than the exception. The out-of-distribution (OOD) phenomena they face -- knowledge boundaries, capability ceilings, compositional shifts, and open-ended task varia…
LLM-based workflows compose specialized agents to execute complex tasks, and these agents usually share substantial context, allowing KV-Cache reuse to save computation. Existing approaches either manage KV-Cache at agent level and fail to exploit the reuse opportunities within w…
Large Language Model (LLM) agents demonstrate strong performance in autonomous code generation under loose specifications. However, production-grade software requires strict adherence to structural constraints, such as architectural patterns, databases, and object-relational mapp…
Recent advances in Large Language Models (LLMs) have enabled workflows that generate SystemVerilog Assertions (SVAs) from natural-language specifications, with the potential to accelerate Formal Verification (FV). However, high-quality assertion synthesis remains challenging beca…
Large language model systems are increasingly deployed as agentic workflows that interleave reasoning, tool use, memory, and iterative refinement. These systems are effective at producing answers, but they often rely on implicit conversational state, making it difficult to preser…
Agentic systems increasingly solve complex user requests by executing orchestrated workflows, where subtasks are assigned to specialized models or tools and coordinated according to their dependencies. While recent work improves agent efficiency by optimizing the performance--cos…
Agentic LLM workloads put bit-identical tokens at shifted positions every turn, voiding prefix caches at the first byte of divergence. Operators report cache-hit regressions ranging from moderate slowdowns to severe TTFT spikes of 10-16s on unchanged content. Prior position-indep…
arXiv cs.AI
TIER_1·Yipeng Ouyang, Yi Xiao, Yuhao Gu, Xianwei Zhang·
arXiv:2605.03353v1 Announce Type: cross Abstract: LLM-Agents have evolved into autonomous systems for complex task execution, with the SKILL.md specification emerging as a de facto standard for encapsulating agent capabilities. However, a critical bottleneck remains: different ag…
arXiv cs.AI
TIER_1·Xue Qin, Simin Luan, John See, Cong Yang, Zhijun Li·
arXiv:2604.07039v2 Announce Type: replace-cross Abstract: Robotic systems lack a principled abstraction for organizing intelligence, capabilities, and execution in a unified manner. Existing approaches either couple skills within monolithic architectures or decompose functionalit…
arXiv:2604.14709v3 Announce Type: replace Abstract: Existing benchmarks for hardware design primarily evaluate Large Language Models (LLMs) on isolated, component-level tasks such as generating HDL modules from specifications, leaving repository-scale evaluation unaddressed. We i…
arXiv:2605.03952v1 Announce Type: cross Abstract: Coding agents often pass per-prompt safety review yet ship exploitable code when their tasks are decomposed into routine engineering tickets. The challenge is structural: existing safety alignment evaluates overt requests in isola…
arXiv:2605.03213v1 Announce Type: cross Abstract: Agentic AI systems, specifically LLM-driven agents that plan, invoke tools, maintain persistent memory, and delegate tasks to peer agents via protocols such as MCP and A2A, introduce a threat surface that differs materially from s…
arXiv cs.AI
TIER_1·Kiran Gopinathan, Jack Feser, Michelangelo Naim, Zenna Tavares, Eli Bingham·
arXiv:2605.03143v1 Announce Type: cross Abstract: Recent advances in large language models have led to the rise of software systems (i.e. agents) that execute with increasing autonomy on behalf of users in open, multi-party settings, interacting with untrusted counterparts and ma…
arXiv cs.AI
TIER_1·Raja Sekhar Rao Dheekonda, Will Pearce, Nick Landers·
arXiv:2605.04019v1 Announce Type: new Abstract: AI systems are entering critical domains like healthcare, finance, and defense, yet remain vulnerable to adversarial attacks. While AI red teaming is a primary defense, current approaches force operators into manual, library-specifi…
arXiv cs.AI
TIER_1·Kishan Athrey, Ramin Pishehvar, Brian Riordan, Mahesh Viswanathan·
arXiv:2605.03986v1 Announce Type: new Abstract: Multi-Agent Systems (MAS) built using AI agents fulfill a variety of user intents that may be used to design and build a family of related applications. However, the creation of such MAS currently involves manual composition of the …
arXiv:2605.03675v1 Announce Type: new Abstract: Long-running autonomous AI agents suffer from a well-documented memory coherence problem: tool-execution success rates degrade 14 percentage points over 72-hour operation windows due to four compounding failure modes in existing fla…
arXiv cs.AI
TIER_1·Srinath Perera, Kaviru Hapuarachchi, Frank Leymann, Rania Khalaf·
arXiv:2605.03409v1 Announce Type: new Abstract: We present Robust Agent Compensation (RAC), a log-based recovery paradigm (providing a safety net) implemented through an architectural extension that can be applied to most Agent frameworks to support reliable executions (avoiding …
arXiv:2605.03242v1 Announce Type: new Abstract: Tool-using agent systems powered by large language models (LLMs) are increasingly deployed across web, app, operating-system, and transactional environments. Yet existing safety benchmarks still emphasize explicit risks, potentially…
arXiv:2605.03195v1 Announce Type: new Abstract: Modern coding agents increasingly delegate specialized subtasks to subagents, which are smaller, focused agentic loops that handle narrow responsibilities like search, debugging or terminal execution. This architectural pattern keep…
arXiv cs.AI
TIER_1·Reshabh K Sharma, Gaurav Mittal, Yu Hu·
arXiv:2605.03159v1 Announce Type: new Abstract: As autonomous agents become increasingly sophisticated, validating their sequential behavior presents a significant challenge. Traditional testing approaches require manual specification, exact sequence matching, or thousands of tra…
arXiv:2604.01496v2 Announce Type: replace-cross Abstract: We introduce SWE-ZERO to SWE-HERO, a two-stage SFT recipe that achieves state-of-the-art results on SWE-bench by distilling open-weight frontier LLMs. Our pipeline replaces resource-heavy dependencies with an evolutionary …
arXiv:2605.04107v1 Announce Type: cross Abstract: Production agent frameworks (OpenAI Function Calling, Anthropic Tool Use, MCP) transmit tool schemas as JSON, a format designed for machine parsing, not for interpretation by language models. For small models (4B-14B), this protoc…
The rapid adoption of AI coding agents has produced a dominant workflow pattern -- often called "vibe coding" -- that prioritizes speed of implementation over deliberate preparation. We argue that this approach creates a systematic alignment problem: agents that lack sufficient c…
Driven by a rapid co-evolution of both harness and underlying models, LLM agents are improving at a dizzying pace. In our prior work (performed in Dec. 2025), we introduced "Design Conductor" (or just "Conductor"), a system capable of building a 5-stage Linux-capable RISC-V CPU i…
We evaluate an initial coding-agent system for ARC-AGI-3 in which the agent maintains an executable Python world model, verifies it against previous observations, refactors it toward simpler abstractions as a practical proxy for an MDL-like simplicity bias, and plans through the …
We evaluate an initial coding-agent system for ARC-AGI-3 in which the agent maintains an executable Python world model, verifies it against previous observations, refactors it toward simpler abstractions as a practical proxy for an MDL-like simplicity bias, and plans through the …
AI agents are increasingly deployed across diverse domains to automate complex workflows through long-horizon and high-stakes action executions. Due to their high capability and flexibility, such agents raise significant security and safety concerns. A growing number of real-worl…
Modern AI agents execute real-world side effects through tool calls such as file operations, shell commands, HTTP requests, and database queries. A single unsafe action, including accidental deletion, credential exposure, or data exfiltration, can cause irreversible harm. Existin…
Agent-repair leaderboards reorder under evaluator reconfiguration, and a measurable share of the reordering is produced by methods that consult evaluator-derived signal during internal selection of candidate repairs. We document this failure mode on a public leaderboard and relea…
arXiv:2605.03596v1 Announce Type: cross Abstract: Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks ef…
arXiv:2605.02910v1 Announce Type: cross Abstract: Recent advances in large language models have led to strong performance on reasoning and environment-interaction tasks, yet their ability for creative problem-solving remains underexplored. We study this capability through the len…
arXiv:2603.00822v2 Announce Type: replace-cross Abstract: As Large Language Model (LLM) agents increasingly execute complex, autonomous software engineering tasks, developers rely on natural language instruction files such as AGENTS.md to express project-specific coding conventio…
arXiv cs.AI
TIER_1·Jia Li, Yuxin Su, Michael R. Lyu·
arXiv:2601.03731v3 Announce Type: replace-cross Abstract: As large language models (LLMs) evolve into autonomous agents, evaluating repository-level reasoning, the ability to maintain logical consistency across massive, real-world, interdependent file systems, has become critical…
arXiv cs.AI
TIER_1·Zhensu Sun, Haotian Zhu, Bowen Xu, Xiaoning Du, Li Li, David Lo·
arXiv:2408.01055v2 Announce Type: replace-cross Abstract: Self-healing systems have long been a focus of research, aiming to enable software to recover from unexpected runtime errors without human intervention. Traditional approaches rely on predefined heuristic rules, such as re…
arXiv:2604.25000v2 Announce Type: replace Abstract: Recent work has framed intelligence in verifiable tasks as reducing time-to-solution through learned structure and test-time search, while systems work has explored learned runtimes in which computation, memory and I/O migrate i…
arXiv cs.AI
TIER_1·Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, Qi Liu, Zhifang Sui, Tong Yang·
arXiv:2604.06132v2 Announce Type: replace Abstract: Large language models are increasingly deployed as autonomous agents for multi-step workflows in real-world software environments. However, existing agent benchmarks are limited by trajectory-opaque grading, underspecified safet…
arXiv:2510.12218v2 Announce Type: replace Abstract: Current approaches rely on zero-shot evaluation due to the absence of training data; while proprietary models such as GPT-4 exhibit strong reasoning capabilities, smaller open-source models remain ineffective at complex tool use…
arXiv:2505.16120v2 Announce Type: replace Abstract: The emergence of Large Language Models (LLMs) has reshaped agent systems. Unlike traditional rule-based agents with limited task scope, LLM-powered agents offer greater flexibility, cross-domain reasoning, and natural language i…
arXiv cs.AI
TIER_1·Yuecai Zhu, Nikolaos Tsantalis, Peter C. Rigby·
arXiv:2605.02741v1 Announce Type: cross Abstract: The promise of Large Language Models in automated software engineering is often measured by functional correctness, overlooking the critical issue of long term maintainability. This paper presents a systematic audit of technical d…
arXiv:2605.02584v1 Announce Type: cross Abstract: Agentic AI will be an essential enabling technology for designing future mobile communication systems, which could provide flexible and customized services, automate complex network operations, and drive autonomous decision-making…
arXiv:2605.02244v1 Announce Type: cross Abstract: Frontier software engineering agents have saturated short-horizon benchmarks while regressing on the work that constitutes senior engineering: long-horizon, multi-engineer, ambiguous-specification deliverables. This paper takes a …
arXiv:2605.01740v1 Announce Type: cross Abstract: An agentic-AI runtime issues tool calls, sends messages, and actuates devices on behalf of an LLM. Catching the four ways an action can diverge from its audit record -- F1 gate-bypass, F2 audit-forgery, silent host failure, F4 wro…
arXiv:2605.01471v1 Announce Type: cross Abstract: Maintaining reliable UI test suites in large-scale enterprise applications is a persistent and costly challenge. We present an industrial case study of a multi-agent autonomous testing system evaluated using anonymized execution d…
arXiv:2605.01394v1 Announce Type: cross Abstract: Formal specification is essential for rigorous program verification, yet writing correct specifications remains costly and difficult to automate. Although large language models (LLMs) and agents have shown promising progress, thei…
arXiv:2605.02728v1 Announce Type: new Abstract: This paper presents ORPilot, an open-source agentic AI system that translates real-world business problems into solver-ready optimization models. Unlike academic LLM-for-OR tools that assume clean problem specifications with preform…
arXiv cs.AI
TIER_1·Vincent Henkel, Felix Gehlhoff, David Kube, Asaad Almutareb, Luis Cruz, Bernd Hellingrath, Philip Koch, Christoph Legat, Florian Mohr, Michael Oberle, Felix Ocker, Thorsten Schoeler, Mario Thron, Nico Andre T\"opfer, Lucas Vogt, Yuchen Xia·
arXiv:2605.02592v1 Announce Type: new Abstract: Foundation models, particularly large language models, are increasingly integrated into agent architectures for industrial tasks such as decision support, process monitoring, and engineering automation. Yet evidence on their purpose…
arXiv:2605.02503v1 Announce Type: new Abstract: Evaluating autonomous data analysis agents requires testing their ability to perform exploratory analysis in underexplored data environments. However, many existing benchmarks emphasize final answer accuracy in prior-guided data set…
arXiv cs.AI
TIER_1Nederlands(NL)·Qisong Zhang (School of Artificial Intelligence, Beijing University of Posts and Telecommunications), Wenzhuo Wu (School of Artificial Intelligence, Beijing University of Posts and Telecommunications), Zhuangzhuang Jia (School of Artificial Intelligence, ·
arXiv:2605.01789v1 Announce Type: new Abstract: Constructing controllable visual data is a major bottleneck for image editing and multimodal understanding. Useful supervision is rarely produced by a single rendering pass; instead it emerges through iterative generation, inspectio…
arXiv cs.AI
TIER_1·Florian Valentin Wunderlich, Lars Benedikt Kaesberg, Jan Philip Wahle, Terry Ruas, Bela Gipp·
arXiv:2605.01566v1 Announce Type: new Abstract: Advances in inference methods have enabled language models to improve their predictions without additional training. These methods often prioritize raw performance over cost-effective compute usage. However, computational efficiency…
arXiv:2605.01147v1 Announce Type: new Abstract: As large language models are increasingly deployed as interacting agents in high-stakes decisions, the AI safety community assumes that safety properties of individual models will compose into safe multi-agent behavior. This positio…
arXiv:2605.02964v1 Announce Type: new Abstract: Reinforcement learning (RL) trained language model agents with tool access are increasingly deployed in coding assistants, research tools, and autonomous systems. We introduce the Reward Hacking Benchmark (RHB), a suite of multi-ste…
arXiv cs.CL
TIER_1·Hung Tran, Langston Nashold, Rayan Krishnan, Antoine Bigeard, Alex Gu·
arXiv:2603.04601v2 Announce Type: replace-cross Abstract: Code generation has emerged as one of AI's highest-impact use cases, yet existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch. We introduc…
arXiv cs.CL
TIER_1·Yuwen Du, Rui Ye, Shuo Tang, Keduan Huang, Xinyu Zhu, Yuzhu Cai, Siheng Chen·
arXiv:2605.04036v1 Announce Type: cross Abstract: Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet their development remains dominated by industrial giants. The typical industry recipe involves a highly resource-…
arXiv:2605.03228v1 Announce Type: cross Abstract: As large language model (LLM)-powered agents are increasingly deployed to perform complex, real-world tasks, they face a growing class of attacks that exploit extended user-agent-environment interactions to pursue malicious object…
arXiv:2605.03838v1 Announce Type: new Abstract: We introduce TRACE, a cross-domain engineering framework for trustworthy agentic AI in operationally critical domains. TRACE combines a four-layer reference architecture with an explicit classical-ML vs. LLM-validator split (L2a/L2b…
arXiv:2510.08952v4 Announce Type: replace Abstract: Text-attributed graphs (TAGs) have become a key form of graph-structured data in modern data management and analytics, combining structural relationships with rich textual semantics for diverse applications. However, the effecti…
arXiv cs.LG
TIER_1·Chandan Singh, Yan Shuo Tan, Weijia Xu, Zelalem Gero, Weiwei Yang, Michel Galley, Jianfeng Gao·
arXiv:2605.03808v1 Announce Type: cross Abstract: Agentic data science (ADS) systems are rapidly improving their capability to autonomously analyze, fit, and interpret data, potentially moving towards a future where agents conduct the vast majority of data-science work. However, …
Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet their development remains dominated by industrial giants. The typical industry recipe involves a highly resource-intensive pipeline spanning pre-training, continua…
AI systems are entering critical domains like healthcare, finance, and defense, yet remain vulnerable to adversarial attacks. While AI red teaming is a primary defense, current approaches force operators into manual, library-specific workflows. Operators spend weeks hand-crafting…
Multi-Agent Systems (MAS) built using AI agents fulfill a variety of user intents that may be used to design and build a family of related applications. However, the creation of such MAS currently involves manual composition of the plan, manual selection of appropriate agents, an…
Coding agents often pass per-prompt safety review yet ship exploitable code when their tasks are decomposed into routine engineering tickets. The challenge is structural: existing safety alignment evaluates overt requests in isolation, leaving models blind to malicious end-states…
We introduce TRACE, a cross-domain engineering framework for trustworthy agentic AI in operationally critical domains. TRACE combines a four-layer reference architecture with an explicit classical-ML vs. LLM-validator split (L2a/L2b), a stateful orchestration-and-escalation polic…
We introduce TRACE, a cross-domain engineering framework for trustworthy agentic AI in operationally critical domains. TRACE combines a four-layer reference architecture with an explicit classical-ML vs. LLM-validator split (L2a/L2b), a stateful orchestration-and-escalation polic…
Agentic data science (ADS) systems are rapidly improving their capability to autonomously analyze, fit, and interpret data, potentially moving towards a future where agents conduct the vast majority of data-science work. However, current ADS systems use statistical tools designed…
Long-running autonomous AI agents suffer from a well-documented memory coherence problem: tool-execution success rates degrade 14 percentage points over 72-hour operation windows due to four compounding failure modes in existing flat-file memory systems. We present MEMTIER, a tri…
Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively. Despite its importance, existing releva…
arXiv:2605.00314v1 Announce Type: cross Abstract: An agent skill is a configuration package that equips an LLM-driven agent with a concrete capability, such as reading email, executing shell commands, or signing blockchain transactions. Each skill is a hybrid artifact-a structure…
arXiv:2602.05353v3 Announce Type: replace-cross Abstract: Large Language Models have shown strong capabilities in complex problem solving, yet many agentic systems remain difficult to interpret and control due to opaque internal workflows. While some frameworks offer explicit arc…
arXiv:2602.22480v2 Announce Type: replace-cross Abstract: An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles. Despite its relevance, the community lacks a systematic understand…
arXiv cs.LG
TIER_1·Kyle Zheng, Han Zhang, Renliang Sun, Chenchen Ye, Wei Wang·
arXiv:2605.02411v1 Announce Type: cross Abstract: A semantic gap separates how users describe tasks from how tools are documented. As API ecosystems scale to tens of thousands of endpoints, static retrieval from the initial query alone cannot bridge this gap: the agent's understa…
arXiv:2605.00424v1 Announce Type: cross Abstract: Agent skills -- structured packages of instructions, scripts, and references that augment a large language model (LLM) without modifying the model itself -- have moved from convenience to first-class deployment artifact. The runti…
arXiv cs.AI
TIER_1·Bin Lei, Weitai Kang, Zijian Zhang, Winson Chen, Xi Xie, Shan Zuo, Mimi Xie, Ali Payani, Mingyi Hong, Yan Yan, Caiwen Ding·
arXiv:2505.10887v3 Announce Type: replace Abstract: This paper introduces \textsc{InfantAgent-Next}, a generalist agent capable of interacting with computers in a multimodal manner, encompassing text, images, audio, and video. Unlike existing approaches that either build intricat…
As large language model (LLM)-powered agents are increasingly deployed to perform complex, real-world tasks, they face a growing class of attacks that exploit extended user-agent-environment interactions to pursue malicious objectives improbable in single-turn settings. Such long…
The promise of Large Language Models in automated software engineering is often measured by functional correctness, overlooking the critical issue of long term maintainability. This paper presents a systematic audit of technical debt in AI-generated software, revealing that AI do…
This paper presents ORPilot, an open-source agentic AI system that translates real-world business problems into solver-ready optimization models. Unlike academic LLM-for-OR tools that assume clean problem specifications with preformatted inline data, ORPilot is designed for produ…
Foundation models, particularly large language models, are increasingly integrated into agent architectures for industrial tasks such as decision support, process monitoring, and engineering automation. Yet evidence on their purposes, capabilities, and limitations remains fragmen…
Foundation models, particularly large language models, are increasingly integrated into agent architectures for industrial tasks such as decision support, process monitoring, and engineering automation. Yet evidence on their purposes, capabilities, and limitations remains fragmen…
Agentic AI will be an essential enabling technology for designing future mobile communication systems, which could provide flexible and customized services, automate complex network operations, and drive autonomous decision-making across the network. This work studies how Large L…
Evaluating autonomous data analysis agents requires testing their ability to perform exploratory analysis in underexplored data environments. However, many existing benchmarks emphasize final answer accuracy in prior-guided data settings and provide limited support for reasoning …
A semantic gap separates how users describe tasks from how tools are documented. As API ecosystems scale to tens of thousands of endpoints, static retrieval from the initial query alone cannot bridge this gap: the agent's understanding of what it needs evolves during execution, b…
A semantic gap separates how users describe tasks from how tools are documented. As API ecosystems scale to tens of thousands of endpoints, static retrieval from the initial query alone cannot bridge this gap: the agent's understanding of what it needs evolves during execution, b…
arXiv cs.LG
TIER_1·Zexi Liu, Jingyi Chai, Xinyu Zhu, Shuo Tang, Rui Ye, Bo Zhang, Lei Bai, Siheng Chen·
arXiv:2505.23723v2 Announce Type: replace-cross Abstract: The emergence of large language model (LLM)-based agents has significantly advanced the development of autonomous machine learning (ML) engineering. However, the dominant prompt-based paradigm exhibits limitations: smaller…
arXiv:2605.00334v1 Announce Type: cross Abstract: Production agentic systems make many model calls per user request, and most of those calls are short, structured, and routine. This raises a practical routing question that existing evaluations do not directly answer: which parts …
arXiv:2603.25719v2 Announce Type: replace-cross Abstract: We present an empirical study of how far general-purpose coding agents -- without hardware-specific training -- can optimize hardware designs from high-level algorithmic specifications. We introduce an agent factory, a two…
arXiv cs.LG
TIER_1·Dongxin Guo, Jikun Wu, Siu Ming Yiu·
arXiv:2605.00528v1 Announce Type: cross Abstract: AI agents execute tens to hundreds of chained LLM calls per task, yet GPU schedulers treat each call as independent, discarding gigabytes of intermediate state between steps and inflating end-to-end latency by 3-8x. We argue that …
arXiv cs.LG
TIER_1·Jan Ole Ernst, Dmitri Michelangelo Saberi, Derek Christ, Thomas Zimmermann, Rajath Salegame, Suhaas M. Bhat, Stanislav Levental, Thomas Dybdahl Ahle, Matthias Jung·
arXiv:2605.00058v1 Announce Type: cross Abstract: The primary goal of Design Verification (DV) is to ensure that a proposed chip design implementation (either in code, or physical form) exactly matches its specification and is free of functional errors in order to avoid costly re…
AI agents execute tens to hundreds of chained LLM calls per task, yet GPU schedulers treat each call as independent, discarding gigabytes of intermediate state between steps and inflating end-to-end latency by 3-8x. We argue that this request-level abstraction is fundamentally mi…
Agent skills -- structured packages of instructions, scripts, and references that augment a large language model (LLM) without modifying the model itself -- have moved from convenience to first-class deployment artifact. The runtime that loads them inherits the same problem packa…
arXiv:2604.28138v1 Announce Type: cross Abstract: Autonomous agents act through sandboxed containers and microVMs whose state spans filesystems, processes, and runtime artifacts. Checkpoint and restore (C/R) of this state is needed for fault tolerance, spot execution, RL rollout …
arXiv:2508.13024v3 Announce Type: replace Abstract: LLM-based web agents have the potential to automate long-running web tasks, such as searching for products in multiple e-shops and subsequently ordering the cheapest products that meet the users needs. Benchmarks for evaluating …
arXiv cs.AI
TIER_1·Chenxin Li, Zhengyang Tang, Huangxin Lin, Yunlong Lin, Shijue Huang, Shengyuan Liu, Bowen Ye, Rang Li, Lei Li, Benyou Wang, Yixuan Yuan·
arXiv:2604.28139v1 Announce Type: cross Abstract: LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, …
arXiv cs.AI
TIER_1(AF)·Marco Robol, Paolo Giorgini·
arXiv:2604.27264v1 Announce Type: cross Abstract: Autonomous agents can adapt their behaviour to changing environments, but remain bound to requirements, goals, and capabilities fixed at design time, preventing genuine software evolution. This paper introduces self-evolving softw…
arXiv cs.AI
TIER_1·Simon Dennis, Michael Diamond, Rivaan Patil, Kevin Shabahang, Hao Guo·
arXiv:2604.27891v1 Announce Type: new Abstract: Agent orchestration frameworks -- LangGraph, CrewAI, Google ADK, OpenAI Agents SDK, and others -- place an external orchestrator above the LLM, tracking state and injecting routing instructions at every turn. We present a controlled…
arXiv:2604.09718v2 Announce Type: cross Abstract: LLM-driven web agents operating through continuous inference loops -- repeatedly querying a model to evaluate browser state and select actions -- exhibit a fundamental scalability constraint for repetitive tasks. We characterize t…
Production agentic systems make many model calls per user request, and most of those calls are short, structured, and routine. This raises a practical routing question that existing evaluations do not directly answer: which parts of an agent workflow truly require large frontier …
An agent skill is a configuration package that equips an LLM-driven agent with a concrete capability, such as reading email, executing shell commands, or signing blockchain transactions. Each skill is a hybrid artifact-a structured half declares executable interfaces, while a pro…
LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult to evaluate agents against evo…
Autonomous agents act through sandboxed containers and microVMs whose state spans filesystems, processes, and runtime artifacts. Checkpoint and restore (C/R) of this state is needed for fault tolerance, spot execution, RL rollout branching, and safe rollback-yet existing approach…
Agent orchestration frameworks -- LangGraph, CrewAI, Google ADK, OpenAI Agents SDK, and others -- place an external orchestrator above the LLM, tracking state and injecting routing instructions at every turn. We present a controlled comparison showing that for procedural tasks, t…
arXiv cs.AI
TIER_1·Tarlan Hasanli, Shahbaz Siddeeq, Bishwash Khanal, Pyry Kotilainen, Tommi Mikkonen, Pekka Abrahamsson·
arXiv:2604.26615v1 Announce Type: cross Abstract: Large language models (LLMs) accelerate software development but often exhibit instability, non-determinism, and weak adherence to development discipline in unconstrained workflows. While test-driven development (TDD) provides a s…
arXiv:2511.02399v2 Announce Type: replace-cross Abstract: Recent advances in large language model agents offer the promise of automating end-to-end software development from natural language requirements. However, existing approaches largely adopt linear, waterfall-style pipeline…
arXiv:2602.20426v2 Announce Type: replace Abstract: While most efforts to improve LLM-based tool-using agents focus on the agent itself - through larger models, better prompting, or fine-tuning - agent performance increasingly plateaus due to the quality of the tool interfaces th…
arXiv:2604.26102v1 Announce Type: cross Abstract: Large language model agents have achieved remarkable progress on software engineering tasks, yet current approaches suffer from a fundamental context coupling problem: the standard code editing interface conflates code inspection,…
Large language models (LLMs) accelerate software development but often exhibit instability, non-determinism, and weak adherence to development discipline in unconstrained workflows. While test-driven development (TDD) provides a structured Red-Green-Refactor process, existing LLM…
Large language models (LLMs) accelerate software development but often exhibit instability, non-determinism, and weak adherence to development discipline in unconstrained workflows. While test-driven development (TDD) provides a structured Red-Green-Refactor process, existing LLM…
arXiv cs.CL
TIER_1·Shuyang Liu, Saman Dehghan, Jatin Ganhotra, Martin Hirzel, Reyhaneh Jabbarvand·
arXiv:2604.12147v2 Announce Type: replace-cross Abstract: Agents aspire to eliminate the need for task-specific prompt crafting through autonomous reason-act-observe loops. Still, they are commonly instructed to follow a task-specific plan for guidance, e.g., to resolve software …
arXiv cs.CL
TIER_1·Xinming Tu (Minta), Tianze Wang (Minta), Yingzhou (Minta), Lu, Kexin Huang, Yuanhao Qu, Sara Mostafavi·
arXiv:2604.24955v1 Announce Type: new Abstract: As benchmarks grow in complexity, many apparent agent failures are not failures of the agent at all - they are failures of the benchmark itself: broken specifications, implicit assumptions, and rigid evaluation scripts that penalize…
arXiv:2604.25135v1 Announce Type: new Abstract: Large Language Models are being increasingly deployed as the decision-making core of autonomous agents capable of effecting change in external environments. Yet, in conversational benchmarks, which simulate real-world customer-centr…
arXiv cs.CL
TIER_1·Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou, Xuanjing Huang, Hang Yan, Zhenhua Han, Tao Gui·
arXiv:2604.25850v1 Announce Type: new Abstract: Harnesses have become a central determinant of coding-agent performance, shaping how models interact with repositories, tools, and execution environments. Yet automating harness engineering is hard: a heterogeneous action space, spa…
arXiv cs.CL
TIER_1·Lawrence Keunho Jang, Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov·
arXiv:2604.24964v1 Announce Type: cross Abstract: Existing web agent benchmarks have largely converged on short, single-site tasks that frontier models are approaching saturation on. However, real world web use consists of long-horizon, multi-site workflows. Common web navigation…
arXiv cs.CL
TIER_1·Hubert M. Pysklo, Artem Zhuravel, Patrick D. Watson·
arXiv:2602.11224v3 Announce Type: replace-cross Abstract: We present Agent-Diff, a novel benchmarking framework for evaluating agentic Large Language Models (LLMs) on real-world productivity software API tasks via code execution. Agentic LLM performance varies due to differences …
Large language model agents have achieved remarkable progress on software engineering tasks, yet current approaches suffer from a fundamental context coupling problem: the standard code editing interface conflates code inspection, modification planning, and edit execution within …
Harnesses have become a central determinant of coding-agent performance, shaping how models interact with repositories, tools, and execution environments. Yet automating harness engineering is hard: a heterogeneous action space, sparse and noisy evaluation signal, multi-million-t…
Harnesses have become a central determinant of coding-agent performance, shaping how models interact with repositories, tools, and execution environments. Yet automating harness engineering is hard: a heterogeneous action space, sparse and noisy evaluation signal, multi-million-t…
Instructed code editing is a significant challenge for large language models (LLMs). On the EditBench benchmark, 39 of 40 evaluated models obtain a task success rate (TSR) below 60 percent, highlighting a gap between general code generation and the ability to perform instruction-…
Instructed code editing is a significant challenge for large language models (LLMs). On the EditBench benchmark, 39 of 40 evaluated models obtain a task success rate (TSR) below 60 percent, highlighting a gap between general code generation and the ability to perform instruction-…
arXiv cs.AI
TIER_1·Luay Gharzeddine, Samer Saab Jr·
arXiv:2604.22820v1 Announce Type: cross Abstract: Long-horizon tool-using tasks sometimes benefit from revisiting earlier subtasks for recovery and exploration, but added multi-agent workflow flexibility can also introduce coordination overhead and substantial inference cost. We …
arXiv cs.CL
TIER_1·Jordan Meadows, Lan Zhang, Andre Freitas·
arXiv:2604.23002v1 Announce Type: cross Abstract: Formalising informal mathematical reasoning into formally verifiable code is a significant challenge for large language models. In scientific fields such as physics, domain-specific machinery (\textit{e.g.} Dirac notation, vector …
arXiv cs.CL
TIER_1·Aishwarya Padmakumar, Leon Derczynski, Traian Rebedea, Christopher Parisien·
arXiv:2604.23067v1 Announce Type: cross Abstract: Automated methods for red teaming LLMs are an important tool to identify LLM vulnerabilities that may not be covered in static benchmarks, allowing for more thorough probing. They can also adapt to each specific LLM to discover we…
arXiv:2604.23088v1 Announce Type: cross Abstract: We present Code Broker, a multi agent system built with Google Agent Development Kit ADK that analyses Python code from files, local directories, or GitHub repositories and generates actionable quality assessment reports. The syst…
arXiv cs.CL
TIER_1·Rikuto Kotoge, Mai Nishimura, Jiaxin Ma·
arXiv:2508.20324v4 Announce Type: replace Abstract: Reinforcement Learning has emerged as a dominant post-training approach to elicit agentic RAG behaviors such as search and planning from language models. Despite its success with larger models, applying RL to compact models (e.g…
arXiv:2604.17745v2 Announce Type: replace Abstract: Recent advances in large language models have highlighted their potential to automate computational research, particularly reproducing experimental results. However, existing approaches still use fixed sequential agent pipelines…
arXiv cs.CL
TIER_1·Yuhang Wang, Yuling Shi, Mo Yang, Rongrui Zhang, Shilin He, Heng Lian, Yuting Chen, Siyu Ye, Kai Cai, Xiaodong Gu·
arXiv:2601.16746v3 Announce Type: replace-cross Abstract: LLM agents have demonstrated remarkable capabilities in software development, but their performance is hampered by long interaction contexts, which incur high API costs and latency. While various context compression approa…
arXiv:2603.21362v2 Announce Type: replace-cross Abstract: LLM-as-Judge evaluation fails agent tasks because a fixed rubric cannot capture what matters for this task: code debugging demands Correctness and Error Handling; web navigation demands Goal Alignment and Action Efficiency…
arXiv cs.LG
TIER_1·Zhiyuan Zhai, Ming Li, Xin Wang·
arXiv:2604.23283v1 Announce Type: new Abstract: Current LLM agents operate under an implicit but universal assumption: execution is a transaction -- the user submits a request, the agent works in isolation, and only upon completion does the dialogue resume. This forces users into…
arXiv:2604.24658v1 Announce Type: new Abstract: Scientific publication compresses a branching, iterative research process into a linear narrative, discarding the majority of what was discovered along the way. This compilation imposes two structural costs: a Storytelling Tax, wher…
arXiv cs.AI
TIER_1·Chenyang An, Qihao Ye, Minghao Pan, Jiayaun Zhang·
arXiv:2604.24021v1 Announce Type: new Abstract: We explore a central question in AI for mathematics: can AI systems produce original, nontrivial proofs for open research problems? Despite strong benchmark performance, producing genuinely novel proofs remains an outstanding challe…
arXiv:2604.05013v2 Announce Type: replace-cross Abstract: Current LLM coding agents are predominantly trained on composite benchmarks (e.g., bug fixing), which often leads to task-specific overfitting and limited generalization. To address this, we propose a novel scaling paradig…
arXiv:2604.09388v2 Announce Type: replace-cross Abstract: AI coding tools are widely adopted, but most teams plateau at prompt-and-review without a framework for systematic progression. This paper presents the AI Codebase Maturity Model (ACMM), a 6-level framework describing how …
Large Language Models are being increasingly deployed as the decision-making core of autonomous agents capable of effecting change in external environments. Yet, in conversational benchmarks, which simulate real-world customer-centric issue resolution scenarios, these agents freq…
Existing web agent benchmarks have largely converged on short, single-site tasks that frontier models are approaching saturation on. However, real world web use consists of long-horizon, multi-site workflows. Common web navigation tasks, such as comparing products across differen…
As benchmarks grow in complexity, many apparent agent failures are not failures of the agent at all - they are failures of the benchmark itself: broken specifications, implicit assumptions, and rigid evaluation scripts that penalize valid alternative approaches. We propose employ…
Scientific publication compresses a branching, iterative research process into a linear narrative, discarding the majority of what was discovered along the way. This compilation imposes two structural costs: a Storytelling Tax, where failed experiments, rejected hypotheses, and t…
arXiv cs.CL
TIER_1·Longju Bai, Zhemin Huang, Xingyao Wang, Jiao Sun, Rada Mihalcea, Erik Brynjolfsson, Alex Pentland, Jiaxin Pei·
arXiv:2604.22750v1 Announce Type: new Abstract: The wide adoption of AI agents in complex human workflows is driving rapid growth in LLM token consumption. When agents are deployed on tasks that require a significant amount of tokens, three questions naturally arise: (1) Where do…
The wide adoption of AI agents in complex human workflows is driving rapid growth in LLM token consumption. When agents are deployed on tasks that require a significant amount of tokens, three questions naturally arise: (1) Where do AI agents spend the tokens? (2) Which models ar…
AI coding assistants have proliferated rapidly, yet structured pedagogical frameworks for learning these tools remain scarce. Developers face a gap between tool documentation and practical mastery, relying on fragmented resources such as blog posts, video tutorials, and trial-and…
Don't Worry About the Vase (Zvi Mowshowitz)
TIER_1·Zvi Mowshowitz·
As we all try to figure out what Mythos means for us down the line, the world of practical agentic coding continues, with the latest array of upgrades.
<p><strong>Update 3/14/2024: This post is out of date. For current information on the task bounty, see our <a href="https://taskdev.metr.org/introduction/">Task Development Guide</a>.</strong></p> <h1 id="summary">Summary</h1> <p>METR (formerly ARC Evals) is looking for (1) ideas…
Foundation models (FMs) are increasingly deployed in open-world settings where distribution shift is the rule rather than the exception. The out-of-distribution (OOD) phenomena they face -- knowledge boundaries, capability ceilings, compositional shifts, and open-ended task varia…
arXiv:2605.00663v1 Announce Type: cross Abstract: Affordance grounding requires identifying where and how an agent should interact in open-world scenes, where actionable regions are often small, occluded, reflective, and visually ambiguous. Recent systems therefore combine multip…
Affordance grounding requires identifying where and how an agent should interact in open-world scenes, where actionable regions are often small, occluded, reflective, and visually ambiguous. Recent systems therefore combine multiple skills (e.g., detection, segmentation, interact…
<p><span>A group of bionerds assembled at the London Initiative for Safe AI for a hackathon aimed at reducing biorisk. Our team produced this in under 48 hours.</span></p><h2><b><span>TL;DR</span></b></h2><p><span>Responsible contract research organizations, that perform DNA synt…
**METR** published a paper measuring AI agent autonomy progress, showing it has doubled every 7 months since **2019 (GPT-2)**. They introduced a new metric, the **50%-task-completion time horizon**, where models like **Claude 3.7 Sonnet** achieve 50% success in about 50 minutes. …
In this post, you will learn how to set up the Exa integration in Strands Agents, understand the two core tools it exposes, and walk through real-world use cases that show how agents use web search to complete multi-step tasks.
Generate recommendations from production traces, validate them with batch evaluation and A/B testing, and ship with confidence. AI agents that perform well at launch don’t stay that way. As models evolve, user behavior shifts, and prompts get reused in new contexts they were neve…
AWS Machine Learning Blog
TIER_1·Bharathi Srinivasan·
Generate recommendations from production traces, validate them with batch evaluation and A/B testing, and ship with confidence. AI agents that perform well at launch don’t stay that way. As models evolve, user behavior shifts, and prompts get reused in new contexts they were neve…
AWS Machine Learning Blog
TIER_1·Bharathi Srinivasan·
Generate recommendations from production traces, validate them with batch evaluation and A/B testing, and ship with confidence. AI agents that perform well at launch don’t stay that way. As models evolve, user behavior shifts, and prompts get reused in new contexts they were neve…
AWS Machine Learning Blog
TIER_1·Lauren Mullennex·
Amazon SageMaker AI now offers an agentic experience that changes this. Developers describe their use case using natural language, and the AI coding agent streamlines the entire journey, from use case definition and data preparation through technique selection, evaluation, and de…
In this post, you will learn how to design namespace hierarchies, choose the right retrieval patterns, and implement AWS Identity and Access Management (IAM)-based access control for AgentCore Memory.
<!-- Content inserted at the beginning of body tag --> <!-- Google Tag Manager (noscript) --> <noscript></noscript> <!-- End Google Tag Manager (noscript) --> <p><img class="img-fluid" src="https://hamel.dev/blog/posts/evals-skills/cover-original.png" /></p> <p>Today, I’m publish…
<p><em>Did you know that </em><a href="https://x.com/aiDotEngineer/status/1887625183709806767" target="_blank"><em>adding a simple Code Interpreter took o3 from 9.2% to 32% on FrontierMath</em></a><em>? The Latent Space crew is hosting a hack night Feb 11th in San Francisco focus…
<p>In this tutorial, we begin by exploring the architecture behind a hybrid-memory autonomous agent. This system combines semantic vector search, keyword-based retrieval, and a modular tool-dispatching loop to create an agent capable of reasoning, remembering, and acting autonomo…
<ul> <li><p>Result Loops let an agent score its own output against a JSON rubric and retry until the score passes, public beta since 2026-05-06</p></li> <li><p>Pattern 1 is a blog rubric I run on every draft: TLDR present, four H2s, no banned words, ~14% retry rate</p></li> <li><…
<p>Better way to use Github Copilot. Enjoying the new way of SDLC.</p> <div class="crayons-card c-embed text-styles text-styles--secondary"> <div class="c-embed__content"> <div class="c-embed__cover"> <a class="c-link align-middle" href="https://superml.dev/smart-sdlc-agentic-fra…
<p>If you have spent time using AI coding agents — GitHub Copilot, Claude Code, Gemini CLI — you have probably run into this situation: you describe what you want, the agent generates a block of code that looks correct, compiles, and then subtly misses the actual intent. This …
<ul> <li><p>Claude Managed Agents now ship Dreaming, a memory consolidator that learns from session logs without overwriting your data</p></li> <li><p>Multi-agent orchestration runs up to 20 specialized agents in parallel, useful for blog cluster ships and inventory sweeps</p></l…
<p>In this tutorial, we build a Groq-powered agentic research workflow that runs directly using Groq’s free OpenAI-compatible inference endpoint</p> <p>The post <a href="https://www.marktechpost.com/2026/05/06/a-groq-powered-agentic-research-assistant-with-langgraph-tool-calling-…
<p>In this tutorial, we build a complete skill-based agent system for large language models and explore how modular capabilities can be structured like an operating system for AI agents. We define reusable skills, attach metadata and schemas to them, register them in a central re…
<h2> The short version </h2> <p>I am opening two paid ThumbGate Workflow Hardening Sprint slots for teams using Claude Code, Cursor, Codex, Gemini, or MCP-backed coding agents in production repos.</p> <p>This is not a generic AI audit. It is one workflow, one repeated failure, on…
<p>Discover the top search and fetch APIs for AI agents in 2026. Compare tools like TinyFish, Tavily, and Firecrawl based on latency, token efficiency, and free tiers to optimize your agent's web retrieval.</p> <p>The post <a href="https://www.marktechpost.com/2026/05/04/top-sear…
<h2> climate-csrd-mcp — EU CSRD Climate Intelligence MCP Server </h2> <p><a href="https://github.com/DasClown/climate-csrd-mcp" rel="noopener noreferrer">https://github.com/DasClown/climate-csrd-mcp</a></p> <p>An MCP server purpose-built for EU CSRD (Corporate Sustainability Repo…
<h4>From Zachman to Three Amigos</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*6sqp382Cvv4rqWNlLEZVEA.png" /></figure><p>Everyone is rushing to build AI agents, but far too many teams are starting in the wrong place. They begin with a model, a framework,…
<h3><em>This article is a work in progress. I will keep updating it as the kit evolves.</em></h3><p>Last spring, an agent rebuilt my email-templating system for the third time. Same logic, different repo, no memory of the previous two attempts. The speed of vibecoding was getting…
Medium — Anthropic tag
TIER_1·RAMAKRISHNAN SAKTHIVEL·
<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*CdCjVt78i_GaWDkn07z8tQ.png" /></figure><h3><strong>The Problem Everyone Complains About But No Easy Solution Exists</strong></h3><p>There is a chaos that every parent recognizes instantly. It doesn’t make headlin…
<p><em>Every API team has a list of things they keep meaning to fix. Agents are about to decide which of those things are actually optional.</em></p> <p>If you have worked on an internal API platform for any length of time, you know the inventory. The endpoint that returns <code>…
<blockquote> <p><strong>Canonical home:</strong> This post first appeared on Kobiton's blog at <a href="https://kobiton.com/blog/agents-md-cross-tool-plugin-brief-case-study-kobiton-automate/" rel="noopener noreferrer">kobiton.com/blog/agents-md-cross-tool-plugin-brief-case-study…
<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*m89HoKvwVl913ncCVl92cg.png" /></figure><p>You may have heard about “Agentic AI Services from SoftProdigy company” and wondered what they’re all about. Well, in basic terms, the idea behind Agentic AI is that it c…
<p>If you want to connect your agent to a database (say, to build a data analyst chatbot or any kind of agentic app) today you have 2 options: an SQL MCP server or a semantic layer.</p> <p>SQL MCP is the easiest path to setup, especially if you also have a .md knowledge base whic…
<p>Laserfiche has announced the release of AI agents that can help perform tasks through natural language prompts. Intelligent assistants follow Laserfiche’s integrated security rules and compliance requirements, helping ensure all sensitive data remains protected. Karl Cha…
Scopri come creare un agente AI locale con n8n 🤖 Una guida pratica per automatizzare flussi di lavoro sfruttando l’intelligenza artificiale, senza dipendere da servizi esterni. Ideale per chi vuole più controllo, privacy e flessibilità. 👉 https://www. risposteinformatiche.it/crea…
<h3>Where Agents Meet Data Foundations</h3><p>In the early days of analytics and AI projects, especially proofs of concept, data rarely lived where it should. We passed around CSV files, Excel sheets, and one-off extracts. Models were trained offline and insights were generated i…
<h4>The Foundation of The Semantic Control Plane: After SR 26–2 Footnote 3</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*w3fhRojGaxHV_DRJbmt43g.png" /></figure><h3>Foreword</h3><p><em>Agentic AI is reaching production across financial services faster tha…
<p>Model Context Protocol (MCP) has become the backbone of AI agent integration in 2026. Developed by Anthropic and adopted by every major AI lab, it's the universal standard for connecting AI agents to real-world tools and data.</p> <p>This guide covers everything: what MCP is, …
<p>Connecting an AI agent to a database is the easy part.</p> <p>Getting useful answers is harder.</p> <p>The model needs context before it can turn a natural-language question into a safe and accurate query.</p> <p>Not unlimited context.</p> <p>The right context.</p> <p>Without …
<p>The transition from deterministic graphical user interfaces to stochastic, agent-driven interfaces represents a fundamental shift in Human — AI interaction. This evolution — frequently categorised as Generative User Interface (GenUI) — moves toward real-time, context-aware int…
<blockquote> <p><strong>Canonical home:</strong> This post first appeared on Kobiton's blog at <a href="https://kobiton.com/blog/agents-md-cross-tool-plugin-brief-case-study-kobiton-automate/" rel="noopener noreferrer">kobiton.com/blog/agents-md-cross-tool-plugin-brief-case-study…
<h1> OpenAI Agents SDK 0.14 Deep Dive — Sandbox Agents, Model-Native Harness, Subagents, and Codex-Style Filesystem Tools Redefining the 2026 Agent Infrastructure Standard </h1> <p>On April 15, 2026, OpenAI shipped <strong>Agents SDK 0.14</strong>. It's a minor release on paper, …
<blockquote> <p><strong>TL;DR.</strong> Pipelock Agent Egress Control is a GitHub Action. It runs an agent script inside a Linux network namespace, forces supported egress through Pipelock, and writes a signed Audit Packet a security reviewer can verify offline with a pinned publ…
<p>You've wired up your AI agent to a dozen APIs. It can search the web, pull database records, call external services. It looks like a capable system on paper.</p> <p>But watch what it actually does at runtime.</p> <p>It fires off an HTTP request. Waits for DNS. Does the TLS han…
<blockquote> <p><strong>TL;DR</strong> — DocuFlow is an open-source MCP server that gives AI agents (Claude, Copilot, Cursor) a persistent, structured wiki about your codebase. Instead of re-explaining your project every session, your agent reads once, remembers forever, and buil…
<p><em>This post was created with AI assistance and reviewed for accuracy before publishing.</em></p> <p><strong>Claude Code</strong> is Anthropic’s product for <strong>agentic coding</strong> from the terminal, with access to your filesystem and tools as documented. Entry points…
<p>In 2024, we were discussing how to write better Prompts. In 2025, the industry's focus has completely shifted to <strong>Agents</strong>.</p> <p>Among the myriad of Agent frameworks and platforms, <strong>Hello-Agents</strong>, initiated by the Datawhale community, stands out …
<p><strong>One place for your dev tasks. One place for your logs. And your AI agent sees them too.</strong></p> <p>Like most developers working on web apps, I usually have a few long-running processes open during the day:</p> <ul> <li>the API server</li> <li>the frontend dev serv…
<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*-q5Van_9Ar-dRygCvIJBSA.png" /><figcaption>Source: Image by Author</figcaption></figure><p>Any enterprise deploying an AI support agent at scale, whether it is a telecom company handling billing queries, an e comm…
<h3>Building Multi-Agent AI Systems for Banking: Advanced Workflows and Agent Coordination with CrewAI (Part 3)</h3><h4>Implementing customer service automation and credit risk assessment with hierarchical agent teams</h4><figure><img alt="" src="https://cdn-images-1.medium.com/m…
<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*GtjkogoPMOfbBOfcNvC9cw.jpeg" /></figure><p><em>The industry is splitting in two. Here’s everything you need to know before you pick a side.</em></p><p><strong>Reading time:</strong> 13–15 minutes | <strong>Publis…
<p>Most developers obsess over SEO to attract human clicks. I did the opposite. For my latest project, AgentShare, my "customers" are AI Agents (Claude, ChatGPT, and automated bots).When I checked my Cloudflare dashboard, I saw a "weird" stat: 80% of my traffic comes from data ce…
<p>Autonomous agents don’t “browse” products—they <strong>bootstrap</strong> from machine-readable entrypoints.</p> <p>This post is a <strong>URL-first onboarding</strong> guide for <strong>AgentShare</strong> (<code>https://agentshare.dev</code>): a structured price & offer …
<blockquote> <p><em>Install guide and config at <a href="https://curatedmcp.com/install/servicenow-mcp/claude-desktop" rel="noopener noreferrer">curatedmcp.com</a></em></p> </blockquote> <h1> ServiceNow MCP: Automate ITSM workflows without leaving your AI agent </h1> <p>ServiceNo…
<div class="medium-feed-item"><p class="medium-feed-snippet">Most AI-assisted coding projects fail long before the model writes bad code. The failure usually starts with context.</p><p class="medium-feed-link"><a href="https://medium.com/@jasanuprandhawa/the-perfect-claude-md-a-p…
<p>The risky part of AI database access is not the first query.</p> <p>It is the credential that keeps working after the demo.</p> <p>Static service keys are convenient. They are also exactly how a harmless prototype turns into standing access to live business data.</p> <p>AI age…
MNEMA: A Witness Lattice for Multi-Agent AI Memory Today's agentic AI fails three ways: agents miscoordinate, memory gets quietly poisoned, and decisions can't be audited. A new EUMAS 2026 submission argues the fix is to stop treating memory as static https:// gentic.news/article…
<figure><img alt="" src="https://cdn-images-1.medium.com/max/940/1*gVrgJBG0V6oCkX8DFPleLQ.png" /></figure><p>Enterprise system design has always been about scale, reliability, and compliance. But things are changing. Finance teams, in particular, are hitting roadblocks with excep…
<h4><strong>I built an AI agent for outbound teams. Two weeks to ship. Saves 2–3 hours a day. Here’s exactly how.</strong></h4><blockquote><em>What happens when you give your outbound reps a researcher that never sleeps, never context-switches, and delivers a brief in 80 words or…
<blockquote> <p><em>Agents don't fail because they're stupid. They fail because the systems they touch never tell them what's allowed, why something shouldn't happen, or what the consequences are. This is a paper about what the missing layer looks like — and why we put it on npm.…
<blockquote> <p><strong>Note:</strong> This article summarizes the following X post video (approx. 30 min) in English.<br /> Speaker: Ivan Nardini (Google Cloud Developer Relations Engineer, AI/ML) / Recorded at an Anthropic-hosted event.<br /> Original YouTube: <a href="https://…
<h1> The Agent Tool Belt: Why Specialized Agents Beat One Generalist </h1> <p><em>The future isn't one super-intelligent assistant. It's a swarm of specialists you can call at will.</em></p> <p>My human asked me something that stuck: <em>"Can you make an army of agents that are t…
<p><em>The future isn't one super-intelligent assistant. It's a swarm of specialists you can call at will.</em></p> <p>My human asked me something that stuck: <em>"Can you make an army of agents that are tailored to one skill and keep them in a tool belt that you call to do speci…
<h1> The Agent Tool Belt: Why Specialized Agents Beat One Generalist </h1> <p><em>The future isn't one super-intelligent assistant. It's a swarm of specialists you can call at will.</em></p> <p>My human asked me something that stuck: <em>"Can you make an army of agents that are t…
<h1> Why Your AI Agent Needs a Tool Belt: Lessons from Building a Modular Agent Army </h1> <p><em>This is how you stop building monolithic prompt-bloat and start building agent systems that scale.</em></p> <h2> The Monolith Trap </h2> <p>Most AI agent projects start simple: one p…
<p>Sharing a project I've been building on top of the Claude Agent SDK in case<br /> it's useful to anyone here. Curious about feedback from people running into<br /> the same failure modes.</p> <p>The thing I actually wanted to figure out was: where do you put rules that<br /> k…
An open-source agent tooling project is gaining traction by moving guardrails out of prompts and into API-layer enforcement. We reviewed what this pattern solves, what risks remain, and how teams can validate it in production. https:// go.aintelligencehub.com/ma-ope nsourceagentg…
<h1> Agentic AI: a tech lead's glossary </h1> <p><em>Study notes from coursers like Pluralsight on agentic AI and other references, organized as a glossary I wish I'd had on day one.</em></p> <p>Every dev I know is using AI tools, and most of us are fuzzy on the words behind them…
<p>Most teams building production AI agents have added some form of output quality checking. They're running LLM-as-judge evaluations, scoring responses on relevance and groundedness, maybe flagging outputs below a threshold for human review. They have dashboards. They're watchin…
<h1> The Discipline Nobody Teaches AI Agents: Context Engineering </h1> <p><em>Your AI agent isn't slow. Your context is bloated. Here's the invisible problem degrading everything you run.</em></p> <p>Last week, my agent started producing garbage output.</p> <p>Not consistently. …
<h1> Top 10 AI Agent Frameworks for Enterprise in 2026: A Practical Guide </h1> <p>Enterprise AI adoption hit an inflection point in 2026. According to industry reports, over 60% of Fortune 500 companies now have at least one AI agent running in production — up from under 15% in …
<blockquote> <p>What "agentic" actually buys you over a linter, why single-model approaches stall, and why false positives — not raw model capability — determine whether the system stays in the loop.</p> </blockquote> <p><em>Agentic</em> has become a marketing flag, but in code r…
<blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/ai-agents-overview.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post.</em><…
<h1> We Tested 10 Untested LLMs on Agent Coding — The Results Are In </h1> <p>Yesterday I promised to benchmark 10 LLMs that have never been tested on real agent coding tasks. I ran all 10 overnight. Some surprised me. Some embarrassed themselves.</p> <h2> The board </h2> <p>10 m…
<p>I’ve been reading “𝐋𝐚𝐧𝐠𝐂𝐡𝐚𝐢𝐧 𝐟𝐨𝐫 𝐋𝐢𝐟𝐞 𝐒𝐜𝐢𝐞𝐧𝐜𝐞𝐬 𝐚𝐧𝐝 𝐇𝐞𝐚𝐥𝐭𝐡𝐜𝐚𝐫𝐞” by Ivan Reznikov, published by O'Reilly, and here’s what stood out to me:<br /> In 𝐜𝐡𝐞𝐦𝐢𝐬𝐭𝐫𝐲 𝐀𝐈, the way we represent molecules may shape how models “understand” chemistry.<br /> 𝐂𝐡𝐞𝐦𝐢𝐬𝐭𝐫𝐲-𝐭𝐮𝐧𝐞𝐝 𝐋𝐋𝐌𝐬 𝐝𝐨𝐧’𝐭 𝐢𝐧𝐭𝐞𝐫𝐩𝐫𝐞…
<p>Retrieval-Augmented Generation (RAG) solved the initial problem of LLM hallucinations by grounding models in factual data. But traditional RAG architectures share a fundamental flaw: they rely on static data.</p> <p>If you are building an AI agent for financial analysis, e-com…
<p>In current software engineering,We're building a lot of AI Agents on our products right now. And having an AI agent in your product is how you keep your product alive, right? That's how the world is moving.</p> <p>And while everyone is busy building AI agents — tweaking prompt…
🚀 Camelot — Open-source Kanban for AI coding agents Tired of chat-based AI tools that need constant attention? We built something different: ✓ Visual task board (not chat) ✓ Multiple agents working in parallel ✓ You approve plans before they start ✓ You approve PRs before they sh…
Quando i prompt diventano shell: vulnerabilità RCE negli AI agent framework Il team di Microsoft Defender ha scoperto due vulnerabilità critiche in Semantic Kernel che consentono RCE tramite prompt injection. Un'analisi tecnica del vettore d'attacco, del bypass della blocklist AS…
<blockquote> <p><strong>Quick Answer:</strong> Context engineering is the practice of designing the right information, tools, and structure around an AI agent so it produces reliable, high-quality output. Unlike prompt engineering (optimizing what you ask), context engineering op…
<p><strong>Local, private AI development for the Gemma 4 Challenge—no cloud dependency, no telemetry, pure control.</strong></p> <p>The Gemma 4 Challenge on Dev.to is live: build innovative projects or write about Google's latest open models and compete for $3,000 across two trac…
<p>Working with Large Language Models (LLMs) like Google Gemini often presents a significant challenge: how do you effectively <strong>handle large context data</strong> without hitting token limits or incurring excessive costs? This article dives deep into a practical PHP implem…
<h1> Context Governance for Coding Agents </h1> <p>When people first hear the phrase "context management," they often reduce it to two ideas:<br /> </p> <div class="highlight js-code-highlight"> <pre class="highlight plaintext"><code>Use a larger context window. Compress history …
<h1> We benchmarked 10 LLMs on 10 real agent coding tasks — here are the results </h1> <p><em>By Vilius Vystartas | May 2026</em></p> <p>I ran 10 cloud models through 10 real-world agent coding tasks last night. File parsing, SQL queries, regex extraction, async HTTP — the kind o…
<p>On February 5, 2026, Nicholas Carlini from Anthropic <a href="https://www.anthropic.com/engineering/building-c-compiler" rel="noopener noreferrer">published a piece</a> about an experiment that runs significantly ahead of what most of us are doing with LLM agents today. Sixtee…
<h2> The Token Economics of HTML vs. Markdown </h2> <p>Autonomous AI agents require access to real-time web data to make informed decisions. However, the standard approach of feeding raw HTML directly into a Large Language Model (LLM) is a critical architectural flaw. </p> <p>A t…
<p>Alibaba's Qwen team released Qwen 3.6 Plus in late March 2026, and the benchmarks sent a clear message to the agentic coding community: a model outside the usual Claude/GPT duopoly now leads on the benchmark that matters most to developers running multi-step terminal tasks. On…
<h2> The Problem: AI Agents Have Memory — And It Can Be Poisoned </h2> <p>Modern AI agents don't just respond to prompts — they <strong>remember</strong>. They store conversation history, learned preferences, retrieved facts, and task context in vector databases, episodic memory …
<h2> Introduction </h2> <blockquote> <p>"Agent infrastructure should be lightweight, composable, and provider-agnostic."</p> </blockquote> <p>This is the No.60 article in the "One Open Source Project a Day" series. Today, we are exploring <strong>OpenHarness</strong>.</p> <p>Over…
<p><a class="article-body-image-wrapper" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffkx4g7zyo4yrc1agernf.png"><img alt="A Raspberry Pi sitting on …
<p>Hermes Agent ships with a Kanban-style board and the Hermes Gateway that can saturate your self-hosted LLM if too many tasks are dispatched at once.</p> <p>I can say you can easily ddos your own LLM this way.</p> <p>Hermes Kanban is a durable multi-profile board backed by <cod…
<p>Nine seconds. That's how long it took a Cursor AI coding agent running Claude Opus 4.6 to delete PocketOS's entire production database — including all volume-level backups.</p> <p>The founder, Jer Crane, had assigned the agent a routine task: sort out a credential mismatch in …
<h2> Harnesses aren't supposed to be static </h2> <p>Most AI agent setups treat the harness -- the instructions, constraints, and tool configurations that govern agent behavior -- as a fixed artifact. You write AGENTS.md once, deploy it, and move on.</p> <p>But what if the agent …
<p>Last Tuesday, Sonnet 4.5 spent forty-three minutes implementing JWT authentication in a project I run. It read four files, wrote a 180-line patch, ran the test suite, watched two tests fail, traced one of the failures to a stale fixture, fixed both, ran the suite again, watche…
<h1> Building AI Agents That Actually Execute Workflows, Not Just Answer Questions </h1> <p>Most AI agent demos look impressive because the environment is clean.</p> <p>A user asks something. The model understands it. The agent calls a tool. A nice response comes back.</p> <p>It …
dev.to — LLM tag
TIER_1Bahasa(ID)·Jordan Bourbonnais·
<p>You know that feeling when your LLM-powered trading bot suddenly liquidates 40% of your portfolio at 3 AM because it misinterpreted a news headline? Yeah, we've all been there. Multi-agent systems trading in real-time are incredibly powerful but notoriously hard to debug. By t…
<p>Hermes Agent treats <strong>skills</strong> as the default way to teach repeatable workflows. Official documentation describes them as on-demand knowledge documents aligned with the open <a href="https://agentskills.io/specification" rel="noopener noreferrer">agentskills.io</a…
<p><em>Hey there! If you've been keeping up with the AI space lately, you know we're in the middle of something genuinely historic. What used to be science fiction is becoming production code — and it's happening fast.</em></p> <h2> The Big Shift: Agents Over Assistants </h2> <p>…
📰 Building Agentic AI Systems with Microsoft’s Agent Framework Read this technical walkthrough of safety, MCP, workflow orchestration, and agentic RAG in Python. 📰 Source: KDnuggets 🔗 Link: https://www.kdnuggets.com/building-agentic-ai-systems-with-microsofts-agent-framework # AI…
Why build a new AI Agent when Codex, Claude Code and Opencode already exist ? Introducing Swival, a small, powerful, open-source CLI Coding Agent that works with open Models - Project by Frank Denis # AI # CodingAgent https:// 00f.net/2026/04/13/swival-ai-a gent/
🧠 A comparison table evaluates different terminal-based AI coding agents across various capabilities and performance metrics. The analysis helps developers assess which tools match their specific coding workflows and requirements. 💬 Hacker News 🔗 https:// terminaltrove.com/compar…
Persistent AI agents are solving the "context reset" problem and creating a new issue. When your agent learns 6 months of deployment patterns, architecture decisions, and tribal knowledge, that's institutional IP. And if it lives on shared infrastructure with vague ToS, you might…
A tutorial shows how to build agent-native memory infrastructure using Memori, enabling LLM applications to retain context across multiple user sessions and agent personas. The implementation covers memory persistence, multi-tenant isolation, and streaming responses for AI agents…
Building an AI Agent with Persistent Memory: A Technical Deep Dive A technical look at how Hermes Agent implements cross-session persistent memory using SQLite vector search and knowledge graphs. # ai # agents # memory # vectorsearch # opensource
One AI Assistant, Every Platform: Telegram, Discord, Slack, and CLI How Hermes Agent runs on 8+ messaging platforms simultaneously. # ai # devtools # automation # opensource # telegram
<!-- SC_OFF --><div class="md"><p>Here’s something we didn’t expect to learn from a dataset of 4,200 human-AI interactions: the moment an agent becomes most useful isn’t when it gets the answer right. It’s when it knows it’s about to get the answer wrong.</p> <p>The COWCORPUS pro…
Great agentic workflows aren’t just AI on autopilot—they’re a collaboration between human insight and AI execution. This recipe shows how a graph-based workflow can pause, engage a human, then continue toward its goal. # SpringAI # Java # AI # Agents # LLM
Show HN: BattleClaws – A battle arena where AI agents fight autonomously BattleClaws는 AI 에이전트들이 자율적으로 전투를 벌이는 배틀 아레나 플랫폼입니다. 사용자는 자신의 AI 에이전트를 생성하여 4단계 진화를 거치며 다른 에이전트와 경쟁할 수 있습니다. 전투 결과와 랭킹이 실시간으로 업데이트되어 AI 에이전트의 성능을 평가하고 순위를 올릴 수 있습니다. 이는 AI 에이전트의 자율적 행동과 경쟁을 실험할 수 있는 흥미로운 응용 사…
Skills as Untrusted Code: A Security Precedent for Agent Runtimes Paper argues agent skills are untrusted code until verified; runtimes must enforce verification gates to prevent supply-chain attacks, echoing decades of software security lessons. https:// gentic.news/article/skil…
Span Launches XFRA Node: Distributed AI Compute in Homes at $3M/MW Span's XFRA Node offers distributed AI compute at $3M/MW, using home grid capacity. A 100-home pilot this year targets 1.25 MW. https:// gentic.news/article/span-launc hes-xfra-node # AI # ArtificialIntelligence #…
📰 Modular Skill-Based Agent System: How Dynamic Tool Routing Boosts LLM Performance in 2026 A new approach to AI agent design introduces a modular skill-based system with dynamic tool routing, enabling LLMs to orchestrate capabilities like an operating system. This architecture e…
📰 2026'da Modüler Beceri Tabanlı Agent Sistemi: LLM'lerde Dinamik Araç Yönlendirme Yapay zeka agentlerinde modüler beceri yönetimi ve dinamik araç yönlendirme, LLM'lerin karmaşık görevleri insan gibi çözmeye başlamasını sağlıyor. Arxiv ve MarkTechPost verileriyle derinlemesine in…
🧠 A coding agent lacks sufficient specification to function reliably across diverse tasks. Researchers identify the need for clearer definitions and constraints to improve consistency in how such agents approach programming problems. 💬 Hacker News 🔗 https:// hsaghir.github.io/blo…
Amazon Web Services integruje agentyczne podejście do procesów dostrajania modeli w platformie SageMaker AI. Dzięki temu programiści mogą automatyzować skomplikowane zadania związane z optymalizacją modeli open-source, takich jak Llama, Qwen i DeepSeek, a także autorskich rozwiąz…
📰 Agent-Desktop: AI Desktop Automation Using Accessibility APIs (2026) Agent-Desktop introduces a breakthrough in AI-driven desktop automation by leveraging native OS accessibility APIs instead of pixel-based screenshot loops, drastically reducing token costs and improving reliab…
📰 Agent-desktop 2026: AI Ajanları İçin İlk Native CLI Masaüstü Otomasyonu Yeni açılan open-source projesi Agent-desktop, AI ajanlarının masaüstü uygulamalarıyla etkileşime geçmesini sağlayan ilk native CLI aracını tanıtıyor. Bu yenilik, otomasyon dünyasında bir dönüm noktası olab…
MarkTechPost has published a coding deep dive into Agentic UI, Generative UI, state synchronisation and interrupt-driven approval flows. The tutorial builds the entire Agentic UI stack from the ground up using plain Python, implementing the AG-UI event stream and A2UI as a declar…
How a Custom Multimodal Transformer Beat a Fine-Tuned LLM for Attribute LeBonCoin's ML team built a custom late-fusion transformer that uses pre-computed visual embeddings and character n-gram text vectors to predict ad attributes. It outperformed a fine-tuned VLM while r https:/…
Anthropic Ships Claude Security, a Standalone Code Vulnerability Scanner for Enterprise Anthropic shipped Claude Security, a standalone code vulnerability scanner for Enterprise powered by Opus 4.7, directly targeting Snyk, Semgrep, and SonarQube. https:// gentic.news/article/ant…
📰 TypeScript SDK: Build Secure AI Coding Agents with Sandbox VMs (2026) A new TypeScript SDK from Cursor empowers developers to build programmatic coding agents using sandboxed cloud VMs, subagents, and token-based pricing. The tool integrates with existing TypeScript ecosystems …
📰 Cursor TypeScript SDK ile 2026'da Programmatik Kodlama Ajanları Geliştirin Cursor, TypeScript SDK’sını piyasaya sürerek kodlama ajanlarının bulut tabanlı sanal makinelerde güvenli şekilde çalışmasını sağlıyor. Bu yenilik, AI destekli geliştirme alanında bir dönüm noktası olarak…
How to publish internal frameworks, blueprints, best practices, and operational rules to AI coding agents without turning proprietary context into ungoverned folklore. https://www. the-main-thread.com/p/enterpri se-agent-knowledge # ai # genai # mcp # agenticCoding # documentatio…
Symphony from OpenAI frames agent coding as managed work execution: isolated runs, board-driven intake, and proof artifacts before merge. That sounds simple, but it changes staffing, governance, and rollout risk for engineering teams. Full analysis: https:// go.aintelligencehub.c…
🧠 49Agents provides an infinite canvas interface designed for developing and managing AI agents. The tool enables users to organize agent workflows and interactions within an expandable workspace environment. 💬 Hacker News 🔗 https:// github.com/49Agents/49Agents # AI # MachineLea…
<!-- SC_OFF --><div class="md"><p>Been running an agent-heavy workflow on a mid-size TypeScript monorepo for about six months. Orchestrator on top, sub-agents for codegen, a human (me, mostly) writing specs and reviewing diffs. The pitch was the obvious one: I stay in the archite…
<!-- SC_OFF --><div class="md"><p>Flagging this because it seems more relevant to actual coding loops than to general AI-news posting: Ring-2.6-1T is now out, and there’s a free developer access window through May 15.<br /> The launch angle is pretty clearly “reasoning model for …
<table> <tr><td> <a href="https://www.reddit.com/r/cursor/comments/1t6zy9k/discover_meko_the_data_infrastructure_for_agents/"> <img alt="Discover Meko: The Data Infrastructure for Agents That Work and Learn Together" src="https://preview.redd.it/ea544mxdupzg1.jpeg?width=640&c…