AI agents evolve: Research tackles scaling, safety, and emergent network risks
ByPulseAugur Editorial·
Summary by gemini-2.5-flash-lite
from 157 sources
Researchers are developing a science of scaling AI agent systems, moving beyond the heuristic that more agents are always better. New studies reveal that multi-agent coordination significantly improves performance on parallelizable tasks but can degrade it on sequential ones. Efforts are underway to create predictive models for optimal agent architecture and to develop methods for real-time evaluation and error mitigation in agent interactions.
AI
IMPACT
New research is defining principles for effective AI agent system design, moving beyond simple scaling heuristics and addressing complex coordination and safety challenges.
RANK_REASON
Multiple research papers and studies are exploring the science of scaling AI agent systems, their coordination, and their interactions.
How Netomi scales enterprise AI agents using GPT-4.1 and GPT-5.2—combining concurrency, governance, and multi-step reasoning for reliable production workflows.
This paper was accepted at the Fifth Workshop on Natural Language Generation, Evaluation, and Metrics at ACL 2026. Tool-calling agents are evaluated on tool selection, parameter accuracy, and scope recognition, yet LLM trajectory assessments remain inherently post-hoc. Disconnect…
Microsoft Research
TIER_1·Gagan Bansal, Shujaat Mirza, Keegan Hines, Will Epperson, Zachary Huang, Whitney Maxwell, Pete Bryan, Tyler Payne, Adam Fourney, Amanda Swearngin, Wenyue Hua, Tori Westerhoff, Amanda Minnich, Maya Murad, Ece Kamar, Ram Shankar Siva Kumar, Saleema Amershi·
<p>Safe agents don’t guarantee a safe ecosystem of interconnected agents. Microsoft Research examines what breaks when AI agents interact and why network-level risks require new approaches.</p> <p>The post <a href="https://www.microsoft.com/en-us/research/blog/red-teaming-a-netwo…
Prompt specifications for multi-agent large language model (LLM) systems carry data contracts and integration logic across many interdependent files but are rarely subjected to structured-inspection rigor. This paper reports a single-system empirical case study of iterative, agen…
Reinforcement learning has become a widely used post-training approach for LLM agents, where training commonly relies on outcome-level rewards that provide only coarse supervision. While finer-grained credit assignment is promising for effective policy updates, obtaining reliable…
Current LLM agents are proficient at calling isolated APIs but struggle with the "last mile" of commercial software automation. In real-world scenarios, tools are not independent; they are atomic, interdependent, and prone to environmental noise. We introduce $\textbf{ComplexMCP}…
Current LLM agents are proficient at calling isolated APIs but struggle with the "last mile" of commercial software automation. In real-world scenarios, tools are not independent; they are atomic, interdependent, and prone to environmental noise. We introduce $\textbf{ComplexMCP}…
Experience-driven self-evolving agents aim to overcome the static nature of large language models by distilling reusable experience from past interactions, thus enabling adaptation to novel tasks at deployment time. This process places substantial demands on the foundation model'…
We investigate the emergent collective dynamics of LLM-based multi-agent systems on a 2D square lattice and present a model-agnostic statistical-physics method to disentangle social conformity from intrinsic bias, compute critical exponents, and probe the collective behavior and …
As artificial intelligence engineering paradigms shift from single-agent Prompt and Context Engineering toward multi-agent \textbf{Coordination Engineering}, the ability to codify and systematically improve how multiple agents collaborate has emerged as a critical bottleneck. Whi…
Multi-agent systems (MAS) have emerged as a promising paradigm for solving complex tasks. Recent work has explored self-evolving MAS that automatically optimize agent capabilities or communication topologies. However, existing methods either learn a topology that remains fixed at…
Test-time scaling (TTS) has become an effective approach for improving large language model performance by allocating additional computation during inference. However, existing TTS strategies are largely hand-crafted: researchers manually design reasoning patterns and tune heuris…
Context window expansion is often treated as a straightforward capability upgrade for LLMs, but we find it systematically fails in multi-agent social dilemmas. Across 7 LLMs and 4 games over 500 rounds, expanding accessible history degrades cooperation in 18 of 28 model--game set…
As LLM-based agents increasingly rely on external tools, it is important to evaluate their ability to sustain tool-grounded reasoning beyond familiar workflows and short-range interactions. We introduce AgentEscapeBench, an escape-room-style benchmark that tests whether agents ca…
Relational learning is a challenging problem that has motivated a wide range of approaches, including graph-based models (e.g., graph neural networks, graph transformers), tabular methods (e.g., tabular foundation models), and sequence-based approaches (e.g., large language model…
Large language models (LLMs) are increasingly deployed as autonomous agents in offensive cybersecurity. In this paper, we reveal an interesting phenomenon: different agents exhibit distinct attack patterns. Specifically, each agent exhibits an attack-selection bias, disproportion…
The concurrent target assignment and pathfinding (TAPF) problem extends multi-agent pathfinding (MAPF) by asking planners to allocate distinct targets and collision-free paths to agents. Prior work on TAPF has relied exclusively on Conflict-Based Search (CBS), which tightly coupl…
arXiv cs.LG
TIER_1·Yi Xie, Yangyang Xu, Yi Fan, Bo Liu·
arXiv:2605.05216v1 Announce Type: new Abstract: Large language models (LLMs) with a large number of parameters achieve strong performance but are often prohibitively expensive to deploy. Recent work explores using teams of smaller, more efficient LLMs that collectively match or e…
arXiv:2605.05704v1 Announce Type: cross Abstract: With the rapid evolution of foundation models, Large Language Model (LLM) agents have demonstrated increasingly powerful tool-use capabilities. However, this proficiency introduces significant security risks, as malicious actors c…
arXiv cs.AI
TIER_1·Keisuke Kamahori, Shihang Li, Simon Peter, Baris Kasikci·
arXiv:2605.06068v1 Announce Type: new Abstract: For years, we have built LLM serving systems like any other critical infrastructure: a single general-purpose stack, hand-tuned over many engineer-years, meant to support every model and workload. In this paper, we take the opposite…
arXiv cs.AI
TIER_1·Yuliang Xu, Xiang Xu, Yao Wan, Hu Wei, Tong Jia·
arXiv:2605.05949v1 Announce Type: new Abstract: Algorithmic problem solving serves as a rigorous testbed for evaluating structured reasoning in AI coding systems, as it directly reflects a model's ability to perform structured reasoning in complex scenarios.Existing approaches pr…
arXiv:2605.05726v1 Announce Type: new Abstract: As LLM agents are increasingly deployed with large libraries of reusable skills, selecting the right skill for a user request has become a critical systems challenge. In small libraries, users may invoke skills explicitly by name, b…
arXiv:2605.05701v1 Announce Type: new Abstract: LLM search agents increasingly rely on tools at inference time, but their trajectories are often constrained by hard limits on both tool calls and generated tokens. Under such dual budgets, better answers require not only stronger m…
arXiv:2605.05413v1 Announce Type: new Abstract: Large language model (LLM) agents are increasingly used to operate browsers, files, code and tools, making personal assistants a natural deployment target. Yet personal agents face a privacy-cost-capability tension: cloud models exe…
arXiv:2512.06721v2 Announce Type: replace-cross Abstract: Recent studies have begun to explore proactive large language model (LLM) agents that provide unobtrusive assistance by automatically leveraging contextual information, such as in code editing and in-app suggestions. Howev…
arXiv cs.CL
TIER_1·Zhexuan Wang, Xuebo Liu, Li Wang, Zifei Shan, Yutong Wang, Zhenxi Song, Min Zhang·
arXiv:2605.06623v1 Announce Type: cross Abstract: Large language model (LLM)-based Multi-agent systems (MAS) have shown promise in tackling complex collaborative tasks, where agents are typically orchestrated via role-specific prompts. While the quality of these prompts is pivota…
arXiv:2605.05716v1 Announce Type: cross Abstract: LLM agent systems are built by stacking scaffolding components (planning, tools, memory, self-reflection, retrieval) assuming more is better. We study cross-component interference (CCI): degradation when components interact destru…
arXiv:2603.12031v2 Announce Type: replace-cross Abstract: State-of-the-art cloud-native applications require intelligent schedulers that can effectively balance system stability, resource utilisation, and associated costs. While Kubernetes provides feasibility-based placement by …
arXiv cs.LG
TIER_1·Huchen Yang, Xinghao Dong, Dan Negrut, Jin-Long Wu·
arXiv:2605.05703v1 Announce Type: cross Abstract: Optimizing the communication structure of large language model based multi-agent systems (LLM-MAS) has been shown to improve downstream performance and reduce token usage. Existing methods typically rely on randomly sampled traini…
arXiv:2605.06639v1 Announce Type: new Abstract: We introduce Recursive Agent Optimization (RAO), a reinforcement learning approach for training recursive agents: agents that can spawn and delegate sub-tasks to new instantiations of themselves recursively. Recursive agents impleme…
arXiv cs.LG
TIER_1·Zheng Zhang, Cuong C. Nguyen, Kevin Wells, Gustavo Carneiro·
arXiv:2605.06028v1 Announce Type: new Abstract: The rapid development of large language models (LLMs) has motivated research on decision-making in multi-agent systems, where multiple agents collaborate to achieve shared objectives. Existing aggregation approaches, such as voting …
arXiv:2605.05802v1 Announce Type: new Abstract: Group-relative RL training (GRPO) samples a small group of parallel rollouts for every training prompt and uses their within-group reward spread to compute per-trajectory advantages. In agentic environments each rollout is a long mu…
We introduce Recursive Agent Optimization (RAO), a reinforcement learning approach for training recursive agents: agents that can spawn and delegate sub-tasks to new instantiations of themselves recursively. Recursive agents implement an inference-time scaling algorithm that natu…
Large language model (LLM)-based Multi-agent systems (MAS) have shown promise in tackling complex collaborative tasks, where agents are typically orchestrated via role-specific prompts. While the quality of these prompts is pivotal, jointly optimizing them across interacting agen…
arXiv:2505.00753v5 Announce Type: replace-cross Abstract: Recent advances in large language models (LLMs) have sparked growing interest in building fully autonomous agents. However, fully autonomous LLM-based agents still face significant challenges, including limited reliability…
arXiv:2511.02230v4 Announce Type: replace-cross Abstract: KV cache management is essential for efficient LLM inference. To maximize utilization, existing inference engines evict finished requests' KV cache if new requests are waiting. This policy breaks for agentic workloads, whi…
arXiv:2502.10148v3 Announce Type: replace Abstract: Cooperative multi-agent reinforcement learning (MARL) struggles with sample efficiency, interpretability, and generalization. While Large Language Models (LLMs) offer powerful planning capabilities, their application has been ha…
arXiv:2605.03604v1 Announce Type: cross Abstract: This paper asks whether large language models (LLMs) can be used to study the strategic foundations of conflict and cooperation. I introduce LLMs as experimental subjects in a repeated security dilemma and evaluate whether they re…
arXiv cs.AI
TIER_1·Andrea Iannoli, Lorenzo Gigli, Luca Sciullo, Angelo Trotta, Marco Di Felice·
arXiv:2605.03788v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly explored as high-level reasoning engines for cyber-physical systems, yet their application to real-time UAV swarm management remains challenging due to heterogeneous interfaces, limited …
arXiv:2605.03034v1 Announce Type: new Abstract: Agentic systems involved in high-stake decision-making under adversarial pressure need formal guarantees not offered by existing approaches. Motivated by the operational needs of security operations centers (SOCs) that must configur…
While Large Language Models (LLMs) excel in certain reasoning tasks, they struggle in multi-agent games where the final outcome depends on the joint strategies of all agents. In multi-agent games, the non-stationarity of other agents brings significant challenges on the evaluatio…
arXiv cs.LG
TIER_1·Jackie Baek, Yaopeng Fu, Will Ma, Tianyi Peng·
arXiv:2602.12631v2 Announce Type: replace-cross Abstract: Inventory control is a fundamental operations problem in which ordering decisions are traditionally guided by theoretically grounded operations research (OR) algorithms. However, such algorithms often rely on rigid modelin…
arXiv cs.LG
TIER_1·Maksym Nechepurenko, Pavel Shuvalov·
arXiv:2605.03310v1 Announce Type: cross Abstract: Multi-agent LLM systems fail in production at rates between 41% and 87%, mostly due to coordination defects rather than base-model capability. Existing responses split between cataloguing failure modes empirically and shipping dec…
arXiv:2605.02911v1 Announce Type: new Abstract: Future sixth-generation (6G) mobile networks are envisioned to be equipped with a diverse set of powerful, yet highly specialized, optimization experts. Such a promising vision is concurrently expected to give rise to the need for s…
arXiv cs.AI
TIER_1·Shuo Liu, Tianle Chen, Ryan Amiri, Christopher Amato·
arXiv:2601.21972v4 Announce Type: replace Abstract: Recent work has explored optimizing LLM collaboration through Multi-Agent Reinforcement Learning (MARL). However, most MARL fine-tuning approaches rely on predefined execution protocols, which often require centralized execution…
arXiv cs.AI
TIER_1·Jose Manuel de la Chica, Juan Manuel Vera, Jairo Rodr\'iguez·
arXiv:2605.02463v1 Announce Type: cross Abstract: Multi-agent LLM systems are increasingly used to solve complex tasks through decomposition, debate, specialization, and ensemble reasoning. However, these systems are usually evaluated in terms of robustness: whether performance i…
arXiv cs.AI
TIER_1·Vicente Pelechanoa, Antoni Mestre, Manoli Albert, Miriam Gil·
arXiv:2605.02832v1 Announce Type: new Abstract: Deciding how to distribute work between humans and AI systems is a central challenge in organisational design. Most approaches treat this as a binary choice, yet the operational reality is richer: humans and AI routinely share tasks…
arXiv:2605.02289v1 Announce Type: new Abstract: Engineering problem solving is central to real-world decision-making, requiring mathematical formulations that not only represent complex problems but also produce feasible solutions under data and physical constraints. Unlike mathe…
arXiv:2605.01879v1 Announce Type: new Abstract: The challenge of engineering autonomous agents capable of navigating the stochastic and adversarial nature of the physical world has historically resided at the intersection of symbolic logic and control theory. Traditional multi-ag…
arXiv:2605.01457v1 Announce Type: new Abstract: Generative models have emerged as a major paradigm for offline multi-agent reinforcement learning (MARL), but existing approaches require many iterative sampling steps. Recent few-step accelerations either distill a joint teacher in…
arXiv:2603.00977v2 Announce Type: replace-cross Abstract: Large language model (LLM) agents have recently demonstrated strong capabilities in interactive decision-making, yet they remain fundamentally limited in long-horizon tasks that require structured planning and reliable exe…
Large Language Models (LLMs) are increasingly explored as high-level reasoning engines for cyber-physical systems, yet their application to real-time UAV swarm management remains challenging due to heterogeneous interfaces, limited grounding, and the need for long-running closed-…
This paper asks whether large language models (LLMs) can be used to study the strategic foundations of conflict and cooperation. I introduce LLMs as experimental subjects in a repeated security dilemma and evaluate whether they reproduce canonical mechanisms from international re…
arXiv:2605.02801v1 Announce Type: new Abstract: As large language model (LLM) agents evolve from isolated tool users into coordinated teams, reinforcement learning (RL) must optimize not only individual actions but also how work is spawned, delegated, communicated, aggregated, an…
arXiv:2510.08804v3 Announce Type: replace Abstract: We present MOSAIC, a multi-agent Large Language Model (LLM) framework for solving challenging scientific coding tasks. Unlike general-purpose coding, scientific workflows require algorithms that are rigorous, interconnected with…
arXiv cs.LG
TIER_1·Wenyi Wu, Sibo Zhu, Kun Zhou, Biwei Huang·
arXiv:2605.02168v1 Announce Type: cross Abstract: Language model (LM)-based agents have demonstrated promising capabilities in automating complex tasks from natural language instructions, yet they continue to struggle with long-horizon planning and reasoning. To address this, we …
arXiv:2605.02063v1 Announce Type: cross Abstract: We present Coopetition-Gym v1, a benchmark platform for mixed-motive multi-agent reinforcement learning under strategic coopetition. The platform comprises twenty environments organized into four mechanism classes that correspond …
arXiv:2605.01347v1 Announce Type: new Abstract: On-policy distillation (OPD) trains a student on its own trajectories under token-level teacher supervision, but existing methods are capped by a single-teacher capability ceiling: when the teacher errs, the student inherits the err…
Deciding how to distribute work between humans and AI systems is a central challenge in organisational design. Most approaches treat this as a binary choice, yet the operational reality is richer: humans and AI routinely share tasks or take complementary roles depending on contex…
As large language model (LLM) agents evolve from isolated tool users into coordinated teams, reinforcement learning (RL) must optimize not only individual actions but also how work is spawned, delegated, communicated, aggregated, and stopped. This paper studies RL for LLM-based m…
Multi-agent LLM systems are increasingly used to solve complex tasks through decomposition, debate, specialization, and ensemble reasoning. However, these systems are usually evaluated in terms of robustness: whether performance is preserved under perturbation. This paper studies…
Engineering problem solving is central to real-world decision-making, requiring mathematical formulations that not only represent complex problems but also produce feasible solutions under data and physical constraints. Unlike mathematical problem solving, which operates on prede…
arXiv cs.LG
TIER_1·Alejandro Breen Herrera, Aayush Sheth, Steven G. Xu, Zhucheng Zhan, Charles Wright, Marcus Yearwood, Hongtai Wei, Sudeep Das, Danny Nightingale, Meg Watson, Charles Pollnow V·
arXiv:2603.03565v2 Announce Type: replace-cross Abstract: Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to o…
arXiv cs.LG
TIER_1·Chunlei Meng, Pengbin Feng, Rong Fu, Hoi Leong Lee, Xiaojing Du, Zhaolu Kang, Zeyu Zhang, Weilin Zhou, Chun Ouyang, Zhongxue Gan·
arXiv:2605.00370v1 Announce Type: new Abstract: Centralized multimodal learning commonly compresses language, acoustic, and visual signals into a single fused representation for prediction. While effective, this paradigm suffers from two limitations: modality dominance, where opt…
arXiv:2604.27699v1 Announce Type: new Abstract: Current embodied agents are often limited to passive instruction-following or reactive need-satisfaction, lacking a stable, high-order value framework essential for long-term, self-directed behavior and resolving motivational confli…
arXiv cs.AI
TIER_1·Giuseppe Arbore, Andrea Sillano, Luigi De Russis·
arXiv:2604.27882v1 Announce Type: new Abstract: Recent advances in agentic AI are shifting automation from discrete tools to proactive multi-agent systems that coordinate multi-specialized capabilities behind unified interfaces. However, today's agent systems typically rely on ha…
arXiv cs.AI
TIER_1·Junan Hu, Jian Liu, Jingxiang Lai, Jiarui Hu, Yiwei Sheng, Shuang Chen, Jian Li, Dazhao Du, Song Guo·
arXiv:2604.27955v1 Announce Type: new Abstract: Graphical User Interface (GUI) agents have emerged as a promising paradigm for intelligent systems that perceive and interact with graphical interfaces visually. Yet supervised fine-tuning alone cannot handle long-horizon credit ass…
arXiv:2604.28043v1 Announce Type: new Abstract: We present Collaborative Agent Reasoning Engineering (CARE), a disciplined methodology for engineering Large Language Model (LLM) agents in scientific domains. Unlike ad-hoc trial-and-error approaches, CARE specifies behavior, groun…
arXiv:2604.27311v1 Announce Type: cross Abstract: The advent of Large Language Models (LLMs) has significantly transformed tasks across Software Engineering. In the context of Business Process Management, LLMs are now being explored as tools to derive process models directly from…
arXiv:2604.27725v1 Announce Type: cross Abstract: A long-standing challenge in economics lies not in the lack of intuition, but in the difficulty of translating intuitive insights into verifiable research. To address this challenge, we introduce AgentEconomist, an end-to-end inte…
arXiv:2510.05192v2 Announce Type: replace-cross Abstract: When AI agents operating with access to sensitive information encounter a conflict between completing an assigned task and following rules or ethical constraints, they can resort to unsanctioned behaviour. Existing inferen…
arXiv:2604.27616v1 Announce Type: new Abstract: People commonly leverage structured content to accelerate knowledge acquisition and research problem solving. Among these, roadmaps guide researchers through hierarchical subtasks to solve complex research problems step by step. Des…
arXiv:2604.26963v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly deployed as the execution core of autonomous agents rather than as standalone text generators. Agentic workloads induce a temporal shift from single-turn inference to multi-turn LLM-to…
arXiv cs.AI
TIER_1·Anh Ta, Junjie Zhu, Shahin Shayandeh·
arXiv:2604.27233v1 Announce Type: new Abstract: Tool-calling agents are evaluated on tool selection, parameter accuracy, and scope recognition, yet LLM trajectory assessments remain inherently post-hoc. Disconnected from the active execution loop, such assessments identify errors…
arXiv:2604.27151v1 Announce Type: new Abstract: Computer-use agents provide a promising path toward general software automation because they can interact directly with arbitrary graphical user interfaces instead of relying on brittle, application-specific integrations. Despite re…
We present Collaborative Agent Reasoning Engineering (CARE), a disciplined methodology for engineering Large Language Model (LLM) agents in scientific domains. Unlike ad-hoc trial-and-error approaches, CARE specifies behavior, grounding, tool orchestration, and verification throu…
Recent advances in agentic AI are shifting automation from discrete tools to proactive multi-agent systems that coordinate multi-specialized capabilities behind unified interfaces. However, today's agent systems typically rely on hard-coded agent architectures with fixed roles, c…
A long-standing challenge in economics lies not in the lack of intuition, but in the difficulty of translating intuitive insights into verifiable research. To address this challenge, we introduce AgentEconomist, an end-to-end interactive system designed to translate abstract intu…
Current embodied agents are often limited to passive instruction-following or reactive need-satisfaction, lacking a stable, high-order value framework essential for long-term, self-directed behavior and resolving motivational conflicts. We introduce \textit{ValuePlanner}, a hiera…
People commonly leverage structured content to accelerate knowledge acquisition and research problem solving. Among these, roadmaps guide researchers through hierarchical subtasks to solve complex research problems step by step. Despite progress in structured content generation, …
arXiv cs.AI
TIER_1·Benedikt Bollig, Matthias F\"ugger, Thomas Nowak·
arXiv:2604.17612v2 Announce Type: replace-cross Abstract: Multi-agent systems built on large language models (LLMs) are difficult to reason about. Coordination errors such as deadlocks or type-mismatched messages are often hard to detect through testing. We introduce a domain-spe…
arXiv:2510.05174v4 Announce Type: replace-cross Abstract: When are multi-agent LLM systems merely a collection of individual agents versus an integrated collective with higher-order structure? We introduce an information-theoretic framework to test -- in a purely data-driven way …
arXiv cs.CL
TIER_1·Jincheng Ren, Siwei Wu, Yizhi Li, Kang Zhu, Shu Xu, Boyu Feng, Ruibin Yuan, Wei Zhang, Riza Batista-Navarro, Jian Yang, Chenghua Lin·
arXiv:2604.19572v2 Announce Type: replace Abstract: As terminal agents scale to long-horizon, multi-turn workflows, a key bottleneck is not merely limited context length, but the accumulation of noisy terminal observations in the interaction history. Retaining raw observations pr…
arXiv cs.AI
TIER_1·Tom Liptay, Dan Schwarz, Rafael Poyiadzi, Jack Wildman, Nikos I. Bosse·
arXiv:2604.26106v1 Announce Type: new Abstract: Forecasting benchmarks produce accuracy leaderboards but little insight into why some forecasters are more accurate than others. We introduce Bench to the Future 2 (BTF-2), 1,417 pastcasting questions with a frozen 15M-document rese…
arXiv:2604.26522v1 Announce Type: new Abstract: Large Language Model (LLM)-based agents exhibit systemic failures in compositional generalization, limiting their robustness in interactive environments. This work introduces AGEL-Comp, a neuro-symbolic AI agent architecture designe…
arXiv:2604.26733v1 Announce Type: new Abstract: Live future prediction refers to the task of making predictions about real-world events before they unfold. This task is increasingly studied using large language model-based agent systems, and it is important for building agents th…
arXiv cs.AI
TIER_1·Bochao Liu, Zhipeng Qian, Yang Zhao, Xinyuan Jiang, Zihan Liang, Yufei Ma, Junpeng Zhuang, Ben Chen, Shuo Yang, Hongen Wan, Yao Wu, Chenyi Lei, Xiao Liang·
arXiv:2604.26805v1 Announce Type: new Abstract: Operating and maintaining (O&M) large-scale online engine systems (search, recommendation, advertising) demands substantial human effort for release monitoring, alert response, and root cause analysis. While LLM-based agents are…
arXiv:2604.26561v1 Announce Type: cross Abstract: Multi-agent deliberation systems using large language models (LLMs) are increasingly proposed for policy simulation, yet they suffer from artificial consensus: evaluator agents converge on the same option regardless of their assig…
arXiv cs.AI
TIER_1·Junxing Hu, Tianlong Li, Lei Yu, Ai Han·
arXiv:2604.25602v2 Announce Type: replace Abstract: Deploying production-ready multi-agent systems (MAS) in complex industrial environments remains challenging due to limitations in scalability, observability, and autonomous evolution. We present OxyGent, an open-source framework…
arXiv:2510.14438v2 Announce Type: replace Abstract: The hallmark of Deep Research agents lies in compositional reasoning, the capacity to aggregate distributed, heterogeneous information into coherent logical insights. However, current agentic systems are often retrieval-heavy bu…
The advent of Large Language Models (LLMs) has significantly transformed tasks across Software Engineering. In the context of Business Process Management, LLMs are now being explored as tools to derive process models directly from textual descriptions. Existing approaches range f…
Operating and maintaining (O&M) large-scale online engine systems (search, recommendation, advertising) demands substantial human effort for release monitoring, alert response, and root cause analysis. While LLM-based agents are a natural fit for these tasks, the deployment bottl…
Operating and maintaining (O&M) large-scale online engine systems (search, recommendation, advertising) demands substantial human effort for release monitoring, alert response, and root cause analysis. While LLM-based agents are a natural fit for these tasks, the deployment b…
Live future prediction refers to the task of making predictions about real-world events before they unfold. This task is increasingly studied using large language model-based agent systems, and it is important for building agents that can continually learn from real-world. Just a…
Live future prediction refers to the task of making predictions about real-world events before they unfold. This task is increasingly studied using large language model-based agent systems, and it is important for building agents that can continually learn from real-world. Just a…
Multi-agent deliberation systems using large language models (LLMs) are increasingly proposed for policy simulation, yet they suffer from artificial consensus: evaluator agents converge on the same option regardless of their assigned value perspectives. We present the AI Council,…
Large Language Model (LLM)-based agents exhibit systemic failures in compositional generalization, limiting their robustness in interactive environments. This work introduces AGEL-Comp, a neuro-symbolic AI agent architecture designed to address this challenge by grounding actions…
arXiv cs.CL
TIER_1Română(RO)·Xiyuan Yang, Jiaru Zou, Rui Pan, Ruizhong Qiu, Pan Lu, Shizhe Diao, Jindong Jiang, Hanghang Tong, Tong Zhang, Markus J. Buehler, Jingrui He, James Zou·
arXiv:2604.25917v1 Announce Type: cross Abstract: Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen reasoning. We extend such scaling principle from a single model to mul…
arXiv cs.CL
TIER_1·Abigail O'Neill, Alan Zhu, Mihran Miroyan, Narges Norouzi, Joseph E. Gonzalez·
arXiv:2604.25088v1 Announce Type: cross Abstract: Language Model (LM)-based agents remain largely untested in mixed-motive settings where agents must leverage short-term cooperation for long-term competitive goals (e.g., multi-party politics). We introduce Cooperate to Compete (C…
arXiv:2604.25040v1 Announce Type: cross Abstract: We propose a per-task leverage ratio for human-agent collaboration: human work displaced by an agent, divided by the human time required to specify the task, resolve mid-run interrupts, and review the result. The denominator decom…
arXiv cs.CL
TIER_1·Yunsu Kim, Kaden Uhlig, Joern Wuebker·
arXiv:2604.24929v1 Announce Type: new Abstract: Agent benchmarks remain largely English-centric, while their multilingual versions are often built with machine translation (MT) and limited post-editing. We argue that, for agentic tasks, this minimal workflow can easily break benc…
arXiv cs.LG
TIER_1·Shiyi Du, Jiayuan Liu, Weihua Du, Yue Huang, Jiayi Li, Yingtao Luo, Xiangliang Zhang, Vincent Conitzer, Carl Kingsford·
arXiv:2604.25012v1 Announce Type: new Abstract: Automated agentic workflow design currently relies on per-task iterative search, which is computationally prohibitive and fails to reuse structural knowledge across tasks. We observe that optimized workflows converge to a small fami…
arXiv cs.CL
TIER_1·Mohamed Aghzal, Gregory J. Stein, Ziyu Yao·
arXiv:2603.14248v2 Announce Type: replace-cross Abstract: Large language model (LLM) web agents are increasingly used for web navigation but remain far from human reliability on realistic, long-horizon tasks. Existing evaluations focus primarily on end-to-end success, offering li…
arXiv:2601.22154v2 Announce Type: replace-cross Abstract: Agentic Reinforcement Learning (Agentic RL) has achieved notable success in enabling agents to perform complex reasoning and tool use. However, most methods still relies on sparse outcome-based reward for training. Such fe…
arXiv:2603.25268v2 Announce Type: replace Abstract: We introduce CRAFT, a multi-agent benchmark for evaluating pragmatic communication in large language models under strict partial information. In this setting, multiple agents with complementary but incomplete views must coordina…
Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen reasoning. We extend such scaling principle from a single model to multi-agent systems, and ask: Can agent collaboration…
Deploying production-ready multi-agent systems (MAS) in complex industrial environments remains challenging due to limitations in scalability, observability, and autonomous evolution. We present OxyGent, an open-source framework that enables modular, observable, and evolvable MAS…
arXiv cs.CL
TIER_1·Qiliang Liang, Hansi Wang, Zhong Liang, Yang Liu·
arXiv:2604.24026v1 Announce Type: new Abstract: LLM agents increasingly rely on reusable skills, capability packages that combine instructions, control flow, constraints, and tool calls. In most current agent systems, however, skills are still represented by text-heavy artifacts,…
arXiv:2604.12290v2 Announce Type: replace-cross Abstract: Current LLM agent benchmarks, which predominantly focus on binary pass/fail tasks such as code generation or search-based question answering, often neglect the value of real-world engineering that is often captured through…
arXiv:2604.22879v1 Announce Type: cross Abstract: We identify and formalize a novel security risk: Context-Fragmented Violations (CFVs) - a class of policy breaches where individual agent actions appear locally safe and reasonable, yet collectively violate organizational policies…
arXiv:2604.17025v2 Announce Type: replace-cross Abstract: Large Language Models produce a controllability gap in safety-critical engineering: even low rates of undetected constraint violations render a system undeployable. Current orchestration paradigms suffer from sycophantic c…
arXiv:2604.23049v1 Announce Type: new Abstract: AI agents are increasingly deployed to execute tasks and make decisions within agentic workflows, introducing new requirements for safe and controlled autonomy. Prior work has established the importance of human oversight for ensuri…
arXiv:2604.23194v1 Announce Type: new Abstract: Large language model-based agents have recently emerged as powerful approaches for solving dynamic and multi-step tasks. Most existing agents employ planning mechanisms to guide long-term actions in dynamic environments. However, cu…
arXiv:2604.23646v1 Announce Type: new Abstract: Recent evidence suggests that frontier AI systems can exhibit agentic misalignment, generating and executing harmful actions derived from internally constructed goals, even without explicit user requests. Existing mitigation methods…
arXiv cs.AI
TIER_1·Boqin Yuan, Renchu Song, Yue Su, Sen Yang, Jing Qin·
arXiv:2604.23853v1 Announce Type: new Abstract: Skill-distillation pipelines learn reusable rules from LLM agent trajectories, but they lack a key signal: how much each step costs. Without per-step cost, a pipeline cannot distinguish adding a missing step to fix a bug from removi…
arXiv cs.AI
TIER_1·Zavier Ndum Ndum, Jian Tao, John Ford, Mansung Yim, Yang Liu·
arXiv:2604.22755v1 Announce Type: cross Abstract: Reliable decision support in nuclear engineering requires traceable, domain-grounded knowledge retrieval, yet safety and risk analysis workflows remain hampered by fragmented documentation and hallucination when use pre-trained la…
arXiv:2604.23080v1 Announce Type: cross Abstract: Large-scale agentic systems run on distributed infrastructures where many software agents share physical hosts and are discovered via peer-to-peer mechanisms. Discovery must handle node-level churn from failures and host departure…
arXiv:2603.25158v4 Announce Type: replace Abstract: Equipping Large Language Model (LLM) agents with domain-specific skills is critical for tackling complex tasks. Yet, manual authoring creates a severe scalability bottleneck. Conversely, automated skill generation often yields f…
arXiv cs.AI
TIER_1·Zhuohui Zhang, Bin Cheng, Bin He·
arXiv:2604.23557v1 Announce Type: cross Abstract: Building scalable and reusable multi-agent decision policies from offline datasets remains a challenge in offline multi-agent reinforcement learning (MARL), as existing methods often rely on fixed observation formats and action sp…
arXiv:2604.14989v2 Announce Type: replace Abstract: Recent advances in large language models (LLMs) have sparked growing interest in automatic RTL optimization for better performance, power, and area (PPA). However, existing methods are still far from realistic RTL optimization. …
arXiv cs.AI
TIER_1·Yifan Zhang, Jianmin Ye, Jiahao Yang, Xi Wang·
arXiv:2604.24218v1 Announce Type: cross Abstract: As the complexity of System-on-Chip (SoC) designs grows, the shift-left paradigm necessitates the rapid development of high-fidelity reference models (typically written in SystemC) for early architecture exploration and verificati…
Language Model (LM)-based agents remain largely untested in mixed-motive settings where agents must leverage short-term cooperation for long-term competitive goals (e.g., multi-party politics). We introduce Cooperate to Compete (C2C), a multi-agent environment where players can e…
Language Model (LM)-based agents remain largely untested in mixed-motive settings where agents must leverage short-term cooperation for long-term competitive goals (e.g., multi-party politics). We introduce Cooperate to Compete (C2C), a multi-agent environment where players can e…
Rapid advances in Large Language Models (LLMs) create new opportunities by enabling efficient exploration of broad, complex design spaces. This is particularly valuable in computer architecture, where performance depends on microarchitectural designs and policies drawn from vast …
We propose a per-task leverage ratio for human-agent collaboration: human work displaced by an agent, divided by the human time required to specify the task, resolve mid-run interrupts, and review the result. The denominator decomposes into three channels through which a conserve…
Automated agentic workflow design currently relies on per-task iterative search, which is computationally prohibitive and fails to reuse structural knowledge across tasks. We observe that optimized workflows converge to a small family of domain-specific topologies, suggesting tha…
Agent benchmarks remain largely English-centric, while their multilingual versions are often built with machine translation (MT) and limited post-editing. We argue that, for agentic tasks, this minimal workflow can easily break benchmark validity through query-answer misalignment…
As the complexity of System-on-Chip (SoC) designs grows, the shift-left paradigm necessitates the rapid development of high-fidelity reference models (typically written in SystemC) for early architecture exploration and verification. While Large Language Models (LLMs) show promis…
Process Reward Models (PRMs) have achieved remarkable success in augmenting the reasoning capabilities of Large Language Models (LLMs) within static domains such as mathematics. However, their potential in dynamic data analysis tasks remains underexplored. In this work, we first …
LLM agents increasingly rely on reusable skills, capability packages that combine instructions, control flow, constraints, and tool calls. In most current agent systems, however, skills are still represented by text-heavy artifacts, including SKILL.md-style documents and structur…
arXiv:2604.20133v2 Announce Type: replace Abstract: This paper proposes EvoAgent - an evolvable large language model (LLM) agent framework that integrates structured skill learning with a hierarchical sub-agent delegation mechanism. EvoAgent models skills as multi-file structured…
arXiv cs.AI
TIER_1·Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin, Lingdong Kong, Jize Zhang, Teng Tu, Weijian Ma, Ziqi Huang, Senqiao Yang, Wei Huang, Yeying Jin, Zhefan Rao, Jinhui Ye, Xinyu Lin, Xichen Zhang, Qisheng Hu, Shuai Yang, Leyang Shen, Wei Chow, Yifei Dong, Fen·
arXiv:2604.22748v1 Announce Type: new Abstract: As AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics becomes a central bottleneck. Agents that manipulate objects, navigate software, coordinate with…
arXiv:2604.01608v3 Announce Type: replace Abstract: Multi-agent systems (MAS) tackle complex tasks by distributing expertise, though this often comes at the cost of heavy coordination overhead, context fragmentation, and brittle phase ordering. Distilling a MAS into a single-agen…
arXiv cs.AI
TIER_1·Zhengxu Yu, Yu Fu, Zhiyuan He, Yuxuan Huang, Lee Ka Yiu, Meng Fang, Weilin Luo, Jun Wang·
arXiv:2604.22446v1 Announce Type: new Abstract: Individual agent capabilities have advanced rapidly through modular skills and tool integrations, yet multi-agent systems remain constrained by fixed team structures, tightly coupled coordination logic, and session-bound learning. W…
On-policy distillation (OPD) has shown strong potential for transferring reasoning ability from frontier or domain-specific models to smaller students. While effective on static single-turn tasks, its behavior in multi-turn agent settings remains underexplored. In this work, we i…
As AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics becomes a central bottleneck. Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictiv…
Individual agent capabilities have advanced rapidly through modular skills and tool integrations, yet multi-agent systems remain constrained by fixed team structures, tightly coupled coordination logic, and session-bound learning. We argue that this reflects a deeper absence: a p…
Graphical User Interface (GUI) agents have emerged as a promising paradigm for intelligent systems that perceive and interact with graphical interfaces visually. Yet supervised fine-tuning alone cannot handle long-horizon credit assignment, distribution shifts, and safe explorati…
<h4>How context propagation, supervisor loops, tool calls, memory, and observability quietly drive up the cost of production agentic systems.</h4><p>Multi-agent AI systems are quickly becoming a default pattern for building advanced LLM applications. Instead of relying on one mod…
<p>In support of our mission to accelerate the developer journey on Google Cloud, we built <strong>Dev Signal</strong> — a multi-agent system designed to transform raw community signals into reliable technical guidance by automating the path from discovery to expert creation.</p>…
<h2> Why This Pattern Matters </h2> <p>Most LangGraph tutorials stop at single agents. A single agent that does research, writes code, and formats a report is juggling three jobs — and as the task list grows, the prompt grows with it. The supervisor pattern solves this: one orche…
<h2> TL;DR </h2> <ul> <li>Stanford (Tran & Kiela, arXiv 2604.02460) tested single-agent vs multi-agent systems with <strong>identical thinking-token budgets</strong> </li> <li>Single agent wins on accuracy AND on compute, across three model families</li> <li>The mechanism is …
<p>Here's the uncomfortable truth about single-agent AI systems: they don't scale. Not because the models aren't capable, but because you're asking one entity to simultaneously plan, execute, research, verify, and synthesize — often in a single context window that fills up faster…
<blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/multi-agent-systems.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post.</em>…
🧠 La multi-agent orchestration è una nuova funzionalità dei Managed Agents di # Claude . 🤖 Un agente coordinatore può delegare attività a più agenti indipendenti. 👉 I dettagli: https://www. linkedin.com/posts/alessiopoma ro_claude-ai-ai-activity-7458473224192962560-Yr4O ___ ✉️ 𝗦𝗲…
<p>For a long time, we've thought of AI as a "chatbot."</p> <p>But if you step back and look from a systems architecture perspective, you'll find that a truly mature AI agent looks more like a new kind of personal computer — one that lives on your device.</p> <p>It has:</p> <ul> …
Agentic Systems Notes and resources on building and operating agentic AI systems, covering orchestration frameworks, task routing, memory, and evaluation approaches that extend baseline LLM capabi(...) # agents # ai # orchestration https:// taoofmac.com/space/ai/agentic? utm_cont…
OpenClaw Ecosystem OpenClaw is a self-hosted personal AI assistant you run on your own devices, with a gateway control plane that connects to the chat channels you already use (WhatsApp, Telegram, Sl(...) # agentic # ai # assistants # openclaw https:// taoofmac.com/space/ai/agent…
2026-05-01 | 🤖 The Digital Agora: Negotiating Reality in Multi-Agent Swarms 🤖 # AI Q: 🤖 AI negotiate? 🤖 Multi-Agent Systems | 🤝 Algorithmic Negotiation | ⚖️ Game Theory | 🕸️ Distributed Systems https:// bagrounds.org/auto-blog-zero/2 026-05-01-the-digital-agora-negotiating-realit…