Abstract

Information-sharing structures critically influence agent interactions and collective outcomes in multiagent systems, especially when individual goals are only partially aligned with overarching system objectives. Adaptively configuring these structures is challenging due to the vast number of possible network topologies. To address this, we propose variational autoencoder-reinforcement learning (VAE-RL), a framework that integrates variational autoencoders and deep reinforcement learning to govern network structure in a tractable, adaptive manner. By encoding the discrete networked action space into a continuous latent space and decoding the controlled latent space to reconstruct the network topology, VAE-RL enables a policy-gradient algorithm to selectively adjust communication networks, improving system-level performance while accounting for agent-level incentives. Experimental evaluations in a multiagent environment illustrate the framework’s ability to outperform conventional discrete action baselines, revealing interpretable strategies in which links are initially densified for effective information flow and later pruned to guide agents toward cooperative behavior. This approach offers a scalable, automated solution for orchestrating multiagent interactions in engineering contexts, where adaptive network design is crucial for aligning individual agent decisions with global design goals.

1 Introduction

Modern complex systems increasingly rely on the coordinated actions of multiple autonomous agents, each equipped with the ability to sense, decide, and act independently. These multiagent systems have transformed numerous domains including robotics [14], supply chain management [57], communication networks [810], transportation systems [1113], and most recently, systems involving multiple large language model agents [1416]. A fundamental challenge in these systems lies in their decentralized nature—while each agent operates based on its individual objectives and local information, the overall system must achieve broader collective goals. System managers face the complex task of steering these autonomous agents toward system-level objectives through strategic allocation of limited resources, such as communication channels or energy.

At the core of these decentralized systems lies the crucial element of interagent communication and interaction, which distinguishes multiagent systems from independent single-agent scenarios. These interactions can be modeled as networks, with resources represented by network links. Research has shown that different network structures significantly impact system-level metrics including performance, resource utilization, and cooperation levels (see Sec. 2). This raises a key question: How can we achieve desired system-level outcomes by strategically modifying these network structures?

The complexity of intervening at the network structure level is compounded by several factors inherent to decentralized multiagent systems. The environment’s dynamic nature, combined with agents’ continuous learning and adaptation [17], makes developing effective coordination policies particularly challenging. The heterogeneity of agents—variations in properties, policies, and learning rates—further increases system complexity [18]. Additionally, the decentralized nature of these systems introduces partial observability, where both managers and agents must operate with limited information about the true environmental state [19].

While deep reinforcement learning [2026] offers promising approaches for learning adaptable policies in dynamic, partially observable environments [27], its application to network structure control faces a significant challenge: the vast discrete action space of possible network configurations. This space grows as O(2N×(N1)/2) for a network with N nodes, quickly becoming intractable for efficient policy learning as networks expand.

To address these challenges, we propose variational autoencoder-reinforcement learning (VAE-RL), a novel framework combining VAE [28] with deep reinforcement learning. VAEs, comprising an encoder and decoder, can embed data distributions into latent spaces while maintaining the ability to reconstruct the original data, the general framework is in Fig. 1. In our framework, we treat potential network configurations as a dataset, using a VAE to transform the vast discrete action space of network structures into a manageable continuous latent space. This enables the application of continuous action deep reinforcement learning (DRL) algorithms like deep deterministic policy gradient (DDPG), with the pretrained VAE decoder reconstructing chosen latent actions into network structures.

Fig. 1
The overview of the VAE-RL framework. Initially, the VAE is trained in a standard manner to learn the encoder and decoder, establishing a connection between the original communication networks and the continuous latent space. Subsequently, the parameters of the decoder are fixed and utilized to decode the latent space back to reconstructed communication networks. Finally, the manager is trained using DDPG to learn policies within the continuous latent space and reconstruct the communication network using the pretrained decoder from the VAE and assign it to the agents.
Fig. 1
The overview of the VAE-RL framework. Initially, the VAE is trained in a standard manner to learn the encoder and decoder, establishing a connection between the original communication networks and the continuous latent space. Subsequently, the parameters of the decoder are fixed and utilized to decode the latent space back to reconstructed communication networks. Finally, the manager is trained using DDPG to learn policies within the continuous latent space and reconstruct the communication network using the pretrained decoder from the VAE and assign it to the agents.
Close modal

An alternative training scheme involves conducting the training of the DDPG and the decoder simultaneously, rather than initially training a VAE and then utilizing its pretrained decoder. Theoretically, this scheme is feasible as the backpropagation of gradients can extend from the decoder to DDPG. However, several practical issues may arise. Studies have indicated that DDPG can experience stability and convergence issues, even in relatively simple control tasks [29,30]. Our environment adds further complexities, containing multiple agents at the lower level where the manager must learn policies that balance agent performance with communication resource usage to enhance system performance. Additionally, the network-based action space introduces greater complexity and uncertainty, which are challenging for DDPG to manage in a stable manner. Integrating a nonfixed decoder into DDPG and training them concurrently through backpropagation of gradients would compound these complexities, potentially making it extremely challenging or even intractable to learn meaningful policies.

Moreover, this alternative training scheme could hinder the learning of a useful decoder within the framework, as it would sacrifice the inherent advantages of VAEs. These include enhanced feature learning and model performance from simultaneous learning of the encoder and decoder, and the regulation of the latent space to fit a normal distribution—advantages that have been demonstrated to be effective in various tasks.

These considerations strongly motivate our decision to utilize the decoder from a pretrained VAE. By adopting this approach, our framework leverages the inherent benefits of the VAE model while simultaneously mitigating the complexities and uncertainties associated with the training process. This strategy not only preserves the robust features of the VAE but also enhances the overall stability and effectiveness of the framework.

Our work builds upon and extends our previous Two-tier framework for Systems of Systems (SoS) [27], which introduced the concept of SoS workers (Tier I) and SoS managers (Tier II). While the previous framework was limited to homogeneous resource allocation, our current study enhances flexibility and efficiency by replacing the SoS manager with the VAE-RL framework. In this enhanced model, the manager controls communication network structures, with SoS workers sharing information through network connections. More connections indicate higher communication resource consumption, requiring the VAE-RL-equipped manager to balance system performance with resource usage through dynamic network modifications.

We evaluate our framework in a modified OpenAI Gym particle environment [31], testing both homogeneous and heterogeneous scenarios across various system scales. Our results demonstrate that VAE-RL outperforms baseline methods in optimizing the weighted sum of system-level performance and network-related resource usage, while revealing meaningful patterns for effective multiagent system management.

In summary, this paper makes three key contributions. First, it presents a new framework—VAE-RL—for managing large discrete action spaces in multiagent systems by converting network topologies into a continuous latent space using a VAE. This approach allows a DRL algorithm to effectively learn system-level policies for dynamically allocating costly communication resources, circumventing the intractability of exhaustive discrete network search. Second, the article demonstrates how VAE-RL can be incorporated into a hierarchical SoS governance structure: in this setup, the manager strategically assigns communication links among agents, striking a balance between improved task performance and reduced resource consumption. Finally, the proposed method is extensively tested on a modified OpenAI particle environment under both homogeneous and heterogeneous scenarios. Empirical results show that VAE-RL consistently outperforms conventional DRL baselines for network assignments, and further analysis of the learned policies reveals interpretable insights into how the optimal networks evolve as agents progress through a task.

2 Background

In this section, we review literature relevant to our research across three key areas. First, we explore network structure governance in various domains, which aligns with our paper’s primary focus. Second, we examine generative models for network-based tasks, which relates to our approach of transforming discrete action spaces into continuous latent spaces. Finally, we review our previous two-tier framework, which serves as the foundation for evaluating our proposed VAE-RL method.

2.1 Network Structure Governance.

Network structure governance—the ability to modify network configurations to influence agent behavior and system performance—has emerged as a critical research area across multiple domains. We examine its applications in three key areas: communication systems, public health policy, and socio-technical systems. In communication systems, researchers have developed various approaches to network governance. Nakamura et al. [32] address telecommunication service cost reduction while maintaining network stability under varying electricity prices. Bijami and Farsangi [33] introduce a distributed networked control scheme for stabilizing large-scale interconnected systems, specifically addressing communication challenges like random delays and packet dropouts. DRL applications in this domain include Chu et al.’s [34] scalable, decentralized multiagent reinforcement learning (MARL) algorithm for adaptive traffic signal control, and Shibata et al.’s [35] exploration of MARL for multiagent cooperative transport. Network structures have also been studied in social and economic contexts, examining communication benefits, costs, and decentralized strategic decisions [36,37]. These studies have identified structures that balance network stability and efficiency [38,39]. Public health policy has increasingly embraced network governance approaches. Dynamic complex networks have proven effective in modeling micro-social interactions for public health [40], urban policies [41], and platform economies [42]. During the COVID-19 pandemic, Mazzitello et al. [43] developed network-based pool testing strategies for low-resource settings, while Robins et al. [44] analyzed network interventions like social distancing. Siciliano and Whetsell [45] contributed a framework for network control in public administration to promote behavioral change. In socio-technical systems, network governance has facilitated various advances. Ion and Pătraşcu [46] developed a scalable algorithm for self-organizing control systems using networked multiagent systems. Ellis et al. [47] studied network impacts on self-management support interventions, while Joseph et al. [48] proposed a complex systems approach to model social media platform interventions. Network structure has proven crucial in promoting prosocial behaviors such as cooperation [4951], coordination [52], and fairness [53], particularly in social dilemma situations. While network structure control has garnered significant attention, most existing approaches rely on traditional methods like hard-coded policies or heuristics, which struggle with complex, evolving, and partially observable environments. Deep reinforcement learning offers greater flexibility but often focuses on decentralized approaches, potentially compromising system-level optimization. Our approach addresses these limitations through a centralized method that tackles the curse of dimensionality.

2.2 Generative Models for Networks.

Generative models have revolutionized various fields, including images [54,55], voice [5658], and text [59,60]. In network modeling, Kipf and Welling [61] introduced the variational graph autoencoder, enabling unsupervised learning on graph-structured data. Li et al. [62] developed an approach using graph neural networks to express probabilistic dependencies among nodes and edges. Domain-specific applications of network generation include DeepLigBuilder by Li et al. [63] for drug design, Zhao et al.’s [64] physics-aware crystal structure generation, and Singh et al.’s [65] TripletFit technique for network classification. Comprehensive surveys [6668] provide extensive coverage of network generation methods. Our work takes a novel approach by integrating generative models with Deep Reinforcement Learning to enable centralized network structure control while addressing the curse of dimensionality.

2.3 Overview of the Two-Tier Reinforcement Learning Governance Framework.

Our previous work [27] introduced a two-tier framework integrating Reinforcement Learning to optimize resource distribution in SoS. The framework operates across two training levels. In Tier I, individual SoS components undergo decentralized-training-decentralized-execution using the DDPG algorithm [20]. This tier develops essential skills based on partial observations, after which policies become fixed. Tier II focuses on the centralized resource manager’s allocation decisions, including resource type and timing, using deep Q learning to maximize operational effectiveness. While our previous work limited the system manager to choosing between empty or fully connected communication networks, our current research expands these capabilities. The manager can now assign any possible communication network configuration at each time-step, enabling more adaptive and tailored strategies.

3 Methodology

In this section, we provide background knowledge and explain the method we are using. Since our proposed framework integrates several distinct algorithms, we begin with several subsections dedicated to introducing the background knowledge of the algorithms employed. This approach provides insights into how they are defined, trained, and executed, as well as their unique features and functions. After these foundational subsections, we delve into the specifics of our proposed framework, explaining how it combines these diverse algorithms and the rationale behind the integration. By structuring our presentation in this manner, with a general description of the framework’s core idea in the Introduction section, we achieve a balance between high-level conceptual explanations and detailed descriptions of individual components.

First, we briefly introduce the partially observable Markov decision process (POMDP), the fundamental model underlying DRL that also models our environment. Then, we provide a mathematical formulation of the general framework. Next, we discuss the DDPG, the DRL algorithm we employ for managing continuous latent action spaces within our framework. We also explain the branching structured-based DQN (Deep Q-Network), an important baseline method in our study. Subsequently, we comprehensively introduce our methodology for training VAE on networks. This component is crucial for bridging the original vast discrete network-based action space and the low-dimensional continuous latent space. Finally, we introduce the proposed VAE-RL framework and explain how it integrates VAE with DDPG to achieve efficient system governance when the manager has a vast discrete network-based action space.

3.1 Overview of the General Framework.

Our framework consists of three main components: the environment, agents, and the manager. In addition to modeling each component separately, we also model the interactions between them to capture the dynamics of the system. The manager interacts with the agents through a communication network–observing the environment and assigning communication links between agents. Agents then share their observations, influencing their decision-making as they work to complete the task. The environment updates its state based on the actions taken by both the manager and agents. This process iterates, with the manager and agents receiving new observations at each step until the task is completed. A formal mathematical formulation of the framework will be introduced in detail.

Consider a two-layer hierarchical system with one Manager and a set of N agents, indexed by iN={1,2,,N}. The communication network selected by the manager can be represented by an undirected graph G=(N,E), where N={1,2,,N} denotes the set of agents, and the set of edges E{{i,j}i,jN,ij} represents the communication links. Specifically, define

3.1.1 Modeling Environment.

System dynamic: The system transitions to a new state st+1 from the current state st based on the agents’ actions {ati}iN and the manager’s actions of communication networks GtM:
where T represents the system dynamics function.
Welfare for agents: The welfare at time t depends on the system state and agents’ actions:
where ui is the utility function of agent i.

3.1.2 Modeling Agents.

Agent’s perceived state: The agent’s perceived observation will be influenced by communication networks from the manager, through sharing the observation with other agents with connections.

Each agent i updates its local information by receiving observations from neighbors defined by network G:
where Ψ() is a local information fusion function for agents.
Agent’s action: Agents make decisions based on their policy and updated observations influenced by communication networks:
where πiAG represents the policy of agent i.
Agent’s objective function: The agents learn policies to optimize their rewards in the long term in the multiagent scenarios:
where γ is the discounted factor and πiAG is the policy for other agents.

3.1.3 Modeling Manager.

Manager’s observation: The manager observes aggregated states and system-level metrics:
where h is the observation function combining aggregated observation from all agents o¯tagents and system-level observation otsys.
Manager’s action of communication network: The manager selects actions specific to each agent based on its observation:
where πMi is the policy of the manager for determining the action of the communication network GtM to influence agents.
Manager’s objective: The manager optimizes cumulative welfare over time while minimizing its action costs:
where C(GtM)) is the cost of communication network action GtM taken by the manager.

3.2 Partial Observable Markov Decision Process.

The environmental dynamics in our study are aptly modeled using a POMDP, as outlined in Ref. [69]. A POMDP is characteristically defined by a tuple <S,s0,A,O,T,R>: where S is a set of potential states; s0 represents the initial state of the environment, with s0S; A denotes the set of available actions; O comprises the observations derived from the actual state; R is the reward outputted by the environment; T is the transition function, defined as T:O×AO; and U is the reward function, where U:O×AR. The primary objective within a POMDP framework is to develop an optimal policy, πθ: O×A[0,1]. This policy aims to maximize the expected return, calculated as t=0nγtrt, where γ represents the discount factor. We assume the environment has a finite length and will terminate after n time-steps. The reward rt represents the reward the agent receives at each time-step.

3.3 Deep Deterministic Policy Gradient.

Policy gradient methods, as described in Ref. [70], represent a potent subclass of RL algorithms. The fundamental concept involves directly modifying the policy parameters, denoted as θ, to optimize the objective function J(θ)=Eπθ[R]. This optimization is achieved by progressing in the direction of the gradient θJ(θ). The policy gradient can be expressed as follows: θJ(θ)=Eo,a[θlogπθ(ao)Qπ(o,a)]. Here, Qπ(o,a), representing the action-value function, can be updated using techniques outlined in the Q learning section. This approach evolves into the actor-critic algorithm. Extending this framework to deterministic policies, denoted as μθ:OA, leads to the formulation of the DDPG algorithm. In DDPG, the gradient of the objective undergoes a transformation: θJ(θ)=Eo,a[θμθ(ao)aQμ(o,a)a=μθ(o)]. In this context, μθ(as), which represents the deterministic policy, is modeled using a deep neural network parameterized by θ. The DDPG algorithm is particularly suited for scenarios involving continuous action spaces, as it outputs a deterministic action value at each time-step.

3.4 Branching Deep Q Network for Network-Based System.

The branching deep Q network (BDQN) [71] presents a solution for implementing deep Q learning in scenarios characterized by large discrete action spaces. When an action space encompasses several dimensions, denoted as N, with each dimension offering a multitude of options, symbolized as D, the complexity of this space escalates exponentially, represented as O(DN). This exponential growth not only results in impractically large model sizes but also significantly complicates the optimization process. To address this, BDQN leverages the Dueling Q Network architecture [72], which separately learns Q functions for each dimension. It then integrates these functions using a shared state value, facilitating coordination among them. Consequently, this approach effectively reduces the action space complexity to O(N*D), rendering the learning process feasible for tasks with extensive discrete action spaces.

To elucidate how BDQN serves as a baseline in our context, our action space of network topology can be converted into actions based on the number of link dimensions, where each dimension has two options—to have a link or not. Therefore, in our case, N corresponds to the number of links, and D is 2, representing the binary choice for each link (either present or absent). Utilizing the BDQN training scheme, we can effectively learn policies for network structure assignment.

3.5 Variational Autoencoder.

The VAE [28], a renowned generative model, has garnered considerable acclaim in fields like image and network generation. Its primary objective is to encode a dataset of independently and identically distributed samples into an embedded distribution that represents certain latent variables (encoder) and to subsequently generate new samples from this distribution (decoder). During the training phase, the encoder and decoder are trained jointly, facilitating a cohesive learning process. Upon establishing the embedded distribution, the decoder’s parameters are fixed, enabling it to generate samples that mirror the distribution of the original dataset. This capability is the cornerstone of VAE’s proficiency in producing images or networks akin to those in the dataset. In our specific context, the dataset comprises network topologies, represented by adjacency matrices. The training regimen adheres closely to the standard VAE protocol, which unfolds as follows.

The motivation for using a VAE in our framework is to establish a connection between the vast discrete network-based action space and the latent low-dimensional continuous action space. This connection enables the DDPG-based manager to learn policies within the latent space and then transfer the controlled latent actions back into communication networks. In our framework, we ideally need to train our VAE with datasets that encapsulate all possible network structures for a given number of nodes. However, several challenges emerge:

  1. Employing a random network formation strategy, such as the Erdős–Rényi model [73], to develop the dataset results in a lower proportion of very sparse or very dense networks. This skew in network density could restrict the capacity to learn an efficient policy for assigning communication networks later.

  2. Constructing a comprehensive dataset containing all possible network structures is impractical, as the dataset size grows exponentially with O(2N(N1)/2), where N represents the number of agents in the system.

To overcome these challenges, we employ the Erdős–Rényi model to construct our dataset, but we adjust the sampling process to include more networks with fewer and more links. This modification ensures a more uniform distribution of networks with varying densities within the dataset. Additionally, rather than attempting to create a complete dataset of all possible network structures, we generate a representative sample that captures a broad spectrum of possible configurations. It is important to note that these techniques are specific adaptations for our proposed framework. While more efficient methods for creating the dataset may exist, they are beyond the scope of this paper.

In our dataset, let us denote the variables as x, with each data point following a distribution p(x). The latent variables are characterized by the distribution p(z). The encoder in a VAE is effectively represented by the conditional probability p(zx), while the decoder is represented by p(xz). Direct computation of p(zx) poses practical challenges, primarily because the underlying distribution p(x) is intractable, as evidenced when expanding p(zx) using Bayes’ rule. To circumvent this, VAE introduces an approximate function q(zx), assuming its tractability. In traditional VAE models, a standard multivariate Gaussian distribution is employed for this approximation. To refine our approximation of p(zx), we focus on minimizing the Kullback–Leibler divergence between q(zx) and p(z). This minimization process involves a series of mathematical operations, ultimately leading us to maximize the following objective: Eq(zx)logp(xz)KL(q(zx)p(z)).

The objective function in VAEs comprises two key components. The first is the log-likelihood of the reconstructed variables, which measures the accuracy of the reconstruction. The second component is a regularization term, which ensures that the distribution of the latent variable z aligns closely with the prior distribution. As previously mentioned, this prior distribution is assumed to be a standard multivariate Gaussian distribution. To model both the encoder q(zx) and decoder p(xz), deep neural networks are employed. Practically, the encoder outputs the mean and covariance of the Gaussian distribution from which z is sampled. Furthermore, VAE employs reparameterization tricks to enable the backpropagation of gradients through stochastic nodes, thereby facilitating effective training of the network. For a more comprehensive understanding of these mechanisms, further details are available in Ref. [28].

3.6 DDPG With Variational Autoencoder for Network Topology Governance.

In controlling multiagent systems at the system level, a centralized management approach is pivotal, primarily to harmonize high-level performance with the allocation of various types of resources. This approach’s efficacy is substantiated in studies such as Refs. [25,27]. Within these complex systems, the dynamics of information sharing and agent interaction are crucial. The topology of the network, in particular, plays a crucial role in influencing outcomes across different scenarios. Consequently, when controlling multiagent systems, a frequent challenge encountered is the need for centralized control over the network topology. This task is complicated by the sheer complexity and diversity of potential network topologies, resulting in an action space that is vast and often impractical for traditional control algorithms with discrete action spaces, such as deep Q learning [74].

Policy gradient methods are notably proficient in handling complex and continuous action spaces [20]. In our cases, where the inherent action space comprises network topologies, it will be beneficial to encode this vast, discrete action space into a manageable, continuous latent action space. By applying policy gradient methods to this transformed, latent continuous space, we can adeptly navigate and optimize within it. Subsequently, the optimized latent space can be decoded back into reconstructed network topologies. This approach effectively circumvents the challenges associated with the original, large discrete action space, presenting a more efficient method for network topology construction and optimization.

Inspired by this concept, we introduce the VAE-RL framework as a viable solution, an overview of the framework is presented in Fig. 1. The approach begins by training a VAE to transform the extensive, discrete action space associated with network topologies into a continuous action space and to decode the latent space back to reconstructed networks. The framework then utilizes the pretrained decoder from the VAE, which assists in reconstructing the communication network from the latent space. Subsequently, policy gradient algorithms with continuous action spaces are employed to learn effective control policies. A key advantage of this framework is its ability to leverage the generative capabilities of the VAE, allowing for the reversion of the controlled, continuous action space back to its original form—the network topology. This process effectively addresses the challenge posed by the vast discrete action space, thereby enhancing control mechanisms within multiagent systems.

Building upon the mathematical formulation of the framework overview in Sec. 3.1, VAE-RL focuses on modeling the decision-making process of the manager by integrating VAE and DDPG to enable efficient communication network assignment, allowing the manager to optimize information sharing among agents dynamically. Besides, we make several assumptions for our implementation. First, we assume that the manager’s observation consists solely of the aggregated observations from the agents, without any additional system-level information. Second, all agents share the same utility function. Finally, information sharing between connected agents is modeled as the union of their individual observations.

The training process for the VAE-RL framework is divided into two distinct phases: the initial training of the VAE followed by the DRL training using the pretrained VAE. The details of these steps will be discussed in the subsequent sections.

3.6.1 Variational Autoencoder Training Process.

Our primary objective centers on managing network topologies within multiagent systems. To achieve this, we concentrate on training a VAE specifically designed to encode adjacency matrices into a latent space, and then decode this latent space back into reconstructed adjacency matrices. For both the encoder and decoder components of the VAE, we utilize fully connected neural networks. However, in scenarios where the system manager possesses comprehensive control with higher authority, extending not only to network topology but also directly to agent properties, graph neural networks [75] emerge as a viable alternative for representing both the encoder and decoder. The training framework for the VAE, as applied to network topology, is depicted in Fig. 2.

Fig. 2
The diagram shows a Variational Autoencoder applied to network topology. The encoder processes the adjacency matrix, producing Gaussian distribution parameters. The decoder samples from this distribution to reconstruct the adjacency matrix. Both components use deep neural networks.
Fig. 2
The diagram shows a Variational Autoencoder applied to network topology. The encoder processes the adjacency matrix, producing Gaussian distribution parameters. The decoder samples from this distribution to reconstruct the adjacency matrix. Both components use deep neural networks.
Close modal

Initially, we assemble a dataset, denoted as D, comprising samples of potential network topologies given the number of nodes. It is crucial to acknowledge that encompassing every possible network topology becomes more intractable as the number of nodes increases. Subsequently, this dataset is partitioned into a training set and a validation set, adhering to conventional supervised learning methodologies. Network topologies are represented through a flattened adjacency matrix, symbolized as A.

The architecture of our model employs fully connected neural networks to instantiate the encoder and decoder, denoted by fencoder(x;θ1) and fdecoder(x;θ2), respectively. During each training iteration, a mini-batch of data a is sampled from the training set. The encoder processes this data to yield the mean and covariance of a multivariate Gaussian distribution: μ,Σ=fencoder(a;θ1). Subsequently, the latent variable values z are sampled from this distribution, where zRd and d, the latent variable dimension, is a predefined hyperparameter.

The decoder then takes z as input and produces the reconstructed adjacency matrix a^: a^=fdecoder(z;θ2). Loss is calculated using the standard VAE loss we mentioned earlier, and the gradient of the loss is backpropagated to update θ1 and θ2 in fencoder(x;θ1) and fdecoder(x;θ2). Ultimately, the model is evaluated on the validation set, and the model with better performance through validation is saved.

3.6.2 DDPG Training Process With Learned Variational Autoencoder.

In the VAE training phase, we have two key outcomes: a learned encoder that effectively embeds input network topologies into a latent space and a learned decoder capable of reconstructing network topologies from latent variables. During the DDPG training, the parameters of the pretrained VAE are fixed, with an exclusive focus on utilizing the decoder. This process is illustrated in Fig. 3.

Fig. 3
The diagram on the left depicts the VAE-RL framework used for interaction within a multiagent system environment. The DDPG manager directly manages the continuous latent variables, which are decoded by the decoder into the reconstructed network topology. This topology serves as the final action within a POMDP. Following this, the multiagent system environment processes the action and updates its state accordingly. The DDPG manager then receives observations and rewards, which it uses to update its policy and execute subsequent controls. During this process, the decoder’s parameters remain fixed from the pretrained model. The right figure illustrates the environment used in our experiments, which simulates a robotic navigation task. This environment includes several agents, each with different vision ranges, capable of sharing observations through a communication network. Their objective is to efficiently distribute themselves across designated landmarks while minimizing collisions.
Fig. 3
The diagram on the left depicts the VAE-RL framework used for interaction within a multiagent system environment. The DDPG manager directly manages the continuous latent variables, which are decoded by the decoder into the reconstructed network topology. This topology serves as the final action within a POMDP. Following this, the multiagent system environment processes the action and updates its state accordingly. The DDPG manager then receives observations and rewards, which it uses to update its policy and execute subsequent controls. During this process, the decoder’s parameters remain fixed from the pretrained model. The right figure illustrates the environment used in our experiments, which simulates a robotic navigation task. This environment includes several agents, each with different vision ranges, capable of sharing observations through a communication network. Their objective is to efficiently distribute themselves across designated landmarks while minimizing collisions.
Close modal

The environment dynamics are defined as R,S=Transition(S,Aadj), where R represents the reward for the current step, S is the current state, and S is the subsequent state of the environment. The current action, Aadj, corresponds to the network topology. For modeling the actor μ(o;ϕ) and the critic Q(o,a;θ), fully connected networks are employed. In each step of the DDPG training, the manager observes the current state of the environment, obtaining the observation ot at time-step t. The actor then outputs latent variable values: zt=μ(ot;ϕ). As part of exploration, noise is added to actions, diminishing over time, a strategy recommended by Ref. [20]. Subsequently, the decoder transforms these latent variables into a reconstructed adjacency matrix: aadjt=fdecoder(zt;θ2). The environment takes this reconstructed matrix, updates its state, and generates corresponding rewards: r,s(t+1)=Transition(st,aadjt). This trajectory, denoted as st,zt,r,s(t+1), is stored in the replay-buffer D. During training, a mini-batch of trajectory data is sampled; the actor and critic are updated by backpropagating the gradient of the loss in the DDPG we mentioned earlier in Sec. 3.2.

4 Experiment Results and Discussion

In this section, we first introduce the environment we are using and the related settings. Next, we compare the performance of the proposed VAE-RL framework to baseline methods in different scenarios. We then analyze the learned behavior of the system manager using VAE-RL and discuss the trends and insights derived from the results. Finally, we present snapshots of the network evolution over time, visualizing the system’s dynamics and justifying our findings.

4.1 Experiment Design.

We utilize the modified OpenAI Gym particle environment [31] to evaluate the effectiveness of our VAE-RL framework. In the original “spread” task within this particle environment, multiple agents are tasked with spreading themselves on landmarks while minimizing collisions. This scenario constitutes a multiagent system where effective coordination among agents is essential. In our previous work [27], we introduced modifications to this environment, making it partially observable. Furthermore, we introduced different types of resources, such as additional vision and communication capabilities. These resources are dynamically selected by the SoS manager to maintain a balance between system performance and resource utilization for the SoS workers. However, the communication network options were limited (either empty or complete) in the previous study. Additionally, the SoS agents were homogeneous, and the proposed framework fell short in handling heterogeneous situations.

In this article, we focus solely on the allocation of communication resources and adding flexibility to the action space. The SoS manager can select different communication network topologies during tasks, providing opportunities to save communication resources and improve system performance in uncertain environments with heterogeneous agents. Therefore, we retain the essential components in the hierarchical framework for SoS that we previously proposed but mainly increase the flexibility of network topologies within the action space in Tier II. The visualization of the environment is presented in Fig. 3. The environment in our experiments models a “robotic navigation” problem. As illustrated, the environment contains four agents tasked with the objective of efficiently spreading across four designated landmarks while minimizing collisions. Each agent may have a different vision range, which implies that some agents, like Agent 4, might not always be able to observe the landmarks directly. To facilitate effective navigation, the manager assigns communication networks among the agents, allowing them to share observations. For instance, although Agent 4 cannot directly see any landmarks, it can receive information about their locations through communication with Agent 1, enabling it to navigate toward a landmark based on its own policy.

The communication network considered in this paper is an undirected communication network, meaning that agents connected in the network share their observations mutually. Exploring scenarios with directed communication networks could provide meaningful insights, particularly for systems with one-way communication. However, from a technical perspective, the complexity of the action space in a directed network (O(2N(N1))) has the same order of magnitude as that of an undirected network (O(2N(N1)/2)), where N is the number of agents in the system. Therefore, we leave the exploration of directed communication networks for future research.

In Tier I, SoS workers are initialized in a manner consistent with the previous framework. SoS workers and landmarks are randomly initialized at the beginning of each game within a 2×2 square. They possess limited vision, allowing them to observe only landmarks or other agents within their visual range. Furthermore, these workers act autonomously, making individual decisions based on the information available to them. While the SoS manager in Tier II lacks the ability to directly compel SoS workers to alter their actions, it can influence their behavior by manipulating the information they receive. This manipulation is achieved through the assignment of different communication network topologies. It is important to note that the SoS manager does not possess an all-encompassing view of the entire system but instead relies on aggregate information derived from the observations of SoS workers.

At each time-step, the SoS manager assigns a communication network to the SoS workers. Subsequently, these workers share their information with their neighbors within the communication network. The allocation of communication networks involves the utilization of communication resources, with each link in the network incurring a cost of 0.1 in the current settings. The primary objective of the SoS manager is to maximize a weighted combination of two key subobjectives: maximizing task performance and minimizing communication resource costs. The current weight is 1, which means the scores of tasks and the communication resource cost are equally important. The specific weights applied to these subobjectives can be adjusted based on their relative importance, which is thoroughly explored in Ref. [27].

Regarding the VAE-RL algorithms, we set several hyperparameters: the decoder and encoder in VAE and the actors and critics in DDPG are represented using fully connected neural networks. For VAE, we use 512×256 to represent the encoder and 256×512 to represent the decoder, while the dimension of the latent space is 10 for the multiagent system with 10 agents. The resource manager uses 1024×512×256 fully connected neural networks to represent critics and 1024×512 fully connected neural networks to represent actors. We use the Adam optimizer with a 0.001 learning rate for critic training and the Adam optimizer with a 0.0001 learning rate for actor training. The training for resource managers includes 20,000 epochs. It is crucial to note that critics are only used during the training phase, so only actors participate in the decision-making process during the execution phase. In our experiments, each game has 50 time-steps, and the following results have been tested on 1000 new games.

4.2 Performance of Proposed Method.

To assess the generalization and robustness capabilities of our proposed VAE-RL method, we aim to compare its performance against the traditional DRL approach using a discrete action space for network topology, hereafter referred to as flat-RL, within a 10-agent system. Given that the action space in this system is O(2N*(N1)), where N=10, its enormity renders the training for flat-RL impractical. Therefore, we introduce another baseline, BDQN [71], which is scalable in larger networks. Initially, we apply both VAE-RL and the two baseline methods (flat-RL and BDQN) to a smaller system comprising 4 agents. After establishing that BDQN demonstrates equivalent or superior performance compared to flat-RL in this smaller setup, we then extend our evaluation to a larger 10-agent system. Here, BDQN’s performance serves as a proxy for the upper bound of flat-RL’s capabilities. This approach allows us to indirectly gauge the performance of flat-RL in larger systems, thereby ensuring the coherence of our results. We also include a random baseline as a reference to evaluate the proposed VAE-RL framework. The random strategy is based on an Erdős–Rényi network [73] with a 0.5 probability of link formation, so at each time-step, the random strategy will assign a random Erdős–Rényi network as a communication network for agents.

It needs to be noted that both flat-RL and BDQN do not use the VAE models. Flat-RL directly utilizes the original discrete network-based action space. Unlike flat-RL, which learns the Q-function for all possible actions given the current state and selects the action with the highest value, BDQN represents actions as vectors encoding the status of possible links in the network. Specifically, this vector consists of N(N1)/2 binary values, where each value indicates the existence of a specific link, with N denoting the number of agents.

Initially, we established a homogeneous vision range for all SoS workers. We designed a series of four experiments, varying in difficulty levels, with vision ranges of 0.6, 0.8, 1.0, and 1.2. It is worth noting that all entities are positioned within a 2×2 square configuration, meaning that even the simplest task remains partially observable. Additionally, even tasks with a vision range of 1.2 are not trivial because it is still possible that agents are unable to observe anything at the initial states. In numerous real-world multiagent systems, agents often exhibit heterogeneity in their properties or capabilities. To test our framework in a more realistic environment, we have devised an environment with heterogeneous agents, where agents have vision ranges of 2, 1.5, 1, and 0.5. These experiments serve as crucial and representative examples within the realm of heterogeneous scenarios.

In our experiments, we first evaluate the performance of our VAE-RL framework, followed by an analysis of the learned behaviors of the SoS manager in small systems with 4 agents. Subsequently, we assess the scalability of VAE-RL by evaluating its performance and the learned behaviors in larger systems with 10 agents. It is important to note that although scaling up VAE-RL requires additional efforts and resources—such as retraining the models, re-optimizing hyperparameters, and allocating more memory and computational resources—it remains a viable solution in large-scale scenarios. In contrast, flat-RL quickly becomes intractable as the system size increases. Regardless of the amount of effort spent on retraining or hyperparameter optimization, flat-RL remains infeasible due to its vast action space and the limitations of physical devices.

Thus, while we acknowledge that VAE-RL may also become intractable if the system grows extremely large, learning network intervention policies for system management in very large systems remains an open question. The proposed VAE-RL framework extends the scalability of system size beyond what is possible with flat-RL, and its core ideas may inspire future research to develop even more efficient frameworks. These advancements could further scale up to manage real-world systems with large and complex network structures effectively.

4.2.1 Results on Two Environments.

We first applied all methods on the small environment with four agents and got the results in Fig. 4, we also summarized the total performance comparison between different methods in Table 1. The worker score in this system is calculated as the average distance between each landmark and its closest agent, which is always a negative value since the highest possible score (when agents are perfectly positioned on landmarks) is zero. The resource cost is calculated based on the number of links in the communication network, with each link incurring a cost of 0.15. This value is chosen to keep both metrics on the same scale. More details about the environment can be found in Ref. [20], and experiments with different resource cost settings are discussed in Ref. [27].

Fig. 4
Results show performance and resource penalties for various methods with homogeneous agents (vision ranges 0.6–1.2) and heterogeneous agents in four-agent systems. A star marker indicates the random policy baseline’s overall performance.
Fig. 4
Results show performance and resource penalties for various methods with homogeneous agents (vision ranges 0.6–1.2) and heterogeneous agents in four-agent systems. A star marker indicates the random policy baseline’s overall performance.
Close modal
Table 1

Comparison of total scores across methods in four-agent systems

VAE-RLFlat-RLBDQNRandom
0.6 Vision−60.83−68.97−62.37−76.98
0.8 Vision−44.33−55.76−49.21−67.71
1.0 Vision−36.24−49.09−39.22−63.36
1.2 Vision−32.24−43.20−32.32−61.98
Heterogeneous vision−43.97−50.12−49.23−63.06
VAE-RLFlat-RLBDQNRandom
0.6 Vision−60.83−68.97−62.37−76.98
0.8 Vision−44.33−55.76−49.21−67.71
1.0 Vision−36.24−49.09−39.22−63.36
1.2 Vision−32.24−43.20−32.32−61.98
Heterogeneous vision−43.97−50.12−49.23−63.06

From the results, our VAE-RL approach consistently outperforms other baseline methods across all tasks including heterogeneous cases, ranging from difficult to easy. This not only underscores the superior performance of our method but also underscores its robustness under different scenarios. Because it is hard to compare the homogeneous scenarios and heterogeneous scenarios directly, for the purpose of analyzing trends, our subsequent findings and discussions in this section will primarily focus on homogeneous cases.

Second, it is noteworthy that as the vision range increases, the performance of all methods and communication resource usage demonstrate improvement. This phenomenon can be attributed to the tasks becoming progressively easier. Even the simplest strategy, such as randomly selecting actions, becomes more effective because SoS workers have access to more self-observed information with larger vision ranges, leading to enhanced performance. Furthermore, with larger vision ranges, workers have increased opportunities to independently observe landmarks and other agents. Consequently, the reliance on the communication network diminishes, leading to reduced usage of communication resources.

Third, as tasks become easier, the VAE-RL method exhibits substantial improvements compared to the baseline methods. In environments where SoS workers have limited vision ranges, such as 0.6 and 0.8, the tasks are exceedingly challenging for agents to accomplish with flawless performance, even when employing intelligent communication network assignment strategies. This difficulty arises from the fact that smaller vision ranges result in SoS workers having very limited opportunities for useful observation at initial states. Despite the potential for communication through the network, they have little to share due to the scarcity of information. Consequently, in these critical scenarios, we observe only marginal enhancements in the VAE-RL method compared to the baseline methods. Conversely, when tasks are moderately challenging, the VAE-RL and BDQN exhibit more substantial improvements compared to flat-RL method. Our explanation for this phenomenon is that both VAE-RL and BDQN are capable of learning adaptive policies by considering the diverse positions of agents and landmarks in small systems (e.g., a four-agent system), although VAE-RL demonstrates superior performance compared to BDQN. In contrast, flat-RL struggles due to the higher order of magnitude in the complexity of its action space, which prevents it from learning policies with sufficient adaptability, even in smaller systems. This limitation explains why flat-RL continues to perform poorly, even when tasks become easier as agents’ vision ranges increase (e.g., from a vision range of 1.0–1.2).

Finally, it becomes evident that although BDQN does not surpass VAE-RL in total performance, it consistently outperforms flat-RL across a variety of scenarios, including those involving heterogeneous cases. When considering larger and more complex environments that incorporate a greater number of agents, where the application of flat-RL is hindered by the model’s size and optimization complexities, BDQN stands as a viable baseline. Its proven advantage over flat-RL in smaller settings provides a rationale for employing BDQN as a benchmark in these larger scenarios. This analysis allows for meaningful comparison and evaluation of the VAE-RL’s performance against a relevant and established baseline in more expansive and challenging environments.

Our motivation for introducing BDQN is to establish a baseline that is applicable to both small and large systems while also outperforming flat-RL. This provides a meaningful reference for evaluating our proposed framework, VAE-RL, in larger systems. Specifically, BDQN remains feasible for learning policies on physical devices, whereas flat-RL becomes intractable due to the exponential growth of the action space (e.g., 245 for a 10-agent system). Furthermore, in small systems, we compared BDQN with flat-RL and confirmed that BDQN achieves better performance. In larger systems, where flat-RL is computationally infeasible—resulting in effectively negative-infinity performance—BDQN still maintains a finite performance, though it degrades as the system size increases.

Given the reasoning above regarding the use of BDQN and the infeasibility of flat-RL in larger systems, our expanded experiment, which involves a larger environment with 10 agents, includes only the VAE-RL method and the BDQN baseline. The results of this experiment are presented in Fig. 5. We also summarized the total performance comparison between different methods in Table 2. The results clearly show that VAE-RL continues to outperform the BDQN baseline across all environment settings, including those with heterogeneous conditions. This consistent pattern reinforces the trends observed in smaller environments, indicating that VAE-RL’s effectiveness scales well with increased complexity. On the other hand, the BDQN baseline struggles even in the simplest case with agents having 1.2 vision ranges, where its performance is comparable to a random strategy. This outcome suggests that while BDQN remains applicable in this context, it fails to develop a meaningful policy for communication network assignment. In contrast, VAE-RL not only demonstrates strong performance relative to the baseline but also retains its potential for effective implementation in even larger systems with more agents.

Fig. 5
Results show performance and resource penalties for various methods with homogeneous agents (vision ranges 0.6–1.2) and heterogeneous agents in 10-agent systems. A star marker indicates the random policy baseline’s overall performance.
Fig. 5
Results show performance and resource penalties for various methods with homogeneous agents (vision ranges 0.6–1.2) and heterogeneous agents in 10-agent systems. A star marker indicates the random policy baseline’s overall performance.
Close modal
Table 2

Comparison of total scores across methods in 10-agent systems

VAE-RLFlat-RLBDQNRandom
0.6 vision−281.89N/A−337.60−327.57
0.8 vision−214.84N/A−304.09−305.02
1.0 vision−187.14N/A−287.66−300.71
1.2 vision−171.33N/A−309.89−294.01
Heterogeneous vision−255.46N/A−286.50−311.12
VAE-RLFlat-RLBDQNRandom
0.6 vision−281.89N/A−337.60−327.57
0.8 vision−214.84N/A−304.09−305.02
1.0 vision−187.14N/A−287.66−300.71
1.2 vision−171.33N/A−309.89−294.01
Heterogeneous vision−255.46N/A−286.50−311.12

4.3 Evolution of Network Behavior.

After confirming that our VAE-RL method outperforms the baseline in both small and large systems, we also aim to analyze the learned behaviors of SoS managers to enhance the explainability of our models using deep learning techniques and generate valuable insights or heuristics. We illustrate the distribution of communication networks with varying link densities during the task. We examine the behaviors of VAE-RL’s policy in both homogeneous and heterogeneous cases. In this section, we analyze the learned behaviors only for large systems with 10 agents, as the four-agent system represents a simpler environment, primarily used to demonstrate the scalability limitations of flat-RL. The maximum number of links in a 10-agent system is 45, so we categorize the communication networks based on the number of links they contain. Sparse networks are defined as those with fewer than 9 links, mid-dense networks as those with more than 9 but fewer than 18 links, dense networks as those with more than 18 but fewer than 27 links, and very dense networks as those with more than 27 links.

4.3.1 Scenarios With Homogeneous Agents.

We have four environments for homogeneous cases, where agents have vision ranges of 0.6, 0.8, 1.0, and 1.2. We plot the distribution of networks with varying densities over time in Fig. 6. The x-axis represents the time-steps, while the y-axis shows the percentage of specific communication network usage across 1000 test runs. This visualization approximates the evolution of the distribution of different communication network usages over time. First, the graphs show that as tasks become less challenging, there is a notable increase in the frequency of using less dense communication networks, while the frequency of employing costly denser communication networks decreases. In the easiest task, with a vision range of 1.2, the use of sparse communication networks approaches approximately 100% at the end of the game. This observation aligns with our earlier explanation that easier tasks correspond to reduced utilization of communication resources.

Fig. 6
Communication network distribution over time is analyzed for homogeneous agents with vision ranges of 0.6, 0.8, 1.0, and 1.2 (subgraphs (a)–(d)). Networks are categorized as sparse (≤ 9 links), mid-dense (9–17 links), dense (18–26 links), or very dense (≥ 27 links). The x-axis represents the time-steps, while the y-axis shows the percentage of specific communication network usage across 1000 test runs. (a) 0.6 vision range, (b) 0.8 vision range, (c) 1.0 vision range, and (d) 1.2 vision range.
Fig. 6
Communication network distribution over time is analyzed for homogeneous agents with vision ranges of 0.6, 0.8, 1.0, and 1.2 (subgraphs (a)–(d)). Networks are categorized as sparse (≤ 9 links), mid-dense (9–17 links), dense (18–26 links), or very dense (≥ 27 links). The x-axis represents the time-steps, while the y-axis shows the percentage of specific communication network usage across 1000 test runs. (a) 0.6 vision range, (b) 0.8 vision range, (c) 1.0 vision range, and (d) 1.2 vision range.
Close modal

Second, the behavior of the SoS manager exhibits two distinct phases: during the initial 10 time-steps, the manager leans toward employing more costly communication networks with a higher number of links, then it shifts toward utilizing cheaper communication networks. At the beginning of each task, due to the constraints posed by limited vision ranges, some SoS workers may not have the capacity to observe any entities. Consequently, the manager’s preference is to encourage all workers to aggregate their information through dense communication networks. However, denser communication networks entail higher communication resource costs. Therefore, as workers approach landmarks and can perceive landmarks within their own limited vision ranges, the manager is inclined to assign sparser communication networks to save communication resources. For instance, in the initial phase of Fig. 6(a), where the vision range is 0.6, resulting in exceptionally challenging tasks, the manager exhibits a frequency of assigning very dense communication networks of almost 60%. Subsequently, there is an immediate decrease in the frequency of employing costly communication networks, while the frequency of employing sparse communication networks exceeds 50% shortly after a few time-steps.

4.3.2 Extending the Model to Heterogeneous Agents.

In numerous real-world multiagent systems, agents often exhibit heterogeneity in their properties or capabilities. Regrettably, many studies in the domain of multiagent system control or SoS control [27] tend to exclusively focus on homogeneous scenarios, rendering them less adaptable for deployment in heterogeneous environments. To address this limitation, we have designed an experiment with 10 agents where agents possess heterogeneous vision ranges: (2, 1, 1, 1, 0.5, 0.5, 0.5, 0, 0, 0). This experiment serves as a crucial and representative example within the realm of heterogeneous scenarios. Instead of analyzing the communication networks in general, we focus more on the agents’ properties within networks. Thus, we examine the average node degrees of the agents and the average node betweenness centrality for those with the same vision range and plot these through time. The calculation of node betweenness centrality includes both the start nodes and end nodes. Our analysis of the learned behaviors of the resource manager and the corresponding results are depicted in Fig. 7.

Fig. 7
The results show average node degrees and betweenness centrality for agents with different vision ranges over time. The study includes one agent with vision 2.0, and three agents each with visions 1.0, 0.5, and 0. (a) Evolution of node degrees and (b) evolution of node betweenness.
Fig. 7
The results show average node degrees and betweenness centrality for agents with different vision ranges over time. The study includes one agent with vision 2.0, and three agents each with visions 1.0, 0.5, and 0. (a) Evolution of node degrees and (b) evolution of node betweenness.
Close modal

Upon analysis, we can discern two distinct phases in the behavior patterns of the SoS manager, paralleling our previous findings. During the first phase (0–10 time-steps), the node degrees of agents are relatively high, indicating a greater level of connectivity among agents. This is quickly followed by a shift to a state with fewer degrees. In the second phase (10–50 time-steps), a notable stabilization occurs, with the agents maintaining generally low node degrees. The rationale behind this pattern mirrors our earlier analysis: at the beginning, agents require a higher degree of information exchange for effective coordination and task completion, leading to an increased number of connections. However, maintaining a high node degree comes with substantial costs. Consequently, once the agents have acquired essential information and start converging toward their objectives, there is a significant reduction in node degrees to minimize costs while maintaining efficiency.

For node betweenness centrality, we observe an increase during the 0–10 time-steps, a decrease between 10 and 20 time-steps, and stabilization afterward. Since betweenness centrality measures the importance and effectiveness of each node in the information flow, we infer that initially, agents cannot effectively share their information due to their random positions and connections within the network. With guidance from the VAE-RL manager, agents are strategically positioned within more reasonable network structures, enhancing their importance and effectiveness in the network. Finally, as agents approach their targets, the need for a complex communication network diminishes, reducing the overall importance of all agents and resulting in stabilization.

Furthermore, we sought to identify patterns among agents with differing vision ranges. It was observed that agents with a vision range of 2 consistently exhibited the highest node degree and node betweenness centrality over time. This suggests that agents with a vision range of 2 tend to remain central within the communication network, maintaining more connections and a higher level of importance. Contrary to expectations, a clear trend where agents with greater vision ranges demonstrate larger node degrees and node betweenness centrality was not evident. Specifically, agents with a vision range of 1 generally had lower node degrees and node betweenness centrality compared to agents with a vision range of 0.5.

We propose several explanations for our findings. First, agents with moderate abilities serve a dual role: they require assistance from others to solve tasks in certain scenarios, yet they are also capable of providing valuable aid in different contexts. This bidirectional flow of information in a complex environment may contribute to the observed phenomenon. Second, while agents with moderate abilities lack the power to significantly assist others, they are able to solve tasks independently. This suggests that in some cases, they should be isolated from the communication network, resulting in lower node degrees and betweenness centrality compared to agents with lesser abilities.

Then, we want to further explore if these explanations are reasonable in our environment. However, given the complexity and dynamic nature of our environment, coupled with inherent randomness and uncertainty, it is challenging to explain this phenomenon using specific examples. Additionally, quantitatively analyzing this phenomenon within the environment in detail presents difficulties. For instance, evaluating quantitatively the benefit of individual connections within communication networks—which we assume to be a key factor in “flipping rank”—is particularly challenging. While we can statistically observe metrics such as node degree or betweenness across multiple runs, explaining the reasons behind the phenomenon is not feasible with our current data.

To clarify these findings and draw motivation from the challenges mentioned above, we designed a simplified theoretical example to capture the core aspects of our environment and propose a couple of potential explanations for this “flipping rank” phenomenon, as illustrated in Fig. 8. In this example, agents with varying capabilities, represented by different vision ranges in our experiment, assist one another via connections in a communication network. However, the assistance an agent can provide is typically less than its own capabilities, because the information beneficial to one agent may not be as useful to another. The task difficulty in each episode varies, resulting from the random initial positions of agents and landmarks in our experiment. The system earns rewards when more agents successfully complete tasks, which occurs when their innate abilities combined with the help received from others through the network meet the task’s requirements. This scenario represents the combination of agents’ own observations and others’ information from communication to facilitate solving the landmark spreading task.

Fig. 8
The illustrative example justifying the “flipping rank” phenomenon. There are three agents with different abilities: Agent 2 with ability 2, Agent 1 with ability 1, and Agent 0.5 with ability 0.5. Agents can boost their total abilities by gaining half of their connected agents” abilities. The colored values represent the boosted abilities. In four different scenarios, agents must meet the task requirements to succeed. Each successful task rewards 1, outweighing the connection cost of 0.1. Network structures shown are the most efficient, which maximizes performance minus connection costs. (a) Requirement of 0.5 ability, (b) requirement of 1 ability, (c) requirement of 1.5 ability, and (d) requirement of 2 ability.
Fig. 8
The illustrative example justifying the “flipping rank” phenomenon. There are three agents with different abilities: Agent 2 with ability 2, Agent 1 with ability 1, and Agent 0.5 with ability 0.5. Agents can boost their total abilities by gaining half of their connected agents” abilities. The colored values represent the boosted abilities. In four different scenarios, agents must meet the task requirements to succeed. Each successful task rewards 1, outweighing the connection cost of 0.1. Network structures shown are the most efficient, which maximizes performance minus connection costs. (a) Requirement of 0.5 ability, (b) requirement of 1 ability, (c) requirement of 1.5 ability, and (d) requirement of 2 ability.
Close modal

In the example, we assume that the cost of establishing one link in the network (0.1) is less than the reward for an additional agent completing the task (1). To resolve a tie when two agents can help another complete the task, the agent with higher ability will offer the help. This decision ensures that our insights remain consistent and are not influenced by random methods such as flipping coins. Here, we have three agents with abilities quantified as 2, 1, and 0.5, which we refer to as Agent 2, Agent 1, and Agent 0.5, respectively. Each agent contributes half of their ability to assist others through undirected connections, facilitating mutual support among connected agents. We then explore the most efficient communication network structures from the system manager’s perspective under various scenarios within this simplified framework.

In scenarios where task requirements are minimal (0.5 ability), there is no need for additional links in the communication network since all agents can independently meet these requirements, and adding links would only incur unnecessary costs. For tasks requiring one ability, only Agent 0.5 cannot solve the task alone. In this case, it is advantageous for Agent 2 to connect to Agent 0.5. This connection boosts the latter’s total ability to 1.5, enabling it to accomplish the task. Although Agent 2 also receives an additional 0.25 ability from Agent 0.5, this surplus does not provide further benefit as it already exceeds the task requirement.

When tasks demand 1.5 ability, both agents with abilities of 1 and 0.5 fall short on their own. Thus, Agent 2 must establish connections with both to facilitate task completion. In the most challenging scenarios, where the requirement is 2 ability, Agent 2 just manages to meet this threshold independently. However, agents with abilities of 1 and 0.5 are unable to complete the task on their own. In this case, Agent 0.5 remains unable to fulfill the task requirements despite assistance from Agent 2. Consequently, the most effective network structure under these conditions is a fully connected network, where all agents are interconnected, maximizing the distribution of available abilities and support.

Assuming that the tasks with varying difficulties are uniformly distributed across all episodes, the average node degrees for Agent 2, Agent 1, and Agent 0.5 are 1.25, 0.75, and 1, respectively; the average betweenness centralities for Agent 2, Agent 1, and Agent 0.5 are 0.67, 0.29, and 0.54, respectively. The rankings in this illustrative example reveal a similar pattern to our simulation results, showing that the rank of agents’ degrees and betweenness centrality may not align with the rank of their abilities. This pattern holds even when the tasks are normally distributed in all episodes, with a higher probability for tasks of middle difficulties, which more closely reflects the conditions of our original experiment.

First, we clearly observe that Agent 1 is in a dual role: it needs assistance from others to solve tasks in scenarios requiring abilities of 1.5 and 2, and it can also provide useful help in scenarios requiring an ability of 2. While this factor does not directly result in the “flipping rank” phenomenon, it can become more significant and contribute to this phenomenon in our more complex experiments. Second, the scenario with a requirement of 1 ability is the main reason for this phenomenon. Agent 1 is not powerful enough to assist others significantly, but it can independently solve the task, suggesting it should be isolated from the communication network in some cases. In the scenario with a requirement of one ability, even if we relax the initial assumption for breaking ties and instead determine network structures by flipping a coin, the results would not change.

It should be emphasized that this example significantly simplifies our original environment. In our experiment, quantifying the abilities of agents and the difficulty of tasks is challenging; the process is much more complex, and the assistance offered by agents is heterogeneous, varying with different recipients and evolving over time. However, this example captures the essential aspects and replicates similar patterns observed in our experimental results, potentially explaining the insights behind the “flipping rank” phenomenon as well.

4.3.3 Evolution of Network Structure.

This visualization aims to elucidate the interactions between agents and the flow of information within communication networks, which is shown in Fig. 9. For this demonstration, we standardize the positions of landmarks while initializing the agents’ positions randomly, each with vision ranges of 0.6, 0.8, 1.0, and 1.2. We capture snapshots of the environment at various time-steps—0, 5, 10, 15, and 20—to observe the progression of the environment and the behaviors of the agents. It should be noted that the training and testing procedures described in our paper incorporate significant randomness to ensure the robustness of our VAE-RL framework. The statistical analysis provided in the previous section offers a more precise evaluation of VAE-RL. The examples presented in this section are intended to visually demonstrate the system’s evolution.

Fig. 9
The evolution of communication networks and agents’ behaviors using the VAE-RL policy under different vision ranges. The graphs, from left to right, show the system’s status at time-steps 0, 5, 10, 15, and 20. Each row, from top to bottom, represents the evolution under vision ranges of 0.6, 0.8, 1.0, and 1.2. For clarity, the initial positions of landmarks are set in a gridlike formation, while in other cases, they are randomly initialized. Smaller dots represent the landmarks; larger dots represent the moving agents; and black lines represent the communication networks. (a) Vision range 0.6, (b) vision range 0.8, (c) vision range 1.0, and (d) vision range 1.2.
Fig. 9
The evolution of communication networks and agents’ behaviors using the VAE-RL policy under different vision ranges. The graphs, from left to right, show the system’s status at time-steps 0, 5, 10, 15, and 20. Each row, from top to bottom, represents the evolution under vision ranges of 0.6, 0.8, 1.0, and 1.2. For clarity, the initial positions of landmarks are set in a gridlike formation, while in other cases, they are randomly initialized. Smaller dots represent the landmarks; larger dots represent the moving agents; and black lines represent the communication networks. (a) Vision range 0.6, (b) vision range 0.8, (c) vision range 1.0, and (d) vision range 1.2.
Close modal

We begin by analyzing the evolution of the system over time. Initially, the agents are randomly dispersed across the environment for all scenarios at varying distances from the landmarks. At the early stage, the communication network is notably dense, reflecting our earlier observation that denser networks facilitate information sharing among agents, aiding them in coordinating their efforts and commencing their tasks. From time-steps 5 to 15, a gradual thinning of the communication network is observed, with agents progressively moving closer to their landmarks. This trend corroborates our previous finding that communication networks become sparser over time, effectively balancing the dual objectives of task completion and conservation of communication resources. From time-steps 15 to 20, the communication network is almost empty for all scenarios, indicating that most agents have either reached or are nearing their landmarks, with only a few agents continuing their journey toward the remaining landmarks. Finally, by time-step 20, the snapshot reveals all agents successfully occupying their landmarks, rendering the communication network unnecessary.

We also examined the system’s evolution across scenarios with varying vision ranges. A clear trend emerges: as the vision range increases, making the task progressively easier, the communication network among agents becomes sparser at every time-step. This trend corroborates our previous statistical analysis findings. Agents with wider vision ranges can more effectively gather information from their surroundings, enhancing their ability to coordinate with others and accomplish tasks independently. Consequently, the reliance on communication decreases since rich communication becomes unnecessary and incurs additional resource costs. For example, the system with a 0.6 vision range maintains a relatively dense communication network up to time-step 10, whereas systems with 1.0 and 1.2 vision ranges exhibit sparse communication networks from the outset and completely disable communication after time-step 10.

5 Conclusion

We introduce VAE-RL, a novel approach for optimizing multiagent communication networks. Traditional DRL struggles with the exponentially large action space of network topologies. VAE-RL addresses this by using a VAE to transform the discrete action space into a manageable continuous latent space. DRL algorithms for continuous action spaces (DDPG) then learn policies in this latent space, which are decoded back into network topologies. Tested on a modified OpenAI particle environment [27], VAE-RL outperforms baseline methods like flat-RL and BDQN in both small (4 agents) and large (10 agents) systems, across homogeneous and heterogeneous scenarios. Our analysis reveals insightful trends in VAE-RL’s behavior.

While our environment is primarily modeled after the communication networks of multirobotic systems in the real world, it can also find parallels in other real-world situations. For instance, consider a corporate management system where a department comprises multiple employees, including at least one manager. These employees collaborate to address their respective tasks. However, variations in talent and work experience among the employees lead to differences in task completion performance. In such scenarios, the manager is tasked with making decisions on how to establish connections among employees to facilitate collective task completion, especially for those employees with comparatively weaker abilities. Optimizing departmental performance by strategically assigning connection networks among employees is a critical responsibility of department managers. This scenario closely mirrors the communication network assignment strategy learned by VAE-RL. Notably, our method operates autonomously and is adaptable to systems featuring heterogeneous workers, making it versatile and applicable across various contexts.

The VAE-RL framework holds significant potential for application in numerous real-world multiagent systems, particularly those reliant on network-based information sharing and interaction processes. However, it has its limitations. A fundamental assumption underpinning VAE-RL is the strong authority of the manager, responsible for assigning network structures to multiagent systems. This framework presupposes that agents will comply with the manager’s commands, a scenario that may not always hold true in real-life situations. Multiagent systems can be broadly classified into three categories: human-only systems, human–AI systems, and AI-only systems. In the last two categories, the assumption of compliance is more likely to be valid, given the predominance of AI agents programmed to adhere to managerial decisions, except in cases of technical failures. However, in systems with a higher proportion of human agents, this assumption may not always be correct. Even in scenarios where the manager represents high-authority entities such as governments, military, or corporations, human agents may sometimes exhibit rebellious behavior, as evidenced by instances of noncompliance with social distancing and masking policies during the COVID-19 pandemic [76,77]. Therefore, while our framework is highly effective in AI-dominated systems, its applicability may be limited in scenarios involving a majority of human agents due to the potential for reduced managerial authority.

Additionally, the proposed VAE-RL framework exhibits a scalability limitation. While we have demonstrated that the framework can effectively manage networks up to 10 nodes and potentially adapt to larger network scenarios, it may struggle with the computational demands posed by real-world applications such as transportation, social media, and human interaction networks, which often comprise thousands of nodes. Although our framework significantly reduces the complexity of the action space in networks, it still faces challenges related to computational expenses. Scaling up engineering management techniques for extensive networks remains an unresolved issue.

However, our framework can accommodate larger networks through the concept of introducing a multihierarchical structure [25]. The general process may involve dividing the network into several subnetworks using clustering algorithms. A first-layer manager then allocates communication resources, treating each subnetwork as a node. Subsequent layers of management distribute these resources to agents within the subnetworks. As the network size increases, this hierarchical structure can be expanded further. This approach introduces new challenges, including the efficiency of network clustering algorithms and the development of super networks that integrate these subnetworks as nodes. Despite these difficulties, this method holds promise for scaling our framework to handle multiagent systems in significantly larger networks.

Conflict of Interest

There are no conflicts of interest.

Data Availability Statement

The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.

References

1.
Sony
,
M.
, and
Naik
,
S.
,
2020
, “
Industry 4.0 Integration With Socio-Technical Systems Theory: A Systematic Review and Proposed Theoretical Model
,”
Technol. Soc.
,
61
, p.
101248
.
2.
Liao
,
H.-Y.
,
Chen
,
Y.
,
Hu
,
B.
, and
Behdad
,
S.
,
2023
, “
Optimization-Based Disassembly Sequence Planning Under Uncertainty for Human–Robot Collaboration
,”
ASME J. Mech. Des.
,
145
(
2
), p.
022001
.
3.
Poudel
,
L.
,
Zhou
,
W.
, and
Sha
,
Z.
,
2021
, “
Resource-Constrained Scheduling for Multi-Robot Cooperative Three-Dimensional Printing
,”
ASME J. Mech. Des.
,
143
(
7
), p.
072002
.
4.
Poudel
,
L.
,
Elagandula
,
S.
,
Zhou
,
W.
, and
Sha
,
Z.
,
2023
, “
Decentralized and Centralized Planning for Multi-Robot Additive Manufacturing
,”
ASME J. Mech. Des.
,
145
(
1
), p.
012003
.
5.
Giannakis
,
M.
, and
Louis
,
M.
,
2011
, “
A Multi-Agent Based Framework for Supply Chain Risk Management
,”
J. Purchas. Supply Manage.
,
17
(
1
), pp.
23
31
.
6.
Govindan
,
K.
,
Fattahi
,
M.
, and
Keyvanshokooh
,
E.
,
2017
, “
Supply Chain Network Design Under Uncertainty: A Comprehensive Review and Future Research Directions
,”
Eur. J. Oper. Res.
,
263
(
1
), pp.
108
141
.
7.
Zhou
,
T.
,
Tang
,
D.
,
Zhu
,
H.
, and
Zhang
,
Z.
,
2021
, “
Multi-Agent Reinforcement Learning for Online Scheduling in Smart Factories
,”
Rob. Comput. Integr. Manuf.
,
72
, p.
102202
.
8.
Su
,
R.
,
Zhang
,
D.
,
Venkatesan
,
R.
,
Gong
,
Z.
,
Li
,
C.
,
Ding
,
F.
,
Jiang
,
F.
, and
Zhu
,
Z.
,
2019
, “
Resource Allocation for Network Slicing in 5G Telecommunication Networks: A Survey of Principles and Models
,”
IEEE Network
,
33
(
6
), pp.
172
179
.
9.
Gyory
,
J. T.
,
Stump
,
G.
,
Nolte
,
H.
, and
Cagan
,
J.
,
2024
, “
Adaptation Through Communication: Assessing Human–Artificial Intelligence Partnership for the Design of Complex Engineering Systems
,”
ASME J. Mech. Des.
,
146
(
8
), p.
081401
.
10.
Chaudhari
,
A. M.
,
Gralla
,
E. L.
,
Szajnfarber
,
Z.
, and
Panchal
,
J. H.
,
2022
, “
Co-Evolution of Communication and System Performance in Engineering Systems Design: A Stochastic Network-Behavior Dynamics Model
,”
ASME J. Mech. Des.
,
144
(
4
), p.
041706
.
11.
Chamseddine
,
I. M.
, and
Kokkolaras
,
M.
,
2017
, “
Bio-Inspired Heuristic Network Configuration in Air Transportation System-of-Systems Design Optimization
,”
ASME J. Mech. Des.
,
139
(
8
), p.
081401
.
12.
Marwaha
,
G.
, and
Kokkolaras
,
M.
,
2015
, “
System-of-Systems Approach to Air Transportation Design Using Nested Optimization and Direct Search
,”
Struct. Multidiscipl. Optim.
,
51
(
4
), pp.
885
901
.
13.
Zhang
,
K.
,
He
,
F.
,
Zhang
,
Z.
,
Lin
,
X.
, and
Li
,
M.
,
2020
, “
Multi-Vehicle Routing Problems With Soft Time Windows: A Multi-Agent Reinforcement Learning Approach
,”
Transp. Res. C: Emerg. Technol.
,
121
, p.
102861
.
14.
Wu
,
Q.
,
Bansal
,
G.
,
Zhang
,
J.
,
Wu
,
Y.
,
Zhang
,
S.
,
Zhu
,
E.
,
Li
,
B.
,
Jiang
,
L.
,
Zhang
,
X.
, and
Wang
,
C.
,
2023
, “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework,” arXiv:2308.08155.
15.
Lorè
,
N.
, and
Heydari
,
B.
,
2023
, “Strategic Behavior of Large Language Models: Game Structure vs. Contextual Framing,” arXiv:2309.05898.
16.
Chen
,
Q.
,
Ilami
,
S.
,
Lore
,
N.
, and
Heydari
,
B.
,
2024
, “
Instigating Cooperation among LLM Agents Using Adaptive Information Modulation
,”
arXiv preprint arXiv:2409.10372
.
17.
Karcanias
,
N.
, and
Hessami
,
A. G.
,
2010
, “
Complexity and the Notion of System of Systems: Part (i): General Systems and Complexity
,”
2010 World Automation Congress
,
Kobe, Japan
,
Sept. 19–23
, IEEE, pp.
1
7
.
18.
Twu
,
P.
,
Mostofi
,
Y.
, and
Egerstedt
,
M.
,
2014
, “
A Measure of Heterogeneity in Multi-Agent Systems
,”
2014 American Control Conference
,
Portland, OR
,
June 4–6
, IEEE, pp.
3972
3977
.
19.
Jaakkola
,
T.
,
Singh
,
S.
, and
Jordan
,
M.
,
1994
, “
Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems
,”
NIPS'94: Proceedings of the 8th International Conference on Neural Information Processing Systems
,
Denver, CO
,
Jan. 1
.
20.
Lillicrap
,
T. P.
,
Hunt
,
J. J.
,
Pritzel
,
A.
,
Heess
,
N.
,
Erez
,
T.
,
Tassa
,
Y.
,
Silver
,
D.
, and
Wierstra
,
D.
,
2015
, “Continuous Control With Deep Reinforcement Learning,” arXiv:1509.02971.
21.
Chen
,
Q.
,
Heydari
,
B.
, and
Moghaddam
,
M.
,
2021
, “
Leveraging Task Modularity in Reinforcement Learning for Adaptable Industry 4.0 Automation
,”
ASME J. Mech. Des.
,
143
(
7
), p.
071701
.
22.
Liang
,
Y.
,
Machado
,
M. C.
,
Talvitie
,
E.
, and
Bowling
,
M.
,
2015
, “State of the Art Control of Atari Games Using Shallow Reinforcement Learning,” arXiv:1512.01563.
23.
Shao
,
K.
,
Tang
,
Z.
,
Zhu
,
Y.
,
Li
,
N.
, and
Zhao
,
D.
,
2019
, “A Survey of Deep Reinforcement Learning in Video Games,” arXiv:1912.10944.
24.
Andriotis
,
C. P.
, and
Papakonstantinou
,
K. G.
,
2019
, “
Managing Engineering Systems With Large State and Action Spaces Through Deep Reinforcement Learning
,”
Reliab. Eng. Syst. Saf.
,
191
, p.
106483
.
25.
Chen
,
Q.
, and
Heydari
,
B.
,
2024
, “
The SoS Conductor: Orchestrating Resources With Iterative Agent-Based Reinforcement Learning
,”
Syst. Eng.
,
27
, pp.
715
727
.
26.
Ororbia
,
M. E.
, and
Warn
,
G. P.
,
2024
, “
Discrete Structural Design Synthesis: A Hierarchical-Inspired Deep Reinforcement Learning Approach Considering Topological and Parametric Actions
,”
ASME J. Mech. Des.
,
146
(
9
), p.
091707
.
27.
Chen
,
Q.
, and
Heydari
,
B.
,
2022
, “
Dynamic Resource Allocation in Systems-of-Systems Using a Heuristic-Based Interpretable Deep Reinforcement Learning
,”
ASME J. Mech. Des.
,
144
(
9
), p.
091711
.
28.
Kingma
,
D. P.
, and
Welling
,
M.
,
2013
, “Auto-Encoding Variational Bayes,” arXiv:1312.6114.
29.
Matheron
,
G.
,
Perrin
,
N.
, and
Sigaud
,
O.
,
2019
, “The Problem With DDPG: Understanding Failures in Deterministic Environments With Sparse Rewards,” arXiv:1911.11679.
30.
Dasagi
,
V.
,
Bruce
,
J.
,
Peynot
,
T.
, and
Leitner
,
J.
,
2019
, “Ctrl-Z: Recovering From Instability in Reinforcement Learning,” arXiv:1910.03732.
31.
Lowe
,
R.
,
Wu
,
Y.
,
Tamar
,
A.
,
Harb
,
J.
,
Abbeel
,
P.
, and
Mordatch
,
I.
,
2017
, “Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments,” arXiv:1706.02275.
32.
Nakamura
,
R.
,
Kawahara
,
R.
,
Wakayama
,
T.
, and
Harada
,
S.
,
2022
, “
Virtual Network Control for Power Bills Reduction and Network Stability
,”
IEEE Trans. Netw. Serv. Manage.
,
19
(
4
), pp.
4338
4349
.
33.
Bijami
,
E.
, and
Farsangi
,
M.
,
2019
, “
A Distributed Control Framework and Delay-Dependent Stability Analysis for Large-Scale Networked Control Systems With Non-Ideal Communication Network
,”
Trans. Inst. Meas. Control
,
41
(
3
), pp.
768
779
.
34.
Chu
,
T.
,
Wang
,
J.
,
Codecà
,
L.
, and
Li
,
Z.
,
2019
, “
Multi-Agent Deep Reinforcement Learning for Large-Scale Traffic Signal Control
,”
IEEE Trans. Intell. Transp. Syst.
,
21
(
3
), pp.
1086
1095
.
35.
Shibata
,
K.
,
Jimbo
,
T.
, and
Matsubara
,
T.
,
2021
, “
Deep Reinforcement Learning of Event-Triggered Communication and Control for Multi-Agent Cooperative Transport
,”
2021 IEEE International Conference on Robotics and Automation (ICRA)
,
Xi'an, China
,
May 30–June 5
, pp.
8671
8677
.
36.
Jackson
,
M. O.
, and
Wolinsky
,
A.
,
2003
, “A Strategic Model of Social and Economic Networks,”
Networks and Groups: Models of Strategic Formation
,
Springer Nature
,
Berlin, Germany
, pp.
23
49
.
37.
Heydari
,
B.
,
Mosleh
,
M.
, and
Dalili
,
K.
,
2015
, “
Efficient Network Structures With Separable Heterogeneous Connection Costs
,”
Econ. Lett.
,
134
, pp.
82
85
.
38.
Mosleh
,
M.
,
Ludlow
,
P.
, and
Heydari
,
B.
,
2016
, “
Resource Allocation Through Network Architecture in Systems of Systems: A Complex Networks Framework
,”
2016 Annual IEEE Systems Conference (SysCon)
,
Orlando, FL
,
Apr. 18–21
, IEEE, pp.
1
5
.
39.
Mosleh
,
M.
,
Ludlow
,
P.
, and
Heydari
,
B.
,
2016
, “
Distributed Resource Management in Systems of Systems: An Architecture Perspective
,”
Syst. Eng.
,
19
(
4
), pp.
362
374
.
40.
Cao
,
Q.
, and
Heydari
,
B.
,
2022
, “
Micro-Level Social Structures and the Success of Covid-19 National Policies
,”
Nat. Comput. Sci.
,
2
(
9
), pp.
595
604
.
41.
Ke
,
L.
, and
Heydari
,
B.
,
2021
, “
Airbnb and Neighborhood Crime: The Incursion of Tourists Or the Erosion of Local Social Dynamics?
PLoS one
,
16
(
7
), p.
e0253315
.
42.
Heydari
,
B.
,
Ergun
,
O.
,
Dyal-Chand
,
R.
, and
Bart
,
Y.
,
2023
,
Reengineering the Sharing Economy: Design, Policy, and Regulation
,
Cambridge University Press
,
Cambridge, UK
.
43.
Mazzitello
,
K.
,
Jiang
,
Y.
, and
Arizmendi
,
C.
,
2020
, “
Optimising SARS-CoV-2 Pooled Testing Strategies on Social Networks for Low-Resource Settings
,”
J. Phys. A: Math. Theor.
,
54
, p.
294002
.
44.
Robins
,
G.
,
Lusher
,
D.
,
Broccatelli
,
C.
,
Bright
,
D.
,
Gallagher
,
C.
,
Karkavandi
,
M. A.
,
Matous
,
P.
et al.,
2023
, “
Multilevel Network Interventions: Goals, Actions, and Outcomes
,”
Social Networks
,
72
, pp.
108
120
.
45.
Siciliano
,
M. D.
, and
Whetsell
,
T. A.
,
2021
, “Strategies of Network Intervention: A Pragmatic Approach to Policy Implementation and Public Problem Resolution Through Network Science,” arXiv:2109.08197.
46.
Ion
,
A.
, and
Pătraşcu
,
M.
,
2019
, “
A Scalable Algorithm for Self-Organization in Event-Triggered Networked Control Systems
,”
2019 18th European Control Conference (ECC)
,
Naples, Italy
,
June 25–28
, pp.
2725
2730
.
47.
Ellis
,
J.
,
Vassilev
,
I.
,
James
,
E.
, and
Rogers
,
A.
,
2020
, “
Implementing a Social Network Intervention: Can the Context for Its Workability Be Created? A Quasi-Ethnographic Study
,”
Implementation Sci. Commun.
,
1
, pp.
1
11
.
48.
Joseph
,
K.
,
Chen
,
H.-Y. W.
,
Ionescu
,
S.
,
Du
,
Y.
,
Sankhe
,
P.
,
Hannak
,
A.
, and
Rudra
,
A.
,
2022
, “
A Qualitative, Network-Centric Method for Modeling Socio-Technical Systems, With Applications to Evaluating Interventions on Social Media Platforms to Increase Social Equality
,”
Appl. Network Sci.
,
7
(
1
), p.
49
.
49.
Nowak
,
M. A.
,
2006
, “
Five Rules for the Evolution of Cooperation
,”
Science
,
314
(
5805
), pp.
1560
1563
.
50.
Gianetto
,
D. A.
, and
Heydari
,
B.
,
2015
, “
Network Modularity Is Essential for Evolution of Cooperation Under Uncertainty
,”
Sci. Rep.
,
5
(
1
), pp.
1
7
.
51.
Gianetto
,
D. A.
, and
Heydari
,
B.
,
2013
, “
Catalysts of Cooperation in System of Systems: The Role of Diversity and Network Structure
,”
IEEE Syst. J.
,
9
(
1
), pp.
303
311
.
52.
Gianetto
,
D. A.
, and
Heydari
,
B.
,
2016
, “
Sparse Cliques Trump Scale-Free Networks in Coordination and Competition
,”
Sci. Rep.
,
6
(
1
), pp.
1
11
.
53.
Mosleh
,
M.
, and
Heydari
,
B.
,
2017
, “
Fair Topologies: Community Structures and Network Hubs Drive Emergence of Fairness Norms
,”
Sci. Rep.
,
7
(
1
), pp.
1
9
.
54.
Goodfellow
,
I.
,
Pouget-Abadie
,
J.
,
Mirza
,
M.
,
Xu
,
B.
,
Warde-Farley
,
D.
,
Ozair
,
S.
,
Courville
,
A.
, and
Bengio
,
Y.
,
2020
, “
Generative Adversarial Networks
,”
Commun. ACM
,
63
(
11
), pp.
139
144
.
55.
Creswell
,
A.
,
White
,
T.
,
Dumoulin
,
V.
,
Arulkumaran
,
K.
,
Sengupta
,
B.
, and
Bharath
,
A. A.
,
2018
, “
Generative Adversarial Networks: An Overview
,”
IEEE Signal Process Mag.
,
35
(
1
), pp.
53
65
.
56.
Kumari
,
G.
, and
Sowjanya
,
A.
,
2022
, “
An Integrated Single Framework for Text, Image and Voice for Sentiment Mining of Social Media Posts
,”
Revue Intell. Artif.
,
36
(
3
), pp.
381
386
.
57.
Tripti
,
S.
,
Anand
,
N.
,
Gaurav
,
K.
, and
Kapoor
,
R.
,
2022
, “
Image Captioning Generator Text-to-Speech
,”
Int. J. Next-Gener. Comput.
,
13
(
3
), p.
449
.
58.
Lakhotia
,
K.
,
Kharitonov
,
E.
,
Hsu
,
W.-N.
,
Adi
,
Y.
,
Polyak
,
A.
,
Bolte
,
B.
,
Nguyen
,
T.
et al.,
2021
, “
On Generative Spoken Language Modeling From Raw Audio
,”
Trans. Assoc. Comput. Ling.
,
9
, pp.
1336
1354
.
59.
de Rosa
,
G. H.
, and
Papa
,
J. P.
,
2021
, “
A Survey on Text Generation Using Generative Adversarial Networks
,”
Pattern Recognit.
,
119
, p.
108098
.
60.
Ray
,
P. P.
,
2023
, “
ChatGPT: A Comprehensive Review on Background, Applications, Key Challenges, Bias, Ethics, Limitations and Future Scope
,”
Internet Things Cyber-Phys. Syst.
,
3
, pp.
121
154
.
61.
Kipf
,
T. N.
, and
Welling
,
M.
,
2016
, “Variational Graph Auto-Encoders,” arXiv:1611.07308.
62.
Li
,
Y.
,
Vinyals
,
O.
,
Dyer
,
C.
,
Pascanu
,
R.
, and
Battaglia
,
P.
,
2018
, “Learning Deep Generative Models of Graphs,” arXiv:1803.03324.
63.
Li
,
Y.
,
Pei
,
J.
, and
Lai
,
L.
,
2021
, “
Structure-Based De Novo Drug Design Using 3D Deep Generative Models
,”
Chem. Sci.
,
12
(
41
), pp.
13664
13675
.
64.
Zhao
,
Y.
,
Siriwardane
,
E. M. D.
, and
Hu
,
J.
,
2021
, “Physics Guided Deep Learning Generative Models for Crystal Materials Discovery,” ArXiv, abs/2112.03528.
65.
Singh
,
K. V.
,
Verma
,
A. K.
, and
Vig
,
L.
,
2021
, “
Deep Learning Based Network Similarity for Model Selection
,”
Data Sci.
,
4
(
2
), pp.
63
83
.
66.
Guo
,
X.
, and
Zhao
,
L.
,
2022
, “
A Systematic Survey on Deep Generative Models for Graph Generation
,”
IEEE Trans. Pattern Anal. Mach. Intell.
,
45
(
5
), pp.
5370
5390
.
67.
Cheng
,
Y.
,
Gong
,
Y.
,
Liu
,
Y.
,
Song
,
B.
, and
Zou
,
Q.
,
2021
, “
Molecular Design in Drug Discovery: A Comprehensive Review of Deep Generative Models
,”
Briefings Bioinf.
,
22
(
6
), p.
bbab344
.
68.
Navidan
,
H.
,
Moshiri
,
P. F.
,
Nabati
,
M.
,
Shahbazian
,
R.
,
Ghorashi
,
S. A.
,
Shah-Mansouri
,
V.
, and
Windridge
,
D.
,
2021
, “
Generative Adversarial Networks (GANs) in Networking: A Comprehensive Survey & Evaluation
,”
Comput. Networks
,
194
, p.
108149
.
69.
Sutton
,
R. S.
, and
Barto
,
A. G.
,
2018
,
Reinforcement Learning: An Introduction
,
MIT Press
,
Cambridge, MA
.
70.
Sutton
,
R. S.
,
McAllester
,
D. A.
,
Singh
,
S. P.
, and
Mansour
,
Y.
,
2000
, “
Policy Gradient Methods for Reinforcement Learning With Function Approximation
,”
NIPS'99: Proceedings of the 13th International Conference on Neural Information Processing Systems
,
Denver, CO
,
Nov. 29–Dec. 4
, pp.
1057
1063
.
71.
Tavakoli
,
A.
,
Pardo
,
F.
, and
Kormushev
,
P.
,
2018
, “
Action Branching Architectures for Deep Reinforcement Learning
,”
AAAI'18: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence
,
New Orleans, LA
,
Feb. 2–7
.
72.
Wang
,
Z.
,
Schaul
,
T.
,
Hessel
,
M.
,
Hasselt
,
H.
,
Lanctot
,
M.
, and
Freitas
,
N.
,
2016
, “
Dueling Network Architectures for Deep Reinforcement Learning
,”
ICML'16: Proceedings of the 33rd International Conference on International Conference on Machine Learning
,
New York, NY
,
June 19–24
, PMLR, pp.
1995
2003
.
73.
ERDdS
,
P.
, and
R&wi
,
A.
,
1959
, “
On Random Graphs I
,”
Publ. Math. Debrecen
,
6
(
290-297
), p.
18
.
74.
Mnih
,
V.
,
Kavukcuoglu
,
K.
,
Silver
,
D.
,
Rusu
,
A. A.
,
Veness
,
J.
,
Bellemare
,
M. G.
,
Graves
,
A.
, et al.,
2015
, “
Human-Level Control Through Deep Reinforcement Learning
,”
Nature
,
518
(
7540
), pp.
529
533
.
75.
Zhou
,
J.
,
Cui
,
G.
,
Hu
,
S.
,
Zhang
,
Z.
,
Yang
,
C.
,
Liu
,
Z.
,
Wang
,
L.
,
Li
,
C.
, and
Sun
,
M.
,
2020
, “
Graph Neural Networks: A Review of Methods and Applications
,”
AI Open
,
1
, pp.
57
81
.
76.
He
,
L.
,
He
,
C.
,
Reynolds
,
T. L.
,
Bai
,
Q.
,
Huang
,
Y.
,
Li
,
C.
,
Zheng
,
K.
, and
Chen
,
Y.
,
2021
, “
Why Do People Oppose Mask Wearing? A Comprehensive Analysis of Us Tweets During the Covid-19 Pandemic
,”
J. Am. Med. Inf. Assoc.
,
28
(
7
), pp.
1564
1573
.
77.
van der Zwet
,
K.
,
Barros
,
A. I.
,
van Engers
,
T. M.
, and
Sloot
,
P. M.
,
2022
, “
Emergence of Protests During the Covid-19 Pandemic: Quantitative Models to Explore the Contributions of Societal Conditions
,”
Humanit. Soc. Sci. Commun.
,
9
(
1
), p.
68
.