[go: up one dir, main page]

Academia.eduAcademia.edu
How to Organize your Deep Reinforcement Learning Agents: The Importance of Communication Topology arXiv:1811.12556v1 [cs.LG] 30 Nov 2018 Dhaval Adjodah MIT Media Lab dval@mit.edu Dan Calacci MIT Media Lab dcalacci@mit.edu Abhimanyu Dubey MIT Media Lab dubeya@mit.edu Esteban Moro Universidad Carlos III de Madrid emoro@mit.edu Peter Krafft MIT Media Lab pkrafft@mit.edu Alex ‘Sandy’ Pentland MIT Media Lab pentland@mit.edu Abstract In this empirical paper, we investigate how learning agents can be arranged in more efficient communication topologies for improved learning. This is an important problem because a common technique to improve speed and robustness of learning in deep reinforcement learning (DRL) and many other machine learning algorithms is to run multiple learning agents in parallel. The standard communication architecture typically involves all agents intermittently communicating with each other (fully connected topology) or with a centralized server (star topology). Unfortunately, optimizing the topology of communication over the space of all possible graphs is a hard problem, so we borrow results from the networked optimization and collective intelligence literatures which suggest that certain families of network topologies can lead to strong improvements over fully-connected networks. We start by introducing alternative network topologies to DRL benchmark tasks under the Evolution Strategies (ES) paradigm which we call Network Evolution Strategies (NetES). We explore the relative performance of the four main graph families and observe that one such family (Erdos-Renyi random graphs) empirically outperforms all other families, including the de facto fully-connected communication topologies. Additionally, the use of alternative network topologies has a multiplicative performance effect: we observe that when 1000 learning agents are arranged in a carefully designed communication topology, they can compete with 3000 agents arranged in the de facto fully-connected topology. Overall, our work suggests that distributed machine learning algorithms would learn more efficiently if the communication topology between learning agents was optimized. 1 Introduction In distributed algorithms there is an implicit communication network between processing units. This network passes information such as data, parameters, or rewards between processors. The two network structures that are almost invariably used in modern distributed machine learning are either a complete network—in which all processors communicate with each other—or a star network—in which all processors communicate with a single hub server (in effect, a more efficient, centralized implementation of the complete network). In this work, we empirically investigate whether using alternative communication topologies between processors could lead to improving learning performance in the context of deep reinforcement learning (DRL). Optimizing the communication topology between agents is a hard problem as Preprint. Work in progress. it involves searching over the space of all possible graphs to find a communication network that performs optimally for the learning objective under consideration. We therefore borrow results from the literatures of networked optimization (optimization over networks of agents with local rewards) [16] and collective intelligence (the study of how agents learn, influence and collaborate with each other) [27] which suggest that certain families of network topologies can lead to strong improvements over fully-connected networks. To the best of our knowledge, almost no prior work has investigated how the topology of communication between agents affects learning performance in distributed DRL. Given that network effects tend to be significant only with large numbers of agents, we choose to build upon one of the DRL algorithms most oriented towards parallelizability and scalability: Evolution Strategies [19, 22, 25, 21]. We introduce Network Evolution Strategies (NetES), a networked decentralized variant of ES. NetES, like many DRL algorithms and evolutionary methods, relies on aggregating the rewards from a population of processors that search in parameter space to optimize a single global parameter set. Using NetES, we explore how the communication topology of a population of processors affects learning performance. Our key findings and contributions are as follows: (1) We introduce the notion of communication network topologies to the ES paradigm for DRL tasks. (2) We run controls on all modifications to the ES algorithms to make sure that any improvements we see come exclusively from using alternative topologies. (3) We compare the learning performance of the main topological families of communication graphs, and observe that one family (Erdos-Renyi graphs) does best. (4) Using an optimized Erdos-Renyi graph, we evaluate NetES on five difficult DRL benchmarks and find large improvements compared to using a fully-connected communication topology. We observe that our 1000-agent Erdos-Renyi graph can compete with 3000 fully-connected agents. (5) We provide some theoretical insights into why alternative topologies might outperform a fully-connected communication topology. 2 Related Work Running parallel (and sometimes asynchronous) agents is very common in modern deep reinforcement learning (DRL). For example, the Gorila framework [14] collects experiences in parallel from many agents and pools them into a global memory store on a distributed database. A3C [12] runs many agents asynchronously on several environment instances while varying their exploration policies. This effectively increases exploration diversity in parameter space and de-correlates agent learning. Until recently, this distributed approach caused heavy communication bottlenecks and limited the number of agents that are able to be run in parallel. Black box optimization algorithms such as Evolution Strategies (ES) [19, 22, 25] are able to overcome such communication bottlenecks, and are capable of running thousands of parallel agents, while simultaneously achieving competitive performance [21]. The baseline ES algorithm has been extended by many various subsequent works, such as CMA-ES [1] that updates the covariance matrix of the gaussian distribution as well. However, in all the approaches described above, agents are organized in a de-facto fully-connected centralized network topology: the algorithm uses and updates only one global-level parameter set using information available from all agents at every step. There is significant evidence that the network structure of communication between nodes significantly affects the convergence rate and accuracy of learning from the literatures of decentralized optimization [16, 17, 15]. Similarly, in the collective intelligence literature, alternative network structures have been shown to result in increased exploration, higher overall maximum reward, and higher diversity of solutions in both simulated high-dimensional optimization [10] and human experiments [3]. We know of only one piece of prior work that has examined network topology in distributed machine learning [11], but network topology was only an aside in this work, and this prior work therefore presented little understanding or motivation for their brief investigation into the effect. Another recent piece of work examines the use of periodic broadcasting of successful parameter settings in deep learning but does not leverage complex network topologies [9]. 2 3 3.1 Approach Evolution Strategies for Deep RL We begin with a brief overview of the application of the Evolution Strategies (ES) [22] approach to deep reinforcement learning, as done in [21]. Evolution Strategies is a class of techniques to solve optimization problems by utilizing a derivative-free parameter update approach. The algorithm proceeds by selecting a fixed model, initialized with a set of weights θ (whose distribution pφ is parameterized by φ), and an objective (reward) function R(·). The ES algorithm then maximizes the average objective value Eθ∼pφ R(θ), which is optimized with stochastic gradient ascent. The score function estimator for ∇φ Eθ∼pφ R(θ) is similar to REINFORCE [26], given by ∇φ Eθ∼pφ R(θ) = Eθ∼pφ [R(θ)∇φ log pφ (θ)]. The update equation used in this algorithm for the parameter θ at any iteration t + 1, for an appropriately chosen learning rate α and noise standar deviation σ, is a discrete approximation to the gradient: N α X (t) (t)  R(θ(t) + σǫi ) · σǫi θ(t+1) = θ(t) + (1) 2 N σ i=1 This update rule is normally implemented by spawning a collection of N agents at every iteration (t) (t) t, with perturbed versions of θ(t) , i.e. {(θ(t) + σǫ1 ), ..., (θ(t) + σǫN )} where ǫ ∼ N (0, I). The algorithm then calculates θ(t+1) which is broadcast again to all agents, and the process is repeated. 3.2 NetES : Networked Evolution Strategies (t) To maximize parameter exploration diversity, each agent can hold their own parameter θi instead of the global (noised) parameter θ(t) given in the equation 1 above. At each time-step, an agent would look at the rewards and parameters of its neighbors, which we control using matrix A = {aij }, where aij = 1, if agents i and j communicate with each other, and 0 otherwise. A represents the adjacency matrix of connectivity if the networks were connected in a graph-like structure, and therefore characterizes fully the communication topology between agents. Each agent then calculates a gradient by computing a weighted average of the difference vector between its parameter (t) (t) (t) and that of each of its neighbors, ((θj + σǫj ) − (θi )), using its neighbors’ normalized rewards (t) (t) R(θj + σǫj ) as weights. This leads to the update rule: (t+1) θi (t) = θi + N   α X (t) (t) (t) (t) (t) a · R(θ + σǫ ) · (θ + σǫ − θ ) ij j j j j i N σ 2 j=1 (t) (2) (t) Consequently, when agents have the same parameter (i.e. θi = θj ), and the network is fullyconnected (i.e. aij = 1), our update rule reduces to Equation 1. In summary, we can interpret the form of Equation 1 as involving an average of the perturbations (t) σǫi weighted by reward, such that ES corresponds to a kind of consensus-by-averaging algorithm [20]. Equation 2 is motivated by extension as corresponding to exactly the same weighted average, (t) (t) but averaging the differences between the agent i’s neighbors’ perturbed positions, (θj + σǫj ), (t) from the agent i’s starting position, θi . Given the above modifications to Equation 1 to obtain Equation 2 , it is important to note that previous work has shown that the exact form of the update rule does not matter much and that sparser networks are better as long as the distributed strategy is to find and aggregate the parameters with the highest reward (as opposed to, for example, finding the most common parameters many agents hold) [3]. Therefore, although our update rule is a straightforward extension, we expect that our primary insight—that network topology can affect deep reinforcement learning—to still be useful with alternative update rules. Additionally, although Equation 2 is a biased gradient estimate, at least in the short term, it is unclear whether in practice we achieve a biased or an unbiased gradient 3 estimate, marginalizing over time steps between broadcasts. This is because in our full algorithm (see Appendix 1 in the supplementary material) we combine this update rule with a periodic parameter broadcast (as is common in distributed learning algorithms - we will address this in detail in a later section), and that every broadcast returns the agents to a consensus position. Empirically, we find that NetES achieves large performance improvements. Future work can better characterize the theoretical properties of NetES and similar networked DRL algorithms using the recently developed tools of calculus on networks. Selecting a network topology (or adjacency matrix A) in this context is a difficult problem – in addition to the credit assignment and exploration-exploitation dilemmas, directly optimizing for the adjacency matrix that provides highest expected rewards is non-convex, and would require substantially more computational power as the number of agents N increases. Because almost no prior work has investigated how the topology of communication between agents affects learning performance in DRL, we believe that a starting contribution would be the empirical exploration of well-studied network topologies that are prevalent in modeling how humans and animals learn collectively. We focus on these main families of network topologies (in addition to the conventional fully-connected de facto topology): 1) Erdos-Renyi Networks: Networks where each edge between any two nodes has a fixed independent probability of being present [6]. They are among the most common graphs in social networks [8] used to define properties that hold for almost all graphs. 2) Scale-Free Networks: Scale-free networks, whose degree distribution follows a power law [5]. They are extremely common in systems that exhibit preferential attachment [2] such as citation and signaling biological networks. 3) Small-World Networks: Small-world networks where most nodes can be reached through a small number of neighbors resulting in the famous ‘six degrees of separation’ [24]. 4) Fully-Connected Networks: Networks where every node is connected to every other node. Each of these network families can be parametrized by the number of nodes N , and their degree distribution. Erdos-Renyi networks, for example, are parametrized by their average density p ranging from 0 to 1, where 0 would lead to a completely disconnected graph (no nodes are connected), and 1.0 would lead back to a fully-connected graph. The lower p is, the sparser a network is. Similarly, the degree distribution of scale-free networks is defined by the exponent of the power distribution. Because each graph is generated randomly, two graphs with the same parameters will be different if they have different random seeds, even though, on average, they will have the same average degree (and therefore the same number of links). 4 Experimental Procedure and Reproducibility We evaluate our algorithm on a series of popular benchmarks for deep reinforcement learning tasks, selected from two frameworks—the open source Roboschool [18] benchmark, and the MuJoCo framework [23]. The five benchmark tasks we evaluate on are: Humanoid-v1 (Roboschool and Mujoco), HalfCheetah-v1 (MuJoCo), Hopper-v1 (MuJoCo) and Ant-v1 (MuJoCo). Our choice of benchmarks is motivated by the inherent difficulty of these walker-based problems. The code we used to compute the benchmark performances is based off the freely-available code from [21]. To maximize reproducibility of our empirical results, we use the standard evaluation metric of collecting the total reward agents obtain during a test-only episode, which we compute periodically during training [13, 4, 21]. Specifically, with a probability of 0.08, we pause training, take the parameters of the best agent and run this parameter (without added noise pertubation) for 1000 episodes, and take the average total reward over all episodes - the exact same evaluation procedure of [21]. After evaluation, training is resumed with the same pre-evaluation parameters (i.e. evaluation does not change training parameters). When training eventually stabilizes to a maximum ‘flat’ line, we record the maximum of evaluation performance values (averaged over all episodes) during this ‘flat’ period as our recorded performance for this particular experimental run. As such, the training performance (as shown in Fig. 4) will be slightly lower that the corresponding maximum evaluation performance (as shown in Table 1). We observe this procedure to be quite robust to noise. Because we are trying to evaluate the performance of communication topologies, we then repeat the former evaluation procedure for different instances of the same network topology (by varying 4 the random seed of network generation, we can create, at the start of each experiment, a different network topology with the same global properties) with the same average density p (i.e. the same average number of links) and the same number of nodes N . Since each node runs the same number of episode time steps per iteration, different networks with the same p can be fairly compared. We then report the average performance over 6 runs with 5-95% confidence intervals. We will share the JSON files that fully describe our experiments, including the analysis script that calculates our evaluation metric, in our code release 1 . In addition to using the evaluation procedure of [21], we also use their exact same neural network architecture: multilayer perceptrons with two 64-unit hidden layers separated by tanh nonlinearities. We also keep all the modifications to the update rule introduced by [21] to improve performance: (1) training for one complete episode for each iteration; (2) employing antithetic or mirrored sampling, (t) (t) (t) also known as mirrored sampling [7], where we explore ǫi , −ǫi for every sample ǫi ∼ N (0, I); (3) employing fitness shaping [25] by applying a rank transformation to the returns before computing each parameter update, and (4) weight decay in the parameters for regularization. We also use the exact same hyperparameters as the original OpenAI (fully-connected and centralized) implementation [21], varying only the network topology for our experiments. We also verify that our implementation using alternative network topologies takes approximately the same wall-clock time as when using a fully-connected network (baseline ES): although each iteration using our network takes longer because of increased communication (about 60 seconds for NetES vs 50 seconds for ES with 1000 agents each), NetES is still superior because our 1000 agents can learn at the same (higher) performance level as 3000 ES agents (which takes more than 2 minutes). 500 400 3000 Performance Performance 300 2000 200 1000 100 0 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Scale-Free Small-World Fully-Connected 0.0 Erdos-Renyi Erdos Fullyconnected 0 Broadcast probability of disconnected network Figure 1: Learning performance on all network families: Erdos-Renyi graphs do best, fully- Figure 2: Agents with only broadcast do not connected graphs do worst (Results from Mu- learn (Results from 1000 agents on RoboSchool JoCo Ant-v1 with small networks of 100 nodes). Humanoid-v1). 5 Results In this section, we present our empirical results as per the experimental procedure described in the previous section. We first present the results showing how NetES running alternative topologies outperforms ES running fully-connected networks - for these results we use networks with 1000 agents (or more). We then run controls (generally using smaller networks of 100 nodes) on all modifications to the ES algorithm to make sure that any improvements we see come exclusively from using alternative topologies. Whenever the performance of alternative topologies are presented, we only 1 our code and JSON experiment files can be found at https://github.com/d-val/NetES 5 Type Task Fully-connected Erdos Improvement % MuJoCo MuJoCo MuJoCo MuJoCo Roboschool Ant-v1 HalfCheetah-v1 Hopper-v1 Humanoid-v1 Humanoid-v1 4496 1571 1506 762 364 4938 7014 3811 6847 429 9.8 346.3 153.1 798.6 17.9 Table 1: Summary of improvements for Erdos-Renyi networks with 1000 nodes compared to fullyconnected networks. compare networks with the same number of nodes and average number of links (defined by the average density) within the same plot so that networks can be fairly commpared. Throughout this paper, we use an average network density of 0.2 for all network families and sizes of networks because it is sparse enough to provide good learning performance, and consistent (not noisy) empirical results. 5.1 Empirical performance of different network families Using the MuJoCo Ant-v1 benchmark task (because it runs the fastest), we run a series of experiments on the different network families we previously introduced: Erdos-Renyi, scale-free, smallworld and the conventional fully-connected network. For network families to be fairly compared, it is important to note that for a given average density, all networks from all topological families have the same approximate number of links (and nodes) and only the distribution of links (degree distribution) changes. Because these are exploratory experiments, we choose to run on smaller networks (number of agents, N=100). As can be seen in Fig. 1, Erdos-Renyi outperforms all other network families, and fully-connected networks (the de facto traditional network) perform worst. We present some theoretical insights as to why Erdos-Renyi networks do best in a later section. 5.2 Empirical performance on all benchmarks Using Erdos-Renyi networks (as they previously performed best compare to other network families), we run larger networks of 1000 agents on all 5 benchmark results. As can be seen in Table 1, our Erdos-Renyi networks outperform fully-connected networks on all benchmark tasks, resulting in improvements ranging from MuJoCo 9.8% on Ant-v1 to 798% on MuJoCo Humanoid-v1. All results are statistically significant (based on 5-95% confidence intervals). We note that the difference in performance between Erdos-Renyi and fully-connected networks is higher for smaller networks (as in Fig. 1 and Fig. 5) compared to larger networks (as in in Table 1) for the same benchmark - and we observe this behavior across different benchmarks. We believe that this is because NetES is able to achieve higher performance with fewer agents due to its efficiency of exploration, as supported in our theoretical result in a later section. 5.3 Multiplicative Learning Performance So far, we have compared alternative network topologies with fully-connected networks containing the same number of agents. In this section, we investigate whether organizing the communication topology using Erdos-Renyi networks can outperform larger fully-connected networks. For maximum reproducibility of our results, we choose one of the benchmarks that has the lowest improvement for 1000 agents, Roboschool Humanoid-v1. As can be seen in Fig. 3 and the training curves (which display the training performance, not the evaluation metric results) in Fig. 4, an Erdos-Renyi network with 1000 agents provides comparable performance to 3000 agents arranged in a fully-connected network. This shows that networks with alternative topologies not only provide improvements over fully-connected networks, but also have a multiplicative effect on performance. 6 600 Performance Training Performance 400 200 1K Erdos-Renyi 1K fully-connected 400 2K fully-connected 3K fully-connected 4K fully-connected 5K fully-connected 200 5K 4K 3K 2K 800 600 400 200 Fully-Connected ES Networks 0 1K Erdos Renyi 1K Network 0 Iteration Figure 3: Evaluation results for Erdos-Renyi graph with 1000 agents compared to fully- Figure 4: Training performance (not evaluation connected networks (Results from RoboSchool metric) for Roboschool Humanoid-v1 Humanoid). 5.4 Control Experiments To make sure that none of the modifications we implemented in the ES algorithm (to generalize it to be able to use alternative topologies) is causing improvements in performance instead of just the use of alternative network topologies, we run control experiments on each modification, namely: 1) the use of broadcast, 2) the fact that each agent/node has a different parameter set. 5.4.1 Broadcast effect We implement parameter broadcast as such: at every iteration, with a probability pb , we choose to replace all agents’ current parameters with the best agent’s performing weights, and then continue training (as per Equation 2) after that. Even though broadcast is common in machine learning (e.g. ‘exploit’ in Population-based Training [9] replaces current weights with weights with highest rewards), we want to make sure that broadcast (over different probabilities ranging from 0.0 to 1.0) does not contribute significantly to learning. Therefore, we compare ‘disconnected‘ networks where agents can only learn from their own parameter update (they do not see the rewards and parameters of any other agents) and from broadcast with our Erdos-Renyi network and fully-connected networks of 1000 agents on the Roboschool Humanoid-v1 task. As can be seen in Fig. 2 practically no learning happens with just broadcast. We therefore experimentally verify that broadcast does not drive the learning performance improvement we observe when using alternative topologies. Broadcast is treated as a hyperparameter that we choose to fix in this paper as 0.8. 5.4.2 Each agent their own (t) The other change we introduce in NetES is to have each agent hold their own parameter set θi instead of a global (noised) parameter θ(t) . We therefore investigate the performance of the following 4 control baselines: fully-connected ES (as per [21]) with 100 agent running: (1) same global parameter, no broadcast; (2) same global parameter, with broadcast; (3) different parameters, with broadcast; (4) different parameters, no broadcast; compared to NetES running an Erdos-Renyi network. As can be seen in Fig 5, NetES does better than all 4 other control baselines on MuJoCo Ant-v1. 7 Performance 3000 2000 1000 Fully Connected With Broadcast Global Parameter Fully Connected With Broadcast Diff. Parameters Fully Connected No broadcast Global Parameter Fully Connected No broadcast Diff. Parameters NetES Erdos 0 Figure 6: We generate large instances of networks (using N=100) from the three main families of networks, and observe that Erdos-Renyi graphs maximize the diversity of parameter updates. Figure 5: NetES using an Erdos-Renyi graph does significantly better than all 4 other control baselines (Results from MuJoCo Ant-v1 with 100 agents). 6 Theoretical Insights In this section, we present some intuitive theoretical insights into why alternative topologies can do better than fully-connected topologies, and why Erdos-Renyi networks outperform all other network families we have tested. A motivating factor for introducing sparse connectivity and having each agent hold their own parameters (as per Equation 2) is to search the parameter space more effectively, a common motivation in DRL and optimization in general. One possible heuristic for measuring the capacity to explore the parameter space is the diversity of parameter updates during each iteration, which can be measured by the variance of parameter updates: Theorem 1. In a NetES update iteration t for a system with N agents with parameters (t) (t) Θ = {θ1 , ..., θN }, agent communication matrix A = {aij }, agent-wise perturbations E = PN (t) (t) (t) (t) (t)  (t) (t) (t) {ǫ1 , ..., ǫN }, and parameter update ui = Nασ2 j=1 aij · R(θj +σǫj )·((θj +σǫj )−(θi )) as per Equation 2, the following relation holds:  min |A | 2 σ 2 X o max2 R(·) n kA2 kF  l l (t) (t) · ( · f (Θ, E) − ǫ ǫ ) (3) N σ4 (minl |Al |)2 maxl |Al | N i,j i j qP P (t) (t) (t) (t) (t) (t) 2 f (Θ, E) = ( j,k,m (θj + σǫj − θm ) · (θk + σǫk − θm ) ). Here, |Al | = j ajl , (t) Vari [ui ] ≤ The proof for Theorem 2 is provided in the supplementary material. This theoretical upper-bound is merely expository; it is not indicative of the worst-case performance, which requires the optimization of a lower-bound. We use this theoretical insight to understand the capacity for parameter exploration supplied by any network topology, and not to choose the best network topology (which would require a lower bound). It is also important to note that the quantity in Theorem 2 is not the variance of the value function gradient, which is typically minimized in reinforcement learning. It is instead the variance in the positions in parameter space of the agents after a step of our algorithm. This quantity is more productively conceptualized as akin to a radius of exploration for a distributed search procedure 8 rather than in its relationship to the variance of the gradient. The challenge is then to maximize the search radius of positions in parameter space to find high-performing parameters. As far as the side effects this might have, given the common wisdom that increasing the variance of the value gradient in single-agent reinforcement learning can slow convergence, it is worth noting that noise (i.e. variance) is often critical for escaping local minima in other algorithms, e.g. via stochasticity in SGD. By Theorem 2, we see that the diversity of exploration in the parameter updates across agents is affected by two quantities that involve the connectivity matrix A: the first being the term (kA2 kF /(minl |Al |))2 (henceforth referred to as the reachability of the network), which we want to maximize, and the second being (minl |Al |/ maxl |Al |)2 (henceforth referred to as the homogeneity of the network), which we want to be as small as possible in order to maximize the diversity of parameter updates across agents. Reachability and homogeneity are not independent and are statistics of the degree distribution of a graph. Reachability is the squared ratio of the total number of paths of length 2 in A to the minimum number of links of all nodes of A. The sparser a network, the larger the reachability. For ErdosRenyi graphs, (kA2 kF /(minl |Al |))2 ≈ (pN )−1/2 , where p is the average density of the network (the inverse of sparsity), the probability that any two nodes being connected. Homogeneity is the squared ratio of the minimum to maximum connectivity of all nodes of A: the higher this value, the more homogeneously connected the graph is. The sparser a network is, the lower is the homogeneity of a network. In the case of Erdos-Renyi networks, p (minl |Al |/ maxl |Al |)2 ≈ 1 − 8 (1 − p)/(N p) (the proofs and plots for Erdos-Renyi are provided in the supplementary material). Using the above definitions for reachability and homogeneity, we generate random graphs of each network family, and plot them in Fig. 6. Two main observations can be made from this result: (1) Erdos-Renyi networks maximize reachability and minimize homogeneity, which means that they maximize the diversity of parameter exploration. (2) Fully-connected networks (which are the de facto communication network used for distributed learning) are the single worst network in terms of exploration diversity (they minimize reachability and maximize homogeneity, the opposite of what would be required for maximizing parameter exploration). We find that this theoretical result is in accordance with our empirical results: Erdos-Renyi networks perform best, followed by scale-free networks, while fully-connected networks do worse. 7 Conclusion In this work, we extend ES, a DRL algorithm, to run alternative topologies and empirically show that the conventional fully-connected de facto topology used in almost all machine learning algorithms is sub-optimal. We also run control experiments on all modification to the ES algorithm and show that improvements come exclusively from the use of alternative topologies. We then provide some theoretical insights as to why that might be. Overall, our work suggests that distributed machine learning algorithms would learn more efficiently if the communication topology between learning agents was optimized. Future work could explore how to learn the network structure itself, how to learn with evolving networks, and the investigation of the performance of naturally occurring (non-synthetic) network topologies such as networks of autonomous vehicles. 9 References [1] Anne Auger and Nikolaus Hansen. A restart cma evolution strategy with increasing population size. In Evolutionary Computation, 2005. The 2005 IEEE Congress on, volume 2, pages 1769– 1776. IEEE, 2005. [2] Albert-László Barabási and Réka Albert. Emergence of scaling in random networks. science, 286(5439):509–512, 1999. [3] Daniel Barkoczi and Mirta Galesic. Social learning strategies modify the effect of network structure on group performance. Nature communications, 7, 2016. [4] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013. [5] Krzysztof Choromański, Michał Matuszak, and Jacek Miekisz. Scale-free graph with preferential attachment and evolving internal vertex structure. Journal of Statistical Physics, 151(6):1175–1183, 2013. [6] P ERDdS and A R&WI. On random graphs i. Publ. Math. Debrecen, 6:290–297, 1959. [7] John Geweke. Antithetic acceleration of monte carlo integration in bayesian inference. Journal of Econometrics, 38(1-2):73–89, 1988. [8] Yannis M Ioannides et al. Random graphs and social networks: An economics perspective. In IUI Conference on Business and Social Networks, Vaxholm, Sweden, June, 2004. [9] Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based training of neural networks. arXiv preprint arXiv:1711.09846, 2017. [10] David Lazer and Allan Friedman. The network structure of exploration and exploitation. Administrative Science Quarterly, 52(4):667–694, 2007. [11] Sergio Valcarcel Macua, Aleksi Tukiainen, Daniel García-Ocaña Hernández, David Baldazo, Enrique Munoz de Cote, and Santiago Zazo. Diff-dac: Distributed actor-critic for multitask deep reinforcement learning. arXiv preprint arXiv:1710.10363, 2017. [12] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. pages 1928–1937, 2016. [13] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. [14] Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro De Maria, Vedavyas Panneershelvam, Mustafa Suleyman, Charles Beattie, Stig Petersen, et al. Massively parallel methods for deep reinforcement learning. arXiv preprint arXiv:1507.04296, 2015. [15] Angelia Nedic. Asynchronous broadcast-based convex optimization over a network. IEEE Transactions on Automatic Control, 56(6):1337–1351, 2011. [16] Angelia Nedić, Alex Olshevsky, and Michael G Rabbat. Network topology and communication-computation tradeoffs in decentralized optimization. arXiv preprint arXiv:1709.08765, 2017. [17] Angelia Nedic and Asuman Ozdaglar. 10 cooperative distributed multi-agent. Convex Optimization in Signal Processing and Communications, 340, 2010. [18] OpenAI. Roboschool. https://github.com/openai/roboschool, 2017. Accessed: 2017-09-30. [19] Ingo Rechenberg. Evolution strategy: Optimization of technical systems by means of biological evolution. Fromman-Holzboog, Stuttgart, 104:15–16, 1973. [20] Wei Ren. Averaging algorithms and consensus. Encyclopedia of Systems and Control, pages 1–10, 2013. [21] Tim Salimans, Jonathan Ho, Xi Chen, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017. 10 [22] Hans-Paul Schwefel. Numerische Optimierung von Computer-Modellen mittels der Evolutionsstrategie: mit einer vergleichenden Einführung in die Hill-Climbing-und Zufallsstrategie. Birkhäuser, 1977. [23] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026–5033. IEEE, 2012. [24] Jeffrey Travers and Stanley Milgram. An experimental study of the small world problem. In Social Networks, pages 179–197. Elsevier, 1977. [25] Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and Jürgen Schmidhuber. Natural evolution strategies. Journal of Machine Learning Research, 15(1):949–980, 2014. [26] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Reinforcement Learning, pages 5–32. Springer, 1992. [27] David H Wolpert and Kagan Tumer. An introduction to collective intelligence. arXiv preprint cs/9908014, 1999. 11 8 Appendix 1 : Algorithm Algorithm 1 Networked Evolution Strategies (0) Input: Learning rate α, noise standard deviation σ, initial policy parameters θi . . ., N (for N workers), adjacency matrix A, global broadcast probability pb (0) Initialize: n workers with known random seeds, initial parameters θi for t = 0, 1, 2,. . . do for each worker i = 1, 2, . . ., N do (t) Sample ǫj ∼ N (0, I) (t) where i = 1, 2, (t) Compute returns Ri = R(θj + σǫj ) Sample β (t) ∼ U (0, 1) if β (t) < pb then (t) (t) (t+1) Set θi ← arg maxθ(t) R(θj + σǫj ) i else for each worker i = 1, 2, . . ., n do   PN (t) (t) (t) (t) (t) (t+1) (t) Set θi ← θi + Nασ2 j=1 aij · R(θj + σǫj ) · (θj + σǫj − θi ) 9 Appendix 2 : Diversity of Parameter Updates Here we provide proofs Theorem 1 from the main paper concerning the diversity of the parameter updates. Theorem 2. In a multi-agent evolution strategies update iteration t for a system with N agents with (t) (t) parameters Θ = {θ1 , ..., θN }, agent communication matrix A = {aij }, agent-wise perturbations (t) (t) (t) E = {ǫ1 , ..., ǫN }, and parameter update ui given by the sparsely-connected update rule: (t) ui = N α X (t) (t) (t) (t) (t)  aij · R(θj + σǫj ) · ((θj + σǫj ) − (θi )) N σ 2 j=1 The following relation holds:  min |A | 2 o max2 R(·) n kA2 kF  l l · f (Θ, E) − · g(E) (4) N σ4 (minl |Al |)2 maxl |Al | P 1 P (t) (t) (t) (t) (t) (t) 2 2 N,N,N , and (θ + σǫ − θ ) · (θ + σǫ − θ ) Here, |Al | = j ajl , f (Θ, E) = m m j j j,k,m k k P  2 N,N (t) (t) g(E) = σN . i,j ǫi ǫj (t) Vari [ui ] ≤ Proof. From Equation 2, the update rule is given by: (t) ui = N α X (t) (t) (t) (t) (t)  aij · R(θj + σǫj ) · ((θj + σǫj ) − (θi )) 2 N σ j=1 (5) (t) The variance of ui can be written as: (t) (t) (t) Vari [ui ] = Ei∈A [(ui )2 ] − (Ei∈A [(ui )])2 (6) (t) Expanding Ei∈A [(ui )2 ]: 1 X γ X (t) (t) (t) (t) (t) = aij · R(θj + σǫj ) · (θj + σǫj − θi ) N N σ 2 j=1 i∈A 12 2 (7) Simplifying: =  1 X  aij aik (t) (t) (t) (t) (t) (t) (t) (t) (t) (t) R(θ + σǫ )R(θ + σǫ ) · (θ + σǫ − θ ) · (θ + σǫ − θ ) j j j j i i k k k k N σ4 |Ai |2 i,j,k (8) Since R(·) ≤ max R(·), therefore: max2 R(·) X aij aik (t) (t) (t) (t) (t) (t) · (θj + σǫj − θi ) · (θk + σǫk − θi ) N σ4 |Ai |2 (9) max2 R(·) X aij aik (t) (t) (t) (t) (t) (t) · (θj + σǫj − θi ) · (θk + σǫk − θi ) N σ4 minl |Al |2 (10) ≤ ≤ i,j,k i,j,k By the Cauchy-Schwarz Inequality: (t) Ei∈A [(ui )2 ] ≤ 1 max2 R(·)  X (aij aik )2  21  X (t) (t) 2 2 (t) (t) (t) (t) (θ +σǫ −θ )·(θ +σǫ −θ · ) j j i i k k N σ4 minl |Al |4 i,j,k i,j,k (11) 2 Since aij ∈ {0, 1}∀ (i, j), (aij aik we know that aij = aji , P) = aij aikP∀(i, j, k). Additionally, since A is symmetric. Therefore, i aij aik = i aji aik = A2jk . Using this: 1 (t) Ei∈A [(ui )2 ] ≤ 1 max2 R(·)  |A2 | 2   X (t) (t) (t) (t) (t) (t) 2 2 · · (θ +σǫ −θ )·(θ +σǫ −θ ) j j i i k k N σ4 minl |Al |2 i,j,k (12) Replacing (t) P (t) i,j,k (θj (t) (t) (t) (t) (t) + σǫj − θi ) · (θk + σǫk − θi ) N {θi }N i=1 , E = {ǫi }i=1 for compactness, we obtain: (t) Ei∈A [(ui )2 ] ≤ max2 R(·) · N σ4 (t)  1 |A2 | 2 minl |Al |2  2  12 = f (Θ, E), where Θ = · f (Θ, E) (13) Similarly, the squared expectation of (ui ) over all agents can be given by: (t) (Ei∈A [ui ])2 = 2  1 X γ X (t) (t) (t) (t) (t) aij · R(θj + σǫj ) · (θj + σǫj − θi ) 2 N N σ j=1 (14) i∈A 2 1 X 1 X (t) (t) (t) (t) (t) a · R(θ + σǫ ) · (θ + σǫ − θ ) ij j j j j i N 2 σ4 |Ai | j=1 = (15) i∈A = 1 N 2 σ4 2 X a ij (t) (t) (t) (t) (t) · R(θj + σǫj ) · (θj + σǫj − θi ) |Ai | i,j (16) Since R(·) ≥ min R(·), therefore: 2 min2 R(·)  X  aij (t) (t) (t) · (θ + σǫ − θ ) j j i N 2 σ4 |Ai | i,j (17) 2 X min2 R(·) (t) (t) (t) a · (θ + σǫ − θ ) ij j j i N 2 σ 4 maxl |Al |2 i,j (18) ≥ ≥ 13 Since A is symmetric, = PN,N i,j (t) (t) aij · (θj + σǫj − θi ) = PN,N i,j (t) (t) aij · (θi + σǫi − θj ). Therefore:  X 1 2 min2 R(·) (t) (t) (t) (t) (t) (t) a · (θ + σǫ − θ ) + a · (θ + σǫ − θ ) ij ij j j i i i j N 2 σ 4 maxl |Al |2 i,j 2 (19) Therefore, (t) (Ei∈A [ui ])2  X 1 min2 R(·) aij 2 maxl |Al | 2 i,j = N 2 σ2 Using the symmetry of A, we have that = = ≥ PN,N i,j aij ǫi = PN,N i,j · (t) (ǫj + (t) ǫi ) 2 (20) aij ǫj . Therefore: 2 X min2 R(·) (t) a · ǫ ij j N 2 σ 2 maxl |Al |2 i,j X 2 min2 R(·) (t) |Aj | · ǫj 2 maxl |Al | j N 2 σ2 min2 R(·) minl |Al |2  X (t) (t)  ǫi ǫj N 2 σ 2 maxl |Al |2 i,j (21) (22) (23) Combining both terms of the variance expression, and using the normalization re  Pof the iteration 2 (t) (t) wards that ensures min R(·) = − max R(·), we can obtain (using g(E) = σN ǫ ǫ ): j i,j i (t) Vari∈A [ui ] 10 ≤ 1  min |A |2  o max2 R(·) n |A2 | 2  l l · f (Θ, E) − · g(E) N σ4 minl |Al |2 maxl |Al |2 (24) Appendix 3 : Approximating Reachability and Homogeneity for Large Erdos-Renyi Graphs Recall that a Erdos-Renyi graph is constructed in the following way 1. Take n nodes 2. For each pair of nodes, link them with probability p Figure 7: Comparison between the values of kmin , ||A2 ||F , and Reachability as a function of p for different realizations of the Erdos-Renyi model (points) and their approximations given in Equations (26), (25) and (27) respectively (lines). The model is simple, and we can infer the following: • The average degree of a node is p(n − 1) 14 • The distribution of degree for the nodes is the Binomial distribution of n − 1 events with probability p, B(n − 1, p). (2) • The (average) number of paths of length 2 from one node i to a node j 6= i (nij ) can be calculated this way: a path of length two between i and j involves a third node k. Since there are n − 2 of them, the maximun number of paths between i and j is n − 2. However, for that path to exists there has to be a link between i and k and k and j, an event with probability p2 . Thus, the average number of paths between i and j is p2 (n − 2) Estimating Reachability We can then estimate Reachability: ||A2 ||F = Reachability = (minl |Al |)2 qP (2) i,j nij 2 kmin where kmin = (minl |Al |) is the minimum degree in the network. Given the above calculations we can approximate X (2) X (2) X (2) nij = nii + nij ≈ n × [p(n − 1)] + n(n − 1) × [p2 (n − 2)] i,j i i6=j where the first term is the number of paths of length 2 from i to i summed over all nodes, i.e. the sum of the degrees in the network. The second term is the sum of p2 (n − 2) for the terms in which i 6= j. For large n we have that X (2) nij ≈ p2 n3 i,j and thus, ||A2 ||F ≈ p p2 n 3 . (25) p p(n − 1)(1 − p) (26) For the denominator kmin we could use the distribution of the minimum of the binomial distribution B(n − 1, p). However, since it is a complicated calculation we can approximate this way: since the binomial distribution B(n − 1, p) looks like a Gaussian, we can say that the minimum of the distribution is closed to the mean minus two times the standard deviation: kmin ≈ p(n − 1) − 2 Once again in the case of large n we have kmin ≈ pn Thus p p2 n 3 p Reachability ≈ [p(n − 1) − 2 p(n − 1)(1 − p)]2 (27) As we can see in the figure those approximations work very well for realizations of the Erdos-Renyi networks. Assuming that n is large, we can approximate Reachability ≈ pn3/2 1 = p2 n 2 pn1/2 Thus the bound decreases with increasing n and p. Note that the density of the Erdos-Renyi graph (the number of links over the number of possible links) is p. And thus for a fixed n more sparse networks p ≃ 0 have larger Reachability than more connected networks p ≃ 1. 15 Estimating Homogeneity The Homogeneity is defined as Homogeneity =  kmin kmax 2 As before we can approximate kmax ≈ p(n − 1) + 2 And thus Homogeneity ≈ For large p we can approximate it to be p p(n − 1)(1 − p) !2 p p(n − 1)(1 − p) p p(n − 1) + 2 p(n − 1)(1 − p) p(n − 1) − 2 √ 1−p Homogeneity ≈ 1 − 8 √ np (28) 0.6 0.4 0.2 Homogeneity 0.8 which shows that for p ≃ 1 we have that Homogeneity grows as a function of p. Thus for fixed number of nodes n, increasing p we get larger values of the Homogeneity. See figure 2 0.2 0.4 0.6 0.8 p Figure 8: Comparison for the Homogeneity in the Erdos-Renyi case for different values of p and n = 500. Points correspond to the real data, while the lines are the approximations given by Equation (28). 16