[go: up one dir, main page]

Academia.eduAcademia.edu
Improved Learning in Evolution Strategies via Sparser Inter-Agent Network Topologies arXiv:1711.11180v1 [cs.AI] 30 Nov 2017 Dhaval Adjodah*, Dan Calacci, Yan Leng, Peter Krafft, Esteban Moro, Alex Pentland MIT Media Lab, *Foundation Bruno Kessler, Italy {dval, dcalacci, yleng, pkrafft, emoro, pentland} @mit.edu Abstract We draw upon a previously largely untapped literature on human collective intelligence as a source of inspiration for improving deep learning. Implicit in many algorithms that attempt to solve Deep Reinforcement Learning (DRL) tasks is the network of processors along which parameter values are shared. So far, existing approaches have implicitly utilized fully-connected networks, in which all processors are connected. However, the scientific literature on human collective intelligence suggests that complete networks may not always be the most effective information network structures for distributed search through complex spaces. Here we show that alternative topologies can improve deep neural network training: we find that sparser networks learn higher rewards faster, and at lower communication costs. 1 Introduction We draw upon a previously largely untapped literature on human collective intelligence as a source of inspiration for improving deep learning via evolutionary algorithms. Distributed evolutionary algorithms have proven to be capable of state-of-the-art training for deep neural networks on reinforcement learning tasks [12]. These algorithms share and remix the parameter values discovered as a population of processors (which we refer to as ’agents’ here). Implicit in these algorithms is the network structure along which processors share parameter values are shared. So far, existing work applying evolutionary algorithms to deep learning has implicitly utilized fully-connected networks in which all processors are connected. The scientific literature on human collective intelligence suggests that fully-connected networks may not always be the most effective for distributed search through complex spaces [7, 8]. Simulations and human experiments have shown that sparser network structures can improve collective learning in a variety of group problem-solving scenarios [7, 8, 1]. Sparser networks are topologies where agents are not sharing learning from every other agent (fully-connected), and are instead organized in less-connected structures. Here, we show that alternative network structures can improve deep neural network training using Evolution Strategies (ES) [12] running OpenAI’s Roboschool 3D Humanoid Walker as our learning task. We find that sparser network topologies of agents (i.e. processors) perform better, and with a significantly lower communication cost. Although the highest-performing topologies we study require that each agent communicate with only 4–10% of all other agents on average, they can learn faster and produce higher rewards than the fully-connected baseline. Our key findings are as follows: 1. We find that explicitly designed networks that incur a lower communication cost yield faster and higher learning than fully-connected networks in an evolutionary algorithm for deep reinforcement learning. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. 2. These networks result in a multiplicative effect in total reward: networks with only 1,000 agents produce results competitive to fully-connected networks with 4,000 agents. 3. We find that sparser graphs can achieve up to 33.5% higher reward than a corresponding fully-connected network, and that they can reach the fully-connected maximum up to 32% earlier. 2 Related Work Distributed DRL approaches attempt to solve the fundamental predicted instability [15] of using non-linear approximators such as neural networks to represent the action-value function. If several agents pool their very varied experiences together, then the model can be learned with non-correlated data. The Gorila framework [11] collects many experiences in parallel from many agents and pools them into a global memory store on a distributed database. A3C [9] instead runs many agents asynchronously on several environment instances while varying their exploration policies. This effectively increases exploration diversity in parameter space and de-correlates agent data while reducing resource usage significantly, but can also cause heavy communication bottlenecks. Recently, black box optimization methods such as evolution strategies have been shown to overcome such communication bottlenecks. ES runs many agents that need only to share their scalar reward values each iteration to learn efficiently. This data efficiency allows ES to solve benchmark DRL problems in record time by utilizing a very large number of CPU cores distributed over a network. All the approaches described above organize agents in a fully-connected network topology: the algorithm updates a global-level parameter set using information available from all agents at every step. As described in the next section, the scientific literature on collective intelligence in animal and human groups suggest that other network topologies could be more effective for such distributed search problems in complex task spaces. 2.1 Human Collective Intelligence Studies of human and animal groups have revealed that groups of problem-solvers often exhibit capabilities well beyond the skill of any of their individual members. This emergent problem-solving ability of groups is called their collective intelligence (CI), and it has been found to be affected by an array of factors such as the learning strategies of group members and their communication network structure [16, 7]. Recent studies of collective intelligence have modeled groups of problem-solvers as distributed information processors (i.e. agents), and we take this philosophical approach here [6]. The network structure that agents use to share information significantly influences the performance of groups. For example, in studies of simulated groups that attempt to search an NK task space (a parameterized space of arbitrarily high ruggedness), sparser networks have been shown to result in increased exploration, higher overall maximum reward and higher diversity of solutions [4, 7]. In human experiments, where agents attempt to solve a different task than the class of NK problems, denser communication networks have instead resulted in higher group performance [8]. Recent work has shown that these opposite effects can be explained by the different learning strategies agents employ and the complexity of the target task [1]. 3 Algorithm: Networked Inter-Agent Learning We introduce the notion of network topology and independent agent updates to the ES paradigm. Instead of updating a global policy using all agents at each iteration, each agent in the network performs an update using only its neighboring nodes. In implementation, our approach maintains the massive scalability of the original ES algorithm, allows for greater exploration of parameter space, and does not require a centralized master in implementation for learning: learning can be done at a node-centric level. 3.1 Evolution Strategies One of the central limitations of modern distributed DRL is the lack of scalability due to the high communication costs of sharing parameters between agents. ES attempts to solve this problem by 2 using a derivative-free approach, which we loosely outline here. ES chooses a fixed deep architecture, and initializes a single set of network weights θ. ES then creates a population of N parameters, each perturbed from θ by adding randomly sampled Gaussian noise ǫi ∼ N (0, I) to θ directly in parameter space. The rewards from running the target task using these perturbed weights are collected, and a gradient is constructed by calculating a weighted average of perturbations via the rewards Fi , PN α N1σ i Fi ǫi . In implementation, this scheme requires that agents only share their scalar rewards, allowing ES to scale massively to thousands of CPUs. This scalability has allowed ES to attain state of the art performance on some of the hardest DRL benchmarks in record time, for example by solving the Mujoco Humanoid Walker task in 10 minutes. 3.2 Networked Evolution Strategies The main differences between Networked Evolution Strategies (NES) and the original ES are summarized in Table 1. In NES, we introduce the notion of independent, networked “agents”, each with their own individual parameter set θi , that perform updates separately. Previously, Evolution Strategies would run a number of episodes, each with a noised version of the parameter. In NES, we deploy agents that each run an episode with a different parameter. The agents are arranged in an undirected, unweighted network structure G = (V, E), with each agent i corresponding to a node vi ∈ V in the network. On each iteration t, we perturb the parameter set of agent i, θit , by a Gaussian noise vector ǫi sampled in the same way as in the evolution strategies algorithm. Original ES 1 100 no fully-connected No. of parameters being explored Percent. agents needed for update Use of broadcast Possible Topologies Allowed Networked ES n (no. of agents) 4–10 yes all Table 1: Main differences between ES and NES In the optimization step, each agent performs its own independent update. Each agent i uses the same rank-centered weighting function as ES, but only uses the closed set of their neighborhood, N [i], to perform the update. This set of nodes includes node i itself. Because different agents have different parameters, we calculate the difference in parameters between θit and each perturbed parameter set of other agents in N [i], (θit − (θkt + σǫk )). We then weight each difference with its reported reward Fkt , instead of calculating a gradient by computing a weighted average over the perturbations applied to each neighbor’s parameter set (as in the Evolution Strategies algorithm). Each agent’s parameter PN [i] t t 1 t update at time t + 1 is then θit+1 ← θit + α nσ k=1 Fk (θi − (θk + σǫk )), as shown in Algorithm 1. This change in each agent’s parameter update means that the parameter sets of different nodes diverge after the first update. The update step in the networked algorithm presented here has each node effectively solving for its neighborhood’s average objective, rather than the global average objective as in ES. In the case of a fully-connected network, each agent’s neighborhood N [i] is equal to the full set of vertices V , and the update is equal to the case of the original ES algorithm. We hypothesize that the divergent objective functions in the case of networked ES results in a greater diversity of policies being explored. Although the neighborhood-only constraint on node parameter updating does not add any penalty term to the update step (line 13-14 in Algorithm 1), updating using only one’s neighbors can be very roughly interpreted as a type of regularization in the same way that Dropout [13] is not strictly regularization but is often seen as acting like regularization by preventing over-fitting. A final difference in our algorithm is the implementation of stochastic global broadcast. This was implemented to counteract the problem that, as nodes are now searching for better parameters in their local neighborhood only, the effective combination of possible parameters around any parameter decreases significantly, scaling with the size of a node’s neighborhood. We take inspiration from random restarts in simulated annealing and implement a stochastic global broadcast: with probability β each iteration, we force all agents to adopt the highest-performing parameter set from the previous iteration, centering the network on a current local maximum. We find that past a certain minimum (β ≈ 0.5), broadcast has minimal effect on both the reward and learning rate of the network topologies we test. 3 Algorithm 1 Networked Evolution Strategies 1: Input: Learning rate α, noise standard deviation σ, initial policy parameters θ0i where i = 1, 2, . . ., n (for n workers), communication Graph G, global broadcast probability B 2: Initialize: n workers with known random seeds, initial parameters θ0i 3: for t = 0, 1, 2,. . . do 4: for each worker i = 1, 2, . . ., n do 5: Sample ǫi ∼ N (0, I) 6: Compute returns Fi = F (θit + σǫi ) 7: Send scalar returns Fi to worker i’s neighbors j = 1, 2,. . ., m as defined by graph G 8: Sample β ∼ U (0, 1) 9: if β < B then 10: Set θit+1 ← arg maxθit F (θit ) 11: else 12: for each worker i = 1, 2, . . ., n do 13: Reconstruct all perturbations k on G, k = 1,. . ., m Pm ǫkt fort neighbors 1 t F (θ − (θ + σǫ 14: Set θit+1 ← θit + α nσ k )) k k=1 k i 4 Experiment Setup Here, we describe the setup of experiments designed to test how different network metrics affect our algorithms’ learning. We choose to focus our experiments on OpenAI’s Roboschool 3D Humanoid Walker (specifically, RoboschoolHumanoid-v1, shown in Figure 1), an open-source implementation of MuJoCo [14]. The 3D Humanoid Walker is considered to be a difficult benchmark task, and therefore serves as a good point of comparison between learning algorithms. Many other learning tasks exists, such as the Atari game environments [10]. Figure 1: A screenshot of Roboschool’s 3D Humanoid Walker, RoboschoolHumanoid-v1. 4.1 Graph Generation To run our experiments, we generate a large set of canonical network topologies, as well as a set of engineered topologies that were designed to isolate various network statistics. In all cases, we fix the number of nodes (agents) to be 1000. We generate a population of Erdos-Renyi random graphs by varying the routine’s main parameter, p. Erdos-Renyi random graphs are constructed by connecting each pair of possible nodes at random with probability p. We ensure that the network consists of only one component (i.e. that there are no disconnected nodes or components in the network). An example of an Erdos-Renyi Graph of average density p = 0.4 can be seen in Figure 2, compared to a fully-connected network with the same number of nodes (only 40 nodes are used in this example for illustration, although we use 1000 nodes in our experiments). 4 Figure 2: Comparison of a fully-connected network topology (A) and an Erdos-Renyi graph (B) of average degree 0.4, both with 40 nodes. To control for and explore network characteristics more independently, we use random partition graph generation, a generalization of the planted-l-partition scheme introduced in [2], which allows us to vary statistics such as clustering and centrality, while keeping modularity constant. We first split the graph into k sub-communities, and assign each node to a sub-community with uniform probability, similar to an Erdos-Renyi graph. We then run the following routine for a set number of iterations: first, sample a source node ns from the network, then, with probability pin , sample a second target node nt from the same cluster nk that both ns and nt belong to. Otherwise, with probability pout , sample the node nt from all nodes not in the same cluster nk , and construct an edge between ns and nt (in between clusters). All sampling is done with replacement, resulting in graphs with differing numbers of edges. In effect, our engineered graphs are actually a number of smaller Erdos-Renyi clusters connected to each other, making them sufficiently similar to be easily compared to the results of the Erdos-Renyi graphs. 4.2 Experiments We first create a baseline by running fully-connected networks of 1000 agents using OpenAI’s original ES code 10 times - they repeat each experiment 7 times. We then fit each run using a logistic growth function similar to [3]. We use the higher asymptote as a measure of maximum reward for each run, and then use the average of these maximum asymptotic rewards as a measure of performance, henceforth referred to as the baseline. Although we fit the learning trajectory using growth functions because there are random jumps to very high reward values, we find that our results do not vary significantly if we use other measures such as the mean or median of the top 5% of rewards over time. We then run all our network variants (both in terms of topology and attributes) and similarly obtain a measure of the mean asymptotic reward. We take care to make sure we compute these asymptotes over the same number of iterations to maintain comparability of results, and we also ensure that rewards stabilize over time to an asymptote in order to get an accurate observation of maximum achieved reward. 5 5.1 Results Higher and Faster Learning As can be seen in Figure 3, our best network (an engineered network with 1000 agents) not only beats fully-connected networks with a similar number of agents (processors), but can beat up to 4000 agents arranged in a fully-connected network. This increase in efficiency could be due to the vastly larger parameter space being explored by each local neighborhood. Regarding Erdos-Renyi networks, they achieve up to a 26% increase from the baseline reward, as shown in Figure 4(a). We see that as the networks become denser, the average improvement compared to baseline decreases, approaching zero as networks become close to fully-connected: a random graph with an average density of 0.9 still does 5% better than a baseline network (which has a density of 1.0). 5 Figure 3: 1000 agents arranged in our best engineered network can beat up to 4000 agents arranged in a conventional fully-connected network. Figure 4: The distribution of reward (a) and learning rate (b) over several repeated runs of our algorithm varies strongly with the density of Erdos-Renyi networks (reward is calculated as the improvement from baseline; learning rate is defined as the number of iterations ahead of the fullyconnected network to reach baseline reward) 6 Figure 5: Erdos-Renyi networks can learn higher rewards with less communication costs (a); sparsity both at the local neighborhood level (b) and at the global inter-cluster level (c) leads to higher rewards. Additionally, while the fully-connected networks take about 320 iterations to reach their asymptotic maximum result, our fastest network reaches that value in only 220 iterations (and keeps learning), an improvement of 32% (Figure 4(b)). Denser networks tend to learn faster, but the relationship is not monotonic: as the network approaches being fully connected, the distribution flattens and the average learning rate decreases. This increase in speed could be due to the fact that the separate network neighborhoods of agents are able to visit a larger number of parameters in parallel, and hence can find higher maxima faster. Because we also implement a probabilistic broadcast, which sets the parameters of all agents to those of the highest-performing agent with probability β at the end of each iteration, we ensure that the network tends to converge to better-performing parameters. In short, our networked decentralization strikes a balance between increased parameter exploration diversity and global communication, similarly to Simulated Annealing [5]. As control, we tested a degenerate network where agents do not communicate with any other agents, except for broadcast, and find that no learning occurs. Additionally, we find that there is little variation in reward and speed beyond β = 0.5: the improvement is instead driven by the network structure. To understand what causes certain specific network topologies to perform better, we calculated network metrics across all 1000 nodes in each Erdos-Renyi network. We find strong correlations between these network metrics and reward, as shown in Figure 5. Specifically, we find that as the number of edges (communication between agents) increases, the reward decreases (Figure 5(a)). This decline may be because, as communication increases, the local neighborhoods become less isolated from one another and the diversity of parameters being explored decreases. This, in turn, leads to lower rewards (closer to baseline). Clustering is a measure of how many of the neighbors of each node form a closed triangle, and is therefore a super-local measure of connectedness. We again find that as clustering increases, rewards decrease (Figure 5(b)). Modularity, a measure of inter-neighborhood global connectedness, also correlates with higher rewards (Figure 5(c)). Overall, we interpret these results to mean that sparser networks - at both the local and global level - can learn faster and achieve higher rewards than baseline, and with less communication cost (less dense networks have less edges, and hence lower communication between nodes). Based on these observations, we then design new networks that push these network metrics to even higher (or lower) values to engineer for high-reward topologies. We focus on optimizing for higher rewards here and leave optimizing for faster learning to future work. 5.2 Improvements from Engineered Topologies As seen in Figure 6(b), Erdos-Renyi graphs suggest that decreasing the number of edges would increase performance. Consequently, we engineered networks with an even smaller number of edges: the largest number of edges for engineered networks was smaller than Erdos-Renyi graphs (Figure 6(a)). As predicted, our engineered networks show increased rewards: 26% for the highest Erdos-Renyi compared to 33.5% for the best engineered graph. Interestingly, the relationship is non-monotonic: the trends in rewards with respect to the number of edges in Erdos-Renyi and engineered networks are opposite to one another. Perhaps under a certain threshold number of edges, 7 Figure 6: Erdos-Renyi graphs suggest that lower communication (edge counts) lead to higher reward (b) : we thus engineer graphs with even less communication (a) and find improvements, but with a non-monotonic behavior. agents are no longer able to communicate efficiently within their thinned neighborhood and good gradients are not being communicated to neighbors who end up relying more on their very few neighbors’ rewards, which in turn leads to ineffective search. We find the same non-monotonic behavior for average path length, clustering (local connectedness), and modularity (global sparsity). Although our best engineered networks still do better than ErdosRenyi graphs, rewards decrease if network connectedness decreases too much. In such cases, even extremely high broadcast probabilities do not allow such overly-thinned networks to learn. Note that the larger scattering variance in the generated network rewards is because we run each engineered network only once (to allow for our greater exploration of engineered topologies), instead of running repeated experiments for each topology, which we do for Erdos-Renyi and fully-connected networks. From these explorations, we conclude that the best network structure is one that is globally and locally sparse: the network should consist of random graph clusters, each sparsely connected internally, with few connections between the clusters. Care should be taken not to create networks that are too sparse or else learning performance will suffer. Overall, it is clear that fully-connected networks are inefficient, learn more slowly and attain lower rewards than sparser networks. 6 Conclusion In this work, we design a new networked evolutionary algorithm, informed by the literature on human collective intelligence, and report experimental results in using this algorithm to solve a benchmark deep reinforcement learning problem. We find the counter-intuitive result that sparser communication between learning agents can lead to faster and higher learning. Conventional wisdom would have suggested that the optimal network topologies would be close to fully-connected, and that diminishing returns would be found on either side of that optimum. We show that this optimum connectedness is actually very sparse, and that it is non-monotonic in reward. Future work could explore how other insights from the literature on human collective intelligence could further improve distributed learning algorithms, such as by experimenting with dynamic networks where nodes can be rewired at each iteration. The application of alternative network structures to other distributed deep learning algorithms, such as gradient-based algorithms, is another promising avenue for future work. 7 Acknowledgements The authors would like to thank Zheyuan Shi, Abhimanyu Dubey, Otkrist Gupta, Stanislav Nikolov, Jordan Wick, Alia Braley and Ray Reich for their help and advice. 8 References [1] Daniel Barkoczi and Mirta Galesic. Social learning strategies modify the effect of network structure on group performance. Nature communications, 7, 2016. [2] Santo Fortunato. Community detection in graphs. Physics reports, 486(3):75–174, 2010. [3] Matthias Kahm, Guido Hasenbrink, Hella Lichtenberg-Fraté, Jost Ludwig, Maik Kschischo, et al. grofit: fitting biological growth curves with r. Journal of Statistical Software, 33(7):1–21, 2010. [4] Stuart Kauffman and Simon Levin. Towards a general theory of adaptive walks on rugged landscapes. Journal of theoretical Biology, 128(1):11–45, 1987. [5] Scott Kirkpatrick, C Daniel Gelatt, Mario P Vecchi, et al. Optimization by simulated annealing. science, 220(4598):671–680, 1983. [6] Peter M Krafft, Julia Zheng, Wei Pan, Nicolás Della Penna, Yaniv Altshuler, Erez Shmueli, Joshua B Tenenbaum, and Alex Pentland. Human collective intelligence as distributed bayesian inference. arXiv preprint arXiv:1608.01987, 2016. [7] David Lazer and Allan Friedman. The network structure of exploration and exploitation. Administrative Science Quarterly, 52(4):667–694, 2007. [8] Winter Mason and Duncan J Watts. Collaborative learning in networks. Proceedings of the National Academy of Sciences, 109(3):764–769, 2012. [9] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016. [10] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. [11] Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro De Maria, Vedavyas Panneershelvam, Mustafa Suleyman, Charles Beattie, Stig Petersen, et al. Massively parallel methods for deep reinforcement learning. arXiv preprint arXiv:1507.04296, 2015. [12] Tim Salimans, Jonathan Ho, Xi Chen, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017. [13] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):1929–1958, 2014. [14] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026–5033. IEEE, 2012. [15] John N Tsitsiklis and Benjamin Van Roy. Analysis of temporal-diffference learning with function approximation. In Advances in neural information processing systems, pages 1075–1081, 1997. [16] Anita Williams Woolley, Christopher F Chabris, Alex Pentland, Nada Hashmi, and Thomas W Malone. Evidence for a collective intelligence factor in the performance of human groups. science, 330(6004):686–688, 2010. 9