[go: up one dir, main page]

0% found this document useful (0 votes)
5 views10 pages

Rldist

The document discusses RLlib, a library designed for scalable distributed reinforcement learning (RL) that addresses the complexities of irregular computation patterns in RL algorithms. It advocates for a hierarchical control model that encapsulates parallelism and resource requirements, enabling easier implementation and code reuse across various RL algorithms. The proposed model leverages the Ray framework to enhance performance and scalability, facilitating rapid development in the RL field.

Uploaded by

VaibhavSharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views10 pages

Rldist

The document discusses RLlib, a library designed for scalable distributed reinforcement learning (RL) that addresses the complexities of irregular computation patterns in RL algorithms. It advocates for a hierarchical control model that encapsulates parallelism and resource requirements, enabling easier implementation and code reuse across various RL algorithms. The proposed model leverages the Ray framework to enhance performance and scalability, facilitating rapid development in the RL field.

Uploaded by

VaibhavSharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

RLlib: Abstractions for Distributed Reinforcement Learning

Eric Liang * 1 Richard Liaw * 1 Philipp Moritz 1 Robert Nishihara 1 Roy Fox 1 Ken Goldberg 1
Joseph E. Gonzalez 1 Michael I. Jordan 1 Ion Stoica 1

Abstract In the absence of a single dominant computational pattern


(e.g., tensor algebra) or fundamental rules of composition
Reinforcement learning (RL) algorithms involve
(e.g., symbolic differentiation), the design and implemen-
the deep nesting of highly irregular computation
arXiv:1712.09381v4 [cs.AI] 29 Jun 2018

tation of reinforcement learning algorithms can often be


patterns, each of which typically exhibits oppor-
cumbersome, requiring RL researchers to directly reason
tunities for distributed computation. We argue for
about complex nested parallelism. Unlike typical operators
distributing RL components in a composable way
in deep learning frameworks, individual components may re-
by adapting algorithms for top-down hierarchi-
quire parallelism across a cluster (e.g., for rollouts), leverage
cal control, thereby encapsulating parallelism and
neural networks implemented by deep learning frameworks,
resource requirements within short-running com-
recursively invoke other components (e.g., model-based sub-
pute tasks. We demonstrate the benefits of this
tasks), or interface with black-box third-party simulators. In
principle through RLlib: a library that provides
essence, the heterogeneous and distributed nature of many of
scalable software primitives for RL. These prim-
these components poses a key challenge to reasoning about
itives enable a broad range of algorithms to be
their parallel composition. Meanwhile, the main algorithms
implemented with high performance, scalability,
that connect these components are rapidly evolving and ex-
and substantial code reuse. RLlib is available as
pose opportunities for parallelism at varying levels. Finally,
part of the open source Ray project 1 .
RL algorithms manipulate substantial amounts of state (e.g.,
replay buffers and model parameters) that must be managed
across multiple levels of parallelism and physical devices.
1. Introduction
The substantial recent progress in RL algorithms and appli-
Advances in parallel computing and composition through cations has resulted in a large and growing number of RL
symbolic differentiation have been fundamental to the recent libraries (Caspi, 2017; Duan et al., 2016; Hafner et al., 2017;
success of deep learning. Today, there are a wide range of Hesse et al., 2017; Kostrikov, 2017; Schaarschmidt et al.,
deep learning frameworks (Paszke et al., 2017; Abadi et al., 2017). While some of these are highly scalable, few enable
2016; Chen et al., 2016; Jia et al., 2014) that enable rapid the composition of components at scale. In large part, this
innovation in neural network design and facilitate training is due to the fact that many of the frameworks used by these
at the scale necessary for progress in the field. libraries rely on communication between long-running pro-
In contrast, while the reinforcement learning community gram replicas for distributed execution; e.g., MPI (Gropp
enjoys the advances in systems and abstractions for deep et al., 1996), Distributed TensorFlow (Abadi et al., 2016),
learning, there has been comparatively less progress in the and parameter servers (Li et al., 2014)). As this program-
design of systems and abstractions that directly target rein- ming model ignores component boundaries, it does not
forcement learning. Nonetheless, many of the challenges in naturally encapsulate parallelism and resource requirements
reinforcement learning stem from the need to scale learning within individual components.2 As a result, reusing these
and simulation while also integrating a rapidly increasing distributed components requires the insertion of appropriate
range of algorithms and models. As a consequence, there control points in the program, a burdensome and error-prone
is a fundamental need for composable parallel primitives to process (Section 2). The absence of usable encapsulation
support research in reinforcement learning. hinders code reuse and leads to error prone reimplementa-
tion of mathematically complex and often highly stochastic
*
Equal contribution 1 University of California, Berkeley. Corre- algorithms. Even worse, in the distributed setting, often
spondence to: Eric Liang <ericliang@berkeley.edu>.
2
By encapsulation, we mean that individual components spec-
Proceedings of the 35 th International Conference on Machine ify their own internal parallelism and resources requirements and
Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 can be used by other components that have no knowledge of these
by the author(s). requirements.
1
RLlib documentation can be found at http://rllib.io
RLlib: Abstractions for Distributed Reinforcement Learning

SGD batch t model-based batch t+1


tasks
K Model Replicas K Model Variations M Parallel Rollouts N Concurrent Tasks

(a) Deep Learning (b) Reinforcement Learning

Figure 1. In contrast with deep learning, RL algorithms leverage parallelism at multiple levels and physical devices. Here, we show an RL
algorithm composing derivative-free optimization, policy evaluation, gradient-based optimization, and model-based planning (Table 2).

large parts of the distributed communication and execution Table 1. RL spans a broad range of computational demand.
must also be reimplemented with each new RL algorithm. Dimension DQN/Laptop IMPALA+PBT/Cluster
We believe that the ability to build scalable RL algorithms by Task Duration ∼1ms minutes
composing and reusing existing components and implemen- Task Compute 1 CPU several CPUs and GPUs
Total Compute 1 CPU hundreds of CPUs and GPUs
tations is essential for the rapid development and progress
Nesting Depth 1 level 3+ levels
of the field.3 Toward this end, we argue for structuring dis- Process Memory megabytes hundreds of gigabytes
tributed RL components around the principles of logically Execution synchronous async. and highly concurrent
centralized program control and parallelism encapsulation
(Graefe & Davison, 1993; Pan et al., 2010). We built RLlib
rithms, including parameter servers, collective communi-
using these principles, and as a result were not only able to
cation primitives in MPI-like frameworks, task queues, etc.
implement a broad range of state-of-the-art RL algorithms,
For more complex algorithms, it is common to build cus-
but also to pull out scalable primitives that can be used to
tom distributed systems in which processes independently
easily compose new algorithms.
compute and coordinate among themselves with no central
control (Figure 2(a)). While this approach can achieve high
1.1. Irregularity of RL training workloads performance, the cost to develop and evaluate is large, not
Modern RL algorithms are highly irregular in the computa- only due to the need to implement and debug distributed
tion patterns they create (Table 1), pushing the boundaries programs, but because composing these algorithms further
of computation models supported by popular distribution complicates their implementation (Figure 3). Moreover, to-
frameworks. This irregularity occurs at several levels: day’s computation frameworks (e.g., Spark (Zaharia et al.,
2010), MPI) typically assume regular computation patterns
1. The duration and resource requirements of tasks differ and have difficulty when sub-tasks have varying durations,
by orders of magnitude depending on the algorithm; resource requirements, or nesting.
e.g., A3C (Mnih et al., 2016) updates may take millisec-
onds, but other algorithms like PPO (Schulman et al., 1.2. Logically centralized control for distributed RL
2017) batch rollouts into much larger granularities.
2. Communication patterns vary, from synchronous to It is desirable for a single programming model to capture all
asynchronous gradient-based optimization, to having the requirements of RL training. This can be done without
several types of asynchronous tasks in high-throughput eschewing high-level frameworks that structure the com-
off-policy algorithms such as Ape-X and IMPALA putation. Our key insight is that for each distributed RL
(Horgan et al., 2018; Espeholt et al., 2018). algorithm, an equivalent algorithm can be written that ex-
hibits logically centralized program control (Figure 2(b)).
3. Nested computations are generated by model-based
That is, instead of having independently executing processes
hybrid algorithms (Table 2), hyperparameter tuning in
(A, B, C, D in Figure 2(a)) coordinate among themselves
conjunction with RL or DL training, or the combina-
(e.g., through RPCs, shared memory, parameter servers, or
tion of derivative-free and gradient-based optimization
collective communication), a single driver program (D in
within a single algorithm (Silver et al., 2017).
Figure 2(b) and 2(c)) can delegate algorithm sub-tasks to
4. RL algorithms often need to maintain and update sub- other processes to execute in parallel. In this paradigm,
stantial amounts of state including policy parameters, the worker processes A, B, and C passively hold state (e.g.,
replay buffers, and even external simulators. policy or simulator state) but execute no computations until
called by D. To support nested computations, we propose
As a consequence, the developers have no choice but to extending the centralized control model with hierarchical
use a hodgepodge of frameworks to implement their algo- delegation of control (Figure 2(c)), which allows the worker
3
We note that composability without scalability can trivially be processes (e.g., B, C) to further delegate work (e.g., simu-
achieved with a single-threaded library and that all of the difficulty lations, gradient computation) to sub-workers of their own
lies in achieving these two objectives simultaneously. when executing tasks.
RLlib: Abstractions for Distributed Reinforcement Learning

A B A B A B
D D
remote call
C D C C data transfer
inactive process

(a) Distributed Control (b) Logically Centralized Control (c) Hierarchical Control

Figure 2. Most RL algorithms today are written in a fully distributed style (a) where replicated processes independently compute and
coordinate with each other according to their roles (if any). We propose a hierarchical control model (c), which extends (b) to support
nesting in RL and hyperparameter tuning workloads, simplifying and unifying the programming models used for implementation.

Building on such a logically centralized and hierarchical 2. Hierarchical Parallel Task Model
control model has several important advantages. First, the
equivalent algorithm is often easier to implement in practice, As highlighted in Figure 3, parallelization of entire pro-
since the distributed control logic is entirely encapsulated grams using frameworks like MPI (Gropp et al., 1996) and
in a single process rather than multiple processes executing Distributed Tensorflow (Abadi et al., 2016) typically require
concurrently. Second, the separation of algorithm compo- explicit algorithm modifications to insert points of coordina-
nents into sub-routines (e.g., do rollouts, compute gradients tion when trying to compose two programs or components
with respect to some policy loss), enables code reuse across together. This limits the ability to rapidly prototype novel
different execution patterns. Sub-tasks that have different distributed RL applications. Though the example in Fig-
resource requirements (e.g., CPUs vs GPUs) can be placed ure 3 is simple, new hyperparameter tuning algorithms for
on different machines, reducing compute costs as we show long-running training tasks; e.g., HyperBand, Population
in Section 5. Finally, distributed algorithms written in this Based Training (PBT) (Li et al., 2016; Jaderberg et al., 2017)
model can be seamlessly nested within each other, satisfying increasingly demand fine-grained control over training.
the parallelism encapsulation principle. We propose building RL libraries with hierarchical and logi-
Logically centralized control models can be highly perfor- cally centralized control on top of flexible task-based pro-
mant, our proposed hierarchical variant even more so. This gramming models like Ray (Moritz et al., 2017). Task-based
is because the bulk of data transfer (blue arrows in Figure systems allow subroutines to be scheduled and executed
2) between processes happens out of band of the driver, not asynchronously on worker processes, on a fine-grained basis,
passing through any central bottleneck. In fact many highly and for results to be retrieved or passed between processes.
scalable distributed systems (Zaharia et al., 2010; Chang
et al., 2008; Dean & Ghemawat, 2008) leverage centralized 2.1. Relation to existing distributed ML abstractions
control in their design. Within a single differentiable tensor Though typically formulated for distributed control, abstrac-
graph, frameworks like TensorFlow also implement logi- tions such as parameter servers and collective communica-
cally centralized scheduling of tensor computations onto tion operations can also be used within a logically central-
available physical devices. Our proposal extends this princi- ized control model. As an example, RLlib uses allreduce
ple into the broader ML systems design space. and parameter-servers in some of its policy optimizers (Fig-
The contributions of this paper are as follows. ure 4), and we evaluate their performance in Section 5.

1. We propose a general and composable hierarchical 2.2. Ray implementation of hierarchical control
programming model for RL training (Section 2). We note that, within a single machine, the proposed pro-
2. We describe RLlib, our highly scalable RL library, and gramming model can be implemented simply with thread-
how it builds on the proposed model to provide scal- pools and shared memory, though it is desirable for the
able abstractions for a broad range of RL algorithms, underlying framework to scale to larger clusters if needed.
enabling rapid development (Section 3). We chose to build RLlib on top of the Ray framework, which
3. We discuss how performance is achieved within the allows Python tasks to be distributed across large clusters.
proposed model (Section 4), and show that RLlib meets Ray’s distributed scheduler is a natural fit for the hierarchical
or exceeds state-of-the-art performance for a wide va- control model, as nested computation can be implemented
riety of RL workloads (Section 5). in Ray with no central task scheduling bottleneck.
RLlib: Abstractions for Distributed Reinforcement Learning

processor ρ that transforms a batch Xt,K of K


{(ot , ht , at , ht+1 , yt1 . . . ytN , rt , ot+1 )} tuples starting at t.
Here rt and ot+1 are the reward and new observation af-
ter taking an action. Example uses include advantage
@ray.remote
def rollout(model): estimation (Schulman et al., 2015) and goal relabeling
# perform a rollout and
# return the result (Andrychowicz et al., 2017). To also support multi-agent
p
@ray.remote
def evaluate(params):
environments, experience batches Xt,K from the P other
model = generate_model(params)
results = [rollout.remote(model) agents in the environment are also made accessible:
for i in range(n)]
1 P
return results
ρθ (Xt,K , Xt,K . . . Xt,K ) ⇒ Xpost (2)
param_grid = generate_hyperparams()
print(ray.get([evaluate.remote(p)
for p in param_grid])) Gradient-based algorithms define a combined loss L that
(a) Distributed Control (b) Hierarchical Control can be descended to improve the policy and auxiliary net-
works:
Figure 3. Composing a distributed hyperparameter search with a L(θ; X) ⇒ loss (3)
function that also requires distributed computation involves com- Finally, the developer can also specify any number of utility
plex nested parallel computation patterns. With MPI (a), a new functions ui to be called as needed during training to, e.g.,
program must be written from scratch that mixes elements of both. return training statistics s, update target networks, or adjust
With hierarchical control (b), components can remain unchanged annealing schedules:
and simply be invoked as remote tasks.
u1 . . . uM (θ) ⇒ (s, θupdate ) (4)
To implement a logically centralized control model, it is first To interface with RLlib, these algorithm functions should be
necessary to have a mechanism to launch new processes and defined in a policy graph class with the following methods:
schedule tasks on them. Ray meets this requirement with
abstract class rllib.PolicyGraph:
Ray actors, which are Python classes that may be created
def act(self, obs, h): action, h, y*
in the cluster and accept remote method calls (i.e., tasks). def postprocess(self, batch, b*): batch
Ray permits these actors to in turn launch more actors and def gradients(self, batch): grads
schedule tasks on those actors as part of a method call, def get_weights; def set_weights;
satisfying our need for hierarchical delegation as well. def u*(self, args*)

For performance, Ray provides standard communication 3.2. Policy Evaluation


primitives such as aggregate and broadcast, and crit-
ically enables the zero-copy sharing of large data objects For collecting experiences, RLlib provides a
through a shared memory object store. As shown in Section PolicyEvaluator class that wraps a policy graph
5, this enables the performance of RLlib algorithms. We and environment to add a method to sample() experience
further discuss framework performance in Section 4. batches. Policy evaluator instances can be created as
Ray remote actors and replicated across a cluster for
parallelism. To make their usage concrete, consider a
3. Abstractions for Reinforcement Learning minimal TensorFlow policy gradients implementation that
To leverage RLlib for distributed execution, algorithms must extends the rllib.TFPolicyGraph helper template:
declare their policy π, experience postprocessor ρ, and loss class PolicyGradient(TFPolicyGraph):
L. These can be specified in any deep learning framework, def __init__(self, obs_space, act_space):
including TensorFlow and PyTorch. RLlib provides policy self.obs, self.advantages = ...
evaluators and policy optimizers that implement strategies pi = FullyConnectedNetwork(self.obs)
for distributed policy evaluation and training. dist = rllib.action_dist(act_space, pi)
self.act = dist.sample()
self.loss = -tf.reduce_mean(
3.1. Defining the Policy Graph dist.logp(self.act) * self.advantages)
def postprocess(self, batch):
RLlib’s abstractions are as follows. The developer specifies return rllib.compute_advantages(batch)
a policy model π that maps the current observation ot and
(optional) RNN hidden state ht to an action at and the next From this policy graph definition, the developer can cre-
RNN state ht+1 . Any number of user-defined values yti ate a number of policy evaluator replicas ev and call
(e.g., value predictions, TD error) can also be returned: ev.sample.remote() on each to collect experiences in
parallel from environments. RLlib supports OpenAI Gym
πθ (ot , ht ) ⇒ (at , ht+1 , yt1 . . . ytN ) (1)
(Brockman et al., 2016), user-defined environments, and
Most algorithms will also specify a trajectory post- also batched simulators such as ELF (Tian et al., 2017):
RLlib: Abstractions for Distributed Reinforcement Learning
grads = [ev.grad(ev.sample())
grads = [ev.grad(ev.sample()) samples = concat([ev.sample() grads = [ev.grad(ev.sample()) for ev in evaluators]
for ev in evaluators] for ev in evaluators]) for ev in evaluators] for _ in range(NUM_ASYNC_GRADS):
avg_grad = aggregate(grads) pin_in_local_gpu_memory(samples) for _ in range(NUM_ASYNC_GRADS): grad, ev, grads = wait(grads)
local_graph.apply(avg_grad) for _ in range(NUM_SGD_EPOCHS): grad, ev, grads = wait(grads) for ps, g in split(grad, ps_shards):
weights = broadcast( local_g.apply(local_g.grad(samples) local_graph.apply(grad) ps.push(g)
local_graph.weights()) weights = broadcast(local_g.weights()) ev.set_weights( ev.set_weights(concat(
for ev in evaluators: for ev in evaluators: local_graph.get_weights()) [ps.pull() for ps in ps_shards])
ev.set_weights(weights) ev.set_weights(weights) grads.append(ev.grad(ev.sample())) grads.append(ev.grad(ev.sample()))

(a) Allreduce (b) Local Multi-GPU (c) Asynchronous (d) Sharded Param-server

Figure 4. Pseudocode for four RLlib policy optimizer step methods. Each step() operates over a local policy graph and array of remote
evaluator replicas. Ray remote calls are highlighted in orange; other Ray primitives in blue (Section 4). Apply is shorthand for updating
weights. Minibatch code and helper functions omitted. The param server optimizer in RLlib also implements pipelining not shown here.

Environment Frames/s
300 180k

Atari A3C Gradients/s


evaluators = [rllib.PolicyEvaluator.remote( RLlib PS 160k 1 Replay Shards
250
env=SomeEnv, graph=PolicyGradient) Dist. TF PS 140k 4 Replay Shards
200 120k
for _ in range(10)] 100k
150
print(ray.get([ 80k
100 60k
ev.sample.remote() for ev in evaluators])) 40k
50
20k
0 0.0
2 4 8 16 32 16 32 64 128 256
3.3. Policy Optimization Number of A3C Workers Number of Workers

RLlib separates the implementation of algorithms into the (a) Sharded Param. Server (b) Ape-X in RLlib
declaration of the algorithm-specific policy graph and the Figure 5. RLlib’s centrally controlled policy optimizers match or
choice of an algorithm-independent policy optimizer. The exceed the performance of implementations in specialized systems.
policy optimizer is responsible for the performance-critical The RLlib parameter server optimizer using 8 internal shards is
tasks of distributed sampling, parameter updates, and man- competitive with a Distributed TensorFlow implementation tested
aging replay buffers. To distribute the computation, the in similar conditions. RLlib’s Ape-X policy optimizer scales to
optimizer operates over a set of policy evaluator replicas. 160k frames per second with 256 workers at a frameskip of 4,
more than matching a reference throughput of ∼45k fps at 256
To complete the example, the developer chooses a policy workers, demonstrating that a single-threaded Python controller
optimizer and creates it with references to existing eval- can efficiently scale to high throughputs.
uators. The async optimizer uses the evaluator actors to
compute gradients in parallel on many CPUs (Figure 4(c)).
Each optimizer.step() runs a round of remote tasks to graph class encapsulates interaction with the deep learning
improve the model. Between steps, policy graph replicas framework, allowing algorithm authors to avoid mixing dis-
can be queried directly, e.g., to print out training statistics: tributed systems code with numerical computations, and
enabling optimizer implementations to be improved and
optimizer = rllib.AsyncPolicyOptimizer( reused across different deep learning frameworks.
graph=PolicyGradient, workers=evaluators)
while True: As shown in Figure 4, by leveraging centralized control, pol-
optimizer.step() icy optimizers succinctly capture a broad range of choices
print(optimizer.foreach_policy( in RL optimization: synchronous vs asynchronous, allre-
lambda p: p.get_train_stats())) duce vs parameter server, and use of GPUs vs CPUs. RL-
lib’s policy optimizers provide performance comparable to
Policy optimizers extend the well-known gradient-descent optimized parameter server (Figure 5(a)) and MPI-based
optimizer abstraction to the RL domain. A typical gradient- implementations (Section 5). Pulling out this optimizer ab-
descent optimizer implements step(L(θ), X, θ) ⇒ θopt . straction is easy in a logically centralized control model
RLlib’s policy optimizers instead operate over the local since each policy optimizer has full control over the dis-
policy graph G and a set of remote evaluator replicas, tributed computation it implements.
i.e., step(G, ev1 . . . evn , θ) ⇒ θopt , capturing the sampling
phase of RL as part of optimization (i.e., calling sample()
3.4. Completeness and Generality of Abstractions
on policy evaluators to produce new simulation data).
We demonstrate the completeness of RLlib’s abstractions by
The policy optimizer abstraction has the following advan-
formulating the algorithm families listed in Table 2 within
tages. By separating execution strategy from policy and loss
the API. When applicable, we also describe the concrete
definitions, specialized optimizers can be swapped in to take
implementation in RLlib:
advantage of available hardware or algorithm features with-
out needing to change the rest of the algorithm. The policy DQNs: DQNs use y 1 for storing TD error, implement n-step
RLlib: Abstractions for Distributed Reinforcement Learning

Table 2. RLlib’s policy optimizers and evaluators capture common components (Evaluation, Replay, Gradient-based Optimizer) within a
logically centralized control model, and leverages Ray’s hierarchical task model to support other distributed components.

Algorithm Family Policy Evaluation Replay Buffer Gradient-Based Optimizer Other Distributed Components
DQNs X X X
Policy Gradient X X
Off-policy PG X X X
Model-Based/Hybrid X X Model-Based Planning
Multi-Agent X X X
Evolutionary Methods X Derivative-Free Optimization
AlphaGo X X X MCTS, Derivative-Free Optimization

return calculation in ρθ , and the Q loss in L. Target updates ronment models, the model loss can either be bundled with
are implemented in u1 , and setting the exploration  in u2 . L, or the model trained separately (i.e., in parallel using Ray
primitives) and its weights periodically updated via u1 .
DQN implementation: To support experience replay, RLlib’s
DQN uses a policy optimizer that saves collected samples Multi-Agent: Policy evaluators can run multiple policies
in an embedded replay buffer. The user can alternatively at once in the same environment, producing batches of ex-
use an asynchronous optimizer (Figure 4(c)). The target perience for each agent. Many multi-agent algorithms use
network is updated by calls to u1 between optimizer steps. a centralized critic or value function, which we support by
allowing ρθ to collate experiences from multiple agents.
Ape-X implementation: Ape-X (Horgan et al., 2018) is a
variation of DQN that leverages distributed experience pri- Evolutionary Methods: Derivative-free methods can be
oritization to scale to many hundreds of cores. To adapt our supported through non-gradient-based policy optimizers.
DQN implementation, we created policy evaluators with a
Evolution Strategies (ES) implementation: ES is a derivative-
distribution of  values, and wrote a new high-throughput
free optimization algorithm that scales well to clusters with
policy optimizer (∼200 lines) that pipelines the sampling
thousands of CPUs. We were able to port a single-threaded
and transfer of data between replay buffer actors using Ray
implementation of ES to RLlib with only a few changes, and
primitives. Our implementation scales nearly linearly up
further scale it with an aggregation tree of actors (Figure
to 160k environment frames per second with 256 workers
8(a)), suggesting that the hierarchical control model is both
(Figure 5(b)), and the optimizer can compute gradients for
flexible and easy to adapt algorithms for.
∼8.5k 80×80×4 observations/s on a V100 GPU.
PPO-ES experiment: We studied a hybrid algorithm that
Policy Gradient / Off-policy PG: These algorithms store
runs PPO updates in the inner loop of an ES optimiza-
value predictions in y 1 , implement advantage estimation
tion step that randomly perturbs the PPO models. The
using ρθ , and combine actor and critic losses in L.
implementation took only ∼50 lines of code and did not
PPO implementation: Since PPO’s loss function permits require changes to PPO, showing the value of encapsulat-
multiple SGD passes over sample data, when there is suffi- ing parallelism. In our experiments, PPO-ES converged
cient GPU memory RLlib chooses a GPU-optimized policy faster and to a higher reward than PPO on the Walker2d-v1
optimizer (Figure 4(b)) that pins data into local GPU mem- task. A similarly modified A3C-ES implementation solved
ory. In each iteration, the optimizer collects samples from PongDeterministic-v4 in 30% less time.
evaluator replicas, performs multi-GPU optimization locally,
AlphaGo: We sketch how to scalably implement the Al-
and then broadcasts the new model weights.
phaGo Zero algorithm using a combination of Ray and
A3C implementation: RLlib’s A3C can use either the asyn- RLlib abstractions. Pseudocode for the ∼70 line main algo-
chronous (Figure 4(c)) or sharded parameter server policy rithm loop is provided in the Supplementary Material.
optimizer (4(d)). These optimizers collect gradients from
the policy evaluators to update the canonical copy of θ. 1. Logically centralized control of multiple distributed
components: AlphaGo Zero uses multiple distributed com-
DDPG implementation: RLlib’s DDPG uses the same replay ponents: model optimizers, self-play evaluators, candidate
policy optimizer as DQN. L includes both actor and critic model evaluators, and the shared replay buffer. These com-
losses. The user can also choose to use the Ape-X policy ponents are manageable as Ray actors under a top-level
optimizer with DDPG. AlphaGo policy optimizer. Each optimizer step loops over
Model-based / Hybrid: Model-based RL algorithms ex- actor statuses to process new results, routing data between
tend πθ (ot , ht ) to make decisions based on model rollouts, actors and launching new actor instances.
which can be parallelized using Ray. To update their envi- 2. Shared replay buffer: AlphaGo Zero stores the experi-
RLlib: Abstractions for Distributed Reinforcement Learning

Evaluation
train() Evaluation
train() PPO Evaluation
Self-Play sample data
train() train()
Evaluators
MCTS
MCTS
MCTS MCTS
best player
weights Shared
Replay
Buffer
Candidate
Evaluation Optimizer
Optimizer
Policy
Optimizer
Evaluators candidate Optimizers
MCTS
MCTS
MCTS MCTS
weights

(a) Ape-X (b) PPO-ES (c) AlphaGo Zero

Figure 6. Complex RL architectures are easily captured within RLlib’s hierarchical control model. Here blue lines denote data transfers,
orange lines lighter overhead method calls. Each train() call encompasses a batch of remote calls between components.

ences from self-play evaluator instances in a shared replay Vectorization: RLlib can batch policy evaluation to im-
buffer. This requires routing game results to the shared prove hardware utilization (Figure 7), supports batched en-
buffer, which is easily done by passing the result object vironments, and passes experience data between actors effi-
references from actor to actor. ciently in columnar array format.
3. Best player: AlphaGo Zero tracks the current best
model and only populates its replay buffer with self-play 4.2. Distributed performance
from that model. Candidate models must achieve a ≥ 55% Lightweight tasks: Remote call overheads in Ray are on
victory margin to replace the best model. Implementing this the order of ∼200µs when scheduled on the same machine.
amounts to adding an if block in the main control loop. When machine resources are saturated, tasks spill over to
4. Monte-Carlo tree search: MCTS (i.e., model-based other nodes, increasing latencies to around ∼1ms. This
planning) can be handled as a subroutine of the policy graph, enables parallel algorithms to scale seamlessly to multiple
and optionally parallelized as well using Ray. machines while preserving high single-node throughput.
Nested parallelism: Building RL algorithms by composing
HyperBand and Population Based Training: Ray in- distributed components creates multiple levels of nested
cludes distributed implementations of hyperparameter parallel calls (Figure 1). Since components make decisions
search algorithms such as HyperBand and PBT (Li et al., that may affect downstream calls, the call graph is also
2016; Jaderberg et al., 2017). We were able to use these to inherently dynamic. Ray supports this by allowing any
evaluate RLlib algorithms, which are themselves distributed, Python function or class method to be invoked remotely as
with the addition of ∼15 lines of code per algorithm. We a lightweight task. For example, func.remote() executes
note that these algorithms are non-trivial to integrate when func remotely and immediately returns a placeholder result
using distributed control models due to the need to modify which can later be retrieved or passed to other tasks.
existing code to insert points of coordination (Figure 3). RL-
lib’s use of short-running tasks avoids this problem, since Resource awareness: Ray allows remote calls to specify
control decisions can be easily made between tasks. resource requirements and utilizes a resource-aware sched-
uler to preserve component performance. Without this,
distributed components can improperly allocate resources,
4. Framework Performance causing algorithms to run inefficiently or fail.
In this section, we discuss properties of Ray (Moritz et al., Fault tolerance and straggler mitigation: Failure events
2017) and other optimizations critical to RLlib. become significant at scale (Barroso et al., 2013). RLlib
leverages Ray’s built-in fault tolerance mechanisms (Moritz
4.1. Single-node performance et al., 2017), reducing costs with preemptible cloud compute
Stateful computation: Tasks can share mutable state with instances (Amazon, 2011; Google, 2015). Similarly, strag-
other tasks through Ray actors. This is critical for tasks glers can significantly impact the performance of distributed
that operate on and mutate stateful objects like third-party algorithms at scale (Dean & Barroso, 2013). RLlib supports
simulators or neural network weights. straggler mitigation in a generic way via the ray.wait()
primitive. For example, in PPO we use this to drop the
Shared memory object store: RL workloads involve shar- slowest evaluator tasks, at the cost of some bias.
ing large quantities of data (e.g., rollouts and neural network
weights). Ray supports this by allowing data objects to be Data compression: RLlib uses the LZ4 algorithm to com-
passed directly between workers without any central bottle- press experience batches. For image observations, LZ4
neck. In Ray, workers on the same machine can also read reduces network traffic and memory usage by more than an
data objects through shared memory without copies. order of magnitude, at a compression rate of ∼1 GB/s/core.
RLlib: Abstractions for Distributed Reinforcement Learning

5. Evaluation 10 6 1

Actions / s
2
10 5 4
Sampling efficiency: Policy evaluation is an important 10 4
8
16
building block for all RL algorithms. In Figure 7 we bench- 10 3
32
64
Pong-GPU Pendulum-CPU
mark the scalability of gathering samples from policy evalu- Number of Policy Evaluators
128

ator actors. To avoid bottlenecks, we use four intermediate


Figure 7. Policy evaluation throughput scales nearly linearly from
actors for aggregation. Pendulum-CPU reaches over 1.5 1 to 128 cores. PongNoFrameskip-v4 on GPU scales from 2.4k
million actions/s running a small 64×64 fully connected to ∼200k actions/s, and Pendulum-v0 on CPU from 15k to 1.5M
network as the policy. Pong-GPU nears 200k actions/s on actions/s. We use a single p3.16xl AWS instance to evaluate from 1-
the DQN convolutional architecture (Mnih et al., 2015). 16 cores, and a cluster of four p3.16xl instances from 32-128 cores,
spreading actors evenly across the cluster. Evaluators compute
Large-scale tests: We evaluate the performance of RLlib
actions for 64 agents at a time, and share the GPUs on the machine.
on Evolution Strategies (ES), Proximal Policy Optimization
(PPO), and A3C, comparing against specialized systems
90

Time to solve (min)

Time to solve (min)


built specifically for those algorithms (OpenAI, 2017; Hesse 80 500
Reference ES MPI PPO
et al., 2017; OpenAI, 2016) using Redis, OpenMPI, and 70 400
60 RLlib ES RLlib PPO
Distributed TensorFlow. The same hyperparameters were 50 300
40
30 200
used in all experiments. We used TensorFlow to define 20 100
neural networks for the RLlib algorithms evaluated. 10
0 x x x 0
256 1024 8192 8x1 64x8 512x64
RLlib’s ES implementation scales well on the Humanoid- Number of CPUs CPUs x GPUs
v1 task to 8192 cores using AWS m4.16xl CPU instances (a) Evolution Strategies (b) PPO
(Amazon, 2017). With 8192 cores, we achieve a reward
of 6000 in a median time of 3.7 minutes, which is over Figure 8. The time required to achieve a reward of 6000 on the
Humanoid-v1 task. RLlib implementations of ES and PPO outper-
twice as fast as the best published result (Salimans et al.,
form highly optimized reference optimizations.
2017). For PPO we evaluate on the same Humanoid-v1 task,
starting with one p2.16xl GPU instance and adding m4.16xl
instances to scale. This cost-efficient local policy optimizer
6. Related work
(Table 3) outperformed the reference MPI implementation There are many reinforcement learning libraries (Caspi,
that required multiple expensive GPU instances to scale. 2017; Duan et al., 2016; Hafner et al., 2017; Hesse et al.,
2017; Kostrikov, 2017; Schaarschmidt et al., 2017). These
We ran RLlib’s A3C on an x1.16xl machine and solved
often scale by creating long-running program replicas that
the PongDeterministic-v4 environment in 12 minutes using
each participate in coordinating the distributed computa-
an asynchronous policy optimizer and 9 minutes using a
tion as a whole, and as a result do not generalize well to
sharded param-server optimizer, which matches the perfor-
complex architectures. RLlib instead uses a hierarchical con-
mance of a well-tuned baseline (OpenAI, 2016).
trol model with short-running tasks to let each component
Multi-GPU: To better understand RLlib’s advantage in the control its own distributed execution, enabling higher-level
PPO experiment, we ran benchmarks on a p2.16xl instance abstractions such as policy optimizers to be used for com-
comparing RLlib’s local multi-GPU policy optimizer with posing and scaling RL algorithms.
one using an allreduce in Table 3. The fact that different
Outside of reinforcement learning, there has been a strong
strategies perform better under different conditions suggests
effort to explore composition and integration between dif-
that policy optimizers are a useful abstraction.
ferent deep learning frameworks. ONNX (Microsoft, 2017),
NNVM (DMLC, 2017), and Gluon (Gluon, 2017) sit be-
Policy Optimizer Gradients computed on Environment SGD throughput
tween model specifications and hardware to provide cross-
Humanoid-v1 330k samples/s
4 GPUs, Evaluators
Pong-v0 23k samples/s
library optimizations. Deep learning libraries (Paszke et al.,
Allreduce-based
2017; Abadi et al., 2016; Chen et al., 2016; Jia et al., 2014)
Humanoid-v1 440k samples/s
16 GPUs, Evaluators provide support for the gradient-based optimization compo-
Pong-v0 100k samples/s
Humanoid-v1 2.1M samples/s nents that appear in RL algorithms.
4 GPUs, Driver
Pong-v0 N/A (out of mem.)
Local Multi-GPU
16 GPUs, Driver
Humanoid-v1 1.7M samples/s 7. Conclusion
Pong-v0 150k samples/s
RLlib is an open source library for reinforcement learning
Table 3. A specialized multi-GPU policy optimizer outperforms that leverages fine-grained nested parallelism to achieve
distributed allreduce when data can fit entirely into GPU memory. state-of-the-art performance across a broad range of RL
This experiment was done for PPO with 64 Evaluator processes. workloads. It offers both a collection of reference algorithms
The PPO batch size was 320k, The SGD batch size was 32k, and and scalable abstractions for easily composing new ones.
we used 20 SGD passes per PPO batch.
RLlib: Abstractions for Distributed Reinforcement Learning

Acknowledgements DMLC. NNVM compiler: Open compiler for AI


frameworks. http://www.tvmlang.org/2017/
In addition to NSF CISE Expeditions Award CCF-1730628, 10/06/nnvm-compiler-announcement.html,
this research is supported in part by DHS Award HSHQDC- 2017.
16-3-00083, and gifts from Alibaba, Amazon Web Ser-
vices, Ant Financial, Arm, CapitalOne, Ericsson, Facebook, Duan, Y., Chen, X., Houthooft, R., Schulman, J., and
Google, Huawei, Intel, Microsoft, Scotiabank, Splunk and Abbeel, P. Benchmarking deep reinforcement learning
VMware. for continuous control. In International Conference on
Machine Learning, pp. 1329–1338, 2016.
References Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih,
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning,
Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, I., et al. Impala: Scalable distributed deep-rl with impor-
M., et al. TensorFlow: Large-scale machine learning tance weighted actor-learner architectures. arXiv preprint
on heterogeneous distributed systems. arXiv preprint arXiv:1802.01561, 2018.
arXiv:1603.04467, 2016.
Gluon. The Gluon API specification. https://github.
Amazon. Scientific computing with EC2 spot in- com/gluon-api/gluon-api, 2017.
stances. https://aws.amazon.com/ec2/spot/
spot-and-science/, 2011. Google. Preemptible virtual machines. https://cloud.
google.com/preemptible-vms, 2015.
Amazon. Amazon EC2 pricing. https://aws.amazon.
com/ec2/pricing, 2017. Graefe, G. and Davison, D. L. Encapsulation of parallelism
and architecture-independence in extensible database
Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, query execution. IEEE Transactions on Software En-
R., Welinder, P., McGrew, B., Tobin, J., Abbeel, P., and gineering, 19(8):749–764, 1993.
Zaremba, W. Hindsight experience replay. arXiv preprint
arXiv:1707.01495, 2017. Gropp, W., Lusk, E., Doss, N., and Skjellum, A. A high-
performance, portable implementation of the MPI mes-
Barroso, L. A., Clidaras, J., and Hölzle, U. The datacenter as sage passing interface standard. Parallel computing, 22
a computer: An introduction to the design of warehouse- (6):789–828, 1996.
scale machines. Synthesis lectures on computer architec-
ture, 8(3):1–154, 2013. Hafner, D., Davidson, J., and Vanhoucke, V. TensorFlow
agents: Efficient batched reinforcement learning in Ten-
Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,
sorFlow. arXiv preprint arXiv:1709.02878, 2017.
Schulman, J., Tang, J., and Zaremba, W. OpenAI gym,
2016. Hesse, C., Plappert, M., Radford, A., Schulman, J., Sidor,
S., and Wu, Y. OpenAI baselines. https://github.
Caspi, I. Reinforcement learning coach by Intel. https:
com/openai/baselines, 2017.
//github.com/NervanaSystems/coach, 2017.
Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel,
D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, M., van Hasselt, H., and Silver, D. Distributed prioritized
R. E. Bigtable: A distributed storage system for structured experience replay. International Conference on Learning
data. ACM Transactions on Computer Systems (TOCS), Representations, 2018. URL https://openreview.
26(2):4, 2008. net/forum?id=H1Dy---0Z.

Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W. M.,
Xiao, T., Xu, B., Zhang, C., and Zhang, Z. MXNet: A Donahue, J., Razavi, A., Vinyals, O., Green, T., Dunning,
flexible and efficient machine learning library for het- I., Simonyan, K., et al. Population based training of
erogeneous distributed systems. In NIPS Workshop on neural networks. arXiv preprint arXiv:1711.09846, 2017.
Machine Learning Systems (LearningSys’16), 2016.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J.,
Dean, J. and Barroso, L. A. The tail at scale. Communica- Girshick, R., Guadarrama, S., and Darrell, T. Caffe:
tions of the ACM, 56(2):74–80, 2013. Convolutional architecture for fast feature embedding.
arXiv preprint arXiv:1408.5093, 2014.
Dean, J. and Ghemawat, S. MapReduce: simplified data
processing on large clusters. Communications of the Kostrikov, I. PyTorch implementation of advantage ac-
ACM, 51(1):107–113, 2008. tor critic (A2C), proximal policy optimization (PPO)
RLlib: Abstractions for Distributed Reinforcement Learning

and scalable trust-region method for deep reinforcement Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel,
learning. https://github.com/ikostrikov/ P. High-dimensional continuous control using generalized
pytorch-a2c-ppo-acktr, 2017. advantage estimation. arXiv preprint arXiv:1506.02438,
2015.
Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and
Talwalkar, A. Hyperband: Bandit-based configuration Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and
evaluation for hyperparameter optimization. 2016. Klimov, O. Proximal policy optimization algorithms.
arXiv preprint arXiv:1707.06347, 2017.
Li, M., Andersen, D. G., Park, J. W., Smola, A., and Ahmed,
A. Scaling distributed machine learning with the parame- Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou,
ter server. In Proceedings of the USENIX Symposium on I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M.,
Operating Systems Design and Implementation, volume Bolton, A., et al. Mastering the game of Go without
583, pp. 598, 2014. human knowledge. 2017.

Microsoft. ONNX: Open neural network exchange format. Tian, Y., Gong, Q., Shang, W., Wu, Y., and Zitnick,
https://onnx.ai, 2017. L. ELF: an extensive, lightweight and flexible re-
search platform for real-time strategy games. CoRR,
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, abs/1707.01067, 2017. URL http://arxiv.org/
J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje- abs/1707.01067.
land, A. K., Ostrovski, G., et al. Human-level control
through deep reinforcement learning. Nature, 518(7540): Zaharia, M., Chowdhury, N. M., Franklin, M., Shenker, S.,
529–533, 2015. and Stoica, I. Spark: Cluster computing with working
sets. 2010.
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap,
T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn-
chronous methods for deep reinforcement learning. In
International Conference on Machine Learning, pp. 1928–
1937, 2016.

Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw, R.,
Liang, E., Paul, W., Jordan, M. I., and Stoica, I. Ray:
A distributed framework for emerging AI applications.
arXiv preprint arXiv:1712.05889, 2017.

OpenAI. Universe Starter Agent. https://github.


com/openai/universe-starter-agent,
2016.

OpenAI. Evolution Strategies Starter


Agent. https://github.com/openai/
evolution-strategies-starter, 2017.

Pan, H., Hindman, B., and Asanović, K. Composing parallel


software efficiently with Lithe. ACM Sigplan Notices, 45
(6):376–387, 2010.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer,
A. Automatic differentiation in pytorch. 2017.

Salimans, T., Ho, J., Chen, X., and Sutskever, I. Evolu-


tion strategies as a scalable alternative to reinforcement
learning. arXiv preprint arXiv:1703.03864, 2017.

Schaarschmidt, M., Kuhnle, A., and Fricke, K. TensorForce:


A TensorFlow library for applied reinforcement learn-
ing. Web page, 2017. URL https://github.com/
reinforceio/tensorforce.

You might also like