Rldist
Rldist
Eric Liang * 1 Richard Liaw * 1 Philipp Moritz 1 Robert Nishihara 1 Roy Fox 1 Ken Goldberg 1
Joseph E. Gonzalez 1 Michael I. Jordan 1 Ion Stoica 1
Figure 1. In contrast with deep learning, RL algorithms leverage parallelism at multiple levels and physical devices. Here, we show an RL
algorithm composing derivative-free optimization, policy evaluation, gradient-based optimization, and model-based planning (Table 2).
large parts of the distributed communication and execution Table 1. RL spans a broad range of computational demand.
must also be reimplemented with each new RL algorithm. Dimension DQN/Laptop IMPALA+PBT/Cluster
We believe that the ability to build scalable RL algorithms by Task Duration ∼1ms minutes
composing and reusing existing components and implemen- Task Compute 1 CPU several CPUs and GPUs
Total Compute 1 CPU hundreds of CPUs and GPUs
tations is essential for the rapid development and progress
Nesting Depth 1 level 3+ levels
of the field.3 Toward this end, we argue for structuring dis- Process Memory megabytes hundreds of gigabytes
tributed RL components around the principles of logically Execution synchronous async. and highly concurrent
centralized program control and parallelism encapsulation
(Graefe & Davison, 1993; Pan et al., 2010). We built RLlib
rithms, including parameter servers, collective communi-
using these principles, and as a result were not only able to
cation primitives in MPI-like frameworks, task queues, etc.
implement a broad range of state-of-the-art RL algorithms,
For more complex algorithms, it is common to build cus-
but also to pull out scalable primitives that can be used to
tom distributed systems in which processes independently
easily compose new algorithms.
compute and coordinate among themselves with no central
control (Figure 2(a)). While this approach can achieve high
1.1. Irregularity of RL training workloads performance, the cost to develop and evaluate is large, not
Modern RL algorithms are highly irregular in the computa- only due to the need to implement and debug distributed
tion patterns they create (Table 1), pushing the boundaries programs, but because composing these algorithms further
of computation models supported by popular distribution complicates their implementation (Figure 3). Moreover, to-
frameworks. This irregularity occurs at several levels: day’s computation frameworks (e.g., Spark (Zaharia et al.,
2010), MPI) typically assume regular computation patterns
1. The duration and resource requirements of tasks differ and have difficulty when sub-tasks have varying durations,
by orders of magnitude depending on the algorithm; resource requirements, or nesting.
e.g., A3C (Mnih et al., 2016) updates may take millisec-
onds, but other algorithms like PPO (Schulman et al., 1.2. Logically centralized control for distributed RL
2017) batch rollouts into much larger granularities.
2. Communication patterns vary, from synchronous to It is desirable for a single programming model to capture all
asynchronous gradient-based optimization, to having the requirements of RL training. This can be done without
several types of asynchronous tasks in high-throughput eschewing high-level frameworks that structure the com-
off-policy algorithms such as Ape-X and IMPALA putation. Our key insight is that for each distributed RL
(Horgan et al., 2018; Espeholt et al., 2018). algorithm, an equivalent algorithm can be written that ex-
hibits logically centralized program control (Figure 2(b)).
3. Nested computations are generated by model-based
That is, instead of having independently executing processes
hybrid algorithms (Table 2), hyperparameter tuning in
(A, B, C, D in Figure 2(a)) coordinate among themselves
conjunction with RL or DL training, or the combina-
(e.g., through RPCs, shared memory, parameter servers, or
tion of derivative-free and gradient-based optimization
collective communication), a single driver program (D in
within a single algorithm (Silver et al., 2017).
Figure 2(b) and 2(c)) can delegate algorithm sub-tasks to
4. RL algorithms often need to maintain and update sub- other processes to execute in parallel. In this paradigm,
stantial amounts of state including policy parameters, the worker processes A, B, and C passively hold state (e.g.,
replay buffers, and even external simulators. policy or simulator state) but execute no computations until
called by D. To support nested computations, we propose
As a consequence, the developers have no choice but to extending the centralized control model with hierarchical
use a hodgepodge of frameworks to implement their algo- delegation of control (Figure 2(c)), which allows the worker
3
We note that composability without scalability can trivially be processes (e.g., B, C) to further delegate work (e.g., simu-
achieved with a single-threaded library and that all of the difficulty lations, gradient computation) to sub-workers of their own
lies in achieving these two objectives simultaneously. when executing tasks.
RLlib: Abstractions for Distributed Reinforcement Learning
A B A B A B
D D
remote call
C D C C data transfer
inactive process
(a) Distributed Control (b) Logically Centralized Control (c) Hierarchical Control
Figure 2. Most RL algorithms today are written in a fully distributed style (a) where replicated processes independently compute and
coordinate with each other according to their roles (if any). We propose a hierarchical control model (c), which extends (b) to support
nesting in RL and hyperparameter tuning workloads, simplifying and unifying the programming models used for implementation.
Building on such a logically centralized and hierarchical 2. Hierarchical Parallel Task Model
control model has several important advantages. First, the
equivalent algorithm is often easier to implement in practice, As highlighted in Figure 3, parallelization of entire pro-
since the distributed control logic is entirely encapsulated grams using frameworks like MPI (Gropp et al., 1996) and
in a single process rather than multiple processes executing Distributed Tensorflow (Abadi et al., 2016) typically require
concurrently. Second, the separation of algorithm compo- explicit algorithm modifications to insert points of coordina-
nents into sub-routines (e.g., do rollouts, compute gradients tion when trying to compose two programs or components
with respect to some policy loss), enables code reuse across together. This limits the ability to rapidly prototype novel
different execution patterns. Sub-tasks that have different distributed RL applications. Though the example in Fig-
resource requirements (e.g., CPUs vs GPUs) can be placed ure 3 is simple, new hyperparameter tuning algorithms for
on different machines, reducing compute costs as we show long-running training tasks; e.g., HyperBand, Population
in Section 5. Finally, distributed algorithms written in this Based Training (PBT) (Li et al., 2016; Jaderberg et al., 2017)
model can be seamlessly nested within each other, satisfying increasingly demand fine-grained control over training.
the parallelism encapsulation principle. We propose building RL libraries with hierarchical and logi-
Logically centralized control models can be highly perfor- cally centralized control on top of flexible task-based pro-
mant, our proposed hierarchical variant even more so. This gramming models like Ray (Moritz et al., 2017). Task-based
is because the bulk of data transfer (blue arrows in Figure systems allow subroutines to be scheduled and executed
2) between processes happens out of band of the driver, not asynchronously on worker processes, on a fine-grained basis,
passing through any central bottleneck. In fact many highly and for results to be retrieved or passed between processes.
scalable distributed systems (Zaharia et al., 2010; Chang
et al., 2008; Dean & Ghemawat, 2008) leverage centralized 2.1. Relation to existing distributed ML abstractions
control in their design. Within a single differentiable tensor Though typically formulated for distributed control, abstrac-
graph, frameworks like TensorFlow also implement logi- tions such as parameter servers and collective communica-
cally centralized scheduling of tensor computations onto tion operations can also be used within a logically central-
available physical devices. Our proposal extends this princi- ized control model. As an example, RLlib uses allreduce
ple into the broader ML systems design space. and parameter-servers in some of its policy optimizers (Fig-
The contributions of this paper are as follows. ure 4), and we evaluate their performance in Section 5.
1. We propose a general and composable hierarchical 2.2. Ray implementation of hierarchical control
programming model for RL training (Section 2). We note that, within a single machine, the proposed pro-
2. We describe RLlib, our highly scalable RL library, and gramming model can be implemented simply with thread-
how it builds on the proposed model to provide scal- pools and shared memory, though it is desirable for the
able abstractions for a broad range of RL algorithms, underlying framework to scale to larger clusters if needed.
enabling rapid development (Section 3). We chose to build RLlib on top of the Ray framework, which
3. We discuss how performance is achieved within the allows Python tasks to be distributed across large clusters.
proposed model (Section 4), and show that RLlib meets Ray’s distributed scheduler is a natural fit for the hierarchical
or exceeds state-of-the-art performance for a wide va- control model, as nested computation can be implemented
riety of RL workloads (Section 5). in Ray with no central task scheduling bottleneck.
RLlib: Abstractions for Distributed Reinforcement Learning
(a) Allreduce (b) Local Multi-GPU (c) Asynchronous (d) Sharded Param-server
Figure 4. Pseudocode for four RLlib policy optimizer step methods. Each step() operates over a local policy graph and array of remote
evaluator replicas. Ray remote calls are highlighted in orange; other Ray primitives in blue (Section 4). Apply is shorthand for updating
weights. Minibatch code and helper functions omitted. The param server optimizer in RLlib also implements pipelining not shown here.
Environment Frames/s
300 180k
RLlib separates the implementation of algorithms into the (a) Sharded Param. Server (b) Ape-X in RLlib
declaration of the algorithm-specific policy graph and the Figure 5. RLlib’s centrally controlled policy optimizers match or
choice of an algorithm-independent policy optimizer. The exceed the performance of implementations in specialized systems.
policy optimizer is responsible for the performance-critical The RLlib parameter server optimizer using 8 internal shards is
tasks of distributed sampling, parameter updates, and man- competitive with a Distributed TensorFlow implementation tested
aging replay buffers. To distribute the computation, the in similar conditions. RLlib’s Ape-X policy optimizer scales to
optimizer operates over a set of policy evaluator replicas. 160k frames per second with 256 workers at a frameskip of 4,
more than matching a reference throughput of ∼45k fps at 256
To complete the example, the developer chooses a policy workers, demonstrating that a single-threaded Python controller
optimizer and creates it with references to existing eval- can efficiently scale to high throughputs.
uators. The async optimizer uses the evaluator actors to
compute gradients in parallel on many CPUs (Figure 4(c)).
Each optimizer.step() runs a round of remote tasks to graph class encapsulates interaction with the deep learning
improve the model. Between steps, policy graph replicas framework, allowing algorithm authors to avoid mixing dis-
can be queried directly, e.g., to print out training statistics: tributed systems code with numerical computations, and
enabling optimizer implementations to be improved and
optimizer = rllib.AsyncPolicyOptimizer( reused across different deep learning frameworks.
graph=PolicyGradient, workers=evaluators)
while True: As shown in Figure 4, by leveraging centralized control, pol-
optimizer.step() icy optimizers succinctly capture a broad range of choices
print(optimizer.foreach_policy( in RL optimization: synchronous vs asynchronous, allre-
lambda p: p.get_train_stats())) duce vs parameter server, and use of GPUs vs CPUs. RL-
lib’s policy optimizers provide performance comparable to
Policy optimizers extend the well-known gradient-descent optimized parameter server (Figure 5(a)) and MPI-based
optimizer abstraction to the RL domain. A typical gradient- implementations (Section 5). Pulling out this optimizer ab-
descent optimizer implements step(L(θ), X, θ) ⇒ θopt . straction is easy in a logically centralized control model
RLlib’s policy optimizers instead operate over the local since each policy optimizer has full control over the dis-
policy graph G and a set of remote evaluator replicas, tributed computation it implements.
i.e., step(G, ev1 . . . evn , θ) ⇒ θopt , capturing the sampling
phase of RL as part of optimization (i.e., calling sample()
3.4. Completeness and Generality of Abstractions
on policy evaluators to produce new simulation data).
We demonstrate the completeness of RLlib’s abstractions by
The policy optimizer abstraction has the following advan-
formulating the algorithm families listed in Table 2 within
tages. By separating execution strategy from policy and loss
the API. When applicable, we also describe the concrete
definitions, specialized optimizers can be swapped in to take
implementation in RLlib:
advantage of available hardware or algorithm features with-
out needing to change the rest of the algorithm. The policy DQNs: DQNs use y 1 for storing TD error, implement n-step
RLlib: Abstractions for Distributed Reinforcement Learning
Table 2. RLlib’s policy optimizers and evaluators capture common components (Evaluation, Replay, Gradient-based Optimizer) within a
logically centralized control model, and leverages Ray’s hierarchical task model to support other distributed components.
Algorithm Family Policy Evaluation Replay Buffer Gradient-Based Optimizer Other Distributed Components
DQNs X X X
Policy Gradient X X
Off-policy PG X X X
Model-Based/Hybrid X X Model-Based Planning
Multi-Agent X X X
Evolutionary Methods X Derivative-Free Optimization
AlphaGo X X X MCTS, Derivative-Free Optimization
return calculation in ρθ , and the Q loss in L. Target updates ronment models, the model loss can either be bundled with
are implemented in u1 , and setting the exploration in u2 . L, or the model trained separately (i.e., in parallel using Ray
primitives) and its weights periodically updated via u1 .
DQN implementation: To support experience replay, RLlib’s
DQN uses a policy optimizer that saves collected samples Multi-Agent: Policy evaluators can run multiple policies
in an embedded replay buffer. The user can alternatively at once in the same environment, producing batches of ex-
use an asynchronous optimizer (Figure 4(c)). The target perience for each agent. Many multi-agent algorithms use
network is updated by calls to u1 between optimizer steps. a centralized critic or value function, which we support by
allowing ρθ to collate experiences from multiple agents.
Ape-X implementation: Ape-X (Horgan et al., 2018) is a
variation of DQN that leverages distributed experience pri- Evolutionary Methods: Derivative-free methods can be
oritization to scale to many hundreds of cores. To adapt our supported through non-gradient-based policy optimizers.
DQN implementation, we created policy evaluators with a
Evolution Strategies (ES) implementation: ES is a derivative-
distribution of values, and wrote a new high-throughput
free optimization algorithm that scales well to clusters with
policy optimizer (∼200 lines) that pipelines the sampling
thousands of CPUs. We were able to port a single-threaded
and transfer of data between replay buffer actors using Ray
implementation of ES to RLlib with only a few changes, and
primitives. Our implementation scales nearly linearly up
further scale it with an aggregation tree of actors (Figure
to 160k environment frames per second with 256 workers
8(a)), suggesting that the hierarchical control model is both
(Figure 5(b)), and the optimizer can compute gradients for
flexible and easy to adapt algorithms for.
∼8.5k 80×80×4 observations/s on a V100 GPU.
PPO-ES experiment: We studied a hybrid algorithm that
Policy Gradient / Off-policy PG: These algorithms store
runs PPO updates in the inner loop of an ES optimiza-
value predictions in y 1 , implement advantage estimation
tion step that randomly perturbs the PPO models. The
using ρθ , and combine actor and critic losses in L.
implementation took only ∼50 lines of code and did not
PPO implementation: Since PPO’s loss function permits require changes to PPO, showing the value of encapsulat-
multiple SGD passes over sample data, when there is suffi- ing parallelism. In our experiments, PPO-ES converged
cient GPU memory RLlib chooses a GPU-optimized policy faster and to a higher reward than PPO on the Walker2d-v1
optimizer (Figure 4(b)) that pins data into local GPU mem- task. A similarly modified A3C-ES implementation solved
ory. In each iteration, the optimizer collects samples from PongDeterministic-v4 in 30% less time.
evaluator replicas, performs multi-GPU optimization locally,
AlphaGo: We sketch how to scalably implement the Al-
and then broadcasts the new model weights.
phaGo Zero algorithm using a combination of Ray and
A3C implementation: RLlib’s A3C can use either the asyn- RLlib abstractions. Pseudocode for the ∼70 line main algo-
chronous (Figure 4(c)) or sharded parameter server policy rithm loop is provided in the Supplementary Material.
optimizer (4(d)). These optimizers collect gradients from
the policy evaluators to update the canonical copy of θ. 1. Logically centralized control of multiple distributed
components: AlphaGo Zero uses multiple distributed com-
DDPG implementation: RLlib’s DDPG uses the same replay ponents: model optimizers, self-play evaluators, candidate
policy optimizer as DQN. L includes both actor and critic model evaluators, and the shared replay buffer. These com-
losses. The user can also choose to use the Ape-X policy ponents are manageable as Ray actors under a top-level
optimizer with DDPG. AlphaGo policy optimizer. Each optimizer step loops over
Model-based / Hybrid: Model-based RL algorithms ex- actor statuses to process new results, routing data between
tend πθ (ot , ht ) to make decisions based on model rollouts, actors and launching new actor instances.
which can be parallelized using Ray. To update their envi- 2. Shared replay buffer: AlphaGo Zero stores the experi-
RLlib: Abstractions for Distributed Reinforcement Learning
Evaluation
train() Evaluation
train() PPO Evaluation
Self-Play sample data
train() train()
Evaluators
MCTS
MCTS
MCTS MCTS
best player
weights Shared
Replay
Buffer
Candidate
Evaluation Optimizer
Optimizer
Policy
Optimizer
Evaluators candidate Optimizers
MCTS
MCTS
MCTS MCTS
weights
Figure 6. Complex RL architectures are easily captured within RLlib’s hierarchical control model. Here blue lines denote data transfers,
orange lines lighter overhead method calls. Each train() call encompasses a batch of remote calls between components.
ences from self-play evaluator instances in a shared replay Vectorization: RLlib can batch policy evaluation to im-
buffer. This requires routing game results to the shared prove hardware utilization (Figure 7), supports batched en-
buffer, which is easily done by passing the result object vironments, and passes experience data between actors effi-
references from actor to actor. ciently in columnar array format.
3. Best player: AlphaGo Zero tracks the current best
model and only populates its replay buffer with self-play 4.2. Distributed performance
from that model. Candidate models must achieve a ≥ 55% Lightweight tasks: Remote call overheads in Ray are on
victory margin to replace the best model. Implementing this the order of ∼200µs when scheduled on the same machine.
amounts to adding an if block in the main control loop. When machine resources are saturated, tasks spill over to
4. Monte-Carlo tree search: MCTS (i.e., model-based other nodes, increasing latencies to around ∼1ms. This
planning) can be handled as a subroutine of the policy graph, enables parallel algorithms to scale seamlessly to multiple
and optionally parallelized as well using Ray. machines while preserving high single-node throughput.
Nested parallelism: Building RL algorithms by composing
HyperBand and Population Based Training: Ray in- distributed components creates multiple levels of nested
cludes distributed implementations of hyperparameter parallel calls (Figure 1). Since components make decisions
search algorithms such as HyperBand and PBT (Li et al., that may affect downstream calls, the call graph is also
2016; Jaderberg et al., 2017). We were able to use these to inherently dynamic. Ray supports this by allowing any
evaluate RLlib algorithms, which are themselves distributed, Python function or class method to be invoked remotely as
with the addition of ∼15 lines of code per algorithm. We a lightweight task. For example, func.remote() executes
note that these algorithms are non-trivial to integrate when func remotely and immediately returns a placeholder result
using distributed control models due to the need to modify which can later be retrieved or passed to other tasks.
existing code to insert points of coordination (Figure 3). RL-
lib’s use of short-running tasks avoids this problem, since Resource awareness: Ray allows remote calls to specify
control decisions can be easily made between tasks. resource requirements and utilizes a resource-aware sched-
uler to preserve component performance. Without this,
distributed components can improperly allocate resources,
4. Framework Performance causing algorithms to run inefficiently or fail.
In this section, we discuss properties of Ray (Moritz et al., Fault tolerance and straggler mitigation: Failure events
2017) and other optimizations critical to RLlib. become significant at scale (Barroso et al., 2013). RLlib
leverages Ray’s built-in fault tolerance mechanisms (Moritz
4.1. Single-node performance et al., 2017), reducing costs with preemptible cloud compute
Stateful computation: Tasks can share mutable state with instances (Amazon, 2011; Google, 2015). Similarly, strag-
other tasks through Ray actors. This is critical for tasks glers can significantly impact the performance of distributed
that operate on and mutate stateful objects like third-party algorithms at scale (Dean & Barroso, 2013). RLlib supports
simulators or neural network weights. straggler mitigation in a generic way via the ray.wait()
primitive. For example, in PPO we use this to drop the
Shared memory object store: RL workloads involve shar- slowest evaluator tasks, at the cost of some bias.
ing large quantities of data (e.g., rollouts and neural network
weights). Ray supports this by allowing data objects to be Data compression: RLlib uses the LZ4 algorithm to com-
passed directly between workers without any central bottle- press experience batches. For image observations, LZ4
neck. In Ray, workers on the same machine can also read reduces network traffic and memory usage by more than an
data objects through shared memory without copies. order of magnitude, at a compression rate of ∼1 GB/s/core.
RLlib: Abstractions for Distributed Reinforcement Learning
5. Evaluation 10 6 1
Actions / s
2
10 5 4
Sampling efficiency: Policy evaluation is an important 10 4
8
16
building block for all RL algorithms. In Figure 7 we bench- 10 3
32
64
Pong-GPU Pendulum-CPU
mark the scalability of gathering samples from policy evalu- Number of Policy Evaluators
128
Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W. M.,
Xiao, T., Xu, B., Zhang, C., and Zhang, Z. MXNet: A Donahue, J., Razavi, A., Vinyals, O., Green, T., Dunning,
flexible and efficient machine learning library for het- I., Simonyan, K., et al. Population based training of
erogeneous distributed systems. In NIPS Workshop on neural networks. arXiv preprint arXiv:1711.09846, 2017.
Machine Learning Systems (LearningSys’16), 2016.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J.,
Dean, J. and Barroso, L. A. The tail at scale. Communica- Girshick, R., Guadarrama, S., and Darrell, T. Caffe:
tions of the ACM, 56(2):74–80, 2013. Convolutional architecture for fast feature embedding.
arXiv preprint arXiv:1408.5093, 2014.
Dean, J. and Ghemawat, S. MapReduce: simplified data
processing on large clusters. Communications of the Kostrikov, I. PyTorch implementation of advantage ac-
ACM, 51(1):107–113, 2008. tor critic (A2C), proximal policy optimization (PPO)
RLlib: Abstractions for Distributed Reinforcement Learning
and scalable trust-region method for deep reinforcement Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel,
learning. https://github.com/ikostrikov/ P. High-dimensional continuous control using generalized
pytorch-a2c-ppo-acktr, 2017. advantage estimation. arXiv preprint arXiv:1506.02438,
2015.
Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and
Talwalkar, A. Hyperband: Bandit-based configuration Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and
evaluation for hyperparameter optimization. 2016. Klimov, O. Proximal policy optimization algorithms.
arXiv preprint arXiv:1707.06347, 2017.
Li, M., Andersen, D. G., Park, J. W., Smola, A., and Ahmed,
A. Scaling distributed machine learning with the parame- Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou,
ter server. In Proceedings of the USENIX Symposium on I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M.,
Operating Systems Design and Implementation, volume Bolton, A., et al. Mastering the game of Go without
583, pp. 598, 2014. human knowledge. 2017.
Microsoft. ONNX: Open neural network exchange format. Tian, Y., Gong, Q., Shang, W., Wu, Y., and Zitnick,
https://onnx.ai, 2017. L. ELF: an extensive, lightweight and flexible re-
search platform for real-time strategy games. CoRR,
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, abs/1707.01067, 2017. URL http://arxiv.org/
J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje- abs/1707.01067.
land, A. K., Ostrovski, G., et al. Human-level control
through deep reinforcement learning. Nature, 518(7540): Zaharia, M., Chowdhury, N. M., Franklin, M., Shenker, S.,
529–533, 2015. and Stoica, I. Spark: Cluster computing with working
sets. 2010.
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap,
T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn-
chronous methods for deep reinforcement learning. In
International Conference on Machine Learning, pp. 1928–
1937, 2016.
Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw, R.,
Liang, E., Paul, W., Jordan, M. I., and Stoica, I. Ray:
A distributed framework for emerging AI applications.
arXiv preprint arXiv:1712.05889, 2017.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer,
A. Automatic differentiation in pytorch. 2017.