Acceleration for Deep Reinforcement Learning using
Parallel and Distributed Computing: A Comprehensive
Survey
Authors: Zhihong Liu, Xin Xu, Peng Qiao, Dongsheng Li (National University of Defense
Technology, China)
Course Code: CIT 5203
Course Title: Parallel and Distributed Computing
Presented By
A.B.M. Shifar Emtiuz
ID: 6232020107
Reg: 08417
Introduction
• Overview: Deep Reinforcement Learning (DRL) merges deep learning and
reinforcement learning, driving AI breakthroughs in gaming, robotics, and
healthcare.
• Problem: DRL training is computationally intensive due to large neural
networks, massive experience data, and complex hyper-parameter tuning.
• Solution: Parallel and distributed computing accelerates DRL training.
• Survey Goal: Provide a comprehensive review of state-of-the-art acceleration
methods, including taxonomy and open-source platforms.
Motivation
1 2 3
Why Accelerate DRL(Deep Contribution Scope
Reinforcement Learning ): Covers architectures,
First survey to systematically
• Example: Training DQN on Atari parallelism strategies,
takes 38 days with 50M frames.
classify DRL acceleration
techniques using parallel and synchronization, evolutionary
• Growing complexity demands faster distributed computing. methods, and libraries.
training for real-world applications.
DRL (Deep Reinforcement Learning)
Fundamentals
• Definition: DRL involves agents learning optimal policies through
environment interactions, collecting experience data (state, action,
next state, reward).
• Key Features:
• Handles high-dimensional state spaces.
• Self-learns without labeled data.
• Enables offline training for online problems.
DRL Classification
Model-Based DRL:
• Uses known or learned environment dynamics (e.g., MB-MPO, AlphaZero).
Model-Free DRL:
• Value-based (e.g., DQN), policy-based, or actor-critic (e.g., PPO, SAC).
On-Policy vs. Off-Policy:
• On-policy: Stable, less sample-efficient.
• Off-policy: Sample-efficient, less stable.
System Coordination: Managing actors, learners,
and parameter servers.
Sample Generation: Producing billions of
experience samples (e.g., 2.5B frames for
navigation).
Challeng
es in DRL Workload Heterogeneity: Frequent data movement
across devices.
Training
Synchronization Issues: Obsolete gradients in
heterogeneous environments.
Optimization Limits: Stochastic gradient descent
prone to local optima.
System Architectures
Overview
• Components:
• Actors: Environment interaction.
• Learners: Gradient computation.
• Parameter Servers: Model maintenance.
• Replay Memory: Off-policy data storage.
• Architecture Types: Centralized (star topology)
vs. Decentralized (fully connected).
Centralized Architectures
• Features: Central node maintains global model,
simpler synchronization but scalability bottleneck.
• Examples:
• Gorila: Massively distributed DQN, 10x faster
training.
• APE-X: Prioritized experience replay for DQN.
• A3C: On-policy learning with actor-learner
threads.
Decentralized • Features: Multiple learners aggregate gradients via all-reduce, scalable but
with synchronization overhead.
Architectures • Examples:
• IMPALA: CPU actors, GPU learners, synchronous updates.
• rlpyt: Multi-GPU acceleration with all-reduce.
• DD-PPO: Scalable across machines, combined actor-learner.
Simulation Parallelism
• Role: Simulations (e.g., OpenAI Gym, MuJoCo) generate
training samples.
• Challenges: High computational cost for physics and
rendering.
• Solutions:
• Parallel simulation environments.
• GPU-accelerated simulations for robotics.
• Large-batch simulations for efficiency.
• Platforms: OpenAI Gym, MuJoCo, Unity ML, Gazebo,
AirSim, Brax.
Computing Parallelism
• Data Parallelism: Splits data across workers with model replicas.
• Model Parallelism: Divides neural network for large models.
• Pipeline Parallelism: Processes model layers in stages.
• Hybrid Approaches: Combines schemes (e.g., ZeRo, HeiPipe).
• Hardware Support: GPUs for learners, CPUs for actors, FPGAs for tasks.
Distributed Synchronization
• Goal: Synchronize backpropagation-based training.
• Methods:
• Synchronous: Stable but slower (e.g., IMPALA).
• Asynchronous: Faster but less stable (e.g., A3C, APE-X).
• Prioritized Experience Replay: Focuses on high-error samples.
• Innovation: Gossip-based peer-to-peer synchronization.
Deep Evolutionary Reinforcement
Learning
• Approach: Evolution-based training avoids local optima.
• Techniques:
• Evolution Strategies: Scalable RL alternative.
• Neural Architecture Search: Evolves network topologies.
• Population-Based Training: Enhances policy diversity.
• Pros: Robust, scalable. Cons: Computationally expensive.
Advanced Parallelism Techniques
• Pipeline Parallelism:
• Splits model layers into stages for pipelined processing.
• Example: HeiPipe for DRL with layered neural networks.
• Hybrid Parallelism:
• Combines data, model, and pipeline parallelism.
• Example: ZeRo optimizes memory and computation.
• DRL Adaptation: Handles dynamic workloads from environment interactions.
Hardware Acceleration for Parallelism
• Hardware Roles:
• GPUs: Accelerate learner gradient computations.
• CPUs: Handle actor environment interactions.
• FPGAs: Optimize specific tasks (e.g., FA3C, QtAccel) .
• Benefits: Reduces training time (e.g., GPU-based simulations in robotics).
• Challenges: Hardware heterogeneity requires tailored frameworks.
Distributed
Synchronization
• Goal: Synchronize backpropagation-based training across workers.
• Methods:
• Synchronous: Stable, slower (e.g., IMPALA).
• Asynchronous: Faster, less stable (e.g., A3C, APE-X).
• Prioritized Experience Replay: Focuses on high-error samples.
• Innovation: Gossip-based synchronization for peer-to-peer scalability.
Open-Source Platforms
• Overview: 16 libraries compared for DRL development.
• Key Examples:
• Ray RLlib: Scalable RL framework.
• SampleFactory: Fine-grained worker optimization.
• Fiber: Supports population-based methods.
• Evaluation: Usability, scalability, algorithm support, hardware
compatibility.
Future Research Directions
• Fine-Grained Workers: Task-specific roles (e.g., Actor, Policy, Trainer).
• Gossip-Based Architectures: Scalable peer-to-peer communication.
• LLM Integration: Enhancing RL with language model feedback.
• Gaps:
• Balancing actor-learner inference.
• Scalable simulations.
• Hybrid optimization (backpropagation + evolution).
Case Studies in Parallel DRL
• IMPALA: Scales to thousands of CPU cores for fast sample generation .
• APE-X: Achieves high throughput with prioritized replay .
• DD-PPO: Distributed PPO for multi-machine scalability .
• Impact: Reduces training time from days to hours for complex tasks.
Conclusion
1 2 3 4
Summary Impact Contribution: Closing
DRL acceleration Enables faster training Comprehensive survey Distributed DRL
leverages for robotics, autonomous with taxonomy and innovation will drive AI
architectures, driving, and more. future directions. advancements.
parallelism,
synchronization, and
evolutionary
methods.
References
• Liu, Z., Xu, X., Qiao, P., & Li, D. (2023). Acceleration for Deep Reinforcement
Learning using Parallel and Distributed Computing: A Comprehensive Survey.
ACM Computing Surveys, November 2023