[go: up one dir, main page]

0% found this document useful (0 votes)
56 views14 pages

Deep Reinforcement Learning For Bipedal Locomotion: A Brief Survey

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views14 pages

Deep Reinforcement Learning For Bipedal Locomotion: A Brief Survey

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

1

Deep Reinforcement Learning for Bipedal


Locomotion: A Brief Survey
Lingfan Bao1 , Joseph Humphreys1 , Tianhu Peng1 and Chengxu Zhou2

Abstract—Bipedal robots are garnering increasing global at-


tention due to their potential applications and advancements
in artificial intelligence, particularly in Deep Reinforcement
Learning (DRL). While DRL has driven significant progress
arXiv:2404.17070v1 [cs.RO] 25 Apr 2024

in bipedal locomotion, developing a comprehensive and unified


framework capable of adeptly performing a wide range of
tasks remains a challenge. This survey systematically categorizes,
compares, and summarizes existing DRL frameworks for bipedal
locomotion, organizing them into end-to-end and hierarchical (a) NAO (b) RABBIT (c) Cassie (d) Atlas (e) Digit
control schemes. End-to-end frameworks are assessed based on
their learning approaches, whereas hierarchical frameworks are Fig. 1: Common bipedal and humanoid robots used as
dissected into layers that utilize either learning-based methods platforms for testing DRL frameworks. (a) NAO, a toy-
or traditional model-based approaches. This survey provides a like 3D humanoid robot, actuated by servo motors [5]. (b)
detailed analysis of the composition, capabilities, strengths, and
limitations of each framework type. Furthermore, we identify Rabbit, a 2D bipedal robot, actuated by torque control [6]. (c)
critical research gaps and propose future directions aimed at Cassie, a 3D bipedal robot, also actuated by torque control [7].
achieving a more integrated and efficient framework for bipedal (d) ATLAS, a 3D humanoid robot, driven by hydraulics [8].
locomotion, with potential broad applications in everyday life. (e) Digit, a full human-sized 3D humanoid robot, an upgrade
Index Terms—Deep Reinforcement Learning, Humanoid based on Cassie and actuated by torque control [9].
Robots, Bipedal Locomotion, Legged Robots

I. I NTRODUCTION decision-making into multiple layers. Here, a High-Level (HL)


Humans navigate complex and varied environments, per- planner addresses navigation and path planning, while a Low-
forming diverse locomotion tasks with only two legs. To Level (LL) controller focuses on fundamental locomotion
facilitate dynamic bipedal locomotion, model-based meth- skills. The highest decision-making tier, the task level, receives
ods were introduced in the 1980s and have since evolved direct input from task or user commands.
significantly [1], [2], [3]. These methods, characterized by The evolution of RL in bipedal robotics has spurred a
rapid convergence, furnish a predictive framework for un- dynamic growth in innovative applications. Although the appli-
derstanding environmental structures. However, they struggle cation of RL to simple 2D bipedal robots began in 2004 [10],
to adapt in dynamically challenging environments that are [11], it took several years before deep reinforcement learning
difficult to model precisely. More recently, advancements in (DRL) algorithms emerged. These DRL-based methods have
machine learning have provided novel approaches. Reinforce- since shown promising results in physical simulators [12],
ment learning (RL)-based methods, in particular, are adept at [13], [14]. Agility Robotics introduced the first sim-to-real
navigating the full dynamics of robot-environment interactions end-to-end learning framework in 2019, which was applied
[4]. Additionally, hybrid approaches that combine model-based on the 3D torque-controlled bipedal robot Cassie, as shown
and learning-based methods have been developed to leverage in Fig. 1(c) [7]. Besides model-based reference learning, the
the advantages of both. Yet, the question remains: Is there policy can also incorporate motion capture data [15], [16],
a unified framework capable of enabling bipedal robots to [17], [18], or start from scratch [19] to explore solutions
effectively manage a diverse range of locomotion tasks? freely. Recent studies demonstrate that end-to-end frameworks
To address this, we explore recent advancements in deep robustly handle complex and diverse tasks [20], [21], [22].
reinforcement learning (DRL)-based frameworks, categorizing Similarly, hierarchical structures have garnered significant
control schemes into two primary types: (i) end-to-end and interest. Within this subset, the hybrid approach combines RL-
(ii) hierarchical. End-to-end frameworks map robot states based and model-based methods to enhance both planning and
directly to control outputs at the joint level, while hierar- control strategies. A notable framework employs a learned
chical frameworks adopt a structured approach, decomposing High-Level (HL) planner coupled with a Low-Level (LL)
This work was supported by the Royal Society [grant number model-based controller, often referred to as the Cascade-
RG\R2\232409]. structure or Deep Planning Hybrid Scheme [23], [24], [25].
1 School of Mechanical Engineering, University of Leeds, UK. {mnlb,
Another innovative construction integrates a learned feedback
el20jeh, mntp}@leeds.ac.uk
2 Department of Computer Science, University College London, UK. controller with an HL planner, categorized under the DRL
chengxu.zhou@ucl.ac.uk Feedback Control Hybrid Scheme [26], [27]. Additionally, a
2

Reference-free learning

Residual learning
Reference-based
Guided learning
DeepRL for bipedal
locomotion framework
Learned hierarchy scheme

Deep planning
hybrid scheme
Hybrid control scheme
Learning-based feedback
hybrid scheme

Fig. 2: Classification of DRL-based control schemes.

learned hierarchical control scheme [28] decomposes loco- In the following sections, we will delve into various rep-
motion into various tasks, focusing each layer on specific resentation frameworks, exploring their characteristics, limita-
functions such as navigation and fundamental locomotion tions, and strengths in comprehensive detail. To facilitate an
skills [12], [13], [29]. understanding of these distinctions, Table I provides a succinct
While several review papers discuss RL for general robotics overview of the frameworks discussed.
[4] and model-based methods for bipeds [1], [2], [3], none
specifically focus on DRL-based frameworks for bipeds. This A. Reference-based learning
survey aims to address this gap by summarizing current Reference-based learning utilizes prior knowledge, allow-
research progress, highlighting the structure and capabilities ing the policy to develop locomotion skills by adhering to
of bipedal locomotion frameworks, and exploring future direc- predefined references, which may be derived from trajectory
tions. We also catalogue DRL-based frameworks, as depicted optimization (TO) techniques or captured through motion
in Fig. 2. The primary contributions of this survey are: capture systems. This method facilitates the acquisition of
• A comprehensive summary and cataloguing of DRL- locomotion skills compared to alternative approaches, though
based frameworks for bipedal locomotion. it typically results in locomotion patterns that closely resemble
• A detailed comparison of each control scheme, highlight- the predefined references or motion clips, thus limiting the
ing their strengths, limitations, and characteristics. variety of gait patterns. Generally, this approach can be divided
• The identification of current challenges and the provision into two primary methods: (i) residual learning and (ii) guided
of insightful future research directions. learning.
1) Residual learning: This method involves a framework
The paper is organized as follows: Section II focuses on
that is aware of the current reference joint positions and
end-to-end frameworks, categorized by learning approaches.
applies offsets determined by the policy to modify motor
Section III details hierarchical frameworks, classified into
commands at the current timestep. By utilizing predefined
three main types. Section IV addresses existing gaps, ongoing
motion trajectories, the residual term acts as feedback control,
challenges, and potential future research directions. Finally,
compensating for errors and enabling the biped to achieve
Section V concludes the paper.
dynamic locomotion skills.
Introduced in 2018, a residual learning framework for the
II. E ND - TO - END FRAMEWORK bipedal robot Cassie marked a significant advancement [33].
This framework allowed the robot to walk forward by incor-
The end-to-end DRL framework represents a holistic ap- porating a policy trained via Proximal Policy Optimization
proach where a single neural network (NN) policy, denoted (PPO) algorithms, as detailed in Appendix A. The policy
π(·) : X → U, directly translates sensory inputs—such as receives the robot’s states and reference inputs, outputting
images, lidar data, or proprioceptive feedback [30]—along a residual term that augments the reference at the current
with user commands [19] or pre-defined references [31], into timestep. These modified references are then processed by a
joint-level control actions. These actions encompass motor Proportional-Derivative (PD) controller to set the desired joint
torques [32], positions, and velocities [15]. This framework positions. While this framework enhanced the robot’s ability to
obviates the need for manually decomposing the problem into perform tasks beyond standing [39], its physical deployment
sub-tasks, streamlining the control process. on a bipedal robot has not yet occurred, potentially rendering
End-to-end strategies primarily simplify the design of low- it impractical for managing walking at varying speeds and
level tracking to basic elements, such as a Proportional- limiting movement to a single direction.
Derivative (PD) controller. These methods can be broadly To transition this framework to a real robot, a sim-to-real
categorized based on their reliance on prior knowledge into strategy based on the previous model was demonstrated, where
two types: reference-based and reference-free. The locomotion the policy, trained through a residual learning approach, was
skills developed through these diverse learning approaches subsequently applied on a physical bipedal robot [34]. This
exhibit considerable variation in performance and adaptability. process and its key points are further explored in Appendix
3

TABLE I: Summary and comparison of reference-based and reference-free Learning approaches for end-to-end framework.
The dashed line in the implementation flow chat refers to optional.
Methods Works Capabilities Characteristic Advantages and Disadvantages Implementation Flow Chart

A: Fast convergence speed


[33] Forward walk Add residential term to the D: Require high-quality predefined User Policy 𝜋 Robots
Residual Commands
[34] undirectional walk known motor positions at the reference, limit to specific motions,
learning
[35] Omni-walk current time step. and lack robustness to complicated Reference
terrains.

A: Accelerate the learning process


[36] Forward walk
Mimic the predefined refer- and robust to the terrains. User Policy 𝜋
Guided [31] Versatile walk Commands
ence and directly specifies D: Limited to the predefined motions Robots
learning [37] Versatile jump
joint-level command. and lack adaptability to unforeseen
[22] Versatile motions Reference
changes in environment.
A: High potential for gait explo-
ration, high robust to the compli-
[19] Periodic motions Learn locomotion skills cated terrain. Policy 𝜋
Reference-free User
[38] Stepping stones from scratch without any D: Requires intensive reward shap- Robots
learning Commands
[20] Vision-based prior knowledge. ing for gait patterns and relatively
expensive computational training re-
sources.
Forward Walk involves the bipeds walking straight ahead. Unidirectional Walk enables the bipeds to move both forward and backward within a range
of desired velocities. Omni-Walk grants the bipeds the ability to walk in any direction. Versatile Walk allows the bipeds to walk forward, backward, turn,
and move sideways, providing extensive movement capabilities. Periodic Motions entails the execution of various repeated gait patterns, such as walking,
hopping, or galloping. Versatile Jump refers to jump towards different desired targets. Versatile Motions cover performing a broad array of motions, both
periodic and aperiodic such as jumping.

B. Compared to model-based methods, this training policy methodology that enables the execution of complex jumping
achieves faster running speeds on the same platform, under- maneuvers. An adversarial motion priors approach, employing
lining the considerable potential of DRL-based frameworks. a style reward mechanism, was also introduced to facilitate the
However, the robot’s movements remain constrained to merely acquisition of user-specified gait behaviors [18]. This method
walking forward or backward. A novel approach in residual improves the training of high-dimensional simulated agents by
learning was introduced to enable unidirectional walking, replacing complex hand-designed reward functions with more
where the policy outputs a residual term added to the current intuitive controls.
positional states, facilitating gradual omnidirectional walking While previous works primarily focused on specific loco-
[35]. motion skills, a unified framework that accommodates both
2) Guided learning: Guided learning trains policies to periodic and non-periodic motions was further developed [22]
directly output the desired joint-level commands, eschewing based on the foundational work in [37]. This framework
the addition of a residual term. The reward structure in this enhances the learning process by incorporating a wide range
approach is focused on closely imitating predefined references. of locomotion skills and introducing a dual I/O history ap-
A sim-to-real framework that employs periodic references proach, marking a significant breakthrough in creating a ro-
to initiate the training phase was proposed in [36]. In this bust, versatile, and dynamic end-to-end framework. However,
framework, the action space directly maps to the joint angles, experimental results indicate that the precision of locomotion
and desired joint positions are managed by joint PD con- features, such as velocity tracking, remains suboptimal.
trollers. The framework also incorporates a Long Short-Term Guided learning methods expedite the learning process by
Memory (LSTM) network, as detailed in Appendix A, which is leveraging expert knowledge and demonstrating the capacity
synchronised with periodic time inputs. However, this model is to achieve versatile and robust locomotion skills. Through the
limited to a single locomotion goal: forward walking. A more comprehensive evaluation [22], it is demonstrated that guided
diverse and robust walking DRL framework that includes a learning employs references without complete dependence on
Hybrid Zero Dynamics (HZD) gait library was demonstrated them. Conversely, residual learning exhibits failures or severe
[31], achieving a significant advancement by enabling a single deviations when predicated on references of inferior quality.
end-to-end policy to facilitate walking, turning, and squatting. This shortfall stems from the framework’s dependency on
Despite these advancements, the parameterization of refer- adhering closely to the provided references, which narrows
ence motions introduces constraints that limit the flexibility of its learning capabilities.
the learning process and the policy’s response to disturbances. Nonetheless, reference-based learning reliance on prede-
To broaden the capabilities of guided learning policies, a fined trajectories confines the policy to specific gaits, restrict-
framework capable of handling multiple targets, including ing its capacity to explore a broader range of motion possibil-
jumping, was developed [37]. This approach introduced a ities. Additionally, this approach exhibits limited adaptability
novel policy structure that integrates long-term input/out- in responding effectively to unforeseen environmental changes
put (I/O) encoding, complemented by a multi-stage training or novel challenges.
4

B. Reference-free learning controller. This modular approach allows for the substitution
In reference-free learning, the policy is trained using a care- of each component with either a model-based method or
fully crafted reward function rather than relying on predefined a learning-based policy, further enhancing adaptability and
trajectories. This approach allows the policy to explore a wider customisation to specific needs.
range of gait patterns and adapt to unforeseen terrains, thereby Hierarchical frameworks can be classified into three distinct
enhancing innovation and flexibility within the learning pro- types based on the integration and function of their compo-
cess. nents:
The concept of reference-free learning was initially explored 1) Deep planning hybrid scheme: This approach com-
using simulated physics engines with somewhat unrealistic bines strategic, high-level planning with dynamic low-
bipedal models. A pioneering framework, which focused on level execution, leveraging the strengths of both
learning symmetric gaits from scratch without the use of learning-based and traditional model-based methods.
motion capture data, was developed and validated within a 2) Feedback DRL control hybrid scheme: It focuses
simulated environment [14]. This framework introduced a on integrating direct feedback control mechanisms with
novel term into the loss function and utilized a curriculum deep reinforcement learning, allowing for real-time ad-
learning strategy to effectively shape gait patterns. Another justments and enhanced responsiveness.
significant advancement was made in developing a learning 3) Learned hierarchy scheme: Entirely learning-driven,
method that enabled a robot to navigate stepping stones using this scheme develops a layered decision-making hier-
curriculum learning, focusing on a physical robot model, archy where each level is trained to optimise specific
Cassie, though this has yet to be validated outside of sim- aspects of locomotion.
ulation [40]. These frameworks are illustrated in Fig. 3. Each type offers
Considering the practical implementation of theoretical unique capabilities and exhibits distinct characteristics, albeit
models, significant efforts have been directed towards devel- with limitations primarily due to the complexities involved in
oping sim-to-real frameworks in robotics studies. A notable integrating diverse modules and their interactions.
example of such a framework accommodates various periodic For a concise overview, Table 3 summarises the various
motions, including walking, hopping, and galloping [19]. frameworks, detailing their respective strengths, limitations,
This framework employs periodic rewards to facilitate initial and primary characteristics. The subsequent sections will delve
training within simulations before successfully transitioning to deeper into each of these frameworks, providing a thorough
a physical robot. It has been further refined to adapt to diverse analysis of their operational mechanics and their application
terrains and scenarios. For instance, robust blind walking in real-world scenarios.
on stairs was demonstrated through terrain randomization
techniques in [38]. Additionally, the integration of a vision A. Deep planning hybrid scheme
system has enhanced the framework’s ability to precisely
determine foot locations [41], thus enabling the robot to effec- In this scheme, robots are pre-equipped with the ability
tively navigate stepping stones [20]. Subsequent developments to execute basic locomotion skills such as walking, typically
include the incorporation of a vision system equipped with managed through model-based feedback controllers or inter-
height maps, leading to an end-to-end framework that more pretable methods. The addition of an HL learned layer focuses
effectively generalizes terrain information [42]. on strategic goals or the task space, enhancing locomotion
This approach to learning enables the exploration of novel capabilities and equipping the robot with advanced navigation
solutions and strategies that might not be achievable through abilities to effectively explore its environment.
mere imitation of existing behaviours. However, the absence Several studies have demonstrated the integration of an
of reference guidance can render the learning process costly, HL planner policy with a model-based controller to achieve
time-consuming, and potentially infeasible for certain tasks. tasks in world space. A notable framework optimises task
Moreover, the success of this method hinges critically on space level performance, eschewing direct joint level and
the design of the reward function, which presents significant balancing considerations [24]. This system combines a residual
challenges in specifying tasks such as jumping. learning planner with an inverse dynamics controller, enabling
precise control over task-space commands to joint-level ac-
tions, thereby improving velocity tracking, foot touchdown
III. H IERARCHY FRAMEWORK location, and height control. Further advancements include
Unlike end-to-end policies that directly map sensor inputs a hybrid framework that merges HZD-based residual deep
to motor outputs, hierarchical control schemes deconstruct planning with model-based regulators to correct errors in
locomotion challenges into discrete, manageable layers or learned trajectories, showcasing robustness, training efficiency,
stages of decision-making. Each layer within this structure and effective velocity tracking [25]. These frameworks have
is tasked with specific objectives, ranging from high-level been successfully transferred from simulation to reality and
navigation to fundamental locomotion skills. This division not validated on robots such as Cassie.
only enhances the framework’s flexibility but also simplifies However, the limitations imposed by residual learning con-
the problem-solving process for each policy. strained the agents’ capacity to explore a broader array of
The architecture of a hierarchical framework typically com- possibilities. Building on previous work [25], a more efficient
prises two principal modules: an HL planner and an LL hybrid framework was developed, which learns from scratch
5

TABLE II: Summary and comparison of Hierarchy framework


Control Scheme Works Module characteristic Advantages and Disadvantages
A: Enhanced command tracking capa-
bilities, generalized across different plat-
[24] Deep planning + ID HL policy is learned to guide forms, sampling efficiency, and robust.
Deep Planning the LL controller to com-
[43] Deep planning + ID-QP D: Complicate system and communi-
Hybrid Scheme plete locomotion and naviga-
[44] Deep planning + WPG cation between layers, require precise
tion tasks.
model, lack generalization regarding dif-
ferent tasks.
A: Short inference times, robust, nav-
LL feedback policy receives igation locomotion capabilities, inter-
[45] Gait library + feedback policy pretability.
Feedback DRL Control non-learned HL planner as
[26] Footstep planner + feedback policy
Hybrid Scheme input to achieve locomotion D: Complicated system and communi-
[27] Model-based planner + feedback policy
skills. cation between layers, reducing sampling
efficiency.
A: provides layer flexibility, where each
Both HL planner and LL layer can be independently retrained and
feedback controller are reused; alleviates the challenges associ-
[12] HL policy + LL policy ated with training an end-to-end policy.
Learned Hierarchy learned. LL policy focuses
[13] HL policy + LL policy
Framework on basic locomotion skills;
[29] HL policy + LL policy
on the other side, HL policy D: inefficienty sim-to-real, complicated
learn navigation skills. interface between layers, training expen-
sively.

without reliance on prior knowledge [43]. In this approach, and dynamic motions, such as jumping. Furthermore, while
a purely learning-based HL planner interacts with an LL these systems adeptly navigate complex terrains with obsta-
controller using an Inverse Dynamics with Quadratic Pro- cles, footstep planning alone is insufficient without concurrent
gramming formulation (ID-QP). This policy adeptly captures enhancements to the robot’s overall locomotion capabilities.
dynamic walking gaits through the use of reduced-order states Moreover, the requisite communication between the two dis-
and simplifies the learning trajectory. Demonstrating robust- tinct layers of the hierarchical framework may introduce
ness and training efficiency, this framework has outperformed system complexities. Enhancing both navigation and dynamic
other models and was successfully generalized across various locomotion capabilities within the HL planner remains a
bipedal platforms, including Digit, Cassie, and RABBIT. significant challenge.
In parallel, several research teams have focused on devel-
oping navigation planners specifically for toy-like humanoid B. Feedback DRL control hybrid scheme
robots, which provide greater physical stability compared to In contrast to the comprehensive approach of end-to-end
torque-driven or hydraulic bipedal robots as shown in Fig. policies discussed in Section II, which excels in handling
1. One notable study [46] implemented a visual navigation versatile locomotion skills and complex terrains with minimal
policy on the NAO robot, depicted in Fig. 1(a), utilizing interface times, the Feedback DRL Control Hybrid Scheme
RGB cameras as the primary sensory modality. This system integrates DRL policies as LL controllers. These LL con-
has demonstrated successful zero-shot transfer to real-world trollers, replacing traditional model-based feedback mecha-
scenarios, enabling the robot to adeptly navigate around ob- nisms, work in conjunction with HL planners that process
stacles. Further research [44] has explored complex dynamic terrain information, plan future walking paths, and maintain
motion tasks, such as playing soccer, by integrating a learned robust locomotion stability.
policy with an online footstep planner that utilises weight
For instance, gait libraries, which provide predefined move-
positioning generation (WPG) to create a center of mass
ment references based on user commands, have been integrated
(CoM) trajectory. This configuration is coupled with a whole-
into such frameworks [45]. Despite the structured approach of
body controller, facilitating dynamic activities like soccer
using gait libraries, their static nature offers limited adapt-
shooting. Despite their platform’s stability, provided by large
ability to changing terrains, diminishing their effectiveness.
feet and a lightweight structure, these robots exhibit limited
A more dynamic approach involves online planning, which
dynamic movement capabilities compared to full-sized hu-
has shown greater adaptability and efficiency. One notable
manoid robots. Consequently, this research primarily addresses
framework combines a conventional foot planner with an LL
navigation and task execution.
DRL policy [26], delivering targeted footsteps and directional
Regarding generalization, these frameworks have shown guidance to the robot, thereby enabling responsive and varied
potential for adaptation across different types of bipedal and walking commands. Moreover, HL controllers can provide
humanoid robots with minimal adjustments, demonstrating additional feedback to LL policies, incorporating CoM or end-
advanced user command tracking [43] and sophisticated nav- feet information, either from model-based methods or other
igation capabilities [44]. However, limitations are evident, conventional control strategies. However, this work has not yet
notably the absence of capabilities for executing more complex been transferred from simulation to real-world applications.
6

(a) Task Command (b) Task Command (c) Task Command (d) Task Command

High-level Planner High-level Planner

Low-level Controller Low-level Controller

Robot Robot Robot Robot

Fig. 3: Hierarchy Control Scheme Diagram: (a) A basic hierarchical scheme with two layers, where each module can be
substituted with a learned policy. (b) A deep planning hybrid scheme, where the High-Level (HL) planner is learned. (c) A
learning-based feedback control hybrid scheme, with a learned Low-Level (LL) controller. (d) A comprehensive DRL hierarchy
control scheme, where both layers are learned.

Later, a similar structure featuring an HL foot planner and adjustments.


an LL DRL policy was proposed [27]. This strategy not only Despite the significant potential demonstrated by previous
achieved a successful sim-to-real transfer but also enabled the studies, integrating DRL-based controllers with sophisticated
robot to navigate omnidirectionally and avoid obstacles. and complex HL planners still presents limitations compared
A recent development has shown that focusing solely on to more integrated frameworks such as end-to-end and deep
foot placement might restrict the stability and adaptability planning models. Specifically, complex HL model-based plan-
of locomotion, particularly in complex maneuvers. A new ners often require substantial computational resources to re-
framework integrates a model-based planner with a DRL solve problems, rely heavily on model assumptions, necessitate
feedback policy to enhance bipedal locomotion’s agility and extensive training periods, demand large datasets for optimiza-
versatility, displaying improved performance [47]. This system tion and hinder rapid deployment and iterative enhancements
employs a residual learning architecture, where the DRL pol- [48].
icy’s outputs are merged with the planner’s directives before
being relayed to the PD controller. This integrated approach
not only concerns itself with foot placement but also generates C. Learned hierarchy framework
comprehensive trajectories for trunk position, orientation, and The Learned Hierarchy Framework merges a learned HL
ankle yaw angle, enabling the robot to perform a wide array planner with an LL controller, focusing initially on refining LL
of locomotion skills including walking, squatting, turning, and policies to ensure balance and basic locomotion capabilities.
stair climbing. Subsequently, an HL policy is developed to direct the robot
Compared to traditional model-based controllers, learned towards specific targets, encapsulating a structured approach
DRL policies provide a comprehensive closed-loop control to robotic autonomy.
strategy that does not rely on assumptions about terrain or The genesis of this framework was within a physics engine,
robotic capabilities. These policies have demonstrated high ef- aimed at validating its efficiency through simulation [12].
ficiency in locomotion and accurate reference tracking. Despite In this setup, LL policies, informed by human motions or
their extensive capabilities, such policies generally require trajectories generated via Trajectory Optimization (TO), strive
short inference times, making DRL a preferred approach in to track these trajectories as dictated by the HL planner while
scenarios where robustness is paramount or computational maintaining balance. An HL policy is then introduced, pre-
resources on the robot are limited. Nonetheless, these learning trained with long-term task goals, to navigate the environment
algorithms often face challenges in environments characterized and identify optimal paths. This structure enabled sophisticated
by sparse rewards, where suitable footholds like gaps or interactions such as guiding a biped to dribble a soccer ball
stepping stones are infrequent [48]. towards a goal. The framework was later enhanced to in-
Additionally, an HL planner can process critical data such clude imitation learning, facilitating the replication of dynamic
as terrain variations or obstacles and generate precise tar- human-like movements within the simulation environment
get locations for feet or desired walking paths, instead of [13].
detailed terrain data, which can significantly expedite the However, despite its structured and layered approach, which
training process [27]. This capability effectively addresses the allows for the reuse of learned behaviors to achieve long-
navigational limitations observed in end-to-end frameworks. term objectives, these frameworks have predominantly been
Moreover, unlike the deep planning hybrid scheme where validated only in simulations. The interface designed manually
modifications post-policy establishment can be cumbersome, between the HL planner and the LL controller sometimes leads
this hybrid scheme offers enhanced flexibility for on-the-fly to suboptimal behaviors, including stability issues like falling.
7

Expanding the application of this framework, a sim-to-real valuable lessons for overcoming similar challenges in bipedal
strategy for a wheeled bipedal robot was proposed, focusing systems.
the LL policy on balance and position tracking, while the HL
policy enhances safety by aiding in collision avoidance and A. Recent progress with quadruped robots
making strategic decisions based on the orientation of subgoals
While DRL remains an emerging technology in bipedal
[29].
robotics, it has firmly established its presence in the realm
Learning complex locomotion skills, particularly when in-
of quadruped robots, another category of legged systems. The
corporating navigation elements, presents a significant chal-
diversity of frameworks developed for quadrupeds ranges from
lenge in robotics. Decomposing these tasks into distinct lo-
model-based RL designed for training in real-world scenarios,
comotion and navigation components allows robots to tackle
where unpredictable dynamics often prevail [49], [50], to
more intricate activities, such as dribbling a soccer ball [12].
systems that include the modeling of deformable terrain to en-
As discussed in the previous section, the benefits of integrating
hance locomotion over compliant surfaces [51]. Furthermore,
RL-based planners with RL-based controllers have been effec-
dynamic quadruped models facilitate highly adaptable policies
tively demonstrated. This combination enables the framework
[52], and sophisticated acrobatic motions are achieved through
to adeptly manage a diverse array of environments and tasks.
imitation learning [53].
Within such a framework, the High-Level (HL) policy is
The domain of quadruped DRL has also seen significant
optimized for strategic planning and achieving specific goals.
advancements in complex hybrid frameworks that integrate
This optimization allows for targeted enhancements depending
vision-based systems. To date, two primary versions of such
on the tasks at hand. Moreover, the potential for continuous
frameworks have been developed: one where a deep planning
improvement and adaptation through further training ensures
module is paired with model-based control [54], and another
that the system can evolve over time, improving its efficiency
that combines model-based planning with low-level DRL
and effectiveness in response to changing conditions or new
control [48], [55]. The latter has shown substantial efficacy; it
objectives.
employs a model predictive control (MPC) to generate refer-
Despite the theoretical advantages, the practical implemen- ence motions, which are then followed by a LL feedback DRL
tation of this type of sim-to-real application for bipedal robots policy. Additionally, the Terrain-aware Motion Generation for
remains largely unexplored. The transition from simulation Legged Robots (TAMOLS) module [56] enhances the MPC
to real-world scenarios is fraught with challenges, not least and DRL policy by providing terrain height maps for effective
because of the complexities involved in training and integrat- foothold placements across diverse environments, including
ing two separate layers within the control hierarchy. Ensuring those not encountered during training. However, similar hybrid
effective communication and cooperation between these layers control schemes have not been thoroughly investigated within
is critical, requiring a meticulously defined communication the field of bipedal locomotion.
interface to avoid operational discrepancies. Quadruped DRL frameworks are predominantly designed to
Additionally, the training process for each policy within navigate complex terrains, but efforts to extend their capabili-
the hierarchy demands considerable computational resources. ties to other tasks are underway. These include mimicking real
The intensive nature of this training can lead to a reliance on animals through motion capture data and imitation learning
the simulation environment, potentially causing the system to [57], [58], as well as augmenting quadrupeds with manipula-
overfit to specific scenarios and thereby fail to generalize to tion abilities. This is achieved either by adding a manipulator
real-world conditions. This limitation highlights a significant [59], [60] or by using the robots’ legs [61]. Notably, the
hurdle that must be addressed to enhance the viability of research presented in [60] demonstrates that loco-manipulation
learned hierarchy frameworks in practical applications. tasks can be effectively managed using a single unified end-
to-end framework.
IV. C HALLENGES AND FUTURE RESEARCH DIRECTIONS Despite the progress in quadruped DRL, similar advance-
ments have been limited for bipedal robots, particularly in
While learning-based frameworks for bipedal robots have
loco-manipulation tasks and vision-based DRL frameworks.
demonstrated considerable potential, they have also clearly
Establishing a unified framework could bridge this gap, an
exposed the limitations inherent to each framework. Moreover,
essential step given the integral role of bipedal robots in
several critical areas remain largely unexplored, especially
developing full humanoid systems. Moreover, the potential of
within the realm of legged robotics, where the pace of research
hybrid frameworks that combine model-based and DRL-based
on bipedal robots lags behind that of their quadruped counter-
methods in bipedal robots remains largely untapped.
parts. This discrepancy in research progress can be attributed
to several factors, including the higher costs and less mature
technology associated with bipedal robot hardware, as well as B. Gaps and challenges
the inherent instability issues that bipedal designs face. Despite numerous promising developments in the field
To gain a deeper understanding of these challenges and of bipedal and humanoid robotics, significant gaps remain
to outline potential future directions, it is instructive to first between current research outcomes and the ultimate goals.
review existing research on quadruped robots. The insights This discussion concentrates on the gaps in frameworks and
gained from quadrupeds, which benefit from more robust algorithms rather than hardware, structured around two pivotal
research outputs and technological advancements, can provide questions: 1) Is it possible to design a unified framework that
8

achieves both generalization and precision? 2) Can we develop • Effective reward functions: Formulation of reward func-
a straightforward end-to-end policy capable of managing all tions that more accurately guide the learning process
tasks efficiently? towards achieving desired behaviors and strategic out-
1) Generalization versus precision: DRL has demonstrated comes.
potential in facilitating versatile locomotion skills [22]; how- • Advanced computational resources: Enhancement of
ever, challenges such as poor velocity tracking and issues computational capabilities to support more intensive
with precise control often arise. While [43] shows that deep training and faster inference, facilitating real-time
planning combined with model-based control can achieve decision-making in dynamic environments.
precise velocity tracking, and [37] illustrates successful end- By focusing on these developmental areas, the potential to
to-end control for precise jumping, the creation of a policy that create a unified, efficient, and less complex framework for
effectively handles both diverse tasks and precise movements handling complex locomotion challenges in bipedal robots is
remains elusive. Furthermore, [41] introduces a foot constraint significantly increased.
policy framework, enabling precise target tracking and accu-
rate touchdown locations. Yet, there is still no framework that
comprehensively addresses the dual demands of versatility and C. Future directions
precision in locomotion. The exploration of quadruped robotics has yielded substan-
The difficulty in simultaneously achieving precise control tial advancements, yet the full potential of bipedal robotics
and a broad range of actions in bipedal locomotion using DRL remains largely untapped. Building on the successes and
stems from several factors: innovative approaches observed in quadruped robots, several
• Complex dynamics: Bipedal locomotion involves intri- key future directions emerge that could significantly enhance
cate dynamics, posing a significant challenge to main- bipedal and humanoid robotics.
taining both dynamic motion and precision. 1) Unified framework: Currently, no single framework ex-
• Resource intensity: Executing diverse locomotion tasks ists that enables bipedal or humanoid robots to adeptly navi-
requires considerable computational power and extensive gate all types of terrains, including stepping stones, stairs, de-
data, necessitating high-quality hardware and efficient formable terrain, and slippery surfaces. A promising approach,
DRL algorithms. as evidenced by recent work in quadruped robots [48], utilizes
• Training conflicts: Training DRL systems to achieve MPC to generate reference motions, which a low-level DRL
both precision and versatility often leads to conflicts. policy then tracks. This method, coupled with the Terrain-
Designing reward functions and training policies that aware Motion Generation for Legged Robots (TAMOLS)
satisfy both criteria is inherently complex. module, simplifies the terrain representation into a height map,
These challenges underscore the need for innovative solu- facilitating more effective navigation. This success encour-
tions that can bridge the gap between the capabilities of current ages further exploration into hybrid frameworks that combine
frameworks and the ambitious goals of advanced bipedal and model-based methods with DRL, inheriting the strengths of
humanoid robotics. both approaches, as discussed in Section III. However, hybrid
2) Simplifying frameworks to overcome complex tasks: The frameworks present challenges such as training efficiency and
envisioned ideal in robotic design is an end-to-end framework system complexity, which demand considerable computational
that enables robots to traverse various terrains using versatile resources and extensive training periods.
locomotion skills. Although current research often focuses Moreover, recent studies [20], [42], [22] have demonstrated
on enhancing frameworks by adding complex components to the potential of end-to-end frameworks enhanced with vision-
mitigate inherent limitations, such as the integration of a foot based information. These frameworks successfully navigate
planner for omnidirectional locomotion and stair navigation, challenging terrains and execute dynamic motions, suggesting
as demonstrated in [26], [27], simpler end-to-end frameworks the feasibility of a unified framework capable of handling
have also proven effective. These frameworks adeptly navigate diverse environments and tasks. Training strategies such as
challenging terrains and perform a diverse range of locomotion curriculum learning and task randomization could be em-
tasks with fewer components [20], [22]. ployed, utilizing visual height maps as inputs to the policy,
The advantage of maintaining simplicity in the framework enhancing the robot’s ability to adapt and perform in varied
lies in its ability to streamline decision-making processes, scenarios.
thereby reducing computational overhead and potential points In addition, the introduction of a DRL end-to-end frame-
of failure. To achieve an optimal end-to-end framework, ad- work incorporating transformer models, as in [62], presents
vancements in several key areas are essential: significant possibilities for integrating locomotion skills with
• Robust and efficient DRL algorithms: Development language and vision capabilities. The use of large-scale models
of algorithms that can manage high-dimensional and capable of processing and condensing extensive data sets into a
continuous control problems more effectively. coherent model could expand the robot’s range of capabilities,
• Specialized neural network architectures: Design of maintaining versatility across a broad spectrum of tasks.
neural architectures tailored for specific bipedal tasks, The exploration of transformers and other large-scale mod-
capable of processing extensive sensory data (e.g., visual els holds considerable promise for enhancing generalizability
and tactile inputs), similar to the innovations presented and adaptability in complex tasks, warranting further investi-
in [42]. gation into their potential applications in bipedal robotics.
9

2) Vision-based learning framework: Vision plays a critical the task [43]. This structured control approach provides a
role in enabling robots to navigate challenging terrains, such more coordinated response to complex interactions within the
as blind drops, where tactile and other sensory inputs may robot’s environment, facilitating the execution of task-specific
not provide sufficient information. Despite the importance commands.
of vision, many current frameworks, particularly in bipedal Alternatively, an end-to-end framework may enable bipeds
robotics, do not fully exploit this modality [38], [43]. Vision- to perform a variety of tasks through task randomization and
based systems are essential in human locomotion for identi- structured curriculum learning methods, progressively teaching
fying obstacles and assessing terrains, and some studies have the policy [35], [27], [22]. During training, such policies
begun to show the effectiveness of integrating vision into DRL can also learn human-like movements from motion capture
frameworks for bipedal and humanoid robots [20], [42], [27]. data [70], [18], [16], offering promising solutions for future
Building on the groundwork laid by both bipedal and integrated loco-manipulation tasks within a single, versatile
quadruped robots, two promising directions have emerged: policy.
• Height scanner mapping: This approach, evaluated in 5) Designing reward functions: The development of effec-
works like [42], involves using height maps generated tive reward functions is a critical challenge in the field of
by scanners to inform locomotion strategies. These maps deep reinforcement learning (DRL) for bipedal robots. While
provide detailed topographical data, allowing robots to periodic reward functions have been designed to facilitate
plan steps on uneven or obstructed surfaces more effec- cyclic movements like walking [19], there remains a significant
tively. gap in crafting reward functions for non-periodic actions
• Direct vision inputs: Directly utilizing inputs from such as jumping. These actions require distinct considera-
cameras, such as depth or RGB images, for real-time tions for success and efficiency, yet current research lacks
decision-making in RL policies [46], [63]. Although pre- comprehensive methods for their reward structure. Further-
vious studies like [46] have integrated visual navigation more, minimizing the need for extensive manual tuning while
by feeding visual information to a High-Level (HL) achieving high performance in DRL systems continues to be
planner, the potential of direct visual inputs to RL policies a substantial challenge, pointing to the need for more adaptive
has not been fully explored. and automatically adjusting reward mechanisms.
Enhancing the capability of bipedal robots to directly in- 6) Integrating large language models: The integration of
terpret and utilize visual data without intermediary processing Large Language Models (LLMs) into bipedal robotics opens
can revolutionize their adaptability and efficiency in real-world new avenues for contextual understanding and task execution,
scenarios. The exploration of direct vision inputs to reinforce- significantly enhancing the robots’ interaction capabilities.
ment learning policies represents a significant opportunity for LLMs, when implemented at the highest task level, offer
advancing the field, potentially enabling more dynamic and substantial promise for improving human-robot interaction,
responsive locomotion strategies. making these systems more intuitive and responsive [71].
3) Bridge the gap from simulation to reality: While simula- The potential applications of this technology are broad and
tions offer a safe and cost-effective environment for developing impactful, spanning sectors such as industrial automation,
robotics policies, the transition from simulation to real-world where robots can perform complex assembly tasks; healthcare,
application often encounters significant challenges due to offering assistance in patient care and rehabilitation; assistive
the approximations and simplifications made in simulations. devices, providing support for individuals with disabilities;
Numerous sim-to-real frameworks [34], [64], [65] have shown search and rescue operations, where robust and adaptive
high efficiency and performance, as detailed in Appendix decision-making is critical; and entertainment and education,
B. Despite these advancements, a significant gap persists, where interactive and engaging experiences are key [72]. Each
exacerbated by the complexity and unpredictability of physical of these fields could benefit from the advanced capabilities of
environments. Moreover, many studies [26], [66], [21] remain LLM-enhanced bipedal robots, particularly in environments
validated only in simulation settings. requiring nuanced understanding and adaptability.
4) Loco-manipulation tasks: Loco-manipulation, which
combines locomotion and manipulation, presents opportunities
D. Applications in various fields
for humanoid robots to excel beyond purely bipedal capabili-
ties. Few studies have addressed this integrated task; one such The advancements in bipedal locomotion technology hold
study [67] demonstrated a ’box transportation’ framework. significant promise for practical applications beyond the con-
This framework decomposes the task into five distinct policies, fines of laboratory environments. These robots, bolstered by
each addressing different aspects of the transportation process. AI, are poised to transform numerous sectors by enhancing
However, this approach lacks efficiency and does not incorpo- operational capabilities and interaction with humans. The
rate vision-based information, suggesting substantial room for potential for humanoid robots in various fields is detailed in
improvement. Moreover, the challenges of managing mobile [72], emphasizing the integration of learning-based approaches
tools like scooters [68] or dynamically interacting with objects for more effective implementation. Key areas include:
such as balls [69] introduce further complexities. 1) Industrial automation and manufacturing: The in-
Decomposing loco-manipulation tasks into multiple layers tegration of humanoid robots in industrial settings can
could simplify the challenges, allowing for more precise and significantly enhance productivity and efficiency, freeing
flexible control by manually tuning individual components of workers from repetitive and labor-intensive tasks. These
10

robots, equipped with advanced loco-manipulation capa- language learning [81], participate in storytelling, teach
bilities and the ability to cooperate with human teams, various academic subjects, or engage in the performing
are particularly effective in assembly line operations, arts and games. In the sphere of entertainment, hu-
maintenance tasks, and the construction of complex manoid robots can act, dance, play ball games [69],
machinery [73], [74]. Their articulated arms and float- and take part in interactive performances, captivating
ing bases provide unmatched flexibility, making them audiences of all ages with their versatility and dynamic
ideal for human-centric manufacturing environments. capabilities.
The humanoid robot Digit, for example, demonstrates However, this in turn leads to a variety of ethical
remarkable stability and efficiency in industrial tasks issues. First, interacting with humans involves collecting
over extended periods, as seen in video demonstrations human daily behavior data and increases the risk of a
[75]. Moreover, these robots are also suited for operation data breach. Second, another concern is the increasing
in high-risk environments such as underwater or areas dependency of humans on robots, not just for assistance
with high radiation levels, significantly enhancing safety but also for emotional support. This will result in
and operational capacity in these contexts. less human-to-human interaction and ultimately affect
2) Healthcare and assistive devices: In the healthcare sec- social constructs and emotional development. Third,
tor, bipedal and humanoid robots contribute significantly advancements of humanoid robots will replace humans
to rehabilitation and assistive technologies. Exoskeletons in various jobs, and eventually lead to unemployment
enhanced with DRL methodologies are being used to issues.
train individuals to achieve more natural gait patterns, On the positive side, humanoid robots can provide
improving mobility and rehabilitation outcomes [76]. invaluable assistance to people with disabilities or the
Beyond mere mobility aids, humanoid robots integrated elderly, offering companionship and reducing the care
with LLMs show promise in delivering medications, burden on families and healthcare systems. Furthermore,
monitoring patient health, and assisting in surgeries [77]. their application across diverse fields such as education,
The synergy between LLMs and loco-manipulation ca- industry, and healthcare can bring about revolutionary
pabilities paves the way for more interactive, responsive changes, improving efficiency and safety while opening
support, aligning closely with the needs of personalized up new possibilities for technological integration. As
care. Additionally, the aging population can benefit from we navigate these advancements, it is crucial to balance
humanoid robots performing everyday tasks like house innovation with ethical considerations to ensure that the
cleaning or delivery through simple voice commands, deployment of humanoid robots enhances societal well-
thereby enhancing the quality of life. being without compromising personal integrity or social
3) Search and rescue missions: Humanoid robots are dynamics.
exceptionally valuable in search and rescue operations,
especially in disaster-stricken or hazardous environments V. C ONCLUSION
where human presence is risky or impractical. Un- Despite significant advances in DRL for robotics, a con-
like traditional wheeled robots, humanoid robots can siderable gap persists between current research achievements
navigate complex terrains filled with debris, gaps, and and the development of a unified framework capable of
elevated structures, making them indispensable in these empowering robots to perform a broad spectrum of complex
scenarios. They also demonstrate potential for significant tasks efficiently. Presently, DRL research can be categorized
interaction and collaboration with human rescue teams. into two primary control schemes: end-to-end and hierarchical
For instance, in environments with high nuclear radia- frameworks. End-to-end frameworks have shown promising
tion, humanoid robots can perform tasks that would be capabilities in executing diverse locomotion skills [22], climb-
perilous for humans, handling delicate instruments and ing stairs [38], and navigating challenging terrains such as
preventing human exposure to harmful conditions. This stepping stones [20]. Conversely, hybrid frameworks, which
capability extends to other challenging environments often integrate an HL planner or an LL model-based con-
such as underwater [78], outer space [79], [80], and troller, offer enhanced capabilities, allowing for simultaneous
other hazardous areas. However, the full realization of management of locomotion and navigation tasks.
these applications remains constrained by the absence To bridge the existing gaps, further development of hierar-
of a unified framework that can seamlessly navigate all chical frameworks, particularly those equipped with advanced
terrains and fully integrate loco-manipulation and human perception systems and integrated with model-based planners,
interaction functionalities. appears promising. Such frameworks could simultaneously
4) Entertainment and education: Humanoid robots have address issues of precision and generalization. Moreover,
the potential to transform the realms of entertainment the advent of LLMs presents a transformative opportunity,
and education by providing highly interactive experi- potentially enabling the unification of language processing and
ences. With their ability to integrate extensive knowl- visual functionalities within robotic systems. While numerous
edge bases, these robots can significantly enhance edu- challenges remain—ranging from the technical intricacies of
cational environments. They can assume the roles of but- framework integration to real-world application—the steady
lers, teachers, or even babysitters, engaging with users progression in control framework refinement and DRL devel-
in diverse activities. For example, robots can facilitate opment provides a hopeful outlook. The vision of achieving an
11

end-to-end unified framework, capable of mimicking human- Q-learning


Value-based RL
like learning processes and enabling bipedal robots to handle DNQ

a wide range of complex tasks, may soon move within reach. Model-free RL
algorithm Policy Gradient

A PPENDIX A A2C/A3C

D EEP REINFORCEMENT LEARNING ALGORITHMS Policy-based RL PPO

TRPO
The advancement and development of RL is crucial for DDPG
bipedal locomotion. Specifically, advancements in deep learn-
ing provide deep neural networks serving as function approx- Fig. 4: Diagram for RL algorithms catalogue
imators to empower RL with the capability to handle tasks
characterized by high-dimensional and continuous spaces, by
efficiently discovering condensed, low-dimensional represen-
tations of complex data. In comparison to other robots of
different morphologies, such as wheeled robots, bipedal robots to a main idea simultaneously learning both a policy (actor)
feature much higher DoFs and continuously interact with and a value function (critic), where it owns both advantages
environments, which results in higher requirements for the of both algorithms [83], [84]. Popular algorithms e.g. Trust
DRL algorithms. Especially in the legged locomotion field, region policy optimization (TRPO) [85] and PPO based on
policy gradient-based algorithms are prevalent in the field of policy-based methods, borrow ideas from AC. Moreover, there
bipedal locomotion. are other novel algorithms based on the AC framework, Deep
Designing an effective neural network architecture is es- Deterministic Policy Gradient (DDPG) [86], Twin Delayed
sential for tackling complex bipedal locomotion tasks. Multi- Deep Deterministic Policy Gradients (TD3) [87], A2C (Ad-
layer perceptrons (MLP), a fundamental neural network, excel vantage Actor-Critic), and A3C (Asynchronous Advantage
in straightforward regression tasks with lower computational Actor-Critic) [88], SAC (Soft Actor-Critic) [89]. Each algo-
resource demands. A comprehensive comparison between rithm has its strengths considering different tasks in the bipedal
MLP and the memory-based neural network, Long Short-Term locomotion scenario. There are several key factors to value
Memory (LSTM) reveals that MLPs have an advantage in con- these algorithms such as: sample efficiency, robustness and
vergence speed for tasks [65]. However, LSTM, as a variant of generalization, and implementation challenges. A comparative
Recurrent Neural Networks (RNN), is adept at processing data analysis work [90] illustrates that SAC-based algorithms excel
associated with time, effectively relating different states across in stability and achieve the highest scores, while their training
time, and modeling key physical properties vital for periodical efficiency significantly trails behind that of PPO that obtain
gaits [19] and successful sim-to-real transfer in bipedal loco- relatively high score.
motion. Additionally, Convolutional Neural Networks (CNN) In [91], PPO demonstrates the robustness and computational
specialize in spatial data processing, particularly for image- economy in complex scenarios, such as bipedal locomotion,
related tasks, making them highly suitable for environments utilizing fewer resources than TRPO. In terms of training
where visual perception is crucial. This diverse range of neural time, PPO is much faster than SAC, and DDPG algorithms
network architectures highlights the importance of selecting [90]. Besides, many works [19], [45], [36] have demonstrated
the appropriate model based on the specific requirements of its robustness and ease of implementation and combined
the bipedal locomotion tasks. with the flexibility to integrate with various neural network
Considering DRL alogirthms, recent bipedal locomotion architectures have made PPO the most popular choice in this
studies focus on model-free reinforcement algorithms. Unlike field. Various work has demonstrated that PPO can conduct the
model-based RL, which learns a model of the environment exploration of walking [19], jumping [37], stair climbing [38],
but may inherit biases from simulations that do not accurately and stepping stones [20], which demonstrates its efficiency,
reflect real-world conditions, model-free RL directly trains robustness and generalization.
policies through environmental interaction without relying on
an explicit environmental model. Although model-free RL Additionally, the DDPG algorithm integrates the Actor-
requires more computational samples and resources, it can Critic framework with DQN to facilitate off-policy training,
train a more robust policy allowing the robots to travel around further optimizing sampling efficiency. In some explicit sce-
challenging environments. narios such as jumping, DDPG shows higher reward and better
Many sophisticated model-free RL algorithms exist, which learning performance than PPO [21], [92]. TD3 is developed
can be broadly classified into two categories: policy-based (or based on DDPG, and improve over the performance of the
policy optimization) and value-based approaches. Value-based DDPG and SAC [89]. Soft Actor-Critic (SAC) further the
methods e.g. Q-learning, Deep Q-learning (DQN) [82] only agent’s exploration capabilities and sample efficiency [89].
excel in discrete action space and often struggle with high While A2C offers improved efficiency and stability compared
dimensional action space. In contrast, policy-based methods, to A3C, the asynchronous update mechanism for A3C provides
such as policy gradient, can handle complex tasks but are gen- better capabilities for exploration and accelerating learning.
erally less sample-efficient compared to value-based methods. Although these algorithms show their advancements, they
More advanced algorithms combine both policy-based are more challenging to apply due to algorithms’ complexity
methods and value-based methods. Actor-critic (AC) refers compared to PPO.
12

A PPENDIX B [6] C. Chevallereau, G. Abba, Y. Aoustin, F. Plestan, E. Westervelt, C. C.


B RIDGING SIM - TO - REAL GAP De Wit, and J. Grizzle, “Rabbit: A testbed for advanced control theory,”
IEEE Control Systems Magazine, vol. 23, pp. 57–79, 2003.
Due to the large number of interactions needed for RL al- [7] Y. Gong, R. Hartley, X. Da, A. Hereid, O. Harib, J.-K. Huang, and
gorithms, training directly on robots can lead to costly damage J. Grizzle, “Feedback control of a cassie bipedal robot: Walking,
standing, and riding a segway,” in American Control Conference, 2019,
to hardware and the environment. Consequently, training a pp. 4559–4566.
policy in the simulation and then deploying it to the hardware [8] S. Kuindersma, R. Deits, M. Fallon, A. Valenzuela, H. Dai, F. Permenter,
illustrates significant potential and efficiency. However, the gap T. Koolen, P. Marion, and R. Tedrake, “Optimization-based locomotion
planning, estimation, and control design for the atlas humanoid robot,”
between simulation and the real world remains substantial, Autonomous robots, vol. 40, pp. 429–455, 2016.
making sim-to-real challenging. To overcome the gap, several [9] G. A. Castillo, B. Weng, W. Zhang, and A. Hereid, “Robust feedback
sim-to-real approaches are developed, including dynamics motion policy design using reinforcement learning on a 3d digit bipedal
robot,” in IEEE/RSJ International Conference on Intelligent Robots and
randomization [36], system identification [93], [94], periodic Systems, 2021, pp. 5136–5143.
reward composition [66], learned actuators dynamics [93], [10] R. Tedrake, T. Zhang, and H. Seung, “Stochastic policy gradient rein-
[95], regulation feedback controller [96], adversarial motion forcement learning on a simple 3D biped,” in IEEE/RSJ International
Conference on Intelligent Robots and Systems, 2004, pp. 2849–2854.
prior [18], [17]. [11] J. Morimoto, G. Cheng, C. Atkeson, and G. Zeglin, “A simple rein-
There are two primary approaches to training policies under forcement learning algorithm for biped walking,” in IEEE International
domain randomization. One is end-to-end training with a Conference on Robotics and Automation, 2004, pp. 3030–3035 Vol.3.
[12] X. Peng, G. Berseth, K. Yin, and M. Panne, “DeepLoco: dynamic
history of robot measurements or I/O [36] and another is locomotion skills using hierarchical deep reinforcement learning,” ACM
policy distillation, an expert policy with environmental insights Transactions on Graphics, vol. 36, pp. 1–13, 2017.
guides a student policy that learns from internal sensory [13] X. Peng, P. Abbeel, S. Levine, and M. Panne, “DeepMimic: Example-
feedback, such as teacher-student policy [66], RMA [64]. guided deep reinforcement learning of physics-based character skills,”
ACM Transactions on Graphics, vol. 37, 2018.
Details of these sim-to-real transition approaches are shown [14] W. Yu, G. Turk, and C. K. Liu, “Learning symmetric and low-energy
below: locomotion,” ACM Transactions on Graphics, vol. 37, pp. 1–12, 2018.
1) The dynamics randomization method involves system- [15] M. Taylor, S. Bashkirov, J. F. Rico, I. Toriyama, N. Miyada, H. Yanag-
isawa, and K. Ishizuka, “Learning bipedal robot locomotion from
atically varying the physical parameters of the simulated human movement,” in IEEE International Conference on Robotics and
environment-such as mass, inertia, or stiffness. 2) System Automation, 2021, pp. 2797–2803.
Identification methods develop mathematical models of dy- [16] X. Cheng, Y. Ji, J. Chen, R. Yang, G. Yang, and X. Wang, “Ex-
pressive whole-body control for humanoid robots,” arXiv preprint
namics from observed data, enhancing the accuracy of robots’ arXiv:2402.16796, 2024.
properties within the models, such as mass and inertia, to [17] A. Tang, T. Hiraoka, N. Hiraoka, F. Shi, K. Kawaharazuka, K. Kojima,
ensure the model faithfully represents the system’s behavior. K. Okada, and M. Inaba, “HumanMimic: Learning natural locomotion
and transitions for humanoid robot via wasserstein adversarial imitation,”
3) The learned actuator dynamics method utilizes experimental arXiv preprint arXiv:2309.14225, 2023.
data from actuators to develop a model of their dynamics, [18] Q. Zhang, P. Cui, D. Yan, J. Sun, Y. Duan, A. Zhang, and R. Xu,
achieving sim-to-real by incorporating realistic actuator be- “Whole-body humanoid robot locomotion with human reference,” arXiv
preprint arXiv:2402.18294, 2024.
havior within the training environment. It is noticeable that [19] J. Siekmann, Y. Godse, A. Fern, and J. Hurst, “Sim-to-real learning of
the higher-level planner can also learn from reference or non- all common bipedal gaits via periodic reward composition,” in IEEE
reference. 4) periodic reward composition helps capture the International Conference on Robotics and Automation, 2021, pp. 7309–
7315.
essential locomotion information and the periodic gait pattern
[20] H. Duan, A. Malik, M. S. Gadde, J. Dao, A. Fern, and J. Hurst, “Learn-
is more general to adapt to uncertainty and variation in the real ing dynamic bipedal walking across stepping stones,” in IEEE/RSJ
world. 5) The regulation feedback controller manually tunes International Conference on Intelligent Robots and Systems, 2022, pp.
the setting of the controller to mitigate perturbations and gaps 6746–6752.
[21] C. Tao, M. Li, F. Cao, Z. Gao, and Z. Zhang, “A multiobjective collabo-
between the sim and the real, thereby enhancing the robustness rative deep reinforcement learning algorithm for jumping optimization of
and adaptation. Key aspects of sim-to-real, including system bipedal robot,” Advanced Intelligent Systems, vol. 6, p. 2300352, 2023.
identification, state estimation with noise measurement, and [22] Z. Li, X. B. Peng, P. Abbeel, S. Levine, G. Berseth, and K. Sreenath,
“Reinforcement learning for versatile, dynamic, and robust bipedal
the selection of state-and-action spaces are highlighted in [34]. locomotion control,” arXiv e-prints, pp. arXiv–2401, 2024.
[23] T. Li, H. Geyer, C. G. Atkeson, and A. Rai, “Using deep reinforcement
learning to learn high-level policies on the ATRIAS biped,” in Interna-
R EFERENCES tional Conference on Robotics and Automation, 2019, pp. 263–269.
[1] S. Gupta and A. Kumar, “A brief review of dynamics and control of [24] H. Duan, J. Dao, K. Green, T. Apgar, A. Fern, and J. Hurst, “Learning
underactuated biped robots,” Advanced Robotics, vol. 31, pp. 607–623, task space actions for bipedal locomotion,” in IEEE International
2017. Conference on Robotics and Automation, 2021, pp. 1276–1282.
[2] J. Reher and A. Ames, “Dynamic walking: Toward agile and efficient [25] G. A. Castillo, B. Weng, W. Zhang, and A. Hereid, “Reinforcement
bipedal robots,” Annual Review of Control, Robotics, and Autonomous learning-based cascade motion policy design for robust 3d bipedal
Systems, vol. 4, 2021. locomotion,” IEEE Access, vol. 10, pp. 20 135–20 148, 2022.
[3] J. Carpentier and P.-B. Wieber, “Recent progress in legged robots [26] R. P. Singh, M. Benallegue, M. Morisawa, R. Cisneros, and F. Kanehiro,
locomotion control,” Current Robotics Reports, vol. 2, pp. 231–238, “Learning bipedal walking on planned footsteps for humanoid robots,”
2021. in IEEE-RAS International Conference on Humanoid Robots, 2022, pp.
[4] M. A.-M. Khan, M. R. J. Khan, A. Tooshil, N. Sikder, M. A. P. Mahmud, 686–693.
A. Z. Kouzani, and A.-A. Nahid, “A systematic review on reinforcement [27] S. Wang, S. Piao, X. Leng, and Z. He, “Learning 3D bipedal walk-
learning-based robotics within the last decade,” IEEE Access, vol. 8, pp. ing with planned footsteps and fourier series periodic gait planning,”
176 598–176 623, 2020. Sensors, vol. 23, p. 1873, 2023.
[5] J. Garcı́a and D. Shafie, “Teaching a humanoid robot to walk faster [28] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath,
through safe reinforcement learning,” Engineering Applications of Arti- “Deep reinforcement learning: A brief survey,” IEEE Signal Processing
ficial Intelligence, vol. 88, p. 103360, 2020. Magazine, vol. 34, pp. 26–38, 2017.
13

[29] W. Zhu and M. Hayashibe, “A hierarchical deep reinforcement learning [52] G. Feng, H. Zhang, Z. Li, X. B. Peng, B. Basireddy, L. Yue, Z. SONG,
framework with high efficiency and generalization for fast and safe L. Yang, Y. Liu, K. Sreenath, and S. Levine, “Genloco: Generalized
navigation,” IEEE Transactions on Industrial Electronics, vol. 70, pp. locomotion controllers for quadrupedal robots,” in Conference on Robot
4962–4971, 2023. Learning, vol. 205, 2023, pp. 1893–1903.
[30] X. B. Peng and M. Van De Panne, “Learning locomotion skills us- [53] Y. Fuchioka, Z. Xie, and M. Van de Panne, “OPT-Mimic: Imitation
ing deeprl: Does the choice of action space matter?” in ACM SIG- of optimized trajectories for dynamic quadruped behaviors,” in IEEE
GRAPH/Eurographics Symposium on Computer Animation, 2017, pp. International Conference on Robotics and Automation, 2023, pp. 5092–
1–13. 5098.
[31] Z. Li, X. Cheng, X. B. Peng, P. Abbeel, S. Levine, G. Berseth, [54] S. Gangapurwala, M. Geisert, R. Orsolino, M. Fallon, and I. Havoutis,
and K. Sreenath, “Reinforcement learning for robust parameterized “RLOC: Terrain-aware legged locomotion using reinforcement learning
locomotion control of bipedal robots,” in IEEE International Conference and optimal control,” IEEE Transactions on Robotics, vol. 38, pp. 2908–
on Robotics and Automation, 2021, pp. 2811–2817. 2927, 2022.
[32] D. Kim, G. Berseth, M. Schwartz, and J. Park, “Torque-based deep [55] D. Kang, J. Cheng, M. Zamora, F. Zargarbashi, and S. Coros, “RL
reinforcement learning for task-and-robot agnostic learning on bipedal + Model-Based Control: Using on-demand optimal control to learn
robots using sim-to-real transfer,” IEEE Robotics and Automation Let- versatile legged locomotion,” IEEE Robotics and Automation Letters,
ters, vol. 8, p. 6251–6258, 2023. vol. 8, pp. 6619–6626, 2023.
[33] Z. Xie, G. Berseth, P. Clary, J. Hurst, and M. van de Panne, “Feedback [56] F. Jenelten, R. Grandia, F. Farshidian, and M. Hutter, “TAMOLS:
control for cassie with deep reinforcement learning,” in IEEE/RSJ Terrain-aware motion optimization for legged systems,” IEEE Trans-
International Conference on Intelligent Robots and Systems, 2018, pp. actions on Robotics, vol. 38, pp. 3395–3413, 2022.
1241–1246. [57] X. B. Peng, E. Coumans, T. Zhang, T.-W. E. Lee, J. Tan, and S. Levine,
[34] Z. Xie, P. Clary, J. Dao, P. Morais, J. Hurst, and M. van de Panne, “Learning agile robotic locomotion skills by imitating animals,” in
“Learning locomotion skills for cassie: Iterative design and sim-to-real,” Robotics: Science and Systems, 2020.
in Conference on Robot Learning, 2020, pp. 317–329. [58] F. Yin, A. Tang, L. Xu, Y. Cao, Y. Zheng, Z. Zhang, and X. Chen, “Run
[35] D. Rodriguez and S. Behnke, “Deepwalk: Omnidirectional bipedal gait like a dog: Learning based whole-body control framework for quadruped
by deep reinforcement learning,” in IEEE International Conference on gait style transfer,” in IEEE/RSJ International Conference on Intelligent
Robotics and Automation, 2021, pp. 3033–3039. Robots and Systems, 2021, pp. 8508–8514.
[36] J. Siekmann, S. Valluri, J. Dao, L. Bermillo, H. Duan, A. Fern, and [59] Y. Ma, F. Farshidian, T. Miki, J. Lee, and M. Hutter, “Combining
J. W. Hurst, “Learning memory-based control for human-scale bipedal learning-based locomotion policy with model-based manipulation for
locomotion,” in Robotics science and systems, 2020. legged mobile manipulators,” IEEE Robotics and Automation Letters,
[37] Z. Li, X. B. Peng, P. Abbeel, S. Levine, G. Berseth, and K. Sreenath, vol. 7, pp. 2377–2384, 2022.
“Robust and versatile bipedal jumping control through multi-task rein- [60] Z. Fu, X. Cheng, and D. Pathak, “Deep whole-body control: Learning
forcement learning,” in Robotics: Science and Systems, 2023. a unified policy for manipulation and locomotion,” in Conference on
[38] J. Siekmann, K. Green, J. Warila, A. Fern, and J. Hurst, “Blind bipedal Robot Learning, 2023, pp. 138–149.
stair traversal via sim-to-real reinforcement learning,” in Robotics: [61] P. Arm, M. Mittal, H. Kolvenbach, and M. Hutter, “Pedipulate: Enabling
Science and Systems, 2021. manipulation skills using a quadruped robot’s leg,” in IEEE Conference
on Robotics and Automation, 2024.
[39] C. Yang, K. Yuan, W. Merkt, T. Komura, S. Vijayakumar, and Z. Li,
[62] I. Radosavovic, T. Xiao, B. Zhang, T. Darrell, J. Malik, and K. Sreenath,
“Learning whole-body motor skills for humanoids,” in IEEE-RAS Inter-
“Learning humanoid locomotion with transformers,” arXiv preprint
national Conference on Humanoid Robots, 2019, pp. 270–276.
arXiv:2303.03381, 2023.
[40] Z. Xie, H. Ling, N. Kim, and M. Panne, “ALLSTEPS: Curriculum-
[63] A. Byravan, J. Humplik, L. Hasenclever, A. Brussee, F. Nori,
driven learning of stepping stone skills,” Computer Graphics Forum,
T. Haarnoja, B. Moran, S. Bohez, F. Sadeghi, B. Vujatovic et al.,
vol. 39, pp. 213–224, 2020.
“Nerf2real: Sim2real transfer of vision-guided bipedal motion skills
[41] H. Duan, A. Malik, J. Dao, A. Saxena, K. Green, J. Siekmann,
using neural radiance fields,” in IEEE International Conference on
A. Fern, and J. Hurst, “Sim-to-real learning of footstep-constrained
Robotics and Automation, 2023, pp. 9362–9369.
bipedal dynamic walking,” in International Conference on Robotics and
[64] A. Kumar, Z. Li, J. Zeng, D. Pathak, K. Sreenath, and J. Malik,
Automation, 2022, pp. 10 428–10 434.
“Adapting rapid motor adaptation for bipedal robots,” in IEEE/RSJ
[42] B. Marum, M. Sabatelli, and H. Kasaei, “Learning vision-based bipedal International Conference on Intelligent Robots and Systems, 2022, pp.
locomotion for challenging terrain,” arXiv preprint arXiv:2309.14594, 1161–1168.
2023. [65] R. P. singh, Z. Xie, P. Gergondet, and F. Kanehiro, “Learning bipedal
[43] G. A. Castillo, B. Weng, S. Yang, W. Zhang, and A. Hereid, “Template walking for humanoids with current feedback,” IEEE Access, vol. 11,
model inspired task space learning for robust bipedal locomotion,” in p. 82013–82023, 2023.
IEEE/RSJ International Conference on Intelligent Robots and Systems, [66] B. van Marum, M. Sabatelli, and H. Kasaei, “Learning per-
2023, pp. 8582–8589. ceptive bipedal locomotion over irregular terrain,” arXiv preprint
[44] C. Gaspard, G. Passault, M. Daniel, and O. Ly, “FootstepNet: an arXiv:2304.07236, 2023.
efficient actor-critic method for fast on-line bipedal footstep planning [67] J. Dao, H. Duan, and A. Fern, “Sim-to-real learning for humanoid box
and forecasting,” arXiv preprint arXiv:2403.12589, 2024. loco-manipulation,” arXiv preprint arXiv:2310.03191, 2023.
[45] K. Green, Y. Godse, J. Dao, R. L. Hatton, A. Fern, and J. Hurst, [68] J. Baltes, G. Christmann, and S. Saeedvand, “A deep reinforcement
“Learning spring mass locomotion: Guiding policies with a reduced- learning algorithm to control a two-wheeled scooter with a humanoid
order model,” IEEE Robotics and Automation Letters, vol. 6, pp. 3926– robot,” Engineering Applications of Artificial Intelligence, vol. 126, p.
3932, 2021. 106941, 2023.
[46] K. Lobos-Tsunekawa, F. Leiva, and J. Ruiz-del Solar, “Visual navigation [69] T. Haarnoja, B. Moran, G. Lever, S. H. Huang, D. Tirumala, J. Humplik,
for biped humanoid robots using deep reinforcement learning,” IEEE M. Wulfmeier, S. Tunyasuvunakool, N. Y. Siegel, R. Hafner et al.,
Robotics and Automation Letters, vol. 3, no. 4, pp. 3247–3254, 2018. “Learning agile soccer skills for a bipedal robot with deep reinforcement
[47] J. Li, L. Ye, Y. Cheng, H. Liu, and B. Liang, “Agile and versatile bipedal learning,” Science Robotics, vol. 9, p. eadi8022, 2024.
robot tracking control through reinforcement learning,” arXiv preprint [70] M. Seo, S. Han, K. Sim, S. H. Bang, C. Gonzalez, L. Sentis, and
arXiv:2404.08246, 2024. Y. Zhu, “Deep imitation learning for humanoid loco-manipulation
[48] F. Jenelten, J. He, F. Farshidian, and M. Hutter, “DTC: Deep tracking through human teleoperation,” in IEEE-RAS International Conference
control,” Science Robotics, vol. 9, p. eadh5401, 2024. on Humanoid Robots, 2023, pp. 1–8.
[49] L. Smith, I. Kostrikov, and S. Levine, “Demonstrating a walk in the [71] K. N. Kumar, I. Essa, and S. Ha, “Words into action: Learning diverse
park: Learning to walk in 20 minutes with model-free reinforcement humanoid robot behaviors using language guided iterative motion re-
learning,” Robotics: Science and Systems Demo, vol. 2, p. 4, 2023. finement,” in Workshop on Language and Robot Learning: Language
[50] P. Wu, A. Escontrela, D. Hafner, P. Abbeel, and K. Goldberg, “Day- as Grounding, 2023.
Dreamer: World models for physical robot learning,” in Conference on [72] Y. Tong, H. Liu, and Z. Zhang, “Advancements in humanoid robots:
Robot Learning, 2023, pp. 2226–2240. A comprehensive review and future prospects,” IEEE/CAA Journal of
[51] S. Choi, G. Ji, J. Park, H. Kim, J. Mun, J. H. Lee, and J. Hwangbo, Automatica Sinica, vol. 11, pp. 301–328, 2024.
“Learning quadrupedal locomotion on deformable terrain,” Science [73] A. Dzedzickis, J. Subačiūtė-Žemaitienė, E. Šutinys, U. Samukaitė-
Robotics, vol. 8, p. eade2256, 2023. Bubnienė, and V. Bučinskas, “Advanced applications of industrial
14

robotics: New trends and possibilities,” Applied Sciences, vol. 12, p. [96] G. A. Castillo, B. Weng, W. Zhang, and A. Hereid, “Robust feedback
135, 2021. motion policy design using reinforcement learning on a 3D digit bipedal
[74] M. Yang, E. Yang, R. C. Zante, M. Post, and X. Liu, “Collaborative robot,” in IEEE/RSJ International Conference on Intelligent Robots and
mobile industrial manipulator: a review of system architecture and Systems, 2021, pp. 5136–5143.
applications,” in International conference on automation and computing,
2019, pp. 1–6.
[75] “6+ Hours Live Autonomous Robot Demo,” https://www.youtube.com/
watch?v=Ke468Mv8ldM, Mar. 2024.
[76] G. Bingjing, H. Jianhai, L. Xiangpan, and Y. Lin, “Human–robot
interactive control based on reinforcement learning for gait rehabilitation
training robot,” International Journal of Advanced Robotic Systems,
vol. 16, p. 1729881419839584, 2019.
[77] A. Diodato, M. Brancadoro, G. De Rossi, H. Abidi, D. Dall’Alba,
R. Muradore, G. Ciuti, P. Fiorini, A. Menciassi, and M. Cianchetti,
“Soft robotic manipulator for improving dexterity in minimally invasive
surgery,” Surgical innovation, vol. 25, pp. 69–76, 2018.
[78] R. Bogue, “Underwater robots: a review of technologies and applica-
tions,” Industrial Robot: An International Journal, vol. 42, pp. 186–191,
2015.
[79] N. Rudin, H. Kolvenbach, V. Tsounis, and M. Hutter, “Cat-like jumping
and landing of legged robots in low gravity using deep reinforcement
learning,” IEEE Transactions on Robotics, vol. 38, pp. 317–328, 2022.
[80] J. Qi, H. Gao, H. Su, L. Han, B. Su, M. Huo, H. Yu, and Z. Deng,
“Reinforcement learning-based stable jump control method for asteroid-
exploration quadruped robots,” Aerospace Science and Technology, vol.
142, p. 108689, 2023.
[81] O. Mubin, C. Bartneck, L. Feijs, H. Hooft van Huysduynen, J. Hu, and
J. Muelver, “Improving speech recognition with the robot interaction
language,” Disruptive science and Technology, vol. 1, pp. 79–88, 2012.
[82] A. Meduri, M. Khadiv, and L. Righetti, “DeepQ stepper: A framework
for reactive dynamic walking on uneven terrain,” in IEEE International
Conference on Robotics and Automation, 2021, pp. 2099–2105.
[83] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra, “Continuous control with deep reinforcement
learning,” in International Conference on Learning Representations,
2016.
[84] L. Liu, M. V. D. Panne, and K. Yin, “Guided learning of control graphs
for physics-based characters,” ACM Transactions on Graphics, vol. 35,
pp. 1–14, 2016.
[85] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust
region policy optimization,” in International Conference on Machine
Learning, 2015, pp. 1889–1897.
[86] C. Huang, G. Wang, Z. Zhou, R. Zhang, and L. Lin, “Reward-
adaptive reinforcement learning: Dynamic policy gradient optimization
for bipedal locomotion,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 45, pp. 7686–7695, 2023.
[87] S. Dankwa and W. Zheng, “Twin-delayed DDPG: A deep reinforcement
learning technique to model a continuous movement of an intelligent
robot agent,” in International conference on vision, image and signal
processing, 2019, pp. 1–5.
[88] J. Leng, S. Fan, J. Tang, H. Mou, J. Xue, and Q. Li, “M-A3C: A mean-
asynchronous advantage actor-critic reinforcement learning method for
real-time gait planning of biped robot,” IEEE Access, vol. 10, pp.
76 523–76 536, 2022.
[89] C. Yu and A. Rosendo, “Multi-modal legged locomotion framework
with automated residual reinforcement learning,” IEEE Robotics and
Automation Letters, vol. 7, pp. 10 312–10 319, 2022.
[90] O. Aydogmus and M. Yilmaz, “Comparative analysis of reinforcement
learning algorithms for bipedal robot locomotion,” IEEE Access, pp.
7490–7499, 2023.
[91] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
“Proximal policy optimization algorithms,” arXiv e-prints, pp. arXiv–
1707, 2017.
[92] C. Tao, J. Xue, Z. Zhang, and Z. Gao, “Parallel deep reinforcement
learning method for gait control of biped robot,” IEEE Transactions on
Circuits and Systems II: Express Briefs, vol. 69, pp. 2802–2806, 2022.
[93] W. Yu, V. C. V. Kumar, G. Turk, and C. K. Liu, “Sim-to-real transfer for
biped locomotion,” in IEEE/RSJ International Conference on Intelligent
Robots and Systems, 2019, pp. 3503–3510.
[94] S. Masuda and K. Takahashi, “Sim-to-real transfer of compliant bipedal
locomotion on torque sensor-less gear-driven humanoid,” in IEEE-RAS
International Conference on Humanoid Robots, 2023, pp. 1–8.
[95] J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun,
and M. Hutter, “Learning agile and dynamic motor skills for legged
robots,” Science Robotics, vol. 4, p. eaau5872, 2019.

You might also like