Deep Reinforcement Learning For Bipedal Locomotion: A Brief Survey
Deep Reinforcement Learning For Bipedal Locomotion: A Brief Survey
Reference-free learning
Residual learning
Reference-based
Guided learning
DeepRL for bipedal
locomotion framework
Learned hierarchy scheme
Deep planning
hybrid scheme
Hybrid control scheme
Learning-based feedback
hybrid scheme
learned hierarchical control scheme [28] decomposes loco- In the following sections, we will delve into various rep-
motion into various tasks, focusing each layer on specific resentation frameworks, exploring their characteristics, limita-
functions such as navigation and fundamental locomotion tions, and strengths in comprehensive detail. To facilitate an
skills [12], [13], [29]. understanding of these distinctions, Table I provides a succinct
While several review papers discuss RL for general robotics overview of the frameworks discussed.
[4] and model-based methods for bipeds [1], [2], [3], none
specifically focus on DRL-based frameworks for bipeds. This A. Reference-based learning
survey aims to address this gap by summarizing current Reference-based learning utilizes prior knowledge, allow-
research progress, highlighting the structure and capabilities ing the policy to develop locomotion skills by adhering to
of bipedal locomotion frameworks, and exploring future direc- predefined references, which may be derived from trajectory
tions. We also catalogue DRL-based frameworks, as depicted optimization (TO) techniques or captured through motion
in Fig. 2. The primary contributions of this survey are: capture systems. This method facilitates the acquisition of
• A comprehensive summary and cataloguing of DRL- locomotion skills compared to alternative approaches, though
based frameworks for bipedal locomotion. it typically results in locomotion patterns that closely resemble
• A detailed comparison of each control scheme, highlight- the predefined references or motion clips, thus limiting the
ing their strengths, limitations, and characteristics. variety of gait patterns. Generally, this approach can be divided
• The identification of current challenges and the provision into two primary methods: (i) residual learning and (ii) guided
of insightful future research directions. learning.
1) Residual learning: This method involves a framework
The paper is organized as follows: Section II focuses on
that is aware of the current reference joint positions and
end-to-end frameworks, categorized by learning approaches.
applies offsets determined by the policy to modify motor
Section III details hierarchical frameworks, classified into
commands at the current timestep. By utilizing predefined
three main types. Section IV addresses existing gaps, ongoing
motion trajectories, the residual term acts as feedback control,
challenges, and potential future research directions. Finally,
compensating for errors and enabling the biped to achieve
Section V concludes the paper.
dynamic locomotion skills.
Introduced in 2018, a residual learning framework for the
II. E ND - TO - END FRAMEWORK bipedal robot Cassie marked a significant advancement [33].
This framework allowed the robot to walk forward by incor-
The end-to-end DRL framework represents a holistic ap- porating a policy trained via Proximal Policy Optimization
proach where a single neural network (NN) policy, denoted (PPO) algorithms, as detailed in Appendix A. The policy
π(·) : X → U, directly translates sensory inputs—such as receives the robot’s states and reference inputs, outputting
images, lidar data, or proprioceptive feedback [30]—along a residual term that augments the reference at the current
with user commands [19] or pre-defined references [31], into timestep. These modified references are then processed by a
joint-level control actions. These actions encompass motor Proportional-Derivative (PD) controller to set the desired joint
torques [32], positions, and velocities [15]. This framework positions. While this framework enhanced the robot’s ability to
obviates the need for manually decomposing the problem into perform tasks beyond standing [39], its physical deployment
sub-tasks, streamlining the control process. on a bipedal robot has not yet occurred, potentially rendering
End-to-end strategies primarily simplify the design of low- it impractical for managing walking at varying speeds and
level tracking to basic elements, such as a Proportional- limiting movement to a single direction.
Derivative (PD) controller. These methods can be broadly To transition this framework to a real robot, a sim-to-real
categorized based on their reliance on prior knowledge into strategy based on the previous model was demonstrated, where
two types: reference-based and reference-free. The locomotion the policy, trained through a residual learning approach, was
skills developed through these diverse learning approaches subsequently applied on a physical bipedal robot [34]. This
exhibit considerable variation in performance and adaptability. process and its key points are further explored in Appendix
3
TABLE I: Summary and comparison of reference-based and reference-free Learning approaches for end-to-end framework.
The dashed line in the implementation flow chat refers to optional.
Methods Works Capabilities Characteristic Advantages and Disadvantages Implementation Flow Chart
B. Compared to model-based methods, this training policy methodology that enables the execution of complex jumping
achieves faster running speeds on the same platform, under- maneuvers. An adversarial motion priors approach, employing
lining the considerable potential of DRL-based frameworks. a style reward mechanism, was also introduced to facilitate the
However, the robot’s movements remain constrained to merely acquisition of user-specified gait behaviors [18]. This method
walking forward or backward. A novel approach in residual improves the training of high-dimensional simulated agents by
learning was introduced to enable unidirectional walking, replacing complex hand-designed reward functions with more
where the policy outputs a residual term added to the current intuitive controls.
positional states, facilitating gradual omnidirectional walking While previous works primarily focused on specific loco-
[35]. motion skills, a unified framework that accommodates both
2) Guided learning: Guided learning trains policies to periodic and non-periodic motions was further developed [22]
directly output the desired joint-level commands, eschewing based on the foundational work in [37]. This framework
the addition of a residual term. The reward structure in this enhances the learning process by incorporating a wide range
approach is focused on closely imitating predefined references. of locomotion skills and introducing a dual I/O history ap-
A sim-to-real framework that employs periodic references proach, marking a significant breakthrough in creating a ro-
to initiate the training phase was proposed in [36]. In this bust, versatile, and dynamic end-to-end framework. However,
framework, the action space directly maps to the joint angles, experimental results indicate that the precision of locomotion
and desired joint positions are managed by joint PD con- features, such as velocity tracking, remains suboptimal.
trollers. The framework also incorporates a Long Short-Term Guided learning methods expedite the learning process by
Memory (LSTM) network, as detailed in Appendix A, which is leveraging expert knowledge and demonstrating the capacity
synchronised with periodic time inputs. However, this model is to achieve versatile and robust locomotion skills. Through the
limited to a single locomotion goal: forward walking. A more comprehensive evaluation [22], it is demonstrated that guided
diverse and robust walking DRL framework that includes a learning employs references without complete dependence on
Hybrid Zero Dynamics (HZD) gait library was demonstrated them. Conversely, residual learning exhibits failures or severe
[31], achieving a significant advancement by enabling a single deviations when predicated on references of inferior quality.
end-to-end policy to facilitate walking, turning, and squatting. This shortfall stems from the framework’s dependency on
Despite these advancements, the parameterization of refer- adhering closely to the provided references, which narrows
ence motions introduces constraints that limit the flexibility of its learning capabilities.
the learning process and the policy’s response to disturbances. Nonetheless, reference-based learning reliance on prede-
To broaden the capabilities of guided learning policies, a fined trajectories confines the policy to specific gaits, restrict-
framework capable of handling multiple targets, including ing its capacity to explore a broader range of motion possibil-
jumping, was developed [37]. This approach introduced a ities. Additionally, this approach exhibits limited adaptability
novel policy structure that integrates long-term input/out- in responding effectively to unforeseen environmental changes
put (I/O) encoding, complemented by a multi-stage training or novel challenges.
4
B. Reference-free learning controller. This modular approach allows for the substitution
In reference-free learning, the policy is trained using a care- of each component with either a model-based method or
fully crafted reward function rather than relying on predefined a learning-based policy, further enhancing adaptability and
trajectories. This approach allows the policy to explore a wider customisation to specific needs.
range of gait patterns and adapt to unforeseen terrains, thereby Hierarchical frameworks can be classified into three distinct
enhancing innovation and flexibility within the learning pro- types based on the integration and function of their compo-
cess. nents:
The concept of reference-free learning was initially explored 1) Deep planning hybrid scheme: This approach com-
using simulated physics engines with somewhat unrealistic bines strategic, high-level planning with dynamic low-
bipedal models. A pioneering framework, which focused on level execution, leveraging the strengths of both
learning symmetric gaits from scratch without the use of learning-based and traditional model-based methods.
motion capture data, was developed and validated within a 2) Feedback DRL control hybrid scheme: It focuses
simulated environment [14]. This framework introduced a on integrating direct feedback control mechanisms with
novel term into the loss function and utilized a curriculum deep reinforcement learning, allowing for real-time ad-
learning strategy to effectively shape gait patterns. Another justments and enhanced responsiveness.
significant advancement was made in developing a learning 3) Learned hierarchy scheme: Entirely learning-driven,
method that enabled a robot to navigate stepping stones using this scheme develops a layered decision-making hier-
curriculum learning, focusing on a physical robot model, archy where each level is trained to optimise specific
Cassie, though this has yet to be validated outside of sim- aspects of locomotion.
ulation [40]. These frameworks are illustrated in Fig. 3. Each type offers
Considering the practical implementation of theoretical unique capabilities and exhibits distinct characteristics, albeit
models, significant efforts have been directed towards devel- with limitations primarily due to the complexities involved in
oping sim-to-real frameworks in robotics studies. A notable integrating diverse modules and their interactions.
example of such a framework accommodates various periodic For a concise overview, Table 3 summarises the various
motions, including walking, hopping, and galloping [19]. frameworks, detailing their respective strengths, limitations,
This framework employs periodic rewards to facilitate initial and primary characteristics. The subsequent sections will delve
training within simulations before successfully transitioning to deeper into each of these frameworks, providing a thorough
a physical robot. It has been further refined to adapt to diverse analysis of their operational mechanics and their application
terrains and scenarios. For instance, robust blind walking in real-world scenarios.
on stairs was demonstrated through terrain randomization
techniques in [38]. Additionally, the integration of a vision A. Deep planning hybrid scheme
system has enhanced the framework’s ability to precisely
determine foot locations [41], thus enabling the robot to effec- In this scheme, robots are pre-equipped with the ability
tively navigate stepping stones [20]. Subsequent developments to execute basic locomotion skills such as walking, typically
include the incorporation of a vision system equipped with managed through model-based feedback controllers or inter-
height maps, leading to an end-to-end framework that more pretable methods. The addition of an HL learned layer focuses
effectively generalizes terrain information [42]. on strategic goals or the task space, enhancing locomotion
This approach to learning enables the exploration of novel capabilities and equipping the robot with advanced navigation
solutions and strategies that might not be achievable through abilities to effectively explore its environment.
mere imitation of existing behaviours. However, the absence Several studies have demonstrated the integration of an
of reference guidance can render the learning process costly, HL planner policy with a model-based controller to achieve
time-consuming, and potentially infeasible for certain tasks. tasks in world space. A notable framework optimises task
Moreover, the success of this method hinges critically on space level performance, eschewing direct joint level and
the design of the reward function, which presents significant balancing considerations [24]. This system combines a residual
challenges in specifying tasks such as jumping. learning planner with an inverse dynamics controller, enabling
precise control over task-space commands to joint-level ac-
tions, thereby improving velocity tracking, foot touchdown
III. H IERARCHY FRAMEWORK location, and height control. Further advancements include
Unlike end-to-end policies that directly map sensor inputs a hybrid framework that merges HZD-based residual deep
to motor outputs, hierarchical control schemes deconstruct planning with model-based regulators to correct errors in
locomotion challenges into discrete, manageable layers or learned trajectories, showcasing robustness, training efficiency,
stages of decision-making. Each layer within this structure and effective velocity tracking [25]. These frameworks have
is tasked with specific objectives, ranging from high-level been successfully transferred from simulation to reality and
navigation to fundamental locomotion skills. This division not validated on robots such as Cassie.
only enhances the framework’s flexibility but also simplifies However, the limitations imposed by residual learning con-
the problem-solving process for each policy. strained the agents’ capacity to explore a broader array of
The architecture of a hierarchical framework typically com- possibilities. Building on previous work [25], a more efficient
prises two principal modules: an HL planner and an LL hybrid framework was developed, which learns from scratch
5
without reliance on prior knowledge [43]. In this approach, and dynamic motions, such as jumping. Furthermore, while
a purely learning-based HL planner interacts with an LL these systems adeptly navigate complex terrains with obsta-
controller using an Inverse Dynamics with Quadratic Pro- cles, footstep planning alone is insufficient without concurrent
gramming formulation (ID-QP). This policy adeptly captures enhancements to the robot’s overall locomotion capabilities.
dynamic walking gaits through the use of reduced-order states Moreover, the requisite communication between the two dis-
and simplifies the learning trajectory. Demonstrating robust- tinct layers of the hierarchical framework may introduce
ness and training efficiency, this framework has outperformed system complexities. Enhancing both navigation and dynamic
other models and was successfully generalized across various locomotion capabilities within the HL planner remains a
bipedal platforms, including Digit, Cassie, and RABBIT. significant challenge.
In parallel, several research teams have focused on devel-
oping navigation planners specifically for toy-like humanoid B. Feedback DRL control hybrid scheme
robots, which provide greater physical stability compared to In contrast to the comprehensive approach of end-to-end
torque-driven or hydraulic bipedal robots as shown in Fig. policies discussed in Section II, which excels in handling
1. One notable study [46] implemented a visual navigation versatile locomotion skills and complex terrains with minimal
policy on the NAO robot, depicted in Fig. 1(a), utilizing interface times, the Feedback DRL Control Hybrid Scheme
RGB cameras as the primary sensory modality. This system integrates DRL policies as LL controllers. These LL con-
has demonstrated successful zero-shot transfer to real-world trollers, replacing traditional model-based feedback mecha-
scenarios, enabling the robot to adeptly navigate around ob- nisms, work in conjunction with HL planners that process
stacles. Further research [44] has explored complex dynamic terrain information, plan future walking paths, and maintain
motion tasks, such as playing soccer, by integrating a learned robust locomotion stability.
policy with an online footstep planner that utilises weight
For instance, gait libraries, which provide predefined move-
positioning generation (WPG) to create a center of mass
ment references based on user commands, have been integrated
(CoM) trajectory. This configuration is coupled with a whole-
into such frameworks [45]. Despite the structured approach of
body controller, facilitating dynamic activities like soccer
using gait libraries, their static nature offers limited adapt-
shooting. Despite their platform’s stability, provided by large
ability to changing terrains, diminishing their effectiveness.
feet and a lightweight structure, these robots exhibit limited
A more dynamic approach involves online planning, which
dynamic movement capabilities compared to full-sized hu-
has shown greater adaptability and efficiency. One notable
manoid robots. Consequently, this research primarily addresses
framework combines a conventional foot planner with an LL
navigation and task execution.
DRL policy [26], delivering targeted footsteps and directional
Regarding generalization, these frameworks have shown guidance to the robot, thereby enabling responsive and varied
potential for adaptation across different types of bipedal and walking commands. Moreover, HL controllers can provide
humanoid robots with minimal adjustments, demonstrating additional feedback to LL policies, incorporating CoM or end-
advanced user command tracking [43] and sophisticated nav- feet information, either from model-based methods or other
igation capabilities [44]. However, limitations are evident, conventional control strategies. However, this work has not yet
notably the absence of capabilities for executing more complex been transferred from simulation to real-world applications.
6
(a) Task Command (b) Task Command (c) Task Command (d) Task Command
Fig. 3: Hierarchy Control Scheme Diagram: (a) A basic hierarchical scheme with two layers, where each module can be
substituted with a learned policy. (b) A deep planning hybrid scheme, where the High-Level (HL) planner is learned. (c) A
learning-based feedback control hybrid scheme, with a learned Low-Level (LL) controller. (d) A comprehensive DRL hierarchy
control scheme, where both layers are learned.
Expanding the application of this framework, a sim-to-real valuable lessons for overcoming similar challenges in bipedal
strategy for a wheeled bipedal robot was proposed, focusing systems.
the LL policy on balance and position tracking, while the HL
policy enhances safety by aiding in collision avoidance and A. Recent progress with quadruped robots
making strategic decisions based on the orientation of subgoals
While DRL remains an emerging technology in bipedal
[29].
robotics, it has firmly established its presence in the realm
Learning complex locomotion skills, particularly when in-
of quadruped robots, another category of legged systems. The
corporating navigation elements, presents a significant chal-
diversity of frameworks developed for quadrupeds ranges from
lenge in robotics. Decomposing these tasks into distinct lo-
model-based RL designed for training in real-world scenarios,
comotion and navigation components allows robots to tackle
where unpredictable dynamics often prevail [49], [50], to
more intricate activities, such as dribbling a soccer ball [12].
systems that include the modeling of deformable terrain to en-
As discussed in the previous section, the benefits of integrating
hance locomotion over compliant surfaces [51]. Furthermore,
RL-based planners with RL-based controllers have been effec-
dynamic quadruped models facilitate highly adaptable policies
tively demonstrated. This combination enables the framework
[52], and sophisticated acrobatic motions are achieved through
to adeptly manage a diverse array of environments and tasks.
imitation learning [53].
Within such a framework, the High-Level (HL) policy is
The domain of quadruped DRL has also seen significant
optimized for strategic planning and achieving specific goals.
advancements in complex hybrid frameworks that integrate
This optimization allows for targeted enhancements depending
vision-based systems. To date, two primary versions of such
on the tasks at hand. Moreover, the potential for continuous
frameworks have been developed: one where a deep planning
improvement and adaptation through further training ensures
module is paired with model-based control [54], and another
that the system can evolve over time, improving its efficiency
that combines model-based planning with low-level DRL
and effectiveness in response to changing conditions or new
control [48], [55]. The latter has shown substantial efficacy; it
objectives.
employs a model predictive control (MPC) to generate refer-
Despite the theoretical advantages, the practical implemen- ence motions, which are then followed by a LL feedback DRL
tation of this type of sim-to-real application for bipedal robots policy. Additionally, the Terrain-aware Motion Generation for
remains largely unexplored. The transition from simulation Legged Robots (TAMOLS) module [56] enhances the MPC
to real-world scenarios is fraught with challenges, not least and DRL policy by providing terrain height maps for effective
because of the complexities involved in training and integrat- foothold placements across diverse environments, including
ing two separate layers within the control hierarchy. Ensuring those not encountered during training. However, similar hybrid
effective communication and cooperation between these layers control schemes have not been thoroughly investigated within
is critical, requiring a meticulously defined communication the field of bipedal locomotion.
interface to avoid operational discrepancies. Quadruped DRL frameworks are predominantly designed to
Additionally, the training process for each policy within navigate complex terrains, but efforts to extend their capabili-
the hierarchy demands considerable computational resources. ties to other tasks are underway. These include mimicking real
The intensive nature of this training can lead to a reliance on animals through motion capture data and imitation learning
the simulation environment, potentially causing the system to [57], [58], as well as augmenting quadrupeds with manipula-
overfit to specific scenarios and thereby fail to generalize to tion abilities. This is achieved either by adding a manipulator
real-world conditions. This limitation highlights a significant [59], [60] or by using the robots’ legs [61]. Notably, the
hurdle that must be addressed to enhance the viability of research presented in [60] demonstrates that loco-manipulation
learned hierarchy frameworks in practical applications. tasks can be effectively managed using a single unified end-
to-end framework.
IV. C HALLENGES AND FUTURE RESEARCH DIRECTIONS Despite the progress in quadruped DRL, similar advance-
ments have been limited for bipedal robots, particularly in
While learning-based frameworks for bipedal robots have
loco-manipulation tasks and vision-based DRL frameworks.
demonstrated considerable potential, they have also clearly
Establishing a unified framework could bridge this gap, an
exposed the limitations inherent to each framework. Moreover,
essential step given the integral role of bipedal robots in
several critical areas remain largely unexplored, especially
developing full humanoid systems. Moreover, the potential of
within the realm of legged robotics, where the pace of research
hybrid frameworks that combine model-based and DRL-based
on bipedal robots lags behind that of their quadruped counter-
methods in bipedal robots remains largely untapped.
parts. This discrepancy in research progress can be attributed
to several factors, including the higher costs and less mature
technology associated with bipedal robot hardware, as well as B. Gaps and challenges
the inherent instability issues that bipedal designs face. Despite numerous promising developments in the field
To gain a deeper understanding of these challenges and of bipedal and humanoid robotics, significant gaps remain
to outline potential future directions, it is instructive to first between current research outcomes and the ultimate goals.
review existing research on quadruped robots. The insights This discussion concentrates on the gaps in frameworks and
gained from quadrupeds, which benefit from more robust algorithms rather than hardware, structured around two pivotal
research outputs and technological advancements, can provide questions: 1) Is it possible to design a unified framework that
8
achieves both generalization and precision? 2) Can we develop • Effective reward functions: Formulation of reward func-
a straightforward end-to-end policy capable of managing all tions that more accurately guide the learning process
tasks efficiently? towards achieving desired behaviors and strategic out-
1) Generalization versus precision: DRL has demonstrated comes.
potential in facilitating versatile locomotion skills [22]; how- • Advanced computational resources: Enhancement of
ever, challenges such as poor velocity tracking and issues computational capabilities to support more intensive
with precise control often arise. While [43] shows that deep training and faster inference, facilitating real-time
planning combined with model-based control can achieve decision-making in dynamic environments.
precise velocity tracking, and [37] illustrates successful end- By focusing on these developmental areas, the potential to
to-end control for precise jumping, the creation of a policy that create a unified, efficient, and less complex framework for
effectively handles both diverse tasks and precise movements handling complex locomotion challenges in bipedal robots is
remains elusive. Furthermore, [41] introduces a foot constraint significantly increased.
policy framework, enabling precise target tracking and accu-
rate touchdown locations. Yet, there is still no framework that
comprehensively addresses the dual demands of versatility and C. Future directions
precision in locomotion. The exploration of quadruped robotics has yielded substan-
The difficulty in simultaneously achieving precise control tial advancements, yet the full potential of bipedal robotics
and a broad range of actions in bipedal locomotion using DRL remains largely untapped. Building on the successes and
stems from several factors: innovative approaches observed in quadruped robots, several
• Complex dynamics: Bipedal locomotion involves intri- key future directions emerge that could significantly enhance
cate dynamics, posing a significant challenge to main- bipedal and humanoid robotics.
taining both dynamic motion and precision. 1) Unified framework: Currently, no single framework ex-
• Resource intensity: Executing diverse locomotion tasks ists that enables bipedal or humanoid robots to adeptly navi-
requires considerable computational power and extensive gate all types of terrains, including stepping stones, stairs, de-
data, necessitating high-quality hardware and efficient formable terrain, and slippery surfaces. A promising approach,
DRL algorithms. as evidenced by recent work in quadruped robots [48], utilizes
• Training conflicts: Training DRL systems to achieve MPC to generate reference motions, which a low-level DRL
both precision and versatility often leads to conflicts. policy then tracks. This method, coupled with the Terrain-
Designing reward functions and training policies that aware Motion Generation for Legged Robots (TAMOLS)
satisfy both criteria is inherently complex. module, simplifies the terrain representation into a height map,
These challenges underscore the need for innovative solu- facilitating more effective navigation. This success encour-
tions that can bridge the gap between the capabilities of current ages further exploration into hybrid frameworks that combine
frameworks and the ambitious goals of advanced bipedal and model-based methods with DRL, inheriting the strengths of
humanoid robotics. both approaches, as discussed in Section III. However, hybrid
2) Simplifying frameworks to overcome complex tasks: The frameworks present challenges such as training efficiency and
envisioned ideal in robotic design is an end-to-end framework system complexity, which demand considerable computational
that enables robots to traverse various terrains using versatile resources and extensive training periods.
locomotion skills. Although current research often focuses Moreover, recent studies [20], [42], [22] have demonstrated
on enhancing frameworks by adding complex components to the potential of end-to-end frameworks enhanced with vision-
mitigate inherent limitations, such as the integration of a foot based information. These frameworks successfully navigate
planner for omnidirectional locomotion and stair navigation, challenging terrains and execute dynamic motions, suggesting
as demonstrated in [26], [27], simpler end-to-end frameworks the feasibility of a unified framework capable of handling
have also proven effective. These frameworks adeptly navigate diverse environments and tasks. Training strategies such as
challenging terrains and perform a diverse range of locomotion curriculum learning and task randomization could be em-
tasks with fewer components [20], [22]. ployed, utilizing visual height maps as inputs to the policy,
The advantage of maintaining simplicity in the framework enhancing the robot’s ability to adapt and perform in varied
lies in its ability to streamline decision-making processes, scenarios.
thereby reducing computational overhead and potential points In addition, the introduction of a DRL end-to-end frame-
of failure. To achieve an optimal end-to-end framework, ad- work incorporating transformer models, as in [62], presents
vancements in several key areas are essential: significant possibilities for integrating locomotion skills with
• Robust and efficient DRL algorithms: Development language and vision capabilities. The use of large-scale models
of algorithms that can manage high-dimensional and capable of processing and condensing extensive data sets into a
continuous control problems more effectively. coherent model could expand the robot’s range of capabilities,
• Specialized neural network architectures: Design of maintaining versatility across a broad spectrum of tasks.
neural architectures tailored for specific bipedal tasks, The exploration of transformers and other large-scale mod-
capable of processing extensive sensory data (e.g., visual els holds considerable promise for enhancing generalizability
and tactile inputs), similar to the innovations presented and adaptability in complex tasks, warranting further investi-
in [42]. gation into their potential applications in bipedal robotics.
9
2) Vision-based learning framework: Vision plays a critical the task [43]. This structured control approach provides a
role in enabling robots to navigate challenging terrains, such more coordinated response to complex interactions within the
as blind drops, where tactile and other sensory inputs may robot’s environment, facilitating the execution of task-specific
not provide sufficient information. Despite the importance commands.
of vision, many current frameworks, particularly in bipedal Alternatively, an end-to-end framework may enable bipeds
robotics, do not fully exploit this modality [38], [43]. Vision- to perform a variety of tasks through task randomization and
based systems are essential in human locomotion for identi- structured curriculum learning methods, progressively teaching
fying obstacles and assessing terrains, and some studies have the policy [35], [27], [22]. During training, such policies
begun to show the effectiveness of integrating vision into DRL can also learn human-like movements from motion capture
frameworks for bipedal and humanoid robots [20], [42], [27]. data [70], [18], [16], offering promising solutions for future
Building on the groundwork laid by both bipedal and integrated loco-manipulation tasks within a single, versatile
quadruped robots, two promising directions have emerged: policy.
• Height scanner mapping: This approach, evaluated in 5) Designing reward functions: The development of effec-
works like [42], involves using height maps generated tive reward functions is a critical challenge in the field of
by scanners to inform locomotion strategies. These maps deep reinforcement learning (DRL) for bipedal robots. While
provide detailed topographical data, allowing robots to periodic reward functions have been designed to facilitate
plan steps on uneven or obstructed surfaces more effec- cyclic movements like walking [19], there remains a significant
tively. gap in crafting reward functions for non-periodic actions
• Direct vision inputs: Directly utilizing inputs from such as jumping. These actions require distinct considera-
cameras, such as depth or RGB images, for real-time tions for success and efficiency, yet current research lacks
decision-making in RL policies [46], [63]. Although pre- comprehensive methods for their reward structure. Further-
vious studies like [46] have integrated visual navigation more, minimizing the need for extensive manual tuning while
by feeding visual information to a High-Level (HL) achieving high performance in DRL systems continues to be
planner, the potential of direct visual inputs to RL policies a substantial challenge, pointing to the need for more adaptive
has not been fully explored. and automatically adjusting reward mechanisms.
Enhancing the capability of bipedal robots to directly in- 6) Integrating large language models: The integration of
terpret and utilize visual data without intermediary processing Large Language Models (LLMs) into bipedal robotics opens
can revolutionize their adaptability and efficiency in real-world new avenues for contextual understanding and task execution,
scenarios. The exploration of direct vision inputs to reinforce- significantly enhancing the robots’ interaction capabilities.
ment learning policies represents a significant opportunity for LLMs, when implemented at the highest task level, offer
advancing the field, potentially enabling more dynamic and substantial promise for improving human-robot interaction,
responsive locomotion strategies. making these systems more intuitive and responsive [71].
3) Bridge the gap from simulation to reality: While simula- The potential applications of this technology are broad and
tions offer a safe and cost-effective environment for developing impactful, spanning sectors such as industrial automation,
robotics policies, the transition from simulation to real-world where robots can perform complex assembly tasks; healthcare,
application often encounters significant challenges due to offering assistance in patient care and rehabilitation; assistive
the approximations and simplifications made in simulations. devices, providing support for individuals with disabilities;
Numerous sim-to-real frameworks [34], [64], [65] have shown search and rescue operations, where robust and adaptive
high efficiency and performance, as detailed in Appendix decision-making is critical; and entertainment and education,
B. Despite these advancements, a significant gap persists, where interactive and engaging experiences are key [72]. Each
exacerbated by the complexity and unpredictability of physical of these fields could benefit from the advanced capabilities of
environments. Moreover, many studies [26], [66], [21] remain LLM-enhanced bipedal robots, particularly in environments
validated only in simulation settings. requiring nuanced understanding and adaptability.
4) Loco-manipulation tasks: Loco-manipulation, which
combines locomotion and manipulation, presents opportunities
D. Applications in various fields
for humanoid robots to excel beyond purely bipedal capabili-
ties. Few studies have addressed this integrated task; one such The advancements in bipedal locomotion technology hold
study [67] demonstrated a ’box transportation’ framework. significant promise for practical applications beyond the con-
This framework decomposes the task into five distinct policies, fines of laboratory environments. These robots, bolstered by
each addressing different aspects of the transportation process. AI, are poised to transform numerous sectors by enhancing
However, this approach lacks efficiency and does not incorpo- operational capabilities and interaction with humans. The
rate vision-based information, suggesting substantial room for potential for humanoid robots in various fields is detailed in
improvement. Moreover, the challenges of managing mobile [72], emphasizing the integration of learning-based approaches
tools like scooters [68] or dynamically interacting with objects for more effective implementation. Key areas include:
such as balls [69] introduce further complexities. 1) Industrial automation and manufacturing: The in-
Decomposing loco-manipulation tasks into multiple layers tegration of humanoid robots in industrial settings can
could simplify the challenges, allowing for more precise and significantly enhance productivity and efficiency, freeing
flexible control by manually tuning individual components of workers from repetitive and labor-intensive tasks. These
10
robots, equipped with advanced loco-manipulation capa- language learning [81], participate in storytelling, teach
bilities and the ability to cooperate with human teams, various academic subjects, or engage in the performing
are particularly effective in assembly line operations, arts and games. In the sphere of entertainment, hu-
maintenance tasks, and the construction of complex manoid robots can act, dance, play ball games [69],
machinery [73], [74]. Their articulated arms and float- and take part in interactive performances, captivating
ing bases provide unmatched flexibility, making them audiences of all ages with their versatility and dynamic
ideal for human-centric manufacturing environments. capabilities.
The humanoid robot Digit, for example, demonstrates However, this in turn leads to a variety of ethical
remarkable stability and efficiency in industrial tasks issues. First, interacting with humans involves collecting
over extended periods, as seen in video demonstrations human daily behavior data and increases the risk of a
[75]. Moreover, these robots are also suited for operation data breach. Second, another concern is the increasing
in high-risk environments such as underwater or areas dependency of humans on robots, not just for assistance
with high radiation levels, significantly enhancing safety but also for emotional support. This will result in
and operational capacity in these contexts. less human-to-human interaction and ultimately affect
2) Healthcare and assistive devices: In the healthcare sec- social constructs and emotional development. Third,
tor, bipedal and humanoid robots contribute significantly advancements of humanoid robots will replace humans
to rehabilitation and assistive technologies. Exoskeletons in various jobs, and eventually lead to unemployment
enhanced with DRL methodologies are being used to issues.
train individuals to achieve more natural gait patterns, On the positive side, humanoid robots can provide
improving mobility and rehabilitation outcomes [76]. invaluable assistance to people with disabilities or the
Beyond mere mobility aids, humanoid robots integrated elderly, offering companionship and reducing the care
with LLMs show promise in delivering medications, burden on families and healthcare systems. Furthermore,
monitoring patient health, and assisting in surgeries [77]. their application across diverse fields such as education,
The synergy between LLMs and loco-manipulation ca- industry, and healthcare can bring about revolutionary
pabilities paves the way for more interactive, responsive changes, improving efficiency and safety while opening
support, aligning closely with the needs of personalized up new possibilities for technological integration. As
care. Additionally, the aging population can benefit from we navigate these advancements, it is crucial to balance
humanoid robots performing everyday tasks like house innovation with ethical considerations to ensure that the
cleaning or delivery through simple voice commands, deployment of humanoid robots enhances societal well-
thereby enhancing the quality of life. being without compromising personal integrity or social
3) Search and rescue missions: Humanoid robots are dynamics.
exceptionally valuable in search and rescue operations,
especially in disaster-stricken or hazardous environments V. C ONCLUSION
where human presence is risky or impractical. Un- Despite significant advances in DRL for robotics, a con-
like traditional wheeled robots, humanoid robots can siderable gap persists between current research achievements
navigate complex terrains filled with debris, gaps, and and the development of a unified framework capable of
elevated structures, making them indispensable in these empowering robots to perform a broad spectrum of complex
scenarios. They also demonstrate potential for significant tasks efficiently. Presently, DRL research can be categorized
interaction and collaboration with human rescue teams. into two primary control schemes: end-to-end and hierarchical
For instance, in environments with high nuclear radia- frameworks. End-to-end frameworks have shown promising
tion, humanoid robots can perform tasks that would be capabilities in executing diverse locomotion skills [22], climb-
perilous for humans, handling delicate instruments and ing stairs [38], and navigating challenging terrains such as
preventing human exposure to harmful conditions. This stepping stones [20]. Conversely, hybrid frameworks, which
capability extends to other challenging environments often integrate an HL planner or an LL model-based con-
such as underwater [78], outer space [79], [80], and troller, offer enhanced capabilities, allowing for simultaneous
other hazardous areas. However, the full realization of management of locomotion and navigation tasks.
these applications remains constrained by the absence To bridge the existing gaps, further development of hierar-
of a unified framework that can seamlessly navigate all chical frameworks, particularly those equipped with advanced
terrains and fully integrate loco-manipulation and human perception systems and integrated with model-based planners,
interaction functionalities. appears promising. Such frameworks could simultaneously
4) Entertainment and education: Humanoid robots have address issues of precision and generalization. Moreover,
the potential to transform the realms of entertainment the advent of LLMs presents a transformative opportunity,
and education by providing highly interactive experi- potentially enabling the unification of language processing and
ences. With their ability to integrate extensive knowl- visual functionalities within robotic systems. While numerous
edge bases, these robots can significantly enhance edu- challenges remain—ranging from the technical intricacies of
cational environments. They can assume the roles of but- framework integration to real-world application—the steady
lers, teachers, or even babysitters, engaging with users progression in control framework refinement and DRL devel-
in diverse activities. For example, robots can facilitate opment provides a hopeful outlook. The vision of achieving an
11
a wide range of complex tasks, may soon move within reach. Model-free RL
algorithm Policy Gradient
A PPENDIX A A2C/A3C
TRPO
The advancement and development of RL is crucial for DDPG
bipedal locomotion. Specifically, advancements in deep learn-
ing provide deep neural networks serving as function approx- Fig. 4: Diagram for RL algorithms catalogue
imators to empower RL with the capability to handle tasks
characterized by high-dimensional and continuous spaces, by
efficiently discovering condensed, low-dimensional represen-
tations of complex data. In comparison to other robots of
different morphologies, such as wheeled robots, bipedal robots to a main idea simultaneously learning both a policy (actor)
feature much higher DoFs and continuously interact with and a value function (critic), where it owns both advantages
environments, which results in higher requirements for the of both algorithms [83], [84]. Popular algorithms e.g. Trust
DRL algorithms. Especially in the legged locomotion field, region policy optimization (TRPO) [85] and PPO based on
policy gradient-based algorithms are prevalent in the field of policy-based methods, borrow ideas from AC. Moreover, there
bipedal locomotion. are other novel algorithms based on the AC framework, Deep
Designing an effective neural network architecture is es- Deterministic Policy Gradient (DDPG) [86], Twin Delayed
sential for tackling complex bipedal locomotion tasks. Multi- Deep Deterministic Policy Gradients (TD3) [87], A2C (Ad-
layer perceptrons (MLP), a fundamental neural network, excel vantage Actor-Critic), and A3C (Asynchronous Advantage
in straightforward regression tasks with lower computational Actor-Critic) [88], SAC (Soft Actor-Critic) [89]. Each algo-
resource demands. A comprehensive comparison between rithm has its strengths considering different tasks in the bipedal
MLP and the memory-based neural network, Long Short-Term locomotion scenario. There are several key factors to value
Memory (LSTM) reveals that MLPs have an advantage in con- these algorithms such as: sample efficiency, robustness and
vergence speed for tasks [65]. However, LSTM, as a variant of generalization, and implementation challenges. A comparative
Recurrent Neural Networks (RNN), is adept at processing data analysis work [90] illustrates that SAC-based algorithms excel
associated with time, effectively relating different states across in stability and achieve the highest scores, while their training
time, and modeling key physical properties vital for periodical efficiency significantly trails behind that of PPO that obtain
gaits [19] and successful sim-to-real transfer in bipedal loco- relatively high score.
motion. Additionally, Convolutional Neural Networks (CNN) In [91], PPO demonstrates the robustness and computational
specialize in spatial data processing, particularly for image- economy in complex scenarios, such as bipedal locomotion,
related tasks, making them highly suitable for environments utilizing fewer resources than TRPO. In terms of training
where visual perception is crucial. This diverse range of neural time, PPO is much faster than SAC, and DDPG algorithms
network architectures highlights the importance of selecting [90]. Besides, many works [19], [45], [36] have demonstrated
the appropriate model based on the specific requirements of its robustness and ease of implementation and combined
the bipedal locomotion tasks. with the flexibility to integrate with various neural network
Considering DRL alogirthms, recent bipedal locomotion architectures have made PPO the most popular choice in this
studies focus on model-free reinforcement algorithms. Unlike field. Various work has demonstrated that PPO can conduct the
model-based RL, which learns a model of the environment exploration of walking [19], jumping [37], stair climbing [38],
but may inherit biases from simulations that do not accurately and stepping stones [20], which demonstrates its efficiency,
reflect real-world conditions, model-free RL directly trains robustness and generalization.
policies through environmental interaction without relying on
an explicit environmental model. Although model-free RL Additionally, the DDPG algorithm integrates the Actor-
requires more computational samples and resources, it can Critic framework with DQN to facilitate off-policy training,
train a more robust policy allowing the robots to travel around further optimizing sampling efficiency. In some explicit sce-
challenging environments. narios such as jumping, DDPG shows higher reward and better
Many sophisticated model-free RL algorithms exist, which learning performance than PPO [21], [92]. TD3 is developed
can be broadly classified into two categories: policy-based (or based on DDPG, and improve over the performance of the
policy optimization) and value-based approaches. Value-based DDPG and SAC [89]. Soft Actor-Critic (SAC) further the
methods e.g. Q-learning, Deep Q-learning (DQN) [82] only agent’s exploration capabilities and sample efficiency [89].
excel in discrete action space and often struggle with high While A2C offers improved efficiency and stability compared
dimensional action space. In contrast, policy-based methods, to A3C, the asynchronous update mechanism for A3C provides
such as policy gradient, can handle complex tasks but are gen- better capabilities for exploration and accelerating learning.
erally less sample-efficient compared to value-based methods. Although these algorithms show their advancements, they
More advanced algorithms combine both policy-based are more challenging to apply due to algorithms’ complexity
methods and value-based methods. Actor-critic (AC) refers compared to PPO.
12
[29] W. Zhu and M. Hayashibe, “A hierarchical deep reinforcement learning [52] G. Feng, H. Zhang, Z. Li, X. B. Peng, B. Basireddy, L. Yue, Z. SONG,
framework with high efficiency and generalization for fast and safe L. Yang, Y. Liu, K. Sreenath, and S. Levine, “Genloco: Generalized
navigation,” IEEE Transactions on Industrial Electronics, vol. 70, pp. locomotion controllers for quadrupedal robots,” in Conference on Robot
4962–4971, 2023. Learning, vol. 205, 2023, pp. 1893–1903.
[30] X. B. Peng and M. Van De Panne, “Learning locomotion skills us- [53] Y. Fuchioka, Z. Xie, and M. Van de Panne, “OPT-Mimic: Imitation
ing deeprl: Does the choice of action space matter?” in ACM SIG- of optimized trajectories for dynamic quadruped behaviors,” in IEEE
GRAPH/Eurographics Symposium on Computer Animation, 2017, pp. International Conference on Robotics and Automation, 2023, pp. 5092–
1–13. 5098.
[31] Z. Li, X. Cheng, X. B. Peng, P. Abbeel, S. Levine, G. Berseth, [54] S. Gangapurwala, M. Geisert, R. Orsolino, M. Fallon, and I. Havoutis,
and K. Sreenath, “Reinforcement learning for robust parameterized “RLOC: Terrain-aware legged locomotion using reinforcement learning
locomotion control of bipedal robots,” in IEEE International Conference and optimal control,” IEEE Transactions on Robotics, vol. 38, pp. 2908–
on Robotics and Automation, 2021, pp. 2811–2817. 2927, 2022.
[32] D. Kim, G. Berseth, M. Schwartz, and J. Park, “Torque-based deep [55] D. Kang, J. Cheng, M. Zamora, F. Zargarbashi, and S. Coros, “RL
reinforcement learning for task-and-robot agnostic learning on bipedal + Model-Based Control: Using on-demand optimal control to learn
robots using sim-to-real transfer,” IEEE Robotics and Automation Let- versatile legged locomotion,” IEEE Robotics and Automation Letters,
ters, vol. 8, p. 6251–6258, 2023. vol. 8, pp. 6619–6626, 2023.
[33] Z. Xie, G. Berseth, P. Clary, J. Hurst, and M. van de Panne, “Feedback [56] F. Jenelten, R. Grandia, F. Farshidian, and M. Hutter, “TAMOLS:
control for cassie with deep reinforcement learning,” in IEEE/RSJ Terrain-aware motion optimization for legged systems,” IEEE Trans-
International Conference on Intelligent Robots and Systems, 2018, pp. actions on Robotics, vol. 38, pp. 3395–3413, 2022.
1241–1246. [57] X. B. Peng, E. Coumans, T. Zhang, T.-W. E. Lee, J. Tan, and S. Levine,
[34] Z. Xie, P. Clary, J. Dao, P. Morais, J. Hurst, and M. van de Panne, “Learning agile robotic locomotion skills by imitating animals,” in
“Learning locomotion skills for cassie: Iterative design and sim-to-real,” Robotics: Science and Systems, 2020.
in Conference on Robot Learning, 2020, pp. 317–329. [58] F. Yin, A. Tang, L. Xu, Y. Cao, Y. Zheng, Z. Zhang, and X. Chen, “Run
[35] D. Rodriguez and S. Behnke, “Deepwalk: Omnidirectional bipedal gait like a dog: Learning based whole-body control framework for quadruped
by deep reinforcement learning,” in IEEE International Conference on gait style transfer,” in IEEE/RSJ International Conference on Intelligent
Robotics and Automation, 2021, pp. 3033–3039. Robots and Systems, 2021, pp. 8508–8514.
[36] J. Siekmann, S. Valluri, J. Dao, L. Bermillo, H. Duan, A. Fern, and [59] Y. Ma, F. Farshidian, T. Miki, J. Lee, and M. Hutter, “Combining
J. W. Hurst, “Learning memory-based control for human-scale bipedal learning-based locomotion policy with model-based manipulation for
locomotion,” in Robotics science and systems, 2020. legged mobile manipulators,” IEEE Robotics and Automation Letters,
[37] Z. Li, X. B. Peng, P. Abbeel, S. Levine, G. Berseth, and K. Sreenath, vol. 7, pp. 2377–2384, 2022.
“Robust and versatile bipedal jumping control through multi-task rein- [60] Z. Fu, X. Cheng, and D. Pathak, “Deep whole-body control: Learning
forcement learning,” in Robotics: Science and Systems, 2023. a unified policy for manipulation and locomotion,” in Conference on
[38] J. Siekmann, K. Green, J. Warila, A. Fern, and J. Hurst, “Blind bipedal Robot Learning, 2023, pp. 138–149.
stair traversal via sim-to-real reinforcement learning,” in Robotics: [61] P. Arm, M. Mittal, H. Kolvenbach, and M. Hutter, “Pedipulate: Enabling
Science and Systems, 2021. manipulation skills using a quadruped robot’s leg,” in IEEE Conference
on Robotics and Automation, 2024.
[39] C. Yang, K. Yuan, W. Merkt, T. Komura, S. Vijayakumar, and Z. Li,
[62] I. Radosavovic, T. Xiao, B. Zhang, T. Darrell, J. Malik, and K. Sreenath,
“Learning whole-body motor skills for humanoids,” in IEEE-RAS Inter-
“Learning humanoid locomotion with transformers,” arXiv preprint
national Conference on Humanoid Robots, 2019, pp. 270–276.
arXiv:2303.03381, 2023.
[40] Z. Xie, H. Ling, N. Kim, and M. Panne, “ALLSTEPS: Curriculum-
[63] A. Byravan, J. Humplik, L. Hasenclever, A. Brussee, F. Nori,
driven learning of stepping stone skills,” Computer Graphics Forum,
T. Haarnoja, B. Moran, S. Bohez, F. Sadeghi, B. Vujatovic et al.,
vol. 39, pp. 213–224, 2020.
“Nerf2real: Sim2real transfer of vision-guided bipedal motion skills
[41] H. Duan, A. Malik, J. Dao, A. Saxena, K. Green, J. Siekmann,
using neural radiance fields,” in IEEE International Conference on
A. Fern, and J. Hurst, “Sim-to-real learning of footstep-constrained
Robotics and Automation, 2023, pp. 9362–9369.
bipedal dynamic walking,” in International Conference on Robotics and
[64] A. Kumar, Z. Li, J. Zeng, D. Pathak, K. Sreenath, and J. Malik,
Automation, 2022, pp. 10 428–10 434.
“Adapting rapid motor adaptation for bipedal robots,” in IEEE/RSJ
[42] B. Marum, M. Sabatelli, and H. Kasaei, “Learning vision-based bipedal International Conference on Intelligent Robots and Systems, 2022, pp.
locomotion for challenging terrain,” arXiv preprint arXiv:2309.14594, 1161–1168.
2023. [65] R. P. singh, Z. Xie, P. Gergondet, and F. Kanehiro, “Learning bipedal
[43] G. A. Castillo, B. Weng, S. Yang, W. Zhang, and A. Hereid, “Template walking for humanoids with current feedback,” IEEE Access, vol. 11,
model inspired task space learning for robust bipedal locomotion,” in p. 82013–82023, 2023.
IEEE/RSJ International Conference on Intelligent Robots and Systems, [66] B. van Marum, M. Sabatelli, and H. Kasaei, “Learning per-
2023, pp. 8582–8589. ceptive bipedal locomotion over irregular terrain,” arXiv preprint
[44] C. Gaspard, G. Passault, M. Daniel, and O. Ly, “FootstepNet: an arXiv:2304.07236, 2023.
efficient actor-critic method for fast on-line bipedal footstep planning [67] J. Dao, H. Duan, and A. Fern, “Sim-to-real learning for humanoid box
and forecasting,” arXiv preprint arXiv:2403.12589, 2024. loco-manipulation,” arXiv preprint arXiv:2310.03191, 2023.
[45] K. Green, Y. Godse, J. Dao, R. L. Hatton, A. Fern, and J. Hurst, [68] J. Baltes, G. Christmann, and S. Saeedvand, “A deep reinforcement
“Learning spring mass locomotion: Guiding policies with a reduced- learning algorithm to control a two-wheeled scooter with a humanoid
order model,” IEEE Robotics and Automation Letters, vol. 6, pp. 3926– robot,” Engineering Applications of Artificial Intelligence, vol. 126, p.
3932, 2021. 106941, 2023.
[46] K. Lobos-Tsunekawa, F. Leiva, and J. Ruiz-del Solar, “Visual navigation [69] T. Haarnoja, B. Moran, G. Lever, S. H. Huang, D. Tirumala, J. Humplik,
for biped humanoid robots using deep reinforcement learning,” IEEE M. Wulfmeier, S. Tunyasuvunakool, N. Y. Siegel, R. Hafner et al.,
Robotics and Automation Letters, vol. 3, no. 4, pp. 3247–3254, 2018. “Learning agile soccer skills for a bipedal robot with deep reinforcement
[47] J. Li, L. Ye, Y. Cheng, H. Liu, and B. Liang, “Agile and versatile bipedal learning,” Science Robotics, vol. 9, p. eadi8022, 2024.
robot tracking control through reinforcement learning,” arXiv preprint [70] M. Seo, S. Han, K. Sim, S. H. Bang, C. Gonzalez, L. Sentis, and
arXiv:2404.08246, 2024. Y. Zhu, “Deep imitation learning for humanoid loco-manipulation
[48] F. Jenelten, J. He, F. Farshidian, and M. Hutter, “DTC: Deep tracking through human teleoperation,” in IEEE-RAS International Conference
control,” Science Robotics, vol. 9, p. eadh5401, 2024. on Humanoid Robots, 2023, pp. 1–8.
[49] L. Smith, I. Kostrikov, and S. Levine, “Demonstrating a walk in the [71] K. N. Kumar, I. Essa, and S. Ha, “Words into action: Learning diverse
park: Learning to walk in 20 minutes with model-free reinforcement humanoid robot behaviors using language guided iterative motion re-
learning,” Robotics: Science and Systems Demo, vol. 2, p. 4, 2023. finement,” in Workshop on Language and Robot Learning: Language
[50] P. Wu, A. Escontrela, D. Hafner, P. Abbeel, and K. Goldberg, “Day- as Grounding, 2023.
Dreamer: World models for physical robot learning,” in Conference on [72] Y. Tong, H. Liu, and Z. Zhang, “Advancements in humanoid robots:
Robot Learning, 2023, pp. 2226–2240. A comprehensive review and future prospects,” IEEE/CAA Journal of
[51] S. Choi, G. Ji, J. Park, H. Kim, J. Mun, J. H. Lee, and J. Hwangbo, Automatica Sinica, vol. 11, pp. 301–328, 2024.
“Learning quadrupedal locomotion on deformable terrain,” Science [73] A. Dzedzickis, J. Subačiūtė-Žemaitienė, E. Šutinys, U. Samukaitė-
Robotics, vol. 8, p. eade2256, 2023. Bubnienė, and V. Bučinskas, “Advanced applications of industrial
14
robotics: New trends and possibilities,” Applied Sciences, vol. 12, p. [96] G. A. Castillo, B. Weng, W. Zhang, and A. Hereid, “Robust feedback
135, 2021. motion policy design using reinforcement learning on a 3D digit bipedal
[74] M. Yang, E. Yang, R. C. Zante, M. Post, and X. Liu, “Collaborative robot,” in IEEE/RSJ International Conference on Intelligent Robots and
mobile industrial manipulator: a review of system architecture and Systems, 2021, pp. 5136–5143.
applications,” in International conference on automation and computing,
2019, pp. 1–6.
[75] “6+ Hours Live Autonomous Robot Demo,” https://www.youtube.com/
watch?v=Ke468Mv8ldM, Mar. 2024.
[76] G. Bingjing, H. Jianhai, L. Xiangpan, and Y. Lin, “Human–robot
interactive control based on reinforcement learning for gait rehabilitation
training robot,” International Journal of Advanced Robotic Systems,
vol. 16, p. 1729881419839584, 2019.
[77] A. Diodato, M. Brancadoro, G. De Rossi, H. Abidi, D. Dall’Alba,
R. Muradore, G. Ciuti, P. Fiorini, A. Menciassi, and M. Cianchetti,
“Soft robotic manipulator for improving dexterity in minimally invasive
surgery,” Surgical innovation, vol. 25, pp. 69–76, 2018.
[78] R. Bogue, “Underwater robots: a review of technologies and applica-
tions,” Industrial Robot: An International Journal, vol. 42, pp. 186–191,
2015.
[79] N. Rudin, H. Kolvenbach, V. Tsounis, and M. Hutter, “Cat-like jumping
and landing of legged robots in low gravity using deep reinforcement
learning,” IEEE Transactions on Robotics, vol. 38, pp. 317–328, 2022.
[80] J. Qi, H. Gao, H. Su, L. Han, B. Su, M. Huo, H. Yu, and Z. Deng,
“Reinforcement learning-based stable jump control method for asteroid-
exploration quadruped robots,” Aerospace Science and Technology, vol.
142, p. 108689, 2023.
[81] O. Mubin, C. Bartneck, L. Feijs, H. Hooft van Huysduynen, J. Hu, and
J. Muelver, “Improving speech recognition with the robot interaction
language,” Disruptive science and Technology, vol. 1, pp. 79–88, 2012.
[82] A. Meduri, M. Khadiv, and L. Righetti, “DeepQ stepper: A framework
for reactive dynamic walking on uneven terrain,” in IEEE International
Conference on Robotics and Automation, 2021, pp. 2099–2105.
[83] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra, “Continuous control with deep reinforcement
learning,” in International Conference on Learning Representations,
2016.
[84] L. Liu, M. V. D. Panne, and K. Yin, “Guided learning of control graphs
for physics-based characters,” ACM Transactions on Graphics, vol. 35,
pp. 1–14, 2016.
[85] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust
region policy optimization,” in International Conference on Machine
Learning, 2015, pp. 1889–1897.
[86] C. Huang, G. Wang, Z. Zhou, R. Zhang, and L. Lin, “Reward-
adaptive reinforcement learning: Dynamic policy gradient optimization
for bipedal locomotion,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 45, pp. 7686–7695, 2023.
[87] S. Dankwa and W. Zheng, “Twin-delayed DDPG: A deep reinforcement
learning technique to model a continuous movement of an intelligent
robot agent,” in International conference on vision, image and signal
processing, 2019, pp. 1–5.
[88] J. Leng, S. Fan, J. Tang, H. Mou, J. Xue, and Q. Li, “M-A3C: A mean-
asynchronous advantage actor-critic reinforcement learning method for
real-time gait planning of biped robot,” IEEE Access, vol. 10, pp.
76 523–76 536, 2022.
[89] C. Yu and A. Rosendo, “Multi-modal legged locomotion framework
with automated residual reinforcement learning,” IEEE Robotics and
Automation Letters, vol. 7, pp. 10 312–10 319, 2022.
[90] O. Aydogmus and M. Yilmaz, “Comparative analysis of reinforcement
learning algorithms for bipedal robot locomotion,” IEEE Access, pp.
7490–7499, 2023.
[91] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
“Proximal policy optimization algorithms,” arXiv e-prints, pp. arXiv–
1707, 2017.
[92] C. Tao, J. Xue, Z. Zhang, and Z. Gao, “Parallel deep reinforcement
learning method for gait control of biped robot,” IEEE Transactions on
Circuits and Systems II: Express Briefs, vol. 69, pp. 2802–2806, 2022.
[93] W. Yu, V. C. V. Kumar, G. Turk, and C. K. Liu, “Sim-to-real transfer for
biped locomotion,” in IEEE/RSJ International Conference on Intelligent
Robots and Systems, 2019, pp. 3503–3510.
[94] S. Masuda and K. Takahashi, “Sim-to-real transfer of compliant bipedal
locomotion on torque sensor-less gear-driven humanoid,” in IEEE-RAS
International Conference on Humanoid Robots, 2023, pp. 1–8.
[95] J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun,
and M. Hutter, “Learning agile and dynamic motor skills for legged
robots,” Science Robotics, vol. 4, p. eaau5872, 2019.