CN118818968A

CN118818968A - A quadruped robot motion control method based on deep reinforcement learning

Info

Publication number: CN118818968A
Application number: CN202410663847.1A
Authority: CN
Inventors: 刘勇; 朱承睿; 张震; 侯典泳
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2024-05-27
Filing date: 2024-05-27
Publication date: 2024-10-22

Abstract

A four-foot robot motion control algorithm based on deep reinforcement learning comprises the following specific steps: s1, establishing a model of a quadruped robot, wherein the model comprises a dynamics model for simulating the quadruped robot and a driver model for identifying and simulating a motor driver of the quadruped robot, and the driver model adopts an experience driver model; s2, describing the motion process of the quadruped robot as a Markov process, designing a reward function, optimizing the motion strategy of the quadruped robot by using a near-end strategy optimization algorithm of a multi-loss function in the simulation environment established in the S1 by using a deep reinforcement learning algorithm, and training to obtain a motion controller; and S3, deploying the motion controller obtained through training on the quadruped robot. The invention can automatically learn the motion strategy in the simulation, reduce the difference from the simulation to the reality and realize the robust motion of the quadruped robot.

Description

A quadruped robot motion control method based on deep reinforcement learning

技术领域Technical Field

本发明属于机器人控制技术领域，涉及一种基于深度强化学习的四足机器人运动控制方法。The present invention belongs to the technical field of robot control and relates to a quadruped robot motion control method based on deep reinforcement learning.

背景技术Background Art

随着当前机器人控制技术的不断发展，四足机器人的应用也逐渐广泛。相比轮式、履带式机器人，四足机器人具有较高的自由度，离散的落足点等特点，在复杂地形作业方面展现了巨大优势，可以广泛应用于搜救、侦察、工业巡检、未知环境探索等领域。With the continuous development of current robot control technology, the application of quadruped robots has become increasingly widespread. Compared with wheeled and tracked robots, quadruped robots have higher degrees of freedom and discrete footholds, showing great advantages in complex terrain operations. They can be widely used in search and rescue, reconnaissance, industrial inspection, and exploration of unknown environments.

然而，四足机器人的高自由度同时也给运动控制带来了极大的挑战。近些年来，有许多基于模型的方法被应用于四足机器人的运动控制问题，但这类控制方法往往需要对各种场景精心设计，但也难以避免出现边界情况(corner cases)。相比之下，强化学习方法能够通过试错自主地学习一个运动控制器，该运动控制能够在多种场景下取得较好的控制效果。这种方法往往需要先在仿真器中进行训练，之后部署在现实的四足机器人上。但由于大多数仿真器无法完全模拟现实环境的复杂性，这些控制器在从仿真到现实的迁移(sim-to-real transfer)过程中往往会有比较大的性能损失。However, the high degree of freedom of quadruped robots also brings great challenges to motion control. In recent years, many model-based methods have been applied to the motion control problem of quadruped robots, but such control methods often require careful design of various scenarios, and it is difficult to avoid corner cases. In contrast, reinforcement learning methods can autonomously learn a motion controller through trial and error, which can achieve good control effects in a variety of scenarios. This method often needs to be trained in a simulator first and then deployed on a real quadruped robot. However, since most simulators cannot fully simulate the complexity of the real environment, these controllers often suffer from relatively large performance losses during the sim-to-real transfer process.

发明内容Summary of the invention

本发明提供了一种基于深度强化学习的四足机器人运动控制算法，可以在仿真中自动学习运动策略，降低仿真到现实的差异，实现四足机器人的鲁棒运动。The present invention provides a quadruped robot motion control algorithm based on deep reinforcement learning, which can automatically learn motion strategies in simulation, reduce the difference between simulation and reality, and realize robust motion of the quadruped robot.

本发明采用的技术方案是：The technical solution adopted by the present invention is:

一种基于深度强化学习的四足机器人运动控制算法，具体步骤如下：A quadruped robot motion control algorithm based on deep reinforcement learning. The specific steps are as follows:

S1，建立四足机器人的模型，包括用于仿真四足机器人的动力学模型和用于辨识和仿真四足机器人的电机驱动器的驱动器模型，所述驱动器模型采用经验驱动器模型(Empirical Actuator Model,EAM)；S1, establishing a model of a quadruped robot, including a dynamic model for simulating the quadruped robot and a driver model for identifying and simulating a motor driver of the quadruped robot, wherein the driver model adopts an empirical actuator model (EAM);

S2、把四足机器人的运动过程描述为马尔可夫过程，设计奖励函数，使用深度强化学习算法在S1中建立的仿真环境中，使用多损失函数的近端策略优化(Multi-LossProximal Policy Optimization,MLPPO)算法优化四足机器人的运动策略，训练得到运动控制器；S2. Describe the motion process of the quadruped robot as a Markov process, design a reward function, use a deep reinforcement learning algorithm in the simulation environment established in S1, use the Multi-Loss Proximal Policy Optimization (MLPPO) algorithm to optimize the motion strategy of the quadruped robot, and train the motion controller.

S3、将训练得到的运动控制器部署到四足机器人上。S3. Deploy the trained motion controller to the quadruped robot.

进一步，步骤S1具体包括以下步骤：Further, step S1 specifically includes the following steps:

S11、建立四足机器人的动力学模型，包括四足机器人的基座质量及惯性张量、各关节连杆质量及惯性张量、各关节安装位置和限位、各关节碰撞模型；S11. Establish a dynamic model of the quadruped robot, including the base mass and inertia tensor of the quadruped robot, the mass and inertia tensor of each joint link, the installation position and limit of each joint, and the collision model of each joint;

S12、建立四足机器人的驱动器模型，经验驱动器模型数学表达式如下：S12. Establish a driver model for the quadruped robot. The mathematical expression of the empirical driver model is as follows:

其中，q_t和为t时刻关节的位置和速度，t_in为关节的输入延迟，为t-t_in时刻给定的期望关节位置，和为比例微分增益，为关节的期望输出力矩；t_out为驱动器力矩的输出延迟，τ_m为电机的外部特性曲线；电机的外部特性曲线即电机输出最大的力矩随电机转速的变化曲线。Among them, q _t and is the position and velocity of the joint at time t, _tin is the input delay of the joint, is the expected joint position given at time tt _in , and is the proportional derivative gain, is the expected output torque of the joint; t _out is the output delay of the driver torque, τ _m is the external characteristic curve of the motor; the external characteristic curve of the motor is the curve of the change of the maximum torque output by the motor with the motor speed.

进一步，步骤S11中产生的模型由一个统一机器人描述格式(Unified RoboticsDescription Format,URDF)文件描述，使用有接触的多关节动力学(Multi-Jointdynamics with Contact,MuJoCo)对机器人模型进行模拟。Furthermore, the model generated in step S11 is described by a Unified Robotics Description Format (URDF) file, and the robot model is simulated using Multi-Joint dynamics with Contact (MuJoCo).

进一步，步骤S2具体包括以下步骤：Further, step S2 specifically includes the following steps:

S21、把四足机器人的运动过程描述为马尔可夫过程(Markov Decision Process,MDP)，包括状态空间动作空间状态转移函数以及奖励函数在t时刻，参数化策略π_θ根据历史状态产生动作环境基于状态转移函数更新状态并计算奖励MDP的目标为最大化折扣奖励和其中为数学期望，γ为奖励的折扣系数；S21. Describe the motion process of the quadruped robot as a Markov decision process (MDP), including the state space Action Space State transfer function And the reward function At time t, the parameterized strategy π _θ generates actions based on the historical state The environment updates its state based on the state transition function And calculate the reward The goal of MDP is to maximize the discounted reward and in is the mathematical expectation, γ is the discount coefficient of the reward;

S22、在仿真环境中收集训练数据，在每个环境步中，收集当前环境状态、由策略给出当前帧的动作、将动作通过经验驱动器模型转化为关节力矩、运行仿真得到下一帧状态、根据两帧状态和动作计算奖励值，并将每个状态保存在缓存中；S22, collecting training data in the simulation environment, in each environment step, collecting the current environment state, giving the action of the current frame by the strategy, converting the action into joint torque through the empirical driver model, running the simulation to obtain the next frame state, calculating the reward value based on the two-frame state and action, and saving each state in the cache;

S23、在收集了一定数量的状态后，使用多损失函数的近端策略优化(Multi-LossProximal Policy Optimization,MLPPO)更新策略；一个参数化策略可以表示为动作对于状态的条件概率p_θ(a_t|s_t)，其中θ为策略的参数；MLPPO的优化目标为S23. After collecting a certain number of states, the policy is updated using Multi-Loss Proximal Policy Optimization (MLPPO). A parameterized policy can be expressed as the conditional probability of an action for a state, p _θ (a _t |s _t ), where θ is the policy parameter. The optimization objective of MLPPO is

minL_ppo+w_symmetryL_symmetry+w_smoothL_smooth minL _ppo +w _symmetry L _symmetry +w _smooth L _smooth

其中，L_ppo为标准PPO的损失函数；Among them, L _ppo is the loss function of standard PPO;

其中，A_t为t时刻下的优势，θ′为收集数据时的策略的参数，ε为裁剪比率；L_symmetry和L_smooth为针对四足机器人设计的特殊目标函数，分别为对称损失和平滑损失，w_symmetry和w_smooth分别为这两个目标函数的权重。这两个目标函数可以表示为Among them, A _t is the advantage at time t, θ′ is the parameter of the strategy when collecting data, and ε is the clipping ratio; L _symmetry and L _smooth are special objective functions designed for quadruped robots, which are symmetric loss and smooth loss respectively, and w _symmetry and w _smooth are the weights of these two objective functions respectively. These two objective functions can be expressed as

其中，和分别代表状态和动作的对称映射。in, and Represents the symmetric mapping of states and actions respectively.

进一步，步骤S21中的状态空间包括机器人线速度指令c_x,c_y、角速度指令c_r、3维基座线速度v、3维基座角速度ω、12维关节角q、12维关节角速度以及基座横滚角ψ_x和俯仰角ψ_y。Furthermore, the state space in step S21 Including robot linear velocity command _cx , _cy , angular velocity command _cr , 3D base linear velocity v, 3D base angular velocity ω, 12D joint angle q, 12D joint angular velocity As well as the base roll angle ψ _x and pitch angle ψ _y .

进一步，步骤S21中的动作空间为12个关节角的期望角度。Furthermore, the action space in step S21 are the desired angles of the 12 joint angles.

进一步，步骤S21中的奖励函数是一系列奖励的加权和其中包括奖励对给定速度指令的跟踪，对功率和关节动作的惩罚、以及对基座姿态和稳定运动的奖励。Furthermore, the reward function in step S21 is is the weighted sum of a series of rewards These include rewards for tracking a given velocity command, penalties for power and joint motion, and rewards for base attitude and stable motion.

进一步，步骤S3中具体包括以下的步骤：Further, step S3 specifically includes the following steps:

S31、从四足机器人的机载传感器获得策略网络需要的各种状态量，其中机器人角速度和姿态从机器人的惯性测量单元获得；各关节角和关节角速度从关节编码器获得，机器人线速度从状态估计器获得；S31, obtaining various state quantities required by the strategy network from the onboard sensors of the quadruped robot, wherein the robot angular velocity and posture are obtained from the inertial measurement unit of the robot; each joint angle and joint angular velocity are obtained from the joint encoder, and the robot linear velocity is obtained from the state estimator;

S32、以固定的频率在四足机器人的运动控制器上实时推理策略网络产生动作，即期望关节位置，再发送到关节电机实现全身控制。S32. The strategy network is inferred in real time on the motion controller of the quadruped robot at a fixed frequency to generate actions, i.e., the desired joint positions, which are then sent to the joint motors to achieve whole-body control.

本发明的有益效果：Beneficial effects of the present invention:

1、可以在仿真中自动学习运动策略，降低仿真到现实的差异，实现四足机器人的鲁棒运动。1. It can automatically learn motion strategies in simulation, reduce the difference between simulation and reality, and realize robust motion of quadruped robots.

2、采用经验驱动器模型EAM，用于辨识实际机器人的驱动器和运动策略的训练，降低仿真与现实的差距。2. Use the empirical actuator model (EAM) to identify the actual robot's actuator and motion strategy training to reduce the gap between simulation and reality.

3、通过强化学习方法实现四足机器人对给定速度指令的精准跟踪，通过多损失函数的强化学习框架，在最大化奖励函数的同时优化策略的对称性和平滑度，从而策略可以驱动四足机器人以4.2m/s的速度高速奔跑，并在较宽的指令范围内实现了低于0.07m/s的速度跟踪误差。而且策略具有极佳的对称性、平滑度和美观性，并在能量效率方面超过基于模型的控制器。3. The quadruped robot can accurately track the given speed command through the reinforcement learning method. Through the reinforcement learning framework of multiple loss functions, the symmetry and smoothness of the strategy are optimized while maximizing the reward function. Therefore, the strategy can drive the quadruped robot to run at a high speed of 4.2m/s and achieve a speed tracking error of less than 0.07m/s within a wide range of commands. Moreover, the strategy has excellent symmetry, smoothness and aesthetics, and exceeds the model-based controller in terms of energy efficiency.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明的结构示意图。FIG. 1 is a schematic structural diagram of the present invention.

具体实施方式DETAILED DESCRIPTION

下面结合具体实施例来对本发明进行进一步说明，但并不将本发明局限于这些具体实施方式。本领域技术人员应该认识到，本发明涵盖了权利要求书范围内所可能包括的所有备选方案、改进方案和等效方案。The present invention is further described below in conjunction with specific embodiments, but the present invention is not limited to these specific embodiments. Those skilled in the art should recognize that the present invention covers all possible alternatives, improvements and equivalents within the scope of the claims.

在本发明的描述中，需要理解的是，术语“中心”、“纵向”、“横向”、“长度”、“宽度”、“厚度”、“上”、“下”、“前”、“后”、“左”、“右”、“竖直”、“水平”、“顶”、“底”“内”、“顺时针”、“逆时针”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明的限制。此外，术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本发明的描述中，除非另有说明，“多个”的含义是两个或两个以上，除非另有明确的限定。In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inside", "clockwise", "counterclockwise" and the like indicate positions or positional relationships based on the positions or positional relationships shown in the accompanying drawings, and are only for the convenience of describing the present invention and simplifying the description, rather than indicating or implying that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and therefore cannot be understood as limiting the present invention. In addition, the terms "first" and "second" are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Therefore, the features defined as "first" and "second" may explicitly or implicitly include one or more of the features. In the description of the present invention, unless otherwise specified, the meaning of "multiple" is two or more, unless otherwise clearly defined.

在本发明中，除非另有明确的规定和限定，术语“安装”、“相连”、“连接”、“固定”等术语应做广义理解，例如，可以是固定连接，也可以是可拆卸连接，或一体地连接；可以是机械连接，也可以是电连接；可以是直接相连，也可以通过中间媒介间接相连，可以是两个元件内部的连通。对于本领域的普通技术人员而言，可以根据具体情况理解上述术语在本发明中的具体含义。In the present invention, unless otherwise clearly specified and limited, the terms "installed", "connected", "connected", "fixed" and the like should be understood in a broad sense, for example, it can be a fixed connection, a detachable connection, or an integral connection; it can be a mechanical connection or an electrical connection; it can be a direct connection, or it can be an indirect connection through an intermediate medium, or it can be the internal communication of two components. For ordinary technicians in this field, the specific meanings of the above terms in the present invention can be understood according to specific circumstances.

在本发明中，除非另有明确的规定和限定，第一特征在第二特征之“上”或之“下”可以包括第一和第二特征直接接触，也可以包括第一和第二特征不是直接接触而是通过它们之间的另外的特征接触。而且，第一特征在第二特征“之上”、“上方”和“上面”包括第一特征在第二特征正上方和斜上方，或仅仅表示第一特征水平高度高于第二特征。第一特征在第二特征“之下”、“下方”和“下面”包括第一特征在第二特征正下方和斜下方，或仅仅表示第一特征水平高度小于第二特征。In the present invention, unless otherwise clearly specified and limited, a first feature being "above" or "below" a second feature may include that the first and second features are in direct contact, or may include that the first and second features are not in direct contact but are in contact through another feature between them. Moreover, a first feature being "above", "above" and "above" a second feature includes that the first feature is directly above and obliquely above the second feature, or simply indicates that the first feature is higher in level than the second feature. A first feature being "below", "below" and "below" a second feature includes that the first feature is directly below and obliquely below the second feature, or simply indicates that the first feature is lower in level than the second feature.

参见图1，本发明提供了一种基于深度强化学习的四足机器人运动控制算法，具体步骤如下：Referring to FIG1 , the present invention provides a quadruped robot motion control algorithm based on deep reinforcement learning, and the specific steps are as follows:

具体包括以下步骤：The specific steps include:

S11、建立四足机器人的动力学模型，包括四足机器人的基座质量及惯性张量、各关节连杆质量及惯性张量、各关节安装位置和限位、各关节碰撞模型；产生的模型由一个统一机器人描述格式URDF文件描述，使用有接触的多关节动力学MujoCo对机器人模型进行模拟。S11. Establish a dynamic model of the quadruped robot, including the base mass and inertia tensor of the quadruped robot, the mass and inertia tensor of each joint link, the installation position and limit of each joint, and the collision model of each joint; the generated model is described by a unified robot description format URDF file, and the robot model is simulated using contact multi-joint dynamics MujoCo.

其中，q_t和为t时刻关节的位置和速度，t_in为关节的输入延迟，为t-t_in时刻给定的期望关节位置，和为比例微分增益，为关节的期望输出力矩。t_out为驱动器力矩的输出延迟，τ_m为电机的外部特性曲线。电机的外部特性曲线即电机输出最大的力矩随电机转速的变化曲线。Among them, q _t and is the position and velocity of the joint at time t, _tin is the input delay of the joint, is the expected joint position given at time tt _in , and is the proportional derivative gain, is the expected output torque of the joint. t _out is the output delay of the driver torque, and τ _m is the external characteristic curve of the motor. The external characteristic curve of the motor is the curve of the maximum torque output by the motor versus the motor speed.

以1000HZ频率收集四足机器人关节电机运行数据获得数据集，遍历t_in∈[0,t_max]，t_out∈[0,t_max]，通过最小化均方误差线性回归得到和将具有最小的均方误差的一组参数作为该关节电机的辨识值。The quadruped robot joint motor operation data is collected at a frequency of 1000HZ to obtain a data set. The data set is traversed through t _in ∈ [0, t _max ], t _out ∈ [0, t _max ], and the linear regression is performed by minimizing the mean square error to obtain and The set of parameters with the smallest mean square error is taken as the identification value of the joint motor.

S2、把四足机器人的运动过程描述为马尔可夫过程，设计奖励函数，使用深度强化学习算法在S1中建立的仿真环境中，使用多损失函数的近端策略优化算法优化四足机器人的运动策略，训练得到运动控制器；S2, describe the motion process of the quadruped robot as a Markov process, design a reward function, use a deep reinforcement learning algorithm in the simulation environment established in S1, use a proximal strategy optimization algorithm with multiple loss functions to optimize the motion strategy of the quadruped robot, and train the motion controller;

具体包括以下步骤：The specific steps include:

S21、把四足机器人的运动过程描述为马尔可夫过程(Markov Decision Process,MDP)，包括状态空间动作空间状态转移函数以及奖励函数在t时刻，参数化策略π_θ根据历史状态产生动作环境基于状态转移函数更新状态并计算奖励MDP的目标为最大化折扣奖励和其中为数学期望，γ为奖励的折扣系数。其中的状态空间包括机器人线速度指令c_x,c_y、角速度指令c_r、3维基座线速度v、3维基座角速度ω、12维关节角q、12维关节角速度以及基座横滚角ψ_x和俯仰角ψ_y。动作空间为12个关节角的期望角度。S21. Describe the motion process of the quadruped robot as a Markov decision process (MDP), including the state space Action Space State transfer function And the reward function At time t, the parameterized strategy π _θ generates actions based on the historical state The environment updates its state based on the state transition function And calculate the reward The goal of MDP is to maximize the discounted reward and in is the mathematical expectation, and γ is the discount factor of the reward. The state space Including robot linear velocity command _cx , _cy , angular velocity command _cr , 3D base linear velocity v, 3D base angular velocity ω, 12D joint angle q, 12D joint angular velocity And the base roll angle ψ _x and pitch angle ψ _y . Action Space are the desired angles of the 12 joint angles.

本实施例的奖励函数是一系列奖励的加权和其中包括奖励对给定速度指令的跟踪，对功率和关节动作的惩罚、以及对基座姿态和稳定运动的奖励。奖励函数如下表1所示，其中t_i为第i条腿的摆动时间。The reward function of this embodiment is the weighted sum of a series of rewards These include rewards for tracking a given velocity command, penalties for power and joint action, and rewards for base posture and stable motion. The reward function is shown in Table 1 below, where _ti is the swing time of the i-th leg.

表1奖励函数Table 1 Reward function

S23、在收集了一定数量的状态后，使用多损失函数的近端策略优化(Multi-LossProximal Policy Optimization,MLPPO)更新策略。一个参数化策略可以表示为动作对于状态的条件概率p_θ(a_t|s_t)，其中θ为策略的参数。MLPPO的优化目标为S23. After collecting a certain number of states, use Multi-Loss Proximal Policy Optimization (MLPPO) to update the policy. A parameterized policy can be expressed as the conditional probability of an action for a state p _θ (a _t |s _t ), where θ is the parameter of the policy. The optimization goal of MLPPO is

其中，和分别代表状态和动作的对称映射。对称映射关系如表2所示。in, and They represent the symmetric mapping of states and actions respectively. The symmetric mapping relationship is shown in Table 2.

表2状态和动作的对称映射Table 2 Symmetric mapping of states and actions

具体包括以下的步骤：The specific steps include the following:

S32、以固定的100HZ频率在四足机器人的运动控制器上实时推理策略网络产生动作，即期望关节位置，再发送到关节电机实现全身控制。S32, the strategy network is inferred in real time on the motion controller of the quadruped robot at a fixed frequency of 100HZ to generate actions, that is, the expected joint positions, which are then sent to the joint motors to achieve whole-body control.

Claims

1. A quadruped robot motion control algorithm based on deep reinforcement learning. The specific steps are as follows:

S1, establishing a model of a quadruped robot, including a dynamic model for simulating the quadruped robot and a driver model for identifying and simulating a motor driver of the quadruped robot, wherein the driver model adopts an empirical driver model;

S2, describe the motion process of the quadruped robot as a Markov process, design a reward function, use a deep reinforcement learning algorithm in the simulation environment established in S1, use a proximal strategy optimization algorithm with multiple loss functions to optimize the motion strategy of the quadruped robot, and train the motion controller;

S3. Deploy the trained motion controller to the quadruped robot.

2. According to the deep reinforcement learning-based quadruped robot motion control algorithm of claim 1, it is characterized in that: step S1 specifically comprises the following steps:

S11. Establish a dynamic model of the quadruped robot, including the base mass and inertia tensor of the quadruped robot, the mass and inertia tensor of each joint link, the installation position and limit of each joint, and the collision model of each joint;

S12. Establish an empirical driver model for the quadruped robot. The mathematical expression of the empirical driver model is as follows:

Among them, q _t and are the position and velocity of the joint at time t, _tin is the input delay of the joint, is the expected joint position given at time tt _in , and is the proportional derivative gain, is the expected output torque of the joint; t _out is the output delay of the driver torque, τ _m is the external characteristic curve of the motor, and the external characteristic curve of the motor is the curve of the change of the maximum torque output by the motor with the motor speed.

3. According to the deep reinforcement learning-based quadruped robot motion control algorithm of claim 2, it is characterized in that: the model generated in step S11 is described by a unified robot description format file, and the robot model is simulated using contact multi-joint dynamics.

4. According to the deep reinforcement learning-based quadruped robot motion control algorithm of claim 1, it is characterized in that: step S2 specifically comprises the following steps:

S21. Describe the motion process of the quadruped robot as a Markov process, including the state space Action Space State transfer function And the reward function At time t, the parameterized strategy π _θ generates actions based on the historical state The environment updates the state based on the state transition function And calculate the reward The goal of MDP is to maximize the discounted reward and in is the mathematical expectation, γ is the discount coefficient of the reward;

S22, collecting training data in the simulation environment, in each environment step, collecting the current environment state, giving the action of the current frame by the strategy, converting the action into joint torque through the empirical driver model, running the simulation to obtain the next frame state, calculating the reward value based on the two-frame state and action, and saving each state in the cache;

S23. After collecting a certain number of states, the proximal strategy with multiple loss functions is used to optimize the update strategy. A parameterized strategy can be expressed as the conditional probability of the action for the state p _θ (a _t |s _t ), where θ is the parameter of the strategy. The optimization goal of MLPPO is

min L _ppo +w _symmetry L _symmetry +w _smooth L _smooth

Among them, L _ppo is the loss function of standard PPO;

Among them, A _t is the advantage at time t, θ′ is the parameter of the strategy when collecting data, and ε is the clipping ratio; L _symmetry and L _smooth are special objective functions designed for quadruped robots, which are symmetric loss and smooth loss respectively, and w _symmetry and w _smooth are the weights of these two objective functions respectively; these two objective functions can be expressed as

in, and Represents the symmetric mapping of states and actions respectively.

5. A quadruped robot motion control algorithm based on deep reinforcement learning according to claim 4, characterized in that: the state space in step S21 Including robot linear velocity command _cx , _cy , angular velocity command _cr , 3D base linear velocity v, 3D base angular velocity ω, 12D joint angle q, 12D joint angular velocity As well as the base roll angle ψ _x and pitch angle ψ _y .

6. A quadruped robot motion control algorithm based on deep reinforcement learning according to claim 4, characterized in that: the action space in step S21 are the desired angles of the 12 joint angles.

7. A quadruped robot motion control algorithm based on deep reinforcement learning according to claim 4, characterized in that: the reward function in step S21 is the weighted sum of a series of rewards These include rewards for tracking a given velocity command, penalties for power and joint motion, and rewards for base attitude and stable motion.

8. The quadruped robot motion control algorithm based on deep reinforcement learning according to claim 1, characterized in that: step S3 specifically includes the following steps:

S31, obtaining various state quantities required by the strategy network from the onboard sensors of the quadruped robot, wherein the robot angular velocity and posture are obtained from the inertial measurement unit of the robot; each joint angle and joint angular velocity are obtained from the joint encoder, and the robot linear velocity is obtained from the state estimator;

S32. The strategy network is inferred in real time on the motion controller of the quadruped robot at a fixed frequency to generate actions, i.e., the desired joint positions, which are then sent to the joint motors to achieve whole-body control.