CN116460860B

CN116460860B - A model-based offline reinforcement learning control method for robots

Info

Publication number: CN116460860B
Application number: CN202310725865.3A
Authority: CN
Inventors: 尚伟伟; 李想; 丛爽
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2023-06-19
Filing date: 2023-06-19
Publication date: 2023-10-20
Anticipated expiration: 2043-06-19
Also published as: CN116460860A

Abstract

The invention discloses a model-based robot offline reinforcement learning control method, and belongs to the field of robot control. The method comprises the following steps: deep learning respectively establishes a depth kinematic model based on jacobian and a depth dynamics model based on Lagrange, which are corresponding to the mechanical arm of the robot, a depth transfer model is established by the depth kinematic model based on jacobian and the depth dynamics model based on Lagrange, the depth transfer model is used for establishing a Markov decision process model of a mechanical arm track tracking task, offline reinforcement learning based on the model is carried out through a Soft Actor-Critic reinforcement learning algorithm to obtain a control strategy, and the mechanical arm is controlled by combining a traditional calculation moment controller. The method greatly reduces the sample complexity of the reinforcement learning control of the robot, improves the precision of the track tracking task, and has stronger generalization and robustness.

Description

A model-based offline reinforcement learning control method for robots

技术领域Technical Field

本发明涉及机器人高精度轨迹跟踪任务领域，尤其涉及一种基于模型的机器人离线强化学习控制方法。The present invention relates to the field of high-precision trajectory tracking tasks of robots, and in particular to a model-based offline reinforcement learning control method for robots.

背景技术Background Art

强化学习算法提供了一个强大的框架来解决顺序决策问题，最近的深度学习技术加速了无模型强化学习算法的发展。然而，这些算法很少直接应用于现实世界的物理系统，尤其是机器人系统，因为它们具有很高的样本复杂度，并且训练过程中的中间策略可能对机器人系统和环境有害。Reinforcement learning algorithms provide a powerful framework to solve sequential decision-making problems, and recent deep learning techniques have accelerated the development of model-free reinforcement learning algorithms. However, these algorithms are rarely directly applied to real-world physical systems, especially robotic systems, because they have high sample complexity and the intermediate policies during training may be harmful to the robotic system and the environment.

相比之下，基于模型的强化学习算法通过学习系统及环境的状态转移模型，进行轨迹模拟与规划，降低了算法的样本复杂度。对于机器人的轨迹跟踪控制任务，传统的基于模型的动力学控制方法，比如，增广PD和计算力矩控制具有更高的跟踪精度和更低的控制能耗等优点。而现有基于模型的机器人控制方法需要先准确的获得机器人的运动学模型和动力学模型，并且控制器的参数需要根据经验手动调整，但随着机器人变得越来越复杂，如何获得准确的机器人对应的运动学模型和动力学模型以及自动得到控制器的参数，进而实现高精度的机器人控制是需要解决的问题。In contrast, the model-based reinforcement learning algorithm simulates and plans trajectories by learning the state transition model of the system and environment, thus reducing the sample complexity of the algorithm. For the robot's trajectory tracking control task, traditional model-based dynamic control methods, such as augmented PD and calculated torque control, have the advantages of higher tracking accuracy and lower control energy consumption. The existing model-based robot control method requires accurate acquisition of the robot's kinematic model and dynamic model first, and the controller parameters need to be manually adjusted based on experience. However, as robots become more and more complex, how to obtain accurate kinematic models and dynamic models corresponding to the robot and automatically obtain the controller parameters to achieve high-precision robot control is a problem that needs to be solved.

有鉴于此，特提出本发明。In view of this, the present invention is proposed.

发明内容Summary of the invention

本发明的目的是提供一种基于模型的机器人离线强化学习控制方法，能通过深度学习获得机器人对应的运动学模型与动力学模型以及自动化得到控制器的参数，并与传统计算力矩控制器结合，实现机器人在关节空间和操作空间的高精度轨迹跟踪任务，很好解决现有技术中存在的上述技术问题。The purpose of the present invention is to provide a model-based offline reinforcement learning control method for a robot, which can obtain the corresponding kinematic model and dynamic model of the robot through deep learning and automatically obtain the parameters of the controller, and combine it with a traditional computational torque controller to achieve high-precision trajectory tracking tasks of the robot in the joint space and operation space, and well solve the above-mentioned technical problems existing in the prior art.

本发明的目的是通过以下技术方案实现的：The objective of the present invention is achieved through the following technical solutions:

一种基于模型的机器人离线强化学习控制方法，其特征在于，用于对作为机器人的机械臂进行控制，包括如下步骤：A model-based offline reinforcement learning control method for a robot, characterized in that it is used to control a mechanical arm as a robot, comprising the following steps:

步骤S1，通过深度学习分别建立对应于机械臂的基于雅可比的深度运动学模型与基于拉格朗日的深度动力学模型；其中，所述深度运动学模型用于预测机械臂末端执行器的位姿和计算机械臂对应的雅可比矩阵；所述深度动力学模型用于预测机械臂的关节角度、角速度与角加速度得出机械臂关节空间的状态变化，并在机械臂进行轨迹跟踪任务时获取计算力矩控制器的控制力矩；Step S1, establishing a Jacobian-based deep kinematic model and a Lagrangian-based deep dynamic model corresponding to the robotic arm through deep learning; wherein the deep kinematic model is used to predict the position and posture of the end effector of the robotic arm and calculate the Jacobian matrix corresponding to the robotic arm; the deep dynamic model is used to predict the joint angle, angular velocity and angular acceleration of the robotic arm to obtain the state change of the joint space of the robotic arm, and obtain the control torque of the torque controller when the robotic arm performs a trajectory tracking task;

步骤S2，建立用有限傅里叶级数描述的随机激励轨迹模型，以随机激励轨迹模型给出的随机激励轨迹作为期望运动轨迹控制所述机械臂，测量并收集所述机械臂的实际运动轨迹作为训练数据集，分别对所述步骤S1建立的基于雅可比的深度运动学模型与基于拉格朗日的深度动力学模型进行训练；Step S2, establishing a random excitation trajectory model described by a finite Fourier series, controlling the robotic arm with the random excitation trajectory given by the random excitation trajectory model as the expected motion trajectory, measuring and collecting the actual motion trajectory of the robotic arm as a training data set, and respectively training the Jacobian-based deep kinematic model and the Lagrangian-based deep dynamics model established in step S1;

步骤S3，建立机械臂轨迹跟踪任务的马尔科夫决策过程模型，所述马尔科夫决策过程模型中的状态转移模型为结合训练好的基于雅可比的深度运动学模型和基于拉格朗日的深度动力学模型构建的模拟所述机械臂运动轨迹的深度转移模型；Step S3, establishing a Markov decision process model for the robot arm trajectory tracking task, wherein the state transition model in the Markov decision process model is a deep transfer model for simulating the robot arm motion trajectory constructed by combining the trained Jacobi-based deep kinematic model and the Lagrangian-based deep dynamics model;

步骤S4，根据所述马尔科夫决策过程模型，通过Soft Actor-Critic强化学习算法离线学习计算力矩控制器的控制参数作为控制策略，收集所述深度转移模型离线与控制策略交互的机械臂模拟运动轨迹数据，对该Soft Actor-Critic强化学习算法的演员网络与评论家网络进行更新，直到得出最优控制策略；Step S4, according to the Markov decision process model, the control parameters of the torque controller are calculated offline through the Soft Actor-Critic reinforcement learning algorithm as the control strategy, the simulated motion trajectory data of the manipulator in which the deep transfer model interacts with the control strategy offline is collected, and the actor network and the critic network of the Soft Actor-Critic reinforcement learning algorithm are updated until the optimal control strategy is obtained;

步骤S5，由计算力矩控制器根据步骤S4得出的最优控制策略计算得出机械臂的具体控制力矩对所述机械臂进行控制。Step S5, the calculation torque controller calculates the specific control torque of the robot arm according to the optimal control strategy obtained in step S4 to control the robot arm.

与现有技术相比，本发明所提供的基于模型的机器人离线强化学习控制方法，其有益效果包括：Compared with the prior art, the model-based robot offline reinforcement learning control method provided by the present invention has the following beneficial effects:

通过深度学习建立对应于作为机器人的机械臂的深度运动学模型与深度动力学模型，配合SoftActor-Critic强化学习算法的离线学习，实现了模型与传统计算力矩控制器的结合，从而控制机械臂在关节空间和操作空间进行高精度轨迹跟踪任务。该方法大大减少了机器人强化学习控制的样本复杂度，提高了轨迹跟踪任务的精度，并且具有较强的泛化性和鲁棒性。Through deep learning, a deep kinematic model and a deep dynamic model corresponding to the robot arm are established. With the offline learning of the SoftActor-Critic reinforcement learning algorithm, the model is combined with the traditional computational torque controller to control the robot arm to perform high-precision trajectory tracking tasks in the joint space and operation space. This method greatly reduces the sample complexity of robot reinforcement learning control, improves the accuracy of trajectory tracking tasks, and has strong generalization and robustness.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域的普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings required for use in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other accompanying drawings can be obtained based on these accompanying drawings without paying creative work.

图1为本发明实施例提供的基于模型的机器人离线强化学习控制方法的流程示意图。FIG1 is a schematic flow chart of a model-based offline reinforcement learning control method for a robot according to an embodiment of the present invention.

图2为本发明实施例提供的基于模型的机器人离线强化学习控制方法的具体流程图。FIG2 is a specific flow chart of a model-based robot offline reinforcement learning control method provided in an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

下面结合本发明的具体内容，对本发明实施例中的技术方案进行清楚、完整地描述；显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例，这并不构成对本发明的限制。基于本发明的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明的保护范围。The following is a clear and complete description of the technical solutions in the embodiments of the present invention in combination with the specific content of the present invention; it is obvious that the described embodiments are only part of the embodiments of the present invention, not all of the embodiments, which does not constitute a limitation of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the protection scope of the present invention.

先对本文中可能使用的术语进行如下说明：First, the terms that may be used in this article are explained as follows:

术语“和/或”是表示两者任一或两者同时均可实现，例如，X和/或Y表示既包括“X”或“Y”的情况也包括“X和Y”的三种情况。The term “and/or” means that either or both of them can be realized at the same time. For example, X and/or Y means both “X” or “Y” and “X and Y”.

术语“包括”、“包含”、“含有”、“具有”或其它类似语义的描述，应被解释为非排它性的包括。例如：包括某技术特征要素（如原料、组分、成分、载体、剂型、材料、尺寸、零件、部件、机构、装置、步骤、工序、方法、反应条件、加工条件、参数、算法、信号、数据、产品或制品等），应被解释为不仅包括明确列出的某技术特征要素，还可以包括未明确列出的本领域公知的其它技术特征要素。The terms "include", "comprises", "contains", "has" or other descriptions with similar semantics should be interpreted as non-exclusive inclusion. For example, including certain technical feature elements (such as raw materials, components, ingredients, carriers, dosage forms, materials, dimensions, parts, components, mechanisms, devices, steps, procedures, methods, reaction conditions, processing conditions, parameters, algorithms, signals, data, products or products, etc.) should be interpreted as including not only certain technical feature elements explicitly listed, but also other technical feature elements known in the art that are not explicitly listed.

术语“由……组成”表示排除任何未明确列出的技术特征要素。若将该术语用于权利要求中，则该术语将使权利要求成为封闭式，使其不包含除明确列出的技术特征要素以外的技术特征要素，但与其相关的常规杂质除外。如果该术语只是出现在权利要求的某子句中，那么其仅限定在该子句中明确列出的要素，其他子句中所记载的要素并不被排除在整体权利要求之外。The term "consisting of..." means excluding any technical feature elements not explicitly listed. If this term is used in a claim, it will make the claim closed, so that it does not contain technical feature elements other than the technical feature elements explicitly listed, except for the conventional impurities related to them. If this term only appears in a clause of a claim, it only limits the elements explicitly listed in the clause, and the elements recorded in other clauses are not excluded from the overall claim.

另有明确的规定或限定外，术语“安装”、“相连”、“连接”、“固定”等术语应做广义理解，例如：可以是固定连接，也可以是可拆卸连接，或一体地连接；可以是机械连接，也可以是电连接；可以是直接相连，也可以通过中间媒介间接相连，可以是两个元件内部的连通。对于本领域的普通技术人员而言，可以根据具体情况理解上述术语在本文中的具体含义。Unless otherwise specified or limited, the terms "installed", "connected", "connected", "fixed" and the like should be understood in a broad sense, for example: it can be a fixed connection, a detachable connection, or an integral connection; it can be a mechanical connection or an electrical connection; it can be a direct connection, or it can be an indirect connection through an intermediate medium, or it can be the internal communication of two components. For ordinary technicians in this field, the specific meanings of the above terms in this article can be understood according to specific circumstances.

术语“中心”、“纵向”、“横向”、“长度”、“宽度”、“厚度”、“上”、“下”、“前”、“后”、“左”、“右”、“竖直”、“水平”、“顶”、“底”“内”、“外”、“顺时针”、“逆时针”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述和简化描述，而不是明示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本文的限制。The orientation or position relationship indicated by terms such as "center", "longitudinal", "lateral", "length", "width", "thickness", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inside", "outside", "clockwise", "counterclockwise", etc. are based on the orientation or position relationship shown in the drawings and are only for the convenience and simplification of description, and do not explicitly or implicitly indicate that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and therefore should not be understood as a limitation of this document.

面对本发明所提供的基于模型的机器人离线强化学习控制方法进行详细描述。本发明实施例中未作详细描述的内容属于本领域专业技术人员公知的现有技术。本发明实施例中未注明具体条件者，按照本领域常规条件或制造商建议的条件进行。本发明实施例中所用试剂或仪器未注明生产厂商者，均为可以通过市售购买获得的常规产品。The model-based robot offline reinforcement learning control method provided by the present invention is described in detail. The contents not described in detail in the embodiments of the present invention belong to the prior art known to professional and technical personnel in the field. If the specific conditions are not specified in the embodiments of the present invention, they are carried out according to the conventional conditions in the field or the conditions recommended by the manufacturer. The reagents or instruments used in the embodiments of the present invention that do not indicate the manufacturer are all conventional products that can be purchased commercially.

如图1、图2所示，本发明实施例提供一种基于模型的机器人离线强化学习控制方法，用于对作为机器人的机械臂进行控制，包括如下步骤：As shown in FIG. 1 and FIG. 2 , an embodiment of the present invention provides a model-based offline reinforcement learning control method for a robot, which is used to control a robot arm as a robot, and includes the following steps:

优选的，上述方法的步骤S1中，按以下方式通过深度学习建立对应于机械臂的基于雅可比的深度运动学模型，包括：Preferably, in step S1 of the above method, a Jacobian-based deep kinematic model corresponding to the robotic arm is established through deep learning in the following manner, including:

确定所述机械臂的正向运动学模型为：；其中，是机械臂末端执行器在操作空间中的位姿，是机械臂操作空间的自由度；是机械臂的关节角度，是机械臂关节空间的自由度；The forward kinematics model of the robotic arm is determined as: ;in, is the position of the end effector of the robot arm in the operating space, is the degree of freedom of the robot's operating space; is the joint angle of the robot arm, is the degree of freedom of the robot joint space;

机械臂的关节与机械臂末端执行器之间的速度关系用运动学方程f对于时间的微分表示为；其中，是机械臂末端执行器在操作空间中位姿相对于时间的导数；是机械臂实际对应的雅可比矩阵，是机械臂操作空间的自由度，是机械臂关节空间的自由度；是机械臂关节角速度；The velocity relationship between the joints of the robot and the end effector of the robot is expressed by the differential of the kinematic equation f with respect to time: ;in, It is the derivative of the position of the end effector of the robot arm in the operating space with respect to time; is the Jacobian matrix actually corresponding to the robotic arm, is the degree of freedom of the robot's operating space, is the degree of freedom of the robot joint space; is the angular velocity of the robot joint;

确定所述机械臂末端执行器在操作空间中的位姿为，其中，p是机械臂末端执行器的位置；是表示机械臂末端执行器的姿态的四元数，，其中表示该四元数的旋转轴，、和分别为旋转轴的三个分量，表示绕旋转轴旋转的角度，满足约束；四元数对于时间的导数与操作空间中机械臂末端执行器的角速度ω之间的关系为，即该关系为：Determine the position and posture of the end effector of the robotic arm in the operating space as , where p is the position of the end effector of the robot arm; is the quaternion representing the posture of the end effector of the robot arm, ,in represents the rotation axis of the quaternion, , and are the three components of the rotation axis, represents the angle of rotation around the axis of rotation, Satisfy constraints ; Quaternion The relationship between the time derivative and the angular velocity ω of the end effector of the manipulator in the operating space is: , that is, the relationship is:

； ;

上述关系式中，、和分别为机械臂末端执行器的角速度在x轴、y轴和z轴的分量；In the above relationship, , and are the components of the angular velocity of the end effector of the robot arm in the x-axis, y-axis and z-axis respectively;

建立全连接的深度神经网络来学习机械臂的正向运动学模型得出机械臂对应的深度运动学模型为：，其中，为深度运动学模型的网络参数；是深度学习得出的运动学方程；q为机械臂的关节角度；A fully connected deep neural network is established to learn the forward kinematics model of the robotic arm, and the corresponding deep kinematics model of the robotic arm is obtained as follows: ,in, are the network parameters of the deep kinematics model; is the kinematic equation derived from deep learning; q is the joint angle of the robot arm;

对所述深度运动学模型中每一层的导数应用链式法则，迭代计算出深度神经网络的输出相对于输入的导数来恢复并学习雅可比矩阵，所述雅可比矩阵通过下述的计算后表示为：Applying the chain rule to the derivatives of each layer in the deep kinematic model, iteratively calculating the derivatives of the output of the deep neural network with respect to the input to recover and learn the Jacobian matrix, the Jacobian matrix After the following calculation, it is expressed as:

； ;

其中，为深度运动学模型的网络参数；为深度学习得出的运动学方程对于关节角度的偏导；为深度运动学模型的第i层网络层，，；，，为深度运动学模型的应用于深度神经网络前一层网络层输入的权重；为偏差；和分别为深度神经网络的非线性激活函数及其导数。in, are the network parameters of the deep kinematics model; The partial derivatives of the kinematic equations derived from deep learning for the joint angles; is the i-th network layer of the deep kinematics model, , ; , , The weights of the deep kinematics model applied to the input of the previous layer of the deep neural network; for deviation; and They are the nonlinear activation function and its derivative of the deep neural network respectively.

优选的，上述方法中，所述深度运动学模型的训练数据集，其中，q为机械臂的关节角度；为机械臂的关节角速度、x为机械臂末端执行器在操作空间中的位姿；为机械臂末端执行器在操作空间中位姿相对于时间的导数；Preferably, in the above method, the training data set of the deep kinematic model , where q is the joint angle of the robot arm; is the joint angular velocity of the robot arm, x is the position of the end effector of the robot arm in the operation space; is the derivative of the position of the end effector of the robot arm in the operating space with respect to time;

所述深度运动学模型训练时的损失函数为：The loss function during the training of the deep kinematics model is:

； ;

其中，为深度运动学模型的网络参数；为深度运动学模型的训练数据集；为训练数据集的大小；是雅可比矩阵的广义逆矩阵；为训练数据集的大小；第一个损失项是机械臂操作空间的实际位姿与预测位姿的均方误差；第二个损失项是机械臂操作空间的实际速度与预测速度的均方误差；最后一个损失项是通过拟合机械臂关节角速度使深度神经网络更准确的雅可比矩阵。in, are the network parameters of the deep kinematics model; It is a training dataset for deep kinematics models; is the size of the training dataset; is the generalized inverse of the Jacobian matrix; is the size of the training dataset; the first loss term is the mean square error between the actual position and predicted position of the robot in the operating space; the second loss term is the mean square error between the actual speed of the robot's operating space and the predicted speed; the last loss term It is the Jacobian matrix that makes the deep neural network more accurate by fitting the angular velocity of the robot arm joints.

优选的，上述方法的步骤S1中，按以下方式通过深度学习建立对应于机械臂的基于拉格朗日的深度动力学模型，包括：Preferably, in step S1 of the above method, a Lagrangian-based deep dynamics model corresponding to the robotic arm is established through deep learning in the following manner, including:

确定所述机械臂的正向动力学模型和逆向动力学模型分别为：The forward dynamics model and the inverse dynamics model of the robot arm are determined as follows:

； ;

其中，分别是机械臂的关节角度、关节角速度和关节角加速度；是作用在机械臂关节上的力和力矩；in, They are the joint angle, joint angular velocity and joint angular acceleration of the robot arm; are the forces and moments acting on the joints of the robot arm;

将机械臂的广义坐标选择为机械臂的关节角度，定义拉格朗日函数为，其中，为机械臂的动能；为机械臂的势能；是机械臂的质量矩阵；上标T表示转置矩阵；The generalized coordinates of the robot are selected as the joint angles of the robot, and the Lagrangian function is defined as ,in, is the kinetic energy of the robotic arm; is the potential energy of the robot arm; is the mass matrix of the robot; the superscript T indicates the transposed matrix;

为保证机械臂的质量矩阵的正定对称性，对质量矩阵进行Chelesky分解得到，其中，为非负对角线的下三角矩阵，上标T表示转置矩阵，利用深度神经网络学习非负对角线的下三角矩阵和机械臂的势能来拟合拉格朗日函数，结合欧拉-拉格朗日方程，其中，F为广义力和力矩，得出组成机械臂对应的深度动力学模型的深度正向动力学模型与深度逆向动力学模型分别为：To ensure the mass matrix of the robot arm The positive definite symmetry of the mass matrix Perform Chelesky decomposition to obtain ,in, is a non-negative diagonal lower triangular matrix, the superscript T represents the transposed matrix, and the non-negative diagonal lower triangular matrix is learned using a deep neural network and the potential energy of the robotic arm To fit the Lagrangian function , combined with the Euler-Lagrange equation , where F is the generalized force and torque, and the deep forward dynamics model and deep inverse dynamics model that constitute the deep dynamics model corresponding to the robotic arm are obtained as follows:

； ;

其中，、和分别为机械臂的关节角度、关节角速度和关节角加速度；为机械臂的质量矩阵；为机械臂的质量矩阵对于时间的导数；为科里奥利力和向心力；为包含重力和弹簧弹力的保守力；为机械臂的关节输出力矩；为机械臂的关节摩擦力，该机械臂的关节摩擦力通过引入的由库仑摩擦、粘滞摩擦与Stribeck摩擦力组成的机械臂的关节先验摩擦力模型得出，该机械臂的关节先验摩擦力模型为：in, , and They are the joint angle, joint angular velocity and joint angular acceleration of the robot arm respectively; is the mass matrix of the robot; is the derivative of the mass matrix of the robot with respect to time; are the Coriolis force and the centripetal force; is a conservative force including gravity and spring force; Output torque for the joints of the robot arm; is the joint friction of the robot arm, which is obtained by introducing a priori friction model of the robot arm composed of Coulomb friction, viscous friction and Stribeck friction. The priori friction model of the robot arm is:

； ;

其中，为库仑摩擦力；为粘滞摩擦系数；为最大静摩擦力；υ与δ分别是Stribeck摩擦力模型的相关系数；为机械臂的第i个关节的角速度。in, is the Coulomb friction force; is the viscous friction coefficient; is the maximum static friction; υ and δ are the correlation coefficients of the Stribeck friction model respectively; is the angular velocity of the i-th joint of the robot.

优选的，上述方法中，所述深度动力学模型的训练数据集，其中，q为机械臂的关节角度；为机械臂的关节角速度；为机械臂的关节角加速度；为机械臂的关节力矩；Preferably, in the above method, the training data set of the deep dynamics model is , where q is the joint angle of the robot arm; is the joint angular velocity of the robot arm; is the joint angular acceleration of the robot arm; is the joint torque of the robot arm;

所述深度动力学模型的损失函数为：The loss function of the deep dynamics model is:

； ;

其中，为深度动力学模型的训练数据集；为深度动力学模型的训练数据集的大小；为机械臂关节角加速度的回归损失，、分别为机械臂的关节角加速度在t时刻的预测值与实际值；为关节力矩的回归损失，、分别为机械臂关节力矩在t时刻的预测值与实际值；是通过数值积分得到的多步预测损失，为预测的总步数，、分别为机械臂关节角度与关节角速度的在t+1时刻的实际值，和分别为机械臂关节角度与关节角速度在t+1时刻的预测值。in, It is a training dataset for deep dynamics models; is the size of the training dataset for the deep dynamics model; is the angular acceleration of the robot joint Regression loss, , are the predicted value and actual value of the joint angular acceleration of the robot arm at time t; is the joint torque Regression loss, , are the predicted value and actual value of the joint torque of the robot arm at time t respectively; is the multi-step prediction loss obtained by numerical integration, is the total number of steps predicted, , are the actual values of the joint angle and joint angular velocity of the robot arm at time t+1, and They are the predicted values of the robot arm joint angle and joint angular velocity at time t+1 respectively.

优选的，上述方法的步骤S2中，按以下方式建立用有限傅里叶级数描述的随机激励轨迹模型，包括：Preferably, in step S2 of the above method, a random excitation trajectory model described by a finite Fourier series is established in the following manner, including:

对于t时刻机械臂的关节i，随机激励轨迹模型定义为：For joint i of the robot at time t, the random excitation trajectory model Defined as:

； ;

其中,和分别为余弦的振幅和正弦的振幅；是余弦的频率和正弦的频率；是机械臂关节角度的偏移量，正弦的振幅、余弦的振幅和频率与关节角度的偏移量在保证机械臂关节角度、关节角速度和关节角加速度都在安全的范围内随机选择；为傅里叶级数的个数，在1到3中随机选择，为累加时的变量，t为t时刻的时间值。in, and are the amplitude of cosine and the amplitude of sine respectively; is the frequency of the cosine and the frequency of the sine; is the offset of the joint angle of the robot arm. The amplitude of the sine, the amplitude and frequency of the cosine and the offset of the joint angle are randomly selected to ensure that the joint angle, joint angular velocity and joint angular acceleration of the robot arm are within a safe range. is the number of Fourier series, randomly selected from 1 to 3, is the variable during accumulation, and t is the time value at time t.

优选的，上述方法的步骤S3中，按以下方式建立机械臂轨迹跟踪任务的马尔科夫决策过程模型，所述马尔科夫决策过程模型的状态转移模型为结合训练好的基于雅可比的深度运动学模型和基于拉格朗日的深度动力学模型构建的模拟所述机械臂运动轨迹的深度转移模型，包括：Preferably, in step S3 of the above method, a Markov decision process model of the robot arm trajectory tracking task is established in the following manner, wherein the state transition model of the Markov decision process model is a deep transition model for simulating the robot arm motion trajectory constructed by combining a trained Jacobi-based deep kinematic model and a Lagrangian-based deep dynamics model, including:

将机械臂的轨迹跟踪任务建模为有限时间折扣离散马尔科夫决策过程；其中，为状态空间，为t时刻的状态；为动作空间；为t时刻的动作；是作为状态转移模型的深度转移模型；为奖励函数；为折扣因子；Modeling the robot's trajectory tracking task as a finite-time discounted discrete Markov decision process ;in, is the state space, is the state at time t; is the action space; is the action at time t; It is a deep transition model as a state transition model; is the reward function; is the discount factor;

按以下方式结合训练好的基于雅可比的深度运动学模型和基于拉格朗日的深度动力学模型构建模拟所述机械臂运动轨迹的深度转移模型，包括：A deep transfer model simulating the motion trajectory of the robotic arm is constructed by combining the trained Jacobian-based deep kinematic model and the Lagrangian-based deep dynamics model in the following manner, including:

根据机械臂在t时刻的关节角度与关节角速度与机械臂的关节力矩，利用四阶龙格-库塔数值积分法结合基于拉格朗日的深度动力学模型，得到机械臂在t+1时刻的关节角度与关节角速度的预测值；According to the joint angle and joint angular velocity of the robot at time t Joint torque with the robot arm The predicted values of the joint angle and joint angular velocity of the manipulator at time t+1 are obtained by using the fourth-order Runge-Kutta numerical integration method combined with the Lagrangian-based deep dynamics model. ;

利用基于雅可比的深度运动学模型以机械臂在t+1时刻的关节角度与关节角速度的预测值为输入得到t+1时刻的机械臂的末端执行器的位姿和速度的预测值；The Jacobian-based deep kinematic model is used to predict the joint angles and joint angular velocities of the robot at time t+1. The predicted values of the position and velocity of the end effector of the robot arm at time t+1 are obtained as input ;

根据得到t+1时刻的机械臂的关节角度、关节角速度与末端执行器的位姿和速度的预测值，结合t+1时刻机械臂的期望轨迹计算出完成轨迹跟踪任务所需的误差值，组成t+1时刻机械臂轨迹跟踪任务的状态；According to the predicted values of the joint angle, joint angular velocity of the robot arm and the position and velocity of the end effector at time t+1 , combined with the expected trajectory of the robot at time t+1, the error value required to complete the trajectory tracking task is calculated to form the state of the robot trajectory tracking task at time t+1 ;

利用t+1时刻机械臂轨迹跟踪任务的状态构建深度转移模型，该深度转移模型中和分别为机械臂轨迹跟踪任务t时刻的状态和t+1时刻的状态，为t时刻控制策略输出的动作。Use the robot arm trajectory at time t+1 to track the status of the task Building a deep transfer model , in this deep transfer model and are the states of the robot arm trajectory tracking task at time t and time t+1, respectively. is the action output by the control strategy at time t.

优选的，上述方法的步骤S4中，按以下方式根据所述马尔科夫决策过程模型，通过Soft Actor-Critic强化学习算法离线学习计算力矩控制器的控制参数作为控制策略，收集所述深度转移模型离线与控制策略交互的机械臂模拟运动轨迹数据，对该Soft Actor-Critic强化学习算法的演员网络与评论家网络进行更新，直到得出最优控制策略，包括：Preferably, in step S4 of the above method, the control parameters of the torque controller are calculated offline by the Soft Actor-Critic reinforcement learning algorithm according to the Markov decision process model as the control strategy, the simulated motion trajectory data of the manipulator in which the deep transfer model interacts offline with the control strategy is collected, and the actor network and the critic network of the Soft Actor-Critic reinforcement learning algorithm are updated until the optimal control strategy is obtained, including:

根据机械臂在关节空间进行的轨迹跟踪任务与操作空间进行的轨迹跟踪任务分别设置所述马尔科夫决策过程模型的状态空间和动作空间，通过Soft Actor-Critic强化学习算法进行基于模型的离线强化学习，输出计算力矩控制器的控制参数作为控制策略；The state space and action space of the Markov decision process model are respectively set according to the trajectory tracking task performed by the robot arm in the joint space and the trajectory tracking task performed in the operation space, and the model-based offline reinforcement learning is performed through the Soft Actor-Critic reinforcement learning algorithm, and the control parameters of the torque controller are output and calculated as the control strategy;

收集所述深度转移模型离线与控制策略交互的机械臂模拟运动轨迹数据，对该Soft Actor-Critic强化学习算法的演员网络与评论家网络进行更新，直到得出最优控制策略。Collect the simulated motion trajectory data of the robot arm that interacts with the control strategy offline , the actor network and critic network of the Soft Actor-Critic reinforcement learning algorithm are updated until the optimal control strategy is obtained.

优选的，上述方法中，按以下方式根据机械臂在关节空间进行的轨迹跟踪任务与操作空间进行的轨迹跟踪任务分别设置所述马尔科夫决策过程模型的状态空间和动作空间，包括：Preferably, in the above method, the state space and action space of the Markov decision process model are respectively set according to the trajectory tracking task performed by the robot arm in the joint space and the trajectory tracking task performed in the operation space in the following manner, including:

设置机械臂关节空间进行轨迹跟踪任务在t时刻的状态为：Set the state of the robot arm joint space for trajectory tracking task at time t for:

； ;

其中，、分别为机械臂的关节角度误差和关节角速度误差；为机械臂的关节角度误差的累积值；、和分别为机械臂期望轨迹的关节角度、关节角速度和关节角加速度；in, , They are the joint angle error and joint angular velocity error of the robot arm respectively; is the accumulated value of the joint angle error of the robot arm; , and are the joint angle, joint angular velocity, and joint angular acceleration of the desired trajectory of the robot arm, respectively;

则机械臂关节空间进行轨迹跟踪任务所用的动作设计为如下形式的计算力矩控制器：The action used by the robot arm joint space to perform trajectory tracking tasks The computational torque controller is designed as follows:

； ;

其中，和分别为机械臂的关节角度和关节角速度；为机械臂的质量矩阵；为科里奥利力和向心力；为包含重力和弹簧弹力的保守力；为机械臂的关节摩擦力；、和分别为机械臂的关节角度误差、关节角速度误差和关节角度误差的累积值；、分别为计算力矩控制器的控制参数；in, and are the joint angles and joint angular velocities of the robot arm, respectively; is the mass matrix of the robot; are the Coriolis force and the centripetal force; is a conservative force including gravity and spring force; is the joint friction of the robot arm; , and They are the joint angle error, joint angular velocity error and cumulative value of joint angle error of the robot arm respectively; , They are respectively used to calculate the control parameters of the torque controller;

设置机械臂操作空间的轨迹跟踪任务在t时刻的状态为：Set the state of the trajectory tracking task in the robot operation space at time t for:

； ;

其中，、分别为机械臂末端执行器的位姿误差和速度误差向量；、分别为机械臂末端执行器的位置误差和线速度误差；、分别为机械臂末端执行器的姿态误差和角速度误差；为机械臂末端执行器位姿误差的累积值；、和分别为机械臂末端执行器期望轨迹的位姿、速度和加速度；和分别为机械臂末端执行器的位置和线速度；、和分别为机械臂末端执行器期望轨迹的位置、线速度和线加速度；为表示机械臂末端执行器的姿态的四元数，和分别为该四元数的旋转轴和绕旋转轴旋转的角度；为机械臂末端执行器期望轨迹的姿态的四元数，和分别为该四元数旋转轴和绕旋转轴旋转的角度；和分别为机械臂末端执行器期望轨迹的姿态的四元数对于时间的一阶导数与二阶导数；和分别机械臂末端执行器的角速度与期望轨迹的角速度；上述各参数的下标t表示该参数是t时刻的参数；in, , are the posture error and velocity error vector of the end effector of the robot arm respectively; , They are the position error and linear velocity error of the end effector of the robot arm respectively; , They are the attitude error and angular velocity error of the end effector of the robot arm respectively; is the accumulated value of the position error of the end effector of the robot arm; , and are the position, velocity and acceleration of the desired trajectory of the robot end effector respectively; and are the position and linear velocity of the end effector of the robot arm, respectively; , and are the position, linear velocity and linear acceleration of the desired trajectory of the robot end effector, respectively; is the quaternion representing the posture of the end effector of the robot arm, and are the rotation axis of the quaternion and the angle of rotation around the rotation axis respectively; is the quaternion of the desired trajectory of the robot end effector, and are the rotation axis of the quaternion and the angle of rotation around the rotation axis respectively; and are the first-order derivative and the second-order derivative of the quaternion of the desired trajectory of the robot end effector with respect to time; and The angular velocity of the end effector of the robot arm and the angular velocity of the desired trajectory respectively; the subscript t of the above parameters indicates that the parameter is the parameter at time t;

则机械臂操作空间进行轨迹跟踪任务所用的动作设计为如下形式的基于加速度的操作空间计算力矩控制器：The action used by the robot to perform trajectory tracking tasks in the operating space is The acceleration-based operating space calculation torque controller is designed as follows:

； ;

其中，和分别为机械臂的关节角度和关节角速度；为机械臂的质量矩阵；为科里奥利力和向心力；为包含重力和弹簧弹力的保守力；为机械臂的关节摩擦力；为操作空间的机械臂参考加速度，、分别为机械臂末端执行器的位姿误差和速度误差向量；为深度运动学模型的网络参数；为机械臂雅可比矩阵的广义逆矩阵；、分别为计算力矩控制器的控制参数，选择零空间的控制力矩的形式为：in, and are the joint angles and joint angular velocities of the robot arm, respectively; is the mass matrix of the robot; are the Coriolis force and the centripetal force; is a conservative force including gravity and spring force; is the joint friction of the robot arm; is the reference acceleration of the manipulator in the operating space, , are the posture error and velocity error vector of the end effector of the robot arm respectively; are the network parameters of the deep kinematics model; is the generalized inverse matrix of the Jacobian matrix of the robot; , Calculate the control parameters of the torque controller and select the control torque of the null space The form is:

； ;

其中，和分别为机械臂的关节角度和关节角速度；为机械臂关节角度的初始值；为单位矩阵；为深度运动学模型的网络参数；为机械臂的雅可比矩阵的广义逆矩阵；为机械臂的雅可比矩阵；、分别为计算力矩控制器的控制参数；in, and are the joint angles and joint angular velocities of the robot arm, respectively; is the initial value of the robot arm joint angle; is the identity matrix; are the network parameters of the deep kinematics model; is the generalized inverse matrix of the Jacobian matrix of the robot; is the Jacobian matrix of the robot; , They are respectively used to calculate the control parameters of the torque controller;

设置的奖励函数为分段奖励函数，该分段奖励函数为：The reward function is set as a segmented reward function, which is:

； ;

其中，和分别为不同控制精度的权重，含项用于使策略探索误差快速减小的动作，含项用于使策略学习提高精度的动作；β为调整不同权重所占比重的值，β取值为0.75；、和在机械臂关节空间的轨迹跟踪任务中分别为机械臂的关节角度误差、关节角速度误差与关节角度误差的累积值，在机械臂操作空间的轨迹跟踪任务中分别为机械臂末端执行器的位姿误差、速度误差与位姿误差的累积值；in, and are the weights of different control accuracies, including The term is used to quickly reduce the strategy exploration error, including The term is used to make the strategy learn actions that improve accuracy; β is the value for adjusting the proportion of different weights, and the value of β is 0.75; , and In the trajectory tracking task of the robot arm joint space, they are the joint angle error, joint angular velocity error and the cumulative value of the joint angle error of the robot arm; in the trajectory tracking task of the robot arm operation space, they are the posture error, velocity error and cumulative value of the posture error of the robot arm end effector;

按以下方式对该Soft Actor-Critic强化学习算法的演员网络与评论家网络进行更新，包括：The actor network and critic network of the Soft Actor-Critic reinforcement learning algorithm are updated as follows:

所述Soft Actor-Critic强化学习算法通过评论家网络拟合状态-动作值函数来对策略进行评估，通过以下最小化贝尔曼方程的误差进行更新：，其中，为t+1时刻机械臂轨迹跟踪任务的状态，为t+1时刻控制策略输出的动作；为t时刻的状态-动作值函数；为以控制策略为条件下的期望值；t表示t时刻；为折扣因子，上标t表示折扣因子的t次方；为奖励函数；为t+1时刻的状态和动作为条件下的期望值；为t+1时刻的状态-动作值函数；The Soft Actor-Critic reinforcement learning algorithm fits the state-action value function through the critic network To evaluate the strategy, update it by minimizing the error of the Bellman equation as follows: ,in, is the state of the robot trajectory tracking task at time t+1, The action output by the control strategy at time t+1; is the state-action value function at time t; To control strategy is the expected value under the condition; t represents the time t; is the discount factor, and the superscript t represents the tth power of the discount factor; is the reward function; is the expected value under the condition of the state and action at time t+1; is the state-action value function at time t+1;

通过以下最小化策略的KL散度对所述Soft Actor-Critic强化学习算法的演员网络进行更新：The actor network of the Soft Actor-Critic reinforcement learning algorithm is updated by minimizing the KL divergence of the policy as follows:

； ;

其中，为在控制策略分布中的采样；Π为演员网络表示的控制策略的分布；为最小化策略的KL散度；为t时刻机械臂轨迹跟踪任务的状态；是表示机械臂末端执行器的姿态的四元数；为更新前的控制策略；为更新前的状态-动作值函数；是用于归一化分布的配分函数；in, is a sample in the distribution of control strategies; Π is the distribution of control strategies represented by the actor network; To minimize the KL divergence of the strategy; is the state of the robot trajectory tracking task at time t; is the quaternion representing the posture of the end effector of the robot arm; is the control strategy before updating; is the state-action value function before updating; is the partition function used to normalize the distribution;

所述Soft Actor-Critic强化学习算法得出的最优控制策略是在最大化期望奖励的同时引入最大熵目标，该最优控制策略为：The optimal control strategy derived from the Soft Actor-Critic reinforcement learning algorithm is to introduce the maximum entropy objective while maximizing the expected reward. for:

； ;

其中，T为机械臂运动轨迹的长度；为在t时刻状态和动作为条件下的期望值；为t时刻机械臂轨迹跟踪任务的状态；为t时刻控制策略输出的动作；为奖励函数；为折扣因子；为控制策略的熵；为熵的正则化系数，该正则化系数采用自适应更新的方式调整熵在目标函数中所占比重来控制策略的随机性。Where T is the length of the robot arm's motion trajectory; is the expected value under the condition of state and action at time t; is the state of the robot trajectory tracking task at time t; is the action of the control strategy output at time t; is the reward function; is the discount factor; is the entropy of the control strategy; is the regularization coefficient of entropy, which uses adaptive updating to adjust the proportion of entropy in the objective function to control the randomness of the strategy.

优选的，上述方法的步骤S5中，由计算力矩控制器根据步骤S4得出的最优控制策略输出的计算力矩控制器的控制参数，结合计算力矩控制器计算得到机械臂的具体控制力矩。Preferably, in step S5 of the above method, the control parameter of the calculated torque controller output by the calculated torque controller according to the optimal control strategy obtained in step S4 is , combined with the calculated torque controller, the specific control torque of the robotic arm is calculated.

本发明实施例的控制方法中，通过引入先验知识使得深度学习得到的深度运动学模型和深度动力学模型均为灰盒模型，提高了模型的泛化性并获得可解释的深度运动学模型和深度动力学模型，用于模拟机器人的运动轨迹与优化机器人的控制力矩。本发明大大减少了机器人强化学习控制的样本复杂度，提高了轨迹跟踪的精度，并且具有较强的泛化性和鲁棒性。In the control method of the embodiment of the present invention, by introducing prior knowledge, the deep kinematic model and deep dynamic model obtained by deep learning are both gray box models, which improves the generalization of the model and obtains interpretable deep kinematic model and deep dynamic model, which are used to simulate the robot's motion trajectory and optimize the robot's control torque. The present invention greatly reduces the sample complexity of robot reinforcement learning control, improves the accuracy of trajectory tracking, and has strong generalization and robustness.

为了更加清晰地展现出本发明所提供的技术方案及所产生的技术效果，下面以具体实施例对本发明实施例所提供的基于模型的机器人离线强化学习控制方法进行详细描述。In order to more clearly demonstrate the technical solution and technical effects provided by the present invention, the model-based robot offline reinforcement learning control method provided by the embodiment of the present invention is described in detail with specific examples below.

实施例1Example 1

如图1、图2所示，本发明实施例提供一种基于模型的机器人离线强化学习控制方法，通过深度学习分别建立对应于作为机器人的机械臂的深度运动学模型与深度动力学模型，实现与传统计算力矩控制器的结合，从而完成机械臂在关节空间和操作空间的高精度轨迹跟踪任务。该方法包括以下步骤：As shown in Figures 1 and 2, an embodiment of the present invention provides a model-based offline reinforcement learning control method for a robot, which establishes a deep kinematic model and a deep dynamic model corresponding to a robot arm as a robot through deep learning, and realizes the combination with a traditional computational torque controller, thereby completing the high-precision trajectory tracking task of the robot arm in the joint space and the operation space. The method includes the following steps:

首先，通过深度学习建立对应于机械臂的基于雅可比的深度运动学模型，包括：First, a Jacobian-based deep kinematic model corresponding to the robotic arm is established through deep learning, including:

机械臂的正向运动学模型为：The forward kinematics model of the robot arm is:

； ;

其中，是机械臂末端执行器在操作空间中的位置和姿态，即位姿，是机械臂的关节角度，是机械臂操作空间的自由度，是机械臂关节空间的自由度；in, is the position and posture of the end effector of the robot arm in the operating space, that is, the posture, is the joint angle of the robot arm, is the degree of freedom of the robot's operating space, is the degree of freedom of the robot joint space;

运动学方程f对于时间的微分描述了机械臂的关节与机械臂的末端执行器之间的速度关系为：The kinematic equation f, differentiated with respect to time, describes the velocity relationship between the joints of the robot and the end effector of the robot as follows:

； ;

其中，是机械臂末端执行器在操作空间中位姿相对于时间的导数；是机械臂实际对应的雅可比矩阵，是机械臂操作空间的自由度，是机械臂关节空间的自由度；是机械臂关节角速度。使用数值鲁棒的四元数来表示末端执行器的姿态；四元数表示为，并且满足约束，其中表示该四元数的旋转轴，、和分别为旋转轴的三个分量，表示绕旋转轴旋转的角度。因此在操作空间中机械臂末端执行器的位姿可以表示为，其中p是机械臂末端执行器的位置。四元数对于时间的导数与操作空间角速度ω之间的关系为，即：in, It is the derivative of the position of the end effector of the robot arm in the operating space with respect to time; is the Jacobian matrix actually corresponding to the robotic arm, is the degree of freedom of the robot's operating space, is the degree of freedom of the robot joint space; is the angular velocity of the robot joint. Using numerically robust quaternions To represent the posture of the end effector; the quaternion is expressed as , and satisfy the constraints ,in represents the rotation axis of the quaternion, , and are the three components of the rotation axis, represents the angle of rotation around the rotation axis. Therefore, the position of the end effector of the robot arm in the operation space can be expressed as , where p is the position of the end effector of the robot arm. Quaternion The relationship between the time derivative and the angular velocity ω of the operating space is: ,Right now:

； ;

建立全连接的深度神经网络来学习机械臂的正向运动学模型，得出机械臂对应的深度运动学模型为：，其中，为深度运动学模型的网络参数；是深度学习得出的运动学方程；q为机械臂的关节角度；A fully connected deep neural network is established to learn the forward kinematics model of the robotic arm, and the corresponding deep kinematics model of the robotic arm is obtained as follows: ,in, are the network parameters of the deep kinematics model; is the kinematic equation derived from deep learning; q is the joint angle of the robot arm;

对得出的深度运动学模型的每一层的导数应用链式法则，迭代计算出深度神经网络的输出相对于输入的导数来恢复并学习雅可比矩阵。学习得到的雅可比矩阵通过如下的计算后表示为：The chain rule is applied to the derivatives of each layer of the obtained deep kinematic model, and the derivatives of the output of the deep neural network with respect to the input are iteratively calculated to recover and learn the Jacobian matrix. The learned Jacobian matrix After the following calculation, it is expressed as:

； ;

上述深度运动学模型的训练数据集，该训练数据集由机械臂的关节角度q、关节角速度、末端执行器在操作空间中的位姿x以及位姿相对于时间的导数组成。Training dataset for the above deep kinematics model , the training data set consists of the joint angle q and joint angular velocity of the robot , the position x of the end effector in the operating space and the derivative of the position with respect to time composition.

具体的，上述深度运动学模型的训练数据集是通过先建立用有限傅里叶级数描述的随机激励轨迹模型，由随机激励轨迹模型给出的随机激励轨迹作为期望运动轨迹控制所述机械臂，测量并收集所述机械臂的实际运动轨迹得到的训练数据集。Specifically, the training data set of the above-mentioned deep kinematic model is obtained by first establishing a random excitation trajectory model described by a finite Fourier series, using the random excitation trajectory given by the random excitation trajectory model as the expected motion trajectory to control the robotic arm, and measuring and collecting the actual motion trajectory of the robotic arm.

优选的，按以下方式建立用有限傅里叶级数描述的随机激励轨迹模型，包括：Preferably, a random excitation trajectory model described by a finite Fourier series is established in the following manner, including:

； ;

设置该深度运动学模型训练时的损失函数为：The loss function for training the deep kinematics model is set as:

； ;

其次，通过深度学习建立对应于机械臂的深度动力学模型，包括：Secondly, a deep dynamics model corresponding to the robotic arm is established through deep learning, including:

机械臂的正向动力学模型和逆向动力学模型分别表示为：The forward dynamics model and inverse dynamics model of the robot arm are expressed as follows:

； ;

其中，分别是机械臂的关节角度、角速度和角加速度，为作用在机械臂关节上的力和力矩。in, They are the joint angle, angular velocity and angular acceleration of the robot arm, are the forces and moments acting on the joints of the robot arm.

基于拉格朗日力学推导出机械臂的动力学模型从而建立深度动力学模型。The dynamic model of the robotic arm is derived based on Lagrangian mechanics to establish a deep dynamic model.

拉格朗日函数定义为，其中为机械臂的动能，为机械臂的势能，是机械臂的质量矩阵；上标T表示转置矩阵；The Lagrangian function is defined as ,in is the kinetic energy of the robot arm, is the potential energy of the robot arm, is the mass matrix of the robot; the superscript T indicates the transposed matrix;

为保证机械臂的质量矩阵的正定对称性，对质量矩阵进行Chelesky分解得到，其中，为非负对角线的下三角矩阵，上标T表示转置矩阵；To ensure the mass matrix of the robot arm The positive definite symmetry of the mass matrix Perform Chelesky decomposition to obtain ,in, is a lower triangular matrix with non-negative diagonal, and the superscript T indicates the transposed matrix;

利用深度神经网络学习非负对角线的下三角矩阵和机械臂的势能来拟合拉格朗日函数，结合欧拉-拉格朗日方程，其中，F为广义力和力矩，得出组成机械臂对应的深度动力学模型的深度正向动力学模型与深度逆向动力学模型分别为：Learning non-negative diagonal lower triangular matrices using deep neural networks and the potential energy of the robotic arm To fit the Lagrangian function , combined with the Euler-Lagrange equation , where F is the generalized force and torque, and the deep forward dynamics model and deep inverse dynamics model that constitute the deep dynamics model corresponding to the robotic arm are obtained as follows:

； ;

上述深度动力学模型的训练数据集，该训练数据集由机械臂的关节角度q、关节角速度、关节角加速度与关节力矩组成。Training dataset for the above deep dynamics model , the training data set consists of the joint angle q and joint angular velocity of the robot , joint angular acceleration and joint torque composition.

具体的，上述深度动力学模型的训练数据集是通过先建立用有限傅里叶级数描述的随机激励轨迹模型，由随机激励轨迹模型给出的随机激励轨迹作为期望运动轨迹控制所述机械臂，测量并收集所述机械臂的实际运动轨迹得到的训练数据集。建立用有限傅里叶级数描述的随机激励轨迹模型的方式与前述深度运动学模型中提及的相同，在此不再重复。Specifically, the training data set of the above-mentioned deep dynamics model is obtained by first establishing a random excitation trajectory model described by a finite Fourier series, controlling the robot arm using the random excitation trajectory given by the random excitation trajectory model as the expected motion trajectory, and measuring and collecting the actual motion trajectory of the robot arm. The method of establishing the random excitation trajectory model described by a finite Fourier series is the same as that mentioned in the above-mentioned deep kinematics model, and will not be repeated here.

设置该深度动力学模型的损失函数为：The loss function of the deep dynamics model is set as:

； ;

其中，为深度动力学模型的训练数据集；为深度动力学模型的训练数据集的大小；为机械臂关节角加速度的回归损失，、分别为机械臂的关节角加速度在t时刻的预测值与实际值；为关节力矩的回归损失，、分别为机械臂关节力矩在t时刻的预测值与实际值；是通过数值积分得到的多步预测损失，该损失项可以为后续强化学习中的策略优化过程提高深度动力学模型多步预测准确性为预测的总步数，、分别为机械臂关节角度与关节角速度的在t+1时刻的实际值，和分别为机械臂关节角度与关节角速度在t+1时刻的预测值。in, It is a training dataset for deep dynamics models; is the size of the training dataset for the deep dynamics model; is the angular acceleration of the robot joint Regression loss, , are the predicted value and actual value of the joint angular acceleration of the robot arm at time t; is the joint torque Regression loss, , are the predicted value and actual value of the joint torque of the robot arm at time t respectively; is the multi-step prediction loss obtained by numerical integration. This loss term can improve the multi-step prediction accuracy of the deep dynamics model for the subsequent strategy optimization process in reinforcement learning. is the total number of steps predicted, , are the actual values of the joint angle and joint angular velocity of the robot arm at time t+1, and They are the predicted values of the robot arm joint angle and joint angular velocity at time t+1 respectively.

然后建立机械臂轨迹跟踪任务的马尔科夫决策过程模型，该马尔科夫决策过程模型中的状态转移模型为结合训练好的基于雅可比的深度运动学模型和基于拉格朗日的深度动力学模型构建的模拟所述机械臂运动轨迹的深度转移模型，包括：Then, a Markov decision process model of the robot arm trajectory tracking task is established. The state transition model in the Markov decision process model is a deep transition model for simulating the robot arm motion trajectory constructed by combining the trained Jacobi-based deep kinematic model and the Lagrangian-based deep dynamics model, including:

将机械臂的轨迹跟踪任务建模成一个有限时间折扣离散马尔科夫决策过程；其中，为状态空间，为动作空间，为t时刻的状态，为t时刻的动作，是作为状态转移模型的深度转移模型，为奖励函数，为折扣因子；The robot trajectory tracking task is modeled as a finite-time discounted discrete Markov decision process. ;in, is the state space, is the action space, is the state at time t, is the action at time t, It is a deep transition model as a state transition model. is the reward function, is the discount factor;

按如下方式结合训练好的基于雅可比的深度运动学模型和基于拉格朗日的深度动力学模型构建模拟所述机械臂运动轨迹的深度转移模型，包括：A deep transfer model simulating the motion trajectory of the robotic arm is constructed by combining the trained Jacobi-based deep kinematic model and the Lagrangian-based deep dynamics model as follows, including:

上述的深度转移模型用于模拟机械臂的运动轨迹，为基于模型的强化学习方法提供离线交互数据，来优化控制策略。The above-mentioned deep transfer model is used to simulate the motion trajectory of the robotic arm and provide offline interaction data for the model-based reinforcement learning method to optimize the control strategy.

最后是根据机械臂轨迹跟踪任务的马尔科夫决策过程模型，基于模型的离线强化学习方法，包括：Finally, according to the Markov decision process model of the robot arm trajectory tracking task, the model-based offline reinforcement learning method includes:

分别设置马尔科夫决策过程模型的状态空间、动作空间和奖励函数，结合SoftActor-Critic（SAC)强化学习方法，实现基于模型的离线强化学习方法得出计算力矩控制器的控制参数作为优化控制策略。The state space, action space and reward function of the Markov decision process model are set respectively, and the Soft Actor-Critic (SAC) reinforcement learning method is combined to implement a model-based offline reinforcement learning method to obtain the control parameters of the computational torque controller as the optimization control strategy.

设置机械臂关节空间的轨迹跟踪任务在t时刻的状态为：Set the state of the trajectory tracking task in the robot joint space at time t to:

； ;

其中，、分别为机械臂的关节角度误差和关节角速度误差；为机械臂的关节角度误差的累积值；、和分别为机械臂期望轨迹的关节角度、关节角速度和关节角加速度。in, , They are the joint angle error and joint angular velocity error of the robot arm respectively; is the accumulated value of the joint angle error of the robot arm; , and are the joint angles, joint angular velocities, and joint angular accelerations of the desired trajectory of the robot arm, respectively.

用计算力矩控制器来解决机械臂关节空间的轨迹跟踪任务的动作，该计算力矩控制器的形式为：The computational torque controller is used to solve the motion of the trajectory tracking task in the joint space of the manipulator. The form of the computational torque controller is:

； ;

设置机械臂操作空间的轨迹跟踪任务在t时刻的状态为：Set the state of the trajectory tracking task of the robot operation space at time t to:

； ;

其中，、分别为机械臂末端执行器的位姿误差和速度误差向量；、分别为机械臂末端执行器的位置误差和线速度误差；、分别为机械臂末端执行器的姿态误差和角速度误差；为机械臂末端执行器位姿误差的累积值；、和分别为机械臂末端执行器期望轨迹的位姿、速度和加速度；和分别为机械臂末端执行器的位置和线速度；、和分别为机械臂末端执行器期望轨迹的位置、线速度和线加速度；为表示机械臂末端执行器的姿态的四元数，和分别为该四元数的旋转轴和绕旋转轴旋转的角度；为机械臂末端执行器期望轨迹的姿态的四元数，和分别为该四元数旋转轴和绕旋转轴旋转的角度；和分别为机械臂末端执行器期望轨迹的姿态的四元数对于时间的一阶导数与二阶导数；和分别机械臂末端执行器的角速度与期望轨迹的角速度；上述各参数的下标t表示该参数是t时刻的参数。in, , are the posture error and velocity error vector of the end effector of the robot arm respectively; , They are the position error and linear velocity error of the end effector of the robot arm respectively; , They are the attitude error and angular velocity error of the end effector of the robot arm respectively; is the accumulated value of the position and posture error of the end effector of the robot arm; , and are the position, velocity and acceleration of the desired trajectory of the robot end effector respectively; and are the position and linear velocity of the end effector of the robot arm, respectively; , and are the position, linear velocity and linear acceleration of the desired trajectory of the robot end effector, respectively; is the quaternion representing the posture of the end effector of the robot arm, and are the rotation axis of the quaternion and the angle of rotation around the rotation axis respectively; is the quaternion of the desired trajectory of the robot end effector, and are the rotation axis of the quaternion and the angle of rotation around the rotation axis respectively; and are the first-order derivative and the second-order derivative of the quaternion of the desired trajectory of the robot end effector with respect to time; and The angular velocity of the end effector of the robot arm and the angular velocity of the desired trajectory respectively; the subscript t of the above parameters indicates that the parameter is the parameter at time t.

机械臂操作空间进行轨迹跟踪任务所用的动作设计为如下形式的基于加速度的操作空间计算力矩控制器：The action design used by the robot arm to perform trajectory tracking tasks in the operating space is an acceleration-based operating space calculation torque controller in the following form:

； ;

与关节空间的轨迹跟踪任务设置相同，策略学习输出计算力矩控制器的控制参数而不是具体的关节控制力矩。Similar to the trajectory tracking task setting in joint space, the policy learning output calculates the control parameters of the torque controller instead of the specific joint control torque.

； ;

其中，和分别为不同控制精度的权重，含项用于使策略探索误差快速减小的动作，当误差较大时主要由含项提供奖励值，使得策略探索误差快速减小的动作；含项用于使策略学习提高精度的动作，当误差较小时主要由含项提供奖励值，使得策略学习进一步提高精度的动作；β为调整不同权重所占比重的值，β取值为0.75，通过β调整不同权重所占比重，最终完成对期望轨迹的高精度跟踪任务；、和在机械臂关节空间的轨迹跟踪任务中分别为机械臂的关节角度误差、关节角速度误差与关节角度误差的累积值，在机械臂操作空间的轨迹跟踪任务中分别为机械臂末端执行器的位姿误差、速度误差与位姿误差的累积值。in, and are the weights of different control accuracies, including The term is used to quickly reduce the strategy exploration error. When the error is large, it is mainly composed of The item provides a reward value, which enables the strategy to explore actions with rapidly reduced errors; The term is used to make the strategy learn actions that improve accuracy. When the error is small, it is mainly composed of The term provides a reward value, which enables the strategy to learn actions that further improve accuracy; β is the value for adjusting the proportion of different weights, and β is set to 0.75. By adjusting the proportion of different weights through β, the high-precision tracking task of the desired trajectory is finally completed; , and In the trajectory tracking task of the robot arm joint space, they are the joint angle error, joint angular velocity error and the cumulative value of the joint angle error of the robot arm; in the trajectory tracking task of the robot arm operation space, they are the posture error, velocity error and the cumulative value of the posture error of the robot arm end effector.

使用Soft Actor-Critic强化学习算法，即SAC算法作为本发明的策略优化算法，策略输出计算力矩控制器的控制参数，结合计算力矩控制器计算得到最终的控制力矩，依据最终的控制力矩对机械臂进行控制。The Soft Actor-Critic reinforcement learning algorithm, namely the SAC algorithm, is used as the strategy optimization algorithm of the present invention, and the strategy output calculates the control parameters of the torque controller , combined with the calculated torque controller, the final control torque is calculated, and the robot arm is controlled according to the final control torque.

SAC算法在最大化期望奖励的同时引入最大熵目标，用于平衡探索与优化，提高策略的性能和鲁棒性，其最优策略为：The SAC algorithm introduces the maximum entropy objective while maximizing the expected reward to balance exploration and optimization and improve the performance and robustness of the strategy. The optimal strategy is:

； ;

SAC算法使用一个评论家网络来拟合状态-动作值函数，用于对策略进行评估，并通过以下最小化贝尔曼方程的误差进行更新：，其中，为t+1时刻机械臂轨迹跟踪任务的状态，为t+1时刻控制策略输出的动作；为t时刻的状态-动作值函数；为以控制策略为条件下的期望值；t表示t时刻；为折扣因子，上标t表示折扣因子的t次方；为奖励函数；为t+1时刻的状态和动作为条件下的期望值；为t+1时刻的状态-动作值函数。The SAC algorithm uses a critic network to fit the state-action value function , which is used to evaluate the policy and is updated by minimizing the error of the Bellman equation as follows: ,in, is the state of the robot trajectory tracking task at time t+1, The action output by the control strategy at time t+1; is the state-action value function at time t; To control strategy is the expected value under the condition; t represents the time t; is the discount factor, and the superscript t represents the tth power of the discount factor; is the reward function; is the expected value under the condition of the state and action at time t+1; is the state-action value function at time t+1.

SAC算法的演员网络通过以下最小化策略的KL散度进行更新：The actor network of the SAC algorithm is updated by minimizing the KL divergence of the following strategies:

； ;

其中，为在控制策略分布中的采样；Π为演员网络表示的控制策略的分布；为最小化策略的KL散度；为t时刻机械臂轨迹跟踪任务的状态；是表示机械臂末端执行器的姿态的四元数；为更新前的控制策略；为更新前的状态-动作值函数；是用于归一化分布的配分函数。in, is a sample in the distribution of control strategies; Π is the distribution of control strategies represented by the actor network; To minimize the KL divergence of the strategy; is the state of the robot trajectory tracking task at time t; is the quaternion representing the posture of the end effector of the robot arm; is the control strategy before updating; is the state-action value function before updating; is the partition function used to normalize the distribution.

具体的方法实施过程为：The specific implementation process of the method is:

首先，收集机械臂实际的运动轨迹数据进行深度学习分别建立对应于机械臂的深度运动学模型和深度动力学模型学习；深度运动学模型用于预测机械臂末端执行器的位姿和计算机械臂对应的雅可比矩阵；深度动力学模型用于预测机械臂的关节角度、角速度与角加速度得出机械臂关节空间的状态变化，并在机械臂进行轨迹跟踪任务时获取计算力矩控制器的控制力矩；First, the actual motion trajectory data of the robot is collected for deep learning to establish a deep kinematic model and a deep dynamic model corresponding to the robot. The deep kinematic model is used to predict the position and posture of the end effector of the robot and calculate the Jacobian matrix corresponding to the robot. The deep dynamic model is used to predict the joint angle, angular velocity and angular acceleration of the robot to obtain the state change of the joint space of the robot, and obtain the control torque of the torque controller when the robot performs trajectory tracking tasks.

然后，建立用有限傅里叶级数描述的随机激励轨迹模型，以随机激励轨迹模型给出的随机激励轨迹作为期望运动轨迹控制所述机械臂，测量并收集所述机械臂的实际运动轨迹作为训练数据集，分别对所述步骤S1建立的基于雅可比的深度运动学模型与基于拉格朗日的深度动力学模型进行训练；Then, a random excitation trajectory model described by a finite Fourier series is established, the random excitation trajectory given by the random excitation trajectory model is used as the expected motion trajectory to control the robotic arm, the actual motion trajectory of the robotic arm is measured and collected as a training data set, and the Jacobian-based deep kinematic model and the Lagrangian-based deep dynamics model established in step S1 are trained respectively;

其次，建立机械臂轨迹跟踪任务的马尔科夫决策过程模型，所述马尔科夫决策过程模型中的状态转移模型为结合训练好的基于雅可比的深度运动学模型和基于拉格朗日的深度动力学模型构建的模拟所述机械臂运动轨迹的深度转移模型；Secondly, a Markov decision process model of the robot arm trajectory tracking task is established, wherein the state transition model in the Markov decision process model is a deep transfer model for simulating the robot arm motion trajectory constructed by combining the trained Jacobi-based deep kinematic model and the Lagrangian-based deep dynamics model;

之后，根据所述马尔科夫决策过程模型，通过Soft Actor-Critic强化学习算法离线学习计算力矩控制器的控制参数作为控制策略，收集所述深度转移模型离线与控制策略交互的机械臂模拟运动轨迹数据，对该Soft Actor-Critic强化学习算法的演员网络与评论家网络进行更新，直到得出最优控制策略；Afterwards, according to the Markov decision process model, the control parameters of the torque controller are calculated offline through the Soft Actor-Critic reinforcement learning algorithm as the control strategy, the simulated motion trajectory data of the manipulator that interacts with the control strategy offline in the deep transfer model is collected, and the actor network and the critic network of the Soft Actor-Critic reinforcement learning algorithm are updated until the optimal control strategy is obtained;

最后，由计算力矩控制器根据步骤S4得出的最优控制策略计算得出机械臂的具体控制力矩对所述机械臂进行控制。Finally, the calculated torque controller calculates the specific control torque of the robot arm according to the optimal control strategy obtained in step S4 to control the robot arm.

与已有技术相比，本发明的有益效果体现在：Compared with the prior art, the beneficial effects of the present invention are as follows:

(1)本发明的控制方法，是一种机器人的基于雅可比的深度运动学模型和基于拉格朗日的深度动力学模型的学习方法。与传统方法不同的是，该方法引入了先验知识，如：深度运动学模型是引入了雅克比模型对于机械臂速度关系的先验知识；深度动力学模型是引入了动力学模型的结构（包括先验摩擦力模型）与机械臂系统能量守恒和运动方程约束等物理约束的先验知识，而不是特定机器人自身的信息，通过数据驱动的网络模型来学习对应于机械臂的深度运动学模型与深度动力学模型，无需复杂的建模过程，具有较高的模型精度与泛化性。深度运动学模型用于实时预测机械臂末端执行器的位姿和计算雅可比矩阵。深度动力学模型以非监督的方式学习模型的各个分量，用于预测机器人关节空间的状态变化。深度运动学模型与深度动力学模型组合成的深度转移模型可用于模拟机器人的运动轨迹，后续用于优化控制策略，并与传统计算力矩控制律相结合，以达到更高的控制精度。(1) The control method of the present invention is a learning method for a robot based on a Jacobian deep kinematic model and a Lagrangian deep dynamic model. Different from the traditional method, this method introduces prior knowledge, such as: the deep kinematic model introduces the prior knowledge of the Jacobian model for the velocity relationship of the robot arm; the deep dynamic model introduces the prior knowledge of the structure of the dynamic model (including the prior friction model) and the physical constraints such as the energy conservation and motion equation constraints of the robot arm system, rather than the information of the specific robot itself. The deep kinematic model and deep dynamic model corresponding to the robot arm are learned through a data-driven network model, without the need for a complex modeling process, and have high model accuracy and generalization. The deep kinematic model is used to predict the position and posture of the end effector of the robot arm in real time and calculate the Jacobian matrix. The deep dynamic model learns each component of the model in an unsupervised manner to predict the state change of the robot joint space. The deep transfer model composed of the deep kinematic model and the deep dynamic model can be used to simulate the motion trajectory of the robot, and then used to optimize the control strategy, and combined with the traditional computational torque control law to achieve higher control accuracy.

(2)本发明的控制方法，是一种适用于机器人高精度轨迹跟踪的基于模型的离线强化学习方法。该方法能够在关节空间和操作空间中完成高精度轨迹跟踪任务。通过对状态空间、动作空间和奖励函数的设计，并结合传统的计算力矩控制器，实现了离线轨迹跟踪策略的快速收敛与高精度性能表现，同时保证了控制策略的稳定性和安全性。(2) The control method of the present invention is a model-based offline reinforcement learning method suitable for high-precision trajectory tracking of robots. The method can complete high-precision trajectory tracking tasks in joint space and operation space. By designing the state space, action space and reward function, and combining with the traditional computational torque controller, the rapid convergence and high-precision performance of the offline trajectory tracking strategy are achieved, while ensuring the stability and safety of the control strategy.

本领域普通技术人员可以理解：实现上述实施例方法中的全部或部分流程是可以通过程序来指令相关的硬件来完成，所述的程序可存储于一计算机可读取存储介质中，该程序在执行时，可包括如上述各方法的实施例的流程。其中，所述的存储介质可为磁碟、光盘、只读存储记忆体（Read-Only Memory，ROM）或随机存储记忆体（Random Access Memory，RAM）等。Those skilled in the art can understand that all or part of the processes in the above-mentioned embodiments can be implemented by instructing related hardware through a program, and the program can be stored in a computer-readable storage medium, and when the program is executed, it can include the processes of the embodiments of the above-mentioned methods. The storage medium can be a disk, an optical disk, a read-only memory (ROM) or a random access memory (RAM), etc.

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明披露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应该以权利要求书的保护范围为准。本文背景技术部分公开的信息仅仅旨在加深对本发明的总体背景技术的理解，而不应当被视为承认或以任何形式暗示该信息构成已为本领域技术人员所公知的现有技术。The above is only a preferred specific embodiment of the present invention, but the protection scope of the present invention is not limited thereto. Any changes or substitutions that can be easily thought of by a technician familiar with the technical field within the technical scope disclosed in the present invention should be included in the protection scope of the present invention. Therefore, the protection scope of the present invention should be based on the protection scope of the claims. The information disclosed in the background technology section of this article is only intended to deepen the understanding of the overall background technology of the present invention, and should not be regarded as an admission or in any form that the information constitutes prior art known to those skilled in the art.

Claims

1. A model-based offline reinforcement learning control method for a robot, characterized in that it is used to control a mechanical arm as a robot, comprising the following steps:

Step S1, establishing a Jacobian-based deep kinematic model and a Lagrangian-based deep dynamic model corresponding to the robotic arm through deep learning; wherein the deep kinematic model is used to predict the position and posture of the end effector of the robotic arm and calculate the Jacobian matrix corresponding to the robotic arm; the deep dynamic model is used to predict the joint angle, angular velocity and angular acceleration of the robotic arm to obtain the state change of the joint space of the robotic arm, and obtain the control torque of the torque controller when the robotic arm performs a trajectory tracking task;

Step S2, establishing a random excitation trajectory model described by a finite Fourier series, controlling the robotic arm with the random excitation trajectory given by the random excitation trajectory model as the expected motion trajectory, measuring and collecting the actual motion trajectory of the robotic arm as a training data set, and respectively training the Jacobian-based deep kinematic model and the Lagrangian-based deep dynamics model established in step S1;

Step S3, establishing a Markov decision process model for the robot arm trajectory tracking task, wherein the state transition model in the Markov decision process model is a deep transfer model for simulating the robot arm motion trajectory constructed by combining the trained Jacobi-based deep kinematic model and the Lagrangian-based deep dynamics model;

Step S4, according to the Markov decision process model, the control parameters of the torque controller are calculated offline through the Soft Actor-Critic reinforcement learning algorithm as the control strategy, the simulated motion trajectory data of the manipulator in which the deep transfer model interacts with the control strategy offline is collected, and the actor network and the critic network of the Soft Actor-Critic reinforcement learning algorithm are updated until the optimal control strategy is obtained;

Step S5, the calculation torque controller calculates the specific control torque of the robot arm according to the optimal control strategy obtained in step S4 to control the robot arm.

2. The model-based offline reinforcement learning control method for robots according to claim 1, characterized in that, in step S1, a Jacobian-based deep kinematic model corresponding to the robot arm is established by deep learning in the following manner, including:

The forward kinematics model of the robotic arm is determined as: ;in, is the position of the end effector of the robot arm in the operating space, is the degree of freedom of the robot's operating space; is the joint angle of the robot arm, is the degree of freedom of the robot joint space;

The velocity relationship between the joints of the robot and the end effector of the robot is expressed by the differential of the kinematic equation f with respect to time: ;in, It is the derivative of the position of the end effector of the robot arm in the operating space with respect to time; is the Jacobian matrix actually corresponding to the robotic arm, is the degree of freedom of the robot's operating space, is the degree of freedom of the robot joint space; is the angular velocity of the robot joint;

Determine the position and posture of the end effector of the robotic arm in the operating space as , where p is the position of the end effector of the robot arm; is the quaternion representing the posture of the end effector of the robot arm, ,in represents the rotation axis of the quaternion, , and are the three components of the rotation axis, represents the angle of rotation around the axis of rotation, Satisfy constraints ; Quaternion The relationship between the time derivative and the angular velocity ω of the end effector of the manipulator in the operating space is for , that is, the relationship for:

;

in, , and are the components of the angular velocity of the end effector of the robot arm in the x-axis, y-axis and z-axis respectively;

A fully connected deep neural network is established to learn the forward kinematics model of the robotic arm, and the corresponding deep kinematics model of the robotic arm is obtained as follows: ,in, are the network parameters of the deep kinematics model; is the kinematic equation derived from deep learning; q is the joint angle of the robot arm;

The chain rule is applied to the derivatives of each layer in the deep kinematic model, and the derivatives of the output of the deep neural network with respect to the input are iteratively calculated to recover and learn the Jacobian matrix. The learned Jacobian matrix After the following calculation, it is expressed as:

;

in, are the network parameters of the deep kinematics model; The partial derivatives of the kinematic equations derived from deep learning for the joint angles; is the i-th network layer of the deep kinematics model, , ; , , The weights of the deep kinematics model applied to the input of the previous layer of the deep neural network; for deviation; and They are the nonlinear activation function and its derivative of the deep neural network respectively.

3. The model-based offline reinforcement learning control method for robots according to claim 2, characterized in that the training data set of the deep kinematic model , where q is the joint angle of the robot arm; is the joint angular velocity of the robot arm, x is the position of the end effector of the robot arm in the operation space; is the derivative of the position of the end effector of the robot arm in the operating space with respect to time;

The loss function during the training of the deep kinematics model is:

;

in, are the network parameters of the deep kinematics model; It is a training dataset for deep kinematics models; is the size of the training dataset; is the generalized inverse of the Jacobian matrix; the first loss term is the mean square error between the actual position and predicted position of the robot in the operating space; the second loss term is the mean square error between the actual speed of the robot's operating space and the predicted speed; the last loss term It is the Jacobian matrix that makes the deep neural network more accurate by fitting the angular velocity of the robot arm joints.

4. The model-based offline reinforcement learning control method for robots according to any one of claims 1 to 3, characterized in that, in step S1, a Lagrangian-based deep dynamics model corresponding to the robot arm is established by deep learning in the following manner, including:

The forward dynamics model and the inverse dynamics model of the robot arm are determined as follows:

;

in, They are the joint angle, joint angular velocity and joint angular acceleration of the robot arm; is the torque acting on the joint of the robot arm;

The generalized coordinates of the robot are selected as the joint angles of the robot, and the Lagrangian function is defined as ,in, is the kinetic energy of the robotic arm; is the potential energy of the robot arm; is the mass matrix of the robot; the superscript T indicates the transposed matrix;

To ensure the mass matrix of the robot arm The positive definite symmetry of the mass matrix Perform Chelesky decomposition to obtain ,in, is a non-negative diagonal lower triangular matrix, the superscript T represents the transposed matrix, and the non-negative diagonal lower triangular matrix is learned using a deep neural network and the potential energy of the robotic arm To fit the Lagrangian function , combined with the Euler-Lagrange equation , where F is the generalized force and torque, and the deep forward dynamics model and deep inverse dynamics model that constitute the deep dynamics model corresponding to the robotic arm are obtained as follows:

;

in, , and They are the joint angle, joint angular velocity and joint angular acceleration of the robot arm respectively; is the mass matrix of the robot; is the derivative of the mass matrix of the robot with respect to time; are the Coriolis force and the centripetal force; is a conservative force including gravity and spring force; Output torque for the joints of the robot arm; is the predicted value of the joint friction of the robot arm. The joint friction of the robot arm is obtained by introducing a priori friction model of the robot arm composed of Coulomb friction, viscous friction and Stribeck friction. The priori friction model of the joint of the robot arm is:

;

in, is the Coulomb friction force; is the viscous friction coefficient; is the maximum static friction; υ and δ are the correlation coefficients of the Stribeck friction model respectively; is the angular velocity of the i-th joint of the robot.

5. The model-based offline reinforcement learning control method for robots according to claim 4, characterized in that the training data set of the deep dynamics model , where q is the joint angle of the robot arm; is the joint angular velocity of the robot arm; is the joint angular acceleration of the robot arm; is the joint torque of the robot arm;

The loss function of the deep dynamics model is:

;

in, It is a training dataset for deep dynamics models; is the size of the training dataset for the deep dynamics model; is the angular acceleration of the robot joint Regression loss, , are the predicted value and actual value of the joint angular acceleration of the robot arm at time t; is the joint torque Regression loss, , are the predicted value and actual value of the joint torque of the robot arm at time t respectively; is the multi-step prediction loss obtained by numerical integration, is the total number of steps predicted, , are the actual values of the joint angle and joint angular velocity of the robot arm at time t+1, and They are the predicted values of the robot arm joint angle and joint angular velocity at time t+1 respectively.

6. The model-based offline reinforcement learning control method for robots according to any one of claims 1 to 3, characterized in that, in step S2, a random excitation trajectory model described by a finite Fourier series is established in the following manner, including:

For joint i of the robot at time t, the random excitation trajectory model Defined as:

;

in, and are the amplitude of cosine and the amplitude of sine respectively; is the frequency of the cosine and the frequency of the sine; is the offset of the joint angle of the robot arm. The amplitude of the sine, the amplitude and frequency of the cosine and the offset of the joint angle are randomly selected to ensure that the joint angle, joint angular velocity and joint angular acceleration of the robot arm are within a safe range. is the number of Fourier series, randomly selected from 1 to 3, is the variable during accumulation, and t is the time value at time t.

7. The model-based offline reinforcement learning control method for robots according to any one of claims 1 to 3, characterized in that, in the step S3, a Markov decision process model of the robot trajectory tracking task is established in the following manner, and the state transition model of the Markov decision process model is a deep transition model for simulating the motion trajectory of the robot arm constructed by combining a trained Jacobi-based deep kinematic model and a Lagrangian-based deep dynamics model, comprising:

Modeling the robot's trajectory tracking task as a finite-time discounted discrete Markov decision process ;in, is the state space, is the state at time t; is the action space; is the action at time t; It is a deep transition model as a state transition model; is the reward function; is the discount factor;

A deep transfer model simulating the motion trajectory of the robotic arm is constructed by combining the trained Jacobian-based deep kinematic model and the Lagrangian-based deep dynamics model in the following manner, including:

According to the joint angle and joint angular velocity of the robot at time t Joint torque with the robot arm The predicted values of the joint angle and joint angular velocity of the manipulator at time t+1 are obtained by using the fourth-order Runge-Kutta numerical integration method combined with the Lagrangian-based deep dynamics model. ;

The Jacobian-based deep kinematic model is used to predict the joint angles and joint angular velocities of the robot at time t+1. The predicted values of the position and velocity of the end effector of the robot arm at time t+1 are obtained as input ;

According to the predicted values of the joint angle, joint angular velocity of the robot arm and the position and velocity of the end effector at time t+1 , combined with the expected trajectory of the robot at time t+1, the error value required to complete the trajectory tracking task is calculated to form the state of the robot trajectory tracking task at time t+1 ;

Use the robot arm trajectory at time t+1 to track the status of the task Building a deep transfer model , in this deep transfer model and are the states of the robot arm trajectory tracking task at time t and time t+1, respectively. is the action output by the control strategy at time t.

8. The model-based offline reinforcement learning control method for robots according to any one of claims 1 to 3, characterized in that in the step S4, the control parameters of the torque controller are calculated offline by the Soft Actor-Critic reinforcement learning algorithm as the control strategy according to the Markov decision process model, the simulated motion trajectory data of the robot arm that interacts with the control strategy offline by the deep transfer model is collected, and the actor network and the critic network of the Soft Actor-Critic reinforcement learning algorithm are updated until the optimal control strategy is obtained, including:

The state space and action space of the Markov decision process model are respectively set according to the trajectory tracking task performed by the robot arm in the joint space and the trajectory tracking task performed in the operation space, and the model-based offline reinforcement learning is performed through the Soft Actor-Critic reinforcement learning algorithm, and the control parameters of the torque controller are output and calculated as the control strategy;

Collect the simulated motion trajectory data of the robot arm that interacts with the control strategy offline , the actor network and critic network of the Soft Actor-Critic reinforcement learning algorithm are updated until the optimal control strategy is obtained.

9. The model-based offline reinforcement learning control method for robots according to claim 8 is characterized in that, in the method, the state space and action space of the Markov decision process model are respectively set according to the trajectory tracking task performed by the robot arm in the joint space and the trajectory tracking task performed in the operation space in the following manner, including:

Set the state of the robot arm joint space for trajectory tracking task at time t for:

;

in, , They are the joint angle error and joint angular velocity error of the robot arm respectively; is the accumulated value of the joint angle error of the robot arm; , and are the joint angle, joint angular velocity, and joint angular acceleration of the desired trajectory of the robot arm, respectively;

The action used by the robot arm joint space to perform trajectory tracking tasks The computational torque controller is designed as follows:

;

in, and are the joint angles and joint angular velocities of the robot arm, respectively; is the mass matrix of the robot; are the Coriolis force and the centripetal force; is a conservative force including gravity and spring force; is the joint friction of the robot arm; , and They are the joint angle error, joint angular velocity error and cumulative value of joint angle error of the robot arm respectively; , They are respectively used to calculate the control parameters of the torque controller;

Set the state of the trajectory tracking task in the robot operation space at time t for:

;

in, , are the posture error and velocity error vector of the end effector of the robot arm respectively; , They are the position error and linear velocity error of the end effector of the robot arm respectively; , They are the attitude error and angular velocity error of the end effector of the robot arm respectively; is the accumulated value of the position error of the end effector of the robot arm; , and are the position, velocity and acceleration of the desired trajectory of the robot end effector respectively; and are the position and linear velocity of the end effector of the robot arm, respectively; , and are the position, linear velocity and linear acceleration of the desired trajectory of the robot end effector, respectively; is the quaternion representing the posture of the end effector of the robot arm, and are the rotation axis of the quaternion and the angle of rotation around the rotation axis respectively; is the quaternion of the desired trajectory of the robot end effector, and are the rotation axis of the quaternion and the angle of rotation around the rotation axis respectively; and are the first-order derivative and the second-order derivative of the quaternion of the desired trajectory of the robot end effector with respect to time; and The angular velocity of the end effector of the robot arm and the angular velocity of the desired trajectory respectively; the subscript t of the above parameters indicates that the parameter is the parameter at time t;

The action used by the robot to perform trajectory tracking tasks in the operating space is The acceleration-based operating space calculation torque controller is designed as follows:

;

in, and are the joint angles and joint angular velocities of the robot arm, respectively; is the mass matrix of the robot; are the Coriolis force and the centripetal force; is a conservative force including gravity and spring force; is the joint friction of the robot arm; is the reference acceleration of the manipulator in the operating space, , are the posture error and velocity error vector of the end effector of the robot arm respectively; are the network parameters of the deep kinematics model; is the generalized inverse matrix of the Jacobian matrix of the robot; , Calculate the control parameters of the torque controller and select the control torque of the null space The form is:

;

in, and are the joint angles and joint angular velocities of the robot arm, respectively; is the initial value of the robot arm joint angle; is the identity matrix; are the network parameters of the deep kinematics model; is the generalized inverse matrix of the Jacobian matrix of the robot; is the Jacobian matrix of the robot; , They are respectively used to calculate the control parameters of the torque controller;

The reward function is set as a segmented reward function, which is:

;

in, and are the weights of different control accuracies, including The term is used to quickly reduce the strategy exploration error, including The term is used to make the strategy learn actions that improve accuracy; β is the value for adjusting the proportion of different weights, and the value of β is 0.75; , and In the trajectory tracking task of the robot arm joint space, they are the joint angle error, joint angular velocity error and the cumulative value of the joint angle error of the robot arm; in the trajectory tracking task of the robot arm operation space, they are the posture error, velocity error and cumulative value of the posture error of the robot arm end effector;

The actor network and critic network of the Soft Actor-Critic reinforcement learning algorithm are updated as follows:

The Soft Actor-Critic reinforcement learning algorithm fits the state-action value function through the critic network To evaluate the strategy, update it by minimizing the error of the Bellman equation as follows: ,in, is the state of the robot trajectory tracking task at time t+1, The action output by the control strategy at time t+1; is the state-action value function at time t; To control strategy is the expected value under the condition; t represents the time t; is the discount factor, the superscript t represents the tth power of the discount factor, and t is the time value at time t; is the reward function; is the expected value under the condition of the state and action at time t+1; is the state-action value function at time t+1;

The actor network of the Soft Actor-Critic reinforcement learning algorithm is updated by minimizing the KL divergence of the policy as follows:

;

in, is a sample in the distribution of control strategies; Π is the distribution of control strategies represented by the actor network; To minimize the KL divergence of the strategy; is the state of the robot trajectory tracking task at time t; is the quaternion representing the posture of the end effector of the robot arm; is the control strategy before updating; is the state-action value function before updating; is the partition function used to normalize the distribution;

The optimal control strategy derived from the Soft Actor-Critic reinforcement learning algorithm is to introduce the maximum entropy objective while maximizing the expected reward. for:

;

Where T is the length of the robot's motion trajectory; is the expected value under the condition of state and action at time t; is the state of the robot trajectory tracking task at time t; is the action of the control strategy output at time t; is the reward function; is the discount factor at time t; is the entropy of the control strategy; is the regularization coefficient of entropy, which uses adaptive updating to adjust the proportion of entropy in the objective function to control the randomness of the strategy.

10. The model-based robot offline reinforcement learning control method according to claim 8, characterized in that in step S5, the control parameters of the calculated torque controller output by the calculated torque controller according to the optimal control strategy obtained in step S4 are , combined with the calculated torque controller, the specific control torque of the robotic arm is calculated.