CN102402712B

CN102402712B - A Neural Network-Based Initialization Method for Robot Reinforcement Learning

Info

Publication number: CN102402712B
Application number: CN201110255530.7A
Authority: CN
Inventors: 李贻斌; 宋勇; 李彩虹; 李彬; 荣学文
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2011-08-31
Filing date: 2011-08-31
Publication date: 2014-03-05
Anticipated expiration: 2031-08-31
Also published as: CN102402712A

Abstract

The invention proposes a neural network-based initialization method for robot reinforcement learning. The neural network has the same topological structure as the robot workspace, and each neuron corresponds to a discrete state of the state space. First, the neural network is evolved according to the known part of the environmental information until it reaches an equilibrium state. At this time, the output value of each neuron represents the maximum cumulative return that the corresponding state can obtain by following the optimal strategy. Then define the initial value of the Q function as the immediate return of the current state plus the maximum discounted cumulative return that the subsequent state can obtain by following the optimal strategy. The neural network maps the known environmental information into the initial value of the Q function, thereby integrating the prior knowledge into the robot learning system and improving the learning ability of the robot in the initial stage of reinforcement learning. Compared with the traditional Q learning algorithm, the present invention can effectively Improve the learning efficiency in the initial stage and speed up the algorithm convergence speed.

Description

A Neural Network-Based Initialization Method for Robot Reinforcement Learning

技术领域 technical field

本发明涉及将先验知识融入到移动机器人学习系统中，对机器人强化学习过程中Q值初始化的方法，属于机器学习技术领域。The invention relates to a method for integrating prior knowledge into a learning system of a mobile robot and for initializing a Q value in the robot reinforcement learning process, and belongs to the technical field of machine learning.

背景技术 Background technique

随着机器人应用领域的不断拓展，机器人所面临的任务也越来越复杂，尽管很多情况下研究人员可以对机器人可能执行的重复行为进行预编程，但为实现整体的期望行为而进行行为设计变得越来越困难，设计人员往往不可能事先对机器人的所有行为做出合理的预测。因此，能够感知环境的自治机器人必须能够通过与环境的交互在线学习获得新的行为，使得机器人能够根据特定的任务选择能达到目标的最优动作。With the continuous expansion of the application field of robots, the tasks faced by robots are becoming more and more complex. Although researchers can pre-program the repetitive behaviors that robots may perform in many cases, the behavior design changes to achieve the overall desired behavior It is becoming more and more difficult for designers to make reasonable predictions about all the behaviors of robots in advance. Therefore, an autonomous robot capable of perceiving the environment must be able to learn new behaviors online by interacting with the environment, so that the robot can choose the optimal action to achieve the goal according to a specific task.

强化学习利用类似于人类思维中的试错(trial-and-error)的方法来发现最优行为策略，目前已经在机器人行为学习方面展现出了良好的学习性能。Q学习算法是求解信息不完全Markov决策问题的一种强化学习方法，根据环境状态和上一步学习获得的立即回报，修改从状态到动作的映射策略，以使行为从环境中获得的累积回报值最大，从而获得最优行为策略。标准Q学习算法一般将Q值初始化为Q或随机数，机器人没有对环境的先验知识，学习的初始阶段只能随机地选择动作，因此，在复杂环境中算法收敛速度较慢。为了提高算法收敛速度，研究人员提出了许多改进Q学习的方法，提高算法学习效率，改善学习性能。Reinforcement learning uses a trial-and-error method similar to human thinking to discover the optimal behavior strategy, and has shown good learning performance in robot behavior learning. The Q-learning algorithm is a reinforcement learning method for solving Markov decision-making problems with incomplete information. According to the environment state and the immediate return obtained from the previous step of learning, the mapping strategy from the state to the action is modified so that the cumulative return value obtained by the behavior from the environment The maximum, so as to obtain the optimal behavior policy. The standard Q-learning algorithm generally initializes the Q value to Q or a random number. The robot has no prior knowledge of the environment, and the initial stage of learning can only choose actions randomly. Therefore, the algorithm converges slowly in complex environments. In order to increase the convergence speed of the algorithm, researchers have proposed many methods to improve Q-learning, improve the learning efficiency of the algorithm, and improve the learning performance.

通常情况下，加速Q学习收敛速度的方法主要包括两个方面：一种方法是设计合适的回报函数，另一种方法是合理初始化Q函数。目前，研究人员已经提出了许多改进的Q学习算法，使机器人在强化学习的过程中能够获得更加有效的回报，主要包括：关联Q学习算法、惰性Q学习算法、贝叶斯Q学习算法等。其主要目的就是将对于机器人有价值的隐含信息融入到回报函数中，从而加速算法收敛速度。关联Q学习将当前回报与过去时刻的立即回报进行比较，选择回报值更大的动作，通过关联回报方法能够改善系统的学习能力，减少获得最优值所需要的迭代步数。惰性Q学习的目标是提供一种预测状态立即回报的方法，学习过程中利用信息延迟原则，在必要的情况下对新的目标进行预测，动作比较器检查每一种情形的期望回报，然后选择期望回报最大的动作执行。贝叶斯Q学习利用概率分布描述机器人状态-动作对Q值的不确定性估计，学习过程中需要考虑前一时刻Q值的分布，并利用机器人学习到的经验对先前的分布进行更新，利用贝叶斯变量表示当前状态的最大累积回报，贝叶斯方法从本质上改进了Q学习的探索策略，改善了Q学习的性能。Usually, the method to accelerate the convergence speed of Q-learning mainly includes two aspects: one method is to design an appropriate reward function, and the other method is to initialize the Q function reasonably. At present, researchers have proposed many improved Q-learning algorithms to enable robots to obtain more effective returns in the process of reinforcement learning, mainly including: associative Q-learning algorithm, lazy Q-learning algorithm, Bayesian Q-learning algorithm, etc. Its main purpose is to integrate the implicit information valuable to the robot into the reward function, thereby accelerating the convergence speed of the algorithm. Associative Q-learning compares the current reward with the immediate reward of the past moment, and selects the action with a greater reward value. The learning ability of the system can be improved through the method of associating rewards, and the number of iteration steps required to obtain the optimal value can be reduced. The goal of lazy Q-learning is to provide a method to predict the immediate reward of the state. The principle of information delay is used in the learning process to predict new targets if necessary. The action comparator checks the expected reward of each situation and chooses Action execution with maximum expected reward. Bayesian Q-learning uses the probability distribution to describe the uncertainty estimation of the robot state-action on the Q value. In the learning process, the distribution of the Q value at the previous moment needs to be considered, and the previous distribution is updated using the experience learned by the robot. The Bayesian variable represents the maximum cumulative return of the current state, and the Bayesian method essentially improves the exploration strategy of Q-learning and improves the performance of Q-learning.

由于标准强化学习中强化信号都是由状态值函数计算得到的标量值，无法将人的知识形态和行为模式融入到学习系统中。而在机器人学习过程中，人往往具有相关领域的经验和知识，因此，在学习过程中将人的认知和智能以强化信号的形式反馈给机器人，能够减小状态空间维数，加快算法收敛速度。针对标准强化学习在人机交互过程中存在的问题，Thomaz等在机器人强化学习过程中由人实时地给出外部强化信号，人根据自身经验调整训练行为，引导机器人进行前瞻性探索。Arsenio提出了一种对训练数据进行在线、自动标注的学习策略，在人机交互过程中通过触发特定的事件获得训练数据，从而将施教者嵌入到强化学习的反馈回路。Mirza等提出了基于交互历史的体系结构，机器人能够利用与人进行社会性交互的历史经验进行强化学习，使机器人能够在与人进行的简单游戏中逐渐获得合适的行为。Since the reinforcement signals in standard reinforcement learning are all scalar values calculated by the state value function, it is impossible to integrate human knowledge and behavior patterns into the learning system. In the process of robot learning, people often have experience and knowledge in related fields. Therefore, feeding back human cognition and intelligence to the robot in the form of strengthened signals during the learning process can reduce the dimension of the state space and speed up the convergence of the algorithm. speed. In view of the problems existing in the human-computer interaction process of standard reinforcement learning, Thomaz et al. gave external reinforcement signals in real time during the robot reinforcement learning process, and humans adjusted the training behavior according to their own experience to guide the robot to perform forward-looking exploration. Arsenio proposed a learning strategy for online and automatic labeling of training data. During the human-computer interaction process, specific events are triggered to obtain training data, thereby embedding the instructor into the feedback loop of reinforcement learning. Mirza et al. proposed an architecture based on interaction history. Robots can use the historical experience of social interaction with humans for reinforcement learning, enabling robots to gradually acquire appropriate behaviors in simple games with humans.

另一种改善Q学习算法性能的方法就是将先验知识融入到学习系统中，对Q值进行初始化。目前，对Q值进行初始化的方法主要包括近似函数法、模糊规则法、势函数法等。近似函数法利用神经网络等智能系统逼近最优值函数，将先验知识映射成为回报函数值，使机器人在整个状态空间的子集上进行学习，从而能够加快算法收敛速度。模糊规则法根据初始环境信息建立模糊规则库，然后利用模糊逻辑对Q值进行初始化。利用这种方法建立的模糊规则都是根据环境信息人为设定的，往往不能客观地反映机器人的环境状态，造成算法不稳定。势函数法在整个状态空间定义相应的状态势函数，每一点势能值对应于状态空间中某一离散状态值，然后利用状态势函数对Q值进行初始化，学习系统的Q值可以表示为初始值加上每次迭代的改变量。在机器人的各种行为当中，机器人必须遵守一系列的行为准则，机器人通过认知与交互作用涌现出相应的行为与智能，机器人强化学习Q值初始化就是要将先验知识映射成为相应的机器人行为。因此，如何获得先验知识的规则化表达形式，特别是实现领域专家的经验与常识知识的机器推理，将人的认知和智能转化为机器的计算和推理的人机智能融合技术是机器人行为学习急需解决的问题。Another way to improve the performance of Q-learning algorithm is to integrate prior knowledge into the learning system and initialize the Q-value. At present, the methods for initializing Q value mainly include approximate function method, fuzzy rule method, potential function method and so on. The approximate function method uses intelligent systems such as neural networks to approximate the optimal value function, maps prior knowledge into reward function values, and enables the robot to learn on a subset of the entire state space, thereby speeding up the convergence of the algorithm. The fuzzy rule method establishes a fuzzy rule base according to the initial environmental information, and then uses fuzzy logic to initialize the Q value. The fuzzy rules established by this method are all artificially set according to the environmental information, which often cannot objectively reflect the environmental state of the robot, resulting in the instability of the algorithm. The potential function method defines the corresponding state potential function in the entire state space. Each potential energy value corresponds to a discrete state value in the state space, and then uses the state potential function to initialize the Q value. The Q value of the learning system can be expressed as the initial value Plus the amount of change for each iteration. Among the various behaviors of the robot, the robot must abide by a series of codes of conduct. The robot emerges with corresponding behavior and intelligence through cognition and interaction. The Q value initialization of the robot reinforcement learning is to map the prior knowledge into the corresponding robot behavior. . Therefore, how to obtain the regular expression form of prior knowledge, especially the machine reasoning of field experts' experience and common sense knowledge, and the fusion technology of man-machine intelligence that transforms human cognition and intelligence into machine calculation and reasoning are robot behaviors. Learn about urgent problems to solve.

发明内容 Contents of the invention

针对现有机器人强化学习技术的研究现状及存在的不足，本发明提出一种能够有效提高初始阶段的学习效率、加快收敛速度的基于神经网络的机器人强化学习初始化方法，该方法通过Q值初始化能够将先验知识融入到学习系统中，对机器人初始阶段的学习进行优化，从而为机器人提供一个较好的学习基础。Aiming at the research status and existing deficiencies of the existing robot reinforcement learning technology, the present invention proposes a neural network-based robot reinforcement learning initialization method that can effectively improve the learning efficiency in the initial stage and accelerate the convergence speed. The method can be initialized by Q value Integrating the prior knowledge into the learning system optimizes the learning of the initial stage of the robot, thus providing a better learning basis for the robot.

本发明的基于神经网络的机器人强化学习初始化方法，神经网络与机器人工作空间具有相同的拓扑结构，每一个神经元对应于状态空间中的一个离散的状态。首先根据已知的部分环境信息对神经网络进行演化，直到达到平衡状态，这时每个神经元输出值就代表其所对应状态能够获得的最大累积回报，然后将当前状态的立即回报加上后继状态遵循最优策略获得的最大折算累积回报(最大累积回报乘以折算因子)，即可对所有状态-动作对的Q(s，a)设定合理的初始值，通过Q值初始化能够将先验知识融入到学习系统中，对机器人初始阶段的学习进行优化，从而为机器人提供一个较好的学习基础；具体包括以下步骤：In the neural network-based robot reinforcement learning initialization method of the present invention, the neural network has the same topology as the robot workspace, and each neuron corresponds to a discrete state in the state space. First, the neural network is evolved according to the known partial environmental information until it reaches an equilibrium state. At this time, the output value of each neuron represents the maximum cumulative return that its corresponding state can obtain, and then the immediate return of the current state is added to the subsequent The state follows the optimal strategy to obtain the maximum discounted cumulative return (the maximum cumulative return multiplied by the conversion factor), which can set a reasonable initial value for Q(s, a) of all state-action pairs. Through the initialization of the Q value, the previous Integrate empirical knowledge into the learning system, and optimize the learning of the robot in the initial stage, so as to provide a better learning basis for the robot; specifically, the following steps are included:

(1)建立神经网络模型(1) Establish a neural network model

神经网络与机器人工作的结构空间具有相同的拓扑结构，每个神经元只与其局部邻域内的神经元相连接，连接形式都相同，连接权都相等，神经元之间信息的传播是双向的，神经网络具有高度并行的体系结构，每个神经元对应于机器人工作空间的一个离散的状态，整个神经网络由N×N个神经元组成二维拓扑结构，神经网络在演化过程中根据每一个离散状态的输入更新其邻域内的状态，直到神经网络达到平衡状态，达到平衡状态时，神经网络中神经元输出值就形成一个单峰值的曲面，曲面上每一点的值就表示所对应状态能够获得的最大累积回报；The neural network has the same topology as the structural space where the robot works. Each neuron is only connected to the neurons in its local neighborhood. The connection forms are the same, the connection weights are equal, and the information transmission between neurons is bidirectional. The neural network has a highly parallel architecture, and each neuron corresponds to a discrete state of the robot's workspace. The entire neural network consists of N×N neurons in a two-dimensional topology, and the neural network evolves according to each discrete The input of the state updates the state in its neighborhood until the neural network reaches the equilibrium state. When the equilibrium state is reached, the output value of the neuron in the neural network forms a single-peak surface, and the value of each point on the surface indicates that the corresponding state can be obtained. The maximum cumulative return of ;

(2)设计回报函数(2) Design return function

在学习过程中，机器人能够在4个方向上移动，在任意状态选择上下左右4个动作，机器人根据当前状态选择动作，如果该动作使机器人到达目标则获得的立即回报为1，如果机器人与障碍物或其他机器人发生碰撞则获得立即回报为-0.2，如果机器人在自由空间移动则获得的立即回报为-0.1；During the learning process, the robot can move in 4 directions. In any state, choose 4 actions: up, down, left, and right. The robot chooses an action according to the current state. If the action makes the robot reach the goal, the immediate reward is 1. If the robot collides with objects or other robots, the immediate reward is -0.2, and if the robot is moving in free space, the immediate reward is -0.1;

(3)计算最大累积回报初始值(3) Calculate the initial value of the maximum cumulative return

神经网络达到平衡状态时，定义每个神经元相对应的状态的最大累积回报V^* _Init(s_i)等于该神经元输出值x_i，其关系公式如下：When the neural network reaches the equilibrium state, define the maximum cumulative reward V ^* _Init (s _i ) corresponding to the state of each neuron is equal to the output value x _i of the neuron, and the relationship formula is as follows:

${V V}_{Init Init}^{* *} (({s the s}_{i i})) &LeftArrow; &LeftArrow; {x x}_{i i},,$

公式中，x_i为神经网络达到平衡状态时第i个神经元的输出值，V^* _Init(s_i)为从状态s_i出发遵循最优策略所能获得的最大累积回报；In the formula, x _i is the output value of the i-th neuron when the neural network reaches the equilibrium state, and V ^* _Init (s _i ) is the maximum cumulative return that can be obtained from the state s _i following the optimal strategy;

(4)Q值初始化(4) Q value initialization

Q(s_i，a)的初始值定义为在状态s_i下选择动作a所获得的立即回报r加上后继状态的最大折算累积回报：The initial value of Q(s _i , a) is defined as the immediate reward r obtained by choosing action a in state s _i plus the maximum discounted cumulative reward of the subsequent state:

${Q Q}_{Init Init} (({s the s}_{i i},, a a)) = = r r + + γ γ {V V}_{Init Init}^{* *} (({s the s}_{j j}))$

公式中，s_j为机器人在状态s_i下选择动作a所产生的后继状态，Q_Init(s_i，a)为状态-动作对(s_i，a)的初始Q值；γ为折算因子，选择γ＝0.95；In the formula, s _j is the subsequent state generated by the robot choosing action a in state s _i , Q _Init (s _i , a) is the initial Q value of the state-action pair (s _i , a); γ is the conversion factor, Select gamma = 0.95;

(5)基于神经网络的机器人强化学习步骤(5) Neural Network-Based Robot Reinforcement Learning Steps

(a)神经网络根据初始化环境信息进行演化，直到达到平衡状态；(a) The neural network evolves according to the initial environment information until it reaches an equilibrium state;

(b)状态s_i获得的最大累积回报的初始值定义为神经元输出值x_i，关系公式如下：(b) The initial value of the maximum cumulative return obtained by state s _i is defined as the neuron output value x _i , and the relationship formula is as follows:

${V V}_{Init Init}^{* *} (({s the s}_{i i})) &LeftArrow; &LeftArrow; {x x}_{i i}$

(c)按照如下规则初始化Q值： $Q_{Init} (s_{i}, a) = r + γ V_{Init}^{*} (s_{j}),$ (c) Initialize the Q value according to the following rules: $Q_{Init} ({the s}_{i}, a) = r + γ V_{Init}^{*} ({the s}_{j}),$

(d)观察当前状态s_t；(d) observe the current state s _t ;

(e)继续在复杂环境中探索，在当前状态s_t下选择一个动作a_t并执行，环境状态更新为新的状态s’_t，并接收立即回报r_t；(e) Continue to explore in the complex environment, select and execute an action a _t in the current state s _t , update the environment state to a new state s' _t , and receive an immediate reward r _t ;

(f)观察新状态s’_t；(f) observe the new state s'_t;

(g)根据以下公式更新表项Q(s_t，a_t)值：(g) Update the value of table item Q(s _t , a _t ) according to the following formula:

${Q Q}_{t t} (({s the s}_{t t},, {a a}_{t t})) = = ((11 - - {α α}_{t t})) {Q Q}_{t t - - 11} (({s the s}_{t t},, {a a}_{t t})) + + {α α}_{t t} (({r r}_{t t} + + γ γ \underset{{a a}_{t t}^{' '}}{arg arg} max max {Q Q}_{t t - - 11} (({s the s}_{t t}^{' '},, {a a}_{t t}^{' '}))))$

公式中，α_t为学习率，取值范围为(0，1)，通常取值为0.5，并随学习过程衰减；Q_t-1(s_t，a_t)和Q_t-1(s’_t，a’_t)分别为状态-动作对(s_t，a_t)和(s’_t，a’_t)在t-1时刻的取值，a’_t为在新状态s’_t下选择的动作；In the formula, α _t is the learning rate, the value range is (0, 1), usually 0.5, and decays with the learning process; Q _t-1 (s _t , a _t ) and Q _t-1 (s' _t , a' _t ) are the values of the state-action pair (s _t , a _t ) and (s' _t , a' _t ) at time t-1 respectively, and a' _t is the value selected under the new state s' _t Actions;

(h)判断机器人是否已经到达目标或者学习系统已经达到设定的最大学习次数，设定的最大学习次数应该保证学习系统在最大学习次数内收敛，如果两者满足其一，则学习结束，否则返回到步骤(d)继续学习。(h) Judging whether the robot has reached the target or the learning system has reached the set maximum number of learning times, the set maximum number of learning times should ensure that the learning system converges within the maximum number of learning times, if the two meet one of them, then the learning ends, otherwise Return to step (d) to continue learning.

本发明通过神经网络将已知环境信息映射成为Q函数初始值，从而将先验知识融入到机器人学习系统中，提高了机器人在强化学习初始阶段的学习能力，与传统Q学习算法相比，能够有效提高初始阶段的学习效率，加快算法收敛速度。The invention maps the known environmental information into the initial value of the Q function through the neural network, thereby integrating prior knowledge into the robot learning system, improving the learning ability of the robot in the initial stage of reinforcement learning, and compared with the traditional Q learning algorithm, it can Effectively improve the learning efficiency in the initial stage and accelerate the convergence speed of the algorithm.

附图说明 Description of drawings

图1是第i个神经元邻域结构示意图。Figure 1 is a schematic diagram of the i-th neuron neighborhood structure.

图2是机器人目标点邻域内神经元输出值示意图。Fig. 2 is a schematic diagram of neuron output values in the neighborhood of the target point of the robot.

图3是最大累积回报初始值V*_Init(s，a)示意图。Figure 3 is a schematic diagram of the initial value V* _Init (s, a) of the maximum cumulative return.

图4是神经网络达到平衡状态时神经元输出值示意图。Fig. 4 is a schematic diagram of neuron output values when the neural network reaches an equilibrium state.

图5是现有Q学习获得的机器人规划路径示意图。FIG. 5 is a schematic diagram of a robot planning path obtained by existing Q-learning.

图6是现有Q学习算法收敛过程示意图。Fig. 6 is a schematic diagram of the convergence process of the existing Q-learning algorithm.

图7是本发明的机器人规划路径示意图。Fig. 7 is a schematic diagram of the robot planning path of the present invention.

图8是本发明的学习收敛过程示意图。Fig. 8 is a schematic diagram of the learning convergence process of the present invention.

具体实施方式 Detailed ways

本发明基于神经网络对机器人强化学习进行初始化，神经网络与机器人工作空间具有相同的拓扑结构，神经网络达到平衡状态时，神经元输出值表示对应状态的最大累积回报，利用当前状态的立即回报与后继状态的最大折算累积回报获得Q函数的初始值。通过Q值初始化能够将先验知识融入到学习系统中，对机器人初始阶段的学习进行优化，从而为机器人提供一个较好的学习基础；具体包括以下步骤：The invention initializes the reinforcement learning of the robot based on the neural network. The neural network and the robot workspace have the same topological structure. When the neural network reaches the equilibrium state, the output value of the neuron represents the maximum cumulative return of the corresponding state, and the immediate return of the current state and the The maximum discounted cumulative reward of the successor state obtains the initial value of the Q function. The prior knowledge can be integrated into the learning system through the Q value initialization, and the learning of the initial stage of the robot can be optimized, so as to provide a better learning basis for the robot; specifically, the following steps are included:

1神经网络模型1 Neural Network Model

神经网络与机器人工作空间具有相同的拓扑结构，每个神经元对应于机器人工作空间的一个离散状态。所有神经元都只与其局部邻域内的神经元相连接，并且其连接形式都相同，整个神经网络由N×N个神经元组成二维拓扑结构。神经网络具有高度并行的体系结构，所有连接权都相等，神经元之间信息的传播是双向的。神经网络在演化过程中根据每一个离散状态的输入更新其邻域内的状态，整个神经网络可以看作一个离散时间动力学系统。The neural network has the same topology as the robot workspace, and each neuron corresponds to a discrete state of the robot workspace. All neurons are only connected to neurons in their local neighborhood, and their connection forms are the same. The whole neural network consists of N×N neurons to form a two-dimensional topology. The neural network has a highly parallel architecture, all connection weights are equal, and the propagation of information between neurons is bidirectional. During the evolution process of the neural network, the state in its neighborhood is updated according to the input of each discrete state, and the whole neural network can be regarded as a discrete-time dynamical system.

神经网络在演化过程中，根据目标点和障碍物位置信息在神经网络拓扑结构中的映射产生神经网络的外部输入，障碍物区域对应的神经元具有负的外部输入，目标点神经元具有正的外部输入。神经网络根据外部输入进行演化，目标点位置正的神经元输出值能够通过神经元的局部连接逐渐衰减地传播到整个状态空间，直到达到平衡状态。S型激活函数保证了目标点位置神经元具有全局最大的正神经元输出值，障碍物区域神经元的输出值则被抑制为零。神经网络达到平衡以后，所有神经元输出值就构成了一个单峰值的曲面，曲面上每一个点的值就代表其所对应状态可获得的最大累积回报。During the evolution of the neural network, the external input of the neural network is generated according to the mapping of the target point and obstacle position information in the neural network topology. The neurons corresponding to the obstacle area have negative external inputs, and the target point neurons have positive external inputs. external input. The neural network evolves according to the external input, and the output value of the neuron at the target point position can be gradually attenuated and propagated to the entire state space through the local connection of the neuron until it reaches an equilibrium state. The S-type activation function ensures that the neuron at the target point has the global maximum positive neuron output value, and the output value of the neuron in the obstacle area is suppressed to zero. After the neural network reaches equilibrium, the output values of all neurons form a single-peak surface, and the value of each point on the surface represents the maximum cumulative return available for its corresponding state.

假设机器人工作空间由20×20个方格组成，神经网络与机器人工作空间具有相同的拓扑结构，也包含20×20个神经元，每个神经元对应于工作空间的一个离散状态。每个神经元都只与其局部邻域内的神经元相连接，其中第i个神经元与其邻域内神经元的连接形式如图1所示。整个神经网络由20×20个神经元组成二维拓扑结构。神经网络具有高度并行的体系结构，所有连接权都相等。在神经网络演化过程中所有神经元既是输入神经元又是输出神经元，神经元之间信息的传播是双向的，整个神经网络可以看作一个离散时间动力学系统。Assuming that the robot workspace consists of 20×20 squares, the neural network has the same topology as the robot workspace, and also contains 20×20 neurons, and each neuron corresponds to a discrete state of the workspace. Each neuron is only connected to neurons in its local neighborhood, and the connection form of the i-th neuron to neurons in its neighborhood is shown in Figure 1. The whole neural network consists of 20×20 neurons to form a two-dimensional topology. Neural networks have a highly parallel architecture where all connections are equally weighted. In the process of neural network evolution, all neurons are both input neurons and output neurons, and the transmission of information between neurons is bidirectional. The entire neural network can be regarded as a discrete-time dynamic system.

神经网络的第i个神经元对应于结构空间的第i个离散状态，则第i个神经元离散时间动力学方程为：The i-th neuron of the neural network corresponds to the i-th discrete state of the structure space, then the discrete-time dynamic equation of the i-th neuron is:

${x x}_{i i} ((t t + + 11)) = = \{\begin{matrix} 11 & ifi ifi = = {i i}_{* *} \\ f f (({Σ Σ}_{j j = = 11}^{N N} {w w}_{ij ij} {x x}_{j j} ((t t)) + + {I I}_{i i} ((t t)))) & otherwise otherwise \end{matrix}$

公式中，i_*为目标神经元索引值，x_i(t)为第i个神经元在t时刻的输出值，N为第i个神经元邻域内的神经元个数，I_i(t)为第i个神经元在t时刻的外部输入，f为激活函数，w_ij为第j个神经元到第i个神经元的连接权，计算公式如下式所示：In the formula, i _* is the index value of the target neuron, x _i (t) is the output value of the i-th neuron at time t, N is the number of neurons in the neighborhood of the i-th neuron, I _i (t) is the external input of the i-th neuron at time t, f is the activation function, w _ij is the connection weight from the j-th neuron to the i-th neuron, and the calculation formula is as follows:

${w w}_{ij ij} = = \{\begin{matrix} {e e}^{- - η η {| | i i - - j j | |}^{22}} & if if | | i i - - j j | | \leq \leq r r \\ 00 & if if | | i i - - j j | | > > r r \end{matrix}$

公式中，|i-j|为结构空间中向量x_i和x_j之间的Euclidian距离，由于每个神经元只与其局部邻域内神经元连接，r取值为1，为了保证神经网络达到平衡状态时神经元输出形成单峰值的曲面，η取值范围为(1，2)，显然w_ij＝w_ji，即w_ij为对称的；神经元激活函数选择S型函数，定义如下：In the formula, |ij| is the Euclidian distance between the vectors x _i and x _j in the structural space. Since each neuron is only connected to neurons in its local neighborhood, r takes a value of 1. In order to ensure that the neural network reaches an equilibrium state The neuron output forms a single-peak surface, and the value range of η is (1, 2), obviously w _ij =w _ji , that is, w _ij is symmetrical; the neuron activation function selects a S-type function, which is defined as follows:

$f f ((x x)) = = \{\begin{matrix} 00 & ifx ifx \leq \leq 00 \\ kx x & if if 00 < < x x < < 11 \\ 11 & ifx ifx &GreaterEqual; &Greater Equal; 11 \end{matrix}$

公式中，k为线性模型的斜率，取值范围为(0，1)，f(x)保证了目标点位置正的神经元输出值能够逐渐衰减地传播到整个状态空间，且目标点具有全局最大的正神经元输出值，障碍物区域神经元的输出值则被抑制为零；第i个神经元的外部输入由目标点和障碍物位置信息在神经网络拓扑结构中的映射产生的，定义如下：In the formula, k is the slope of the linear model, and the value range is (0, 1). f(x) ensures that the output value of the neuron at the positive position of the target point can be gradually attenuated and propagated to the entire state space, and the target point has a global The largest positive neuron output value, the output value of the neuron in the obstacle area is suppressed to zero; the external input of the i-th neuron is generated by the mapping of the target point and obstacle position information in the neural network topology, defined as follows:

${I I}_{i i} ((t t)) = = \{\begin{matrix} V V & if if {x x}_{i i} ((t t)) = = t t arg arg et et \\ - - V V & if if {x x}_{i i} ((t t)) = = obstacle obstacle \\ 00 & otherwise otherwise \end{matrix}$

公式中，V为较大的常数，为了保证目标点神经元具有全局最大的神经元输出值，而障碍物区域神经元具有全局最小的神经元输出值，V的值应该大于神经元的输入总和，取值范围为大于4的实数。In the formula, V is a relatively large constant. In order to ensure that the target point neuron has the global maximum neuron output value, and the obstacle area neuron has the global minimum neuron output value, the value of V should be greater than the sum of neuron input , the value range is a real number greater than 4.

2回报函数设计2 Return function design

在学习过程中，机器人能够在4个方向上移动，在任意状态可以选择上下左右4个动作，机器人根据当前状态选择动作，如果该动作使机器人到达目标则获得的立即回报为1，如果机器人与障碍物或其他机器人发生碰撞则获得立即回报为-0.2，如果机器人在自由空间移动则获得的立即回报为-0.1。During the learning process, the robot can move in 4 directions. In any state, you can choose 4 actions: up, down, left, and right. The robot chooses an action according to the current state. If the action makes the robot reach the goal, the immediate reward is 1. If the robot and An immediate reward of -0.2 is obtained for collisions with obstacles or other robots, and -0.1 if the robot is moving in free space.

3计算最大累积回报初始值3 Calculate the initial value of the maximum cumulative return

根据目标点和障碍物位置信息在神经网络拓扑结构中的映射产生神经网络的外部输入，障碍物区域对应的神经元具有负的外部输入，目标点神经元具有正的外部输入。神经网络根据外部输入进行演化，目标点位置正的神经元输出值能够通过神经元的局部连接逐渐衰减地传播到整个状态空间，直到达到平衡状态。S型激活函数保证了目标点位置神经元具有全局最大的正神经元输出值，障碍物区域神经元的输出值则被抑制为零。神经网络达到平衡状态以后，所有神经元输出值就构成了一个单峰值的曲面，如图2所示，曲面上每一个点的值就代表其所对应状态可获得的最大累积回报。According to the mapping of the target point and obstacle position information in the neural network topology, the external input of the neural network is generated. The neuron corresponding to the obstacle area has a negative external input, and the target point neuron has a positive external input. The neural network evolves according to the external input, and the output value of the neuron at the target point position can be gradually attenuated and propagated to the entire state space through the local connection of the neuron until it reaches an equilibrium state. The S-type activation function ensures that the neuron at the target point has the global maximum positive neuron output value, and the output value of the neuron in the obstacle area is suppressed to zero. After the neural network reaches the equilibrium state, the output values of all neurons form a single-peak surface, as shown in Figure 2, the value of each point on the surface represents the maximum cumulative return available for its corresponding state.

机器人从任意初始状态s_t出发获得的累积回报定义如下：The cumulative reward obtained by the robot starting from any initial state s _t is defined as follows:

${V V}^{π π} (({s the s}_{t t})) = = {r r}_{t t} + + γ γ {r r}_{t t + + 11} + + {γ γ}^{22} {r r}_{t t + + 22} + + Λ Λ = = {Σ Σ}_{i i = = 00}^{\infty \infty} {γ γ}^{i i} {r r}_{t t + + i i}$

上式中，π为控制策略，r为获得的立即回报序列，γ为折算因子，取值范围为(0，1)，这里选择γ＝0.95；则机器人从状态s出发遵循最优策略所获得的最大累积回报V^*(s)计算如下：In the above formula, π is the control strategy, r is the obtained immediate reward sequence, γ is the conversion factor, and the value range is (0, 1). Here, γ=0.95 is selected; then the robot starts from the state s and follows the optimal strategy to obtain The maximum cumulative return V ^* (s) for is calculated as follows:

${V V}^{* *} ((s the s)) = = \underset{π π}{arg arg max max} {V V}^{π π} ((s the s)),, ((&ForAll; &ForAll; s the s))$

${V V}_{Init Init}^{* *} (({s the s}_{i i})) &LeftArrow; &LeftArrow; {x x}_{i i} . .$

公式中，x_i为神经网络达到平衡状态时第i个神经元的输出值，V^* _Init(s_i)为从状态s_i出发遵循最优策略所能获得的最大累积回报。In the formula, _xi is the output value of the i-th neuron when the neural network reaches the equilibrium state, and V ^* _Init (s _i ) is the maximum cumulative reward that can be obtained from the state s _i following the optimal strategy.

4基于神经网络的机器人强化学习4 Neural Network-Based Reinforcement Learning for Robots

4.1传统Q学习算法4.1 Traditional Q-learning algorithm

在马尔科夫决策过程中，机器人通过传感器感知周围环境获知当前状态，并选择当前要执行的动作，环境响应该动作并给出立即回报，并产生后继状态。机器人强化学习的任务就是获得一个最优策略使得机器人从当前状态出发获得最大的折算累积回报。机器人从任意初始状态遵循任意策略π获得的累积回报定义为：In the Markov decision-making process, the robot perceives the surrounding environment through sensors to know the current state, and chooses the action to be performed currently. The environment responds to the action and gives an immediate reward, and generates a subsequent state. The task of robot reinforcement learning is to obtain an optimal strategy to enable the robot to obtain the maximum discounted cumulative return from the current state. The cumulative reward obtained by a robot following any policy π from any initial state is defined as:

${V V}^{π π} (({s the s}_{t t})) &equiv; &equiv; {r r}_{t t} + + γ γ {r r}_{t t + + 11} + + {γ γ}^{22} {r r}_{t t + + 22} + + Λ Λ &equiv; &equiv; {Σ Σ}_{i i = = 00}^{\infty \infty} {γ γ}^{i i} {r r}_{t t + + i i}$

公式中，r_t为t时刻的立即回报，γ为折算因子，取值范围为(0，1)，这里选择γ＝0.95。In the formula, r _t is the immediate return at time t, γ is the conversion factor, and the value range is (0, 1). Here, γ=0.95 is selected.

机器人从状态s出发能够获得最大累积回报的最优策略π*定义如下：The optimal strategy π* for the robot to obtain the maximum cumulative reward starting from the state s is defined as follows:

${π π}^{* *} &equiv; &equiv; \underset{π π}{arg arg max max} {V V}^{π π} ((s the s)),, ((&ForAll; &ForAll; s the s))$

机器人从状态s出发遵循最优策略π*所能够获得的最大累积回报定义为V*(s)，则Q函数的值为当前状态的立即回报加上后继状态的最大折算累积回报，计算公式如下：The maximum cumulative return that the robot can obtain from the state s following the optimal strategy π* is defined as V*(s), then the value of the Q function is the immediate return of the current state plus the maximum discounted cumulative return of the subsequent state, and the calculation formula is as follows :

Q(s，a)≡(1-α_t)Q(s，a)+α_t(r(s，a)+γV*(s′))Q(s, a)≡(1-α _t )Q(s, a)+α _t (r(s, a)+γV*(s′))

公式中，α_t为学习率，取值范围为(0，1)，通常选择α_t初始值为0.5，并随学习次数衰减；V*(s’)与Q(s’，a’)关系式如下：In the formula, α _t is the learning rate, and the value range is (0, 1). Usually, the initial value of α _t is 0.5, and it decays with the number of learning times; the relationship between V*(s') and Q(s', a') The formula is as follows:

$V V * * (({s the s}^{' '})) = = \underset{{a a}^{' '}}{max max Q Q (({s the s}^{' '},, {a a}^{' '}))}$

则Q(s_t，a_t)按照如下规则更新：Then Q(s _t , a _t ) is updated according to the following rules:

${Q Q}_{t t} (({s the s}_{t t},, {a a}_{t t})) = = ((11 - - {α α}_{t t})) {Q Q}_{t t - - 11} (({s the s}_{t t},, {a a}_{t t})) + + {α α}_{t t} (({r r}_{t t} + + \underset{{a a}_{t t}^{' '}}{γ γ arg arg max max} {Q Q}_{t t - - 11} (({s the s}_{t t}^{' '},, {a a}_{t t}^{' '}))))$

公式中，Q_t-1(s_t，a_t)和Q_t-1(s’_t，a’_t)分别为状态-动作对(s_t，a_t)和(s’_t，a’_t)在t-1时刻的取值,a’_t为在新状态s’_t下选择的动作。In the formula, Q _t-1 (s _t , a _t ) and Q _t-1 (s' _t , a' _t ) are state-action pairs (s _t , a _t ) and (s' _t , a' _t ) at time t-1, a' _t is the action selected in the new state s' _t .

4.2Q值初始化4.2 Q value initialization

根据已知环境信息对神经网络进行演化，直到达到平衡状态，这时定义每个离散状态可获得的最大累积回报等于其所对应神经元的输出值。然后将从当前状态执行选定的动作获得的立即回报加上后继状态遵循最优策略获得的最大折算累积回报，即可对所有状态-动作对的Q(s_i，a)设置合理的初始值。Q(s_i，a)的初始值计算公式如下：The neural network is evolved according to the known environmental information until it reaches an equilibrium state. At this time, it is defined that the maximum cumulative reward that can be obtained for each discrete state is equal to the output value of its corresponding neuron. Then add the immediate reward obtained by executing the selected action from the current state to the maximum discounted cumulative reward obtained by the subsequent state following the optimal strategy, and then set a reasonable initial value for Q(s _i , a) of all state-action pairs . The calculation formula of the initial value of Q(s _i , a) is as follows:

公式中，r为在状态s_i下选择动作a获得的立即回报，γ为折算因子，取值范围为(0，1)，这里选择γ＝0.95；s_j为机器人在状态s_i下选择动作a所产生的后继状态，Q_Init(s_i，a)为状态-动作对(s_i，a)的初始Q值；In the formula, r is the immediate reward obtained by choosing action a in state s _i , γ is the conversion factor, and the value range is (0, 1), here choose γ=0.95; s _j is the action selected by the robot in state s _i The subsequent state generated by a, Q _Init (s _i , a) is the initial Q value of the state-action pair (s _i , a);

4.3本发明的基于神经网络的Q学习算法4.3 Q learning algorithm based on neural network of the present invention

(1)神经网络根据初始化环境信息进行演化，直到达到平衡状态。(1) The neural network evolves according to the initial environment information until it reaches an equilibrium state.

(2)利用神经元输出值x_i对状态s_i可获得的最大累积回报进行初始化，关系公式如下：(2) Use the neuron output value _xi to initialize the maximum cumulative reward that can be obtained from state _si , and the relationship formula is as follows:

(3)按照如下规则初始化Q值：(3) Initialize the Q value according to the following rules:

Q_Init(s_i，a)＝r+γV_Init*(s_j)Q _Init (s _i , a)=r+γV _Init *(s _j )

(4)观察当前状态s_t。(4) Observe the current state s _t .

(5)继续在复杂环境中探索，在当前状态s_t下选择一个动作a_t并执行，环境状态更新为新的状态s’_t，并接收立即回报r_t。(5) Continue to explore in the complex environment, select and execute an action a _t in the current state s _t , update the environment state to a new state s' _t , and receive an immediate reward r _t .

(6)观察新状态s’_t。(6) Observe the new state s' _t .

(7)根据以下公式更新表项Q(s_t，a_t)值：(7) Update the value of table item Q(s _t , a _t ) according to the following formula:

(8)判断机器人是否已经到达目标或者学习系统已经达到设定的最大学习次数(设定的最大学习次数能够保证学习系统在最大学习次数内收敛，在本发明的实验环境中最大学习次数设置为300)，如果两者满足其一，则学习结束，否则返回到步骤(4)继续学习。(8) Judging whether the robot has reached the target or the learning system has reached the maximum number of studies set (the maximum number of studies set can ensure that the learning system converges in the maximum number of studies, in the experimental environment of the present invention, the maximum number of studies is set to 300), if the two meet one of them, then the learning ends, otherwise return to step (4) to continue learning.

为了说明机器人强化学习Q值初始化过程，选择机器人目标点邻域进行演示。神经网络达到平衡状态时，邻域内神经元输出值如图3中节点中数值所示，每个节点对应于一个离散状态，每个状态的最大累积回报等于该状态神经元的输出值，红色节点表示目标状态，灰色节点表示障碍物。每个箭头代表一个动作，若机器人导向目标状态G则获得的立即回报为1，若与障碍物或其他机器人发生碰撞则获得的立即回报为-0.2，若机器人在自由空间移动则获得的立即回报为-0.1。γ为折算因子，选择γ＝0.95，根据Q值初始化公式可获得Q函数的初始值，每个状态-动作对的初始化Q值如图4中箭头代表数值所示。初始化完成以后，机器人在任意初始状态下都能够选择恰当的动作，当机器人面临较复杂的环境时，在学习的初始阶段就具有一定的目的性，而不是完全随机地选择动作，从而加快算法收敛速度。In order to illustrate the Q value initialization process of robot reinforcement learning, the robot target point neighborhood is selected for demonstration. When the neural network reaches an equilibrium state, the output values of the neurons in the neighborhood are shown in the nodes in Figure 3. Each node corresponds to a discrete state, and the maximum cumulative return of each state is equal to the output value of the state neurons. The red nodes represents the target state, and gray nodes represent obstacles. Each arrow represents an action. If the robot is directed to the goal state G, the immediate reward obtained is 1, if it collides with obstacles or other robots, the immediate reward obtained is -0.2, and if the robot moves in free space, the immediate reward obtained is is -0.1. γ is the conversion factor, select γ=0.95, and the initial value of the Q function can be obtained according to the Q value initialization formula. The initial Q value of each state-action pair is shown in Figure 4 as the arrow represents the value. After the initialization is completed, the robot can choose the appropriate action in any initial state. When the robot faces a more complex environment, it has a certain purpose in the initial stage of learning, instead of choosing actions completely randomly, thereby speeding up the algorithm convergence. speed.

在实验室所建立的移动机器人环境建模和探索软件平台上，进行了仿真实验。图5为现有的机器人强化学习方法获得的机器人规划路径；图6为现有的机器人强化学习算法收敛过程。学习算法经过145次学习以后开始收敛，在学习的初始阶段(如前20次学习)机器人在最大迭代次数内基本都不能到达目标点。这是由于Q值被初始化为0，使得机器人没有任何先验知识，只能随机地选择动作，从而导致学习初始阶段效率较低，算法收敛速度较慢。On the mobile robot environment modeling and exploration software platform established in the laboratory, the simulation experiment was carried out. Fig. 5 is the robot planning path obtained by the existing robot reinforcement learning method; Fig. 6 is the convergence process of the existing robot reinforcement learning algorithm. The learning algorithm starts to converge after 145 times of learning. In the initial stage of learning (such as the first 20 times of learning), the robot basically cannot reach the target point within the maximum number of iterations. This is because the Q value is initialized to 0, so that the robot does not have any prior knowledge and can only choose actions randomly, resulting in low efficiency in the initial stage of learning and slow algorithm convergence.

图7为本发明的机器人规划路径；图8为本发明收敛过程。学习算法在经过76次学习以后开始收敛，而且机器人在学习的初始阶段也基本都能在最大迭代次数之内到达目标点，本发明有效提高了机器人初始阶段的学习效率，明显加快了学习过程的收敛速度。Fig. 7 is the robot planning path of the present invention; Fig. 8 is the convergence process of the present invention. The learning algorithm starts to converge after 76 times of learning, and the robot can basically reach the target point within the maximum number of iterations in the initial stage of learning. The invention effectively improves the learning efficiency of the robot in the initial stage, and obviously accelerates the learning process. convergence speed.

Claims

1. the robot reinforced learning initialization method based on neural network, neural network has identical topological structure with robot working space, each neuron is corresponding to a discrete state in state space, first according to known component environment information, neural network is developed, until reach equilibrium state, at this moment each neuron output value just represents the cumulative maximum return that its corresponding states can obtain, then the return immediately of current state is added to follow-up state follows the maximum conversion accumulation return that optimal strategy obtains, can be to the right Q (s of all states-move, a) set rational initial value, by Q value initialization, priori can be dissolved in learning system, study to the robot initial stage is optimized, thereby for robot provides a good learning foundation, specifically comprise the following steps:

(1) set up neural network model

Neural network has identical topological structure with the structure space of robot work, each neuron is only connected with the neuron in its local neighborhood, type of attachment is all identical, connection weight all equates, between neuron, the propagation of information is two-way, neural network has the architecture of highly-parallel, each neuron is corresponding to one of robot working space discrete state, whole neural network forms two dimensional topology by N * N neuron, neural network is upgraded the state in its neighborhood according to the input of each discrete state in evolutionary process, until neural network reaches equilibrium state, while reaching equilibrium state, in neural network, neuron output value just forms a single-peaked curved surface, on curved surface, the value of every bit just represents the cumulative maximum return that institute's corresponding states can obtain,

(2) design return function

In learning process, robot can move up 4 sides, can select action according to current state, in free position, can both in 4 actions of advancing, retreat, turn left, turn right, select a suitable action, if this action makes robot arrive target, the return immediately obtaining is 1, if robot and barrier or other robot bump, obtain return immediately for-0.2, if robot moves at free space, the return immediately obtaining is-0.1;

(3) calculate cumulative maximum return initial value

When neural network reaches equilibrium state, define the cumulative maximum return V of the corresponding state of each neuron ^* _init(s _i) equal this neuron output value x _i, its relation formula is as follows:

V_{Init}^{*} (s_{i}) &LeftArrow; x_{i},

In formula, xi is neural network i neuronic output valve while reaching equilibrium state, V ^* _init(s _i) be from state s _iset out and follow the obtainable cumulative maximum return of optimal strategy institute;

(4) Q value initialization

Q(s _i, initial value a) is defined as at state s _ilower selection action a obtains returns the maximum conversion accumulation return that r adds follow-up state immediately:

Q_{Init} (s_{i}, a) = r + {γV}_{Init}^{*} (s_{j})

In formula, s _jfor robot is at state s _ithe follow-up state that lower selection action a produces, Q _init(s _i, be a) that state-action is to (s _i, initial Q value a); γ is commutation factor, selects γ=0.95;

(5) the robot intensified learning step based on neural network

(a) neural network develops according to initialization context information, until reach equilibrium state;

(b) state s _ithe initial value of the cumulative maximum return obtaining is defined as neuron output value x _i, relation formula is as follows:

V_{Init}^{*} (s_{i}) &LeftArrow; x_{i}

(c) according to following regular initialization Q value:

Q_{Init} (s_{i}, a) = r + {γV}_{Init}^{*} (s_{j}),

(d) observe current state s _t;

(e) continue to explore in complex environment, at current state s _taction a of lower selection _tand carry out, ambient condition is more

New is new state s' _t, and r is returned in reception immediately _t;

(f) observe new state s' _t;

(g) according to following formula, upgrade list item Q (s _t, a _t) value:

Q_{t} (s_{t}, a_{t}) = (1 - α_{t}) Q_{t - 1} (s_{t}, a_{t}) + α_{t} (r_{t} + γ \underset{a_{t}^{'}}{\arg \max} Q_{t - 1} (s_{t}^{'}, a_{t}^{'}))

In formula, α _tfor learning rate, value is 0.5, and decays with learning process; Q _t-1(s _t, a _t) and Q _t-1(s' _t, a' _t) state of being respectively-action is to (s _t, a _t) and (s' _t, a' _t) at t-1 value constantly, a' _tfor at new state s' _tthe action of lower selection;

(h) judge whether robot has arrived the maximum study number of times that target or learning system have reached setting, the maximum study number of times of setting should guarantee that learning system restrains in maximum study number of times, if both meet one, study finishes, otherwise turns back to step (d) continue studying.