CN114707359A

CN114707359A - A Decision Planning Method for Autonomous Vehicles Based on Value Distribution Reinforcement Learning

Info

Publication number: CN114707359A
Application number: CN202210487160.8A
Authority: CN
Inventors: 唐小林; 钟桂川; 杨凯; 陈永力; 邓忠伟; 彭颖; 胡晓松; 李佳承
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2022-05-06
Filing date: 2022-05-06
Publication date: 2022-07-05
Anticipated expiration: 2042-05-06
Also published as: CN114707359B

Abstract

The invention relates to an automatic driving automobile decision planning method based on value distribution reinforcement learning, and belongs to the field of automatic driving automobiles. The method comprises the following steps: s1: constructing a non-signal lamp crossroad scene considering uncertainty; s2: constructing a fully parameterized quantile function model as an automatic driving automobile control model; s3: and introducing condition risk value based on the learned state-action return distribution information in the fully parameterized quantile function model, and generating the driving behavior with risk awareness. The method improves the safety and stability of the decision planning strategy of the automatic driving automobile in the uncertain environment by using value distribution reinforcement learning.

Description

A Decision Planning Method for Autonomous Vehicles Based on Value Distribution Reinforcement Learning

技术领域technical field

本发明属于自动驾驶汽车领域，涉及一种基于值分布强化学习的自动驾驶汽车决策规划方法。The invention belongs to the field of self-driving cars, and relates to a decision-making planning method for self-driving cars based on value distribution reinforcement learning.

背景技术Background technique

自动驾驶技术近年来取得飞速发展，但是安全性已成为自动驾驶技术所面临的一个关键问题。安全性是阻碍自动驾驶汽车商业化的一个重要因素，也是近年来的一个研究热点。而自动驾驶决策规划模块，作为自动驾驶汽车的“大脑”，对自动驾驶汽车安全性有相当重要的影响，尤其是在十字路口等复杂城市场景下，如何进行自主安全的决策，近年来被广泛研究。Autonomous driving technology has developed rapidly in recent years, but safety has become a key issue faced by autonomous driving technology. Safety is an important factor hindering the commercialization of autonomous vehicles, and it has also been a research hotspot in recent years. The autonomous driving decision planning module, as the "brain" of autonomous vehicles, has a very important impact on the safety of autonomous vehicles, especially in complex urban scenarios such as intersections, how to make autonomous and safe decision-making has been widely used in recent years. Research.

自动驾驶汽车决策规划模块，主要是根据当前环境状态，决策生成最优的驾驶行为，从而安全的完成驾驶任务，现有的决策规划方法，主要分为基于规则、基于优化以及基于学习的三类。其中，基于规则的方法，只适用于特定的场景；基于优化的方法，在实时性方面的表现较差。因此，基于学习的方法近年来被学术界和工业界广泛研究，其中强化学习已被广泛用于自动驾驶汽车的决策规划问题，得益于强化学习的实时性与场景适应性，基于强化学习的决策规划方法能很好的完成驾驶任务。但是，由于自动驾驶汽车所面临的驾驶环境日益复杂，恶劣天气、建筑物遮挡等造成的不完全感知，以及周围交通参与者的行为不确定性，给自动驾驶汽车的安全性带来了巨大挑战，传统的强化学习算法已经无法满足自动驾驶汽车对安全性的需求。The decision-making and planning module for autonomous vehicles mainly generates optimal driving behaviors based on the current environmental state, so as to safely complete driving tasks. The existing decision-making and planning methods are mainly divided into three categories: rule-based, optimization-based and learning-based. . Among them, the rule-based method is only suitable for specific scenarios; the optimization-based method has poor performance in real-time. Therefore, learning-based methods have been widely studied in academia and industry in recent years, and reinforcement learning has been widely used in decision-making and planning problems of autonomous vehicles. Thanks to the real-time nature and scene adaptability of reinforcement learning, reinforcement learning-based The decision planning method can accomplish the driving task well. However, due to the increasingly complex driving environment faced by autonomous vehicles, incomplete perception caused by inclement weather, building occlusion, etc., as well as the behavioral uncertainty of surrounding traffic participants, it has brought great challenges to the safety of autonomous vehicles. , traditional reinforcement learning algorithms have been unable to meet the safety requirements of autonomous vehicles.

由于传统的强化学习以最大化回报的期望值来选择最优动作，回报的分布信息很大程度上被丢失，因此无法考虑由于环境中内在的不确定性对决策策略的影响。因此，亟需提出一种新的强化学习算法来处理环境中存在的不确定性，以提升自动驾驶汽车决策规划的安全性。Since traditional reinforcement learning selects the optimal action to maximize the expected value of the reward, the distributional information of the reward is largely lost, so the influence on the decision policy due to the inherent uncertainty in the environment cannot be considered. Therefore, it is urgent to propose a new reinforcement learning algorithm to deal with the uncertainties in the environment to improve the safety of autonomous vehicle decision planning.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本发明的目的在于提供一种基于值分布强化学习的自动驾驶汽车决策规划方法，能提高自动驾驶汽车在具有不确定性的环境下决策规划策略的安全性与稳定性。In view of this, the purpose of the present invention is to provide a decision-making planning method for autonomous driving vehicles based on value distribution reinforcement learning, which can improve the safety and stability of decision-making planning strategies of autonomous driving vehicles in uncertain environments.

为达到上述目的，本发明提供如下技术方案：To achieve the above object, the present invention provides the following technical solutions:

一种基于值分布强化学习的自动驾驶汽车决策规划方法，具体包括以下步骤：A decision planning method for autonomous vehicles based on value distribution reinforcement learning, which specifically includes the following steps:

S1：构建考虑不确定性的无信号灯十字路口场景；S1: Construct a no-signal intersection scene considering uncertainty;

S2：构建全参数化分位数函数(FQF)网络模型，作为自动驾驶汽车控制模型；S2: Build a fully parameterized quantile function (FQF) network model as an autonomous vehicle control model;

S3：基于全参数化分位数函数(FQF)模型中学习的状态-动作回报分布信息，引入条件风险价值(CVaR)，生成具有风险意识的驾驶行为。S3: Based on the state-action reward distribution information learned in a fully parameterized quantile function (FQF) model, a conditional value at risk (CVaR) is introduced to generate risk-aware driving behavior.

进一步，步骤S1中，构建考虑不确定性的无信号灯十字路口场景，具体包括：建立遮挡模型，确定周围车辆模型，建立周围车辆类型分布。Further, in step S1, constructing a scene of an intersection without a signal light that considers uncertainty, which specifically includes: establishing an occlusion model, determining a surrounding vehicle model, and establishing a distribution of surrounding vehicle types.

进一步，步骤S1中，建立遮挡模型，具体包括：考虑十字路口两侧的遮挡，通过分析周围车辆与自车以及十字路口中心的相对位置关系，根据几何关系，计算出周围车辆能被自车观测到的临界距离d，以此作为判断周围车辆是否被遮挡的临界条件：Further, in step S1, an occlusion model is established, which specifically includes: considering the occlusion on both sides of the intersection, by analyzing the relative positional relationship between the surrounding vehicles and the vehicle and the center of the intersection, according to the geometric relationship, calculate that the surrounding vehicles can be observed by the vehicle. The critical distance d to be used as a critical condition for judging whether the surrounding vehicles are blocked:

其中，l为每条车道宽度，d′为自车车头到十字路口中心点的距离，

为道路边界到遮挡物的距离，d为周围车辆车头至十字路口中心点的距离。Among them, l is the width of each lane, d' is the distance from the front of the vehicle to the center of the intersection,

is the distance from the road boundary to the occluder, and d is the distance from the head of the surrounding vehicles to the center of the intersection.

进一步，步骤S1中，确定周围车辆模型，具体包括：为使周围车辆能对环境的主动变化做出反应，规定仿真环境中，周围车辆的行为由智能驾驶员模型控制(IntelligentDriver Model)：Further, in step S1, the surrounding vehicle model is determined, which specifically includes: in order to enable the surrounding vehicle to respond to the active change of the environment, it is specified that in the simulation environment, the behavior of the surrounding vehicle is controlled by the intelligent driver model (IntelligentDriver Model):

其中，a为加速度，a_max为最大加速度，v为车辆纵向速度，v_target为车辆纵向期望速度，m为加速度参数，d_target为车辆纵向期望距离，d₀为车辆纵向最小距离，T₀为车辆最小碰撞时间，Δv为与前车的相对速度。Among them, a is the acceleration, a _max is the maximum acceleration, v is the longitudinal speed of the vehicle, v _target is the desired longitudinal speed of the vehicle, m is the acceleration parameter, d _target is the desired longitudinal distance of the vehicle, d ₀ is the minimum longitudinal distance of the vehicle, and T ₀ is The minimum collision time of the vehicle, Δv is the relative speed with the preceding vehicle.

进一步，步骤S1中，建立周围车辆类型分布，具体包括：规定仿真环境中，周围车辆包括激进(Aggressive)、保守(Conservative)、正常(Normal)三种类型，每种类型的车辆在每一个时间步，分别以一定的概率添加到环境中，周围车辆类型空间为：Further, in step S1, the distribution of the types of surrounding vehicles is established, which specifically includes: specifying that in the simulation environment, the surrounding vehicles include three types: aggressive, conservative, and normal, and each type of vehicle is used at each time Steps are added to the environment with a certain probability, and the surrounding vehicle type space is:

进一步，步骤S2中，构建全参数化分位数函数模型，具体包括以下步骤：Further, in step S2, a fully parameterized quantile function model is constructed, which specifically includes the following steps:

S21：构建分位数提议网络(Fraction proposal network)：以状态信息作为网络输入，输出每个状态-动作对应的最优分位点τ；S21: Build a Fraction proposal network: take the state information as the network input, and output the optimal quantile τ corresponding to each state-action;

S22：构建分位数值网络(Quantile value network)：将由分位数提议网络得到的最优分位点作为分位数值网络的输入，映射得到对应当前状态下，各个分位点对应的分位函数值；S22: Build a quantile value network: take the optimal quantile point obtained by the quantile proposal network as the input of the quantile value network, and map the quantile function corresponding to each quantile point corresponding to the current state value;

S23：构建状态空间S：以周围车辆的位置、速度、航向角以及自车的位置、速度及航向角作为自车可观测的状态信息，值分布强化学习基于自车观测信息进行下一步的决策规划；S23: Construct the state space S: The position, speed, heading angle of the surrounding vehicles and the position, speed and heading angle of the vehicle are used as the observable state information of the vehicle, and the value distribution reinforcement learning is based on the observation information of the vehicle to make the next decision. planning;

S24：构建动作空间A：动作空间定义为自车可执行动作的集合，为值分布强化学习网络的输出值，此处自车的动作空间包括加速、巡航和减速三个离散动作值；其中加速和减速两个动作的具体加速度由智能驾驶员模型(Intelligent Driver Model)计算得到；S24: Build an action space A: The action space is defined as the set of actions that the vehicle can perform, and is the output value of the value distribution reinforcement learning network. Here, the action space of the vehicle includes three discrete action values: acceleration, cruise and deceleration; among which acceleration The specific acceleration of the two actions of deceleration and deceleration is calculated by the Intelligent Driver Model;

S25：设计奖励函数，总奖励等于碰撞奖励R_collision、完成任务的奖励R_success以及超时奖励R_timeout三部分之和；S25: Design a reward function, the total reward is equal to the sum of the three parts of the collision reward R _collision , the task completion reward R _success and the timeout reward R _timeout ;

S26：根据当前状态S_t，执行动作A_t，将自车执行动作后所得到的训练数据(S_t,A_t,R_t,S_t+1)添加至经验池；S26: According to the current state S _t , perform the action A _t , and add the training data (S _t , A _t , R _t , S _t+1 ) obtained after the self-vehicle performs the action to the experience pool;

S27：拟合回报分布；S27: Fit the return distribution;

S28：更新分位数提议网络：通过最小化1-Wasserstein距离，更新分位数提议网络，以确定最优的分位点τ，使其拟合的得到的分布更接近真实分布；S28: Update the quantile proposal network: by minimizing the 1-Wasserstein distance, update the quantile proposal network to determine the optimal quantile τ, so that the fitted distribution is closer to the true distribution;

S29：更新分位数值网络：分位数值网络的更新目标是，最小化分位数回归Huber-loss,使分位数值网络的输出尽可能逼近目标值，以梯度下降法更新分位数值网络。S29: Update the quantile numerical network: The update goal of the quantile numerical network is to minimize the quantile regression Huber-loss, make the output of the quantile numerical network as close to the target value as possible, and update the quantile numerical network by gradient descent.

进一步，步骤S27具体包括：通过N个混合Dirac函数的加权值，拟合回报的分布：Further, step S27 specifically includes: fitting the distribution of rewards through the weighted values of N mixed Dirac functions:

其中，N为分位点数目，τ_i为分位数提议网络生成的分位点，满足τ_i-1<τ_i，且τ₀＝0，τ_N＝1，δ_θi(s,a)为当前状态(s,a)下参数θ_i的Dirac函数。Among them, N is the number of quantiles, τ _i is the quantiles generated by the quantile proposal network, satisfying τ _i-1 <τ _i , and τ ₀ =0, τ _N =1, δ _θi(s,a) is the Dirac function of the parameter θ _i in the current state (s, a).

进一步，步骤S28具体包括以下步骤：Further, step S28 specifically includes the following steps:

S281：1-Wasserstein距离公式为：S281: The 1-Wasserstein distance formula is:

其中，N为分位点数目，ω为神经网络参数，

为分位点

对应的分位数函数值，

Among them, N is the number of quantiles, ω is the neural network parameter,

quantile

the corresponding quantile function value,

S282：由于真实的分位数函数

实际上是无法得到的，因此利用带有分位数网络参数ω₂的分位数值函数

作为当前状态下真实的分位数值函数；S282: Due to the true quantile function

is not practically available, so use the quantile numerical function with the quantile network parameter ω ₂

As the real quantile value function in the current state;

S283：为了避免直接计算1-Wasserstein距离，通过对分位数提议网络的参数ω₁利用梯度下降以最小化1-Wasserstein距离：S283: To avoid calculating the 1-Wasserstein distance directly, use gradient descent to minimize the 1-Wasserstein distance by applying gradient descent to the parameter ω ₁ of the quantile proposal network:

S284：全参数化分位数函数的回报期望为：

S284: The return expectation of the fully parameterized quantile function is:

进一步，步骤S29具体包括以下步骤：Further, step S29 specifically includes the following steps:

S291：求解时间差分方程：S291: Solve the time difference equation:

其中，δ_ij为TD-error，r_t为当前时刻的回报，γ为衰减因子，Z为当前时刻的回报分布，Z′为下一时刻的回报分布；Among them, δ _ij is TD-error, r _t is the return at the current moment, γ is the decay factor, Z is the return distribution at the current moment, and Z′ is the return distribution at the next moment;

S292：计算分位数回归Huber-loss：S292: Calculate quantile regression Huber-loss:

其中，

为分位数回归Huber-loss，

为Huber-loss函数，κ为阈值；in,

Regression Huber-loss for quantile,

is the Huber-loss function, and κ is the threshold;

S293：利用随机梯度下降，更新分位数值网络：S293: Use stochastic gradient descent to update the quantile value network:

其中，

为t时刻的TD-error。in,

is the TD-error at time t.

进一步，步骤S3具体包括以下步骤：Further, step S3 specifically includes the following steps:

S31：基于步骤S2全参数化分位数函数(FQF)模型中所得到的回报分布信息，计算各个分布对应的条件风险价值(CVaR)为：S31: Based on the return distribution information obtained in the fully parameterized quantile function (FQF) model in step S2, calculate the conditional value at risk (CVaR) corresponding to each distribution as:

其中，风险价值(VaR)：

Z为回报的分布，α为累积概率，R为回报，是一个随机变量；Among them, the value at risk (VaR):

Z is the distribution of returns, α is the cumulative probability, and R is the return, which is a random variable;

S32：选择最优动作，以最大化CVaR值为目标，选择最优的具有风险敏感性的行为：S32: Choose the optimal action, with the goal of maximizing the CVaR value, and choose the optimal risk-sensitive behavior:

其中，

为当前状态s_t下所选择的最优动作，Z为回报的分布，α为累积概率。in,

is the optimal action selected under the current state s _t , Z is the distribution of rewards, and α is the cumulative probability.

本发明的有益效果在于：The beneficial effects of the present invention are:

1)本发明设计了一种无信号灯十字路口的仿真训练环境，同时考虑了由于环境中的遮挡导致的不完全感知和周围交通参与者的行为不确定性，使该场景更符合真实驾驶场景。1) The present invention designs a simulation training environment at an intersection without a signal light, taking into account the incomplete perception caused by the occlusion in the environment and the behavioral uncertainty of the surrounding traffic participants, so that the scene is more in line with the real driving scene.

2)本发明设计了一种基于值分布强化学习的决策规划方法，采用全参数化分位数函数(FQF)更加准确的拟合值分布，为后续具有风险意识的决策行为生成，提供更准确的分布信息。2) The present invention designs a decision planning method based on value distribution reinforcement learning, which adopts a more accurate fitting value distribution with a fully parameterized quantile function (FQF), and provides a more accurate distribution for subsequent decision-making behaviors with risk awareness. distribution information.

3)本发明设计了一种基于条件风险价值(CVaR)的行为生成方法，基于所得到的回报分布信息，考虑环境中存在的不确定性，生成具有风险意识的驾驶行为。3) The present invention designs a behavior generation method based on Conditional Value at Risk (CVaR), and generates a risk-conscious driving behavior based on the obtained return distribution information and considering the uncertainty existing in the environment.

本发明的其他优点、目标和特征在某种程度上将在随后的说明书中进行阐述，并且在某种程度上，基于对下文的考察研究对本领域技术人员而言将是显而易见的，或者可以从本发明的实践中得到教导。本发明的目标和其他优点可以通过下面的说明书来实现和获得。Other advantages, objects, and features of the present invention will be set forth in the description that follows, and will be apparent to those skilled in the art based on a study of the following, to the extent that is taught in the practice of the present invention. The objectives and other advantages of the present invention may be realized and attained by the following description.

附图说明Description of drawings

为了使本发明的目的、技术方案和优点更加清楚，下面将结合附图对本发明作优选的详细描述，其中：In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be preferably described in detail below with reference to the accompanying drawings, wherein:

图1为本发明基于值分布强化学习的自动驾驶汽车决策规划方法的整体逻辑框架图；Fig. 1 is the overall logical frame diagram of the decision-making planning method of autonomous driving vehicle based on value distribution reinforcement learning of the present invention;

图2为构建仿真训练环境的逻辑框架图；Fig. 2 is the logical frame diagram of constructing simulation training environment;

图3为全参数化分位数函数(FQF)网络结构图。Figure 3 is a fully parameterized quantile function (FQF) network structure diagram.

具体实施方式Detailed ways

以下通过特定的具体实例说明本发明的实施方式，本领域技术人员可由本说明书所揭露的内容轻易地了解本发明的其他优点与功效。本发明还可以通过另外不同的具体实施方式加以实施或应用，本说明书中的各项细节也可以基于不同观点与应用，在没有背离本发明的精神下进行各种修饰或改变。需要说明的是，以下实施例中所提供的图示仅以示意方式说明本发明的基本构想，在不冲突的情况下，以下实施例及实施例中的特征可以相互组合。The embodiments of the present invention are described below through specific specific examples, and those skilled in the art can easily understand other advantages and effects of the present invention from the contents disclosed in this specification. The present invention can also be implemented or applied through other different specific embodiments, and various details in this specification can also be modified or changed based on different viewpoints and applications without departing from the spirit of the present invention. It should be noted that the drawings provided in the following embodiments are only used to illustrate the basic idea of the present invention in a schematic manner, and the following embodiments and features in the embodiments can be combined with each other without conflict.

请参阅图1～图3，本发明提供了一种基于值分布强化学习的自动驾驶汽车决策规划方法。考虑到真实驾驶环境中存在的不确定性，建立了同时考虑遮挡以及不同驾驶员类型的无信号灯十字路口仿真训练环境。同时，考虑到自动驾驶汽车决策规划对于安全性的需求，提出了一种基于值分布强化学习的方法，通过全参数化分位数函数(FQF)拟合回报的真实分布，进而将条件风险值(CVaR)引入所得到的分布信息，生成具有风险意识的驾驶行为，提升自动驾驶汽车对环境中不确定性的处理能力。该方法具体包括以下步骤：Referring to FIGS. 1 to 3 , the present invention provides a decision-making planning method for autonomous driving vehicles based on value distribution reinforcement learning. Considering the uncertainty in the real driving environment, a simulation training environment at intersections without signal lights is established that considers both occlusion and different types of drivers. At the same time, considering the safety requirements of autonomous vehicle decision-making planning, a method based on value distribution reinforcement learning is proposed. The full parameterized quantile function (FQF) is used to fit the true distribution of returns, and then the conditional value at risk is calculated. (CVaR) introduces the obtained distribution information to generate risk-aware driving behaviors and improve the autonomous vehicle's ability to handle uncertainties in the environment. The method specifically includes the following steps:

步骤S1：构建无信号灯的十字路口仿真训练场景，如图2所示，具体包括以下步骤：Step S1: Construct a simulation training scene of an intersection without signal lights, as shown in Figure 2, which specifically includes the following steps:

S11：建立遮挡模型：考虑十字路口两侧的遮挡，通过分析周围车辆与自车以及十字路口中心的相对位置关系，根据几何关系，计算出周围车辆可被自车观测到的临界距离d，以此作为判断周围车辆是否被遮挡的临界条件：S11: Establish an occlusion model: considering the occlusion on both sides of the intersection, by analyzing the relative positional relationship between the surrounding vehicles and the vehicle and the center of the intersection, according to the geometric relationship, calculate the critical distance d that the surrounding vehicles can be observed by the vehicle, as This is a critical condition for judging whether the surrounding vehicles are blocked:

S12：确定周围车辆模型：为使周围车辆能对环境的变化做出相应的反应，规定仿真环境中，周围车辆的行为由智能驾驶员模型控制(Intelligent Driver Model)：S12: Determine the surrounding vehicle model: In order to enable the surrounding vehicles to respond to changes in the environment, it is specified that in the simulation environment, the behavior of surrounding vehicles is controlled by the Intelligent Driver Model:

其中，a为加速度，a_max为最大加速度，v为车辆纵向速度，v_target为车辆纵向期望速度，m为加速度参数,d_target为车辆纵向期望距离，d₀为车辆纵向最小距离，T₀为车辆最小碰撞时间，Δv为与前车的相对速度。Among them, a is the acceleration, a _max is the maximum acceleration, v is the longitudinal speed of the vehicle, v _target is the desired longitudinal speed of the vehicle, m is the acceleration parameter, d _target is the desired longitudinal distance of the vehicle, d ₀ is the minimum longitudinal distance of the vehicle, and T ₀ is The minimum collision time of the vehicle, Δv is the relative speed with the preceding vehicle.

S13：建立周围车辆类型分布：为使自车能够根据不同驾驶员类型做出不同决策，规定仿真环境中，周围车辆包括激进(Aggressive)、保守(Conservative)、正常(Normal)三种类型，每种类型的车辆在每一个时间步，分别以概率：P_aggressive＝0.2,P_conservative＝0.3,P_normal＝0.5添加到环境中，周围车辆类型空间为：S13: Establish the distribution of surrounding vehicle types: In order to enable the vehicle to make different decisions according to different driver types, it is specified that in the simulation environment, surrounding vehicles include three types: Aggressive, Conservative, and Normal. Each type of vehicle is added to the environment at each time step with the probability: P _aggressive = 0.2, P _conservative = 0.3, P _normal = 0.5, and the surrounding vehicle type space is:

S14：初始化环境：随机初始化周围车辆的初始速度、位置与目标速度。S14: Initialize the environment: randomly initialize the initial speed, position and target speed of the surrounding vehicles.

S2：构建全参数化分位数函数(FQF)模型，作为自动驾驶汽车控制模型，如图3所示，具体包括以下步骤：S2: Build a fully parameterized quantile function (FQF) model as an autonomous vehicle control model, as shown in Figure 3, which includes the following steps:

S21：构建分位数提议网络(Fraction proposal network)：以状态信息作为网络输入，输出每个状态-动作对应的最优分位点τ。S21: Build a Fraction proposal network: take the state information as the network input, and output the optimal fractional point τ corresponding to each state-action.

S22：构建分位数值网络(Quantile value network)：将由分位数提议网络得到的最优分位点作为分位数值网络的输入，映射得到对应当前状态下，各个分位点对应的分位函数值。S22: Build a quantile value network: take the optimal quantile point obtained by the quantile proposal network as the input of the quantile value network, and map the quantile function corresponding to each quantile point corresponding to the current state value.

S23：构建状态空间S：以周围车辆的位置、速度、航向角以及自车的位置、速度及航向角为自车可观测的状态信息，值分布强化学习基于自车观测信息进行下一步的决策规划。S23: Construct a state space S: Take the position, speed, heading angle of the surrounding vehicles and the position, speed and heading angle of the vehicle as the state information that can be observed by the vehicle, and the value distribution reinforcement learning makes the next decision based on the observation information of the vehicle planning.

其中，i＝0代表自车，i∈[1,N]代表周围车辆，x_i,y_i代表车辆的横向和纵向位置，v_xi,v_yi代表车辆的横向和纵向速度，

代表车辆的航向角。Among them, i=0 represents the self-vehicle, i∈[1,N] represents the surrounding vehicles, x _i , y _i represent the lateral and longitudinal positions of the vehicle, v _xi , v _yi represent the lateral and longitudinal speeds of the vehicle,

Represents the heading angle of the vehicle.

S24：构建动作空间A：动作空间定义为自车可执行动作的集合，为值分布强化学习网络的输出值，此处自车的动作空间包括加速、巡航、减速，其中加速和减速两个动作的具体加速度由智能驾驶员模型(Intelligent Driver Model)计算得到：S24: Constructing action space A: Action space is defined as the set of actions that the vehicle can perform, and is the output value of the value distribution reinforcement learning network. Here, the action space of the vehicle includes acceleration, cruise, and deceleration, of which acceleration and deceleration are two actions. The specific acceleration of is calculated by the Intelligent Driver Model:

其中，a为加速度，a_max为最大加速度，v为车辆纵向速度，v_target为车辆纵向期望速度，m为加速度参数，d_target为车辆纵向期望距离，d0为车辆纵向最小距离，T₀为车辆最小碰撞时间，Δv为与前车的相对速度，加速度范围为：a∈[-3，1]m²/_s。Among them, a is the acceleration, a _max is the maximum acceleration, v is the longitudinal speed of the vehicle, v _target is the desired longitudinal speed of the vehicle, m is the acceleration parameter, d _target is the desired longitudinal distance of the vehicle, d0 is the minimum longitudinal distance of the vehicle, and T ₀ is the vehicle longitudinal distance. The minimum collision time, Δv is the relative speed with the preceding vehicle, and the acceleration range is: a∈[-3,1]m ² / _s .

S25：设计奖励函数：奖励函数主要考虑安全R_collision，成功率R_success以及效率R_timeout三部分之和，即：S25: Design reward function: The reward function mainly considers the sum of three parts: safety R _collision , success rate R _success and efficiency R _timeout , namely:

R＝R_collision+R_success+R_timeout R=R _collision +R _success +R _timeout

第一项R_collision为碰撞奖励，要求自车不能与周围环境车辆发生碰撞；The first item, R _collision , is a collision reward, which requires that the ego vehicle cannot collide with vehicles in the surrounding environment;

R_collision＝-10R _collision = -10

第二项R_success为完成任务的奖励，要求自车能够无碰撞的到达目标地点；The second R _success is the reward for completing the task, requiring the vehicle to reach the target location without collision;

R_success＝10R _success = 10

第三项R_timeout为超时奖励，要求自车不能超过规定的回合最大步数。The third item, R _timeout , is a timeout reward, which requires that the ego vehicle cannot exceed the specified maximum number of steps in the round.

R_timeout＝-10R _timeout = -10

S26：根据当前状态S_t，执行动作A_t，将自车执行动作后所得到的训练数据(S_t，A_t，R_t，S_t+1)添加至经验池。S26: According to the current state S _t , perform the action A _t , and add the training data (S _t , A _t , R _t , S _t+1 ) obtained after the self-vehicle performs the action to the experience pool.

S27：拟合回报分布：通过N个混合Dirac函数的加权，拟合回报的分布：S27: Fit the distribution of returns: By weighting N mixed Dirac functions, fit the distribution of returns:

其中，N为分位点数目，τ_i为分位数提议网络生成的分位点，满足τ_i-1＜τ_i，且τ₀＝0，τ_N＝1以及

为当前状态(s，a)下参数θ_i的Dirac函数。where N is the number of quantiles, τ _i is the quantile generated by the quantile proposal network, satisfying τ _i-1 <τ _i , and τ ₀ =0, τ _N =1, and

is the Dirac function of the parameter θ _i in the current state (s, a).

S28：更新分位数提议网络：通过最小化1-Wasserstein距离，更新分位数提议网络，以确定最优的分位点τ，使其拟合得到的分布更接近真实分布。具体操作如下：S28: Update the quantile proposal network: By minimizing the 1-Wasserstein distance, update the quantile proposal network to determine the optimal quantile τ, so that the fitted distribution is closer to the true distribution. The specific operations are as follows:

其中，N为分位点数目，ω为神经网络参数，

为分位点

对应的分位函数值，

Among them, N is the number of quantiles, ω is the neural network parameter,

quantile

The corresponding quantile function value,

S282：由于真实的分位数函数

作为当前状态下真实的分位数值函数。S282: Due to the true quantile function

as the true quantile value function in the current state.

其中，

为分位点τ_i对应的分位函数值，

ω₂为分位数值网络参数。in,

is the quantile function value corresponding to the quantile τ _i ,

ω ₂ is the quantile value network parameter.

S284：全参数化分位数函数的回报期望为：S284: The return expectation of the fully parameterized quantile function is:

其中，N为分位点数目，

为分位点τ_i对应的分位函数值，

ω₂为分位数值网络参数。where N is the number of quantiles,

is the quantile function value corresponding to the quantile τ _i ,

ω ₂ is the quantile value network parameter.

S29：更新分位数值网络：分位数值网络的更新目标是，最小化分位数回归Huber-loss，使分位数值网络的输出尽可能逼近目标值，求得损失函数后，以梯度下降法更新分位数值网络，具体操作如下：S29: Update the quantile numerical network: The update goal of the quantile numerical network is to minimize the quantile regression Huber-loss, so that the output of the quantile numerical network is as close to the target value as possible. After the loss function is obtained, the gradient descent method is used. To update the quantile value network, the specific operations are as follows:

S291：求解时间差分方程：S291: Solve the time difference equation:

其中，r_t为当前时刻的回报，γ为衰减因子，ω₁为神经网络网络参数，

为分位点τ_i对应的分位函数值，

Z为当前时刻的回报分布，Z′为下一时刻的回报分布。Among them, r _t is the return at the current moment, γ is the decay factor, ω ₁ is the neural network network parameter,

is the quantile function value corresponding to the quantile τ _i ,

Z is the reward distribution at the current moment, and Z′ is the reward distribution at the next moment.

其中，Huber-loss：

δ_ij为TD-error，κ为阈值。Among them, Huber-loss:

_δij is the TD-error, and κ is the threshold.

其中，N为分位点数目，

为分位数回归Huber-loss，

为t时刻的TD-error，κ为阈值，τ_i为分位点，

where N is the number of quantiles,

Regression Huber-loss for quantile,

is the TD-error at time t, κ is the threshold, τ _i is the quantile,

S3：基于步骤S2中所得到的回报分布，引入条件风险值(CVaR)，生成具有风险意识的驾驶行为，具体包括以下步骤：S3: Based on the reward distribution obtained in step S2, a conditional value at risk (CVaR) is introduced to generate a driving behavior with risk awareness, which specifically includes the following steps:

S31：基于步骤S2所得到的回报分布信息，计算各个分布对应的条件风险价值(CVaR)：S31: Based on the return distribution information obtained in step S2, calculate the conditional value at risk (CVaR) corresponding to each distribution:

其中，风险价值(VaR)：

Z为回报的分布，α为累积概率，R为回报，是一个随机变量。Among them, the value at risk (VaR):

Z is the distribution of returns, α is the cumulative probability, and R is the return, which is a random variable.

S32：选择最优动作：以最大化CVaR值为目标，选择最优的具有风险敏感性的行为：S32: Choose the optimal action: with the goal of maximizing the CVaR value, choose the optimal risk-sensitive action:

其中，

最后说明的是，以上实施例仅用以说明本发明的技术方案而非限制，尽管参照较佳实施例对本发明进行了详细说明，本领域的普通技术人员应当理解，可以对本发明的技术方案进行修改或者等同替换，而不脱离本技术方案的宗旨和范围，其均应涵盖在本发明的权利要求范围当中。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present invention can be Modifications or equivalent replacements, without departing from the spirit and scope of the technical solution, should all be included in the scope of the claims of the present invention.

Claims

1. An automatic driving automobile decision planning method based on value distribution reinforcement learning is characterized by comprising the following steps:

s1: constructing a signal lamp-free crossroad scene considering uncertainty;

s2: constructing a fully parameterized quantile function model as an automatic driving automobile control model;

s3: and introducing condition risk value based on the state-action return distribution information learned in the fully parameterized quantile function model to generate driving behaviors with risk awareness.

2. The decision planning method for the automatic driving vehicle according to claim 1, wherein in step S1, constructing a signal-free intersection scene considering uncertainty specifically comprises: and establishing an occlusion model, determining a surrounding vehicle model, and establishing the type distribution of surrounding vehicles.

3. The decision planning method for an autonomous vehicle according to claim 2, wherein in step S1, the establishing of the occlusion model specifically includes: considering the occlusion on two sides of the intersection, calculating the critical distance d that the surrounding vehicles can be observed by the vehicle according to the geometric relationship by analyzing the relative position relationship between the surrounding vehicles and the vehicle and the center of the intersection, and taking the critical distance d as a critical condition for judging whether the surrounding vehicles are occluded:

wherein l is the width of each lane, d' is the distance from the head of the vehicle to the central point of the crossroad,

the distance from the road boundary to the shelter and the distance from the head of the surrounding vehicle to the central point of the crossroad are respectively d.

4. The automated driving vehicle decision planning method according to claim 2, wherein the determining the surrounding vehicle model in step S1 specifically comprises: the behavior of the surrounding vehicles is controlled by an intelligent driver model:

wherein a is acceleration, a_maxMaximum acceleration, v vehicle longitudinal speed, v_targetFor the desired longitudinal speed of the vehicle, m is an acceleration parameter, d_targetDesired distance in the longitudinal direction of the vehicle, d₀For minimum longitudinal distance, T, of the vehicle₀Δ v is the relative speed with the preceding vehicle for the vehicle minimum time to collision.

5. The decision planning method for the automatic driving vehicle of claim 1, wherein in step S2, a fully parameterized quantile function model is constructed, which specifically includes the following steps:

s21: constructing a quantile proposing network: taking the state information as network input, and outputting the optimal quantile point tau corresponding to each state-action;

s22: constructing a quantile numerical network: using the optimal quantile point obtained by the quantile proposing network as the input of the quantile numerical value network, and mapping to obtain the corresponding quantile function value of each quantile point under the corresponding current state;

s23: constructing a state space S: taking the position, the speed and the course angle of the surrounding vehicles and the position, the speed and the course angle of the vehicle as observable state information of the vehicle, and carrying out next decision planning on the basis of observed information of the vehicle by value distribution reinforcement learning;

s24: constructing an action space A: the action space is defined as a set of executable actions of the self vehicle and is an output value of the value distribution reinforcement learning network, wherein the action space of the self vehicle comprises three discrete action values of acceleration, cruising and deceleration; the specific acceleration of the acceleration action and the deceleration action is calculated by an intelligent driver model;

s25: designing a reward function with a total reward equal to the collision reward R_collisionReward R for completing a task_successAnd an overtime reward R_timeoutThe sum of the three parts;

s26: according to the current state S_tPerforming action A_tTraining data (S) obtained after the vehicle performs the action_t,A_t,R_t,S_t+1) Adding to an experience pool;

s27: fitting a return distribution;

s28: updating the quantile proposal network: updating the quantile proposal network by minimizing the 1-Wasserstein distance to determine the optimal quantile point tau, so that the fitted distribution is closer to the real distribution;

s29: updating the quantile numerical network: the updating target of the quantile numerical network is to minimize the quantile regression Huber-loss, make the output of the quantile numerical network approach the target value as much as possible, and update the quantile numerical network by a gradient descent method.

6. The decision-making planning method for an autonomous vehicle according to claim 5, characterized in that step S27 specifically comprises: fit the distribution of the returns by the weighted values of the N hybrid Dirac functions:

wherein N is the number of sub-sites, τ_iProposing network-generated quantiles for quantiles satisfying tau_i-1<τ_iAnd τ is₀＝0，τ_N＝1，

Is a parameter theta under the current state (s, a)_iThe Dirac function of (1).

7. The automated driving vehicle decision planning method according to claim 6, wherein step S28 specifically comprises the steps of:

s281: the 1-Wasserstein distance formula is:

wherein N is the number of the sub-sites, omega is the neural network parameter,

is a quantile

The value of the corresponding quantile function,

s282: using network parameters omega with quantiles₂Fractional numerical function of

As a function of the true fractional value in the current state;

s283: proposing a parameter omega of a network by dividing digits₁Gradient descent was used to minimize the 1-Wasserstein distance:

s284: the return expectation for a fully parameterized quantile function is:

8. the autopilot decision-making method of claim 7 wherein step S29 specifically includes the steps of:

s291: solving a time difference equation:

wherein, delta_ijIs TD-error, r_tThe current time is the return, gamma is an attenuation factor, Z is the return distribution of the current time, and Z' is the return distribution of the next time;

s292: calculating quantile regression Huber-loss:

wherein,

for quantile regression Huber-loss,

is a Huber-loss function, k is a threshold value;

s293: updating the quantile numerical network by using random gradient descent:

wherein,

and is TD-error at time t.

9. The decision-making planning method for an autonomous vehicle according to claim 1, characterized in that step S3 specifically comprises the following steps:

s31: based on the return distribution information obtained by the fully parameterized quantile function model in step S2, calculating conditional risk values (CVaR) corresponding to the distributions as:

among them, the value of risk

Z is the distribution of the return, alpha is the cumulative probability, and R is the return;

s32: selecting an optimal action, targeting the maximum CVaR value, selecting an optimal risk-sensitive behavior:

wherein,

is the current state s_tAnd (4) selecting the optimal action, wherein Z is the distribution of the return, and alpha is the cumulative probability.