CN114707359A - A Decision Planning Method for Autonomous Vehicles Based on Value Distribution Reinforcement Learning - Google Patents
A Decision Planning Method for Autonomous Vehicles Based on Value Distribution Reinforcement Learning Download PDFInfo
- Publication number
- CN114707359A CN114707359A CN202210487160.8A CN202210487160A CN114707359A CN 114707359 A CN114707359 A CN 114707359A CN 202210487160 A CN202210487160 A CN 202210487160A CN 114707359 A CN114707359 A CN 114707359A
- Authority
- CN
- China
- Prior art keywords
- quantile
- vehicle
- network
- value
- distribution
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000009826 distribution Methods 0.000 title claims abstract description 61
- 238000000034 method Methods 0.000 title claims abstract description 28
- 230000002787 reinforcement Effects 0.000 title claims abstract description 23
- 230000006870 function Effects 0.000 claims abstract description 47
- 230000006399 behavior Effects 0.000 claims abstract description 14
- 230000009471 action Effects 0.000 claims description 31
- 230000001133 acceleration Effects 0.000 claims description 22
- 238000012549 training Methods 0.000 claims description 7
- 230000001186 cumulative effect Effects 0.000 claims description 6
- 238000013528 artificial neural network Methods 0.000 claims description 4
- 238000011478 gradient descent method Methods 0.000 claims description 2
- 238000013507 mapping Methods 0.000 claims 1
- 230000008685 targeting Effects 0.000 claims 1
- 238000004088 simulation Methods 0.000 description 8
- 238000013461 design Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 3
- 230000003542 behavioural effect Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2111/00—Details relating to CAD techniques
- G06F2111/08—Probabilistic or stochastic CAD
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Computer Hardware Design (AREA)
- Geometry (AREA)
- Traffic Control Systems (AREA)
Abstract
Description
技术领域technical field
本发明属于自动驾驶汽车领域,涉及一种基于值分布强化学习的自动驾驶汽车决策规划方法。The invention belongs to the field of self-driving cars, and relates to a decision-making planning method for self-driving cars based on value distribution reinforcement learning.
背景技术Background technique
自动驾驶技术近年来取得飞速发展,但是安全性已成为自动驾驶技术所面临的一个关键问题。安全性是阻碍自动驾驶汽车商业化的一个重要因素,也是近年来的一个研究热点。而自动驾驶决策规划模块,作为自动驾驶汽车的“大脑”,对自动驾驶汽车安全性有相当重要的影响,尤其是在十字路口等复杂城市场景下,如何进行自主安全的决策,近年来被广泛研究。Autonomous driving technology has developed rapidly in recent years, but safety has become a key issue faced by autonomous driving technology. Safety is an important factor hindering the commercialization of autonomous vehicles, and it has also been a research hotspot in recent years. The autonomous driving decision planning module, as the "brain" of autonomous vehicles, has a very important impact on the safety of autonomous vehicles, especially in complex urban scenarios such as intersections, how to make autonomous and safe decision-making has been widely used in recent years. Research.
自动驾驶汽车决策规划模块,主要是根据当前环境状态,决策生成最优的驾驶行为,从而安全的完成驾驶任务,现有的决策规划方法,主要分为基于规则、基于优化以及基于学习的三类。其中,基于规则的方法,只适用于特定的场景;基于优化的方法,在实时性方面的表现较差。因此,基于学习的方法近年来被学术界和工业界广泛研究,其中强化学习已被广泛用于自动驾驶汽车的决策规划问题,得益于强化学习的实时性与场景适应性,基于强化学习的决策规划方法能很好的完成驾驶任务。但是,由于自动驾驶汽车所面临的驾驶环境日益复杂,恶劣天气、建筑物遮挡等造成的不完全感知,以及周围交通参与者的行为不确定性,给自动驾驶汽车的安全性带来了巨大挑战,传统的强化学习算法已经无法满足自动驾驶汽车对安全性的需求。The decision-making and planning module for autonomous vehicles mainly generates optimal driving behaviors based on the current environmental state, so as to safely complete driving tasks. The existing decision-making and planning methods are mainly divided into three categories: rule-based, optimization-based and learning-based. . Among them, the rule-based method is only suitable for specific scenarios; the optimization-based method has poor performance in real-time. Therefore, learning-based methods have been widely studied in academia and industry in recent years, and reinforcement learning has been widely used in decision-making and planning problems of autonomous vehicles. Thanks to the real-time nature and scene adaptability of reinforcement learning, reinforcement learning-based The decision planning method can accomplish the driving task well. However, due to the increasingly complex driving environment faced by autonomous vehicles, incomplete perception caused by inclement weather, building occlusion, etc., as well as the behavioral uncertainty of surrounding traffic participants, it has brought great challenges to the safety of autonomous vehicles. , traditional reinforcement learning algorithms have been unable to meet the safety requirements of autonomous vehicles.
由于传统的强化学习以最大化回报的期望值来选择最优动作,回报的分布信息很大程度上被丢失,因此无法考虑由于环境中内在的不确定性对决策策略的影响。因此,亟需提出一种新的强化学习算法来处理环境中存在的不确定性,以提升自动驾驶汽车决策规划的安全性。Since traditional reinforcement learning selects the optimal action to maximize the expected value of the reward, the distributional information of the reward is largely lost, so the influence on the decision policy due to the inherent uncertainty in the environment cannot be considered. Therefore, it is urgent to propose a new reinforcement learning algorithm to deal with the uncertainties in the environment to improve the safety of autonomous vehicle decision planning.
发明内容SUMMARY OF THE INVENTION
有鉴于此,本发明的目的在于提供一种基于值分布强化学习的自动驾驶汽车决策规划方法,能提高自动驾驶汽车在具有不确定性的环境下决策规划策略的安全性与稳定性。In view of this, the purpose of the present invention is to provide a decision-making planning method for autonomous driving vehicles based on value distribution reinforcement learning, which can improve the safety and stability of decision-making planning strategies of autonomous driving vehicles in uncertain environments.
为达到上述目的,本发明提供如下技术方案:To achieve the above object, the present invention provides the following technical solutions:
一种基于值分布强化学习的自动驾驶汽车决策规划方法,具体包括以下步骤:A decision planning method for autonomous vehicles based on value distribution reinforcement learning, which specifically includes the following steps:
S1:构建考虑不确定性的无信号灯十字路口场景;S1: Construct a no-signal intersection scene considering uncertainty;
S2:构建全参数化分位数函数(FQF)网络模型,作为自动驾驶汽车控制模型;S2: Build a fully parameterized quantile function (FQF) network model as an autonomous vehicle control model;
S3:基于全参数化分位数函数(FQF)模型中学习的状态-动作回报分布信息,引入条件风险价值(CVaR),生成具有风险意识的驾驶行为。S3: Based on the state-action reward distribution information learned in a fully parameterized quantile function (FQF) model, a conditional value at risk (CVaR) is introduced to generate risk-aware driving behavior.
进一步,步骤S1中,构建考虑不确定性的无信号灯十字路口场景,具体包括:建立遮挡模型,确定周围车辆模型,建立周围车辆类型分布。Further, in step S1, constructing a scene of an intersection without a signal light that considers uncertainty, which specifically includes: establishing an occlusion model, determining a surrounding vehicle model, and establishing a distribution of surrounding vehicle types.
进一步,步骤S1中,建立遮挡模型,具体包括:考虑十字路口两侧的遮挡,通过分析周围车辆与自车以及十字路口中心的相对位置关系,根据几何关系,计算出周围车辆能被自车观测到的临界距离d,以此作为判断周围车辆是否被遮挡的临界条件:Further, in step S1, an occlusion model is established, which specifically includes: considering the occlusion on both sides of the intersection, by analyzing the relative positional relationship between the surrounding vehicles and the vehicle and the center of the intersection, according to the geometric relationship, calculate that the surrounding vehicles can be observed by the vehicle. The critical distance d to be used as a critical condition for judging whether the surrounding vehicles are blocked:
其中,l为每条车道宽度,d′为自车车头到十字路口中心点的距离,为道路边界到遮挡物的距离,d为周围车辆车头至十字路口中心点的距离。Among them, l is the width of each lane, d' is the distance from the front of the vehicle to the center of the intersection, is the distance from the road boundary to the occluder, and d is the distance from the head of the surrounding vehicles to the center of the intersection.
进一步,步骤S1中,确定周围车辆模型,具体包括:为使周围车辆能对环境的主动变化做出反应,规定仿真环境中,周围车辆的行为由智能驾驶员模型控制(IntelligentDriver Model):Further, in step S1, the surrounding vehicle model is determined, which specifically includes: in order to enable the surrounding vehicle to respond to the active change of the environment, it is specified that in the simulation environment, the behavior of the surrounding vehicle is controlled by the intelligent driver model (IntelligentDriver Model):
其中,a为加速度,amax为最大加速度,v为车辆纵向速度,vtarget为车辆纵向期望速度,m为加速度参数,dtarget为车辆纵向期望距离,d0为车辆纵向最小距离,T0为车辆最小碰撞时间,Δv为与前车的相对速度。Among them, a is the acceleration, a max is the maximum acceleration, v is the longitudinal speed of the vehicle, v target is the desired longitudinal speed of the vehicle, m is the acceleration parameter, d target is the desired longitudinal distance of the vehicle, d 0 is the minimum longitudinal distance of the vehicle, and T 0 is The minimum collision time of the vehicle, Δv is the relative speed with the preceding vehicle.
进一步,步骤S1中,建立周围车辆类型分布,具体包括:规定仿真环境中,周围车辆包括激进(Aggressive)、保守(Conservative)、正常(Normal)三种类型,每种类型的车辆在每一个时间步,分别以一定的概率添加到环境中,周围车辆类型空间为:Further, in step S1, the distribution of the types of surrounding vehicles is established, which specifically includes: specifying that in the simulation environment, the surrounding vehicles include three types: aggressive, conservative, and normal, and each type of vehicle is used at each time Steps are added to the environment with a certain probability, and the surrounding vehicle type space is:
进一步,步骤S2中,构建全参数化分位数函数模型,具体包括以下步骤:Further, in step S2, a fully parameterized quantile function model is constructed, which specifically includes the following steps:
S21:构建分位数提议网络(Fraction proposal network):以状态信息作为网络输入,输出每个状态-动作对应的最优分位点τ;S21: Build a Fraction proposal network: take the state information as the network input, and output the optimal quantile τ corresponding to each state-action;
S22:构建分位数值网络(Quantile value network):将由分位数提议网络得到的最优分位点作为分位数值网络的输入,映射得到对应当前状态下,各个分位点对应的分位函数值;S22: Build a quantile value network: take the optimal quantile point obtained by the quantile proposal network as the input of the quantile value network, and map the quantile function corresponding to each quantile point corresponding to the current state value;
S23:构建状态空间S:以周围车辆的位置、速度、航向角以及自车的位置、速度及航向角作为自车可观测的状态信息,值分布强化学习基于自车观测信息进行下一步的决策规划;S23: Construct the state space S: The position, speed, heading angle of the surrounding vehicles and the position, speed and heading angle of the vehicle are used as the observable state information of the vehicle, and the value distribution reinforcement learning is based on the observation information of the vehicle to make the next decision. planning;
S24:构建动作空间A:动作空间定义为自车可执行动作的集合,为值分布强化学习网络的输出值,此处自车的动作空间包括加速、巡航和减速三个离散动作值;其中加速和减速两个动作的具体加速度由智能驾驶员模型(Intelligent Driver Model)计算得到;S24: Build an action space A: The action space is defined as the set of actions that the vehicle can perform, and is the output value of the value distribution reinforcement learning network. Here, the action space of the vehicle includes three discrete action values: acceleration, cruise and deceleration; among which acceleration The specific acceleration of the two actions of deceleration and deceleration is calculated by the Intelligent Driver Model;
S25:设计奖励函数,总奖励等于碰撞奖励Rcollision、完成任务的奖励Rsuccess以及超时奖励Rtimeout三部分之和;S25: Design a reward function, the total reward is equal to the sum of the three parts of the collision reward R collision , the task completion reward R success and the timeout reward R timeout ;
S26:根据当前状态St,执行动作At,将自车执行动作后所得到的训练数据(St,At,Rt,St+1)添加至经验池;S26: According to the current state S t , perform the action A t , and add the training data (S t , A t , R t , S t+1 ) obtained after the self-vehicle performs the action to the experience pool;
S27:拟合回报分布;S27: Fit the return distribution;
S28:更新分位数提议网络:通过最小化1-Wasserstein距离,更新分位数提议网络,以确定最优的分位点τ,使其拟合的得到的分布更接近真实分布;S28: Update the quantile proposal network: by minimizing the 1-Wasserstein distance, update the quantile proposal network to determine the optimal quantile τ, so that the fitted distribution is closer to the true distribution;
S29:更新分位数值网络:分位数值网络的更新目标是,最小化分位数回归Huber-loss,使分位数值网络的输出尽可能逼近目标值,以梯度下降法更新分位数值网络。S29: Update the quantile numerical network: The update goal of the quantile numerical network is to minimize the quantile regression Huber-loss, make the output of the quantile numerical network as close to the target value as possible, and update the quantile numerical network by gradient descent.
进一步,步骤S27具体包括:通过N个混合Dirac函数的加权值,拟合回报的分布:Further, step S27 specifically includes: fitting the distribution of rewards through the weighted values of N mixed Dirac functions:
其中,N为分位点数目,τi为分位数提议网络生成的分位点,满足τi-1<τi,且τ0=0,τN=1,δθi(s,a)为当前状态(s,a)下参数θi的Dirac函数。Among them, N is the number of quantiles, τ i is the quantiles generated by the quantile proposal network, satisfying τ i-1 <τ i , and τ 0 =0, τ N =1, δ θi(s,a) is the Dirac function of the parameter θ i in the current state (s, a).
进一步,步骤S28具体包括以下步骤:Further, step S28 specifically includes the following steps:
S281:1-Wasserstein距离公式为:S281: The 1-Wasserstein distance formula is:
其中,N为分位点数目,ω为神经网络参数,为分位点对应的分位数函数值, Among them, N is the number of quantiles, ω is the neural network parameter, quantile the corresponding quantile function value,
S282:由于真实的分位数函数实际上是无法得到的,因此利用带有分位数网络参数ω2的分位数值函数作为当前状态下真实的分位数值函数;S282: Due to the true quantile function is not practically available, so use the quantile numerical function with the quantile network parameter ω 2 As the real quantile value function in the current state;
S283:为了避免直接计算1-Wasserstein距离,通过对分位数提议网络的参数ω1利用梯度下降以最小化1-Wasserstein距离:S283: To avoid calculating the 1-Wasserstein distance directly, use gradient descent to minimize the 1-Wasserstein distance by applying gradient descent to the parameter ω 1 of the quantile proposal network:
S284:全参数化分位数函数的回报期望为: S284: The return expectation of the fully parameterized quantile function is:
进一步,步骤S29具体包括以下步骤:Further, step S29 specifically includes the following steps:
S291:求解时间差分方程:S291: Solve the time difference equation:
其中,δij为TD-error,rt为当前时刻的回报,γ为衰减因子,Z为当前时刻的回报分布,Z′为下一时刻的回报分布;Among them, δ ij is TD-error, r t is the return at the current moment, γ is the decay factor, Z is the return distribution at the current moment, and Z′ is the return distribution at the next moment;
S292:计算分位数回归Huber-loss:S292: Calculate quantile regression Huber-loss:
其中,为分位数回归Huber-loss,为Huber-loss函数,κ为阈值;in, Regression Huber-loss for quantile, is the Huber-loss function, and κ is the threshold;
S293:利用随机梯度下降,更新分位数值网络:S293: Use stochastic gradient descent to update the quantile value network:
其中,为t时刻的TD-error。in, is the TD-error at time t.
进一步,步骤S3具体包括以下步骤:Further, step S3 specifically includes the following steps:
S31:基于步骤S2全参数化分位数函数(FQF)模型中所得到的回报分布信息,计算各个分布对应的条件风险价值(CVaR)为:S31: Based on the return distribution information obtained in the fully parameterized quantile function (FQF) model in step S2, calculate the conditional value at risk (CVaR) corresponding to each distribution as:
其中,风险价值(VaR):Z为回报的分布,α为累积概率,R为回报,是一个随机变量;Among them, the value at risk (VaR): Z is the distribution of returns, α is the cumulative probability, and R is the return, which is a random variable;
S32:选择最优动作,以最大化CVaR值为目标,选择最优的具有风险敏感性的行为:S32: Choose the optimal action, with the goal of maximizing the CVaR value, and choose the optimal risk-sensitive behavior:
其中,为当前状态st下所选择的最优动作,Z为回报的分布,α为累积概率。in, is the optimal action selected under the current state s t , Z is the distribution of rewards, and α is the cumulative probability.
本发明的有益效果在于:The beneficial effects of the present invention are:
1)本发明设计了一种无信号灯十字路口的仿真训练环境,同时考虑了由于环境中的遮挡导致的不完全感知和周围交通参与者的行为不确定性,使该场景更符合真实驾驶场景。1) The present invention designs a simulation training environment at an intersection without a signal light, taking into account the incomplete perception caused by the occlusion in the environment and the behavioral uncertainty of the surrounding traffic participants, so that the scene is more in line with the real driving scene.
2)本发明设计了一种基于值分布强化学习的决策规划方法,采用全参数化分位数函数(FQF)更加准确的拟合值分布,为后续具有风险意识的决策行为生成,提供更准确的分布信息。2) The present invention designs a decision planning method based on value distribution reinforcement learning, which adopts a more accurate fitting value distribution with a fully parameterized quantile function (FQF), and provides a more accurate distribution for subsequent decision-making behaviors with risk awareness. distribution information.
3)本发明设计了一种基于条件风险价值(CVaR)的行为生成方法,基于所得到的回报分布信息,考虑环境中存在的不确定性,生成具有风险意识的驾驶行为。3) The present invention designs a behavior generation method based on Conditional Value at Risk (CVaR), and generates a risk-conscious driving behavior based on the obtained return distribution information and considering the uncertainty existing in the environment.
本发明的其他优点、目标和特征在某种程度上将在随后的说明书中进行阐述,并且在某种程度上,基于对下文的考察研究对本领域技术人员而言将是显而易见的,或者可以从本发明的实践中得到教导。本发明的目标和其他优点可以通过下面的说明书来实现和获得。Other advantages, objects, and features of the present invention will be set forth in the description that follows, and will be apparent to those skilled in the art based on a study of the following, to the extent that is taught in the practice of the present invention. The objectives and other advantages of the present invention may be realized and attained by the following description.
附图说明Description of drawings
为了使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明作优选的详细描述,其中:In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be preferably described in detail below with reference to the accompanying drawings, wherein:
图1为本发明基于值分布强化学习的自动驾驶汽车决策规划方法的整体逻辑框架图;Fig. 1 is the overall logical frame diagram of the decision-making planning method of autonomous driving vehicle based on value distribution reinforcement learning of the present invention;
图2为构建仿真训练环境的逻辑框架图;Fig. 2 is the logical frame diagram of constructing simulation training environment;
图3为全参数化分位数函数(FQF)网络结构图。Figure 3 is a fully parameterized quantile function (FQF) network structure diagram.
具体实施方式Detailed ways
以下通过特定的具体实例说明本发明的实施方式,本领域技术人员可由本说明书所揭露的内容轻易地了解本发明的其他优点与功效。本发明还可以通过另外不同的具体实施方式加以实施或应用,本说明书中的各项细节也可以基于不同观点与应用,在没有背离本发明的精神下进行各种修饰或改变。需要说明的是,以下实施例中所提供的图示仅以示意方式说明本发明的基本构想,在不冲突的情况下,以下实施例及实施例中的特征可以相互组合。The embodiments of the present invention are described below through specific specific examples, and those skilled in the art can easily understand other advantages and effects of the present invention from the contents disclosed in this specification. The present invention can also be implemented or applied through other different specific embodiments, and various details in this specification can also be modified or changed based on different viewpoints and applications without departing from the spirit of the present invention. It should be noted that the drawings provided in the following embodiments are only used to illustrate the basic idea of the present invention in a schematic manner, and the following embodiments and features in the embodiments can be combined with each other without conflict.
请参阅图1~图3,本发明提供了一种基于值分布强化学习的自动驾驶汽车决策规划方法。考虑到真实驾驶环境中存在的不确定性,建立了同时考虑遮挡以及不同驾驶员类型的无信号灯十字路口仿真训练环境。同时,考虑到自动驾驶汽车决策规划对于安全性的需求,提出了一种基于值分布强化学习的方法,通过全参数化分位数函数(FQF)拟合回报的真实分布,进而将条件风险值(CVaR)引入所得到的分布信息,生成具有风险意识的驾驶行为,提升自动驾驶汽车对环境中不确定性的处理能力。该方法具体包括以下步骤:Referring to FIGS. 1 to 3 , the present invention provides a decision-making planning method for autonomous driving vehicles based on value distribution reinforcement learning. Considering the uncertainty in the real driving environment, a simulation training environment at intersections without signal lights is established that considers both occlusion and different types of drivers. At the same time, considering the safety requirements of autonomous vehicle decision-making planning, a method based on value distribution reinforcement learning is proposed. The full parameterized quantile function (FQF) is used to fit the true distribution of returns, and then the conditional value at risk is calculated. (CVaR) introduces the obtained distribution information to generate risk-aware driving behaviors and improve the autonomous vehicle's ability to handle uncertainties in the environment. The method specifically includes the following steps:
步骤S1:构建无信号灯的十字路口仿真训练场景,如图2所示,具体包括以下步骤:Step S1: Construct a simulation training scene of an intersection without signal lights, as shown in Figure 2, which specifically includes the following steps:
S11:建立遮挡模型:考虑十字路口两侧的遮挡,通过分析周围车辆与自车以及十字路口中心的相对位置关系,根据几何关系,计算出周围车辆可被自车观测到的临界距离d,以此作为判断周围车辆是否被遮挡的临界条件:S11: Establish an occlusion model: considering the occlusion on both sides of the intersection, by analyzing the relative positional relationship between the surrounding vehicles and the vehicle and the center of the intersection, according to the geometric relationship, calculate the critical distance d that the surrounding vehicles can be observed by the vehicle, as This is a critical condition for judging whether the surrounding vehicles are blocked:
其中,l为每条车道宽度,d′为自车车头到十字路口中心点的距离,为道路边界到遮挡物的距离,d为周围车辆车头至十字路口中心点的距离。Among them, l is the width of each lane, d' is the distance from the front of the vehicle to the center of the intersection, is the distance from the road boundary to the occluder, and d is the distance from the head of the surrounding vehicles to the center of the intersection.
S12:确定周围车辆模型:为使周围车辆能对环境的变化做出相应的反应,规定仿真环境中,周围车辆的行为由智能驾驶员模型控制(Intelligent Driver Model):S12: Determine the surrounding vehicle model: In order to enable the surrounding vehicles to respond to changes in the environment, it is specified that in the simulation environment, the behavior of surrounding vehicles is controlled by the Intelligent Driver Model:
其中,a为加速度,amax为最大加速度,v为车辆纵向速度,vtarget为车辆纵向期望速度,m为加速度参数,dtarget为车辆纵向期望距离,d0为车辆纵向最小距离,T0为车辆最小碰撞时间,Δv为与前车的相对速度。Among them, a is the acceleration, a max is the maximum acceleration, v is the longitudinal speed of the vehicle, v target is the desired longitudinal speed of the vehicle, m is the acceleration parameter, d target is the desired longitudinal distance of the vehicle, d 0 is the minimum longitudinal distance of the vehicle, and T 0 is The minimum collision time of the vehicle, Δv is the relative speed with the preceding vehicle.
S13:建立周围车辆类型分布:为使自车能够根据不同驾驶员类型做出不同决策,规定仿真环境中,周围车辆包括激进(Aggressive)、保守(Conservative)、正常(Normal)三种类型,每种类型的车辆在每一个时间步,分别以概率:Paggressive=0.2,Pconservative=0.3,Pnormal=0.5添加到环境中,周围车辆类型空间为:S13: Establish the distribution of surrounding vehicle types: In order to enable the vehicle to make different decisions according to different driver types, it is specified that in the simulation environment, surrounding vehicles include three types: Aggressive, Conservative, and Normal. Each type of vehicle is added to the environment at each time step with the probability: P aggressive = 0.2, P conservative = 0.3, P normal = 0.5, and the surrounding vehicle type space is:
S14:初始化环境:随机初始化周围车辆的初始速度、位置与目标速度。S14: Initialize the environment: randomly initialize the initial speed, position and target speed of the surrounding vehicles.
S2:构建全参数化分位数函数(FQF)模型,作为自动驾驶汽车控制模型,如图3所示,具体包括以下步骤:S2: Build a fully parameterized quantile function (FQF) model as an autonomous vehicle control model, as shown in Figure 3, which includes the following steps:
S21:构建分位数提议网络(Fraction proposal network):以状态信息作为网络输入,输出每个状态-动作对应的最优分位点τ。S21: Build a Fraction proposal network: take the state information as the network input, and output the optimal fractional point τ corresponding to each state-action.
S22:构建分位数值网络(Quantile value network):将由分位数提议网络得到的最优分位点作为分位数值网络的输入,映射得到对应当前状态下,各个分位点对应的分位函数值。S22: Build a quantile value network: take the optimal quantile point obtained by the quantile proposal network as the input of the quantile value network, and map the quantile function corresponding to each quantile point corresponding to the current state value.
S23:构建状态空间S:以周围车辆的位置、速度、航向角以及自车的位置、速度及航向角为自车可观测的状态信息,值分布强化学习基于自车观测信息进行下一步的决策规划。S23: Construct a state space S: Take the position, speed, heading angle of the surrounding vehicles and the position, speed and heading angle of the vehicle as the state information that can be observed by the vehicle, and the value distribution reinforcement learning makes the next decision based on the observation information of the vehicle planning.
其中,i=0代表自车,i∈[1,N]代表周围车辆,xi,yi代表车辆的横向和纵向位置,vxi,vyi代表车辆的横向和纵向速度,代表车辆的航向角。Among them, i=0 represents the self-vehicle, i∈[1,N] represents the surrounding vehicles, x i , y i represent the lateral and longitudinal positions of the vehicle, v xi , v yi represent the lateral and longitudinal speeds of the vehicle, Represents the heading angle of the vehicle.
S24:构建动作空间A:动作空间定义为自车可执行动作的集合,为值分布强化学习网络的输出值,此处自车的动作空间包括加速、巡航、减速,其中加速和减速两个动作的具体加速度由智能驾驶员模型(Intelligent Driver Model)计算得到:S24: Constructing action space A: Action space is defined as the set of actions that the vehicle can perform, and is the output value of the value distribution reinforcement learning network. Here, the action space of the vehicle includes acceleration, cruise, and deceleration, of which acceleration and deceleration are two actions. The specific acceleration of is calculated by the Intelligent Driver Model:
其中,a为加速度,amax为最大加速度,v为车辆纵向速度,vtarget为车辆纵向期望速度,m为加速度参数,dtarget为车辆纵向期望距离,d0为车辆纵向最小距离,T0为车辆最小碰撞时间,Δv为与前车的相对速度,加速度范围为:a∈[-3,1]m2/s。Among them, a is the acceleration, a max is the maximum acceleration, v is the longitudinal speed of the vehicle, v target is the desired longitudinal speed of the vehicle, m is the acceleration parameter, d target is the desired longitudinal distance of the vehicle, d0 is the minimum longitudinal distance of the vehicle, and T 0 is the vehicle longitudinal distance. The minimum collision time, Δv is the relative speed with the preceding vehicle, and the acceleration range is: a∈[-3,1]m 2 / s .
S25:设计奖励函数:奖励函数主要考虑安全Rcollision,成功率Rsuccess以及效率Rtimeout三部分之和,即:S25: Design reward function: The reward function mainly considers the sum of three parts: safety R collision , success rate R success and efficiency R timeout , namely:
R=Rcollision+Rsuccess+Rtimeout R=R collision +R success +R timeout
第一项Rcollision为碰撞奖励,要求自车不能与周围环境车辆发生碰撞;The first item, R collision , is a collision reward, which requires that the ego vehicle cannot collide with vehicles in the surrounding environment;
Rcollision=-10R collision = -10
第二项Rsuccess为完成任务的奖励,要求自车能够无碰撞的到达目标地点;The second R success is the reward for completing the task, requiring the vehicle to reach the target location without collision;
Rsuccess=10R success = 10
第三项Rtimeout为超时奖励,要求自车不能超过规定的回合最大步数。The third item, R timeout , is a timeout reward, which requires that the ego vehicle cannot exceed the specified maximum number of steps in the round.
Rtimeout=-10R timeout = -10
S26:根据当前状态St,执行动作At,将自车执行动作后所得到的训练数据(St,At,Rt,St+1)添加至经验池。S26: According to the current state S t , perform the action A t , and add the training data (S t , A t , R t , S t+1 ) obtained after the self-vehicle performs the action to the experience pool.
S27:拟合回报分布:通过N个混合Dirac函数的加权,拟合回报的分布:S27: Fit the distribution of returns: By weighting N mixed Dirac functions, fit the distribution of returns:
其中,N为分位点数目,τi为分位数提议网络生成的分位点,满足τi-1<τi,且τ0=0,τN=1以及 为当前状态(s,a)下参数θi的Dirac函数。where N is the number of quantiles, τ i is the quantile generated by the quantile proposal network, satisfying τ i-1 <τ i , and τ 0 =0, τ N =1, and is the Dirac function of the parameter θ i in the current state (s, a).
S28:更新分位数提议网络:通过最小化1-Wasserstein距离,更新分位数提议网络,以确定最优的分位点τ,使其拟合得到的分布更接近真实分布。具体操作如下:S28: Update the quantile proposal network: By minimizing the 1-Wasserstein distance, update the quantile proposal network to determine the optimal quantile τ, so that the fitted distribution is closer to the true distribution. The specific operations are as follows:
S281:1-Wasserstein距离公式为:S281: The 1-Wasserstein distance formula is:
其中,N为分位点数目,ω为神经网络参数,为分位点对应的分位函数值, Among them, N is the number of quantiles, ω is the neural network parameter, quantile The corresponding quantile function value,
S282:由于真实的分位数函数实际上是无法得到的,因此利用带有分位数网络参数ω2的分位数值函数作为当前状态下真实的分位数值函数。S282: Due to the true quantile function is not practically available, so use the quantile numerical function with the quantile network parameter ω 2 as the true quantile value function in the current state.
S283:为了避免直接计算1-Wasserstein距离,通过对分位数提议网络的参数ω1利用梯度下降以最小化1-Wasserstein距离:S283: To avoid calculating the 1-Wasserstein distance directly, use gradient descent to minimize the 1-Wasserstein distance by applying gradient descent to the parameter ω 1 of the quantile proposal network:
其中,为分位点τi对应的分位函数值,ω2为分位数值网络参数。in, is the quantile function value corresponding to the quantile τ i , ω 2 is the quantile value network parameter.
S284:全参数化分位数函数的回报期望为:S284: The return expectation of the fully parameterized quantile function is:
其中,N为分位点数目,为分位点τi对应的分位函数值,ω2为分位数值网络参数。where N is the number of quantiles, is the quantile function value corresponding to the quantile τ i , ω 2 is the quantile value network parameter.
S29:更新分位数值网络:分位数值网络的更新目标是,最小化分位数回归Huber-loss,使分位数值网络的输出尽可能逼近目标值,求得损失函数后,以梯度下降法更新分位数值网络,具体操作如下:S29: Update the quantile numerical network: The update goal of the quantile numerical network is to minimize the quantile regression Huber-loss, so that the output of the quantile numerical network is as close to the target value as possible. After the loss function is obtained, the gradient descent method is used. To update the quantile value network, the specific operations are as follows:
S291:求解时间差分方程:S291: Solve the time difference equation:
其中,rt为当前时刻的回报,γ为衰减因子,ω1为神经网络网络参数,为分位点τi对应的分位函数值,Z为当前时刻的回报分布,Z′为下一时刻的回报分布。Among them, r t is the return at the current moment, γ is the decay factor, ω 1 is the neural network network parameter, is the quantile function value corresponding to the quantile τ i , Z is the reward distribution at the current moment, and Z′ is the reward distribution at the next moment.
S292:计算分位数回归Huber-loss:S292: Calculate quantile regression Huber-loss:
其中,Huber-loss:δij为TD-error,κ为阈值。Among them, Huber-loss: δij is the TD-error, and κ is the threshold.
S293:利用随机梯度下降,更新分位数值网络:S293: Use stochastic gradient descent to update the quantile value network:
其中,N为分位点数目,为分位数回归Huber-loss,为t时刻的TD-error,κ为阈值,τi为分位点, where N is the number of quantiles, Regression Huber-loss for quantile, is the TD-error at time t, κ is the threshold, τ i is the quantile,
S3:基于步骤S2中所得到的回报分布,引入条件风险值(CVaR),生成具有风险意识的驾驶行为,具体包括以下步骤:S3: Based on the reward distribution obtained in step S2, a conditional value at risk (CVaR) is introduced to generate a driving behavior with risk awareness, which specifically includes the following steps:
S31:基于步骤S2所得到的回报分布信息,计算各个分布对应的条件风险价值(CVaR):S31: Based on the return distribution information obtained in step S2, calculate the conditional value at risk (CVaR) corresponding to each distribution:
其中,风险价值(VaR):Z为回报的分布,α为累积概率,R为回报,是一个随机变量。Among them, the value at risk (VaR): Z is the distribution of returns, α is the cumulative probability, and R is the return, which is a random variable.
S32:选择最优动作:以最大化CVaR值为目标,选择最优的具有风险敏感性的行为:S32: Choose the optimal action: with the goal of maximizing the CVaR value, choose the optimal risk-sensitive action:
其中,为当前状态st下所选择的最优动作,Z为回报的分布,α为累积概率。in, is the optimal action selected under the current state s t , Z is the distribution of rewards, and α is the cumulative probability.
最后说明的是,以上实施例仅用以说明本发明的技术方案而非限制,尽管参照较佳实施例对本发明进行了详细说明,本领域的普通技术人员应当理解,可以对本发明的技术方案进行修改或者等同替换,而不脱离本技术方案的宗旨和范围,其均应涵盖在本发明的权利要求范围当中。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present invention can be Modifications or equivalent replacements, without departing from the spirit and scope of the technical solution, should all be included in the scope of the claims of the present invention.
Claims (9)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210487160.8A CN114707359B (en) | 2022-05-06 | 2022-05-06 | Decision-making planning method for autonomous driving vehicles based on value distribution reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210487160.8A CN114707359B (en) | 2022-05-06 | 2022-05-06 | Decision-making planning method for autonomous driving vehicles based on value distribution reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114707359A true CN114707359A (en) | 2022-07-05 |
CN114707359B CN114707359B (en) | 2025-03-21 |
Family
ID=82176207
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210487160.8A Active CN114707359B (en) | 2022-05-06 | 2022-05-06 | Decision-making planning method for autonomous driving vehicles based on value distribution reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114707359B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117208019A (en) * | 2023-11-08 | 2023-12-12 | 北京理工大学前沿技术研究院 | Longitudinal decision method and system under perceived occlusion based on value distribution reinforcement learning |
CN118212808A (en) * | 2024-02-02 | 2024-06-18 | 长安大学 | Method, system and equipment for planning traffic decision of signalless intersection |
CN118323163A (en) * | 2024-04-30 | 2024-07-12 | 北京理工大学前沿技术研究院 | Automatic driving decision method and system considering shielding uncertainty |
CN118747519A (en) * | 2024-06-06 | 2024-10-08 | 中国电子科技集团有限公司电子科学研究院 | A risk-adaptive navigation algorithm for unmanned boats based on distributed reinforcement learning |
CN119377624A (en) * | 2024-12-26 | 2025-01-28 | 杭州衡泰技术股份有限公司 | A strategy evaluation system and risk control method based on value distribution environment model |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110562258A (en) * | 2019-09-30 | 2019-12-13 | 驭势科技(北京)有限公司 | Method for vehicle automatic lane change decision, vehicle-mounted equipment and storage medium |
CN110716562A (en) * | 2019-09-25 | 2020-01-21 | 南京航空航天大学 | Decision-making method for multi-lane driving of driverless cars based on reinforcement learning |
WO2021213616A1 (en) * | 2020-04-20 | 2021-10-28 | Volvo Truck Corporation | Tactical decision-making through reinforcement learning with uncertainty estimation |
-
2022
- 2022-05-06 CN CN202210487160.8A patent/CN114707359B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110716562A (en) * | 2019-09-25 | 2020-01-21 | 南京航空航天大学 | Decision-making method for multi-lane driving of driverless cars based on reinforcement learning |
CN110562258A (en) * | 2019-09-30 | 2019-12-13 | 驭势科技(北京)有限公司 | Method for vehicle automatic lane change decision, vehicle-mounted equipment and storage medium |
WO2021213616A1 (en) * | 2020-04-20 | 2021-10-28 | Volvo Truck Corporation | Tactical decision-making through reinforcement learning with uncertainty estimation |
Non-Patent Citations (8)
Title |
---|
CARL-JOHAN HOEL: "Tactical Decision-Making in Autonomous Driving by Reinforcement Learning with Uncertainty Estimation", 2020 IEEE INTELLIGENT VEHICLES SYMPOSIUM (IV), 8 January 2021 (2021-01-08), pages 1563 - 1569 * |
DEREK YANG等: "Fully parameterized quantile function for distributional reinforcement learning", ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), vol. 32, 31 December 2019 (2019-12-31), pages 1 - 10 * |
JULIAN BERNHARD等: "Addressing Inherent Uncertainty: Risk-Sensitive Behavior Generation for Automated Driving using Distributional Reinforcement Learning", 2019 IEEE INTELLIGENT VEHICLES SYMPOSIUM (IV), 29 August 2019 (2019-08-29), pages 1 - 9 * |
LOUIS RUWAID等: "A study of the exploration/exploitation trade-off in reinforcement learning: Applied to autonomous driving", COMPUTER AND INFORMATION SCIENCES, 29 July 2019 (2019-07-29), pages 1 - 49 * |
XIAO LIN等: "Decision Making through Occluded Intersections for Autonomous Driving", 2019 IEEE INTELLIGENT TRANSPORTATION SYSTEMS CONFERENCE (ITSC), 28 November 2019 (2019-11-28), pages 2449 - 2455 * |
XIAOLIN TANG等: "Highway Decision-Making and Motion Planning for Autonomous Driving via Soft Actor-Critic", IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, vol. 71, no. 5, 22 February 2022 (2022-02-22), pages 4706, XP011908845, DOI: 10.1109/TVT.2022.3151651 * |
XIAOLIN TANG等: "Uncertainty-Aware Decision-Making for Autonomous Driving at Uncontrolled Intersections", IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTA TION SYSTEMS, vol. 24, no. 9, 30 September 2023 (2023-09-30), pages 9725 - 9735 * |
杨凯等: "面向无信号灯十字路口场景的自动驾驶安全决策方法研究", 机械工程学报, 11 March 2024 (2024-03-11), pages 1 - 13 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117208019A (en) * | 2023-11-08 | 2023-12-12 | 北京理工大学前沿技术研究院 | Longitudinal decision method and system under perceived occlusion based on value distribution reinforcement learning |
CN117208019B (en) * | 2023-11-08 | 2024-04-05 | 北京理工大学前沿技术研究院 | Longitudinal decision-making method and system under perceived occlusion based on value distribution reinforcement learning |
CN118212808A (en) * | 2024-02-02 | 2024-06-18 | 长安大学 | Method, system and equipment for planning traffic decision of signalless intersection |
CN118323163A (en) * | 2024-04-30 | 2024-07-12 | 北京理工大学前沿技术研究院 | Automatic driving decision method and system considering shielding uncertainty |
CN118323163B (en) * | 2024-04-30 | 2025-03-18 | 北京理工大学前沿技术研究院 | Autonomous driving decision-making method and system considering occlusion uncertainty |
CN118747519A (en) * | 2024-06-06 | 2024-10-08 | 中国电子科技集团有限公司电子科学研究院 | A risk-adaptive navigation algorithm for unmanned boats based on distributed reinforcement learning |
CN118747519B (en) * | 2024-06-06 | 2025-02-11 | 中国电子科技集团有限公司电子科学研究院 | A risk-adaptive navigation algorithm for unmanned boats based on distributed reinforcement learning |
CN119377624A (en) * | 2024-12-26 | 2025-01-28 | 杭州衡泰技术股份有限公司 | A strategy evaluation system and risk control method based on value distribution environment model |
Also Published As
Publication number | Publication date |
---|---|
CN114707359B (en) | 2025-03-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wang et al. | Research on autonomous driving decision-making strategies based deep reinforcement learning | |
CN114707359A (en) | A Decision Planning Method for Autonomous Vehicles Based on Value Distribution Reinforcement Learning | |
CN110969848B (en) | Automatic driving overtaking decision method based on reinforcement learning under opposite double lanes | |
CN111222630B (en) | A Learning Method for Autonomous Driving Rules Based on Deep Reinforcement Learning | |
CN114013443B (en) | Automatic driving vehicle lane change decision control method based on hierarchical reinforcement learning | |
CN111833597B (en) | Autonomous decision making in traffic situations with planning control | |
CN113071487B (en) | Automatic driving vehicle control method and device and cloud equipment | |
Makantasis et al. | Deep reinforcement‐learning‐based driving policy for autonomous road vehicles | |
CN115257745A (en) | A lane change decision control method for autonomous driving based on rule fusion reinforcement learning | |
CN110843789A (en) | Vehicle lane change intention prediction method based on time sequence convolution network | |
CN115214672A (en) | A human-like decision-making, planning and control method for autonomous driving considering workshop interaction | |
CN115257746A (en) | Uncertainty-considered decision control method for lane change of automatic driving automobile | |
US11613269B2 (en) | Learning safety and human-centered constraints in autonomous vehicles | |
CN107479547A (en) | Decision tree behaviour decision making algorithm based on learning from instruction | |
CN116612636B (en) | Signal lamp cooperative control method based on multi-agent reinforcement learning | |
Pan et al. | Research on the behavior decision of connected and autonomous vehicle at the unsignalized intersection | |
CN115303297A (en) | End-to-end autonomous driving control method and device in urban scenarios based on attention mechanism and graphical model reinforcement learning | |
CN110646007B (en) | A Vehicle Driving Method Based on Formal Representation | |
Ren et al. | Self-learned intelligence for integrated decision and control of automated vehicles at signalized intersections | |
CN116572993A (en) | Intelligent vehicle risk sensitive sequential behavior decision method, device and equipment | |
CN117734715A (en) | Automatic driving control method, system, equipment and storage medium based on reinforcement learning | |
Fabiani et al. | A mixed-logical-dynamical model for automated driving on highways | |
El Hamdani et al. | A Markov decision process model for a reinforcement learning-based autonomous pedestrian crossing protocol | |
CN118917179A (en) | Multi-mode reinforcement learning vehicle decision-making planning method with compensation feedback | |
CN118583187A (en) | Path optimization selection method and system based on time-sharing planning and radar-vision fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |