[go: up one dir, main page]

CN114707359A - A Decision Planning Method for Autonomous Vehicles Based on Value Distribution Reinforcement Learning - Google Patents

A Decision Planning Method for Autonomous Vehicles Based on Value Distribution Reinforcement Learning Download PDF

Info

Publication number
CN114707359A
CN114707359A CN202210487160.8A CN202210487160A CN114707359A CN 114707359 A CN114707359 A CN 114707359A CN 202210487160 A CN202210487160 A CN 202210487160A CN 114707359 A CN114707359 A CN 114707359A
Authority
CN
China
Prior art keywords
quantile
vehicle
network
value
distribution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210487160.8A
Other languages
Chinese (zh)
Other versions
CN114707359B (en
Inventor
唐小林
钟桂川
杨凯
陈永力
邓忠伟
彭颖
胡晓松
李佳承
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN202210487160.8A priority Critical patent/CN114707359B/en
Publication of CN114707359A publication Critical patent/CN114707359A/en
Application granted granted Critical
Publication of CN114707359B publication Critical patent/CN114707359B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2111/00Details relating to CAD techniques
    • G06F2111/08Probabilistic or stochastic CAD

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Geometry (AREA)
  • Traffic Control Systems (AREA)

Abstract

The invention relates to an automatic driving automobile decision planning method based on value distribution reinforcement learning, and belongs to the field of automatic driving automobiles. The method comprises the following steps: s1: constructing a non-signal lamp crossroad scene considering uncertainty; s2: constructing a fully parameterized quantile function model as an automatic driving automobile control model; s3: and introducing condition risk value based on the learned state-action return distribution information in the fully parameterized quantile function model, and generating the driving behavior with risk awareness. The method improves the safety and stability of the decision planning strategy of the automatic driving automobile in the uncertain environment by using value distribution reinforcement learning.

Description

基于值分布强化学习的自动驾驶汽车决策规划方法A Decision Planning Method for Autonomous Vehicles Based on Value Distribution Reinforcement Learning

技术领域technical field

本发明属于自动驾驶汽车领域,涉及一种基于值分布强化学习的自动驾驶汽车决策规划方法。The invention belongs to the field of self-driving cars, and relates to a decision-making planning method for self-driving cars based on value distribution reinforcement learning.

背景技术Background technique

自动驾驶技术近年来取得飞速发展,但是安全性已成为自动驾驶技术所面临的一个关键问题。安全性是阻碍自动驾驶汽车商业化的一个重要因素,也是近年来的一个研究热点。而自动驾驶决策规划模块,作为自动驾驶汽车的“大脑”,对自动驾驶汽车安全性有相当重要的影响,尤其是在十字路口等复杂城市场景下,如何进行自主安全的决策,近年来被广泛研究。Autonomous driving technology has developed rapidly in recent years, but safety has become a key issue faced by autonomous driving technology. Safety is an important factor hindering the commercialization of autonomous vehicles, and it has also been a research hotspot in recent years. The autonomous driving decision planning module, as the "brain" of autonomous vehicles, has a very important impact on the safety of autonomous vehicles, especially in complex urban scenarios such as intersections, how to make autonomous and safe decision-making has been widely used in recent years. Research.

自动驾驶汽车决策规划模块,主要是根据当前环境状态,决策生成最优的驾驶行为,从而安全的完成驾驶任务,现有的决策规划方法,主要分为基于规则、基于优化以及基于学习的三类。其中,基于规则的方法,只适用于特定的场景;基于优化的方法,在实时性方面的表现较差。因此,基于学习的方法近年来被学术界和工业界广泛研究,其中强化学习已被广泛用于自动驾驶汽车的决策规划问题,得益于强化学习的实时性与场景适应性,基于强化学习的决策规划方法能很好的完成驾驶任务。但是,由于自动驾驶汽车所面临的驾驶环境日益复杂,恶劣天气、建筑物遮挡等造成的不完全感知,以及周围交通参与者的行为不确定性,给自动驾驶汽车的安全性带来了巨大挑战,传统的强化学习算法已经无法满足自动驾驶汽车对安全性的需求。The decision-making and planning module for autonomous vehicles mainly generates optimal driving behaviors based on the current environmental state, so as to safely complete driving tasks. The existing decision-making and planning methods are mainly divided into three categories: rule-based, optimization-based and learning-based. . Among them, the rule-based method is only suitable for specific scenarios; the optimization-based method has poor performance in real-time. Therefore, learning-based methods have been widely studied in academia and industry in recent years, and reinforcement learning has been widely used in decision-making and planning problems of autonomous vehicles. Thanks to the real-time nature and scene adaptability of reinforcement learning, reinforcement learning-based The decision planning method can accomplish the driving task well. However, due to the increasingly complex driving environment faced by autonomous vehicles, incomplete perception caused by inclement weather, building occlusion, etc., as well as the behavioral uncertainty of surrounding traffic participants, it has brought great challenges to the safety of autonomous vehicles. , traditional reinforcement learning algorithms have been unable to meet the safety requirements of autonomous vehicles.

由于传统的强化学习以最大化回报的期望值来选择最优动作,回报的分布信息很大程度上被丢失,因此无法考虑由于环境中内在的不确定性对决策策略的影响。因此,亟需提出一种新的强化学习算法来处理环境中存在的不确定性,以提升自动驾驶汽车决策规划的安全性。Since traditional reinforcement learning selects the optimal action to maximize the expected value of the reward, the distributional information of the reward is largely lost, so the influence on the decision policy due to the inherent uncertainty in the environment cannot be considered. Therefore, it is urgent to propose a new reinforcement learning algorithm to deal with the uncertainties in the environment to improve the safety of autonomous vehicle decision planning.

发明内容SUMMARY OF THE INVENTION

有鉴于此,本发明的目的在于提供一种基于值分布强化学习的自动驾驶汽车决策规划方法,能提高自动驾驶汽车在具有不确定性的环境下决策规划策略的安全性与稳定性。In view of this, the purpose of the present invention is to provide a decision-making planning method for autonomous driving vehicles based on value distribution reinforcement learning, which can improve the safety and stability of decision-making planning strategies of autonomous driving vehicles in uncertain environments.

为达到上述目的,本发明提供如下技术方案:To achieve the above object, the present invention provides the following technical solutions:

一种基于值分布强化学习的自动驾驶汽车决策规划方法,具体包括以下步骤:A decision planning method for autonomous vehicles based on value distribution reinforcement learning, which specifically includes the following steps:

S1:构建考虑不确定性的无信号灯十字路口场景;S1: Construct a no-signal intersection scene considering uncertainty;

S2:构建全参数化分位数函数(FQF)网络模型,作为自动驾驶汽车控制模型;S2: Build a fully parameterized quantile function (FQF) network model as an autonomous vehicle control model;

S3:基于全参数化分位数函数(FQF)模型中学习的状态-动作回报分布信息,引入条件风险价值(CVaR),生成具有风险意识的驾驶行为。S3: Based on the state-action reward distribution information learned in a fully parameterized quantile function (FQF) model, a conditional value at risk (CVaR) is introduced to generate risk-aware driving behavior.

进一步,步骤S1中,构建考虑不确定性的无信号灯十字路口场景,具体包括:建立遮挡模型,确定周围车辆模型,建立周围车辆类型分布。Further, in step S1, constructing a scene of an intersection without a signal light that considers uncertainty, which specifically includes: establishing an occlusion model, determining a surrounding vehicle model, and establishing a distribution of surrounding vehicle types.

进一步,步骤S1中,建立遮挡模型,具体包括:考虑十字路口两侧的遮挡,通过分析周围车辆与自车以及十字路口中心的相对位置关系,根据几何关系,计算出周围车辆能被自车观测到的临界距离d,以此作为判断周围车辆是否被遮挡的临界条件:Further, in step S1, an occlusion model is established, which specifically includes: considering the occlusion on both sides of the intersection, by analyzing the relative positional relationship between the surrounding vehicles and the vehicle and the center of the intersection, according to the geometric relationship, calculate that the surrounding vehicles can be observed by the vehicle. The critical distance d to be used as a critical condition for judging whether the surrounding vehicles are blocked:

Figure BDA0003629612870000021
Figure BDA0003629612870000021

其中,l为每条车道宽度,d′为自车车头到十字路口中心点的距离,

Figure BDA0003629612870000022
为道路边界到遮挡物的距离,d为周围车辆车头至十字路口中心点的距离。Among them, l is the width of each lane, d' is the distance from the front of the vehicle to the center of the intersection,
Figure BDA0003629612870000022
is the distance from the road boundary to the occluder, and d is the distance from the head of the surrounding vehicles to the center of the intersection.

进一步,步骤S1中,确定周围车辆模型,具体包括:为使周围车辆能对环境的主动变化做出反应,规定仿真环境中,周围车辆的行为由智能驾驶员模型控制(IntelligentDriver Model):Further, in step S1, the surrounding vehicle model is determined, which specifically includes: in order to enable the surrounding vehicle to respond to the active change of the environment, it is specified that in the simulation environment, the behavior of the surrounding vehicle is controlled by the intelligent driver model (IntelligentDriver Model):

Figure BDA0003629612870000023
Figure BDA0003629612870000023

Figure BDA0003629612870000024
Figure BDA0003629612870000024

其中,a为加速度,amax为最大加速度,v为车辆纵向速度,vtarget为车辆纵向期望速度,m为加速度参数,dtarget为车辆纵向期望距离,d0为车辆纵向最小距离,T0为车辆最小碰撞时间,Δv为与前车的相对速度。Among them, a is the acceleration, a max is the maximum acceleration, v is the longitudinal speed of the vehicle, v target is the desired longitudinal speed of the vehicle, m is the acceleration parameter, d target is the desired longitudinal distance of the vehicle, d 0 is the minimum longitudinal distance of the vehicle, and T 0 is The minimum collision time of the vehicle, Δv is the relative speed with the preceding vehicle.

进一步,步骤S1中,建立周围车辆类型分布,具体包括:规定仿真环境中,周围车辆包括激进(Aggressive)、保守(Conservative)、正常(Normal)三种类型,每种类型的车辆在每一个时间步,分别以一定的概率添加到环境中,周围车辆类型空间为:Further, in step S1, the distribution of the types of surrounding vehicles is established, which specifically includes: specifying that in the simulation environment, the surrounding vehicles include three types: aggressive, conservative, and normal, and each type of vehicle is used at each time Steps are added to the environment with a certain probability, and the surrounding vehicle type space is:

Figure BDA0003629612870000025
Figure BDA0003629612870000025

进一步,步骤S2中,构建全参数化分位数函数模型,具体包括以下步骤:Further, in step S2, a fully parameterized quantile function model is constructed, which specifically includes the following steps:

S21:构建分位数提议网络(Fraction proposal network):以状态信息作为网络输入,输出每个状态-动作对应的最优分位点τ;S21: Build a Fraction proposal network: take the state information as the network input, and output the optimal quantile τ corresponding to each state-action;

S22:构建分位数值网络(Quantile value network):将由分位数提议网络得到的最优分位点作为分位数值网络的输入,映射得到对应当前状态下,各个分位点对应的分位函数值;S22: Build a quantile value network: take the optimal quantile point obtained by the quantile proposal network as the input of the quantile value network, and map the quantile function corresponding to each quantile point corresponding to the current state value;

S23:构建状态空间S:以周围车辆的位置、速度、航向角以及自车的位置、速度及航向角作为自车可观测的状态信息,值分布强化学习基于自车观测信息进行下一步的决策规划;S23: Construct the state space S: The position, speed, heading angle of the surrounding vehicles and the position, speed and heading angle of the vehicle are used as the observable state information of the vehicle, and the value distribution reinforcement learning is based on the observation information of the vehicle to make the next decision. planning;

S24:构建动作空间A:动作空间定义为自车可执行动作的集合,为值分布强化学习网络的输出值,此处自车的动作空间包括加速、巡航和减速三个离散动作值;其中加速和减速两个动作的具体加速度由智能驾驶员模型(Intelligent Driver Model)计算得到;S24: Build an action space A: The action space is defined as the set of actions that the vehicle can perform, and is the output value of the value distribution reinforcement learning network. Here, the action space of the vehicle includes three discrete action values: acceleration, cruise and deceleration; among which acceleration The specific acceleration of the two actions of deceleration and deceleration is calculated by the Intelligent Driver Model;

S25:设计奖励函数,总奖励等于碰撞奖励Rcollision、完成任务的奖励Rsuccess以及超时奖励Rtimeout三部分之和;S25: Design a reward function, the total reward is equal to the sum of the three parts of the collision reward R collision , the task completion reward R success and the timeout reward R timeout ;

S26:根据当前状态St,执行动作At,将自车执行动作后所得到的训练数据(St,At,Rt,St+1)添加至经验池;S26: According to the current state S t , perform the action A t , and add the training data (S t , A t , R t , S t+1 ) obtained after the self-vehicle performs the action to the experience pool;

S27:拟合回报分布;S27: Fit the return distribution;

S28:更新分位数提议网络:通过最小化1-Wasserstein距离,更新分位数提议网络,以确定最优的分位点τ,使其拟合的得到的分布更接近真实分布;S28: Update the quantile proposal network: by minimizing the 1-Wasserstein distance, update the quantile proposal network to determine the optimal quantile τ, so that the fitted distribution is closer to the true distribution;

S29:更新分位数值网络:分位数值网络的更新目标是,最小化分位数回归Huber-loss,使分位数值网络的输出尽可能逼近目标值,以梯度下降法更新分位数值网络。S29: Update the quantile numerical network: The update goal of the quantile numerical network is to minimize the quantile regression Huber-loss, make the output of the quantile numerical network as close to the target value as possible, and update the quantile numerical network by gradient descent.

进一步,步骤S27具体包括:通过N个混合Dirac函数的加权值,拟合回报的分布:Further, step S27 specifically includes: fitting the distribution of rewards through the weighted values of N mixed Dirac functions:

Figure BDA0003629612870000031
Figure BDA0003629612870000031

其中,N为分位点数目,τi为分位数提议网络生成的分位点,满足τi-1i,且τ0=0,τN=1,δθi(s,a)为当前状态(s,a)下参数θi的Dirac函数。Among them, N is the number of quantiles, τ i is the quantiles generated by the quantile proposal network, satisfying τ i-1i , and τ 0 =0, τ N =1, δ θi(s,a) is the Dirac function of the parameter θ i in the current state (s, a).

进一步,步骤S28具体包括以下步骤:Further, step S28 specifically includes the following steps:

S281:1-Wasserstein距离公式为:S281: The 1-Wasserstein distance formula is:

Figure BDA0003629612870000032
Figure BDA0003629612870000032

其中,N为分位点数目,ω为神经网络参数,

Figure BDA0003629612870000033
为分位点
Figure BDA0003629612870000034
对应的分位数函数值,
Figure BDA0003629612870000035
Among them, N is the number of quantiles, ω is the neural network parameter,
Figure BDA0003629612870000033
quantile
Figure BDA0003629612870000034
the corresponding quantile function value,
Figure BDA0003629612870000035

S282:由于真实的分位数函数

Figure BDA0003629612870000036
实际上是无法得到的,因此利用带有分位数网络参数ω2的分位数值函数
Figure BDA0003629612870000037
作为当前状态下真实的分位数值函数;S282: Due to the true quantile function
Figure BDA0003629612870000036
is not practically available, so use the quantile numerical function with the quantile network parameter ω 2
Figure BDA0003629612870000037
As the real quantile value function in the current state;

S283:为了避免直接计算1-Wasserstein距离,通过对分位数提议网络的参数ω1利用梯度下降以最小化1-Wasserstein距离:S283: To avoid calculating the 1-Wasserstein distance directly, use gradient descent to minimize the 1-Wasserstein distance by applying gradient descent to the parameter ω 1 of the quantile proposal network:

Figure BDA0003629612870000038
Figure BDA0003629612870000038

S284:全参数化分位数函数的回报期望为:

Figure BDA0003629612870000041
S284: The return expectation of the fully parameterized quantile function is:
Figure BDA0003629612870000041

进一步,步骤S29具体包括以下步骤:Further, step S29 specifically includes the following steps:

S291:求解时间差分方程:S291: Solve the time difference equation:

Figure BDA0003629612870000042
Figure BDA0003629612870000042

其中,δij为TD-error,rt为当前时刻的回报,γ为衰减因子,Z为当前时刻的回报分布,Z′为下一时刻的回报分布;Among them, δ ij is TD-error, r t is the return at the current moment, γ is the decay factor, Z is the return distribution at the current moment, and Z′ is the return distribution at the next moment;

S292:计算分位数回归Huber-loss:S292: Calculate quantile regression Huber-loss:

Figure BDA0003629612870000043
Figure BDA0003629612870000043

Figure BDA0003629612870000044
Figure BDA0003629612870000044

其中,

Figure BDA0003629612870000045
为分位数回归Huber-loss,
Figure BDA0003629612870000046
为Huber-loss函数,κ为阈值;in,
Figure BDA0003629612870000045
Regression Huber-loss for quantile,
Figure BDA0003629612870000046
is the Huber-loss function, and κ is the threshold;

S293:利用随机梯度下降,更新分位数值网络:S293: Use stochastic gradient descent to update the quantile value network:

Figure BDA0003629612870000047
Figure BDA0003629612870000047

其中,

Figure BDA0003629612870000048
为t时刻的TD-error。in,
Figure BDA0003629612870000048
is the TD-error at time t.

进一步,步骤S3具体包括以下步骤:Further, step S3 specifically includes the following steps:

S31:基于步骤S2全参数化分位数函数(FQF)模型中所得到的回报分布信息,计算各个分布对应的条件风险价值(CVaR)为:S31: Based on the return distribution information obtained in the fully parameterized quantile function (FQF) model in step S2, calculate the conditional value at risk (CVaR) corresponding to each distribution as:

Figure BDA0003629612870000049
Figure BDA0003629612870000049

其中,风险价值(VaR):

Figure BDA00036296128700000410
Z为回报的分布,α为累积概率,R为回报,是一个随机变量;Among them, the value at risk (VaR):
Figure BDA00036296128700000410
Z is the distribution of returns, α is the cumulative probability, and R is the return, which is a random variable;

S32:选择最优动作,以最大化CVaR值为目标,选择最优的具有风险敏感性的行为:S32: Choose the optimal action, with the goal of maximizing the CVaR value, and choose the optimal risk-sensitive behavior:

Figure BDA00036296128700000411
Figure BDA00036296128700000411

其中,

Figure BDA00036296128700000412
为当前状态st下所选择的最优动作,Z为回报的分布,α为累积概率。in,
Figure BDA00036296128700000412
is the optimal action selected under the current state s t , Z is the distribution of rewards, and α is the cumulative probability.

本发明的有益效果在于:The beneficial effects of the present invention are:

1)本发明设计了一种无信号灯十字路口的仿真训练环境,同时考虑了由于环境中的遮挡导致的不完全感知和周围交通参与者的行为不确定性,使该场景更符合真实驾驶场景。1) The present invention designs a simulation training environment at an intersection without a signal light, taking into account the incomplete perception caused by the occlusion in the environment and the behavioral uncertainty of the surrounding traffic participants, so that the scene is more in line with the real driving scene.

2)本发明设计了一种基于值分布强化学习的决策规划方法,采用全参数化分位数函数(FQF)更加准确的拟合值分布,为后续具有风险意识的决策行为生成,提供更准确的分布信息。2) The present invention designs a decision planning method based on value distribution reinforcement learning, which adopts a more accurate fitting value distribution with a fully parameterized quantile function (FQF), and provides a more accurate distribution for subsequent decision-making behaviors with risk awareness. distribution information.

3)本发明设计了一种基于条件风险价值(CVaR)的行为生成方法,基于所得到的回报分布信息,考虑环境中存在的不确定性,生成具有风险意识的驾驶行为。3) The present invention designs a behavior generation method based on Conditional Value at Risk (CVaR), and generates a risk-conscious driving behavior based on the obtained return distribution information and considering the uncertainty existing in the environment.

本发明的其他优点、目标和特征在某种程度上将在随后的说明书中进行阐述,并且在某种程度上,基于对下文的考察研究对本领域技术人员而言将是显而易见的,或者可以从本发明的实践中得到教导。本发明的目标和其他优点可以通过下面的说明书来实现和获得。Other advantages, objects, and features of the present invention will be set forth in the description that follows, and will be apparent to those skilled in the art based on a study of the following, to the extent that is taught in the practice of the present invention. The objectives and other advantages of the present invention may be realized and attained by the following description.

附图说明Description of drawings

为了使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明作优选的详细描述,其中:In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be preferably described in detail below with reference to the accompanying drawings, wherein:

图1为本发明基于值分布强化学习的自动驾驶汽车决策规划方法的整体逻辑框架图;Fig. 1 is the overall logical frame diagram of the decision-making planning method of autonomous driving vehicle based on value distribution reinforcement learning of the present invention;

图2为构建仿真训练环境的逻辑框架图;Fig. 2 is the logical frame diagram of constructing simulation training environment;

图3为全参数化分位数函数(FQF)网络结构图。Figure 3 is a fully parameterized quantile function (FQF) network structure diagram.

具体实施方式Detailed ways

以下通过特定的具体实例说明本发明的实施方式,本领域技术人员可由本说明书所揭露的内容轻易地了解本发明的其他优点与功效。本发明还可以通过另外不同的具体实施方式加以实施或应用,本说明书中的各项细节也可以基于不同观点与应用,在没有背离本发明的精神下进行各种修饰或改变。需要说明的是,以下实施例中所提供的图示仅以示意方式说明本发明的基本构想,在不冲突的情况下,以下实施例及实施例中的特征可以相互组合。The embodiments of the present invention are described below through specific specific examples, and those skilled in the art can easily understand other advantages and effects of the present invention from the contents disclosed in this specification. The present invention can also be implemented or applied through other different specific embodiments, and various details in this specification can also be modified or changed based on different viewpoints and applications without departing from the spirit of the present invention. It should be noted that the drawings provided in the following embodiments are only used to illustrate the basic idea of the present invention in a schematic manner, and the following embodiments and features in the embodiments can be combined with each other without conflict.

请参阅图1~图3,本发明提供了一种基于值分布强化学习的自动驾驶汽车决策规划方法。考虑到真实驾驶环境中存在的不确定性,建立了同时考虑遮挡以及不同驾驶员类型的无信号灯十字路口仿真训练环境。同时,考虑到自动驾驶汽车决策规划对于安全性的需求,提出了一种基于值分布强化学习的方法,通过全参数化分位数函数(FQF)拟合回报的真实分布,进而将条件风险值(CVaR)引入所得到的分布信息,生成具有风险意识的驾驶行为,提升自动驾驶汽车对环境中不确定性的处理能力。该方法具体包括以下步骤:Referring to FIGS. 1 to 3 , the present invention provides a decision-making planning method for autonomous driving vehicles based on value distribution reinforcement learning. Considering the uncertainty in the real driving environment, a simulation training environment at intersections without signal lights is established that considers both occlusion and different types of drivers. At the same time, considering the safety requirements of autonomous vehicle decision-making planning, a method based on value distribution reinforcement learning is proposed. The full parameterized quantile function (FQF) is used to fit the true distribution of returns, and then the conditional value at risk is calculated. (CVaR) introduces the obtained distribution information to generate risk-aware driving behaviors and improve the autonomous vehicle's ability to handle uncertainties in the environment. The method specifically includes the following steps:

步骤S1:构建无信号灯的十字路口仿真训练场景,如图2所示,具体包括以下步骤:Step S1: Construct a simulation training scene of an intersection without signal lights, as shown in Figure 2, which specifically includes the following steps:

S11:建立遮挡模型:考虑十字路口两侧的遮挡,通过分析周围车辆与自车以及十字路口中心的相对位置关系,根据几何关系,计算出周围车辆可被自车观测到的临界距离d,以此作为判断周围车辆是否被遮挡的临界条件:S11: Establish an occlusion model: considering the occlusion on both sides of the intersection, by analyzing the relative positional relationship between the surrounding vehicles and the vehicle and the center of the intersection, according to the geometric relationship, calculate the critical distance d that the surrounding vehicles can be observed by the vehicle, as This is a critical condition for judging whether the surrounding vehicles are blocked:

Figure BDA0003629612870000061
Figure BDA0003629612870000061

其中,l为每条车道宽度,d′为自车车头到十字路口中心点的距离,

Figure BDA0003629612870000062
为道路边界到遮挡物的距离,d为周围车辆车头至十字路口中心点的距离。Among them, l is the width of each lane, d' is the distance from the front of the vehicle to the center of the intersection,
Figure BDA0003629612870000062
is the distance from the road boundary to the occluder, and d is the distance from the head of the surrounding vehicles to the center of the intersection.

S12:确定周围车辆模型:为使周围车辆能对环境的变化做出相应的反应,规定仿真环境中,周围车辆的行为由智能驾驶员模型控制(Intelligent Driver Model):S12: Determine the surrounding vehicle model: In order to enable the surrounding vehicles to respond to changes in the environment, it is specified that in the simulation environment, the behavior of surrounding vehicles is controlled by the Intelligent Driver Model:

Figure BDA0003629612870000063
Figure BDA0003629612870000063

Figure BDA0003629612870000064
Figure BDA0003629612870000064

其中,a为加速度,amax为最大加速度,v为车辆纵向速度,vtarget为车辆纵向期望速度,m为加速度参数,dtarget为车辆纵向期望距离,d0为车辆纵向最小距离,T0为车辆最小碰撞时间,Δv为与前车的相对速度。Among them, a is the acceleration, a max is the maximum acceleration, v is the longitudinal speed of the vehicle, v target is the desired longitudinal speed of the vehicle, m is the acceleration parameter, d target is the desired longitudinal distance of the vehicle, d 0 is the minimum longitudinal distance of the vehicle, and T 0 is The minimum collision time of the vehicle, Δv is the relative speed with the preceding vehicle.

S13:建立周围车辆类型分布:为使自车能够根据不同驾驶员类型做出不同决策,规定仿真环境中,周围车辆包括激进(Aggressive)、保守(Conservative)、正常(Normal)三种类型,每种类型的车辆在每一个时间步,分别以概率:Paggressive=0.2,Pconservative=0.3,Pnormal=0.5添加到环境中,周围车辆类型空间为:S13: Establish the distribution of surrounding vehicle types: In order to enable the vehicle to make different decisions according to different driver types, it is specified that in the simulation environment, surrounding vehicles include three types: Aggressive, Conservative, and Normal. Each type of vehicle is added to the environment at each time step with the probability: P aggressive = 0.2, P conservative = 0.3, P normal = 0.5, and the surrounding vehicle type space is:

Figure BDA0003629612870000065
Figure BDA0003629612870000065

S14:初始化环境:随机初始化周围车辆的初始速度、位置与目标速度。S14: Initialize the environment: randomly initialize the initial speed, position and target speed of the surrounding vehicles.

S2:构建全参数化分位数函数(FQF)模型,作为自动驾驶汽车控制模型,如图3所示,具体包括以下步骤:S2: Build a fully parameterized quantile function (FQF) model as an autonomous vehicle control model, as shown in Figure 3, which includes the following steps:

S21:构建分位数提议网络(Fraction proposal network):以状态信息作为网络输入,输出每个状态-动作对应的最优分位点τ。S21: Build a Fraction proposal network: take the state information as the network input, and output the optimal fractional point τ corresponding to each state-action.

S22:构建分位数值网络(Quantile value network):将由分位数提议网络得到的最优分位点作为分位数值网络的输入,映射得到对应当前状态下,各个分位点对应的分位函数值。S22: Build a quantile value network: take the optimal quantile point obtained by the quantile proposal network as the input of the quantile value network, and map the quantile function corresponding to each quantile point corresponding to the current state value.

S23:构建状态空间S:以周围车辆的位置、速度、航向角以及自车的位置、速度及航向角为自车可观测的状态信息,值分布强化学习基于自车观测信息进行下一步的决策规划。S23: Construct a state space S: Take the position, speed, heading angle of the surrounding vehicles and the position, speed and heading angle of the vehicle as the state information that can be observed by the vehicle, and the value distribution reinforcement learning makes the next decision based on the observation information of the vehicle planning.

Figure BDA0003629612870000066
Figure BDA0003629612870000066

其中,i=0代表自车,i∈[1,N]代表周围车辆,xi,yi代表车辆的横向和纵向位置,vxi,vyi代表车辆的横向和纵向速度,

Figure BDA0003629612870000067
代表车辆的航向角。Among them, i=0 represents the self-vehicle, i∈[1,N] represents the surrounding vehicles, x i , y i represent the lateral and longitudinal positions of the vehicle, v xi , v yi represent the lateral and longitudinal speeds of the vehicle,
Figure BDA0003629612870000067
Represents the heading angle of the vehicle.

S24:构建动作空间A:动作空间定义为自车可执行动作的集合,为值分布强化学习网络的输出值,此处自车的动作空间包括加速、巡航、减速,其中加速和减速两个动作的具体加速度由智能驾驶员模型(Intelligent Driver Model)计算得到:S24: Constructing action space A: Action space is defined as the set of actions that the vehicle can perform, and is the output value of the value distribution reinforcement learning network. Here, the action space of the vehicle includes acceleration, cruise, and deceleration, of which acceleration and deceleration are two actions. The specific acceleration of is calculated by the Intelligent Driver Model:

Figure BDA0003629612870000071
Figure BDA0003629612870000071

Figure BDA0003629612870000072
Figure BDA0003629612870000072

其中,a为加速度,amax为最大加速度,v为车辆纵向速度,vtarget为车辆纵向期望速度,m为加速度参数,dtarget为车辆纵向期望距离,d0为车辆纵向最小距离,T0为车辆最小碰撞时间,Δv为与前车的相对速度,加速度范围为:a∈[-3,1]m2/sAmong them, a is the acceleration, a max is the maximum acceleration, v is the longitudinal speed of the vehicle, v target is the desired longitudinal speed of the vehicle, m is the acceleration parameter, d target is the desired longitudinal distance of the vehicle, d0 is the minimum longitudinal distance of the vehicle, and T 0 is the vehicle longitudinal distance. The minimum collision time, Δv is the relative speed with the preceding vehicle, and the acceleration range is: a∈[-3,1]m 2 / s .

S25:设计奖励函数:奖励函数主要考虑安全Rcollision,成功率Rsuccess以及效率Rtimeout三部分之和,即:S25: Design reward function: The reward function mainly considers the sum of three parts: safety R collision , success rate R success and efficiency R timeout , namely:

R=Rcollision+Rsuccess+Rtimeout R=R collision +R success +R timeout

第一项Rcollision为碰撞奖励,要求自车不能与周围环境车辆发生碰撞;The first item, R collision , is a collision reward, which requires that the ego vehicle cannot collide with vehicles in the surrounding environment;

Rcollision=-10R collision = -10

第二项Rsuccess为完成任务的奖励,要求自车能够无碰撞的到达目标地点;The second R success is the reward for completing the task, requiring the vehicle to reach the target location without collision;

Rsuccess=10R success = 10

第三项Rtimeout为超时奖励,要求自车不能超过规定的回合最大步数。The third item, R timeout , is a timeout reward, which requires that the ego vehicle cannot exceed the specified maximum number of steps in the round.

Rtimeout=-10R timeout = -10

S26:根据当前状态St,执行动作At,将自车执行动作后所得到的训练数据(St,At,Rt,St+1)添加至经验池。S26: According to the current state S t , perform the action A t , and add the training data (S t , A t , R t , S t+1 ) obtained after the self-vehicle performs the action to the experience pool.

S27:拟合回报分布:通过N个混合Dirac函数的加权,拟合回报的分布:S27: Fit the distribution of returns: By weighting N mixed Dirac functions, fit the distribution of returns:

Figure BDA0003629612870000073
Figure BDA0003629612870000073

其中,N为分位点数目,τi为分位数提议网络生成的分位点,满足τi-1<τi,且τ0=0,τN=1以及

Figure BDA0003629612870000074
Figure BDA0003629612870000075
为当前状态(s,a)下参数θi的Dirac函数。where N is the number of quantiles, τ i is the quantile generated by the quantile proposal network, satisfying τ i-1i , and τ 0 =0, τ N =1, and
Figure BDA0003629612870000074
Figure BDA0003629612870000075
is the Dirac function of the parameter θ i in the current state (s, a).

S28:更新分位数提议网络:通过最小化1-Wasserstein距离,更新分位数提议网络,以确定最优的分位点τ,使其拟合得到的分布更接近真实分布。具体操作如下:S28: Update the quantile proposal network: By minimizing the 1-Wasserstein distance, update the quantile proposal network to determine the optimal quantile τ, so that the fitted distribution is closer to the true distribution. The specific operations are as follows:

S281:1-Wasserstein距离公式为:S281: The 1-Wasserstein distance formula is:

Figure BDA0003629612870000076
Figure BDA0003629612870000076

其中,N为分位点数目,ω为神经网络参数,

Figure BDA0003629612870000081
为分位点
Figure BDA0003629612870000082
对应的分位函数值,
Figure BDA0003629612870000083
Among them, N is the number of quantiles, ω is the neural network parameter,
Figure BDA0003629612870000081
quantile
Figure BDA0003629612870000082
The corresponding quantile function value,
Figure BDA0003629612870000083

S282:由于真实的分位数函数

Figure BDA0003629612870000084
实际上是无法得到的,因此利用带有分位数网络参数ω2的分位数值函数
Figure BDA0003629612870000085
作为当前状态下真实的分位数值函数。S282: Due to the true quantile function
Figure BDA0003629612870000084
is not practically available, so use the quantile numerical function with the quantile network parameter ω 2
Figure BDA0003629612870000085
as the true quantile value function in the current state.

S283:为了避免直接计算1-Wasserstein距离,通过对分位数提议网络的参数ω1利用梯度下降以最小化1-Wasserstein距离:S283: To avoid calculating the 1-Wasserstein distance directly, use gradient descent to minimize the 1-Wasserstein distance by applying gradient descent to the parameter ω 1 of the quantile proposal network:

Figure BDA0003629612870000086
Figure BDA0003629612870000086

其中,

Figure BDA0003629612870000087
为分位点τi对应的分位函数值,
Figure BDA0003629612870000088
ω2为分位数值网络参数。in,
Figure BDA0003629612870000087
is the quantile function value corresponding to the quantile τ i ,
Figure BDA0003629612870000088
ω 2 is the quantile value network parameter.

S284:全参数化分位数函数的回报期望为:S284: The return expectation of the fully parameterized quantile function is:

Figure BDA0003629612870000089
Figure BDA0003629612870000089

其中,N为分位点数目,

Figure BDA00036296128700000810
为分位点τi对应的分位函数值,
Figure BDA00036296128700000811
ω2为分位数值网络参数。where N is the number of quantiles,
Figure BDA00036296128700000810
is the quantile function value corresponding to the quantile τ i ,
Figure BDA00036296128700000811
ω 2 is the quantile value network parameter.

S29:更新分位数值网络:分位数值网络的更新目标是,最小化分位数回归Huber-loss,使分位数值网络的输出尽可能逼近目标值,求得损失函数后,以梯度下降法更新分位数值网络,具体操作如下:S29: Update the quantile numerical network: The update goal of the quantile numerical network is to minimize the quantile regression Huber-loss, so that the output of the quantile numerical network is as close to the target value as possible. After the loss function is obtained, the gradient descent method is used. To update the quantile value network, the specific operations are as follows:

S291:求解时间差分方程:S291: Solve the time difference equation:

Figure BDA00036296128700000812
Figure BDA00036296128700000812

其中,rt为当前时刻的回报,γ为衰减因子,ω1为神经网络网络参数,

Figure BDA00036296128700000813
为分位点τi对应的分位函数值,
Figure BDA00036296128700000814
Z为当前时刻的回报分布,Z′为下一时刻的回报分布。Among them, r t is the return at the current moment, γ is the decay factor, ω 1 is the neural network network parameter,
Figure BDA00036296128700000813
is the quantile function value corresponding to the quantile τ i ,
Figure BDA00036296128700000814
Z is the reward distribution at the current moment, and Z′ is the reward distribution at the next moment.

S292:计算分位数回归Huber-loss:S292: Calculate quantile regression Huber-loss:

Figure BDA00036296128700000815
Figure BDA00036296128700000815

其中,Huber-loss:

Figure BDA00036296128700000816
δij为TD-error,κ为阈值。Among them, Huber-loss:
Figure BDA00036296128700000816
δij is the TD-error, and κ is the threshold.

S293:利用随机梯度下降,更新分位数值网络:S293: Use stochastic gradient descent to update the quantile value network:

Figure BDA00036296128700000817
Figure BDA00036296128700000817

其中,N为分位点数目,

Figure BDA00036296128700000818
为分位数回归Huber-loss,
Figure BDA00036296128700000819
为t时刻的TD-error,κ为阈值,τi为分位点,
Figure BDA0003629612870000091
where N is the number of quantiles,
Figure BDA00036296128700000818
Regression Huber-loss for quantile,
Figure BDA00036296128700000819
is the TD-error at time t, κ is the threshold, τ i is the quantile,
Figure BDA0003629612870000091

S3:基于步骤S2中所得到的回报分布,引入条件风险值(CVaR),生成具有风险意识的驾驶行为,具体包括以下步骤:S3: Based on the reward distribution obtained in step S2, a conditional value at risk (CVaR) is introduced to generate a driving behavior with risk awareness, which specifically includes the following steps:

S31:基于步骤S2所得到的回报分布信息,计算各个分布对应的条件风险价值(CVaR):S31: Based on the return distribution information obtained in step S2, calculate the conditional value at risk (CVaR) corresponding to each distribution:

Figure BDA0003629612870000092
Figure BDA0003629612870000092

其中,风险价值(VaR):

Figure BDA0003629612870000093
Z为回报的分布,α为累积概率,R为回报,是一个随机变量。Among them, the value at risk (VaR):
Figure BDA0003629612870000093
Z is the distribution of returns, α is the cumulative probability, and R is the return, which is a random variable.

S32:选择最优动作:以最大化CVaR值为目标,选择最优的具有风险敏感性的行为:S32: Choose the optimal action: with the goal of maximizing the CVaR value, choose the optimal risk-sensitive action:

Figure BDA0003629612870000094
Figure BDA0003629612870000094

其中,

Figure BDA0003629612870000095
为当前状态st下所选择的最优动作,Z为回报的分布,α为累积概率。in,
Figure BDA0003629612870000095
is the optimal action selected under the current state s t , Z is the distribution of rewards, and α is the cumulative probability.

最后说明的是,以上实施例仅用以说明本发明的技术方案而非限制,尽管参照较佳实施例对本发明进行了详细说明,本领域的普通技术人员应当理解,可以对本发明的技术方案进行修改或者等同替换,而不脱离本技术方案的宗旨和范围,其均应涵盖在本发明的权利要求范围当中。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present invention can be Modifications or equivalent replacements, without departing from the spirit and scope of the technical solution, should all be included in the scope of the claims of the present invention.

Claims (9)

1. An automatic driving automobile decision planning method based on value distribution reinforcement learning is characterized by comprising the following steps:
s1: constructing a signal lamp-free crossroad scene considering uncertainty;
s2: constructing a fully parameterized quantile function model as an automatic driving automobile control model;
s3: and introducing condition risk value based on the state-action return distribution information learned in the fully parameterized quantile function model to generate driving behaviors with risk awareness.
2. The decision planning method for the automatic driving vehicle according to claim 1, wherein in step S1, constructing a signal-free intersection scene considering uncertainty specifically comprises: and establishing an occlusion model, determining a surrounding vehicle model, and establishing the type distribution of surrounding vehicles.
3. The decision planning method for an autonomous vehicle according to claim 2, wherein in step S1, the establishing of the occlusion model specifically includes: considering the occlusion on two sides of the intersection, calculating the critical distance d that the surrounding vehicles can be observed by the vehicle according to the geometric relationship by analyzing the relative position relationship between the surrounding vehicles and the vehicle and the center of the intersection, and taking the critical distance d as a critical condition for judging whether the surrounding vehicles are occluded:
Figure FDA0003629612860000011
wherein l is the width of each lane, d' is the distance from the head of the vehicle to the central point of the crossroad,
Figure FDA0003629612860000012
the distance from the road boundary to the shelter and the distance from the head of the surrounding vehicle to the central point of the crossroad are respectively d.
4. The automated driving vehicle decision planning method according to claim 2, wherein the determining the surrounding vehicle model in step S1 specifically comprises: the behavior of the surrounding vehicles is controlled by an intelligent driver model:
Figure FDA0003629612860000013
Figure FDA0003629612860000014
wherein a is acceleration, amaxMaximum acceleration, v vehicle longitudinal speed, vtargetFor the desired longitudinal speed of the vehicle, m is an acceleration parameter, dtargetDesired distance in the longitudinal direction of the vehicle, d0For minimum longitudinal distance, T, of the vehicle0Δ v is the relative speed with the preceding vehicle for the vehicle minimum time to collision.
5. The decision planning method for the automatic driving vehicle of claim 1, wherein in step S2, a fully parameterized quantile function model is constructed, which specifically includes the following steps:
s21: constructing a quantile proposing network: taking the state information as network input, and outputting the optimal quantile point tau corresponding to each state-action;
s22: constructing a quantile numerical network: using the optimal quantile point obtained by the quantile proposing network as the input of the quantile numerical value network, and mapping to obtain the corresponding quantile function value of each quantile point under the corresponding current state;
s23: constructing a state space S: taking the position, the speed and the course angle of the surrounding vehicles and the position, the speed and the course angle of the vehicle as observable state information of the vehicle, and carrying out next decision planning on the basis of observed information of the vehicle by value distribution reinforcement learning;
s24: constructing an action space A: the action space is defined as a set of executable actions of the self vehicle and is an output value of the value distribution reinforcement learning network, wherein the action space of the self vehicle comprises three discrete action values of acceleration, cruising and deceleration; the specific acceleration of the acceleration action and the deceleration action is calculated by an intelligent driver model;
s25: designing a reward function with a total reward equal to the collision reward RcollisionReward R for completing a tasksuccessAnd an overtime reward RtimeoutThe sum of the three parts;
s26: according to the current state StPerforming action AtTraining data (S) obtained after the vehicle performs the actiont,At,Rt,St+1) Adding to an experience pool;
s27: fitting a return distribution;
s28: updating the quantile proposal network: updating the quantile proposal network by minimizing the 1-Wasserstein distance to determine the optimal quantile point tau, so that the fitted distribution is closer to the real distribution;
s29: updating the quantile numerical network: the updating target of the quantile numerical network is to minimize the quantile regression Huber-loss, make the output of the quantile numerical network approach the target value as much as possible, and update the quantile numerical network by a gradient descent method.
6. The decision-making planning method for an autonomous vehicle according to claim 5, characterized in that step S27 specifically comprises: fit the distribution of the returns by the weighted values of the N hybrid Dirac functions:
Figure FDA0003629612860000021
wherein N is the number of sub-sites, τiProposing network-generated quantiles for quantiles satisfying taui-1iAnd τ is0=0,τN=1,
Figure FDA0003629612860000022
Is a parameter theta under the current state (s, a)iThe Dirac function of (1).
7. The automated driving vehicle decision planning method according to claim 6, wherein step S28 specifically comprises the steps of:
s281: the 1-Wasserstein distance formula is:
Figure FDA0003629612860000023
wherein N is the number of the sub-sites, omega is the neural network parameter,
Figure FDA0003629612860000024
is a quantile
Figure FDA0003629612860000025
The value of the corresponding quantile function,
Figure FDA0003629612860000026
s282: using network parameters omega with quantiles2Fractional numerical function of
Figure FDA0003629612860000027
As a function of the true fractional value in the current state;
s283: proposing a parameter omega of a network by dividing digits1Gradient descent was used to minimize the 1-Wasserstein distance:
Figure FDA0003629612860000031
s284: the return expectation for a fully parameterized quantile function is:
Figure FDA0003629612860000032
8. the autopilot decision-making method of claim 7 wherein step S29 specifically includes the steps of:
s291: solving a time difference equation:
Figure FDA0003629612860000033
wherein, deltaijIs TD-error, rtThe current time is the return, gamma is an attenuation factor, Z is the return distribution of the current time, and Z' is the return distribution of the next time;
s292: calculating quantile regression Huber-loss:
Figure FDA0003629612860000034
Figure FDA0003629612860000035
wherein,
Figure FDA0003629612860000036
for quantile regression Huber-loss,
Figure FDA0003629612860000037
is a Huber-loss function, k is a threshold value;
s293: updating the quantile numerical network by using random gradient descent:
Figure FDA0003629612860000038
wherein,
Figure FDA0003629612860000039
and is TD-error at time t.
9. The decision-making planning method for an autonomous vehicle according to claim 1, characterized in that step S3 specifically comprises the following steps:
s31: based on the return distribution information obtained by the fully parameterized quantile function model in step S2, calculating conditional risk values (CVaR) corresponding to the distributions as:
Figure FDA00036296128600000310
among them, the value of risk
Figure FDA00036296128600000311
Z is the distribution of the return, alpha is the cumulative probability, and R is the return;
s32: selecting an optimal action, targeting the maximum CVaR value, selecting an optimal risk-sensitive behavior:
Figure FDA00036296128600000312
wherein,
Figure FDA00036296128600000313
is the current state stAnd (4) selecting the optimal action, wherein Z is the distribution of the return, and alpha is the cumulative probability.
CN202210487160.8A 2022-05-06 2022-05-06 Decision-making planning method for autonomous driving vehicles based on value distribution reinforcement learning Active CN114707359B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210487160.8A CN114707359B (en) 2022-05-06 2022-05-06 Decision-making planning method for autonomous driving vehicles based on value distribution reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210487160.8A CN114707359B (en) 2022-05-06 2022-05-06 Decision-making planning method for autonomous driving vehicles based on value distribution reinforcement learning

Publications (2)

Publication Number Publication Date
CN114707359A true CN114707359A (en) 2022-07-05
CN114707359B CN114707359B (en) 2025-03-21

Family

ID=82176207

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210487160.8A Active CN114707359B (en) 2022-05-06 2022-05-06 Decision-making planning method for autonomous driving vehicles based on value distribution reinforcement learning

Country Status (1)

Country Link
CN (1) CN114707359B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117208019A (en) * 2023-11-08 2023-12-12 北京理工大学前沿技术研究院 Longitudinal decision method and system under perceived occlusion based on value distribution reinforcement learning
CN118212808A (en) * 2024-02-02 2024-06-18 长安大学 Method, system and equipment for planning traffic decision of signalless intersection
CN118323163A (en) * 2024-04-30 2024-07-12 北京理工大学前沿技术研究院 Automatic driving decision method and system considering shielding uncertainty
CN118747519A (en) * 2024-06-06 2024-10-08 中国电子科技集团有限公司电子科学研究院 A risk-adaptive navigation algorithm for unmanned boats based on distributed reinforcement learning
CN119377624A (en) * 2024-12-26 2025-01-28 杭州衡泰技术股份有限公司 A strategy evaluation system and risk control method based on value distribution environment model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110562258A (en) * 2019-09-30 2019-12-13 驭势科技(北京)有限公司 Method for vehicle automatic lane change decision, vehicle-mounted equipment and storage medium
CN110716562A (en) * 2019-09-25 2020-01-21 南京航空航天大学 Decision-making method for multi-lane driving of driverless cars based on reinforcement learning
WO2021213616A1 (en) * 2020-04-20 2021-10-28 Volvo Truck Corporation Tactical decision-making through reinforcement learning with uncertainty estimation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110716562A (en) * 2019-09-25 2020-01-21 南京航空航天大学 Decision-making method for multi-lane driving of driverless cars based on reinforcement learning
CN110562258A (en) * 2019-09-30 2019-12-13 驭势科技(北京)有限公司 Method for vehicle automatic lane change decision, vehicle-mounted equipment and storage medium
WO2021213616A1 (en) * 2020-04-20 2021-10-28 Volvo Truck Corporation Tactical decision-making through reinforcement learning with uncertainty estimation

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
CARL-JOHAN HOEL: "Tactical Decision-Making in Autonomous Driving by Reinforcement Learning with Uncertainty Estimation", 2020 IEEE INTELLIGENT VEHICLES SYMPOSIUM (IV), 8 January 2021 (2021-01-08), pages 1563 - 1569 *
DEREK YANG等: "Fully parameterized quantile function for distributional reinforcement learning", ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), vol. 32, 31 December 2019 (2019-12-31), pages 1 - 10 *
JULIAN BERNHARD等: "Addressing Inherent Uncertainty: Risk-Sensitive Behavior Generation for Automated Driving using Distributional Reinforcement Learning", 2019 IEEE INTELLIGENT VEHICLES SYMPOSIUM (IV), 29 August 2019 (2019-08-29), pages 1 - 9 *
LOUIS RUWAID等: "A study of the exploration/exploitation trade-off in reinforcement learning: Applied to autonomous driving", COMPUTER AND INFORMATION SCIENCES, 29 July 2019 (2019-07-29), pages 1 - 49 *
XIAO LIN等: "Decision Making through Occluded Intersections for Autonomous Driving", 2019 IEEE INTELLIGENT TRANSPORTATION SYSTEMS CONFERENCE (ITSC), 28 November 2019 (2019-11-28), pages 2449 - 2455 *
XIAOLIN TANG等: "Highway Decision-Making and Motion Planning for Autonomous Driving via Soft Actor-Critic", IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, vol. 71, no. 5, 22 February 2022 (2022-02-22), pages 4706, XP011908845, DOI: 10.1109/TVT.2022.3151651 *
XIAOLIN TANG等: "Uncertainty-Aware Decision-Making for Autonomous Driving at Uncontrolled Intersections", IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTA TION SYSTEMS, vol. 24, no. 9, 30 September 2023 (2023-09-30), pages 9725 - 9735 *
杨凯等: "面向无信号灯十字路口场景的自动驾驶安全决策方法研究", 机械工程学报, 11 March 2024 (2024-03-11), pages 1 - 13 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117208019A (en) * 2023-11-08 2023-12-12 北京理工大学前沿技术研究院 Longitudinal decision method and system under perceived occlusion based on value distribution reinforcement learning
CN117208019B (en) * 2023-11-08 2024-04-05 北京理工大学前沿技术研究院 Longitudinal decision-making method and system under perceived occlusion based on value distribution reinforcement learning
CN118212808A (en) * 2024-02-02 2024-06-18 长安大学 Method, system and equipment for planning traffic decision of signalless intersection
CN118323163A (en) * 2024-04-30 2024-07-12 北京理工大学前沿技术研究院 Automatic driving decision method and system considering shielding uncertainty
CN118323163B (en) * 2024-04-30 2025-03-18 北京理工大学前沿技术研究院 Autonomous driving decision-making method and system considering occlusion uncertainty
CN118747519A (en) * 2024-06-06 2024-10-08 中国电子科技集团有限公司电子科学研究院 A risk-adaptive navigation algorithm for unmanned boats based on distributed reinforcement learning
CN118747519B (en) * 2024-06-06 2025-02-11 中国电子科技集团有限公司电子科学研究院 A risk-adaptive navigation algorithm for unmanned boats based on distributed reinforcement learning
CN119377624A (en) * 2024-12-26 2025-01-28 杭州衡泰技术股份有限公司 A strategy evaluation system and risk control method based on value distribution environment model

Also Published As

Publication number Publication date
CN114707359B (en) 2025-03-21

Similar Documents

Publication Publication Date Title
Wang et al. Research on autonomous driving decision-making strategies based deep reinforcement learning
CN114707359A (en) A Decision Planning Method for Autonomous Vehicles Based on Value Distribution Reinforcement Learning
CN110969848B (en) Automatic driving overtaking decision method based on reinforcement learning under opposite double lanes
CN111222630B (en) A Learning Method for Autonomous Driving Rules Based on Deep Reinforcement Learning
CN114013443B (en) Automatic driving vehicle lane change decision control method based on hierarchical reinforcement learning
CN111833597B (en) Autonomous decision making in traffic situations with planning control
CN113071487B (en) Automatic driving vehicle control method and device and cloud equipment
Makantasis et al. Deep reinforcement‐learning‐based driving policy for autonomous road vehicles
CN115257745A (en) A lane change decision control method for autonomous driving based on rule fusion reinforcement learning
CN110843789A (en) Vehicle lane change intention prediction method based on time sequence convolution network
CN115214672A (en) A human-like decision-making, planning and control method for autonomous driving considering workshop interaction
CN115257746A (en) Uncertainty-considered decision control method for lane change of automatic driving automobile
US11613269B2 (en) Learning safety and human-centered constraints in autonomous vehicles
CN107479547A (en) Decision tree behaviour decision making algorithm based on learning from instruction
CN116612636B (en) Signal lamp cooperative control method based on multi-agent reinforcement learning
Pan et al. Research on the behavior decision of connected and autonomous vehicle at the unsignalized intersection
CN115303297A (en) End-to-end autonomous driving control method and device in urban scenarios based on attention mechanism and graphical model reinforcement learning
CN110646007B (en) A Vehicle Driving Method Based on Formal Representation
Ren et al. Self-learned intelligence for integrated decision and control of automated vehicles at signalized intersections
CN116572993A (en) Intelligent vehicle risk sensitive sequential behavior decision method, device and equipment
CN117734715A (en) Automatic driving control method, system, equipment and storage medium based on reinforcement learning
Fabiani et al. A mixed-logical-dynamical model for automated driving on highways
El Hamdani et al. A Markov decision process model for a reinforcement learning-based autonomous pedestrian crossing protocol
CN118917179A (en) Multi-mode reinforcement learning vehicle decision-making planning method with compensation feedback
CN118583187A (en) Path optimization selection method and system based on time-sharing planning and radar-vision fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant