CN115257819A

CN115257819A - Decision-making method for safe driving of large-scale commercial vehicle in urban low-speed environment

Info

Publication number: CN115257819A
Application number: CN202211070514.5A
Authority: CN
Inventors: 李旭; 胡玮明; 胡悦; 胡锦超; 陆红伟; 徐启敏
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-09-02
Filing date: 2022-09-02
Publication date: 2022-11-01
Anticipated expiration: 2042-09-02
Also published as: CN115257819B

Abstract

The invention discloses a decision-making method for safe driving of a large-scale commercial vehicle in an urban low-speed environment. And secondly, constructing a multi-head attention-based safe driving decision model of the commercial vehicle. The model contains two sub-networks, deep double-Q network and generative confrontation mock-learning. The deep double-Q network learns the safe driving strategies under the edge scenes such as dangerous scenes and conflict scenes in an unsupervised learning mode; an antagonistic emulation learning sub-network is generated that emulates safe driving behavior of a human driver under different driving conditions and driving regimes. And finally, training a safe driving decision model to obtain driving strategies under different driving conditions and driving conditions. The method provided by the invention can simulate the safe driving behavior of a human driver, and considers the influence of factors such as a visual blind area, a sudden obstacle and the like on the driving safety.

Description

Decision-making method for safe driving of large commercial vehicles in urban low-speed environment

技术领域technical field

本发明涉及一种营运车辆驾驶决策方法，尤其是涉及一种城市低速环境下的大型营运车辆安全驾驶决策方法，属于汽车安全技术领域。The invention relates to a decision-making method for driving a commercial vehicle, in particular to a decision-making method for safe driving of a large commercial vehicle in an urban low-speed environment, and belongs to the technical field of automobile safety.

背景技术Background technique

在城市交通环境下，因驾驶员视线盲区引发的道路交通事故占比最高，这些事故的肇事主体多为重型货车、大型客车、汽车列车等大型营运车辆。不同于乘用车辆，大型营运车辆具有体积大、车身长、轴距大、驾驶位置高等特点，其车身周围存在较多的静态和动态视觉盲区，如车头前方、右前轮附近和右后视镜下方等。当营运车辆转向特别是右转时，极易碰撞甚至碾压视野盲区内行人和非机动车，是产生恶性安全事故的主要区域。此外，相比于较为封闭的高速公路场景，在机非混行的城市交通环境下，交通参与者的类型和数量相对较多，营运车辆突遇障碍物的情况时有发生，具有更高的危险性。因此，在开放、多交通目标干扰的城市交通环境下，如何提高营运车辆的行车安全性，是目前亟需解决的关键问题，也是保障城市道路交通安全的重点。In the urban traffic environment, road traffic accidents caused by drivers' blind spots account for the highest proportion, and the main causes of these accidents are mostly large operating vehicles such as heavy goods vehicles, large passenger cars, and automobile trains. Different from passenger vehicles, large commercial vehicles have the characteristics of large size, long body, large wheelbase, and high driving position. There are many static and dynamic visual blind spots around the body, such as the front of the front, near the right front wheel and right rear view. Wait under the mirror. When commercial vehicles turn, especially turn right, they are very likely to collide or even crush pedestrians and non-motor vehicles in the blind area of vision, which is the main area where vicious safety accidents occur. In addition, compared with the relatively closed highway scene, in the urban traffic environment with mixed traffic, the types and numbers of traffic participants are relatively large, and business vehicles encounter obstacles frequently, which has a higher dangerous. Therefore, how to improve the driving safety of operating vehicles in an open and interfering urban traffic environment is a key problem that needs to be solved urgently, and it is also the focus of ensuring urban road traffic safety.

目前，积极发展自动驾驶技术已成为国内外广泛认可的保障车辆运行安全的重要手段。作为实现高品质自动驾驶的关键一环，驾驶决策决定了营运车辆自动驾驶的合理性和安全性。如果能在交通事故发生前的1.5秒对驾驶员进行危险预警，并提供可靠、有效的安全驾驶策略，可以大幅度降低因视觉盲区、突遇障碍物等因素造成的交通事故发生频率。因此，研究大型营运车辆的安全驾驶决策方法，对于保障营运车辆的行车安全具有重要作用。At present, the active development of autonomous driving technology has become an important means to ensure the safety of vehicle operation widely recognized at home and abroad. As a key part of realizing high-quality autonomous driving, driving decision-making determines the rationality and safety of autonomous driving of commercial vehicles. If the driver can be warned of danger 1.5 seconds before a traffic accident, and a reliable and effective safe driving strategy can be provided, the frequency of traffic accidents caused by factors such as visual blind spots and sudden obstacles can be greatly reduced. Therefore, studying the safe driving decision-making method of large commercial vehicles plays an important role in ensuring the driving safety of commercial vehicles.

已有较多专利和文献对防碰撞驾驶决策进行了研究，但主要面向乘用车辆。相比于乘用车辆，营运车辆具有较大的视觉盲区，且具有更长的制动距离和制动时间。面向乘用车辆的防撞决策方法，无法直接应用于营运车辆。另一方面，已有部分专利对营运车辆的安全驾驶决策进行了研究，如一种高度类人的自动驾驶营运车辆安全驾驶决策方法(申请号：202210158758.2)、一种基于深度学习的大型营运车辆车道变换决策方法(公开号：CN113954837A)等，但这些决策方法均面向高速公路场景。There have been many patents and literatures on collision avoidance driving decision-making, but mainly for passenger vehicles. Compared with passenger vehicles, commercial vehicles have larger visual blind spots, and have longer braking distance and braking time. The collision avoidance decision-making method for passenger vehicles cannot be directly applied to commercial vehicles. On the other hand, some patents have studied the safe driving decision-making of commercial vehicles, such as a highly human-like autonomous driving commercial vehicle safe driving decision-making method (application number: 202210158758.2), a deep learning-based large commercial vehicle lane Transformation decision-making method (public number: CN113954837A), etc., but these decision-making methods are all oriented to the highway scene.

不同于交通参与者类型较少的高速公路场景，城市交通环境具有开放、多交通目标干扰、机非混行等特点。特别是车辆视觉盲区、突遇障碍物等因素的存在，对城市交通环境下的营运车辆安全驾驶提出了更高的挑战。因此，面向高速公路场景的营运车辆安全驾驶决策方法，无法直接应用于开放干扰的城市交通环境。Different from the expressway scene with fewer types of traffic participants, the urban traffic environment has the characteristics of openness, multi-traffic target interference, and mixed traffic. In particular, the existence of factors such as vehicle visual blind spots and unexpected obstacles pose higher challenges to the safe driving of commercial vehicles in urban traffic environments. Therefore, the decision-making method for safe driving of commercial vehicles for highway scenarios cannot be directly applied to urban traffic environments with open interference.

总体而言，针对开放、多交通目标干扰的城市交通环境，现有的方法难以满足营运车辆对于安全驾驶决策的要求，尚缺乏能够提供驾驶动作、行车路径等具体驾驶建议的安全驾驶决策方法，特别是缺乏考虑视觉盲区和突遇障碍物影响的大型营运车辆安全驾驶决策研究。In general, for the open urban traffic environment with multiple traffic targets, the existing methods are difficult to meet the requirements of operating vehicles for safe driving decision-making, and there is still a lack of safe driving decision-making methods that can provide specific driving suggestions such as driving actions and driving paths. In particular, there is a lack of research on safe driving decision-making of large commercial vehicles considering the impact of visual blind spots and sudden obstacles.

发明内容Contents of the invention

发明目的：为了实现城市低速环境下的大型营运车辆安全驾驶决策，保障车辆行车安全，本发明针对重型货车、重型卡车等自动驾驶营运车辆，提出了一种城市低速环境下的大型营运车辆安全驾驶决策方法。该方法综合考虑了视觉盲区、突遇障碍物、不同行驶工况等因素对行车安全的影响，且能够模拟人类驾驶员的安全驾驶行为，为自动驾驶营运车辆提供更加合理、安全的驾驶策略，可以有效保障自动驾驶营运车辆的行车安全。同时，该方法无需考虑复杂的车辆动力学方程和车身参数，计算方法简单清晰，可以实时输出自动驾驶营运车辆的安全驾驶策略，且使用的传感器成本较低，便于大规模推广。Purpose of the invention: In order to realize the safe driving decision-making of large commercial vehicles in urban low-speed environment and ensure the driving safety of vehicles, the present invention proposes a safe driving of large commercial vehicles in urban low-speed environment for heavy goods vehicles, heavy trucks and other self-driving commercial vehicles. decision making method. This method comprehensively considers the impact of factors such as visual blind spots, unexpected obstacles, and different driving conditions on driving safety, and can simulate the safe driving behavior of human drivers, providing more reasonable and safe driving strategies for autonomous driving vehicles. It can effectively guarantee the driving safety of self-driving commercial vehicles. At the same time, this method does not need to consider complex vehicle dynamic equations and body parameters. The calculation method is simple and clear, and it can output the safe driving strategy of autonomous driving commercial vehicles in real time. The cost of the sensors used is low, which is convenient for large-scale promotion.

技术方案：为实现本发明的目的，本发明所采用的技术方案是：城市低速环境下的大型营运车辆安全驾驶决策方法。首先，采集城市交通环境下人类驾驶员的安全驾驶行为，构建形成安全驾驶行为数据集。其次，构建基于多头注意力的营运车辆安全驾驶决策模型。该模型包含深度双Q网络和生成对抗模仿学习两个子网络。其中，深度双Q网络通过无监督学习的方式，学习危险场景、冲突场景等边缘场景下的安全驾驶策略；生成对抗模仿学习子网络模仿不同驾驶条件和行驶工况下的安全驾驶行为。最后，训练安全驾驶决策模型，得到不同驾驶条件和行驶工况下的驾驶策略，实现对营运车辆安全驾驶行为的高级决策输出。具体包括以下步骤：Technical solution: In order to realize the object of the present invention, the technical solution adopted in the present invention is: a decision-making method for safe driving of large commercial vehicles in an urban low-speed environment. First, collect the safe driving behavior of human drivers in the urban traffic environment, and construct a safe driving behavior data set. Secondly, construct a decision-making model for safe driving of commercial vehicles based on multi-head attention. The model consists of two sub-networks, a deep double-Q network and a generative adversarial imitation learning. Among them, the deep double-Q network learns safe driving strategies in edge scenarios such as dangerous scenes and conflict scenes through unsupervised learning; the generated confrontational imitation learning sub-network imitates safe driving behaviors under different driving conditions and driving conditions. Finally, the safe driving decision-making model is trained to obtain driving strategies under different driving conditions and driving conditions, and realize advanced decision-making output for safe driving behavior of commercial vehicles. Specifically include the following steps:

步骤一：采集城市交通环境下人类驾驶员的安全驾驶行为Step 1: Collect the safe driving behavior of human drivers in urban traffic environment

为了实现与人类驾驶员相媲美的驾驶决策，本发明通过实际道路测试和驾驶模拟仿真的方式，采集不同驾驶条件和行驶工况下的安全驾驶行为，进而构建表征人类驾驶员安全驾驶行为的数据集。具体包括以下4个子步骤：In order to achieve driving decisions comparable to human drivers, the present invention collects safe driving behaviors under different driving conditions and driving conditions through actual road tests and driving simulations, and then constructs data representing safe driving behaviors of human drivers set. Specifically, it includes the following 4 sub-steps:

子步骤1：利用毫米波雷达、128线激光雷达、视觉传感器、北斗传感器和惯性传感器搭建多维目标信息同步采集系统。Sub-step 1: Use millimeter-wave radar, 128-line laser radar, vision sensor, Beidou sensor and inertial sensor to build a multi-dimensional target information synchronous acquisition system.

子步骤2：在真实城市环境下，多名驾驶员依次驾驶搭载多维目标信息同步采集系统的营运车辆，对驾驶员的车道变换、车道保持、车辆跟驰、加减速等各种驾驶行为的相关数据进行采集和处理，获取各驾驶行为的多源异质描述数据，如雷达或视觉传感器测得多个不同方位的障碍物距离，北斗传感器及惯性传感器测得的位置、速度、加速度及横摆角速度等，以及车载传感器测得的方向盘转角等。Sub-step 2: In a real urban environment, multiple drivers sequentially drive a commercial vehicle equipped with a multi-dimensional target information synchronous acquisition system, and the correlation of various driving behaviors such as lane change, lane keeping, vehicle following, acceleration and deceleration, etc. Data is collected and processed to obtain multi-source heterogeneous description data of various driving behaviors, such as the distance of obstacles in different directions measured by radar or visual sensors, the position, speed, acceleration and yaw measured by Beidou sensors and inertial sensors Angular velocity, etc., and steering wheel angle measured by on-board sensors, etc.

子步骤3：为了模仿危险场景、冲突场景等边缘场景下的安全驾驶行为，搭建基于硬件在环仿真的虚拟城市场景，所构建的城市交通场景包括以下三类：Sub-step 3: In order to imitate the safe driving behavior in borderline scenarios such as dangerous scenarios and conflict scenarios, build a virtual city scenario based on hardware-in-the-loop simulation. The constructed urban traffic scenarios include the following three categories:

(1)在车辆行驶过程中，车辆前方会出现横向接近的交通参与者(即突遇障碍物)；(1) During the driving process of the vehicle, there will be laterally approaching traffic participants in front of the vehicle (that is, suddenly encountering obstacles);

(2)在车辆转向过程中，车辆的视觉盲区内存在静止的交通参与者；(2) During the turning process of the vehicle, there are stationary traffic participants in the blind spot of the vehicle;

(3)在车辆转向过程中，车辆的视觉盲区内存在运动的交通参与者。(3) During the turning process of the vehicle, there are moving traffic participants in the blind spot of the vehicle.

在上述交通场景中，存在多种路网结构(直道、弯道和十字路口)和多类交通参与者(营运车辆、乘用车、非机动车和行人)。In the above traffic scenarios, there are various road network structures (straight roads, curves, and intersections) and various types of traffic participants (commercial vehicles, passenger vehicles, non-motorized vehicles, and pedestrians).

多名驾驶员通过真实控制器(方向盘、油门和制动踏板)驾驶虚拟场景中的营运车辆，采集自车的横纵向位置、横纵向速度、横纵向加速度、与周围交通参与者的相对距离和相对速度等信息。Multiple drivers drive commercial vehicles in the virtual scene through real controllers (steering wheel, accelerator and brake pedal), and collect the vehicle's horizontal and vertical positions, horizontal and vertical speeds, horizontal and vertical accelerations, and relative distances and distances from surrounding traffic participants. Relative speed and other information.

子步骤4：基于真实城市环境和驾驶模拟仿真环境采集的数据，构建形成用于安全驾驶决策学习的驾驶行为数据集，具体可表示为：Sub-step 4: Based on the data collected in the real urban environment and driving simulation environment, construct a driving behavior data set for safe driving decision-making learning, which can be specifically expressed as:

式中，X表示涵盖状态、动作的二元组，即构建的表征人类驾驶员安全驾驶行为的数据集，(s_j,a_j)表示j时刻的“状态-动作”对，其中，s_j表示j时刻的状态，a_j表示j时刻的动作，即人类驾驶员基于状态s_j做出的动作，n表示数据库中“状态-动作”对的数量。In the formula, X represents the 2-tuple covering the state and action, that is, the constructed data set representing the safe driving behavior of human drivers, (s _j , a _j ) represents the "state-action" pair at time j, where s _j Indicates the state at time j, a _j represents the action at time j, that is, the action made by a human driver based on state s _j , and n represents the number of "state-action" pairs in the database.

步骤二：构建基于多头注意力的营运车辆安全驾驶决策模型Step 2: Construct a decision-making model for safe driving of commercial vehicles based on multi-head attention

为了实现城市低速环境下的大型营运车辆安全驾驶决策，本发明综合考虑视觉盲区、突遇障碍物、行驶工况等因素对行车安全的影响，建立营运车辆安全驾驶决策模型。考虑到深度强化学习将深度学习的感知能力和强化学习的决策能力相结合，通过无监督学习的方式对交通环境进行探索，本发明利用深度强化学习对危险场景、冲突场景等边缘场景下的安全驾驶策略进行学习。此外，考虑到模仿学习具有仿效榜样的能力，本发明利用模仿学习模拟人类驾驶员在不同驾驶条件和行驶工况下的安全驾驶行为。因此，构建的安全驾驶决策模型由两部分组成，具体描述如下：In order to realize the safe driving decision-making of large-scale commercial vehicles in low-speed urban environments, the present invention comprehensively considers the influence of factors such as visual blind spots, unexpected obstacles, and driving conditions on driving safety, and establishes a safe driving decision-making model for commercial vehicles. Considering that deep reinforcement learning combines the perception ability of deep learning with the decision-making ability of reinforcement learning, and explores the traffic environment through unsupervised learning, the present invention uses deep reinforcement learning to improve safety in edge scenarios such as dangerous scenes and conflict scenes. Learn driving strategies. In addition, considering that imitation learning has the ability to imitate models, the present invention uses imitation learning to simulate the safe driving behavior of human drivers under different driving conditions and driving conditions. Therefore, the constructed safe driving decision-making model consists of two parts, which are described in detail as follows:

子步骤1：定义安全驾驶决策模型的基本参数Sub-step 1: Define the basic parameters of the safe driving decision model

首先，将城市低速环境下的安全驾驶决策问题转化为有限马尔科夫决策过程。其次，定义安全驾驶决策模型的基本参数。First, the safe driving decision-making problem in urban low-speed environment is transformed into a finite Markov decision process. Second, the basic parameters of the safe driving decision-making model are defined.

(1)定义状态空间(1) Define the state space

为了描述自车和附近交通参与者的运动状态，本发明利用时间序列数据和占据栅格图构建状态空间。具体描述如下：In order to describe the motion state of the self-vehicle and nearby traffic participants, the present invention utilizes time series data and occupancy grid graphs to construct a state space. The specific description is as follows:

S_t＝[S₁(t),S₂(t),S₃(t)] (2)S _t = [S ₁ (t), S ₂ (t), S ₃ (t)] (2)

式中，S_t表示t时刻的状态空间，S₁(t)和S₂(t)表示t时刻与时间序列数据相关的状态空间，S₃(t)表示t时刻与占据栅格图相关的状态空间。In the formula, S _t represents the state space at time t, S ₁ (t) and S ₂ (t) represent the state space related to the time series data at time t, and S ₃ (t) represents the state space related to the occupancy grid map at time t state space.

首先，利用连续位置、速度、加速度和航向角信息描述自车的运动状态：First, the ego vehicle's motion state is described using continuous position, velocity, acceleration, and heading angle information:

S₁(t)＝[p_x,p_y,v_x,v_y,a_x,a_y,θ_s] (3)S ₁ (t)＝[p _x ,p _y ,v _x ,v _y ,a _x ,a _y ,θ _s ] (3)

式中，p_x,p_y分别表示自车的横向位置和纵向位置，单位为米，v_x,v_y分别表示自车的横向速度和纵向速度，单位为米每秒，a_x,a_y分别表示自车的横向加速度和纵向加速度，单位为米每二次方秒，θ_s表示自车的航向角，单位为度。In the formula, p _x , p _y represent the lateral position and longitudinal position of the self-vehicle respectively in meters, v _x , v _y represent the lateral speed and longitudinal speed of the self-vehicle respectively in meters per second, a _x , a _y Respectively represent the lateral acceleration and longitudinal acceleration of the own vehicle, the unit is meter per square second, θ _s represents the heading angle of the own vehicle, the unit is degree.

其次，利用自车与周围交通参与者的相对运动状态信息描述周围交通参与者的运动状态：Secondly, use the relative motion state information of the self-vehicle and the surrounding traffic participants to describe the motion state of the surrounding traffic participants:

式中，

分别表示自车与第i个交通参与者的相对距离、相对速度和加速度，单位分别为米、米每秒和米每二次方秒。In the formula,

respectively represent the relative distance, relative speed and acceleration between the self-vehicle and the i-th traffic participant, and the units are meters, meters per second and meters per square second.

现有的状态空间定义方法中，常使用固定的编码方法，即考虑的周围交通参与者的数量是固定的。然而，在实际的城市交通场景中，营运车辆周围的交通参与者数量和位置是时刻变化的，且需要特别考虑突遇障碍物和视觉盲区导致的侧向碰撞。虽然固定编码的方法可以实现有效的状态表征，但考虑的交通参与者数量有限(使用了表示场景所需的最少信息量)，无法准确、全面地描述周围所有交通参与者对营运车辆行车安全的影响。In the existing state space definition methods, a fixed coding method is often used, that is, the number of surrounding traffic participants considered is fixed. However, in the actual urban traffic scene, the number and position of traffic participants around the operating vehicle are constantly changing, and special consideration should be given to lateral collisions caused by sudden obstacles and visual blind spots. Although the method of fixed coding can achieve effective state representation, the number of traffic participants considered is limited (the minimum amount of information required to represent the scene is used), and it cannot accurately and comprehensively describe the impact of all surrounding traffic participants on the driving safety of operating vehicles. influences.

最后，为了更加形象地描述自车与周围交通参与者的相对位置关系，提高决策的可靠性和有效性，本发明将道路区域栅格化，划分成若干个a×b的网格区域，将道路区域及车辆目标抽象成栅格图，即用于描述相对位置关系的“存在”栅格图S₃(t)。其中，a表示网格区域的长度，b表示网格区域的宽度。Finally, in order to more vividly describe the relative positional relationship between the self-vehicle and the surrounding traffic participants, and improve the reliability and effectiveness of decision-making, the present invention grids the road area and divides it into several a×b grid areas. The road area and vehicle targets are abstracted into a grid map, that is, the "existence" grid map S ₃ (t) used to describe the relative positional relationship. Wherein, a represents the length of the grid area, and b represents the width of the grid area.

“存在”栅格图包含四种属性，包括栅格坐标、是否存在车辆、对应车辆的类别、与左右车道线的距离。其中，不存在交通参与者的网格置为“0”，存在交通参与者的网格置为“1”，该网格与自车所在网格的位置分布，用于描述两车的相对间距。The "existence" grid map contains four attributes, including grid coordinates, whether there is a vehicle, the category of the corresponding vehicle, and the distance from the left and right lanes. Among them, the grid with no traffic participants is set to "0", and the grid with traffic participants is set to "1". The position distribution between the grid and the grid where the self-vehicle is located is used to describe the relative distance between the two vehicles .

(2)定义动作空间(2) Define the action space

利用横向和纵向驾驶动作定义动作空间：Define the action space with lateral and longitudinal driving actions:

A_t＝[a_left,a_straight,a_right,a_accel,a_cons,a_decel] (5)A _t ＝[a _left ,a _straight ,a _right ,a _accel ,a _cons ,a _decel ] (5)

式中，A_t表示t时刻的动作空间，a_left,a_straight,a_right分别表示左转、直行和右转，a_accel,a_cons,a_decel分别表示加速、匀速和减速。In the formula, A _t represents the action space at time t, a _left , a _straight , and a _right represent left turn, straight line and right turn respectively, and a _accel , a _cons , a _decel represent acceleration, constant speed and deceleration respectively.

(3)定义奖励函数(3) Define the reward function

R_t＝r₁+r₂+r₃ (6)R _t =r ₁ +r ₂ +r ₃ (6)

式中，R_t表示t时刻的奖励函数，r₁,r₂,r₃分别表示前向防撞奖励函数、后向防撞奖励函数和侧向防撞奖励函数，可通过式(7)、式(8)和式(9)获得。In the formula, R _t represents the reward function at time t, r ₁ , r ₂ , and r ₃ represent the forward collision avoidance reward function, the backward collision avoidance reward function and the side collision avoidance reward function respectively, which can be obtained through formula (7), Formula (8) and formula (9) are obtained.

式中，TTC表示自车与前方障碍物发生碰撞的时间，可通过自车与前方障碍物之间的距离除以相对速度获得，TTC_thr表示距离碰撞时间阈值，RTTC表示后向碰撞时间，RTTC_thr表示后向碰撞时间阈值，单位均为秒，x_lat表示自车与两侧交通参与者的距离，x_min表示最小侧向安全距离，单位均为米，β₁,β₂,β₃分别表示前向防撞奖励函数、后向防撞奖励函数和侧向防撞奖励函数的权重系数。In the formula, TTC represents the collision time between the self-vehicle and the front obstacle, which can be obtained by dividing the distance between the self-vehicle and the front obstacle by the relative speed, TTC _thr represents the distance collision time threshold, RTTC represents the backward collision time, RTTC _thr represents the backward collision time threshold, the unit is second, x _lat represents the distance between the vehicle and the traffic participants on both sides, x _min represents the minimum lateral safety distance, the unit is meter, β ₁ , β ₂ , β ₃ respectively Indicates the weight coefficients of forward collision avoidance reward function, backward collision avoidance reward function and side collision avoidance reward function.

子步骤2：构建基于深度双Q网络的决策子网络Sub-step 2: Construct a decision-making sub-network based on a deep double-Q network

考虑到深度双Q网络(Double Deep Q Network,DDQN)通过使用经验复用池的方式，提高了数据的利用效率，且能够避免参数振荡或发散，可以降低Q学习网络中因过估计导致负面的学习效果。因此，本发明利用深度双Q网络学习边缘场景下的安全驾驶策略。Considering that the Double Deep Q Network (DDQN) improves the efficiency of data utilization by using the experience multiplexing pool, and can avoid parameter oscillation or divergence, it can reduce the negative effects of overestimation in the Q learning network. learning result. Therefore, the present invention utilizes a deep double-Q network to learn safe driving strategies in edge scenarios.

不同于处理固定维度的状态空间，处理涵盖周围所有交通参与者的特征信息，需具有更强的特征提取能力。考虑到注意力机制可以捕捉到更加丰富的特征信息(自车与周围各交通参与者之间的依赖关系)，本发明设计了基于多头注意力机制的策略网络。此外，考虑到驾驶决策只与自车、周围交通参与者的运动状态有关，不应受状态空间中各交通参与者的顺序影响，本发明利用位置编码方法(文献：Vaswani,Ashish,et al.“Attention isall you need.”Advances in Neural Information Processing Systems.2017.)，将排列不变性构建到决策子网络中。Different from dealing with a fixed-dimensional state space, dealing with feature information covering all surrounding traffic participants requires stronger feature extraction capabilities. Considering that the attention mechanism can capture more abundant feature information (dependence between the self-vehicle and the surrounding traffic participants), the present invention designs a policy network based on the multi-head attention mechanism. In addition, considering that the driving decision is only related to the movement state of the vehicle and the surrounding traffic participants, and should not be affected by the order of the traffic participants in the state space, the present invention uses the position encoding method (document: Vaswani, Ashish, et al. "Attention is all you need." Advances in Neural Information Processing Systems.2017.), Building permutation invariance into decision subnetworks.

注意力层可表示为：The attention layer can be expressed as:

MultiHead(Q,K,V)＝Concat(head₁,...,head_h)W^O (10)MultiHead(Q,K,V)＝Concat(head ₁ ,...,head _h )W ^O (10)

式中，MultiHead(Q,K,V)表示多头注意力值，Q表示查询向量，K表示键向量，维度均为d_k，V表示值向量，维度为d_v，W^O表示需要学习的参数矩阵，head_h表示多头注意力中的第h个头，在本发明中，h＝2，可通过下式计算：In the formula, MultiHead(Q,K,V) represents the multi-head attention value, Q represents the query vector, K represents the key vector, the dimension is d _k , V represents the value vector, the dimension is d _v , W ^O represents the parameters to be learned Matrix, head _h represents the hth head in the multi-head attention, in the present invention, h=2, can be calculated by following formula:

式中，Attention(Q,K,V)表示输出的注意力矩阵，

表示需要学习的参数矩阵。In the formula, Attention(Q,K,V) represents the output attention matrix,

Represents the parameter matrix that needs to be learned.

构建基于深度双Q网络的决策子网络，具体描述如下。A decision-making sub-network based on a deep double-Q network is constructed, which is described in detail as follows.

首先，状态空间S_t分别与编码器1、编码器2和编码器3相连。编码器1由两个全连接层组成，输出自车运动状态编码。编码器2的结构与编码器1相同，输出相对运动状态编码。编码器3由两个卷积层组成，输出占据栅格图编码。First, the state space S _t is connected to Encoder 1, Encoder 2, and Encoder 3, respectively. Encoder 1 consists of two fully connected layers and outputs the ego-vehicle motion state encoding. Encoder 2 has the same structure as Encoder 1, and outputs relative motion state codes. Encoder 3 consists of two convolutional layers, and the output occupies a raster map encoding.

其中，全连接层的神经元数量均为64，激活函数均为Tanh函数。卷积层的卷积核均为3×3，步长均为2。Among them, the number of neurons in the fully connected layer is 64, and the activation functions are all Tanh functions. The convolution kernels of the convolutional layer are all 3×3, and the stride is 2.

其次，利用多头注意力机制分析自车与周围交通参与者的依赖关系，使得决策子网络能够注意到突然靠近自车或与自车行驶路径冲突的交通参与者，并将不同的输入大小和排列不变性构建到决策子网络中。编码器1、编码器2和编码器3的输出均与多头注意力模块连接，输出注意力矩阵。再次，将输出的注意力矩阵与解码器1相连。编码器1由一个全连接层组成。Secondly, the multi-head attention mechanism is used to analyze the dependence relationship between the self-vehicle and the surrounding traffic participants, so that the decision-making sub-network can notice the traffic participants who suddenly approach the self-vehicle or conflict with the driving path of the self-vehicle, and use different input sizes and arrangements Invariance is built into the decision subnetwork. The outputs of Encoder 1, Encoder 2, and Encoder 3 are all connected to a multi-head attention module to output an attention matrix. Again, connect the output attention matrix to decoder 1. Encoder 1 consists of a fully connected layer.

其中，全连接层的神经元数量为64，激活函数为Sigmoid函数。Among them, the number of neurons in the fully connected layer is 64, and the activation function is the Sigmoid function.

子步骤3：构建基于生成对抗模仿学习的决策子网络Sub-step 3: Build a decision-making sub-network based on generative confrontational imitation learning

在开放、多交通目标干扰的复杂城市交通环境下，很难构建一个准确、全面的奖励函数，特别是难以定量描述不确定性(如突遇障碍物、视觉盲区内的交通参与者等)对行车安全的影响。为了减小安全驾驶决策受交通环境和行驶工况不确定性的影响，提高驾驶决策的有效性和可靠性，本发明利用生成对抗模仿学习子网络，学习驾驶行为数据集及其泛化的样本数据中的驾驶策略，进而模仿不同驾驶条件和行驶工况下的安全驾驶行为。生成对抗模仿学习子网络由生成器和判别器两部分组成，分别利用深度神经网络构建生成器网络和判别器网络。具体描述如下：In the complex urban traffic environment with open and multi-traffic target interference, it is difficult to construct an accurate and comprehensive reward function, especially it is difficult to quantitatively describe the uncertainty (such as sudden obstacles, traffic participants in the visual blind zone, etc.) impact on driving safety. In order to reduce the influence of the uncertainty of traffic environment and driving conditions on safe driving decision-making, and improve the effectiveness and reliability of driving decision-making, the present invention utilizes the generation confrontational imitation learning sub-network to learn the driving behavior data set and its generalized samples The driving strategy in the data, and then imitate the safe driving behavior under different driving conditions and driving conditions. The generative confrontational imitation learning sub-network consists of two parts, the generator and the discriminator, and the deep neural network is used to construct the generator network and the discriminator network respectively. The specific description is as follows:

(1)构建生成器(1) Build generator

构建生成器网络。生成器的输入是状态空间，输出是动作空间中各个动作的概率值f＝π(·|s；θ)，其中，θ表示生成器网络的参数。首先，状态空间依次与全连接层FC₁和FC₂相连，得到特征F₁。状态空间依次与全连接层FC₃和FC₄，得到特征F₂。同时，状态空间依次与卷积层C₁和卷积层C₂相连，得到特征F₃。然后，将特征F₁、F₂和F₃依次与合并层、全连接层FC₅、Softmax激活函数相连，得到输出f＝π(·|s；θ)。Build a generator network. The input of the generator is the state space, and the output is the probability value f=π(·|s;θ) of each action in the action space, where θ represents the parameters of the generator network. First, the state space is sequentially connected with fully connected layers FC ₁ and FC ₂ to obtain feature F ₁ . The state space is sequentially combined with the fully connected layers FC ₃ and FC ₄ to obtain the feature F ₂ . At the same time, the state space is sequentially connected with the convolutional layer C ₁ and the convolutional layer C ₂ to obtain the feature F ₃ . Then, the features F ₁ , F ₂ and F ₃ are sequentially connected to the merge layer, the fully connected layer FC ₅ , and the Softmax activation function to obtain the output f=π(·|s; θ).

其中，全连接层FC₁、FC₂、FC₃、FC₄和FC₅的神经元数量均为64，卷积层C₁的卷积核为3×3，步长为2，卷积层C₂的卷积核为3×3，步长为1。Among them, the number of neurons in the fully connected layers FC ₁ , FC ₂ , FC ₃ , FC ₄ and FC ₅ is 64, the convolution kernel of the convolution layer C ₁ is 3×3, and the step size is 2, and the convolution layer C The convolution kernel of ₂ is 3×3, and the step size is 1.

(2)构建判别器(2) Build a discriminator

构建判别器网络。判别器的输入是状态空间，输出是向量

维度为6，其中，φ表示判别器网络的参数。首先，状态空间依次与全连接层FC₆和FC₇相连，得到特征F₄。状态空间依次与全连接层FC₈和FC₉，得到特征F₅。同时，状态空间依次与卷积层C₃和卷积层C₄相连，得到特征F₆。然后，将特征F₄、F₅和F₆依次与合并层、全连接层FC₁₀、Sigmoid激活函数相连，得到输出

Build the discriminator network. The input of the discriminator is the state space, and the output is a vector

The dimension is 6, where φ represents the parameters of the discriminator network. First, the state space is sequentially connected with the fully connected layers FC ₆ and FC ₇ to obtain the feature F ₄ . The state space is sequentially combined with the fully connected layers FC ₈ and FC ₉ to obtain the feature F ₅ . At the same time, the state space is sequentially connected with the convolutional layer C ₃ and the convolutional layer C ₄ to obtain the feature F ₆ . Then, the features F ₄ , F ₅ and F ₆ are sequentially connected to the merge layer, the fully connected layer FC ₁₀ , and the Sigmoid activation function to obtain the output

其中，全连接层FC₆、FC₇、FC₈、FC₉和FC₁₀的神经元数量均为64，卷积层C₃的卷积核为3×3，步长为2，卷积层C₄的卷积核为3×3，步长为1。Among them, the number of neurons in the fully connected layers FC ₆ , FC ₇ , FC ₈ , FC ₉ and FC ₁₀ is 64, the convolution kernel of the convolution layer C ₃ is 3×3, and the step size is 2, and the convolution layer C The convolution kernel of ₄ is 3×3, and the step size is 1.

步骤三：训练营运车辆安全驾驶决策模型Step 3: Train the decision-making model for safe driving of commercial vehicles

首先，训练基于生成对抗模仿学习的决策子网络。生成对抗模仿学习子网络的目标是学习一个生成器网络，使得判别器无法区分生成器生成的驾驶动作与驾驶行为数据集中的动作。具体包括以下几个子步骤：First, a decision sub-network based on generative adversarial imitation learning is trained. The goal of the Generative Adversarial Imitation Learning sub-network is to learn a generator network such that the discriminator cannot distinguish between the driving actions generated by the generator and those in the driving behavior dataset. Specifically, the following sub-steps are included:

子步骤1：在驾驶行为数据集

中，初始化生成器网络参数θ₀和判别器网络参数ω₀；Sub-step 1: On the driving behavior dataset

, initialize the generator network parameter θ ₀ and the discriminator network parameter ω ₀ ;

子步骤2：进行L次迭代求解，每一次迭代包括子步骤2.1至2.2，具体地：Sub-step 2: Carry out L iterations to solve, each iteration includes sub-steps 2.1 to 2.2, specifically:

子步骤2.1：利用式(13)描述的梯度公式更新判别器参数ω_i→ω_i+1：Sub-step 2.1: Update the discriminator parameters ω _i →ω _i+1 using the gradient formula described by Equation (13):

式中，

表示参数为ω的神经网络损失函数的梯度函数；In the formula,

Represents the gradient function of the neural network loss function with parameter ω;

子步骤2.2：设置奖励函数

利用信赖域策略优化算法更新生成器参数θ_i→θ_i+1。Sub-step 2.2: Set the reward function

Utilize the trust region policy optimization algorithm to update the generator parameters θ _i →θ _i+1 .

首先，在上述网络训练结果的基础上，继续训练构建基于DDQN的决策子网络，具体包括以下几个子步骤：First, on the basis of the above network training results, continue to train and build a decision-making sub-network based on DDQN, which specifically includes the following sub-steps:

子步骤3：初始化经验复用池D的容量为N；Sub-step 3: Initialize the capacity of experience reuse pool D as N;

子步骤4：初始化动作对应的Q值为随机值；Sub-step 4: The Q value corresponding to the initialization action is a random value;

子步骤5：进行M次迭代求解，每一次迭代包括子步骤5.1至5.2，具体地：Sub-step 5: Carry out M iterations to solve, each iteration includes sub-steps 5.1 to 5.2, specifically:

子步骤5.1：初始化状态s₀，初始化策略参数φ₀；Sub-step 5.1: Initialize state s ₀ , initialize strategy parameter φ ₀ ;

子步骤5.2：进行T次迭代求解，每一次迭代包括子步骤5.21至5.27，具体地：Sub-step 5.2: Perform T iterations to solve, each iteration includes sub-steps 5.21 to 5.27, specifically:

子步骤5.21：随机选择一个驾驶动作；Sub-step 5.21: Randomly select a driving action;

子步骤5.22：否则选择a_t＝max_aQ^*(φ(s_t),a；θ)；Sub-step 5.22: Otherwise select a _t = max _a Q ^* (φ(s _t ),a; θ);

式中，Q^*(·)表示最优的动作价值函数，a_t表示t时刻的动作；In the formula, Q ^* ( ) represents the optimal action-value function, and a _t represents the action at time t;

子步骤5.23：执行动作a_t，获得t时刻的奖励值r_t和t+1时刻的状态s_t+1；Sub-step 5.23: Execute action a _t to obtain reward value r _{t at time t} and state s _{t+1 at time t+1} ;

子步骤5.24：在经验复用池D中存储样本(φ_t,a_t,r_t,φ_t+1)；Sub-step 5.24: Store samples (φ _t , a _t , r _t ,φ _t+1 ) in the experience multiplexing pool D;

子步骤5.25：从经验复用池D中随机抽取小批量的样本(φ_j,a_j,r_j,φ_j+1)；Sub-step 5.25: Randomly select a small batch of samples (φ _j ,a _j ,r _j ,φ _j+1 ) from the experience multiplexing pool D;

子步骤5.26：利用下式计算迭代目标：Sub-step 5.26: Calculate the iteration target using the following formula:

式中，

表示t时刻目标网络的权重；γ表示折扣因子；argmax(·)表示使目标函数具有最大值的变量，y_i表示i时刻的迭代目标，p(s,a)表示动作分布；In the formula,

Indicates the weight of the target network at time t; γ indicates the discount factor; argmax( ) indicates the variable that makes the objective function have the maximum value, y _i indicates the iterative target at time i, and p(s,a) indicates the action distribution;

子步骤5.27：利用下式在(y_i-Q(φ_j,a_j；θ))²上进行梯度下降：Sub-step 5.27: Use the following formula to perform gradient descent on (y _i -Q(φ _j ,a _j ; θ)) ² :

式中，

表示参数为θ_i的神经网络损失函数的梯度函数，ε表示在ε-greedy探索策略下，随机选择一个行为的概率；θ_i表示i时刻迭代的参数，L_i(θ_i)表示i时刻的损失函数，Q(s,a；θ_i)表示目标网络的动作价值函数，a′表示状态s′所有可能存在的动作。In the formula,

Indicates the gradient function of the neural network loss function whose parameter is θ _i , ε indicates the probability of randomly selecting a behavior under the ε-greedy exploration strategy; θ _i indicates the parameter of iteration at time i, and L _i (θ _i ) indicates the The loss function, Q(s,a; θ _i ) represents the action value function of the target network, and a' represents all possible actions in the state s'.

当营运车辆安全驾驶决策模型训练完成后，将传感器采集的状态空间信息输入到安全驾驶决策模型中，可以实时地输出转向、直行、加减速等高级驾驶决策，能够有效保障城市低速环境下的营运车辆运行安全。After the training of the safe driving decision model of commercial vehicles is completed, the state space information collected by the sensor is input into the safe driving decision model, which can output high-level driving decisions such as steering, straight ahead, acceleration and deceleration in real time, which can effectively guarantee the operation in the urban low-speed environment The vehicle is safe to operate.

有益效果：相比于一般的驾驶决策方法，本发明提出的方法具有更为有效、可靠的特点，具体体现在：Beneficial effects: Compared with general driving decision-making methods, the method proposed by the present invention has more effective and reliable characteristics, which are specifically reflected in:

(1)本发明提出的方法能够模拟人类驾驶员的安全驾驶行为，为城市低速环境下的营运车辆提供更加合理、安全的驾驶策略，实现了具有高度类人水平的大型营运车辆安全驾驶决策，可以有效保障车辆的行车安全。(1) The method proposed by the present invention can simulate the safe driving behavior of human drivers, provide more reasonable and safe driving strategies for commercial vehicles in urban low-speed environments, and realize safe driving decisions for large commercial vehicles with a high human-like level, It can effectively guarantee the driving safety of the vehicle.

(2)本发明提出的方法综合考虑了视觉盲区、突遇障碍物、不同行驶工况等因素对行车安全的影响，并正常驾驶场景和边缘场景下进行策略学习和训练，进一步提高了驾驶决策的有效性和可靠性。(2) The method proposed by the present invention comprehensively considers the influence of factors such as visual blind spots, unexpected obstacles, and different driving conditions on driving safety, and conducts strategy learning and training in normal driving scenes and edge scenes, further improving driving decision-making. effectiveness and reliability.

(3)本发明提出的方法引入了多头注意力机制，考虑了自车与周围各交通参与者之间的动态交互，且能够处理输入可变(周围交通参与者数量动态变化)的安全驾驶决策.(3) The method proposed in the present invention introduces a multi-head attention mechanism, considers the dynamic interaction between the vehicle and the surrounding traffic participants, and can handle safe driving decisions with variable input (the number of surrounding traffic participants changes dynamically) .

(4)本发明提出的方法无需考虑复杂的车辆动力学方程和车身参数，计算方法简单清晰，可以实时输出大型营运车辆的安全驾驶决策策略，且使用的传感器成本较低，便于大规模推广。(4) The method proposed by the present invention does not need to consider complex vehicle dynamic equations and vehicle body parameters. The calculation method is simple and clear, and can output the safe driving decision-making strategy of large commercial vehicles in real time, and the cost of the sensors used is low, which is convenient for large-scale promotion.

附图说明Description of drawings

图1是本发明的技术路线图；Fig. 1 is a technical roadmap of the present invention;

图2是本发明设计的基于多头注意力机制的策略网络结构示意图；Fig. 2 is a schematic diagram of the strategy network structure based on the multi-head attention mechanism designed by the present invention;

图3是本发明设计的生成器网络结构示意图；Fig. 3 is a schematic diagram of the generator network structure designed by the present invention;

图4是本发明设计的判别器网络结构示意图。Fig. 4 is a schematic diagram of the structure of the discriminator network designed by the present invention.

具体实施方式Detailed ways

下面结合附图和实施例对本发明的技术方案作进一步的说明。The technical solutions of the present invention will be further described below in conjunction with the accompanying drawings and embodiments.

本发明针对开放、多交通目标干扰的城市交通环境，提出了一种具有高度类人水平的大型营运车辆安全驾驶决策方法。首先，采集城市交通环境下人类驾驶员的安全驾驶行为，构建形成安全驾驶行为数据集。其次，构建基于多头注意力的营运车辆安全驾驶决策模型。该模型包含深度双Q网络和生成对抗模仿学习两个子网络。其中，深度双Q网络通过无监督学习的方式，学习危险场景、冲突场景等边缘场景下的安全驾驶策略；生成对抗模仿学习子网络模仿不同驾驶条件和行驶工况下的安全驾驶行为。最后，训练安全驾驶决策模型，得到不同驾驶条件和行驶工况下的驾驶策略，实现对营运车辆安全驾驶行为的高级决策输出。本发明提出的方法，能够模拟人类驾驶员的安全驾驶行为，且考虑了视觉盲区、突遇障碍物等因素对行车安全的影响，为大型营运车辆提供更加合理、安全的驾驶策略，实现了城市交通环境下的营运车辆安全驾驶决策。本发明的技术路线如图1所示，具体步骤如下：Aiming at the urban traffic environment which is open and interfered by multiple traffic targets, the invention proposes a safe driving decision-making method for large commercial vehicles with a highly human-like level. First, collect the safe driving behavior of human drivers in the urban traffic environment, and construct a safe driving behavior data set. Secondly, construct a decision-making model for safe driving of commercial vehicles based on multi-head attention. The model consists of two sub-networks, a deep double-Q network and a generative adversarial imitation learning. Among them, the deep double-Q network learns safe driving strategies in edge scenarios such as dangerous scenes and conflict scenes through unsupervised learning; the generated confrontational imitation learning sub-network imitates safe driving behaviors under different driving conditions and driving conditions. Finally, the safe driving decision-making model is trained to obtain driving strategies under different driving conditions and driving conditions, and realize advanced decision-making output for safe driving behavior of commercial vehicles. The method proposed by the present invention can simulate the safe driving behavior of human drivers, and considers the influence of factors such as visual blind spots and sudden obstacles on driving safety, and provides more reasonable and safe driving strategies for large-scale commercial vehicles, realizing urban Safe driving decisions for commercial vehicles in traffic environments. Technical route of the present invention is as shown in Figure 1, and concrete steps are as follows:

为了实现与人类驾驶员相媲美的驾驶决策，本发明通过实际道路测试和驾驶模拟仿真的方式，采集不同驾驶条件和行驶工况下的安全驾驶行为，进而构建表征人类驾驶员安全驾驶行为的数据集。具体包括以下5个子步骤：In order to achieve driving decisions comparable to human drivers, the present invention collects safe driving behaviors under different driving conditions and driving conditions through actual road tests and driving simulations, and then constructs data representing safe driving behaviors of human drivers set. Specifically, it includes the following 5 sub-steps:

子步骤2：在真实城市环境下，多名驾驶员依次驾驶搭载多维目标信息同步采集系统的营运车辆。Sub-step 2: In a real urban environment, multiple drivers sequentially drive a commercial vehicle equipped with a multi-dimensional target information synchronous acquisition system.

子步骤3：对驾驶员的车道变换、车道保持、车辆跟驰、加减速等各种驾驶行为的相关数据进行采集和处理，获取各驾驶行为的多源异质描述数据，如雷达或视觉传感器测得多个不同方位的障碍物距离，北斗传感器及惯性传感器测得的位置、速度、加速度及横摆角速度等，以及车载传感器测得的方向盘转角等。Sub-step 3: Collect and process data related to various driving behaviors such as driver's lane change, lane keeping, vehicle following, acceleration and deceleration, etc., and obtain multi-source heterogeneous description data of each driving behavior, such as radar or visual sensors Obstacle distances in different orientations are measured, the position, velocity, acceleration, and yaw rate measured by Beidou sensors and inertial sensors, and the steering wheel angle measured by on-board sensors.

子步骤4：为了模仿危险场景、冲突场景等边缘场景下的安全驾驶行为，搭建基于硬件在环仿真的虚拟城市场景，所构建的城市交通场景包括以下三类：Sub-step 4: In order to imitate the safe driving behavior in borderline scenarios such as dangerous scenarios and conflict scenarios, build a virtual city scenario based on hardware-in-the-loop simulation. The constructed urban traffic scenarios include the following three categories:

子步骤5：基于真实城市环境和驾驶模拟仿真环境采集的数据，构建形成用于安全驾驶决策学习的驾驶行为数据集，具体可表示为：Sub-step 5: Based on the data collected in the real urban environment and driving simulation environment, construct a driving behavior data set for safe driving decision-making learning, which can be specifically expressed as:

(1)定义状态空间(1) Define the state space

S_t＝[S₁(t),S₂(t),S₃(t)] (2)S _t = [S ₁ (t), S ₂ (t), S ₃ (t)] (2)

式中，

(2)定义动作空间(2) Define the action space

(3)定义奖励函数(3) Define the reward function

R_t＝r₁+r₂+r₃ (6)R _t =r ₁ +r ₂ +r ₃ (6)

子步骤2：构建基于DDQN的决策子网络Sub-step 2: Build a decision-making subnetwork based on DDQN

注意力层可表示为：The attention layer can be expressed as:

式中，Attention(Q,K,V)表示输出的注意力矩阵，

Represents the parameter matrix that needs to be learned.

构建基于深度双Q网络的决策子网络，如图2所示，具体描述如下。Construct a decision-making sub-network based on a deep double-Q network, as shown in Figure 2, and the specific description is as follows.

在开放、多交通目标干扰的复杂城市交通环境下，很难构建一个准确、全面的奖励函数，特别是难以定量描述多种不确定性(如突遇障碍物、视觉盲区内的交通参与者等)对行车安全的影响。为了减小安全驾驶决策受交通环境和行驶工况不确定性的影响，提高驾驶决策的有效性和可靠性，本发明利用生成对抗模仿学习子网络，学习驾驶行为数据集及其泛化的样本数据中的驾驶策略，进而模仿不同驾驶条件和行驶工况下的安全驾驶行为。生成对抗模仿学习子网络由生成器和判别器两部分组成，分别利用深度神经网络构建生成器网络和判别器网络。具体描述如下：In the complex urban traffic environment with open and multi-traffic target interference, it is difficult to construct an accurate and comprehensive reward function, especially it is difficult to quantitatively describe various uncertainties (such as sudden obstacles, traffic participants in the visual blind zone, etc.) ) on driving safety. In order to reduce the influence of the uncertainty of traffic environment and driving conditions on safe driving decision-making, and improve the effectiveness and reliability of driving decision-making, the present invention utilizes the generation confrontational imitation learning sub-network to learn the driving behavior data set and its generalized samples The driving strategy in the data, and then imitate the safe driving behavior under different driving conditions and driving conditions. The generative confrontational imitation learning sub-network consists of two parts, the generator and the discriminator, and the deep neural network is used to construct the generator network and the discriminator network respectively. The specific description is as follows:

(1)构建生成器(1) Build generator

构建如图3所示的生成器网络。生成器的输入是状态空间，输出是动作空间中各个动作的概率值f＝π(·|s；θ)，其中，θ表示生成器网络的参数。首先，状态空间依次与全连接层FC₁和FC₂相连，得到特征F₁。状态空间依次与全连接层FC₃和FC₄，得到特征F₂。同时，状态空间依次与卷积层C₁和卷积层C₂相连，得到特征F₃。然后，将特征F₁、F₂和F₃依次与合并层、全连接层FC₅、Softmax激活函数相连，得到输出f＝π(·|s；θ)。Build a generator network as shown in Figure 3. The input of the generator is the state space, and the output is the probability value f=π(·|s;θ) of each action in the action space, where θ represents the parameters of the generator network. First, the state space is sequentially connected with fully connected layers FC ₁ and FC ₂ to obtain feature F ₁ . The state space is sequentially combined with the fully connected layers FC ₃ and FC ₄ to obtain the feature F ₂ . At the same time, the state space is sequentially connected with the convolutional layer C ₁ and the convolutional layer C ₂ to obtain the feature F ₃ . Then, the features F ₁ , F ₂ and F ₃ are sequentially connected to the merge layer, the fully connected layer FC ₅ , and the Softmax activation function to obtain the output f=π(·|s; θ).

(2)构建判别器(2) Build a discriminator

构建如图4所示的判别器网络。判别器的输入是状态空间，输出是向量

Construct the discriminator network as shown in Figure 4. The input of the discriminator is the state space, and the output is a vector

子步骤1：在驾驶行为数据集

式中，

表示参数为ω的神经网络损失函数的梯度函数；In the formula,

子步骤2.2：设置奖励函数

式中，

式中，

Claims

1. The decision-making method for safe driving of large commercial vehicles in urban low-speed environments. First, collect the safe driving behaviors of human drivers in urban traffic environments, and construct a safe driving behavior data set; secondly, construct a safe driving of commercial vehicles based on multi-head attention Decision-making model; this model includes two sub-networks of deep double-Q network and generative confrontational imitation learning; among them, the deep double-Q network learns safe driving strategies in edge scenarios such as dangerous scenes and conflict scenes through unsupervised learning; generative confrontational imitation learning The learning sub-network imitates the safe driving behavior of human drivers under different driving conditions and driving conditions; finally, trains the safe driving decision-making model to obtain driving strategies under different driving conditions and driving conditions, and realizes the safe driving behavior of commercial vehicles. High-level decision output; characterized by:

Step 1: Collect the safe driving behavior of human drivers in urban traffic environment

Through the actual road test and driving simulation, the safe driving behavior under different driving conditions and driving conditions is collected, and then the data set representing the safe driving behavior of human drivers is constructed; specifically, it includes the following four sub-steps:

Sub-step 1: Use millimeter-wave radar, 128-line laser radar, vision sensor, Beidou sensor and inertial sensor to build a multi-dimensional target information synchronous acquisition system;

Sub-step 2: In a real urban environment, multiple drivers sequentially drive a commercial vehicle equipped with a multi-dimensional target information synchronous acquisition system, and the correlation of various driving behaviors of the driver's lane change, lane keeping, vehicle following, acceleration and deceleration The data is collected and processed to obtain multi-source heterogeneous description data of various driving behaviors, including obstacle distances in different directions measured by radar or visual sensors, position, speed, acceleration and yaw measured by Beidou sensors and inertial sensors Angular velocity, and steering wheel angle measured by on-board sensors;

Sub-step 3: In order to imitate safe driving behavior in borderline scenarios such as dangerous scenarios and conflict scenarios, build a virtual city scenario based on hardware-in-the-loop simulation; the constructed urban traffic scenarios include the following three categories:

(1) During the driving process of the vehicle, there will be traffic participants approaching laterally in front of the vehicle, that is, suddenly encountering obstacles;

(2) During the turning process of the vehicle, there are stationary traffic participants in the blind spot of the vehicle;

(3) During the turning process of the vehicle, there are moving traffic participants in the blind spot of the vehicle;

In the above traffic scenarios, there are various road network structures, including straight roads, curves, and intersections, and various types of traffic participants, including commercial vehicles, passenger vehicles, non-motorized vehicles, and pedestrians;

Multiple drivers drive commercial vehicles in the virtual scene through real controllers with steering wheels, accelerators and brake pedals, and collect the vehicle's horizontal and vertical positions, horizontal and vertical speeds, horizontal and vertical accelerations, and relative distances from surrounding traffic participants and relative velocity information;

Sub-step 4: Based on the data collected in the real urban environment and driving simulation environment, construct a driving behavior data set for safe driving decision-making learning, specifically expressed as:

In the formula, X represents the 2-tuple covering the state and action, that is, the constructed data set representing the safe driving behavior of human drivers, (s _j , a _j ) represents the "state-action" pair at time j, where s _j Indicates the state at time j, a _j represents the action at time j, that is, the action made by the human driver based on the state s _j , n represents the number of "state-action" pairs in the database;

Step 2: Construct a decision-making model for safe driving of commercial vehicles based on multi-head attention

Use deep reinforcement learning to learn safe driving strategies in dangerous scenes and edge scenes of conflict scenes; in addition, considering that imitation learning has the ability to imitate models, use imitation learning to simulate the behavior of human drivers under different driving conditions and driving conditions. Safe driving behavior; therefore, the constructed safe driving decision-making model consists of two parts, which are specifically described as follows:

Sub-step 1: Define the basic parameters of the safe driving decision model

Firstly, transform the safe driving decision-making problem in urban low-speed environment into a finite Markov decision process; secondly, define the basic parameters of the safe driving decision-making model;

(1) Define the state space

In order to describe the movement state of the self-vehicle and nearby traffic participants, the state space is constructed by using time series data and occupancy grid graph; the specific description is as follows:

S _t = [S ₁ (t), S ₂ (t), S ₃ (t)] (2)

In the formula, S _t represents the state space at time t, S ₁ (t) represents the state space of the ego vehicle related to time series data at time t, and S ₂ (t) represents the surrounding traffic participants related to time series data at time t The state space of , S ₃ (t) represents the state space related to the occupancy grid graph at time t;

First, the ego vehicle's motion state is described using continuous position, velocity, acceleration, and heading angle information:

S ₁ (t)＝[p _x ,p _y ,v _x ,v _y ,a _x ,a _y ,θ _s ] (3)

In the formula, p _x , p _y represent the lateral position and longitudinal position of the self-vehicle respectively in meters, v _x , v _y represent the lateral speed and longitudinal speed of the self-vehicle respectively in meters per second, a _x , a _y Respectively represent the lateral acceleration and longitudinal acceleration of the own vehicle, the unit is meter per square second, θ _s represents the heading angle of the own vehicle, the unit is degree;

Secondly, use the relative motion state information of the self-vehicle and the surrounding traffic participants to describe the motion state of the surrounding traffic participants:

In the formula,

a _i represent the relative distance, relative speed and acceleration between the ego vehicle and the i-th traffic participant, and the units are meters, meters per second and meters per square second;

Finally, the road area is rasterized and divided into several p×q grid areas, and the road area and vehicle targets are abstracted into a grid map, which is the "existence" grid map S ₃ ( t); wherein, p represents the length of the grid area, and q represents the width of the grid area;

The "existence" grid map contains four attributes, including grid coordinates, whether there is a vehicle, the category of the corresponding vehicle, and the distance from the left and right lane lines; among them, the grid without traffic participants is set to "0", and the grid with traffic The grid of the participant is set to "1", and the position distribution between the grid and the grid of the self-vehicle is used to describe the relative distance between the two vehicles;

(2) Define the action space

Define the action space with lateral and longitudinal driving actions:

A _t ＝[a _left ,a _straight ,a _right ,a _accel ,a _cons ,a _decel ] (5)

In the formula, A _t represents the action space at time t, a _left , a _straight , and a _right represent left turn, straight line and right turn respectively, and a _accel , a _cons , a _decel represent acceleration, constant speed and deceleration respectively;

(3) Define the reward function

R _t =r ₁ +r ₂ +r ₃ (6)

In the formula, R _t represents the reward function at time t, r ₁ , r ₂ , and r ₃ represent the forward collision avoidance reward function, the backward collision avoidance reward function and the side collision avoidance reward function respectively, which can be obtained through formula (7), Formula (8) and formula (9) are obtained;

In the formula, TTC represents the collision time between the self-vehicle and the front obstacle, which can be obtained by dividing the distance between the self-vehicle and the front obstacle by the relative speed, TTC _thr represents the distance collision time threshold, RTTC represents the backward collision time, RTTC _thr represents the backward collision time threshold, the unit is second, x _lat represents the distance between the vehicle and the traffic participants on both sides, x _min represents the minimum lateral safety distance, the unit is meter, β ₁ , β ₂ , β ₃ respectively Indicates the weight coefficients of forward collision avoidance reward function, backward collision avoidance reward function and side collision avoidance reward function;

Sub-step 2: Construct a decision-making sub-network based on a deep double-Q network

Use deep double Q network to learn safe driving strategy in edge scenarios;

Design a policy network based on a multi-head attention mechanism; in addition, use a positional encoding method to build permutation invariance into the decision sub-network;

The attention layer can be expressed as:

MultiHead(Q,K,V)＝Concat(head ₁ ,...,head _h )W ^O (10)

In the formula, MultiHead(Q,K,V) represents the multi-head attention value, Q represents the query vector, K represents the key vector, the dimension is d _k , V represents the value vector, the dimension is d _v , W ^O represents the parameters to be learned Matrix, head _h represents the hth head in the multi-head attention, in the present invention, h=2, calculated by the following formula:

In the formula, Attention(Q,K,V) represents the output attention matrix,

Represents the parameter matrix that needs to be learned;

Construct a decision-making sub-network, which is described in detail as follows;

First, the state space S _t is connected to encoder 1, encoder 2, and encoder 3 respectively; encoder 1 is composed of two fully connected layers, and outputs the self-vehicle motion state code; encoder 2 has the same structure as encoder 1, Output relative motion state encoding; Encoder 3 consists of two convolutional layers, and the output occupies a raster image encoding;

Among them, the number of neurons in the fully connected layer is 64, and the activation function is Tanh function; the convolution kernel of the convolution layer is 3×3, and the step size is 2;

Secondly, the multi-head attention mechanism is used to analyze the dependence relationship between the self-vehicle and the surrounding traffic participants, so that the decision-making sub-network can notice the traffic participants who suddenly approach the self-vehicle or conflict with the driving path of the self-vehicle, and use different input sizes and arrangements The invariance is built into the decision-making sub-network; the outputs of Encoder 1, Encoder 2, and Encoder 3 are all connected to the multi-head attention module, and the attention matrix is output; again, the output attention matrix is connected to Decoder 1; the encoding Device 1 consists of a fully connected layer;

Among them, the number of neurons in the fully connected layer is 64, and the activation function is the Sigmoid function;

Sub-step 3: Build a decision-making sub-network based on generative confrontational imitation learning

Use the generative adversarial imitation learning subnetwork to learn the driving strategy in the driving behavior data set and its generalized sample data, and then imitate the safe driving behavior of human drivers under different driving conditions and driving conditions; generate the adversarial imitation learning subnetwork It consists of two parts, the generator and the discriminator, and uses the deep neural network to build the generator network and the discriminator network respectively; the specific description is as follows:

(1) Build generator

Build a generator network as shown in Figure 3; the input of the generator is the state space, and the output is the probability value f=π(|s; θ) of each action in the action space, wherein, θ represents the parameter of the generator network; First, the state space is sequentially connected with the fully connected layers FC ₁ and FC ₂ to obtain the feature F ₁ ; the state space is sequentially connected with the fully connected layers FC ₃ and FC ₄ to obtain the feature F ₂ ; at the same time, the state space is sequentially connected with the convolutional layer C ₁ Connect with the convolutional layer C ₂ to obtain the feature F ₃ ; then, connect the features F ₁ , F ₂ and F ₃ to the merge layer, the fully connected layer FC ₅ , and the Softmax activation function in turn to obtain the output f=π(·|s ;θ);

Among them, the number of neurons in the fully connected layers FC ₁ , FC ₂ , FC ₃ , FC ₄ and FC ₅ is 64, the convolution kernel of the convolution layer C ₁ is 3×3, and the step size is 2, and the convolution layer C The convolution kernel of ₂ is 3×3, and the step size is 1;

(2) Build a discriminator

Build a discriminator network; the input of the discriminator is a state space, and the output is a vector

The dimension is 6, where φ represents the parameters of the discriminator network; first, the state space is sequentially connected with the fully connected layers FC ₆ and FC ₇ to obtain the feature F ₄ ; the state space is sequentially connected with the fully connected layers FC ₈ and FC ₉ to obtain the feature F ₅ ; at the same time, the state space is sequentially connected with the convolutional layer C ₃ and the convolutional layer C ₄ to obtain the feature F ₆ ; then, the features F ₄ , F ₅ and F ₆ are sequentially connected with the merge layer, the fully connected layer FC ₁₀ , The Sigmoid activation function is connected to get the output

Among them, the number of neurons in the fully connected layers FC ₆ , FC ₇ , FC ₈ , FC ₉ and FC ₁₀ is 64, the convolution kernel of the convolution layer C ₃ is 3×3, and the step size is 2, and the convolution layer C The convolution kernel of ₄ is 3×3, and the step size is 1;

Step 3: Train the decision-making model for safe driving of commercial vehicles

First, train the decision-making subnetwork based on generative confrontational imitation learning; the goal of the generative confrontational imitation learning subnetwork is to learn a generator network, so that the discriminator cannot distinguish between the driving actions generated by the generator and the actions in the driving behavior data set; the specifics include the following Several sub-steps:

Sub-step 1: On the driving behavior dataset

Sub-step 2: Carry out L iterations to solve, each iteration includes sub-steps 2.1 to 2.2, specifically:

Sub-step 2.1: Update the discriminator parameters ω _i →ω _i+1 using the gradient formula described by Equation (13):

where ▽ _ω represents the gradient function of the neural network loss function whose parameter is ω;

Sub-step 2.2: Set the reward function

Use the trust region strategy optimization algorithm to update the generator parameters θ _i → θ _i+1 ;

First, on the basis of the above network training results, continue to train and build a decision-making sub-network based on DDQN, which specifically includes the following sub-steps:

Sub-step 3: Initialize the capacity of experience reuse pool D as N;

Sub-step 4: The Q value corresponding to the initialization action is a random value;

Sub-step 5: Carry out M iterations to solve, each iteration includes sub-steps 5.1 to 5.2, specifically:

Sub-step 5.1: Initialize state s ₀ , initialize strategy parameter φ ₀ ;

Sub-step 5.2: Perform T iterations to solve, each iteration includes sub-steps 5.21 to 5.27, specifically:

Sub-step 5.21: Randomly select a driving action;

Sub-step 5.22: Otherwise select a _t = max _a Q ^* (φ(s _t ),a; θ);

In the formula, Q ^* ( ) represents the optimal action-value function, and a _t represents the action at time t;

Sub-step 5.23: Execute action a _t to obtain reward value r _{t at time t} and state s _{t+1 at time t+1} ;

Sub-step 5.24: Store samples (φ _t , a _t , r _t ,φ _t+1 ) in the experience multiplexing pool D;

Sub-step 5.25: Randomly select a small batch of samples (φ _j ,a _j ,r _j ,φ _j+1 ) from the experience multiplexing pool D;

Sub-step 5.26: Calculate the iteration target using the following formula:

In the formula,

Indicates the weight of the target network at time t; γ indicates the discount factor; arg max( ) indicates the variable that makes the objective function have the maximum value, y _i indicates the iterative target at time i, and p(s,a) indicates the action distribution;

Sub-step 5.27: Use the following formula to perform gradient descent on (y _i -Q(φ _j ,a _j ; θ)) ² :

In the formula,

Represents the gradient function of the neural network loss function with parameter θ _i , ε represents the probability of randomly selecting a behavior under the ε-greedy exploration strategy; θ _i represents the parameter of iteration at time i, L _i (θ _i ) represents the Loss function, Q(s,a; θ _i ) represents the action value function of the target network, a' represents all possible actions in the state s';

After the training of the safe driving decision model of commercial vehicles is completed, the state space information collected by the sensor is input into the safe driving decision model, which can output high-level driving decisions such as steering, straight ahead, acceleration and deceleration in real time, which can effectively guarantee the operation in the urban low-speed environment The vehicle is safe to operate.