CN114701517A

CN114701517A - Multi-target complex traffic scene automatic driving solution based on reinforcement learning

Info

Publication number: CN114701517A
Application number: CN202210370991.7A
Authority: CN
Inventors: 迟宇翔; 范彧
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2022-04-07
Filing date: 2022-04-07
Publication date: 2022-07-05

Abstract

The invention discloses an automatic driving solution under a multi-target complex traffic scene based on reinforcement learning, which can process all traffic scenes by using a set of reinforcement learning automatic driving modeling method and has better universality. The reinforcement learning comprehensive modeling is based on a traditional reinforcement learning framework, and environment perception information and characteristic quantity extracted by combining human knowledge are used as observation space. Model training is based on a time-varying training strategy, and training speed and generalization of strategy application are improved. In order to further guarantee the form safety, a dangerous action recognizer based on a long-time memory (LSTM) network and a rule restraint device based on a human knowledge body are also provided, the dangerous action recognizer is sampled and trained from the environment, so that the vehicle has the capacity of recognizing dangerous actions and dangerous scenes, the rule restraint is designed according to specific situations to limit output actions, the safety can be greatly improved, the collision frequency is reduced, and the driving safety of the vehicle is guaranteed.

Description

A Reinforcement Learning-Based Solution for Autonomous Driving in Multi-objective Complex Traffic Scenarios

技术领域technical field

本发明涉及一种基于强化学习的多目标复杂交通场景下自动驾驶解决方法，属于自动驾驶技术领域，尤其涉及多目标复杂交通场景下的基于深度强化学习的通用自动驾驶算法建模和训练方案。The invention relates to an automatic driving solution method based on reinforcement learning in multi-objective complex traffic scenarios, belonging to the technical field of automatic driving, and in particular to a general automatic driving algorithm modeling and training scheme based on deep reinforcement learning in multi-objective complex traffic scenarios.

背景技术Background technique

随着智能车产业的高速发展与自动驾驶技术的不断成熟，无人驾驶技术已成为未来车辆发展的趋势。目前落地的自动驾驶系统可达到L3级别，即能够实现驾驶员监控的情况下车辆自动行驶，遇到紧急情况需要驾驶员接管车辆。然而，自动驾驶尚未实现完全无人驾驶的重要原因之一是当前基于规则的决策方法不能处理足够的交通场景，安全方面存在较大风险隐患。With the rapid development of the smart car industry and the continuous maturity of autonomous driving technology, driverless technology has become the trend of future vehicle development. The current automatic driving system on the ground can reach the L3 level, that is, the vehicle can drive automatically under the condition of driver monitoring, and the driver needs to take over the vehicle in case of emergency. However, one of the important reasons why autonomous driving has not yet achieved complete unmanned driving is that the current rule-based decision-making method cannot handle enough traffic scenarios, and there are great hidden dangers in safety.

主流的自动驾驶技术可分为感知、决策、控制三个模块，其中决策模块是作为智能系统的核心部分。目前自动驾驶决策技术主要可以分为基于规则和基于学习的两大类。基于规则的决策技术有通用决策模型、有限状态机模型、决策树模型、基于知识推理的模型等。基于学习的决策技术主要是基于深度学习和强化学习。The mainstream autonomous driving technology can be divided into three modules: perception, decision-making and control, of which the decision-making module is the core part of the intelligent system. At present, autonomous driving decision-making technology can be mainly divided into two categories: rule-based and learning-based. Rule-based decision-making techniques include general decision-making models, finite state machine models, decision tree models, and knowledge-based reasoning models. Learning-based decision-making techniques are mainly based on deep learning and reinforcement learning.

当今，实际使用的还是基于规则的决策技术，但这种技术却暴露出越来越多的问题。基于规则的系统难以穷举所有可能出现的场景，在一些没有考虑到场景容易引发交通事故。其次，规则系统的设计在人力成本耗费与系统复杂程度上都较高，系统维护和升级也尤其繁琐。因此，人们迫切希望开发完善其他技术方法，基于数据驱动的深度强化学习就是一个方向。然而，当今深度强化学习在自动驾驶的应用主要还是在某一具体场景，如超车、换道、车道保持等，其通用性不强。此外，深度神经网络目前不具备完全的解释性，存在灾难性遗忘的问题，容易产生一些未知的不安全动作。另一个方面，强化学习本身也有泛化性、稳定性的问题。因此，要让深度强化学习在自动驾驶决策控制得到有实际意义的应用，才能一定程度上改善解决通用性差、泛化性弱、安全性得不到改善的问题。Today, rules-based decision-making techniques are still in practice, but such techniques expose more and more problems. It is difficult for a rule-based system to enumerate all possible scenarios, and some scenarios may easily lead to traffic accidents. Secondly, the design of the rule system is high in labor cost and system complexity, and system maintenance and upgrades are particularly cumbersome. Therefore, people are eager to develop and improve other technical methods, and data-driven deep reinforcement learning is one direction. However, the application of deep reinforcement learning in autonomous driving today is mainly in a specific scenario, such as overtaking, lane changing, lane keeping, etc., and its versatility is not strong. In addition, deep neural networks currently do not have complete interpretability, there is a problem of catastrophic forgetting, and it is easy to generate some unknown unsafe actions. On the other hand, reinforcement learning itself also has generalization and stability problems. Therefore, in order to make deep reinforcement learning have practical application in automatic driving decision control, it is possible to improve and solve the problems of poor generality, weak generalization, and unimproved safety to a certain extent.

发明内容SUMMARY OF THE INVENTION

本申请的内容部分用于以简要的形式介绍构思，这些构思将在后面的具体实施方式部分被详细描述。本申请的内容部分并不旨在标识要求保护的技术方案的关键特征或必要特征，也不旨在用于限制所要求的保护的技术方案的范围。The content of this application is intended to introduce concepts in a simplified form that are described in detail in the detailed description that follows. The content section of this application is not intended to identify key features or essential features of the claimed technical solution, nor is it intended to be used to limit the scope of the claimed technical solution.

针对现有技术中存在的问题与不足，本发明目的在于提供一种基于强化学习的多目标复杂交通场景下自动驾驶解决方法，通过拟人的观测设计和基于奖励重塑的奖励函数设计，可以实现仅一个模型可在多种类型具有多种策略环境的交通场景中表现出更好地综合性能。此外，为了提高训练速度和泛化性，本发明提出了奖励时变训练方法。为了增强安全性，本发明提出了基于LSTM的危险动作识别器、基于知识的安全过滤等保障安全性的方法，以解决上述背景技术中提出的问题。Aiming at the problems and deficiencies in the prior art, the purpose of the present invention is to provide a reinforcement learning-based solution for automatic driving in multi-objective complex traffic scenarios. Through anthropomorphic observation design and reward function design based on reward reshaping, the Only one model can show better comprehensive performance in multiple types of traffic scenarios with multiple policy environments. In addition, in order to improve the training speed and generalization, the present invention proposes a reward time-varying training method. In order to enhance security, the present invention proposes methods for ensuring security, such as LSTM-based dangerous action recognizer, knowledge-based security filtering, etc., to solve the problems raised in the above background art.

为实现上述目的，本发明提供如下技术方案：To achieve the above object, the present invention provides the following technical solutions:

本发明公开一种基于强化学习的多目标复杂交通场景下自动驾驶解决方法，包括如下步骤：The invention discloses an automatic driving solution method based on reinforcement learning in multi-objective complex traffic scenarios, comprising the following steps:

步骤1，准备用于自动驾驶仿真的模拟器环境以及复杂行驶场景；Step 1, prepare a simulator environment and complex driving scenarios for automatic driving simulation;

步骤2，在观测空间中添加训练强化学习模型所需的环境特征信息作为环境观测信息，包括自车信息、他车信息和道路信息，根据所述环境特征信息计算关键特征信息，包括每条车道距离前车的碰撞时间、自车距离碰撞点的碰撞时间差、自车与前方路点的朝向角方差；Step 2: Add the environmental feature information required for training the reinforcement learning model in the observation space as the environmental observation information, including the own vehicle information, other vehicle information and road information, and calculate the key feature information according to the environmental feature information, including each lane. The collision time to the preceding vehicle, the collision time difference between the vehicle and the collision point, and the variance of the heading angle between the vehicle and the road point ahead;

步骤3，设置所述强化学习模型训练所需的奖励框架；Step 3, setting the reward framework required for the reinforcement learning model training;

步骤4，基于时变训练法训练所述强化学习模型，再结合使用元模型在不同的交通场景下继续训练，每隔一定迭代轮数根据智能体行车表现情况及发生碰撞的种类修改训练时的奖励权重，重复修改多次权重以结束训练；Step 4: Train the reinforcement learning model based on the time-varying training method, and then use the meta-model to continue training in different traffic scenarios, and modify the training time according to the driving performance of the intelligent body and the type of collision every certain number of iterations. Reward the weights, and modify the weights repeatedly to end the training;

步骤5，输出训练完成的所述强化学习模型后构建危险动作识别器与规则约束器，根据所述环境观测信息判断场景危险程度，来限制或调整所述强化学习模型所输出的规划量，并通过在仿真环境中观察效果不断手动添加和优化规则。Step 5: After outputting the training-completed reinforcement learning model, construct a dangerous action recognizer and a rule constraint, and judge the degree of danger of the scene according to the environmental observation information, so as to limit or adjust the planning amount output by the reinforcement learning model, and Constantly manually adding and refining rules by observing the effects in the simulation environment.

进一步的，步骤2中所述观测空间中添加训练强化学习模型所需的环境特征信息，所述自车信息包括自车速度、自车方向盘转角、自车所在车道序号、自车中心点与所在车道中心线距离、选取自车位置起前方十五个路点与自车朝向角偏差、距离4/8米预瞄点的相对位置；所述他车信息包括每条车道中近邻车的距离、每条车道距离前车的碰撞时间、自车车道中最近邻他车与自车的相对速度；所述道路信息包括自车距离车道中心横向距离、路点朝向误差。Further, the environmental feature information required for training the reinforcement learning model is added to the observation space in step 2, and the self-vehicle information includes the self-vehicle speed, the steering wheel angle of the self-vehicle, the lane number of the self-vehicle, the center point and the location of the self-vehicle. The distance from the center line of the lane, the deviation of the 15 waypoints ahead from the position of the own vehicle and the heading angle of the own vehicle, and the relative position of the preview point at a distance of 4/8 meters; the other vehicle information includes the distance of the adjacent vehicles in each lane , the collision time of each lane from the preceding vehicle, the relative speed of the nearest neighbor car and the own vehicle in the own vehicle lane; the road information includes the lateral distance from the own vehicle to the center of the lane, and the waypoint orientation error.

进一步的，步骤3中所述奖励框架包括环境奖赏、速度奖赏、碰撞惩罚和车道中心偏差惩罚；其中，所述环境奖赏为自车存活时间，表示自车从起点行驶至发生碰撞所经历的时长；所述速度奖赏为自车行驶速度，以每秒行驶距离为单位；所述碰撞惩罚为当自车驶离航线、自车行驶出道路边界或自车与环境车发生碰撞时，会给予相应的碰撞惩罚；所述车道中心偏差惩罚为车辆中心与中心线距离的绝对值。Further, the reward framework described in step 3 includes environmental reward, speed reward, collision penalty and lane center deviation penalty; wherein, the environmental reward is the survival time of the self-vehicle, which represents the length of time that the self-vehicle travels from the starting point to the collision. ; The speed reward is the driving speed of the self-vehicle, and the unit is the distance traveled per second; the collision penalty is that when the self-vehicle leaves the route, the self-vehicle runs out of the road boundary, or the self-vehicle collides with the environmental vehicle, corresponding The collision penalty of ; the lane center deviation penalty is the absolute value of the distance between the vehicle center and the center line.

进一步的，步骤4中所述基于时变训练法结合元模型训练强化学习模型，其具体步骤为：Further, in step 4, the reinforcement learning model is trained based on the time-varying training method combined with the meta-model, and the specific steps are:

步骤4.1，初始化所述强化学习模型，依次使用每个场景训练一定轮数得到元模型；Step 4.1, initialize the reinforcement learning model, and use each scene to train a certain number of rounds in turn to obtain a meta-model;

步骤4.2，使用时变训练法在所选场景下训练步骤4.1得到的元模型，根据智能体行为所存在的缺陷调整奖励权重；Step 4.2, use the time-varying training method to train the meta-model obtained in step 4.1 in the selected scene, and adjust the reward weight according to the defects in the behavior of the agent;

步骤4.3，设置场景为所有无交叉路口的简单场景，重复步骤4.2过程训练，提升在无交叉路口简单场景下的表现；Step 4.3, set the scene to all simple scenes without intersections, and repeat the training process in step 4.2 to improve the performance in simple scenes without intersections;

步骤4.4，设置场景为所有含交叉路口的场景，重复步骤4.2过程训练，提升在交叉路口场景下的表现；Step 4.4, set the scene to all scenes with intersections, and repeat the training process in step 4.2 to improve the performance in the intersection scene;

步骤4.5，设置场景为含环岛与多方向车辆的场景，重复步骤4.2过程训练，提升在环岛与多方向车辆场景下的表现；Step 4.5, set the scene to include a roundabout and multi-directional vehicles, and repeat the training in step 4.2 to improve the performance in the roundabout and multi-directional vehicle scenarios;

步骤4.6，在剩余的场景下继续训练，直至过程结束。Step 4.6, continue training in the remaining scenarios until the end of the process.

进一步的，步骤4中所述基于时变训练法训练强化学习模型，具体步骤为：Further, the reinforcement learning model is trained based on the time-varying training method described in step 4, and the specific steps are:

步骤4.2.1，设置所述强化学习模型超参数；Step 4.2.1, setting the hyperparameters of the reinforcement learning model;

步骤4.2.2，设置奖励函数为基本奖励，使Agent学会车道保持，开始迭代训练；Step 4.2.2, set the reward function as the basic reward, so that the Agent can learn lane keeping and start iterative training;

步骤4.2.3，调高车道中心偏差惩罚与碰撞惩罚权重，继续迭代训练；Step 4.2.3, increase the weight of lane center deviation penalty and collision penalty, and continue iterative training;

步骤4.2.4，继续调高所述碰撞惩罚，继续迭代训练；Step 4.2.4, continue to increase the collision penalty, and continue the iterative training;

步骤4.2.5，在原有场景数据集基础上新增场景，并新增速度奖赏以及调高所述车道中心偏差惩罚与碰撞惩罚权重，直至迭代结束。Step 4.2.5, add a new scene based on the original scene data set, add a speed reward, and increase the weight of the lane center deviation penalty and collision penalty until the iteration ends.

进一步的，步骤5中所述危险动作识别器是根据强化学习模型输出的动作及环境观测信息预测其危险程度，并根据危险程度采取紧急避让、紧急调整等行为，所述危险动作识别器包括采集样本阶段与训练阶段，所述采集样本阶段的具体步骤为：Further, in step 5, the dangerous action recognizer predicts the degree of danger according to the actions outputted by the reinforcement learning model and the environmental observation information, and takes actions such as emergency avoidance and emergency adjustment according to the degree of danger. The sample stage and the training stage, the specific steps of the sample collection stage are:

步骤5.1.1，准备多种类型的场景，选择一个场景开始训练；Step 5.1.1, prepare multiple types of scenarios, select a scenario to start training;

步骤5.1.2，初始化PPO策略模型，在所选场景下开始训练策略模型；Step 5.1.2, initialize the PPO strategy model, and start training the strategy model in the selected scenario;

步骤5.1.3，在运行过程中记录本轮运行过程中的轨迹；Step 5.1.3, record the trajectory of the current round of operation during the operation;

步骤5.1.4，发生碰撞时，采集碰撞前10步作为负样本，同时随机采集本轮轨迹中任意连续10步作为正样本；Step 5.1.4, when a collision occurs, collect the first 10 steps of the collision as negative samples, and randomly collect any 10 consecutive steps in the current trajectory as positive samples;

步骤5.1.5，训练至运行步数达到设定的总步数后，选择下个场景重复训练；Step 5.1.5, after training until the number of running steps reaches the set total number of steps, select the next scene to repeat the training;

步骤5.1.6，直至所有场景均采集完毕后结束。Step 5.1.6, until all the scenes are collected.

进一步的，步骤5中所述危险动作识别器是基于长短记忆网络构建的危险动作识别器模型，所述训练阶段的具体步骤为：Further, the dangerous action recognizer described in step 5 is a dangerous action recognizer model constructed based on a long-short memory network, and the specific steps in the training phase are:

步骤5.2.1，根据采集的每组样本数据，将所述样本数据使用滑动窗口生成数据组作为模型输入，并以每组最后一步的标签作为模型的目标标签；Step 5.2.1, according to each group of sample data collected, use the sliding window to generate the data group as the model input, and use the label of the last step of each group as the target label of the model;

步骤5.2.2，使用Adam优化器，并基于余弦模拟退火方法调整优化器学习率；Step 5.2.2, use the Adam optimizer and adjust the optimizer learning rate based on the cosine simulated annealing method;

步骤5.2.3，使用均方误差损失作为训练损失函数，计算模型输出与所述目标标签的均方误差损失；Step 5.2.3, using the mean square error loss as the training loss function, calculate the mean square error loss between the model output and the target label;

步骤5.2.4，设置相关模型参数以及训练模型轮数，完成训练。Step 5.2.4, set the relevant model parameters and the number of training model rounds to complete the training.

进一步的，步骤5中规则约束器是基于人类知识和仿真实验的经验统计编写规则，用于限制自车在某些特定情形下的行为，所述规则约束器主要包括最近距离保护的知识规则、交叉路口下的知识规则、急弯前的知识规则以及自车无近邻车情形下长时间驻留的修正规则，所述环境观测信息分别对不同场景进行判断，决定自车限速规则。Further, in step 5, the rule constraint device is to write rules based on human knowledge and empirical statistics of simulation experiments, and is used to limit the behavior of the self-vehicle in some specific situations, and the rule constraint device mainly includes the knowledge rules of shortest distance protection, The knowledge rules under the intersection, the knowledge rules before sharp bends, and the correction rules for long-term residence when the vehicle has no neighbors, the environmental observation information respectively judges different scenarios and determines the speed limit rules for the vehicle.

与现有技术相比，本发明的有益效果为：本发明提供了一种基于强化学习的多目标复杂交通场景下自动驾驶解决方法，该方法可以使用一套强化学习自动驾驶建模方法处理所有交通场景，具有较好的通用性，并且可以获得较好的多目标性能和泛化性能。强化学习综合建模基于传统强化学习框架，使用环境感知信息及结合人类知识提取的特征量作为观测空间，根据评价指标设定车道线保持、行驶距离及避撞等作为强化学习算法中智能体车辆的奖赏与惩罚。模型训练时则通过元学习思想结合时变训练策略，每一阶段分别设定不同的奖赏权重和不同的训练集以强化智能体在先前训练阶段中所形成的部分行为缺陷，提升智能体在部分弱项场景下的表现，可以提高训练速度和策略应用的泛化性。另外，为对其形式安全性作进一步保障，还提出了基于长短时记忆(LSTM)网络的危险动作识别器与基于人类知识体的规则约束器，从环境中采样并训练危险动作识别器，使车辆具备识别危险动作与危险场景的能力，并针对特定情形设计规则约束对输出动作加以限制，可以大大提高安全性，减少碰撞次数，处理特殊紧急情况，以保障车辆的行驶安全。Compared with the prior art, the beneficial effects of the present invention are as follows: the present invention provides an automatic driving solution method based on reinforcement learning in multi-objective complex traffic scenarios, which can use a set of reinforcement learning automatic driving modeling methods to process all Traffic scene, has good generality, and can obtain good multi-objective performance and generalization performance. The comprehensive modeling of reinforcement learning is based on the traditional reinforcement learning framework, using the environmental perception information and the feature quantities extracted by combining human knowledge as the observation space, and setting the lane line keeping, driving distance and collision avoidance according to the evaluation indicators as the intelligent vehicle in the reinforcement learning algorithm rewards and punishments. When the model is trained, the meta-learning idea is combined with the time-varying training strategy. Different reward weights and different training sets are set in each stage to strengthen some of the behavioral defects formed by the agent in the previous training stage, and improve the agent's performance in some parts. The performance in weak scenarios can improve the training speed and generalization of policy application. In addition, in order to further ensure its formal security, a dangerous action recognizer based on long short-term memory (LSTM) network and a rule constraint based on human knowledge are also proposed. Vehicles have the ability to identify dangerous actions and dangerous scenes, and design rules and constraints for specific situations to limit output actions, which can greatly improve safety, reduce the number of collisions, and handle special emergencies to ensure vehicle driving safety.

附图说明Description of drawings

构成本申请的一部分的附图用来提供对本申请的进一步理解，使得本申请的其它特征、目的和优点变得更明显。本申请的示意性实施例附图及其说明用于解释本申请，并不构成对本申请的不当限定。The accompanying drawings, which constitute a part of this application, are used to provide a further understanding of the application and make other features, objects and advantages of the application more apparent. The accompanying drawings and descriptions of the exemplary embodiments of the present application are used to explain the present application, and do not constitute an improper limitation of the present application.

在附图中：In the attached image:

图1为本发明强化学习多目标复杂交通场景下自动驾驶解决方法整体的流程示意图；FIG. 1 is a schematic flowchart of an overall automatic driving solution method for reinforcement learning in multi-objective complex traffic scenarios according to the present invention;

图2为本发明强化学习多目标复杂交通场景下自动驾驶解决方法的步骤示意图；2 is a schematic diagram of the steps of the reinforcement learning multi-objective and complex traffic scene automatic driving solution method according to the present invention;

图3为本发明强化学习时变训练法与元模型训练强化学习模型的流程示意图；3 is a schematic flowchart of the time-varying training method of reinforcement learning and the meta-model training reinforcement learning model of the present invention;

图4为本发明的结合端到端与规则约束器控制自动驾驶的框架示意图；4 is a schematic diagram of the framework of the present invention for controlling automatic driving by combining end-to-end and rule constraints;

图5为本发明基于近端策略优化的强化学习策略网络示意图；5 is a schematic diagram of a reinforcement learning strategy network based on near-end strategy optimization according to the present invention;

图6为本发明的基于长短期记忆(LSTM)神经网络的危险动作识别器模型结构示意图；6 is a schematic structural diagram of a dangerous action recognizer model based on a long short-term memory (LSTM) neural network of the present invention;

图7为本发明危险动作识别器采样阶段的流程示意图；FIG. 7 is a schematic flowchart of the sampling stage of the dangerous action identifier according to the present invention;

图8为本发明的危险动作识别器采样算法的计算机语言框图。Figure 8 is a computer language block diagram of the dangerous action recognizer sampling algorithm of the present invention.

具体实施方式Detailed ways

下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例，然而应当理解的是，本公开可以通过各种形式来实现，而且不应该被解释为限于这里阐述的实施例。相反，提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是，本公开的附图及实施例仅用于示例性作用，并非用于限制本公开的保护范围。Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided for a thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are only for exemplary purposes, and are not intended to limit the protection scope of the present disclosure.

另外还需要说明的是，为了便于描述，附图中仅示出了与有关发明相关的部分。在不冲突的情况下，本公开中的实施例及实施例中的特征可以相互组合。In addition, it should be noted that, for the convenience of description, only the parts related to the related invention are shown in the drawings. The embodiments of this disclosure and features of the embodiments may be combined with each other without conflict.

本发明公开了一种基于强化学习的多目标复杂交通场景下自动驾驶解决方法，下面将参考附图并结合实施例来详细说明本公开。参照图1至2所示，其主要包括以下步骤：The present invention discloses an automatic driving solution method based on reinforcement learning in multi-objective complex traffic scenarios. The present disclosure will be described in detail below with reference to the accompanying drawings and in conjunction with embodiments. 1 to 2, it mainly includes the following steps:

本发明的多目标是指以车辆行驶精准、行驶速度快、行驶安全、算法高鲁棒性和高泛化性为目标。具体来说，在单一交通场景中要求在车辆行驶最大距离情况下进行相同次数模拟时，行驶速度体现在自车平均速度尽量大。行驶安全体现在自车与环境车发生碰撞、自车偏离航线次数尽量少、自车行驶距离尽量远。行驶精准体现在自车中心点离道路中心线平均距离尽量近。算法高鲁棒性和高泛化性体现在一个算法能在多个复杂交通场景和没有训练过的地图场景也可取得较好的效果。The multi-objectives of the present invention refer to the goals of accurate vehicle driving, fast driving speed, driving safety, high algorithm robustness and high generalization. Specifically, in a single traffic scenario, when the same number of simulations are required under the condition that the vehicle travels the maximum distance, the driving speed is reflected in the average speed of the self-vehicle as large as possible. Driving safety is reflected in the collision between the self-vehicle and the environmental vehicle, the number of deviations of the self-vehicle from the route as few as possible, and the driving distance of the self-vehicle as far as possible. Driving accuracy is reflected in the average distance between the center point of the vehicle and the center line of the road as close as possible. The high robustness and generalization of the algorithm is reflected in the fact that an algorithm can achieve good results in multiple complex traffic scenes and untrained map scenes.

而复杂交通场景是指在模拟器中训练所使用的地图场景具有多种道路类型和车流情况，所使用的地图类型包括简单、急弯、交叉路口、环岛、汇合、分流、多种道路混合场景等共计九种地图类型，不同的场景也包含不同密集程度的环境车。在模拟环境所设定的规则中，环境车的行驶轨迹与行驶策略具有一定的随机性，其策略类型可分为保守、中庸、激进的三种类型，在每一种策略下，环境车的运行策略参数依然具有一定随机性。The complex traffic scene refers to the map scene used for training in the simulator with various road types and traffic conditions. The map types used include simple, sharp bends, intersections, roundabouts, convergence, diversion, and mixed road scenes. There are a total of nine map types, and different scenes also contain environment vehicles with different degrees of density. In the rules set by the simulated environment, the driving trajectory and driving strategy of the environmental vehicle have certain randomness, and its strategy types can be divided into three types: conservative, moderate and aggressive. The operating strategy parameters still have a certain randomness.

步骤2中，观测空间中添加训练强化学习模型所需的环境特征信息作为环境观测信息，根据环境特征信息计算关键特征量。这里的强化学习模型中，输入观测空间包括有自车信息、他车信息、道路信息等环境感知信息，以及从环境信息中提取的观测特征；其输出动作则包括油门开度、刹车控制、方向盘转角控制。In step 2, the environmental feature information required for training the reinforcement learning model is added to the observation space as the environmental observation information, and the key feature quantity is calculated according to the environmental feature information. In the reinforcement learning model here, the input observation space includes environmental perception information such as self-vehicle information, other vehicle information, and road information, as well as observation features extracted from environmental information; its output actions include accelerator opening, brake control, steering wheel Corner control.

具体而言，环境观测信息包括自车信息、他车信息和道路信息等环境感知信息。其中，自车信息包括自车速度、自车方向盘转角、自车所在车道序号、自车中心点与所在车道中心线距离、选取自车位置起前方十五个路点与自车朝向角偏差、距离4/8米预瞄点的相对位置，共计2+1+1+15+4＝19维。他车信息包括每条车道中近邻车的距离、每条车道距离前车的碰撞时间(TTC)、自车车道中最近邻他车与自车的相对速度，共计6*3＝18维；自车行驶至交叉路口时查询五十米范围内朝向交叉，即可能发生碰撞且距离自车最近5辆车的相对位置、相对朝向、相对速度、碰撞时间差，共计5*5＝25维。道路信息包括：自车距离车道中心横向距离、路点朝向误差，其中路点朝向误差表示自车朝向所在直线与前方路点的距离，从自车位置处的路点起选取前方15个路点分别计算朝向误差，以距离自车近的路点稠密，距离较远的路点稀疏为原则，选取路点基数分别为[0,1,2,3,5,7,10,13,17,21,25,30,35,42,50]，表示前方第i个路点，也表示该路点距离智能体车辆的距离。Specifically, the environment observation information includes environment perception information such as own vehicle information, other vehicle information, and road information. Among them, the information of the own vehicle includes the speed of the own vehicle, the steering wheel angle of the own vehicle, the serial number of the lane where the vehicle is located, the distance between the center point of the vehicle and the center line of the lane where the vehicle is located, and the deviation of the 15 waypoints ahead from the location of the vehicle and the heading angle of the vehicle. , The relative position of the preview point at a distance of 4/8 meters, a total of 2+1+1+15+4=19 dimensions. The other car information includes the distance of the neighboring car in each lane, the collision time (TTC) between each lane and the preceding car, and the relative speed of the nearest neighbor car and the own car in the own lane, a total of 6*3=18 dimensions; When the car drives to the intersection, query the direction of the intersection within 50 meters, that is, the relative position, relative direction, relative speed, and collision time difference of the 5 vehicles closest to the vehicle that may collide, with a total of 5*5=25 dimensions. The road information includes: the lateral distance from the vehicle to the center of the lane, and the waypoint orientation error. The waypoint orientation error represents the distance between the line where the vehicle is heading and the waypoint ahead, and 15 waypoints ahead are selected from the waypoint at the location of the vehicle. Calculate the heading error separately, and take the principle that the waypoints close to the vehicle are dense, and the waypoints farther away are sparse. 21, 25, 30, 35, 42, 50], which represents the i-th waypoint ahead, and also represents the distance between the waypoint and the agent vehicle.

提取的关键特征信息主要为避撞车辆信息，选取与自车位置最近的三辆他车作为避撞车辆，并以自车坐标系计算他车信息，因此关键特征信息包括避撞车辆相对位置、避撞车辆绝对速度、自车与避撞车辆的相对朝向角、距离碰撞点的碰撞时间差(TDTC)，其中距离碰撞点的碰撞时间差(TDTC)的计算公式为：The extracted key feature information is mainly the collision avoidance vehicle information. The three other vehicles closest to the position of the own vehicle are selected as the collision avoidance vehicle, and the other vehicle information is calculated in the own vehicle coordinate system. Therefore, the key feature information includes the relative position of the collision avoidance vehicle, The absolute speed of the collision avoidance vehicle, the relative orientation angle of the ego vehicle and the collision avoidance vehicle, and the time difference to collision (TDTC) from the collision point. The calculation formula of the time difference to collision (TDTC) from the collision point is:

步骤3中，设置强化学习模型训练所需的奖励框架。在强化学习模型训练的奖励框架中，包括环境奖赏，速度奖赏，碰撞惩罚，车道中心偏差惩罚。其中环境奖赏为自车存活时间，表示自车从起点行驶至发生碰撞所经历的时长，其值从1逐渐增加到4，再从1开始逐步增加，在仿真环境中单步模拟只要自车还存活就给奖励。速度奖赏为自车行驶速度，以每秒行驶距离为单位。车道中心偏差惩罚为车辆中心与中心线距离的绝对值。碰撞惩罚包括三种为当自车驶离航线、自车行驶出道路边界或自车与环境车发生碰撞时会给予相应的碰撞惩罚，其值均为常数5，且其权重随迭代次数增加而增加，也就是每一种情况产生便获得-5的惩罚，各项惩罚的系数分别是1,0.5,0.1,1.2.这些系数会在训练过程中进行调整。In step 3, the reward framework required for training the reinforcement learning model is set. In the reward framework of reinforcement learning model training, environmental reward, speed reward, collision penalty, lane center deviation penalty are included. Among them, the environmental reward is the survival time of the self-vehicle, which means the time it takes for the self-vehicle to travel from the starting point to the collision. Survival is rewarded. The speed reward is the driving speed of the vehicle, and the unit is the distance traveled per second. The lane center deviation penalty is the absolute value of the distance between the vehicle center and the centerline. There are three types of collision penalties: when the ego car leaves the route, the ego car runs out of the road boundary, or the ego car collides with the environmental car, the corresponding collision penalty will be given, and its value is a constant 5, and its weight increases with the number of iterations. Increase, that is, a penalty of -5 is obtained for each situation, and the coefficients of each penalty are 1, 0.5, 0.1, 1.2. These coefficients will be adjusted during the training process.

参照图3所示，基于时变训练法结合元模型训练强化学习模型，可以提升车辆在部分特殊场景中的表现，如十字路口、环岛等。下面给出其训练步骤：Referring to Figure 3, training the reinforcement learning model based on the time-varying training method combined with the meta-model can improve the performance of the vehicle in some special scenes, such as intersections and roundabouts. The training steps are given below:

步骤4.2，使用时变训练法在所选场景下训练步骤4.1得到的元模型，是根据智能体行为所存在的缺陷调整奖励权重；Step 4.2, use the time-varying training method to train the meta-model obtained in step 4.1 in the selected scene, which is to adjust the reward weight according to the defects in the behavior of the agent;

具体而言，在步骤4.2中，使用时变训练法训练强化学习模型的具体过程为：Specifically, in step 4.2, the specific process of using the time-varying training method to train the reinforcement learning model is as follows:

步骤4.2.1，设置强化学习模型超参数；Step 4.2.1, set the reinforcement learning model hyperparameters;

步骤4.2.4，继续调高碰撞惩罚，继续迭代训练；Step 4.2.4, continue to increase the collision penalty and continue the iterative training;

步骤4.2.5，在原有场景数据集基础上新增场景，并新增速度奖赏以及调高车道中心偏差惩罚与碰撞惩罚权重，直至迭代结束。Step 4.2.5, add a new scene based on the original scene data set, add a speed reward, and increase the weight of the lane center deviation penalty and collision penalty until the end of the iteration.

设置的强化学习模型超参数如下表所示：The set reinforcement learning model hyperparameters are shown in the following table:

参数名称parameter name 参数值parameter value 学习率learning rate 1e-41e-4 训练批次大小training batch size 10240*310240*3 λλ 0.950.95 Clip参数Clip parameter 0.20.2 SGD迭代数Number of SGD iterations 1010 SGD小批次大小SGD mini-batch size 10241024 步数范围Step range 10001000

首先，设置奖励函数为基本函数，使Agent学会车道保持，将奖励函数表示为：1.0×环境奖赏+0.1×车道中心偏差惩罚+1.2×碰撞惩罚，从第0轮起迭代训练460轮。然后，调高车道中心偏差惩罚与碰撞惩罚权重，此时的奖励函数表示为：1.0×环境奖赏+0.4×车道中心偏差惩罚+1.6×碰撞惩罚，从第461轮起继续训练至第768轮。继续调高碰撞惩罚权重，此时奖励函数表示为：1.0×环境奖赏+0.4×车道中心偏差惩罚+1.8×碰撞惩罚，从第769轮起继续训练至第1152轮。最后，在原有场景数据集基础上新增3个场景，包括2个all_loop场景和1个mix_loop场景，并新增速度奖赏(车辆行驶速度越快其奖赏值越大)，以及调高车道中心偏差惩罚与碰撞惩罚权重，此时表示为：1.0×环境奖赏+0.4×速度奖赏+0.56×车道中心偏差惩罚+2.9×碰撞惩罚，从第1153轮起继续训练至第1400轮，训练过程结束。First, set the reward function as a basic function to make the agent learn lane keeping, and express the reward function as: 1.0×environment reward+0.1×lane center deviation penalty+1.2×collision penalty, and iteratively trained for 460 rounds from round 0. Then, the weights of lane center deviation penalty and collision penalty are increased. The reward function at this time is expressed as: 1.0×environmental reward+0.4×lane center deviation penalty+1.6×collision penalty, and the training continues from round 461 to round 768. Continue to increase the collision penalty weight. At this time, the reward function is expressed as: 1.0×environmental reward+0.4×lane center deviation penalty+1.8×collision penalty, and continue training from round 769 to round 1152. Finally, based on the original scene data set, 3 new scenes are added, including 2 all_loop scenes and 1 mix_loop scene, and speed rewards are added (the faster the vehicle travels, the greater the reward value), and the lane center deviation is increased. The weights of penalty and collision penalty are expressed as: 1.0×environmental reward+0.4×speed reward+0.56×lane center deviation penalty+2.9×collision penalty, continue training from round 1153 to round 1400, and the training process ends.

强化学习模型在传统强化学习基础上，采用时变策略训练，从而产生具有更高鲁棒性和泛化性的智能体模型。强化学习模型时变策略是一种阶段性学习方法，将Agent任务目标划根据重要程度划分为不同子任务并分不同阶段学习，每一阶段通过调整奖赏权重的方式对学习任务进行层次性强化训练，最开始将训练Agent使其拥有车道保持能力，其次训练其避撞能力，最后训练其高速行驶能力。智能车模型训练过程由多次迭代构成，每次迭代根据之前的模拟结果对本次迭代的奖励权重进行调整，具体来说，每次迭代时记录前一次模拟情况中所发生的碰撞及碰撞的类型，根据某种类型的碰撞占所有碰撞中的比例对奖励的比例进行调整。On the basis of traditional reinforcement learning, the reinforcement learning model is trained with time-varying strategies, resulting in an agent model with higher robustness and generalization. The time-varying strategy of reinforcement learning model is a staged learning method, which divides the Agent task objectives into different subtasks according to their importance and learns them in different stages. In each stage, the learning task is trained hierarchically by adjusting the reward weight. , firstly, the agent will be trained to have lane keeping ability, secondly, its collision avoidance ability will be trained, and finally its high-speed driving ability will be trained. The training process of the smart car model consists of multiple iterations. Each iteration adjusts the reward weight of this iteration according to the previous simulation results. Specifically, each iteration records the collisions and collisions that occurred in the previous simulation. Type, which adjusts the proportion of rewards based on the proportion of collisions of a certain type among all collisions.

在步骤5中，危险动作识别器包括样本采集与训练两个阶段，参照图7和图8所示，具体样本采集的过程如下：In step 5, the dangerous action recognizer includes two stages of sample collection and training. Referring to Figures 7 and 8, the specific sample collection process is as follows:

具体而言，这里我们这里使用7种类别的场景，选择其中一个场景开始训练。初始化PPO策略模型，在所选场景下从头开始训练策略，将运行总步数设为400000步。在运行过程中记录本轮运行过程中的轨迹，包括每步的观测-动作对(s1,a1,s2,a2,…)以及运行时奖励值(r1,,r2,…)。若出现碰撞，记录碰撞前10步的观测s-动作a-奖励r对视为负样本进行采集，并记录该轮的轨迹长度为m，从区间[1,m-19]中随机选择一个随机数k，采集第k步起10步的观测s-动作a-奖励r对作为正样本进行采集。使用采集到每组样本中的奖励值计算10个滚动平均值，作为该组样本的训练标签，计算公式如下：Specifically, here we use 7 categories of scenes, and select one of the scenes to start training. Initialize the PPO policy model, train the policy from scratch in the selected scenario, and set the total number of steps to run to 400,000 steps. During the running process, the trajectory during the running process of the current round is recorded, including the observation-action pair (s1, a1, s2, a2, ...) and the runtime reward value (r1,, r2, ...) of each step. If there is a collision, record the observations s-action a-reward r 10 steps before the collision as a negative sample to collect, and record the trajectory length of the round as m, and randomly select a random sample from the interval [1, m-19] Count k, and collect the observation s-action a-reward r pair of 10 steps from the kth step as a positive sample. Use the reward value collected in each group of samples to calculate 10 rolling averages as the training labels of the group of samples. The calculation formula is as follows:

其中i表示第i个场景，j表示10步样本中的第j步。将观测s-动作a-标签label对储存作为训练数据集，训练直到运行步数达到设定的总步数400000。选择下一个场景重复步骤5.1.2至步骤5.1.5训练，直到所有场景均采集完毕结束，所采集的样本总数约为100000条。where i represents the ith scene and j represents the jth step in the 10-step sample. The observation s-action a-label label pair is stored as a training data set, and the training is performed until the number of running steps reaches the set total number of steps of 400,000. Select the next scene and repeat step 5.1.2 to step 5.1.5 training until all scenes are collected, and the total number of samples collected is about 100,000.

参照图5和图6所示，基于长短期记忆(LSTM)网络构建危险动作识别器模型，模型中包括一个LSTM层和2个全连接层，长短期记忆(LSTM)网络输出经过第一个全连接层之后使用ReLU激活函数，接入第二个全连接层。那么危险动作识别器训练的过程如下：Referring to Figures 5 and 6, a dangerous action recognizer model is constructed based on a long short-term memory (LSTM) network. The model includes one LSTM layer and two fully connected layers. The output of the long short-term memory (LSTM) network passes through the first full connection layer. After the connection layer, the ReLU activation function is used to access the second fully connected layer. Then the dangerous action recognizer training process is as follows:

具体而言，采集的每组样本中包含10步的观测s-动作a-标签label对。其中观测为44维，动作为3维，将观测与动作拼接为47维的向量后，将10步数据使用大小为5的滑动窗口生成6组长度为5的数据组，每组5步数据作为模型输入，以每组最后一步的标签作为模型的目标标签。使用Adam优化器，并基于余弦模拟退火方法调整优化器学习率，使用均方误差损失作为训练损失函数，计算模型输出与目标标签的均方误差损失。设置如下表所示模型参数，训练模型100轮后结束。Specifically, each set of samples collected contains 10-step observation s-action a-label pairs. The observation is 44-dimensional, and the action is 3-dimensional. After splicing the observation and action into a 47-dimensional vector, the 10-step data is used to generate 6 data groups of length 5 using a sliding window of size 5, and each group of 5-step data is used as Model input, with the label of the last step of each group as the target label of the model. The Adam optimizer is used, and the learning rate of the optimizer is adjusted based on the cosine simulated annealing method, and the mean square error loss is used as the training loss function to calculate the mean square error loss between the model output and the target label. Set the model parameters as shown in the table below, and train the model after 100 rounds.

参数名称parameter name 参数值parameter value LSTM隐藏层大小LSTM hidden layer size 128128 全连接层1Fully connected layer 1 128x64128x64 全连接层2Fully connected layer 2 64x164x1 初始学习率initial learning rate 0.010.01 余弦退火轮数Number of Cosine Annealing Rounds 100100 最终学习率final learning rate 0.00010.0001 训练-验证集比例Train-validation set ratio 9:19:1 Batch大小Batch size 3232

危险动作识别器用于提升车辆的安全性，根据强化学习模型所输出的动作及环境观测信息预测其危险程度，并根据危险程度采取紧急避让、紧急调整等行为。危险动作识别器包括两个阶段：采集样本阶段与训练阶段。在采集样本阶段通过在模拟器环境渐进训练策略的过程中收集大量安全样本及危险样本，并计算样本标签，为危险动作识别器的训练做准备。样本需在多个场景、策略不同效果情况下采集，并满足正负样本数量的平衡性。而在训练阶段，则使用收集的样本数据进行训练直至结束。The dangerous action recognizer is used to improve the safety of the vehicle. It predicts the degree of danger according to the action and environmental observation information output by the reinforcement learning model, and takes actions such as emergency avoidance and emergency adjustment according to the degree of danger. The dangerous action recognizer consists of two stages: sample collection stage and training stage. In the sample collection stage, a large number of safe samples and dangerous samples are collected in the process of gradually training the strategy in the simulator environment, and the sample labels are calculated to prepare for the training of the dangerous action recognizer. The samples need to be collected in multiple scenarios and strategies with different effects, and the balance of the number of positive and negative samples should be satisfied. In the training phase, the collected sample data is used for training until the end.

参照图4所示，步骤5中的规则约束器则为基于人类知识和仿真实验的经验统计编写，用于限制自车在某些特定情形下的行为。规则约束器主要包括最近距离保护的知识规则、交叉路口下的知识规则、急弯前的知识规则以及自车无近邻车情形下长时间驻留的修正规则，根据环境观测数据分别对不同场景进行判断，决定自车限速规则。以下为规则约束器示例。Referring to FIG. 4 , the rule constraint in step 5 is written based on human knowledge and empirical statistics of simulation experiments, and is used to restrict the behavior of the self-vehicle in some specific situations. The rule constraint mainly includes the knowledge rules for the closest distance protection, the knowledge rules under the intersection, the knowledge rules before sharp bends, and the correction rules for long-term residence when the vehicle has no neighbors, and judges different scenarios according to the environmental observation data. , to determine the speed limit rules for your own vehicle. The following is an example of a rule constraint.

在规则约束器中，第一部分主要为基于最近距离保护的知识规则的约束：首先筛选出距离自车最近的3辆环境车，当环境车距离自车小于20米时进入TDTC条件判断，分别根据不同的TDTC、环境车车速判断危险程度，进而对自车车速最大值进行限制。具体规则如下：In the rule constraint device, the first part is mainly the constraint of knowledge rules based on the protection of the closest distance: first, the three environmental vehicles closest to the self-vehicle are screened out. When the environmental vehicle is less than 20 meters away from the self-vehicle, it enters the TDTC condition judgment. Different TDTC and environmental vehicle speed judge the degree of danger, and then limit the maximum speed of the own vehicle. The specific rules are as follows:

1.-2＜TDTC＜1，d/v_cv＜1，v_cv≥30：限制自车最大车速为5，其中d为自车与冲突车距离，v_cv为冲突车车速；1.-2 < TDTC < 1, d/v _cv < 1, v _cv ≥ 30: The maximum speed of the own vehicle is limited to 5, where d is the distance between the own vehicle and the conflicting vehicle, and v _cv is the speed of the conflicting vehicle;

2.0＜TDTC＜1.2：限制自车最大车速为2；2.0＜TDTC＜1.2: Limit the maximum speed of the vehicle to 2;

3.-1.2＜TDTC＜0：限制自车最大车速为0.1×规划车速；3.-1.2＜TDTC＜0: The maximum speed of the vehicle is limited to 0.1×the planned speed;

4.1.2≤TDTC＜3：限制自车最大车速为15+20(TDTC-1.2)/1.8；4.1.2≤TDTC＜3: The maximum speed of the vehicle is limited to 15+20(TDTC-1.2)/1.8;

5.-1.8＜TDTC≤-1.2，v_cv＞20：限制自车最大车速为0.1+1.4(TDTC-1.2)/0.6；5.-1.8＜TDTC≤-1.2, v _cv ＞20: limit the maximum speed of the vehicle to 0.1+1.4(TDTC-1.2)/0.6;

6.-3＜TDTC≤-1.8，v_cv＞20：限制自车最大车速为1.4+6(TDTC-1.8)/1.2；6.-3＜TDTC≤-1.8, v _cv ＞20: The maximum speed of the vehicle is limited to 1.4+6(TDTC-1.8)/1.2;

7.-3＜TDTC≤-1.2，v_cv≤20：限制自车最大车速为5+15(TDTC-1.2)/1.8；7.-3＜TDTC≤-1.2, v _cv ≤20: The maximum speed of the vehicle is limited to 5+15(TDTC-1.2)/1.8;

8.3≤TDTC＜7：限制自车最大车速为30+20(TDTC-3)/4；8.3≤TDTC＜7: The maximum speed of the vehicle is limited to 30+20(TDTC-3)/4;

9.-7＜TDTC≤3：限制自车最大车速为30+20(TDTC-3)/4；9.-7＜TDTC≤3: The maximum speed of the vehicle is limited to 30+20(TDTC-3)/4;

10.d＜4，v_cv＞5：限制自车最大车速为0；10. d<4, v _cv > 5: limit the maximum speed of the vehicle to 0;

11.d＜5，v_cv＞13：限制自车最大车速为2；11. d<5, v _cv > 13: limit the maximum speed of the vehicle to 2;

12.d＜7，v_cv＞20：限制自车最大车速为5。12. d<7, v _cv > 20: Limit the maximum speed of the ego vehicle to 5.

规则约束器第二部分为交叉路口下的知识规则：该规则针对自车通过十字路口时沿侧方向行驶的车辆进行避碰，通过设置自车前方不同大小的矩形判定区域，检测每个区域内是否存在环境车，根据不同大小的区域检测到环境车对自车车速进行不同大小的限制。此外，矩形区域大小还应根据自车转弯来修正，可根据自车与前方路点的朝向角偏差得到自车转弯状态。具体规则如下：The second part of the rule constraint is the knowledge rule under the intersection: this rule is used to avoid collisions for vehicles traveling in the side direction when the vehicle passes through the intersection. Whether there is an environmental car, the speed of the self-vehicle is limited in different sizes according to the detection of the environmental car in different size areas. In addition, the size of the rectangular area should also be corrected according to the turning of the own vehicle, and the turning state of the own vehicle can be obtained according to the deviation of the heading angle between the own vehicle and the road point ahead. The specific rules are as follows:

1.自车前方纵向距离5.5、两边横向距离10范围内存在非静止环境车，且该环境车满足TDTC＜8，v_cv＜4之一：限制自车最大车速为0；1. There is a non-stationary environmental vehicle within the range of 5.5 in the longitudinal distance in front of the self-vehicle and 10 in the lateral distance on both sides, and the environmental vehicle satisfies one of TDTC < 8 and v _cv < 4: the maximum speed of the self-vehicle is limited to 0;

2.自车前方纵向距离7、两边横向距离10范围内存在非静止环境车，且该环境车满足TDTC＜8，v_cv＜10之一：限制自车最大车速为5；2. There is a non-stationary environmental vehicle within the range of longitudinal distance 7 in front of the vehicle and 10 lateral distances on both sides, and the environmental vehicle satisfies one of TDTC < 8 and v _cv < 10: the maximum speed of the vehicle is limited to 5;

3.自车前方纵向距离9、两边横向距离10范围内存在非静止环境车，且该环境车满足TDTC＜8，v_cv＜12之一：限制自车最大车速为7；3. There is a non-stationary environmental vehicle within the range of 9 longitudinal distances in front of the self-vehicle and 10 lateral distances on both sides, and the environmental vehicle satisfies one of TDTC < 8 and v _cv < 12: the maximum speed of the self-vehicle is limited to 7;

4.自车与前方路点朝向角方差大于0.16，即自车正在转弯时，自车前方纵向距离9、两边横向距离9范围内存在非静止环境车，且该环境车满足TDTC＜8，v_cv＜4之一：限制自车最大车速为0；4. The variance of the orientation angle between the vehicle and the road point ahead is greater than 0.16, that is, when the vehicle is turning, there is a non-stationary environmental vehicle within the range of the longitudinal distance 9 in front of the vehicle and the lateral distance on both sides of the vehicle, and the environmental vehicle satisfies TDTC<8, v One of _cv < 4: limit the maximum speed of the self-vehicle to 0;

5.自车与前方路点朝向角方差大于0.16，即自车正在转弯时，自车前方纵向距离10.5、两边横向距离9范围内存在非静止环境车，且该环境车满足TDTC＜8，v_cv＜4之一：限制自车最大车速为7。5. The deviation of the orientation angle between the vehicle and the road point ahead is greater than 0.16, that is, when the vehicle is turning, there is a non-stationary environmental vehicle within the range of 10.5 in the longitudinal distance in front of the vehicle and 9 in the lateral distance on both sides, and the environmental vehicle satisfies TDTC<8, v One of _cv < 4: Limit the maximum speed of the ego vehicle to 7.

规则约束器第三部分为自车无近邻车情形下长时间驻留的修正规则，其规则为：The third part of the rule constraint is the correction rule for long-term residence in the case of no neighboring car. The rules are:

1.自车范围30内无环境车，所有冲突车满足v_cv＜5，|TDTC|＞9：手动对自车加速，设置油门控制量最小值为0.3。1. There is no environmental car within 30 of the self-vehicle range, and all conflicting vehicles satisfy v _cv <5, |TDTC|>9: manually accelerate the self-vehicle, and set the minimum value of the throttle control amount to 0.3.

规则约束器第四部分为急弯前的知识规则，根据自车与前方路点朝向角方差V_h进行限速，其规则如下：The fourth part of the rule constraint is the knowledge rule before the sharp bend. The speed is limited according to the variance V _h of the orientation angle between the vehicle and the road point ahead. The rules are as follows:

1.V_h＞0.33：限制自车最大车速为8；1. V _h > 0.33: limit the maximum speed of the vehicle to 8;

2.0.25＜V_h≤0.33：限制自车最大车速为8+10(0.3-V_h)/0.08；2.0.25＜V _h ≤0.33: The maximum speed of the vehicle is limited to 8+10(0.3-V _h )/0.08;

3.0.2＜V_h≤0.25：限制自车最大车速为14+10(0.25-V_h)/0.05；3.0.2＜V _h ≤0.25: The maximum speed of the vehicle is limited to 14+10(0.25-V _h )/0.05;

4.0.15＜V_h≤0.2：限制自车最大车速为22+20(0.2-V_h)/0.05；4.0.15＜V _h ≤0.2: The maximum speed of the vehicle is limited to 22+20(0.2-V _h )/0.05;

5.0.1＜V_h≤0.15：限制自车最大车速为30+20(0.15-V_h)/0.05；5.0.1＜V _h ≤0.15: The maximum speed of the vehicle is limited to 30+20(0.15-V _h )/0.05;

6.0.03＜V_h≤0.1：限制自车最大车速为40+20(0.1-V_h)/0.07；6.0.03＜V _h ≤0.1: The maximum speed of the vehicle is limited to 40+20(0.1-V _h )/0.07;

7.0.01＜V_h≤0.03：限制自车最大车速为60+20(0.03-V_h)/0.02；7.0.01＜V _h ≤0.03: The maximum speed of the vehicle is limited to 60+20(0.03-V _h )/0.02;

上述规则约束器的主要作用为判断环境是否为危险情况，并对强化学习所输出的动作进行大小上的限制，主要体现在规划行驶速度上的限制，在控制部分体现为车辆刹车行为，同时也在前方安全、强化学习模型输出速度过低时进行辅助加速，保证了车辆的行驶安全与效率。The main function of the above rule constraint is to judge whether the environment is a dangerous situation, and to limit the size of the actions output by reinforcement learning. When the front is safe and the output speed of the reinforcement learning model is too low, auxiliary acceleration is performed to ensure the driving safety and efficiency of the vehicle.

综上所述，本发明提供了多目标场景下安全强化学习的解决方案，此项技术可应用于智能车辆辅助驾驶、无人驾驶等领域，与传统的完全端到端方案与完全基于规则方案相比，提供了一种新的混合方案的思路，结合两者优点实现车辆在复杂、多种场景下高安全性、高智能性、高效率行驶的目的，因此，本技术具有很高的推广价值。In summary, the present invention provides a solution for safety reinforcement learning in multi-objective scenarios. This technology can be applied to the fields of intelligent vehicle assisted driving, unmanned driving, etc. In comparison, it provides a new idea of hybrid solution, combining the advantages of both to achieve the purpose of high safety, high intelligence and high efficiency of vehicles in complex and various scenarios. Therefore, this technology has a high promotion. value.

以上描述仅为本公开的一些较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解，本公开的实施例中所涉及的发明范围，并不限于上述技术特征的特定组合而成的技术方案，同时也应涵盖在不脱离上述发明构思的情况下，由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开的实施例中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above descriptions are merely some preferred embodiments of the present disclosure and illustrations of the applied technical principles. Those skilled in the art should understand that the scope of the invention involved in the embodiments of the present disclosure is not limited to the technical solution formed by the specific combination of the above-mentioned technical features, and should also cover, without departing from the above-mentioned inventive concept, the above-mentioned Other technical solutions formed by any combination of technical features or their equivalent features. For example, a technical solution is formed by replacing the above features with the technical features disclosed in the embodiments of the present disclosure (but not limited to) with similar functions.

Claims

1. An automatic driving solution method based on reinforcement learning under a multi-target complex traffic scene is characterized by comprising the following steps:

step 1, preparing a simulator environment and a complex driving scene for automatic driving simulation;

step 2, adding environmental characteristic information required by training a reinforcement learning model into an observation space as environmental observation information, wherein the environmental characteristic information comprises own vehicle information, other vehicle information and road information, and calculating key characteristic information according to the environmental characteristic information, wherein the key characteristic information comprises collision time of each lane from a front vehicle, collision time difference of the own vehicle from a collision point, and orientation angle variance of the own vehicle and the front road point;

step 3, setting a reward frame required by the reinforcement learning model training;

step 4, training the reinforcement learning model based on a time-varying training method, continuing training in different traffic scenes by combining with the use of a meta-model, modifying the rewarding weight during training according to the driving performance condition of the intelligent agent and the type of collision at regular intervals of iteration rounds, and repeatedly modifying the weight for multiple times to finish the training;

and 5, after outputting the trained reinforcement learning model, constructing a dangerous action recognizer and a rule restraint device, judging scene danger degree according to the environment observation information to limit or adjust the planning quantity output by the reinforcement learning model, and continuously and manually adding and optimizing rules by observing the effect in the simulation environment.

2. The solution of automatic driving under the complex traffic scene of multiple targets based on reinforcement learning of claim 1, characterized in that environmental characteristic information required for training a reinforcement learning model is added in the observation space in step 2, and the vehicle information includes the speed of the vehicle, the steering wheel angle of the vehicle, the serial number of the lane where the vehicle is located, the distance between the center point of the vehicle and the center line of the lane where the vehicle is located, the deviation between fifteen road points ahead of the vehicle position and the heading angle of the vehicle, and the relative position of the pre-aiming point at a distance of 4/8 m; the other-vehicle information comprises the distance between the adjacent vehicle in each lane, the collision time between each lane and the previous vehicle, and the relative speed between the nearest other vehicle in the vehicle lane and the vehicle; the road information comprises the transverse distance from the vehicle to the center of the lane and the orientation error of the road point.

3. The solution for automatic driving under the multi-target complex traffic scene based on reinforcement learning as claimed in claim 2, characterized in that: the reward frame in the step 3 comprises an environment reward, a speed reward, a collision penalty and a lane center deviation penalty; wherein the environment reward is the survival time of the vehicle, and represents the time length from the starting point to the collision of the vehicle; the speed reward is the driving speed of the self vehicle, and the driving distance per second is taken as a unit; the collision punishment is that when the self-vehicle drives away from the air route, runs out of the road boundary or collides with the environmental vehicle, the self-vehicle gives corresponding collision punishment; and the lane center deviation penalty is an absolute value of the distance between the vehicle center and the central line.

4. The solution for automatic driving under the complex traffic scenario of multiple targets based on reinforcement learning as claimed in claim 3, wherein the step 4 of training the reinforcement learning model based on the time-varying training method in combination with the meta model comprises the following specific steps:

step 4.1, initializing the reinforcement learning model, and training a certain number of turns by using each scene in sequence to obtain a meta-model;

step 4.2, training the meta-model obtained in the step 4.1 under the selected scene by using a time-varying training method, and adjusting the rewarding weight according to the defects of the behavior of the agent;

4.3, setting the scenes as all simple scenes without intersections, repeating the process training of the step 4.2, and improving the performance under the simple scenes without intersections;

4.4, setting the scenes as all the scenes containing the intersection, repeating the training process of the step 4.2, and improving the performance under the scene of the intersection;

step 4.5, setting a scene containing the rotary island and the multi-direction vehicles, repeating the training process of the step 4.2, and improving the performance under the rotary island and multi-direction vehicle scene;

and 4.6, continuing training in the rest scenes until the process is finished.

5. The solution for automatic driving under the complex traffic scene of multiple targets based on reinforcement learning according to claim 4, wherein the reinforcement learning model trained based on the time-varying training method in step 4 comprises the following specific steps:

step 4.2.1, setting hyper-parameters of the reinforcement learning model;

step 4.2.2, setting a reward function as a basic reward, enabling the Agent to learn lane keeping, and starting iterative training;

4.2.3, heightening the central deviation punishment and collision punishment weight of the lane, and continuing iterative training;

step 4.2.4, continuously increasing the collision penalty, and continuously carrying out iterative training;

and 4.2.5, adding scenes on the basis of the original scene data set, adding a speed reward and increasing the lane center deviation punishment and collision punishment weight until the iteration is finished.

6. The solution of automatic driving under the complex traffic scene of multiple targets based on reinforcement learning of claim 5, wherein in step 5 the dangerous action recognizer predicts the dangerous degree according to the action and environment observation information output by the reinforcement learning model, and takes actions such as emergency avoidance and emergency adjustment according to the dangerous degree, the dangerous action recognizer includes a sample collection stage and a training stage, and the sample collection stage includes the following specific steps:

step 5.1.1, preparing various scenes, and selecting one scene to start training;

step 5.1.2, initializing a PPO strategy model, and starting training the strategy model under the selected scene;

step 5.1.3, recording the track of the current round in the running process;

step 5.1.4, when collision occurs, collecting 10 steps before the collision as negative samples, and randomly collecting any continuous 10 steps in the track of the current round as positive samples;

step 5.1.5, after training until the number of operation steps reaches the set total number of steps, selecting the next scene for repeated training;

and 5.1.6, ending until all scenes are acquired.

7. The solution for automatic driving under the complex traffic scenario of multiple targets based on reinforcement learning as claimed in claim 6, wherein the dangerous motion recognizer in step 5 is a dangerous motion recognizer model constructed based on a long and short memory network, and the specific steps in the training phase are as follows:

step 5.2.1, according to each group of collected sample data, generating a data group of the sample data by using a sliding window as a model input, and using a label of the last step of each group as a target label of the model;

step 5.2.2, using an Adam optimizer and adjusting the learning rate of the optimizer based on a cosine simulated annealing method;

step 5.2.3, calculating the mean square error loss of the model output and the target label by using the mean square error loss as a training loss function;

and 5.2.4, setting relevant model parameters and the number of training model rounds to finish training.

8. The solution of automatic driving under the complex traffic scenario of multiple objectives based on reinforcement learning of claim 7, characterized in that the rule constraint device in step 5 is an empirical statistical compiling rule based on human knowledge and simulation experiments for limiting the behavior of the vehicle under certain specific situations, the rule constraint device mainly includes a knowledge rule of nearest distance protection, a knowledge rule under the intersection, a knowledge rule before sharp bend and a correction rule of long-time residence under the condition that the vehicle has no neighboring vehicle, the environmental observation information judges different scenarios respectively and decides the speed limit rule of the vehicle.