CN114701517A - Multi-target complex traffic scene automatic driving solution based on reinforcement learning - Google Patents
Multi-target complex traffic scene automatic driving solution based on reinforcement learning Download PDFInfo
- Publication number
- CN114701517A CN114701517A CN202210370991.7A CN202210370991A CN114701517A CN 114701517 A CN114701517 A CN 114701517A CN 202210370991 A CN202210370991 A CN 202210370991A CN 114701517 A CN114701517 A CN 114701517A
- Authority
- CN
- China
- Prior art keywords
- vehicle
- training
- reinforcement learning
- collision
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000002787 reinforcement Effects 0.000 title claims abstract description 76
- 238000012549 training Methods 0.000 claims abstract description 117
- 238000000034 method Methods 0.000 claims abstract description 58
- 230000009471 action Effects 0.000 claims abstract description 36
- 230000008569 process Effects 0.000 claims abstract description 22
- 230000015654 memory Effects 0.000 claims abstract description 3
- 230000007613 environmental effect Effects 0.000 claims description 53
- 230000006870 function Effects 0.000 claims description 16
- 238000004088 simulation Methods 0.000 claims description 12
- 230000006399 behavior Effects 0.000 claims description 6
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 4
- 238000012937 correction Methods 0.000 claims description 4
- 230000007547 defect Effects 0.000 claims description 4
- 230000000694 effects Effects 0.000 claims description 4
- 238000002922 simulated annealing Methods 0.000 claims description 4
- 230000004083 survival effect Effects 0.000 claims description 4
- 230000008447 perception Effects 0.000 abstract description 5
- 238000004422 calculation algorithm Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000013461 design Methods 0.000 description 4
- 230000006403 short-term memory Effects 0.000 description 4
- 230000007774 longterm Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000007670 refining Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 206010039203 Road traffic accident Diseases 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000000137 annealing Methods 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000005096 rolling process Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B60—VEHICLES IN GENERAL
- B60W—CONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
- B60W60/00—Drive control systems specially adapted for autonomous road vehicles
- B60W60/001—Planning or execution of driving tasks
- B60W60/0015—Planning or execution of driving tasks specially adapted for safety
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B60—VEHICLES IN GENERAL
- B60W—CONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
- B60W40/00—Estimation or calculation of non-directly measurable driving parameters for road vehicle drive control systems not related to the control of a particular sub unit, e.g. by using mathematical models
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B60—VEHICLES IN GENERAL
- B60W—CONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
- B60W60/00—Drive control systems specially adapted for autonomous road vehicles
- B60W60/001—Planning or execution of driving tasks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Mechanical Engineering (AREA)
- Transportation (AREA)
- Automation & Control Theory (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Human Computer Interaction (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Traffic Control Systems (AREA)
Abstract
Description
技术领域technical field
本发明涉及一种基于强化学习的多目标复杂交通场景下自动驾驶解决方法,属于自动驾驶技术领域,尤其涉及多目标复杂交通场景下的基于深度强化学习的通用自动驾驶算法建模和训练方案。The invention relates to an automatic driving solution method based on reinforcement learning in multi-objective complex traffic scenarios, belonging to the technical field of automatic driving, and in particular to a general automatic driving algorithm modeling and training scheme based on deep reinforcement learning in multi-objective complex traffic scenarios.
背景技术Background technique
随着智能车产业的高速发展与自动驾驶技术的不断成熟,无人驾驶技术已成为未来车辆发展的趋势。目前落地的自动驾驶系统可达到L3级别,即能够实现驾驶员监控的情况下车辆自动行驶,遇到紧急情况需要驾驶员接管车辆。然而,自动驾驶尚未实现完全无人驾驶的重要原因之一是当前基于规则的决策方法不能处理足够的交通场景,安全方面存在较大风险隐患。With the rapid development of the smart car industry and the continuous maturity of autonomous driving technology, driverless technology has become the trend of future vehicle development. The current automatic driving system on the ground can reach the L3 level, that is, the vehicle can drive automatically under the condition of driver monitoring, and the driver needs to take over the vehicle in case of emergency. However, one of the important reasons why autonomous driving has not yet achieved complete unmanned driving is that the current rule-based decision-making method cannot handle enough traffic scenarios, and there are great hidden dangers in safety.
主流的自动驾驶技术可分为感知、决策、控制三个模块,其中决策模块是作为智能系统的核心部分。目前自动驾驶决策技术主要可以分为基于规则和基于学习的两大类。基于规则的决策技术有通用决策模型、有限状态机模型、决策树模型、基于知识推理的模型等。基于学习的决策技术主要是基于深度学习和强化学习。The mainstream autonomous driving technology can be divided into three modules: perception, decision-making and control, of which the decision-making module is the core part of the intelligent system. At present, autonomous driving decision-making technology can be mainly divided into two categories: rule-based and learning-based. Rule-based decision-making techniques include general decision-making models, finite state machine models, decision tree models, and knowledge-based reasoning models. Learning-based decision-making techniques are mainly based on deep learning and reinforcement learning.
当今,实际使用的还是基于规则的决策技术,但这种技术却暴露出越来越多的问题。基于规则的系统难以穷举所有可能出现的场景,在一些没有考虑到场景容易引发交通事故。其次,规则系统的设计在人力成本耗费与系统复杂程度上都较高,系统维护和升级也尤其繁琐。因此,人们迫切希望开发完善其他技术方法,基于数据驱动的深度强化学习就是一个方向。然而,当今深度强化学习在自动驾驶的应用主要还是在某一具体场景,如超车、换道、车道保持等,其通用性不强。此外,深度神经网络目前不具备完全的解释性,存在灾难性遗忘的问题,容易产生一些未知的不安全动作。另一个方面,强化学习本身也有泛化性、稳定性的问题。因此,要让深度强化学习在自动驾驶决策控制得到有实际意义的应用,才能一定程度上改善解决通用性差、泛化性弱、安全性得不到改善的问题。Today, rules-based decision-making techniques are still in practice, but such techniques expose more and more problems. It is difficult for a rule-based system to enumerate all possible scenarios, and some scenarios may easily lead to traffic accidents. Secondly, the design of the rule system is high in labor cost and system complexity, and system maintenance and upgrades are particularly cumbersome. Therefore, people are eager to develop and improve other technical methods, and data-driven deep reinforcement learning is one direction. However, the application of deep reinforcement learning in autonomous driving today is mainly in a specific scenario, such as overtaking, lane changing, lane keeping, etc., and its versatility is not strong. In addition, deep neural networks currently do not have complete interpretability, there is a problem of catastrophic forgetting, and it is easy to generate some unknown unsafe actions. On the other hand, reinforcement learning itself also has generalization and stability problems. Therefore, in order to make deep reinforcement learning have practical application in automatic driving decision control, it is possible to improve and solve the problems of poor generality, weak generalization, and unimproved safety to a certain extent.
发明内容SUMMARY OF THE INVENTION
本申请的内容部分用于以简要的形式介绍构思,这些构思将在后面的具体实施方式部分被详细描述。本申请的内容部分并不旨在标识要求保护的技术方案的关键特征或必要特征,也不旨在用于限制所要求的保护的技术方案的范围。The content of this application is intended to introduce concepts in a simplified form that are described in detail in the detailed description that follows. The content section of this application is not intended to identify key features or essential features of the claimed technical solution, nor is it intended to be used to limit the scope of the claimed technical solution.
针对现有技术中存在的问题与不足,本发明目的在于提供一种基于强化学习的多目标复杂交通场景下自动驾驶解决方法,通过拟人的观测设计和基于奖励重塑的奖励函数设计,可以实现仅一个模型可在多种类型具有多种策略环境的交通场景中表现出更好地综合性能。此外,为了提高训练速度和泛化性,本发明提出了奖励时变训练方法。为了增强安全性,本发明提出了基于LSTM的危险动作识别器、基于知识的安全过滤等保障安全性的方法,以解决上述背景技术中提出的问题。Aiming at the problems and deficiencies in the prior art, the purpose of the present invention is to provide a reinforcement learning-based solution for automatic driving in multi-objective complex traffic scenarios. Through anthropomorphic observation design and reward function design based on reward reshaping, the Only one model can show better comprehensive performance in multiple types of traffic scenarios with multiple policy environments. In addition, in order to improve the training speed and generalization, the present invention proposes a reward time-varying training method. In order to enhance security, the present invention proposes methods for ensuring security, such as LSTM-based dangerous action recognizer, knowledge-based security filtering, etc., to solve the problems raised in the above background art.
为实现上述目的,本发明提供如下技术方案:To achieve the above object, the present invention provides the following technical solutions:
本发明公开一种基于强化学习的多目标复杂交通场景下自动驾驶解决方法,包括如下步骤:The invention discloses an automatic driving solution method based on reinforcement learning in multi-objective complex traffic scenarios, comprising the following steps:
步骤1,准备用于自动驾驶仿真的模拟器环境以及复杂行驶场景;
步骤2,在观测空间中添加训练强化学习模型所需的环境特征信息作为环境观测信息,包括自车信息、他车信息和道路信息,根据所述环境特征信息计算关键特征信息,包括每条车道距离前车的碰撞时间、自车距离碰撞点的碰撞时间差、自车与前方路点的朝向角方差;Step 2: Add the environmental feature information required for training the reinforcement learning model in the observation space as the environmental observation information, including the own vehicle information, other vehicle information and road information, and calculate the key feature information according to the environmental feature information, including each lane. The collision time to the preceding vehicle, the collision time difference between the vehicle and the collision point, and the variance of the heading angle between the vehicle and the road point ahead;
步骤3,设置所述强化学习模型训练所需的奖励框架;
步骤4,基于时变训练法训练所述强化学习模型,再结合使用元模型在不同的交通场景下继续训练,每隔一定迭代轮数根据智能体行车表现情况及发生碰撞的种类修改训练时的奖励权重,重复修改多次权重以结束训练;Step 4: Train the reinforcement learning model based on the time-varying training method, and then use the meta-model to continue training in different traffic scenarios, and modify the training time according to the driving performance of the intelligent body and the type of collision every certain number of iterations. Reward the weights, and modify the weights repeatedly to end the training;
步骤5,输出训练完成的所述强化学习模型后构建危险动作识别器与规则约束器,根据所述环境观测信息判断场景危险程度,来限制或调整所述强化学习模型所输出的规划量,并通过在仿真环境中观察效果不断手动添加和优化规则。Step 5: After outputting the training-completed reinforcement learning model, construct a dangerous action recognizer and a rule constraint, and judge the degree of danger of the scene according to the environmental observation information, so as to limit or adjust the planning amount output by the reinforcement learning model, and Constantly manually adding and refining rules by observing the effects in the simulation environment.
进一步的,步骤2中所述观测空间中添加训练强化学习模型所需的环境特征信息,所述自车信息包括自车速度、自车方向盘转角、自车所在车道序号、自车中心点与所在车道中心线距离、选取自车位置起前方十五个路点与自车朝向角偏差、距离4/8米预瞄点的相对位置;所述他车信息包括每条车道中近邻车的距离、每条车道距离前车的碰撞时间、自车车道中最近邻他车与自车的相对速度;所述道路信息包括自车距离车道中心横向距离、路点朝向误差。Further, the environmental feature information required for training the reinforcement learning model is added to the observation space in
进一步的,步骤3中所述奖励框架包括环境奖赏、速度奖赏、碰撞惩罚和车道中心偏差惩罚;其中,所述环境奖赏为自车存活时间,表示自车从起点行驶至发生碰撞所经历的时长;所述速度奖赏为自车行驶速度,以每秒行驶距离为单位;所述碰撞惩罚为当自车驶离航线、自车行驶出道路边界或自车与环境车发生碰撞时,会给予相应的碰撞惩罚;所述车道中心偏差惩罚为车辆中心与中心线距离的绝对值。Further, the reward framework described in
进一步的,步骤4中所述基于时变训练法结合元模型训练强化学习模型,其具体步骤为:Further, in
步骤4.1,初始化所述强化学习模型,依次使用每个场景训练一定轮数得到元模型;Step 4.1, initialize the reinforcement learning model, and use each scene to train a certain number of rounds in turn to obtain a meta-model;
步骤4.2,使用时变训练法在所选场景下训练步骤4.1得到的元模型,根据智能体行为所存在的缺陷调整奖励权重;Step 4.2, use the time-varying training method to train the meta-model obtained in step 4.1 in the selected scene, and adjust the reward weight according to the defects in the behavior of the agent;
步骤4.3,设置场景为所有无交叉路口的简单场景,重复步骤4.2过程训练,提升在无交叉路口简单场景下的表现;Step 4.3, set the scene to all simple scenes without intersections, and repeat the training process in step 4.2 to improve the performance in simple scenes without intersections;
步骤4.4,设置场景为所有含交叉路口的场景,重复步骤4.2过程训练,提升在交叉路口场景下的表现;Step 4.4, set the scene to all scenes with intersections, and repeat the training process in step 4.2 to improve the performance in the intersection scene;
步骤4.5,设置场景为含环岛与多方向车辆的场景,重复步骤4.2过程训练,提升在环岛与多方向车辆场景下的表现;Step 4.5, set the scene to include a roundabout and multi-directional vehicles, and repeat the training in step 4.2 to improve the performance in the roundabout and multi-directional vehicle scenarios;
步骤4.6,在剩余的场景下继续训练,直至过程结束。Step 4.6, continue training in the remaining scenarios until the end of the process.
进一步的,步骤4中所述基于时变训练法训练强化学习模型,具体步骤为:Further, the reinforcement learning model is trained based on the time-varying training method described in
步骤4.2.1,设置所述强化学习模型超参数;Step 4.2.1, setting the hyperparameters of the reinforcement learning model;
步骤4.2.2,设置奖励函数为基本奖励,使Agent学会车道保持,开始迭代训练;Step 4.2.2, set the reward function as the basic reward, so that the Agent can learn lane keeping and start iterative training;
步骤4.2.3,调高车道中心偏差惩罚与碰撞惩罚权重,继续迭代训练;Step 4.2.3, increase the weight of lane center deviation penalty and collision penalty, and continue iterative training;
步骤4.2.4,继续调高所述碰撞惩罚,继续迭代训练;Step 4.2.4, continue to increase the collision penalty, and continue the iterative training;
步骤4.2.5,在原有场景数据集基础上新增场景,并新增速度奖赏以及调高所述车道中心偏差惩罚与碰撞惩罚权重,直至迭代结束。Step 4.2.5, add a new scene based on the original scene data set, add a speed reward, and increase the weight of the lane center deviation penalty and collision penalty until the iteration ends.
进一步的,步骤5中所述危险动作识别器是根据强化学习模型输出的动作及环境观测信息预测其危险程度,并根据危险程度采取紧急避让、紧急调整等行为,所述危险动作识别器包括采集样本阶段与训练阶段,所述采集样本阶段的具体步骤为:Further, in
步骤5.1.1,准备多种类型的场景,选择一个场景开始训练;Step 5.1.1, prepare multiple types of scenarios, select a scenario to start training;
步骤5.1.2,初始化PPO策略模型,在所选场景下开始训练策略模型;Step 5.1.2, initialize the PPO strategy model, and start training the strategy model in the selected scenario;
步骤5.1.3,在运行过程中记录本轮运行过程中的轨迹;Step 5.1.3, record the trajectory of the current round of operation during the operation;
步骤5.1.4,发生碰撞时,采集碰撞前10步作为负样本,同时随机采集本轮轨迹中任意连续10步作为正样本;Step 5.1.4, when a collision occurs, collect the first 10 steps of the collision as negative samples, and randomly collect any 10 consecutive steps in the current trajectory as positive samples;
步骤5.1.5,训练至运行步数达到设定的总步数后,选择下个场景重复训练;Step 5.1.5, after training until the number of running steps reaches the set total number of steps, select the next scene to repeat the training;
步骤5.1.6,直至所有场景均采集完毕后结束。Step 5.1.6, until all the scenes are collected.
进一步的,步骤5中所述危险动作识别器是基于长短记忆网络构建的危险动作识别器模型,所述训练阶段的具体步骤为:Further, the dangerous action recognizer described in
步骤5.2.1,根据采集的每组样本数据,将所述样本数据使用滑动窗口生成数据组作为模型输入,并以每组最后一步的标签作为模型的目标标签;Step 5.2.1, according to each group of sample data collected, use the sliding window to generate the data group as the model input, and use the label of the last step of each group as the target label of the model;
步骤5.2.2,使用Adam优化器,并基于余弦模拟退火方法调整优化器学习率;Step 5.2.2, use the Adam optimizer and adjust the optimizer learning rate based on the cosine simulated annealing method;
步骤5.2.3,使用均方误差损失作为训练损失函数,计算模型输出与所述目标标签的均方误差损失;Step 5.2.3, using the mean square error loss as the training loss function, calculate the mean square error loss between the model output and the target label;
步骤5.2.4,设置相关模型参数以及训练模型轮数,完成训练。Step 5.2.4, set the relevant model parameters and the number of training model rounds to complete the training.
进一步的,步骤5中规则约束器是基于人类知识和仿真实验的经验统计编写规则,用于限制自车在某些特定情形下的行为,所述规则约束器主要包括最近距离保护的知识规则、交叉路口下的知识规则、急弯前的知识规则以及自车无近邻车情形下长时间驻留的修正规则,所述环境观测信息分别对不同场景进行判断,决定自车限速规则。Further, in
与现有技术相比,本发明的有益效果为:本发明提供了一种基于强化学习的多目标复杂交通场景下自动驾驶解决方法,该方法可以使用一套强化学习自动驾驶建模方法处理所有交通场景,具有较好的通用性,并且可以获得较好的多目标性能和泛化性能。强化学习综合建模基于传统强化学习框架,使用环境感知信息及结合人类知识提取的特征量作为观测空间,根据评价指标设定车道线保持、行驶距离及避撞等作为强化学习算法中智能体车辆的奖赏与惩罚。模型训练时则通过元学习思想结合时变训练策略,每一阶段分别设定不同的奖赏权重和不同的训练集以强化智能体在先前训练阶段中所形成的部分行为缺陷,提升智能体在部分弱项场景下的表现,可以提高训练速度和策略应用的泛化性。另外,为对其形式安全性作进一步保障,还提出了基于长短时记忆(LSTM)网络的危险动作识别器与基于人类知识体的规则约束器,从环境中采样并训练危险动作识别器,使车辆具备识别危险动作与危险场景的能力,并针对特定情形设计规则约束对输出动作加以限制,可以大大提高安全性,减少碰撞次数,处理特殊紧急情况,以保障车辆的行驶安全。Compared with the prior art, the beneficial effects of the present invention are as follows: the present invention provides an automatic driving solution method based on reinforcement learning in multi-objective complex traffic scenarios, which can use a set of reinforcement learning automatic driving modeling methods to process all Traffic scene, has good generality, and can obtain good multi-objective performance and generalization performance. The comprehensive modeling of reinforcement learning is based on the traditional reinforcement learning framework, using the environmental perception information and the feature quantities extracted by combining human knowledge as the observation space, and setting the lane line keeping, driving distance and collision avoidance according to the evaluation indicators as the intelligent vehicle in the reinforcement learning algorithm rewards and punishments. When the model is trained, the meta-learning idea is combined with the time-varying training strategy. Different reward weights and different training sets are set in each stage to strengthen some of the behavioral defects formed by the agent in the previous training stage, and improve the agent's performance in some parts. The performance in weak scenarios can improve the training speed and generalization of policy application. In addition, in order to further ensure its formal security, a dangerous action recognizer based on long short-term memory (LSTM) network and a rule constraint based on human knowledge are also proposed. Vehicles have the ability to identify dangerous actions and dangerous scenes, and design rules and constraints for specific situations to limit output actions, which can greatly improve safety, reduce the number of collisions, and handle special emergencies to ensure vehicle driving safety.
附图说明Description of drawings
构成本申请的一部分的附图用来提供对本申请的进一步理解,使得本申请的其它特征、目的和优点变得更明显。本申请的示意性实施例附图及其说明用于解释本申请,并不构成对本申请的不当限定。The accompanying drawings, which constitute a part of this application, are used to provide a further understanding of the application and make other features, objects and advantages of the application more apparent. The accompanying drawings and descriptions of the exemplary embodiments of the present application are used to explain the present application, and do not constitute an improper limitation of the present application.
在附图中:In the attached image:
图1为本发明强化学习多目标复杂交通场景下自动驾驶解决方法整体的流程示意图;FIG. 1 is a schematic flowchart of an overall automatic driving solution method for reinforcement learning in multi-objective complex traffic scenarios according to the present invention;
图2为本发明强化学习多目标复杂交通场景下自动驾驶解决方法的步骤示意图;2 is a schematic diagram of the steps of the reinforcement learning multi-objective and complex traffic scene automatic driving solution method according to the present invention;
图3为本发明强化学习时变训练法与元模型训练强化学习模型的流程示意图;3 is a schematic flowchart of the time-varying training method of reinforcement learning and the meta-model training reinforcement learning model of the present invention;
图4为本发明的结合端到端与规则约束器控制自动驾驶的框架示意图;4 is a schematic diagram of the framework of the present invention for controlling automatic driving by combining end-to-end and rule constraints;
图5为本发明基于近端策略优化的强化学习策略网络示意图;5 is a schematic diagram of a reinforcement learning strategy network based on near-end strategy optimization according to the present invention;
图6为本发明的基于长短期记忆(LSTM)神经网络的危险动作识别器模型结构示意图;6 is a schematic structural diagram of a dangerous action recognizer model based on a long short-term memory (LSTM) neural network of the present invention;
图7为本发明危险动作识别器采样阶段的流程示意图;FIG. 7 is a schematic flowchart of the sampling stage of the dangerous action identifier according to the present invention;
图8为本发明的危险动作识别器采样算法的计算机语言框图。Figure 8 is a computer language block diagram of the dangerous action recognizer sampling algorithm of the present invention.
具体实施方式Detailed ways
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例。相反,提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided for a thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are only for exemplary purposes, and are not intended to limit the protection scope of the present disclosure.
另外还需要说明的是,为了便于描述,附图中仅示出了与有关发明相关的部分。在不冲突的情况下,本公开中的实施例及实施例中的特征可以相互组合。In addition, it should be noted that, for the convenience of description, only the parts related to the related invention are shown in the drawings. The embodiments of this disclosure and features of the embodiments may be combined with each other without conflict.
本发明公开了一种基于强化学习的多目标复杂交通场景下自动驾驶解决方法,下面将参考附图并结合实施例来详细说明本公开。参照图1至2所示,其主要包括以下步骤:The present invention discloses an automatic driving solution method based on reinforcement learning in multi-objective complex traffic scenarios. The present disclosure will be described in detail below with reference to the accompanying drawings and in conjunction with embodiments. 1 to 2, it mainly includes the following steps:
步骤1,准备用于自动驾驶仿真的模拟器环境以及复杂行驶场景;
步骤2,在观测空间中添加训练强化学习模型所需的环境特征信息作为环境观测信息,包括自车信息、他车信息和道路信息,根据所述环境特征信息计算关键特征信息,包括每条车道距离前车的碰撞时间、自车距离碰撞点的碰撞时间差、自车与前方路点的朝向角方差;Step 2: Add the environmental feature information required for training the reinforcement learning model in the observation space as the environmental observation information, including the own vehicle information, other vehicle information and road information, and calculate the key feature information according to the environmental feature information, including each lane. The collision time to the preceding vehicle, the collision time difference between the vehicle and the collision point, and the variance of the heading angle between the vehicle and the road point ahead;
步骤3,设置所述强化学习模型训练所需的奖励框架;
步骤4,基于时变训练法训练所述强化学习模型,再结合使用元模型在不同的交通场景下继续训练,每隔一定迭代轮数根据智能体行车表现情况及发生碰撞的种类修改训练时的奖励权重,重复修改多次权重以结束训练;Step 4: Train the reinforcement learning model based on the time-varying training method, and then use the meta-model to continue training in different traffic scenarios, and modify the training time according to the driving performance of the intelligent body and the type of collision every certain number of iterations. Reward the weights, and modify the weights repeatedly to end the training;
步骤5,输出训练完成的所述强化学习模型后构建危险动作识别器与规则约束器,根据所述环境观测信息判断场景危险程度,来限制或调整所述强化学习模型所输出的规划量,并通过在仿真环境中观察效果不断手动添加和优化规则。Step 5: After outputting the training-completed reinforcement learning model, construct a dangerous action recognizer and a rule constraint, and judge the degree of danger of the scene according to the environmental observation information, so as to limit or adjust the planning amount output by the reinforcement learning model, and Constantly manually adding and refining rules by observing the effects in the simulation environment.
本发明的多目标是指以车辆行驶精准、行驶速度快、行驶安全、算法高鲁棒性和高泛化性为目标。具体来说,在单一交通场景中要求在车辆行驶最大距离情况下进行相同次数模拟时,行驶速度体现在自车平均速度尽量大。行驶安全体现在自车与环境车发生碰撞、自车偏离航线次数尽量少、自车行驶距离尽量远。行驶精准体现在自车中心点离道路中心线平均距离尽量近。算法高鲁棒性和高泛化性体现在一个算法能在多个复杂交通场景和没有训练过的地图场景也可取得较好的效果。The multi-objectives of the present invention refer to the goals of accurate vehicle driving, fast driving speed, driving safety, high algorithm robustness and high generalization. Specifically, in a single traffic scenario, when the same number of simulations are required under the condition that the vehicle travels the maximum distance, the driving speed is reflected in the average speed of the self-vehicle as large as possible. Driving safety is reflected in the collision between the self-vehicle and the environmental vehicle, the number of deviations of the self-vehicle from the route as few as possible, and the driving distance of the self-vehicle as far as possible. Driving accuracy is reflected in the average distance between the center point of the vehicle and the center line of the road as close as possible. The high robustness and generalization of the algorithm is reflected in the fact that an algorithm can achieve good results in multiple complex traffic scenes and untrained map scenes.
而复杂交通场景是指在模拟器中训练所使用的地图场景具有多种道路类型和车流情况,所使用的地图类型包括简单、急弯、交叉路口、环岛、汇合、分流、多种道路混合场景等共计九种地图类型,不同的场景也包含不同密集程度的环境车。在模拟环境所设定的规则中,环境车的行驶轨迹与行驶策略具有一定的随机性,其策略类型可分为保守、中庸、激进的三种类型,在每一种策略下,环境车的运行策略参数依然具有一定随机性。The complex traffic scene refers to the map scene used for training in the simulator with various road types and traffic conditions. The map types used include simple, sharp bends, intersections, roundabouts, convergence, diversion, and mixed road scenes. There are a total of nine map types, and different scenes also contain environment vehicles with different degrees of density. In the rules set by the simulated environment, the driving trajectory and driving strategy of the environmental vehicle have certain randomness, and its strategy types can be divided into three types: conservative, moderate and aggressive. The operating strategy parameters still have a certain randomness.
步骤2中,观测空间中添加训练强化学习模型所需的环境特征信息作为环境观测信息,根据环境特征信息计算关键特征量。这里的强化学习模型中,输入观测空间包括有自车信息、他车信息、道路信息等环境感知信息,以及从环境信息中提取的观测特征;其输出动作则包括油门开度、刹车控制、方向盘转角控制。In
具体而言,环境观测信息包括自车信息、他车信息和道路信息等环境感知信息。其中,自车信息包括自车速度、自车方向盘转角、自车所在车道序号、自车中心点与所在车道中心线距离、选取自车位置起前方十五个路点与自车朝向角偏差、距离4/8米预瞄点的相对位置,共计2+1+1+15+4=19维。他车信息包括每条车道中近邻车的距离、每条车道距离前车的碰撞时间(TTC)、自车车道中最近邻他车与自车的相对速度,共计6*3=18维;自车行驶至交叉路口时查询五十米范围内朝向交叉,即可能发生碰撞且距离自车最近5辆车的相对位置、相对朝向、相对速度、碰撞时间差,共计5*5=25维。道路信息包括:自车距离车道中心横向距离、路点朝向误差,其中路点朝向误差表示自车朝向所在直线与前方路点的距离,从自车位置处的路点起选取前方15个路点分别计算朝向误差,以距离自车近的路点稠密,距离较远的路点稀疏为原则,选取路点基数分别为[0,1,2,3,5,7,10,13,17,21,25,30,35,42,50],表示前方第i个路点,也表示该路点距离智能体车辆的距离。Specifically, the environment observation information includes environment perception information such as own vehicle information, other vehicle information, and road information. Among them, the information of the own vehicle includes the speed of the own vehicle, the steering wheel angle of the own vehicle, the serial number of the lane where the vehicle is located, the distance between the center point of the vehicle and the center line of the lane where the vehicle is located, and the deviation of the 15 waypoints ahead from the location of the vehicle and the heading angle of the vehicle. , The relative position of the preview point at a distance of 4/8 meters, a total of 2+1+1+15+4=19 dimensions. The other car information includes the distance of the neighboring car in each lane, the collision time (TTC) between each lane and the preceding car, and the relative speed of the nearest neighbor car and the own car in the own lane, a total of 6*3=18 dimensions; When the car drives to the intersection, query the direction of the intersection within 50 meters, that is, the relative position, relative direction, relative speed, and collision time difference of the 5 vehicles closest to the vehicle that may collide, with a total of 5*5=25 dimensions. The road information includes: the lateral distance from the vehicle to the center of the lane, and the waypoint orientation error. The waypoint orientation error represents the distance between the line where the vehicle is heading and the waypoint ahead, and 15 waypoints ahead are selected from the waypoint at the location of the vehicle. Calculate the heading error separately, and take the principle that the waypoints close to the vehicle are dense, and the waypoints farther away are sparse. 21, 25, 30, 35, 42, 50], which represents the i-th waypoint ahead, and also represents the distance between the waypoint and the agent vehicle.
提取的关键特征信息主要为避撞车辆信息,选取与自车位置最近的三辆他车作为避撞车辆,并以自车坐标系计算他车信息,因此关键特征信息包括避撞车辆相对位置、避撞车辆绝对速度、自车与避撞车辆的相对朝向角、距离碰撞点的碰撞时间差(TDTC),其中距离碰撞点的碰撞时间差(TDTC)的计算公式为:The extracted key feature information is mainly the collision avoidance vehicle information. The three other vehicles closest to the position of the own vehicle are selected as the collision avoidance vehicle, and the other vehicle information is calculated in the own vehicle coordinate system. Therefore, the key feature information includes the relative position of the collision avoidance vehicle, The absolute speed of the collision avoidance vehicle, the relative orientation angle of the ego vehicle and the collision avoidance vehicle, and the time difference to collision (TDTC) from the collision point. The calculation formula of the time difference to collision (TDTC) from the collision point is:
步骤3中,设置强化学习模型训练所需的奖励框架。在强化学习模型训练的奖励框架中,包括环境奖赏,速度奖赏,碰撞惩罚,车道中心偏差惩罚。其中环境奖赏为自车存活时间,表示自车从起点行驶至发生碰撞所经历的时长,其值从1逐渐增加到4,再从1开始逐步增加,在仿真环境中单步模拟只要自车还存活就给奖励。速度奖赏为自车行驶速度,以每秒行驶距离为单位。车道中心偏差惩罚为车辆中心与中心线距离的绝对值。碰撞惩罚包括三种为当自车驶离航线、自车行驶出道路边界或自车与环境车发生碰撞时会给予相应的碰撞惩罚,其值均为常数5,且其权重随迭代次数增加而增加,也就是每一种情况产生便获得-5的惩罚,各项惩罚的系数分别是1,0.5,0.1,1.2.这些系数会在训练过程中进行调整。In
参照图3所示,基于时变训练法结合元模型训练强化学习模型,可以提升车辆在部分特殊场景中的表现,如十字路口、环岛等。下面给出其训练步骤:Referring to Figure 3, training the reinforcement learning model based on the time-varying training method combined with the meta-model can improve the performance of the vehicle in some special scenes, such as intersections and roundabouts. The training steps are given below:
步骤4.1,初始化所述强化学习模型,依次使用每个场景训练一定轮数得到元模型;Step 4.1, initialize the reinforcement learning model, and use each scene to train a certain number of rounds in turn to obtain a meta-model;
步骤4.2,使用时变训练法在所选场景下训练步骤4.1得到的元模型,是根据智能体行为所存在的缺陷调整奖励权重;Step 4.2, use the time-varying training method to train the meta-model obtained in step 4.1 in the selected scene, which is to adjust the reward weight according to the defects in the behavior of the agent;
步骤4.3,设置场景为所有无交叉路口的简单场景,重复步骤4.2过程训练,提升在无交叉路口简单场景下的表现;Step 4.3, set the scene to all simple scenes without intersections, and repeat the training process in step 4.2 to improve the performance in simple scenes without intersections;
步骤4.4,设置场景为所有含交叉路口的场景,重复步骤4.2过程训练,提升在交叉路口场景下的表现;Step 4.4, set the scene to all scenes with intersections, and repeat the training process in step 4.2 to improve the performance in the intersection scene;
步骤4.5,设置场景为含环岛与多方向车辆的场景,重复步骤4.2过程训练,提升在环岛与多方向车辆场景下的表现;Step 4.5, set the scene to include a roundabout and multi-directional vehicles, and repeat the training in step 4.2 to improve the performance in the roundabout and multi-directional vehicle scenarios;
步骤4.6,在剩余的场景下继续训练,直至过程结束。Step 4.6, continue training in the remaining scenarios until the end of the process.
具体而言,在步骤4.2中,使用时变训练法训练强化学习模型的具体过程为:Specifically, in step 4.2, the specific process of using the time-varying training method to train the reinforcement learning model is as follows:
步骤4.2.1,设置强化学习模型超参数;Step 4.2.1, set the reinforcement learning model hyperparameters;
步骤4.2.2,设置奖励函数为基本奖励,使Agent学会车道保持,开始迭代训练;Step 4.2.2, set the reward function as the basic reward, so that the Agent can learn lane keeping and start iterative training;
步骤4.2.3,调高车道中心偏差惩罚与碰撞惩罚权重,继续迭代训练;Step 4.2.3, increase the weight of lane center deviation penalty and collision penalty, and continue iterative training;
步骤4.2.4,继续调高碰撞惩罚,继续迭代训练;Step 4.2.4, continue to increase the collision penalty and continue the iterative training;
步骤4.2.5,在原有场景数据集基础上新增场景,并新增速度奖赏以及调高车道中心偏差惩罚与碰撞惩罚权重,直至迭代结束。Step 4.2.5, add a new scene based on the original scene data set, add a speed reward, and increase the weight of the lane center deviation penalty and collision penalty until the end of the iteration.
设置的强化学习模型超参数如下表所示:The set reinforcement learning model hyperparameters are shown in the following table:
首先,设置奖励函数为基本函数,使Agent学会车道保持,将奖励函数表示为:1.0×环境奖赏+0.1×车道中心偏差惩罚+1.2×碰撞惩罚,从第0轮起迭代训练460轮。然后,调高车道中心偏差惩罚与碰撞惩罚权重,此时的奖励函数表示为:1.0×环境奖赏+0.4×车道中心偏差惩罚+1.6×碰撞惩罚,从第461轮起继续训练至第768轮。继续调高碰撞惩罚权重,此时奖励函数表示为:1.0×环境奖赏+0.4×车道中心偏差惩罚+1.8×碰撞惩罚,从第769轮起继续训练至第1152轮。最后,在原有场景数据集基础上新增3个场景,包括2个all_loop场景和1个mix_loop场景,并新增速度奖赏(车辆行驶速度越快其奖赏值越大),以及调高车道中心偏差惩罚与碰撞惩罚权重,此时表示为:1.0×环境奖赏+0.4×速度奖赏+0.56×车道中心偏差惩罚+2.9×碰撞惩罚,从第1153轮起继续训练至第1400轮,训练过程结束。First, set the reward function as a basic function to make the agent learn lane keeping, and express the reward function as: 1.0×environment reward+0.1×lane center deviation penalty+1.2×collision penalty, and iteratively trained for 460 rounds from round 0. Then, the weights of lane center deviation penalty and collision penalty are increased. The reward function at this time is expressed as: 1.0×environmental reward+0.4×lane center deviation penalty+1.6×collision penalty, and the training continues from round 461 to round 768. Continue to increase the collision penalty weight. At this time, the reward function is expressed as: 1.0×environmental reward+0.4×lane center deviation penalty+1.8×collision penalty, and continue training from round 769 to round 1152. Finally, based on the original scene data set, 3 new scenes are added, including 2 all_loop scenes and 1 mix_loop scene, and speed rewards are added (the faster the vehicle travels, the greater the reward value), and the lane center deviation is increased. The weights of penalty and collision penalty are expressed as: 1.0×environmental reward+0.4×speed reward+0.56×lane center deviation penalty+2.9×collision penalty, continue training from round 1153 to round 1400, and the training process ends.
强化学习模型在传统强化学习基础上,采用时变策略训练,从而产生具有更高鲁棒性和泛化性的智能体模型。强化学习模型时变策略是一种阶段性学习方法,将Agent任务目标划根据重要程度划分为不同子任务并分不同阶段学习,每一阶段通过调整奖赏权重的方式对学习任务进行层次性强化训练,最开始将训练Agent使其拥有车道保持能力,其次训练其避撞能力,最后训练其高速行驶能力。智能车模型训练过程由多次迭代构成,每次迭代根据之前的模拟结果对本次迭代的奖励权重进行调整,具体来说,每次迭代时记录前一次模拟情况中所发生的碰撞及碰撞的类型,根据某种类型的碰撞占所有碰撞中的比例对奖励的比例进行调整。On the basis of traditional reinforcement learning, the reinforcement learning model is trained with time-varying strategies, resulting in an agent model with higher robustness and generalization. The time-varying strategy of reinforcement learning model is a staged learning method, which divides the Agent task objectives into different subtasks according to their importance and learns them in different stages. In each stage, the learning task is trained hierarchically by adjusting the reward weight. , firstly, the agent will be trained to have lane keeping ability, secondly, its collision avoidance ability will be trained, and finally its high-speed driving ability will be trained. The training process of the smart car model consists of multiple iterations. Each iteration adjusts the reward weight of this iteration according to the previous simulation results. Specifically, each iteration records the collisions and collisions that occurred in the previous simulation. Type, which adjusts the proportion of rewards based on the proportion of collisions of a certain type among all collisions.
在步骤5中,危险动作识别器包括样本采集与训练两个阶段,参照图7和图8所示,具体样本采集的过程如下:In
步骤5.1.1,准备多种类型的场景,选择一个场景开始训练;Step 5.1.1, prepare multiple types of scenarios, select a scenario to start training;
步骤5.1.2,初始化PPO策略模型,在所选场景下开始训练策略模型;Step 5.1.2, initialize the PPO strategy model, and start training the strategy model in the selected scenario;
步骤5.1.3,在运行过程中记录本轮运行过程中的轨迹;Step 5.1.3, record the trajectory of the current round of operation during the operation;
步骤5.1.4,发生碰撞时,采集碰撞前10步作为负样本,同时随机采集本轮轨迹中任意连续10步作为正样本;Step 5.1.4, when a collision occurs, collect the first 10 steps of the collision as negative samples, and randomly collect any 10 consecutive steps in the current trajectory as positive samples;
步骤5.1.5,训练至运行步数达到设定的总步数后,选择下个场景重复训练;Step 5.1.5, after training until the number of running steps reaches the set total number of steps, select the next scene to repeat the training;
步骤5.1.6,直至所有场景均采集完毕后结束。Step 5.1.6, until all the scenes are collected.
具体而言,这里我们这里使用7种类别的场景,选择其中一个场景开始训练。初始化PPO策略模型,在所选场景下从头开始训练策略,将运行总步数设为400000步。在运行过程中记录本轮运行过程中的轨迹,包括每步的观测-动作对(s1,a1,s2,a2,…)以及运行时奖励值(r1,,r2,…)。若出现碰撞,记录碰撞前10步的观测s-动作a-奖励r对视为负样本进行采集,并记录该轮的轨迹长度为m,从区间[1,m-19]中随机选择一个随机数k,采集第k步起10步的观测s-动作a-奖励r对作为正样本进行采集。使用采集到每组样本中的奖励值计算10个滚动平均值,作为该组样本的训练标签,计算公式如下:Specifically, here we use 7 categories of scenes, and select one of the scenes to start training. Initialize the PPO policy model, train the policy from scratch in the selected scenario, and set the total number of steps to run to 400,000 steps. During the running process, the trajectory during the running process of the current round is recorded, including the observation-action pair (s1, a1, s2, a2, ...) and the runtime reward value (r1,, r2, ...) of each step. If there is a collision, record the observations s-
其中i表示第i个场景,j表示10步样本中的第j步。将观测s-动作a-标签label对储存作为训练数据集,训练直到运行步数达到设定的总步数400000。选择下一个场景重复步骤5.1.2至步骤5.1.5训练,直到所有场景均采集完毕结束,所采集的样本总数约为100000条。where i represents the ith scene and j represents the jth step in the 10-step sample. The observation s-action a-label label pair is stored as a training data set, and the training is performed until the number of running steps reaches the set total number of steps of 400,000. Select the next scene and repeat step 5.1.2 to step 5.1.5 training until all scenes are collected, and the total number of samples collected is about 100,000.
参照图5和图6所示,基于长短期记忆(LSTM)网络构建危险动作识别器模型,模型中包括一个LSTM层和2个全连接层,长短期记忆(LSTM)网络输出经过第一个全连接层之后使用ReLU激活函数,接入第二个全连接层。那么危险动作识别器训练的过程如下:Referring to Figures 5 and 6, a dangerous action recognizer model is constructed based on a long short-term memory (LSTM) network. The model includes one LSTM layer and two fully connected layers. The output of the long short-term memory (LSTM) network passes through the first full connection layer. After the connection layer, the ReLU activation function is used to access the second fully connected layer. Then the dangerous action recognizer training process is as follows:
步骤5.2.1,根据采集的每组样本数据,将所述样本数据使用滑动窗口生成数据组作为模型输入,并以每组最后一步的标签作为模型的目标标签;Step 5.2.1, according to each group of sample data collected, use the sliding window to generate the data group as the model input, and use the label of the last step of each group as the target label of the model;
步骤5.2.2,使用Adam优化器,并基于余弦模拟退火方法调整优化器学习率;Step 5.2.2, use the Adam optimizer and adjust the optimizer learning rate based on the cosine simulated annealing method;
步骤5.2.3,使用均方误差损失作为训练损失函数,计算模型输出与所述目标标签的均方误差损失;Step 5.2.3, using the mean square error loss as the training loss function, calculate the mean square error loss between the model output and the target label;
步骤5.2.4,设置相关模型参数以及训练模型轮数,完成训练。Step 5.2.4, set the relevant model parameters and the number of training model rounds to complete the training.
具体而言,采集的每组样本中包含10步的观测s-动作a-标签label对。其中观测为44维,动作为3维,将观测与动作拼接为47维的向量后,将10步数据使用大小为5的滑动窗口生成6组长度为5的数据组,每组5步数据作为模型输入,以每组最后一步的标签作为模型的目标标签。使用Adam优化器,并基于余弦模拟退火方法调整优化器学习率,使用均方误差损失作为训练损失函数,计算模型输出与目标标签的均方误差损失。设置如下表所示模型参数,训练模型100轮后结束。Specifically, each set of samples collected contains 10-step observation s-action a-label pairs. The observation is 44-dimensional, and the action is 3-dimensional. After splicing the observation and action into a 47-dimensional vector, the 10-step data is used to generate 6 data groups of
危险动作识别器用于提升车辆的安全性,根据强化学习模型所输出的动作及环境观测信息预测其危险程度,并根据危险程度采取紧急避让、紧急调整等行为。危险动作识别器包括两个阶段:采集样本阶段与训练阶段。在采集样本阶段通过在模拟器环境渐进训练策略的过程中收集大量安全样本及危险样本,并计算样本标签,为危险动作识别器的训练做准备。样本需在多个场景、策略不同效果情况下采集,并满足正负样本数量的平衡性。而在训练阶段,则使用收集的样本数据进行训练直至结束。The dangerous action recognizer is used to improve the safety of the vehicle. It predicts the degree of danger according to the action and environmental observation information output by the reinforcement learning model, and takes actions such as emergency avoidance and emergency adjustment according to the degree of danger. The dangerous action recognizer consists of two stages: sample collection stage and training stage. In the sample collection stage, a large number of safe samples and dangerous samples are collected in the process of gradually training the strategy in the simulator environment, and the sample labels are calculated to prepare for the training of the dangerous action recognizer. The samples need to be collected in multiple scenarios and strategies with different effects, and the balance of the number of positive and negative samples should be satisfied. In the training phase, the collected sample data is used for training until the end.
参照图4所示,步骤5中的规则约束器则为基于人类知识和仿真实验的经验统计编写,用于限制自车在某些特定情形下的行为。规则约束器主要包括最近距离保护的知识规则、交叉路口下的知识规则、急弯前的知识规则以及自车无近邻车情形下长时间驻留的修正规则,根据环境观测数据分别对不同场景进行判断,决定自车限速规则。以下为规则约束器示例。Referring to FIG. 4 , the rule constraint in
在规则约束器中,第一部分主要为基于最近距离保护的知识规则的约束:首先筛选出距离自车最近的3辆环境车,当环境车距离自车小于20米时进入TDTC条件判断,分别根据不同的TDTC、环境车车速判断危险程度,进而对自车车速最大值进行限制。具体规则如下:In the rule constraint device, the first part is mainly the constraint of knowledge rules based on the protection of the closest distance: first, the three environmental vehicles closest to the self-vehicle are screened out. When the environmental vehicle is less than 20 meters away from the self-vehicle, it enters the TDTC condition judgment. Different TDTC and environmental vehicle speed judge the degree of danger, and then limit the maximum speed of the own vehicle. The specific rules are as follows:
1.-2<TDTC<1,d/vcv<1,vcv≥30:限制自车最大车速为5,其中d为自车与冲突车距离,vcv为冲突车车速;1.-2 < TDTC < 1, d/v cv < 1, v cv ≥ 30: The maximum speed of the own vehicle is limited to 5, where d is the distance between the own vehicle and the conflicting vehicle, and v cv is the speed of the conflicting vehicle;
2.0<TDTC<1.2:限制自车最大车速为2;2.0<TDTC<1.2: Limit the maximum speed of the vehicle to 2;
3.-1.2<TDTC<0:限制自车最大车速为0.1×规划车速;3.-1.2<TDTC<0: The maximum speed of the vehicle is limited to 0.1×the planned speed;
4.1.2≤TDTC<3:限制自车最大车速为15+20(TDTC-1.2)/1.8;4.1.2≤TDTC<3: The maximum speed of the vehicle is limited to 15+20(TDTC-1.2)/1.8;
5.-1.8<TDTC≤-1.2,vcv>20:限制自车最大车速为0.1+1.4(TDTC-1.2)/0.6;5.-1.8<TDTC≤-1.2, v cv >20: limit the maximum speed of the vehicle to 0.1+1.4(TDTC-1.2)/0.6;
6.-3<TDTC≤-1.8,vcv>20:限制自车最大车速为1.4+6(TDTC-1.8)/1.2;6.-3<TDTC≤-1.8, v cv >20: The maximum speed of the vehicle is limited to 1.4+6(TDTC-1.8)/1.2;
7.-3<TDTC≤-1.2,vcv≤20:限制自车最大车速为5+15(TDTC-1.2)/1.8;7.-3<TDTC≤-1.2, v cv ≤20: The maximum speed of the vehicle is limited to 5+15(TDTC-1.2)/1.8;
8.3≤TDTC<7:限制自车最大车速为30+20(TDTC-3)/4;8.3≤TDTC<7: The maximum speed of the vehicle is limited to 30+20(TDTC-3)/4;
9.-7<TDTC≤3:限制自车最大车速为30+20(TDTC-3)/4;9.-7<TDTC≤3: The maximum speed of the vehicle is limited to 30+20(TDTC-3)/4;
10.d<4,vcv>5:限制自车最大车速为0;10. d<4, v cv > 5: limit the maximum speed of the vehicle to 0;
11.d<5,vcv>13:限制自车最大车速为2;11. d<5, v cv > 13: limit the maximum speed of the vehicle to 2;
12.d<7,vcv>20:限制自车最大车速为5。12. d<7, v cv > 20: Limit the maximum speed of the ego vehicle to 5.
规则约束器第二部分为交叉路口下的知识规则:该规则针对自车通过十字路口时沿侧方向行驶的车辆进行避碰,通过设置自车前方不同大小的矩形判定区域,检测每个区域内是否存在环境车,根据不同大小的区域检测到环境车对自车车速进行不同大小的限制。此外,矩形区域大小还应根据自车转弯来修正,可根据自车与前方路点的朝向角偏差得到自车转弯状态。具体规则如下:The second part of the rule constraint is the knowledge rule under the intersection: this rule is used to avoid collisions for vehicles traveling in the side direction when the vehicle passes through the intersection. Whether there is an environmental car, the speed of the self-vehicle is limited in different sizes according to the detection of the environmental car in different size areas. In addition, the size of the rectangular area should also be corrected according to the turning of the own vehicle, and the turning state of the own vehicle can be obtained according to the deviation of the heading angle between the own vehicle and the road point ahead. The specific rules are as follows:
1.自车前方纵向距离5.5、两边横向距离10范围内存在非静止环境车,且该环境车满足TDTC<8,vcv<4之一:限制自车最大车速为0;1. There is a non-stationary environmental vehicle within the range of 5.5 in the longitudinal distance in front of the self-vehicle and 10 in the lateral distance on both sides, and the environmental vehicle satisfies one of TDTC < 8 and v cv < 4: the maximum speed of the self-vehicle is limited to 0;
2.自车前方纵向距离7、两边横向距离10范围内存在非静止环境车,且该环境车满足TDTC<8,vcv<10之一:限制自车最大车速为5;2. There is a non-stationary environmental vehicle within the range of
3.自车前方纵向距离9、两边横向距离10范围内存在非静止环境车,且该环境车满足TDTC<8,vcv<12之一:限制自车最大车速为7;3. There is a non-stationary environmental vehicle within the range of 9 longitudinal distances in front of the self-vehicle and 10 lateral distances on both sides, and the environmental vehicle satisfies one of TDTC < 8 and v cv < 12: the maximum speed of the self-vehicle is limited to 7;
4.自车与前方路点朝向角方差大于0.16,即自车正在转弯时,自车前方纵向距离9、两边横向距离9范围内存在非静止环境车,且该环境车满足TDTC<8,vcv<4之一:限制自车最大车速为0;4. The variance of the orientation angle between the vehicle and the road point ahead is greater than 0.16, that is, when the vehicle is turning, there is a non-stationary environmental vehicle within the range of the
5.自车与前方路点朝向角方差大于0.16,即自车正在转弯时,自车前方纵向距离10.5、两边横向距离9范围内存在非静止环境车,且该环境车满足TDTC<8,vcv<4之一:限制自车最大车速为7。5. The deviation of the orientation angle between the vehicle and the road point ahead is greater than 0.16, that is, when the vehicle is turning, there is a non-stationary environmental vehicle within the range of 10.5 in the longitudinal distance in front of the vehicle and 9 in the lateral distance on both sides, and the environmental vehicle satisfies TDTC<8, v One of cv < 4: Limit the maximum speed of the ego vehicle to 7.
规则约束器第三部分为自车无近邻车情形下长时间驻留的修正规则,其规则为:The third part of the rule constraint is the correction rule for long-term residence in the case of no neighboring car. The rules are:
1.自车范围30内无环境车,所有冲突车满足vcv<5,|TDTC|>9:手动对自车加速,设置油门控制量最小值为0.3。1. There is no environmental car within 30 of the self-vehicle range, and all conflicting vehicles satisfy v cv <5, |TDTC|>9: manually accelerate the self-vehicle, and set the minimum value of the throttle control amount to 0.3.
规则约束器第四部分为急弯前的知识规则,根据自车与前方路点朝向角方差Vh进行限速,其规则如下:The fourth part of the rule constraint is the knowledge rule before the sharp bend. The speed is limited according to the variance V h of the orientation angle between the vehicle and the road point ahead. The rules are as follows:
1.Vh>0.33:限制自车最大车速为8;1. V h > 0.33: limit the maximum speed of the vehicle to 8;
2.0.25<Vh≤0.33:限制自车最大车速为8+10(0.3-Vh)/0.08;2.0.25<V h ≤0.33: The maximum speed of the vehicle is limited to 8+10(0.3-V h )/0.08;
3.0.2<Vh≤0.25:限制自车最大车速为14+10(0.25-Vh)/0.05;3.0.2<V h ≤0.25: The maximum speed of the vehicle is limited to 14+10(0.25-V h )/0.05;
4.0.15<Vh≤0.2:限制自车最大车速为22+20(0.2-Vh)/0.05;4.0.15<V h ≤0.2: The maximum speed of the vehicle is limited to 22+20(0.2-V h )/0.05;
5.0.1<Vh≤0.15:限制自车最大车速为30+20(0.15-Vh)/0.05;5.0.1<V h ≤0.15: The maximum speed of the vehicle is limited to 30+20(0.15-V h )/0.05;
6.0.03<Vh≤0.1:限制自车最大车速为40+20(0.1-Vh)/0.07;6.0.03<V h ≤0.1: The maximum speed of the vehicle is limited to 40+20(0.1-V h )/0.07;
7.0.01<Vh≤0.03:限制自车最大车速为60+20(0.03-Vh)/0.02;7.0.01<V h ≤0.03: The maximum speed of the vehicle is limited to 60+20(0.03-V h )/0.02;
上述规则约束器的主要作用为判断环境是否为危险情况,并对强化学习所输出的动作进行大小上的限制,主要体现在规划行驶速度上的限制,在控制部分体现为车辆刹车行为,同时也在前方安全、强化学习模型输出速度过低时进行辅助加速,保证了车辆的行驶安全与效率。The main function of the above rule constraint is to judge whether the environment is a dangerous situation, and to limit the size of the actions output by reinforcement learning. When the front is safe and the output speed of the reinforcement learning model is too low, auxiliary acceleration is performed to ensure the driving safety and efficiency of the vehicle.
综上所述,本发明提供了多目标场景下安全强化学习的解决方案,此项技术可应用于智能车辆辅助驾驶、无人驾驶等领域,与传统的完全端到端方案与完全基于规则方案相比,提供了一种新的混合方案的思路,结合两者优点实现车辆在复杂、多种场景下高安全性、高智能性、高效率行驶的目的,因此,本技术具有很高的推广价值。In summary, the present invention provides a solution for safety reinforcement learning in multi-objective scenarios. This technology can be applied to the fields of intelligent vehicle assisted driving, unmanned driving, etc. In comparison, it provides a new idea of hybrid solution, combining the advantages of both to achieve the purpose of high safety, high intelligence and high efficiency of vehicles in complex and various scenarios. Therefore, this technology has a high promotion. value.
以上描述仅为本公开的一些较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开的实施例中所涉及的发明范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述发明构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开的实施例中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above descriptions are merely some preferred embodiments of the present disclosure and illustrations of the applied technical principles. Those skilled in the art should understand that the scope of the invention involved in the embodiments of the present disclosure is not limited to the technical solution formed by the specific combination of the above-mentioned technical features, and should also cover, without departing from the above-mentioned inventive concept, the above-mentioned Other technical solutions formed by any combination of technical features or their equivalent features. For example, a technical solution is formed by replacing the above features with the technical features disclosed in the embodiments of the present disclosure (but not limited to) with similar functions.
Claims (8)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210370991.7A CN114701517A (en) | 2022-04-07 | 2022-04-07 | Multi-target complex traffic scene automatic driving solution based on reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210370991.7A CN114701517A (en) | 2022-04-07 | 2022-04-07 | Multi-target complex traffic scene automatic driving solution based on reinforcement learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114701517A true CN114701517A (en) | 2022-07-05 |
Family
ID=82171815
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210370991.7A Pending CN114701517A (en) | 2022-04-07 | 2022-04-07 | Multi-target complex traffic scene automatic driving solution based on reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114701517A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116361472A (en) * | 2023-05-02 | 2023-06-30 | 周维 | Public opinion big data analysis system for social network comment hot events |
-
2022
- 2022-04-07 CN CN202210370991.7A patent/CN114701517A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116361472A (en) * | 2023-05-02 | 2023-06-30 | 周维 | Public opinion big data analysis system for social network comment hot events |
CN116361472B (en) * | 2023-05-02 | 2024-05-03 | 脉讯在线(北京)信息技术有限公司 | Method for analyzing public opinion big data of social network comment hot event |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111222630B (en) | A Learning Method for Autonomous Driving Rules Based on Deep Reinforcement Learning | |
CN110745136B (en) | A driving adaptive control method | |
CA3065617C (en) | Method for predicting car-following behavior under apollo platform | |
WO2022052406A1 (en) | Automatic driving training method, apparatus and device, and medium | |
CN115056798B (en) | Automatic driving vehicle lane change behavior vehicle-road collaborative decision algorithm based on Bayesian game | |
CN114312830B (en) | Intelligent vehicle coupling decision model and method considering dangerous driving conditions | |
CN114153213A (en) | A deep reinforcement learning intelligent vehicle behavior decision-making method based on path planning | |
CN113581182B (en) | Automatic driving vehicle lane change track planning method and system based on reinforcement learning | |
CN115062202B (en) | Prediction method, device, equipment and storage medium for driving behavior intention and track | |
Gu et al. | Integrated eco-driving automation of intelligent vehicles in multi-lane scenario via model-accelerated reinforcement learning | |
CN114919578A (en) | Intelligent vehicle behavior decision method, planning method, system and storage medium | |
CN113901718A (en) | Deep reinforcement learning-based driving collision avoidance optimization method in following state | |
CN112249008A (en) | Unmanned automobile early warning method aiming at complex dynamic environment | |
Lodhi et al. | Autonomous vehicular overtaking maneuver: A survey and taxonomy | |
CN115257789A (en) | Decision-making method for side anti-collision driving of commercial vehicle in urban low-speed environment | |
CN117227755A (en) | Automatic driving decision method and system based on reinforcement learning under complex traffic scene | |
CN117197784A (en) | Automatic driving behavior decision and model training method, system, equipment and medium | |
CN116639124A (en) | Automatic driving vehicle lane changing method based on double-layer deep reinforcement learning | |
CN115169951A (en) | Multi-feature-fused automatic driving course reinforcement learning training method | |
CN111368465A (en) | Unmanned decision-making method based on ID3 decision tree | |
Fan et al. | Deep reinforcement learning based integrated eco-driving strategy for connected and automated electric vehicles in complex urban scenarios | |
CN114701517A (en) | Multi-target complex traffic scene automatic driving solution based on reinforcement learning | |
CN118212808B (en) | Method, system and equipment for planning traffic decision of signalless intersection | |
CN114117944B (en) | Model updating method, device, equipment and readable storage medium | |
CN114954498A (en) | Reinforcement learning lane changing behavior planning method and system based on imitation learning initialization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |