CN114580539A

CN114580539A - A vehicle driving strategy processing method and device

Info

Publication number: CN114580539A
Application number: CN202210212061.9A
Authority: CN
Inventors: 徐鑫; 张亮亮
Original assignee: Jingdong Kunpeng Jiangsu Technology Co Ltd
Current assignee: Jingdong Kunpeng Jiangsu Technology Co Ltd
Priority date: 2022-03-04
Filing date: 2022-03-04
Publication date: 2022-06-03
Anticipated expiration: 2042-03-04

Abstract

The invention discloses a vehicle driving strategy processing method and device, and relates to the field of intelligent driving. A specific implementation of the method includes: collecting the travel trajectories of a plurality of vehicles in different states, grouping the travel trajectories according to the strategy used in the first state in each travel trajectories, and obtaining multiple travel trajectories data sets; The driving trajectory data set constructs a value function to calculate the confidence of each value function, and then calculates the probability activation threshold based on the confidence of each value function; the probability that the function value of each value function is greater than the preset value is used as the evaluation error, Based on the evaluation errors and probability activation thresholds of all value functions, a probability activation function is constructed; based on multiple driving trajectory data sets and probability activation functions, the network parameters of the decision model are adjusted to obtain an adjusted decision model. In this embodiment, the driving trajectory is segmented, and a probability activation function is constructed to ensure the reliability of the decision-making model under the condition of limited data.

Description

A vehicle driving strategy processing method and device

技术领域technical field

本发明涉及智能驾驶领域，尤其涉及一种车辆驾驶策略处理方法和装置。The present invention relates to the field of intelligent driving, in particular to a method and device for processing vehicle driving strategies.

背景技术Background technique

自学习决策方法通过历史数据中无人驾驶车辆在不同状态、动作中获得的反馈，来评估无人驾驶策略，并基于该评估的策略调整以适应环境。因此环境状态的不确定性会直接作用于策略返回实现策略调整，这种方案并没有对环境模型做先验假设，而且求解过程本身并不对状态空间维度有要求，因此有希望应用于高维不确定性场景下无人驾驶策略求解。The self-learning decision-making method evaluates the unmanned strategy based on the feedback obtained by the unmanned vehicle in different states and actions in the historical data, and adjusts the strategy based on the evaluation to adapt to the environment. Therefore, the uncertainty of the environmental state will directly affect the strategy return to realize strategy adjustment. This scheme does not make a priori assumptions on the environmental model, and the solution process itself does not require the state space dimension, so it is hopeful to be applied to high-dimensional Solving unmanned driving strategies in deterministic scenarios.

在实现本发明的过程中，发明人发现训练数据不足会导致自学习策略不可靠。现有方式通常假定训练数据能够覆盖全部场景，而在高维不确定的环境中，场景的完整覆盖需要大量的训练数据。而未充分训练的自学习决策方法在遇到新的场景时其表现无法保证，因此难以将自学习决策方法在无人驾驶上实现落地应用。In the process of implementing the present invention, the inventor found that insufficient training data would lead to unreliable self-learning strategies. Existing methods usually assume that the training data can cover all the scenes, and in a high-dimensional uncertain environment, the complete coverage of the scene requires a large amount of training data. However, the performance of self-learning decision-making methods that are not fully trained cannot be guaranteed when encountering new scenarios, so it is difficult to implement self-learning decision-making methods in unmanned driving.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本发明实施例提供一种车辆驾驶策略处理方法和装置，至少能够解决现有技术中训练数据不足，导致自学习决策方法训练不充分，进而不适用于无人驾驶场景的现象。In view of this, the embodiments of the present invention provide a vehicle driving strategy processing method and device, which can at least solve the problem of insufficient training data in the prior art, resulting in insufficient training of the self-learning decision-making method, and thus unsuitable for unmanned driving scenarios.

为实现上述目的，根据本发明实施例的一个方面，提供了一种车辆驾驶策略处理方法，包括：To achieve the above object, according to an aspect of the embodiments of the present invention, a method for processing a vehicle driving strategy is provided, including:

采集多个车辆在不同状态下行驶轨迹，按照每个行驶轨迹中第一个状态使用的策略对行驶轨迹进行分组，得到多个行驶轨迹数据集；Collect the driving trajectories of multiple vehicles in different states, group the driving trajectories according to the strategy used in the first state of each driving trajectory, and obtain multiple driving trajectory data sets;

为每个行驶轨迹数据集构建价值函数，以计算每个价值函数的置信度，进而基于所述每个价值函数的置信度计算概率激活阈值；constructing a value function for each driving trajectory data set to calculate the confidence of each value function, and then calculate the probability activation threshold based on the confidence of each value function;

将每个价值函数的函数值大于预设数值的概率作为评估误差，基于所有价值函数的评估误差和所述概率激活阈值，构建概率激活函数；The probability that the function value of each value function is greater than the preset value is used as the evaluation error, and the probability activation function is constructed based on the evaluation errors of all value functions and the probability activation threshold;

基于所述多个行驶轨迹数据集和所述概率激活函数，调整决策模型的网络参数，得到调整后的决策模型；其中，所述调整后的决策模型用于在车辆行驶过程中，根据车辆当前状态和行驶轨迹进行驾驶策略规划。Based on the multiple driving trajectory data sets and the probability activation function, the network parameters of the decision-making model are adjusted to obtain an adjusted decision-making model; wherein, the adjusted decision-making model is used for, during the driving process of the vehicle, according to the current state of the vehicle. state and driving trajectory for driving strategy planning.

可选的，在所述按照每个行驶轨迹中第一个状态使用的策略对行驶轨迹进行分组之前，还包括：Optionally, before the grouping of the driving trajectories according to the strategy used in the first state in each driving trajectory, the method further includes:

使用预设长度的滑动窗，将每个行驶轨迹分割为多个预设长度的行驶轨迹。Using a sliding window of preset length, each travel trajectory is divided into a plurality of travel trajectories of preset length.

可选的，还包括：统计每个行驶轨迹数据集中的行驶轨迹数量，确定行驶轨迹数量小于或等于预设阈值的目标行驶轨迹数据集，将所述目标行驶轨迹数据集的评估误差设为0。Optionally, it also includes: counting the number of driving trajectories in each driving trajectory data set, determining a target driving trajectory data set whose number of driving trajectories is less than or equal to a preset threshold, and setting the evaluation error of the target driving trajectory data set as 0 .

可选的，策略包括规则策略和自学习策略中的一种，所述基于每个价值函数的置信度计算概率激活阈值，包括：Optionally, the strategy includes one of a rule strategy and a self-learning strategy, and the probability activation threshold is calculated based on the confidence of each value function, including:

计算自学习策略相对规则策略的表现期望最大时的概率值，将所述概率值作为概率激活阈值的取值范围；其中，表现期望表示置信度差值。Calculate the probability value when the performance expectation of the self-learning policy relative to the rule policy is the largest, and use the probability value as the value range of the probability activation threshold; wherein, the performance expectation represents the confidence difference.

可选的，所述根据车辆当前状态和行驶轨迹进行驾驶策略规划，包括：Optionally, the planning of the driving strategy according to the current state of the vehicle and the driving trajectory includes:

响应于车辆当前状态在状态空间中查询结果为不存在，确定车辆驾驶策略为规则策略；In response to the fact that the current state of the vehicle does not exist in the state space, determine that the vehicle driving strategy is the rule strategy;

随机生成一个变量，在车辆继续行驶过程中，通过收集车辆使用规则策略后的行驶轨迹，以估算规则策略表现值，若检测到所述变量大于所述规则策略表现值和1的累加值，则触发进行策略更新操作。Randomly generate a variable, in the process of the vehicle continuing to drive, by collecting the driving trajectory of the vehicle after using the rule policy to estimate the performance value of the rule policy, if it is detected that the variable is greater than the cumulative value of the performance value of the rule policy and 1, then Triggers a policy update operation.

为实现上述目的，根据本发明实施例的另一方面，提供了一种车辆驾驶策略处理装置，包括：In order to achieve the above object, according to another aspect of the embodiments of the present invention, a vehicle driving strategy processing device is provided, including:

采集分组模块，用于采集多个车辆在不同状态下行驶轨迹，按照每个行驶轨迹中第一个状态使用的策略对行驶轨迹进行分组，得到多个行驶轨迹数据集；The collection and grouping module is used to collect the driving trajectories of multiple vehicles in different states, and group the driving trajectories according to the strategy used in the first state in each driving trajectory to obtain multiple driving trajectory data sets;

置信度计算模块，用于为每个行驶轨迹数据集构建价值函数，以计算每个价值函数的置信度，进而基于所述每个价值函数的置信度计算概率激活阈值；A confidence level calculation module, used for constructing a value function for each driving trajectory data set, to calculate the confidence level of each value function, and then calculating a probability activation threshold based on the confidence level of each value function;

函数构建模块，用于将每个价值函数的函数值大于预设数值的概率作为评估误差，基于所有价值函数的评估误差和所述概率激活阈值，构建概率激活函数；The function building module is used to take the probability that the function value of each value function is greater than the preset value as the evaluation error, and build the probability activation function based on the evaluation errors of all value functions and the probability activation threshold;

参数调整模块，用于基于所述多个行驶轨迹数据集和所述概率激活函数，调整决策模型的网络参数，得到调整后的决策模型；其中，所述调整后的决策模型用于在车辆行驶过程中，根据车辆当前状态和行驶轨迹进行驾驶策略规划。A parameter adjustment module for adjusting the network parameters of the decision-making model based on the plurality of travel trajectory data sets and the probability activation function to obtain an adjusted decision-making model; wherein, the adjusted decision-making model is used for driving the vehicle During the process, the driving strategy planning is carried out according to the current state of the vehicle and the driving trajectory.

可选的，所述采集分组模块，还用于：Optionally, the collection and grouping module is also used for:

可选的，所述函数构建模块，用于：Optionally, the function builds a module for:

统计每个行驶轨迹数据集中的行驶轨迹数量，确定行驶轨迹数量小于或等于预设阈值的目标行驶轨迹数据集，将所述目标行驶轨迹数据集的评估误差设为0。The number of driving trajectories in each driving trajectory data set is counted, a target driving trajectory data set whose number of driving trajectories is less than or equal to a preset threshold is determined, and the evaluation error of the target driving trajectory data set is set to 0.

可选的，策略包括规则策略和自学习策略中的一种，所述置信度计算模块，用于：Optionally, the strategy includes one of a rule strategy and a self-learning strategy, and the confidence calculation module is used for:

可选的，还包括策略规划模块，用于：Optionally, also includes a strategic planning module for:

为实现上述目的，根据本发明实施例的再一方面，提供了一种车辆驾驶策略处理电子设备。To achieve the above object, according to yet another aspect of the embodiments of the present invention, an electronic device for processing a vehicle driving strategy is provided.

本发明实施例的电子设备包括：一个或多个处理器；存储装置，用于存储一个或多个程序，当所述一个或多个程序被所述一个或多个处理器执行，使得所述一个或多个处理器实现上述任一所述的车辆驾驶策略处理方法。An electronic device according to an embodiment of the present invention includes: one or more processors; and a storage device for storing one or more programs, when the one or more programs are executed by the one or more processors, the one or more programs cause the One or more processors implement any of the above-described vehicle driving strategy processing methods.

为实现上述目的，根据本发明实施例的再一方面，提供了一种计算机可读介质，其上存储有计算机程序，所述程序被处理器执行时实现上述任一所述的车辆驾驶策略处理方法。In order to achieve the above object, according to another aspect of the embodiments of the present invention, a computer-readable medium is provided, on which a computer program is stored, and when the program is executed by a processor, any one of the above-mentioned vehicle driving strategy processing is implemented. method.

根据本发明所述提供的方案，上述发明中的一个实施例具有如下优点或有益效果：对每个行驶轨迹进行分割，以尽可能增加训练数据的数量；构造概率激活函数，保证自学习策略激活时对规则策略表现提升的可靠性，从而保证有限数据条件下决策模型的可靠性。According to the solution provided by the present invention, one of the above-mentioned embodiments of the present invention has the following advantages or beneficial effects: segmenting each driving track to increase the number of training data as much as possible; constructing a probability activation function to ensure activation of the self-learning strategy It can improve the reliability of the performance of the rules and policies, so as to ensure the reliability of the decision-making model under the condition of limited data.

上述的非惯用的可选方式所具有的进一步效果将在下文中结合具体实施方式加以说明。Further effects of the above non-conventional alternatives will be described below in conjunction with specific embodiments.

附图说明Description of drawings

附图用于更好地理解本发明，不构成对本发明的不当限定。其中：The accompanying drawings are used for better understanding of the present invention and do not constitute an improper limitation of the present invention. in:

图1是根据本发明实施例的一种车辆驾驶策略处理方法的主要流程示意图；FIG. 1 is a schematic flowchart of a main flow of a method for processing a vehicle driving strategy according to an embodiment of the present invention;

图2是状态和策略、行为之间的关系示意图；Figure 2 is a schematic diagram of the relationship between state, strategy and behavior;

图3是概率激活阈值对混合决策的影响示意图；Fig. 3 is a schematic diagram of the influence of probability activation threshold on mixed decision-making;

图4是根据本发明实施例的一种车辆驾驶策略处理装置的主要模块示意图；4 is a schematic diagram of main modules of a vehicle driving strategy processing device according to an embodiment of the present invention;

图5是本发明实施例可以应用于其中的示例性系统架构图；5 is an exemplary system architecture diagram to which an embodiment of the present invention may be applied;

图6是适于用来实现本发明实施例的移动设备或服务器的计算机系统的结构示意图。FIG. 6 is a schematic structural diagram of a computer system suitable for implementing a mobile device or a server according to an embodiment of the present invention.

具体实施方式Detailed ways

以下结合附图对本发明的示范性实施例做出说明，其中包括本发明实施例的各种细节以助于理解，应当将它们认为仅仅是示范性的。因此，本领域普通技术人员应当认识到，可以对这里描述的实施例做出各种改变和修改，而不会背离本发明的范围和精神。同样，为了清楚和简明，以下描述中省略了对公知功能和结构的描述。Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, which include various details of the embodiments of the present invention to facilitate understanding and should be considered as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted from the following description for clarity and conciseness.

参见图1，示出的是本发明实施例提供的一种车辆驾驶策略处理方法的主要流程图，包括如下步骤：Referring to FIG. 1, it shows a main flowchart of a vehicle driving strategy processing method provided by an embodiment of the present invention, including the following steps:

S101：采集多个车辆在不同状态下行驶轨迹，按照每个行驶轨迹中第一个状态使用的策略对行驶轨迹进行分组，得到多个行驶轨迹数据集；S101: Collecting multiple vehicle driving trajectories in different states, grouping the driving trajectories according to the strategy used in the first state in each driving trajectory, and obtaining multiple driving trajectory data sets;

S102：为每个行驶轨迹数据集构建价值函数，以计算每个价值函数的置信度，进而基于每个价值函数的置信度计算概率激活阈值；S102: constructing a value function for each driving trajectory data set, to calculate the confidence of each value function, and then calculating a probability activation threshold based on the confidence of each value function;

S103：将每个价值函数的函数值大于预设数值的概率作为评估误差，基于所有价值函数的评估误差和所述概率激活阈值，构建概率激活函数；S103: Use the probability that the function value of each value function is greater than the preset value as an evaluation error, and construct a probability activation function based on the evaluation errors of all value functions and the probability activation threshold;

S104：基于所述多个行驶轨迹数据集和所述概率激活函数，调整决策模型的网络参数，得到调整后的决策模型；其中，所述调整后的决策模型用于在车辆行驶过程中，根据车辆当前状态和行驶轨迹进行驾驶策略规划。S104: Based on the multiple driving trajectory data sets and the probability activation function, adjust the network parameters of the decision-making model to obtain an adjusted decision-making model; The current state of the vehicle and the driving trajectory are used for driving strategy planning.

上述实施方式中，对于步骤S101，真实世界场景具有维度高且不确定性强的特点，因此无人驾驶车辆实现对所有场景的充分采样是十分困难的。据统计，在真实驾驶场景中仅验证一个驾驶策略需约140亿公里的驾驶数据，而策略训练需要的数据量则更大。In the above embodiment, for step S101, the real world scene has the characteristics of high dimension and strong uncertainty, so it is very difficult for the unmanned vehicle to fully sample all the scenes. According to statistics, it takes about 14 billion kilometers of driving data to verify only one driving strategy in a real driving scenario, while the amount of data required for strategy training is even greater.

由于车辆行驶时与周边车辆及道路均存在交互作用，因而车型驾驶策略决策时需要综合考虑自车运动状态、周边车辆运动状态及车道可通行性。状态为车辆行驶状态，包括1)自车当前时刻的运动状态信息，如车速和加速度，2)周边车辆当前时刻运动状态信息，如与自车间纵向距离、横向距离和相对车速，若无周边车辆，则将距离设为无穷，相对车速设为0，3)车道通行性信息，如左侧车道标记信息和右侧车道标记信息，可供通行标记为1，不可供通行标记(如对向车道、非机动车道、路沿)为0。Due to the interaction between the vehicle and the surrounding vehicles and roads, the driving strategy decision-making of the vehicle needs to comprehensively consider the movement state of the ego vehicle, the movement state of the surrounding vehicles and the passability of the lane. The state is the driving state of the vehicle, including 1) the current motion state information of the own vehicle, such as vehicle speed and acceleration, 2) the current motion state information of the surrounding vehicles, such as the longitudinal distance, lateral distance and relative vehicle speed from the vehicle, if there are no surrounding vehicles , the distance is set to infinity, and the relative speed is set to 0. 3) Lane trafficability information, such as left lane marking information and right lane marking information, the passable mark is 1, and the passable mark (such as the opposite lane) , non-motorized lanes, curbs) is 0.

比如走迷宫时总喜欢向左走，这是一种行为模式，车辆在行驶过程中也会遵循一定的行为模式，如掉头、换道行驶、左转、右转、直行等，对应不同策略，表示为π，而价值函数也与之息息相关。当车辆一直以策略π运行时，价值函数表明了当前状态s下所能够获得的期望回报值与其后继状态之间存在关联。For example, when walking in a maze, you always like to go left. This is a behavioral pattern. Vehicles will also follow certain behavioral patterns during driving, such as U-turn, changing lanes, turning left, turning right, going straight, etc., corresponding to different strategies. is denoted by π, and the value function is also closely related to it. When the vehicle has been running with policy π, the value function indicates that there is a relationship between the expected reward value that can be obtained in the current state s and its successor state.

例如，参见图2所示，空心圆代表状态，实心圆代表行为，在状态s时车辆面临三个选择，而促使车辆在三个行为中进行选择的就是策略π，代表一个概率分布，即从所有行为中选择不同行为的概率。在选择了行为a后，会由不同的可能性转换到新的状态s’，转换概率为p，在转换的过程中也会获得环境的相应回报，即是r。For example, as shown in Figure 2, the hollow circle represents the state, and the solid circle represents the behavior. In state s, the vehicle faces three choices, and the policy π is what prompts the vehicle to choose among the three behaviors, which represents a probability distribution, that is, from The probability of choosing a different action among all actions. After selecting the behavior a, it will be converted to a new state s' from different possibilities, the conversion probability is p, and the corresponding reward of the environment will be obtained in the process of conversion, that is, r.

车辆在真实道路驾驶过程中，其行驶轨迹长度通常是不固定的，因此决策模型需使用一个预设长度的滑动窗来收集所需要的轨迹τ_π(s)，以此得到不同状态下行驶轨迹，如下式所示：In the process of real road driving, the length of the driving trajectory of the vehicle is usually not fixed, so the decision-making model needs to use a sliding window with a preset length to collect the required trajectory τ _π (s), so as to obtain the driving trajectory in different states. , as shown in the following formula:

τ_π(s_i):＝{s_i,a_i,...,s_k,a_k}τ _π (s _i ):={s _i ,a _i ,...,s _k , _ak }

k＝min(i+H-1,n)k=min(i+H-1,n)

式中，s为状态，Z_s为终止状态集合，ω_π为车辆利用策略π行驶的轨迹。根据图2描述可知，每个轨迹可能由多个状态和策略组成，因此根据轨迹中第一个状态所使用的动作(即策略)进行划分，得到两个子数据集(规则策略数据集和自学习策略数据集)，如下所示：In the formula, s is the state, Z _s is the set of termination states, and ω _π is the trajectory of the vehicle using policy π. According to the description in Figure 2, each trajectory may be composed of multiple states and strategies. Therefore, it is divided according to the action (i.e. strategy) used in the first state in the trajectory to obtain two sub-datasets (rule-policy dataset and self-learning). policy dataset), as follows:

D_b(s):＝{τ(s₁＝s,a₁＝π_b(s₁))}D _b (s):={τ(s ₁ =s,a ₁ = _πb (s ₁ ))}

D_r1(s):＝{τ(s₁＝s,a₁＝π_r1(s₁))}D _r1 (s):={τ(s ₁ =s,a ₁ =π _r1 (s ₁ ))}

式中，D_b(s)包含车辆在运行状态s中，第一个动作使用规则策略的轨迹数据；D_r1(s)包含车辆在运行状态s中，第一个动作使用自学习策略的轨迹数据。例如共2个状态和2个策略，则会得到4个行驶轨迹数据集D(s,a)，包括行驶轨迹数据集1-状态1和策略1、行驶轨迹数据集2-状态1和策略2、行驶轨迹数据集3-状态1和策略2、行驶轨迹数据集4-状态2和策略2。In the formula, D _b (s) contains the trajectory data of the vehicle in the running state s, and the first action uses the rule strategy; D _r1 (s) contains the trajectory of the vehicle in the running state s, the first action uses the self-learning strategy data. For example, if there are 2 states and 2 strategies in total, 4 driving trajectory data sets D(s, a) will be obtained, including driving trajectory data set 1-state 1 and strategy 1, driving trajectory data set 2-state 1 and strategy 2 , driving trajectory data set 3 - state 1 and strategy 2, driving trajectory data set 4 - state 2 and strategy 2.

对于步骤S102，在生成行驶轨迹数据集D(s,a)后，开始计算每个行驶轨迹数据集的评估误差；其中，评估误差表示预估值相对于真实值之间的偏差，如预估概率为90％，但真实值为1，则偏差为10％。For step S102, after generating the driving trajectory data set D(s, a), start to calculate the evaluation error of each driving trajectory data set; wherein, the evaluation error represents the deviation between the estimated value and the actual value, such as the estimated value The probability is 90%, but the true value is 1, then the bias is 10%.

定义价值函数Q_π(s,u_π)，表示车辆使用状态s和策略π，在未来一段时间内的预期表现。但在训练数据有限的情况下，价值函数的估计存在误差，若直接使用包含误差的价值函数对决策模型中的驾驶策略模型进行调整训练，可能导致策略使用不正确的问题。事实上，真正的价值函数仅存在于概念中，并无法直接观测。考虑在车辆运动状态能够被充分采样时，价值函数的估计值与真实值之间偏差较小，因而可以构造函数表达该偏差情况。此处基于蒙特卡洛采样的估计方法，需要对每一个状态、每一个策略下轨迹进行充分采样，然而由于环境的强不确定性、环境高维的特性以及数据量不足等问题，无法保证均满足充分采样要求，因此需要利用该方法得到价值函数的置信度。Define the value function Q _π (s,u _π ), which represents the expected performance of the vehicle using state s and policy π in the future. However, in the case of limited training data, there is an error in the estimation of the value function. If the value function containing the error is directly used to adjust and train the driving strategy model in the decision-making model, it may lead to the problem of incorrect use of the strategy. In fact, the real value function exists only in the concept and cannot be directly observed. Considering that when the vehicle motion state can be sufficiently sampled, the deviation between the estimated value of the value function and the real value is small, so a function can be constructed to express the deviation. The estimation method based on Monte Carlo sampling here needs to fully sample the trajectory under each state and each strategy. However, due to the strong uncertainty of the environment, the high-dimensional characteristics of the environment, and the lack of data volume, it is impossible to guarantee that the trajectories are uniform. To meet the requirements of sufficient sampling, it is necessary to use this method to obtain the confidence of the value function.

本方案优选使用林德伯格-费勒定理，估计每个行驶轨迹数据集的价值函数的置信度，主要思想是大量相互独立的随机变量的均值，经适当标准化后，其分布收敛于正态分布，公式如下：This scheme preferably uses the Lindbergh-Feller theorem to estimate the confidence of the value function of each driving trajectory data set. The formula is as follows:

σ²＝Var(G_π(s))σ ² =Var(G _π (s))

式中，Φ(·)表示标准正态分布的累积分布函数，G_π(s)表示轨迹的折扣返回值。通过上式，价值函数的真实值分布被描述为累积分布函数，且随着采样次数的增加，其估计值与真实值逐步接近，该式也可以通过积分方式获得价值函数的置信度。定义真实价值函数

使用

表示真实价值函数

大于λ的概率，将

作为评估误差。此处基于规则策略数据集D(s_t,a_b)计算

以及基于自学习策略数据集D(s_t,a_r1)计算

where Φ(·) represents the cumulative distribution function of the standard normal distribution, and G _π (s) represents the discounted return value of the trajectory. Through the above formula, the true value distribution of the value function is described as a cumulative distribution function, and as the number of sampling increases, its estimated value is gradually close to the true value. This formula can also obtain the confidence of the value function through integration. Define the true value function

use

represents the true value function

The probability of being greater than λ, will be

as an evaluation error. Here, it is calculated based on the rule policy dataset D(s _t , a _b )

And based on the self-learning strategy dataset D(s _t , a _r1 ) calculation

对于步骤S103，在生成每个行驶轨迹数据集的评估误差后，需先构造考虑价值函数误差的激活函数，本方案称为概率激活函数。考虑到在实际应用过程中，自学习策略在训练过程中会迭代更新，因此基于自学习策略获得的轨迹数据，对自学习价值函数真实值估计偏保守，即概率激活函数的计算值比真实值偏大，但并不会破坏混合决策对策略表现提升的可靠性。For step S103, after generating the evaluation error of each driving trajectory data set, it is necessary to construct an activation function considering the error of the value function, and this solution is called a probability activation function. Considering that in the actual application process, the self-learning strategy will be iteratively updated during the training process, so based on the trajectory data obtained by the self-learning strategy, the true value of the self-learning value function is estimated conservatively, that is, the calculated value of the probability activation function is higher than the true value. It is too large, but does not undermine the reliability of mixed decision-making to improve policy performance.

混合决策概率激活函数的设计：在数据有限的情况下，策略表现期望值

可能不等于自学习模型估计的价值函数Q(s,u_π)，无人驾驶车辆行驶在部分场景中时，存在如下所示的情况：Design of Probabilistic Activation Functions for Hybrid Decisions: Policy Performance Expectations in Limited Data

It may not be equal to the value function Q(s,u _π ) estimated by the self-learning model. When the unmanned vehicle is driving in some scenes, the following situations exist:

上式中，对于规则策略π_b以及自学习策略π_r1，当根据自学习策略价值函数Q(s,u_r1)与规则策略价值函数Q(s,u_b)对概率激活函数进行激活时，自学习策略表现差于规则策略，此时自学习策略被误激活。由于不能准确估计价值函数，因此通过自学习策略的表现优于规则策略表现的期望，表示概率激活函数：In the above formula, for the rule strategy π _b and the self-learning strategy π _r1 , when the probability activation function is activated according to the self-learning strategy value function Q(s, u _r1 ) and the regular strategy value function Q(s, u _b ), The self-learning strategy performs worse than the regular strategy, and the self-learning strategy is misactivated at this time. Since the value function cannot be estimated accurately, the performance of the self-learning policy is expected to outperform the performance of the regular policy, denoting the probabilistic activation function:

式中，P(·)表示事件发生的概率，c_thres为混合决策概率激活阈值，需要理论推算置信度计算得到。基于上述公式(1)和(2)，概率激活函数与采样数据集的关系如下式所示：In the formula, P(·) represents the probability of event occurrence, and c _thres is the activation threshold of the probability of mixed decision, which needs to be calculated by theoretical calculation of confidence. Based on the above formulas (1) and (2), the relationship between the probability activation function and the sampled data set is as follows:

式中，当采样次数不足时，即n_d≤n_thres＝30时，令κ(Q(s,u),λ)＝0。概率激活阈值c_thres的设计原则为，应保证根据当前观测以及训练数据，自学习策略对规则策略表现提升的期望最大，即(4)：In the formula, when the sampling times are insufficient, that is, when n _d ≤n _thres =30, let κ(Q(s,u),λ)=0. The design principle of the probability activation threshold c _thres is to ensure that according to the current observation and training data, the self-learning strategy has the greatest expectation of improving the performance of the rule strategy, namely (4):

进而，根据上述公式(1)和(3)可得：Furthermore, according to the above formulas (1) and (3), we can get:

理想情况下，两种策略的表现期望值是一个恒定值，然而由于存在估计误差，无法准确估计两者的预期表现，因此利用概率的形式设计概率激活函数，能够在激活函数中引入策略评估过程中，解决由于训练不充分而产生的评估误差，避免因数据有限而采样不充分导致的自学习策略的误激活，实现在有限数据条件下高可靠性决策。Ideally, the expected performance of the two strategies is a constant value. However, due to the estimation error, the expected performance of the two cannot be accurately estimated. Therefore, the probability activation function is designed in the form of probability, which can be introduced into the activation function in the strategy evaluation process. , to solve the evaluation error caused by insufficient training, avoid the false activation of the self-learning strategy caused by insufficient sampling due to limited data, and achieve high reliability decision-making under limited data conditions.

参见图3所示，该概率激活阈值c_thres能实现从规则策略向完全自学习策略的过渡。假设概率激活阈值取值范围为0～1，当阈值为0时，概率激活函数恒成立，因此将始终采用自学习策略驾驶车辆。而当阈值为1时，概率激活函数恒不成立，因此将始终采用基于规则策略驾驶车辆。在0～1的取值中，均使用混合决策框架进行驾驶，当阈值为0.5时，其表现提升的期望最高，因此能够保障混合策略对规则策略提升的可靠性。更高的阈值会使策略提升变保守，而更低的阈值则可能导致策略提升不可靠。Referring to Fig. 3, the probability activation threshold c _thres can realize the transition from a regular strategy to a fully self-learning strategy. Assuming that the probability activation threshold value ranges from 0 to 1, when the threshold value is 0, the probability activation function is always established, so the self-learning strategy will always be used to drive the vehicle. When the threshold is 1, the probability activation function does not hold, so the rule-based strategy will always be used to drive the vehicle. In the values from 0 to 1, the hybrid decision-making framework is used for driving. When the threshold is 0.5, the expectation of performance improvement is the highest, so the reliability of the hybrid strategy to the improvement of the rule strategy can be guaranteed. Higher thresholds make policy promotion conservative, while lower thresholds can make policy promotion unreliable.

对于步骤S104，在有限数据条件下确定概率激活阈值c_thres后，基于概率激活阈值c_thres进行决策模型的设计，本方案参考深度Q学习算法建立决策模型。迭代计算Q值利用的是贝尔曼方程，其更新原则如下式所示，式中α为学习率：For step S104, after the probability activation threshold c _thres is determined under the condition of limited data, the decision model is designed based on the probability activation threshold c _thres . This scheme refers to the deep Q-learning algorithm to establish a decision model. The iterative calculation of the Q value uses the Bellman equation, and its update principle is shown in the following formula, where α is the learning rate:

Q(s_k,a_k)←Q(s _k , _ak )←

Q(s_k,a_k)+α[r(s_k+1)+γmax_aQ(s_k+1,a)-Q(s_k,a_k)]Q(s _k , _ak )+α[r(s _k+1 )+γmax _a Q(s _k+1 ,a)-Q(s _k , _ak )]

上式中，需要根据状态和动作策略产生的奖励更新Q函数(又称为标准正态分布的右尾函数)，然而在无人驾驶决策问题定义中，通常状态空间是连续的，因此无法重复访问某一个特定状态。为此Q函数的存储与更新将基于一个神经网络完成，该框架被称为深度Q学习方法，其更新原则如下式所示：In the above formula, the Q function (also known as the right tail function of the standard normal distribution) needs to be updated according to the state and the reward generated by the action strategy. However, in the definition of the driverless decision problem, the state space is usually continuous, so it cannot be repeated Access a specific state. For this reason, the storage and update of the Q function will be completed based on a neural network. This framework is called a deep Q learning method. The update principle is as follows:

Q⁺(s_k,a_k)＝r(s_k+1)+γmax_aQ(s_k+1,a,θ^-)Q ⁺ (s _k , _ak )=r(s _k+1 )+γmax _a Q(s _k+1 ,a,θ ⁻ )

式中，θ与θ^-分别表示正在调整的Q网络参数以及历史存储的Q网络参数。θ^-会在一定次数迭代n_up＝100次后更新θ，该设置使得网络更新更稳定。(Q(s_k,a_k,θ_j)-Q⁺(s_k,a_k))²表示训练误差，Q⁺(s_k,a_k)的值由贝尔曼方程计算。由于Q函数的值是基于数据集D(s,a)更新的，因此数据采样决定了策略更新的质量。在本算法中，自学习策略更新分为两个阶段：规则策略验证阶段和自学习探索阶段。In the formula, θ and θ ^- represent the parameters of the Q network being adjusted and the parameters of the Q network stored in the history, respectively. θ ^- will update θ after a certain number of iterations n _up = 100 times, this setting makes the network update more stable. (Q(s _k , _ak ,θ _j )-Q ⁺ (s _k , _ak )) ² represents the training error, and the value of Q ⁺ (s _k , _ak ) is calculated by the Bellman equation. Since the value of the Q function is updated based on the dataset D(s,a), the data sampling determines the quality of the policy update. In this algorithm, the self-learning strategy update is divided into two stages: the rule-policy verification stage and the self-learning exploration stage.

此处算法描述了自学习探索与策略生成过程，

表示基于动态更新的数据集D(s₁,a_b)所估算的策略表现，n_d(D(s,a_b))表示数据集中轨迹的数量：The algorithm here describes the self-learning exploration and policy generation process,

represents the estimated policy performance based on the dynamically updated dataset D(s ₁ , a _b ), and n _d (D(s, a _b )) represents the number of trajectories in the dataset:

i＝0；i = 0;

ω＝0；ω=0;

D(s,a_b)＝0；D(s, a _b )=0;

D(s,a)＝0；D(s, a) = 0;

while i≤总训练次数dowhile i≤total training times do

i＝1+1；i=1+1;

if s_t∈Z_s，thenif s _t ∈ Z _s , then

随机采样ξ←U(0,1)；Random sampling ξ←U(0,1);

If n_d(D(s,a_b))≥n_thres and

thenIf n _d (D(s,a _b ))≥n _thres and

then

产生探索动作a_explore；Generate an exploration action a _explore ;

使用动作a_explore与环境交互；use the action a _explore to interact with the environment;

elseelse

使用动作a_b与环境交互；interact with the environment using actions a _b ;

从ω中截取τ＝{s₁,a₁,r₁,...,s_H,a_H,r_H}；Intercept τ={s ₁ , a ₁ , r ₁ ,..., s _H , a _H , r _H } from ω;

if a₁＝π_b(s₁)thenif a ₁ =π _b (s ₁ )then

将轨迹τ加入D(s₁,a_b)；Add the trajectory τ to D(s ₁ , a _b );

elseelse

将轨迹τ加入D(s₁,a＝a₁)；Add the trajectory τ to D(s ₁ , a=a ₁ );

从ω中删除(s₁,a₁)remove (s ₁ ,a ₁ ) from ω

elseelse

for i＝l to n_ωdofor i=l to n _ω do

从ω中截取τ＝{s_i,a_i,...,s_nω,a_nω}；Intercept τ={s _i ,a _i ,...,s _nω ,a _nω } from ω;

if a_i＝π_b(s)thenif a _i =π _b (s)then

将轨迹τ加入D(s₁,a_b)；Add the trajectory τ to D(s ₁ , a _b );

elseelse

将轨迹τ加入D(s₁,a＝a_i)；Add the trajectory τ to D(s ₁ , a= _ai );

ω＝θ；ω=θ;

for i＝l to H dofor i=l to H do

使用动作a_b与环境交互Interact with the environment using actions a _b

具体而言，在第一阶段，无人驾驶车辆将仅使用规则策略驾驶，以获得Q(s,a_b)的更新；而在第二阶段，自学习策略则主动尝试与规则策略不同的动作，以获得更好的策略。两个阶段的切换是通过设计抑制自学习探索范围来实现的，在本章提出的算法框架中，有两个抑制条件限制了探索范围，1)当规则策略并没有进行足够长时间的验证时，即n_d(D(s,a_b))＜n_thres，不会进行策略探索；2)当规则策略表现足够好时，不会进行策略探索。相比之下，传统自学习算法的探索方案通常与其他策略无关。Specifically, in the first stage, the driverless vehicle will only drive using the regular policy to obtain an update of Q(s, a _b ); while in the second stage, the self-learning policy actively tries different actions from the regular policy , for a better strategy. The switching of the two stages is realized by designing the exploration scope of inhibition self-learning. In the algorithm framework proposed in this chapter, there are two inhibition conditions that limit the scope of exploration. 1) When the rule strategy has not been verified for a long enough time, That is, n _d (D(s, a _b ))<n _thres , policy exploration will not be performed; 2) When the performance of the rule policy is good enough, policy exploration will not be performed. In contrast, the exploration scheme of traditional self-learning algorithms is usually independent of other strategies.

由于探索范围通常决定了策略探索效率，越大的探索范围意味着策略探索效率越低，本方案提出的方法也可以帮助提升探索效率。除此之外，本方案优选使用e贪婪方法做策略探索，其他策略探索方法在本方案中同样可行。Since the exploration range usually determines the strategy exploration efficiency, the larger the exploration range, the lower the strategy exploration efficiency. The method proposed in this scheme can also help improve the exploration efficiency. In addition, this scheme preferably uses the e-greedy method for strategy exploration, and other strategy exploration methods are also feasible in this scheme.

第一个抑制条件设置的原因，是概率激活函数需要首先对规则策略的价值函数进行估计，当规则策略没有获得价值函数估计的置信度时，系统将无法判断当前自学习策略是否能够进行有效提升，因此在遇到一个陌生状态时，需要优先采取基于规则策略驾驶，以保证驾驶可靠性。第二个抑制条件设置的方法是通过当第一个条件满足时，随机产生一个变量ξ～U(0,1)，即ξ服从0-1之间的平均分布。当该变量满足约束条件使得

时，系统会主动探索。The first reason for setting the inhibition condition is that the probability activation function needs to first estimate the value function of the rule strategy. When the rule strategy does not obtain the confidence of the value function estimation, the system will not be able to judge whether the current self-learning strategy can be effectively improved. Therefore, when encountering an unfamiliar state, it is necessary to give priority to driving based on a rule-based strategy to ensure driving reliability. The second method of setting the suppression condition is to randomly generate a variable ξ～U(0,1) when the first condition is satisfied, that is, ξ obeys the average distribution between 0-1. When the variable satisfies the constraint such that

, the system will actively explore.

通过这种方式，自学习探索概率为-Q(s,a_b)，此概率与规则策略价值函数有关，规则策略价值函数越大，则探索动力越弱。在面向概率约束的奖励函数条件下，Q值代表了策略满足约束的概率，因此这种方法意味着基于规则策略越有可能发生危险时，系统越有动力进行主动探索。该方法能够在保障可靠性的前提下，提升探索效率。In this way, the self-learning exploration probability is -Q(s, a _b ), which is related to the rule policy value function. The larger the rule policy value function, the weaker the exploration motivation. Under the condition of the probability constraint-oriented reward function, the Q value represents the probability that the policy satisfies the constraint, so this method means that the more likely the rule-based policy is to be dangerous, the more motivated the system is to actively explore. This method can improve the exploration efficiency on the premise of ensuring reliability.

此外，由于状态空间是连续的，系统使用固定采样窗的方法获取某一状态对应的数据集，并计算对应的采样次数，采样窗的尺度为b(max(D(s))-min(D(s)))，b∈(0,1)，b的取值需根据经验设计，本方案中优选0.1。In addition, because the state space is continuous, the system uses a fixed sampling window method to obtain a data set corresponding to a certain state, and calculates the corresponding sampling times. The scale of the sampling window is b(max(D(s))-min(D (s))), b∈(0,1), the value of b needs to be designed according to experience, and 0.1 is preferred in this scheme.

当无人驾驶车辆行驶过程中访问到某一状态s_t时，根据概率激活函数生成有限数据条件下决策模型：When a certain state s _t is accessed during the driving process of the driverless vehicle, a decision-making model under the condition of limited data is generated according to the probability activation function:

可以使用该决策模型根据车辆当前状态和行驶轨迹进行驾驶策略规划。This decision model can be used to plan driving strategies based on the current state of the vehicle and the driving trajectory.

本发明实施例所提供的方法，在有限训练数据的条件下，仍能够保证策略的有效确定：The method provided by the embodiment of the present invention can still ensure the effective determination of the strategy under the condition of limited training data:

1、使用预设长度的滑动窗，将每个行驶轨迹分割为多个预设长度的行驶轨迹，以尽可能增加训练数据的数量；1. Use a sliding window of preset length to divide each driving trajectory into multiple driving trajectories of preset length to increase the amount of training data as much as possible;

2、构造概率激活函数，保证自学习策略激活时对规则策略表现提升的可靠性，从而实现有限数据情况下决策模型的可靠性；2. Construct a probability activation function to ensure the reliability of the performance improvement of the rule strategy when the self-learning strategy is activated, so as to achieve the reliability of the decision-making model in the case of limited data;

3、设计基于深度Q学习框架的自学习决策系统，在考虑规则策略进行策略搜索的同时，也考虑了概率激活函数对策略价值函数置信度的需求，使得规则策略价值函数置信度在没有满足要求或表现较好时，可以抑制自学习模型的策略探索。3. Design a self-learning decision-making system based on the deep Q-learning framework. While considering the rule strategy for strategy search, the requirement of the probability activation function for the confidence of the strategy value function is also considered, so that the confidence of the rule strategy value function does not meet the requirements. Or when the performance is better, the policy exploration of the self-learning model can be suppressed.

参见图4，示出了本发明实施例提供的一种车辆驾驶策略处理装置400的主要模块示意图，包括：Referring to FIG. 4 , a schematic diagram of main modules of a vehicle driving strategy processing device 400 provided by an embodiment of the present invention is shown, including:

采集分组模块401，用于采集多个车辆在不同状态下行驶轨迹，按照每个行驶轨迹中第一个状态使用的策略对行驶轨迹进行分组，得到多个行驶轨迹数据集；The collecting and grouping module 401 is used for collecting the driving trajectories of multiple vehicles in different states, and grouping the driving trajectories according to the strategy used in the first state in each driving trajectory to obtain a plurality of driving trajectory data sets;

置信度计算模块402，用于为每个行驶轨迹数据集构建价值函数，以计算每个价值函数的置信度，进而基于所述每个价值函数的置信度计算概率激活阈值；A confidence level calculation module 402, configured to construct a value function for each driving trajectory data set, to calculate the confidence level of each value function, and then calculate a probability activation threshold based on the confidence level of each value function;

函数构建模块403，用于将每个价值函数的函数值大于预设数值的概率作为评估误差，基于所有价值函数的评估误差和所述概率激活阈值，构建概率激活函数；The function building module 403 is used to use the probability that the function value of each value function is greater than the preset value as the evaluation error, and build the probability activation function based on the evaluation errors of all value functions and the probability activation threshold;

参数调整模块404，用于基于所述多个行驶轨迹数据集和所述概率激活函数，调整决策模型的网络参数，得到调整后的决策模型；其中，所述调整后的决策模型用于在车辆行驶过程中，根据车辆当前状态和行驶轨迹进行驾驶策略规划。The parameter adjustment module 404 is configured to adjust the network parameters of the decision-making model based on the plurality of driving trajectory data sets and the probability activation function to obtain an adjusted decision-making model; wherein, the adjusted decision-making model is used in the vehicle During the driving process, the driving strategy planning is carried out according to the current state of the vehicle and the driving trajectory.

本发明实施装置中，所述采集分组模块401，还用于：In the implementation device of the present invention, the collecting and grouping module 401 is further configured to:

本发明实施装置中，所述函数构建模块403，用于：In the implementation device of the present invention, the function building module 403 is used for:

本发明实施装置中，策略包括规则策略和自学习策略中的一种，所述置信度计算模块402，用于：In the implementation device of the present invention, the strategy includes one of a rule strategy and a self-learning strategy, and the confidence calculation module 402 is used for:

本发明实施装置还包括策略规划模块，用于：The implementation device of the present invention also includes a strategy planning module for:

另外，在本发明实施例中所述装置的具体实施内容，在上面所述方法中已经详细说明了，故在此重复内容不再说明。In addition, the specific implementation content of the device in the embodiment of the present invention has been described in detail in the above-mentioned method, so the repeated content will not be described here.

图5示出了可以应用本发明实施例的示例性系统架构500，包括终端设备501、502、503，网络504和服务器505(仅仅是示例)。Figure 5 shows an exemplary system architecture 500 to which embodiments of the present invention may be applied, including terminal devices 501, 502, 503, a network 504, and a server 505 (just an example).

终端设备501、502、503可以是具有显示屏并且支持网页浏览的各种电子设备，安装有各种通讯客户端应用，用户可以使用终端设备501、502、503通过网络504与服务器505交互，以接收或发送消息等。The terminal devices 501, 502, 503 can be various electronic devices with display screens and support web browsing, and various communication client applications are installed. Users can use the terminal devices 501, 502, 503 to interact with the server 505 through the network 504 to receive or send messages, etc.

网络504用以在终端设备501、502、503和服务器505之间提供通信链路的介质。网络504可以包括各种连接类型，例如有线、无线通信链路或者光纤电缆等等。The network 504 is a medium used to provide a communication link between the terminal devices 501 , 502 , 503 and the server 505 . Network 504 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

服务器505可以是提供各种服务的服务器，需要说明的是，本发明实施例所提供的方法一般由服务器505执行，相应地，装置一般设置于服务器505中。应该理解，图5中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要，可以具有任意数目的终端设备、网络和服务器。The server 505 may be a server that provides various services. It should be noted that the methods provided in the embodiments of the present invention are generally executed by the server 505 , and accordingly, the apparatus is generally set in the server 505 . It should be understood that the numbers of terminal devices, networks and servers in FIG. 5 are only illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.

下面参考图6，其示出了适于用来实现本发明实施例的终端设备的计算机系统600的结构示意图。图6示出的终端设备仅仅是一个示例，不应对本发明实施例的功能和使用范围带来任何限制。Referring to FIG. 6 below, it shows a schematic structural diagram of a computer system 600 suitable for implementing a terminal device according to an embodiment of the present invention. The terminal device shown in FIG. 6 is only an example, and should not impose any limitations on the functions and scope of use of the embodiments of the present invention.

如图6所示，计算机系统600包括中央处理单元(CPU)601，其可以根据存储在只读存储器(ROM)602中的程序或者从存储部分608加载到随机访问存储器(RAM)603中的程序而执行各种适当的动作和处理。在RAM 603中，还存储有系统600操作所需的各种程序和数据。CPU 601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出(I/O)接口605也连接至总线604。As shown in FIG. 6, a computer system 600 includes a central processing unit (CPU) 601, which can be loaded into a random access memory (RAM) 603 according to a program stored in a read only memory (ROM) 602 or a program from a storage section 608 Instead, various appropriate actions and processes are performed. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601 , the ROM 602 , and the RAM 603 are connected to each other through a bus 604 . An input/output (I/O) interface 605 is also connected to bus 604 .

以下部件连接至I/O接口605：包括键盘、鼠标等的输入部分606；包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分607；包括硬盘等的存储部分608；以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分609。通信部分609经由诸如因特网的网络执行通信处理。驱动器610也根据需要连接至I/O接口605。可拆卸介质611，诸如磁盘、光盘、磁光盘、半导体存储器等等，根据需要安装在驱动器610上，以便于从其上读出的计算机程序根据需要被安装入存储部分608。The following components are connected to the I/O interface 605: an input section 606 including a keyboard, a mouse, etc.; an output section 607 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker, etc.; a storage section 608 including a hard disk, etc. ; and a communication section 609 including a network interface card such as a LAN card, a modem, and the like. The communication section 609 performs communication processing via a network such as the Internet. A drive 610 is also connected to the I/O interface 605 as needed. A removable medium 611, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is mounted on the drive 610 as needed so that a computer program read therefrom is installed into the storage section 608 as needed.

特别地，根据本发明公开的实施例，上文参考流程图描述的过程可以被实现为计算机软件程序。例如，本发明公开的实施例包括一种计算机程序产品，其包括承载在计算机可读介质上的计算机程序，该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中，该计算机程序可以通过通信部分609从网络上被下载和安装，和/或从可拆卸介质611被安装。在该计算机程序被中央处理单元(CPU)601执行时，执行本发明的系统中限定的上述功能。In particular, the processes described above with reference to the flowcharts may be implemented as computer software programs in accordance with the disclosed embodiments of the present invention. For example, embodiments disclosed herein include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 609 and/or installed from the removable medium 611 . When the computer program is executed by the central processing unit (CPU) 601, the above-described functions defined in the system of the present invention are performed.

需要说明的是，本发明所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于：具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本发明中，计算机可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本发明中，计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式，包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质，该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输，包括但不限于：无线、电线、光缆、RF等等，或者上述的任意合适的组合。It should be noted that the computer-readable medium shown in the present invention may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing. In the present invention, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present invention, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

附图中的流程图和框图，图示了按照本发明各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分，上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个接连地表示的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图或流程图中的每个方框、以及框图或流程图中的方框的组合，可以用执行规定的功能或操作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams or flowchart illustrations, and combinations of blocks in the block diagrams or flowchart illustrations, can be implemented in special purpose hardware-based systems that perform the specified functions or operations, or can be implemented using A combination of dedicated hardware and computer instructions is implemented.

描述于本发明实施例中所涉及到的模块可以通过软件的方式实现，也可以通过硬件的方式来实现。所描述的模块也可以设置在处理器中，例如，可以描述为：一种处理器包括采集分组模块、置信度计算模块、函数构建模块、参数调整模块。其中，这些模块的名称在某种情况下并不构成对该模块本身的限定，例如，参数调整模块还可以被描述为“调整参数和使用模型模块”。The modules involved in the embodiments of the present invention may be implemented in a software manner, and may also be implemented in a hardware manner. The described modules can also be set in the processor, for example, it can be described as: a processor includes a collection grouping module, a confidence calculation module, a function construction module, and a parameter adjustment module. Among them, the names of these modules do not constitute a limitation of the module itself in some cases, for example, the parameter adjustment module can also be described as "adjusting parameters and using model module".

作为另一方面，本发明还提供了一种计算机可读介质，该计算机可读介质可以是上述实施例中描述的设备中所包含的；也可以是单独存在，而未装配入该设备中。上述计算机可读介质承载有一个或者多个程序，当上述一个或者多个程序被一个该设备执行时，使得该设备执行本方案车辆驾驶策略处理方法。As another aspect, the present invention also provides a computer-readable medium, which may be included in the device described in the above embodiments; or may exist alone without being assembled into the device. The above computer-readable medium carries one or more programs, and when the one or more programs are executed by a device, the device causes the device to execute the vehicle driving strategy processing method of the present solution.

上述具体实施方式，并不构成对本发明保护范围的限制。本领域技术人员应该明白的是，取决于设计要求和其他因素，可以发生各种各样的修改、组合、子组合和替代。任何在本发明的精神和原则之内所作的修改、等同替换和改进等，均应包含在本发明保护范围之内。The above-mentioned specific embodiments do not constitute a limitation on the protection scope of the present invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may occur depending on design requirements and other factors. Any modifications, equivalent replacements and improvements made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.

Claims

1. A vehicle driving strategy processing method, characterized by comprising:

collecting running tracks of a plurality of vehicles in different states, and grouping the running tracks according to a strategy used in a first state in each running track to obtain a plurality of running track data sets;

constructing a cost function for each driving track data set to calculate the confidence coefficient of each cost function, and further calculating a probability activation threshold value based on the confidence coefficient of each cost function;

taking the probability that the function value of each cost function is larger than a preset value as an evaluation error, and constructing a probability activation function based on the evaluation errors of all the cost functions and the probability activation threshold;

adjusting network parameters of a decision model based on the plurality of running track data sets and the probability activation function to obtain an adjusted decision model; and the adjusted decision model is used for planning the driving strategy according to the current state and the driving track of the vehicle in the driving process of the vehicle.

2. The method of claim 1, further comprising, prior to said grouping the travel tracks according to the strategy used in accordance with the first state in each travel track:

and dividing each running track into a plurality of running tracks with preset lengths by using a sliding window with preset lengths.

3. The method of claim 1 or 2, further comprising:

counting the number of the running tracks in each running track data set, determining a target running track data set with the number of the running tracks less than or equal to a preset threshold value, and setting the evaluation error of the target running track data set to be 0.

4. The method of claim 1 or 2, wherein the policy comprises one of a rule policy and a self-learning policy, and wherein calculating the probability activation threshold based on the confidence level of each cost function comprises:

calculating a probability value when the performance expectation of the self-learning strategy relative to the rule strategy is maximum, and taking the probability value as a value range of a probability activation threshold; where the performance expectation represents a confidence difference.

5. The method of claim 1, wherein the driving strategy planning based on the current state and the driving trajectory of the vehicle comprises:

determining that the vehicle driving strategy is a rule strategy in response to the condition that the current state of the vehicle does not exist in the state space;

randomly generating a variable, estimating a rule strategy representation value by collecting the driving track of the vehicle using the rule strategy in the process of continuously driving the vehicle, and triggering to update the strategy if the variable is detected to be larger than the rule strategy representation value and the accumulated value of 1.

6. A vehicle driving strategy processing apparatus characterized by comprising:

the collection grouping module is used for collecting the running tracks of a plurality of vehicles in different states, and grouping the running tracks according to a strategy used in a first state in each running track to obtain a plurality of running track data sets;

the confidence coefficient calculation module is used for constructing a value function for each driving track data set so as to calculate the confidence coefficient of each value function, and further calculating a probability activation threshold value based on the confidence coefficient of each value function;

the function construction module is used for taking the probability that the function value of each value function is larger than a preset value as an evaluation error, and constructing a probability activation function based on the evaluation errors of all the value functions and the probability activation threshold;

the parameter adjusting module is used for adjusting network parameters of the decision model based on the plurality of running track data sets and the probability activating function to obtain an adjusted decision model; and the adjusted decision model is used for planning the driving strategy according to the current state and the driving track of the vehicle in the driving process of the vehicle.

7. The apparatus of claim 6, wherein the function building module is configured to:

8. The apparatus of claim 5, wherein the policy comprises one of a rules policy and a self-learning policy, and wherein the confidence calculation module is configured to:

9. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-5.

10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-5.