CN113805572B

CN113805572B - Motion planning methods and devices

Info

Publication number: CN113805572B
Application number: CN202010471732.4A
Authority: CN
Inventors: 王志涛; 庄雨铮; 刘武龙; 古强
Original assignee: Huawei Technologies Co Ltd
Current assignee: Shenzhen Yinwang Intelligent Technology Co ltd
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2023-12-15
Anticipated expiration: 2040-05-29
Also published as: WO2021238303A1; CN113805572A

Abstract

This application relates to the field of artificial intelligence, specifically to the field of autonomous driving, and provides a method and device for motion planning. The method includes: obtaining driving environment information, which includes location information of dynamic obstacles; and characterizing the status of the driving environment information. Input the trained reinforcement learning network model and obtain the prediction time domain output by the reinforcement learning network model. The prediction time domain represents the duration or number of steps for predicting the motion trajectory of dynamic obstacles; use the prediction time domain for motion planning. The prediction time domain is obtained through reinforcement learning, which can dynamically change with changes in the driving environment, allowing the autonomous vehicle to flexibly respond to dynamic obstacles during the interaction between the autonomous vehicle and dynamic obstacles.

Description

Motion planning methods and devices

技术领域Technical field

本申请涉及人工智能领域，具体涉及一种运动规划的方法与装置。This application relates to the field of artificial intelligence, and specifically to a method and device for motion planning.

背景技术Background technique

自动驾驶实现的关键技术包括感知定位、规划决策、执行控制。其中，规划决策包括运动规划(motion planning)，运动规划是在遵循道路交通规则的前提下，将自动驾驶车辆从当前位置导航到目的地的一种方法。Key technologies for autonomous driving include perception and positioning, planning and decision-making, and execution control. Among them, planning decisions include motion planning, which is a method to navigate an autonomous vehicle from its current location to its destination while complying with road traffic rules.

在实际开放道路场景下，自动驾驶要处理的场景非常繁杂，尤其在动态的交通场景中，即存在动态障碍物(行人或车辆)(也可称为其它交通参与者)的交通场景中，自动驾驶车辆在与动态障碍物交互过程中存在博弈行为，这种场景下，要求自动驾驶车辆可以灵活应对动态障碍物。In actual open road scenarios, the scenarios to be processed by autonomous driving are very complex, especially in dynamic traffic scenarios, that is, in traffic scenarios where there are dynamic obstacles (pedestrians or vehicles) (also called other traffic participants), automatic driving There is gaming behavior when driving vehicles interact with dynamic obstacles. In this scenario, autonomous vehicles are required to be able to flexibly deal with dynamic obstacles.

目前，运动规划的方案缺乏在与动态障碍物交互过程中灵活应对动态障碍物的能力。Currently, motion planning solutions lack the ability to flexibly respond to dynamic obstacles during interaction with dynamic obstacles.

发明内容Contents of the invention

本申请提供一种运动规划的方法与装置，可以实现在自动驾驶车辆与动态障碍物交互过程中使得自动驾驶车辆可以灵活应对动态障碍物。This application provides a motion planning method and device, which can enable the autonomous vehicle to flexibly respond to dynamic obstacles during the interaction between the autonomous vehicle and dynamic obstacles.

第一方面，提供一种运动规划的方法，所述方法包括：获取驾驶环境信息，所述驾驶环境信息包括动态障碍物的位置信息；将所述驾驶环境信息的状态表征输入训练后的强化学习网络模型，获取所述强化学习网络模型输出的预测时域，所述预测时域表示对所述动态障碍物进行运动轨迹预测的时长或者步数；利用所述预测时域进行运动规划。In a first aspect, a method of motion planning is provided. The method includes: obtaining driving environment information, where the driving environment information includes position information of dynamic obstacles; inputting the state representation of the driving environment information into the trained reinforcement learning The network model obtains the prediction time domain output by the reinforcement learning network model, where the prediction time domain represents the duration or number of steps for predicting the motion trajectory of the dynamic obstacle; and uses the prediction time domain to perform motion planning.

该强化学习网络模型的输入为驾驶环境信息，该强化学习网络模型的输出为预测时域。换句话说，强化学习算法中的状态(state)为驾驶环境信息，动作(action)为预测时域。本申请实施例中的强化学习网络模型也可以称为预测时域策略网络。The input of the reinforcement learning network model is driving environment information, and the output of the reinforcement learning network model is the prediction time domain. In other words, the state in the reinforcement learning algorithm is the driving environment information, and the action is the prediction time domain. The reinforcement learning network model in the embodiment of this application can also be called a predictive time domain policy network.

通过采用强化学习方法，根据驾驶环境信息实时确定预测时域，使得预测时域不是固定的，而是可以随驾驶环境的变换而动态变化的，从而基于该预测时域进行运动规划，可以实现在自动驾驶车辆与动态障碍物交互过程中使得自动驾驶车辆可以灵活应对动态障碍物。By using the reinforcement learning method, the prediction time domain is determined in real time based on the driving environment information, so that the prediction time domain is not fixed, but can dynamically change with the change of the driving environment. Therefore, motion planning based on the prediction time domain can be realized in The interaction between autonomous vehicles and dynamic obstacles enables autonomous vehicles to flexibly deal with dynamic obstacles.

自动驾驶车辆按照基于采用强化学习方法获得的预测时域进行运动规划得到的运动轨迹进行行驶，可以实现在与动态障碍物交互过程中动态调整驾驶风格。驾驶风格表示驾驶行为是激进的还是保守的。The autonomous vehicle drives according to the motion trajectory obtained by motion planning based on the predicted time domain obtained using the reinforcement learning method, which can dynamically adjust the driving style during interaction with dynamic obstacles. Driving style indicates whether driving behavior is aggressive or conservative.

现有技术中，预测时域是固定的，可以视为，自动驾驶车辆的驾驶风格是固定的，而交通场景复杂多变，如果自动驾驶车辆的驾驶风格固定，难以兼顾通信效率与行驶安全。In the existing technology, the prediction time domain is fixed. It can be considered that the driving style of autonomous vehicles is fixed, and the traffic scenes are complex and changeable. If the driving style of autonomous vehicles is fixed, it is difficult to balance communication efficiency and driving safety.

在本申请中，预测时域是通过强化学习得到的，则该预测时域的大小不是固定的，而是随驾驶环境的改变而动态改变的，也就是说，针对动态障碍物不同的移动状态，该预测时域可以是不同的。因此，在本申请中，随着自动驾驶车辆的驾驶环境的改变，预测时域可大可小，对应的自动驾驶车辆的驾驶风格可保守可激进，从而可以实现在与动态障碍物交互过程中动态调整驾驶风格。In this application, the prediction time domain is obtained through reinforcement learning, and the size of the prediction time domain is not fixed, but changes dynamically with changes in the driving environment. That is to say, for the different movement states of dynamic obstacles , the prediction time domain can be different. Therefore, in this application, as the driving environment of the self-driving vehicle changes, the prediction time domain can be large or small, and the corresponding driving style of the self-driving vehicle can be conservative or aggressive, so that it can be realized during the interaction with dynamic obstacles. Dynamically adjust your driving style.

结合第一方面，在一种可能的实现方式中，所述利用所述预测时域进行运动规划，包括：将所述预测时域作为超参数，对所述动态障碍物的运动轨迹进行预测；根据所述驾驶环境信息中包括的静态障碍物的位置信息，以及所预测的所述动态障碍物的运动轨迹，规划自动驾驶车辆的运动轨迹。In conjunction with the first aspect, in one possible implementation, using the prediction time domain for motion planning includes: using the prediction time domain as a hyperparameter to predict the motion trajectory of the dynamic obstacle; According to the position information of the static obstacles included in the driving environment information and the predicted movement trajectory of the dynamic obstacles, the movement trajectory of the autonomous vehicle is planned.

结合第一方面，在一种可能的实现方式中，还包括：控制自动驾驶车辆按照所述运动规划得到的运动轨迹进行行驶。In conjunction with the first aspect, a possible implementation further includes: controlling the autonomous vehicle to drive according to the motion trajectory obtained by the motion planning.

第二方面，提供一种数据处理的方法，所述方法包括：根据强化学习网络模型与自动驾驶的驾驶环境交互获得的数据，获得所述强化学习网络模型的训练数据；利用所述训练数据，对所述强化学习网络模型进行强化学习的训练，以获得训练后的所述强化学习网络模型，其中，所述强化学习网络模型的输入为驾驶环境信息，所述强化学习网络模型的输出为预测时域，所述预测时域表示对自动驾驶的动态障碍物进行运动轨迹预测的时长或者步数。In a second aspect, a data processing method is provided. The method includes: obtaining training data for the reinforcement learning network model based on data obtained by interacting with the driving environment of autonomous driving; using the training data, Perform reinforcement learning training on the reinforcement learning network model to obtain the trained reinforcement learning network model, wherein the input of the reinforcement learning network model is driving environment information, and the output of the reinforcement learning network model is prediction Time domain, the prediction time domain represents the duration or number of steps for predicting the motion trajectory of dynamic obstacles in autonomous driving.

该强化学习网络模型的输入为驾驶环境信息，强化学习网络模型的输出为预测时域。The input of the reinforcement learning network model is driving environment information, and the output of the reinforcement learning network model is the prediction time domain.

将采用本申请提供的数据处理的方法训练得到的强化学习网络模型应用于自动驾驶，可以在运动规划的过程中，根据驾驶环境确定较为合适的预测时域，基于该预测时域进行运动规划，可以实现在自动驾驶车辆与动态障碍物交互过程中使得自动驾驶车辆可以灵活应对动态障碍物。Applying the reinforcement learning network model trained using the data processing method provided by this application to autonomous driving, a more appropriate prediction time domain can be determined based on the driving environment during the motion planning process, and motion planning can be performed based on this prediction time domain. It can be realized that the autonomous vehicle can flexibly respond to dynamic obstacles during the interaction between the autonomous vehicle and dynamic obstacles.

结合第二方面，在一种可能的实现方式中，所述根据强化学习网络模型与自动驾驶的驾驶环境交互获得的数据，获得所述强化学习网络模型的训练数据，包括：通过如下步骤获得所述训练数据中的一组样本<状态s，动作a，奖励r>。Combined with the second aspect, in a possible implementation, obtaining the training data of the reinforcement learning network model based on the data obtained by interacting with the driving environment of autonomous driving includes: obtaining the training data through the following steps Describe a set of samples in the training data <state s, action a, reward r>.

获取驾驶环境信息，将所述驾驶环境信息作为所述状态s，所述驾驶环境信息包括动态障碍物的位置信息；将所述状态s输入待训练的强化学习网络模型，获取所述强化学习网络模型输出的预测时域，将所述预测时域作为所述动作a，其中，所述预测时域表示对所述动态障碍物进行运动轨迹预测的时长或者步数；利用所述预测时域进行运动规划，获得自动驾驶车辆的运动轨迹；通过控制所述自动驾驶车辆按照所述自动驾驶车辆的运动轨迹进行行驶，获得所述奖励r。Obtain driving environment information, and use the driving environment information as the state s, which includes the location information of dynamic obstacles; input the state s into the reinforcement learning network model to be trained, and obtain the reinforcement learning network The prediction time domain of the model output is used as the action a, where the prediction time domain represents the duration or number of steps for predicting the motion trajectory of the dynamic obstacle; use the prediction time domain to perform Motion planning is used to obtain the movement trajectory of the self-driving vehicle; by controlling the self-driving vehicle to drive according to the movement trajectory of the self-driving vehicle, the reward r is obtained.

结合第二方面，在一种可能的实现方式中，所述获得所述奖励r，包括：根据回报函数，计算所述奖励r，其中，所述回报函数考虑了下列任一种或多种因素：驾驶安全性、自动驾驶车辆的通行效率、其他交通参与者的通行效率。In conjunction with the second aspect, in a possible implementation, obtaining the reward r includes: calculating the reward r according to a reward function, wherein the reward function takes into account any one or more of the following factors : Driving safety, traffic efficiency of autonomous vehicles, and traffic efficiency of other traffic participants.

第三方面，提供一种数据处理的装置，所述装置包括获取单元、预测单元与规划单元。In a third aspect, a data processing device is provided. The device includes an acquisition unit, a prediction unit and a planning unit.

所述获取单元用于获取驾驶环境信息，所述驾驶环境信息包括动态障碍物的位置信息。所述预测单元，用于将所述驾驶环境信息的状态表征输入训练后的强化学习网络模型，获取所述强化学习网络模型输出的预测时域，所述预测时域表示对所述动态障碍物进行运动轨迹预测的时长或者步数。所述规划单元，用于利用所述预测时域进行运动规划。The acquisition unit is used to acquire driving environment information, where the driving environment information includes location information of dynamic obstacles. The prediction unit is used to input the state representation of the driving environment information into the trained reinforcement learning network model, and obtain the prediction time domain output by the reinforcement learning network model. The prediction time domain represents the response to the dynamic obstacle. The duration or number of steps for motion trajectory prediction. The planning unit is used for motion planning using the prediction time domain.

结合第三方面，在一种可能的实现方式中，所述规划单元用于：将所述预测时域作为超参数，对所述动态障碍物的运动轨迹进行预测；根据所述驾驶环境信息中包括的静态障碍物的位置信息，以及所预测的所述动态障碍物的运动轨迹，规划自动驾驶车辆的运动轨迹。Combined with the third aspect, in a possible implementation, the planning unit is configured to: use the prediction time domain as a hyperparameter to predict the motion trajectory of the dynamic obstacle; and according to the driving environment information The included position information of static obstacles and the predicted movement trajectory of the dynamic obstacles are used to plan the movement trajectory of the autonomous vehicle.

结合第三方面，在一种可能的实现方式中，所述装置还包括控制单元，用于控制自动驾驶车辆按照所述运动规划得到的运动轨迹进行行驶。In conjunction with the third aspect, in a possible implementation, the device further includes a control unit configured to control the autonomous vehicle to drive according to the motion trajectory obtained by the motion planning.

第四方面，提供一种数据处理的装置，所述装置包括获取单元与训练单元。A fourth aspect provides a data processing device, which includes an acquisition unit and a training unit.

所述获取单元用于根据强化学习网络模型与自动驾驶的驾驶环境交互获得的数据，获得所述强化学习网络模型的训练数据。所述训练单元用于利用所述训练数据，对所述强化学习网络模型进行强化学习的训练，以获得训练后的所述强化学习网络模型。其中，所述强化学习网络模型的输入为驾驶环境信息，所述强化学习网络模型的输出为预测时域，所述预测时域表示对自动驾驶的动态障碍物进行运动轨迹预测的时长或者步数。The acquisition unit is configured to obtain training data of the reinforcement learning network model based on data obtained through interaction between the reinforcement learning network model and the driving environment of autonomous driving. The training unit is configured to use the training data to perform reinforcement learning training on the reinforcement learning network model to obtain the trained reinforcement learning network model. Wherein, the input of the reinforcement learning network model is driving environment information, and the output of the reinforcement learning network model is the prediction time domain. The prediction time domain represents the duration or number of steps for predicting the motion trajectory of dynamic obstacles in autonomous driving. .

结合第四方面，在一种可能的实现方式中，所述根获取单元用于，通过如下步骤获得所述训练数据中的一组样本<状态s，动作a，奖励r>。Combined with the fourth aspect, in a possible implementation, the root acquisition unit is used to obtain a set of samples <state s, action a, reward r> in the training data through the following steps.

获取驾驶环境信息，将所述驾驶环境信息作为所述状态s，所述驾驶环境信息包括动态障碍物的位置信息。将所述状态s输入待训练的强化学习网络模型，获取所述强化学习网络模型输出的预测时域，将所述预测时域作为所述动作a，其中，所述预测时域表示对所述动态障碍物进行运动轨迹预测的时长或者步数。利用所述预测时域进行运动规划，获得自动驾驶车辆的运动轨迹。通过控制所述自动驾驶车辆按照所述自动驾驶车辆的运动轨迹进行行驶，获得所述奖励r。Driving environment information is obtained and used as the state s, where the driving environment information includes position information of dynamic obstacles. Input the state s into the reinforcement learning network model to be trained, obtain the prediction time domain output by the reinforcement learning network model, and use the prediction time domain as the action a, where the prediction time domain represents the relationship between the The duration or number of steps for dynamic obstacle motion trajectory prediction. The prediction time domain is used for motion planning to obtain the motion trajectory of the autonomous vehicle. The reward r is obtained by controlling the self-driving vehicle to drive according to the motion trajectory of the self-driving vehicle.

结合第四方面，在一种可能的实现方式中，所述获取单元用于，根据回报函数，计算所述奖励r，其中，所述回报函数考虑了下列任一种或多种因素：驾驶安全性、自动驾驶车辆的通行效率、其他交通参与者的通行效率。In conjunction with the fourth aspect, in a possible implementation, the acquisition unit is configured to calculate the reward r according to a reward function, wherein the reward function takes into account any one or more of the following factors: driving safety characteristics, the traffic efficiency of autonomous vehicles, and the traffic efficiency of other traffic participants.

第五方面，提供一种自动驾驶车辆，包括第三方面提供的数据处理的装置。A fifth aspect provides an autonomous vehicle, including the data processing device provided in the third aspect.

结合第四方面，在一种可能的实现方式中，所述自动驾驶车辆还包括第四方面提供的数据处理的装置。In conjunction with the fourth aspect, in a possible implementation manner, the autonomous vehicle further includes the data processing device provided in the fourth aspect.

第六方面，提供一种数据处理的装置，该装置包括：存储器，用于存储程序；处理器，用于执行存储器存储的程序，当存储器存储的程序被执行时，处理器用于执行上述第一方面或第二方面中的方法。In a sixth aspect, a data processing device is provided. The device includes: a memory for storing a program; a processor for executing the program stored in the memory. When the program stored in the memory is executed, the processor is configured to execute the above-mentioned first step. aspect or methods in the second aspect.

第七方面，提供一种计算机可读介质，该计算机可读介质存储用于设备执行的程序代码，该程序代码包括用于执行上述第一方面或第二方面中的方法。In a seventh aspect, a computer-readable medium is provided. The computer-readable medium stores program code for device execution. The program code includes a method for executing the above-mentioned first or second aspect.

第八方面，提供一种包含指令的计算机程序产品，当该计算机程序产品在计算机上运行时，使得计算机执行上述第一方面或第二方面中的方法。An eighth aspect provides a computer program product containing instructions, which when the computer program product is run on a computer, causes the computer to execute the method in the first aspect or the second aspect.

第九方面，提供一种芯片，所述芯片包括处理器与数据接口，所述处理器通过所述数据接口读取存储器上存储的指令，执行上述第一方面或第二方面中的方法。A ninth aspect provides a chip. The chip includes a processor and a data interface. The processor reads instructions stored in a memory through the data interface and executes the method in the first aspect or the second aspect.

可选地，作为一种实现方式，所述芯片还可以包括存储器，所述存储器中存储有指令，所述处理器用于执行所述存储器上存储的指令，当所述指令被执行时，所述处理器用于执行上述第一方面或第二方面中的方法。Optionally, as an implementation manner, the chip may further include a memory, in which instructions are stored, and the processor is configured to execute the instructions stored in the memory. When the instructions are executed, the The processor is configured to execute the method in the above first or second aspect.

基于上述描述，在本申请中，预测时域是通过强化学习得到的，则该预测时域的大小不是固定的，而是随驾驶环境的改变而动态改变的，也就是说，针对动态障碍物不同的移动状态，该预测时域可以是不同的。因此，在本申请中，随着自动驾驶车辆的驾驶环境的改变，预测时域可大可小，对应的自动驾驶车辆的驾驶风格可保守可激进，从而可以实现在与动态障碍物交互过程中动态调整驾驶风格。Based on the above description, in this application, the prediction time domain is obtained through reinforcement learning, so the size of the prediction time domain is not fixed, but changes dynamically with changes in the driving environment. That is to say, for dynamic obstacles The prediction time domain may be different for different movement states. Therefore, in this application, as the driving environment of the self-driving vehicle changes, the prediction time domain can be large or small, and the corresponding driving style of the self-driving vehicle can be conservative or aggressive, so that it can be realized during the interaction with dynamic obstacles. Dynamically adjust your driving style.

附图说明Description of drawings

图1是自动驾驶系统的示意性框图。Figure 1 is a schematic block diagram of an autonomous driving system.

图2是自动驾驶的场景示意图。Figure 2 is a schematic diagram of an autonomous driving scenario.

图3是强化学习的原理示意图。Figure 3 is a schematic diagram of the principle of reinforcement learning.

图4是本申请实施例提供的运动规划的方法的示意性流程图。Figure 4 is a schematic flowchart of a motion planning method provided by an embodiment of the present application.

图5是本申请实施例提供的运动规划的方法的另一示意性流程图。FIG. 5 is another schematic flowchart of the motion planning method provided by the embodiment of the present application.

图6是本申请实施例提供的训练强化学习网络模型的方法的示意性流程图。Figure 6 is a schematic flow chart of a method for training a reinforcement learning network model provided by an embodiment of the present application.

图7是图6中步骤S610的示意性流程图。FIG. 7 is a schematic flow chart of step S610 in FIG. 6 .

图8是自动驾驶的另一场景示意图。Figure 8 is a schematic diagram of another scenario of autonomous driving.

图9是本申请实施例提供的数据处理的装置的示意性框图。Figure 9 is a schematic block diagram of a data processing device provided by an embodiment of the present application.

图10是本申请实施例提供的数据处理的装置的另一示意性框图。Figure 10 is another schematic block diagram of a data processing device provided by an embodiment of the present application.

图11是本申请实施例提供的数据处理的装置的又一示意性框图。Figure 11 is another schematic block diagram of a data processing device provided by an embodiment of the present application.

图12是本申请实施例提供的数据处理的装置的再一示意性框图。Figure 12 is another schematic block diagram of a data processing device provided by an embodiment of the present application.

图13是本申请实施例提供的一种芯片硬件结构示意图。Figure 13 is a schematic diagram of a chip hardware structure provided by an embodiment of the present application.

具体实施方式Detailed ways

随着智能驾驶的到来，智能汽车(intelligent vehicles)成为各大厂商重点研究的目标。智能汽车根据传感器输入的各种参数等生成期望的路径，并将相应的控制量提供给后续的控制器。智能驾驶也称为自动驾驶。自动驾驶的关键技术包括感知定位、决策规划、执行控制。作为示例，如图1所示，自动驾驶系统可以包括感知模块110、决策规划模块120与执行控制模块130。With the advent of smart driving, intelligent vehicles have become the focus of research by major manufacturers. The smart car generates the desired path based on various parameters input by the sensor, and provides the corresponding control volume to the subsequent controller. Intelligent driving is also called autonomous driving. Key technologies for autonomous driving include perception and positioning, decision-making and planning, and execution control. As an example, as shown in FIG. 1 , the automatic driving system may include a perception module 110 , a decision planning module 120 and an execution control module 130 .

下面对自动驾驶系统中的环境感知模块110、决策规划模块120与执行控制模块130进行过示例性地描述。The environment perception module 110, the decision planning module 120 and the execution control module 130 in the autonomous driving system are exemplarily described below.

环境感知模块110负责采集环境信息，例如，其他车辆、行人等障碍物信息，道路上交通标志、红绿灯等交通规则信息。The environment sensing module 110 is responsible for collecting environmental information, such as information about obstacles such as other vehicles and pedestrians, and traffic rule information such as traffic signs and traffic lights on the road.

决策规划模块120负责的决策规划可以分为如下三个层次。The decision planning responsible for the decision planning module 120 can be divided into the following three levels.

1)全局路径规划(route planning)，指的是，在收到一个目的地信息后，结合地图信息和本车的当前位置信息与姿态信息，生成一条最优的全局路径，作为后续局部路径规划的参考与引导。这里的“最优”可以指路径最短、时间最快或必须经过指定点等条件。1) Global path planning (route planning) refers to, after receiving a destination information, combining the map information and the current location information and attitude information of the vehicle to generate an optimal global path as subsequent local path planning. reference and guidance. The "optimal" here can refer to conditions such as the shortest path, the fastest time, or must pass through a specified point.

常见的全局路径规划算法包括Dijkstra、A-Star算法，以及在这两种算法基础上的多种改进。Common global path planning algorithms include Dijkstra and A-Star algorithms, as well as various improvements based on these two algorithms.

2)行为决策层(behavioral layer)，指的是，在接收到全局路径后，根据从环境感知模块110得到的环境信息，以及本车当前的行驶路径等信息，作出具体的行为决策(例如，变道超车、跟车行驶、让行、停车、进出站等)。2) Behavioral layer (behavioral layer) refers to making specific behavioral decisions (for example, after receiving the global path) based on the environmental information obtained from the environment sensing module 110 and the current driving path of the vehicle. Changing lanes to overtake, following, yielding, parking, entering and exiting the station, etc.).

常见的行为决策层的算法包括：有限状态机、决策树、基于规则的推理模型等。Common behavioral decision-making layer algorithms include: finite state machines, decision trees, rule-based reasoning models, etc.

3)运动规划(motion planning)，指的是，根据行为决策层作出的具体的行为决策，生成一条满足各种约束条件(例如，安全性、车辆本身的动力学约束等)的运动轨迹，该运动轨迹作为执行控制模块130的输入决定车辆的行驶路径。3) Motion planning refers to generating a motion trajectory that satisfies various constraints (such as safety, dynamic constraints of the vehicle itself, etc.) based on specific behavioral decisions made by the behavioral decision-making layer. The motion trajectory is used as an input to the execution control module 130 to determine the driving path of the vehicle.

执行控制模块130负责，根据决策规划模块120输出的运动轨迹，控制车辆的行驶路径。The execution control module 130 is responsible for controlling the driving path of the vehicle according to the motion trajectory output by the decision planning module 120 .

在实际开放道路场景下，自动驾驶要处理的场景非常繁杂，包括：空旷的道路场景、与行人、障碍物共用道路的场景、空旷的十字路口场景、繁忙的十字路口场景、违反交通规则的行人/车辆场景、正常行驶的车辆/行人场景等。例如，在如图2所示的动态交通场景中，具有其它交通参与者：行人与移动的其它车辆，对自动驾驶车辆来说，行人与移动的其它车辆是动态障碍物。自动驾驶车辆在与动态障碍物交互过程中存在博弈行为。因此，在动态交通场景中，要求自动驾驶车辆可以灵活应对动态障碍物。In actual open road scenes, autonomous driving has to deal with very complex scenes, including: empty road scenes, scenes sharing the road with pedestrians and obstacles, empty intersection scenes, busy intersection scenes, and pedestrians violating traffic rules. /Vehicle scenes, normal driving vehicle/pedestrian scenes, etc. For example, in the dynamic traffic scene shown in Figure 2, there are other traffic participants: pedestrians and other moving vehicles. For autonomous vehicles, pedestrians and other moving vehicles are dynamic obstacles. There is gaming behavior in the interaction between autonomous vehicles and dynamic obstacles. Therefore, in dynamic traffic scenarios, autonomous vehicles are required to flexibly deal with dynamic obstacles.

目前，运动规划的主要实现方式有基于搜索(例如，A*类算法)、采样(例如，RRT类算法)、参数化轨迹(例如，Reeds-Shepp曲线)以及优化(例如，基于Frenet坐标系)的解决方案，这些解决方案缺乏在与动态障碍物交互过程中灵活应对动态障碍物的能力。At present, the main implementation methods of motion planning are based on search (for example, A* class algorithm), sampling (for example, RRT class algorithm), parameterized trajectory (for example, Reeds-Shepp curve) and optimization (for example, based on Frenet coordinate system) solutions that lack the ability to flexibly respond to dynamic obstacles during interaction with them.

针对上述问题，本申请提供一种运动规划的方法，可以使得自动驾驶车辆在与动态障碍物交互过程中可以灵活应对动态障碍物。In response to the above problems, this application provides a motion planning method that can enable autonomous vehicles to flexibly respond to dynamic obstacles during interaction with dynamic obstacles.

为了更好地理解本申请实施例，下面先描述本申请实施例涉及的强化学习。In order to better understand the embodiments of the present application, the reinforcement learning involved in the embodiments of the present application is first described below.

强化学习(reinforcement learning，RL)用于描述和解决智能体(agent)在与环境的交互过程中通过学习策略以达成回报最大化或实现特定目标的问题。强化学习的常见模型是马尔可夫决策过程(markov decision process，MDP)。MDP是一种分析决策问题的数学模型。强化学习是智能体(agent)以“试错”的方式进行学习，通过动作(action)与环境进行交互获得的奖励(reward)指导行为，目标是使智能体获得最大的奖励。强化学习中由环境提供的强化信号(即奖励)对产生动作的好坏作一种评价，而不是告诉强化学习系统如何去产生正确的动作。由于外部环境提供的信息很少，智能体必须靠自身的经历进行学习。通过这种方式，智能体在行动-评价(即奖励)的环境中获得知识，改进行动方案以适应环境。常见的强化学习算法有Q-learning，policy gradient，actor-critic等。Reinforcement learning (RL) is used to describe and solve problems in which an agent learns strategies to maximize returns or achieve specific goals during its interaction with the environment. A common model of reinforcement learning is the Markov decision process (MDP). MDP is a mathematical model for analyzing decision-making problems. Reinforcement learning is where an agent learns in a "trial and error" manner, and the reward obtained through interaction with the environment guides behavior. The goal is to enable the agent to obtain the maximum reward. In reinforcement learning, the reinforcement signal (i.e., reward) provided by the environment evaluates the quality of the generated action, rather than telling the reinforcement learning system how to generate the correct action. Since the external environment provides little information, the agent must rely on its own experience to learn. In this way, the agent acquires knowledge in an action-evaluation (i.e., reward) environment and improves its action plan to adapt to the environment. Common reinforcement learning algorithms include Q-learning, policy gradient, actor-critic, etc.

如图3所示，强化学习主要包含五个元素：智能体(agent)、环境(environment)、状态(state)、动作(action)与奖励(reward)，其中，智能体的输入为状态，输出为动作。强化学习的训练过程为：通过智能体与环境进行多次交互，获得每次交互的动作、状态、奖励；将这多组(动作，状态，奖励)作为训练数据，对智能体进行一次训练。采用上述过程，对智能体进行下一轮次训练，直至满足收敛条件。As shown in Figure 3, reinforcement learning mainly contains five elements: agent, environment, state, action and reward. Among them, the input of the agent is the state, and the output for action. The training process of reinforcement learning is: through multiple interactions between the agent and the environment, the actions, states, and rewards of each interaction are obtained; these multiple groups (actions, states, rewards) are used as training data to train the agent once. Using the above process, the agent is trained for the next round until the convergence conditions are met.

作为示例，获得一次交互的动作、状态、奖励的过程如图3所示，将环境当前状态s0输入至智能体，获得智能体输出的动作a0，根据环境在动作a0作用下的相关性能指标计算本次交互的奖励r0，至此，获得本次交互的动作a0、动作a0与奖励r0。记录本次交互的动作a0、动作a0与奖励r0，以备后续用来训练智能体。还记录环境在动作a0作用下的下一个状态s1，以便实现智能体与环境的下一次交互。As an example, the process of obtaining the actions, status, and rewards of an interaction is shown in Figure 3. The current state of the environment s0 is input to the agent, and the action a0 output by the agent is obtained. The calculation is based on the relevant performance indicators of the environment under the action a0. The reward r0 for this interaction. At this point, the action a0, action a0 and reward r0 for this interaction are obtained. Record the action a0, action a0 and reward r0 of this interaction for subsequent use in training the agent. The next state s1 of the environment under the action a0 is also recorded in order to realize the next interaction between the agent and the environment.

下面将结合附图，对本申请中的技术方案进行描述。The technical solutions in this application will be described below with reference to the accompanying drawings.

图4为本申请实施例提供的一种运动规划的方法400的示意性流程图。以自动驾驶系统如图1为例，该方法300可以由决策规划模块120执行。如图4所示，该方法400包括步骤S410、S420、S430。Figure 4 is a schematic flow chart of a motion planning method 400 provided by an embodiment of the present application. Taking the automatic driving system shown in Figure 1 as an example, the method 300 can be executed by the decision planning module 120. As shown in Figure 4, the method 400 includes steps S410, S420, and S430.

S410，获取驾驶环境信息。S410, obtain driving environment information.

该驾驶环境信息包括动态障碍物的位置信息。动态障碍物表示驾驶环境中行人、车辆等各种运动的障碍物。动态障碍物也可以称为动态交通参与者。例如，动态障碍物包括其它行驶的车辆或行人。The driving environment information includes location information of dynamic obstacles. Dynamic obstacles represent various moving obstacles such as pedestrians and vehicles in the driving environment. Dynamic obstacles can also be called dynamic traffic participants. Dynamic obstacles include, for example, other moving vehicles or pedestrians.

例如，驾驶环境信息还可以包括道路结构信息、静态障碍物的位置信息、自动驾驶车辆的位置信息等。其中，道路结构信息包括道路上交通标志、红绿灯等交通规则信息等。For example, the driving environment information may also include road structure information, location information of static obstacles, location information of autonomous vehicles, etc. Among them, road structure information includes traffic rules information such as traffic signs and traffic lights on the road.

获取驾驶环境信息的方法可以为，根据自动驾驶车辆上的各个传感器采集的信息获取驾驶环境信息。本申请对获取驾驶环境信息的方式不作限定。The method of obtaining the driving environment information may be to obtain the driving environment information based on the information collected by various sensors on the autonomous vehicle. This application does not limit the method of obtaining driving environment information.

S420，将驾驶环境信息的状态表征输入训练后的强化学习网络模型，获取强化学习网络模型输出的预测时域，该预测时域表示对动态障碍物进行运动轨迹预测的时长或者步数。S420: Input the state representation of the driving environment information into the trained reinforcement learning network model, and obtain the prediction time domain output by the reinforcement learning network model. The prediction time domain represents the duration or number of steps for predicting the motion trajectory of the dynamic obstacle.

本申请实施例中的强化学习网络模型表示强化学习方法中的智能体(如图3所示)。The reinforcement learning network model in the embodiment of this application represents the agent in the reinforcement learning method (as shown in Figure 3).

需要说明的是，驾驶环境信息的状态表征表示对驾驶环境信息进行处理后的数据。实际应用中，可以根据强化学习算法中对状态的定义来确定对驾驶环境信息的处理方式。It should be noted that the state representation of the driving environment information represents data after processing the driving environment information. In practical applications, the processing method of driving environment information can be determined based on the definition of state in the reinforcement learning algorithm.

实际应用中，可以根据应用需求设计强化学习算法中状态的定义。本申请对此不作限定。In practical applications, the definition of the state in the reinforcement learning algorithm can be designed according to the application requirements. This application does not limit this.

本申请实施例中提及的预测时域表示对动态障碍物进行运动轨迹预测的时长或者步数。The prediction time domain mentioned in the embodiment of this application represents the duration or number of steps for predicting the motion trajectory of dynamic obstacles.

作为一个示例，假设将预测时域定位是预测的时长，例如，预测时域为5，表示，对动态障碍物进行运动轨迹预测的时长为5个时间单位。该时间单位可以预设。As an example, assume that the prediction time domain is positioned as the prediction duration. For example, the prediction time domain is 5, which means that the duration for predicting the motion trajectory of dynamic obstacles is 5 time units. This time unit can be preset.

作为另一个示例，假设将预测时域定位是预测的步数，例如，预测时域为5，表示，对动态障碍物进行运动轨迹预测的步数为5个单位步长。该单位步长可以预设。As another example, assume that the prediction time domain positioning is the number of prediction steps. For example, the prediction time domain is 5, which means that the number of steps for predicting the motion trajectory of a dynamic obstacle is 5 unit steps. The unit step size can be preset.

本申请实施例中的预测时域还可以表述为是，用于规划动态障碍物的运动轨迹的规划器的预测时域。The prediction time domain in the embodiment of the present application can also be expressed as the prediction time domain of the planner used to plan the movement trajectory of dynamic obstacles.

需要说明的是，本申请实施例提供的运动规划的方法400(以及下文将描述的方法500)中采用的强化学习网络模型为已经训练好的模型，具体地，是以基于驾驶环境预测预测时域为训练目标训练好的模型。关于强化学习网络模型的训练方法，下文将结合图6进行描述，这里暂不详述。It should be noted that the reinforcement learning network model used in the motion planning method 400 (and the method 500 to be described below) provided by the embodiment of the present application is an already trained model. Specifically, the prediction time is based on the driving environment prediction. The domain is a model trained for the training target. Regarding the training method of the reinforcement learning network model, it will be described below in conjunction with Figure 6, and will not be described in detail here.

S430，利用该预测时域进行运动规划。S430: Use the prediction time domain to perform motion planning.

例如，利用预测时域进行运动规划的流程包括如下步骤：For example, the process of using prediction time domain for motion planning includes the following steps:

1)将步骤S420中得到的预测时域作为超参数，对动态障碍物的运动轨迹进行预测；1) Use the prediction time domain obtained in step S420 as a hyperparameter to predict the motion trajectory of the dynamic obstacle;

2)根据驾驶环境信息中的静态障碍物的位置信息，以及所预测的动态障碍物的运动轨迹，利用规划算法进行规划自动驾驶车辆的运动轨迹。2) Based on the position information of static obstacles in the driving environment information and the predicted movement trajectories of dynamic obstacles, use the planning algorithm to plan the movement trajectory of the autonomous vehicle.

需要说明的是，根据动态障碍物的运动轨迹预测的时长或步数(即本申请实施例中的预测时域)对自动驾驶车辆进行运动规划的方法可参考现有技术，本文对此不作详述。It should be noted that the method of motion planning for an autonomous vehicle based on the predicted duration or number of steps of the motion trajectory of the dynamic obstacle (i.e., the prediction time domain in the embodiment of the present application) can refer to the existing technology, and this article will not elaborate on this. narrate.

应理解，自动驾驶车辆可以按照步骤S430中得到的自动驾驶车辆的运动轨迹进行行驶，直至驾驶任务完成。It should be understood that the self-driving vehicle can drive according to the motion trajectory of the self-driving vehicle obtained in step S430 until the driving task is completed.

例如，自动驾驶车辆按照步骤S430中得到的自动驾驶车辆的运动轨迹行驶C1步，若驾驶任务未完成，则基于更新后的驾驶环境重新获取新的状态，继续执行步骤S420与步骤S430，并按照步骤S430中得到的自动驾驶车辆的运动轨迹行驶C2步，若驾驶任务未完成，继续循环上述操作，若驾驶任务完成则自动驾驶结束。其中，涉及的C1与C2的取值可以预设或根据驾驶环境实时确定。C1与C2可以相同，也可以不同。For example, the self-driving vehicle drives step C1 according to the motion trajectory of the self-driving vehicle obtained in step S430. If the driving task is not completed, a new state is reacquired based on the updated driving environment, steps S420 and S430 are continued, and the following steps are performed: The motion trajectory of the autonomous vehicle obtained in step S430 travels for step C2. If the driving task is not completed, the above operation continues to be looped. If the driving task is completed, the autonomous driving ends. Among them, the values of C1 and C2 involved can be preset or determined in real time according to the driving environment. C1 and C2 can be the same or different.

以C1与C2相同且取值为10为例，则自动驾驶车辆可以按照步骤S430中得到的自动驾驶车辆的运动轨迹行驶10个单位步长。单位步长可以预设的。Taking C1 and C2 as the same and taking a value of 10 as an example, the self-driving vehicle can travel for 10 unit steps according to the motion trajectory of the self-driving vehicle obtained in step S430. The unit step size can be preset.

例如，自动驾驶车辆按照基于采用强化学习方法获得的预测时域进行运动规划得到的运动轨迹进行行驶，可以实现在与动态障碍物交互过程中动态调整驾驶风格。For example, an autonomous vehicle drives according to a motion trajectory obtained by motion planning based on the predicted time domain obtained using reinforcement learning methods, which can dynamically adjust the driving style during interaction with dynamic obstacles.

驾驶风格表示驾驶行为是激进的还是保守的。Driving style indicates whether driving behavior is aggressive or conservative.

例如，在预测时域较大的情况下，可以将对应的驾驶风格视为是保守的；在预测时域较小的情况下，可以将对应的驾驶风格视为是激进的。For example, when the prediction time domain is large, the corresponding driving style can be regarded as conservative; when the prediction time domain is small, the corresponding driving style can be regarded as aggressive.

下面结合图5描述本申请实施例提供的运动规划的方法的一个例子。An example of the motion planning method provided by the embodiment of the present application is described below with reference to FIG. 5 .

图5为本申请实施例提供的一种运动规划的方法500的示意性流程图。Figure 5 is a schematic flow chart of a motion planning method 500 provided by an embodiment of the present application.

S510，获取驾驶环境信息。S510, obtain driving environment information.

该驾驶环境信息包括动态障碍物的位置信息。The driving environment information includes location information of dynamic obstacles.

驾驶环境信息还可以包括道路结构信息、静态障碍物的位置信息、自动驾驶车辆的位置信息等。Driving environment information can also include road structure information, location information of static obstacles, location information of autonomous vehicles, etc.

S520，将步骤S510获取的驾驶环境信息的状态表征输入训练后的强化学习网络模型，获得该强化学习网络模型输出的预测时域。S520: Input the state representation of the driving environment information obtained in step S510 into the trained reinforcement learning network model to obtain the prediction time domain output by the reinforcement learning network model.

S530，根据步骤S520中得到的预测时域，对自动驾驶车辆进行运动规划，获得自动驾驶车辆的规划轨迹。S530: Perform motion planning on the self-driving vehicle based on the prediction time domain obtained in step S520, and obtain the planned trajectory of the self-driving vehicle.

步骤S530可以包括如下两个步骤：Step S530 may include the following two steps:

1)将步骤S520中得到的预测时域作为超参数，对动态障碍物的运动轨迹进行预测；1) Use the prediction time domain obtained in step S520 as a hyperparameter to predict the motion trajectory of the dynamic obstacle;

S540，控制自动驾驶车辆按照步骤S530中获得的自动驾驶车辆的运动轨迹行驶C步，或者说，执行步骤S530中获得的自动驾驶车辆的运动轨迹的前C步，C为正整数。S540: Control the self-driving vehicle to drive C steps according to the movement trajectory of the self-driving vehicle obtained in step S530, or in other words, execute the first C steps of the movement trajectory of the self-driving vehicle obtained in step S530, where C is a positive integer.

S550，判断驾驶任务是否完成，若是，则自动驾驶操作结束，若否，转到步骤S510。S550, determine whether the driving task is completed. If so, the automatic driving operation ends. If not, go to step S510.

本申请实施例提供的运动规划的方法，通过采用强化学习方法，根据驾驶环境信息实时确定预测时域，使得预测时域不是固定的，而是可以随驾驶环境的变换而动态变化的，从而基于该预测时域进行运动规划，可以实现在自动驾驶车辆与动态障碍物交互过程中使得自动驾驶车辆可以灵活应对动态障碍物。The motion planning method provided by the embodiment of the present application uses a reinforcement learning method to determine the prediction time domain in real time based on the driving environment information, so that the prediction time domain is not fixed, but can dynamically change with the change of the driving environment, thus based on Motion planning in this prediction time domain can enable the autonomous vehicle to flexibly respond to dynamic obstacles during the interaction between the autonomous vehicle and dynamic obstacles.

例如，将本申请实施例提供的运动规划的方法应用于自动驾驶，可以实现在与动态障碍物交互过程中动态调整驾驶风格。For example, applying the motion planning method provided by the embodiments of this application to autonomous driving can dynamically adjust the driving style during interaction with dynamic obstacles.

图6为本申请实施例提供的一种数据处理的方法600的示意性流程图。例如，该方法600可应用于训练得到方法400与方法500中采用的强化学习网络模型。该方法600包括如下步骤。Figure 6 is a schematic flow chart of a data processing method 600 provided by an embodiment of the present application. For example, the method 600 can be applied to train the reinforcement learning network model used in the methods 400 and 500 . The method 600 includes the following steps.

S610，根据强化学习网络模型与自动驾驶的驾驶环境交互获得的数据，获得强化学习网络模型的训练数据。该强化学习网络模型的输入为驾驶环境信息，强化学习网络模型的输出为预测时域，预测时域表示对自动驾驶的动态障碍物进行运动轨迹预测的时长或者步数。S610: Obtain training data for the reinforcement learning network model based on the data obtained from the interaction between the reinforcement learning network model and the driving environment of the autonomous driving. The input of the reinforcement learning network model is driving environment information, and the output of the reinforcement learning network model is the prediction time domain. The prediction time domain represents the duration or number of steps for predicting the motion trajectory of dynamic obstacles in autonomous driving.

S620，利用该训练数据，对强化学习网络模型进行强化学习的训练，以获得训练后的强化学习网络模型。S620: Use the training data to perform reinforcement learning training on the reinforcement learning network model to obtain a trained reinforcement learning network model.

本申请实施例中的强化学习网络模型表示强化学习方法中的智能体(如图3所示)。该强化学习网络模型的训练数据包括多组样本，每组样本可以表示为<状态s，动作a，奖励r>。关于状态s，动作a、奖励r的含义参见前文结合图3的描述，这里不再赘述。The reinforcement learning network model in the embodiment of this application represents the agent in the reinforcement learning method (as shown in Figure 3). The training data of the reinforcement learning network model includes multiple groups of samples, and each group of samples can be expressed as <state s, action a, reward r>. Regarding the meaning of state s, action a, and reward r, please refer to the previous description in conjunction with Figure 3, and will not be repeated here.

如图7所示，在本申请实施例中，步骤S610包括：通过如下步骤S611至步骤S614，获得该强化学习网络模型的训练数据中的一组样本<状态s，动作a，奖励r>。As shown in Figure 7, in the embodiment of the present application, step S610 includes: obtaining a set of samples <state s, action a, reward r> in the training data of the reinforcement learning network model through the following steps S611 to step S614.

S611，获取驾驶环境信息，将驾驶环境信息作为该状态s。S611, obtain the driving environment information and use the driving environment information as the state s.

例如，该驾驶环境信息还可以包括道路结构信息、静态障碍物的位置信息、自动驾驶车辆的位置信息等。For example, the driving environment information may also include road structure information, location information of static obstacles, location information of autonomous vehicles, etc.

S612，将该状态s输入待训练的强化学习网络模型，获取强化学习网络模型输出的预测时域，将预测时域作为该动作a，其中，预测时域表示对动态障碍物进行运动轨迹预测的时长或者步数。S612, input the state s into the reinforcement learning network model to be trained, obtain the prediction time domain output by the reinforcement learning network model, and use the prediction time domain as the action a, where the prediction time domain represents the prediction of the motion trajectory of the dynamic obstacle. Duration or number of steps.

S613，利用预测时域进行运动规划，获得自动驾驶车辆的运动轨迹。S613, use the prediction time domain for motion planning to obtain the motion trajectory of the autonomous vehicle.

步骤S613可以包括如下两个步骤：Step S613 may include the following two steps:

1)将步骤S612中得到的预测时域作为超参数，对动态障碍物的运动轨迹进行预测；1) Use the prediction time domain obtained in step S612 as a hyperparameter to predict the motion trajectory of the dynamic obstacle;

S614，通过控制自动驾驶车辆按照自动驾驶车辆的运动轨迹进行行驶，获得该奖励r。S614, obtain the reward r by controlling the autonomous vehicle to drive according to the motion trajectory of the autonomous vehicle.

例如，通过控制自动驾驶车辆按照自动驾驶车辆的运动轨迹进行行驶，获得更新后的驾驶环境信息，基于更新后的驾驶环境信息计算得到奖励r。其中，基于更新后的驾驶环境信息获得奖励r的策略可以根据应用需求确定，本申请对此不作限定。For example, by controlling the autonomous vehicle to drive according to the motion trajectory of the autonomous vehicle, updated driving environment information is obtained, and the reward r is calculated based on the updated driving environment information. Among them, the strategy for obtaining reward r based on the updated driving environment information can be determined according to application requirements, and this application does not limit this.

应理解，通过循环执行多轮步骤S611至步骤S614，可得到多组样本<状态s，动作a，奖励r>。其中，在每次执行下一轮步骤S611至步骤S614之前，强化学习网络模型会根据上一轮步骤S614获得的奖励更新状态s与动作a之间的映射关系。It should be understood that by executing multiple rounds of steps S611 to S614 in a loop, multiple groups of samples <state s, action a, reward r> can be obtained. Among them, before each execution of the next round of steps S611 to step S614, the reinforcement learning network model will update the mapping relationship between the state s and the action a based on the reward obtained in the previous round of step S614.

将这多组样本作为训练数据，对强化学习网络模型进行一次训练。继续采用上述过程，对强化学习网络模型进行下一轮次训练，直至满足模型收敛条件，则获得训练好的强化学习网络模型。Use these sets of samples as training data to train the reinforcement learning network model. Continue to use the above process to train the reinforcement learning network model for the next round until the model convergence conditions are met, and then the trained reinforcement learning network model is obtained.

可选地，在本实施例的步骤S614中，可以通过代价函数，计算得到奖励r。Optionally, in step S614 of this embodiment, the reward r can be calculated through the cost function.

该代价函数可以根据应用需求进行设计。The cost function can be designed according to application requirements.

可选地，该代价函数可以是根据自动驾驶车辆与其它车辆之间的博弈行为确定的。Optionally, the cost function may be determined based on the game behavior between the autonomous vehicle and other vehicles.

作为示例，设计该代价函数的考虑因素包括下列中任一种或多种：As examples, considerations in designing this cost function include any one or more of the following:

驾驶安全性、自动驾驶车辆的通行效率、其他交通参与者(例如，其它车辆)的通行效率。Driving safety, traffic efficiency of autonomous vehicles, and traffic efficiency of other traffic participants (e.g., other vehicles).

作为一个示例，奖励r根据如下分段函数获得，该分段函数可以称为代价函数：As an example, the reward r is obtained according to the following piecewise function, which can be called the cost function:

该分段函数中的第一段“-0.5×time_step”是用于鼓励自动驾驶车辆尽快完成驾驶任务，是出于自动驾驶车辆的通行效率的考虑。其中，Time_step表示驾驶任务的计时信息。The first section "-0.5×time_step" in this piecewise function is used to encourage autonomous vehicles to complete the driving task as soon as possible, out of consideration for the traffic efficiency of autonomous vehicles. Among them, Time_step represents the timing information of the driving task.

该分段函数中的第二段“-10”用于惩罚碰撞行为，是出于安全性的考虑。The second section "-10" in this piecewise function is used to punish collision behavior for safety reasons.

该分段函数中的第三段“10”用于对完成驾驶任务进行奖励。The third segment "10" in this piecewise function is used to reward completing the driving task.

该分段函数中的第四段“5”用于对其它车辆通过窄道进行奖励，使得强化学习算法不仅考虑自动驾驶车辆的行驶效率，还考虑其它车辆的行驶效率，是出于鼓励兼顾其它车辆的通行效率的考虑。The fourth segment "5" in this piecewise function is used to reward other vehicles for passing through the narrow road, so that the reinforcement learning algorithm not only considers the driving efficiency of the self-driving vehicle, but also considers the driving efficiency of other vehicles. This is to encourage consideration of other vehicles. Consideration of vehicle traffic efficiency.

将本申请实施例提供的方法600训练得到的强化学习网络模型应用于自动驾驶，可以在运动规划的过程中，根据驾驶环境确定较为合适的预测时域，基于该预测时域进行运动规划，可以实现在自动驾驶车辆与动态障碍物交互过程中使得自动驾驶车辆可以灵活应对动态障碍物。When the reinforcement learning network model trained by the method 600 provided in the embodiment of the present application is applied to autonomous driving, a more appropriate prediction time domain can be determined according to the driving environment during the motion planning process, and motion planning can be performed based on the prediction time domain. This enables the autonomous vehicle to flexibly respond to dynamic obstacles during the interaction between the autonomous vehicle and dynamic obstacles.

下面描述一个将本申请实施例提供的方法应用于如图8所示的窄道会车场景的例子。The following describes an example of applying the method provided by the embodiment of the present application to the narrow road meeting scene as shown in Figure 8.

图8所示的窄道会车场景的驾驶任务是，自动驾驶车辆与其它车辆(移动的)期望通过窄道，两车在不考虑路权的情况下行驶，自动驾驶车辆根据对方车辆的形式行为对自身的行驶行为进行调整。The driving task of the narrow road meeting scene shown in Figure 8 is that the self-driving vehicle and other vehicles (moving) want to pass through the narrow road. The two vehicles drive without considering the right of way. The self-driving vehicle follows the form of the other vehicle. Behavior adjusts its own driving behavior.

步骤1)，获取强化学习算法中的状态。Step 1), obtain the state in the reinforcement learning algorithm.

例如，通过激光雷达，获取二维可行区域信息与不可行区域信息。例如，将这些区域信息(包括二维可行区域信息与不可行区域信息)表征为84×84投影矩阵。For example, through lidar, two-dimensional feasible area information and infeasible area information are obtained. For example, these area information (including two-dimensional feasible area information and infeasible area information) are represented as an 84×84 projection matrix.

例如，为了使得强化学习网络模型能够对自动驾驶车辆与其他车辆的运动具有描述能力，可以将历史投影矩阵中间隔为5的最近4帧投影矩阵按照当前车辆坐标系进行坐标变换，将得到的投影矩阵序列作为强化学习网络模型的输入。For example, in order to enable the reinforcement learning network model to describe the movements of autonomous vehicles and other vehicles, the projection matrices of the last 4 frames with an interval of 5 in the historical projection matrix can be transformed according to the current vehicle coordinate system, and the resulting projection A sequence of matrices as input to a reinforcement learning network model.

步骤2)，将步骤1)获取的状态，即矩阵序列，输入强化学习网络模型，获得规划算法对动态障碍物的预测时域。Step 2), input the state obtained in step 1), that is, the matrix sequence, into the reinforcement learning network model to obtain the prediction time domain of the dynamic obstacles by the planning algorithm.

例如，强化学习网络模型的网络结构可以采用ACKTR算法。该ACKTR算法为Actor-Critic框架下的策略梯度算法。该ACKTR算法包括策略网络与值网络。For example, the network structure of the reinforcement learning network model can adopt the ACKTR algorithm. The ACKTR algorithm is a policy gradient algorithm under the Actor-Critic framework. The ACKTR algorithm includes policy network and value network.

例如，为了处理矩阵输入，可以设计包含卷积层与全连接层的值网络与策略网络模型。将步骤1)中得到的矩阵序列作为该强化学习网络模型的输入。将策略网络的输出值设计为规划算法对动态障碍物的预测时域。关于预测时域的说明参见前文，这里不再赘述。For example, in order to process matrix inputs, value network and policy network models containing convolutional layers and fully connected layers can be designed. Use the matrix sequence obtained in step 1) as the input of the reinforcement learning network model. The output value of the policy network is designed as the prediction time domain of the planning algorithm for dynamic obstacles. For the description of the prediction time domain, please refer to the previous article and will not be repeated here.

步骤3)，以步骤2)中得到的预测时域作为超参数，利用匀速预测模型对动态的其他车辆进行该时域步长的轨迹预测。Step 3), use the prediction time domain obtained in step 2) as a hyperparameter, and use the uniform speed prediction model to predict the trajectory of other dynamic vehicles in this time domain step.

基于静态障碍物以及对动态障碍物的轨迹预测，例如，采用多项式规划算法进行运动规划。多项式算法是一种基于采样的规划算法，该算法在结构化道路的Frenet坐标系下进行规划，首先对偏离车道中心线的横向距离以及纵向期望速度进行采样，之后通过五次多项式拟合，生成备选轨迹集合，最后根据规划器的代价函数对轨迹进行优选，输出最优轨迹，完成运动规划。Based on static obstacles and trajectory prediction of dynamic obstacles, for example, polynomial programming algorithm is used for motion planning. The polynomial algorithm is a planning algorithm based on sampling. The algorithm performs planning under the Frenet coordinate system of structured roads. It first samples the lateral distance deviated from the lane centerline and the longitudinal expected speed, and then uses a fifth-order polynomial fitting to generate A set of alternative trajectories, and finally the trajectories are optimized according to the cost function of the planner, the optimal trajectory is output, and motion planning is completed.

应理解，自动驾驶车辆可以按照步骤3)中得到的自动驾驶车辆的运动轨迹进行行驶，直至驾驶任务完成。It should be understood that the self-driving vehicle can drive according to the motion trajectory of the self-driving vehicle obtained in step 3) until the driving task is completed.

例如，自动驾驶车辆按照步骤3)中得到的自动驾驶车辆的运动轨迹行驶若干步，若驾驶任务未完成，则继续执行步骤1)至步骤3)，并按照步骤3)中得到的自动驾驶车辆的运动轨迹行驶若干步，若驾驶任务未完成，继续循环上述操作，若驾驶任务完成则自动驾驶任务结束。For example, the self-driving vehicle drives several steps according to the motion trajectory of the self-driving vehicle obtained in step 3). If the driving task is not completed, continue to perform steps 1) to step 3), and follow the self-driving vehicle motion trajectory obtained in step 3). The motion trajectory is driven for several steps. If the driving task is not completed, the above operation continues to be cycled. If the driving task is completed, the automatic driving task ends.

在结合图8描述的例子中涉及的强化学习网络模型可以采用上文实施例中的方法600训练得到。具体描述详见上文，这里不再赘述。The reinforcement learning network model involved in the example described in conjunction with Figure 8 can be trained using the method 600 in the above embodiment. The specific description is detailed above and will not be repeated here.

上述可知，本申请实施例，通过采用强化学习方法，根据驾驶环境信息实时确定预测时域，使得预测时域不是固定的，而是可以随驾驶环境的变换而动态变化的，从而基于该预测时域进行运动规划，可以实现在自动驾驶车辆与动态障碍物交互过程中使得自动驾驶车辆可以灵活应对动态障碍物。As can be seen from the above, the embodiment of the present application uses a reinforcement learning method to determine the prediction time domain in real time based on the driving environment information, so that the prediction time domain is not fixed, but can dynamically change with the change of the driving environment, so that based on the prediction time domain Motion planning in the domain can enable the autonomous vehicle to flexibly respond to dynamic obstacles during the interaction between the autonomous vehicle and dynamic obstacles.

本文中描述的各个实施例可以为独立的方案，也可以根据内在逻辑进行组合，这些方案都落入本申请的保护范围中。Each embodiment described in this article can be an independent solution or can be combined according to the internal logic. These solutions all fall within the protection scope of this application.

上文描述了本申请提供的方法实施例，下文将描述本申请提供的装置实施例。应理解，装置实施例的描述与方法实施例的描述相互对应，因此，未详细描述的内容可以参见上文方法实施例，为了简洁，这里不再赘述。The method embodiments provided by this application are described above, and the device embodiments provided by this application will be described below. It should be understood that the description of the device embodiments corresponds to the description of the method embodiments. Therefore, for content that is not described in detail, please refer to the above method embodiments. For the sake of brevity, they will not be described again here.

如图9所示，本申请实施例还提供一种数据处理的装置900，该装置900包括环境感知模块910、运动规划模块920、车辆控制模块930。As shown in FIG. 9 , this embodiment of the present application also provides a data processing device 900 . The device 900 includes an environment sensing module 910 , a motion planning module 920 , and a vehicle control module 930 .

环境感知模块910，用于获取驾驶环境信息，并向该驾驶环境信息传递给运动规划模块920。The environment sensing module 910 is used to obtain driving environment information and transfer the driving environment information to the motion planning module 920.

例如，环境感知模块910用于根据车辆上各个传感器所采集的信息，获取驾驶环境信息。For example, the environment sensing module 910 is used to obtain driving environment information based on information collected by various sensors on the vehicle.

运动规划模块920，用于从环境感知模块910接收驾驶环境信息，并采用强化学习网络模型获得动态障碍物的预测时域，并基于该预测时域进行运动规划，获得自动驾驶车辆的运动轨迹，并将该运动轨迹对应的规划控制信息传递给车辆控制模块930。The motion planning module 920 is used to receive driving environment information from the environment sensing module 910, and use a reinforcement learning network model to obtain the prediction time domain of dynamic obstacles, and perform motion planning based on the prediction time domain to obtain the motion trajectory of the autonomous vehicle, And the planning control information corresponding to the motion trajectory is transferred to the vehicle control module 930.

例如，运动规划模块920用于执行上文方法实施例提供的方法400中的步骤S420与步骤S430。For example, the motion planning module 920 is used to execute steps S420 and S430 in the method 400 provided in the above method embodiment.

车辆控制模块930，用于从运动规划模块920接收规划控制信息，并控制车辆依据规划控制信息对应的动作指令信息控制车辆完成驾驶任务。The vehicle control module 930 is configured to receive planning control information from the motion planning module 920 and control the vehicle to complete the driving task according to the action instruction information corresponding to the planning control information.

本申请实施例提供的装置900可以设置在自动驾驶车辆上。The device 900 provided in the embodiment of this application can be installed on an autonomous vehicle.

如图10所示，本申请实施例还提供一种运动规划的装置1000，装置1000用于执行上文方法实施例中的方法400或方法500。装置1000包括获取单元1010、预测单元1020与规划单元1030。As shown in Figure 10, this embodiment of the present application also provides a motion planning device 1000. The device 1000 is configured to execute method 400 or method 500 in the above method embodiment. The device 1000 includes an acquisition unit 1010, a prediction unit 1020 and a planning unit 1030.

获取单元1010用于获取驾驶环境信息，驾驶环境信息包括动态障碍物的位置信息。The acquisition unit 1010 is used to acquire driving environment information, which includes location information of dynamic obstacles.

预测单元1020用于将驾驶环境信息的状态表征输入训练后的强化学习网络模型，获取强化学习网络模型输出的预测时域，预测时域表示对动态障碍物进行运动轨迹预测的时长或者步数。The prediction unit 1020 is used to input the state representation of the driving environment information into the trained reinforcement learning network model, and obtain the prediction time domain output by the reinforcement learning network model. The prediction time domain represents the duration or number of steps for predicting the motion trajectory of the dynamic obstacle.

规划单元1030用于利用预测时域进行运动规划。The planning unit 1030 is used to perform motion planning using the prediction time domain.

例如，规划单元1030利用预测时域进行运动规划的操作包括如下步骤。For example, the operation of the planning unit 1030 to perform motion planning using the prediction time domain includes the following steps.

将预测时域作为超参数，对动态障碍物的运动轨迹进行预测；根据驾驶环境信息中包括的静态障碍物的位置信息，以及所预测的动态障碍物的运动轨迹，规划自动驾驶车辆的运动轨迹。Use the prediction time domain as a hyperparameter to predict the movement trajectory of dynamic obstacles; plan the movement trajectory of the autonomous vehicle based on the position information of static obstacles included in the driving environment information and the predicted movement trajectory of dynamic obstacles. .

如图10所示，该装置1000还可以包括控制单元1040，用于控制自动驾驶车辆按照运动规划得到的运动轨迹进行行驶。As shown in Figure 10, the device 1000 may also include a control unit 1040 for controlling the autonomous vehicle to drive according to the movement trajectory obtained by the movement planning.

例如，预测单元1020、规划单元1030与控制单元1040可以通过处理器实现。获取单元1010可以通过通信接口实现。For example, the prediction unit 1020, the planning unit 1030 and the control unit 1040 may be implemented by a processor. The acquisition unit 1010 can be implemented through a communication interface.

如图11所示，本申请实施例还提供一种数据处理的装置1100，装置1100用于执行上文方法实施例中的方法600。装置1100包括获取单元1110与训练单元1120。As shown in Figure 11, this embodiment of the present application also provides a data processing device 1100. The device 1100 is used to execute the method 600 in the above method embodiment. The device 1100 includes an acquisition unit 1110 and a training unit 1120.

获取单元1110用于根据强化学习网络模型与自动驾驶的驾驶环境交互获得的数据，获得强化学习网络模型的训练数据。The acquisition unit 1110 is configured to obtain training data for the reinforcement learning network model based on data obtained through interaction between the reinforcement learning network model and the driving environment of the autonomous driving.

训练单元1120用于利用训练数据，对强化学习网络模型进行强化学习的训练，以获得训练后的强化学习网络模型。其中，强化学习网络模型的输入为驾驶环境信息，强化学习网络模型的输出为预测时域，预测时域表示对自动驾驶的动态障碍物进行运动轨迹预测的时长或者步数。The training unit 1120 is used to perform reinforcement learning training on the reinforcement learning network model using training data to obtain a trained reinforcement learning network model. Among them, the input of the reinforcement learning network model is driving environment information, and the output of the reinforcement learning network model is the prediction time domain. The prediction time domain represents the duration or number of steps for predicting the motion trajectory of dynamic obstacles in autonomous driving.

例如，获取单元1110用于通过如图7所示的步骤S611至步骤S614获得训练数据中的一组样本<状态s，动作a，奖励r>。参见上文描述，这里不再赘述。For example, the acquisition unit 1110 is used to obtain a set of samples <state s, action a, reward r> in the training data through steps S611 to S614 as shown in FIG. 7 . See the above description and will not go into details here.

如图12所示，本申请实施例还提供一种数据处理的装置1200。该装置1200包括处理器1210，处理器1210与存储器1220耦合，存储器1220用于存储计算机程序或指令，处理器1210用于执行存储器1220存储的计算机程序或指令，使得上文方法实施例中的方法被执行。As shown in Figure 12, this embodiment of the present application also provides a data processing device 1200. The device 1200 includes a processor 1210. The processor 1210 is coupled to a memory 1220. The memory 1220 is used to store computer programs or instructions. The processor 1210 is used to execute the computer programs or instructions stored in the memory 1220, so that the method in the above method embodiment be executed.

可选地，如图12所示，该装置1200还可以包括存储器1220。Optionally, as shown in Figure 12, the device 1200 may also include a memory 1220.

可选地，如图12所示，该装置1200还可以包括数据接口1230，数据接口1230用于与外界进行数据的传输。Optionally, as shown in Figure 12, the device 1200 may also include a data interface 1230, which is used to transmit data with the outside world.

可选地，作为一种方案，该装置1200用于实现上文实施例中的方法400。Optionally, as a solution, the device 1200 is used to implement the method 400 in the above embodiment.

可选地，作为另一种方案，该装置1200用于实现上文实施例中的方法500。Optionally, as another solution, the device 1200 is used to implement the method 500 in the above embodiment.

可选地，作为又一种方案，该装置1200用于实现上文实施例中的方法600。Optionally, as another solution, the device 1200 is used to implement the method 600 in the above embodiment.

本申请实施例还提供一种自动驾驶车辆，包括如图9所示的数据处理的装置900或如图10所示的数据处理的装置1000。An embodiment of the present application also provides an autonomous vehicle, including a data processing device 900 as shown in FIG. 9 or a data processing device 1000 as shown in FIG. 10 .

可选地，该自动驾驶车辆还包括如图11所示的数据处理的装置1100。Optionally, the autonomous vehicle further includes a data processing device 1100 as shown in Figure 11.

本申请实施例还提供一种自动驾驶车辆，包括如图12所示的数据处理的装置1200。An embodiment of the present application also provides an autonomous vehicle, including a data processing device 1200 as shown in Figure 12.

本申请实施例还提供一种计算机可读介质，该计算机可读介质存储用于设备执行的程序代码，该程序代码包括用于执行上述实施例的方法。Embodiments of the present application also provide a computer-readable medium that stores program code for device execution, and the program code includes the method for executing the above embodiments.

本申请实施例还提供一种包含指令的计算机程序产品，当该计算机程序产品在计算机上运行时，使得计算机执行上述实施例的方法。Embodiments of the present application also provide a computer program product containing instructions, which when the computer program product is run on a computer, causes the computer to execute the method of the above embodiments.

本申请实施例还提供一种芯片，该芯片包括处理器与数据接口，处理器通过数据接口读取存储器上存储的指令，执行上述实施例的方法。An embodiment of the present application also provides a chip. The chip includes a processor and a data interface. The processor reads instructions stored in the memory through the data interface and executes the method of the above embodiment.

可选地，作为一种实现方式，该芯片还可以包括存储器，存储器中存储有指令，处理器用于执行存储器上存储的指令，当指令被执行时，处理器用于执行上述实施例中的方法。Optionally, as an implementation manner, the chip may also include a memory, in which instructions are stored. The processor is configured to execute the instructions stored in the memory. When the instructions are executed, the processor is configured to execute the method in the above embodiment.

图13为本申请实施例提供的一种芯片硬件结构，该芯片上包括神经网络处理器1300。该芯片可以被设置在如下任一种或多种装置中：Figure 13 is a chip hardware structure provided by an embodiment of the present application. The chip includes a neural network processor 1300. The chip can be installed in any one or more of the following devices:

如图9所示的装置900、如图10所示的装置1000、如图11中所示的装置1100、如图12所示的装置1200。The device 900 shown in FIG. 9 , the device 1000 shown in FIG. 10 , the device 1100 shown in FIG. 11 , and the device 1200 shown in FIG. 12 .

上文方法实施例中的方法400、500或600均可在如图13所示的芯片中得以实现。Methods 400, 500 or 600 in the above method embodiments can all be implemented in the chip as shown in Figure 13.

神经网络处理器1300作为协处理器挂载到主处理器(Host CPU)上，由主CPU分配任务。神经网络处理器1300的核心部分为运算电路1303，控制器1304控制运算电路1303获取存储器(权重存储器1302或输入存储器1301)中的数据并进行运算。The neural network processor 1300 is mounted on the main processor (Host CPU) as a co-processor, and the main CPU allocates tasks. The core part of the neural network processor 1300 is the arithmetic circuit 1303. The controller 1304 controls the arithmetic circuit 1303 to obtain data in the memory (weight memory 1302 or input memory 1301) and perform operations.

在一些实现中，运算电路1303内部包括多个处理单元(process engine，PE)。在一些实现中，运算电路1303是二维脉动阵列。运算电路1303还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中，运算电路1303是通用的矩阵处理器。In some implementations, the computing circuit 1303 includes multiple processing units (process engines, PEs) internally. In some implementations, arithmetic circuit 1303 is a two-dimensional systolic array. The arithmetic circuit 1303 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 1303 is a general-purpose matrix processor.

举例来说，假设有输入矩阵A，权重矩阵B，输出矩阵C。运算电路1303从权重存储器1302中取矩阵B相应的数据，并缓存在运算电路1303中每一个PE上。运算电路1303从输入存储器1301中取矩阵A数据与矩阵B进行矩阵运算，得到的矩阵的部分结果或最终结果，保存在累加器(accumulator)1308中。For example, assume there is an input matrix A, a weight matrix B, and an output matrix C. The operation circuit 1303 obtains the corresponding data of matrix B from the weight memory 1302 and caches it on each PE in the operation circuit 1303. The operation circuit 1303 obtains the matrix A data from the input memory 1301 and performs matrix operation on the matrix B, and stores the partial result or final result of the matrix in an accumulator (accumulator) 1308.

向量计算单元1307可以对运算电路1303的输出做进一步处理，如向量乘，向量加，指数运算，对数运算，大小比较等等。例如，向量计算单元1307可以用于神经网络中非卷积/非FC层的网络计算，如池化(pooling)，批归一化(batch normalization)，局部响应归一化(local response normalization)等。The vector calculation unit 1307 can further process the output of the operation circuit 1303, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc. For example, the vector calculation unit 1307 can be used for network calculations of non-convolutional/non-FC layers in neural networks, such as pooling, batch normalization, local response normalization, etc. .

在一些实现种，向量计算单元能1307将经处理的输出的向量存储到统一存储器(也可称为统一缓存器)1306。例如，向量计算单元1307可以将非线性函数应用到运算电路1303的输出，例如累加值的向量，用以生成激活值。在一些实现中，向量计算单元1307生成归一化的值、合并值，或二者均有。在一些实现中，处理过的输出的向量能够用作到运算电路1303的激活输入，例如用于在神经网络中的后续层中的使用。In some implementations, the vector calculation unit can 1307 store the processed output vector into a unified memory (which may also be referred to as a unified buffer) 1306 . For example, the vector calculation unit 1307 may apply a nonlinear function to the output of the operation circuit 1303, such as a vector of accumulated values, to generate an activation value. In some implementations, vector calculation unit 1307 generates normalized values, merged values, or both. In some implementations, the processed output vector can be used as an activation input to the arithmetic circuit 1303, such as for use in a subsequent layer in a neural network.

上文方法实施例中的方法400、500或600可以由1303或1307执行。Method 400, 500 or 600 in the above method embodiment may be executed by 1303 or 1307.

统一存储器1306用于存放输入数据以及输出数据。The unified memory 1306 is used to store input data and output data.

可以通过存储单元访问控制器1305(direct memory access controller，DMAC)将外部存储器中的输入数据搬运到输入存储器1301和/或统一存储器1306、将外部存储器中的权重数据存入权重存储器1302，以及将统一存储器1306中的数据存入外部存储器。The input data in the external memory can be transferred to the input memory 1301 and/or the unified memory 1306 through the storage unit access controller 1305 (direct memory access controller, DMAC), the weight data in the external memory can be stored in the weight memory 1302, and the weight data in the external memory can be stored in the weight memory 1302. The data in unified memory 1306 is stored in external memory.

总线接口单元(bus interface unit，BIU)1310，用于通过总线实现主CPU、DMAC和取指存储器1309之间进行交互。A bus interface unit (BIU) 1310 is used to implement interaction between the main CPU, the DMAC and the fetch memory 1309 through the bus.

与控制器1304连接的取指存储器(instruction fetch buffer)1309，用于存储控制器1304使用的指令；An instruction fetch buffer 1309 connected to the controller 1304 is used to store instructions used by the controller 1304;

控制器1304，用于调用指存储器1309中缓存的指令，实现控制该运算加速器的工作过程。The controller 1304 is used to call instructions cached in the memory 1309 to control the working process of the computing accelerator.

一般地，统一存储器1306，输入存储器1301，权重存储器1302以及取指存储器1309均为片上(On-Chip)存储器，外部存储器为该NPU外部的存储器，该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random accessmemory，DDR SDRAM)、高带宽存储器(high bandwidth memory，HBM)或其他可读可写的存储器。Generally, the unified memory 1306, the input memory 1301, the weight memory 1302 and the instruction memory 1309 are all on-chip memories, and the external memory is a memory external to the NPU. The external memory can be double data rate synchronous dynamic random access. Memory (double data rate synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.

除非另有定义，本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的，不是旨在于限制本申请。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terminology used herein in the description of the application is for the purpose of describing specific embodiments only and is not intended to limit the application.

需要说明的是，本文中涉及的第一、第二、第三或第四等各种数字编号仅为描述方便进行的区分，并不用来限制本申请实施例的范围。It should be noted that the first, second, third or fourth numerical numbers involved in this article are only for convenience of description and are not used to limit the scope of the embodiments of the present application.

本领域普通技术人员可以意识到，结合本文中所公开的实施例描述的各示例的单元及算法步骤，能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art will appreciate that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented with electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each specific application, but such implementations should not be considered beyond the scope of this application.

所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的系统、装置和单元的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that for the convenience and simplicity of description, the specific working processes of the systems, devices and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be described again here.

在本申请所提供的几个实施例中，应该理解到，所揭露的系统、装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外，在本申请各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit.

所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：通用串行总线闪存盘(USB flash disk，UFD)(UFD也可以简称为U盘或者优盘)、移动硬盘、只读存储器(read-only memory，ROM)、随机存取存储器(randomaccess memory，RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of this application. The aforementioned storage media include: Universal Serial Bus flash disk (UFD) (UFD can also be referred to as U disk or USB flash drive), mobile hard disk, read-only memory (ROM), random access Various media that can store program code, such as random access memory (RAM), magnetic disks, or optical disks.

以上所述，仅为本申请的具体实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应以所述权利要求的保护范围为准。The above are only specific embodiments of the present application, but the protection scope of the present application is not limited thereto. Any person familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the present application. should be covered by the protection scope of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims

1. A method of motion planning, comprising:

acquiring driving environment information, wherein the driving environment information comprises position information of dynamic obstacles;

inputting the state representation of the driving environment information into a trained reinforcement learning network model, and obtaining a prediction time domain output by the reinforcement learning network model, wherein the prediction time domain represents the duration or the number of steps of motion trail prediction of the dynamic obstacle, the prediction time domain is determined in real time by the reinforcement learning network model according to the driving environment information, the driving environment information is the state information of the reinforcement learning network model, and the prediction time domain is the action information of the reinforcement learning network model;

and performing motion planning by using the prediction time domain.

2. The method of claim 1, wherein said utilizing said prediction horizon for motion planning comprises:

Taking the prediction time domain as a super parameter, and predicting the motion trail of the dynamic obstacle;

and planning the movement track of the automatic driving vehicle according to the position information of the static obstacle and the predicted movement track of the dynamic obstacle, which are included in the driving environment information.

3. The method according to claim 1 or 2, further comprising:

and controlling the automatic driving vehicle to run according to the motion trail obtained by the motion planning.

4. A method of data processing, comprising:

obtaining training data of the reinforcement learning network model according to data obtained by interaction between the reinforcement learning network model and an automatic driving environment;

training the reinforcement learning network model by utilizing the training data to obtain the trained reinforcement learning network model,

the method comprises the steps that input of a reinforcement learning network model is driving environment information, output of the reinforcement learning network model is prediction time domain, the prediction time domain represents duration or step number of motion track prediction of an automatic driving dynamic obstacle, the prediction time domain is determined in real time by the reinforcement learning network model according to the driving environment information, the driving environment information is state information of the reinforcement learning network model, and the prediction time domain is action information of the reinforcement learning network model.

5. The method of claim 4, wherein the obtaining training data for the reinforcement learning network model based on data obtained from interaction of the reinforcement learning network model with a driving environment of an autopilot comprises:

a set of samples < state s, action a, reward r > in the training data is obtained by:

acquiring driving environment information, wherein the driving environment information is used as the state s, and the driving environment information comprises the position information of dynamic obstacles;

inputting the state s into a reinforcement learning network model to be trained, obtaining a prediction time domain output by the reinforcement learning network model, and taking the prediction time domain as the action a, wherein the prediction time domain represents the duration or the step number of motion trail prediction of the dynamic obstacle;

performing motion planning by utilizing the prediction horizon to obtain a motion trail of the automatic driving vehicle;

and obtaining the rewards r by controlling the automatic driving vehicle to run according to the movement track of the automatic driving vehicle.

6. The method of claim 5, wherein the obtaining the prize r comprises:

calculating the reward r according to a reward function, wherein the reward function takes into account any one or more of the following factors:

Driving safety, traffic efficiency of an automatically driven vehicle, traffic efficiency of other traffic participants.

7. An apparatus for motion planning, comprising:

an acquisition unit configured to acquire driving environment information including position information of a dynamic obstacle;

the prediction unit is used for inputting the state representation of the driving environment information into the trained reinforcement learning network model, obtaining a prediction time domain output by the reinforcement learning network model, wherein the prediction time domain represents the duration or the step number of the motion track prediction of the dynamic obstacle, the prediction time domain is determined in real time by the reinforcement learning network model according to the driving environment information, the driving environment information is the state information of the reinforcement learning network model, and the prediction time domain is the action information of the reinforcement learning network model;

and the planning unit is used for planning the motion by utilizing the prediction time domain.

8. The apparatus of claim 7, wherein the planning unit is configured to:

9. The apparatus according to claim 7 or 8, further comprising:

and the control unit is used for controlling the automatic driving vehicle to run according to the motion track obtained by the motion planning.

10. An apparatus for data processing, comprising:

the system comprises an acquisition unit, a control unit and a control unit, wherein the acquisition unit is used for acquiring training data of a reinforcement learning network model according to data obtained by interaction between the reinforcement learning network model and an automatic driving environment;

a training unit for training the reinforcement learning network model by using the training data to obtain the reinforcement learning network model after training,

11. The apparatus according to claim 10, wherein the acquisition unit is configured to obtain the set of samples < state s, action a, reward r > in the training data by:

12. The apparatus of claim 11, wherein the obtaining unit is configured to calculate the reward r according to a reward function, wherein the reward function takes into account any one or more of the following factors:

13. An autonomous vehicle, comprising:

an apparatus for motion planning according to any of claims 7-9.

14. The autonomous vehicle of claim 13, further comprising:

apparatus for data processing according to any of claims 10 to 12.

15. An apparatus for data processing, comprising:

a memory for storing executable instructions;

a processor for invoking and executing said executable instructions in said memory to perform the method of any of claims 1-6.

16. A computer readable storage medium, characterized in that it has stored therein program instructions which, when executed by a processor, implement the method of any of claims 1 to 6.

17. A computer program product, characterized in that it comprises a computer program code for implementing the method according to any of claims 1 to 6 when said computer program code is run on a computer.