WO2022052406A1

WO2022052406A1 - Automatic driving training method, apparatus and device, and medium

Info

Publication number: WO2022052406A1
Application number: PCT/CN2021/073449
Authority: WO
Inventors: 李仁刚; 赵雅倩; 李茹杨; 李雪雷; 金良
Original assignee: 苏州浪潮智能科技有限公司
Priority date: 2020-09-08
Filing date: 2021-01-23
Publication date: 2022-03-17
Also published as: CN112099496B; CN112099496A

Abstract

An automatic driving training method, apparatus and device and a medium. The method comprises: acquiring a traffic environment state of a current moment and corresponding structured noise (S11), the structured noise being determined on the basis of historical data, the historical data being data saved during pretraining of an autonomous vehicle, and the historical data including historical action information and historical traffic environment state information; determining a corresponding execution action using the traffic environment state and the structured noise and by means of a policy network (S12); controlling the autonomous vehicle to perform the execution action (S13); evaluating a policy of the policy network according to the execution action and by means of an evaluation network to obtain a corresponding return (S14); updating parameters of the evaluation network on the basis of the return and by means of a back propagation operation (S15); and updating parameters of the policy network using a policy gradient algorithm (S16). The method can improve the stability of automatic driving training, thereby reducing the probability of occurrence of dangerous accidents.

Description

A kind of automatic driving training method, device, equipment and medium

This application claims the priority of the Chinese patent application filed on September 8, 2020 with the application number 202010934770.9 and the invention titled "An automatic driving training method, device, equipment and medium", the entire contents of which are by reference Incorporated in this application.

technical field

The present application relates to the technical field of automatic driving, and in particular, to an automatic driving training method, device, equipment and medium.

Background technique

In modern urban traffic, the number of motor vehicles is increasing day by day, road congestion is serious, and traffic accidents are frequent. Studies have shown that people waste up to 3 years of time due to traffic congestion in their lifetime, and 90% of traffic accidents are caused by human error or error. In order to minimize the harm caused by human factors, people are turning their attention to the field of autonomous driving. According to the driver's participation in the driving process of the vehicle, automatic driving is divided into six levels from low to high, namely Level-0 to Level-5, namely human driver driving, assisted driving, partial automatic driving, and conditional automatic driving. , highly autonomous and fully autonomous. At present, mainstream autonomous driving companies or projects generally reach the Level-3 level. Autonomous driving is a very complex integrated technology, covering hardware devices such as on-board sensors, data processors, controllers, etc., and requires modern mobile communication and network technologies as support to realize the realization of traffic participants such as vehicles, pedestrians and non-motor vehicles. Information transmission and sharing between them, complete functions such as sensing perception, decision planning and control execution in complex environments, and realize automatic acceleration/deceleration, steering, overtaking, braking and other operations of vehicles to ensure driving safety. Referring to FIG. 1 , FIG. 1 is a schematic diagram of a control architecture of an automatic driving vehicle provided by an embodiment of the present application.

Computer simulation of autonomous driving systems based on the simulator environment is the basic key technology for the testing and testing of autonomous driving vehicles, which can effectively ensure the safety of autonomous driving vehicles and accelerate the research and application of autonomous driving. Existing automated driving simulations are mainly divided into two categories, namely, Modular Pipeline and End-to-End Pipeline. Referring to FIG. 2, FIG. 2 is a schematic diagram of a modular method in the prior art provided by this application, and the automatic driving system is decomposed into several independent but interrelated modules, such as perception (Perception), localization (Localization) ), Planning and Control modules, which have good interpretability and can quickly locate the problem module when the system fails. It is a conventional method widely used in the industry at this stage. However, the modular construction and maintenance of the system is difficult, and it is not easy to update in the face of new complex scenarios. Referring to Fig. 3, Fig. 3 is a schematic diagram of an end-to-end method in the prior art provided by this application. The end-to-end method regards the automatic driving problem as a machine learning problem, and directly optimizes the "sensor data processing-generation control" command-execute command" process. The end-to-end method is simple to build and has achieved rapid development in the field of autonomous driving, but the method itself is also a "black box" with poor interpretability. There are also two forms of end-to-end methods, namely the Open-loop imitation learning method and the Closed-loop reinforcement learning method. Referring to FIG. 4 , FIG. 4 is a schematic diagram of an Open-loop imitation learning method in the prior art provided by this application. Open-loop's imitation learning method learns to drive autonomously in a supervised learning manner by imitating the behavior of human drivers, emphasizing a "predictive ability", Figure 5 is a closed-loop in the prior art provided by this application Schematic diagram of reinforcement learning method. The reinforcement learning method of Closed-loop uses the Markov Decision Process (MDP, Markov Decision Process) to explore and improve automatic driving strategies from scratch, emphasizing a kind of "driving ability". Reinforcement learning (RL, Reinforcement Learning) is a type of machine learning method that has developed rapidly in recent years, in which the agent-environment interaction mechanism and the sequential decision-making mechanism are close to the process of human learning, so it is also called RL. A key step in the realization of "Artificial General Intelligence (AGI)". The deep reinforcement learning (DRL, Deep Reinforcement Learning) algorithm combined with deep learning (DL, Deep Learning) can automatically learn the abstract representation of large-scale input data, and the decision-making performance is better. It has been used in video games, mechanical control, advertising recommendation, financial transactions. , urban transportation and other fields have been widely used.

When DRL is applied to autonomous driving problems, it does not require domain expert knowledge, nor does it need to build models. It has a wide range of adaptability and can cope with changing and complex road environments. However, for DRL-based autonomous vehicles to learn autonomous driving from scratch, the selection of poor actions in the sequential decision-making process will lead to a large training variance, which is reflected in the unstable driving of the vehicle, and even accidents such as running out of lanes and collisions. The existing research results show that, compared with the modular method and the Open-loop imitation learning method, the DRL-based autonomous driving training has the worst stability and is very sensitive to changes in the environment and weather.

SUMMARY OF THE INVENTION

In view of this, the purpose of the present application is to provide an automatic driving training method, device, equipment and medium, which can improve the stability of automatic driving training, thereby reducing the probability of occurrence of dangerous accidents. Its specific plan is as follows:

In a first aspect, the present application discloses an automatic driving training method, including:

Obtain the traffic environment state at the current moment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is saved in the process of pre-training the autonomous driving vehicle data, and the historical data includes historical action information and historical traffic environment state information;

Determine the corresponding execution action by utilizing the traffic environment state and the structured noise through a policy network;

controlling the self-driving vehicle to perform the execution action;

Evaluate the strategy of the strategy network according to the execution action by the evaluation network to obtain a corresponding reward;

Update evaluation network parameters by back-propagation operation based on the reward;

The policy network parameters are updated using the policy gradient algorithm.

Optionally, the automatic driving training method further includes:

Pre-training of autonomous vehicles with DQN algorithm;

The corresponding pre-training data is stored in the playback buffer, and the data stored in the playback buffer is used as the historical data.

Optionally, the updating of the evaluation network parameters through back-propagation operation based on the reward includes:

Based on the reward, a back-propagation operation for the evaluation network loss function is performed, and the evaluation network parameters are updated in a single step.

Optionally, the use of the policy gradient algorithm to update the policy network parameters includes:

The policy gradient operation is performed using the value function of the evaluation network and the current policy of the policy network, and the policy network parameters are updated.

Optionally, the automatic driving training method further includes:

The structured noise is precomputed.

Optionally, the pre-calculating the structured noise includes:

Randomly extract a preset number of data from the historical data to obtain a corresponding minibatch;

Calculate the Gaussian factor of each piece of historical data in the minibatch;

The structured noise corresponding to the minibatch is calculated using all the Gaussian factors.

Optionally, the pre-calculating the structured noise includes:

Randomly extract data from the historical data to obtain multiple minibatches;

Calculate the Gaussian factor of each piece of historical data in each of the minibatches, and then use all the Gaussian factors corresponding to each of the minibatches to calculate the structured noise corresponding to each of the minibatches.

In a second aspect, the present application discloses an automatic driving training device, comprising:

The data acquisition module is used to acquire the traffic environment state at the current moment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is used for pre-preparing the automatic driving vehicle. Data saved in the training process, and the historical data includes historical action information and historical traffic environment state information;

an action determination module, configured to use the traffic environment state and the structured noise to determine a corresponding execution action through a policy network;

an action control module, configured to control the autonomous driving vehicle to execute the execution action;

a strategy evaluation module, configured to evaluate the strategy of the strategy network through the evaluation network according to the execution action to obtain a corresponding reward;

an evaluation network update module for updating the evaluation network parameters through back-propagation operation based on the return;

The policy network update module is used to update the policy network parameters using the policy gradient algorithm.

In a third aspect, the present application discloses an automatic driving training device, including a processor and a memory; wherein,

the memory for storing computer programs;

The processor is configured to execute the computer program to implement the aforementioned automatic driving training method.

In a fourth aspect, the present application discloses a computer-readable storage medium for storing a computer program, wherein when the computer program is executed by a processor, the aforementioned automatic driving training method is implemented.

It can be seen that the present application obtains the current state of the traffic environment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is pre-trained for autonomous driving vehicles. The data saved in the process, and the historical data includes historical action information and historical traffic environment state information, and then use the traffic environment state and the structured noise to determine the corresponding execution action through the policy network, and then control the The autonomous vehicle performs the execution action, and evaluates the strategy of the strategy network according to the execution action through the evaluation network to obtain a corresponding reward, and then updates the evaluation network parameters and utilization strategy based on the reward through back-propagation operation. The gradient algorithm updates the policy network parameters. In this way, in the training process of automatic driving, structured noise based on historical data is introduced, and the historical data includes historical action information and historical traffic environment state information, which can improve the stability of automatic driving training and reduce the probability of dangerous accidents. .

Description of drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following briefly introduces the accompanying drawings that are used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only It is an embodiment of the present application. For those of ordinary skill in the art, other drawings can also be obtained according to the provided drawings without any creative effort.

FIG. 1 is a schematic diagram of a control architecture of an autonomous driving vehicle provided by the present application;

Fig. 2 is a kind of modularization method schematic diagram in the prior art;

3 is a schematic diagram of an end-to-end method in the prior art;

4 is a schematic diagram of an imitation learning method of Open-loop in the prior art;

5 is a schematic diagram of a closed-loop reinforcement learning method in the prior art;

6 is a flowchart of an automatic driving training method disclosed in the application;

7 is a schematic diagram of an automatic driving training disclosed in the application;

8 is a flowchart of a specific automatic driving training method disclosed in the application;

9 is a flowchart of a specific automatic driving training method disclosed in the application;

10 is a schematic structural diagram of an automatic driving training device disclosed in the application;

FIG. 11 is a structural diagram of an automatic driving training device disclosed in this application.

detailed description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

When DRL is applied to autonomous driving problems, it does not require domain expert knowledge, nor does it need to build models. It has a wide range of adaptability and can cope with changing and complex road environments. However, for DRL-based autonomous vehicles to learn autonomous driving from scratch, the selection of poor actions in the sequential decision-making process will lead to a large training variance, which is reflected in the unstable driving of the vehicle, and even accidents such as running out of lanes and collisions. The existing research results show that, compared with the modular method and the Open-loop imitation learning method, the DRL-based autonomous driving training has the worst stability and is very sensitive to changes in the environment and weather. Therefore, the present application provides an automatic driving training solution, which can improve the stability of automatic driving training, thereby reducing the probability of occurrence of dangerous accidents.

Referring to FIG. 6 , an embodiment of the present application discloses an automatic driving training method, including:

Step S11: Obtain the traffic environment state at the current moment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is in the process of pre-training the autonomous driving vehicle and the historical data includes historical action information and historical traffic environment state information.

Obtain the traffic environment state S _t at the current moment and the corresponding structured noise z _t .

It should be pointed out that the sequential decision-making process of the automatic driving system based on DRL is: the automatic driving vehicle (ie the agent) observes the state S _t of the environment at time t, such as the position, speed, acceleration of itself and other traffic participants Isokinetic information, traffic lights and road topology features and other information, use a nonlinear neural network (NN, Neural Network) to represent the policy (Policy) π _θ , and select vehicle actions a _t , such as acceleration/deceleration, steering, lane change , brakes, etc. Entering the next time t+1, the environment calculates the reward r _t according to the action at _t taken by the autonomous driving vehicle, combined with the set benchmarks, such as the average driving speed of the autonomous driving vehicle, distance from the center of the lane, running a red light, collision and other factors. ₊₁ and enter a new state S _t+1 . The self-driving vehicle adjusts the policy _πθ according to the reward r _t+1 obtained, and enters the next decision-making process in combination with the new state S _t+1 . By making sequential decisions through the interaction between the autonomous vehicle and the environment, and learning the optimal strategy, the autonomous vehicle can obtain the maximum cumulative return and achieve smooth and safe driving. Existing DRL-based autonomous driving research applications mostly use algorithms that can deal with continuous action spaces, such as Deep Deterministic Policy Gradient (DDPG), Trust Region Policy Optimization (TRPO) and recent Terminal Policy Optimization (PPO, Proximal Policy Optimization). In this embodiment, DRL and structured noise can be fused to make automatic driving decisions. Considering the continuity of the state space and the action space of the automatic driving problem, this embodiment can use the DDPG algorithm with high sample efficiency and computational efficiency. In some other embodiments, the Asynchronous Advantage Actor-Critic algorithm A3C (Asynchronous Advantage Actor-Critic), the Double Delayed Deterministic Policy Gradient Algorithm TD3 (Twin Delayed Deep Deterministic policy gradient), the Relaxed Actor-Critic Algorithm SAC (Soft Actor-Critic).

In a specific implementation manner, this embodiment can acquire traffic environment state data collected by vehicle sensors. Specifically, the driving environment status, such as weather data, can be obtained with the help of camera, GPS (Global Positioning System, Global Positioning System), IMU (Inertia Measurement Unit, Inertial Measurement Unit), millimeter-wave radar, LiDAR and other vehicle-mounted sensor devices , traffic lights, traffic topology information, automatic driving vehicles, the location of other traffic participants, running status and other information, and the traffic environment status in this embodiment not only includes the direct original image data obtained by the camera, but also includes the deep learning model, Such as the depth map and semantic segmentation map processed by RefineNet and so on. Among them, for the autonomous vehicle, the state information that can be directly obtained are: the driving speed and lateral speed v, u of the vehicle; the steering angle δ of the steering wheel; the distance deviation ΔL between the vehicle center and the road center line; the vehicle is closest to the four directions The distance Δxi of the traffic participants, i = 1 ~ 4 and so on.

Step S12: Determine a corresponding execution action by using the traffic environment state and the structured noise through a policy network.

In a specific implementation, ActorNet (policy network) selects action a _t based on the policy function π _θ (a|s,z), and the autonomous vehicle completes the corresponding action, such as “change lanes to the left”, where θ is Actor Net’s Network parameters, s represents the traffic environment state, and z represents structured noise.

Step S13: Control the automatic driving vehicle to execute the execution action.

Step S14: Evaluate the strategy of the strategy network through the evaluation network according to the execution action to obtain a corresponding reward.

In a specific embodiment, CriticNet (evaluation network) evaluates ActorNet's strategy based on the value function Q _ω (s, a, z) according to the action at _t performed by the autonomous vehicle, and obtains the reward r _t given by the traffic environment ₊₁ , where ω is the network parameter of CriticNet.

Among them, the value function Q _ω (s, a, z) is transformed from the preset reward function.

It should be noted that, in the embodiment of the present application, the reward function r _t for studying the automatic driving problem can also be designed in advance. Considering the specific scenarios of the autonomous driving simulation, as well as the average driving speed of the autonomous driving vehicle, the distance from the center of the lane, the duration of the traffic disturbance, whether it is line pressure, running a red light, collision and other evaluation indicators, the reward function of the autonomous driving vehicle can be designed in different forms . Taking the simulation scenario of vehicle lane change as an example, according to factors such as whether the automatic driving vehicle changes lanes successfully, whether it disrupts traffic, or even collides, the reward function can be designed as:

Among them, v is the driving speed of the autonomous vehicle, v _ref is the reference speed set according to the road speed limit, and λ is a coefficient set manually.

And, the value function can be calculated by the reward function in the form:

Among them, γ∈(0,1] is the discount factor. In this embodiment, structured noise is introduced, and the corresponding value function is Q _ω (s, a, z), and Ε represents the expectation operation.

Step S15: Update evaluation network parameters through back-propagation operation based on the reward.

In a specific embodiment, a back-propagation operation for the evaluation network loss function is performed based on the reward, and the evaluation network parameters are updated in a single step. Specifically, through the back-propagation propagation operation, the loss function of the evaluation network is minimized, and the network parameter ω is updated in a single step. Among them, the evaluation network loss function is:

In the formula, y _t =r _t+1 + _γQ′ω (s _t+1 ,at ₊₁ ,z _t+1 ). Q′ _ω (s _t+1 , a _t+1 , z _t+1 ) and Q _ω (s _t , at _t , z _t ) are the value functions of the target network and the prediction network, respectively. N is the number of samples collected, and γ∈(0,1] is a discount factor. The target network and prediction network are neural networks designed based on the DQN (ie Deep-Q-Network, deep value function neural network) algorithm.

Step S16: Use the policy gradient algorithm to update the policy network parameters.

In a specific implementation manner, this embodiment may use the value function of the evaluation network and the current strategy of the strategy network to perform a strategy gradient operation, and update the strategy network parameters.

Specifically, this embodiment updates the network parameter θ of Actor Net through the following policy gradient:

where J(θ) is the objective function of the policy gradient method, usually expressed in some form of reward.

It is obtained from the derivation of the value function of Critical Net with respect to the action a,

It is derived from the policy of Actor Net at the current step. The task of the policy gradient method is to maximize the objective function, which is achieved by gradient ascent. After obtaining the policy gradient with the help of the above formula, pass

Update the network parameter θ, where α is a fixed time step parameter.

Repeat the above steps S11 to S15 until the automatic driving ends.

For example, referring to FIG. 7 , FIG. 7 is a schematic diagram of an automatic driving training disclosed in the present application. Combined with structured noise z, the DDPG algorithm is used to train vehicles for autonomous driving. The DDPG algorithm is a typical Actor-Critic reinforcement learning algorithm. Among them, the policy network (Actor Net) updates the policy according to the value function fed back by the evaluation network (Critic Net), while the Critic Net trains the value function and uses the time difference method (TD) for single-step update. In addition, Critic Net includes a target network (Target Net) and a prediction network (Pred Net) designed based on the DQN algorithm, and the value functions of the two networks are used when the network parameters are updated. The Actor Net and Critic Net work together to maximize the cumulative reward for the actions chosen by the agent.

It can be seen that the embodiment of the present application obtains the traffic environment state at the current moment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is used in the prediction of the automatic driving vehicle. The data saved in the training process, and the historical data includes historical action information and historical traffic environment state information, and then use the traffic environment state and the structured noise to determine the corresponding execution action through the policy network, and then control the The self-driving vehicle performs the execution action, and evaluates the strategy of the strategy network according to the execution action through the evaluation network to obtain a corresponding reward, and then updates the evaluation network parameters and The policy network parameters are updated using the policy gradient algorithm. In this way, in the training process of automatic driving, structured noise based on historical data is introduced, and the historical data includes historical action information and historical traffic environment state information, which can improve the stability of automatic driving training and reduce the probability of dangerous accidents. .

Referring to FIG. 8 , an embodiment of the present application discloses a specific automatic driving training method, including:

Step S21: Use the DQN algorithm to pre-train the autonomous vehicle.

Step S22: Store the corresponding pre-training data in the playback buffer, and use the data stored in the playback buffer as the historical data.

In a specific implementation, the classical DQN algorithm is used to pre-train the vehicle for automatic driving, and the playback buffer data B is accumulated. Using the classic DQN method, two neural networks with the same structure but different parameters are constructed, namely the target network (Target Net) which updates parameters at a certain interval and the prediction network (Pred Net) which updates parameters at each step. Taking the simulation scene of vehicle lane change as an example, the action space of the autonomous vehicle at each time t is [a _t1 , a _t2 , a _t3 ], which represent “lane change to the left”, “lane change to the right” and “keep current lane". Both Target Net and Pred Net use a simple 3-layer neural network with only one hidden layer in the middle. The traffic environment state S _t collected by the vehicle sensor device is input, the output target value Qtarget and the predicted value Qpred are calculated, and the action at corresponding to the largest _Qpred is selected as the driving action of the autonomous vehicle. According to the designed reward function, get the reward r _t+1 , enter the new traffic environment state S _t+1 , and store the learning experience c _t =(s _t ,at ,r _t ,s _t ₊₁ ) in the playback buffer in the area. The network parameters are updated to minimize the loss function using the RMSProP optimizer, and the self-driving vehicle is continuously pre-trained until sufficient playback buffer data B is accumulated.

Step S23: Calculate the structured noise.

In a specific implementation, in this embodiment, a preset number of data can be randomly extracted from the historical data to obtain a corresponding minibatch (ie, small batch data); Gaussian factor of historical data; the structured noise corresponding to the minibatch is calculated using all the Gaussian factors.

In another specific implementation, this embodiment can randomly extract data from the historical data to obtain multiple minibatches; calculate the Gaussian factor of each piece of the historical data in each of the minibatches, and then use each All the Gaussian factors corresponding to the minibatches calculate the structured noise corresponding to each minibatch.

That is, multiple structured noises can be calculated by using multiple minibatches, so that during automatic driving training, different structured noises can be used for training to improve the robustness of automatic driving.

Specifically, minibatch b ⁱ to B can be randomly taken out from the playback buffer B, and the minibatch b ⁱ contains N pieces of historical data c _1:N =(s _n , a _n , rn , s _n ₊₁ ), n= 1 to N. Calculate the Gaussian factor of each piece of historical data. The Gaussian factor of each sampled historical data c _n , namely Ψ _φ (z|c _n )=N(μ _n ,σ _n ). Among them, N represents the Gaussian distribution, then the Gaussian factor of the historical data c _n is expressed as

Calculated using a neural network NN (Neural Network), where the mean

variance

φ is the parameter of the neural network f. The latent variable is computed to obtain a probabilistic representation, namely structured noise. The structured noise of each sampled minibatch b ⁱ , ie z ~ q _φ (z|c _1:N ). Among them, q _φ (z|c _1:N ) is obtained by accumulating the Gaussian factor Ψ _φ (z|c _n ) of each piece of historical data c _n , namely

That is, in this embodiment, the structured noise may be pre-calculated. In other embodiments, a minibatch may be extracted from historical data when obtaining the traffic environment state at the current moment, and the structured noise corresponding to the current moment may be calculated.

Step S24 : obtaining the current state of the traffic environment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is in the process of pre-training the autonomous vehicle and the historical data includes historical action information and historical traffic environment state information.

In a specific implementation, this embodiment can acquire the traffic environment state at the current moment and the corresponding structured noise; wherein, the structured noise is a pre-calculated fixed value, and the structured noise used at each moment Noise is the same.

In another specific implementation, this embodiment can acquire the traffic state at the current moment and the corresponding structured noise; wherein the structured noise acquired at the current moment is obtained from a plurality of pre-calculated structured noises. A structured noise captured in noise. Specifically, the structured noise corresponding to the current moment may be obtained cyclically from a plurality of the pre-calculated structured noises. For example, if 100 structured noises are pre-calculated, the structured noise corresponding to the current moment can be obtained cyclically from the 100 structured noises. Of course, in other embodiments, the specific process of obtaining the structured noise corresponding to the current moment may include: randomly extracting a preset number of data from the historical data in real time, obtaining a corresponding minibatch, and then calculating the minibatch The Gaussian factor of each piece of historical data in , and the structured noise corresponding to the minibatch is calculated by using all the Gaussian factors.

Understandably, training with different structured noises can improve the robustness of autonomous driving.

Step S25: Determine a corresponding execution action by using the traffic environment state and the structured noise through a policy network.

Step S26: Control the automatic driving vehicle to execute the execution action.

Step S27: Evaluate the strategy of the strategy network through the evaluation network according to the execution action, and obtain a corresponding reward.

In a specific embodiment, the evaluation network inherits the pre-trained target network and neural network, thereby improving the efficiency of automatic driving training.

Step S28: Update evaluation network parameters through back-propagation operation based on the reward.

Step S29: Use the policy gradient algorithm to update the policy network parameters.

That is, the present application provides an automatic driving decision-making method based on the fusion of DRL and structured noise. In the automatic driving simulation platform, environmental state information is obtained through the vehicle sensor device, and historical data is sampled from the playback buffer (Replay Buffer). With the help of Gaussian factor algorithm, structured noise is introduced into the policy function and value function to solve the robustness problem of DRL-based automatic driving sequence decision-making, and to avoid the dangerous situation of unstable driving and even causing accidents when the automatic driving vehicle faces a complex environment. For example, as shown in FIG. 9 , an embodiment of the present application discloses a specific automatic driving training method, including (1) acquiring the traffic environment state S _t collected by the vehicle sensor device; (2) designing the automatic driving problem under study. The reward function _rt ; (3) use the classical DQN algorithm to pre-train the vehicle for autonomous driving, and accumulate the playback buffer data B; (4) sample the historical data c from the playback buffer B, and use the Gaussian factor to calculate the potential represented by the probability The variable z is the structured noise; (5) Combined with the structured noise z, use the DDPG algorithm to train the vehicle to drive automatically.

Referring to FIG. 10 , an embodiment of the present application discloses an automatic driving training device, including:

The data acquisition module 11 is used to acquire the current state of the traffic environment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is the Data saved in the pre-training process, and the historical data includes historical action information and historical traffic environment state information;

an action determination module 12, configured to use the traffic environment state and the structured noise to determine a corresponding execution action through a policy network;

an action control module 13, configured to control the autonomous driving vehicle to execute the execution action;

The strategy evaluation module 14 is configured to evaluate the strategy of the strategy network through the evaluation network according to the execution action to obtain a corresponding reward;

an evaluation network update module 15, configured to update the evaluation network parameters through back-propagation operation based on the return;

The policy network updating module 16 is used for updating the parameters of the policy network by using the policy gradient algorithm.

The device further includes a pre-training module for pre-training the self-driving vehicle by using the DQN algorithm; storing the corresponding pre-training data in a playback buffer, and using the data stored in the playback buffer as the historical data.

The evaluation network updating module 15 is specifically configured to perform a back-propagation operation on the evaluation network loss function based on the reward, and update the evaluation network parameters in a single step.

The policy network update module 16 is specifically configured to perform policy gradient operation by using the value function of the evaluation network and the current policy of the policy network to update the policy network parameters.

The apparatus further includes a structured noise calculation module for pre-calculating the structured noise.

In a specific embodiment, the structured noise calculation module is specifically used to randomly extract a preset number of data from the historical data to obtain a corresponding minibatch; Gaussian factor of the historical data; using all the Gaussian factors to calculate the structured noise corresponding to the minibatch.

In another specific embodiment, the structured noise calculation module is specifically configured to randomly extract data from the historical data to obtain multiple minibatches; calculate each piece of the historical data in each of the minibatches The Gaussian factor of , and then the structured noise corresponding to each minibatch is calculated by using all the Gaussian factors corresponding to each minibatch.

Referring to FIG. 11 , an embodiment of the present application discloses an automatic driving training device, including a processor 21 and a memory 22; wherein, the memory 22 is used to save a computer program; the processor 21 is used to execute all The computer program is used to implement the automatic driving training method disclosed in the foregoing embodiments.

For the specific process of the above-mentioned automatic driving training method, reference may be made to the corresponding content disclosed in the foregoing embodiments, which will not be repeated here.

Further, the embodiment of the present application also discloses a computer-readable storage medium for storing a computer program, wherein the computer program implements the automatic driving training method disclosed in the foregoing embodiments when the computer program is executed by the processor.

The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments may be referred to each other. As for the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be directly implemented in hardware, a software module executed by a processor, or a combination of the two. The software module can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.

The automatic driving training method, device, equipment and medium provided by the present application have been introduced in detail above. The principles and implementations of the present application are described with specific examples in this paper. Understand the method of the present application and its core idea; at the same time, for those of ordinary skill in the art, according to the idea of the present application, there will be changes in the specific implementation and application scope. In summary, the content of this specification does not It should be understood as a limitation of this application.

Claims

An automatic driving training method, comprising:

Obtain the traffic environment state at the current moment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is saved in the process of pre-training the autonomous driving vehicle data, and the historical data includes historical action information and historical traffic environment state information;

Determine the corresponding execution action by utilizing the traffic environment state and the structured noise through a policy network;

controlling the self-driving vehicle to perform the execution action;

Evaluate the strategy of the strategy network according to the execution action by the evaluation network to obtain a corresponding reward;

Update evaluation network parameters by back-propagation operation based on the reward;

The policy network parameters are updated using the policy gradient algorithm.
The automatic driving training method according to claim 1, further comprising:

Pre-training of autonomous vehicles with DQN algorithm;

The corresponding pre-training data is stored in the playback buffer, and the data stored in the playback buffer is used as the historical data.
The automatic driving training method according to claim 1, wherein the updating and evaluating network parameters by back-propagation operation based on the reward comprises:

Based on the reward, a back-propagation operation for the evaluation network loss function is performed, and the evaluation network parameters are updated in a single step.
The automatic driving training method according to claim 1, wherein the updating of the policy network parameters using a policy gradient algorithm comprises:

The policy gradient operation is performed using the value function of the evaluation network and the current policy of the policy network, and the policy network parameters are updated.
The automatic driving training method according to any one of claims 1 to 4, further comprising:

The structured noise is precomputed.
The automatic driving training method according to claim 5, wherein the pre-calculating the structured noise comprises:

Randomly extract a preset number of data from the historical data to obtain a corresponding minibatch;

Calculate the Gaussian factor of each piece of historical data in the minibatch;

The structured noise corresponding to the minibatch is calculated using all the Gaussian factors.
The automatic driving training method according to claim 5, wherein the pre-calculating the structured noise comprises:

Randomly extract data from the historical data to obtain multiple minibatches;

Calculate the Gaussian factor of each piece of historical data in each of the minibatches, and then use all the Gaussian factors corresponding to each of the minibatches to calculate the structured noise corresponding to each of the minibatches.
An automatic driving training device, comprising:

The data acquisition module is used to acquire the traffic environment state at the current moment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is used for pre-preparing the automatic driving vehicle. Data saved in the training process, and the historical data includes historical action information and historical traffic environment state information;

an action determination module, configured to use the traffic environment state and the structured noise to determine a corresponding execution action through a policy network;

an action control module, configured to control the autonomous driving vehicle to execute the execution action;

a strategy evaluation module, configured to evaluate the strategy of the strategy network through the evaluation network according to the execution action to obtain a corresponding reward;

an evaluation network update module for updating the evaluation network parameters through back-propagation operation based on the return;

The policy network update module is used to update the policy network parameters using the policy gradient algorithm.
An automatic driving training device, characterized in that it includes a processor and a memory; wherein,

the memory for storing computer programs;

The processor is configured to execute the computer program to implement the automatic driving training method according to any one of claims 1 to 7.
A computer-readable storage medium, characterized in that it is used for storing a computer program, wherein, when the computer program is executed by a processor, the automatic driving training method according to any one of claims 1 to 7 is implemented.