WO2022052406A1 - Automatic driving training method, apparatus and device, and medium - Google Patents
Automatic driving training method, apparatus and device, and medium Download PDFInfo
- Publication number
- WO2022052406A1 WO2022052406A1 PCT/CN2021/073449 CN2021073449W WO2022052406A1 WO 2022052406 A1 WO2022052406 A1 WO 2022052406A1 CN 2021073449 W CN2021073449 W CN 2021073449W WO 2022052406 A1 WO2022052406 A1 WO 2022052406A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- automatic driving
- policy
- structured noise
- network
- historical data
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 87
- 238000012549 training Methods 0.000 title claims abstract description 75
- 230000009471 action Effects 0.000 claims abstract description 66
- 238000011156 evaluation Methods 0.000 claims abstract description 48
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 33
- 230000006870 function Effects 0.000 claims description 34
- 230000008569 process Effects 0.000 claims description 25
- 238000004590 computer program Methods 0.000 claims description 13
- 230000000875 corresponding effect Effects 0.000 description 53
- 238000010586 diagram Methods 0.000 description 14
- 238000013528 artificial neural network Methods 0.000 description 12
- 230000002787 reinforcement Effects 0.000 description 9
- 230000008859 change Effects 0.000 description 6
- 238000004088 simulation Methods 0.000 description 5
- 238000011160 research Methods 0.000 description 4
- 230000001133 acceleration Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 230000008447 perception Effects 0.000 description 3
- 206010039203 Road traffic accident Diseases 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 239000003795 chemical substances by application Substances 0.000 description 2
- 230000001186 cumulative effect Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000003111 delayed effect Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000004807 localization Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000005094 computer simulation Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000003189 isokinetic effect Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0231—Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means
- G05D1/0238—Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means using obstacle or wall sensors
- G05D1/024—Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means using obstacle or wall sensors in combination with a laser
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0212—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
- G05D1/0214—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory in accordance with safety or protection criteria, e.g. avoiding hazardous areas
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0212—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
- G05D1/0221—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0212—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
- G05D1/0223—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving speed control of the vehicle
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0231—Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means
- G05D1/0246—Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means using a video camera in combination with image processing means
- G05D1/0253—Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means using a video camera in combination with image processing means extracting relative motion information from a plurality of images taken successively, e.g. visual odometry, optical flow
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0257—Control of position or course in two dimensions specially adapted to land vehicles using a radar
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0276—Control of position or course in two dimensions specially adapted to land vehicles using signals provided by a source external to the vehicle
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0276—Control of position or course in two dimensions specially adapted to land vehicles using signals provided by a source external to the vehicle
- G05D1/0278—Control of position or course in two dimensions specially adapted to land vehicles using signals provided by a source external to the vehicle using satellite positioning signals, e.g. GPS
-
- G—PHYSICS
- G09—EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
- G09B—EDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
- G09B9/00—Simulators for teaching or training purposes
- G09B9/02—Simulators for teaching or training purposes for teaching control of vehicles or other craft
- G09B9/04—Simulators for teaching or training purposes for teaching control of vehicles or other craft for teaching control of land vehicles
Definitions
- the present application relates to the technical field of automatic driving, and in particular, to an automatic driving training method, device, equipment and medium.
- FIG. 1 is a schematic diagram of a control architecture of an automatic driving vehicle provided by an embodiment of the present application.
- FIG. 2 is a schematic diagram of a modular method in the prior art provided by this application, and the automatic driving system is decomposed into several independent but interrelated modules, such as perception (Perception), localization (Localization) ), Planning and Control modules, which have good interpretability and can quickly locate the problem module when the system fails. It is a conventional method widely used in the industry at this stage.
- FIG. 3 is a schematic diagram of an end-to-end method in the prior art provided by this application.
- the end-to-end method regards the automatic driving problem as a machine learning problem, and directly optimizes the "sensor data processing-generation control" command-execute command" process.
- the end-to-end method is simple to build and has achieved rapid development in the field of autonomous driving, but the method itself is also a "black box" with poor interpretability.
- FIG. 4 is a schematic diagram of an Open-loop imitation learning method in the prior art provided by this application.
- Open-loop's imitation learning method learns to drive autonomously in a supervised learning manner by imitating the behavior of human drivers, emphasizing a "predictive ability”
- Figure 5 is a closed-loop in the prior art provided by this application Schematic diagram of reinforcement learning method.
- the reinforcement learning method of Closed-loop uses the Markov Decision Process (MDP, Markov Decision Process) to explore and improve automatic driving strategies from scratch, emphasizing a kind of "driving ability".
- MDP Markov Decision Process
- Reinforcement learning is a type of machine learning method that has developed rapidly in recent years, in which the agent-environment interaction mechanism and the sequential decision-making mechanism are close to the process of human learning, so it is also called RL.
- AGI Artificial General Intelligence
- the deep reinforcement learning (DRL, Deep Reinforcement Learning) algorithm combined with deep learning (DL, Deep Learning) can automatically learn the abstract representation of large-scale input data, and the decision-making performance is better. It has been used in video games, mechanical control, advertising recommendation, financial transactions. , urban transportation and other fields have been widely used.
- DRL When DRL is applied to autonomous driving problems, it does not require domain expert knowledge, nor does it need to build models. It has a wide range of adaptability and can cope with changing and complex road environments.
- DRL-based autonomous vehicles to learn autonomous driving from scratch, the selection of poor actions in the sequential decision-making process will lead to a large training variance, which is reflected in the unstable driving of the vehicle, and even accidents such as running out of lanes and collisions.
- the existing research results show that, compared with the modular method and the Open-loop imitation learning method, the DRL-based autonomous driving training has the worst stability and is very sensitive to changes in the environment and weather.
- the purpose of the present application is to provide an automatic driving training method, device, equipment and medium, which can improve the stability of automatic driving training, thereby reducing the probability of occurrence of dangerous accidents. Its specific plan is as follows:
- an automatic driving training method including:
- the structured noise is the structured noise determined based on historical data
- the historical data is saved in the process of pre-training the autonomous driving vehicle data, and the historical data includes historical action information and historical traffic environment state information
- the policy network parameters are updated using the policy gradient algorithm.
- the automatic driving training method further includes:
- the corresponding pre-training data is stored in the playback buffer, and the data stored in the playback buffer is used as the historical data.
- the updating of the evaluation network parameters through back-propagation operation based on the reward includes:
- the use of the policy gradient algorithm to update the policy network parameters includes:
- the policy gradient operation is performed using the value function of the evaluation network and the current policy of the policy network, and the policy network parameters are updated.
- the automatic driving training method further includes:
- the structured noise is precomputed.
- the pre-calculating the structured noise includes:
- the structured noise corresponding to the minibatch is calculated using all the Gaussian factors.
- the pre-calculating the structured noise includes:
- an automatic driving training device comprising:
- the data acquisition module is used to acquire the traffic environment state at the current moment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is used for pre-preparing the automatic driving vehicle.
- Data saved in the training process, and the historical data includes historical action information and historical traffic environment state information;
- an action determination module configured to use the traffic environment state and the structured noise to determine a corresponding execution action through a policy network
- an action control module configured to control the autonomous driving vehicle to execute the execution action
- a strategy evaluation module configured to evaluate the strategy of the strategy network through the evaluation network according to the execution action to obtain a corresponding reward
- an evaluation network update module for updating the evaluation network parameters through back-propagation operation based on the return
- the policy network update module is used to update the policy network parameters using the policy gradient algorithm.
- an automatic driving training device including a processor and a memory; wherein,
- the memory for storing computer programs
- the processor is configured to execute the computer program to implement the aforementioned automatic driving training method.
- the present application discloses a computer-readable storage medium for storing a computer program, wherein when the computer program is executed by a processor, the aforementioned automatic driving training method is implemented.
- the present application obtains the current state of the traffic environment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is pre-trained for autonomous driving vehicles.
- the data saved in the process, and the historical data includes historical action information and historical traffic environment state information, and then use the traffic environment state and the structured noise to determine the corresponding execution action through the policy network, and then control the
- the autonomous vehicle performs the execution action, and evaluates the strategy of the strategy network according to the execution action through the evaluation network to obtain a corresponding reward, and then updates the evaluation network parameters and utilization strategy based on the reward through back-propagation operation.
- the gradient algorithm updates the policy network parameters.
- structured noise based on historical data is introduced, and the historical data includes historical action information and historical traffic environment state information, which can improve the stability of automatic driving training and reduce the probability of dangerous accidents. .
- FIG. 1 is a schematic diagram of a control architecture of an autonomous driving vehicle provided by the present application
- Fig. 2 is a kind of modularization method schematic diagram in the prior art
- FIG. 3 is a schematic diagram of an end-to-end method in the prior art
- FIG. 4 is a schematic diagram of an imitation learning method of Open-loop in the prior art
- FIG. 5 is a schematic diagram of a closed-loop reinforcement learning method in the prior art
- FIG. 10 is a schematic structural diagram of an automatic driving training device disclosed in the application.
- FIG. 11 is a structural diagram of an automatic driving training device disclosed in this application.
- DRL When DRL is applied to autonomous driving problems, it does not require domain expert knowledge, nor does it need to build models. It has a wide range of adaptability and can cope with changing and complex road environments.
- DRL-based autonomous vehicles to learn autonomous driving from scratch, the selection of poor actions in the sequential decision-making process will lead to a large training variance, which is reflected in the unstable driving of the vehicle, and even accidents such as running out of lanes and collisions.
- the existing research results show that, compared with the modular method and the Open-loop imitation learning method, the DRL-based autonomous driving training has the worst stability and is very sensitive to changes in the environment and weather. Therefore, the present application provides an automatic driving training solution, which can improve the stability of automatic driving training, thereby reducing the probability of occurrence of dangerous accidents.
- an embodiment of the present application discloses an automatic driving training method, including:
- Step S11 Obtain the traffic environment state at the current moment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is in the process of pre-training the autonomous driving vehicle and the historical data includes historical action information and historical traffic environment state information.
- the sequential decision-making process of the automatic driving system based on DRL is: the automatic driving vehicle (ie the agent) observes the state S t of the environment at time t, such as the position, speed, acceleration of itself and other traffic participants Isokinetic information, traffic lights and road topology features and other information, use a nonlinear neural network (NN, Neural Network) to represent the policy (Policy) ⁇ ⁇ , and select vehicle actions a t , such as acceleration/deceleration, steering, lane change , brakes, etc.
- NN nonlinear neural network
- Policy Policy
- the environment calculates the reward r t according to the action at t taken by the autonomous driving vehicle, combined with the set benchmarks, such as the average driving speed of the autonomous driving vehicle, distance from the center of the lane, running a red light, collision and other factors. +1 and enter a new state S t+1 .
- the self-driving vehicle adjusts the policy ⁇ according to the reward r t+1 obtained, and enters the next decision-making process in combination with the new state S t+1 .
- DRL-based autonomous driving research applications mostly use algorithms that can deal with continuous action spaces, such as Deep Deterministic Policy Gradient (DDPG), Trust Region Policy Optimization (TRPO) and recent Terminal Policy Optimization (PPO, Proximal Policy Optimization).
- DDPG Deep Deterministic Policy Gradient
- TRPO Trust Region Policy Optimization
- PPO Terminal Policy Optimization
- DRL and structured noise can be fused to make automatic driving decisions.
- this embodiment can use the DDPG algorithm with high sample efficiency and computational efficiency.
- the Asynchronous Advantage Actor-Critic algorithm A3C (Asynchronous Advantage Actor-Critic), the Double Delayed Deterministic Policy Gradient Algorithm TD3 (Twin Delayed Deep Deterministic policy gradient), the Relaxed Actor-Critic Algorithm SAC (Soft Actor-Critic).
- this embodiment can acquire traffic environment state data collected by vehicle sensors.
- the driving environment status such as weather data
- the driving environment status can be obtained with the help of camera, GPS (Global Positioning System, Global Positioning System), IMU (Inertia Measurement Unit, Inertial Measurement Unit), millimeter-wave radar, LiDAR and other vehicle-mounted sensor devices , traffic lights, traffic topology information, automatic driving vehicles, the location of other traffic participants, running status and other information
- the traffic environment status in this embodiment not only includes the direct original image data obtained by the camera, but also includes the deep learning model, Such as the depth map and semantic segmentation map processed by RefineNet and so on.
- Step S12 Determine a corresponding execution action by using the traffic environment state and the structured noise through a policy network.
- ActorNet policy network selects action a t based on the policy function ⁇ ⁇ (a
- Step S13 Control the automatic driving vehicle to execute the execution action.
- Step S14 Evaluate the strategy of the strategy network through the evaluation network according to the execution action to obtain a corresponding reward.
- CriticNet evaluation network evaluates ActorNet's strategy based on the value function Q ⁇ (s, a, z) according to the action at t performed by the autonomous vehicle, and obtains the reward r t given by the traffic environment +1 , where ⁇ is the network parameter of CriticNet.
- the value function Q ⁇ (s, a, z) is transformed from the preset reward function.
- the reward function r t for studying the automatic driving problem can also be designed in advance.
- the reward function of the autonomous driving vehicle can be designed in different forms .
- the reward function can be designed as:
- v is the driving speed of the autonomous vehicle
- v ref is the reference speed set according to the road speed limit
- ⁇ is a coefficient set manually.
- the value function can be calculated by the reward function in the form:
- ⁇ (0,1] is the discount factor.
- structured noise is introduced, and the corresponding value function is Q ⁇ (s, a, z), and ⁇ represents the expectation operation.
- Step S15 Update evaluation network parameters through back-propagation operation based on the reward.
- a back-propagation operation for the evaluation network loss function is performed based on the reward, and the evaluation network parameters are updated in a single step. Specifically, through the back-propagation propagation operation, the loss function of the evaluation network is minimized, and the network parameter ⁇ is updated in a single step.
- the evaluation network loss function is:
- y t r t+1 + ⁇ Q′ ⁇ (s t+1 ,at +1 ,z t+1 ).
- Q′ ⁇ (s t+1 , a t+1 , z t+1 ) and Q ⁇ (s t , at t , z t ) are the value functions of the target network and the prediction network, respectively.
- N is the number of samples collected, and ⁇ (0,1] is a discount factor.
- the target network and prediction network are neural networks designed based on the DQN (ie Deep-Q-Network, deep value function neural network) algorithm.
- Step S16 Use the policy gradient algorithm to update the policy network parameters.
- this embodiment may use the value function of the evaluation network and the current strategy of the strategy network to perform a strategy gradient operation, and update the strategy network parameters.
- this embodiment updates the network parameter ⁇ of Actor Net through the following policy gradient:
- J( ⁇ ) is the objective function of the policy gradient method, usually expressed in some form of reward. It is obtained from the derivation of the value function of Critical Net with respect to the action a, It is derived from the policy of Actor Net at the current step.
- the task of the policy gradient method is to maximize the objective function, which is achieved by gradient ascent.
- FIG. 7 is a schematic diagram of an automatic driving training disclosed in the present application.
- the DDPG algorithm is used to train vehicles for autonomous driving.
- the DDPG algorithm is a typical Actor-Critic reinforcement learning algorithm.
- the policy network updates the policy according to the value function fed back by the evaluation network (Critic Net), while the Critic Net trains the value function and uses the time difference method (TD) for single-step update.
- Critic Net includes a target network (Target Net) and a prediction network (Pred Net) designed based on the DQN algorithm, and the value functions of the two networks are used when the network parameters are updated.
- the Actor Net and Critic Net work together to maximize the cumulative reward for the actions chosen by the agent.
- the embodiment of the present application obtains the traffic environment state at the current moment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is used in the prediction of the automatic driving vehicle.
- the data saved in the training process, and the historical data includes historical action information and historical traffic environment state information, and then use the traffic environment state and the structured noise to determine the corresponding execution action through the policy network, and then control the
- the self-driving vehicle performs the execution action, and evaluates the strategy of the strategy network according to the execution action through the evaluation network to obtain a corresponding reward, and then updates the evaluation network parameters and
- the policy network parameters are updated using the policy gradient algorithm.
- structured noise based on historical data is introduced, and the historical data includes historical action information and historical traffic environment state information, which can improve the stability of automatic driving training and reduce the probability of dangerous accidents. .
- an embodiment of the present application discloses a specific automatic driving training method, including:
- Step S21 Use the DQN algorithm to pre-train the autonomous vehicle.
- Step S22 Store the corresponding pre-training data in the playback buffer, and use the data stored in the playback buffer as the historical data.
- the classical DQN algorithm is used to pre-train the vehicle for automatic driving, and the playback buffer data B is accumulated.
- the classic DQN method two neural networks with the same structure but different parameters are constructed, namely the target network (Target Net) which updates parameters at a certain interval and the prediction network (Pred Net) which updates parameters at each step.
- the action space of the autonomous vehicle at each time t is [a t1 , a t2 , a t3 ], which represent “lane change to the left”, “lane change to the right” and “keep current lane”.
- Both Target Net and Pred Net use a simple 3-layer neural network with only one hidden layer in the middle.
- the traffic environment state S t collected by the vehicle sensor device is input, the output target value Qtarget and the predicted value Qpred are calculated, and the action at corresponding to the largest Qpred is selected as the driving action of the autonomous vehicle.
- the network parameters are updated to minimize the loss function using the RMSProP optimizer, and the self-driving vehicle is continuously pre-trained until sufficient playback buffer data B is accumulated.
- Step S23 Calculate the structured noise.
- a preset number of data can be randomly extracted from the historical data to obtain a corresponding minibatch (ie, small batch data); Gaussian factor of historical data; the structured noise corresponding to the minibatch is calculated using all the Gaussian factors.
- this embodiment can randomly extract data from the historical data to obtain multiple minibatches; calculate the Gaussian factor of each piece of the historical data in each of the minibatches, and then use each All the Gaussian factors corresponding to the minibatches calculate the structured noise corresponding to each minibatch.
- multiple structured noises can be calculated by using multiple minibatches, so that during automatic driving training, different structured noises can be used for training to improve the robustness of automatic driving.
- the Gaussian factor of each sampled historical data c n namely ⁇ ⁇ (z
- c n ) N( ⁇ n , ⁇ n ).
- N represents the Gaussian distribution
- the Gaussian factor of the historical data c n is expressed as
- ⁇ is the parameter of the neural network f.
- the latent variable is computed to obtain a probabilistic representation, namely structured noise.
- c 1:N ) is obtained by accumulating the Gaussian factor ⁇ ⁇ (z
- the structured noise may be pre-calculated.
- a minibatch may be extracted from historical data when obtaining the traffic environment state at the current moment, and the structured noise corresponding to the current moment may be calculated.
- Step S24 obtaining the current state of the traffic environment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is in the process of pre-training the autonomous vehicle and the historical data includes historical action information and historical traffic environment state information.
- this embodiment can acquire the traffic environment state at the current moment and the corresponding structured noise; wherein, the structured noise is a pre-calculated fixed value, and the structured noise used at each moment Noise is the same.
- this embodiment can acquire the traffic state at the current moment and the corresponding structured noise; wherein the structured noise acquired at the current moment is obtained from a plurality of pre-calculated structured noises.
- the structured noise corresponding to the current moment may be obtained cyclically from a plurality of the pre-calculated structured noises. For example, if 100 structured noises are pre-calculated, the structured noise corresponding to the current moment can be obtained cyclically from the 100 structured noises.
- the specific process of obtaining the structured noise corresponding to the current moment may include: randomly extracting a preset number of data from the historical data in real time, obtaining a corresponding minibatch, and then calculating the minibatch The Gaussian factor of each piece of historical data in , and the structured noise corresponding to the minibatch is calculated by using all the Gaussian factors.
- Step S25 Determine a corresponding execution action by using the traffic environment state and the structured noise through a policy network.
- Step S26 Control the automatic driving vehicle to execute the execution action.
- Step S27 Evaluate the strategy of the strategy network through the evaluation network according to the execution action, and obtain a corresponding reward.
- the evaluation network inherits the pre-trained target network and neural network, thereby improving the efficiency of automatic driving training.
- Step S28 Update evaluation network parameters through back-propagation operation based on the reward.
- Step S29 Use the policy gradient algorithm to update the policy network parameters.
- the present application provides an automatic driving decision-making method based on the fusion of DRL and structured noise.
- environmental state information is obtained through the vehicle sensor device, and historical data is sampled from the playback buffer (Replay Buffer).
- structured noise is introduced into the policy function and value function to solve the robustness problem of DRL-based automatic driving sequence decision-making, and to avoid the dangerous situation of unstable driving and even causing accidents when the automatic driving vehicle faces a complex environment.
- an embodiment of the present application discloses a specific automatic driving training method, including (1) acquiring the traffic environment state S t collected by the vehicle sensor device; (2) designing the automatic driving problem under study.
- the reward function rt (3) use the classical DQN algorithm to pre-train the vehicle for autonomous driving, and accumulate the playback buffer data B; (4) sample the historical data c from the playback buffer B, and use the Gaussian factor to calculate the potential represented by the probability
- the variable z is the structured noise; (5) Combined with the structured noise z, use the DDPG algorithm to train the vehicle to drive automatically.
- an embodiment of the present application discloses an automatic driving training device, including:
- the data acquisition module 11 is used to acquire the current state of the traffic environment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is the Data saved in the pre-training process, and the historical data includes historical action information and historical traffic environment state information;
- an action determination module 12 configured to use the traffic environment state and the structured noise to determine a corresponding execution action through a policy network
- an action control module 13 configured to control the autonomous driving vehicle to execute the execution action
- the strategy evaluation module 14 is configured to evaluate the strategy of the strategy network through the evaluation network according to the execution action to obtain a corresponding reward;
- an evaluation network update module 15 configured to update the evaluation network parameters through back-propagation operation based on the return;
- the policy network updating module 16 is used for updating the parameters of the policy network by using the policy gradient algorithm.
- the embodiment of the present application obtains the traffic environment state at the current moment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is used in the prediction of the automatic driving vehicle.
- the data saved in the training process, and the historical data includes historical action information and historical traffic environment state information, and then use the traffic environment state and the structured noise to determine the corresponding execution action through the policy network, and then control the
- the self-driving vehicle performs the execution action, and evaluates the strategy of the strategy network according to the execution action through the evaluation network to obtain a corresponding reward, and then updates the evaluation network parameters and
- the policy network parameters are updated using the policy gradient algorithm.
- structured noise based on historical data is introduced, and the historical data includes historical action information and historical traffic environment state information, which can improve the stability of automatic driving training and reduce the probability of dangerous accidents. .
- the device further includes a pre-training module for pre-training the self-driving vehicle by using the DQN algorithm; storing the corresponding pre-training data in a playback buffer, and using the data stored in the playback buffer as the historical data.
- the evaluation network updating module 15 is specifically configured to perform a back-propagation operation on the evaluation network loss function based on the reward, and update the evaluation network parameters in a single step.
- the policy network update module 16 is specifically configured to perform policy gradient operation by using the value function of the evaluation network and the current policy of the policy network to update the policy network parameters.
- the apparatus further includes a structured noise calculation module for pre-calculating the structured noise.
- the structured noise calculation module is specifically used to randomly extract a preset number of data from the historical data to obtain a corresponding minibatch; Gaussian factor of the historical data; using all the Gaussian factors to calculate the structured noise corresponding to the minibatch.
- the structured noise calculation module is specifically configured to randomly extract data from the historical data to obtain multiple minibatches; calculate each piece of the historical data in each of the minibatches The Gaussian factor of , and then the structured noise corresponding to each minibatch is calculated by using all the Gaussian factors corresponding to each minibatch.
- an embodiment of the present application discloses an automatic driving training device, including a processor 21 and a memory 22; wherein, the memory 22 is used to save a computer program; the processor 21 is used to execute all The computer program is used to implement the automatic driving training method disclosed in the foregoing embodiments.
- the embodiment of the present application also discloses a computer-readable storage medium for storing a computer program, wherein the computer program implements the automatic driving training method disclosed in the foregoing embodiments when the computer program is executed by the processor.
- the steps of a method or algorithm described in connection with the embodiments disclosed herein may be directly implemented in hardware, a software module executed by a processor, or a combination of the two.
- the software module can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Remote Sensing (AREA)
- Radar, Positioning & Navigation (AREA)
- General Physics & Mathematics (AREA)
- Aviation & Aerospace Engineering (AREA)
- Automation & Control Theory (AREA)
- Electromagnetism (AREA)
- Theoretical Computer Science (AREA)
- Educational Technology (AREA)
- Optics & Photonics (AREA)
- Educational Administration (AREA)
- Business, Economics & Management (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Traffic Control Systems (AREA)
Abstract
An automatic driving training method, apparatus and device and a medium. The method comprises: acquiring a traffic environment state of a current moment and corresponding structured noise (S11), the structured noise being determined on the basis of historical data, the historical data being data saved during pretraining of an autonomous vehicle, and the historical data including historical action information and historical traffic environment state information; determining a corresponding execution action using the traffic environment state and the structured noise and by means of a policy network (S12); controlling the autonomous vehicle to perform the execution action (S13); evaluating a policy of the policy network according to the execution action and by means of an evaluation network to obtain a corresponding return (S14); updating parameters of the evaluation network on the basis of the return and by means of a back propagation operation (S15); and updating parameters of the policy network using a policy gradient algorithm (S16). The method can improve the stability of automatic driving training, thereby reducing the probability of occurrence of dangerous accidents.
Description
本申请要求于2020年9月8日提交中国专利局、申请号为202010934770.9、发明名称为“一种自动驾驶训练方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on September 8, 2020 with the application number 202010934770.9 and the invention titled "An automatic driving training method, device, equipment and medium", the entire contents of which are by reference Incorporated in this application.
本申请涉及自动驾驶技术领域,特别涉及一种自动驾驶训练方法、装置、设备及介质。The present application relates to the technical field of automatic driving, and in particular, to an automatic driving training method, device, equipment and medium.
现代城市交通中,机动车数量日益增多,道路拥堵情况严重,且交通事故频发。有研究表明,每个人一生中因交通拥堵导致的时间浪费长达3年,而90%的交通事故由人为操作失误或错误造成。为最大程度降低人为因素造成的危害,人们将目光转向自动驾驶领域。根据驾驶员在车辆行驶过程中的参与度大小,将自动驾驶由低到高分为Level-0至Level-5共6个级别,即人类驾驶员驾驶、辅助驾驶、部分自动驾驶、条件自动驾驶、高度自动驾驶和完全自动驾驶。目前,主流自动驾驶企业或项目普遍达到Level-3级别。自动驾驶是一项十分复杂的集成性技术,涵盖车载传感器、数据处理器、控制器等硬件装置,并需要现代移动通信与网络技术作为支撑,以实现车辆、行人和非机动车等交通参与者之间的信息传递与共享,完成在复杂环境下的传感感知、决策规划和控制执行等功能,实现车辆的自动加速/减速、转向、超车、刹车等操作,保证行车安全。参见图1所示,本图1为本申请实施例提供的一种自动驾驶车辆控制架构示意图。In modern urban traffic, the number of motor vehicles is increasing day by day, road congestion is serious, and traffic accidents are frequent. Studies have shown that people waste up to 3 years of time due to traffic congestion in their lifetime, and 90% of traffic accidents are caused by human error or error. In order to minimize the harm caused by human factors, people are turning their attention to the field of autonomous driving. According to the driver's participation in the driving process of the vehicle, automatic driving is divided into six levels from low to high, namely Level-0 to Level-5, namely human driver driving, assisted driving, partial automatic driving, and conditional automatic driving. , highly autonomous and fully autonomous. At present, mainstream autonomous driving companies or projects generally reach the Level-3 level. Autonomous driving is a very complex integrated technology, covering hardware devices such as on-board sensors, data processors, controllers, etc., and requires modern mobile communication and network technologies as support to realize the realization of traffic participants such as vehicles, pedestrians and non-motor vehicles. Information transmission and sharing between them, complete functions such as sensing perception, decision planning and control execution in complex environments, and realize automatic acceleration/deceleration, steering, overtaking, braking and other operations of vehicles to ensure driving safety. Referring to FIG. 1 , FIG. 1 is a schematic diagram of a control architecture of an automatic driving vehicle provided by an embodiment of the present application.
基于模拟器环境进行自动驾驶系统计算机仿真是自动驾驶车辆测试和试验的基础关键技术,能够有效保证上自动驾驶车辆的安全性,以及加速自动驾驶研究应用。现有的自动驾驶仿真主要分为两类,即模块化方法(Modular Pipeline)和端到端方法(End-to-End Pipeline)。参见图2所示,图2为本申请提供的现有技术中的一种模块化方法示意图,将自动驾驶系统分解成几个独立但互相关联的模块,如感知(Perception)、本地化 (Localization)、规划(Planning)和控制(Control)模块,具有良好的可解释性,在系统发生故障时能快速定位到问题模块,是现阶段业界广泛使用的常规方法。然而,系统的模块化构建和维护困难大,在面对新的复杂场景时不易更新。参见图3所示,图3为本申请提供的现有技术中的一种端到端方法示意图,端到端方法将自动驾驶问题视为一个机器学习问题,直接优化“传感器数据处理-生成控制命令-执行命令”的整个流程。端到端的方法搭建简单,在自动驾驶领域获得快速发展,但方法本身也是一个“黑盒”,解释性差。端到端的方法也有2种形式,分别是Open-loop的模仿学习方法和Closed-loop的强化学习方法。参见图4所示,图4为本申请提供的现有技术中的一种Open-loop的模仿学习方法示意图。Open-loop的模仿学习方法通过模仿人类驾驶员的行为,以监督学习的方式学会自动驾驶,强调一种“预测能力”,图5为本申请提供的现有技术中的一种Closed-loop的强化学习方法示意图,Closed-loop的强化学习方法,借助马尔科夫决策过程(MDP,Markov Decision Process)从头开始探索和改进自动驾驶策略,强调一种“驾驶能力”。强化学习(RL,Reinforcement Learning)是近年来快速发展的一类机器学习方法,其中的智能体(Agent)-环境(Environment)交互作用机制和序列决策机制接近于人类学习的过程,因此也被称为实现“通用人工智能(AGI,Artificial General Intelligence)”的关键步骤。结合深度学习(DL,Deep Learning)的深度强化学习(DRL,Deep Reinforcement Learning)算法能够自动学习大规模输入数据的抽象表征,决策性能更加优秀,已经在电子游戏、机械控制、广告推荐、金融交易、城市交通等领域获得了广泛应用。Computer simulation of autonomous driving systems based on the simulator environment is the basic key technology for the testing and testing of autonomous driving vehicles, which can effectively ensure the safety of autonomous driving vehicles and accelerate the research and application of autonomous driving. Existing automated driving simulations are mainly divided into two categories, namely, Modular Pipeline and End-to-End Pipeline. Referring to FIG. 2, FIG. 2 is a schematic diagram of a modular method in the prior art provided by this application, and the automatic driving system is decomposed into several independent but interrelated modules, such as perception (Perception), localization (Localization) ), Planning and Control modules, which have good interpretability and can quickly locate the problem module when the system fails. It is a conventional method widely used in the industry at this stage. However, the modular construction and maintenance of the system is difficult, and it is not easy to update in the face of new complex scenarios. Referring to Fig. 3, Fig. 3 is a schematic diagram of an end-to-end method in the prior art provided by this application. The end-to-end method regards the automatic driving problem as a machine learning problem, and directly optimizes the "sensor data processing-generation control" command-execute command" process. The end-to-end method is simple to build and has achieved rapid development in the field of autonomous driving, but the method itself is also a "black box" with poor interpretability. There are also two forms of end-to-end methods, namely the Open-loop imitation learning method and the Closed-loop reinforcement learning method. Referring to FIG. 4 , FIG. 4 is a schematic diagram of an Open-loop imitation learning method in the prior art provided by this application. Open-loop's imitation learning method learns to drive autonomously in a supervised learning manner by imitating the behavior of human drivers, emphasizing a "predictive ability", Figure 5 is a closed-loop in the prior art provided by this application Schematic diagram of reinforcement learning method. The reinforcement learning method of Closed-loop uses the Markov Decision Process (MDP, Markov Decision Process) to explore and improve automatic driving strategies from scratch, emphasizing a kind of "driving ability". Reinforcement learning (RL, Reinforcement Learning) is a type of machine learning method that has developed rapidly in recent years, in which the agent-environment interaction mechanism and the sequential decision-making mechanism are close to the process of human learning, so it is also called RL. A key step in the realization of "Artificial General Intelligence (AGI)". The deep reinforcement learning (DRL, Deep Reinforcement Learning) algorithm combined with deep learning (DL, Deep Learning) can automatically learn the abstract representation of large-scale input data, and the decision-making performance is better. It has been used in video games, mechanical control, advertising recommendation, financial transactions. , urban transportation and other fields have been widely used.
DRL应用于自动驾驶问题时不需要领域专家知识,也不需要建立模型,具有较为广泛的适应性,能够应对不断变化的复杂道路环境。然而,基于DRL的自动驾驶车辆从头开始学习自动驾驶,序列决策过程中选取较差动作的步骤会导致训练方差较大,体现为车辆行驶不平稳,甚至出现冲出车道、碰撞等事故。现有研究成果表明,相比模块化方法和Open-loop的模仿学习方法,基于DRL的自动驾驶训练的稳定性最差,并且对环境、天气变化十分敏感。When DRL is applied to autonomous driving problems, it does not require domain expert knowledge, nor does it need to build models. It has a wide range of adaptability and can cope with changing and complex road environments. However, for DRL-based autonomous vehicles to learn autonomous driving from scratch, the selection of poor actions in the sequential decision-making process will lead to a large training variance, which is reflected in the unstable driving of the vehicle, and even accidents such as running out of lanes and collisions. The existing research results show that, compared with the modular method and the Open-loop imitation learning method, the DRL-based autonomous driving training has the worst stability and is very sensitive to changes in the environment and weather.
发明内容SUMMARY OF THE INVENTION
有鉴于此,本申请的目的在于提供一种自动驾驶训练方法、装置、设备及介质,能够提升自动驾驶训练的稳定性,从而降低危险事故的发生概率。其具体方案如下:In view of this, the purpose of the present application is to provide an automatic driving training method, device, equipment and medium, which can improve the stability of automatic driving training, thereby reducing the probability of occurrence of dangerous accidents. Its specific plan is as follows:
第一方面,本申请公开了一种自动驾驶训练方法,包括:In a first aspect, the present application discloses an automatic driving training method, including:
获取当前时刻的交通环境状态以及对应的结构化噪声;其中,所述结构化噪声为基于历史数据确定出的结构化噪声,所述历史数据为在对自动驾驶车辆进行预训练的过程中保存的数据,并且,所述历史数据包括历史动作信息以及历史交通环境状态信息;Obtain the traffic environment state at the current moment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is saved in the process of pre-training the autonomous driving vehicle data, and the historical data includes historical action information and historical traffic environment state information;
通过策略网络利用所述交通环境状态以及所述结构化噪声确定出对应的执行动作;Determine the corresponding execution action by utilizing the traffic environment state and the structured noise through a policy network;
控制所述自动驾驶车辆执行所述执行动作;controlling the self-driving vehicle to perform the execution action;
通过评价网络根据所述执行动作对所述策略网络的策略进行评价,得到对应的回报;Evaluate the strategy of the strategy network according to the execution action by the evaluation network to obtain a corresponding reward;
基于所述回报通过反向传播运算更新评价网络参数;Update evaluation network parameters by back-propagation operation based on the reward;
利用策略梯度算法更新策略网络参数。The policy network parameters are updated using the policy gradient algorithm.
可选的,所述自动驾驶训练方法,还包括:Optionally, the automatic driving training method further includes:
利用DQN算法对自动驾驶车辆进行预训练;Pre-training of autonomous vehicles with DQN algorithm;
将对应的预训练数据存放至回放缓冲区,将所述回放缓冲区存放的数据作为所述历史数据。The corresponding pre-training data is stored in the playback buffer, and the data stored in the playback buffer is used as the historical data.
可选的,所述基于所述回报通过反向传播运算更新评价网络参数,包括:Optionally, the updating of the evaluation network parameters through back-propagation operation based on the reward includes:
基于所述回报进行针对评价网络损失函数的反向传播运算,单步更新所述评价网络参数。Based on the reward, a back-propagation operation for the evaluation network loss function is performed, and the evaluation network parameters are updated in a single step.
可选的,所述利用策略梯度算法更新策略网络参数,包括:Optionally, the use of the policy gradient algorithm to update the policy network parameters includes:
利用所述评价网络的价值函数以及所述策略网络的当前策略进行策略梯度运算,更新所述策略网络参数。The policy gradient operation is performed using the value function of the evaluation network and the current policy of the policy network, and the policy network parameters are updated.
可选的,所述自动驾驶训练方法,还包括:Optionally, the automatic driving training method further includes:
预先计算所述结构化噪声。The structured noise is precomputed.
可选的,所述预先计算所述结构化噪声,包括:Optionally, the pre-calculating the structured noise includes:
从所述历史数据中随机抽取出预设条数的数据,得到对应的minibatch;Randomly extract a preset number of data from the historical data to obtain a corresponding minibatch;
计算出所述minibatch中每条所述历史数据的高斯因子;Calculate the Gaussian factor of each piece of historical data in the minibatch;
利用全部所述高斯因子计算出所述minibatch对应的所述结构化噪声。The structured noise corresponding to the minibatch is calculated using all the Gaussian factors.
可选的,所述预先计算所述结构化噪声,包括:Optionally, the pre-calculating the structured noise includes:
从所述历史数据中随机抽取数据,得到多个minibatch;Randomly extract data from the historical data to obtain multiple minibatches;
计算出每个所述minibatch中每条所述历史数据的高斯因子,然后利用每个所述minibatch对应的全部所述高斯因子计算出每个所述minibatch对应的所述结构化噪声。Calculate the Gaussian factor of each piece of historical data in each of the minibatches, and then use all the Gaussian factors corresponding to each of the minibatches to calculate the structured noise corresponding to each of the minibatches.
第二方面,本申请公开了一种自动驾驶训练装置,包括:In a second aspect, the present application discloses an automatic driving training device, comprising:
数据获取模块,用于获取当前时刻的交通环境状态以及对应的结构化噪声;其中,所述结构化噪声为基于历史数据确定出的结构化噪声,所述历史数据为在对自动驾驶车辆进行预训练的过程中保存的数据,并且,所述历史数据包括历史动作信息以及历史交通环境状态信息;The data acquisition module is used to acquire the traffic environment state at the current moment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is used for pre-preparing the automatic driving vehicle. Data saved in the training process, and the historical data includes historical action information and historical traffic environment state information;
动作确定模块,用于通过策略网络利用所述交通环境状态以及所述结构化噪声确定出对应的执行动作;an action determination module, configured to use the traffic environment state and the structured noise to determine a corresponding execution action through a policy network;
动作控制模块,用于控制所述自动驾驶车辆执行所述执行动作;an action control module, configured to control the autonomous driving vehicle to execute the execution action;
策略评价模块,用于通过评价网络根据所述执行动作对所述策略网络的策略进行评价,得到对应的回报;a strategy evaluation module, configured to evaluate the strategy of the strategy network through the evaluation network according to the execution action to obtain a corresponding reward;
评价网络更新模块,用于基于所述回报通过反向传播运算更新评价网络参数;an evaluation network update module for updating the evaluation network parameters through back-propagation operation based on the return;
策略网络更新模块,用于利用策略梯度算法更新策略网络参数。The policy network update module is used to update the policy network parameters using the policy gradient algorithm.
第三方面,本申请公开了一种自动驾驶训练设备,包括处理器和存储器;其中,In a third aspect, the present application discloses an automatic driving training device, including a processor and a memory; wherein,
所述存储器,用于保存计算机程序;the memory for storing computer programs;
所述处理器,用于执行所述计算机程序以实现前述的自动驾驶训练方法。The processor is configured to execute the computer program to implement the aforementioned automatic driving training method.
第四方面,本申请公开了一种计算机可读存储介质,用于保存计算机 程序,其中,所述计算机程序被处理器执行时实现前述的自动驾驶训练训练方法。In a fourth aspect, the present application discloses a computer-readable storage medium for storing a computer program, wherein when the computer program is executed by a processor, the aforementioned automatic driving training method is implemented.
可见,本申请获取当前时刻的交通环境状态以及对应的结构化噪声;其中,所述结构化噪声为基于历史数据确定出的结构化噪声,所述历史数据为在对自动驾驶车辆进行预训练的过程中保存的数据,并且,所述历史数据包括历史动作信息以及历史交通环境状态信息,然后通过策略网络利用所述交通环境状态以及所述结构化噪声确定出对应的执行动作,之后控制所述自动驾驶车辆执行所述执行动作,并通过评价网络根据所述执行动作对所述策略网络的策略进行评价,得到对应的回报,然后基于所述回报通过反向传播运算更新评价网络参数以及利用策略梯度算法更新策略网络参数。这样,在自动驾驶的训练过程中,引入基于历史数据的结构化噪声,并且,历史数据包括历史动作信息以及历史交通环境状态信息,能够提升自动驾驶训练的稳定性,从而降低危险事故的发生概率。It can be seen that the present application obtains the current state of the traffic environment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is pre-trained for autonomous driving vehicles. The data saved in the process, and the historical data includes historical action information and historical traffic environment state information, and then use the traffic environment state and the structured noise to determine the corresponding execution action through the policy network, and then control the The autonomous vehicle performs the execution action, and evaluates the strategy of the strategy network according to the execution action through the evaluation network to obtain a corresponding reward, and then updates the evaluation network parameters and utilization strategy based on the reward through back-propagation operation. The gradient algorithm updates the policy network parameters. In this way, in the training process of automatic driving, structured noise based on historical data is introduced, and the historical data includes historical action information and historical traffic environment state information, which can improve the stability of automatic driving training and reduce the probability of dangerous accidents. .
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following briefly introduces the accompanying drawings that are used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only It is an embodiment of the present application. For those of ordinary skill in the art, other drawings can also be obtained according to the provided drawings without any creative effort.
图1为本申请提供的一种自动驾驶车辆控制架构示意图;FIG. 1 is a schematic diagram of a control architecture of an autonomous driving vehicle provided by the present application;
图2为现有技术中的一种模块化方法示意图;Fig. 2 is a kind of modularization method schematic diagram in the prior art;
图3为现有技术中的一种端到端方法示意图;3 is a schematic diagram of an end-to-end method in the prior art;
图4为现有技术中的一种Open-loop的模仿学习方法示意图;4 is a schematic diagram of an imitation learning method of Open-loop in the prior art;
图5为现有技术中的一种Closed-loop的强化学习方法示意图;5 is a schematic diagram of a closed-loop reinforcement learning method in the prior art;
图6为本申请公开的一种自动驾驶训练方法流程图;6 is a flowchart of an automatic driving training method disclosed in the application;
图7为本申请公开的一种自动驾驶训练示意图;7 is a schematic diagram of an automatic driving training disclosed in the application;
图8为本申请公开的一种具体的自动驾驶训练方法流程图;8 is a flowchart of a specific automatic driving training method disclosed in the application;
图9为本申请公开的一种具体的自动驾驶训练方法流程图;9 is a flowchart of a specific automatic driving training method disclosed in the application;
图10为本申请公开的一种自动驾驶训练装置结构示意图;10 is a schematic structural diagram of an automatic driving training device disclosed in the application;
图11为本申请公开的一种自动驾驶训练设备结构图。FIG. 11 is a structural diagram of an automatic driving training device disclosed in this application.
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
DRL应用于自动驾驶问题时不需要领域专家知识,也不需要建立模型,具有较为广泛的适应性,能够应对不断变化的复杂道路环境。然而,基于DRL的自动驾驶车辆从头开始学习自动驾驶,序列决策过程中选取较差动作的步骤会导致训练方差较大,体现为车辆行驶不平稳,甚至出现冲出车道、碰撞等事故。现有研究成果表明,相比模块化方法和Open-loop的模仿学习方法,基于DRL的自动驾驶训练的稳定性最差,并且对环境、天气变化十分敏感。为此,本申请提供了一种自动驾驶训练方案,能够提升自动驾驶训练的稳定性,从而降低危险事故的发生概率。When DRL is applied to autonomous driving problems, it does not require domain expert knowledge, nor does it need to build models. It has a wide range of adaptability and can cope with changing and complex road environments. However, for DRL-based autonomous vehicles to learn autonomous driving from scratch, the selection of poor actions in the sequential decision-making process will lead to a large training variance, which is reflected in the unstable driving of the vehicle, and even accidents such as running out of lanes and collisions. The existing research results show that, compared with the modular method and the Open-loop imitation learning method, the DRL-based autonomous driving training has the worst stability and is very sensitive to changes in the environment and weather. Therefore, the present application provides an automatic driving training solution, which can improve the stability of automatic driving training, thereby reducing the probability of occurrence of dangerous accidents.
参见图6所示,本申请实施例公开了一种自动驾驶训练方法,包括:Referring to FIG. 6 , an embodiment of the present application discloses an automatic driving training method, including:
步骤S11:获取当前时刻的交通环境状态以及对应的结构化噪声;其中,所述结构化噪声为基于历史数据确定出的结构化噪声,所述历史数据为在对自动驾驶车辆进行预训练的过程中保存的数据,并且,所述历史数据包括历史动作信息以及历史交通环境状态信息。Step S11: Obtain the traffic environment state at the current moment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is in the process of pre-training the autonomous driving vehicle and the historical data includes historical action information and historical traffic environment state information.
获取当前时刻的交通环境状态S
t以及对应的结构化噪声z
t。
Obtain the traffic environment state S t at the current moment and the corresponding structured noise z t .
需要指出的是,基于DRL的自动驾驶系统序列决策过程为:自动驾驶车辆(即智能体)在t时刻观测到所处环境的状态S
t,如自身和其他交通参与者的位置、速度、加速度等动力学信息,交通信号灯以及道路拓扑特征等信息,利用非线性的神经网络(NN,Neural Network)表示策略(Policy)π
θ,并选取车辆动作a
t,如加速/减速、转向、变道、刹车等。进入下一个时刻t+1,环境根据自动驾驶车辆采取的动作a
t,结合设定的基准,如自动驾驶车辆平均行驶速度、偏离车道中心距离、闯红灯、发生碰撞等因素,计 算出回报r
t+1,并进入一个新的状态S
t+1。自动驾驶车辆根据获得的回报r
t+1对策略π
θ进行调整,并结合新的状态S
t+1进入下一个决策过程。通过自动驾驶车辆与环境之间的交互做出序列决策,学习到最优策略,自动驾驶车辆获得最大的累计回报,实现平稳、安全驾驶。现有的基于DRL的自动驾驶研究应用多采用能够应对连续动作空间的算法,如深度确定策略梯度算法(DDPG,Deep Deterministic Policy Gradient)、置信域策略优化算法(TRPO,Trust Region Policy Optimization)和近端策略优化算法(PPO,Proximal Policy Optimization)。本实施例可以将DRL与结构化噪声融合,进行自动驾驶决策。考虑自动驾驶问题的状态空间和动作空间连续性,本实施例可以使用样本效率和计算效率较高的DDPG算法。在其他一些实施例中,还可以利用异步优势Actor-Critic算法A3C(Asynchronous Advantage Actor-Critic)、双延迟确定性策略梯度算法TD3(Twin Delayed Deep Deterministic policy gradient)、松弛Actor-Critic算法SAC(Soft Actor-Critic)。
It should be pointed out that the sequential decision-making process of the automatic driving system based on DRL is: the automatic driving vehicle (ie the agent) observes the state S t of the environment at time t, such as the position, speed, acceleration of itself and other traffic participants Isokinetic information, traffic lights and road topology features and other information, use a nonlinear neural network (NN, Neural Network) to represent the policy (Policy) π θ , and select vehicle actions a t , such as acceleration/deceleration, steering, lane change , brakes, etc. Entering the next time t+1, the environment calculates the reward r t according to the action at t taken by the autonomous driving vehicle, combined with the set benchmarks, such as the average driving speed of the autonomous driving vehicle, distance from the center of the lane, running a red light, collision and other factors. +1 and enter a new state S t+1 . The self-driving vehicle adjusts the policy πθ according to the reward r t+1 obtained, and enters the next decision-making process in combination with the new state S t+1 . By making sequential decisions through the interaction between the autonomous vehicle and the environment, and learning the optimal strategy, the autonomous vehicle can obtain the maximum cumulative return and achieve smooth and safe driving. Existing DRL-based autonomous driving research applications mostly use algorithms that can deal with continuous action spaces, such as Deep Deterministic Policy Gradient (DDPG), Trust Region Policy Optimization (TRPO) and recent Terminal Policy Optimization (PPO, Proximal Policy Optimization). In this embodiment, DRL and structured noise can be fused to make automatic driving decisions. Considering the continuity of the state space and the action space of the automatic driving problem, this embodiment can use the DDPG algorithm with high sample efficiency and computational efficiency. In some other embodiments, the Asynchronous Advantage Actor-Critic algorithm A3C (Asynchronous Advantage Actor-Critic), the Double Delayed Deterministic Policy Gradient Algorithm TD3 (Twin Delayed Deep Deterministic policy gradient), the Relaxed Actor-Critic Algorithm SAC (Soft Actor-Critic).
在具体的实施方式中,本实施例可以获取车辆传感器采集到的交通环境状态数据。具体的,可以借助摄像头、GPS(即Global Positioning System,全球定位系统)、IMU(即Inertia Measurement Unit,惯性测量装置)、毫米波雷达、激光雷达等车载传感器装置,获取行车环境状态,如天气数据、交通信号灯、交通拓扑信息,自动驾驶车辆、其他交通参与者的位置、运行状态等信息,并且,本实施例的交通环境状态不仅包括摄像头获取的直接原始图像数据,还包括通过深度学习模型,如RefineNet等处理得到的深度图和语义分割图等。其中,针对自动驾驶车辆,可以直接获得的状态信息有:车辆的行驶速度和侧向速度v、u;方向盘的转向角δ;车辆中心与道路中心线的距离偏差ΔL;车辆与四个方向最近的交通参与者的距离Δxi,i=1~4等。In a specific implementation manner, this embodiment can acquire traffic environment state data collected by vehicle sensors. Specifically, the driving environment status, such as weather data, can be obtained with the help of camera, GPS (Global Positioning System, Global Positioning System), IMU (Inertia Measurement Unit, Inertial Measurement Unit), millimeter-wave radar, LiDAR and other vehicle-mounted sensor devices , traffic lights, traffic topology information, automatic driving vehicles, the location of other traffic participants, running status and other information, and the traffic environment status in this embodiment not only includes the direct original image data obtained by the camera, but also includes the deep learning model, Such as the depth map and semantic segmentation map processed by RefineNet and so on. Among them, for the autonomous vehicle, the state information that can be directly obtained are: the driving speed and lateral speed v, u of the vehicle; the steering angle δ of the steering wheel; the distance deviation ΔL between the vehicle center and the road center line; the vehicle is closest to the four directions The distance Δxi of the traffic participants, i = 1 ~ 4 and so on.
步骤S12:通过策略网络利用所述交通环境状态以及所述结构化噪声确定出对应的执行动作。Step S12: Determine a corresponding execution action by using the traffic environment state and the structured noise through a policy network.
在具体的实施方式中,ActorNet(策略网络)基于策略函数π
θ(a|s,z)选取动作a
t,自动驾驶车辆完成相应动作,如“向左变道”,其中θ为Actor Net的网络参数,s表示交通环境状态,z表示结构化噪声。
In a specific implementation, ActorNet (policy network) selects action a t based on the policy function π θ (a|s,z), and the autonomous vehicle completes the corresponding action, such as “change lanes to the left”, where θ is Actor Net’s Network parameters, s represents the traffic environment state, and z represents structured noise.
步骤S13:控制所述自动驾驶车辆执行所述执行动作。Step S13: Control the automatic driving vehicle to execute the execution action.
步骤S14:通过评价网络根据所述执行动作对所述策略网络的策略进行评价,得到对应的回报。Step S14: Evaluate the strategy of the strategy network through the evaluation network according to the execution action to obtain a corresponding reward.
在具体的实施方式中,CriticNet(评价网络)根据自动驾驶车辆执行的动作a
t,基于价值函数Q
ω(s,a,z)对ActorNet的策略进行评价,并得到交通环境给予的回报r
t+1,其中,ω为CriticNet的网络参数。
In a specific embodiment, CriticNet (evaluation network) evaluates ActorNet's strategy based on the value function Q ω (s, a, z) according to the action at t performed by the autonomous vehicle, and obtains the reward r t given by the traffic environment +1 , where ω is the network parameter of CriticNet.
其中,价值函数Q
ω(s,a,z)为由预设的回报函数转化得到。
Among them, the value function Q ω (s, a, z) is transformed from the preset reward function.
需要指出的是,本申请实施例还可以预先设计研究自动驾驶问题的回报函数r
t。考虑自动驾驶仿真的具体场景,以及自动驾驶车辆平均行驶速度,偏离车道中心距离,扰乱交通的时长,是否压线、闯红灯、发生碰撞等评估指标,可以将自动驾驶车辆的回报函数设计成不同形式。以车辆变道的仿真场景举例,根据自动驾驶车辆变道是否成功、是否扰乱交通,甚至发生碰撞等因素,可以将回报函数设计为:
It should be noted that, in the embodiment of the present application, the reward function r t for studying the automatic driving problem can also be designed in advance. Considering the specific scenarios of the autonomous driving simulation, as well as the average driving speed of the autonomous driving vehicle, the distance from the center of the lane, the duration of the traffic disturbance, whether it is line pressure, running a red light, collision and other evaluation indicators, the reward function of the autonomous driving vehicle can be designed in different forms . Taking the simulation scenario of vehicle lane change as an example, according to factors such as whether the automatic driving vehicle changes lanes successfully, whether it disrupts traffic, or even collides, the reward function can be designed as:
其中,v为自动驾驶车辆的行驶速度,v
ref为根据道路限速设定的参考速度,λ是人为设定的系数。
Among them, v is the driving speed of the autonomous vehicle, v ref is the reference speed set according to the road speed limit, and λ is a coefficient set manually.
并且,价值函数可通过回报函数计算得到,形式为:And, the value function can be calculated by the reward function in the form:
其中,γ∈(0,1]为折扣因子。本实施例引入结构化噪声,相应的价值函数为Q
ω(s,a,z),Ε表示求期望运算。
Among them, γ∈(0,1] is the discount factor. In this embodiment, structured noise is introduced, and the corresponding value function is Q ω (s, a, z), and Ε represents the expectation operation.
步骤S15:基于所述回报通过反向传播运算更新评价网络参数。Step S15: Update evaluation network parameters through back-propagation operation based on the reward.
在具体的实施方式中,基于所述回报进行针对评价网络损失函数的反向传播运算,单步更新所述评价网络参数。具体的,通过反向传播传播运算,最小化评价网络损失函数,单步更新网络参数ω。其中,评价网络损 失函数为:In a specific embodiment, a back-propagation operation for the evaluation network loss function is performed based on the reward, and the evaluation network parameters are updated in a single step. Specifically, through the back-propagation propagation operation, the loss function of the evaluation network is minimized, and the network parameter ω is updated in a single step. Among them, the evaluation network loss function is:
式中,y
t=r
t+1+γQ′
ω(s
t+1,a
t+1,z
t+1)。Q′
ω(s
t+1,a
t+1,z
t+1)和Q
ω(s
t,a
t,z
t)分别是目标网络和预测网络的价值函数。N为采集的样本数量,γ∈(0,1]为折扣因子。其中,所述目标网络和预测网络为基于DQN(即Deep-Q-Network,深度价值函数神经网络)算法设计的神经网络。
In the formula, y t =r t+1 + γQ′ω (s t+1 ,at +1 ,z t+1 ). Q′ ω (s t+1 , a t+1 , z t+1 ) and Q ω (s t , at t , z t ) are the value functions of the target network and the prediction network, respectively. N is the number of samples collected, and γ∈(0,1] is a discount factor. The target network and prediction network are neural networks designed based on the DQN (ie Deep-Q-Network, deep value function neural network) algorithm.
步骤S16:利用策略梯度算法更新策略网络参数。Step S16: Use the policy gradient algorithm to update the policy network parameters.
在具体的实施方式中,本实施例可以利用所述评价网络的价值函数以及所述策略网络的当前策略进行策略梯度运算,更新所述策略网络参数。In a specific implementation manner, this embodiment may use the value function of the evaluation network and the current strategy of the strategy network to perform a strategy gradient operation, and update the strategy network parameters.
具体的,本实施例通过如下策略梯度,更新Actor Net的网络参数θ:Specifically, this embodiment updates the network parameter θ of Actor Net through the following policy gradient:
其中,J(θ)为策略梯度方法的目标函数,通常使用回报的某种形式表示。
由Critic Net的价值函数关于动作a求导得到,
为当前步骤下Actor Net的策略求导得到。策略梯度方法的任务是使得目标函数最大化,通过梯度上升来实现。借助上式得到策略梯度后,通过
对网络参数θ进行更新,其中,α为固定的时间步参数。
where J(θ) is the objective function of the policy gradient method, usually expressed in some form of reward. It is obtained from the derivation of the value function of Critical Net with respect to the action a, It is derived from the policy of Actor Net at the current step. The task of the policy gradient method is to maximize the objective function, which is achieved by gradient ascent. After obtaining the policy gradient with the help of the above formula, pass Update the network parameter θ, where α is a fixed time step parameter.
重复上述步骤S11至步骤S15,直至自动驾驶结束。Repeat the above steps S11 to S15 until the automatic driving ends.
例如,参见图7所示,图7为本申请公开的一种自动驾驶训练示意图。结合结构化噪声z,使用DDPG算法训练车辆自动驾驶。DDPG算法是一种典型的Actor-Critic的强化学习算法。其中,策略网络(Actor Net)根据评价网络(Critic Net)反馈的价值函数更新策略,而Critic Net训练价值函数,使用时间差分法(TD)进行单步更新。并且,Critic Net包括基于DQN算法设计的目标网络(Target Net)和预测网络(Pred Net),网络参数更新时会使用两个网络的价值函数。Actor Net和Critic Net共同作用,使智能体选择的动作获得最大累计回报。For example, referring to FIG. 7 , FIG. 7 is a schematic diagram of an automatic driving training disclosed in the present application. Combined with structured noise z, the DDPG algorithm is used to train vehicles for autonomous driving. The DDPG algorithm is a typical Actor-Critic reinforcement learning algorithm. Among them, the policy network (Actor Net) updates the policy according to the value function fed back by the evaluation network (Critic Net), while the Critic Net trains the value function and uses the time difference method (TD) for single-step update. In addition, Critic Net includes a target network (Target Net) and a prediction network (Pred Net) designed based on the DQN algorithm, and the value functions of the two networks are used when the network parameters are updated. The Actor Net and Critic Net work together to maximize the cumulative reward for the actions chosen by the agent.
可见,本申请实施例获取当前时刻的交通环境状态以及对应的结构化噪声;其中,所述结构化噪声为基于历史数据确定出的结构化噪声,所述 历史数据为在对自动驾驶车辆进行预训练的过程中保存的数据,并且,所述历史数据包括历史动作信息以及历史交通环境状态信息,然后通过策略网络利用所述交通环境状态以及所述结构化噪声确定出对应的执行动作,之后控制所述自动驾驶车辆执行所述执行动作,并通过评价网络根据所述执行动作对所述策略网络的策略进行评价,得到对应的回报,然后基于所述回报通过反向传播运算更新评价网络参数以及利用策略梯度算法更新策略网络参数。这样,在自动驾驶的训练过程中,引入基于历史数据的结构化噪声,并且,历史数据包括历史动作信息以及历史交通环境状态信息,能够提升自动驾驶训练的稳定性,从而降低危险事故的发生概率。It can be seen that the embodiment of the present application obtains the traffic environment state at the current moment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is used in the prediction of the automatic driving vehicle. The data saved in the training process, and the historical data includes historical action information and historical traffic environment state information, and then use the traffic environment state and the structured noise to determine the corresponding execution action through the policy network, and then control the The self-driving vehicle performs the execution action, and evaluates the strategy of the strategy network according to the execution action through the evaluation network to obtain a corresponding reward, and then updates the evaluation network parameters and The policy network parameters are updated using the policy gradient algorithm. In this way, in the training process of automatic driving, structured noise based on historical data is introduced, and the historical data includes historical action information and historical traffic environment state information, which can improve the stability of automatic driving training and reduce the probability of dangerous accidents. .
参见图8所示,本申请实施例公开了一种具体的自动驾驶训练方法,包括:Referring to FIG. 8 , an embodiment of the present application discloses a specific automatic driving training method, including:
步骤S21:利用DQN算法对自动驾驶车辆进行预训练。Step S21: Use the DQN algorithm to pre-train the autonomous vehicle.
步骤S22:将对应的预训练数据存放至回放缓冲区,将所述回放缓冲区存放的数据作为所述历史数据。Step S22: Store the corresponding pre-training data in the playback buffer, and use the data stored in the playback buffer as the historical data.
在具体的实施方式中,利用经典的DQN算法对车辆自动驾驶进行预训练,积累回放缓冲区数据B。使用经典的DQN方法,构建2个结构相同但参数不同的神经网络,分别是间隔一定时间更新参数的目标网络(Target Net)和每步更新参数的预测网络(Pred Net)。以车辆变道的仿真场景举例,自动驾驶车辆在每个时刻t的动作空间为[a
t1,a
t2,a
t3],分别表示“向左变道”、“向右变道”和“保持当前车道”。Target Net和Pred Net均使用简单的3层神经网络,中间仅包含一个隐藏层。输入车辆传感器装置采集到的交通环境状态S
t,计算输出目标价值Qtarget和预测价值Qpred,并选择最大的Qpred对应的动作a
t作为自动驾驶车辆的驾驶动作。依据设计的回报函数,获得回报r
t+1,进入新的交通环境状态S
t+1,并将学习经历c
t=(s
t,a
t,r
t,s
t+1)存储到回放缓冲区中。使用RMSProP优化器更新网络参数以最小化损失函数,持续对自动驾驶车辆进行预训练,直至累计足够的回放缓冲区数据B。
In a specific implementation, the classical DQN algorithm is used to pre-train the vehicle for automatic driving, and the playback buffer data B is accumulated. Using the classic DQN method, two neural networks with the same structure but different parameters are constructed, namely the target network (Target Net) which updates parameters at a certain interval and the prediction network (Pred Net) which updates parameters at each step. Taking the simulation scene of vehicle lane change as an example, the action space of the autonomous vehicle at each time t is [a t1 , a t2 , a t3 ], which represent “lane change to the left”, “lane change to the right” and “keep current lane". Both Target Net and Pred Net use a simple 3-layer neural network with only one hidden layer in the middle. The traffic environment state S t collected by the vehicle sensor device is input, the output target value Qtarget and the predicted value Qpred are calculated, and the action at corresponding to the largest Qpred is selected as the driving action of the autonomous vehicle. According to the designed reward function, get the reward r t+1 , enter the new traffic environment state S t+1 , and store the learning experience c t =(s t ,at ,r t ,s t +1 ) in the playback buffer in the area. The network parameters are updated to minimize the loss function using the RMSProP optimizer, and the self-driving vehicle is continuously pre-trained until sufficient playback buffer data B is accumulated.
步骤S23:计算所述结构化噪声。Step S23: Calculate the structured noise.
在一种具体的实施方式中,本实施例可以从所述历史数据中随机抽取 出预设条数的数据,得到对应的minibatch(即小批量数据);计算出所述minibatch中每条所述历史数据的高斯因子;利用全部所述高斯因子计算出所述minibatch对应的所述结构化噪声。In a specific implementation, in this embodiment, a preset number of data can be randomly extracted from the historical data to obtain a corresponding minibatch (ie, small batch data); Gaussian factor of historical data; the structured noise corresponding to the minibatch is calculated using all the Gaussian factors.
在另一种具体的实施方式中,本实施例可以从所述历史数据中随机抽取数据,得到多个minibatch;计算出每个所述minibatch中每条所述历史数据的高斯因子,然后利用每个所述minibatch对应的全部所述高斯因子计算出每个所述minibatch对应的所述结构化噪声。In another specific implementation, this embodiment can randomly extract data from the historical data to obtain multiple minibatches; calculate the Gaussian factor of each piece of the historical data in each of the minibatches, and then use each All the Gaussian factors corresponding to the minibatches calculate the structured noise corresponding to each minibatch.
也即,可以利用多个minibatch计算出多个结构化噪声,这样,在进行自动驾驶训练时,可以利用不同的结构化噪声训练,以提升自动驾驶的鲁棒性。That is, multiple structured noises can be calculated by using multiple minibatches, so that during automatic driving training, different structured noises can be used for training to improve the robustness of automatic driving.
具体的,可以从回放缓冲区B中随机取出minibatch b
i~B,minibatch b
i中包含N条历史数据c
1:N=(s
n,a
n,r
n,s
n+1),n=1~N。计算得到每一条历史数据的高斯因子。采样的每一条历史数据c
n的高斯因子,即Ψ
φ(z|c
n)=N(μ
n,σ
n)。其中,N表示高斯分布,则历史数据c
n的高斯因子表示为
Specifically, minibatch b i to B can be randomly taken out from the playback buffer B, and the minibatch b i contains N pieces of historical data c 1:N =(s n , a n , rn , s n +1 ), n= 1 to N. Calculate the Gaussian factor of each piece of historical data. The Gaussian factor of each sampled historical data c n , namely Ψ φ (z|c n )=N(μ n ,σ n ). Among them, N represents the Gaussian distribution, then the Gaussian factor of the historical data c n is expressed as
使用神经网络NN(Neural Network)计算,其中,均值
方差
φ为神经网络f的参数。计算得到概率表示的潜在变量,即结构化噪声。采样的每一个minibatch b
i的结构化噪声,即z~q
φ(z|c
1:N)。其中,q
φ(z|c
1:N)由每一条历史数据c
n的高斯因子Ψ
φ(z|c
n)累乘得到,即
Calculated using a neural network NN (Neural Network), where the mean variance φ is the parameter of the neural network f. The latent variable is computed to obtain a probabilistic representation, namely structured noise. The structured noise of each sampled minibatch b i , ie z ~ q φ (z|c 1:N ). Among them, q φ (z|c 1:N ) is obtained by accumulating the Gaussian factor Ψ φ (z|c n ) of each piece of historical data c n , namely
也即,本实施例可以预先计算所述结构化噪声,在另外一些实施例中,可以在获取当前时刻的交通环境状态时,从历史数据中抽取minibatch,计算出当前时刻对应的结构化噪声。That is, in this embodiment, the structured noise may be pre-calculated. In other embodiments, a minibatch may be extracted from historical data when obtaining the traffic environment state at the current moment, and the structured noise corresponding to the current moment may be calculated.
步骤S24:获取当前时刻的交通环境状态以及对应的结构化噪声;其中,所述结构化噪声为基于历史数据确定出的结构化噪声,所述历史数据为在对自动驾驶车辆进行预训练的过程中保存的数据,并且,所述历史数据包 括历史动作信息以及历史交通环境状态信息。Step S24 : obtaining the current state of the traffic environment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is in the process of pre-training the autonomous vehicle and the historical data includes historical action information and historical traffic environment state information.
在一种具体的实施方式中,本实施例可以获取当前时刻的交通环境状态以及对应的结构化噪声;其中,所述结构化噪声为预先计算出的固定值,每个时刻所采用的结构化噪声相同。In a specific implementation, this embodiment can acquire the traffic environment state at the current moment and the corresponding structured noise; wherein, the structured noise is a pre-calculated fixed value, and the structured noise used at each moment Noise is the same.
在另一种具体的实施方式中,本实施例可以获取当前时刻的交通状态以及对应的结构化噪声;其中,当前时刻获取的所述结构化噪声为从预先计算出的多个所述结构化噪声中获取的一个结构化噪声。具体的,可以循环从预先计算出的多个所述结构化噪声中获取当前时刻对应的结构化噪声。例如,预先计算出100个结构化噪声,可以循环从100个结构化噪声中获取当前时刻对应的结构化噪声。当然,在另外一些实施例中,获取当前时刻对应的结构化噪声的具体过程可以包括:实时从所述历史数据中随机抽取出预设条数的数据,得到对应的minibatch,然后计算出该minibatch中每条所述历史数据的高斯因子,利用全部所述高斯因子计算出该minibatch对应的所述结构化噪声。In another specific implementation, this embodiment can acquire the traffic state at the current moment and the corresponding structured noise; wherein the structured noise acquired at the current moment is obtained from a plurality of pre-calculated structured noises. A structured noise captured in noise. Specifically, the structured noise corresponding to the current moment may be obtained cyclically from a plurality of the pre-calculated structured noises. For example, if 100 structured noises are pre-calculated, the structured noise corresponding to the current moment can be obtained cyclically from the 100 structured noises. Of course, in other embodiments, the specific process of obtaining the structured noise corresponding to the current moment may include: randomly extracting a preset number of data from the historical data in real time, obtaining a corresponding minibatch, and then calculating the minibatch The Gaussian factor of each piece of historical data in , and the structured noise corresponding to the minibatch is calculated by using all the Gaussian factors.
可以理解的是,利用不同的结构化噪声进行训练,可以提升自动驾驶的鲁棒性。Understandably, training with different structured noises can improve the robustness of autonomous driving.
步骤S25:通过策略网络利用所述交通环境状态以及所述结构化噪声确定出对应的执行动作。Step S25: Determine a corresponding execution action by using the traffic environment state and the structured noise through a policy network.
步骤S26:控制所述自动驾驶车辆执行所述执行动作。Step S26: Control the automatic driving vehicle to execute the execution action.
步骤S27:通过评价网络根据所述执行动作对所述策略网络的策略进行评价,得到对应的回报。Step S27: Evaluate the strategy of the strategy network through the evaluation network according to the execution action, and obtain a corresponding reward.
在具体的实施方式中,所述评价网络继承了预训练后的目标网络和神经网络,从而提升了自动驾驶训练的效率。In a specific embodiment, the evaluation network inherits the pre-trained target network and neural network, thereby improving the efficiency of automatic driving training.
步骤S28:基于所述回报通过反向传播运算更新评价网络参数。Step S28: Update evaluation network parameters through back-propagation operation based on the reward.
步骤S29:利用策略梯度算法更新策略网络参数。Step S29: Use the policy gradient algorithm to update the policy network parameters.
也即,本申请提供了一种DRL与结构化噪声融合的自动驾驶决策方法,在自动驾驶模拟平台中,通过车辆传感器装置获取环境状态信息,从回放缓冲区(Replay Buffer)中采样历史数据,借助高斯因子算法在策略函数和价值函数中引入结构化噪声,解决基于DRL的自动驾驶序列决策的鲁棒 性问题,避免自动驾驶车辆面对复杂环境时行驶不稳定、甚至引发事故的危险情况。例如,参见图9所示,本申请实施例公开了一种具体的自动驾驶训练方法,包括(1)获取车辆传感器装置采集到的交通环境状态S
t;(2)设计所研究自动驾驶问题的回报函数r
t;(3)使用经典的DQN算法对车辆自动驾驶进行预训练,积累回放缓冲区数据B;(4)从回放缓冲区B中取样历史数据c,利用高斯因子计算概率表示的潜在变量z,即结构化的噪声;(5)结合结构化噪声z,使用DDPG算法训练车辆自动驾驶。
That is, the present application provides an automatic driving decision-making method based on the fusion of DRL and structured noise. In the automatic driving simulation platform, environmental state information is obtained through the vehicle sensor device, and historical data is sampled from the playback buffer (Replay Buffer). With the help of Gaussian factor algorithm, structured noise is introduced into the policy function and value function to solve the robustness problem of DRL-based automatic driving sequence decision-making, and to avoid the dangerous situation of unstable driving and even causing accidents when the automatic driving vehicle faces a complex environment. For example, as shown in FIG. 9 , an embodiment of the present application discloses a specific automatic driving training method, including (1) acquiring the traffic environment state S t collected by the vehicle sensor device; (2) designing the automatic driving problem under study. The reward function rt ; (3) use the classical DQN algorithm to pre-train the vehicle for autonomous driving, and accumulate the playback buffer data B; (4) sample the historical data c from the playback buffer B, and use the Gaussian factor to calculate the potential represented by the probability The variable z is the structured noise; (5) Combined with the structured noise z, use the DDPG algorithm to train the vehicle to drive automatically.
参见图10所示,本申请实施例公开了一种自动驾驶训练装置,包括:Referring to FIG. 10 , an embodiment of the present application discloses an automatic driving training device, including:
数据获取模块11,用于获取当前时刻的交通环境状态以及对应的结构化噪声;其中,所述结构化噪声为基于历史数据确定出的结构化噪声,所述历史数据为在对自动驾驶车辆进行预训练的过程中保存的数据,并且,所述历史数据包括历史动作信息以及历史交通环境状态信息;The data acquisition module 11 is used to acquire the current state of the traffic environment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is the Data saved in the pre-training process, and the historical data includes historical action information and historical traffic environment state information;
动作确定模块12,用于通过策略网络利用所述交通环境状态以及所述结构化噪声确定出对应的执行动作;an action determination module 12, configured to use the traffic environment state and the structured noise to determine a corresponding execution action through a policy network;
动作控制模块13,用于控制所述自动驾驶车辆执行所述执行动作;an action control module 13, configured to control the autonomous driving vehicle to execute the execution action;
策略评价模块14,用于通过评价网络根据所述执行动作对所述策略网络的策略进行评价,得到对应的回报;The strategy evaluation module 14 is configured to evaluate the strategy of the strategy network through the evaluation network according to the execution action to obtain a corresponding reward;
评价网络更新模块15,用于基于所述回报通过反向传播运算更新评价网络参数;an evaluation network update module 15, configured to update the evaluation network parameters through back-propagation operation based on the return;
策略网络更新模块16,用于利用策略梯度算法更新策略网络参数。The policy network updating module 16 is used for updating the parameters of the policy network by using the policy gradient algorithm.
可见,本申请实施例获取当前时刻的交通环境状态以及对应的结构化噪声;其中,所述结构化噪声为基于历史数据确定出的结构化噪声,所述历史数据为在对自动驾驶车辆进行预训练的过程中保存的数据,并且,所述历史数据包括历史动作信息以及历史交通环境状态信息,然后通过策略网络利用所述交通环境状态以及所述结构化噪声确定出对应的执行动作,之后控制所述自动驾驶车辆执行所述执行动作,并通过评价网络根据所述执行动作对所述策略网络的策略进行评价,得到对应的回报,然后基于所述回报通过反向传播运算更新评价网络参数以及利用策略梯度算法更新策 略网络参数。这样,在自动驾驶的训练过程中,引入基于历史数据的结构化噪声,并且,历史数据包括历史动作信息以及历史交通环境状态信息,能够提升自动驾驶训练的稳定性,从而降低危险事故的发生概率。It can be seen that the embodiment of the present application obtains the traffic environment state at the current moment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is used in the prediction of the automatic driving vehicle. The data saved in the training process, and the historical data includes historical action information and historical traffic environment state information, and then use the traffic environment state and the structured noise to determine the corresponding execution action through the policy network, and then control the The self-driving vehicle performs the execution action, and evaluates the strategy of the strategy network according to the execution action through the evaluation network to obtain a corresponding reward, and then updates the evaluation network parameters and The policy network parameters are updated using the policy gradient algorithm. In this way, in the training process of automatic driving, structured noise based on historical data is introduced, and the historical data includes historical action information and historical traffic environment state information, which can improve the stability of automatic driving training and reduce the probability of dangerous accidents. .
所述装置还包括预训练模块,用于利用DQN算法对自动驾驶车辆进行预训练;将对应的预训练数据存放至回放缓冲区,将所述回放缓冲区存放的数据作为所述历史数据。The device further includes a pre-training module for pre-training the self-driving vehicle by using the DQN algorithm; storing the corresponding pre-training data in a playback buffer, and using the data stored in the playback buffer as the historical data.
评价网络更新模块15,具体用于基于所述回报进行针对评价网络损失函数的反向传播运算,单步更新所述评价网络参数。The evaluation network updating module 15 is specifically configured to perform a back-propagation operation on the evaluation network loss function based on the reward, and update the evaluation network parameters in a single step.
策略网络更新模块16,具体用于利用所述评价网络的价值函数以及所述策略网络的当前策略进行策略梯度运算,更新所述策略网络参数。The policy network update module 16 is specifically configured to perform policy gradient operation by using the value function of the evaluation network and the current policy of the policy network to update the policy network parameters.
所述装置还包括结构化噪声计算模块,用于预先计算所述结构化噪声。The apparatus further includes a structured noise calculation module for pre-calculating the structured noise.
在一种具体的实施方式中,所述结构化噪声计算模块,具体用于从所述历史数据中随机抽取出预设条数的数据,得到对应的minibatch;计算出所述minibatch中每条所述历史数据的高斯因子;利用全部所述高斯因子计算出所述minibatch对应的所述结构化噪声。In a specific embodiment, the structured noise calculation module is specifically used to randomly extract a preset number of data from the historical data to obtain a corresponding minibatch; Gaussian factor of the historical data; using all the Gaussian factors to calculate the structured noise corresponding to the minibatch.
在另一种具体的实施方式中,所述结构化噪声计算模块,具体用于从所述历史数据中随机抽取数据,得到多个minibatch;计算出每个所述minibatch中每条所述历史数据的高斯因子,然后利用每个所述minibatch对应的全部所述高斯因子计算出每个所述minibatch对应的所述结构化噪声。In another specific embodiment, the structured noise calculation module is specifically configured to randomly extract data from the historical data to obtain multiple minibatches; calculate each piece of the historical data in each of the minibatches The Gaussian factor of , and then the structured noise corresponding to each minibatch is calculated by using all the Gaussian factors corresponding to each minibatch.
参见图11所示,本申请实施例公开了一种自动驾驶训练设备,包括处理器21和存储器22;其中,所述存储器22,用于保存计算机程序;所述处理器21,用于执行所述计算机程序,以实现前述实施例公开的自动驾驶训练训练方法。Referring to FIG. 11 , an embodiment of the present application discloses an automatic driving training device, including a processor 21 and a memory 22; wherein, the memory 22 is used to save a computer program; the processor 21 is used to execute all The computer program is used to implement the automatic driving training method disclosed in the foregoing embodiments.
关于上述自动驾驶训练方法的具体过程可以参考前述实施例中公开的相应内容,在此不再进行赘述。For the specific process of the above-mentioned automatic driving training method, reference may be made to the corresponding content disclosed in the foregoing embodiments, which will not be repeated here.
进一步的,本申请实施例还公开了一种计算机可读存储介质,用于保 存计算机程序,其中,所述计算机程序被处理器执行时实现前述实施例公开的自动驾驶训练方法。Further, the embodiment of the present application also discloses a computer-readable storage medium for storing a computer program, wherein the computer program implements the automatic driving training method disclosed in the foregoing embodiments when the computer program is executed by the processor.
关于上述自动驾驶训练方法的具体过程可以参考前述实施例中公开的相应内容,在此不再进行赘述。For the specific process of the above-mentioned automatic driving training method, reference may be made to the corresponding content disclosed in the foregoing embodiments, which will not be repeated here.
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments may be referred to each other. As for the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method.
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of a method or algorithm described in connection with the embodiments disclosed herein may be directly implemented in hardware, a software module executed by a processor, or a combination of the two. The software module can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.
以上对本申请所提供的一种自动驾驶训练方法、装置、设备及介质进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。The automatic driving training method, device, equipment and medium provided by the present application have been introduced in detail above. The principles and implementations of the present application are described with specific examples in this paper. Understand the method of the present application and its core idea; at the same time, for those of ordinary skill in the art, according to the idea of the present application, there will be changes in the specific implementation and application scope. In summary, the content of this specification does not It should be understood as a limitation of this application.
Claims (10)
- 一种自动驾驶训练方法,其特征在于,包括:An automatic driving training method, comprising:获取当前时刻的交通环境状态以及对应的结构化噪声;其中,所述结构化噪声为基于历史数据确定出的结构化噪声,所述历史数据为在对自动驾驶车辆进行预训练的过程中保存的数据,并且,所述历史数据包括历史动作信息以及历史交通环境状态信息;Obtain the traffic environment state at the current moment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is saved in the process of pre-training the autonomous driving vehicle data, and the historical data includes historical action information and historical traffic environment state information;通过策略网络利用所述交通环境状态以及所述结构化噪声确定出对应的执行动作;Determine the corresponding execution action by utilizing the traffic environment state and the structured noise through a policy network;控制所述自动驾驶车辆执行所述执行动作;controlling the self-driving vehicle to perform the execution action;通过评价网络根据所述执行动作对所述策略网络的策略进行评价,得到对应的回报;Evaluate the strategy of the strategy network according to the execution action by the evaluation network to obtain a corresponding reward;基于所述回报通过反向传播运算更新评价网络参数;Update evaluation network parameters by back-propagation operation based on the reward;利用策略梯度算法更新策略网络参数。The policy network parameters are updated using the policy gradient algorithm.
- 根据权利要求1所述的自动驾驶训练方法,其特征在于,还包括:The automatic driving training method according to claim 1, further comprising:利用DQN算法对自动驾驶车辆进行预训练;Pre-training of autonomous vehicles with DQN algorithm;将对应的预训练数据存放至回放缓冲区,将所述回放缓冲区存放的数据作为所述历史数据。The corresponding pre-training data is stored in the playback buffer, and the data stored in the playback buffer is used as the historical data.
- 根据权利要求1所述的自动驾驶训练方法,其特征在于,所述基于所述回报通过反向传播运算更新评价网络参数,包括:The automatic driving training method according to claim 1, wherein the updating and evaluating network parameters by back-propagation operation based on the reward comprises:基于所述回报进行针对评价网络损失函数的反向传播运算,单步更新所述评价网络参数。Based on the reward, a back-propagation operation for the evaluation network loss function is performed, and the evaluation network parameters are updated in a single step.
- 根据权利要求1所述的自动驾驶训练方法,其特征在于,所述利用策略梯度算法更新策略网络参数,包括:The automatic driving training method according to claim 1, wherein the updating of the policy network parameters using a policy gradient algorithm comprises:利用所述评价网络的价值函数以及所述策略网络的当前策略进行策略梯度运算,更新所述策略网络参数。The policy gradient operation is performed using the value function of the evaluation network and the current policy of the policy network, and the policy network parameters are updated.
- 根据权利要求1至4任一项所述的自动驾驶训练方法,其特征在于,还包括:The automatic driving training method according to any one of claims 1 to 4, further comprising:预先计算所述结构化噪声。The structured noise is precomputed.
- 根据权利要求5所述的自动驾驶训练方法,其特征在于,所述预先计算所述结构化噪声,包括:The automatic driving training method according to claim 5, wherein the pre-calculating the structured noise comprises:从所述历史数据中随机抽取出预设条数的数据,得到对应的minibatch;Randomly extract a preset number of data from the historical data to obtain a corresponding minibatch;计算出所述minibatch中每条所述历史数据的高斯因子;Calculate the Gaussian factor of each piece of historical data in the minibatch;利用全部所述高斯因子计算出所述minibatch对应的所述结构化噪声。The structured noise corresponding to the minibatch is calculated using all the Gaussian factors.
- 根据权利要求5所述的自动驾驶训练方法,其特征在于,所述预先计算所述结构化噪声,包括:The automatic driving training method according to claim 5, wherein the pre-calculating the structured noise comprises:从所述历史数据中随机抽取数据,得到多个minibatch;Randomly extract data from the historical data to obtain multiple minibatches;计算出每个所述minibatch中每条所述历史数据的高斯因子,然后利用每个所述minibatch对应的全部所述高斯因子计算出每个所述minibatch对应的所述结构化噪声。Calculate the Gaussian factor of each piece of historical data in each of the minibatches, and then use all the Gaussian factors corresponding to each of the minibatches to calculate the structured noise corresponding to each of the minibatches.
- 一种自动驾驶训练装置,其特征在于,包括:An automatic driving training device, comprising:数据获取模块,用于获取当前时刻的交通环境状态以及对应的结构化噪声;其中,所述结构化噪声为基于历史数据确定出的结构化噪声,所述历史数据为在对自动驾驶车辆进行预训练的过程中保存的数据,并且,所述历史数据包括历史动作信息以及历史交通环境状态信息;The data acquisition module is used to acquire the traffic environment state at the current moment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is used for pre-preparing the automatic driving vehicle. Data saved in the training process, and the historical data includes historical action information and historical traffic environment state information;动作确定模块,用于通过策略网络利用所述交通环境状态以及所述结构化噪声确定出对应的执行动作;an action determination module, configured to use the traffic environment state and the structured noise to determine a corresponding execution action through a policy network;动作控制模块,用于控制所述自动驾驶车辆执行所述执行动作;an action control module, configured to control the autonomous driving vehicle to execute the execution action;策略评价模块,用于通过评价网络根据所述执行动作对所述策略网络的策略进行评价,得到对应的回报;a strategy evaluation module, configured to evaluate the strategy of the strategy network through the evaluation network according to the execution action to obtain a corresponding reward;评价网络更新模块,用于基于所述回报通过反向传播运算更新评价网络参数;an evaluation network update module for updating the evaluation network parameters through back-propagation operation based on the return;策略网络更新模块,用于利用策略梯度算法更新策略网络参数。The policy network update module is used to update the policy network parameters using the policy gradient algorithm.
- 一种自动驾驶训练设备,其特征在于,包括处理器和存储器;其中,An automatic driving training device, characterized in that it includes a processor and a memory; wherein,所述存储器,用于保存计算机程序;the memory for storing computer programs;所述处理器,用于执行所述计算机程序以实现如权利要求1至7任一项所述的自动驾驶训练方法。The processor is configured to execute the computer program to implement the automatic driving training method according to any one of claims 1 to 7.
- 一种计算机可读存储介质,其特征在于,用于保存计算机程序,其中,所述计算机程序被处理器执行时实现如权利要求1至7任一项所述的自动驾驶训练训练方法。A computer-readable storage medium, characterized in that it is used for storing a computer program, wherein, when the computer program is executed by a processor, the automatic driving training method according to any one of claims 1 to 7 is implemented.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010934770.9A CN112099496B (en) | 2020-09-08 | 2020-09-08 | Automatic driving training method, device, equipment and medium |
CN202010934770.9 | 2020-09-08 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022052406A1 true WO2022052406A1 (en) | 2022-03-17 |
Family
ID=73752230
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/073449 WO2022052406A1 (en) | 2020-09-08 | 2021-01-23 | Automatic driving training method, apparatus and device, and medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112099496B (en) |
WO (1) | WO2022052406A1 (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114735027A (en) * | 2022-04-13 | 2022-07-12 | 北京京东乾石科技有限公司 | Operation decision-making method and device applied to unmanned vehicles |
CN114859734A (en) * | 2022-06-15 | 2022-08-05 | 厦门大学 | Greenhouse environment parameter optimization decision method based on improved SAC algorithm |
CN114859899A (en) * | 2022-04-18 | 2022-08-05 | 哈尔滨工业大学人工智能研究院有限公司 | Actor-critic stability reinforcement learning method for navigation obstacle avoidance of mobile robot |
CN114895697A (en) * | 2022-05-27 | 2022-08-12 | 西北工业大学 | A UAV flight decision-making method based on meta-reinforcement learning parallel training algorithm |
CN115508105A (en) * | 2022-09-19 | 2022-12-23 | 北京百度网讯科技有限公司 | Method, device, equipment and storage medium for scene simulation and reproduction |
CN115903457A (en) * | 2022-11-02 | 2023-04-04 | 曲阜师范大学 | Low-wind-speed permanent magnet synchronous wind driven generator control method based on deep reinforcement learning |
CN116946162A (en) * | 2023-09-19 | 2023-10-27 | 东南大学 | Safe driving decision-making method for intelligent connected commercial vehicles considering road adhesion conditions |
CN117078923A (en) * | 2023-07-19 | 2023-11-17 | 苏州大学 | Automatic driving environment-oriented semantic segmentation automation method, system and medium |
CN117330063A (en) * | 2023-12-01 | 2024-01-02 | 华南理工大学 | A method to improve the accuracy of the combined positioning algorithm of IMU and wheel speedometer |
CN119808880A (en) * | 2025-03-11 | 2025-04-11 | 长春工业大学 | Implicit state representation method based on automatic driving system of signal-free intersection |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112099496B (en) * | 2020-09-08 | 2023-03-21 | 苏州浪潮智能科技有限公司 | Automatic driving training method, device, equipment and medium |
CN112835368A (en) * | 2021-01-06 | 2021-05-25 | 上海大学 | A kind of multi-unmanned boat cooperative formation control method and system |
CN112904864B (en) * | 2021-01-28 | 2023-01-03 | 的卢技术有限公司 | Automatic driving method and system based on deep reinforcement learning |
CN113253612B (en) * | 2021-06-01 | 2021-09-17 | 苏州浪潮智能科技有限公司 | Automatic driving control method, device, equipment and readable storage medium |
CN113743469B (en) * | 2021-08-04 | 2024-05-28 | 北京理工大学 | Automatic driving decision method integrating multi-source data and comprehensive multi-dimensional indexes |
CN113449823B (en) * | 2021-08-31 | 2021-11-19 | 成都深蓝思维信息技术有限公司 | Automatic driving model training method and data processing equipment |
CN113991654B (en) * | 2021-10-28 | 2024-01-23 | 东华大学 | Energy internet hybrid energy system and scheduling method thereof |
CN114118276B (en) * | 2021-11-29 | 2024-08-20 | 北京触达无界科技有限公司 | Network training method, control method and device |
CN114120653A (en) * | 2022-01-26 | 2022-03-01 | 苏州浪潮智能科技有限公司 | Centralized vehicle group decision control method and device and electronic equipment |
CN114104005B (en) * | 2022-01-26 | 2022-04-19 | 苏州浪潮智能科技有限公司 | Decision-making method, device and equipment of automatic driving equipment and readable storage medium |
CN116061967B (en) * | 2022-12-30 | 2025-07-15 | 东风商用车有限公司 | Vehicle behavior decision method, device, equipment and readable storage medium |
CN116811915A (en) * | 2023-06-30 | 2023-09-29 | 清华大学 | Vehicle decision method and device based on passenger brain electrical signals and computer equipment |
CN117041916B (en) * | 2023-09-27 | 2024-01-09 | 创意信息技术股份有限公司 | Mass data processing method, device, system and storage medium |
CN117493884B (en) * | 2023-11-16 | 2025-06-20 | 北京控制工程研究所 | Reinforcement learning decision-making method and device for complex scenarios |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108600379A (en) * | 2018-04-28 | 2018-09-28 | 中国科学院软件研究所 | A kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient |
WO2019089591A1 (en) * | 2017-10-30 | 2019-05-09 | Mobileye Vision Technologies Ltd. | Vehicle navigation based on human activity |
CN110481536A (en) * | 2019-07-03 | 2019-11-22 | 中国科学院深圳先进技术研究院 | A kind of control method and equipment applied to hybrid vehicle |
CN111310915A (en) * | 2020-01-21 | 2020-06-19 | 浙江工业大学 | Data anomaly detection and defense method for reinforcement learning |
CN112099496A (en) * | 2020-09-08 | 2020-12-18 | 苏州浪潮智能科技有限公司 | Automatic driving training method, device, equipment and medium |
CN112256746A (en) * | 2020-09-11 | 2021-01-22 | 安徽中科新辰技术有限公司 | Method for realizing data management technology based on tagging |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110196587A (en) * | 2018-02-27 | 2019-09-03 | 中国科学院深圳先进技术研究院 | Vehicular automatic driving control strategy model generating method, device, equipment and medium |
CN110989577B (en) * | 2019-11-15 | 2023-06-23 | 深圳先进技术研究院 | Automatic driving decision method and automatic driving device of vehicle |
-
2020
- 2020-09-08 CN CN202010934770.9A patent/CN112099496B/en active Active
-
2021
- 2021-01-23 WO PCT/CN2021/073449 patent/WO2022052406A1/en active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019089591A1 (en) * | 2017-10-30 | 2019-05-09 | Mobileye Vision Technologies Ltd. | Vehicle navigation based on human activity |
CN108600379A (en) * | 2018-04-28 | 2018-09-28 | 中国科学院软件研究所 | A kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient |
CN110481536A (en) * | 2019-07-03 | 2019-11-22 | 中国科学院深圳先进技术研究院 | A kind of control method and equipment applied to hybrid vehicle |
CN111310915A (en) * | 2020-01-21 | 2020-06-19 | 浙江工业大学 | Data anomaly detection and defense method for reinforcement learning |
CN112099496A (en) * | 2020-09-08 | 2020-12-18 | 苏州浪潮智能科技有限公司 | Automatic driving training method, device, equipment and medium |
CN112256746A (en) * | 2020-09-11 | 2021-01-22 | 安徽中科新辰技术有限公司 | Method for realizing data management technology based on tagging |
Non-Patent Citations (1)
Title |
---|
WANG YILIN: "Study on Self-driving Cars Overtaking Control Method Based on Deep Reinforced Learning", CHINESE MASTER’S THESES FULL-TEXT DATABASE, no. 4, 30 April 2020 (2020-04-30), XP055910492 * |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114735027A (en) * | 2022-04-13 | 2022-07-12 | 北京京东乾石科技有限公司 | Operation decision-making method and device applied to unmanned vehicles |
CN114859899A (en) * | 2022-04-18 | 2022-08-05 | 哈尔滨工业大学人工智能研究院有限公司 | Actor-critic stability reinforcement learning method for navigation obstacle avoidance of mobile robot |
CN114859899B (en) * | 2022-04-18 | 2024-05-31 | 哈尔滨工业大学人工智能研究院有限公司 | Actor-critics stability reinforcement learning method for mobile robot navigation obstacle avoidance |
CN114895697B (en) * | 2022-05-27 | 2024-04-30 | 西北工业大学 | Unmanned aerial vehicle flight decision method based on meta reinforcement learning parallel training algorithm |
CN114895697A (en) * | 2022-05-27 | 2022-08-12 | 西北工业大学 | A UAV flight decision-making method based on meta-reinforcement learning parallel training algorithm |
CN114859734A (en) * | 2022-06-15 | 2022-08-05 | 厦门大学 | Greenhouse environment parameter optimization decision method based on improved SAC algorithm |
CN114859734B (en) * | 2022-06-15 | 2024-06-07 | 厦门大学 | A greenhouse environmental parameter optimization decision-making method based on improved SAC algorithm |
CN115508105A (en) * | 2022-09-19 | 2022-12-23 | 北京百度网讯科技有限公司 | Method, device, equipment and storage medium for scene simulation and reproduction |
CN115903457B (en) * | 2022-11-02 | 2023-09-08 | 曲阜师范大学 | A low wind speed permanent magnet synchronous wind turbine control method based on deep reinforcement learning |
CN115903457A (en) * | 2022-11-02 | 2023-04-04 | 曲阜师范大学 | Low-wind-speed permanent magnet synchronous wind driven generator control method based on deep reinforcement learning |
CN117078923A (en) * | 2023-07-19 | 2023-11-17 | 苏州大学 | Automatic driving environment-oriented semantic segmentation automation method, system and medium |
CN116946162B (en) * | 2023-09-19 | 2023-12-15 | 东南大学 | Safe driving decision-making method for intelligent connected commercial vehicles considering road adhesion conditions |
CN116946162A (en) * | 2023-09-19 | 2023-10-27 | 东南大学 | Safe driving decision-making method for intelligent connected commercial vehicles considering road adhesion conditions |
CN117330063A (en) * | 2023-12-01 | 2024-01-02 | 华南理工大学 | A method to improve the accuracy of the combined positioning algorithm of IMU and wheel speedometer |
CN117330063B (en) * | 2023-12-01 | 2024-03-22 | 华南理工大学 | Method for improving accuracy of IMU and wheel speed meter combined positioning algorithm |
CN119808880A (en) * | 2025-03-11 | 2025-04-11 | 长春工业大学 | Implicit state representation method based on automatic driving system of signal-free intersection |
CN119808880B (en) * | 2025-03-11 | 2025-06-17 | 长春工业大学 | Implicit state representation method based on automatic driving system of signal-free intersection |
Also Published As
Publication number | Publication date |
---|---|
CN112099496B (en) | 2023-03-21 |
CN112099496A (en) | 2020-12-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022052406A1 (en) | Automatic driving training method, apparatus and device, and medium | |
CN110834644B (en) | Vehicle control method and device, vehicle to be controlled and storage medium | |
US11899411B2 (en) | Hybrid reinforcement learning for autonomous driving | |
JP7532615B2 (en) | Planning for autonomous vehicles | |
CN110796856B (en) | Vehicle lane change intention prediction method and training method of lane change intention prediction network | |
US20230124864A1 (en) | Graph Representation Querying of Machine Learning Models for Traffic or Safety Rules | |
Min et al. | Deep Q learning based high level driving policy determination | |
CN111506058A (en) | Method and device for planning short-term path of automatic driving through information fusion | |
CN114644016B (en) | Vehicle automatic driving decision method and device, vehicle-mounted terminal and storage medium | |
CN113044064A (en) | Vehicle self-adaptive automatic driving decision method and system based on meta reinforcement learning | |
US20210398014A1 (en) | Reinforcement learning based control of imitative policies for autonomous driving | |
CN115578876A (en) | Automatic driving method, system, equipment and storage medium of vehicle | |
WO2022252457A1 (en) | Autonomous driving control method, apparatus and device, and readable storage medium | |
KR20230024392A (en) | Driving decision making method and device and chip | |
CN117325865A (en) | Intelligent vehicle lane change decision method and system for LSTM track prediction | |
WO2023135271A1 (en) | Motion prediction and trajectory generation for mobile agents | |
Youssef et al. | Comparative study of end-to-end deep learning methods for self-driving car | |
Ren et al. | Intelligent Path Planning and Obstacle Avoidance Algorithms for Autonomous Vehicles Based on Enhanced RRT Algorithm | |
CN116476861A (en) | Automatic driving decision system based on multi-mode sensing and layering actions | |
CN119418583A (en) | Intelligent driving skill training method and system based on behavior cloning and reinforcement learning | |
Arbabi et al. | Planning for autonomous driving via interaction-aware probabilistic action policies | |
CN118212808A (en) | Method, system and equipment for planning traffic decision of signalless intersection | |
CN114104005B (en) | Decision-making method, device and equipment of automatic driving equipment and readable storage medium | |
CN117009787A (en) | Cross-modal semantic information supervision track prediction method | |
CN114120653A (en) | Centralized vehicle group decision control method and device and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21865470 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21865470 Country of ref document: EP Kind code of ref document: A1 |