CN117496707A

CN117496707A - Intersection decision-making method based on multi-agent deep reinforcement learning

Info

Publication number: CN117496707A
Application number: CN202311451381.0A
Authority: CN
Inventors: 杜煜; 江安旎; 赵世昕; 王翊伟; 陈泽宇
Original assignee: Beijing Union University
Current assignee: Beijing Union University
Priority date: 2023-11-02
Filing date: 2023-11-02
Publication date: 2024-02-02

Abstract

The invention provides an intersection decision-making method based on multi-agent deep reinforcement learning, wherein each agent interacts with the environment to obtain observable agent information and local environment information, the overall state information is transmitted to a model for learning, all agents are intensively trained, the exploration action is carried out according to the environment, the corresponding decision is given by the agents, and finally the decision is transmitted to the bottom control of an agent vehicle. And carrying out new round training according to the current rewards and the next state every time the current rewards are obtained, thereby completing model building. The method has the advantages that through improving the deep neural network model, the process of action exploration is optimized, the exploration process of invalid actions is reduced, the robustness of the decision model is enhanced, the performance of the model in a complex scene is improved, and the training difficulty of the model is reduced. The generalization capability of the model method in complex scenes is increased by rewarding functions involving multiple constraints.

Description

Intersection decision method based on multi-agent deep reinforcement learning

Technical Field

The invention relates to the technical field of unmanned aerial vehicle, in particular to an intersection decision method based on multi-agent deep reinforcement learning.

Background

The automobile becomes an indispensable tool for riding instead of walking in the daily life of citizens, and the urban congestion and the traffic accident probability are increased along with the great increase of the number of motor vehicles. With the development of artificial intelligence, unmanned development has become a hot spot for research. Compared with human driving, the unmanned driving technology has the advantages of improving traffic efficiency, reducing the occurrence rate of traffic accidents, saving human resources and the like. At present, the unmanned technology is in an auxiliary driving stage, most of automobiles on the market are provided with related functions, people also gradually accept the auxiliary driving function, and the realization of high-level unmanned has great challenges.

Currently, high-level unmanned vehicles or fully unmanned vehicles are limited to test operation in specific scenes, and in daily urban roads, the high-level unmanned vehicles cannot be well propelled. In a daily urban road, such as a crossroad scene, the scene is complex, the randomness of the environment is high, and unmanned vehicles need to predict each social vehicle which possibly collides in a planned path, continuously interact with the social vehicles and make effective and safe decision actions. In addition, because the number of vehicles in the intersection scene of the crossroad is large, the running tracks of the vehicles at the crossroad are different, and therefore, the traffic efficiency of the crossroad is also very important. The existing methods are mainly based on a single agent, and the realized scene is single. Because the traffic scene of the crossroad is complex, the traffic flow is more likely, and a single agent can only consider the decision control of a single vehicle, so that the possibility of traffic jam and collision is easy to generate. In the face of the situation of large traffic flow, single-agent vehicles lack of collaboration and cannot flexibly strain according to scene changes, so how to realize efficient and safe passing of unmanned vehicles through an intersection scene is always a hot spot of academic and engineering research. Therefore, how to provide a method for improving the action cooperativity of a plurality of intelligent vehicles at an intersection is a problem to be solved.

Disclosure of Invention

The invention provides an intersection decision method based on multi-agent deep reinforcement learning, which is used for solving the problem of low action collaboration of a plurality of agent vehicles at an intersection in the prior art.

In order to achieve the above object, the present invention provides an intersection decision method based on multi-agent deep reinforcement learning, comprising: setting up a traffic light-free intersection scene, and designing a state space based on multiple intelligent vehicles and an action space of each intelligent vehicle. And constructing a reward function according to the state space of the multi-agent vehicle and the action space of each agent vehicle. The state space of the multi-agent vehicle and the action space of each agent vehicle are observed, and the observation result (the environment information, the relative position information and the relative speed information of the current social vehicle relative to the multi-agent vehicle) is learned by adopting a deep neural network, so that invalid data is filtered, and the complete multi-agent state information is obtained. According to the current state space of the multi-agent vehicle, the action space of each agent vehicle and the current rewarding function, the next state space of the multi-agent vehicle, the next action space of each agent vehicle and the action expectations containing noise are combined to input a learning network with two critic dual networks respectively, so that the actions of the multi-agent vehicle without a signal lamp intersection at present are learned. Wherein one critic dual network is used to input iteration information to another critic dual network.

As a preferred aspect of the present invention, it is preferable to design a state space based on a multi-agent vehicle and an operation space of each agent vehicle, including: a state space of each agent vehicle and its dimensions are defined, including the social vehicles observed by the agent vehicle and the state features of this social vehicle. And carrying out Cartesian operation on the state space of each intelligent vehicle to obtain the state space of the multi-intelligent vehicle. Defining the action space of each intelligent vehicle, wherein the action space comprises three actions of acceleration, deceleration and rotation angle.

As a preferred aspect of the above-mentioned technical solution, after obtaining the state space of the multi-agent vehicle, the optimizing the state space further includes: the current state, the position state and the speed state of the multi-agent vehicle are respectively input into two LSTM layers, the two LSTM layers respectively process the observation data in a full-chain layer, output vectors with fixed sizes are output and input into corresponding Actor networks or Critic networks, and therefore the Actor networks and the Critic networks can learn stably under the condition of different updating frequencies.

Preferably, after the Actor network receives the output vector, the output vector is transferred to a Softmax layer, vector mapping is performed on the Softmax layer, and the effective actions of the multi-agent vehicle are obtained according to the probabilities of different actions of the multi-agent vehicle obtained by the mapping result.

As a preferred aspect of the above-described technical solution, preferably, constructing the bonus function according to the state space of the multi-agent vehicle and the action space of the respective agent vehicle includes: constructing a reward function according to the collision estimation, the speed estimation, the headway estimation and the arrival cost estimation and the corresponding weighting scalar:

r _i,t ＝w _c r _c +w _s r _s +w _h r _h +w _e r _e

wherein r is _i,t Rewards for the ith agent at time step t, w _c ,w _s ,w _h ,w _e Respectively corresponding to the collision estimates r _c Velocity estimation r _s Estimating r of time distance of head _h And the arrival cost r _e Is a weighted scalar of (2); the arrival cost estimation is used for stopping blind waiting of the intelligent vehicle at the intersection, and program locking is avoided.

Preferably, the TD3 model is an improved MATD3 decision algorithm, and in particular, the TD3 model is a dual network comprising a learning network for learning the actions of the multi-agent vehicle and a delay update network. The learning network is composed of a current state space of the multi-agent input into an Actor network, the Actor network combines the state space of the multi-agent vehicle obtained through exploration and then inputs the combined state space of the multi-agent vehicle into an error updating network of the learning network, and the dual Critic network in the error updating learns the actions of the multi-agent vehicle according to a first target value and the iteration information output by the delay updating network. The delay updating network is composed of a following structure that the next state space of the multi-agent is input into an Actor network, the Actor network carries out smooth regularization processing on the motion of the multi-agent vehicle in the next state space, a processing result is input into a dual Critic network in the delay updating network, and the dual Critic network outputs the iteration information to the learning network.

Preferably, as a preference of the above technical solution, the dual Critic network in error update performs data processing on the reward function to obtain two first target values, and the smaller first target value is taken and used for receiving the iteration information.

As a preferred aspect of the above-mentioned technical solution, preferably, the dual Critic network in error update learns the actions of the multi-agent vehicle by receiving the iteration information output by the delay update network, including: storing rewards in the iteration information, the state of the next round of multi-agent vehicle, the state and actions of the current multi-agent vehicle into an experience playback pool; sampling the experience playback pool to obtain an optimal action strategy and target estimation of the multi-agent vehicle; updating policy parameters in a Softmax layer, and updating action expectations in a dual Critic network in the delay updating; wherein the action is expected to be derived from the reward function.

The technical scheme of the invention provides an intersection decision method based on multi-agent deep reinforcement learning, which is implemented by constructing a scene of an intersection without signal lamps, and designing a state space based on multi-agent vehicles and an action space of each agent vehicle; constructing a reward function according to the state space of the multi-agent vehicle and the action space of each agent vehicle; processing the environment information and the relative position information and the relative speed information of the current social vehicle relative to the multi-agent vehicle to obtain complete multi-agent state information; according to the current state space of the multi-agent vehicle, the action space of each agent vehicle and the current rewarding function, the next state space of the multi-agent vehicle, the next action space of each agent vehicle and the action expectations containing noise are combined to input a TD3 model with two critic dual networks respectively, so that the actions of the multi-agent vehicle at the current intersection without signal lamps are learned.

The method has the advantages that the depth neural network model is improved, the action exploration process is optimized, the exploration process of invalid actions is reduced, and the robustness of the decision model is enhanced. By introducing an LSTM network, the observation vector is smoothly provided with the decreasing distance, so that the influence of the nearest intelligent agent on the final state is ensured to be increased, the performance of the model in a complex scene is increased, and the training difficulty of the model is reduced. Through designing a multi-constraint reward function, the model is designed according to the diversity of scenes, so that the prediction of the actions of each intelligent vehicle in different scenes can be realized, and the generalization capability of the model method in complex scenes is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description will be given below of the drawings required for the embodiments or the prior art descriptions, and it is obvious that the drawings in the following description are some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of the technical scheme provided by the embodiment of the invention.

Fig. 2 is a detailed flowchart of the technical solution of the present invention provided in the embodiment of the present invention.

Fig. 3 is a schematic diagram of a scene of intersection without signal lamps according to an embodiment of the present invention.

Fig. 4 is a frame diagram of a TD3 algorithm according to an embodiment of the present invention.

Fig. 5 is a deep neural network model diagram provided in an embodiment of the present invention.

Fig. 6 is a frame diagram of an LSTM input environment update information part to a Critic network and an Actor network, respectively, in an improved MATD3 model according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

First, the flow of the present invention will be briefly described, a flow chart is shown in FIG. 1,

s101, constructing a traffic light-free intersection scene, and designing a state space based on multiple intelligent vehicles and an action space of each intelligent vehicle.

Specifically, the design is based on the state space of the multi-agent vehicle and the action space of the respective agent vehicle, and comprises: a state space of each agent vehicle and its dimensions are defined, including the social vehicles observed by the agent vehicle and the state features of this social vehicle. And carrying out Cartesian operation on the state space of each intelligent vehicle to obtain the state space of the multi-intelligent vehicle. Defining the action space of each intelligent vehicle, wherein the action space comprises three actions of acceleration, deceleration and rotation angle.

S102, constructing a reward function according to the state space of the multi-agent vehicle and the action space of each agent vehicle.

Constructing a reward function according to the state space of the multi-agent vehicle and the action space of the respective agent vehicle, comprising: constructing a reward function according to the collision estimation, the speed estimation, the headway estimation and the arrival cost estimation and the corresponding weighting scalar:

r _i,t ＝w _c r _c +w _s r _s +w _h r _h +w _e r _e

S103, processing the environment information and the relative position information and the relative speed information of the current social vehicle relative to the multi-agent vehicle to obtain complete multi-agent state information.

Further, after the state space of the multi-agent vehicle is constructed, in order to further perfect the environmental information, the state space is optimized by adopting an LSTM network, specifically, the current state, the position state and the speed state of the multi-agent vehicle are respectively input into two LSTM layers, the two LSTM layers respectively process the observed data in a full-chain layer, and the distilled output vector with a fixed size containing the environmental information of the multi-agent is input into the corresponding Actor network or Critic network, so that the action of the multi-agent can be stably learned by the Actor network and the Critic network under the condition of different updating frequencies. Wherein the implicit coding in the output vector includes information about the complete environment.

And S104, the TD3 model learns the actions of the multi-agent vehicle at the current intersection without signal lamps.

The TD3 model is an improved MATD3 decision algorithm, and specifically, the TD3 model is a dual network consisting of a learning network for learning the actions of the multi-agent vehicle and a delay update network.

The learning network is composed of a current state space of the multi-agent input into an Actor network, the Actor network combines the state space of the multi-agent vehicle which is obtained by searching for the effective actions of the multi-agent vehicle and inputs the combined state space into an error updating network of the learning network, and the dual Critic network in error updating learns the actions of the multi-agent vehicle according to a first target value and the iteration information output by the delay updating network.

The Actor network receives the output vector, updates the output vector, transmits the output vector to the Softmax layer, performs vector mapping on the Softmax layer, and obtains the probabilities of different actions of the multi-agent vehicle according to the mapping result, thereby obtaining the effective actions of the multi-agent vehicle.

The dual Critic network in error update stores the rewards in the iteration information and the state of the next round of multi-agent vehicle, and the state and actions of the current multi-agent vehicle to an experience playback pool. And sampling the experience playback pool to obtain the optimal action strategy and target estimation of the multi-agent vehicle.

Updating policy parameters in a Softmax layer, and updating action expectations in a dual Critic network in the delay updating; wherein the action is expected to be derived from the reward function.

The composition of the delay updating network is as follows, the next state space of the multi-agent is input into an Actor network, the Actor network carries out smooth regularization processing on the motion of the multi-agent vehicle in the next state space, the processing result is input into a dual Critic network in the delay updating network, and the dual Critic network outputs iteration information to a learning network. And the dual Critic network in error updating performs data processing on the reward function to obtain two first target values, and the smaller first target values are taken and used for receiving the iteration information.

The technical scheme of the invention will now be described with reference to a specific embodiment:

step 201, constructing a scene of the intersection without signal lamps.

Specifically, the construction is performed according to the urban road intersection scene in life, as shown in fig. 2. In the scene of the intersection without signal lamps, vehicles with different track planning are used at each intersection, and a junction exists between tracks in the intersection, so that the intelligent vehicle needs to continuously interact with other social vehicles observable in the environment according to the set track, and the intelligent vehicle passes through the intersection rapidly and safely.

Step 202, designing a state space based on multiple agents.

Specifically, the state space of each agent vehicle i is defined as S _i The dimensions of each state space are defined as Representing a social vehicle observed by a current agent vehicle, W represents a state feature of the vehicle, the state feature including:

observe: representing whether the current intelligent vehicle observes social vehicles or not, and representing the social vehicles by adopting binary variables;

x _l : is the lateral position of the observed vehicle relative to the agent vehicle;

y _l : is the longitudinal position of the observed vehicle relative to the smart vehicle;

v _x : is the observed lateral velocity of the vehicle relative to the agent vehicle;

v _y : is the observed longitudinal speed of the vehicle relative to the smart vehicle;

the state space of the plurality of agents carries out Cartesian product operation on the state space of each agent, and the method is as follows: s=s ₁ +S ₂ +S ₃ +…+S _N At presentAnd the local observation quantity of the optimal agent is obtained.

Wherein S is _1～N S is the state space of multiple agents, and N represents the number of scene multiple agent vehicles.

Step 203, designing an action space of multiple agents.

Specifically, the motion space of each agent vehicle i is defined as a _i The action space of each intelligent vehicle comprises acceleration, deceleration and rotation angleThree actions. The action space of the multi-agent can be defined as:

A＝A ₁ +A ₂ +A ₃ +…+A _N

wherein A is _1～N The system is characterized in that the system is used for controlling the motion space of each intelligent agent, A is used for controlling the motion space of a plurality of intelligent agents, and N represents the number of vehicles with the plurality of intelligent agents in a scene.

Step 204, constructing a reward function of multiple constraint conditions.

Because the scene of the multi-agent vehicle is complex, a single rewarding function can limit the generalization capability of the model, and therefore, the rewarding function introducing multiple constraint conditions is proposed. The rewards of the ith agent at time step t are defined as follows:

r _i,t ＝w _c r _c +w _s r _s +w _H r _H +w _e r _e

wherein w is _c ,w _s ,w _h ,w _e Respectively corresponding to the collision estimates r _c Velocity estimation r _s Estimating r of time distance of head _h And an agent vehicle arrival destination cost assessment r _e Is included in the weighting scalar of (a).

Since security is the most important criterion, then w _c Security is prioritized over other weights. The four performance indicators are defined as follows:

collision estimation: if a conflict occurs, then the conflict evaluates r _c Is set to-100, otherwise r _c ＝0。

The velocity estimation is:

v _t for the current speed of the intelligent vehicle, v _max For the maximum speed set, the maximum speed set was 50km/h.

The headway is estimated as:

d _headway is the time interval of the head, t _h Is a predefined headway threshold. When the time interval of the vehicle head is smaller than t _h When the current intelligent vehicle is punished, and only when the headway is greater than t _h And when the intelligent vehicle receives rewards.

The destination arrival cost evaluation setting is to prevent blind waiting of the intelligent vehicles at the intersection and avoid program locking. Wherein, the destination cost evaluation is to give corresponding punishment or rewards according to the waiting predicted by the current vehicle in the journey.

And 205, perfecting the surrounding environment of the intelligent vehicle by adopting the LSTM.

In the existing deep learning algorithm, only a fixed number of nearby intelligent vehicles are considered, so that local observation problems are easy to generate, and the performance of the model is reduced.

The present invention introduces an LSTM network to consider all agent vehicles, an LSTM being one of the recurrent neural networks, typically used to infer output from an indefinite number of ordered sequence inputs.

Specifically, as shown in FIG. 6, the LSTM is capable of integrating n time-independent observation vectorsDistilling into an output vector with fixed size, wherein the implicit coding of the output vector is used for carrying out the key information about the complete environment, so as to perfect the environment information of the existing intelligent vehicle. Wherein the observation vector->Specific information that is a state space of the vehicle (step 202); k represents different agent vehicles; where i represents the information items at different moments in the observation vector.

LSTM provides observation vectors in descending order of distance to ensure that the closest agent has the greatest impact on the final state. To reduce the training complexity, a single layer of LSTM is used to process the observed state information. Because the updating frequencies of the Actor network and the Critic network of the TD3 algorithm are different, two LSTM networks are respectively used for the two networks, and the stability of model learning is improved.

And 206, constructing an intersection unmanned decision model based on multi-agent deep reinforcement learning for the MATD3 decision algorithm.

Specifically, after the state space of the multi-agent designed in step 202 is completed in step 205, the multi-agent action space designed in step 203 is combined with the deep reinforcement learning algorithm TD3, so as to obtain the MATD3 multi-agent reinforcement learning algorithm.

Specifically, as shown in fig. 4, the algorithm shown in fig. 4 is mainly based on an Actor-Critic framework, and adopts a dual-network architecture, wherein the upper architecture is a learning network for learning the actions of the multi-agent vehicle, and the lower architecture is a delay update network for outputting iteration information according to the actions of the multi-agent vehicle at the next moment.

Referring to the learning network in fig. 4, the intelligent vehicle obtains state space information S from the perfect simulation environment in step 205, processes the information through the Actor network to output corresponding action a of the intelligent vehicle at the moment, in the present network, the dual Critic network evaluates the action value of the multi-intelligent vehicle in the current state according to the reward return r obtained by the output action according to the reward function in step 204, calculates two target values (TD-error 1, TD-error 2) and takes a smaller value, and the formula is:

wherein y is a target value (TD-error 1, TD-error 2); r is the rewards payoff according to step 204, representing rewards or payoff obtained by taking the current action in the current state; gamma is a discount factor for discounts on future rewards; s' the next state; a' represents a target action; where i represents two Critic networks;for the output of the target Critic network (specifically Q1, Q2 in FIG. 4), tableShowing the action value function estimate by the Critic network for the next state s 'and target action a'. The dual network can maximally solve the problem of network overestimation.

In the process of outputting the estimation result by the dual Critic network in the learning network, the learning network also iterates continuously according to the output result of the delay updating network, and the method is specific:

and the delay updating network processes and outputs the obtained corresponding action A 'of the intelligent vehicle at the moment according to the next state s' of the multi-intelligent vehicle predicted by the learning network and the target network of the Actor. The corresponding action and the next state are input into a dual Critic network in the network, and the dual network outputs a delay update network target value and action rewards of the next state.

Specifically, the MATD3 algorithm belongs to a deterministic strategy algorithm, and the deterministic strategies are all over-fitted, so that when the dual Critic network of the delay updating network is updated, a learning target of the deterministic strategy is easily influenced by a function approximation error, and the target estimation error is larger, so that the estimation is inaccurate. To solve this problem, the target strategy smoothing regularization is used to solve, and at the same time, a small amount of noise is added in the target action, and the average value is calculated in a small batch, so as to approximate the expectation of the action, and the formula is as follows:

Y＝r+γQ′(s′,μ′(s′|θ′)+∈|θ ^Q′ ),∈～clip(N(0,σ),-c,c)

wherein Y is a target value of updating the Critic network in the delay updating network; r is the rewards payoff obtained according to step 204, representing the rewards or payoff obtained by taking the current action in the current state s; gamma is a discount factor for discounts on future rewards; q' represents an estimated value of the target network; μ ' (s ' |θ ') is the target action a of the target policy generation _i The method comprises the steps of carrying out a first treatment on the surface of the s' is the next state; the E is noise and is used for introducing randomness in the target action so as to increase exploration; sigma is a standard deviation parameter of noise; q' is the estimated value of the dual Critic network output in the delay update network of FIG. 4, and r is the reward output by the target Critic network.

Because the state space dimension of the multi-agent is larger, the calculation amount of the model is increased by directly transmitting the state space to the model, and therefore, the deep neural network framework is provided for inputting the state obtained by the deep neural network into the Actor and the Critic network. As shown in fig. 5, the observed states are divided into three groups, a current state, a position state, and a speed state, respectively. Each of the three sub-state vectors is encoded by a fully-connected FC layer, wherein the current state fully-connected layer has 32 neurons, the fully-connected layer of the position state and the velocity state has 64 neurons, the three fully-connected encoded results are concatenated into one vector, the concatenated vector is transferred into the FC layer consisting of 128 neurons, and the final result is transferred to the Actor and Critic network.

Logitsl generated by Actor network in action exploration _i Will be passed to the Softmax layer throughProbability is generated, and the sampling of each action is defined as:

wherein->Represents a policy pi, the parameters of which are theta _i And in a given state s _i The action distribution generated below; a softmax function that maps the input vector into a probability distribution to ensure that the sum of all probability values is 1; [ l ] ₁ ,l ₂ ,l ₃ ,l ₄ ,l ₅ ]This is the vector input to the softmax function.

However, the generated actions have invalid actions and valid actions, and in a random strategy, the invalid actions may be sampled, so that the robustness of the model is reduced. Therefore, a method for updating the invalid strategy is proposed, by adding a mask to the invalid action and sampling only the valid action, the logic of the invalid action is replaced by-1 e8, and after the invalid action passes through the softmax layer, the probability of the invalid action is close to 0, so that the exploration of the invalid action is avoided.

The invalidation actions specifically considered are as follows: when the speed of the intelligent vehicle reaches a preset maximum value, accelerating action is still tried; when the angular velocity of the smart vehicle reaches a predetermined maximum value, an attempt is still made to add to the corner.

And 207, performing action prediction by constructing a complete intersection unmanned decision model based on multi-agent deep reinforcement learning.

The complete multi-agent deep reinforcement learning-based unmanned decision model of the intersection is shown in fig. 6.

Initializing a state space of multiple agents s=s ₁ +S ₂ +S ₃ +…+S _N ；

Initializing a multi-agent action space a=a ₁ +A ₂ +A ₃ +…+A _N ；

Initializing parameters pi of policy network _φ ；

Initializing parameters of a value network

Initializing a random parameter θ ₁ ，θ ₂ ，φ；θ ₁ ，θ ₂ The random parameters of the two value networks, phi is the random parameter of the policy network, and the random parameter is the parameter of the optimization network in the invention (delay update part in fig. 4).

Initializing target network θ ₁ ′←θ ₁ ，θ ₂ ′←θ ₂ ，

The target network is a generic term of a strategy target network and a value target network, the strategy target network is a TD3 algorithm framework provided by the invention, and the value target network is a reward function in the invention.

Initializing an experience playback pool B;

according to the current environmental state input, performing action exploration through a strategy network;

acquiring a rewards return value and a state of the next round through action exploration, and storing the rewards return value, the current state and the action into an experience playback pool;

sampling the experience playback pool to obtain an optimal strategy and target estimation;

updating the value network;

updating the policy network.

In this model, the Actor network is updated by maximizing the cumulative expected return, which requires evaluation with the Critic network. If the Critic network is unstable, the Actor network also generates oscillation, and the updating frequency of the Critic network is higher than that of the Actor network, namely, the Actor network is updated after waiting for the Critic network to be more stable. Furthermore, the two introduced LSTM networks are combined to input environment update information to the Critic network and the Actor network respectively at different frequencies, so that the stability of the Critic network and the Actor network is ensured.

Each intelligent agent interacts with the environment, acquires observable intelligent agent information and local environment information in a concentrated way, transmits the whole state information to a decision model for learning, adopts a concentrated training mode for all intelligent agents, explores actions according to the environment, gives out corresponding decisions, and finally transmits the decisions to the bottom control of the intelligent agent vehicle to control the acceleration, deceleration and rotation angle of the vehicle. And each round model gives rewards, and new rounds of training are carried out according to the current rewards and the next state, so that the construction of the unmanned model of the intersection based on multi-agent deep reinforcement learning is completed.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. An intersection decision-making method based on multi-agent deep reinforcement learning is characterized by comprising the following steps:

constructing a scene of an intersection without signal lamps, and designing a state space based on multiple intelligent vehicles and an action space of each intelligent vehicle;

constructing a reward function according to the state space of the multi-agent vehicle and the action space of each agent vehicle;

processing the environment information and the relative position information and the relative speed information of the current social vehicle relative to the multi-agent vehicle to obtain complete multi-agent state information;

according to the current state space of the multi-agent vehicle, the action space of each agent vehicle and the current rewarding function, respectively inputting a TD3 model with two critic dual networks by combining the next state space of the multi-agent vehicle, the next action space of each agent vehicle and action expectations containing noise, so as to learn the actions of the multi-agent vehicle at the current intersection without signal lamps;

wherein one critic dual network is used to input iteration information to another critic dual network.

2. The method of claim 1, wherein the designing is based on a state space of the multi-agent vehicle and a motion space of the respective agent vehicle, comprising:

defining a state space of each intelligent vehicle and a dimension thereof, wherein the dimension comprises a social vehicle observed by the intelligent vehicle and state characteristics of the social vehicle;

carrying out Cartesian operation on the state space of each intelligent vehicle to obtain the state space of the multi-intelligent vehicle;

defining the action space of each intelligent vehicle, wherein the action space comprises three actions of acceleration, deceleration and rotation angle.

3. The method of claim 2, wherein processing the environmental information and the relative position information of the current social vehicle with respect to the multi-agent vehicle, relative velocity information, to obtain complete multi-agent status information, comprises:

the current state, the position state and the speed state of the multi-agent vehicle are respectively input into two LSTM layers, the two LSTM layers respectively process the observation data in a full-chain layer, output vectors with fixed sizes and containing the surrounding environment information of the multi-agent are input into corresponding Actor networks or Critic networks, and therefore the Actor networks and the Critic networks can stably learn the actions of the multi-agent under the condition of different updating frequencies;

wherein the implicit coding in the output vector includes information about the complete environment.

4. The method of claim 3, wherein the Actor network receives the output vector, updates the output vector, transmits the output vector to a Softmax layer, performs vector mapping on the Softmax layer, and obtains the effective actions of the multi-agent vehicle according to the probabilities of different actions of the multi-agent vehicle obtained by the mapping result.

5. The method of claim 1, wherein constructing a bonus function from the state space of the multi-agent vehicle and the motion space of the respective agent vehicle comprises:

constructing a reward function according to the collision estimation, the speed estimation, the headway estimation and the arrival cost estimation and the corresponding weighting scalar:

r _i,t ＝w _c r _c +w _s r _s +w _H r _H +w _e r _e

wherein r is _i,t Rewards for the ith agent at time step t, w _c ,w _s ,w _h ,w _e Respectively corresponding to the collision estimates r _c Velocity estimation r _s Estimating r of time distance of head _h And the arrival cost r _e Is a weighted scalar of (2);

the arrival cost estimation is used for stopping blind waiting of the intelligent vehicle at the intersection, and program locking is avoided.

6. The method according to claim 1, wherein the TD3 model is a modified MATD3 decision algorithm, in particular, the TD3 model is a dual network consisting of a learning network for learning the actions of the multi-agent vehicle and a delayed update network;

the learning network is composed of a current state space of the multi-agent input into an Actor network, the Actor network combines the state space of the multi-agent vehicle obtained through exploration and then inputs the combined state space of the multi-agent vehicle into an error updating network of the learning network, and the dual Critic network in the error updating learns the actions of the multi-agent vehicle according to a first target value and the iteration information output by the delay updating network;

the delay updating network is composed of a following structure that the next state space of the multi-agent is input into an Actor network, the Actor network carries out smooth regularization processing on the motion of the multi-agent vehicle in the next state space, a processing result is input into a dual Critic network in the delay updating network, and the dual Critic network outputs the iteration information to the learning network.

7. The method of claim 6, wherein the dual Critic network in the error update performs data processing on the bonus function to obtain two first target values, and wherein the smaller first target value is used for receiving the iteration information.

8. The method of claim 6, wherein the dual Critic network in the error update learns the actions of a multi-agent vehicle by receiving the iteration information output by the delay update network, comprising:

storing rewards in the iteration information, the state of the next round of multi-agent vehicle, the state and actions of the current multi-agent vehicle into an experience playback pool;

sampling the experience playback pool to obtain an optimal action strategy and target estimation of the multi-agent vehicle;