CN118918720A

CN118918720A - Multi-agent unmanned decision method and system for intersection scene without signal lamp

Info

Publication number: CN118918720A
Application number: CN202410948308.2A
Authority: CN
Inventors: 杜煜; 张昊; 赵世昕; 吕和君; 原颖
Original assignee: Beijing Union University
Current assignee: Beijing Union University
Priority date: 2024-07-16
Filing date: 2024-07-16
Publication date: 2024-11-08

Abstract

The present invention discloses a multi-agent unmanned driving decision-making method and system for a scene of a non-signalized intersection, wherein the method steps include: setting the parameter space of the multi-agent based on the unmanned driving scene of the non-signalized intersection; designing a dynamic noise mechanism for the multi-agent so that the multi-agent accumulates experience in the parameter space; designing a sampling rule of the experience playback pool for the multi-agent so that the multi-agent effectively learns the accumulated experience; constructing a decision model based on the parameter space, the dynamic noise mechanism and the sampling rule; and generating a multi-agent unmanned driving decision-making method using the decision model. The present invention can enable unmanned vehicles to pass through complex non-signalized intersections. The method not only improves the problem that the algorithm model of the multi-agent reinforcement learning algorithm is complex, resulting in low learning efficiency of the multi-agent algorithm model, but also improves the problem that the final trained decision strategy is not stable enough and has poor robustness, thereby reducing the collision rate of unmanned vehicles at non-signalized intersections.

Description

Multi-agent unmanned decision method and system for intersection scene without signal lamp

Technical Field

The invention relates to the field of unmanned, in particular to a multi-agent unmanned decision method and system for a scene of a signal lamp-free intersection.

Background

The single-agent reinforcement learning algorithm, such as the algorithm of Q-learning, DQN, DDPG, PPO, is widely applied to the unmanned field. The single agent reinforcement learning algorithm is widely applied in the decision scene of an unmanned intersection, but only one agent is trained, and more complex multi-agent strategies such as coordination and competition cannot be processed. Compared with multi-agent reinforcement learning, single-agent reinforcement learning has limitations. Such as single agent reinforcement learning algorithms, typically assume that the driving behavior of other vehicles is fixed and complies with certain rules. However, in the real world, drivers of other vehicles may adjust their decisions based on observed driving behavior, which introduces uncertainty and unpredictability.

For complex decision-making scenarios involving multiple vehicles and similar intersections, it is a worth discussing to consider the use of multi-agent deep reinforcement learning to achieve more excellent decision-making results. Multi-agent reinforcement learning algorithms involve multiple agents (unmanned vehicles) learning simultaneously in a shared environment and constantly adjusting their strategies. This instability destroys the steady state of the environment, while impeding the learning process, more conforming to complex scenes in the real world. At present, the research on multi-agent deep reinforcement learning unmanned decision algorithm based on no-signalized intersection scene is still relatively few. Based on the improvement of the multi-agent deep reinforcement learning algorithm, the method is practically applied to unmanned decision in the scene of the intersection, and has wider research prospect.

For a multi-agent reinforcement learning algorithm, the centralized learning has poor expansibility in a large-scale agent environment, and a strategy that partial agents learn negatively can occur; independent learning faces the problem of environmental instability; the multi-agent reinforcement learning algorithm of centralized training and distributed execution is more feasible, and the problem is changed into an independent strategy of how to train each agent from the global angle.

Disclosure of Invention

In order to solve the technical problems, the invention designs a multi-agent unmanned decision method based on an importance sampling module and a dynamic noise mechanism. For experiences stored in an experience playback pool, the sampling frequency of the experience sample is determined according to importance weights, and learning is more efficient. The dynamic noise mechanism is introduced, so that the fixed noise of the intelligent agent changes along with the increase of training rounds, the introduction of action noise is gradually reduced, and the intelligent agent is enabled to rely on learned strategies more to complete tasks. The learning efficiency and the robustness of the multi-intelligent reinforcement learning algorithm decision model are improved, the decision efficiency and the passing success rate of the unmanned vehicle facing the complex dynamic signalless intersection are improved, and the collision among vehicles is relieved.

In order to achieve the above purpose, the invention provides a multi-agent unmanned decision method for a scene of a no-signal lamp intersection, comprising the following steps:

Setting a parameter space of multiple agents based on an unmanned scene of the signalless intersection;

designing a dynamic noise mechanism for the multi-agent, so that the multi-agent accumulates experience in the parameter space;

the sampling rule of the experience playback pool is designed for the multi-agent, so that the multi-agent effectively learns accumulated experiences;

constructing a decision model based on the parameter space, the dynamic noise mechanism and the sampling rule;

and generating a multi-agent unmanned decision method by using the decision model.

Preferably, the parameter space includes: a state space and an action space;

the state space is defined as follows:

S＝(V_ego,V₁,V₂...V_n,D₁,D₂...D_n,Dest_ego)

Wherein S represents a state space; v _ego denotes the speed of the own vehicle; v ₁,V₂...V_n denotes the speed of the remaining vehicles; d ₁,D₂...D_n denotes the relative distance of the remaining vehicles from the own vehicle; dest _ego represents the distance of the destination end point from the own vehicle;

The action space is defined as follows:

A＝(Throttle_ego,Brake_ego,Steer_ego)

Wherein A represents an action space; throttle _ego represents the Throttle; brake _ego represents a Brake; steer _ego denotes a vehicle corner; the action space of each agent vehicle within the scene contains these parameters.

Preferably, dynamic noise is introduced on the action output by the agent strategy network, so that the randomness of the agent behavior is increased, and the action is correspondingly reduced along with the increase of the training round number; encouraging agents to explore more initially, by accumulating experience, the agents are increasingly utilizing their learned strategies.

Preferably, the noise value is calculated according to the initial noise, the final noise and the remaining training round percentage, and the specific calculation method comprises the following steps: on the basis of the final noise value, the initial noise value and the final noise value are added and then multiplied by the percentage of the rest training rounds.

Preferably, the sampling rule includes: experience that leads to substantial changes in agent behavior or near optimal solutions is given higher priority; the sampling strategy is to select experience samples to be replayed according to the importance of experience, the decision model is more concentrated on samples with larger influence on the learning process during training, the efficiency and the speed of the learning process are improved, and the decision model is converged to a better strategy more quickly.

Preferably, the training method of the decision model comprises the following steps: after initializing network parameters of each intelligent agent, the intelligent agent explores the environment based on the dynamic noise mechanism and accumulates experience; and after the experience stored in the experience playback pool exceeds a set threshold, the intelligent agent samples based on the sampling rule of the experience playback pool, and trains a decision model.

Preferably, the reward function of the decision model comprises: local rewards and global rewards;

The local rewards are carried out by taking the vehicle from the starting point to the target point, safely and quickly passing through the intersection and taking the vehicle speed and the time as rewards judging standards; giving a certain rewards for the distance from the target point, and giving punishment for conflicts occurring between vehicles; judging whether the vehicle reaches the target point according to the distance between the vehicle and the target point, and giving rewards for completing the task if the vehicle reaches the target point smoothly;

if all vehicles safely reach the expected target point, a global reward is given, and task cooperation among the vehicles is promoted.

The invention also provides a multi-agent unmanned decision system facing the scene of the intersection without signal lamps, which is used for realizing the method and comprises the following steps: the system comprises a space design module, a noise design module, a rule design module, a model construction module and a decision generation module;

the space design module is used for setting a parameter space of multiple intelligent agents based on an unmanned scene of the signalless intersection;

the noise design module is used for designing a dynamic noise mechanism for the multi-agent, so that the multi-agent accumulates experience in the parameter space;

the rule design module is used for designing a sampling rule of the experience playback pool for the multi-agent, so that the multi-agent effectively learns accumulated experiences;

the model construction module is used for constructing a decision model based on the parameter space, the dynamic noise mechanism and the sampling rule;

The decision generation module is used for generating a multi-agent unmanned decision method by utilizing the decision model.

Compared with the prior art, the invention has the following beneficial effects:

The invention can make the unmanned vehicle pass through the complex no-signal intersection. The method not only solves the problem that the multi-agent algorithm model learning efficiency is low due to the complexity of the multi-agent reinforcement learning algorithm model, but also solves the problems that the final training decision strategy is not stable enough and the robustness is poor, reduces the collision rate of unmanned vehicles without signal intersections, and improves the vehicle passing rate.

Drawings

In order to more clearly illustrate the technical solutions of the present invention, the drawings that are needed in the embodiments are briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a method according to an embodiment of the invention;

FIG. 2 is a schematic diagram of a scenario simulation of a signalless intersection in accordance with an embodiment of the present invention;

FIG. 3 is a schematic diagram of a sampling rule according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a decision model according to an embodiment of the present invention;

Fig. 5 is a general flow chart of a decision making method according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Example 1

As shown in fig. 1, a flow chart of a method of the present embodiment includes the steps of:

s1, setting a parameter space of multiple agents based on an unmanned scene of a signalless intersection.

In this embodiment, the agents may be regarded as unmanned vehicles themselves or algorithm models representing such vehicles, i.e., multiple agents.

The method of this embodiment is improved based on a Multi-Agent depth deterministic strategy gradient algorithm (Multi-Agent DeepDeterministicPolicyGradient, MADDPG). Each agent in MADDPG algorithm adopts the same Actor-Critic architecture as DDPG, and the empirical sample of the empirical playback pool in algorithm is composed of { s, a ₁,a₂,...,a_N,r₁,...,r_N, s', d }.

Constructing the unmanned decision model requires setting a corresponding parameter space according to the unmanned scene, and in this embodiment, the parameter space includes: a state space and an action space. In the training stage, the state space input of each agent contains all other agent participants in the scene; meanwhile, continuous action output is adopted, the problem of continuous behavior space in the unmanned decision problem is solved, and the scene is shown in figure 2. The method comprises the following specific steps:

s1.1, designing an intelligent vehicle state space. The scene is exemplified by a typical two-way single-lane signalless intersection in a multi-agent scene, and the state space needs to contain state data of all vehicles acquired by vehicles in the scene. Specifically, taking any one of the intelligent vehicles in the scene as an example, the state space is defined as follows:

S＝(V_ego,V₁,V₂...V_n,D₁,D₂...D_n,Dest_ego)

Wherein S represents a state space; v _ego denotes the speed of the own vehicle; v ₁,V₂...V_n denotes the speed of the remaining vehicles; d ₁,D₂...D_n denotes the relative distance of the remaining vehicles from the own vehicle; dest _ego represents the distance of the destination end point from the own vehicle.

S1.2, designing an intelligent vehicle action space. An action space is a collection of all actions that an agent may take, the agent's behavior being defined by the action space, the correct definition of the action space facilitating the learning process, enabling the agent to efficiently explore and utilize experience to achieve a given goal.

The decision model provided in this embodiment is more suitable for solving the problem of continuous behavior space in the unmanned decision problem, and the design of the action space is continuous, taking any one of the intelligent vehicles ego in the scene as an example, and the action space is defined as follows:

A＝(Throttle_ego,Brake_ego,Steer_ego)

The three control signals are all continuous values, so that the intelligent vehicle can accelerate, decelerate and turn, and further smoothly pass through the intersection to reach the destination.

S2, a dynamic noise mechanism is designed for the multi-agent, so that the multi-agent accumulates experience in a parameter space.

To encourage the agent to explore the environment, noise is introduced in this embodiment, increasing the randomness of the agent's behavior.

The noise changes dynamically along with the increase of the training wheel number, and accordingly, the noise is reduced.

The noise value is calculated from the initial noise, the final noise, and the remaining training round percentage as shown in the formula:

Wherein Explr _remain represents the percentage of the remaining training rounds; exploration _eps represents the total training round number; ep_i represents the current number of rounds.

The noise introduced by the current round is equal to the noise value obtained by multiplying the percentage of the rest training round after the difference between the initial noise value and the final noise value is added on the basis of the final noise value:

Noise＝noise_final+(noise_init-noise_final)

Wherein Noise represents Noise introduced by the current round; noise _init represents the initial noise value; noise _final represents the final noise value.

As the search period decreases, the noise value gradually transitions from the initial value to the final value. The agent is encouraged to explore more initially, and as experience builds up, it is increasingly more utilized with its learned strategies.

S3, designing a sampling rule of an experience playback pool for the multiple agents, so that the multiple agents effectively learn accumulated experiences.

Under the scene of multi-agent reinforcement learning, the instability of the dynamic change of the environment causes slow convergence speed in the training of the decision model, and the optimal strategy is difficult to learn; in this embodiment, empirical importance sampling is introduced in the design to improve model learning efficiency and convergence rate. Experience that causes the agent behavior to change substantially or near optimal solutions is given higher priority by designing sampling rules for the multi-agent experience playback pool, as shown in fig. 3.

The sampling strategy is to select experience samples to be replayed according to the importance of experience, and the priority of experience is measured by TD errors or rewarding prediction errors; the larger the error, the less accurate the model predicts the experience, the higher the priority of experience, requiring further learning.

The prediction error δ _t can be expressed by the following formula:

δ_t＝|r_t+γ·V_t+1-Q(s_t,a_t)|

Where r _t denotes the prize currently obtained; gamma represents a discount factor; v _t+1 denotes the value in the next state calculated using the target network; q (s _t,a_t) represents the value of performing the action a _t in the current state, and in particular, a _t＝π(s_t represents the action selection in the current state; pi represents the policy that the current agent has learned; s _t represents the current state.

Specifically, the value V _t+1 in the next state calculated by the target network is expressed as:

V_t+1＝Q′(st_+1,π′(s_t+1))

Wherein Q' is a target Q network; pi' is the target policy network; s _t+1 denotes the next state;

The priority weights piror _i for the empirical samples are defined as follows:

piror_i＝(δ_t+ε)^α

where ε is a small constant used to ensure that priority is not zero; alpha is a hyper-parameter that controls priority weights in order to prevent high priority experiences from being over sampled resulting in training shifts.

The probability that each empirical sample is selected is calculated based on its priority weight, defined as follows:

Where p _i represents the probability that each empirical sample is selected.

S3, constructing a decision model based on the parameter space, the dynamic noise mechanism and the sampling rule.

Decision model as shown in fig. 4, when the decision model samples the experience sample during training, the decision model focuses more on the sample with larger influence on the learning process, the sample in the experience buffer zone is utilized more effectively, the efficiency and speed of the learning process are improved, and the reinforcement learning algorithm is helped to converge to a better strategy more quickly.

In this embodiment, the reward functions of the decision model include local rewards and global rewards. Specifically, the local rewards are carried out by taking the vehicle from the starting point to the target point, safely and quickly passing through the intersection, and taking the vehicle speed and the time as rewards judging standards; giving a certain rewards for the distance from the target point, and giving punishment for conflicts occurring between vehicles; judging whether the vehicle reaches the target point according to the distance between the vehicle and the target point, and giving rewards for completing the task if the vehicle reaches the target point smoothly; if all vehicles safely reach the expected target point, a global reward is given, and task cooperation among the vehicles is promoted.

The training method of the decision model comprises the following steps: after initializing network parameters of each intelligent agent, the intelligent agent explores the environment based on a strategy dynamic noise mechanism which enables the intelligent agent to gradually and more utilize the intelligent agent to learn, and experience is accumulated; after the experience stored in the experience playback pool exceeds a set threshold, the intelligent agent gradually and more utilizes the learned strategy sampling rules to sample based on the experience playback pool, and a decision model is trained. The method comprises the following specific steps:

initializing an Actor network for all agent vehicles And parameter omega _i;

initializing Critic networks for all agent vehicles And a parameter θ _i;

initializing target Actor networks for all agent vehicles And its parameter ω' _i;

initializing target Critic networks for all agent vehicles And its parameter θ' _i;

Initializing an experience playback buffer D;

Training a multi-agent unmanned decision model based on a dynamic noise mechanism and importance sampling:

According to the current environment state, each intelligent vehicle selects an action exploration environment through the strategy network to obtain an initial observation value;

each time step, the following steps are performed:

Each intelligent vehicle obtains an action a according to the current observation value and the Actor network thereof.

And executing the actions of all the intelligent vehicles, acquiring the next observed value s' and the rewards r, and judging whether the passing task of all the vehicles is finished or not.

The observation s, action a (a ₁...a_N), prize r (r ₁...r_N), next observation s' and whether to end D are stored in an experience playback pool (experience playback buffer) D.

D←(s，a₁,a₂,...,a_N,r₁,...,r_N，s′，d)

The importance weights of the experience samples in the experience playback pool are updated, and the sampling probability p _i of each experience sample is updated. If the experience playback pool is satisfied that a certain amount of experience is deposited, each agent vehicle performs the steps of:

A set of experiences (s, a ₁,a₂,...,a_N,r₁,...,r_N, s', d) is extracted from the experience playback buffer.

And calculating the target Q value and the actual Q value of each intelligent agent according to the extracted experience and the current Critic network.

The mean square error calculates the loss of Critic.

Gradient descent update Critic network

And calculating the loss of the Actor according to the current Actor network and the Critic network.

Gradient descent update Actor network

Updating target Actor network for each agent vehicle using soft update policiesParameter ω' _i:

ω′_i＝τ·ω_i+(1-τ)·ω′_i

where τ is a scaling factor between 0 and 1 representing the extent to which the current network parameters affect the update of the target network parameters.

Updating target Critic network for each agent vehicle using soft update policiesParameter θ' _i:

θ′_i＝τ·θ_i+(1-τ)·θ′_i

and evaluating the performance of the decision model according to the convergence speed of the training model and the collision probability, the passing success rate and the rewarding value of the intelligent vehicle in the test scene.

Specifically, the collision rate is defined as the ratio of the number of collisions of the vehicle per 100 trains:

wherein CollisionCounts represents the total number of collisions between vehicles during training; totalNumber is equal to 100.

The passing success rate is defined as the proportion of times that all intelligent vehicles pass through the crossroad smoothly without collision and successfully reach the target point in every 100 times of training:

Wherein SuccessCoounts denotes the total number of times that no collision occurs between vehicles during the training process and all vehicles smoothly pass through the intersection and reach the destination; totalNumber is equal to 100.

And generating a multi-agent unmanned decision method by utilizing the decision model.

The flow of the method for generating multi-agent unmanned decision by using the decision model is shown in fig. 5.

Example two

The embodiment also provides a multi-agent unmanned decision system facing the scene of the intersection without signal lamps, which comprises: the system comprises a space design module, a noise design module, a rule design module, a model construction module and a decision generation module; the space design module is used for setting a parameter space of multiple intelligent agents based on an unmanned scene of the signalless intersection; the noise design module is used for designing a dynamic noise mechanism for the multi-agent, so that the multi-agent accumulates experience in a parameter space; the rule design module is used for designing sampling rules of the experience playback pool for the multiple agents, so that the multiple agents effectively learn accumulated experiences; the model construction module is used for constructing a decision model based on the parameter space, the dynamic noise mechanism and the sampling rule; the decision generation module is used for generating a multi-agent unmanned decision method by utilizing the decision model.

In the following, in connection with the present embodiment, how the present invention solves the technical problems in actual operation will be described in detail.

Firstly, a space design module is used for setting a parameter space of multiple agents based on an unmanned scene of a signalless intersection. In this embodiment, the parameter space includes: a state space and an action space;

the state space is defined as follows:

S＝(V_ego,V₁,V₂...V_n,D₁,D₂...D_n,Dest_ego)

The action space is defined as follows:

A＝(Throttle_ego,Broke_ego,Steer_ego)

The noise design module introduces dynamic noise on the action output by the agent strategy network, increases the randomness of the agent behavior, and correspondingly reduces along with the increase of the training round number; encouraging the agent to explore more initially, by accumulating experience, allowing the agent to utilize progressively more of its learned strategies;

The noise value is calculated according to the initial noise, the final noise and the remaining training round percentage, and the specific calculation method comprises the following steps: on the basis of the final noise value, the initial noise value and the final noise value are added and then multiplied by the percentage of the rest training rounds.

And a rule design module is utilized to design sampling rules of an experience playback pool for multiple agents, so that the multiple agents effectively learn accumulated experiences. The sampling rules include: experience that leads to substantial changes in agent behavior or near optimal solutions is given higher priority; the sampling strategy is to select experience samples to be replayed according to the importance of experience, the decision model is more concentrated on samples with larger influence on the learning process during training, the efficiency and the speed of the learning process are improved, and the decision model is converged to a better strategy more quickly.

The training process of the decision model comprises the following steps: after initializing network parameters of each intelligent agent, the intelligent agent explores the environment based on a dynamic noise mechanism and accumulates experience; and after the experience stored in the experience playback pool exceeds a set threshold, the intelligent agent samples based on the sampling rule of the experience playback pool, and trains a decision model.

The reward function of the decision model comprises: local rewards and global rewards;

The vehicle passes through the intersection safely and quickly from the starting point to the target point, and the vehicle speed and the time spent are taken as rewarding judgment standards to carry out local rewards; giving a certain rewards for the distance from the target point, and giving punishment for conflicts occurring between vehicles; judging whether the vehicle reaches the target point according to the distance between the vehicle and the target point, and giving rewards for completing the task if the vehicle reaches the target point smoothly; if all vehicles safely reach the expected target point, a global reward is given, and task cooperation among the vehicles is promoted.

And finally, the decision generation module generates a multi-agent unmanned decision method by utilizing the decision model.

The above embodiments are merely illustrative of the preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, but various modifications and improvements made by those skilled in the art to which the present invention pertains are made without departing from the spirit of the present invention, and all modifications and improvements fall within the scope of the present invention as defined in the appended claims.

Claims

1. A multi-agent unmanned driving decision-making method for a non-signal intersection scenario, characterized in that the steps include:

Based on the unmanned driving scenario at an unsignalized intersection, set the parameter space of the multi-agent;

Designing a dynamic noise mechanism for the multi-agent so that the multi-agent accumulates experience in the parameter space;

Designing a sampling rule for the experience replay pool for the multi-agent so that the multi-agent can effectively learn the accumulated experience;

The decision model is used to generate a multi-agent unmanned driving decision-making method.

2. The multi-agent unmanned driving decision-making method for non-signal intersection scenarios according to claim 1, characterized in that the parameter space includes: a state space and an action space;

The state space is defined as follows:

S＝(V _ego , V ₁ , V ₂ ...V _n , D ₁ , D ₂ ...D _n , Dest _ego )

Where S represents the state space; V _ego represents the speed of the ego vehicle; V ₁ , V ₂ ...V _n represent the speeds of the other vehicles; D ₁ , D ₂ ...D _n represent the relative distances between the other vehicles and the ego vehicle; Dest _ego represents the distance between the destination and the ego vehicle;

The action space is defined as follows:

A=(Throttle _ego , Broke _ego , Steer _ego )

Where A represents the action space; Throttle _ego represents the accelerator; Brake _ego represents the brake; Steer _ego represents the vehicle turning angle; the action space of each intelligent vehicle in the scene contains these parameters.

3. According to claim 1, the multi-agent unmanned driving decision-making method for non-signal light intersection scenarios is characterized by introducing dynamic noise into the actions output by the agent strategy network to increase the randomness of the agent's behavior, which decreases accordingly as the number of training rounds increases; encouraging the agent to explore more at the beginning, and by accumulating experience, allowing the agent to gradually make more use of its learned strategies.

4. According to claim 3, the multi-agent unmanned driving decision-making method for the scene of no signal light intersection is characterized in that the noise value is calculated based on the initial noise, the final noise and the remaining percentage of training rounds. The specific calculation method includes: on the basis of the final noise value, adding the difference between the initial noise value and the final noise value and multiplying it by the percentage of remaining training rounds.

5. According to claim 1, the multi-agent unmanned driving decision-making method for non-signal light intersection scenarios is characterized in that the sampling rules include: giving higher priority to experiences that cause significant changes in the agent's behavior or are close to the optimal solution; the sampling strategy is to select experience samples to be replayed based on the importance of the experience, and during training, the decision model focuses more on samples that have a greater impact on the learning process, thereby improving the efficiency and speed of the learning process and enabling the decision model to converge to a better strategy more quickly.

6. According to claim 1, the multi-agent unmanned driving decision-making method for non-signal light intersection scenarios is characterized in that the training method of the decision model includes: after initializing the network parameters of each agent, the agent explores the environment based on the dynamic noise mechanism and accumulates experience; after the experience stored in the experience replay pool exceeds the set threshold, the agent samples based on the sampling rules of the experience replay pool to train the decision model.

7. The multi-agent unmanned driving decision-making method for non-signal intersection scenarios according to claim 1, characterized in that the reward function of the decision model includes: local rewards and global rewards;

The local reward is given based on the vehicle's safe and fast crossing of the intersection from the starting point to the target point, as well as the vehicle speed and time spent as reward evaluation criteria; a certain reward is given based on the distance from the target point, and a penalty is given for conflicts between vehicles; whether the vehicle has reached the target point is judged based on the distance from the target point, and if it arrives successfully, a task completion reward is given;

If all vehicles safely reach the expected destination, a global reward will be given to promote task collaboration among vehicles.

8. A multi-agent unmanned driving decision system for non-signal intersection scenarios, the system is used to implement the method described in any one of claims 1 to 7, characterized in that it includes: a space design module, a noise design module, a rule design module, a model building module and a decision generation module;

The spatial design module is used to set the parameter space of multiple agents based on the unmanned driving scenario at an unsignalized intersection;

The noise design module is used to design a dynamic noise mechanism for the multi-agent so that the multi-agent can accumulate experience in the parameter space;

The rule design module is used to design a sampling rule of the experience replay pool for the multi-agent so that the multi-agent can effectively learn the accumulated experience;

The model building module is used to build a decision model based on the parameter space, the dynamic noise mechanism and the sampling rule;

The decision generation module is used to generate a multi-agent unmanned driving decision method using the decision model.