[go: up one dir, main page]

CN118918720A - Multi-agent unmanned decision method and system for intersection scene without signal lamp - Google Patents

Multi-agent unmanned decision method and system for intersection scene without signal lamp Download PDF

Info

Publication number
CN118918720A
CN118918720A CN202410948308.2A CN202410948308A CN118918720A CN 118918720 A CN118918720 A CN 118918720A CN 202410948308 A CN202410948308 A CN 202410948308A CN 118918720 A CN118918720 A CN 118918720A
Authority
CN
China
Prior art keywords
agent
decision
experience
ego
unmanned driving
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410948308.2A
Other languages
Chinese (zh)
Inventor
杜煜
张昊
赵世昕
吕和君
原颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Union University
Original Assignee
Beijing Union University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Union University filed Critical Beijing Union University
Priority to CN202410948308.2A priority Critical patent/CN118918720A/en
Publication of CN118918720A publication Critical patent/CN118918720A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/09Arrangements for giving variable traffic instructions
    • G08G1/0962Arrangements for giving variable traffic instructions having an indicator mounted inside the vehicle, e.g. giving voice messages
    • G08G1/0967Systems involving transmission of highway information, e.g. weather, speed limits
    • G08G1/096708Systems involving transmission of highway information, e.g. weather, speed limits where the received information might be used to generate an automatic action on the vehicle control
    • G08G1/096725Systems involving transmission of highway information, e.g. weather, speed limits where the received information might be used to generate an automatic action on the vehicle control where the received information generates an automatic action on the vehicle control
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/16Anti-collision systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/30Services specially adapted for particular environments, situations or purposes
    • H04W4/40Services specially adapted for particular environments, situations or purposes for vehicles, e.g. vehicle-to-pedestrians [V2P]
    • H04W4/44Services specially adapted for particular environments, situations or purposes for vehicles, e.g. vehicle-to-pedestrians [V2P] for communication between vehicles and infrastructures, e.g. vehicle-to-cloud [V2C] or vehicle-to-home [V2H]

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Atmospheric Sciences (AREA)
  • Feedback Control In General (AREA)

Abstract

本发明公开了一种面向无信号灯路口场景的多智能体无人驾驶决策方法及系统,其中方法步骤包括:基于无信号交叉口的无人驾驶场景,设定多智能体的参数空间;为多智能体设计动态噪声机制,使多智能体在参数空间中积累经验;为多智能体设计经验回放池的采样规则,使多智能体有效学习积累的经验;基于参数空间、动态噪声机制和采样规则,构建决策模型;利用决策模型生成多智能体无人驾驶决策方法。本发明可以使无人驾驶车辆通过复杂的无信号交叉路口。该方法不仅改善了多智能体强化学习算法的算法模型复杂导致多智能体算法模型学习效率较低的问题,而且改善了最终训练的决策策略不够稳定且鲁棒性较差的问题,降低了无信号交叉口的无人驾驶车辆碰撞率。

The present invention discloses a multi-agent unmanned driving decision-making method and system for a scene of a non-signalized intersection, wherein the method steps include: setting the parameter space of the multi-agent based on the unmanned driving scene of the non-signalized intersection; designing a dynamic noise mechanism for the multi-agent so that the multi-agent accumulates experience in the parameter space; designing a sampling rule of the experience playback pool for the multi-agent so that the multi-agent effectively learns the accumulated experience; constructing a decision model based on the parameter space, the dynamic noise mechanism and the sampling rule; and generating a multi-agent unmanned driving decision-making method using the decision model. The present invention can enable unmanned vehicles to pass through complex non-signalized intersections. The method not only improves the problem that the algorithm model of the multi-agent reinforcement learning algorithm is complex, resulting in low learning efficiency of the multi-agent algorithm model, but also improves the problem that the final trained decision strategy is not stable enough and has poor robustness, thereby reducing the collision rate of unmanned vehicles at non-signalized intersections.

Description

Multi-agent unmanned decision method and system for intersection scene without signal lamp
Technical Field
The invention relates to the field of unmanned, in particular to a multi-agent unmanned decision method and system for a scene of a signal lamp-free intersection.
Background
The single-agent reinforcement learning algorithm, such as the algorithm of Q-learning, DQN, DDPG, PPO, is widely applied to the unmanned field. The single agent reinforcement learning algorithm is widely applied in the decision scene of an unmanned intersection, but only one agent is trained, and more complex multi-agent strategies such as coordination and competition cannot be processed. Compared with multi-agent reinforcement learning, single-agent reinforcement learning has limitations. Such as single agent reinforcement learning algorithms, typically assume that the driving behavior of other vehicles is fixed and complies with certain rules. However, in the real world, drivers of other vehicles may adjust their decisions based on observed driving behavior, which introduces uncertainty and unpredictability.
For complex decision-making scenarios involving multiple vehicles and similar intersections, it is a worth discussing to consider the use of multi-agent deep reinforcement learning to achieve more excellent decision-making results. Multi-agent reinforcement learning algorithms involve multiple agents (unmanned vehicles) learning simultaneously in a shared environment and constantly adjusting their strategies. This instability destroys the steady state of the environment, while impeding the learning process, more conforming to complex scenes in the real world. At present, the research on multi-agent deep reinforcement learning unmanned decision algorithm based on no-signalized intersection scene is still relatively few. Based on the improvement of the multi-agent deep reinforcement learning algorithm, the method is practically applied to unmanned decision in the scene of the intersection, and has wider research prospect.
For a multi-agent reinforcement learning algorithm, the centralized learning has poor expansibility in a large-scale agent environment, and a strategy that partial agents learn negatively can occur; independent learning faces the problem of environmental instability; the multi-agent reinforcement learning algorithm of centralized training and distributed execution is more feasible, and the problem is changed into an independent strategy of how to train each agent from the global angle.
Disclosure of Invention
In order to solve the technical problems, the invention designs a multi-agent unmanned decision method based on an importance sampling module and a dynamic noise mechanism. For experiences stored in an experience playback pool, the sampling frequency of the experience sample is determined according to importance weights, and learning is more efficient. The dynamic noise mechanism is introduced, so that the fixed noise of the intelligent agent changes along with the increase of training rounds, the introduction of action noise is gradually reduced, and the intelligent agent is enabled to rely on learned strategies more to complete tasks. The learning efficiency and the robustness of the multi-intelligent reinforcement learning algorithm decision model are improved, the decision efficiency and the passing success rate of the unmanned vehicle facing the complex dynamic signalless intersection are improved, and the collision among vehicles is relieved.
In order to achieve the above purpose, the invention provides a multi-agent unmanned decision method for a scene of a no-signal lamp intersection, comprising the following steps:
Setting a parameter space of multiple agents based on an unmanned scene of the signalless intersection;
designing a dynamic noise mechanism for the multi-agent, so that the multi-agent accumulates experience in the parameter space;
the sampling rule of the experience playback pool is designed for the multi-agent, so that the multi-agent effectively learns accumulated experiences;
constructing a decision model based on the parameter space, the dynamic noise mechanism and the sampling rule;
and generating a multi-agent unmanned decision method by using the decision model.
Preferably, the parameter space includes: a state space and an action space;
the state space is defined as follows:
S=(Vego,V1,V2...Vn,D1,D2...Dn,Destego)
Wherein S represents a state space; v ego denotes the speed of the own vehicle; v 1,V2...Vn denotes the speed of the remaining vehicles; d 1,D2...Dn denotes the relative distance of the remaining vehicles from the own vehicle; dest ego represents the distance of the destination end point from the own vehicle;
The action space is defined as follows:
A=(Throttleego,Brakeego,Steerego)
Wherein A represents an action space; throttle ego represents the Throttle; brake ego represents a Brake; steer ego denotes a vehicle corner; the action space of each agent vehicle within the scene contains these parameters.
Preferably, dynamic noise is introduced on the action output by the agent strategy network, so that the randomness of the agent behavior is increased, and the action is correspondingly reduced along with the increase of the training round number; encouraging agents to explore more initially, by accumulating experience, the agents are increasingly utilizing their learned strategies.
Preferably, the noise value is calculated according to the initial noise, the final noise and the remaining training round percentage, and the specific calculation method comprises the following steps: on the basis of the final noise value, the initial noise value and the final noise value are added and then multiplied by the percentage of the rest training rounds.
Preferably, the sampling rule includes: experience that leads to substantial changes in agent behavior or near optimal solutions is given higher priority; the sampling strategy is to select experience samples to be replayed according to the importance of experience, the decision model is more concentrated on samples with larger influence on the learning process during training, the efficiency and the speed of the learning process are improved, and the decision model is converged to a better strategy more quickly.
Preferably, the training method of the decision model comprises the following steps: after initializing network parameters of each intelligent agent, the intelligent agent explores the environment based on the dynamic noise mechanism and accumulates experience; and after the experience stored in the experience playback pool exceeds a set threshold, the intelligent agent samples based on the sampling rule of the experience playback pool, and trains a decision model.
Preferably, the reward function of the decision model comprises: local rewards and global rewards;
The local rewards are carried out by taking the vehicle from the starting point to the target point, safely and quickly passing through the intersection and taking the vehicle speed and the time as rewards judging standards; giving a certain rewards for the distance from the target point, and giving punishment for conflicts occurring between vehicles; judging whether the vehicle reaches the target point according to the distance between the vehicle and the target point, and giving rewards for completing the task if the vehicle reaches the target point smoothly;
if all vehicles safely reach the expected target point, a global reward is given, and task cooperation among the vehicles is promoted.
The invention also provides a multi-agent unmanned decision system facing the scene of the intersection without signal lamps, which is used for realizing the method and comprises the following steps: the system comprises a space design module, a noise design module, a rule design module, a model construction module and a decision generation module;
the space design module is used for setting a parameter space of multiple intelligent agents based on an unmanned scene of the signalless intersection;
the noise design module is used for designing a dynamic noise mechanism for the multi-agent, so that the multi-agent accumulates experience in the parameter space;
the rule design module is used for designing a sampling rule of the experience playback pool for the multi-agent, so that the multi-agent effectively learns accumulated experiences;
the model construction module is used for constructing a decision model based on the parameter space, the dynamic noise mechanism and the sampling rule;
The decision generation module is used for generating a multi-agent unmanned decision method by utilizing the decision model.
Compared with the prior art, the invention has the following beneficial effects:
The invention can make the unmanned vehicle pass through the complex no-signal intersection. The method not only solves the problem that the multi-agent algorithm model learning efficiency is low due to the complexity of the multi-agent reinforcement learning algorithm model, but also solves the problems that the final training decision strategy is not stable enough and the robustness is poor, reduces the collision rate of unmanned vehicles without signal intersections, and improves the vehicle passing rate.
Drawings
In order to more clearly illustrate the technical solutions of the present invention, the drawings that are needed in the embodiments are briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a method according to an embodiment of the invention;
FIG. 2 is a schematic diagram of a scenario simulation of a signalless intersection in accordance with an embodiment of the present invention;
FIG. 3 is a schematic diagram of a sampling rule according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a decision model according to an embodiment of the present invention;
Fig. 5 is a general flow chart of a decision making method according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.
Example 1
As shown in fig. 1, a flow chart of a method of the present embodiment includes the steps of:
s1, setting a parameter space of multiple agents based on an unmanned scene of a signalless intersection.
In this embodiment, the agents may be regarded as unmanned vehicles themselves or algorithm models representing such vehicles, i.e., multiple agents.
The method of this embodiment is improved based on a Multi-Agent depth deterministic strategy gradient algorithm (Multi-Agent DeepDeterministicPolicyGradient, MADDPG). Each agent in MADDPG algorithm adopts the same Actor-Critic architecture as DDPG, and the empirical sample of the empirical playback pool in algorithm is composed of { s, a 1,a2,...,aN,r1,...,rN, s', d }.
Constructing the unmanned decision model requires setting a corresponding parameter space according to the unmanned scene, and in this embodiment, the parameter space includes: a state space and an action space. In the training stage, the state space input of each agent contains all other agent participants in the scene; meanwhile, continuous action output is adopted, the problem of continuous behavior space in the unmanned decision problem is solved, and the scene is shown in figure 2. The method comprises the following specific steps:
s1.1, designing an intelligent vehicle state space. The scene is exemplified by a typical two-way single-lane signalless intersection in a multi-agent scene, and the state space needs to contain state data of all vehicles acquired by vehicles in the scene. Specifically, taking any one of the intelligent vehicles in the scene as an example, the state space is defined as follows:
S=(Vego,V1,V2...Vn,D1,D2...Dn,Destego)
Wherein S represents a state space; v ego denotes the speed of the own vehicle; v 1,V2...Vn denotes the speed of the remaining vehicles; d 1,D2...Dn denotes the relative distance of the remaining vehicles from the own vehicle; dest ego represents the distance of the destination end point from the own vehicle.
S1.2, designing an intelligent vehicle action space. An action space is a collection of all actions that an agent may take, the agent's behavior being defined by the action space, the correct definition of the action space facilitating the learning process, enabling the agent to efficiently explore and utilize experience to achieve a given goal.
The decision model provided in this embodiment is more suitable for solving the problem of continuous behavior space in the unmanned decision problem, and the design of the action space is continuous, taking any one of the intelligent vehicles ego in the scene as an example, and the action space is defined as follows:
A=(Throttleego,Brakeego,Steerego)
Wherein A represents an action space; throttle ego represents the Throttle; brake ego represents a Brake; steer ego denotes a vehicle corner; the action space of each agent vehicle within the scene contains these parameters.
The three control signals are all continuous values, so that the intelligent vehicle can accelerate, decelerate and turn, and further smoothly pass through the intersection to reach the destination.
S2, a dynamic noise mechanism is designed for the multi-agent, so that the multi-agent accumulates experience in a parameter space.
To encourage the agent to explore the environment, noise is introduced in this embodiment, increasing the randomness of the agent's behavior.
The noise changes dynamically along with the increase of the training wheel number, and accordingly, the noise is reduced.
The noise value is calculated from the initial noise, the final noise, and the remaining training round percentage as shown in the formula:
Wherein Explr remain represents the percentage of the remaining training rounds; exploration _eps represents the total training round number; ep_i represents the current number of rounds.
The noise introduced by the current round is equal to the noise value obtained by multiplying the percentage of the rest training round after the difference between the initial noise value and the final noise value is added on the basis of the final noise value:
Noise=noisefinal+(noiseinit-noisefinal)
Wherein Noise represents Noise introduced by the current round; noise init represents the initial noise value; noise final represents the final noise value.
As the search period decreases, the noise value gradually transitions from the initial value to the final value. The agent is encouraged to explore more initially, and as experience builds up, it is increasingly more utilized with its learned strategies.
S3, designing a sampling rule of an experience playback pool for the multiple agents, so that the multiple agents effectively learn accumulated experiences.
Under the scene of multi-agent reinforcement learning, the instability of the dynamic change of the environment causes slow convergence speed in the training of the decision model, and the optimal strategy is difficult to learn; in this embodiment, empirical importance sampling is introduced in the design to improve model learning efficiency and convergence rate. Experience that causes the agent behavior to change substantially or near optimal solutions is given higher priority by designing sampling rules for the multi-agent experience playback pool, as shown in fig. 3.
The sampling strategy is to select experience samples to be replayed according to the importance of experience, and the priority of experience is measured by TD errors or rewarding prediction errors; the larger the error, the less accurate the model predicts the experience, the higher the priority of experience, requiring further learning.
The prediction error δ t can be expressed by the following formula:
δt=|rt+γ·Vt+1-Q(st,at)|
Where r t denotes the prize currently obtained; gamma represents a discount factor; v t+1 denotes the value in the next state calculated using the target network; q (s t,at) represents the value of performing the action a t in the current state, and in particular, a t=π(st represents the action selection in the current state; pi represents the policy that the current agent has learned; s t represents the current state.
Specifically, the value V t+1 in the next state calculated by the target network is expressed as:
Vt+1=Q′(st+1,π′(st+1))
Wherein Q' is a target Q network; pi' is the target policy network; s t+1 denotes the next state;
The priority weights piror i for the empirical samples are defined as follows:
pirori=(δt+ε)α
where ε is a small constant used to ensure that priority is not zero; alpha is a hyper-parameter that controls priority weights in order to prevent high priority experiences from being over sampled resulting in training shifts.
The probability that each empirical sample is selected is calculated based on its priority weight, defined as follows:
Where p i represents the probability that each empirical sample is selected.
S3, constructing a decision model based on the parameter space, the dynamic noise mechanism and the sampling rule.
Decision model as shown in fig. 4, when the decision model samples the experience sample during training, the decision model focuses more on the sample with larger influence on the learning process, the sample in the experience buffer zone is utilized more effectively, the efficiency and speed of the learning process are improved, and the reinforcement learning algorithm is helped to converge to a better strategy more quickly.
In this embodiment, the reward functions of the decision model include local rewards and global rewards. Specifically, the local rewards are carried out by taking the vehicle from the starting point to the target point, safely and quickly passing through the intersection, and taking the vehicle speed and the time as rewards judging standards; giving a certain rewards for the distance from the target point, and giving punishment for conflicts occurring between vehicles; judging whether the vehicle reaches the target point according to the distance between the vehicle and the target point, and giving rewards for completing the task if the vehicle reaches the target point smoothly; if all vehicles safely reach the expected target point, a global reward is given, and task cooperation among the vehicles is promoted.
The training method of the decision model comprises the following steps: after initializing network parameters of each intelligent agent, the intelligent agent explores the environment based on a strategy dynamic noise mechanism which enables the intelligent agent to gradually and more utilize the intelligent agent to learn, and experience is accumulated; after the experience stored in the experience playback pool exceeds a set threshold, the intelligent agent gradually and more utilizes the learned strategy sampling rules to sample based on the experience playback pool, and a decision model is trained. The method comprises the following specific steps:
initializing an Actor network for all agent vehicles And parameter omega i;
initializing Critic networks for all agent vehicles And a parameter θ i;
initializing target Actor networks for all agent vehicles And its parameter ω' i;
initializing target Critic networks for all agent vehicles And its parameter θ' i;
Initializing an experience playback buffer D;
Training a multi-agent unmanned decision model based on a dynamic noise mechanism and importance sampling:
According to the current environment state, each intelligent vehicle selects an action exploration environment through the strategy network to obtain an initial observation value;
each time step, the following steps are performed:
Each intelligent vehicle obtains an action a according to the current observation value and the Actor network thereof.
And executing the actions of all the intelligent vehicles, acquiring the next observed value s' and the rewards r, and judging whether the passing task of all the vehicles is finished or not.
The observation s, action a (a 1...aN), prize r (r 1...rN), next observation s' and whether to end D are stored in an experience playback pool (experience playback buffer) D.
D←(s,a1,a2,...,aN,r1,...,rN,s′,d)
The importance weights of the experience samples in the experience playback pool are updated, and the sampling probability p i of each experience sample is updated. If the experience playback pool is satisfied that a certain amount of experience is deposited, each agent vehicle performs the steps of:
A set of experiences (s, a 1,a2,...,aN,r1,...,rN, s', d) is extracted from the experience playback buffer.
And calculating the target Q value and the actual Q value of each intelligent agent according to the extracted experience and the current Critic network.
The mean square error calculates the loss of Critic.
Gradient descent update Critic network
And calculating the loss of the Actor according to the current Actor network and the Critic network.
Gradient descent update Actor network
Updating target Actor network for each agent vehicle using soft update policiesParameter ω' i:
ω′i=τ·ωi+(1-τ)·ω′i
where τ is a scaling factor between 0 and 1 representing the extent to which the current network parameters affect the update of the target network parameters.
Updating target Critic network for each agent vehicle using soft update policiesParameter θ' i:
θ′i=τ·θi+(1-τ)·θ′i
and evaluating the performance of the decision model according to the convergence speed of the training model and the collision probability, the passing success rate and the rewarding value of the intelligent vehicle in the test scene.
Specifically, the collision rate is defined as the ratio of the number of collisions of the vehicle per 100 trains:
wherein CollisionCounts represents the total number of collisions between vehicles during training; totalNumber is equal to 100.
The passing success rate is defined as the proportion of times that all intelligent vehicles pass through the crossroad smoothly without collision and successfully reach the target point in every 100 times of training:
Wherein SuccessCoounts denotes the total number of times that no collision occurs between vehicles during the training process and all vehicles smoothly pass through the intersection and reach the destination; totalNumber is equal to 100.
And generating a multi-agent unmanned decision method by utilizing the decision model.
The flow of the method for generating multi-agent unmanned decision by using the decision model is shown in fig. 5.
Example two
The embodiment also provides a multi-agent unmanned decision system facing the scene of the intersection without signal lamps, which comprises: the system comprises a space design module, a noise design module, a rule design module, a model construction module and a decision generation module; the space design module is used for setting a parameter space of multiple intelligent agents based on an unmanned scene of the signalless intersection; the noise design module is used for designing a dynamic noise mechanism for the multi-agent, so that the multi-agent accumulates experience in a parameter space; the rule design module is used for designing sampling rules of the experience playback pool for the multiple agents, so that the multiple agents effectively learn accumulated experiences; the model construction module is used for constructing a decision model based on the parameter space, the dynamic noise mechanism and the sampling rule; the decision generation module is used for generating a multi-agent unmanned decision method by utilizing the decision model.
In the following, in connection with the present embodiment, how the present invention solves the technical problems in actual operation will be described in detail.
Firstly, a space design module is used for setting a parameter space of multiple agents based on an unmanned scene of a signalless intersection. In this embodiment, the parameter space includes: a state space and an action space;
the state space is defined as follows:
S=(Vego,V1,V2...Vn,D1,D2...Dn,Destego)
Wherein S represents a state space; v ego denotes the speed of the own vehicle; v 1,V2...Vn denotes the speed of the remaining vehicles; d 1,D2...Dn denotes the relative distance of the remaining vehicles from the own vehicle; dest ego represents the distance of the destination end point from the own vehicle;
The action space is defined as follows:
A=(Throttleego,Brokeego,Steerego)
Wherein A represents an action space; throttle ego represents the Throttle; brake ego represents a Brake; steer ego denotes a vehicle corner; the action space of each agent vehicle within the scene contains these parameters.
The noise design module introduces dynamic noise on the action output by the agent strategy network, increases the randomness of the agent behavior, and correspondingly reduces along with the increase of the training round number; encouraging the agent to explore more initially, by accumulating experience, allowing the agent to utilize progressively more of its learned strategies;
The noise value is calculated according to the initial noise, the final noise and the remaining training round percentage, and the specific calculation method comprises the following steps: on the basis of the final noise value, the initial noise value and the final noise value are added and then multiplied by the percentage of the rest training rounds.
And a rule design module is utilized to design sampling rules of an experience playback pool for multiple agents, so that the multiple agents effectively learn accumulated experiences. The sampling rules include: experience that leads to substantial changes in agent behavior or near optimal solutions is given higher priority; the sampling strategy is to select experience samples to be replayed according to the importance of experience, the decision model is more concentrated on samples with larger influence on the learning process during training, the efficiency and the speed of the learning process are improved, and the decision model is converged to a better strategy more quickly.
The training process of the decision model comprises the following steps: after initializing network parameters of each intelligent agent, the intelligent agent explores the environment based on a dynamic noise mechanism and accumulates experience; and after the experience stored in the experience playback pool exceeds a set threshold, the intelligent agent samples based on the sampling rule of the experience playback pool, and trains a decision model.
The reward function of the decision model comprises: local rewards and global rewards;
The vehicle passes through the intersection safely and quickly from the starting point to the target point, and the vehicle speed and the time spent are taken as rewarding judgment standards to carry out local rewards; giving a certain rewards for the distance from the target point, and giving punishment for conflicts occurring between vehicles; judging whether the vehicle reaches the target point according to the distance between the vehicle and the target point, and giving rewards for completing the task if the vehicle reaches the target point smoothly; if all vehicles safely reach the expected target point, a global reward is given, and task cooperation among the vehicles is promoted.
And finally, the decision generation module generates a multi-agent unmanned decision method by utilizing the decision model.
The above embodiments are merely illustrative of the preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, but various modifications and improvements made by those skilled in the art to which the present invention pertains are made without departing from the spirit of the present invention, and all modifications and improvements fall within the scope of the present invention as defined in the appended claims.

Claims (8)

1.一种面向无信号灯路口场景的多智能体无人驾驶决策方法,其特征在于,步骤包括:1. A multi-agent unmanned driving decision-making method for a non-signal intersection scenario, characterized in that the steps include: 基于无信号交叉口的无人驾驶场景,设定多智能体的参数空间;Based on the unmanned driving scenario at an unsignalized intersection, set the parameter space of the multi-agent; 为所述多智能体设计动态噪声机制,使多智能体在所述参数空间中积累经验;Designing a dynamic noise mechanism for the multi-agent so that the multi-agent accumulates experience in the parameter space; 为所述多智能体设计经验回放池的采样规则,使多智能体有效学习积累的经验;Designing a sampling rule for the experience replay pool for the multi-agent so that the multi-agent can effectively learn the accumulated experience; 基于所述参数空间、所述动态噪声机制和所述采样规则,构建决策模型;Constructing a decision model based on the parameter space, the dynamic noise mechanism and the sampling rule; 利用所述决策模型生成多智能体无人驾驶决策方法。The decision model is used to generate a multi-agent unmanned driving decision-making method. 2.根据权利要求1所述的面向无信号灯路口场景的多智能体无人驾驶决策方法,其特征在于,所述参数空间包括:状态空间和动作空间;2. The multi-agent unmanned driving decision-making method for non-signal intersection scenarios according to claim 1, characterized in that the parameter space includes: a state space and an action space; 状态空间定义如下:The state space is defined as follows: S=(Vego,V1,V2...Vn,D1,D2...Dn,Destego)S=(V ego , V 1 , V 2 ...V n , D 1 , D 2 ...D n , Dest ego ) 式中,S表示状态空间;Vego表示自我车辆的速度;V1,V2...Vn表示其余车辆的速度;D1,D2...Dn表示其余车辆距离自我车辆的相对距离;Destego表示目的地终点距离自我车辆的距离;Where S represents the state space; V ego represents the speed of the ego vehicle; V 1 , V 2 ...V n represent the speeds of the other vehicles; D 1 , D 2 ...D n represent the relative distances between the other vehicles and the ego vehicle; Dest ego represents the distance between the destination and the ego vehicle; 动作空间定义如下:The action space is defined as follows: A=(Throttleego,Brokeego,Steerego)A=(Throttle ego , Broke ego , Steer ego ) 式中,A表示动作空间;Throttleego表示油门;Brakeego表示刹车;Steerego表示车辆转角;场景内每个智能体车辆的动作空间都包含这些参数。Where A represents the action space; Throttle ego represents the accelerator; Brake ego represents the brake; Steer ego represents the vehicle turning angle; the action space of each intelligent vehicle in the scene contains these parameters. 3.根据权利要求1所述的面向无信号灯路口场景的多智能体无人驾驶决策方法,其特征在于,在智能体策略网络输出的动作上引入动态噪声,增加智能体行为的随机性,随着训练轮数的增加而相应的减少;鼓励智能体在开始时更多地进行探索,通过积累经验,使智能体逐渐更多地利用其学习到的策略。3. According to claim 1, the multi-agent unmanned driving decision-making method for non-signal light intersection scenarios is characterized by introducing dynamic noise into the actions output by the agent strategy network to increase the randomness of the agent's behavior, which decreases accordingly as the number of training rounds increases; encouraging the agent to explore more at the beginning, and by accumulating experience, allowing the agent to gradually make more use of its learned strategies. 4.根据权利要求3所述的面向无信号灯路口场景的多智能体无人驾驶决策方法,其特征在于,噪声值根据初始噪声、最终噪声和剩余的训练轮次百分比来计算,具体计算方法包括:在最终噪声值的基础上,加上初始噪声值与最终噪声值作差后乘以剩余训练轮次所占百分比。4. According to claim 3, the multi-agent unmanned driving decision-making method for the scene of no signal light intersection is characterized in that the noise value is calculated based on the initial noise, the final noise and the remaining percentage of training rounds. The specific calculation method includes: on the basis of the final noise value, adding the difference between the initial noise value and the final noise value and multiplying it by the percentage of remaining training rounds. 5.根据权利要求1所述的面向无信号灯路口场景的多智能体无人驾驶决策方法,其特征在于,所述采样规则包括:将导致智能体行为大幅改变或者接近最优解的经验赋予更高的优先级;采样策略为根据经验的重要性来选择要重播的经验样本,所述决策模型在训练期间,更加专注对学习过程影响较大的样本,提高学习过程的效率和速度,使所述决策模型更快地收敛到更好的策略。5. According to claim 1, the multi-agent unmanned driving decision-making method for non-signal light intersection scenarios is characterized in that the sampling rules include: giving higher priority to experiences that cause significant changes in the agent's behavior or are close to the optimal solution; the sampling strategy is to select experience samples to be replayed based on the importance of the experience, and during training, the decision model focuses more on samples that have a greater impact on the learning process, thereby improving the efficiency and speed of the learning process and enabling the decision model to converge to a better strategy more quickly. 6.根据权利要求1所述的面向无信号灯路口场景的多智能体无人驾驶决策方法,其特征在于,所述决策模型的训练方法包括:初始化各智能体网络参数后,智能体基于所述动态噪声机制探索环境,积累经验;经验回放池存放的经验超过设定的阈值后,智能体基于经验回放池的所述采样规则进行采样,训练决策模型。6. According to claim 1, the multi-agent unmanned driving decision-making method for non-signal light intersection scenarios is characterized in that the training method of the decision model includes: after initializing the network parameters of each agent, the agent explores the environment based on the dynamic noise mechanism and accumulates experience; after the experience stored in the experience replay pool exceeds the set threshold, the agent samples based on the sampling rules of the experience replay pool to train the decision model. 7.根据权利要求1所述的面向无信号灯路口场景的多智能体无人驾驶决策方法,其特征在于,所述决策模型的奖励函数包括:局部奖励与全局奖励;7. The multi-agent unmanned driving decision-making method for non-signal intersection scenarios according to claim 1, characterized in that the reward function of the decision model includes: local rewards and global rewards; 以车辆从起点到目标点,且安全、快速的通过交叉路口,以及车辆速度、花费时间作为奖励评判标准来进行所述局部奖励;针对距目标点的距离给予一定的奖励回报,同时对于车辆间发生的冲突给予惩罚;依据车辆距目标点的距离来判断车辆是否到达目标点,如果顺利到达则给予任务完成的奖励;The local reward is given based on the vehicle's safe and fast crossing of the intersection from the starting point to the target point, as well as the vehicle speed and time spent as reward evaluation criteria; a certain reward is given based on the distance from the target point, and a penalty is given for conflicts between vehicles; whether the vehicle has reached the target point is judged based on the distance from the target point, and if it arrives successfully, a task completion reward is given; 如所有车辆均安全到达预期目标点,则给予全局奖励,促进车辆间的任务协作。If all vehicles safely reach the expected destination, a global reward will be given to promote task collaboration among vehicles. 8.一种面向无信号灯路口场景的多智能体无人驾驶决策系统,所述系统用于实现权利要求1-7任一项所述的方法,其特征在于,包括:空间设计模块、噪声设计模块、规则设计模块、模型构建模块和决策生成模块;8. A multi-agent unmanned driving decision system for non-signal intersection scenarios, the system is used to implement the method described in any one of claims 1 to 7, characterized in that it includes: a space design module, a noise design module, a rule design module, a model building module and a decision generation module; 空间设计模块用于基于无信号交叉口的无人驾驶场景,设定多智能体的参数空间;The spatial design module is used to set the parameter space of multiple agents based on the unmanned driving scenario at an unsignalized intersection; 噪声设计模块用于为所述多智能体设计动态噪声机制,使多智能体在所述参数空间中积累经验;The noise design module is used to design a dynamic noise mechanism for the multi-agent so that the multi-agent can accumulate experience in the parameter space; 规则设计模块用于为所述多智能体设计经验回放池的采样规则,使多智能体有效学习积累的经验;The rule design module is used to design a sampling rule of the experience replay pool for the multi-agent so that the multi-agent can effectively learn the accumulated experience; 模型构建模块用于基于所述参数空间、所述动态噪声机制和所述采样规则,构建决策模型;The model building module is used to build a decision model based on the parameter space, the dynamic noise mechanism and the sampling rule; 所述决策生成模块用于利用所述决策模型生成多智能体无人驾驶决策方法。The decision generation module is used to generate a multi-agent unmanned driving decision method using the decision model.
CN202410948308.2A 2024-07-16 2024-07-16 Multi-agent unmanned decision method and system for intersection scene without signal lamp Pending CN118918720A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410948308.2A CN118918720A (en) 2024-07-16 2024-07-16 Multi-agent unmanned decision method and system for intersection scene without signal lamp

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410948308.2A CN118918720A (en) 2024-07-16 2024-07-16 Multi-agent unmanned decision method and system for intersection scene without signal lamp

Publications (1)

Publication Number Publication Date
CN118918720A true CN118918720A (en) 2024-11-08

Family

ID=93313776

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410948308.2A Pending CN118918720A (en) 2024-07-16 2024-07-16 Multi-agent unmanned decision method and system for intersection scene without signal lamp

Country Status (1)

Country Link
CN (1) CN118918720A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119889073A (en) * 2025-03-27 2025-04-25 吉林大学 Intelligent vehicle decision method for signal-free crossroad

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113257016A (en) * 2021-06-21 2021-08-13 腾讯科技(深圳)有限公司 Traffic signal control method and device and readable storage medium
CN115294784A (en) * 2022-06-21 2022-11-04 中国科学院自动化研究所 Multi-intersection traffic signal control method, device, electronic device and storage medium
CN116176606A (en) * 2023-02-22 2023-05-30 中国船舶集团有限公司第七〇九研究所 Intelligent body reinforcement learning method and device for controlling vehicle driving
CN117496707A (en) * 2023-11-02 2024-02-02 北京联合大学 Intersection decision-making method based on multi-agent deep reinforcement learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113257016A (en) * 2021-06-21 2021-08-13 腾讯科技(深圳)有限公司 Traffic signal control method and device and readable storage medium
CN115294784A (en) * 2022-06-21 2022-11-04 中国科学院自动化研究所 Multi-intersection traffic signal control method, device, electronic device and storage medium
CN116176606A (en) * 2023-02-22 2023-05-30 中国船舶集团有限公司第七〇九研究所 Intelligent body reinforcement learning method and device for controlling vehicle driving
CN117496707A (en) * 2023-11-02 2024-02-02 北京联合大学 Intersection decision-making method based on multi-agent deep reinforcement learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119889073A (en) * 2025-03-27 2025-04-25 吉林大学 Intelligent vehicle decision method for signal-free crossroad
CN119889073B (en) * 2025-03-27 2025-06-27 吉林大学 A method for intelligent vehicle decision making at an unsignalized intersection

Similar Documents

Publication Publication Date Title
CN112099496B (en) Automatic driving training method, device, equipment and medium
CN111696370B (en) Traffic light control method based on heuristic deep Q network
Liang et al. A deep reinforcement learning network for traffic light cycle control
CN111898211B (en) Intelligent vehicle speed decision method based on deep reinforcement learning and simulation method thereof
CN113223305B (en) Multi-intersection traffic light control method, system and storage medium based on reinforcement learning
CN114852105A (en) Method and system for planning track change of automatic driving vehicle
CN112550314B (en) Embedded optimization control method suitable for unmanned driving, its driving control module and automatic driving control system
CN113276852A (en) Unmanned lane keeping method based on maximum entropy reinforcement learning framework
Yuan et al. Prioritized experience replay-based deep q learning: Multiple-reward architecture for highway driving decision making
CN119169818B (en) Intersection entrance road mixed vehicle team collaborative guiding control method
CN117275228B (en) Urban road network traffic signal timing optimization control method
CN114463997A (en) A method and system for cooperative vehicle control at an intersection without signal lights
CN114074680A (en) Vehicle lane change behavior decision method and system based on deep reinforcement learning
CN117808652A (en) A bus scheduling method based on multi-agent path planning
CN118545093A (en) TD3 autonomous driving decision-making method based on multiple experience replay pools
CN118918720A (en) Multi-agent unmanned decision method and system for intersection scene without signal lamp
Huang Reinforcement learning based adaptive control method for traffic lights in intelligent transportation
CN114360290A (en) Method for selecting vehicle group lanes in front of intersection based on reinforcement learning
CN114613170B (en) Traffic signal lamp intersection coordination control method based on reinforcement learning
Luo et al. Researches on intelligent traffic signal control based on deep reinforcement learning
CN114937506A (en) Epidemic situation prevention and control-oriented bus transit reinforcement learning speed control method
CN120396986A (en) A decision-making planning method for autonomous vehicles based on deep reinforcement learning
CN119472677A (en) A TD3 map-free navigation method based on dynamic window guidance
CN116343516B (en) A method for intersection management based on intelligent connected vehicles
CN114627640B (en) Dynamic evolution method of intelligent network-connected automobile driving strategy

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20241108