CN117908542A

CN117908542A - Multi-agent path planning method and system based on logistics storage environment

Info

Publication number: CN117908542A
Application number: CN202410063880.0A
Authority: CN
Inventors: 刘妹琴; 施沁汝; 兰旭光; 魏平; 陈霸东
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2024-01-16
Filing date: 2024-01-16
Publication date: 2024-04-19

Abstract

The invention belongs to the technical field of multi-agent path planning, and discloses a multi-agent path planning method and system based on a logistics storage environment; wherein the method comprises the following steps: based on a plurality of intelligent agents in a logistics storage environment, acquiring partial observation information of each intelligent agent at the current moment; and based on the partial observation information, carrying out path planning by utilizing the trained collaborative optimization strategy network model to obtain the action of each agent at the current moment. The invention particularly discloses a multi-agent path planning scheme based on evolution reinforcement learning for a logistics storage environment, which can plan paths of various agents in a plurality of agents in a high-complexity environment and can solve the technical problem of lower quality of planning results in a large scale in the prior art.

Description

Multi-agent path planning method and system based on logistics storage environment

Technical Field

The invention belongs to the technical field of multi-agent path planning, and particularly relates to a multi-agent path planning method and system based on a logistics storage environment.

Background

Multi-agent path planning (Multi-AGENT PATH FINDING, MAPF) is a problem of planning paths from multiple agents at different starting positions to respective target positions, and the key constraint is to reach the target positions on the premise of ensuring that the agents do not collide with each other, and to ensure the speed and quality of path planning.

In the intelligent warehouse logistics system, the multi-agent path planning can effectively relieve the shortage of logistics labor force and improve the efficiency of collaborative optimization strategies; however, in the logistics storage environment, the path planning of multiple agents is often large-scale, the environment complexity is higher, and the number of agents is more; therefore, the method for planning the path of the multiple intelligent agents in the large-scale logistics storage environment is a challenging task for realizing the efficient operation of the intelligent logistics system.

The current multi-agent path planning algorithm can be divided into two types according to different planning modes, namely a centralized planning method and a distributed execution method.

The centralized planning method is to plan paths for all agents by a central controller, and the premise is that the central controller grasps complete map information and information such as the starting point, the ending point and the like of all agents. In general, the currently mainstream algorithms optimize and improve the performance and applicability of the algorithms by introducing priority planning, large-area searching and complex heuristic functions, and the comparison of the represented algorithms is based on an A-based search algorithm, a conflict-based algorithm, a cost-increasing tree search algorithm, a protocol-based algorithm and the like; it can be said that the centralized planning algorithm is a classical and commonly used MAPF algorithm, which can solve most of the path planning problems. However, the above existing algorithms all require the central planner to grasp the complete information to plan the optimal path; as the complexity of the environment and the number of agents increase, the complexity of the computation will increase dramatically; thus, as technology advances, distributed execution approaches are becoming more popular.

The distributed execution method is mainly based on reinforcement learning, and the premise is that each intelligent body only grasps the information of the positions of the intelligent body and the obstacle in the visual field (namely in a certain range), the intelligent body continuously interacts with the environment according to the current strategy, acquires the next arrival state and the action rewards under the environment, calculates and updates the strategy at the same time, and in order to obtain the maximum accumulated rewards, the intelligent body gradually learns to reach the target position and cooperatively avoids the obstacle so as to complete the multi-intelligent body path planning task. The algorithm can be expanded to a high-density environment, and the problem of real-time re-planning of paths of multiple agents is solved efficiently; among these, the representative algorithms are expert demonstration type PRIMARL algorithm, improved communication type DHC algorithm, hierarchical planning type HELSA algorithm, and the like. However, the existing distributed planning method basically introduces the arrangement of decentralization, performs strategy training on a random environment, has lower quality of multi-agent planning results in a storage environment, and is difficult to compare with the traditional centralized planning method.

Disclosure of Invention

The invention aims to provide a multi-agent path planning method and system based on a logistics storage environment, which are used for solving one or more of the technical problems. The technical scheme of the invention particularly discloses a multi-agent path planning scheme based on evolution reinforcement learning for logistics storage environment, which can plan paths of various agents in a plurality of agents in a high-complexity environment and can solve the technical problem of lower quality of planning results in a large scale in the prior art.

In order to achieve the above purpose, the invention adopts the following technical scheme:

The invention provides a multi-agent path planning method based on a logistics storage environment, which comprises the following steps:

Based on a plurality of intelligent agents in a logistics storage environment, acquiring partial observation information of each intelligent agent at the current moment;

Based on the partial observation information, performing path planning by using a trained collaborative optimization strategy network model to obtain the action of each agent at the current moment;

wherein,

In the step of obtaining the partial observation information of each intelligent agent at the current moment, for the selected intelligent agent, the partial observation information comprises information in a visual field and a partial guide vector; the information in the visual field comprises a binary matrix of the goods shelf position in the visual field, a binary matrix of the positions of other intelligent agents in the visual field, a binary matrix of the positions of intersections in the visual field and a binary matrix of four-channel heuristic information; the four-channel heuristic information is used for indicating each grid position of the selected intelligent agent in the visual field, whether the grid position approaches to the end point when moving one grid up, down, left and right, and the grid position is acquired according to a storage grid map obtained based on the logistics storage environment; the local guide vector consists of a unit direction vector pointing to the terminal point and a Manhattan distance from the terminal point;

And when the trained collaborative optimization strategy network model is trained, training strategies of reinforcement learning and evolution learning are adopted.

The method of the invention is further improved in that the collaborative optimization strategy network model comprises the following steps:

the observation value encoder module is used for inputting part of observation information of the intelligent agent, respectively extracting characteristics of information in a visual field and local guide vectors in the part of observation information, splicing the extracted characteristics in a matrix, and outputting encoded information vectors;

The GRU network is used for inputting the hidden value information and the coded information vector at the previous moment, and outputting information for communication after combining the current state information and the historical information;

The communication module is used for inputting information for communication output by the GRU network, communicating with other adjacent intelligent agents by utilizing a self-attention mechanism through the graphic neural network, and outputting hidden value information at the current moment; the intelligent agent communicates with two nearest adjacent intelligent agents in the visual field at most at each moment;

and the Q network module is used for inputting the hidden value information output by the communication module and outputting the action of the intelligent agent.

A further improvement of the method of the invention is that, in the collaborative optimization strategy network model,

The hidden value information at the initial moment is 0; in addition, if an agent does not find an adjacent agent in the field of view, no communication occurs, and information used for communication is taken as hidden value information.

The method of the invention is further improved in that the training step of the collaborative optimization strategy network model comprises the following steps:

Adopting a D3QN algorithm to perform reinforcement learning training on the collaborative optimization strategy network model; the method comprises the steps that an Ape-X architecture is utilized, a plurality of participants adopt greedy strategies to generate experiences in parallel according to the current collaborative optimization strategy, and the experiences are stored in a global experience pool together; then selecting experience in the global experience pool by a learner with an evaluation network Q (o _t,a_t; theta) and a target network Q (o _t,a_t; theta'), updating the coordination optimization strategy in a mode of calculating n steps of TD errors and a calculation expression in a mode of,

L(θ)＝Huber([r_t+γr_t+1+…+γⁿQ(o_t+1,arg max_aQ(o_t+1,a;θ);θ′)]-Q(o_t,a_t;θ));

Wherein Huber (·) is a Huber loss function; r _t is the rewarding value obtained by the agent at the moment t; gamma epsilon [0,1] is a discount factor for trading off current rewards and future rewards; o _t is a partial observed value of the agent at time t; o _t+1 is a partial observed value of the agent at time t+1; a _t is the action performed by the agent at time t;

solving the error by using a gradient descent method to minimize the error and continuously optimizing an evaluation network of a learner; finally, updating the cooperative optimization strategy of the participants and the target network of the learner after a certain training times;

In training a strategy using the D3QN algorithm, the steps performed by a plurality of evolutionary entities in parallel include: firstly, randomly generating a warehouse environment path planning task for evaluation, and performing policy evaluation; then, taking the current strategy obtained by using the D3QN algorithm as a parent strategy, and completing strategy evolution by using an evolution algorithm to generate an dominant offspring strategy; finally, the original collaborative optimization strategy is changed in a soft update mode.

The method is further improved in that the step of performing reinforcement learning training on the collaborative optimization strategy network model by adopting the D3QN algorithm specifically comprises the following steps:

1) Generating a logistics storage environment map by combining the logistics storage environment, and designing part of observation space, action space and rewarding function of each intelligent body; wherein,

The step of generating the logistics storage environment map comprises the following steps: designing a two-dimensional storage grid map according to the logistics storage environment, and setting the placement positions of shelves in the map according to the length, the width and the shelf density of the storage grid map to obtain the logistics storage environment map;

The step of designing a portion of the observation space includes: in the partially observable environment, the intelligent agent is only supposed to observe the environment in the visual field with the size of R multiplied by R, wherein R represents the visual field scale; at time t, visual field information observed by the agent A ⁱ is represented by a 7 XR binary matrix f _t ⁱ, which is a binary matrix of shelf positions in the visual field, a binary matrix of other agent positions in the visual field, a binary matrix of intersection positions in the visual field, and a binary matrix of four-way heuristic information; agent A ⁱ also obtains a unit direction vector to point to the endpoint at time t And Manhattan distance from end pointStructured local guide vector

The step of designing the action space includes: in a 4-connected grid environment, agent A ⁱ can take 5 discrete actions at time tA cell grid is moved or kept stationary in four basic directions, namely up, down, left and right; wherein, when agent a ⁱ performs the errant behavior, the agent's current actions are replaced with remaining stationary;

the step of designing the bonus function includes: at time t, agent A ⁱ is taking action Thereafter, the obtained rewardIn order to achieve this, the first and second,

Where r _move is the movement penalty; r _stay is a quiescence penalty; r _collide is the collision penalty; r _reach is cooperation standard rewards;

2) Define the partial observations of agent A ⁱ at time t as The hidden value information obtained after passing through the observation value encoder module and the communication module isFor action selection; suppose the Q network action cost function of agent A ⁱ isFrom the cost functionAnd dominance functionThe two-part composition, denoted as,

Wherein θ is a parameter of the overall Q network, and p and Q are network parameters of a cost function and an advantage function respectively; A set of all possible actions to be taken by agent a ⁱ at time t;

setting up an evaluation network in an algorithm And target networkTwo Q networks to eliminate the maximization bias, wherein the evaluation network is used to determine the next moment of action, the target network is used to determine an estimate of the return, expressed as,

A further improvement of the method of the present invention is that, during the step of training the strategy using the D3QN algorithm, a plurality of evolutions are performed in parallel,

1) The step of performing policy evaluation includes: assuming that each agent successfully reaches the end point within a specified maximum step number t _max, obtaining an evaluation value of +e _reach; if an agent collides c times in a task, an evaluation value of-e _collide is obtained; assuming that the Manhattan distance from the start point to the end point of an agent is x, the number of steps for the first time the agent reaches the end point is step, if the agent fails to reach the end point within a prescribed maximum number of steps t _max, the number of steps step is the maximum number of steps t _max, and the evaluation value of e _length obtained by the agent is

Under the guidance of the collaborative optimization strategy pi, the agent A ⁱ performs task planning, and the final obtained total evaluation value is expressed as,

eⁱ(π)＝r_sign·e_reach-e_collide+e_length；

Wherein r_sign is a successful identification;

The evaluation value of the collaborative optimization strategy pi is that,

Wherein n_env represents that the collaborative optimization strategy pi is to be evaluated in multi-agent path planning tasks in n_env storage environments; a is an agent set participating in a primary multi-agent path planning task; the I a I is the number of agents participating in a multi-agent path planning task;

2) In the process of completing policy evolution, the primary evolution process of the evolutionary person is as follows: firstly, randomly generating multi-agent path planning tasks under n_env storage environments for evaluating strategies; then, obtaining and evaluating a collaborative optimization strategy pi (theta ₀) with the evaluation network parameter theta ₀ obtained based on reinforcement learning training, and obtaining an initial parent strategy evaluation value E [ pi (theta ₀) ]; then, carrying out iterative variation on the original parent strategy pi (theta ₀) based on the set maximum iterative times; assuming that the parent policy parameter of the y-th generation is θ _y-1, and z=1, 2, …, N generates N pseudo-child policy parameters, denoted as θ _y,z＝θ_y-1+ε_y,z, j=1, 2, …, N by using a random noise vector ε _y,z～N(0,σ²); evaluating each of the pseudo-offspring strategies, the obtained evaluation value being denoted E _z＝E[π(θ_y,z) ]; the approximate gradient change is calculated, expressed as, Obtaining mutated policy parameters,Wherein alpha is the learning rate of the evolutionary algorithm; sigma is the mutation rate of the evolutionary algorithm; evaluating the variant child strategy pi (theta _y), if the evaluation value E [ pi (theta _y) ] is larger than the initial parent strategy evaluation value E [ pi (theta ₀) ], performing strategy evolution by utilizing soft update, ending the evolution process of the evolution, and starting a new strategy evolution; if the dominant offspring strategy is not found in the maximum iteration times in a variation mode, soft updating is not carried out, and the evolution is carried out again.

A further improvement of the method of the invention is that, in the step of changing the original collaborative optimization strategy by means of soft update,

The soft update mode is that the dominant offspring strategy is combined with the evaluation network and the target network which are currently trained, and is expressed as

θ＝τθ+(1-τ)θ_i

θ′＝τθ′+(1-τ)θ_i；

Wherein θ is a parameter of the evaluation network; θ' is a parameter of the target network; θ _i is a parameter of the dominant offspring strategy.

In a second aspect of the present invention, a multi-agent path planning system based on a logistics storage environment is provided, including:

The data acquisition module is used for acquiring partial observation information of each intelligent agent at the current moment based on a plurality of intelligent agents in the logistics storage environment;

the path planning module is used for carrying out path planning by utilizing the trained collaborative optimization strategy network model based on the part of observation information to obtain the action of each intelligent agent at the current moment;

wherein,

In a third aspect of the present invention, there is provided an electronic apparatus comprising:

At least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the logistics warehouse environment based multi-agent path planning method in accordance with any one of the first aspects of the present invention.

In a fourth aspect of the present invention, there is provided a computer readable storage medium storing a computer program which when executed by a processor implements the multi-agent path planning method according to any one of the first aspects of the present invention based on a logistics storage environment.

Compared with the prior art, the invention has the following beneficial effects:

The method provided by the invention is a multi-agent path planning method based on evolution reinforcement learning, which performs path planning by adopting a collaborative optimization strategy network model based on acquired information; constructing a collaborative optimization strategy network model based on reinforcement learning; on the basis, through a strategy quantitative evaluation mechanism, the evaluation and self evolution of strategies are completed by utilizing parallel evolutionary learning in the reinforcement learning training process, so that the strategy training speed can be accelerated, and a planning result with higher quality can be obtained. In summary, the technical scheme disclosed by the invention can realize path planning of each intelligent agent in an environment with higher complexity and multiple intelligent agents, and can effectively improve the training speed of the collaborative optimization strategy of the system and improve the operation efficiency of the warehouse logistics system while improving the quality of the planned path.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description of the embodiments or the drawings used in the description of the prior art will make a brief description; it will be apparent to those of ordinary skill in the art that the drawings in the following description are of some embodiments of the invention and that other drawings may be derived from them without undue effort.

FIG. 1 is a flow diagram of a multi-agent path planning method based on a logistics storage environment in an embodiment of the present invention;

FIG. 2 is a flow chart of a multi-agent path planning method based on a logistics storage environment in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a grid map of a logistics storage environment in an embodiment of the present invention;

FIG. 4 is a schematic diagram of a collaborative optimization strategy in an embodiment of the invention;

FIG. 5 is a schematic diagram of an observation encoder module in an embodiment of the present invention;

FIG. 6 is a schematic diagram of a communication module in an embodiment of the invention;

FIG. 7 is a schematic diagram of a collaborative optimization strategy evolution flow in an embodiment of the present invention;

fig. 8 is a schematic diagram of a multi-agent path planning system based on a logistics storage environment according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The invention is described in further detail below with reference to the attached drawing figures:

Referring to fig. 1, the multi-agent path planning method based on the logistics storage environment provided by the embodiment of the invention includes the following steps:

step 1, based on a plurality of intelligent agents in a logistics storage environment, acquiring partial observation information of each intelligent agent at the current moment;

Step 2, based on the partial observation information obtained in the step 1, carrying out path planning by utilizing a trained collaborative optimization strategy network model to obtain the action of each intelligent agent at the current moment;

wherein,

For selected agents, the partial observation information includes in-view information and local guidance vectors; the information in the visual field comprises a binary matrix of the goods shelf position in the visual field, a binary matrix of the positions of other intelligent agents in the visual field, a binary matrix of the positions of intersections in the visual field and a binary matrix of four-channel heuristic information; the four-channel heuristic information is used for indicating whether the selected intelligent agent approaches the end point when moving one grid up, down, left and right in each grid position in the visual field; the grid positions are obtained according to a storage grid map obtained based on the logistics storage environment;

the local guide vector consists of a unit direction vector pointing to the terminal point and a Manhattan distance from the terminal point;

In the multi-agent path planning method provided by the embodiment of the invention, the path planning is performed by adopting the same trained collaborative optimization strategy network model based on the acquired partial observation information of each agent; the collaborative optimization strategy network model is constructed based on reinforcement learning, and the strategy evaluation and self evolution are completed by utilizing parallel evolution learning in the reinforcement learning training process through a strategy quantification evaluation mechanism, so that the strategy training speed is accelerated, and finally, a planning result with higher quality can be obtained.

In a further preferred technical solution of the embodiment of the present invention, the collaborative optimization policy network model includes:

The GRU network is used for inputting the hidden value information at the previous moment and the coded information vector output by the observation value encoder module so as to combine the current state information and the historical information and output the information for communication; further illustratively, the hidden value information at the initial time may be 0;

The communication module is used for inputting information for communication output by the GRU network, communicating with other adjacent intelligent agents by utilizing a self-attention mechanism through the graphic neural network, and outputting hidden value information at the current moment; further illustratively, the agent communicates with at most two adjacent agents nearest to each other in the field of view at each time;

And the Q network module is used for inputting the hidden value information output by the communication module and outputting the action of the intelligent agent at the moment.

Further, the training step of the collaborative optimization strategy network model comprises the following steps:

Adopting a D3QN algorithm to perform reinforcement learning training on the collaborative optimization strategy network model; wherein, by using Ape-X architecture, multiple participants use greedy strategy to generate experiences in parallel according to the current collaborative optimization strategy, and store the experiences in a global experience pool together, then a learner possessing an evaluation network Q (o _t,a_t; theta) and a target network Q (o _t,a_t; theta') selects experiences in the global experience pool, updates the collaborative optimization strategy, the updating mode is to calculate n steps of TD errors, the calculation expression is,

L(θ)＝Huber([r_t+θr_t+1+…+γⁿQ(o_t+1,argmax_aQ(o_t+1,a;θ);θ′)]-Q(o_t,a_t;θ));

Wherein Huber (·) is a Huber loss function; r _t is the rewarding value obtained by the agent at time t; gamma epsilon [0,1] is a discount factor for trading off current rewards and future rewards; o _t is the time t, and the part of observation value of the intelligent agent; o _t+1 is the time t+1, part of the observation value of the intelligent agent; a _t is the action performed by the intelligent agent at the time t;

Solving the error by using a gradient descent method to minimize the error, thereby continuously optimizing the assessment network of the learner; and finally, updating the collaborative optimization strategy of the participant and the target network of the learner after a certain training time.

In training a strategy using the D3QN algorithm, multiple evolutions are in parallel to complete the steps comprising: firstly, randomly generating a warehouse environment path planning task for evaluation, and performing policy evaluation; then, taking the current strategy obtained by using the D3QN algorithm as a parent strategy, and completing strategy evolution by using an evolution algorithm to generate an dominant offspring strategy; finally, the original collaborative optimization strategy is changed in a soft update mode.

In a preferred technical scheme of the embodiment of the invention, a collaborative optimization strategy network model based on reinforcement learning is constructed, and the evaluation and evolution of the collaborative optimization strategy are realized by utilizing evolutionary learning, so that the quality of multi-agent path planning can be effectively improved, and the training speed of the collaborative optimization strategy can be accelerated.

Referring to fig. 2, a specific exemplary embodiment of the present invention provides a method for planning a multi-agent path in a large-scale storage environment based on evolutionary reinforcement learning, which includes the following steps:

Step S101, describing a warehouse multi-agent path planning problem environment under a partial Markov decision framework; the specific example is that the characteristics of the logistics storage environment are combined to generate a logistics storage environment map, and meanwhile, part of observation space, action space and rewarding function of each intelligent body A ⁱ are designed; wherein,

1) The step of generating the logistics storage environment map comprises the following steps: according to the characteristics of large scale, structuring, high repeatability and the like of the logistics storage environment, a two-dimensional storage grid map is designed; setting the length l and the width w of the storage grid map and the shelf density d (0 < d < 0.6); and setting the placement position of the goods shelf in the map according to the information of the three. As shown in fig. 3, in the specific embodiment of the invention, for a medium-high density map, the shelves are densely arranged, and a plurality of shelves in one row or two rows are sequentially arranged as a group; for the medium-low density map, the shelves are sequentially arranged in groups of 4 or 1, and finally the settable structured storage grid map is formed.

2) The step of designing a portion of the observation space includes: in some observable environments, it is assumed that the agent can only observe an environment within a Field of View (FOV) of size r×r; wherein R represents the visual field scale and can be set by oneself. At time t, visual field information that can be observed by agent A ⁱ is represented by a 7 XRXR binary matrix f _t ⁱ, which is a binary matrix of shelf locations within the FOV, binary matrices of other agent locations within the FOV, binary matrices of intersection locations within the FOV, binary matrices of four-way heuristic information, respectively; the four-channel heuristic information is used for indicating whether each grid position of the intelligent agent A ⁱ in the FOV approaches to the end point when moving one grid up, down, left and right; illustratively, if close then the element is 1 and far then the element is 0. At the same time, to further guide the path, agent A ⁱ also obtains a unit direction vector pointing to the endpoint at time tAnd Manhattan distance from end pointStructured local guide vectorTo sum up, the partial observations/>, at time t, of agent A ⁱ Can be derived from FOV information f _t ⁱ and local steering vectorTwo parts are shown.

3) The step of designing the action space includes: in a 4-connected grid environment, agent A ⁱ may take 5 discrete actions at time tI.e. move one cell grid in four basic directions (up, down, left, right) or remain stationary; further, to prevent false behavior (agent A ⁱ takes action/>, at time tLater colliding with shelves, other agents, or out of bounds), the current actions of agent a ⁱ are replaced to remain stationary when agent a ⁱ performs the wrong actions.

4) The step of designing the bonus function includes: in order to realize the cooperation between the high-quality path and the multi-agent, the reward function consists of a movement penalty, a static penalty, a collision penalty and a cooperation standard-reaching reward. Illustratively, if the agent gets a smaller movement penalty r _move after each movement before the endpoint is reached, keeping stationary gets a smaller stationary penalty r _stay; if the agent collides (with the shelf, other agent) or exceeds the boundary after the action, a larger collision penalty r _collide is obtained; only when all the intelligent agents reach respective end points, all the intelligent agents can obtain a cooperation standard-reaching rewards r _reach;

thus, at time t agent A ⁱ is taking action The prize to be obtained is then,

Specifically, alternatively, the movement penalty r _move = -0.075, the stationary penalty r _stay = -0.075, the collision penalty r _collide = -0.5, and the cooperation achievement reward r _reach =3.

Step S102, constructing a collaborative optimization strategy network model based on reinforcement learning; wherein,

Referring to fig. 4, in the embodiment of the present invention, the whole construction of the collaborative optimization strategy based on reinforcement learning is to encode a part of observed values to extract feature information, then perform communication learning between multiple agents through a graph neural network, and finally complete action selection by using a D3QN algorithm according to the communication information; the difficulty of training environment is continuously increased through course learning, so that strategies are trained gradually, and finally, the efficiency of the strategies in the logistics storage environment can be effectively improved. Further, the important modules in the collaborative optimization strategy are sequentially described according to the sequence of the observation value encoder module, the communication module and the Q network module.

Referring to fig. 5, in order to extract the characteristics of the observed values and encode the observed values, the observed value encoder module extracts the local guide vectors using a multi-layer perceptron network in an embodiment of the present inventionExtracting the characteristics of FOV information f _t ⁱ by utilizing VIT (Vision Transformer) network by adopting a multi-head attention mechanism, and performing matrix splicing on the extracted characteristics to obtain a coded information vector

Further illustratively, for a given characteristic information X, its multi-headed attention is coded as,

MSA＝[SA₁(X);SA₂(X);…;SA_l(X)]W_O；

Wherein softmax (·) is an exponential normalization function; d is the dimension of the feature information X; l is the number of heads of the multi-head attention code; w _O is a weight matrix; q _i(X)、K_i(X)、V_i (X) is the query, key and value vector corresponding to the characteristic information X, and the calculation formulas are respectivelyWhereinThe matrix is a weight matrix corresponding to the query, the key and the value;

Finally, the two extracted information are combined to obtain intermediate information which can be used for communication, namely, for the agent A ⁱ at the t moment, the intermediate information can be obtained through the observation encoder module.

Referring to fig. 4, in the embodiment of the present invention, intermediate information obtained by the observation encoder module is required to pass through a loop unit (Gate Recurrent Unit, GRU) network to combine the hidden values of the previous timeOutputting information for communicationExpressed as

In the embodiment of the invention, a graph structure is constructed at each moment, wherein each agent is regarded as a node and is connected with only the adjacent limited nodes, and then the self-attention mechanism is utilized for communication.

Referring to fig. 6, at time t, agent a ⁱ communicates with at most the two nearest agents a ^j and a ^k within the FOV. I.e. communication information for each agentBy means of the weight matrix W _q、W_k、W_v, the query, key and value vectors are calculated, i.e

Then, information is calculatedAndThe degree of similarity between the two is that,

Where d _K is the dimension of the key vector.

Finally, through weighted summation, the hidden value information of the agent A ⁱ at the time t is outputNamely, the method is that,

Wherein W _o is a weight matrix.

Further illustratively, if an agent does not find an adjacent agent in its field of view, no communication, i.e., communication information, will occurIs the hidden value information

Because the targets of each agent are different in the multi-agent path planning problem, compared with a centralized multi-agent reinforcement learning algorithm, the independent learning algorithm is more suitable for the multi-agent path planning problem; i.e. all agents share one action-selecting Q network. In the embodiment of the invention, a D3QN algorithm is selected to learn a collaborative optimization strategy.

The embodiment of the invention is specifically exemplified by defining the partial observation value of the agent A ⁱ at the time t asHidden value information obtained after passing through encoder module and communication moduleFor action selection. Let Q network-action cost function of agent a ⁱ beIt is formed by a cost functionAnd dominance functionThe two-part composition, denoted as,

Wherein θ is a parameter of the overall Q network, and p and Q are network parameters of a cost function and a merit function, respectively; Is the set of all possible actions that agent a ⁱ may take at time t.

In order to eliminate the maximized deviation, two Q networks are arranged in the algorithm, and the evaluation networkAnd a target networkWherein the evaluation network is to determine the next moment of action and the target network is to determine the estimation of the return, i.e

Wherein gamma E [0,1] is a discount factor for trading off current rewards from future rewards.

To accelerate reinforcement learning training of the D3QN algorithm, the invention utilizes an Ape-X architecture in which experiences are generated in parallel by 20 participants using greedy strategies according to current collaborative optimization strategies and stored together in a global experience pool, and then experiences are selected in the global experience pool by a learner (possessing an evaluation network and a target network) to update the collaborative optimization strategy.

The updating mode is to calculate n steps of TD errors,

L(θ)＝Huber([r_t+γr_t+1+…+γⁿQ(o_t+1,argmax_aQ(o_t+1,a;θ);θ′)]-Q(o_t,a_t;θ));

Wherein Huber (·) is a Huber loss function; r _t is the rewarding value obtained by the agent at time t; gamma epsilon [0,1] is a discount factor for trading off current rewards and future rewards; o _t is the time t, and the part of observation value of the intelligent agent; o _t+1 is the time t+1, part of the observation value of the intelligent agent; a _t is the action performed by the agent at time t.

And solving the error by using a gradient descent method to minimize the error, so as to continuously optimize the evaluation network of the learner. And finally, updating the collaborative optimization strategy of the participant and the target network of the learner after a certain training time.

And step S103, utilizing evolutionary learning to evaluate and evolve the collaborative optimization strategy.

In the embodiment of the invention, a strategy quantitative evaluation mechanism in a logistics storage environment is established by using actual indexes, and in the reinforcement learning training process, strategy evaluation and self evolution are completed through parallel evolution learning, so that the strategy training speed is accelerated, and a better planning result is obtained. The specific flow is that, in the process of training the strategy by using the D3QN algorithm, a plurality of evolutions are parallel to complete the following steps: firstly, randomly generating a warehouse environment path planning task for evaluation, and performing strategy evaluation; then, taking the current strategy obtained by using the D3QN algorithm as a parent strategy, and completing strategy evolution by using an evolution algorithm to generate an dominant offspring strategy; finally, the original collaborative optimization strategy is changed in a soft update mode.

Further, for policy evaluation and policy evolution, the specific process includes:

(1) Policy evaluation: for the multi-agent path planning problem, the evaluation indexes of the multi-agent path planning problem comprise indexes such as success rate, collision rate, average step number and the like; the quality of the estimated path plan is related to the above-mentioned index for a one-time multi-agent path planning task.

In the embodiment of the invention, the evaluation value of +e _reach(e_reach =2) can be obtained by assuming that each agent successfully reaches the end point in the specified maximum step number t _max; if one agent collides c times in a mission (with a shelf or other agent), it will get an evaluation of-e _collide, e _collide =0.5c; meanwhile, in order to further evaluate the path length, assume that the manhattan distance from the start point to the end point of an agent is x, and the number of steps for the agent to reach the end point for the first time is step; if an agent fails to reach the end point within a prescribed maximum step number t _max, then its step number step is the maximum step number t _max, an agent can obtain an evaluation value of e _lengt h,

For example, for a warehouse map, a total of a agents perform path planning, one agent a ⁱ performs task planning under the guidance of a collaborative optimization strategy pi, and finally a total evaluation value can be obtained, which is expressed as,

eⁱ(π)＝r_sign·e_reach-e_collide+e_length；

Wherein r_sign is a successful sign, namely the successful arrival of the intelligent agent at the end point within a specified maximum step number t _max is 1; otherwise, 0.

The evaluation value of the co-optimization strategy pi is,

Wherein n_env represents that the collaborative optimization strategy pi is to be evaluated in multi-agent path planning tasks in n_env storage environments; a is an agent set participating in a primary multi-agent path planning task; the |a| is the number of agents that are involved in the multi-agent path planning task at one time.

(2) Policy evolution: as shown in fig. 7, in the embodiment of the present invention, there are 5 evolutions running in parallel for uninterrupted policy evolution. The primary evolution process of the evolutionary is as follows: firstly, in order to ensure the fairness of evaluation, randomly generating multi-agent path planning tasks under n_env storage environments for evaluating strategies; then, a collaborative optimization strategy pi (theta ₀) with the estimated network parameter theta ₀ obtained based on reinforcement learning training is obtained, and the strategy is estimated to obtain an initial parent strategy estimated value E [ pi (theta ₀) ]; then, carrying out iterative mutation for at most 10 generations on the original parent strategy pi (theta ₀); illustratively, for a co-optimization strategy pi (θ), the change in strategy is the change in network parameter θ; assuming, therefore, that the parent policy parameter for the y-th generation is θ _y-1, then with the random noise vector ε _y,z～N(0,σ²) (z=1, 2, …, N) N pseudo-child policy parameters can be generated,

θ_y,z＝θ_y-1+ε_y,z(j＝1,2,…,N)；

Evaluating each pseudo-offspring strategy to obtain an evaluation value, expressed as E _z＝E[π(θ_y,z) ];

the approximate gradient change is calculated, expressed as,

Thus, the mutated policy parameters can be obtained,Wherein alpha is the learning rate of the evolutionary algorithm; sigma is the mutation rate of the evolutionary algorithm;

And evaluating the variant child strategy pi (theta _y), wherein if the evaluation value E [ pi (theta _y) ] is larger than the initial parent strategy evaluation value E [ pi (theta ₀) ], the representative strategy is better than the original strategy to a certain extent.

The evolution of the strategy can be performed by utilizing soft update, namely, the dominant offspring strategy is combined with the evaluation network and the target network which are currently trained, namely

Meanwhile, the evolution process of the evolutionary person is also finished, and then a new strategy evolution is started. If within 10 generations, no dominant offspring strategy is found by way of mutation. Then no soft update will occur and the evolutionary process will resume as described above.

In summary, the embodiment of the invention combines the storage multi-agent path planning problem environment under the description part Markov decision frame, builds and learns the collaborative optimization strategy by using the attention mechanism, the graph neural network and the D3QN algorithm, and builds the strategy evaluation and strategy evolution method based on the evolution reinforcement learning frame. Finally, the efficiency of optimizing the training strategy can be achieved, and the effect of improving the indexes (success rate, collision rate and average step number) of the multi-agent path planning is improved. Further specifically explaining, due to the complexity of the large-scale path planning problem, the existing traditional method has the problems of huge calculation cost and repeated planning for many times, and is difficult to solve and low in planning quality. The invention is based on a reinforcement learning method, and by constructing a decentralization cooperative optimization strategy, the action is output in real time according to the local observed value at the current moment. The method can be expanded into a high-density storage environment, the problem of real-time re-planning of paths of multiple agents is effectively solved, and a better planning result is obtained. Specific exemplary, multi-agent path planning simulation results are shown in table 1, table 1 shows that under the same high-density storage environment test set, the maximum step number is set to 256, and the dhc method (a classical multi-agent path planning method) and the method of the present invention are compared with average step number and success rate under the same training round. Wherein, the average step number refers to the average step number required by all the agents in the test to successfully reach the end point; success rate refers to the ratio of the number of test sets that all agents successfully reach the endpoint to the total test set. Compared with the DHC method, the technical scheme of the embodiment of the invention can reduce the average step number and improve the success rate of the task, which proves that the novel scheme of the invention can effectively improve the performance of the strategy and improve the quality of multi-agent path planning in the warehouse environment.

TABLE 1 comparison Table of simulation results for path planning of multiple agents

The following are device embodiments of the present invention that may be used to perform method embodiments of the present invention. For details not disclosed in the apparatus embodiments, please refer to the method embodiments of the present invention.

Referring to fig. 8, in still another embodiment of the present invention, a multi-agent path planning system based on a logistics storage environment is provided, including:

wherein,

In yet another embodiment of the present invention, a computer device is provided that includes a processor and a memory for storing a computer program including program instructions, the processor for executing the program instructions stored by the computer storage medium. The processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processor, digital signal processor (DIGITAL SIGNAL processor, DSP), application Specific Integrated Circuit (ASIC), off-the-shelf programmable gate array (field-programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc., which are the computational core and control core of the terminal, adapted to implement one or more instructions, in particular to load and execute one or more instructions in a computer storage medium to implement a corresponding method flow or a corresponding function; the processor provided by the embodiment of the invention can be used for the operation of the multi-agent path planning method.

In yet another embodiment of the present invention, a storage medium, specifically a computer readable storage medium (Memory), is a Memory device in a computer device, for storing a program and data. It is understood that the computer readable storage medium herein may include both built-in storage media in a computer device and extended storage media supported by the computer device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor. The computer readable storage medium herein may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. One or more instructions stored in a computer-readable storage medium may be loaded and executed by a processor to implement the corresponding steps of the multi-agent path planning method described in the above embodiments with respect to a logistics storage environment.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims

1. The multi-agent path planning method based on the logistics storage environment is characterized by comprising the following steps of:

wherein,

2. The multi-agent path planning method of claim 1, wherein the collaborative optimization strategy network model comprises:

3. The multi-agent path planning method of claim 2 wherein, in the collaborative optimization strategy network model,

4. The multi-agent path planning method of claim 2, wherein the training step of the collaborative optimization strategy network model comprises:

L(θ)＝Huber([r_t+γr_t+1+…+γⁿQ(o_t+1,arg max_aQ(o_t+1,a;θ);θ')]-Q(o_t,a_t,;θ));

5. The multi-agent path planning method according to claim 4, wherein the step of performing reinforcement learning training on the collaborative optimization strategy network model by using the D3QN algorithm specifically comprises:

the step of designing the bonus function includes: at time t, agent A ⁱ is taking action The prize r _t ⁱ obtained is then,

r_t ⁱ＝r_move+r_stay+r_collide+r_reach；

6. The multi-agent path planning method of claim 5 wherein, in the step of training the strategy using the D3QN algorithm, a plurality of evolutions are performed in parallel,

eⁱ(π)＝r_sign·e_reach-e_collide+e_length；

Wherein r_sign is a successful identification;

The evaluation value of the collaborative optimization strategy pi is that,

2) In the process of completing policy evolution, the primary evolution process of the evolutionary person is as follows: firstly, randomly generating multi-agent path planning tasks under n_env storage environments for evaluating strategies; then, obtaining and evaluating a collaborative optimization strategy pi (theta ₀) with the evaluation network parameter theta ₀ obtained based on reinforcement learning training, and obtaining an initial parent strategy evaluation value E [ pi (theta ₀) ]; then, carrying out iterative variation on the original parent strategy pi (theta ₀) based on the set maximum iterative times; assuming that the parent policy parameter of the y-th generation is θ _y-₁, and z=1, 2, …, N generates N pseudo-child policy parameters, denoted as θ _y,z＝θ_y-1+ε_y,z, j=1, 2, …, N by using a random noise vector ε _y,z～N(0,σ²); evaluating each of the pseudo-offspring strategies, the obtained evaluation value being denoted E _z＝E[π(θ_y,z) ]; the approximate gradient change is calculated, expressed as, Obtaining mutated policy parameters,Wherein alpha is the learning rate of the evolutionary algorithm; sigma is the mutation rate of the evolutionary algorithm; evaluating the variant child strategy pi (theta _y), if the evaluation value E [ pi (theta _y) ] is larger than the initial parent strategy evaluation value E [ pi (theta ₀) ], performing strategy evolution by utilizing soft update, ending the evolution process of the evolution, and starting a new strategy evolution; if the dominant offspring strategy is not found in the maximum iteration times in a variation mode, soft updating is not carried out, and the evolution is carried out again.

7. The multi-agent path planning method of claim 6 wherein, in the step of changing the original collaborative optimization strategy by means of soft update,

θ＝τθ+(1-τ)θ_i

θ'＝τθ'+(1-τ)θ_i；

8. A multi-agent path planning system based on logistics storage environment is characterized by comprising:

wherein,

9. An electronic device, comprising:

At least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the logistics warehouse environment-based multi-agent path planning method of any one of claims 1 to 7.

10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the multi-agent path planning method based on a logistics storage environment of any one of claims 1 to 7.