Single-piece job shop scheduling method based on Deep reinforcement learning of Deep Q-network
Technical Field
The invention belongs to the technical field of production plans, is used for optimizing the production plan of a job shop, and particularly relates to a scheduling method of a single job shop based on Deep reinforcement learning of Deep Q-network (DQN).
Background
Under the intelligent manufacturing environment, "sensing, interconnection and intelligence of everything" becomes a new target for the development of the manufacturing industry nowadays. A workshop generates a large amount of data in the actual production process, the data contain a large amount of scheduling knowledge, and the traditional meta-heuristic algorithm cannot effectively utilize the data, so that the data are wasted. Under the modern manufacturing mode of intelligent and informatization rapid development, the research of an intelligent scheduling algorithm with self-perception, data analysis and self-decision has important significance.
The traditional scheduling algorithm mainly comprises a heuristic rule algorithm and a meta-heuristic scheduling algorithm. Heuristic rules are approximate scheduling algorithms that have been widely used in practical applications because of their advantages such as insensitivity to NP characteristics and good real-time performance, but research has shown that the performance of a single assignment rule in the global context depends on the characteristics of the system, operating condition parameters and production goals, and that previously valid scheduling rules may not be applicable when the state of the manufacturing system changes. The meta-heuristic algorithm carries out iterative optimization through a certain evolution or moving mechanism and forms a brand-new scheduling scheme, and the meta-heuristic algorithm has the advantages of high solving precision, capability of obtaining a better rescheduling scheme, and the disadvantage that the searching time is long, the requirement of real-time performance cannot be met, and the meta-heuristic algorithm is not suitable for being used under the condition that dynamic events occur frequently.
In the production process, the scheduling problem (JSP) of a single job shop needs to be processed, but the existing job shop scheduling algorithm cannot rapidly process the dynamic problem in the job shop, and the practicability is poor. Therefore, how to overcome the above technical problems and disadvantages is a problem to be solved.
Disclosure of Invention
The invention aims to provide a Deep reinforcement learning single-target single-piece job shop real-time scheduling method based on Deep Q-network (DQN).
The purpose of the invention is: the method realizes the analysis graph modeling of the single job shop scheduling problem, the scheduling decision based on the convolutional neural network and the scheduling decision optimization based on the DQN deep reinforcement learning algorithm, thereby optimizing the performance index of minimizing the maximum completion time in the job shop scheduling.
In order to achieve the aim, the technical scheme of the invention is to provide a single job shop scheduling method based on Deep reinforcement learning of Deep Q-network, which comprises the following steps:
(1) modeling a job shop scheduling environment by adopting a method of drawing, converting a scheduling decision problem into a sequential decision problem, establishing a Markov (MDP) quintuple epsilon { S, A, S', gamma and R } model, and solving the model by using deep reinforcement learning according to the minimum maximum completion time as a performance evaluation index of the model;
(2) extracting a current state S belonging to S from an extraction diagram environment of job shop scheduling;
the state s is formed by a three-channel matrix, and the three matrices are a processing time matrix, a machine matrix and a procedure processing starting time matrix respectively;
(3) constructing an action value function Q (s, a) and a target value function
Fitting the action value function Q (s, a; theta) and the target value function by using a convolutional neural network
And initializing a function of two values Q (s, a; theta)
0)、
Wherein theta is a set of internal weights w and bias values b of the convolutional neural network of the action value function, and theta-Convolving the neural network internal weights w for the target value function-Sum bias value b-A set of (a);
(4) adopting 18 heuristic scheduling rules as an agent action a for reinforcement learning, wherein a belongs to A;
establishing a task set of each machine according to a workpiece to be processed, inputting a global state into a convolutional neural network, converting the output of the convolutional neural network into the probability of each heuristic rule by using a softmax function, and when action selection is performed according to the probability P, in order to ensure that a scheduling strategy can be converged and has the capability of jumping out of a local optimal solution, adding uncertainty into the current decision to design a work selection mechanism, and combining the action a which is the action of selecting the maximum probability a max (P) with the action selection a random (P) according to uniform probability distribution;
the action selection mechanism comprises a manually set hyper-parameter beta and a randomly generated natural number d, the d belongs to (0,1), when the d is less than or equal to the hyper-parameter beta, a heuristic scheduling rule with the maximum probability is selected, when the d is greater than the hyper-parameter beta, the heuristic scheduling rule is selected according to uniform probability distribution, workpieces are selected from a task set according to the selected rule and are arranged on a corresponding machine for processing, and the action selection mechanism comprises the following steps:
(5) designing a reward function R (t) to evaluate the whole scheduling decision, and updating the weight parameter theta of the action value function by using a DQN algorithm to realize the updating of the scheduling decision;
in the aspect of reward function design, a single performance index of the minimum maximum completion time is mainly considered, the minimum maximum completion times obtained by scheduling are different due to different scheduling problems, and in order to evaluate the instant behavior in a unified manner, the optimization of the minimum maximum completion time is converted into the optimization of the machine utilization rate through the following formula:
in the formula, MkIs a k machine, CkAs a machine MkTime of end of working, Pi,jFor working work JiThe time of the j-th step;
defining the instant prize as:
r(t)=Uk(t)-Uk(t-1)
the expected target value function is calculated as:
updating the parameter theta of the action value function by using a mode of minimizing random gradient descent of the square loss function, wherein the calculation formula of the loss function is as follows:
LDQN=(Yt DQN-Q(s,a;θt))2
the update formula of the parameter θ is:
(6) carrying out state transition P (s '| s, a), and storing the state s, the action a, the reward r and the state s' of the next step into a memory pool;
executing the action a by the state s, updating the workshop environment, extracting the state s ', and sequentially storing the state s, the action a, the reward r and the next state s' in a memory pool;
(7) updating network parameters of the target value function;
when the memory pool is full, the agent starts to update the network parameters of the target value function by using the historical data, and randomly extracts the historical data of the batch _ size bar from the memory pool at intervals of C steps to perform the target value function
Parameter theta of
-The updated formula is as follows:
further, in the step (1), the state S is obtained from the environment, the state is input into the agent, the action a currently executed by the agent is output, the global state S is changed, the reward R is obtained, the process is repeated until all the machines complete the processing tasks, and the environment of job shop scheduling is established according to the disjunctive graph.
Further, in the step (1), the disjunctive graph model is represented by a set G (N; C, E) of triples;
wherein, N is all process node sets, including virtual start node start and end node end;
c is a directed connection directed arc set between adjacent working procedures determined by the process under the same part;
e represents an extracted arc set between processes that can be processed on the same machine tool.
Further, in the step (2), the processing time matrix is a matrix formed by the processing time of each process, and each element in the matrix is normalized by 0-1 according to the maximum processing time.
Further, in the step (2), the machine matrix is a matrix composed of machine numbers of the machining processes.
Further, in the step (2), the process starting time matrix is a matrix formed by the starting time of each process, and each element is normalized by 0 to 1 according to the current maximum starting time and initialized to zero.
Further, in the step (2), the machining time matrix, the machine matrix, and the process machining start time matrix are all n × m in size, n is the number of workpieces, and m is the number of processes.
Further, in the step (3), the convolutional neural network comprises two convolutional layers and two full-connected layers, and the input size is 3 × n × m;
wherein, the first layer convolution layer contains 9 convolution kernels, the size of the convolution kernels is 3 multiplied by 3, the step length is 1, and the edge filling is 2;
the second convolution layer contains 18 convolution kernels, the size of the convolution kernels is 3 x 3, the step size is 1, and the edge filling is 2;
the third layer is a full connection layer, the number of input nodes is 18 x (n +4) (m +4), and the number of output nodes is 18 x (n + 4);
the fourth layer is a full connection layer, the number of input nodes is 18 (n +4), and the number of output nodes is 18.
The invention has the beneficial effects that:
1. the invention considers the performance index of the minimized maximum completion time in the job shop scheduling, utilizes the disjunctive graph to convert the scheduling decision problem into the sequential decision problem, establishes the job shop scheduling model of MDP, utilizes the DQN algorithm to solve the model, can obtain better scheduling decision arrangement through training, and can quickly solve the new job shop scheduling problem with the same scale by the trained intelligent agent.
2. The invention is based on Deep reinforcement learning algorithm of Deep Q-network (DQN) to solve the scheduling problem of related single job workshops, and the algorithm selects heuristic scheduling rules according to the input state characteristics, thereby being more in line with the scheduling decision process of the actual order response type production manufacturing system.
3. The invention uses a convolution neural network model to fit an action value function and a target value function, inputs the characteristic data of the processing state of the manufacturing system into the model, combines a reinforced learning reward and punishment mechanism, and selects an optimal combination heuristic rule as a strategy for each scheduling decision.
4. The Deep reinforcement learning algorithm of Deep Q-network (DQN) provided by the invention has the advantages of strong real-time performance and high flexibility.
Drawings
FIG. 1 is a flowchart of a Deep reinforcement learning method for scheduling a single-target single-component job shop in real time based on Deep Q-network (DQN) according to the present invention.
Fig. 2 is an extracted graph of ft 06.
FIG. 3 is a transition diagram of the scheduling state of the present invention.
Fig. 4 is a graph of training data for ft 06.
Fig. 5 is a gantt chart for ft 06.
Detailed Description
The invention will be described in more detail below by means of specific embodiments with reference to the attached drawings.
The invention relates to a Deep reinforcement learning job shop scheduling method based on Deep Q-network (DQN), the flow of which is shown in figure 1, and the method comprises the following steps:
(1) taking an ft06 reference case as a concrete implementation description, taking the minimized maximum completion time as a performance evaluation index, wherein the information of the ft06 reference case is shown in table 1, and an analytic graph is adopted to establish a scheduling environment of the reference case, as shown in fig. 2;
TABLE 1 ft06 Standard example
(2) Extracting state features from the disjunctive graph as input of a convolutional neural network;
principle of state design: from the perspective of a task target, the selection of state characteristics is related to a scheduling target, otherwise, characteristic redundancy is caused, the state characteristics need to describe main characteristics and changes of a scheduling environment, including global information and local information, and the state characteristics need to have certain compatibility in the face of different scheduling problems, the state characteristics are numerical representation of state attributes, the state attributes are easy to calculate and normalize, and the scale uniformity is ensured; extracting states from a job shop environment to obtain three state matrixes, namely a processing time matrix, a machine matrix and a procedure starting time matrix;
the processing time matrix is a matrix formed by the processing time of each procedure, the matrix is normalized by 0-1, and if the procedure is allocated to one machine, the value is zero;
the machine matrix is a matrix consisting of the machine numbers of the machining processes, while changing to zero if the process has been assigned to a machine;
the procedure processing starting time matrix is a matrix formed by the starting time of each procedure processing, and is normalized by 0-1 and initialized to zero;
the sizes of the processing time matrix, the machine matrix and the procedure processing starting time matrix are n multiplied by m, n is the number of workpieces, m is the number of procedures, and ft06 is changed from an initial state s0 to a state s3 as shown in FIG. 3;
(3) constructing an action value function Q (s, a) and a target value function
Fitting the action value function Q (s, a; theta) and the target value function by using a convolutional neural network
And initializes two value functions Q (s, a;θ
0)、
wherein theta is a set of internal weights w and bias values b of the convolutional neural network of the action value function, and theta-Convolving the neural network internal weights w for the target value function-Sum bias value b-A set of (a);
the convolutional neural network comprises two convolutional layers and two full-connection layers, and the input size is 3 multiplied by n multiplied by m;
wherein, the first layer convolution layer contains 9 convolution kernels, the size of the convolution kernels is 3 multiplied by 3, the step length is 1, and the edge filling is 2;
the second convolution layer contains 18 convolution kernels, the size of the convolution kernels is 3 x 3, the step size is 1, and the edge filling is 2;
the third layer is a full connection layer, the number of input nodes is 18 x (n +4) (m +4), and the number of output nodes is 18 x (n + 4);
the fourth layer is a full connection layer, the number of input nodes is 18 x (n +4), and the number of output nodes is 18 of the number of actions;
(4) adopting 18 heuristic scheduling rules as an agent action a for reinforcement learning, wherein a belongs to A;
establishing a task set of each machine according to a workpiece to be processed, inputting a global state into a convolutional neural network, converting the output of the convolutional neural network into the probability of each heuristic rule by using a softmax function, and when action selection is performed according to the probability P, in order to ensure that a scheduling strategy can be converged and has the capability of jumping out of a local optimal solution, adding uncertainty into the current decision to design a work selection mechanism, and combining the action a which is the action of selecting the maximum probability a max (P) with the action selection a random (P) according to uniform probability distribution;
the action selection mechanism comprises a manually set hyper-parameter beta and a randomly generated natural number d, the d belongs to (0,1), when the d is less than or equal to the hyper-parameter beta, a heuristic scheduling rule with the maximum probability is selected, when the d is greater than the hyper-parameter beta, the heuristic scheduling rule is selected according to uniform probability distribution, workpieces are selected from a task set according to the selected rule and are arranged on a corresponding machine for processing, and the action selection mechanism comprises the following steps:
wherein β is 0.95;
the 18 heuristic rules are shown in table 2:
table 218 heuristic scheduling rules
No.
|
Rule
|
Description of the invention
|
1
|
SPT
|
Selecting the workpiece with the shortest processing time
|
2
|
LPT
|
Selecting the workpiece with the longest processing time in the working procedure
|
3
|
SPT+SSO
|
Selecting the workpiece with the shortest processing time in the current process and the subsequent process
|
4
|
LPT+LSO
|
Selecting the workpiece with the longest processing time in the current process and the longest processing time in the subsequent process
|
5
|
SPT*TWK
|
Selecting the workpiece with the smallest product of the working procedure machining time and the total machining time
|
6
|
LPT*TWK
|
Selecting the workpiece with the largest product of the working procedure machining time and the total machining time
|
7
|
SPT/TWK
|
Selecting the workpiece with the minimum ratio of the working procedure processing time to the total processing time
|
8
|
LPT/TWK
|
Selecting the workpiece with the largest ratio of working procedure machining time to total machining time
|
9
|
SPT*TWKR
|
Selecting the workpiece with the smallest product of the working procedure machining time and the residual machining time
|
10
|
LPT*TWKR
|
Selecting the workpiece with the largest product of the working procedure machining time and the residual machining time
|
11
|
SPT/TWKR
|
Selecting the workpiece with the smallest ratio of the working procedure machining time to the residual machining time
|
12
|
LPT/TWKR
|
Selecting working procedure processing timeWorkpiece having maximum ratio to remaining machining time
|
13
|
SRM
|
Selecting the workpiece with the shortest residual processing time except the current considered process
|
14
|
LRM
|
Selecting the workpiece with the longest residual processing time except the current considered process
|
15
|
SRPT
|
Selecting the workpiece with the shortest residual processing time
|
16
|
LRPT
|
Selecting the workpiece with the longest residual processing time
|
17
|
SSO
|
Selecting the workpiece with the shortest subsequent processing time
|
18
|
LSO
|
Selecting the workpiece with the longest subsequent processing time |
(5) Designing a reward function R (t) to evaluate the whole scheduling decision, and updating the weight parameter theta of the action value function by using a DQN algorithm to realize the updating of the scheduling decision;
in the aspect of reward function design, a single performance index of the minimum maximum completion time is mainly considered, the minimum maximum completion times obtained by scheduling are different due to different scheduling problems, and in order to evaluate the instant behavior in a unified manner, the optimization of the minimum maximum completion time is converted into the optimization of the machine utilization rate through the following formula:
in the formula, MkIs a k machine, CkAs a machine MkTime of end of working, Pi,jFor working work JiThe time of the j-th step;
defining the instant prize as:
r(t)=Uk(t)-Uk(t-1)
the expected target value function is calculated as:
wherein, the discount factor gamma is 0.95;
updating the parameter theta of the action value function by using a mode of minimizing random gradient descent of the square loss function, wherein the calculation formula of the loss function is as follows:
LDQN=(Yt DQN-Q(s,a;θt))2
the update formula of the parameter θ is:
wherein the learning rate α is 0.0001;
(6) performing state transition P (s '| s, a), and storing a state s, an action a, a reward r and a next state s' into a memory pool, wherein the capacity N of the memory pool is 2000;
executing the action a by the state s, updating the workshop environment, extracting the state s ', and sequentially storing the state s, the action a, the reward r and the next state s' in a memory pool;
(7) updating network parameters of the target value function;
when the memory pool is full, the intelligent agent starts to update the network parameters of the target value function by using the historical data, and 128 pieces of historical data are randomly extracted from the memory pool at 200 steps to carry out the target value function
Parameter theta of
-The updated formula is as follows:
and the epamode is 3500 in this embodiment.
The foregoing is a more detailed description of the present invention that is presented in conjunction with specific embodiments, and the practice of the invention is not to be considered limited to those descriptions. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.