CN115167150B

CN115167150B - Batch process two-dimensional off-orbit strategy staggered Q learning optimal tracking control method with unknown system dynamics

Info

Publication number: CN115167150B
Application number: CN202210973113.4A
Authority: CN
Inventors: 施惠元; 高维; 解俊朋; 苏成利; 姜雪莹; 李平; 李娟�; 郑尚磊
Original assignee: Liaoning Shihua University
Current assignee: Liaoning Shihua University
Priority date: 2022-08-15
Filing date: 2022-08-15
Publication date: 2024-07-05
Anticipated expiration: 2042-08-15
Also published as: CN115167150A

Abstract

The two-dimensional off-track strategy interleaved Q learning optimal tracking control method for batch processes with unknown system dynamics belongs to the field of industrial process control technology. The specific steps are as follows: Step 1: Establish a nonlinear state space equation for a batch process with unknown dynamics; Step 2: Extend the tracking error as a state variable to the performance index, and construct the performance index of the two-dimensional nonlinear system; Step 3: Define the expression of the two-dimensional optimal value function and the Q function according to the relationship between the performance index and the value function; Step 4: Introduce the interleaved Q iteration algorithm; Step 5: Construct a model network to obtain the initial weights of the neural network; Step 6: Construct a judgment network and a behavior network to obtain the final control strategy. This method solves the problem of over-reliance on system models in traditional industrial processes. At the same time, when the initial parameters of the system are unknown, the industrial process can also proceed smoothly, greatly improving production efficiency and reducing computing costs.

Description

Two-dimensional off-track strategy staggered Q-learning optimal tracking control method for batch processes with unknown system dynamics

技术领域Technical Field

本发明属于工业过程控制技术领域，具体涉及具有未知系统动态的批次过程二维离轨策略交错Q学习最优跟踪控制方法。The present invention belongs to the technical field of industrial process control, and in particular relates to a two-dimensional off-track strategy interleaved Q-learning optimal tracking control method for a batch process with unknown system dynamics.

背景技术Background technique

在产品升级和需求急剧增加的当下，工业过程繁多且极具复杂性，为满足更多使用者的不同需求，工业生产应逐渐倾向于高质量和小规模的生产趋势。为此，批次处理过程凭借其众多优势被用于解决工业生产中的各类问题,与此同时也成为当今控制领域的热门研究课题之一。相比于连续过程，批次处理过程具有快速性、低成本、重复运行、多样性和多阶段等特性。但大多数批次处理过程通常依赖于被控过程的模型来设计系统的控制器。At a time when product upgrades and demand are increasing dramatically, industrial processes are numerous and extremely complex. In order to meet the different needs of more users, industrial production should gradually tend to high-quality and small-scale production trends. For this reason, batch processes are used to solve various problems in industrial production with their many advantages, and at the same time have become one of the hot research topics in the field of control today. Compared with continuous processes, batch processes have the characteristics of rapidity, low cost, repeated operation, diversity, and multiple stages. However, most batch processes usually rely on the model of the controlled process to design the controller of the system.

然而在实际应用中，批次处理过程并不仅仅是我们所想象中的简单的生产过程，其中涉及许多学科的知识，且包含了众多的内部因素与外部因素。由于批次处理过程的多样性，在操作过程中不可避免的会出现多种多样的难题。况且，在操作中过度依赖于过程的模型必然会出现过程模型的各种性能相继下降的情况，使得建立准确的批次处理过程的系统模型更为困难，且导致产品精度大幅度降低。为此，鉴于上述情形，设计出一种具有未知动态的批次过程二维离轨策略交错Q学习最优跟踪控方法，在不依赖过程模型和系统初始参数的同时研究批次处理过程的控制问题。However, in practical applications, the batch process is not just a simple production process as we imagine. It involves knowledge from many disciplines and contains many internal and external factors. Due to the diversity of batch processes, various problems will inevitably arise during the operation. Moreover, excessive reliance on the process model during operation will inevitably lead to a decrease in the performance of the process model, making it more difficult to establish an accurate system model of the batch process and resulting in a significant reduction in product accuracy. To this end, in view of the above situation, a two-dimensional off-track strategy interleaved Q learning optimal tracking control method for batch processes with unknown dynamics is designed to study the control problem of the batch process without relying on the process model and the initial parameters of the system.

发明内容Summary of the invention

本发明是针对具有未知系统动态的批次过程，提出的二维离轨策略交错Q学习最优跟踪控制方法，该方法可有效地解决系统无法精确建模的问题，降低系统的模型依赖性，仅仅依靠时间方向和批次方向上的数据不断学习，且在系统初始参数未知的情况下，仍可以得到最优的控制策略，提高系统的控制和跟踪性能，加快了收敛速度。The present invention proposes a two-dimensional off-track strategy interleaved Q-learning optimal tracking control method for a batch process with unknown system dynamics. The method can effectively solve the problem that the system cannot be accurately modeled, reduce the model dependence of the system, and continuously learn only by relying on data in the time direction and batch direction. Even when the initial parameters of the system are unknown, the optimal control strategy can still be obtained, thereby improving the control and tracking performance of the system and accelerating the convergence speed.

本发明是通过以下技术方案实现的：The present invention is achieved through the following technical solutions:

本发明提出了具有未知系统动态的批次过程二维离轨策略交错Q学习最优跟踪控制方法，首先建立具有未知动态的批次过程非线性状态空间方程，其次，将跟踪误差作为状态变量扩展到性能指标中，构建出二维非线性系统的性能指标。根据性能指标与值函数的关系，定义二维最优值函数与Q函数的表达式，并引入交错Q迭代算法。然后，构建模型网络、评判网络及行为网络，在此过程中，各层神经网络中的权重通过批次方向的数据进行不断学习和更新，最终我们寻找到最优的控制策略。本发明可有效地解决系统无法精确建模的问题，极大地降低对系统模型的过度依赖性，与此同时采用Q迭代的学习方法，在系统初始参数未知的情况下，就可以快速、准确的得到最优控制策略，提高系统的最优性能，加快了收敛速度。The present invention proposes a two-dimensional off-track strategy staggered Q learning optimal tracking control method for a batch process with unknown system dynamics. First, a nonlinear state space equation of a batch process with unknown dynamics is established. Secondly, the tracking error is extended to the performance index as a state variable to construct the performance index of the two-dimensional nonlinear system. According to the relationship between the performance index and the value function, the expression of the two-dimensional optimal value function and the Q function is defined, and the staggered Q iteration algorithm is introduced. Then, a model network, a judgment network and a behavior network are constructed. In this process, the weights in each layer of the neural network are continuously learned and updated through the data in the batch direction, and finally we find the optimal control strategy. The present invention can effectively solve the problem that the system cannot be accurately modeled, greatly reduce the excessive dependence on the system model, and at the same time adopt the Q iteration learning method. When the initial parameters of the system are unknown, the optimal control strategy can be obtained quickly and accurately, the optimal performance of the system is improved, and the convergence speed is accelerated.

步骤一：建立具有未知动态的批次过程非线性状态空间方程；Step 1: Establish the nonlinear state space equations of the batch process with unknown dynamics;

通过非线性放射状态空间方程来表示具有未知系统动态的批次过程，其表现形式如下：The batch process with unknown system dynamics is represented by the nonlinear radial state-space equations, which are expressed as follows:

y(t+1,k)＝f((y(t,k))+g(y(t,k))u(t,k) (1)y(t+1,k)＝f((y(t,k))+g(y(t,k))u(t,k) (1)

其中，t表示时间方向，k表示批次方向，y(t,k)∈Rⁿ表示系统状态，u(t,k)∈R^m表示系统控制输入，f((y(t,k))∈Rⁿ表示为y(t,k)的f函数，g(y(t,k))∈R^n×m表示为y(t,k)的g函数，R表示为实数矩阵，n和m表示为实数矩阵R的适当维数；Where t represents the time direction, k represents the batch direction, y(t,k) ^∈Rn represents the system state, u(t,k) ^∈Rm represents the system control input, f((y(t,k)) ^∈Rn) represents the f function of y(t,k), g(y(t,k))∈Rn ^×m represents the g function of y(t,k), R represents a real matrix, and n and m represent the appropriate dimensions of the real matrix R;

步骤二：将跟踪误差作为状态变量扩展到性能指标中，构建出二维非线性系统的性能指标；Step 2: Extend the tracking error as a state variable into the performance index and construct the performance index of the two-dimensional nonlinear system;

定义二维非线性系统的性能指标为：The performance index of a two-dimensional nonlinear system is defined as:

其中，T为预测时域，y_r(t,k)表示系统的期望状态，u(t,k-1)表示系统k-1批次t时刻的控制输入，R表示为控制输入相应维数的加权矩阵；Where T is the prediction time domain, y _r (t, k) represents the expected state of the system, u(t, k-1) represents the control input of the system k-1 batch at time t, and R represents the weighted matrix of the corresponding dimension of the control input;

令扩展状态Q₁＝C^TQC，Q₁表示为扩展状态相应维数的加权矩阵，I代表适当维数的单位矩阵；则二维非线性系统的性能指标可以重新定义为：Extended state Q ₁ = ^CT QC, Q ₁ represents the extended state The weight matrix of the corresponding dimension, I represents the identity matrix of appropriate dimension; then the performance index of the two-dimensional nonlinear system can be redefined as:

与此同时，系统k批次t+1时刻的扩展状态可以表示为：At the same time, the extended state of the system k batch at time t+1 It can be expressed as:

其中，，代表系统期望设定值之间适当维数的矩阵，0代表适当维数的零矩阵；in, , represents a matrix of appropriate dimension between the expected setpoints of the system, and 0 represents a zero matrix of appropriate dimension;

步骤三：根据性能指标与值函数的关系，定义二维最优值函数与Q函数的表达式；根据公式(3)，定义二维最优值函数和二维最优Q函数为如下形式：Step 3: Based on the relationship between the performance index and the value function, define the expressions of the two-dimensional optimal value function and the Q function; according to formula (3), define the two-dimensional optimal value function and the two-dimensional optimal Q function as follows:

其中， in,

通过将公式(5)的右测最小化，来求解最优控制策略产生最优值函数基于最优性的必要条件，最优控制策略u^*(t,k)可以通过对u(t,k)求导来获得；因此，By minimizing the right side of formula (5), the optimal control strategy is solved to generate the optimal value function Based on the necessary conditions for optimality, the optimal control strategy u ^* (t, k) can be obtained by taking the derivative of u(t, k); therefore,

当控制策略u(t,k)达到最优值u^*(t,k)时，最优值函数等价于最优Q函数，即：When the control strategy u(t,k) reaches the optimal value u ^* (t,k), the optimal value function is equivalent to the optimal Q function, that is:

步骤四：引入交错Q迭代算法；Step 4: Introduce the interleaved Q iteration algorithm;

设计一个Q迭代推导算法，根据公式(8)最优Q函数也可以表示为如下形式：Design a Q iterative derivation algorithm, according to formula (8) the optimal Q function It can also be expressed as follows:

则最优控制策略可以被描述为：Then the optimal control strategy can be described as:

为了寻找到(9)和(10)式中的最优解，首先设定初始值Q⁰()＝0，则控制策略u⁰(t,k)表示为：In order to find the optimal solution in (9) and (10), we first set the initial value Q ⁰ () = 0, then the control strategy u ⁰ (t, k) is expressed as:

同时Q函数可以被更新为：At the same time, the Q function can be updated as:

当时i＝1,2…，控制策略u⁰(t,k)和Q函数将完成迭代；When i＝1,2…, the control strategy u ⁰ (t,k) and Q function The iteration will be completed;

和and

步骤五：构建模型网络，获得神经网络的初始权重；Step 5: Build a model network and obtain the initial weights of the neural network;

在实际应用中，的结果是不确定的，且很难得到准确的值；因此，我们将采用神经网络来近似系统(4)的动态，并计算的结果；In practical applications, The result is uncertain and it is difficult to get an accurate value; therefore, we will use a neural network to approximate the dynamics of system (4) and calculate the result of;

对于模型网络，一个三层的神经网络被用来识别带有一个结构的非线性系统，其中分别表示在输入层、隐藏层和输出层的神经元节点的个数；令输入层与隐藏层之间的权重矩阵为W_m1，隐藏层与输出层之间的权重矩阵为W_m2；For the model network, a three-layer neural network is used to identify A nonlinear system of structures, where Represent the number of neuron nodes in the input layer, hidden layer and output layer respectively; let the weight matrix between the input layer and the hidden layer be W _m1 , and the weight matrix between the hidden layer and the output layer be W _m2 ;

假设非线性系统的扩展状态和系统的控制输入u(t,k)已知，则模型网络的输出可以表示为：Assume that the extended state of the nonlinear system and the control input u(t,k) of the system are known, then the output of the model network can be expressed as:

其中， in,

定义训练模型网络的误差函数为：The error function of the training model network is defined as:

目标函数最小化：The objective function is minimized:

其中，E_m(t,k)为模型网络的误差平方；Among them, _Em (t,k) is the square error of the model network;

权值更新采用基于梯度的自适应方法，即：The weight update adopts a gradient-based adaptive method, namely:

其中，W_m1(j+1)代表迭代到j+1时刻的模型网络输入层与隐藏层之间的权重矩阵，W_m1(j)代表迭代到j时刻的模型网络输入层与隐藏层之间的权重矩阵，W_m2(j+1)代表迭代到j+1时刻的模型网络隐藏层与输出层之间的权重矩阵，W_m2(j)代表迭代到j时刻的模型网络隐藏层与输出层之间的权重矩阵，α_m＞0代表模型网络的学习率，j代表模型网络中的内部循环次数；Wherein, W _m1 (j+1) represents the weight matrix between the input layer and the hidden layer of the model network at the time of iteration j+1, W _m1 (j) represents the weight matrix between the input layer and the hidden layer of the model network at the time of iteration j, W _m2 (j+1) represents the weight matrix between the hidden layer and the output layer of the model network at the time of iteration j+1, W _m2 (j) represents the weight matrix between the hidden layer and the output layer of the model network at the time of iteration j, α _m >0 represents the learning rate of the model network, and j represents the number of internal cycles in the model network;

一旦训练结束，模型网络的权重保持不变；Once training is complete, the weights of the model network remain unchanged;

步骤六：构建评判网络及行为网络，获得最终的控制策略；Step 6: Construct the judgment network and behavior network to obtain the final control strategy;

评判网络被用于近似Q函数结构，一个被用于评判网络的三层神经网络的输出结构表示如下：The critic network is used to approximate the Q function Structure, the output structure of a three-layer neural network used to judge the network is shown as follows:

其中，输入层与隐藏层之间的权重矩阵为W_c1，隐藏层与输出层之间的权重矩阵为W_c2；Among them, the weight matrix between the input layer and the hidden layer is W _c1 , and the weight matrix between the hidden layer and the output layer is W _c2 ;

基于公式(14)，当Q函数迭代到第i+1时，期望值函数可由下式得到：Based on formula (14), when the Q function iterates to the i+1th time, the expected value function can be obtained as follows:

定义预测误差为：The prediction error is defined as:

目标函数可以最小化为：The objective function can be minimized as:

其中，E_c(i+1)(t,k)为评判网络中迭代到第i+1时的误差平方；Where E _c(i+1) (t,k) is the square error when the evaluation network iterates to the i+1th time;

基于梯度的自适应规则，权重矩阵可更新为如下形式：Based on the gradient adaptive rule, the weight matrix can be updated as follows:

在权重矩阵更新的过程中，其中，W_c1(j+1)代表迭代到j+1时刻的评判网络输入层与隐藏层之间的权重矩阵，W_c1(j)代表迭代到j时刻的评判网络输入层与隐藏层之间的权重矩阵，W_c2(j+1)代表迭代到j+1时刻的评判网络隐藏层与输出层之间的权重矩阵，W_c2(j)代表迭代到j时刻的评判网络隐藏层与输出层之间的权重矩阵，α_c＞0表示为评判网络的学习率，j表示为评判网络中权重更新的内循环次数；In the process of updating the weight matrix, W _c1 (j+1) represents the weight matrix between the input layer and the hidden layer of the evaluation network at the time of iteration j+1, W _c1 (j) represents the weight matrix between the input layer and the hidden layer of the evaluation network at the time of iteration j, W _c2 (j+1) represents the weight matrix between the hidden layer and the output layer of the evaluation network at the time of iteration j+1, W _c2 (j) represents the weight matrix between the hidden layer and the output layer of the evaluation network at the time of iteration j, α _c >0 represents the learning rate of the evaluation network, and j represents the number of inner cycles of weight update in the evaluation network;

执行网络将系统状态作为输入，去近似控制策略选择具有三层结构的自适应线性神经元网络应用于执行网络中；Execute the network to set the system status As input, to approximate the control strategy An adaptive linear neural network with a three-layer structure is selected and applied to the execution network;

其中，W_a1表示输入层和隐含层的权重矩阵，W_a2表示隐含层和输出层的权重矩阵；Where, W _a1 represents the weight matrix of the input layer and the hidden layer, and W _a2 represents the weight matrix of the hidden layer and the output layer;

行为网络的预测误差被定义为：The prediction error of the behavioral network is defined as:

为实现目标函数最小化，定义行为网络的误差平方为：In order to minimize the objective function, the square error of the behavior network is defined as:

其中，E_a(t,k)为评判网络的误差平方；Among them, E _a (t,k) is the square error of the judgment network;

采用梯度下降算法来实现行为网络中权重矩阵的更新；The gradient descent algorithm is used to update the weight matrix in the behavior network;

其中，W_a1(j+1)代表迭代到j+1时刻的行为网络输入层与隐藏层之间的权重矩阵，W_a1(j)代表迭代到j时刻的行为网络输入层与隐藏层之间的权重矩阵，W_a2(j+1)代表迭代到j+1时刻的行为网络隐藏层与输出层之间的权重矩阵，W_a2(j)代表迭代到j时刻的行为网络隐藏层与输出层之间的权重矩阵，α_a＞0和j分别是行为网络的学习率和内部循环次数。Among them, _Wa1 (j+1) represents the weight matrix between the input layer and the hidden layer of the behavior network iterated to time j+1, _Wa1 (j) represents the weight matrix between the input layer and the hidden layer of the behavior network iterated to time j, _Wa2 (j+1) represents the weight matrix between the hidden layer and the output layer of the behavior network iterated to time j+1, _Wa2 (j) represents the weight matrix between the hidden layer and the output layer of the behavior network iterated to time j, _αa ＞0 and j are the learning rate and the number of internal cycles of the behavior network, respectively.

因此，可以获得最终的控制策略为：Therefore, the final control strategy can be obtained as:

其中，的结果为：in, The result is:

综上所述，In summary,

实现二维离轨策略交错Q学习最优跟踪控制方法步骤如下：The steps to implement the two-dimensional off-orbit strategy staggered Q learning optimal tracking control method are as follows:

1.通过行为控制策略u(t,k)来收集系统数据x(t,k)，并将其存储。1. Collect system data x(t,k) through the behavior control strategy u(t,k) and store it.

2.将神经网络中的权重W_m1，W_m2进行初始化。2. Initialize the weights W _m1 and W _m2 in the neural network.

3.训练模型网络：3. Training model network:

(1)通过使用测量数据的方法，对权重W_m1，W_m2进行训练，直至e_m(t,k)≤ψ_m(t,k)，其中ψ_m(t,k)为一个很小的正数。(1) By using the measurement data method, the weights W _m1 and W _m2 are trained until e _m (t, k) ≤ ψ _m (t, k), where ψ _m (t, k) is a very small positive number.

(2)获得已训练好的权重W_m1(j+1)，W_m2(j+1)，并设置k＝0。(2) Obtain the trained weights W _m1 (j+1), W _m2 (j+1), and set k=0.

4.交错迭代：4. Staggered iteration:

(1)设置Q函数的初始值Q⁰(·)＝0，当迭代参数i＝0时，对控制策略u⁰(t,k)进一步计算。(1) Set the initial value of the Q function Q ⁰ (·) = 0. When the iteration parameter i = 0, further calculate the control strategy u ⁰ (t, k).

(2)通过公式(19)和(20)计算和u(t,k-1))，并使用公式(23)对权重W_c1，W_c2进行一次更新，使其达到训练评判网络的目的。(2) Calculated by formulas (19) and (20) and u(t,k-1)), and use formula (23) to update the weights W _c1 and W _c2 to achieve the purpose of training the judgment network.

(3)通过公式(24)近似出并使用基于自适应的梯度下降算法对权重W_a1，W_a2进行一次更新，用于训练行为网络。(3) Approximately, by formula (24) The weights _Wa1 and _Wa2 are updated once using an adaptive gradient descent algorithm for training the behavior network.

(4)若不满足：其中ψ_Q＞0，则回到步骤4中的(2)，若满足，则得到最终控制策略 (4) If not satisfied: Where ψ _Q > 0, then return to step 4 (2). If it is satisfied, the final control strategy is obtained.

5.令t＝t+1，并回到步骤4继续进行。5. Set t=t+1 and return to step 4 to continue.

本发明的优点与效果为：The advantages and effects of the present invention are:

本发明针对具有未知系统动态批次过程建模困难以及系统初始参数未知等复杂特性的问题，提出一种具有未知系统动态的批次过程二维离轨策略交错Q学习最优跟踪控制方法；此发明仅仅利用注塑成型过程在批次和时间方向所测量的数据就可以求解出最优控制策略，不同于以往的对此类批次过程精确建模的方法，更区别于传统的一维模式，在二维理论的框架下，实现二维离轨策略交错Q学习算法对批次过程的最优跟踪控制。在建立状态空间模型的同时引入了实际输出跟踪误差和增量的偏差，为后续控制器提供了多自由度的设计方法，且保证了控制器的跟踪性能；无需精确的建模过程，极大地降低了系统的模型依赖性，降低了计算成本；在求解控制器增益时，考虑到了时间方向与批次方向之间的相互联系，在获得较为精确的控制策略的同时又加快了收敛速度、增强了最优性能；且控制算法不需要初始可允许的控制策略就可以获得最优值，加快了工业生产的速度，提高了工业生产的效率。The present invention aims at the problem of the difficulty of modeling batch processes with unknown system dynamics and the complex characteristics of unknown system initial parameters, and proposes a two-dimensional off-track strategy interleaved Q learning optimal tracking control method for batch processes with unknown system dynamics; this invention can solve the optimal control strategy by only using the data measured in the batch and time directions of the injection molding process, which is different from the previous method of accurately modeling such batch processes, and is more different from the traditional one-dimensional model. Under the framework of two-dimensional theory, the optimal tracking control of the batch process by the two-dimensional off-track strategy interleaved Q learning algorithm is realized. While establishing the state space model, the actual output tracking error and the deviation of the increment are introduced, and a multi-degree-of-freedom design method is provided for the subsequent controller, and the tracking performance of the controller is guaranteed; no precise modeling process is required, which greatly reduces the model dependence of the system and reduces the calculation cost; when solving the controller gain, the mutual connection between the time direction and the batch direction is taken into account, and the convergence speed is accelerated and the optimal performance is enhanced while obtaining a more accurate control strategy; and the control algorithm does not need an initial allowable control strategy to obtain the optimal value, which accelerates the speed of industrial production and improves the efficiency of industrial production.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为模型网络的权重训练结果；Figure 1 shows the weight training results of the model network;

图2为模型网络的权重测试结果；Figure 2 shows the weight test results of the model network;

图3为不同批次过程下评判网络的权重结果；Figure 3 shows the weight results of the evaluation network under different batch processes;

图4为不同批次过程下行为网络的权重结果；Figure 4 shows the weight results of the behavioral network under different batch processes;

图5为二维离轨策略交错Q学习算法下的输出跟踪曲线；Figure 5 is the output tracking curve under the two-dimensional off-track strategy interleaved Q learning algorithm;

图6为二维离轨策略交错Q学习算法下的控制输入曲线；Figure 6 shows the control input curve under the two-dimensional off-track strategy interleaved Q learning algorithm;

具体实施方式Detailed ways

为了进一步说明本发明，下面结合附图及实例对本发明进行详细地描述，但不能将它们理解为对本发明保护范围的限定。In order to further illustrate the present invention, the present invention is described in detail below with reference to the accompanying drawings and examples, but they should not be construed as limiting the protection scope of the present invention.

实施例1：Embodiment 1:

在注塑成型的工艺过程中，注射阶段是整个注塑循环过程的第一步，时间从模具开始合模注塑的时间节点计算，到模具型腔填充到大约95％为止，一般大约为1-5秒。尽管注射的时间很短，但却是影响产品质量的重要因素之一。注塑时间的控制安排越合理，越有助于模具型腔内熔体实现理想填充，而且对于提高产品质量以及减小工艺误差有着非常重要的意义。因此本文基于维离轨策略交错Q学习算法对注塑成型注射阶段的注射速度进行近似最优跟踪控制研究。In the process of injection molding, the injection stage is the first step of the entire injection molding cycle. The time is calculated from the time when the mold starts to close the mold for injection until the mold cavity is filled to about 95%, which is generally about 1-5 seconds. Although the injection time is very short, it is one of the important factors affecting product quality. The more reasonable the control arrangement of the injection time, the more helpful it is to achieve ideal filling of the melt in the mold cavity, and it is of great significance to improve product quality and reduce process errors. Therefore, this paper studies the approximate optimal tracking control of the injection speed in the injection stage of injection molding based on the staggered Q learning algorithm with a dimensional off-track strategy.

在收集了大量实验数据后，实际注射成型过程中注射速度(IV)与阀门开度(VO)之间的开环动态特性可由如下描述：After collecting a large amount of experimental data, the open-loop dynamic characteristics between injection velocity (IV) and valve opening (VO) in the actual injection molding process can be described as follows:

其中，IV(t+1,k)为k批次t+1时刻的注射速度；IV(t,k)为k批次t时刻的注射速度；IV(t-1,k)为k批次t-1时刻的注射速度；VO(t,k)为k批次t时刻的阀门开度；VO(t-1,k)为k批次t-1时刻的阀门开度。Among them, IV(t+1,k) is the injection speed of batch k at time t+1; IV(t,k) is the injection speed of batch k at time t; IV(t-1,k) is the injection speed of batch k at time t-1; VO(t,k) is the valve opening of batch k at time t; VO(t-1,k) is the valve opening of batch k at time t-1.

定义：definition:

则根据公式(29)和公式(30)，可以建立注射阶段的状态空间模型：According to formula (29) and formula (30), the state space model of the injection stage can be established:

其中，ρ(t,k)＝0.001x₁(t,k)+0.6。Where, ρ(t,k)=0.001x ₁ (t,k)+0.6.

经过反复的试验，确定了注射阶段的参数为Q₁＝Q₂＝diag[6,6]和R＝1。然而，在实际的批次生产过程中，系统的初始参数也是很难获得的。因此，在接下来的研究中，我们提出了一种基于二维离轨策略交错Q学习算法来设计批次过程的最优控制器。After repeated experiments, the parameters of the injection phase were determined to be Q ₁ = Q ₂ = diag [6,6] and R = 1. However, in the actual batch production process, the initial parameters of the system are also difficult to obtain. Therefore, in the following study, we proposed a two-dimensional off-track strategy interleaved Q learning algorithm to design the optimal controller for the batch process.

首先，建立一个带有7个输入层，20个隐藏层和6个输出层结构的模型网络。其初始权重在-0.1到0.1之间任意选择，并设置学习率为。接下来，选取1000组数据来训练模型网络。在训练结束后，该网络中的权重保持不变。其次，对于评判网络，我们选取一个带有6个输入层，10个隐藏层和1个输出层结构的神经网络。此外，设置学习率为，初始权重依然在在-0.1到0.1之间任意选择。最后，设计一个带有6个输入层，10个隐藏层和1个输出层结构的行为网络，设置该网络学习率为，初始权重在-0.1到0.1之间任意选择。First, a model network with 7 input layers, 20 hidden layers and 6 output layers is established. The initial weight is arbitrarily selected between -0.1 and 0.1, and the learning rate is set. Next, 1000 sets of data are selected to train the model network. After the training, the weights in the network remain unchanged. Secondly, for the judgment network, we select a neural network with 6 input layers, 10 hidden layers and 1 output layer. In addition, the learning rate is set to, and the initial weight is still arbitrarily selected between -0.1 and 0.1. Finally, a behavioral network with 6 input layers, 10 hidden layers and 1 output layer is designed, and the learning rate of the network is set to, and the initial weight is arbitrarily selected between -0.1 and 0.1.

图1和图2分别代表了模型网络的权重测试图和权重训练图。此外，在对模型网络进行仿真的过程中，我们设置训练样本比例为0.8，即图1为前80％权重样本所得出的训练结果，图2为后20％权重样本对图1中的权重训练样本进行测试所得出的结果图。因此，通过上述两个图可以看出，无论是权重的训练图还是测试图，在随机选取的100组权重数据中，权重的预测值曲线和实测值的曲线都基本吻合，均达到了良好的效果。Figure 1 and Figure 2 represent the weight test graph and weight training graph of the model network, respectively. In addition, in the process of simulating the model network, we set the training sample ratio to 0.8, that is, Figure 1 is the training result obtained by the first 80% weight samples, and Figure 2 is the result graph obtained by testing the weight training samples in Figure 1 with the last 20% weight samples. Therefore, it can be seen from the above two figures that, whether it is the weight training graph or the test graph, in the 100 randomly selected sets of weight data, the weight prediction value curve and the measured value curve are basically consistent, and both have achieved good results.

图3和图4则分别代表评判网路和行为网络中隐含层到输出层之间不同批次的权重曲线，为了展现出良好的控制效果，我们挑选出第4批次，第7批次，第10批次，第20批次作为代表。可以看出，随着批次的增加，各权重曲线的训练效果越来越好，且在第10批次各权重曲线就已经达到了最优值，说明了本文所提出的算法具有相当好的优化效果。而在随后的批次中，各权重也能够达到最优值，证明了该算法的有效性和可行性。Figures 3 and 4 represent the weight curves of different batches from the hidden layer to the output layer in the judgment network and the behavior network, respectively. In order to show a good control effect, we selected the 4th, 7th, 10th, and 20th batches as representatives. It can be seen that with the increase of batches, the training effect of each weight curve is getting better and better, and each weight curve has reached the optimal value in the 10th batch, indicating that the algorithm proposed in this paper has a very good optimization effect. In subsequent batches, each weight can also reach the optimal value, proving the effectiveness and feasibility of the algorithm.

图5和图6所展示的是控制输入和批次过程输出曲线，其中二维系统输出设定值为40mm/s。从图5中可以看出，在第4批次的时候，批次过程的输出与工业生产要求的设定值相差甚远，但随着学习次数的增加，输出值和设定值之间的误差也越来越小。这表明，系统的跟踪效果越来越好，权重训练的效果也在逐步改善。第10批次后，系统输出达到了设定值，表明了控制算法是可行有效的。Figures 5 and 6 show the control input and batch process output curves, where the output setting value of the two-dimensional system is 40 mm/s. As can be seen from Figure 5, in the 4th batch, the output of the batch process is far from the setting value required by industrial production, but as the number of learning times increases, the error between the output value and the setting value becomes smaller and smaller. This shows that the tracking effect of the system is getting better and better, and the effect of weight training is also gradually improving. After the 10th batch, the system output reached the setting value, indicating that the control algorithm is feasible and effective.

综上，本发明以具有未知系统动态批次过程的控制设计为例，来验证本发明所提出的控制方法的有效性和可行性；此发明仅仅利用注塑成型过程在批次和时间方向所测量的数据就可以求解出最优控制策略，不同于以往的对此类批次过程精确建模的方法；区别于传统的一维模式，无需精确的建模过程，极大地降低了系统的模型依赖性，降低了计算成本；在求解控制器增益时，考虑到了时间方向与批次方向之间的相互联系，在获得较为精确的控制策略的同时又加快了收敛速度、增强了最优性能；且控制算法不需要初始可允许的控制策略就可以获得最优值，加快了工业生产的速度，提高了工业生产的效率。注射阶段的仿真结果表明，随着批次的增加，最终控制策略收敛于最优值，且控制效果和跟踪效果逐渐改善；因此，这种发明方法的提出，为具有未知系统动态批次过程的控制问题提供了全新的设计方案，可以在保证系统跟踪控制效果的同时减少系统建模和计算成本。In summary, the present invention takes the control design of the dynamic batch process with unknown system as an example to verify the effectiveness and feasibility of the control method proposed by the present invention; this invention can solve the optimal control strategy by only using the data measured in the batch and time directions of the injection molding process, which is different from the previous method of accurately modeling such batch processes; different from the traditional one-dimensional model, no accurate modeling process is required, which greatly reduces the model dependence of the system and reduces the calculation cost; when solving the controller gain, the mutual connection between the time direction and the batch direction is taken into account, and the convergence speed is accelerated and the optimal performance is enhanced while obtaining a more accurate control strategy; and the control algorithm does not require an initial allowable control strategy to obtain the optimal value, which accelerates the speed of industrial production and improves the efficiency of industrial production. The simulation results of the injection stage show that with the increase of batches, the final control strategy converges to the optimal value, and the control effect and tracking effect gradually improve; therefore, the proposal of this invention method provides a new design solution for the control problem of the dynamic batch process with unknown system, which can reduce the system modeling and calculation cost while ensuring the system tracking control effect.

以上所述仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above is only a preferred embodiment of the present invention. It should be pointed out that for ordinary technicians in this technical field, several improvements and modifications can be made without departing from the principle of the present invention. These improvements and modifications should also be regarded as the scope of protection of the present invention.

Claims

1. The batch process two-dimensional off-orbit strategy staggered Q learning optimal tracking control method with unknown system dynamics is characterized by comprising the following steps of:

The method comprises the following specific steps:

step one: establishing a batch process nonlinear state space equation with unknown dynamics;

The batch process with unknown system dynamics is represented by a nonlinear radial state space equation, which takes the following form:

y(t+1,k)＝f((y(t,k))+g(y(t,k))u(t,k) (1)

Where t represents the time direction, k represents the batch direction, y (t, k) e R ⁿ represents the system state, u (t, k) e R ^m represents the system control input, f ((y (t, k))eR ⁿ represents the f function of y (t, k), g (y (t, k))eR ^n×m represents the g function of y (t, k), R represents the real matrix, n and m represent the appropriate dimensions of the real matrix R;

Step two: the tracking error is used as a state variable to be expanded into a performance index, and the performance index of the two-dimensional nonlinear system is constructed;

the performance indexes of the two-dimensional nonlinear system are defined as follows:

Wherein T is a prediction time domain, y _r (T, k) represents a desired state of the system, u (T, k-1) represents a control input at a time T of a batch of the system k-1, and R represents a weighting matrix of a corresponding dimension of the control input;

Let the extended state Q ₁＝C^TQC,Q₁ is represented as an extended stateA weighting matrix of the corresponding dimension is provided,I represents an identity matrix of appropriate dimension; the performance index of the two-dimensional nonlinear system can be redefined as:

At the same time, the expansion state of system k batch t+1 time Can be expressed as:

Wherein, Y _r(t+1,k)＝θy_r (t, k), θ represents a matrix of appropriate dimension between the desired settings of the system, and 0 represents a zero matrix of appropriate dimension;

Step three: according to the relation between the performance index and the value function, defining an expression of a two-dimensional optimal value function and a Q function;

According to formula (3), a two-dimensional optimal value function and a two-dimensional optimal Q function are defined as follows:

Wherein,

Solving the optimal control strategy to produce an optimal value function by minimizing the right measurement of equation (5)Based on the optimality requirement, the optimal control strategy u ^* (t, k) can be obtained by deriving u (t, k); thus, the first and second substrates are bonded together,

When the control strategy u (t, k) reaches the optimal value u ^* (t, k), the optimal value function is equivalent to the optimal Q function, i.e.:

Step four: introducing an interleaving Q iteration algorithm;

designing a Q iterative derivation algorithm, optimizing Q function according to formula (8) Can also be expressed in the following form:

The optimal control strategy can be described as:

To find the optimal solution in equations (9) and (10), first, an initial value Q ⁰ (·) =0 is set, and then the control strategy u ⁰ (t, k) is expressed as:

while the Q function may be updated as:

control strategy u ⁰ (t, k) and Q function when i=1, 2 … The iteration will be completed;

And

Step five: constructing a model network to obtain initial weights of the neural network;

in the practical application of the present invention, Is uncertain and it is difficult to obtain an accurate value; therefore we will approximate the dynamics of the system (4) using a neural network and calculateResults of (2);

For model networks, a three-layer neural network is used to identify a model with a model Nonlinear system of structures, whereinThe number of neuron nodes in the input layer, the hidden layer and the output layer are respectively represented; let the weight matrix between the input layer and the hidden layer be W _m1, and the weight matrix between the hidden layer and the output layer be W _m2;

Assume the extended state of a nonlinear system And the control inputs u (t, k) of the system are known, the output of the model network can be expressed as:

Wherein,

Defining an error function of the training model network as:

Objective function minimization:

Wherein E _m (t, k) is the square of the error of the model network;

The weight updating adopts a gradient-based self-adaptive method, namely:

Wherein W _m1 (j+1) represents a weight matrix between the model network input layer and the hidden layer iterated to the moment j+1, W _m1 (j) represents a weight matrix between the model network input layer and the hidden layer iterated to the moment j, W _m2 (j+1) represents a weight matrix between the model network hidden layer and the output layer iterated to the moment j+1, W _m2 (j) represents a weight matrix between the model network hidden layer and the output layer iterated to the moment j, alpha _m > 0 represents a learning rate of the model network, and j represents the number of internal loops in the model network;

Once training is completed, the weights of the model network remain unchanged;

step six: constructing a judging network and a behavior network to obtain a final control strategy;

the evaluation network is used to approximate the Q function The structure, an output structure of a three-layer neural network used for evaluation network is represented as follows:

Wherein, the weight matrix between the input layer and the hidden layer is W _c1, and the weight matrix between the hidden layer and the output layer is W _c2;

Based on equation (14), when the Q function iterates to the i+1 th, the expected value function can be obtained by:

Defining a prediction error as:

The objective function can be minimized as:

Wherein E _c(i+1) (t, k) is the square of the error in the evaluation network from iteration to the (i+1);

based on the gradient-based adaptive rules, the weight matrix may be updated as follows:

In the process of updating the weight matrix, W _c1 (j+1) represents the weight matrix between the input layer and the hidden layer of the evaluation network iterated to the moment j+1, W _c1 (j) represents the weight matrix between the input layer and the hidden layer of the evaluation network iterated to the moment j, W _c2 (j+1) represents the weight matrix between the hidden layer and the output layer of the evaluation network iterated to the moment j+1, W _c2 (j) represents the weight matrix between the hidden layer and the output layer of the evaluation network iterated to the moment j, alpha _c > 0 is the learning rate of the evaluation network, and j is the number of internal loops of weight updating in the evaluation network;

Executing the network to bring the system state As input, to approximate the control strategySelecting an adaptive linear neuron network with a three-layer structure to be applied to an execution network;

Wherein W _a1 represents the weight matrix of the input layer and the hidden layer, and W _a2 represents the weight matrix of the hidden layer and the output layer;

the prediction error of the behavioural network is defined as:

to achieve objective function minimization, the square error of the defined behavior network is:

Wherein E _a (t, k) is the square of the error of the evaluation network;

adopting a gradient descent algorithm to update a weight matrix in the behavior network;

Wherein W _a1 (j+1) represents a weight matrix between the behavioural network input layer and the hidden layer iterated to the moment j+1, W _a1 (j) represents a weight matrix between the behavioural network input layer and the hidden layer iterated to the moment j, W _a2 (j+1) represents a weight matrix between the behavioural network hidden layer and the output layer iterated to the moment j+1, W _a2 (j) represents a weight matrix between the behavioural network hidden layer and the output layer iterated to the moment j, and alpha _a > 0 and j are respectively the learning rate and the internal circulation times of the behavioural network;

thus, a final control strategy may be obtained as:

Wherein, The result of (2) is:

In view of the above-mentioned, it is desirable,

。