Parallel probability variation soft measurement modeling method for streaming big data
Technical Field
The invention belongs to the field of industrial process control and soft measurement, and relates to a parallel probability variation soft measurement modeling method for streaming big data.
Background
In industrial processes, soft measurement models are widely used to predict key process variables in the process that are difficult to measure online due to harsh measurement environments, expensive measurement instruments, and large time lags. In recent years, the soft measurement modeling method based on data driving models according to data collected during process operation without relying on process mechanism knowledge, and is greatly favored by researchers and producers. Compared with a mechanism modeling method, the data driving method can reflect the actual running state of the process more truly, and the established model is more reliable. However, when the soft measurement model is put into practical use, the prediction performance of the soft measurement model tends to be gradually reduced along with factors such as process state change, catalyst activity reduction, raw material change and instrument drift. Therefore, the soft measurement model needs to be continuously updated adaptively, so that the model can effectively track the process change, ensure the timely and effective monitoring of the process data, and timely make the adjustment of the control strategy. Researchers have proposed different update strategies for this purpose, such as recursive models, sliding window models, and just-in-time learning models.
In the development of information technology, industrial processes have also come to the big data era. Meanwhile, the sensor data is increased sharply, the data generation speed is increased more and more, and the data state information is changed dynamically along with the change of time, so that a new data form, namely streaming big data, is formed. Streaming big data is an ordered sequence of real-time, continuous, data information that changes dynamically. In the analysis of the streaming big data, all data streams cannot be stored, in a popular way, if the memory is compared with a reservoir, the batch big data is water in the reservoir, and the streaming big data is water flowing into the reservoir, but the flow rate is too large to be accommodated by the memory. Thus, streaming computing no longer stores streaming data, but rather performs real-time analysis of incoming streaming data directly in memory. The traditional industrial process self-adaptive soft measurement method needs to be updated on the premise of storing part or all historical data, and has great limitation on the real-time characteristic of actual streaming big data; aiming at the industrial streaming big data soft measurement scene, the method is combined with a probability variation supervised factor analysis method, so that the overfitting problem is relieved, a parallel computing strategy is introduced, and the model updating efficiency is improved.
Disclosure of Invention
Aiming at the current industrial big data scene, the invention provides a parallel probability variation soft measurement modeling method facing to the streaming big data, which combines the streaming variation deduction and the analysis of supervised factors, introduces the symmetrical relative entropy to decide the prior selection, and further combines the parallel computing strategy to improve the model updating efficiency, thereby realizing the self-adaptive soft measurement of the industrial streaming big data.
Aiming at the problems in the prior art, the specific technical scheme of the invention is as follows: a parallel probability variation soft measurement modeling method for streaming big data comprises the following steps:
(1) initializing prior hyper-parameters a, b and rho and variational hyper-parameters lambda and tau, and collecting training data F in historical industrial processnm=[X,Y]T,Fnm∈RN×MN represents the number of samples, M represents the number of process variables, and R represents a real number set;
(2) dividing the training data F into Z data blocks, and calculating variation hyper-parameters lambda and tau according to the following formula:
wherein,the mean value of the hidden variable, t,representing the variance, τ, of the latent variable tmWhich represents the variance of the noise, is,<Wm>representing the load matrix WmIn the expectation that the position of the target is not changed,<μm>represents the mean value μmIs desired andFmnrepresenting training data, I representing an identity matrix, L beingThe parameter part of the parallel computation.
Wherein,representing the load matrix WmThe average value of (a) of (b),representing the load matrix WmThe variance of (a) is determined,<tn>expressing the expectation of the hidden variable t, i.e.diag<α>A diagonal matrix representing the representation α is shown,s isThe parameter part of the parallel computation.
Wherein,andthe parameters of the representation α are shown,
wherein,represents μmThe average value of the average value is calculated,represents μmVariance, H isThe parameter part of the parallel computation.
Wherein Q is taumThe parameter part of the parallel computation.
(3) Integrating the variational hyper-parameters lambda and tau in the Z data blocks in the step (2), and continuously keeping the parameters updated until the maximum variational upper boundThe convergence or number of iterations reaches a maximum and a posterior distribution q (theta) is obtained, whereAs shown in the following formula:
wherein E isq(Θ)Representing a parametric expectation, ln p (F, Θ) representing the log-likelihood of the joint probability distribution, ln q (Θ) variational parametric probability distribution;
(4) when the new process variable XnewHidden factor at arrival timeCan be obtained by the following formula:
wherein λ istRepresenting the expectation of a hidden factor, τxThe variance of the noise over x is represented,<Wx>representing the expectation of the loading matrix on x,<μx>represents the expectation of the mean over x;
then the soft measurement prediction resultsComprises the following steps:
wherein,<Wy>indicating the expectation of the loading matrix on y,<μy>represents the expectation of the mean over y;
(5) when mass variable YnewWhen the output of (2) is obtained, new training data F is obtainednew=(Xnew,Ynew) Taking the posterior distribution Q (theta) obtained in the step (3) as the prior distribution of the time, dividing the new training data into Z data blocks again, and obtaining L, S, Q and H update parameters t, W, mu and tau through the following parallel computing strategy, wherein the update formula of the parameters W and mu is changed:
here, λ*Representing new training data F by calculationnewThe posterior distribution obtained.
(7) Parameters obtained for the data block ZIntegrating the number and continuously updating the parameters until the maximum variation upper bound in the updating modeThe convergence or number of iterations reaches a maximum whereinComprises the following steps:
(7) the symmetric relative entropy between the old and new distributions is calculated by:
when the result KL (old, new) is smaller than the set threshold value SKLtsUpdating parameters through the steps (5) and (6); otherwise, initializing the prior of the parameter lambda;
(8) and (5) repeating the steps (4) to (7) when a new data set is obtained, thereby realizing the self-adaptive parallel soft measurement.
Compared with the prior art, the invention has the beneficial effects that: on the basis of an original variational supervised factor analysis model, a stream-type updating method and a symmetrical relative entropy are respectively introduced, the posterior distribution of model parameters is updated in real time according to the change of actual stream-type big data, and the selection of prior distribution is determined, so that the self-adaptive updating of the model is realized, and the updating efficiency of the model is further improved by combining a parallel computing strategy; compared with other traditional methods, the method can improve the model performance to a certain extent aiming at the streaming big data scene with higher real-time requirement, and improves the model updating efficiency by combining the parallel computing strategy, thereby achieving the aim of facing the self-adaptive soft measurement of the streaming big data.
Drawings
FIG. 1 is a flow chart of a parallel probabilistic variational soft measurement model;
FIG. 2 is a graph of the predicted output of the just-in-time learning probabilistic variation soft metric model;
FIG. 3 is a graph of the predicted output of the sliding window probabilistic variation soft metric model;
FIG. 4 is a graph of the predicted output of the parallel probabilistic variational soft measurement model.
Detailed Description
The parallel probability variation soft measurement modeling method for streaming big data of the invention is further detailed below by combining with a specific embodiment.
A parallel probability variation soft measurement modeling method for streaming big data is disclosed, wherein the flow of the whole parallel probability variation soft measurement model is shown in figure 1, and the steps are as follows:
(1) initializing prior hyper-parameters a, b and rho and variational hyper-parameters lambda and tau, and collecting training data F in historical industrial processnm=[X,Y]T,Fnm∈RN×MN represents the number of samples, M represents the number of process variables, and R represents a real number set;
(2) dividing the training data F into Z data blocks, and calculating variation hyper-parameters lambda and tau according to the following formula:
wherein,represents the mean value of the hidden variable t,representing the variance, τ, of the latent variable tmWhich represents the variance of the noise, is,<Wm>representing the load matrix WmIn the expectation that the position of the target is not changed,<μm>represents the mean value μmIs desired andFmnrepresenting training data, I representing an identity matrix, L beingThe parameter part of the parallel computation.
Wherein,representing the load matrix WmThe average value of (a) of (b),representing the load matrix WmThe variance of (a) is determined,<tn>expressing the expectation of the hidden variable t, i.e.diag<α>A diagonal matrix representing the representation α is shown,s isThe parameter part of the parallel computation.
Wherein,andthe parameters of the representation α are shown,
wherein,represents μmThe average value of the average value is calculated,represents μmVariance, H isThe parameter part of the parallel computation.
Wherein Q is taumThe parameter part of the parallel computation.
Because the parameter calculation parts in L, S, H and Q are mutually independent, the calculation efficiency of the model can be further improved by a parallel strategy. Furthermore, theoretically the results from parallel and non-parallel computations should be consistent. Therefore, here we start with a strategy of parallel computation in order to further improve the updating effect of the model. On the other hand, the method can effectively aim at the streaming big data scene with higher real-time requirement, and improves the model performance to a certain extent.
(3) Integrating the variational hyper-parameters lambda and tau in the Z data blocks in the step (2), and continuously keeping the parameters updated until the maximum variational upper boundThe convergence or number of iterations reaches a maximum and a posterior distribution q (theta) is obtained, whereAs shown in the following formula:
wherein E isq(Θ)Representing a parametric expectation, ln p (F, Θ) representing the log-likelihood of the joint probability distribution, ln q (Θ) variational parametric probability distribution;
(4) when the new process variable XnewHidden factor at arrival timeCan be obtained by the following formula:
wherein λ istRepresenting the expectation of a hidden factor, τxThe variance of the noise over x is represented,<Wx>representing the expectation of the loading matrix on x,<μx>represents the expectation of the mean over x;
then the soft measurement prediction resultsComprises the following steps:
wherein,<Wy>indicating the expectation of the loading matrix on y,<μx>represents the expectation of the mean over y;
(5) when mass variable YnewWhen the output of (2) is obtained, new training data F is obtainednew=(Xnew,Ynew) Taking the posterior distribution Q (theta) obtained in the step (3) as the prior distribution of the time, dividing the new process variable into Z data blocks again, and obtaining L, S, Q and H update parameters t, W, mu and tau through the following parallel computing strategy, wherein the update formula of the parameters W and mu is changed:
here, λ*Representing new training data F by calculationnewThe posterior distribution obtained.
(7) Integrating the parameters obtained by the data block Z and continuously updating the parameters until the maximum variation upper bound under the updating modeThe convergence or number of iterations reaches a maximum whereinComprises the following steps:
(7) the symmetric relative entropy between the old and new distributions is calculated by:
when the result KL (old, new) is smaller than the set threshold value SKLtsUpdating parameters through the steps (5) and (6); otherwise, initializing the prior of the parameter lambda; the main purpose of this is to update the posterior distribution of the model parameters in real time according to the change of the actual streaming big data and to decide the selection of the prior distribution, thereby realizing the self-adaptive update of the model.
(8) And (5) repeating the steps (4) to (7) when a new training set is obtained, thereby realizing the self-adaptive parallel soft measurement.
Furthermore, the Root Mean Square Error (RMSE) quantitatively evaluates the predicted performance, and the expression is as follows:
wherein, yiIs the true value of the output variable,is the predicted output of the model, Nt represents the number of online test samples.
Examples
The performance of the parallel probabilistic variational model is described below in connection with a specific methanation unit example in a synthetic ammonia process. The main function of the methanation furnace unit is to convert CO and CO2Is converted into methane, which willIs transferred and recycled. In this unit, our goal is to minimize CO and CO in the process gas2The content of (a). Therefore, the first and most important procedure is to measure the residual CO and CO at the outlet of the cell2And as a key quality variable. Here we take 10 process variables as inputs to the soft measurement modeling including pressure, temperature, flow and level.
For this procedure, 95000 samples were collected at consecutive equal intervals. The first 5000 samples constitute the original training data set, and the remaining 90000 samples serve as test samples.
In order to track the change characteristics of the state, the adaptive soft measurement method of the invention is verified, and an instantaneous learning probability variation model and a sliding window probability variation model are compared, as shown in fig. 2, 3 and 4 respectively. Wherein, for the learning-on-the-fly, the number of samples of the local modeling is the same as the original training data set (5000 samples); for the sliding window, the window size is set to 5000, and the window step length is set to 100; for the process of the invention, each small batch XnewThe size is 100. In fig. 2, although the instantaneous learning probabilistic variation soft measurement model can track the whole trend, there is a large deviation. Furthermore, the performance after state transition is unstable and worse than the sliding window method. As can be seen from fig. 3, the sliding window probability variation soft measurement model, although able to track state changes more roughly, does not work well after the second state switch. The fluctuation is large at the beginning, the error is large, but the result is gradually stable, and the follow-up prediction is good. In contrast, it can be seen from fig. 4 that the method of the present invention further improves the adaptability of the soft measurement model, and the output thereof is closer to the true value. It can be seen from fig. 1 to 4 that the prediction error of the parallel probability variation soft measurement model of the invention is smaller, and the tracking effect is better. Further, table 1 shows detailed prediction results of three adaptive soft measurement models. On the one hand, the method of the invention has the lowest error 0.0397 from the viewpoint of RMSE; on the other hand, from the aspect of update time, the parallel probability variation soft measurement model has absolute valueThe advantage of (1). In summary, as can be seen from fig. 2 to fig. 4 and table 1, the parallel probability variation method of the present invention not only has better tracking effect and fitting effect, but also has the highest updating efficiency, and is more suitable for streaming big data scenes with high real-time requirement.
TABLE 1 prediction effect and computation time of three adaptive methods