CN110197022A

CN110197022A - Parallel probability variation soft-measuring modeling method towards streaming big data

Info

Publication number: CN110197022A
Application number: CN201910426698.6A
Authority: CN
Inventors: 葛志强; 杨泽宇
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-05-21
Filing date: 2019-05-21
Publication date: 2019-09-03
Anticipated expiration: 2039-05-21
Also published as: CN110197022B

Abstract

本发明公开了一种面向流式大数据的并行概率变分软测量建模方法，该方法在原始的变分有监督因子分析模型的基础上，分别引入流式更新方法和对称相对熵，根据实际流式大数据的变化而实时更新模型参数的后验分布和决定先验分布的选择，从而实现模型的自适应更新，并结合并行计算策略，进一步提高模型更新效率；本发明能够针对实时性要求较高的流式大数据场景，在一定程度上改善了模型性能，并结合并行计算策略提高了模型更新效率，达到了面对流式大数据自适应软测量的目的。The invention discloses a parallel probabilistic variational soft sensor modeling method for streaming big data. The method introduces a streaming update method and a symmetric relative entropy respectively on the basis of the original variational supervised factor analysis model. According to The posterior distribution of the model parameters and the selection of the prior distribution are updated in real time due to the change of the actual streaming big data, so as to realize the self-adaptive update of the model, and combine the parallel computing strategy to further improve the model update efficiency; the present invention can aim at real-time The demanding streaming big data scenario improves the model performance to a certain extent, and combines the parallel computing strategy to improve the model update efficiency, and achieves the purpose of adaptive soft sensor for streaming big data.

Description

Parallel probability variation soft measurement modeling method for streaming big data

Technical Field

The invention belongs to the field of industrial process control and soft measurement, and relates to a parallel probability variation soft measurement modeling method for streaming big data.

Background

In industrial processes, soft measurement models are widely used to predict key process variables in the process that are difficult to measure online due to harsh measurement environments, expensive measurement instruments, and large time lags. In recent years, the soft measurement modeling method based on data driving models according to data collected during process operation without relying on process mechanism knowledge, and is greatly favored by researchers and producers. Compared with a mechanism modeling method, the data driving method can reflect the actual running state of the process more truly, and the established model is more reliable. However, when the soft measurement model is put into practical use, the prediction performance of the soft measurement model tends to be gradually reduced along with factors such as process state change, catalyst activity reduction, raw material change and instrument drift. Therefore, the soft measurement model needs to be continuously updated adaptively, so that the model can effectively track the process change, ensure the timely and effective monitoring of the process data, and timely make the adjustment of the control strategy. Researchers have proposed different update strategies for this purpose, such as recursive models, sliding window models, and just-in-time learning models.

In the development of information technology, industrial processes have also come to the big data era. Meanwhile, the sensor data is increased sharply, the data generation speed is increased more and more, and the data state information is changed dynamically along with the change of time, so that a new data form, namely streaming big data, is formed. Streaming big data is an ordered sequence of real-time, continuous, data information that changes dynamically. In the analysis of the streaming big data, all data streams cannot be stored, in a popular way, if the memory is compared with a reservoir, the batch big data is water in the reservoir, and the streaming big data is water flowing into the reservoir, but the flow rate is too large to be accommodated by the memory. Thus, streaming computing no longer stores streaming data, but rather performs real-time analysis of incoming streaming data directly in memory. The traditional industrial process self-adaptive soft measurement method needs to be updated on the premise of storing part or all historical data, and has great limitation on the real-time characteristic of actual streaming big data; aiming at the industrial streaming big data soft measurement scene, the method is combined with a probability variation supervised factor analysis method, so that the overfitting problem is relieved, a parallel computing strategy is introduced, and the model updating efficiency is improved.

Disclosure of Invention

Aiming at the current industrial big data scene, the invention provides a parallel probability variation soft measurement modeling method facing to the streaming big data, which combines the streaming variation deduction and the analysis of supervised factors, introduces the symmetrical relative entropy to decide the prior selection, and further combines the parallel computing strategy to improve the model updating efficiency, thereby realizing the self-adaptive soft measurement of the industrial streaming big data.

Aiming at the problems in the prior art, the specific technical scheme of the invention is as follows: a parallel probability variation soft measurement modeling method for streaming big data comprises the following steps:

(1) initializing prior hyper-parameters a, b and rho and variational hyper-parameters lambda and tau, and collecting training data F in historical industrial process_nm＝[X，Y]^T，F_nm∈R^N×MN represents the number of samples, M represents the number of process variables, and R represents a real number set;

(2) dividing the training data F into Z data blocks, and calculating variation hyper-parameters lambda and tau according to the following formula:

wherein,the mean value of the hidden variable, t,representing the variance, τ, of the latent variable t_mWhich represents the variance of the noise, is,<W_m>representing the load matrix W_mIn the expectation that the position of the target is not changed,<μ_m>represents the mean value μ_mIs desired andF_mnrepresenting training data, I representing an identity matrix, L beingThe parameter part of the parallel computation.

Wherein,representing the load matrix W_mThe average value of (a) of (b),representing the load matrix W_mThe variance of (a) is determined,<t_n>expressing the expectation of the hidden variable t, i.e.diag<α>A diagonal matrix representing the representation α is shown,s isThe parameter part of the parallel computation.

Wherein,andthe parameters of the representation α are shown,

wherein,represents μ_mThe average value of the average value is calculated,represents μ_mVariance, H isThe parameter part of the parallel computation.

Wherein Q is tau_mThe parameter part of the parallel computation.

(3) Integrating the variational hyper-parameters lambda and tau in the Z data blocks in the step (2), and continuously keeping the parameters updated until the maximum variational upper boundThe convergence or number of iterations reaches a maximum and a posterior distribution q (theta) is obtained, whereAs shown in the following formula:

wherein E is_q(Θ)Representing a parametric expectation, ln p (F, Θ) representing the log-likelihood of the joint probability distribution, ln q (Θ) variational parametric probability distribution;

(4) when the new process variable X_newHidden factor at arrival timeCan be obtained by the following formula:

wherein λ is_tRepresenting the expectation of a hidden factor, τ_xThe variance of the noise over x is represented,<W_x>representing the expectation of the loading matrix on x,<μ_x>represents the expectation of the mean over x;

then the soft measurement prediction resultsComprises the following steps:

wherein,<W_y>indicating the expectation of the loading matrix on y,<μ_y>represents the expectation of the mean over y;

(5) when mass variable Y_newWhen the output of (2) is obtained, new training data F is obtained_new＝(X_new，Y_new) Taking the posterior distribution Q (theta) obtained in the step (3) as the prior distribution of the time, dividing the new training data into Z data blocks again, and obtaining L, S, Q and H update parameters t, W, mu and tau through the following parallel computing strategy, wherein the update formula of the parameters W and mu is changed:

here, λ^*Representing new training data F by calculation_newThe posterior distribution obtained.

(7) Parameters obtained for the data block ZIntegrating the number and continuously updating the parameters until the maximum variation upper bound in the updating modeThe convergence or number of iterations reaches a maximum whereinComprises the following steps:

(7) the symmetric relative entropy between the old and new distributions is calculated by:

when the result KL (old, new) is smaller than the set threshold value SKL_tsUpdating parameters through the steps (5) and (6); otherwise, initializing the prior of the parameter lambda;

(8) and (5) repeating the steps (4) to (7) when a new data set is obtained, thereby realizing the self-adaptive parallel soft measurement.

Compared with the prior art, the invention has the beneficial effects that: on the basis of an original variational supervised factor analysis model, a stream-type updating method and a symmetrical relative entropy are respectively introduced, the posterior distribution of model parameters is updated in real time according to the change of actual stream-type big data, and the selection of prior distribution is determined, so that the self-adaptive updating of the model is realized, and the updating efficiency of the model is further improved by combining a parallel computing strategy; compared with other traditional methods, the method can improve the model performance to a certain extent aiming at the streaming big data scene with higher real-time requirement, and improves the model updating efficiency by combining the parallel computing strategy, thereby achieving the aim of facing the self-adaptive soft measurement of the streaming big data.

Drawings

FIG. 1 is a flow chart of a parallel probabilistic variational soft measurement model;

FIG. 2 is a graph of the predicted output of the just-in-time learning probabilistic variation soft metric model;

FIG. 3 is a graph of the predicted output of the sliding window probabilistic variation soft metric model;

FIG. 4 is a graph of the predicted output of the parallel probabilistic variational soft measurement model.

Detailed Description

The parallel probability variation soft measurement modeling method for streaming big data of the invention is further detailed below by combining with a specific embodiment.

A parallel probability variation soft measurement modeling method for streaming big data is disclosed, wherein the flow of the whole parallel probability variation soft measurement model is shown in figure 1, and the steps are as follows:

wherein,represents the mean value of the hidden variable t,representing the variance, τ, of the latent variable t_mWhich represents the variance of the noise, is,<W_m>representing the load matrix W_mIn the expectation that the position of the target is not changed,<μ_m>represents the mean value μ_mIs desired andF_mnrepresenting training data, I representing an identity matrix, L beingThe parameter part of the parallel computation.

Wherein,andthe parameters of the representation α are shown,

Wherein Q is tau_mThe parameter part of the parallel computation.

Because the parameter calculation parts in L, S, H and Q are mutually independent, the calculation efficiency of the model can be further improved by a parallel strategy. Furthermore, theoretically the results from parallel and non-parallel computations should be consistent. Therefore, here we start with a strategy of parallel computation in order to further improve the updating effect of the model. On the other hand, the method can effectively aim at the streaming big data scene with higher real-time requirement, and improves the model performance to a certain extent.

then the soft measurement prediction resultsComprises the following steps:

wherein,<W_y>indicating the expectation of the loading matrix on y,<μx>represents the expectation of the mean over y;

(5) when mass variable Y_newWhen the output of (2) is obtained, new training data F is obtained_new＝(X_new，Y_new) Taking the posterior distribution Q (theta) obtained in the step (3) as the prior distribution of the time, dividing the new process variable into Z data blocks again, and obtaining L, S, Q and H update parameters t, W, mu and tau through the following parallel computing strategy, wherein the update formula of the parameters W and mu is changed:

(7) Integrating the parameters obtained by the data block Z and continuously updating the parameters until the maximum variation upper bound under the updating modeThe convergence or number of iterations reaches a maximum whereinComprises the following steps:

when the result KL (old, new) is smaller than the set threshold value SKL_tsUpdating parameters through the steps (5) and (6); otherwise, initializing the prior of the parameter lambda; the main purpose of this is to update the posterior distribution of the model parameters in real time according to the change of the actual streaming big data and to decide the selection of the prior distribution, thereby realizing the self-adaptive update of the model.

(8) And (5) repeating the steps (4) to (7) when a new training set is obtained, thereby realizing the self-adaptive parallel soft measurement.

Furthermore, the Root Mean Square Error (RMSE) quantitatively evaluates the predicted performance, and the expression is as follows:

wherein, y_iIs the true value of the output variable,is the predicted output of the model, Nt represents the number of online test samples.

Examples

The performance of the parallel probabilistic variational model is described below in connection with a specific methanation unit example in a synthetic ammonia process. The main function of the methanation furnace unit is to convert CO and CO₂Is converted into methane, which willIs transferred and recycled. In this unit, our goal is to minimize CO and CO in the process gas₂The content of (a). Therefore, the first and most important procedure is to measure the residual CO and CO at the outlet of the cell₂And as a key quality variable. Here we take 10 process variables as inputs to the soft measurement modeling including pressure, temperature, flow and level.

For this procedure, 95000 samples were collected at consecutive equal intervals. The first 5000 samples constitute the original training data set, and the remaining 90000 samples serve as test samples.

In order to track the change characteristics of the state, the adaptive soft measurement method of the invention is verified, and an instantaneous learning probability variation model and a sliding window probability variation model are compared, as shown in fig. 2, 3 and 4 respectively. Wherein, for the learning-on-the-fly, the number of samples of the local modeling is the same as the original training data set (5000 samples); for the sliding window, the window size is set to 5000, and the window step length is set to 100; for the process of the invention, each small batch X_newThe size is 100. In fig. 2, although the instantaneous learning probabilistic variation soft measurement model can track the whole trend, there is a large deviation. Furthermore, the performance after state transition is unstable and worse than the sliding window method. As can be seen from fig. 3, the sliding window probability variation soft measurement model, although able to track state changes more roughly, does not work well after the second state switch. The fluctuation is large at the beginning, the error is large, but the result is gradually stable, and the follow-up prediction is good. In contrast, it can be seen from fig. 4 that the method of the present invention further improves the adaptability of the soft measurement model, and the output thereof is closer to the true value. It can be seen from fig. 1 to 4 that the prediction error of the parallel probability variation soft measurement model of the invention is smaller, and the tracking effect is better. Further, table 1 shows detailed prediction results of three adaptive soft measurement models. On the one hand, the method of the invention has the lowest error 0.0397 from the viewpoint of RMSE; on the other hand, from the aspect of update time, the parallel probability variation soft measurement model has absolute valueThe advantage of (1). In summary, as can be seen from fig. 2 to fig. 4 and table 1, the parallel probability variation method of the present invention not only has better tracking effect and fitting effect, but also has the highest updating efficiency, and is more suitable for streaming big data scenes with high real-time requirement.

TABLE 1 prediction effect and computation time of three adaptive methods

Claims

1. A parallel probability variation soft measurement modeling method for streaming big data is characterized by comprising the following steps:

Wherein,andthe parameters of the representation α are shown,

Wherein Q is tau_mThe parameter part of the parallel computation.

wherein E is_q(Θ)Representing parametric expectation, lnp (F, Θ) representing the log-likelihood of the joint probability distribution, lnq (Θ) the log-likelihood of the variational parametric probability distribution;

then the soft measurement prediction resultsComprises the following steps: