CN106874651A

CN106874651A - Room air data preprocessing method based on local weighted recurrence

Info

Publication number: CN106874651A
Application number: CN201710020701.5A
Authority: CN
Inventors: 孙贺江; 徐崇; 刘俊杰
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2017-01-12
Filing date: 2017-01-12
Publication date: 2017-06-20

Abstract

The present invention relates to the preprocessing of air parameter data that changes with time. It uses a simple but effective method to preprocess the data of indoor air that changes with time, including the filling of short-term data gaps and the removal of data abnormal jump values. At the same time, it is ensured that large changes in data related to changes in human behavior are not identified as outliers, and finally the zero offset is corrected. The technical scheme adopted by the present invention is, based on the indoor air data preprocessing method of local weighted regression, first fill in the short-term data gaps, ensure that there are no gaps of 0 in the entire data, and then remove the data abnormal jump value , when it is ensured that there is no needle-shaped data jump point, the zero offset correction is performed, that is, the processed data is substituted into the calibration curve. The invention is mainly applied to the preprocessing of air parameter data that changes with time.

Description

Indoor Air Data Preprocessing Method Based on Local Weighted Regression

技术领域technical field

本算法能对随时间变化的空气参数(温度，湿度，甲醛浓度，PM2.5浓度，二氧化碳浓度等)中的数据空缺进行填补，并可以对数据中的异常跳变值进行去除，以及可以对数据进行零点偏移的修正。属于特定数据预处理的领域。具体讲,涉及基于局部加权回归的室内空气数据预处理方法。This algorithm can fill in the data gaps in air parameters (temperature, humidity, formaldehyde concentration, PM2.5 concentration, carbon dioxide concentration, etc.) that change with time, and can remove abnormal jump values in the data, and can The data is corrected for zero offset. Belongs to the field of specific data preprocessing. Specifically, it involves a preprocessing method for indoor air data based on locally weighted regression.

背景技术Background technique

目前对数据进行预处理的技术方法难易皆具，然而简单的预处理方法难以做到有效，而有效的预处理方法往往比较复杂[1]。本技术预处理的数据对象是室内空气数据：第一，这种数据具有整体上随时间缓慢变动但是每时每刻都有着不同程度的随机噪声的特点(如图1)；第二，由于硬件系统具有数据空缺报警功能，故可以确保数据空缺的时长很短；第三，已经具有了修正零点偏移的标定曲线。所以相比于对数据空缺进行填补和零点偏移，本技术的核心是对数据中的跳变异常值进行去除，并且能够保证与人行为变化相关的数据大幅度变动不被识别成异常值被剔除。At present, the technical methods for data preprocessing are difficult and easy, but simple preprocessing methods are difficult to be effective, and effective preprocessing methods are often more complicated [1]. The data object preprocessed by this technology is indoor air data: first, this kind of data has the characteristics of slow change over time as a whole but has the characteristics of different degrees of random noise at every moment (as shown in Figure 1); second, due to the hardware The system has a data vacancy alarm function, so it can ensure that the duration of the data vacancy is very short; third, it already has a calibration curve for correcting the zero offset. Therefore, compared to filling data gaps and zero offsets, the core of this technology is to remove the jump outliers in the data, and it can ensure that large changes in data related to changes in human behavior will not be identified as outliers. remove.

在对数据异常值进行剔除的方法中，最常见的是对数据直接使用C4.5决策树进行分类判定[2]，但是该算法易将因为人行为变化导致的数据大幅度变动值和异常跳变值一并被分类成异常值；其次CD(Curve Description)法也被用于对异常值的分类[3]，此方法以相邻的数值的变化量和变化率为阈值进行判定，然而对于本专利要解决的问题而言，它和决策树法有着相似的缺陷，而且在程序实现上也比决策树法复杂；国外也使用噪声数据过滤法(Filters)识别并剔除异常值，比较典型的是Ensemble Filter(EF)[4]和Iterative-Partitioning Filter(IPF)[5]，这两种方法都比较有名，但是都比较复杂，得对其额外设置多个参数[1]，这对本技术所面对的问题是没有必要的。In the method of eliminating data outliers, the most common method is to directly use the C4.5 decision tree to classify and judge the data [2]. The variable values are classified as outliers together; secondly, the CD (Curve Description) method is also used to classify outliers [3]. As far as the problem to be solved in this patent is concerned, it has similar defects to the decision tree method, and it is also more complicated than the decision tree method in program implementation; foreign countries also use noise data filtering method (Filters) to identify and eliminate outliers, which is more typical It is Ensemble Filter (EF) [4] and Iterative-Partitioning Filter (IPF) [5]. These two methods are relatively well-known, but they are relatively complicated, and multiple additional parameters must be set for them [1]. Faced with the problem is not necessary.

发明内容Contents of the invention

为克服现有技术的不足，本发明旨在用简单但有效的方法对室内空气随时间变化的数据进行预处理，包括短时长数据空缺的填补，数据异常跳变值的去除，与此同时保证与人行为变化相关的数据大幅度变动不被识别成异常值，最后进行零点偏移的矫正。本发明采用的技术方案是，基于局部加权回归的室内空气数据预处理方法，首先进行短时长数据空缺的填补，要确保整个数据不存在空缺的0值，然后再进行数据异常跳变值的去除，在保证不再存在针状的数据跳变点时，再进行零点偏移的矫正，即将处理好的数据代入到标定曲线中。In order to overcome the deficiencies in the prior art, the present invention aims to use a simple but effective method to preprocess the data of indoor air changes over time, including filling short-term data gaps and removing data abnormal jump values, while ensuring Large changes in data related to changes in human behavior are not identified as outliers, and finally the zero offset is corrected. The technical scheme adopted by the present invention is, based on the indoor air data preprocessing method of local weighted regression, first fill in the short-term data gaps, ensure that there are no gaps of 0 in the entire data, and then remove the data abnormal jump value , when it is ensured that there is no needle-shaped data jump point, the zero offset correction is performed, that is, the processed data is substituted into the calibration curve.

进行数据异常跳变值的去除具体步骤是，使用拟合曲线将有意义的信息拟合出来，并且同时不拟合针状数据跳变和所有的高频噪声，具体选用局部加权回归(LocalWeight Regression)进行有用信息的拟合，再用原数据曲线减去拟合曲线得到噪声曲线，解决有用信息对跳变值去除的干扰。The specific steps to remove the abnormal jump value of the data are to use the fitting curve to fit the meaningful information, and at the same time not to fit the needle-shaped data jump and all the high-frequency noise, specifically choose the local weighted regression (LocalWeightRegression ) to fit the useful information, and then subtract the fitted curve from the original data curve to obtain the noise curve, so as to solve the interference of the useful information on the jump value removal.

局部加权回归原理具体步骤是，先用一定数目的横轴上的参考点将整个数据等分开来，并以这些点为中心分别求算线局部性回归，在使用最小二乘法求解回归参数时，离中心点越远的数据点所占的权数越小，最后得到这些点的回归数值，然后用插值将这些回归数值点相连，这里使用线性插值即可；The specific steps of the principle of local weighted regression are to divide the entire data equally with a certain number of reference points on the horizontal axis, and calculate the linear local regression with these points as the center. When using the least square method to solve the regression parameters, The farther the data points are from the center point, the smaller the weights are, and finally get the regression values of these points, and then use interpolation to connect these regression value points, here you can use linear interpolation;

进一步地，对每一个训练数据点，都要使得：Further, for each training data point, it is necessary to make:

∑_iw⁽ⁱ⁾(y⁽ⁱ⁾-θ^Tx⁽ⁱ⁾)² (1)∑ _i w ⁽ⁱ⁾ (y ⁽ⁱ⁾ -θ ^T x ⁽ⁱ⁾ ) ² (1)

最小；minimum;

其中i是训练数据的个数角标；x指时间轴的时间值；y是目标值；θ是回归方程的系数向量，使用二次回归，故θ是个三维向量；w是高斯权数，表示成：Among them, i is the subscript of the number of training data; x refers to the time value of the time axis; y is the target value; θ is the coefficient vector of the regression equation, using quadratic regression, so θ is a three-dimensional vector; w is the Gaussian weight, indicating become:

其中没有上角标的x指的是选定的横轴上的参考点，τ是带宽(bandwidth)，τ越大，局部回归的强度越大；The x without the superscript refers to the reference point on the selected horizontal axis, τ is the bandwidth (bandwidth), the larger the τ, the greater the strength of the local regression;

局部加权回归在每个残差平方项之前多一个高斯权，对每个参考点都要求得二次的回归曲线，且曲线参数一定是不同的，对任一个参考点x，都有：Locally weighted regression adds a Gaussian weight before each residual square term, and requires a quadratic regression curve for each reference point, and the curve parameters must be different. For any reference point x, there are:

θ＝(X^TWX)^-1X^TWy (3)θ＝(X ^T WX) ^-1 X ^T Wy (3)

其中，X是由1，x⁽ⁱ⁾，(x⁽ⁱ⁾)²组成的m维矩阵，称之为设计矩阵(design matrix)m即训练数据数量，X写作：Among them, X is an m-dimensional matrix composed of 1, x ⁽ⁱ⁾ , (x ⁽ⁱ⁾ ) ² , which is called the design matrix (design matrix) m is the number of training data, and X is written as:

W是m阶对角矩阵，写作diag(w⁽¹⁾…w⁽ⁱ⁾…w⁽ⁿ⁾)；y是目标值排成的m阶列向量，记作(y⁽¹⁾…y⁽ⁱ⁾…y⁽ⁿ⁾)^T；最终得到的θ是一个3×3的矩阵，取θ中第一列中的从上到下三个元素分别作为二次回归曲线中的常数项前系数，一次项前系数和二次项前系数，对于每一个参考点xck^(j)，代回回归曲线都有其对应的回归值yck^(j)，其中j是参考点数据的个数角标，这样便形成一个回归点(xck^(j),yck^(j))；W is an m-order diagonal matrix, written as diag(w ⁽¹⁾ …w ⁽ⁱ⁾ …w ⁽ⁿ⁾ ); y is an m-order column vector of target values, written as (y ⁽¹⁾ …y ^{(i )} …y ⁽ⁿ⁾ ) ^T ; the final θ is a 3×3 matrix, and the three elements from top to bottom in the first column of θ are taken as the constant front coefficients in the quadratic regression curve, once For the pre-term coefficient and the quadratic pre-term coefficient, for each reference point xck ^(j) , the regression regression curve has its corresponding regression value yck ^(j) , where j is the subscript of the number of reference point data, so that Form a regression point (xck ^(j) ,yck ^(j) );

将相邻的回归点进行线性插值就得到对整个数据曲线进行回归的回归曲线。Linear interpolation is performed on adjacent regression points to obtain a regression curve that regresses the entire data curve.

使用局部加权回归识别并剔除跳变值的流程：The process of using locally weighted regression to identify and remove jump values:

a.将原数据曲线进行局部加权回归，生成拟合曲线；a. Perform local weighted regression on the original data curve to generate a fitting curve;

b.将原数据减去拟合曲线得到残差曲线；b. Subtract the fitted curve from the original data to obtain the residual curve;

c.求残差曲线的平均值和标准差；c. Find the mean and standard deviation of the residual curve;

d.遍历所有的残差数据，利用拉依达准则，挑选出所有超出限制的数据：d. Traversing all the residual data, using the Raida criterion, select all the data that exceeds the limit:

e.获取d中选中的数据的标号，并将对应标号中的原数据替换成跳变数据两端的正常数据之间的插值，达到平滑的目的。e. Obtain the label of the data selected in d, and replace the original data in the corresponding label with the interpolation between the normal data at both ends of the jump data to achieve the purpose of smoothing.

本发明的特点及有益效果是：Features and beneficial effects of the present invention are:

本发明具有原理简单，计算快速并效果显著的特点。该发明对室内空气质量(IAQ)随时间的数据有着良好的有益效果：本发明能够有效的从原数据中分离出噪声；能够通过分析噪声特点来对数据中的跳变异常值进行去除，并且能够保证与人行为变化相关的数据大幅度变动不被识别成异常值被剔除。The invention has the characteristics of simple principle, fast calculation and remarkable effect. The invention has a good beneficial effect on the data of indoor air quality (IAQ) over time: the invention can effectively separate the noise from the original data; it can remove the jump abnormal value in the data by analyzing the characteristics of the noise, and It can ensure that large changes in data related to changes in human behavior will not be identified as outliers and eliminated.

附图说明：Description of drawings:

图1：对一间办公室使用传感器的实测室内空气数据，横轴是时间，以1秒为单位；纵轴是相应的数值强度。图中1-2和1-3都有针状的数据跳变。Figure 1: The measured indoor air data of an office using sensors, the horizontal axis is time, in units of 1 second; the vertical axis is the corresponding numerical intensity. Both 1-2 and 1-3 in the figure have needle-shaped data jumps.

图2：带有数据空缺和数据跳变的实测甲醛随时间的变化数据曲线。Figure 2: The data curve of measured formaldehyde over time with data gaps and data jumps.

图3数据平滑流程图。Figure 3 Data smoothing flow chart.

图4：数据空缺被插值的实测甲醛随时间的变化数据曲线。Figure 4: Measured formaldehyde versus time data curve with data gaps interpolated.

图5：对图1曲线进行局部加权回归后的实测甲醛随时间的变化数据拟合曲线Figure 5: The data fitting curve of the measured formaldehyde change over time after performing local weighted regression on the curve in Figure 1

图6：甲醛随时间的变化数据的残差曲线，可见数据跳变在其中，且不含有任何有用信息。Figure 6: The residual curve of the formaldehyde change data over time, it can be seen that the data jumps in it, and does not contain any useful information.

图7：数据跳变被平滑的实测甲醛随时间的变化数据曲线，可见有用信息都被保留了下来。Figure 7: The data curve of the measured formaldehyde over time with data jumps smoothed, it can be seen that useful information has been preserved.

具体实施方式detailed description

为了防止算法的复杂以及误差的放大，对可能带有短时间数据空缺的室内空气数据进行预处理时，首先进行短时长数据空缺的填补，要确保整个数据不存在空缺的0值，然后再进行数据异常跳变值的去除，在保证不再存在针状的数据跳变点时，再进行零点偏移的矫正，即将处理好的数据代入到标定曲线中。In order to prevent the complexity of the algorithm and the amplification of errors, when preprocessing the indoor air data that may have short-term data gaps, first fill in the short-term and long-term data gaps, and ensure that there are no gaps of 0 in the entire data, and then perform The removal of the abnormal jump value of the data is to ensure that there is no needle-shaped data jump point, and then the zero offset is corrected, that is, the processed data is substituted into the calibration curve.

1)在原始数据中，凡是空缺值都已经用0代替。1) In the original data, all vacant values have been replaced by 0.

首先选择n个数值组成的时序数据，这里n需要是容易被整除的数，比如2000，这为了能够有效的进行数据跳变值的去除。然后将原始数据进行空缺值的填补，并根据(三)中提到的数据性质，在这里选用对数据进行插值，将结果替换掉对应位置的0值。First select the time series data composed of n values, where n needs to be a number that is easily divisible, such as 2000, in order to effectively remove the data jump value. Then fill in the vacant value of the original data, and according to the nature of the data mentioned in (3), choose to interpolate the data here, and replace the result with the 0 value in the corresponding position.

2)确保整个数据中没有0值时，开始对跳变值进行去除。2) When ensuring that there is no 0 value in the entire data, start to remove the jump value.

这里注意到不能将因为人行为变化的数据大幅度变动去除，以带有数据空缺，数据跳变和因为人行为变化的数据大幅度变动的实测甲醛数据为例(如图2)，发现数据跳变和因为人行为变化的数据变动有着很大的区别：It is noted here that the large changes in data due to changes in human behavior cannot be removed. Taking the measured formaldehyde data with data gaps, data jumps, and large changes in data due to changes in human behavior as an example (as shown in Figure 2), it is found that the data jumps There is a big difference between data changes and data changes due to changes in human behavior:

如图2中右侧虚线曲线部分，此处对应的时刻，办公室门被打开，使得对门的实验室里面的实验甲醛气体部分涌入办公室，导致了室内甲醛浓度升高，这属于典型的人行为变化。这种甲醛浓度的变化不能被当做异常值识别并剔除。其次，左侧的实线红圈的部分是数据缺失，为0值。As shown in the dotted line curve on the right side of Figure 2, at the corresponding moment, the office door is opened, causing the experimental formaldehyde gas in the laboratory opposite the door to flow into the office, resulting in an increase in the indoor formaldehyde concentration, which is a typical human behavior Variety. This variation in formaldehyde concentration cannot be identified as an outlier and removed. Secondly, the part of the solid red circle on the left is missing data, which is 0 value.

在这里，不适宜直接将数据进行异常值的剔除，一方面数据整体走势就呈现出缓慢的变动，从左侧的0.07变到右侧的0.047，另一方面就是右侧虚线红圈的曲线很容易被当作异常值，这些都是数据曲线中的有意义的信息。因此有必要将数据曲线中有意义的信息保留，所以适合的策略就是使用拟合曲线将有意义的信息拟合出来，并且同时不拟合针状数据跳变和所有的高频噪声，这里选用局部加权回归(Local Weight Regression)进行有用信息的拟合，再用原数据曲线减去拟合曲线得到噪声曲线，便可以完全解决有用信息对跳变值去除的干扰。Here, it is not suitable to directly remove outliers from the data. On the one hand, the overall trend of the data shows a slow change, from 0.07 on the left to 0.047 on the right. On the other hand, the curve of the dotted red circle on the right is very Easy to be regarded as outliers, these are meaningful information in the data curve. Therefore, it is necessary to retain the meaningful information in the data curve, so the appropriate strategy is to use the fitting curve to fit the meaningful information, and at the same time do not fit the needle-shaped data jump and all high-frequency noise, which is selected here Local weighted regression (Local Weight Regression) fits the useful information, and then subtracts the fitted curve from the original data curve to obtain the noise curve, which can completely solve the interference of the useful information on the removal of the jump value.

3)局部加权回归原理3) Principle of local weighted regression

局部加权回归是一般线性回归的改进版，能够克服后者欠拟合或过拟合的缺陷。它的步骤是先用一定数目的横轴上的参考点将整个数据等分开来，并以这些点为中心分别求算线局部性回归，在使用最小二乘法求解回归参数时，离中心点越远的数据点所占的权数越小。最后可以得到这些点的回归数值，然后用插值将这些回归数值点相连，这里使用线性插值即可。Locally weighted regression is an improved version of general linear regression, which can overcome the defects of underfitting or overfitting of the latter. Its steps are to divide the entire data equally with a certain number of reference points on the horizontal axis, and calculate the linear local regression with these points as the center. When using the least square method to solve the regression parameters, the farther away from the center The farther data points are given less weight. Finally, the regression values of these points can be obtained, and then these regression value points can be connected by interpolation, and linear interpolation can be used here.

对每一个训练数据点，都要使得：For each training data point, make:

最小。minimum.

其中i是训练数据的个数角标；x是特征值，在本文中即指时间轴的时间值；y是目标值，本文中即甲醛浓度值；θ是回归方程的系数向量，本方法使用二次回归，故θ是个三维向量；w是高斯权数，表示成：Among them, i is the subscript of the number of training data; x is the feature value, which refers to the time value of the time axis in this article; y is the target value, which is the formaldehyde concentration value in this article; θ is the coefficient vector of the regression equation, and this method uses Quadratic regression, so θ is a three-dimensional vector; w is a Gaussian weight, expressed as:

其中没有上角标的x指的是选定的横轴上的参考点，τ是带宽(bandwidth)，τ越大，局部回归的强度越大。The x without superscript refers to the selected reference point on the horizontal axis, τ is the bandwidth, and the larger τ is, the stronger the local regression is.

局部加权回归在每个残差平方项之前多一个高斯权，这就会极大削弱离参考点比较远的数据对拟合的影响，进而达到局部回归的目的。对每个参考点都要求得二次的回归曲线，且曲线参数一定是不同的，对任一个参考点x，都有：Local weighted regression adds a Gaussian weight before each residual square item, which will greatly weaken the influence of data far from the reference point on the fitting, and then achieve the purpose of local regression. For each reference point, a quadratic regression curve is required, and the curve parameters must be different. For any reference point x, there are:

θ＝(X^TWX)^-1X^TWy (3)θ＝(X ^T WX) ^-1 X ^T Wy (3)

其中，X是由1，x⁽ⁱ⁾，(x⁽ⁱ⁾)²组成的m维矩阵，称之为设计矩阵(design matrix)m即训练数据数量，X写作(此时为了求二次回归曲线，所以矩阵只有三列，若求n次回归曲线，矩阵就有n+1列)：Among them, X is an m-dimensional matrix composed of 1, x ⁽ⁱ⁾ , (x ⁽ⁱ⁾ ) ² , which is called the design matrix (design matrix). Curve, so the matrix has only three columns, if you want n times regression curve, the matrix has n+1 columns):

W是m阶对角矩阵，写作diag(w⁽¹⁾…w⁽ⁱ⁾…w⁽ⁿ⁾)；y是目标值排成的m阶列向量，记作(y⁽¹⁾…y⁽ⁱ⁾…y⁽ⁿ⁾)^T；最终得到的θ是一个3×3的矩阵，取θ中第一列中的从上到下三个元素分别作为二次回归曲线中的常数项前系数，一次项前系数和二次项前系数。对于每一个参考点xck^(j)，代回回归曲线都有其对应的回归值yck^(j)，其中j是参考点数据的个数角标，这样便形成一个回归点(xck^(j),yck^(j))。W is an m-order diagonal matrix, written as diag(w ⁽¹⁾ …w ⁽ⁱ⁾ …w ⁽ⁿ⁾ ); y is an m-order column vector of target values, written as (y ⁽¹⁾ …y ^{(i )} …y ⁽ⁿ⁾ ) ^T ; the final θ is a 3×3 matrix, and the three elements from top to bottom in the first column of θ are taken as the constant front coefficients in the quadratic regression curve, once Preterm coefficient and quadratic preterm coefficient. For each reference point xck ^(j) , the regression regression curve has its corresponding regression value yck ^(j) , where j is the subscript of the number of reference point data, thus forming a regression point (xck ^(j) , yck ^(j) ).

将相邻的回归点进行线性插值就可以比较准确的得到对整个数据曲线进行回归的回归曲线，这样的曲线能够保留所有的有用信息。Linear interpolation of adjacent regression points can obtain a regression curve that regresses the entire data curve more accurately, and such a curve can retain all useful information.

4)使用局部加权回归识别并剔除跳变值的流程4) The process of using local weighted regression to identify and eliminate jump values

a.将原数据曲线进行局部加权回归，生成拟合曲线。a. Perform local weighted regression on the original data curve to generate a fitting curve.

b.将原数据减去拟合曲线得到残差曲线。b. Subtract the fitted curve from the original data to get the residual curve.

c.求残差曲线的平均值和标准差。c. Find the mean and standard deviation of the residual curve.

拉依达准则规定：所有超出三倍标准差范围之内的数据都被剔除。即：去掉所有的x，若x满足|x-μ|<3σ。因为一次性处理的数据量较大，远超过了拉依达准则的使用数据下限：100个数据。所以这里采取较为简便的拉依达准则。The Raida criterion stipulates that all data beyond the range of three standard deviations are eliminated. That is: remove all x, if x satisfies |x-μ|<3σ. Because the amount of data processed at one time is large, it far exceeds the lower limit of the data used in the Raida Guidelines: 100 data. Therefore, the simpler Raida criterion is adopted here.

e.获取d中选中的数据的标号，并将对应标号中的原数据(跳变数据)替换成跳变数据两端的正常数据之间的插值，达到平滑的目的。e. Obtain the label of the data selected in d, and replace the original data (jump data) in the corresponding label with the interpolation between the normal data at both ends of the jump data to achieve the purpose of smoothing.

5)在用局部加权回归识别并剔除跳变值后，将得到的处理后数据代入到标定的回归方程中，进而得到完整的预处理结果数据。5) After using local weighted regression to identify and eliminate the jump value, the obtained processed data is substituted into the calibrated regression equation, and then the complete preprocessing result data is obtained.

1)自定义函数的命名，输入变量和输出变量。1) The naming, input variable and output variable of the custom function.

输入变量有两个：一个为“原始数据”，从excel表格中导入，以列向量的形式存在；另一个为“对时间维度的分割间隔数”。There are two input variables: one is "raw data", which is imported from the excel table and exists in the form of a column vector; the other is "number of division intervals for the time dimension".

输出变量有一个：为“经过异常数据平滑算法之后的结果数据”，并填回excel表格。There is one output variable: "result data after abnormal data smoothing algorithm", and fill in the excel form.

2)将原数据进行基于局部加权回归的室内空气数据异常值平滑的流程(原数据中不能有2) The process of smoothing outliers of indoor air data based on local weighted regression on the original data (the original data cannot have

为0的空缺值)，如图3所示。is the blank value of 0), as shown in Figure 3.

这里以被填补之后的实测甲醛随时间的变化数据曲线为实例(见图4)：取含有1500个数据的原始数据向量，“对时间维度的分割间隔数”取50；图5给出了对图4曲线的局部加权回归的拟合曲线，由此可见，所有的有用信息都被保留了下来，图6给出了图4数据曲线减去图5数据曲线得到的残差曲线，也称噪声曲线，由此可见，针状的数据跳变都被从原数据中分离了出来，将残差曲线根据拉依达准则判定，去掉并用插值替代不符合判定的跳变数据，再将其和图5的拟合曲线相加，得到经过跳变值平滑之后的数据曲线(如图7)。Here is an example of the measured formaldehyde change data curve with time after being filled (see Fig. 4): take the original data vector containing 1500 data, and take 50 for "the number of division intervals of the time dimension"; Fig. 5 shows the The fitting curve of the locally weighted regression of the curve in Figure 4, it can be seen that all useful information is preserved, and Figure 6 shows the residual curve obtained by subtracting the data curve in Figure 4 from the data curve in Figure 5, also known as noise It can be seen that the needle-shaped data jumps are separated from the original data, and the residual curve is judged according to the Raida criterion, and the jump data that do not meet the judgment are removed and replaced by interpolation, and then compared with the figure The fitting curves of 5 are added together to obtain the data curve smoothed by the jump value (as shown in Figure 7).

参考文献：references:

[1]Salvador García,Julian Luengo,Tutorial on practical tips of themost influential data preprocessing algorithms in data mining.Knowledge-BasedSystems,2016；98:1-29..[1] Salvador García, Julian Luengo, Tutorial on practical tips of the most influential data preprocessing algorithms in data mining. Knowledge-Based Systems, 2016; 98:1-29..

[2]J.R.Quinlan,C4.5:Programs for Machine Learning,Morgan KaufmannPub-lishers Inc.,1993.[3]Hao Zhou；Lifeng Qiao,Ph.D.；Yi Jiang,Ph.D.；HejiangSun,Ph.D.；Qingyan Chen,Ph.D.Recognition of air-conditioner operation fromindoor air temperature and relative humidity by a data mining approach.Energyand Buildings,2016；111:233-241.[2] J.R.Quinlan, C4.5: Programs for Machine Learning, Morgan KaufmannPub-lishers Inc., 1993. [3] Hao Zhou; Lifeng Qiao, Ph.D.; Yi Jiang, Ph.D.; HejiangSun, Ph.D. D.; Qingyan Chen, Ph.D. Recognition of air-conditioner operation from indoor air temperature and relative humidity by a data mining approach. Energy and Buildings, 2016; 111:233-241.

[4]C.E.Brodley,M.A.Friedl,Identifying mislabeled training data,J.Artif.Intell.Res.1999；11:131–167.[4] C.E.Brodley, M.A.Friedl, Identifying mislabeled training data, J.Artif.Intell.Res.1999; 11:131–167.

[5]T.M.Khoshgoftaar,P.Rebours,Improving software quality predictionby noise filtering techniques,J.Comput.Sci.Technol.2007；22:387–396。[5] T.M. Khoshgoftaar, P. Rebours, Improving software quality prediction by noise filtering techniques, J. Comput. Sci. Technol. 2007; 22:387–396.

Claims

1. a kind of room air data preprocessing method based on local weighted recurrence, it is characterized in that, long number in short-term is carried out first According to filling up for vacancy, it is to be ensured that then whole data carry out the removal of data exception hop value again in the absence of 0 value of vacancy, When no longer there is the data jump point of needle-like in guarantee, then carry out the correction of zero migration, and the data that will be handled well are updated to mark In determining curve.

2. the room air data preprocessing method of local weighted recurrence is based on as claimed in claim 1, it is characterized in that, carry out The removal of data exception hop value comprises the concrete steps that, fits significant information to come using matched curve, while not Fitting needle-like data jump and all of high-frequency noise, specifically from local weighted recurrence (Local Weight Regression the fitting of useful information) is carried out, then matched curve is subtracted with former data and curves to obtain noise curve, solve useful The interference that information is removed to hop value.

3. the room air data preprocessing method of local weighted recurrence is based on as claimed in claim 2, it is characterized in that, it is local Weighted regression principle comprises the concrete steps that, first separated whole data etc. with the reference point on the transverse axis of certain amount, and with this Ask calculation line locality to return centered on a little points respectively, when regression parameter is solved using least square method, from central point more away from Flexible strategy shared by data point are smaller, finally obtain the recurrence numerical value of these points, and these then are returned into numerical point with interpolation is connected, Used here as linear interpolation；

Further, to each training data point, will cause：

∑_iw⁽ⁱ⁾(y⁽ⁱ⁾-θ^Tx⁽ⁱ⁾)² (1)

It is minimum；

Wherein i is the number footmark of training data；X refers to the time value of time shaft；Y is desired value；θ be regression equation coefficient to Amount, using quadratic regression, therefore θ is a three-dimensional vector；W is Gauss flexible strategy, is expressed as：

w^{(i)} = \exp (- \frac{{(x^{(i)} - x)}^{2}}{2 τ^{2}}) - - - (2)

The reference point on selected transverse axis is referred to without superscript x, τ is bandwidth (bandwidth), and τ is bigger, it is local The intensity of recurrence is bigger；

Local weighted to return many Gauss power before each residuals squares, require secondary to each reference point returns Return curve, and parameter of curve must be different, to any one reference point x, have：

θ=(X^TWX)^-1X^TWy (3)

Wherein, X is by 1, x⁽ⁱ⁾, (x⁽ⁱ⁾)²The m dimension matrixes of composition, referred to as design matrix (design matrix) m is training number Data bulk, X writings：

X = (\begin{matrix} 1 & x^{(1)} & {(x^{(1)})}^{2} \\ . & . & . \\ . & . & . \\ . & . & . \\ 1 & x^{(i)} & {(x^{(i)})}^{2} \\ . & . & . \\ . & . & . \\ . & . & . \\ 1 & x^{(m)} & {(x^{(m)})}^{2} \end{matrix}) - - - (4)

W is m rank diagonal matrix, writing diag (w⁽¹⁾…w⁽ⁱ⁾…w⁽ⁿ⁾)；Y is the m rank column vectors that desired value is lined up, and is denoted as (y⁽¹⁾…y⁽ⁱ⁾…y⁽ⁿ⁾)^T；The θ for finally giving is the matrix of 3 × 3, takes three elements point from top to bottom in first row in θ Not as coefficient before the constant term in quadratic regression curve, coefficient before coefficient and quadratic term before first order, for each reference Point xck^(j), generation returns regression curve its corresponding regressand value yck^(j), wherein j is the number footmark with reference to point data, so Just a regression point (xck is formed^(j),yck^(j))；

By returning that adjacent regression point carries out that linear interpolation just can more accurately obtain returning whole data and curves Return curve.

4. the room air data preprocessing method of local weighted recurrence is based on as claimed in claim 2, it is characterized in that, use Local weighted recurrence is recognized and rejects the flow of hop value：

A. former data and curves are carried out into local weighted recurrence, generates matched curve；

B. former data are subtracted into matched curve and obtains residual error curve；

C. the average value and standard deviation of residual error curve are sought；

D. all of residual error data is traveled through, using Pauta criterion, all data beyond limitation is picked out：

E. the label of the data chosen in d is obtained, and the former data in correspondence label is substituted for the normal of saltus step data two ends Interpolation between data, reaches smooth purpose.