CN104123448B

CN104123448B - Multi-data-stream anomaly detection method based on context

Info

Publication number: CN104123448B
Application number: CN201410335201.7A
Authority: CN
Inventors: 徐建; 李涛; 张宏; 张琨; 朱保平; 衷宜; 陈龙
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2014-07-14
Filing date: 2014-07-14
Publication date: 2017-05-17
Anticipated expiration: 2034-07-14
Also published as: CN104123448A

Abstract

The present invention provides a context-based multi-data stream anomaly detection method, comprising the following steps: step 1, multi-data stream acquisition and snapshot generation; step 2, multi-data stream snapshot anomaly quantification; step 3, data stream anomaly quantification; step 4 , data stream anomaly identification. The purpose of the detection method provided by the present invention is to take the node anomaly detection of the isomorphic distributed computing system as the research background, and take the data flow monitored by the computing node as the research object, to provide a context information and single data comprehensively considering multiple data flows An anomaly detection method for historical behavior information of streams, with a high detection rate.

Description

Context-based anomaly detection method for multiple data streams

技术领域technical field

本发明属于异常检测技术，特别是一种融合多数据流的上下文信息和单数据流的历史行为信息的异常检测方法。The invention belongs to anomaly detection technology, in particular to an anomaly detection method that integrates context information of multiple data streams and historical behavior information of a single data stream.

背景技术Background technique

数据流异常检测是数据流挖掘研究中的一个重要的方向。异常指的是在数据集中与众不同的数据，这些数据并不是由于随机偏差产生的，而是产生于完全不同的机制。由于查找数据流中的异常在网络攻击监测，信用卡诈骗、计算系统性能分析等领域具有非常广泛的应用，数据流异常检测方法是当前研究的热点之一，对数据流上异常行为的检测和挖掘的研究已经受到了学术界和工业界的共同关注。Data stream anomaly detection is an important direction in data stream mining research. Anomalies refer to data that are out of the ordinary in a data set, that are not due to random bias, but are generated by an entirely different mechanism. Since finding anomalies in data streams has a wide range of applications in the fields of network attack monitoring, credit card fraud, and computing system performance analysis, data stream anomaly detection methods are one of the current research hotspots. The detection and mining of abnormal behaviors on data streams The research has attracted the attention of academia and industry.

在诸如分布式计算等现实应用中，数据流管理系统需要同时接收多条数据流，并且各条数据流之间往往并非完全独立，而是存在相关性，例如在证券交易系统中，处于同一市场的股票经常会出现相同或相似的升降趋势，在道路交通网络中，不同路段的车流量也将具有一定的关联性。对于相互关联的数据流来说，一旦发现它们之间的相关性被破坏，则可断定在这些数据流中存在有异常情况。基于这一思路，研究者探讨通过监测多数据流间相关性来检测异常的方法。现有的数据流异常检测方法大致可以划分为基于网格的数据流异常检测，基于密度的异常检测和基于距离的异常检测。In real-world applications such as distributed computing, the data stream management system needs to receive multiple data streams at the same time, and the data streams are often not completely independent, but correlated, for example, in the securities trading system, in the same market Stocks often have the same or similar ups and downs. In the road traffic network, the traffic flow of different road sections will also have a certain correlation. For interrelated data streams, once the correlation between them is found to be destroyed, it can be concluded that there are abnormalities in these data streams. Based on this idea, the researchers explore the method of detecting anomalies by monitoring the correlation between multiple data streams. Existing data flow anomaly detection methods can be roughly divided into grid-based data flow anomaly detection, density-based anomaly detection and distance-based anomaly detection.

基于网格的数据流异常检测是把整个数据空间分割成为相互独立，大小一致的很多网格，人为地设定一个支持度，当网格中所包含的数据元素的支持度超过或者等于了事先设定的支持度大小时，就从所有的维度中选出一维，并按照这一维度将网格动态的分为两个完全独立的子网格。当子网格的支持度也达到或超过阈值时，同样的分割操作也会在子网格上进行。Park和Lee等在提出了一种实时的数据流异常检测方法，该网格聚类方法不需要计算数据对象之间的距离，只需要按照事先确定的网格大小，直接把数据放入相应的网格，因此可以实现实时的增量聚类。每次聚类完毕之后只需要保存每个类的特征信息，并计算所有类的异常度，按照由大到小的顺序进行排序，把Top-k异常度最大的类划分为最终的异常类。(Park N H,Lee W S.Statistical grid-based clustering over datastreams[J]. ACM SIGMOD Record,2004,33(1):32-37.)上述异常检测方法要么采用top-p方式把异常量化值最高的p个数据流作为异常，要么把异常量化值超过预定义阈值的数据流作为异常，这些方法在实际应用过程中存在问题：1)阈值难于设定。阈值的合理设定需要非常熟悉应用程序的底层机制，这对于一般应用者而言，难度太大；2)异常的数目一直在变化。某个时刻可能存在超过p个数据流是异常的，采用top-p方式会错过这些真实存在的异常。因此，本发明中采用一种无指导的学习方法自动获取动态变化的异常检测阈值，能更好地适应异常频繁改变的场景。Grid-based data flow anomaly detection is to divide the entire data space into many grids that are independent of each other and of the same size, and artificially set a support degree. When the support degree of the data elements contained in the grid exceeds or is equal to the prior When the support degree is set, one dimension is selected from all dimensions, and the grid is dynamically divided into two completely independent sub-grids according to this dimension. When the support of the sub-grid also reaches or exceeds the threshold, the same segmentation operation will also be performed on the sub-grid. Park and Lee et al. proposed a real-time data flow anomaly detection method. The grid clustering method does not need to calculate the distance between data objects, but only needs to directly put the data into the corresponding grid according to the grid size determined in advance. grid, so real-time incremental clustering can be achieved. After each clustering is completed, only the feature information of each class needs to be saved, and the abnormality of all classes is calculated, sorted in descending order, and the top-k class with the largest abnormality is divided into the final abnormal class. (Park N H, Lee W S. Statistical grid-based clustering over datastreams[J]. ACM SIGMOD Record, 2004, 33(1): 32-37.) The above anomaly detection methods either use the top-p method to maximize the anomaly quantification value The p data streams are regarded as abnormal, or the data stream whose abnormal quantitative value exceeds the predefined threshold is regarded as abnormal. These methods have problems in the actual application process: 1) The threshold is difficult to set. Reasonable setting of the threshold needs to be very familiar with the underlying mechanism of the application, which is too difficult for ordinary users; 2) The number of exceptions is always changing. It is abnormal that there may be more than p data streams at a certain moment, and these real abnormalities will be missed by using the top-p method. Therefore, an unguided learning method is adopted in the present invention to automatically acquire dynamically changing anomaly detection thresholds, which can better adapt to scenes with frequently changing anomalies.

基于密度的异常检测的基本思想是利用某一邻域内样本的密度来确定异常。LOF算法是基于密度的异常检测的代表性算法(Breunig M M,Kriegel H P,Ng R T,etal.LOF:identifying density-based local outliers[C]//ACM Sigmod Record.ACM,2000,29(2):93-104.)。该算法是一种基于局部密度的异常检测算法，能够较为准确的在密度分布不均匀的数据集合中发现异常数据对象。但是LOF算法并不适合直接用于数据流的异常检测，因为其时间复杂度较大，如果每得到一个新的数据对象都需要对所有数据对象的异常度重新进行计算，其代价是不可容忍的。因此，Pokrajac和Lazarevic等人对已有的静态LOF算法做出了改进，提出了动态的增量LOF算法(Pokrajac D,Lazarevic A,LateckiL J.Incremental local outlier detection for data streams[C]//ComputationalIntelligence and Data Mining,2007.CIDM2007.IEEE Symposium on.IEEE,2007:504-515.)。增量LOF算法的核心思想就是当一个新的数据对象到来的时候，并不重新计算所有数据对象特征信息的值，而是只对受到新输入数据对象影响的那一部分数据对象的各个特征信息值进行更新。增量LOF算法在接收到一个新输入的数据对象时，其主要操作分为两个步骤:对于新输入的数据对象，计算其所需的特征信息值；对于受到新输入对象影响密度发生变化的邻居结点，挨个更新其特征信息值，对于没有受到影响的数据对象，不重新计算。采用这一策略之后，动态增量LOF算法在能够达到和重复执行静态LOF算法相当效果的同时，却大大降低了算法执行的时间复杂度，使得其适用于针对数据流的异常检测。然而，LOF算法并没有考虑不同维度值域的差异，可能导致部分维度的影响力显著大于其他维度；另外，其时间复杂度对于离线检测来说是可以接受的，但对实时检测来说还不实用。本发明针对LOF算法的上述两个局限性，提出的算法的时间复杂度为O(n)，与数据流个数呈线性增加关系，能满足实时应用需要。The basic idea of density-based anomaly detection is to use the density of samples in a certain neighborhood to identify anomalies. The LOF algorithm is a representative algorithm for density-based anomaly detection (Breunig M M, Kriegel HP, Ng R T, et al. LOF: identifying density-based local outliers [C] // ACM Sigmod Record. ACM, 2000, 29 (2): 93-104.). This algorithm is an anomaly detection algorithm based on local density, which can accurately find abnormal data objects in data sets with uneven density distribution. However, the LOF algorithm is not suitable for direct use in anomaly detection of data streams, because of its large time complexity. If every time a new data object is obtained, the anomaly degree of all data objects needs to be recalculated, and the cost is intolerable. . Therefore, Pokrajac and Lazarevic et al. improved the existing static LOF algorithm and proposed a dynamic incremental LOF algorithm (Pokrajac D, Lazarevic A, LateckiL J.Incremental local outlier detection for data streams[C]//ComputationalIntelligence and Data Mining, 2007. CIDM2007. IEEE Symposium on. IEEE, 2007:504-515.). The core idea of the incremental LOF algorithm is that when a new data object arrives, it does not recalculate the value of the characteristic information of all data objects, but only calculates the value of each characteristic information of the part of the data object affected by the new input data object. to update. When the incremental LOF algorithm receives a new input data object, its main operation is divided into two steps: for the newly input data object, calculate the required feature information value; for the density change affected by the new input object Neighbor nodes update their characteristic information values one by one, and do not recalculate for unaffected data objects. After adopting this strategy, the dynamic incremental LOF algorithm can achieve the same effect as the repeated execution of the static LOF algorithm, but it greatly reduces the time complexity of algorithm execution, making it suitable for anomaly detection for data streams. However, the LOF algorithm does not consider the difference in the value range of different dimensions, which may cause the influence of some dimensions to be significantly greater than others; in addition, its time complexity is acceptable for offline detection, but not for real-time detection. practical. The present invention aims at the above two limitations of the LOF algorithm, and the time complexity of the proposed algorithm is O(n), which is linearly increasing with the number of data streams, and can meet the needs of real-time applications.

基于距离的异常检测是由Knorr和Ng等提出的(Knorr E M,Ng R T,TucakovV.Distance-based outliers:algorithms and applications[J].The VLDB Journal—The International Journal on Very Large Data Bases,2000,8(3-4):237-253.)。基于距离的异常定义：数据集S中一个对象O称为DB(p,D)-outlier，如果它满足下列性质：数据集S中至少p*100％的对象与O的距离大于距离D。简单的说，基于距离的异常点就是那些没有“足够多”的邻居的对象。Angiulli等人也提出了基于距离的数据流异常检测算法Storm，包括exact-storm和approx-Storm，前者为精确的算法，后者则是以中心极限定理为保证的近似算法(Angiulli F,Fassetti F.Detecting distance-based outliers in streamsof data[C]//Proceedings of the sixteenth ACM conference on Conference oninformation and knowledge management.ACM,2007:811-820.)。Storm算法采用了基于计数的滑动窗口模型，并根据数据流中数据对象到达的时间先后顺序，将某个特定数据对象的邻居划分为前驱邻居和后续邻居。事先定义两个阈值K和R，分别表示邻居个数和距离，如果数据流中某个输入数据对象在R的距离范围之内邻居个数小于K个，则该对象就为异常数据。算法中定义了一类特殊的数据对象safe inliers，无论数据流随着时间变化如何演化，该类数据对象在整个数据窗口中都不会变为异常数据对象，即邻居个数始终大于K个。在此假设的基础之上，算法对所有非safe inliers数据对象都采用R-Tree来查找每个数据对象的邻居，以提高检索的效率。上述算法，仅考虑数据流的历史信息，忽视了数据流的上下文信息，比如在同构的计算系统中，计算节点的数据流往往表现出相似性，本发明综合考虑的多数据流的上下文信息和单个数据流上的历史信息来确定某个数据流是否是异常数据流，有更高的准确性。Distance-based anomaly detection was proposed by Knorr and Ng (Knorr E M, Ng R T, TucakovV. Distance-based outliers: algorithms and applications[J]. The VLDB Journal—The International Journal on Very Large Data Bases, 2000,8 (3-4):237-253.). Distance-based anomaly definition: An object O in a data set S is called DB(p, D)-outlier if it satisfies the following properties: at least p*100% of the objects in the data set S are farther than the distance D from O. Simply put, distance-based outliers are objects that do not have "enough" neighbors. Angiulli et al. also proposed a distance-based data flow anomaly detection algorithm Storm, including exact-storm and approx-Storm, the former is an accurate algorithm, and the latter is an approximate algorithm guaranteed by the central limit theorem (Angiulli F, Fassetti F .Detecting distance-based outliers in streams of data[C]//Proceedings of the sixteenth ACM conference on Conference on information and knowledge management. ACM,2007:811-820.). The Storm algorithm adopts a count-based sliding window model, and divides the neighbors of a specific data object into predecessor neighbors and follow-up neighbors according to the chronological order in which data objects arrive in the data stream. Two thresholds K and R are defined in advance, which represent the number of neighbors and the distance respectively. If the number of neighbors of an input data object in the data stream is less than K within the distance range of R, the object is abnormal data. A special class of data objects safe inliers is defined in the algorithm. No matter how the data flow evolves over time, this type of data objects will not become abnormal data objects in the entire data window, that is, the number of neighbors is always greater than K. Based on this assumption, the algorithm uses R-Tree to find the neighbors of each data object for all non-safe inliers data objects, so as to improve the retrieval efficiency. The above algorithm only considers the historical information of the data stream, and ignores the context information of the data stream. For example, in an isomorphic computing system, the data streams of the computing nodes often show similarity. The present invention comprehensively considers the context information of multiple data streams Using historical information on a single data flow to determine whether a certain data flow is an abnormal data flow has higher accuracy.

发明内容Contents of the invention

为了克服现有技术存在的不足，本发明提供一种采用同构分布式计算系统的节点异常检测方法，以计算节点监测的数据流为研究对象的综合考虑多数据流的上下文信息和单数据流的历史行为信息的异常检测方法。In order to overcome the shortcomings of the existing technology, the present invention provides a node anomaly detection method using an isomorphic distributed computing system, which takes the data stream monitored by the computing node as the research object and comprehensively considers the context information of multiple data streams and single data stream An anomaly detection method for historical behavior information.

一种基于上下文的多数据流异常检测方法，包括以下步骤：A context-based multi-data stream anomaly detection method, comprising the following steps:

步骤1，多数据流获取和快照生成，过程如下：Step 1, multi-stream acquisition and snapshot generation, the process is as follows:

给定一个由n个同构的计算节点构成的分布式计算系统，记录时刻t同步获取每个计算节点的观测值，根据观测值计算每个计算节点的数据流，所有计算节点在t时刻的数据流记为快照；Given a distributed computing system composed of n isomorphic computing nodes, the observation value of each computing node is obtained synchronously at recording time t, and the data flow of each computing node is calculated according to the observation value. The data flow is recorded as a snapshot;

步骤2，多数据流快照异常量化，过程如下：Step 2, abnormal quantification of multi-stream snapshots, the process is as follows:

将时刻t的快照构造矩阵M用于描述时刻t各个计算节点的瞬时行为；对矩阵M的每个列向量逐一采用0-1归一化方式计算归一化值，得到新的矩阵M'；计算归一化后的每个列向量的均值，得到偏差矩阵M″；计算每个计算节点在时刻t产生的数据流快照的异常量化值；Use the snapshot matrix M at time t to describe the instantaneous behavior of each computing node at time t; use 0-1 normalization method to calculate the normalized value for each column vector of matrix M one by one, and obtain a new matrix M'; Calculate the mean value of each column vector after normalization to obtain the deviation matrix M″; calculate the abnormal quantization value of the data flow snapshot generated by each computing node at time t;

步骤3，数据流异常量化，过程如下：Step 3, data flow abnormal quantification, the process is as follows:

给定任意的数据流，计算时刻t之前的所有快照异常量化值对于当前时刻t的影响力，综合考虑多数据流快照异常量化结果和单个数据流中历史数据对于异常量化结果的影响力，得出数据流的异常量化结果值；Given any data stream, calculate the influence of all snapshot abnormal quantization values before time t on the current time t, and comprehensively consider the influence of multi-data stream snapshot abnormal quantification results and historical data in a single data stream on abnormal quantification results, and get The abnormal quantization result value of the outgoing data stream;

步骤4，数据流异常识别，过程如下：Step 4, data flow abnormal identification, the process is as follows:

对异常量化所得的结果按照从小到大的方式进行排序；计算排序后异常量化值的中位值、最大值、最小值和最大偏差值，并检测数据流。Sort the results of abnormal quantification in ascending order; calculate the median value, maximum value, minimum value, and maximum deviation value of the sorted abnormal quantified values, and detect the data flow.

本发明与现有技术相比，其优点在于：(1)综合考虑多数据流的上下文信息和单数据流的历史信息量化数据流的异常，提高了检测精度；(2)异常识别阈值采用无指导的学习方法自动获取，无需计算节点异常的领域知识，降低了识别参数设定的难度；(3)算法具有较低的时间复杂度，与数据流个数n呈线性增长关系。Compared with the prior art, the present invention has the advantages of: (1) comprehensively considering the context information of multiple data streams and the historical information of a single data stream to quantify the abnormality of the data stream, and improving the detection accuracy; The guided learning method is automatically obtained, without the need to calculate the domain knowledge of node abnormalities, which reduces the difficulty of identifying parameter settings; (3) The algorithm has a low time complexity, which has a linear growth relationship with the number n of data streams.

下面结合附图对本发明作进一步详细描述。The present invention will be described in further detail below in conjunction with the accompanying drawings.

附图说明Description of drawings

图1是本发明一种基于上下文的多数据流异常检测方法的流程图；Fig. 1 is the flow chart of a kind of context-based multi-data flow anomaly detection method of the present invention;

图2是多数据流的上下文、快照生成示意图；Fig. 2 is a schematic diagram of context and snapshot generation of multiple data streams;

图3是数据流快照异常量化的流程图；Fig. 3 is a flow chart of abnormal quantification of data flow snapshot;

图4是数据流异常检测的流程图；Fig. 4 is a flow chart of data stream anomaly detection;

具体实施方式detailed description

结合图1，一种基于上下文的多数据流异常检测方法，包括以下步骤：Combined with Figure 1, a context-based multi-data stream anomaly detection method includes the following steps:

步骤1.1，给定一个由n个同构的计算节点构成的分布式计算系统；Step 1.1, given a distributed computing system composed of n isomorphic computing nodes;

步骤1.2，同步记录t时刻第i个计算节点的观测值其中表示第i个计算节点在时刻t的观测值的第m个观测维度上的分量，1≤i≤n，每个分量表示一种感兴趣的节点度量，同构的计算节点的度量是完全相同的，m的值由总的度量数目决定；Step 1.2, synchronously record the observation value of the i-th computing node at time t in Indicates the component of the i-th computing node on the m-th observation dimension of the observed value at time t, 1≤i≤n, each component represents a node metric of interest, and the metrics of isomorphic computing nodes are exactly the same , the value of m is determined by the total number of metrics;

步骤1.3，构成计算节点i对应的数据流S_i＝{s_i1,s_i2,s_i3,...,s_it}，S_i是一个有序但无限的数据观测序列；Step 1.3, construct the data flow S _i corresponding to computing node i = {s _i1 , s _i2 , s _i3 ,..., s _it }, S _i is an ordered but infinite data observation sequence;

步骤1.4，n个计算节点的数据流构成数据流集S＝{S₁,S₂,...S₃,...S_i,...,S_n}；Step 1.4, the data streams of n computing nodes form a data stream set S={S ₁ , S ₂ ,...S ₃ ,...S _i ,...,S _n };

步骤1.5，记录t时刻的数据流集S的快照S^t＝[s_it|s_it∈S_i,S_i∈S,1≤i≤n]，数据流集S的快照S^t是由时刻t在每个计算节点上获取的观测值构成的；因此根据快照的定义，数据流集S又可以表示成快照集，即S＝{S¹,S²,...,S^t,...}。Step 1.5, record the snapshot S t of the data flow set S at time ^t = [s _it |s _it ∈ S _i , S _i ∈ S, 1≤i≤n], the snapshot S ^t of the data flow set S is determined by the time t It is composed of observations obtained on each computing node; therefore, according to the definition of snapshot, the data flow set S can be expressed as a snapshot set, that is, S={S ¹ ,S ² ,...,S ^t ,... }.

步骤2.1，给定时刻t的快照S^t，构造n×m矩阵用于描述时刻t各个计算节点的瞬时行为；Step 2.1, given the snapshot S ^t at time t, construct an n×m matrix Used to describe the instantaneous behavior of each computing node at time t;

步骤2.2，对矩阵M_t的任意列向量1≤d≤m，表示时刻t第i个计算节点的观测值，采用0-1归一化方式计算得到其中表示第d个观测度量在n个观测值中的最小值，表示第d个观测度量在n个观测值中的最大值，进而得到新矩阵 Step 2.2, for any column vector of matrix M _t 1≤d≤m, Indicates the observation value of the i-th computing node at time t, calculated by 0-1 normalization in Indicates the minimum value of the dth observation metric among n observations, Indicates the maximum value of the dth observation metric among the n observations, and then a new matrix is obtained

步骤2.3，对于矩阵M_t'的任意列向量计算任意列向量的均值得到偏差矩阵其中 Step 2.3, for any column vector of matrix M _t ' compute an arbitrary column vector mean of get the bias matrix in

步骤2.4，计算每个计算节点在时刻t产生的数据流快照的异常量化值N_it，具体步骤为：Step 2.4, calculate the abnormal quantitative value N _it of the data flow snapshot generated by each computing node at time t, the specific steps are:

步骤2.4.1，初始化列向量对于任意的观测维度k，1≤k≤m，计算偏差平方则t时刻第k个观测维度的方差 Step 2.4.1, initialize the column vector For any observation dimension k, 1≤k≤m, calculate the square of the deviation Then the variance of the kth observation dimension at time t

步骤2.4.2，初始化用于存放每个观测维度的异常量化值的列向量其中表示t时刻第k个观测维度的熵，表示取留一方差情形下第k个观测维度的熵， M_t″(f,k)是矩阵M_t″第f行第k列的值；因此有 Step 2.4.2, initialize the column vector used to store the abnormal quantization value of each observation dimension in Indicates the entropy of the kth observation dimension at time t, means take-and-leave variance The entropy of the kth observation dimension in the case, M _t ″(f, k) is the value of the fth row and kth column of the matrix M _t ″; therefore,

步骤2.4.3，计算时刻t第i个计算节点的数据流快照异常量化值 Step 2.4.3, calculate the abnormal quantitative value of the data flow snapshot of the ith computing node at time t

步骤3，数据流异常量化，由于数据流中普遍存在瞬时波动和阶段迁移现象，仅通过数据流快照识别计算节点异常，会导致较高的误警率；为了缓解这一个问题，在对数据流快照异常量化的基础上，考虑数据流历史数据中展示的集体行为进一步对候选的异常数据流进行量化，并集成数据流快照异常量化结果和单个数据流中历史数据对于异常量化结果的影响力，产生最终的异常量化结果，具体过程如下：Step 3: Quantify data flow anomalies. Due to the common phenomenon of instantaneous fluctuations and phase transitions in data flow, identifying computing node anomalies only through data flow snapshots will lead to a high false alarm rate; in order to alleviate this problem, in the data flow On the basis of snapshot anomaly quantification, consider the collective behavior displayed in the historical data stream data to further quantify the candidate anomalous data streams, and integrate the abnormal quantification results of data stream snapshots and the influence of historical data in a single data stream on the abnormal quantification results. Generate the final anomaly quantification result, the specific process is as follows:

步骤3.1，给定任意的数据流S_i，记录时刻t之前的所有快照异常量化值{N_i0,N_i1,...,N_i(t-1)}，计算这t个快照异常量化值对于当前时刻t的影响力I_it，其中U_t表示影响力衰减函数，本发明中选取指数衰减函数控制影响力衰减程度，即U_t＝e^-λkt，其中λ>0是衰减快慢阈值，因此有：Step 3.1, given any data stream S _i , record all snapshot anomaly quantization values {N _i0 , N _i1 ,...,N _i(t-1) } before time t, and calculate the t snapshot anomaly quantization values For the influence I _it at the current moment t, Among them, U _t represents the influence decay function. In the present invention, an exponential decay function is selected to control the degree of influence decay, that is, U _t =e ^-λkt , where λ>0 is the decay speed threshold, so there are:

I_it＝N_i(t-1)e^-λ+N_i(t-2)e^-2λ+N_i(t-1)e^-3λ+...I _it ＝N _i(t-1) e ^-λ +N _i(t-2) e ^-2λ +N _i(t-1) e ^-3λ +...

＝e^-λ(N_i(_t-1)+e^-λ(N_i(t-2)+e^-λ(N_i(t-3)+...；=e ^-λ (N _i ( _t-1 )+e ^-λ (N _i(t-2) +e ^-λ (N _i(t-3) +...;

＝e^-λ(N_i(t-1)+I_i(t-1))＝e ^-λ (N _i(t-1) +I _i(t-1) )

步骤3.2，综合考虑多数据流快照异常量化结果和单个数据流中历史数据对于异常量化结果的影响力，得出数据流S_i的异常量化结果值N_i＝N_it+I_it。In step 3.2, comprehensively consider the abnormal quantization results of multiple data stream snapshots and the influence of historical data in a single data stream on the abnormal quantification results, and obtain the abnormal quantization result value N _i =N _it +I _it of the data stream S _i .

步骤4.1，给定数据流集S＝{S₁,S₂,...,S_n}对应的异常量化结果序列N＝{N₁,N₂,...,N_n}，对异常量化结果序列N按照从小到大的方式进行排序，得到新的序列N'＝{N′₁,N'₂,...,N'_n}，并记录下标映射关系v＝R(u)，1≤u,v≤n，表示新序列的第u个异常量化结果对应于原序列的第v个异常量化结果；Step 4.1, given the abnormal quantization result sequence N={N ₁ ,N ₂ ,...,N _n } corresponding to the data stream set S={S ₁ ,S ₂ ,...,S _n }, quantify the abnormal The result sequence N is sorted from small to large, and a new sequence N'={N' ₁ ,N' ₂ ,...,N' _n } is obtained, and the subscript mapping relationship v=R(u) is recorded, 1≤u, v≤n, indicating that the uth abnormal quantization result of the new sequence corresponds to the vth abnormal quantization result of the original sequence;

步骤4.2，设异常量化结果的中位值的下标为mIdx，有计算异常量化结果的中位值N_median＝N'_mIdx，异常量化结果的最小值N_min＝N′₁，异常量化结果的最大值N_max＝N'_n和最大异常量化值分量d_upper；Step 4.2, set the subscript of the median value of the abnormal quantization result as mIdx, and have Calculate the median value N _median =N' _mIdx of the abnormal quantization result, the minimum value N _min =N' ₁ of the abnormal quantization result, the maximum value N _max =N' _n of the abnormal quantization result and the maximum abnormal quantization value component d _upper ;

步骤4.3，设定idx＝mIdx+1；Step 4.3, set idx=mIdx+1;

步骤4.4，若N′_idx>max(2(N_median-N_min),N_min+d_upper)，其中max表示求2个数最大值的函数，当数据流无异常时，N_median-N_min应当近似等于N_max-N_min的一半，所以若大于2(N_median-N_min)时，表示数据流S_R(idx)是一个异常数据流，产生异常告警；否则，数据流S_R(idx)是一个正常数据流；Step 4.4, if N′ _idx >max(2(N _median -N _min ),N _min +d _upper ), where max represents the function to find the maximum value of 2 numbers, when the data flow is normal, N _median -N _min It should be approximately equal to half of N _max -N _min , so if it is greater than 2(N _median -N _min ), it means that the data flow S _R(idx) is an abnormal data flow, and an abnormal alarm is generated; otherwise, the data flow S _{R(idx )} is a normal data stream;

若表示在数据流抖动的情形下，存在抖动数据流与无抖动数据流之间异常量化的最大偏差值，当数据流存在抖动，数据流的异常量化结果大于N_min+d_upper时，则数据流S_R(idx)是一个异常数据流，产生异常告警；否则，数据流S_R(idx)是一个正常数据流；like Indicates that in the case of data flow jitter, there is the maximum deviation value of abnormal quantization between the jitter data flow and the non-jitter data flow. When there is jitter in the data flow and the abnormal quantization result of the data flow is greater than N _min +d _upper , the data flow S _R(idx) is an abnormal data flow, which generates an abnormal alarm; otherwise, the data flow S _R(idx) is a normal data flow;

步骤4.5，若idx<n，则idx←idx+1，跳转到步骤4.4，否则，步骤4结束。Step 4.5, if idx<n, then idx←idx+1, jump to step 4.4, otherwise, end step 4.

利用本发明的方法进行计算得到的数据与现有Storm算法和LOF算法进行比较，可以得出本发明的算法在准确率、召回率和综合评价指标上均优于对比算法。实验数据来源于16个计算节点的监测数据，每个计算节点观测维度为10维，分别是CPU使用率、内存使用率、内存页换入次数、内存页换出次数、Disk读次数、Disk写次数、Disk读字节数、Disk写字节数、网卡接收字节数、网卡发送字节数等，实验中注入了2种类型的异常，分别为内存泄露和CPU泄露结果，注入持续时间为1000s，实验结果如表1所示。Comparing the data calculated by the method of the present invention with the existing Storm algorithm and the LOF algorithm, it can be concluded that the algorithm of the present invention is superior to the comparison algorithm in terms of accuracy, recall and comprehensive evaluation indicators. The experimental data comes from the monitoring data of 16 computing nodes. The observation dimension of each computing node is 10 dimensions, which are CPU usage rate, memory usage rate, memory page swapping times, memory page swapping times, Disk read times, and Disk write times. The number of times, the number of bytes read by Disk, the number of bytes written by Disk, the number of bytes received by the NIC, the number of bytes sent by the NIC, etc. In the experiment, two types of exceptions were injected, namely memory leak and CPU leak results, and the injection duration was 1000s, the experimental results are shown in Table 1.

表1不同算法的实验结果比较Table 1 Comparison of experimental results of different algorithms

。 .

Claims

1. a kind of multiple data stream method for detecting abnormality based on context, it is characterised in that comprise the following steps：

Step 1, multiple data stream is obtained and snapshot is generated, and process is as follows：

A distributed computing system being made up of the calculate node of n isomorphism is given, record moment t synchronously obtains each calculating The observation of node, according to observation the data flow of each calculate node is calculated, and all calculate nodes are remembered in the data flow of t For snapshot；

Step 2, multiple data stream snapshot quantifies extremely, and process is as follows：

The snapshot structural matrix M of moment t is used to describe the transient behavior of each calculate node of moment t；Each row to matrix M Vector calculates normalized value using 0-1 normalization mode one by one, obtains new matrix M'；Calculate each column vector after normalization Average, obtain deviation matrix M "；Calculate the abnormal quantized value of the data flow snapshot that each calculate node is produced in moment t；

Step 3, data flow anomaly quantifies, and process is as follows：

Arbitrary data flow is given, all snapshots exception quantized value before moment t is calculated for the influence power of current time t, Consider multiple data stream snapshot exception quantized result and individual traffic in historical data for the impact of abnormal quantized result Power, show that the abnormal of data flow quantifies end value；

Step 4, data flow anomaly identification, process is as follows：

Result obtained by abnormal quantization is ranked up according to mode from small to large；Calculate the middle position of abnormal quantized value after sequence Value, maximum, minimum of a value and maximum deflection difference value, and detection data stream, specially：

Step 4.1, data-oriented adfluxion S={ S₁,S₂,...,S_nCorresponding abnormal quantized result sequence N₁,N₂,...,N_n, it is right Abnormal quantized result sequence is ranked up according to mode from small to large, obtains new sequence N '₁,N'₂,...,N'_n, and record Subscript mapping relations v=R (u), 1≤u, v≤n represent the v of u-th abnormal quantized result corresponding to former sequence of new sequence Individual abnormal quantized result；

Step 4.2, if being designated as mIdx under the median of abnormal quantized result, hasCalculate abnormal quantized result Median N_median=N'_mIdx, minimum of a value N of abnormal quantized result_min=N '₁, maximum N of abnormal quantized result_max=N'_n With maximum exception quantized value component d_upper；

Step 4.3, sets idx=mIdx+1；

Step 4.4, if N '_idx＞ max (2 (N_median-N_min),N_min+d_upper), represent data flow S_R(idx)It is an abnormal data Stream, produces abnormality alarming；IfRepresent in the case of data dither flow there is shake data flow and non-jitter number According to the maximum deflection difference value quantified extremely between stream, shake when data flow is present, the abnormal quantized result of data flow is more than N_min+ d_upperWhen, then data flow S_R(idx)It is an abnormal data stream, produces abnormality alarming；

Step 4.5, if idx is ＜ n, idx=idx+1, jumps to step 4.4, and otherwise, step 4 terminates.

2. the multiple data stream method for detecting abnormality based on context according to claim 1, it is characterised in that step 1 is most Obtain and comprising the following steps that snapshot is generated according to stream：

Step 1.1, gives a distributed computing system being made up of the calculate node of n isomorphism, and each calculate node has h Identical tolerance；

Step 1.2, the observation of i-th calculate node of synchronous recording tWhereinRepresent i-th Component of the calculate node in m-th observation dimension of the observation of moment t, 1≤i≤n, 1≤m≤h；

Step 1.3, constitutes corresponding data flow S of calculate node i_i={ s_i1,s_i2,s_i3,...,s_it}；

Step 1.4, the data flow of n calculate node constitutes data adfluxion S={ S₁,S₂,...S₃,...S_i,...,S_n}；

Step 1.5, records the snapshot S of data adfluxion S of t^t=[s_it|s_it∈S_i,S_i∈S,1≤i≤n]。

3. the multiple data stream method for detecting abnormality based on context according to claim 1, it is characterised in that step 2 is most According to comprising the following steps that stream snapshot quantifies extremely：

Step 2.1, the snapshot S of given time t^t, construct n × m matrixes

Step 2.2, to matrix M_tAny column vector1≤d≤m, is calculated using 0-1 normalization modeAnd then obtain new matrix

Step 2.3, calculates any column vectorAverageObtain deviation matrixWherein

Step 2.4, calculates the abnormal quantized value N of the data flow snapshot that each calculate node is produced in moment t_it, concretely comprise the following steps：

Step 2.4.1, initializes column vectorWherein1≤k≤m, calculates k-th observation dimension Variance

Step 2.4.2, calculates the data flow snapshot exception quantized value component of each calculate nodeWherein1≤f≤n, M_t" (f, k) be matrix The value of M " f rows kth row；

Step 2.4.3, calculates the data flow snapshot exception quantized value of i-th calculate node of moment t

4. the multiple data stream method for detecting abnormality based on context according to claim 3, it is characterised in that the number of step 3 Quantify according to throat floater, step is as follows：

Step 3.1, gives arbitrary data flow S_i, record all snapshots exception quantized value { N before moment t_i0,N_i1,..., N_i(t-1), this t snapshot exception quantized value is calculated for influence power I of current time t_it,Wherein U_t =e^-λktFor influence power attenuation function, wherein λ ＞ 0 are decay speed threshold values；

Step 3.2, consider multiple data stream snapshot exception quantized result and individual traffic in historical data for abnormal amount Change the influence power of result, draw data flow S_iAbnormal quantify end value N_i=N_it+I_it。