CN112562771B

CN112562771B - Disk anomaly detection method based on neighborhood partition and isolation reconstruction

Info

Publication number: CN112562771B
Application number: CN202011564817.3A
Authority: CN
Inventors: 高欣; 查森; 贾欣; 李康生; 刘治宇; 任昺; 张光耀; 黄子健
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2022-07-26
Anticipated expiration: 2040-12-25
Also published as: CN112562771A

Abstract

An embodiment of the present invention proposes a disk anomaly detection method based on neighborhood partitioning and isolation reconstruction, including: collecting disk SMART information, filtering out effective disk feature attributes to form a data set, and performing exponential smoothing processing on it to obtain a disk training set ; Sampling the training set randomly for multiple times to obtain multiple sub-training sets, in the subset, the distance between each point and its closest point is used as the radius to construct the disk feature isolation area, and the test points that do not belong to any area are regarded as global anomalies; for non-global anomalies For the test point, the ratio of the radius of the area where the two consecutive adjacent points are located is regarded as the former abnormal value of the test point in this area; the area is reconstructed after including the test point, and the radius ratio of the area where the test point is located before and after reconstruction is used as the test point. The back abnormal value in this area; the abnormal score is obtained by combining the front and back abnormal values of all the areas where the test point is located. The technical solution provided by the embodiment of the present invention can effectively improve the recall rate of abnormal disks.

Description

A Disk Anomaly Detection Method Based on Neighborhood Partitioning and Isolation Reconstruction

【技术领域】【Technical field】

本发明涉及机器学习领域异常检测方法，尤其涉及一种基于邻域分区与隔离重构的磁盘异常检测方法。The invention relates to an anomaly detection method in the field of machine learning, in particular to a disk anomaly detection method based on neighborhood partition and isolation reconstruction.

【背景技术】【Background technique】

目前，计算机存储数据使用最多的是磁盘，磁盘的运行情况直接关系到存储数据的安全。数据中心一般具有成百上千块磁盘，这便大大增加了系统出现故障的可能。因此，数据中心需要采用一些机制对磁盘异常情况进行检测，从而避免数据遭到不可逆转的损坏或丢失。At present, the most used computer to store data is the disk, and the operation of the disk is directly related to the security of the stored data. Data centers typically have hundreds or thousands of disks, which greatly increases the likelihood of system failure. Therefore, data centers need to adopt some mechanisms to detect disk anomalies, so as to avoid irreversible damage or loss of data.

目前常用的磁盘异常检测方法是基于SMART数据的阈值检测方法。它可以通过在磁盘内发送检测指令对磁盘本身的硬件运行情况进行监控、记录并与厂商所设定的预设安全值进行比较。如果检测到某些属性超过或是即将超过预设安全值的安全范围，便会通过主机的监控硬件或是软件自动向用户报警并且进行轻微的自动修复，从而保证磁盘数据的安全。但是只通过SMART预警的成功率并不高，只有3％-10％，无法达到实际要求的，因此就需要对此方法其进行进一步改进。The commonly used disk anomaly detection method is the threshold detection method based on SMART data. It can monitor, record and compare with the preset safety value set by the manufacturer by sending detection commands in the disk to monitor and record the hardware operation of the disk itself. If it is detected that certain attributes exceed or are about to exceed the safety range of the preset safety value, the monitoring hardware or software of the host will automatically alert the user and perform minor automatic repairs to ensure the safety of disk data. However, the success rate of only passing SMART early warning is not high, only 3%-10%, which cannot meet the actual requirements, so this method needs to be further improved.

磁盘拥有几十种SMART属性，如果想要对这些属性加以分析和训练便会需要处理大量的数据，并且这些属性之间存在着一定联系，而机器学习可以对这些大量的数据进行学习，并且自行探寻属性内部的联系，通过构建相应的学习算法模型来处理数据，同时，大量的数据可以不断优化模型，从而可以提高预测磁盘故障的准确性，以达到检测磁盘故障的目的。Disks have dozens of SMART attributes. If you want to analyze and train these attributes, you need to process a large amount of data, and there is a certain relationship between these attributes. Machine learning can learn from these large amounts of data and automatically Explore the internal relationship of attributes, and process data by building corresponding learning algorithm models. At the same time, a large amount of data can continuously optimize the model, which can improve the accuracy of predicting disk failures and achieve the purpose of detecting disk failures.

在利用机器学习方法解决磁盘异常检测问题时，存在磁盘正、异常数据分布极端不平衡的现象，即异常类样本的数量远远少于正常类样本数量或是没有异常类样本的现象。在极少异常数据的情况下，基于有监督算法的方法不能有效解决该问题。因此，针对此类情况，会考虑使用无监督算法来解决这类问题，其中孤立森林算法在此类算法中表现较好。孤立森林算法每次用一个随机超平面来切割数据空间以及其切割后生成的每个子空间，直到每子空间里面只有一个数据点或者达到预设的终止条件为止。该算法可以只利用正常样本，较为有效的处理样本数据极端分布条件下的分类问题，但是，孤立森林算法无法有效检测局部异常和包裹异常。基于隔离的最近邻方法可在低维数据中较好的解决以上问题，但是较难在高维空间对异常进行有效检测，难以在不同密度条件下有效判定局部异常和包裹异常。因此，需要考虑将测试点根据所处于隔离区域的数量与位置进行结合，从而较为精确的对测试点进行定位，实现对异常磁盘的有效判定。When using the machine learning method to solve the problem of disk anomaly detection, there is a phenomenon that the distribution of positive and abnormal data on the disk is extremely unbalanced, that is, the number of abnormal samples is far less than the number of normal samples or there is no abnormal sample. In the case of very few abnormal data, methods based on supervised algorithms cannot effectively solve this problem. Therefore, for such cases, unsupervised algorithms are considered to solve such problems, among which the isolation forest algorithm performs better. The Isolation Forest algorithm uses a random hyperplane to cut the data space and each subspace generated after it cuts each time until there is only one data point in each subspace or a preset termination condition is reached. The algorithm can only use normal samples to effectively deal with the classification problem under the extreme distribution of sample data, but the isolated forest algorithm cannot effectively detect local anomalies and package anomalies. The nearest neighbor method based on isolation can solve the above problems well in low-dimensional data, but it is difficult to effectively detect anomalies in high-dimensional space, and it is difficult to effectively determine local anomalies and package anomalies under different density conditions. Therefore, it is necessary to consider combining the test points according to the number and position of the isolated areas, so as to locate the test points more accurately and realize the effective determination of abnormal disks.

【发明内容】[Content of the invention]

有鉴于此，本发明实施例提出了一种基于邻域分区与隔离重构的磁盘异常检测方法，以解决不同密度条件下样本精确定位及区域内特殊异常检测问题。In view of this, an embodiment of the present invention proposes a disk anomaly detection method based on neighborhood partitioning and isolation reconstruction, so as to solve the problems of accurate sample location and special anomaly detection in a region under different density conditions.

本发明实施例提出了一种基于邻域分区与隔离重构的磁盘异常检测方法，包括：An embodiment of the present invention proposes a disk anomaly detection method based on neighborhood partition and isolation reconstruction, including:

收集磁盘SMART信息并筛选出有效的磁盘属性特征组成数据集，对其进行指数平滑处理得到稳定磁盘训练集；Collect disk SMART information and filter out effective disk attribute features to form a data set, and perform exponential smoothing on it to obtain a stable disk training set;

对磁盘数据集多次随机采样获得多个子训练集，结合欧氏距离计算子集中各点距其最近点的距离，以该距离为半径构建磁盘隔离区域，将不属于任何区域的测试点作为全局异常；Randomly sample the disk data set for multiple times to obtain multiple sub-training sets. Combine the Euclidean distance to calculate the distance between each point in the subset and its closest point. Use this distance as the radius to construct a disk isolation area, and use the test points that do not belong to any area as the global abnormal;

对于非全局异常的测试点，找到所有其所处区域的训练点及该训练点的最近训练点，将对应两点所在区域半径的比值作为该测试点在此区域的重构前异常度量值；For non-global abnormal test points, find all the training points in the region where they are located and the nearest training point of the training point, and take the ratio of the radius of the region where the corresponding two points are located as the abnormal measurement value of the test point before reconstruction in this region;

包含测试点后重新构建区域，将测试点所处区域重构后与重构前的区域半径比作为该测试点在此区域的重构后异常度量值；Reconstruct the area after including the test point, and take the area radius ratio of the area where the test point is located after reconstruction and before reconstruction as the abnormal measurement value of the test point after reconstruction in this area;

结合两次度量值得到该测试点在一个区域内的重构分数，将测试点所处所有区域的重构分数之和的倒数作为隔离分数，将多个子集中隔离分数的平均值作为测试点异常分数。Combine the two measurement values to get the reconstruction score of the test point in one area, take the reciprocal of the sum of the reconstruction scores of all the areas where the test point is located as the isolation score, and take the average of the isolation scores in multiple subsets as the test point abnormality Fraction.

上述方法中，收集磁盘SMART信息并筛选出有效的磁盘属性特征组成数据集，对其进行指数平滑处理得到稳定磁盘训练集，具体说明如下：采集磁盘SMART属性数据并筛选出其中无缺失且随时间有效变化的磁盘SMART属性，将采集到的数据作为数据集，通过指数平滑的方法将SMART属性生成为可用于生成模型的序列，指数平滑公式有以下定义：In the above method, collect disk SMART information and filter out valid disk attribute features to form a data set, and perform exponential smoothing processing on it to obtain a stable disk training set. Effectively changing the SMART attribute of the disk, the collected data is used as a data set, and the SMART attribute is generated into a sequence that can be used to generate a model through the exponential smoothing method. The exponential smoothing formula has the following definitions:

S_t＝α·Y_t+(1-α)·S_t-1 S _t =α·Y _t +(1-α)·S _t-1

其中t为时间，Y_t是第t个数据的实际值，S_t是之前t个数据的平滑值，是根据时间t的实际值和前t-1个数据的平滑值递归计算的，将窗口的宽度固定为k，k∈[1,5]，可根据实际情况设定，若k值取得较小则会导致较弱的平滑效果，但是对数据的新变化具有更高的敏感性，参数α控制较旧的观测数据衰减的速度，α∈[0,1]，α越接近1，平滑后的值越接近当前时间的数据值。where t is the time, Y _t is the actual value of the t-th data, and S _t is the smoothed value of the previous t data, which is calculated recursively based on the actual value of time t and the smoothed value of the previous t-1 data. The width of k is fixed as k, k∈[1,5], which can be set according to the actual situation. If the value of k is smaller, it will lead to a weaker smoothing effect, but it has higher sensitivity to new changes in the data. The parameter α controls the rate at which older observations decay, α∈[0,1], the closer α is to 1, the closer the smoothed value is to the data value at the current time.

上述方法中，对磁盘数据集多次随机采样获得多个子训练集，结合欧氏距离计算子集中各点距其最近点的距离，以该距离为半径构建磁盘隔离区域，将不属于任何区域的测试点作为全局异常，具体说明为：对训练集D进行多次简单随机采样，得到多个样本大小为ψ的子训练集S_i，i是整数且1≤i≤t，t为子集的个数，可根据实际情况选择合适数值，在每个子训练集S_i中，基于欧氏距离计算各点之间的距离，将每个训练点a作为区域中心，以a到其最近训练点η_a的距离τ(a)作为区域半径构建一个磁盘隔离区域，使点a与子集内其它训练点隔离，其中a,η_a∈S_i，对于点a的半径距离τ(a)有以下定义：In the above method, the disk data set is randomly sampled for multiple times to obtain multiple sub-training sets, the distance between each point in the subset and its closest point is calculated in combination with the Euclidean distance, and the disk isolation area is constructed with this distance as the radius. The test point is regarded as a global anomaly. The specific description is: perform multiple simple random sampling on the training set D, and obtain multiple sub-training sets S _i with the sample size ψ, where i is an integer and 1≤i≤t, and t is a subset of In each sub-training set S _i , the distance between the points is calculated based on the Euclidean distance, and each training point a is taken as the center of the region, and the distance from a to its nearest training point η The distance τ( _a ) of a is used as the area radius to construct a disk isolation area, so that point a is isolated from other training points in the subset, where a,η _a ∈S _i , the radius distance τ(a) of point a has the following definition :

τ(a)＝||a-η_a||τ(a)=||a- _ηa ||

对于每个子训练集S_i，设c是距测试点x最近的训练点，c∈S_i，对于测试点x，当且仅当τ(c)＜τ(x)时，x为全局异常，τ(x)和τ(c)分别是点x和点c的半径距离。For each sub-training set S _i , let c be the training point closest to the test point x, c∈S _i , for the test point x, x is the global anomaly if and only if τ(c)<τ(x), τ(x) and τ(c) are the radial distances of point x and point c, respectively.

上述方法中，对于非全局异常的测试点，找到所有其所处区域的训练点及该训练点的最近训练点，将对应两点所在区域半径的比值作为该测试点在此区域的重构前异常度量值，具体说明为：对于每个子训练集S_i，设b为S_i中的任意一个训练点，将以b为球心，τ(b)为半径构建的超球体记为B(b)，则对在B(b)中的任意一个训练点y，均有y:||y-b||＜τ(b)，对于不是全局异常的测试点，设c是距测试点x最近的训练点，η_c是距c最近的训练点，c,η_c∈S_i，B(η_c)和B(c)分别是以η_c和c为球心，以τ(η_c)和τ(c)为半径的超球体区域，将

作为测试点x在此区域的重构前异常度量值，

越大，表明B(η_c)和B(c)的相对半径差距越小，测试点异常程度越低；反之，测试点异常程度越高。In the above method, for the non-global abnormal test point, find all the training points in the area where it is located and the nearest training point of the training point, and take the ratio of the radius of the area where the corresponding two points are located as the test point before the reconstruction of this area. The abnormal measurement value is specifically described as follows: for each sub-training set S _i , let b be any training point in S _i , and denote the hypersphere constructed with b as the center and τ(b) as the radius as B(b ), then for any training point y in B(b), there is y:||yb||<τ(b). For a test point that is not a global anomaly, let c be the training point closest to the test point x point, η _c is the closest training point to c, c,η _c ∈S _i , B(η _c ) and B(c) are centered on η _c and c, respectively, and τ(η _c ) and τ( c) is the hypersphere region of radius, the

As the pre-reconstruction anomaly measure of test point x in this region,

The larger the value, the smaller the relative radius difference between B(η _c ) and B(c), and the lower the abnormal degree of the test point; on the contrary, the higher the abnormal degree of the test point.

上述方法中，包含测试点后重新构建区域，将测试点所处区域重构后与重构前的区域半径比作为该测试点在此区域的重构后异常度量值，具体说明为：对于每一个测试点x，设点c∈S_i是x所处区域包含的训练点，则原有训练点在包含测试点x之后重新依据最近邻原则建立区域，将重构后点c的区域半径记为τ(c)′，测试点x所处该区域重构后与重构前的区域半径比

为该测试点在此区域的重构后异常度量值，

越大，表明重构后与重构前的相对半径差距越小，测试点异常程度越低；反之，测试点异常程度越高。In the above method, the area is reconstructed after the test point is included, and the area radius ratio of the area where the test point is located after reconstruction and before the reconstruction is used as the abnormality metric value of the test point after reconstruction in this area. A test point x, set point c∈S _i is the training point contained in the area where x is located, then the original training point will re-establish the area according to the nearest neighbor principle after including the test point x, and the area radius of the reconstructed point c will be recorded. is τ(c)′, the radius ratio of the area where the test point x is located after reconstruction and before reconstruction

is the post-reconstruction anomaly measure of the test point in this region,

The larger the value, the smaller the relative radius difference between the post-reconstruction and the pre-reconstruction, and the lower the abnormality degree of the test point; otherwise, the higher the abnormality degree of the test point.

上述方法中，结合两次度量值得到该测试点在一个区域内的重构分数，将测试点所处所有区域的重构分数之和的倒数作为隔离分数，将多个子集中隔离分数的平均值作为测试点异常分数，具体说明为：对于每一个测试点x，设点c∈S_i是x所处区域包含的训练点，则其重构分数R(x)，有如下定义：In the above method, the reconstruction score of the test point in one area is obtained by combining the two measurement values, the reciprocal of the sum of the reconstruction scores of all the areas where the test point is located is used as the isolation score, and the average value of the isolation scores in multiple subsets is used. As the abnormal score of the test point, the specific description is: for each test point x, set the point c∈S _i to be the training point contained in the area where x is located, then the reconstruction score R(x) is defined as follows:

其中τ(c)′为重构后以c为球心构建的隔离区域半径，将测试点所处所有区域的重构分数之和的倒数作为隔离分数，则隔离分数A(x),有如下定义：where τ(c)′ is the radius of the isolation area constructed with c as the center of the sphere after reconstruction, and the reciprocal of the sum of the reconstruction scores of all areas where the test point is located is taken as the isolation score, then the isolation score A(x) is as follows definition:

其中，k为测试点x所处区域的个数，对正常样本数据集进行多次采样获得多个子训练集{S₁,S₂,...,S_t}，t为子集的个数，可根据实际情况选择合适数值，分别在每个子集S_i(1≤i≤t)中计算测试点x的隔离分数，则对于测试点x的异常分数

有如下定义：Among them, k is the number of the area where the test point x is located, and multiple sub-training sets {S ₁ , S ₂ ,..., S _t } are obtained by sampling the normal sample data set multiple times, and t is the number of subsets , you can choose an appropriate value according to the actual situation, and calculate the isolation score of the test point x in each subset S _i (1≤i≤t), then for the abnormal score of the test point x

There are the following definitions:

其中A_i(x)是测试点x在第i个子集中的隔离分数；异常分数

可以衡量x的异常程度，

越大则表明测试点x异常程度越高；反之，

越小则表明测试点x异常程度越低。where A _i (x) is the isolation score of test point x in the ith subset; the anomaly score

can measure the degree of abnormality of x,

The larger the value, the higher the abnormal degree of the test point x; on the contrary,

The smaller the value, the lower the abnormal degree of the test point x.

【附图说明】【Description of drawings】

为了更清楚地说明本发明实施例的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其它的附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

图1是本发明实施例所提出的基于邻域分区与隔离重构的磁盘异常检测方法的流程示意图；1 is a schematic flowchart of a disk anomaly detection method based on neighborhood partitioning and isolation reconstruction proposed by an embodiment of the present invention;

图2是本发明实施例所提出的基于邻域分区与隔离重构的磁盘异常检测方法计算测试点隔离分数的示意图。FIG. 2 is a schematic diagram of calculating a test point isolation score by a disk anomaly detection method based on neighborhood partition and isolation reconstruction according to an embodiment of the present invention.

【具体实施方式】【Detailed ways】

为了更好的理解本发明的技术方案，下面结合附图对本发明实施例进行详细描述。In order to better understand the technical solutions of the present invention, the embodiments of the present invention are described in detail below with reference to the accompanying drawings.

应当明确，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例，都属于本发明保护的范围。It should be understood that the described embodiments are only some, but not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

本发明实施例给出基于边缘样本密度度量的最近邻异常检测方法，如图1所示，其为本发明实施例所提出的基于边缘样本密度度量的最近邻异常检测方法的流程示意图，该方法包括以下步骤：An embodiment of the present invention provides a nearest neighbor anomaly detection method based on edge sample density metric, as shown in FIG. 1 , which is a schematic flowchart of a nearest neighbor anomaly detection method based on edge sample density metric proposed in an embodiment of the present invention. Include the following steps:

步骤101，收集磁盘SMART信息并筛选出有效的磁盘属性特征组成数据集，对其进行指数平滑处理得到稳定磁盘训练集。Step 101 , collect the SMART information of the disk and filter out the effective disk attribute features to form a data set, and perform exponential smoothing processing on the data set to obtain a stable disk training set.

具体的，采集磁盘SMART属性数据并筛选出其中无缺失且随时间有效变化的磁盘SMART属性，包括：底层数据读取错误率(当前值)、底层数据读取错误率(原始值)、磁盘读写通量性能(当前值)、磁盘读写通量性能(原始值)、主轴起旋时间(当前值)、主轴起旋时间(原始值)、启停计数(当前值)、启停计数(原始值)、重映射扇区计数(当前值)、重映射扇区计数(原始值)、寻道错误率(当前值)、寻道错误率(原始值)、寻道性能(当前值)、寻道性能(原始值)、通电时间累计(当前值)、通电时间累计(原始值)、主轴起旋重试次数(当前值)、主轴起旋重试次数(原始值)、通电周期计数(当前值)、通电周期计数(原始值)、串口降速错误计数(当前值)、串口降速错误计数(原始值)、I/O错误检测与校正(当前值)、I/O错误检测与校正(原始值)、无法校正的错误(当前值)、无法校正的错误(原始值)、命令超时(当前值)、命令超时(原始值)、高飞写入(当前值)、高飞写入(原始值)、气流温度(当前值)、气流温度(原始值)、冲击错误率(当前值)、冲击错误率(原始值)、断电返回计数(当前值)、断电返回计数(原始值)、磁头加载/卸载计数(当前值)、磁头加载/卸载计数(原始值)、温度(当前值)、温度(原始值)、编程错误块计数(当前值)、编程错误块计数(原始值)、当前待映射扇区计数(当前值)、当前待映射扇区计数(原始值)、脱机无法校正的扇区计数(当前值)、脱机无法校正的扇区计数(原始值)、Ultra访问校验错误率(当前值)、Ultra访问校验错误率(原始值)、磁头飞行时间/传输错误率(当前值)、磁头飞行时间/传输错误率(原始值)、LBA写入总数(当前值)、LBA写入总数(原始值)、LBA读取总数(当前值)、LBA读取总数(原始值)，将采集到的数据作为数据集，通过指数平滑的方法将SMART属性生成为可用于生成模型的序列，指数平滑公式有以下定义：Specifically, collect disk SMART attribute data and filter out the disk SMART attributes that are not missing and effectively change with time, including: underlying data read error rate (current value), underlying data read error rate (original value), disk read error rate (original value) Write flux performance (current value), disk read and write flux performance (original value), spindle spin-up time (current value), spindle spin-up time (original value), start-stop count (current value), start-stop count ( original value), remapped sector count (current value), remapped sector count (original value), seek error rate (current value), seek error rate (original value), seek performance (current value), Seek performance (original value), power-on time accumulation (current value), power-on time accumulation (original value), spindle spin-up retries (current value), spindle spin-up retries (original value), power-on cycle count ( current value), power-on cycle count (original value), serial port deceleration error count (current value), serial port deceleration error count (original value), I/O error detection and correction (current value), I/O error detection and correction Correction (Raw Value), Uncorrectable Error (Current Value), Uncorrectable Error (Raw Value), Command Timeout (Current Value), Command Timeout (Raw Value), Goofy Write (Current Value), Goofy Write Input (original value), airflow temperature (current value), airflow temperature (original value), shock error rate (current value), shock error rate (original value), power failure return count (current value), power failure return count ( raw value), head load/unload count (current value), head load/unload count (raw value), temperature (current value), temperature (raw value), programming error block count (current value), programming error block count ( Original value), current sector count to be mapped (current value), current sector count to be mapped (original value), offline uncorrectable sector count (current value), offline uncorrectable sector count (raw value ), Ultra Access Check Error Rate (Current Value), Ultra Access Check Error Rate (Original Value), Head Flight Time/Transmission Error Rate (Current Value), Head Flight Time/Transmission Error Rate (Original Value), LBA Write The total number of inputs (current value), the total number of LBA writes (original value), the total number of LBA reads (current value), the total number of LBA reads (original value), the collected data is used as a data set, and the SMART Attributes are generated as series that can be used to generate models, and the exponential smoothing formula has the following definitions:

S_t＝α·Y_t+(1-α)·S_t-1 S _t =α·Y _t +(1-α)·S _t-1

其中t为时间，Y_t是第t个数据的实际值，S_t是之前t个数据的平滑值，是根据时间t的实际值和前t-1个数据的平滑值递归计算的，将窗口的宽度固定为k，k∈[1,5]，可根据实际情况设定，若k值取得较小则会导致较弱的平滑效果，但是对数据的新变化具有更高的敏感性，参数α控制较旧的观测数据衰减的速度，α∈[0,1]，一个大的α用于为较远的观测数据分配较低的权重，α越接近1，平滑后的值越接近当前时间的数据值。where t is the time, Y _t is the actual value of the t-th data, and S _t is the smoothed value of the previous t data, which is calculated recursively based on the actual value of time t and the smoothed value of the previous t-1 data. The width of k is fixed as k, k∈[1,5], which can be set according to the actual situation. If the value of k is smaller, it will lead to a weaker smoothing effect, but it has higher sensitivity to new changes in the data. The parameter α controls the rate at which older observations decay, α ∈ [0,1], a large α is used to assign lower weights to distant observations, the closer α is to 1, the closer the smoothed value is to the current time data value.

步骤102，对磁盘数据集多次随机采样获得多个子训练集，结合欧氏距离计算子集中各点距其最近点的距离，以该距离为半径构建磁盘隔离区域，将不属于任何区域的测试点作为全局异常。Step 102: Randomly sample the disk data set for multiple times to obtain multiple sub-training sets, calculate the distance between each point in the subset and its closest point in combination with the Euclidean distance, and use the distance as the radius to construct a disk isolation area, and test the test that does not belong to any area. dot as a global exception.

具体的，对训练集D进行多次简单随机采样，得到多个样本大小为ψ的子训练集S_i，i是整数且1≤i≤t，t为子集的个数，可根据实际情况选择合适数值，在每个子训练集S_i中，基于欧氏距离计算各点之间的距离，将每个训练点a作为区域中心，以a到其最近训练点η_a的距离τ(a)作为区域半径构建一个磁盘隔离区域，使点a与子集内其它训练点隔离，其中a,η_a∈S_i，对于点a的半径距离τ(a)有以下定义：Specifically, multiple simple random samplings are performed on the training set D to obtain multiple sub-training sets S _i with a sample size of ψ, where i is an integer and 1≤i≤t, and t is the number of subsets, which can be determined according to the actual situation Select the appropriate value, in each sub-training set S _i , calculate the distance between the points based on the Euclidean distance, take each training point a as the center of the region, and take the distance τ(a) from a to its nearest training point η _a Construct a disk isolation region as the region radius to isolate point a from other training points in the subset, where a,η _a ∈S _i , the radius distance τ(a) for point a is defined as follows:

τ(a)＝||a-η_a||τ(a)=||a- _ηa ||

对于每个子训练集S_i，设c是距测试点x最近的训练点，c∈S_i，对于测试点x，当且仅当τ(c)＜τ(x)时，x为全局异常，τ(x)和τ(c)分别是点x和点c的半径距离，

是确定x是否为全局异常的分界线。For each sub-training set S _i , let c be the training point closest to the test point x, c∈S _i , for the test point x, x is the global anomaly if and only if τ(c)<τ(x), τ(x) and τ(c) are the radial distance between point x and point c, respectively,

is the dividing line that determines whether x is a global exception.

步骤103，对于非全局异常的测试点，找到所有其所处区域的训练点及该训练点的最近训练点，将对应两点所在区域半径的比值作为该测试点在此区域的重构前异常度量值。Step 103, for the non-global abnormal test point, find all the training points in the area where it is located and the nearest training point of the training point, and take the ratio of the radius of the area where the corresponding two points are located as the abnormality of the test point before reconstruction in this area. metric.

具体的，对于每个子训练集S_i，设b为S_i中的任意一个训练点，将以b为球心，τ(b)为半径构建的超球体记为B(b)，则对在B(b)中的任意一个训练点y，有如下定义：Specifically, for each sub-training set _Si , let b be any training point in Si, and denote the _hypersphere constructed with b as the center of the sphere and τ(b) as the radius as B(b). Any training point y in B(b) is defined as follows:

y:||y-b||＜τ(b)y:||y-b||＜τ(b)

对于不是全局异常的测试点，设c是距测试点x最近的训练点，η_c是距c最近的训练点，c,η_c∈S_i，B(η_c)和B(c)分别是以η_c和c为球心，以τ(η_c)和τ(c)为半径的超球体区域，B(η_c)和B(c)半径的比值

是训练点c相对于其邻域的异常度量，将

作为测试点x在此区域的重构前异常度量值，

越大，表明B(η_c)和B(c)的相对半径差距越小，测试点异常程度越低；反之，

越小，表明B(η_c)和B(c)的相对半径差距越大，测试点异常程度越高。For test points that are not global anomalies, let c be the training point closest to test point x, η _c be the training point closest to c, c, η _c ∈ S _i , B(η _c ) and B(c) are respectively The ratio of the radii of B(η _c ) to B(c) for the hypersphere region with η _c and c as the centers and τ(η _c ) and τ(c) as the radii

is the anomaly measure of the training point c relative to its neighborhood, set

As the pre-reconstruction anomaly measure of test point x in this region,

The larger the value is, the smaller the relative radius difference between B(η _c ) and B(c) is, and the lower the abnormal degree of the test point; otherwise,

The smaller the value, the larger the relative radius difference between B(η _c ) and B(c), and the higher the abnormal degree of the test point.

步骤104，包含测试点后重新构建区域，将测试点所处区域重构后与重构前的区域半径比作为该测试点在此区域的重构后异常度量值。Step 104 includes reconstructing the area after the test point, and taking the area radius ratio of the area where the test point is located after reconstruction and before the reconstruction as the post-reconstruction abnormality metric value of the test point in this area.

具体的，对于每一个测试点x，设点c∈S_i是x所处区域包含的训练点，则原有训练点在包含测试点x之后重新依据最近邻原则建立区域，将重构后点c的区域半径记为τ(c)′，测试点x所处该区域重构后与重构前的区域半径比

为该测试点在此区域的重构后异常度量值，

越大，表明重构后与重构前的相对半径差距越小，测试点异常程度越低；反之，测试点异常程度越高。Specifically, for each test point x, set point c∈S _i to be the training point contained in the area where x is located, then the original training point will re-establish the area according to the nearest neighbor principle after including the test point x, and the reconstructed point The area radius of c is denoted as τ(c)′, and the area radius ratio of the area where the test point x is located after reconstruction and before reconstruction

is the post-reconstruction anomaly measure of the test point in this region,

步骤105，结合两次度量值得到该测试点在一个区域内的重构分数，将测试点所处所有区域的重构分数之和的倒数作为隔离分数，将多个子集中隔离分数的平均值作为测试点异常分数。In step 105, the reconstruction score of the test point in one area is obtained by combining the two metric values, and the reciprocal of the sum of the reconstruction scores of all the areas where the test point is located is used as the isolation score, and the average value of the isolation scores in the multiple subsets is used as the isolation score. Test point anomaly score.

具体的，对于每一个测试点x，设点c∈S_i是x所处区域包含的训练点，则其重构分数R(x)，有如下定义：Specifically, for each test point x, set point c∈S _i to be the training point contained in the area where x is located, then its reconstruction score R(x) is defined as follows:

There are the following definitions:

其中A_i(x)是测试点x在第i个子集中的隔离分数；异常分数

可以衡量x的异常程度，

越大则表明测试点x异常程度越高；反之，

can measure the degree of abnormality of x,

The smaller the value, the lower the abnormal degree of the test point x.

图2是本方法计算测试点隔离分数的示意图，其中O_m为测试点，m∈[1,5]，R_m为O_m重构前区域半径，r_m为O_m重构后区域半径，x_n为测试点，n∈[1,3]，依据x_n所处区域的相对位置及引入x_n前后隔离区域变化程度对磁盘异常进行预测。Figure 2 is a schematic diagram of the method for calculating the test point isolation score, where O _m is the test point, m∈[1,5], R _m is the area radius before O _m reconstruction, r _m is the area radius after O _m reconstruction, x _n is the test point, n∈[1,3], according to the relative position of the area where x _n is located and the degree of change in the isolation area before and after the introduction of x _n , the disk anomaly is predicted.

表一是本发明实施例给出基于邻域分区与隔离重构的磁盘异常检测方法(NPIR)解决5组公开数据集分类任务时，各数据集的名称、样本数量、维度、异常样本比例以及两种方法召回率的对比实验结果，其中，本发明实施例中对比方法是基于隔离的最近邻算法(iNNE)。由表一可知，本发明所提出的方法相比于对比方法在Disk数据集上召回率提升得最高，达到8.1％。本发明实施例所提出的方法能充分考虑边缘样本的分布特征，能有效解决边缘样本邻近区域内局部异常检测问题。Table 1 shows the names, sample numbers, dimensions, abnormal sample ratios, and data of each data set when the disk anomaly detection method (NPIR) based on neighborhood partitioning and isolation and reconstruction according to the embodiment of the present invention solves the classification tasks of five groups of public data sets. The comparison experiment results of the recall rates of the two methods, wherein the comparison method in the embodiment of the present invention is the isolation-based nearest neighbor algorithm (iNNE). It can be seen from Table 1 that the method proposed in the present invention has the highest recall rate improvement on the Disk dataset compared to the comparison method, reaching 8.1%. The method proposed in the embodiment of the present invention can fully consider the distribution characteristics of edge samples, and can effectively solve the problem of local anomaly detection in the vicinity of edge samples.

表一Table I

数据集名称dataset name 样本数量Number of samples 维数dimension 异常样本比例(％)Proportion of abnormal samples (%) iNNE召回率(％)iNNE recall rate (%) NPIR召回率(％)NPIR recall (%) 皮肤(Skin)Skin 95219521 33 3.73.7 78.278.2 83.483.4 玻璃(Glass)Glass 214214 99 4.24.2 66.666.6 72.672.6 波形(Waveform)Waveform 35053505 21twenty one 4.64.6 61.161.1 68.368.3 磁盘(Disk)Disk 3557835578 5454 1.11.1 63.463.4 71.571.5 港口(Har)Port (Har) 51955195 561561 10.110.1 53.253.2 60.860.8

综上所述，本发明实施例具有以下有益效果：To sum up, the embodiments of the present invention have the following beneficial effects:

本发明实施的技术方案中，收集磁盘SMART信息并筛选出有效的磁盘特征属性组成数据集，对其进行指数平滑处理得到磁盘训练集；多次随机采样训练集获得多个子训练集，在子集中以各点距其最近点的距离为半径构建磁盘特征隔离区域，将不属于任何区域的测试点作为全局异常；对于非全局异常的测试点，将其连续两个近邻点所在区域半径比作为该测试点在此区域的前异常值；包含测试点后重新构建区域，将测试点所处区域重构前后的半径比作为该测试点在此区域的后异常值；结合测试点所处所有区域的前后异常值得到异常分数。本发明实施例提供的技术方案，充分考虑了在不同密度条件下对磁盘异常样本的判定情况，能有效解决在磁盘样本高维特征分布空间内对异常磁盘判定的问题。In the technical scheme implemented by the present invention, the SMART information of the disk is collected and the effective disk characteristic attributes are selected to form a data set, and the disk training set is obtained by performing exponential smoothing processing on it; the training set is randomly sampled for multiple times to obtain a plurality of sub-training sets. The disk feature isolation area is constructed with the distance between each point and its closest point as the radius, and the test point that does not belong to any area is regarded as the global anomaly; for the non-global anomaly test point, the ratio of the radius of the area where two consecutive adjacent points are located is used as the The former abnormal value of the test point in this area; the area is reconstructed after including the test point, and the radius ratio of the area where the test point is located before and after reconstruction is taken as the back abnormal value of the test point in this area; Before and after outliers get an outlier score. The technical solutions provided by the embodiments of the present invention fully consider the determination of abnormal disk samples under different density conditions, and can effectively solve the problem of abnormal disk determination in the high-dimensional feature distribution space of disk samples.

以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明保护的范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the present invention. within the scope of protection.

Claims

1. A disk anomaly detection method based on neighborhood partition and isolation reconstruction is disclosed, and the method comprises the following steps:

(1) collecting SMART information of the magnetic discs, screening effective magnetic disc attribute characteristics to form a data set, and performing exponential smoothing processing on the data set to obtain a stable magnetic disc training set, which specifically comprises the following steps: collecting the SMART attribute data of the disks and screening out the SMART attributes of the disks which have no deficiency and effectively change along with time, wherein the SMART attributes of the disks comprise: a current value of a bottom layer data read error rate, an original value of a bottom layer data read error rate, a current value of a disk read-write flux performance, an original value of a disk read-write flux performance, a current value of a spindle spin-up time, an original value of a spindle spin-up time, a current value of a start-stop count, an original value of a start-stop count, a current value of a remapped sector count, an original value of a remapped sector count, a current value of a seek error rate, an original value of a seek error rate, a current value of a seek performance, an original value of a seek performance, a current value of a power-on time accumulation, an original value of a power-on time accumulation, a current value of a spindle spin-up retry number, an original value of a spindle spin-up retry number, a current value of a power-on cycle count, an original value of a power-on cycle count, a current value of a serial port slow-down error count, an original value of a serial port slow-down error count, a current value of an I/O error detection and correction, a current value of a disk read-write flux performance, a remapped sector count, a current value of a seek error rate correction, a seek error correction, a seek performance, a performance, An original value of I/O error detection and correction, a current value of uncorrectable errors, an original value of uncorrectable errors, a current value of command timeout, an original value of command timeout, a current value of high fly write, an original value of high fly write, a current value of air flow temperature, an original value of air flow temperature, a current value of shock error rate, an original value of shock error rate, a current value of power down return count, an original value of power down return count, a current value of head load/unload count, an original value of head load/unload count, a current value of temperature, an original value of temperature, a current value of program error block count, an original value of program error block count, a current value of current to-be-mapped sector count, an original value of current to-be-mapped sector count, a current value of offline uncorrectable sector count, an original value of offline uncorrectable sector count, a current value of power-off error block count, a power-off error rate, a power-on-off power-on power-off power-on power-off power-on power-off power-on power-off power-on power-off power-on-off power-on-power-off power-on-off power-on-power-off power-on-power-off power-on-power-on-power-off-power-off-power-off-power-off-power-on-power-off-power-on-power-off-on-off-power-on-off-, The method comprises the following steps of generating a SMART attribute into a sequence which can be used for generating a model by an exponential smoothing method by using collected data as a data set, wherein the current value of the Ultra access check error rate, the original value of the Ultra access check error rate, the current value of the head flight time/transmission error rate, the original value of the head flight time/transmission error rate, the current value of the total number of LBA writes, the original value of the total number of LBA writes, the current value of the total number of LBA reads and the original value of the total number of LBA reads, and the exponential smoothing formula is defined as follows:

S _t ＝α·Y _t +(1-α)·S _t-1

wherein t is time, Y _t Is the actual value of the t-th data, S _t Is a smooth value of the previous t data, is recursively calculated according to the actual value of the time t and the smooth value of the previous t-1 data, and the width of the window is fixed to k, k is equal to [1,5 ]]The parameter alpha is an exponential smoothing coefficient, and alpha belongs to [0,1 ]]；

(2) The method comprises the following steps of randomly sampling a disk data set for multiple times to obtain multiple sub-training sets, calculating the distance between each point in a subset and the nearest point of each point in the subset by combining Euclidean distance, constructing a disk isolation area by taking the distance as a radius, and taking test points which do not belong to any area as global anomalies, wherein the method specifically comprises the following steps: carrying out simple random sampling on the training set D for multiple times to obtain a plurality of sub-training sets S with the sample size phi _i I is an integer, i is more than or equal to 1 and less than or equal to t, t is the number of subsets, a proper numerical value can be selected according to actual conditions, and S is arranged in each sub-training set _i In, based on Euclidean distance meterCalculating the distance between each point, taking each training point a as the center of the region, and taking a to its nearest training point eta _a Is used as a zone radius to construct a disk isolation zone that isolates point a from other training points in the subset, where a, η _a ∈S _i The radial distance τ (a) for point a is defined as follows:

τ(a)＝||a-η _a ||

for each sub-training set S _i Let c be the training point closest to the test point x, c ∈ S _i For a test point x, x is a global anomaly if and only if τ (c) < τ (x), τ (x) and τ (c) are the radial distances of point x and point c, respectively,

is a boundary that determines whether x is a global exception;

(3) for the test point with non-global abnormality, finding out all the training points in the area where the test point is located and the nearest training point of the training points, and taking the ratio of the radiuses of the areas where the corresponding two points are located as the abnormality metric value of the test point before reconstruction in the area, specifically: for each sub-training set S _i B is S _i In the above description, when a hypersphere constructed with b as the center of sphere and τ (b) as the radius is denoted as b (b), the following definition is applied to any training point y in b (b):

y:||y-b||＜τ(b)

for test points which are not global anomalies, let c be the training point nearest to the test point x, η _c Is the training point nearest to c, η _c ∈S _i ，B(η _c ) And B (c) are each eta _c And c is the center of the sphere, in τ (. eta.) _c ) And τ (c) is the hypersphere region of radius, B (η) _c ) And B (c) the ratio of the radii

Is the anomaly measure of training point c relative to its neighborhood, will

As a measureThe test point x is the value of the anomaly metric before reconstruction in this region,

(4) reconstructing an area after the test point is included, and taking the radius ratio of the area where the test point is located after reconstruction and before reconstruction as the abnormal metric value of the test point in the area after reconstruction, specifically: for each test point x, setting a point c E to S _i If the test point is the training point contained in the area where the test point x is located, the original training point establishes an area again according to the nearest neighbor principle after the test point x is contained, the radius of the area of the point c after reconstruction is recorded as tau (c)', and the radius of the area where the test point x is located after reconstruction is larger than that of the area before reconstruction

The abnormal metric value of the test point after reconstruction in the area is obtained;

(5) combining the two measurement values to obtain the reconstruction score of the test point in one area, taking the reciprocal of the sum of the reconstruction scores of all the areas where the test point is located as an isolation score, and taking the average value of the isolation scores in a plurality of subsets as the abnormal score of the test point, wherein the method specifically comprises the following steps: for each test point x, setting a point c E to S _i If the training point included in the region where x is located is x, the reconstruction score r (x) is defined as follows:

wherein τ (c)' is the radius of an isolation region constructed by taking c as the sphere center after reconstruction, and the reciprocal of the sum of the reconstruction scores of all regions where the test points are located is taken as an isolation score, so that the isolation score A (x) is defined as follows:

k is the number of the areas where the test points x are located, and the normal sample data set is subjected toMultiple sampling to obtain multiple sub-training sets S ₁ ,S ₂ ,...,S _t T is the number of subsets, and a proper value can be selected according to actual conditions, and is respectively set in each subset S _i (1 ≦ i ≦ t) calculating the isolation score for test point x, and for the anomaly score A (x) for test point x, the following definition:

wherein A is _i (x) Is the isolation score of the test point x in the ith subset; abnormal score

The degree of abnormality of x can be measured,

the larger the test point x is, the higher the abnormal degree of the test point x is; on the contrary, the first step is to take the reverse,

smaller is indicative of a lower degree of abnormality for test point x.