CN112562771B - Disk anomaly detection method based on neighborhood partition and isolation reconstruction - Google Patents
Disk anomaly detection method based on neighborhood partition and isolation reconstruction Download PDFInfo
- Publication number
- CN112562771B CN112562771B CN202011564817.3A CN202011564817A CN112562771B CN 112562771 B CN112562771 B CN 112562771B CN 202011564817 A CN202011564817 A CN 202011564817A CN 112562771 B CN112562771 B CN 112562771B
- Authority
- CN
- China
- Prior art keywords
- point
- power
- value
- test point
- current value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000002955 isolation Methods 0.000 title claims abstract description 49
- 238000001514 detection method Methods 0.000 title claims abstract description 22
- 238000005192 partition Methods 0.000 title claims description 5
- 238000012360 testing method Methods 0.000 claims abstract description 119
- 238000012549 training Methods 0.000 claims abstract description 78
- 230000002159 abnormal effect Effects 0.000 claims abstract description 39
- 238000009499 grossing Methods 0.000 claims abstract description 14
- 238000005070 sampling Methods 0.000 claims abstract description 8
- 238000012545 processing Methods 0.000 claims abstract description 5
- 238000000034 method Methods 0.000 claims description 24
- 230000005856 abnormality Effects 0.000 claims description 14
- 238000012937 correction Methods 0.000 claims description 7
- 238000005259 measurement Methods 0.000 claims description 6
- 230000004907 flux Effects 0.000 claims description 5
- 238000009825 accumulation Methods 0.000 claims description 4
- 230000005540 biological transmission Effects 0.000 claims description 4
- 230000035939 shock Effects 0.000 claims description 4
- 230000008859 change Effects 0.000 claims description 3
- 238000012216 screening Methods 0.000 claims 2
- 230000007812 deficiency Effects 0.000 claims 1
- 238000000638 solvent extraction Methods 0.000 abstract description 4
- 238000001914 filtration Methods 0.000 abstract 1
- 238000010801 machine learning Methods 0.000 description 3
- 102100036300 Golgi-associated olfactory signaling regulator Human genes 0.000 description 2
- 101710204059 Golgi-associated olfactory signaling regulator Proteins 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000035945 sensitivity Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000011521 glass Substances 0.000 description 1
- 230000002427 irreversible effect Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C29/00—Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
- G11C29/04—Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
- G11C29/08—Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
- G11C29/12—Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
Landscapes
- Digital Magnetic Recording (AREA)
- Debugging And Monitoring (AREA)
Abstract
本发明实施例提出了一种基于邻域分区与隔离重构的磁盘异常检测方法,包括:收集磁盘SMART信息并筛选出有效的磁盘特征属性组成数据集,对其进行指数平滑处理得到磁盘训练集;多次随机采样训练集获得多个子训练集,在子集中以各点距其最近点的距离为半径构建磁盘特征隔离区域,将不属于任何区域的测试点作为全局异常;对于非全局异常的测试点,将其连续两个近邻点所在区域半径比作为该测试点在此区域的前异常值;包含测试点后重新构建区域,将测试点所处区域重构前后的半径比作为该测试点在此区域的后异常值;结合测试点所处所有区域的前后异常值得到异常分数,本发明实施例提供的技术方案,能有效提高异常磁盘召回率。
An embodiment of the present invention proposes a disk anomaly detection method based on neighborhood partitioning and isolation reconstruction, including: collecting disk SMART information, filtering out effective disk feature attributes to form a data set, and performing exponential smoothing processing on it to obtain a disk training set ; Sampling the training set randomly for multiple times to obtain multiple sub-training sets, in the subset, the distance between each point and its closest point is used as the radius to construct the disk feature isolation area, and the test points that do not belong to any area are regarded as global anomalies; for non-global anomalies For the test point, the ratio of the radius of the area where the two consecutive adjacent points are located is regarded as the former abnormal value of the test point in this area; the area is reconstructed after including the test point, and the radius ratio of the area where the test point is located before and after reconstruction is used as the test point. The back abnormal value in this area; the abnormal score is obtained by combining the front and back abnormal values of all the areas where the test point is located. The technical solution provided by the embodiment of the present invention can effectively improve the recall rate of abnormal disks.
Description
【技术领域】【Technical field】
本发明涉及机器学习领域异常检测方法,尤其涉及一种基于邻域分区与隔离重构的磁盘异常检测方法。The invention relates to an anomaly detection method in the field of machine learning, in particular to a disk anomaly detection method based on neighborhood partition and isolation reconstruction.
【背景技术】【Background technique】
目前,计算机存储数据使用最多的是磁盘,磁盘的运行情况直接关系到存储数据的安全。数据中心一般具有成百上千块磁盘,这便大大增加了系统出现故障的可能。因此,数据中心需要采用一些机制对磁盘异常情况进行检测,从而避免数据遭到不可逆转的损坏或丢失。At present, the most used computer to store data is the disk, and the operation of the disk is directly related to the security of the stored data. Data centers typically have hundreds or thousands of disks, which greatly increases the likelihood of system failure. Therefore, data centers need to adopt some mechanisms to detect disk anomalies, so as to avoid irreversible damage or loss of data.
目前常用的磁盘异常检测方法是基于SMART数据的阈值检测方法。它可以通过在磁盘内发送检测指令对磁盘本身的硬件运行情况进行监控、记录并与厂商所设定的预设安全值进行比较。如果检测到某些属性超过或是即将超过预设安全值的安全范围,便会通过主机的监控硬件或是软件自动向用户报警并且进行轻微的自动修复,从而保证磁盘数据的安全。但是只通过SMART预警的成功率并不高,只有3%-10%,无法达到实际要求的,因此就需要对此方法其进行进一步改进。The commonly used disk anomaly detection method is the threshold detection method based on SMART data. It can monitor, record and compare with the preset safety value set by the manufacturer by sending detection commands in the disk to monitor and record the hardware operation of the disk itself. If it is detected that certain attributes exceed or are about to exceed the safety range of the preset safety value, the monitoring hardware or software of the host will automatically alert the user and perform minor automatic repairs to ensure the safety of disk data. However, the success rate of only passing SMART early warning is not high, only 3%-10%, which cannot meet the actual requirements, so this method needs to be further improved.
磁盘拥有几十种SMART属性,如果想要对这些属性加以分析和训练便会需要处理大量的数据,并且这些属性之间存在着一定联系,而机器学习可以对这些大量的数据进行学习,并且自行探寻属性内部的联系,通过构建相应的学习算法模型来处理数据,同时,大量的数据可以不断优化模型,从而可以提高预测磁盘故障的准确性,以达到检测磁盘故障的目的。Disks have dozens of SMART attributes. If you want to analyze and train these attributes, you need to process a large amount of data, and there is a certain relationship between these attributes. Machine learning can learn from these large amounts of data and automatically Explore the internal relationship of attributes, and process data by building corresponding learning algorithm models. At the same time, a large amount of data can continuously optimize the model, which can improve the accuracy of predicting disk failures and achieve the purpose of detecting disk failures.
在利用机器学习方法解决磁盘异常检测问题时,存在磁盘正、异常数据分布极端不平衡的现象,即异常类样本的数量远远少于正常类样本数量或是没有异常类样本的现象。在极少异常数据的情况下,基于有监督算法的方法不能有效解决该问题。因此,针对此类情况,会考虑使用无监督算法来解决这类问题,其中孤立森林算法在此类算法中表现较好。孤立森林算法每次用一个随机超平面来切割数据空间以及其切割后生成的每个子空间,直到每子空间里面只有一个数据点或者达到预设的终止条件为止。该算法可以只利用正常样本,较为有效的处理样本数据极端分布条件下的分类问题,但是,孤立森林算法无法有效检测局部异常和包裹异常。基于隔离的最近邻方法可在低维数据中较好的解决以上问题,但是较难在高维空间对异常进行有效检测,难以在不同密度条件下有效判定局部异常和包裹异常。因此,需要考虑将测试点根据所处于隔离区域的数量与位置进行结合,从而较为精确的对测试点进行定位,实现对异常磁盘的有效判定。When using the machine learning method to solve the problem of disk anomaly detection, there is a phenomenon that the distribution of positive and abnormal data on the disk is extremely unbalanced, that is, the number of abnormal samples is far less than the number of normal samples or there is no abnormal sample. In the case of very few abnormal data, methods based on supervised algorithms cannot effectively solve this problem. Therefore, for such cases, unsupervised algorithms are considered to solve such problems, among which the isolation forest algorithm performs better. The Isolation Forest algorithm uses a random hyperplane to cut the data space and each subspace generated after it cuts each time until there is only one data point in each subspace or a preset termination condition is reached. The algorithm can only use normal samples to effectively deal with the classification problem under the extreme distribution of sample data, but the isolated forest algorithm cannot effectively detect local anomalies and package anomalies. The nearest neighbor method based on isolation can solve the above problems well in low-dimensional data, but it is difficult to effectively detect anomalies in high-dimensional space, and it is difficult to effectively determine local anomalies and package anomalies under different density conditions. Therefore, it is necessary to consider combining the test points according to the number and position of the isolated areas, so as to locate the test points more accurately and realize the effective determination of abnormal disks.
【发明内容】[Content of the invention]
有鉴于此,本发明实施例提出了一种基于邻域分区与隔离重构的磁盘异常检测方法,以解决不同密度条件下样本精确定位及区域内特殊异常检测问题。In view of this, an embodiment of the present invention proposes a disk anomaly detection method based on neighborhood partitioning and isolation reconstruction, so as to solve the problems of accurate sample location and special anomaly detection in a region under different density conditions.
本发明实施例提出了一种基于邻域分区与隔离重构的磁盘异常检测方法,包括:An embodiment of the present invention proposes a disk anomaly detection method based on neighborhood partition and isolation reconstruction, including:
收集磁盘SMART信息并筛选出有效的磁盘属性特征组成数据集,对其进行指数平滑处理得到稳定磁盘训练集;Collect disk SMART information and filter out effective disk attribute features to form a data set, and perform exponential smoothing on it to obtain a stable disk training set;
对磁盘数据集多次随机采样获得多个子训练集,结合欧氏距离计算子集中各点距其最近点的距离,以该距离为半径构建磁盘隔离区域,将不属于任何区域的测试点作为全局异常;Randomly sample the disk data set for multiple times to obtain multiple sub-training sets. Combine the Euclidean distance to calculate the distance between each point in the subset and its closest point. Use this distance as the radius to construct a disk isolation area, and use the test points that do not belong to any area as the global abnormal;
对于非全局异常的测试点,找到所有其所处区域的训练点及该训练点的最近训练点,将对应两点所在区域半径的比值作为该测试点在此区域的重构前异常度量值;For non-global abnormal test points, find all the training points in the region where they are located and the nearest training point of the training point, and take the ratio of the radius of the region where the corresponding two points are located as the abnormal measurement value of the test point before reconstruction in this region;
包含测试点后重新构建区域,将测试点所处区域重构后与重构前的区域半径比作为该测试点在此区域的重构后异常度量值;Reconstruct the area after including the test point, and take the area radius ratio of the area where the test point is located after reconstruction and before reconstruction as the abnormal measurement value of the test point after reconstruction in this area;
结合两次度量值得到该测试点在一个区域内的重构分数,将测试点所处所有区域的重构分数之和的倒数作为隔离分数,将多个子集中隔离分数的平均值作为测试点异常分数。Combine the two measurement values to get the reconstruction score of the test point in one area, take the reciprocal of the sum of the reconstruction scores of all the areas where the test point is located as the isolation score, and take the average of the isolation scores in multiple subsets as the test point abnormality Fraction.
上述方法中,收集磁盘SMART信息并筛选出有效的磁盘属性特征组成数据集,对其进行指数平滑处理得到稳定磁盘训练集,具体说明如下:采集磁盘SMART属性数据并筛选出其中无缺失且随时间有效变化的磁盘SMART属性,将采集到的数据作为数据集,通过指数平滑的方法将SMART属性生成为可用于生成模型的序列,指数平滑公式有以下定义:In the above method, collect disk SMART information and filter out valid disk attribute features to form a data set, and perform exponential smoothing processing on it to obtain a stable disk training set. Effectively changing the SMART attribute of the disk, the collected data is used as a data set, and the SMART attribute is generated into a sequence that can be used to generate a model through the exponential smoothing method. The exponential smoothing formula has the following definitions:
St=α·Yt+(1-α)·St-1 S t =α·Y t +(1-α)·S t-1
其中t为时间,Yt是第t个数据的实际值,St是之前t个数据的平滑值,是根据时间t的实际值和前t-1个数据的平滑值递归计算的,将窗口的宽度固定为k,k∈[1,5],可根据实际情况设定,若k值取得较小则会导致较弱的平滑效果,但是对数据的新变化具有更高的敏感性,参数α控制较旧的观测数据衰减的速度,α∈[0,1],α越接近1,平滑后的值越接近当前时间的数据值。where t is the time, Y t is the actual value of the t-th data, and S t is the smoothed value of the previous t data, which is calculated recursively based on the actual value of time t and the smoothed value of the previous t-1 data. The width of k is fixed as k, k∈[1,5], which can be set according to the actual situation. If the value of k is smaller, it will lead to a weaker smoothing effect, but it has higher sensitivity to new changes in the data. The parameter α controls the rate at which older observations decay, α∈[0,1], the closer α is to 1, the closer the smoothed value is to the data value at the current time.
上述方法中,对磁盘数据集多次随机采样获得多个子训练集,结合欧氏距离计算子集中各点距其最近点的距离,以该距离为半径构建磁盘隔离区域,将不属于任何区域的测试点作为全局异常,具体说明为:对训练集D进行多次简单随机采样,得到多个样本大小为ψ的子训练集Si,i是整数且1≤i≤t,t为子集的个数,可根据实际情况选择合适数值,在每个子训练集Si中,基于欧氏距离计算各点之间的距离,将每个训练点a作为区域中心,以a到其最近训练点ηa的距离τ(a)作为区域半径构建一个磁盘隔离区域,使点a与子集内其它训练点隔离,其中a,ηa∈Si,对于点a的半径距离τ(a)有以下定义:In the above method, the disk data set is randomly sampled for multiple times to obtain multiple sub-training sets, the distance between each point in the subset and its closest point is calculated in combination with the Euclidean distance, and the disk isolation area is constructed with this distance as the radius. The test point is regarded as a global anomaly. The specific description is: perform multiple simple random sampling on the training set D, and obtain multiple sub-training sets S i with the sample size ψ, where i is an integer and 1≤i≤t, and t is a subset of In each sub-training set S i , the distance between the points is calculated based on the Euclidean distance, and each training point a is taken as the center of the region, and the distance from a to its nearest training point η The distance τ( a ) of a is used as the area radius to construct a disk isolation area, so that point a is isolated from other training points in the subset, where a,η a ∈S i , the radius distance τ(a) of point a has the following definition :
τ(a)=||a-ηa||τ(a)=||a- ηa ||
对于每个子训练集Si,设c是距测试点x最近的训练点,c∈Si,对于测试点x,当且仅当τ(c)<τ(x)时,x为全局异常,τ(x)和τ(c)分别是点x和点c的半径距离。For each sub-training set S i , let c be the training point closest to the test point x, c∈S i , for the test point x, x is the global anomaly if and only if τ(c)<τ(x), τ(x) and τ(c) are the radial distances of point x and point c, respectively.
上述方法中,对于非全局异常的测试点,找到所有其所处区域的训练点及该训练点的最近训练点,将对应两点所在区域半径的比值作为该测试点在此区域的重构前异常度量值,具体说明为:对于每个子训练集Si,设b为Si中的任意一个训练点,将以b为球心,τ(b)为半径构建的超球体记为B(b),则对在B(b)中的任意一个训练点y,均有y:||y-b||<τ(b),对于不是全局异常的测试点,设c是距测试点x最近的训练点,ηc是距c最近的训练点,c,ηc∈Si,B(ηc)和B(c)分别是以ηc和c为球心,以τ(ηc)和τ(c)为半径的超球体区域,将作为测试点x在此区域的重构前异常度量值,越大,表明B(ηc)和B(c)的相对半径差距越小,测试点异常程度越低;反之,测试点异常程度越高。In the above method, for the non-global abnormal test point, find all the training points in the area where it is located and the nearest training point of the training point, and take the ratio of the radius of the area where the corresponding two points are located as the test point before the reconstruction of this area. The abnormal measurement value is specifically described as follows: for each sub-training set S i , let b be any training point in S i , and denote the hypersphere constructed with b as the center and τ(b) as the radius as B(b ), then for any training point y in B(b), there is y:||yb||<τ(b). For a test point that is not a global anomaly, let c be the training point closest to the test point x point, η c is the closest training point to c, c,η c ∈S i , B(η c ) and B(c) are centered on η c and c, respectively, and τ(η c ) and τ( c) is the hypersphere region of radius, the As the pre-reconstruction anomaly measure of test point x in this region, The larger the value, the smaller the relative radius difference between B(η c ) and B(c), and the lower the abnormal degree of the test point; on the contrary, the higher the abnormal degree of the test point.
上述方法中,包含测试点后重新构建区域,将测试点所处区域重构后与重构前的区域半径比作为该测试点在此区域的重构后异常度量值,具体说明为:对于每一个测试点x,设点c∈Si是x所处区域包含的训练点,则原有训练点在包含测试点x之后重新依据最近邻原则建立区域,将重构后点c的区域半径记为τ(c)′,测试点x所处该区域重构后与重构前的区域半径比为该测试点在此区域的重构后异常度量值,越大,表明重构后与重构前的相对半径差距越小,测试点异常程度越低;反之,测试点异常程度越高。In the above method, the area is reconstructed after the test point is included, and the area radius ratio of the area where the test point is located after reconstruction and before the reconstruction is used as the abnormality metric value of the test point after reconstruction in this area. A test point x, set point c∈S i is the training point contained in the area where x is located, then the original training point will re-establish the area according to the nearest neighbor principle after including the test point x, and the area radius of the reconstructed point c will be recorded. is τ(c)′, the radius ratio of the area where the test point x is located after reconstruction and before reconstruction is the post-reconstruction anomaly measure of the test point in this region, The larger the value, the smaller the relative radius difference between the post-reconstruction and the pre-reconstruction, and the lower the abnormality degree of the test point; otherwise, the higher the abnormality degree of the test point.
上述方法中,结合两次度量值得到该测试点在一个区域内的重构分数,将测试点所处所有区域的重构分数之和的倒数作为隔离分数,将多个子集中隔离分数的平均值作为测试点异常分数,具体说明为:对于每一个测试点x,设点c∈Si是x所处区域包含的训练点,则其重构分数R(x),有如下定义:In the above method, the reconstruction score of the test point in one area is obtained by combining the two measurement values, the reciprocal of the sum of the reconstruction scores of all the areas where the test point is located is used as the isolation score, and the average value of the isolation scores in multiple subsets is used. As the abnormal score of the test point, the specific description is: for each test point x, set the point c∈S i to be the training point contained in the area where x is located, then the reconstruction score R(x) is defined as follows:
其中τ(c)′为重构后以c为球心构建的隔离区域半径,将测试点所处所有区域的重构分数之和的倒数作为隔离分数,则隔离分数A(x),有如下定义:where τ(c)′ is the radius of the isolation area constructed with c as the center of the sphere after reconstruction, and the reciprocal of the sum of the reconstruction scores of all areas where the test point is located is taken as the isolation score, then the isolation score A(x) is as follows definition:
其中,k为测试点x所处区域的个数,对正常样本数据集进行多次采样获得多个子训练集{S1,S2,...,St},t为子集的个数,可根据实际情况选择合适数值,分别在每个子集Si(1≤i≤t)中计算测试点x的隔离分数,则对于测试点x的异常分数有如下定义:Among them, k is the number of the area where the test point x is located, and multiple sub-training sets {S 1 , S 2 ,..., S t } are obtained by sampling the normal sample data set multiple times, and t is the number of subsets , you can choose an appropriate value according to the actual situation, and calculate the isolation score of the test point x in each subset S i (1≤i≤t), then for the abnormal score of the test point x There are the following definitions:
其中Ai(x)是测试点x在第i个子集中的隔离分数;异常分数可以衡量x的异常程度,越大则表明测试点x异常程度越高;反之,越小则表明测试点x异常程度越低。where A i (x) is the isolation score of test point x in the ith subset; the anomaly score can measure the degree of abnormality of x, The larger the value, the higher the abnormal degree of the test point x; on the contrary, The smaller the value, the lower the abnormal degree of the test point x.
【附图说明】【Description of drawings】
为了更清楚地说明本发明实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其它的附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.
图1是本发明实施例所提出的基于邻域分区与隔离重构的磁盘异常检测方法的流程示意图;1 is a schematic flowchart of a disk anomaly detection method based on neighborhood partitioning and isolation reconstruction proposed by an embodiment of the present invention;
图2是本发明实施例所提出的基于邻域分区与隔离重构的磁盘异常检测方法计算测试点隔离分数的示意图。FIG. 2 is a schematic diagram of calculating a test point isolation score by a disk anomaly detection method based on neighborhood partition and isolation reconstruction according to an embodiment of the present invention.
【具体实施方式】【Detailed ways】
为了更好的理解本发明的技术方案,下面结合附图对本发明实施例进行详细描述。In order to better understand the technical solutions of the present invention, the embodiments of the present invention are described in detail below with reference to the accompanying drawings.
应当明确,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本发明保护的范围。It should be understood that the described embodiments are only some, but not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
本发明实施例给出基于边缘样本密度度量的最近邻异常检测方法,如图1所示,其为本发明实施例所提出的基于边缘样本密度度量的最近邻异常检测方法的流程示意图,该方法包括以下步骤:An embodiment of the present invention provides a nearest neighbor anomaly detection method based on edge sample density metric, as shown in FIG. 1 , which is a schematic flowchart of a nearest neighbor anomaly detection method based on edge sample density metric proposed in an embodiment of the present invention. Include the following steps:
步骤101,收集磁盘SMART信息并筛选出有效的磁盘属性特征组成数据集,对其进行指数平滑处理得到稳定磁盘训练集。Step 101 , collect the SMART information of the disk and filter out the effective disk attribute features to form a data set, and perform exponential smoothing processing on the data set to obtain a stable disk training set.
具体的,采集磁盘SMART属性数据并筛选出其中无缺失且随时间有效变化的磁盘SMART属性,包括:底层数据读取错误率(当前值)、底层数据读取错误率(原始值)、磁盘读写通量性能(当前值)、磁盘读写通量性能(原始值)、主轴起旋时间(当前值)、主轴起旋时间(原始值)、启停计数(当前值)、启停计数(原始值)、重映射扇区计数(当前值)、重映射扇区计数(原始值)、寻道错误率(当前值)、寻道错误率(原始值)、寻道性能(当前值)、寻道性能(原始值)、通电时间累计(当前值)、通电时间累计(原始值)、主轴起旋重试次数(当前值)、主轴起旋重试次数(原始值)、通电周期计数(当前值)、通电周期计数(原始值)、串口降速错误计数(当前值)、串口降速错误计数(原始值)、I/O错误检测与校正(当前值)、I/O错误检测与校正(原始值)、无法校正的错误(当前值)、无法校正的错误(原始值)、命令超时(当前值)、命令超时(原始值)、高飞写入(当前值)、高飞写入(原始值)、气流温度(当前值)、气流温度(原始值)、冲击错误率(当前值)、冲击错误率(原始值)、断电返回计数(当前值)、断电返回计数(原始值)、磁头加载/卸载计数(当前值)、磁头加载/卸载计数(原始值)、温度(当前值)、温度(原始值)、编程错误块计数(当前值)、编程错误块计数(原始值)、当前待映射扇区计数(当前值)、当前待映射扇区计数(原始值)、脱机无法校正的扇区计数(当前值)、脱机无法校正的扇区计数(原始值)、Ultra访问校验错误率(当前值)、Ultra访问校验错误率(原始值)、磁头飞行时间/传输错误率(当前值)、磁头飞行时间/传输错误率(原始值)、LBA写入总数(当前值)、LBA写入总数(原始值)、LBA读取总数(当前值)、LBA读取总数(原始值),将采集到的数据作为数据集,通过指数平滑的方法将SMART属性生成为可用于生成模型的序列,指数平滑公式有以下定义:Specifically, collect disk SMART attribute data and filter out the disk SMART attributes that are not missing and effectively change with time, including: underlying data read error rate (current value), underlying data read error rate (original value), disk read error rate (original value) Write flux performance (current value), disk read and write flux performance (original value), spindle spin-up time (current value), spindle spin-up time (original value), start-stop count (current value), start-stop count ( original value), remapped sector count (current value), remapped sector count (original value), seek error rate (current value), seek error rate (original value), seek performance (current value), Seek performance (original value), power-on time accumulation (current value), power-on time accumulation (original value), spindle spin-up retries (current value), spindle spin-up retries (original value), power-on cycle count ( current value), power-on cycle count (original value), serial port deceleration error count (current value), serial port deceleration error count (original value), I/O error detection and correction (current value), I/O error detection and correction Correction (Raw Value), Uncorrectable Error (Current Value), Uncorrectable Error (Raw Value), Command Timeout (Current Value), Command Timeout (Raw Value), Goofy Write (Current Value), Goofy Write Input (original value), airflow temperature (current value), airflow temperature (original value), shock error rate (current value), shock error rate (original value), power failure return count (current value), power failure return count ( raw value), head load/unload count (current value), head load/unload count (raw value), temperature (current value), temperature (raw value), programming error block count (current value), programming error block count ( Original value), current sector count to be mapped (current value), current sector count to be mapped (original value), offline uncorrectable sector count (current value), offline uncorrectable sector count (raw value ), Ultra Access Check Error Rate (Current Value), Ultra Access Check Error Rate (Original Value), Head Flight Time/Transmission Error Rate (Current Value), Head Flight Time/Transmission Error Rate (Original Value), LBA Write The total number of inputs (current value), the total number of LBA writes (original value), the total number of LBA reads (current value), the total number of LBA reads (original value), the collected data is used as a data set, and the SMART Attributes are generated as series that can be used to generate models, and the exponential smoothing formula has the following definitions:
St=α·Yt+(1-α)·St-1 S t =α·Y t +(1-α)·S t-1
其中t为时间,Yt是第t个数据的实际值,St是之前t个数据的平滑值,是根据时间t的实际值和前t-1个数据的平滑值递归计算的,将窗口的宽度固定为k,k∈[1,5],可根据实际情况设定,若k值取得较小则会导致较弱的平滑效果,但是对数据的新变化具有更高的敏感性,参数α控制较旧的观测数据衰减的速度,α∈[0,1],一个大的α用于为较远的观测数据分配较低的权重,α越接近1,平滑后的值越接近当前时间的数据值。where t is the time, Y t is the actual value of the t-th data, and S t is the smoothed value of the previous t data, which is calculated recursively based on the actual value of time t and the smoothed value of the previous t-1 data. The width of k is fixed as k, k∈[1,5], which can be set according to the actual situation. If the value of k is smaller, it will lead to a weaker smoothing effect, but it has higher sensitivity to new changes in the data. The parameter α controls the rate at which older observations decay, α ∈ [0,1], a large α is used to assign lower weights to distant observations, the closer α is to 1, the closer the smoothed value is to the current time data value.
步骤102,对磁盘数据集多次随机采样获得多个子训练集,结合欧氏距离计算子集中各点距其最近点的距离,以该距离为半径构建磁盘隔离区域,将不属于任何区域的测试点作为全局异常。Step 102: Randomly sample the disk data set for multiple times to obtain multiple sub-training sets, calculate the distance between each point in the subset and its closest point in combination with the Euclidean distance, and use the distance as the radius to construct a disk isolation area, and test the test that does not belong to any area. dot as a global exception.
具体的,对训练集D进行多次简单随机采样,得到多个样本大小为ψ的子训练集Si,i是整数且1≤i≤t,t为子集的个数,可根据实际情况选择合适数值,在每个子训练集Si中,基于欧氏距离计算各点之间的距离,将每个训练点a作为区域中心,以a到其最近训练点ηa的距离τ(a)作为区域半径构建一个磁盘隔离区域,使点a与子集内其它训练点隔离,其中a,ηa∈Si,对于点a的半径距离τ(a)有以下定义:Specifically, multiple simple random samplings are performed on the training set D to obtain multiple sub-training sets S i with a sample size of ψ, where i is an integer and 1≤i≤t, and t is the number of subsets, which can be determined according to the actual situation Select the appropriate value, in each sub-training set S i , calculate the distance between the points based on the Euclidean distance, take each training point a as the center of the region, and take the distance τ(a) from a to its nearest training point η a Construct a disk isolation region as the region radius to isolate point a from other training points in the subset, where a,η a ∈S i , the radius distance τ(a) for point a is defined as follows:
τ(a)=||a-ηa||τ(a)=||a- ηa ||
对于每个子训练集Si,设c是距测试点x最近的训练点,c∈Si,对于测试点x,当且仅当τ(c)<τ(x)时,x为全局异常,τ(x)和τ(c)分别是点x和点c的半径距离,是确定x是否为全局异常的分界线。For each sub-training set S i , let c be the training point closest to the test point x, c∈S i , for the test point x, x is the global anomaly if and only if τ(c)<τ(x), τ(x) and τ(c) are the radial distance between point x and point c, respectively, is the dividing line that determines whether x is a global exception.
步骤103,对于非全局异常的测试点,找到所有其所处区域的训练点及该训练点的最近训练点,将对应两点所在区域半径的比值作为该测试点在此区域的重构前异常度量值。
具体的,对于每个子训练集Si,设b为Si中的任意一个训练点,将以b为球心,τ(b)为半径构建的超球体记为B(b),则对在B(b)中的任意一个训练点y,有如下定义:Specifically, for each sub-training set Si , let b be any training point in Si, and denote the hypersphere constructed with b as the center of the sphere and τ(b) as the radius as B(b). Any training point y in B(b) is defined as follows:
y:||y-b||<τ(b)y:||y-b||<τ(b)
对于不是全局异常的测试点,设c是距测试点x最近的训练点,ηc是距c最近的训练点,c,ηc∈Si,B(ηc)和B(c)分别是以ηc和c为球心,以τ(ηc)和τ(c)为半径的超球体区域,B(ηc)和B(c)半径的比值是训练点c相对于其邻域的异常度量,将作为测试点x在此区域的重构前异常度量值,越大,表明B(ηc)和B(c)的相对半径差距越小,测试点异常程度越低;反之,越小,表明B(ηc)和B(c)的相对半径差距越大,测试点异常程度越高。For test points that are not global anomalies, let c be the training point closest to test point x, η c be the training point closest to c, c, η c ∈ S i , B(η c ) and B(c) are respectively The ratio of the radii of B(η c ) to B(c) for the hypersphere region with η c and c as the centers and τ(η c ) and τ(c) as the radii is the anomaly measure of the training point c relative to its neighborhood, set As the pre-reconstruction anomaly measure of test point x in this region, The larger the value is, the smaller the relative radius difference between B(η c ) and B(c) is, and the lower the abnormal degree of the test point; otherwise, The smaller the value, the larger the relative radius difference between B(η c ) and B(c), and the higher the abnormal degree of the test point.
步骤104,包含测试点后重新构建区域,将测试点所处区域重构后与重构前的区域半径比作为该测试点在此区域的重构后异常度量值。Step 104 includes reconstructing the area after the test point, and taking the area radius ratio of the area where the test point is located after reconstruction and before the reconstruction as the post-reconstruction abnormality metric value of the test point in this area.
具体的,对于每一个测试点x,设点c∈Si是x所处区域包含的训练点,则原有训练点在包含测试点x之后重新依据最近邻原则建立区域,将重构后点c的区域半径记为τ(c)′,测试点x所处该区域重构后与重构前的区域半径比为该测试点在此区域的重构后异常度量值,越大,表明重构后与重构前的相对半径差距越小,测试点异常程度越低;反之,测试点异常程度越高。Specifically, for each test point x, set point c∈S i to be the training point contained in the area where x is located, then the original training point will re-establish the area according to the nearest neighbor principle after including the test point x, and the reconstructed point The area radius of c is denoted as τ(c)′, and the area radius ratio of the area where the test point x is located after reconstruction and before reconstruction is the post-reconstruction anomaly measure of the test point in this region, The larger the value, the smaller the relative radius difference between the post-reconstruction and the pre-reconstruction, and the lower the abnormality degree of the test point; otherwise, the higher the abnormality degree of the test point.
步骤105,结合两次度量值得到该测试点在一个区域内的重构分数,将测试点所处所有区域的重构分数之和的倒数作为隔离分数,将多个子集中隔离分数的平均值作为测试点异常分数。In
具体的,对于每一个测试点x,设点c∈Si是x所处区域包含的训练点,则其重构分数R(x),有如下定义:Specifically, for each test point x, set point c∈S i to be the training point contained in the area where x is located, then its reconstruction score R(x) is defined as follows:
其中τ(c)′为重构后以c为球心构建的隔离区域半径,将测试点所处所有区域的重构分数之和的倒数作为隔离分数,则隔离分数A(x),有如下定义:where τ(c)′ is the radius of the isolation area constructed with c as the center of the sphere after reconstruction, and the reciprocal of the sum of the reconstruction scores of all areas where the test point is located is taken as the isolation score, then the isolation score A(x) is as follows definition:
其中,k为测试点x所处区域的个数,对正常样本数据集进行多次采样获得多个子训练集{S1,S2,...,St},t为子集的个数,可根据实际情况选择合适数值,分别在每个子集Si(1≤i≤t)中计算测试点x的隔离分数,则对于测试点x的异常分数有如下定义:Among them, k is the number of the area where the test point x is located, and multiple sub-training sets {S 1 , S 2 ,..., S t } are obtained by sampling the normal sample data set multiple times, and t is the number of subsets , you can choose an appropriate value according to the actual situation, and calculate the isolation score of the test point x in each subset S i (1≤i≤t), then for the abnormal score of the test point x There are the following definitions:
其中Ai(x)是测试点x在第i个子集中的隔离分数;异常分数可以衡量x的异常程度,越大则表明测试点x异常程度越高;反之,越小则表明测试点x异常程度越低。where A i (x) is the isolation score of test point x in the ith subset; the anomaly score can measure the degree of abnormality of x, The larger the value, the higher the abnormal degree of the test point x; on the contrary, The smaller the value, the lower the abnormal degree of the test point x.
图2是本方法计算测试点隔离分数的示意图,其中Om为测试点,m∈[1,5],Rm为Om重构前区域半径,rm为Om重构后区域半径,xn为测试点,n∈[1,3],依据xn所处区域的相对位置及引入xn前后隔离区域变化程度对磁盘异常进行预测。Figure 2 is a schematic diagram of the method for calculating the test point isolation score, where O m is the test point, m∈[1,5], R m is the area radius before O m reconstruction, r m is the area radius after O m reconstruction, x n is the test point, n∈[1,3], according to the relative position of the area where x n is located and the degree of change in the isolation area before and after the introduction of x n , the disk anomaly is predicted.
表一是本发明实施例给出基于邻域分区与隔离重构的磁盘异常检测方法(NPIR)解决5组公开数据集分类任务时,各数据集的名称、样本数量、维度、异常样本比例以及两种方法召回率的对比实验结果,其中,本发明实施例中对比方法是基于隔离的最近邻算法(iNNE)。由表一可知,本发明所提出的方法相比于对比方法在Disk数据集上召回率提升得最高,达到8.1%。本发明实施例所提出的方法能充分考虑边缘样本的分布特征,能有效解决边缘样本邻近区域内局部异常检测问题。Table 1 shows the names, sample numbers, dimensions, abnormal sample ratios, and data of each data set when the disk anomaly detection method (NPIR) based on neighborhood partitioning and isolation and reconstruction according to the embodiment of the present invention solves the classification tasks of five groups of public data sets. The comparison experiment results of the recall rates of the two methods, wherein the comparison method in the embodiment of the present invention is the isolation-based nearest neighbor algorithm (iNNE). It can be seen from Table 1 that the method proposed in the present invention has the highest recall rate improvement on the Disk dataset compared to the comparison method, reaching 8.1%. The method proposed in the embodiment of the present invention can fully consider the distribution characteristics of edge samples, and can effectively solve the problem of local anomaly detection in the vicinity of edge samples.
表一Table I
综上所述,本发明实施例具有以下有益效果:To sum up, the embodiments of the present invention have the following beneficial effects:
本发明实施的技术方案中,收集磁盘SMART信息并筛选出有效的磁盘特征属性组成数据集,对其进行指数平滑处理得到磁盘训练集;多次随机采样训练集获得多个子训练集,在子集中以各点距其最近点的距离为半径构建磁盘特征隔离区域,将不属于任何区域的测试点作为全局异常;对于非全局异常的测试点,将其连续两个近邻点所在区域半径比作为该测试点在此区域的前异常值;包含测试点后重新构建区域,将测试点所处区域重构前后的半径比作为该测试点在此区域的后异常值;结合测试点所处所有区域的前后异常值得到异常分数。本发明实施例提供的技术方案,充分考虑了在不同密度条件下对磁盘异常样本的判定情况,能有效解决在磁盘样本高维特征分布空间内对异常磁盘判定的问题。In the technical scheme implemented by the present invention, the SMART information of the disk is collected and the effective disk characteristic attributes are selected to form a data set, and the disk training set is obtained by performing exponential smoothing processing on it; the training set is randomly sampled for multiple times to obtain a plurality of sub-training sets. The disk feature isolation area is constructed with the distance between each point and its closest point as the radius, and the test point that does not belong to any area is regarded as the global anomaly; for the non-global anomaly test point, the ratio of the radius of the area where two consecutive adjacent points are located is used as the The former abnormal value of the test point in this area; the area is reconstructed after including the test point, and the radius ratio of the area where the test point is located before and after reconstruction is taken as the back abnormal value of the test point in this area; Before and after outliers get an outlier score. The technical solutions provided by the embodiments of the present invention fully consider the determination of abnormal disk samples under different density conditions, and can effectively solve the problem of abnormal disk determination in the high-dimensional feature distribution space of disk samples.
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明保护的范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the present invention. within the scope of protection.
Claims (1)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011564817.3A CN112562771B (en) | 2020-12-25 | 2020-12-25 | Disk anomaly detection method based on neighborhood partition and isolation reconstruction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011564817.3A CN112562771B (en) | 2020-12-25 | 2020-12-25 | Disk anomaly detection method based on neighborhood partition and isolation reconstruction |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112562771A CN112562771A (en) | 2021-03-26 |
CN112562771B true CN112562771B (en) | 2022-07-26 |
Family
ID=75032520
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011564817.3A Active CN112562771B (en) | 2020-12-25 | 2020-12-25 | Disk anomaly detection method based on neighborhood partition and isolation reconstruction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112562771B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115344468A (en) * | 2022-02-23 | 2022-11-15 | 中国银联股份有限公司 | Method and computer system for monitoring abnormal events |
CN114816814B (en) * | 2022-03-23 | 2025-04-11 | 北京邮电大学 | A two-layer dynamic weighted disk anomaly detection method based on transfer learning |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107292350A (en) * | 2017-08-04 | 2017-10-24 | 电子科技大学 | The method for detecting abnormality of large-scale data |
CN108986869A (en) * | 2018-07-26 | 2018-12-11 | 南京群顶科技有限公司 | A kind of disk failure detection method predicted using multi-model |
CN109460791A (en) * | 2018-11-14 | 2019-03-12 | 北京邮电大学 | A kind of arest neighbors method for detecting abnormality based on edge samples Density Metric |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
FR3082963A1 (en) * | 2018-06-22 | 2019-12-27 | Amadeus S.A.S. | SYSTEM AND METHOD FOR EVALUATING AND DEPLOYING NON-SUPERVISED OR SEMI-SUPERVISED AUTOMATIC LEARNING MODELS |
US11468273B2 (en) * | 2018-09-20 | 2022-10-11 | Cable Television Laboratories, Inc. | Systems and methods for detecting and classifying anomalous features in one-dimensional data |
-
2020
- 2020-12-25 CN CN202011564817.3A patent/CN112562771B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107292350A (en) * | 2017-08-04 | 2017-10-24 | 电子科技大学 | The method for detecting abnormality of large-scale data |
CN108986869A (en) * | 2018-07-26 | 2018-12-11 | 南京群顶科技有限公司 | A kind of disk failure detection method predicted using multi-model |
CN109460791A (en) * | 2018-11-14 | 2019-03-12 | 北京邮电大学 | A kind of arest neighbors method for detecting abnormality based on edge samples Density Metric |
Non-Patent Citations (1)
Title |
---|
李新鹏等.不平衡数据集下基于自适应加权Bagging-GBDT算法的磁盘故障预测模型.《微电子学与计算机》.2020,(第03期), * |
Also Published As
Publication number | Publication date |
---|---|
CN112562771A (en) | 2021-03-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108228377B (en) | SMART threshold value optimization method for disk fault detection | |
CN111966569B (en) | Hard disk health evaluation method and device and computer readable storage medium | |
CN107392320A (en) | A kind of method that hard disk failure is predicted using machine learning | |
WO2017129032A1 (en) | Disk failure prediction method and apparatus | |
CN112562771B (en) | Disk anomaly detection method based on neighborhood partition and isolation reconstruction | |
CN113656228B (en) | A disk fault detection method, device, computer equipment and storage medium | |
CN119179598B (en) | Solid state disk fault prediction method and system based on artificial intelligence | |
CN112951311B (en) | Hard disk fault prediction method and system based on variable weight random forest | |
CN111581072A (en) | Disk failure prediction method based on SMART and performance log | |
CN110119344B (en) | Hard disk health state analysis method based on S.M.A.R.T. parameters | |
CN110164501A (en) | A kind of hard disk detection method, device, storage medium and equipment | |
CN111881000A (en) | Fault prediction method, device, equipment and machine readable medium | |
Shen et al. | Hard disk drive failure prediction for mobile edge computing based on an LSTM recurrent neural network | |
US20220121985A1 (en) | Machine Learning Supplemented Storage Device Calibration | |
Huang et al. | Characterizing disk health degradation and proactively protecting against disk failures for reliable storage systems | |
CN116820339A (en) | Method and device for determining disk status, storage medium and electronic device | |
CN114756420A (en) | Fault prediction method and related device | |
Zhou et al. | A proactive failure tolerant mechanism for SSDs storage systems based on unsupervised learning | |
KR100924694B1 (en) | Defect Prediction and Processing Method of Hard Disk Drive Using Hierarchical Clustering and Curve Pit | |
Gao et al. | Incremental prediction model of disk failures based on the density metric of edge samples | |
CN115378000A (en) | Evaluation Method of Distribution Network Operation Status Based on Interval Type II Fuzzy Cluster Analysis | |
Agarwal et al. | Discovering rules from disk events for predicting hard drive failures | |
Chen et al. | SSD drive failure prediction on Alibaba data center using machine learning | |
CN118939487A (en) | A hard disk failure prediction method for industrial edge computer room based on IIoT | |
CN110673997B (en) | Disk Failure Prediction Method and Device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |