CN108415810B

CN108415810B - A kind of hard disk state monitoring method and device

Info

Publication number: CN108415810B
Application number: CN201810212464.7A
Authority: CN
Inventors: 包卫东; 朱晓敏; 王吉; 周文; 张耀鸿; 陈超; 马力; 张国良; 陈俊杰; 杨骋; 吴冠霖; 韩浩然
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2018-03-15
Filing date: 2018-03-15
Publication date: 2021-05-11
Anticipated expiration: 2038-03-15
Also published as: CN108415810A

Abstract

The invention discloses a hard disk state monitoring method and device. The method includes: periodically acquiring different SMART attribute values of a monitored hard disk; and performing the following operations after acquiring the SMART attribute values of the hard disk each time: according to the current acquisition The SMART attribute value of the described hard disk, obtain the attribute integration of this hard disk; Input the attribute integration into the recurrent neural network introduced into the gated recursive unit, and use the hidden layer state of the recurrent neural network as the output; The information obtained from the output of the recurrent neural network is used to monitor whether the hard disk is about to fail. By applying the present invention, the deviation of the normal state of the hard disk drive can be traced back to an earlier period, so as to improve the failure detection rate or the failure prediction ability.

Description

A kind of hard disk state monitoring method and device

技术领域technical field

本发明涉及硬盘故障监控技术领域，特别是指一种硬盘状态监控方法和装置。The invention relates to the technical field of hard disk failure monitoring, in particular to a hard disk state monitoring method and device.

背景技术Background technique

在大数据时代，配备大型存储系统的数据中心在存储和处理数据方面起着重要的作用。然而，复杂的系统引起了IT设备故障的严重问题，其中硬盘是最常见的故障组件。虽然单个硬盘故障可能是比较罕见的，但成千上万个硬盘垒叠在一起，放大了失效的概率，使得在数据中心存储系统中，故障事件成为普遍而不是例外。考虑到数据丢失和服务中断造成的巨大的经济损失，硬盘可靠性问题是数据中心管理员最关心的问题之一。In the era of big data, data centers equipped with large storage systems play an important role in storing and processing data. However, complex systems have caused serious problems of IT equipment failure, with hard disks being the most common failed component. While a single hard drive failure may be relatively rare, stacking thousands of hard drives together magnifies the probability of failure, making failure events the norm rather than the exception in data center storage systems. Considering the huge financial losses caused by data loss and service interruption, the issue of hard drive reliability is one of the top concerns for data center administrators.

目前，几乎所有的硬盘都配备有自我监控、分析和报告技术(SMART)来检测和报告各种驱动器可靠性指标。研究表明，硬盘可以通过SMART数据预示即将出现的故障。它甚至被一些硬盘制造商作为产品中内置的故障预测模型。然而，内置模型仅仅提供了一个基本的基于阈值的评估，其失效预测能力相当弱。为了提高故障预测能力，研究者们提出了基于SMART数据的统计学和机器学习方法。虽然这些方法在故障检测率和误报率方面表现出良好的性能，但仍然存在一些关键的尚未解决的挑战：Today, almost all hard drives are equipped with Self-Monitoring, Analysis and Reporting Technology (SMART) to detect and report various drive reliability metrics. Studies have shown that hard drives can predict impending failure through SMART data. It is even used by some HDD manufacturers as a failure prediction model built into their products. However, the built-in model only provides a basic threshold-based assessment and its failure prediction capability is rather weak. In order to improve the failure prediction ability, the researchers proposed statistical and machine learning methods based on SMART data. While these methods show good performance in terms of fault detection rate and false positive rate, there are still some key unsolved challenges:

故障硬盘往往经历一个从健康到故障的劣化的进程。但是大多数现有的方法都是基于SMART(Self-Monitoring Analysis and Reporting Technology，自动检测分析及报告技术)数据的一个时间戳来预测故障，忽略了时间上的退化过程。基于马尔可夫模型和简单递归神经网络(RNN)的一些方法，试图捕捉SMART数据中的时间依赖性。然而，受限于这些模型的固有的问题，例如RNN中梯度消失与梯度爆炸问题，这些方法只能捕捉几天内短期的时间依赖性。Failed hard drives often go through a deteriorating process from healthy to failure. But most of the existing methods predict failures based on a timestamp of SMART (Self-Monitoring Analysis and Reporting Technology) data, ignoring the temporal degradation process. Some methods based on Markov models and simple recurrent neural networks (RNNs) try to capture temporal dependencies in SMART data. However, limited by the inherent problems of these models, such as vanishing and exploding gradients in RNNs, these methods can only capture short-term temporal dependencies over several days.

然而，根据本发明的发明人的观察和分析，某些硬盘驱动器的正常状态的偏差可以追溯到几十天甚至数月，这大大超过了这些方法的能力。However, according to the observation and analysis of the inventors of the present invention, the deviation from the normal state of some hard disk drives can be traced back to tens of days or even months, which greatly exceeds the capabilities of these methods.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本发明的目的在于提出一种硬盘状态监控方法和装置，可以对硬盘驱动器的正常状态的偏差可以追溯到更早的时期，以利于提高故障检测率或故障预测能力。In view of this, the purpose of the present invention is to provide a hard disk state monitoring method and device, which can trace the deviation of the normal state of the hard disk drive to an earlier period, so as to improve the failure detection rate or the failure prediction ability.

基于上述目的本发明提供一种硬盘状态监控方法，包括：Based on the above purpose, the present invention provides a hard disk state monitoring method, comprising:

周期性获取被监测硬盘的不同的SMART属性值；在每次获取所述硬盘的SMART属性值后进行如下操作：Periodically obtain different SMART attribute values of the monitored hard disk; perform the following operations after obtaining the SMART attribute value of the hard disk each time:

根据当前获取的所述硬盘的SMART属性值，得到该硬盘的属性集成；According to the currently acquired SMART attribute value of the hard disk, the attribute integration of the hard disk is obtained;

将所述属性集成输入到引入门控递归单元的递归神经网络中，将所述递归神经网络的隐层状态作为输出；The attribute integration is input into the recurrent neural network introduced into the gated recurrent unit, and the hidden layer state of the recurrent neural network is used as the output;

根据从所述递归神经网络的输出中获取的信息，监控所述硬盘是否将出现故障状态。Based on the information obtained from the output of the recurrent neural network, monitor whether the hard disk is about to fail.

其中，所述根据当前获取的所述硬盘的SMART属性值，得到该硬盘的属性集成，具体为：Wherein, according to the currently acquired SMART attribute value of the hard disk, the attribute integration of the hard disk is obtained, specifically:

根据当前获取的所述硬盘的第t天的SMART属性值所组成的SMART向量

得到的属性集成表示为

根据如下表达式一计算得到：A SMART vector composed of the currently acquired SMART attribute values of the hard disk on day t

The resulting property ensemble is expressed as

Calculated according to the following expression:

v_t＝ReLU(W_Vs_t+b_v) (表达式一)v _t =ReLU(W _V s _t +b _v ) (Expression 1)

其中，

表示SMART属性值的权重矩阵，

是偏置向量。ReLU是定义为ReLU(x)＝x+＝max(0,x)的激活函数，其中，max是逐元素操作；W_V和b_v是预先在训练过程中进行学习得到的向量；

表示维度为d_s的实数向量组；

表示维度为d_v的实数向量组；d_s为SMART属性值个数、d_v为属性集成值个数。in,

weight matrix representing SMART attribute values,

is the bias vector. ReLU is an activation function defined as ReLU(x)=x+=max(0,x), where max is an element-wise operation; W _V and b _v are vectors learned in advance during the training process;

represents a real vector group of dimension d _s ;

Represents a real vector group with dimension d _v ; d _s is the number of SMART attribute values, and d _v is the number of attribute integration values.

其中，所述递归神经网络中的门控递归单元包括门控单元和递归单元；其中，Wherein, the gated recurrent unit in the recurrent neural network includes a gated unit and a recursive unit; wherein,

一个门控递归单元中，门控单元用以控制递归单元的信息流，使该递归单元捕获长时间尺度的依赖；其中，In a gated recursive unit, the gated unit is used to control the information flow of the recursive unit, so that the recursive unit captures long-term dependencies; where,

所述门控单元包括重置门和更新门，用以允许该递归单元保持现有内容或在现有内容基础上更新内容。The gating unit includes a reset gate and an update gate to allow the recursive unit to maintain existing content or update content based on existing content.

其中，所述递归神经网络的输入与输出之间的关系具体通过以下四个表达式的递归算法实现：Wherein, the relationship between the input and output of the recurrent neural network is specifically realized by the recursive algorithm of the following four expressions:

r_t＝sigmoid(W_rv_t+U_rh_t-1) (表达式二)r _t =sigmoid(W _r v _t +U _r h _t-1 ) (Expression 2)

z_t＝sigmoid(W_zv_t+U_zh_t-1) (表达式三)z _t =sigmoid(W _z v _t +U _z h _t-1 ) (Expression 3)

h_t'＝tanh(Wv_t+U(r_t⊙h_t-1)) (表达式四)h _t '=tanh(Wv _t +U(r _t ⊙h _t-1 )) (Expression 4)

h_t＝z_t⊙h_t-1+(1-z_t)⊙h'_t (表达式五)h _t =z _t ⊙h _t-1 +(1-z _t )⊙h' _t (Expression 5)

其中，⊙是逐元素乘法操作；r_t表示重置门，h'_t表示备选状态，z_t表示更新门，h_t表示递归神经网络的当前隐层状态，即递归神经网络当前的输出；h_t-1表示递归神经网络上一次时间步骤中得到的隐层状态；参数W_r、U_r、W_z、U_z、W、U是预先在训练过程中进行学习得到的权重向量。Among them, ⊙ is the element-wise multiplication operation; r _t represents the reset gate, h' _t represents the alternative state, z _t represents the update gate, and h _t represents the current hidden layer state of the recurrent neural network, that is, the current output of the recurrent neural network; h _t-1 represents the state of the hidden layer obtained in the last time step of the recurrent neural network _; the parameters W _r , Ur , W _z , U _z , W, and U are the weight vectors learned in advance during the training process.

其中，所述根据从所述递归神经网络的输出中获取的信息，监控所述硬盘是否将出现故障状态，具体包括：Wherein, according to the information obtained from the output of the recurrent neural network, monitoring whether the hard disk will be in a fault state specifically includes:

根据所述递归神经网络输出的隐层状态，生成注意力分布向量；其中，所述注意力分布向量作为当前的隐层状态的权重向量，反映了当前的隐层状态与所述硬盘的健康状态之间的差异；According to the hidden layer state output by the recurrent neural network, an attention distribution vector is generated; wherein, the attention distribution vector is used as the weight vector of the current hidden layer state, reflecting the current hidden layer state and the health state of the hard disk difference between;

通过监控所述权重向量中权重值的大小，确定所述硬盘是否将出现故障状态。By monitoring the size of the weight value in the weight vector, it is determined whether the hard disk will be in a fault state.

进一步，在周期性获取被监测硬盘的不同的SMART属性值之前，还包括：Further, before periodically acquiring different SMART attribute values of the monitored hard disk, it also includes:

使用健康硬盘和故障硬盘的SMART数据对所述递归神经网络进行训练。The recurrent neural network is trained using SMART data from healthy and failed hard drives.

本发明还提供一种硬盘状态监控装置，包括：The present invention also provides a hard disk state monitoring device, comprising:

特征整合模块，用于周期性获取被监测硬盘的不同的SMART属性值；在每次获取所述硬盘的SMART属性值后，根据当前获取的所述硬盘的SMART属性值，得到该硬盘的属性集成；The feature integration module is used to periodically obtain the different SMART attribute values of the monitored hard disk; after obtaining the SMART attribute value of the hard disk each time, according to the SMART attribute value of the currently obtained hard disk, the attribute integration of the hard disk is obtained. ;

时间依赖提取模块，用于将所述属性集成输入到引入门控递归单元的递归神经网络中，将所述递归神经网络的隐层状态作为输出；a time-dependent extraction module for integrating the attributes into a recurrent neural network introduced into a gated recurrent unit, and using the hidden layer state of the recurrent neural network as an output;

故障信息监控模块，用于根据从所述递归神经网络的输出中获取的信息，监控所述硬盘是否将出现故障状态。A fault information monitoring module, configured to monitor whether the hard disk will be in a fault state according to the information obtained from the output of the recurrent neural network.

进一步，所述装置还包括：Further, the device also includes:

训练模块，用于使用健康硬盘和故障硬盘的SMART数据对所述递归神经网络进行训练。A training module for training the recurrent neural network using the SMART data of the healthy hard disk and the faulty hard disk.

本发明的技术方案中，为了捕捉SMART数据中的长期时间依赖性，在现有的简单RNN的基础上引入了门控递归单元(GRU)，避免了处理长时间序列时出现的梯度消失和爆炸问题。从而可以对硬盘驱动器的正常状态的偏差可以追溯到更早的时期，以利于提高故障检测率或故障预测能力。In the technical solution of the present invention, in order to capture the long-term time dependence in the SMART data, a gated recursive unit (GRU) is introduced on the basis of the existing simple RNN, which avoids the disappearance and explosion of gradients when processing long-term sequences. question. As a result, the deviation from the normal state of the hard disk drive can be traced back to an earlier period, so as to improve the failure detection rate or the failure prediction ability.

进一步，本发明的技术方案中，还设计了一个注意力机制，为递归神经网络输出的隐层状态，生成注意力分布向量，用以反映当前的隐层状态与所述硬盘的健康状态之间的差异；通过分析注意力分布，其中较高的注意力权重意味着更重要的角色，从而可以深入了解过去哪几天对硬盘当前状态的影响最大；从而可以自动地揭示硬盘的退化进程，有助于追踪硬盘故障的原因。Further, in the technical solution of the present invention, an attention mechanism is also designed to generate an attention distribution vector for the hidden layer state output by the recurrent neural network to reflect the difference between the current hidden layer state and the health state of the hard disk. By analyzing the attention distribution, where a higher attention weight means a more important role, it is possible to gain a deep understanding of which days in the past have the greatest impact on the current state of the hard disk; so that the degradation process of the hard disk can be automatically revealed, there are Helps track down the cause of hard drive failures.

附图说明Description of drawings

图1为本发明实施例提供的一种硬盘状态监控方法流程图；1 is a flowchart of a method for monitoring a hard disk state according to an embodiment of the present invention;

图2为本发明实施例提供的一种门控递归单元的内部连接结构图；2 is an internal connection structure diagram of a gated recursive unit provided by an embodiment of the present invention;

图3为本发明实施例提供的一种硬盘状态监控装置内部结构示意图。FIG. 3 is a schematic diagram of an internal structure of a hard disk state monitoring device according to an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白，以下结合具体实施例，并参照附图，对本发明进一步详细说明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to specific embodiments and accompanying drawings.

下面详细描述本发明的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，仅用于解释本发明，而不能解释为对本发明的限制。The following describes in detail the embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present invention, but not to be construed as a limitation of the present invention.

本技术领域技术人员可以理解，除非特意声明，这里使用的单数形式“一”、“一个”、“所述”和“该”也可包括复数形式。应该进一步理解的是，当我们称元件被“连接”或“耦接”到另一元件时，它可以直接连接或耦接到其他元件，或者也可以存在中间元件。此外，这里使用的“连接”或“耦接”可以包括无线连接或无线耦接。这里使用的措辞“和/或”包括一个或更多个相关联的列出项的全部或任一单元和全部组合。It will be understood by those skilled in the art that the singular forms "a", "an", "the" and "the" as used herein can include the plural forms as well, unless expressly stated otherwise. It will further be understood that when we refer to an element as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Furthermore, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combination of one or more of the associated listed items.

需要说明的是，本发明实施例中所有使用“第一”和“第二”的表述均是为了区分两个相同名称非相同的实体或者非相同的参量，可见“第一”“第二”仅为了表述的方便，不应理解为对本发明实施例的限定，后续实施例对此不再一一说明。It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are for the purpose of distinguishing two entities with the same name but not the same or non-identical parameters. It can be seen that "first" and "second" It is only for the convenience of expression and should not be construed as a limitation to the embodiments of the present invention, and subsequent embodiments will not describe them one by one.

本发明的发明人对硬盘驱动器的SMART属性值的变化进行观察和分析，发现SMART属性值的恶化过程通常可以追溯到15天甚至40天。例如，对于SMART_197_RAW，变化点通常出现在故障前的20天内；然而对于SMART_7_RAW和SMART_242_RAW，通常需要回溯40天才能发现变化点。这显然是超出马尔可夫模型或简单RNN的能力。高性能的预测模型需要一种能提取长期时间依赖性的方法。The inventor of the present invention observes and analyzes the change of the SMART attribute value of the hard disk drive, and finds that the deterioration process of the SMART attribute value can usually be traced back to 15 days or even 40 days. For example, for SMART_197_RAW, the change point usually occurs within 20 days before the failure; however, for SMART_7_RAW and SMART_242_RAW, it usually takes 40 days back to find the change point. This is clearly beyond the capabilities of Markov models or simple RNNs. High-performance predictive models require a method that can extract long-term temporal dependencies.

为解决上述问题，本发明的技术方案中，为了捕捉SMART数据中的长期时间依赖性，在现有的简单RNN的基础上引入了门控递归单元(GRU)，避免了处理长时间序列时出现的梯度消失和爆炸问题。从而可以对硬盘驱动器的正常状态的偏差可以追溯到更早的时期，以利于提高故障检测率或故障预测能力。In order to solve the above problems, in the technical scheme of the present invention, in order to capture the long-term time dependence in the SMART data, a gated recursive unit (GRU) is introduced on the basis of the existing simple RNN, which avoids the occurrence of long-term sequences. The gradient vanishing and exploding problem. As a result, the deviation from the normal state of the hard disk drive can be traced back to an earlier period, so as to improve the failure detection rate or the failure prediction ability.

下面结合附图详细说明本发明实施例的技术方案。The technical solutions of the embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

本发明实施例提供的一种硬盘状态监控方法，可以周期性地进行硬盘故障状态监控：周期性获取被监测硬盘的不同的SMART属性值，比如，每天获取被监测硬盘的不同的SMART属性值，进而，在每次获取所述硬盘的SMART属性值后进行如下操作：根据获取的SMART属性值监控所述硬盘是否将出现故障状态；A method for monitoring the status of a hard disk provided by the embodiment of the present invention can periodically monitor the failure status of the hard disk: periodically obtain different SMART attribute values of the monitored hard disk, for example, obtain different SMART attribute values of the monitored hard disk every day, Then, carry out the following operations after obtaining the SMART attribute value of the hard disk each time: monitor whether the hard disk will have a fault state according to the SMART attribute value obtained;

在一次获取被监测硬盘的不同的SMART属性值后进行硬盘状态监控的具体方法，流程如图1所示，包括如下步骤：The specific method of monitoring the hard disk status after obtaining different SMART attribute values of the monitored hard disk at one time, the process is shown in Figure 1, including the following steps:

步骤S101：获取被监测硬盘不同的SMART属性值，并得到该硬盘当前的属性集成表示。Step S101: Acquire different SMART attribute values of the monitored hard disk, and obtain an integrated representation of the current attributes of the hard disk.

具体地，根据当前获取的所述硬盘的SMART属性值，得到该硬盘的属性集成表示；比如，可以根据当前获取的被监测硬盘第t天的不同的SMART属性值，进而得到该硬盘第t天的属性集成表示。Specifically, according to the currently acquired SMART attribute value of the hard disk, an integrated representation of the attributes of the hard disk is obtained; for example, the t-th day of the hard disk can be obtained according to the currently acquired different SMART attribute values of the monitored hard disk on the t day Attribute integrated representation of .

例如，根据当前获取的第t天的SMART属性值所组成的SMART向量

得到的属性集成表示为

可根据如下表达式一计算得到：For example, the SMART vector composed of the currently acquired SMART attribute values on day t

The resulting property ensemble is expressed as

It can be calculated according to the following expression:

其中，

表示SMART属性值的权重矩阵，

表示维度为d_s的实数向量组；

weight matrix representing SMART attribute values,

represents a real vector group of dimension d _s ;

步骤S102：将上述得到的属性集成输入到引入门控递归单元的递归神经网络中，将所述递归神经网络的隐层状态作为输出。Step S102: Integrate and input the attributes obtained above into the recurrent neural network introduced into the gated recurrent unit, and use the hidden layer state of the recurrent neural network as the output.

具体地，本发明实施例的递归神经网络中包括多个门控递归单元，将上述得到的属性集成输入到所述递归神经网络时，属性集成的每个向量元素分别作为各门控递归单元的输入。Specifically, the recurrent neural network in the embodiment of the present invention includes a plurality of gated recursive units, and when the attribute integration obtained above is input into the recurrent neural network, each vector element of the attribute integration is used as the enter.

本发明实施例的递归神经网络中的门控递归单元包括门控单元和递归单元；其中，一个门控递归单元中，门控单元用以控制递归单元的信息流，使该递归单元可以捕获长时间尺度的依赖；其中，一个门控单元包括重置门和更新门，用以允许该递归单元保持现有内容或在现有内容基础上更新内容。图2示出了门控递归单元的内部连接结构。The gated recursive unit in the recurrent neural network according to the embodiment of the present invention includes a gated unit and a recursive unit; wherein, in a gated recursive unit, the gated unit is used to control the information flow of the recursive unit, so that the recursive unit can capture long Time scale dependence; wherein a gating unit includes reset gates and update gates to allow the recursive unit to maintain existing content or update content based on existing content. Figure 2 shows the internal connection structure of the gated recursive unit.

递归神经网络(RNN)维持一个递归的隐层状态，由当前输入和以前的隐层状态在每个时间步骤中更新得到，而引入门控递归单元的递归神经网络的输入v_t与输出h_t之间的关系可以通过以下四个表达式的递归算法实现：The recurrent neural network (RNN) maintains a recursive hidden layer state, which is updated at each time step from the current input and previous hidden layer state, while the input v _t and output h _t of the recurrent neural network introducing the gated recurrent unit The relationship between can be achieved by a recursive algorithm of the following four expressions:

其中，⊙是逐元素乘法操作；参数W_r、U_r、W_z、U_z、W、U是预先在训练过程中进行学习得到的权重向量；Sigmoid函数可将任意实数值转换到[0,1]范围内；Tanh函数可将任意实数值转换到[-1,1]范围内。Among them, ⊙ is the element-wise multiplication operation; the parameters W _r , U _r , W _z , U _z , W, U are the weight vectors learned in advance during the training process; the sigmoid function can convert any real value to [0, 1]; the Tanh function converts any real value to the range [-1,1].

r_t表示重置门，h'_t表示备选状态，z_t表示更新门，h_t表示递归神经网络的当前隐层状态(递归神经网络第t天的隐层状态)，即递归神经网络当前的输出；h_t-1表示递归神经网络上一次时间步骤中得到的隐层状态(递归神经网络第t-1天的隐层状态)；r _t represents the reset gate, h' _t represents the alternative state, z _t represents the update gate, h _t represents the current hidden layer state of the recurrent neural network (hidden layer state of the recurrent neural network on the t day), that is, the current state of the recurrent neural network The output of ; h _t-1 represents the hidden layer state obtained in the last time step of the recurrent neural network (the hidden layer state of the recurrent neural network on day t-1);

当重置门r_t接近于0时，备选状态h'_t可以忘记之前的隐层状态，并重置为当前的输入；更新门z_t控制从上一次时间步骤的隐层状态h_t-1和备选状态h'_t中流入的信息量。When the reset gate _rt is close to 0, the alternative state _h't can forget the previous hidden state and reset to the current input; the update gate _zt controls the hidden state from the previous time step _{ht- 1} and the amount of information flowing in the alternative state _h't .

步骤S103：根据从所述递归神经网络的输出中获取的信息，监控所述硬盘是否将出现故障状态。Step S103: According to the information obtained from the output of the recurrent neural network, monitor whether the hard disk will be in a fault state.

本步骤中，可以从所述递归神经网络的输出中获取所述硬盘的SMART数据的长期信息，根据获取的SMART数据的长期信息，可以对硬盘驱动器的正常状态的偏差可以追溯到更早的时期，以利于提早监控到所述硬盘是否将出现故障状态，从而提高故障检测率或故障预测能力。In this step, the long-term information of the SMART data of the hard disk can be obtained from the output of the recurrent neural network, and according to the long-term information of the acquired SMART data, the deviation of the normal state of the hard disk drive can be traced back to an earlier period , so as to facilitate early monitoring of whether the hard disk is about to fail, thereby improving the failure detection rate or the failure prediction capability.

此外，本步骤中，在根据从所述递归神经网络的输出中获取的信息，监控所述硬盘是否将出现故障状态时，可以采用一个更优的技术方案：在获取所述递归神经网络输出的隐层状态所反映的所述硬盘的SMART数据的长期信息的基础上，设计一个注意力机制，该注意力机制能够自动聚焦于硬盘的退化过程。该注意力机制可以显示哪些信息对故障预测影响最大，提供故障追踪诊断。它有助于管理者追溯到具体的日子、找出故障的原因。In addition, in this step, when monitoring whether the hard disk will fail according to the information obtained from the output of the recurrent neural network, a more optimal technical solution can be adopted: after obtaining the output of the recurrent neural network Based on the long-term information of the SMART data of the hard disk reflected by the hidden layer state, an attention mechanism is designed, which can automatically focus on the degradation process of the hard disk. This attention mechanism can reveal which information has the greatest impact on fault prediction, providing fault tracing diagnosis. It helps managers trace back to specific days and identify the cause of failures.

具体地，可以根据所述递归神经网络输出的隐层状态，生成注意力分布向量；其中，所述注意力分布向量作为当前的隐层状态的权重向量，反映了当前的隐层状态与所述硬盘的健康状态之间的差异；通过监控所述权重向量中权重值的大小，确定所述硬盘是否将出现故障状态。Specifically, an attention distribution vector can be generated according to the hidden layer state output by the recurrent neural network; wherein, the attention distribution vector is used as the weight vector of the current hidden layer state, reflecting the difference between the current hidden layer state and the The difference between the health states of the hard disks; by monitoring the size of the weight value in the weight vector, it is determined whether the hard disk will be in a fault state.

具体可以根据所述递归神经网络输出的隐层状态h_t，依据如下表达式六，生成注意力分布向量a_t：Specifically, the attention distribution vector a _t can be generated according to the hidden layer state h _t output by the recurrent neural network and according to the following expression 6:

其中，i是自然数，其在表达式六中的求和范围为[t-T+1,t]；T为输入所述递归神经网络的序列的时间窗口的大小。u_t是隐层状态h_t通过tanh激活函数被转化为一个基于位置的表示，如下表达式七所示；

是健康状态向量，可以被视作健康硬盘的特征的高阶表示；

可以预先在训练过程中进行学习得到。上式被用来比较健康状态向量与当前隐层状态之间的差异，并得到该差异的权重。Wherein, i is a natural number, and its summation range in Expression 6 is [t-T+1, t]; T is the size of the time window of the sequence input to the recurrent neural network. u _t is the hidden layer state h _t is transformed into a position-based representation through the tanh activation function, as shown in Expression 7 below;

is the health state vector, which can be regarded as a high-order representation of the characteristics of a healthy hard disk;

It can be learned in advance during the training process. The above formula is used to compare the difference between the healthy state vector and the current hidden layer state, and get the weight of this difference.

u_t＝tanh(W_ah_t+b_a) (表达式七)u _t =tanh(W _a h _t +b _a ) (Expression 7)

其中，

表示维度为d_r×d_r的实数矩阵，

表示维度为d_r的实数向量，均是预先在训练过程中进行学习得到的参数；d_r为所述递归神经网络的门控递归单元个数。in,

represents a real matrix of dimension d _r ×d _r ,

The real number vectors representing the dimension d _r are all parameters learned in advance in the training process; d _r is the number of gated recursive units of the recurrent neural network.

在得到注意力分布向量a_t后，可以根据如下表达式八得到具有注意权重的隐层状态

After obtaining the attention distribution vector a _t , the hidden layer state with attention weight can be obtained according to the following expression 8

借助于注意力机制，可以聚焦故障信息最丰富的部分，因此常常能做出更好的评估和预测。更重要的是，通过分析注意力分布，其中较高的注意力权重意味着更重要的角色，从而可以深入了解过去哪几天对硬盘当前状态的影响最大；它可以自动地揭示硬盘的退化进程，并帮助我们追踪硬盘故障的原因。With the help of the attention mechanism, the most informative part of the failure can be focused, so better evaluation and prediction can often be made. What's more, by analyzing the attention distribution, where higher attention weights mean more important roles, it is possible to gain insight into which days in the past have had the greatest impact on the current state of the hard drive; it can automatically reveal the degradation process of the hard drive , and help us track down the cause of a hard drive failure.

事实上，在执行步骤S101之前，即周期性获取被监测硬盘的不同的SMART属性值之前，可先进行训练过程。训练过程中包括使用健康硬盘和故障硬盘的SMART数据对所述递归神经网络进行训练，即训练得到上述的递归神经网络中的参数；训练过程也可包括学习其它参数：In fact, before step S101 is performed, that is, before the different SMART attribute values of the monitored hard disk are periodically acquired, a training process may be performed first. The training process includes using the SMART data of the healthy hard disk and the faulty hard disk to train the recurrent neural network, that is, the parameters in the above-mentioned recurrent neural network are obtained by training; the training process can also include learning other parameters:

具体地，在训练过程中，可以使用健康硬盘和故障硬盘的SMART数据进行计算和验证，确定递归神经网络中的参数W_r、U_r、W_z、U_z、W、U。当然，在训练过程中还可同时得到W_V和b_v，以及注意力机制中的参数

W_a、b_a。而训练方法可以采用本领域技术人员所熟知的梯度下降法等，此处不赘述。Specifically, in the training process, the SMART data of the healthy hard disk and the faulty hard disk can be used for calculation and verification to determine the parameters W _r , _Ur , W _z , U _z , W, and U in the recurrent neural network. Of course, W _V and b _v can also be obtained at the same time during the training process, as well as the parameters in the attention mechanism

W _a , b _a . The training method can be a gradient descent method well known to those skilled in the art, etc., which will not be repeated here.

基于上述方法，本发明实施例提供的一种硬盘状态监控装置，内部结构如图3所示，包括：特征整合模块301、时间依赖提取模块302、故障信息监控模块303。Based on the above method, a hard disk state monitoring device provided by an embodiment of the present invention has an internal structure as shown in FIG.

其中，特征整合模块301用于周期性获取被监测硬盘的不同的SMART属性值；在每次获取所述硬盘的SMART属性值后，根据当前获取的所述硬盘的SMART属性值，得到该硬盘的属性集成；具体地，特征整合模块301周期性获取被监测硬盘的不同的SMART属性值；对于当前获取的所述硬盘的第T天的SMART属性值所组成的SMART向量

可根据如下表达式一计算得到的属性集成。Wherein, the feature integration module 301 is used to periodically acquire different SMART attribute values of the monitored hard disk; after each acquisition of the SMART attribute value of the hard disk, obtain the SMART attribute value of the hard disk according to the currently acquired SMART attribute value of the hard disk. Attribute integration; Specifically, the feature integration module 301 periodically obtains different SMART attribute values of the monitored hard disk; for the SMART vector formed by the SMART attribute values of the T-th day of the currently acquired hard disk

The attribute integration can be calculated according to the following expression 1.

时间依赖提取模块302用于将特征整合模块301得到的属性集成输入到引入门控递归单元的递归神经网络中，将所述递归神经网络的隐层状态作为输出；而递归神经网络的输入与输出之间的关系具体通过上述表达式二、三、四、五的递归算法实现。The time-dependent extraction module 302 is used to integrate the attributes obtained by the feature integration module 301 into the recurrent neural network introduced into the gated recurrent unit, and use the hidden layer state of the recurrent neural network as the output; and the input and output of the recurrent neural network The relationship between them is specifically realized by the recursive algorithm of the above expressions 2, 3, 4, and 5.

故障信息监控模块303用于根据从所述递归神经网络的输出中获取的信息，监控所述硬盘是否将出现故障状态。具体地，故障信息监控模块303根据所述递归神经网络输出的隐层状态，生成注意力分布向量；其中，所述注意力分布向量作为当前的隐层状态的权重向量，反映了当前的隐层状态与所述硬盘的健康状态之间的差异；通过监控所述权重向量中权重值的大小，确定所述硬盘是否将出现故障状态。故障信息监控模块303可以根据上述表达式六、七计算得到注意力分布向量。The fault information monitoring module 303 is configured to monitor whether the hard disk will be in a fault state according to the information obtained from the output of the recurrent neural network. Specifically, the fault information monitoring module 303 generates an attention distribution vector according to the hidden layer state output by the recurrent neural network; wherein, the attention distribution vector is used as the weight vector of the current hidden layer state, reflecting the current hidden layer state. The difference between the state and the health state of the hard disk; by monitoring the size of the weight value in the weight vector, it is determined whether the hard disk will be in a fault state. The fault information monitoring module 303 can calculate the attention distribution vector according to the above expressions 6 and 7.

进一步，本发明实施例提供的一种硬盘状态监控装置还可包括：训练模块304。Further, the device for monitoring the state of a hard disk provided by the embodiment of the present invention may further include: a training module 304 .

训练模块304用于使用健康硬盘和故障硬盘的SMART数据对上述的递归神经网络进行训练，即使用健康硬盘和故障硬盘的SMART数据训练上述的递归神经网络，确定递归神经网络中的参数W_r、U_r、W_z、U_z、W、U。The training module 304 is used to train the above-mentioned recurrent neural network using the SMART data of the healthy hard disk and the faulty hard disk, that is, use the SMART data of the healthy hard disk and the faulty hard disk to train the above-mentioned recurrent neural network, and determine the parameters W _r , _{Ur, Wz, Uz} _, _W , U.

训练模块304还可在训练递归神经网络的同时，训练得到参数W_V和b_v，以及注意力机制中的参数

W_a、b_a。The training module 304 can also train to obtain parameters W _v and b _v , and parameters in the attention mechanism while training the recurrent neural network.

W _a , b _a .

进一步，本发明的技术方案中，还设计了一个注意力机制，为递归神经网络输出的隐层状态，生成注意力分布向量，用以反映当前的隐层状态与所述硬盘的健康状态之间的差异；通过分析注意力分布，其中较高的注意力权重意味着更重要的角色，从而可以深入了解过去哪几天对硬盘当前状态的影响最大；从而可以自动地揭示硬盘的退化进程，有助于追踪硬盘故障的原因。Further, in the technical solution of the present invention, an attention mechanism is also designed to generate an attention distribution vector for the hidden layer state output by the recurrent neural network to reflect the difference between the current hidden layer state and the health state of the hard disk. By analyzing the attention distribution, where a higher attention weight means a more important role, it is possible to gain a deep understanding of which days in the past have the greatest impact on the current state of the hard disk; thus, the degradation process of the hard disk can be automatically revealed. Helps track down the cause of hard drive failures.

本技术领域技术人员可以理解，本发明包括涉及用于执行本申请中所述操作中的一项或多项的设备。这些设备可以为所需的目的而专门设计和制造，或者也可以包括通用计算机中的已知设备。这些设备具有存储在其内的计算机程序，这些计算机程序选择性地激活或重构。这样的计算机程序可以被存储在设备(例如，计算机)可读介质中或者存储在适于存储电子指令并分别耦联到总线的任何类型的介质中，所述计算机可读介质包括但不限于任何类型的盘(包括软盘、硬盘、光盘、CD-ROM、和磁光盘)、ROM(Read-Only Memory，只读存储器)、RAM(Random Access Memory，随即存储器)、EPROM(Erasable ProgrammableRead-Only Memory，可擦写可编程只读存储器)、EEPROM(Electrically ErasableProgrammable Read-Only Memory，电可擦可编程只读存储器)、闪存、磁性卡片或光线卡片。也就是，可读介质包括由设备(例如，计算机)以能够读的形式存储或传输信息的任何介质。As will be appreciated by those skilled in the art, the present invention includes apparatuses for performing one or more of the operations described in this application. These devices may be specially designed and manufactured for the required purposes, or they may include those known in general purpose computers. These devices have computer programs stored in them that are selectively activated or reconfigured. Such a computer program may be stored in a device (eg, computer) readable medium including, but not limited to, any type of medium suitable for storing electronic instructions and coupled to a bus, respectively Types of disks (including floppy disks, hard disks, optical disks, CD-ROMs, and magneto-optical disks), ROM (Read-Only Memory, read-only memory), RAM (Random Access Memory, random access memory), EPROM (Erasable Programmable Read-Only Memory, Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory, Electrically Erasable Programmable Read-Only Memory), flash memory, magnetic card or optical card. That is, a readable medium includes any medium that stores or transmits information in a form that can be read by a device (eg, a computer).

本技术领域技术人员可以理解，可以用计算机程序指令来实现这些结构图和/或框图和/或流图中的每个框以及这些结构图和/或框图和/或流图中的框的组合。本技术领域技术人员可以理解，可以将这些计算机程序指令提供给通用计算机、专业计算机或其他可编程数据处理方法的处理器来实现，从而通过计算机或其他可编程数据处理方法的处理器来执行本发明公开的结构图和/或框图和/或流图的框或多个框中指定的方案。Those skilled in the art will understand that computer program instructions can be used to implement each block of these structural diagrams and/or block diagrams and/or flow diagrams, and combinations of blocks in these structural diagrams and/or block diagrams and/or flow diagrams . Those skilled in the art can understand that these computer program instructions can be provided to a general-purpose computer, a professional computer or a processor of other programmable data processing methods to implement, so that the present invention can be executed by a processor of a computer or other programmable data processing method. The block or blocks specified in the block or blocks of the block diagrams and/or block diagrams and/or flow diagrams of the invention are disclosed.

本技术领域技术人员可以理解，本发明中已经讨论过的各种操作、方法、流程中的步骤、措施、方案可以被交替、更改、组合或删除。进一步地，具有本发明中已经讨论过的各种操作、方法、流程中的其他步骤、措施、方案也可以被交替、更改、重排、分解、组合或删除。进一步地，现有技术中的具有与本发明中公开的各种操作、方法、流程中的步骤、措施、方案也可以被交替、更改、重排、分解、组合或删除。Those skilled in the art can understand that the various operations, methods, steps, measures, and solutions discussed in the present invention may be alternated, modified, combined or deleted. Further, other steps, measures, and solutions in the various operations, methods, and processes that have been discussed in the present invention may also be alternated, modified, rearranged, decomposed, combined, or deleted. Further, steps, measures and solutions in the prior art with various operations, methods, and processes disclosed in the present invention may also be alternated, modified, rearranged, decomposed, combined or deleted.

所属领域的普通技术人员应当理解：以上任何实施例的讨论仅为示例性的，并非旨在暗示本公开的范围(包括权利要求)被限于这些例子；在本发明的思路下，以上实施例或者不同实施例中的技术特征之间也可以进行组合，步骤可以以任意顺序实现，并存在如上所述的本发明的不同方面的许多其它变化，为了简明它们没有在细节中提供。因此，凡在本发明的精神和原则之内，所做的任何省略、修改、等同替换、改进等，均应包含在本发明的保护范围之内。Those of ordinary skill in the art should understand that the discussion of any of the above embodiments is only exemplary, and is not intended to imply that the scope of the present disclosure (including the claims) is limited to these examples; under the spirit of the present invention, the above embodiments or There may also be combinations between technical features in different embodiments, steps may be carried out in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omission, modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.

Claims

1. A hard disk state monitoring method comprises the following steps:

periodically acquiring different automatic detection analysis and reporting technology SMART attribute values of the monitored hard disk; after the SMART attribute value of the hard disk is obtained each time, the following operations are carried out:

obtaining attribute integration of the hard disk according to the SMART attribute value of the hard disk which is obtained currently;

integrating and inputting the attribute into a recurrent neural network with a gated recurrent unit introduced, and taking the hidden layer state of the recurrent neural network as output;

monitoring whether the hard disk will fail according to information obtained from the output of the recurrent neural network: generating an attention distribution vector according to the hidden layer state output by the recurrent neural network; wherein, the attention distribution vector is used as a weight vector of the current hidden layer state, and reflects the difference between the current hidden layer state and the health state of the hard disk; and determining whether the hard disk is in a fault state or not by monitoring the weight value in the weight vector.

2. The method according to claim 1, wherein the obtaining of the attribute integration of the hard disk according to the SMART attribute value of the hard disk currently obtained specifically comprises:

according to the SMART vector composed of the currently acquired SMART attribute values of the t day of the hard disk

The resulting attribute integration is represented as

The calculation is carried out according to the following expression I:

v_t＝ReLU(W_Vs_t+b_v) (expression one)

Wherein,

a weight matrix representing SMART attribute values,

is a bias vector; ReLU is an activation function defined as ReLU (x) ═ x + ═ max (0, x), where max is an element-by-element operation; w_VAnd b_vIs a vector obtained by learning in the training process in advance;

the representation dimension is d_sThe set of real vectors of (2);

the representation dimension is d_vThe set of real vectors of (2); d_sFor the number of SMART attribute values, d_vThe number of values is integrated for the attribute.

3. The method of claim 2, wherein the gated recursion units in the recurrent neural network comprise a gated unit and a recursion unit; wherein,

the gate control unit is used for controlling the information flow of the recursion unit so that the recursion unit captures the dependence of a long time scale; wherein,

the gating cell includes a reset gate and an update gate to allow the recursion cell to hold existing content or update content on an existing content basis.

4. The method according to claim 3, characterized in that the relation between the inputs and outputs of the recurrent neural network is realized in particular by a recurrent algorithm of the following four expressions:

r_t＝sigmoid(W_rv_t+U_rh_t-1) (expression two)

z_t＝sigmoid(W_zv_t+U_zh_t-1) (expression III)

h_t'＝tanh(Wv_t+U(r_t⊙h_t-1) (expression four)

h_t＝z_t⊙h_t-1+(1-z_t)⊙h'_t(expression five)

Wherein, an element-by-element multiplication operation; r is_tDenotes a reset gate, h'_tRepresenting alternative states, z_tRepresents an update gate, h_tRepresenting the current hidden layer state of the recurrent neural network, namely the current output of the recurrent neural network; h is_t-1Representing the hidden layer state obtained in the last time step of the recurrent neural network; parameter W_r、U_r、W_z、U_zW, U are learned in advance during training.

5. The method of any of claims 1-4, further comprising, prior to periodically obtaining different SMART attribute values for the monitored hard disk:

training the recurrent neural network using SMART data of healthy and failed hard disks.

6. A hard disk state monitoring device, comprising:

the characteristic integration module is used for periodically acquiring different SMART attribute values of the monitored hard disk; after the SMART attribute value of the hard disk is obtained every time, obtaining the attribute integration of the hard disk according to the currently obtained SMART attribute value of the hard disk;

the time dependence extraction module is used for integrating and inputting the attribute into a recurrent neural network with a gating recurrent unit and taking the hidden layer state of the recurrent neural network as output;

a fault information monitoring module for monitoring whether the hard disk will have a fault state according to the information obtained from the output of the recurrent neural network: generating an attention distribution vector according to the hidden layer state output by the recurrent neural network; wherein, the attention distribution vector is used as a weight vector of the current hidden layer state, and reflects the difference between the current hidden layer state and the health state of the hard disk; and determining whether the hard disk is in a fault state or not by monitoring the weight value in the weight vector.

7. The apparatus of claim 6, wherein the gated recursion units in the recurrent neural network comprise a gating unit and a recursion unit; wherein,

gating a gate unit in a recursive unit to control the flow of information for the recursive unit in the gated recursive unit such that the recursive unit captures long time scale dependencies; wherein,

8. The apparatus of any of claims 6-7, further comprising:

and the training module is used for training the recurrent neural network by using SMART data of the healthy hard disk and the fault hard disk.