CN106980468A

CN106980468A - Method and device for triggering RAID array reconstruction

Info

Publication number: CN106980468A
Application number: CN201710124832.8A
Authority: CN
Inventors: 上官应兰; 张学东
Original assignee: Macrosan Technologies Co Ltd
Current assignee: Macrosan Technologies Co Ltd
Priority date: 2017-03-03
Filing date: 2017-03-03
Publication date: 2017-07-25

Abstract

The application provides a method and a device for triggering RAID array reconstruction, wherein the method is applied to a disk subsystem of a storage device and can comprise the following steps: issuing IO read-write instructions to each physical disk; respectively calculating the average response time of each physical disk based on the response time of the IO read-write instruction returned by each physical disk in a preset statistical period, and respectively judging whether the average response time of the physical disk of each non-fault physical disk reaches the disk abnormal response time threshold corresponding to the model of the physical disk; and marking the physical disk of which the average response time reaches the disk abnormal response time threshold value corresponding to the model as a failed physical disk, and if the failed physical disk belongs to the RAID array, informing the RAID array to which the failed physical disk belongs to reconstruct the failed physical disk. By using the scheme, the accuracy of judging the failed physical disk can be effectively improved.

Description

Method and device for triggering RAID array reconstruction

技术领域technical field

本申请涉及计算机通信领域，尤其涉及触发RAID阵列重建的方法及装置。The present application relates to the field of computer communication, in particular to a method and device for triggering RAID array reconstruction.

背景技术Background technique

RAID阵列(Redundant Array of Independent Disks，独立磁盘冗余阵列)是一种把多块独立的磁盘(物理磁盘)按不同的方式组合起来形成一个磁盘组(逻辑磁盘)，从而提供比单个磁盘更高的存储性能和数据可靠性的技术。RAID array (Redundant Array of Independent Disks, Redundant Array of Independent Disks) is a combination of multiple independent disks (physical disks) in different ways to form a disk group (logical disk), thus providing higher technologies for storage performance and data reliability.

在计算机通信领域，通常会使用RAID阵列技术对磁盘中数据进行冗余保护，当有数据写入时，根据RAID阵列算法把数据拆分到多个成员磁盘中。根据RAID阵列级别不同，可容忍1块或多块磁盘故障或者离线，当检测到磁盘IO错误或者磁盘离线时，可使用专用热备盘或者全局热备盘进行重建，恢复RAID阵列数据冗余性。In the field of computer communication, RAID array technology is usually used for redundant protection of data in the disk. When data is written, the data is split into multiple member disks according to the RAID array algorithm. Depending on the level of the RAID array, it can tolerate one or more disk failures or offline. When a disk IO error is detected or the disk is offline, a dedicated hot spare disk or a global hot spare disk can be used to rebuild and restore the data redundancy of the RAID array. .

然而，在现有的触发RAID阵列进行重建的方法中，仅考虑了磁盘IO错误和磁盘离线的情况，没有考虑磁盘老化后响应时间变慢导致业务中断的情况，因此如何在磁盘响应慢的情况下触发RAID阵列重建成为亟待解决的问题。However, in the existing methods for triggering RAID array reconstruction, only the disk IO error and the disk offline situation are considered, and the slow response time after the disk aging causes business interruption, so how to deal with the slow disk response Triggering the rebuilding of the RAID array becomes an urgent problem to be solved.

发明内容Contents of the invention

有鉴于此，本申请提供一种触发RAID阵列重建的方法及装置，用以提高判断故障物理磁盘的准确性。In view of this, the present application provides a method and device for triggering rebuilding of a RAID array, so as to improve the accuracy of judging a failed physical disk.

具体地，本申请是通过如下技术方案实现的：Specifically, this application is achieved through the following technical solutions:

根据本申请的第一方面，提供一种触发RAID阵列重建的方法，所述方法应用于存储设备的磁盘子系统，所述存储设备包括至少一个RAID阵列，所述RAID阵列包括若干个物理磁盘，所述方法包括：According to the first aspect of the present application, a method for triggering RAID array reconstruction is provided, the method is applied to a disk subsystem of a storage device, the storage device includes at least one RAID array, and the RAID array includes several physical disks, The methods include:

根据各相关子系统的IO读写请求向各物理磁盘下发IO读写指令；Send IO read and write instructions to each physical disk according to the IO read and write requests of each relevant subsystem;

基于各物理磁盘在预设统计周期内返回的IO读写指令的响应时间，分别计算各物理磁盘的平均响应时间；Calculate the average response time of each physical disk based on the response time of the IO read and write commands returned by each physical disk within the preset statistical period;

分别判断各非故障物理磁盘的物理磁盘的平均响应时间是否达到与其型号对应的磁盘异常响应时间阈值；其中，不同型号的物理磁盘的磁盘异常响应时间阈值不同；Determine whether the average response time of the physical disks of each non-faulty physical disk reaches the disk abnormal response time threshold corresponding to its model; wherein, the disk abnormal response time thresholds of different types of physical disks are different;

将平均响应时间达到与其型号对应的磁盘异常响应时间阈值的物理磁盘标记为故障物理磁盘，如果所述故障物理磁盘属于RAID阵列，则通知所述故障物理磁盘所属的RAID阵列对所述故障物理磁盘进行重建。Mark the physical disk whose average response time reaches the disk abnormal response time threshold corresponding to its model as a faulty physical disk, and if the faulty physical disk belongs to a RAID array, notify the RAID array to which the faulty physical disk belongs to the faulty physical disk to rebuild.

根据本申请的第二方面，一种触发RAID阵列重建的装置，所述装置应用于存储设备的磁盘子系统，所述存储设备包括至少一个RAID阵列，所述RAID阵列包括若干个物理磁盘，所述装置包括：According to the second aspect of the present application, a device for triggering RAID array reconstruction is applied to a disk subsystem of a storage device, the storage device includes at least one RAID array, and the RAID array includes several physical disks, the Said devices include:

下发单元，用于根据各相关子系统的IO读写请求向各物理磁盘下发IO读写指令；The issuing unit is used to issue IO read and write instructions to each physical disk according to the IO read and write requests of each relevant subsystem;

计算单元，用于基于各物理磁盘在预设统计周期内返回的IO读写指令的响应时间，分别计算各物理磁盘的平均响应时间；The calculation unit is used to calculate the average response time of each physical disk based on the response time of the IO read and write commands returned by each physical disk within a preset statistical period;

判断单元，用于分别判断各非故障物理磁盘的物理磁盘的平均响应时间是否达到与其型号对应的磁盘异常响应时间阈值；其中，不同型号的物理磁盘的磁盘异常响应时间阈值不同；The judging unit is used to respectively judge whether the average response time of the physical disks of each non-faulty physical disk reaches the disk abnormal response time threshold corresponding to its model; wherein, the disk abnormal response time thresholds of different types of physical disks are different;

标记单元，用于将平均响应时间达到与其型号对应的磁盘异常响应时间阈值的物理磁盘标记为故障物理磁盘，如果所述故障物理磁盘属于RAID阵列，则通知所述故障物理磁盘所属的RAID阵列对所述故障物理磁盘进行重建。The marking unit is configured to mark the physical disk whose average response time reaches the disk abnormal response time threshold corresponding to its model as a faulty physical disk, and if the faulty physical disk belongs to a RAID array, then notify the RAID array pair to which the faulty physical disk belongs The failed physical disk is rebuilt.

在本申请提出一种触发RAID阵列重建的方法中，一方面，由于磁盘子系统可以基于各物理磁盘的平均响应时间，将平均响应时间达到与其型号对应的磁盘异常响应时间阈值的非故障物理磁盘的物理磁盘标记为故障物理磁盘，并通知该故障物理磁盘所属的RAID阵列进行重建，从而实现了基于物理磁盘的IO读写指令的响应时间来触发对该物理磁盘所属的RAID阵列的重建。In the method for triggering RAID array reconstruction proposed in this application, on the one hand, since the disk subsystem can determine the non-faulty physical disk whose average response time reaches the abnormal response time threshold of the disk corresponding to its model based on the average response time of each physical disk Mark the physical disk as a failed physical disk, and notify the RAID array to which the failed physical disk belongs to rebuild, thereby realizing the reconstruction based on the response time of the IO read and write commands of the physical disk to trigger the rebuilding of the RAID array to which the physical disk belongs.

另一方面，由于各物理磁盘的响应时间可以与其型号对应的磁盘异常响应时间阈值进行比较，从而使得在判断物理磁盘的平均响应时间是否异常时，综合考虑该磁盘上所有业务下发的IO，比如RAID下发的IO、磁盘检测任务下发的IO等，从而有效地提高磁盘子系统标记出的故障物理磁盘的准确率。On the other hand, since the response time of each physical disk can be compared with the abnormal response time threshold of the disk corresponding to its model, when judging whether the average response time of a physical disk is abnormal, the IO issued by all services on the disk should be considered comprehensively. For example, the IOs issued by the RAID, the IOs issued by the disk detection task, etc., so as to effectively improve the accuracy of the faulty physical disks marked by the disk subsystem.

附图说明Description of drawings

图1是本申请一示例性实施例示出的一种触发RAID阵列重建的方法的流程图；FIG. 1 is a flow chart of a method for triggering RAID array reconstruction shown in an exemplary embodiment of the present application;

图2是本申请一示例性实施例示出的一种触发RAID阵列重建的装置所在设备的硬件结构图；FIG. 2 is a hardware structural diagram of a device where a device for triggering RAID array reconstruction is shown in an exemplary embodiment of the present application;

图3是本申请一示例性实施例示出的一种触发RAID阵列重建的装置的框图。Fig. 3 is a block diagram of an apparatus for triggering rebuilding of a RAID array according to an exemplary embodiment of the present application.

具体实施方式detailed description

这里将详细地对示例性实施例进行说明，其示例表示在附图中。下面的描述涉及附图时，除非另有表示，不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反，它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present application as recited in the appended claims.

在本申请使用的术语是仅仅出于描述特定实施例的目的，而非旨在限制本申请。在本申请和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式，除非上下文清楚地表示其他含义。还应当理解，本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terminology used in this application is for the purpose of describing particular embodiments only, and is not intended to limit the application. As used in this application and the appended claims, the singular forms "a", "the", and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It should also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

应当理解，尽管在本申请可能采用术语第一、第二、第三等来描述各种信息，但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如，在不脱离本申请范围的情况下，第一信息也可以被称为第二信息，类似地，第二信息也可以被称为第一信息。取决于语境，如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used in this application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of the present application, first information may also be called second information, and similarly, second information may also be called first information. Depending on the context, the word "if" as used herein may be interpreted as "at" or "when" or "in response to a determination."

RAID阵列是一种把多块独立的磁盘(物理磁盘)按不同的方式组合起来形成一个磁盘组(逻辑磁盘)，从而提供比单个磁盘更高的存储性能和数据可靠性的技术。A RAID array is a technology that combines multiple independent disks (physical disks) in different ways to form a disk group (logical disk), thereby providing higher storage performance and data reliability than a single disk.

在相关的RAID阵列触发重建的方法中，当RAID子系统接收到成员磁盘返回的IO读写错误，并且判断该错误无法恢复时，可以标记该成员磁盘故障，并触发该成员磁盘所属的RAID阵列重建。此外，当RAID子系统接收到成员磁盘离线的通知消息时，也可以触发该离线的成员磁盘所属的RAID阵列重建。In the related RAID array trigger reconstruction method, when the RAID subsystem receives an IO read and write error returned by a member disk and judges that the error cannot be recovered, it can mark the member disk as faulty and trigger the RAID array to which the member disk belongs reconstruction. In addition, when the RAID subsystem receives a notification message that a member disk is offline, it may also trigger rebuilding of the RAID array to which the offline member disk belongs.

在重建时，可以使用热备盘重建故障盘或者离线盘，RAID子系统可以按照RAID阵列算法计算出热备盘中对应条带的数据，恢复该故障磁盘所属的RAID阵列的冗余性。When rebuilding, you can use the hot spare disk to rebuild the faulty disk or offline disk. The RAID subsystem can calculate the data of the corresponding stripe in the hot spare disk according to the RAID array algorithm, and restore the redundancy of the RAID array to which the faulty disk belongs.

由于磁盘是机械和电子结合的装置，受到器件老化、环境等因素的影响，在实际应用中可能出现磁盘IO不返错但是响应时间变慢的现象，将导致上层应用读写该磁盘对应的RAID阵列时，响应时间变慢的磁盘上IO返回慢于其他磁盘，上层应用的性能出现波动或IO超时。具体表现为，在将RAID阵列上创建的LUN(逻辑单元号)分配给前端应用服务器进行持续读写时，可能出现LUN的性能有很大的波动甚至IO超时业务中断的情况，但是开发人员在对该LUN性能大波动的现象进行排查时，发现该RAID阵列状态正常，磁盘状态也正常，成员磁盘也未返回IO读写错误。进一步排查，虽然该RAID阵列的成员磁盘的接口相同，转速相同，但是部分成员磁盘上返回的IO读写响应的响应时间明显长于该RAID阵列中其他的成员磁盘。在拔走IO读写响应时间长的成员磁盘，使用热备盘代替IO读写响应时间长的成员磁盘后，该RAID阵列性能和LUN性能恢复正常。Since the disk is a combination of mechanical and electronic devices, affected by factors such as device aging and the environment, in actual applications, disk IO may not return errors but the response time will slow down, which will cause upper-layer applications to read and write the RAID corresponding to the disk. When using an array, the IO return on the disk with slow response time is slower than that of other disks, and the performance of upper-layer applications fluctuates or IO times out. Specifically, when the LUN (logical unit number) created on the RAID array is assigned to the front-end application server for continuous read and write, the performance of the LUN may fluctuate greatly or even the business may be interrupted due to IO timeout. When troubleshooting the large fluctuations in the performance of the LUN, it was found that the status of the RAID array was normal, the status of the disks was also normal, and the member disks did not return IO read and write errors. After further investigation, although the member disks of the RAID array have the same interface and the same speed, the response time of the IO read and write responses returned by some member disks is significantly longer than that of other member disks in the RAID array. After the member disk with long IO read and write response time is removed and the member disk with long IO read and write response time is replaced with a hot spare disk, the performance of the RAID array and LUN returns to normal.

综上可知，由于成员磁盘的IO读写响应时间长会严重影响该成员磁盘所属的RAID阵列和该RAID阵列上创建的LUN的性能，出现性能波动，极端情况下出现IO超时可能导致业务中断。然而，在现有的触发RAID阵列进行重建的方法中，仅考虑了磁盘IO错误和磁盘离线的情况，没有考虑磁盘老化后响应时间变慢导致业务中断的情况。To sum up, the long IO read and write response time of a member disk will seriously affect the performance of the RAID array to which the member disk belongs and the LUN created on the RAID array, resulting in performance fluctuations. In extreme cases, IO timeout may cause service interruption. However, in the existing methods for triggering RAID array reconstruction, only disk IO errors and disk offline situations are considered, and business interruption caused by slow response time after disk aging is not considered.

本申请提出一种触发RAID阵列重建的方法，存储设备的磁盘子系统可以根据各相关子系统的IO读写请求向各物理磁盘下发IO读写指令，并可以基于各物理磁盘在预设统计周期内返回的IO读写指令的响应时间，分别计算各物理磁盘的平均响应时间。磁盘子系统可以分别判断各非故障物理磁盘的物理磁盘的平均响应时间是否达到与该物理磁盘的型号对应的磁盘异常响应时间阈值。磁盘子系统可将平均响应时间达到与其型号对应的磁盘异常响应时间阈值的物理磁盘标记为故障物理磁盘，如果所述故障物理磁盘属于RAID阵列，则通知所述故障物理磁盘所属的RAID阵列对所述故障物理磁盘进行重建。This application proposes a method for triggering RAID array reconstruction. The disk subsystem of the storage device can issue IO read and write commands to each physical disk according to the IO read and write requests of each related subsystem, and can base on the preset statistics of each physical disk. Calculate the average response time of each physical disk for the response time of the IO read and write commands returned within the period. The disk subsystem can respectively determine whether the average response time of the physical disks of each non-faulty physical disk reaches the abnormal disk response time threshold corresponding to the model of the physical disk. The disk subsystem may mark the physical disk whose average response time reaches the disk abnormal response time threshold corresponding to its model as a faulty physical disk, and if the faulty physical disk belongs to a RAID array, notify the RAID array to which the faulty physical disk belongs rebuild the failed physical disk.

一方面，由于磁盘子系统可以基于各物理磁盘的平均响应时间，将平均响应时间达到与其型号对应的磁盘异常响应时间阈值的非故障物理磁盘的物理磁盘标记为故障物理磁盘，并通知该故障物理磁盘所属的RAID阵列进行重建，从而实现了基于物理磁盘的IO读写指令的响应时间来触发对该物理磁盘所属的RAID阵列的重建。On the one hand, based on the average response time of each physical disk, the disk subsystem can mark the physical disk of the non-faulty physical disk whose average response time reaches the abnormal response time threshold of the corresponding model as a faulty physical disk, and notify the faulty physical disk The RAID array to which the disk belongs is rebuilt, so that the reconstruction of the RAID array to which the physical disk belongs is triggered based on the response time of the IO read and write commands of the physical disk.

参见图1，图1是本申请一示例性实施例示出的一种触发RAID阵列重建的方法的流程图，所述方法应用于存储设备的磁盘子系统，所述存储设备包括至少一个RAID阵列，所述RAID阵列包括若干个物理磁盘，所述方法包括：Referring to FIG. 1, FIG. 1 is a flow chart of a method for triggering RAID array reconstruction shown in an exemplary embodiment of the present application, the method is applied to a disk subsystem of a storage device, and the storage device includes at least one RAID array, The RAID array includes several physical disks, and the method includes:

步骤101：根据各相关子系统的IO读写请求向各物理磁盘下发IO读写指令；Step 101: sending IO read and write instructions to each physical disk according to the IO read and write requests of each relevant subsystem;

步骤102：基于各物理磁盘在预设统计周期内返回的IO读写指令的响应时间，分别计算各物理磁盘的平均响应时间；Step 102: Calculate the average response time of each physical disk based on the response time of the IO read and write commands returned by each physical disk within a preset statistical period;

步骤103：分别判断各非故障物理磁盘的物理磁盘的平均响应时间是否达到与其型号对应的磁盘异常响应时间阈值；其中，不同型号的物理磁盘的磁盘异常响应时间阈值不同；Step 103: Determine whether the average response time of the physical disks of each non-faulty physical disk reaches the disk abnormal response time threshold corresponding to its model; wherein, the disk abnormal response time thresholds of different types of physical disks are different;

步骤104：将平均响应时间达到与其型号对应的磁盘异常响应时间阈值的物理磁盘标记为故障物理磁盘，如果所述故障物理磁盘属于RAID阵列，则通知所述故障物理磁盘所属的RAID阵列对所述故障物理磁盘进行重建。Step 104: mark the physical disk whose average response time reaches the disk abnormal response time threshold corresponding to its model as a faulty physical disk, and if the faulty physical disk belongs to a RAID array, notify the RAID array to which the faulty physical disk belongs to the Rebuild the failed physical disk.

其中，上述RAID子系统，用于管理存储设备中的各RAID阵列。例如，RAID子系统的功能可以包括对接收到的多个IO读写指令进行基于RAID阵列算法的拆分，将拆分后IO读写指令下发给磁盘子系统。RAID子系统的功能还可以包括，在RAID阵列重建时，基于该RAID阵列的算法，计算出热备盘中对应条带的数据，恢复RAID阵列数据冗余性等功能。当然，RAID子系统还具有多种功能。在这里，不对RAID子系统的功能进行具体的限定。Wherein, the above-mentioned RAID subsystem is used to manage each RAID array in the storage device. For example, the function of the RAID subsystem may include splitting multiple received IO read and write commands based on the RAID array algorithm, and sending the split IO read and write commands to the disk subsystem. The function of the RAID subsystem may also include, when rebuilding the RAID array, based on the algorithm of the RAID array, calculating the data corresponding to the stripe in the hot spare disk, restoring the data redundancy of the RAID array, and the like. Of course, the RAID subsystem also has various functions. Here, the functions of the RAID subsystem are not specifically limited.

上述磁盘子系统，主要用于管理存储设备中的所有的物理磁盘。磁盘子系统相当于“下层”系统，用于服务“上层”系统如RAID子系统。例如，磁盘子系统可以对存储设备中的物理磁盘进行扫描，并向RAID子系统通知扫描出的各物理磁盘的状态信息等。当然，在实际应用中，磁盘子系统还有其他功能，在这里，不对磁盘子系统的功能进行具体地限定。The above-mentioned disk subsystem is mainly used to manage all physical disks in the storage device. The disk subsystem is equivalent to the "lower layer" system, which is used to serve the "upper layer" system such as the RAID subsystem. For example, the disk subsystem may scan the physical disks in the storage device, and notify the RAID subsystem of the scanned status information of each physical disk. Of course, in practical applications, the disk subsystem has other functions, and here, the functions of the disk subsystem are not specifically limited.

上述RAID阵列，是指由存储设备中的多个物理磁盘按RAID阵列算法组合成的磁盘组。该RAID阵列中的物理磁盘也可以被称之为成员磁盘。RAID阵列根据其级别的不同以及实现方式的不同，所支持同时重建的物理磁盘个数也不相同。例如，传统RAID5可以支持的同时重建的物理磁盘个数为一个，在某些厂家的实现中，可支持同时重建多个。The aforementioned RAID array refers to a disk group composed of multiple physical disks in the storage device according to the RAID array algorithm. The physical disks in the RAID array can also be called member disks. Depending on the level and implementation of the RAID array, the number of physical disks that can be reconstructed at the same time is also different. For example, the number of physical disks that can be reconstructed at the same time that traditional RAID5 can support is one, and in the implementation of some manufacturers, multiple physical disks can be reconstructed at the same time.

上述物理磁盘的响应时间，是指磁盘子系统向物理磁盘下发IO读写指令到该物理磁盘返回该IO读写响应所需要的时间。The above response time of the physical disk refers to the time required for the disk subsystem to send an IO read and write command to the physical disk until the physical disk returns the IO read and write response.

上述物理磁盘的平均响应时间，是指在预设的统计周期内，该物理磁盘累加的已完成的IO读写指令的响应时间除以该物理磁盘累加的已完成的IO读写指令的个数为该物理磁盘的平均响应时间。其中，如果某个物理磁盘对应的累加的已完成的IO读写指令的个数为零，该物理磁盘的平均响应时间按零处理。The above-mentioned average response time of the physical disk refers to the response time of the completed IO read and write instructions accumulated by the physical disk divided by the number of completed IO read and write instructions accumulated by the physical disk within the preset statistical period is the average response time of the physical disk. Wherein, if the accumulated number of completed IO read and write instructions corresponding to a certain physical disk is zero, the average response time of the physical disk is treated as zero.

上述磁盘异常响应时间阈值，用于判断物理磁盘的响应时间是否异常，当物理磁盘的平均响应时间达到(大于或者等于)该异常响应时间阈值时，表示该物理磁盘异常。The above disk abnormal response time threshold is used to judge whether the response time of the physical disk is abnormal, and when the average response time of the physical disk reaches (greater than or equal to) the abnormal response time threshold, it indicates that the physical disk is abnormal.

需要说明的是，上述磁盘异常响应时间阈值可由开发人员根据实际测试数据进行设置，或者可由用户通过交互界面手动输入等。其中，开发人员在设置或者用户手动输入该磁盘异常响应时间阈值时，如果将该磁盘异常响应时间阈值设置的过大，则可能无法精确地检测到IO读写指令的响应时间过长的物理磁盘，使得RAID阵列和其上创建的LUN的性能无法很好地恢复。如果将该磁盘异常响应时间阈值设置地过小，则可能检测到大量性能异常的物理磁盘，误伤平均响应时间正常的物理磁盘，从而造成RAID阵列的频繁重建，影响存储设备的性能。所以在实际应用中，开发人员可以根据实际情况对该异常响应时间阈值进行设定。例如，开发人员可以将该磁盘异常响应时间阈值设置为同一型号的物理磁盘的平均响应时间与预设的异常响应时间加权值的乘积等。在这里，只是对磁盘异常响应时间阈值的设置进行示例性的说明，不对其进行具体的限定。It should be noted that the above-mentioned disk abnormality response time threshold may be set by a developer based on actual test data, or may be manually input by a user through an interactive interface. Among them, when the developer sets or the user manually enters the abnormal response time threshold of the disk, if the abnormal response time threshold of the disk is set too large, it may not be possible to accurately detect the physical disk whose response time for IO read and write commands is too long , so that the performance of the RAID array and the LUN created on it cannot be well restored. If the abnormal response time threshold of the disk is set too small, a large number of physical disks with abnormal performance may be detected, and physical disks with normal average response time may be accidentally damaged, resulting in frequent rebuilding of the RAID array and affecting the performance of the storage device. Therefore, in practical applications, developers can set the abnormal response time threshold according to the actual situation. For example, the developer may set the abnormal response time threshold of the disk as the product of the average response time of physical disks of the same model and a preset weighted value of the abnormal response time. Here, the setting of the disk abnormality response time threshold is only exemplified and not specifically limited.

在本申请实施例中，由于磁盘子系统基于各物理磁盘的平均响应时间，将平均响应时间达到与其型号对应的磁盘异常响应时间阈值的非故障物理磁盘的物理磁盘标记为故障物理磁盘，并通知该故障物理磁盘所属的RAID进行重建，从而实现了基于物理磁盘的IO读写指令的响应时间来触发对该物理磁盘所属的RAID阵列的重建。In the embodiment of this application, because the disk subsystem marks the physical disks of the non-faulty physical disks whose average response time reaches the disk abnormal response time threshold corresponding to its model as faulty physical disks based on the average response time of each physical disk, and notifies The RAID to which the failed physical disk belongs is rebuilt, so that the reconstruction of the RAID array to which the physical disk belongs is triggered based on the response time of the IO read and write commands of the physical disk.

此外，由于各物理磁盘的响应时间可以与其型号对应的磁盘异常响应时间阈值进行比较，从而使得在判断物理磁盘的平均响应时间是否异常时，综合考虑该磁盘上所有业务下发的IO，比如RAID下发的IO、磁盘检测任务下发的IO等，从而有效地提高磁盘子系统标记出的故障物理磁盘的准确率。In addition, since the response time of each physical disk can be compared with the abnormal response time threshold of the disk corresponding to its model, when judging whether the average response time of a physical disk is abnormal, the IO issued by all services on the disk, such as RAID The IO issued by the disk detection task, the IO issued by the disk detection task, etc., thereby effectively improving the accuracy of the faulty physical disk marked by the disk subsystem.

下面对上述触发RAID阵列重建的方法进行详细地说明。The above-mentioned method for triggering RAID array reconstruction will be described in detail below.

在实现时，在预设统计周期内，磁盘子系统可以根据各相关子系统的IO读写请求向各个物理磁盘下发IO读写指令。各物理磁盘在接收到IO读写指令并进行相应的处理之后，可以向磁盘子系统返回IO读写响应。During implementation, within a preset statistical period, the disk subsystem may issue IO read and write commands to each physical disk according to the IO read and write requests of each relevant subsystem. Each physical disk may return an IO read/write response to the disk subsystem after receiving the IO read/write command and performing corresponding processing.

磁盘子系统可以基于各物理磁盘在预设统计周期内返回的IO读写指令的响应时间，对各物理磁盘在预设统计周期内的所有的已完成的IO读写指令的响应时间和已完成的IO读写指令的个数进行统计。在统计周期结束时，分别计算各物理磁盘的平均响应时间。The disk subsystem can calculate the response time and completed Count the number of IO read and write instructions. At the end of the statistical period, calculate the average response time of each physical disk.

下面以计算一个物理磁盘的平均响应时间为例，对计算物理磁盘的平均响应时间的方法进行详细地说明。The method for calculating the average response time of a physical disk will be described in detail below by taking the calculation of the average response time of a physical disk as an example.

在实现时，磁盘子系统在向某个物理磁盘下发IO读写指令时开始计时，在接收到该物理磁盘返回的该IO读写指令的响应后结束计时，可以计算该物理磁盘上该IO读写指令的响应时间，并可以将此次的IO读写指令的响应时间与当前周期内的该物理磁盘的已完成的IO读写指令的响应时间进行累加，并将该物理磁盘的已完成的IO读写指令的个数加1。During implementation, the disk subsystem starts timing when it sends an IO read and write command to a physical disk, and ends the timing after receiving the response of the IO read and write command returned by the physical disk, and can calculate the IO on the physical disk. The response time of the read and write commands, and the response time of this IO read and write command can be accumulated with the response time of the completed IO read and write commands of the physical disk in the current cycle, and the completed The number of IO read and write instructions is increased by 1.

在当前统计周期结束时，磁盘子系统可以统计该物理磁盘在当前统计周期的平均响应时间。在实现时，磁盘子系统可以将累加的该物理磁盘的IO读写指令的响应时间除以累加的该物理磁盘的已完成的IO读写指令的个数，得到该物理磁盘的平均响应时间。如果某个物理磁盘在当前统计周期内累加的已完成的IO读写指令的个数为0，该物理磁盘的平均响应时间按照零处理。At the end of the current statistics period, the disk subsystem can calculate the average response time of the physical disk in the current statistics period. During implementation, the disk subsystem may divide the accumulated response time of the IO read and write commands of the physical disk by the accumulated number of completed IO read and write commands of the physical disk to obtain the average response time of the physical disk. If the accumulated number of completed IO read and write commands of a certain physical disk in the current statistical period is 0, the average response time of the physical disk is treated as zero.

其他物理磁盘的平均响应时间的计算方法与上述描述的计算方法相同，在这里不再赘述。The calculation method of the average response time of other physical disks is the same as the calculation method described above, and will not be repeated here.

在当前统计周期结束时，磁盘子系统还可以计算各型号物理磁盘对应的磁盘异常响应时间阈值。At the end of the current statistical period, the disk subsystem may also calculate the disk abnormal response time threshold corresponding to each type of physical disk.

在一种可选的实现方式中，磁盘子系统可以计算各型号物理磁盘的平均响应时间与预设的异常响应时间加权值的乘积，作为各型号物理磁盘对应的磁盘异常响应时间阈值。In an optional implementation manner, the disk subsystem may calculate the product of the average response time of each type of physical disk and a preset abnormal response time weighted value as the disk abnormal response time threshold corresponding to each type of physical disk.

下面对某个型号物理磁盘对应的磁盘异常响应时间阈值的计算方法进行详细地说明。The calculation method of the disk abnormal response time threshold corresponding to a certain type of physical disk is described in detail below.

在实现时，磁盘子系统可以对该型号的平均响应时间不为零的非故障物理磁盘的多个物理磁盘在当前统计周期的IO读写指令的平均响应时间进行累加，并可累加平均响应时间不为零的非故障物理磁盘的物理磁盘的数目。然后用累加得到的该型号的非故障物理磁盘的物理磁盘的IO读写指令的平均响应时间除以累加得到的该型号的平均响应时间不为零的非故障物理磁盘的物理磁盘数目，获得该型号的物理磁盘的平均响应时间。During implementation, the disk subsystem can accumulate the average response time of the IO read and write commands of multiple physical disks whose average response time is not zero in the current statistical cycle, and can accumulate the average response time Non-zero number of physical disks that are not failed physical disks. Then divide the accumulated average response time of the IO read and write commands of the physical disks of the type of non-faulty physical disks by the accumulated number of non-faulty physical disks whose average response time is not zero to obtain the The average response time of the model's physical disks.

磁盘子系统可以将计算得到的该型号物理磁盘的平均响应时间与预设的异常响应时间加权值相乘，获得该型号物理磁盘对应的磁盘异常响应时间阈值。The disk subsystem may multiply the calculated average response time of the type of physical disk by a preset weighted value of abnormal response time to obtain a disk abnormal response time threshold corresponding to the type of physical disk.

其中，上述异常响应时间加权值，可用于判断磁盘平均响应时间是否异常，通常由用户根据实际情况进行自行设定，以百分比的形式存在，如200％等，在这里，不对该异常响应时间加权值进行特别地限定。Among them, the weighted value of the above-mentioned abnormal response time can be used to judge whether the average response time of the disk is abnormal. It is usually set by the user according to the actual situation and exists in the form of a percentage, such as 200%. Here, the abnormal response time is not weighted Values are specifically defined.

其余型号的物理磁盘的平均响应时间的计算方式与上述该型号物理磁盘的平均响应时间的计算方式相同，在这里不再赘述。The calculation method of the average response time of the other types of physical disks is the same as the calculation method of the above-mentioned average response time of the physical disk of this type, and will not be repeated here.

在本申请实施例中，磁盘子系统可以通过上述方法获得各物理磁盘的平均响应时间和各型号物理磁盘的磁盘异常响应时间阈值。磁盘子系统可以判断各非故障物理磁盘的物理磁盘的平均响应时间是否达到(大于或者等于)与该型号物理磁盘对应的磁盘异常响应时间阈值。In the embodiment of the present application, the disk subsystem may obtain the average response time of each physical disk and the disk abnormal response time threshold of each type of physical disk through the above method. The disk subsystem may determine whether the average response time of the physical disks of each non-faulty physical disk reaches (is greater than or equal to) the abnormal disk response time threshold corresponding to the type of physical disk.

磁盘子系统可以将平均响应时间达到与其型号对应的磁盘异常响应时间阈值的物理磁盘标记为故障物理磁盘，如果所述故障物理磁盘属于RAID阵列，则通知所述故障物理磁盘所属的RAID阵列对该故障物理磁盘进行重建。The disk subsystem may mark the physical disk whose average response time reaches the disk abnormal response time threshold corresponding to its model as a faulty physical disk, and if the faulty physical disk belongs to a RAID array, notify the RAID array to which the faulty physical disk belongs Rebuild the failed physical disk.

为了提高磁盘子系统检测故障物理磁盘的精准性，避免将临时性异常响应的物理磁盘标记为故障物理磁盘，磁盘子系统可以不立即标记该物理磁盘为故障物理磁盘，而是记录该物理磁盘在连续若干个周期内的判断结果，如果该物理磁盘在连续若干个周期内的均被确定为平均响应时间达到与其型号对应的磁盘异常响应时间阈值的物理磁盘，则将该物理磁盘标记为故障物理磁盘，如果所述故障物理磁盘属于RAID阵列，则触发该物理磁盘所属的RAID阵列进行重建。In order to improve the accuracy of the disk subsystem in detecting a failed physical disk and avoid marking a physical disk with a temporary abnormal response as a failed physical disk, the disk subsystem may not immediately mark the physical disk as a failed physical disk, but record the physical disk in the Judgment results in several consecutive cycles, if the physical disk is determined to be a physical disk whose average response time reaches the abnormal response time threshold of the disk corresponding to its model in several consecutive cycles, the physical disk will be marked as a faulty physical disk If the faulty physical disk belongs to a RAID array, trigger the reconstruction of the RAID array to which the physical disk belongs.

在一种可选的实现方式中，为增加检测故障物理磁盘的准确性与实用性，上述连续若干个统计周期，可为“相对连续”的若干统计周期。In an optional implementation manner, in order to increase the accuracy and practicability of detecting a faulty physical disk, the above-mentioned several consecutive statistical cycles may be "relatively continuous" several statistical cycles.

在标记时，上述磁盘子系统可分别记录平均响应时间达到与其型号对应的磁盘异常响应时间阈值的物理磁盘的持续周期数；如果在若干个统计周期后，平均响应时间达到与其型号对应的磁盘异常响应时间阈值的物理磁盘中的任一磁盘的持续周期数达到预设的持续周期阈值，则将该物理磁盘标记为故障物理磁盘。When marking, the above-mentioned disk subsystem can respectively record the number of continuous cycles of the physical disk whose average response time reaches the abnormal response time threshold of the disk corresponding to its model; if after several statistical cycles, the average response time reaches the abnormal If the duration number of any disk in the physical disks with the response time threshold reaches the preset duration threshold, the physical disk is marked as a failed physical disk.

在记录时，磁盘子系统可针对所述平均响应时间达到与其型号对应的磁盘异常响应时间阈值的物理磁盘中的每一个物理磁盘，在下一个统计周期结束时，如果该物理磁盘再次被确定为所述平均响应时间达到与其型号对应的磁盘异常响应时间阈值的物理磁盘，则增加该物理磁盘的持续周期数并记录；如果该物理磁盘未被确定为所述平均响应时间达到与其型号对应的磁盘异常响应时间阈值的物理磁盘，则减少该物理磁盘的持续周期数并记录；其中，物理磁盘的持续周期数的初始值为零。When recording, the disk subsystem may, for each physical disk whose average response time reaches the disk abnormal response time threshold corresponding to its model, at the end of the next statistical period, if the physical disk is determined to be the If the physical disk whose average response time reaches the abnormal disk response time threshold corresponding to its model, increase the number of continuous cycles of the physical disk and record it; if the physical disk is not determined as the average response time reaching the abnormal disk corresponding to its model If the response time threshold of the physical disk is exceeded, the number of continuous cycles of the physical disk is reduced and recorded; wherein, the initial value of the number of continuous cycles of the physical disk is zero.

例如，在上述物理磁盘中，如果某个物理磁盘第一次被确定为平均响应时间达到与其型号对应的磁盘异常响应时间阈值的物理磁盘，则可将该物理磁盘的持续周期数设置为1。在下一个统计周期结束时，如果该物理磁盘再次被确定为所述平均响应时间达到与其型号对应的磁盘异常响应时间阈值的物理磁盘，则将该物理磁盘的持续周期数自加1；如果该物理磁盘没有被确定为所述平均响应时间达到与其型号对应的磁盘异常响应时间阈值的物理磁盘，则将该物理磁盘的持续周期数自减1，如果该物理磁盘的持续周期数减到零，不再记录该物理磁盘的持续周期。For example, among the above physical disks, if a certain physical disk is determined for the first time as a physical disk whose average response time reaches the disk abnormal response time threshold corresponding to its model, the number of continuous cycles of the physical disk may be set to 1. At the end of the next statistical period, if the physical disk is again determined as the physical disk whose average response time reaches the abnormal response time threshold of the disk corresponding to its model, the duration of the physical disk will be incremented by 1; if the physical disk If the disk is not identified as a physical disk whose average response time reaches the abnormal response time threshold of the disk corresponding to its model, then the number of continuous cycles of the physical disk will be reduced by 1. If the number of continuous cycles of the physical disk is reduced to zero, no Then record the duration of the physical disk.

在另一种可选的实现方式中，上述磁盘子系统还可基于“绝对连续”的若干个统计周期对物理磁盘进行故障物理磁盘标记。In another optional implementation manner, the above-mentioned disk subsystem may also mark the physical disk as a failed physical disk based on several "absolutely continuous" statistical periods.

在实现时，上述磁盘子系统可分别记录上述查找到的平均响应时间达到与其型号对应的磁盘异常响应时间阈值的物理磁盘的持续周期数；如果在若干个统计周期后，该平均响应时间达到与其型号对应的磁盘异常响应时间阈值的物理磁盘中的任一成员磁盘的持续周期数达到预设的持续周期阈值，则将该物理磁盘标记为故障物理磁盘。During implementation, the above-mentioned disk subsystem can respectively record the number of continuous cycles of the physical disk whose average response time reaches the abnormal response time threshold of the corresponding model; if after several statistical cycles, the average response time reaches the corresponding If the duration number of any member disk in the physical disk corresponding to the abnormal response time threshold of the model reaches the preset duration threshold, the physical disk is marked as a faulty physical disk.

在记录时，磁盘子系统可针对确定的平均响应时间达到与其型号对应的磁盘异常响应时间阈值的物理磁盘中的每个成员磁盘，在下一个统计周期结束时，如果该物理磁盘再次被确定为平均响应时间达到与其型号对应的磁盘异常响应时间阈值的物理磁盘，则增加该物理磁盘的持续周期数并记录；如果该物理磁盘没有被查找为平均响应时间达到与其型号对应的磁盘异常响应时间阈值的物理磁盘，则将该物理磁盘的持续周期数置为零。其中，成员磁盘的持续周期数的初始值为零。At the time of recording, for each member disk in a physical disk whose average response time can be determined by the disk subsystem to reach the disk abnormal response time threshold corresponding to its model, at the end of the next statistical period, if the physical disk is determined to be average again If the response time of the physical disk reaches the abnormal response time threshold corresponding to the model, increase the number of continuous cycles of the physical disk and record it; if the physical disk is not found, the average response time reaches the abnormal response time threshold corresponding to the model. If it is a physical disk, the duration of the physical disk is set to zero. Wherein, the initial value of the duration number of the member disk is zero.

此外，需要说明的是，在RAID阵列在接收到上述磁盘子系统针对上述故障物理磁盘重建的通知后，可以判断该RAID阵列当前是否满足重建条件，如果满足，则对该故障物理磁盘进行重建；如果不满足，则不对该故障物理磁盘进行重建，直到该RAID阵列满足重建条件时再触发对该故障物理磁盘进行重建。In addition, it should be noted that after the RAID array receives the above-mentioned disk subsystem's notification for rebuilding the above-mentioned faulty physical disk, it can determine whether the RAID array currently meets the rebuilding conditions, and if so, rebuild the faulty physical disk; If not, the failed physical disk will not be rebuilt until the RAID array satisfies the rebuilding condition, and then the rebuilding of the failed physical disk will be triggered.

其中，上述重建条件可包括，RAID阵列健康状态满足重建要求，有可用热备盘，重建上述故障物理磁盘不会超过该故障物理磁盘所属的RAID阵列支持的同时重建的物理磁盘的个数等。在实际应用中，开发人员可以根据实际情况，设定上述重建条件，这里只是对重建条件进行示例性说明，不对其进行特别地限定。Wherein, the reconstruction conditions may include that the health status of the RAID array meets the reconstruction requirements, there is an available hot spare disk, and the reconstruction of the above-mentioned faulty physical disk will not exceed the number of simultaneously reconstructed physical disks supported by the RAID array to which the faulty physical disk belongs. In practical applications, developers can set the above rebuilding conditions according to the actual situation, and the rebuilding conditions are only described here as examples, and are not specifically limited.

在磁盘子系统完成本周期的相关判断和处理后，磁盘子系统可以将统计的各物理磁盘在当前统计周期的已完成IO读写指令的响应时间和已完成IO读写指令的个数清空，以使得磁盘子系统可以在下一个统计周期对这两个参数进行统计。After the disk subsystem completes the relevant judgment and processing of this cycle, the disk subsystem can clear the response time of the completed IO read and write commands and the number of completed IO read and write commands of each physical disk in the current statistical cycle. So that the disk subsystem can collect statistics on these two parameters in the next statistical cycle.

一方面，由于磁盘子系统可以基于各物理磁盘的平均响应时间，将平均响应时间达到与其型号对应的磁盘异常响应时间阈值的非故障物理磁盘的物理磁盘标记为故障物理磁盘，并通知该故障物理磁盘所属的RAID进行重建，从而实现了基于物理磁盘的IO读写指令的响应时间来触发对该物理磁盘所属的RAID阵列的重建。On the one hand, based on the average response time of each physical disk, the disk subsystem can mark the physical disk of the non-faulty physical disk whose average response time reaches the abnormal response time threshold of the corresponding model as a faulty physical disk, and notify the faulty physical disk The RAID to which the disk belongs is rebuilt, so that the reconstruction of the RAID array to which the physical disk belongs is triggered based on the response time of the IO read and write commands of the physical disk.

此外，由于磁盘子系统可以在连续若干个周期检测到同一物理磁盘为故障物理磁盘时才触发该物理磁盘所属的RAID阵列重建，因此可以有效地提高RAID子系统检测故障成员磁盘的精准性，避免出现成员磁盘临时性异常响应。In addition, since the disk subsystem can detect the same physical disk as a failed physical disk for several consecutive cycles, it can trigger the rebuilding of the RAID array to which the physical disk belongs, so it can effectively improve the accuracy of the RAID subsystem in detecting failed member disks, and avoid A temporary abnormal response of a member disk occurs.

与前述触发RAID阵列重建的方法的实施例相对应，本申请还提供了触发RAID阵列重建的装置的实施例。Corresponding to the foregoing embodiments of the method for triggering RAID array reconstruction, the present application also provides embodiments of an apparatus for triggering RAID array reconstruction.

本申请触发RAID阵列重建的装置的实施例可以应用在存储设备上。装置实施例可以通过软件实现，也可以通过硬件或者软硬件结合的方式实现。以软件实现为例，作为一个逻辑意义上的装置，是通过其所在存储设备的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言，如图2所示，为本申请触发RAID阵列重建的装置所在存储设备的一种硬件结构图，除了图2所示的处理器、内存、网络出接口、以及非易失性存储器之外，实施例中装置所在的存储设备通常根据该存储的实际功能，还可以包括其他硬件，对此不再赘述。Embodiments of the apparatus for triggering RAID array rebuilding in the present application may be applied to storage devices. The device embodiments can be implemented by software, or by hardware or a combination of software and hardware. Taking software implementation as an example, as a device in a logical sense, it is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory for operation by the processor of the storage device where it is located. From the hardware level, as shown in Figure 2, it is a hardware structure diagram of the storage device where the device triggering RAID array reconstruction in this application is located, except for the processor, memory, network output interface, and non-volatile memory shown in Figure 2 In addition to the non-volatile memory, the storage device where the device in the embodiment is located usually may also include other hardware according to the actual function of the storage, which will not be repeated here.

请参考图3，图3是本申请一示例性实施例示出的一种触发RAID阵列重建的装置的框图。所述装置应用于存储设备的磁盘子系统，所述存储设备包括至少一个RAID阵列，所述RAID阵列包括若干个物理磁盘，所述装置包括：Please refer to FIG. 3 , which is a block diagram of an apparatus for triggering rebuilding of a RAID array shown in an exemplary embodiment of the present application. The device is applied to a disk subsystem of a storage device, the storage device includes at least one RAID array, and the RAID array includes several physical disks, and the device includes:

下发单元310，用于根据各相关子系统的IO读写请求向各物理磁盘下发IO读写指令；Issuing unit 310, configured to issue IO read and write instructions to each physical disk according to the IO read and write requests of each relevant subsystem;

计算单元320，用于基于各物理磁盘在预设统计周期内返回的IO读写指令的响应时间，分别计算各物理磁盘的平均响应时间；The calculation unit 320 is configured to calculate the average response time of each physical disk based on the response time of the IO read and write commands returned by each physical disk within a preset statistical period;

判断单元330，用于分别判断各非故障物理磁盘的物理磁盘的平均响应时间是否达到与其型号对应的磁盘异常响应时间阈值；其中，不同型号的物理磁盘的磁盘异常响应时间阈值不同；The judging unit 330 is used to judge whether the average response time of the physical disks of each non-faulty physical disk reaches the disk abnormal response time threshold corresponding to its model; wherein, the disk abnormal response time thresholds of different types of physical disks are different;

标记单元340，用于将平均响应时间达到与其型号对应的磁盘异常响应时间阈值的物理磁盘标记为故障物理磁盘，如果所述故障物理磁盘属于RAID阵列，则通知所述故障物理磁盘所属的RAID阵列对所述故障物理磁盘进行重建。The marking unit 340 is configured to mark the physical disk whose average response time reaches the disk abnormal response time threshold corresponding to its model as a faulty physical disk, and if the faulty physical disk belongs to a RAID array, then notify the RAID array to which the faulty physical disk belongs Rebuild the failed physical disk.

在一种可选的实现方式中，所述与其型号对应的磁盘异常响应时间阈值为该型号的物理磁盘的平均响应时间与预设的异常响应时间加权值的乘积；In an optional implementation manner, the abnormal response time threshold of the disk corresponding to the model is the product of the average response time of the physical disk of the model and the preset abnormal response time weighted value;

所述判断单元330，具体用于分别计算各型号的平均响应时间不为零的非故障物理磁盘的物理磁盘数目；分别累加各型号的若干个非故障物理磁盘的物理磁盘的平均响应时间；分别将各型号的若干个非故障物理磁盘的物理磁盘累加得到的平均响应时间除以累加的与其型号对应的平均响应时间不为零的非故障物理磁盘的物理磁盘数目，得到各型号物理磁盘的平均响应时间；分别计算各型号的物理磁盘的平均响应时间与预设的异常响应时间加权值的乘积，得到各型号物理磁盘的磁盘异常响应时间阈值；判断各非故障物理磁盘的物理磁盘的平均响应时间是否到达与其型号对应的磁盘异常响应时间阈值。The judgment unit 330 is specifically used to calculate the number of physical disks of non-faulty physical disks whose average response time of each model is not zero; respectively accumulate the average response time of the physical disks of several non-faulty physical disks of each model; Divide the accumulated average response time of several non-faulty physical disks of each model by the accumulated number of non-faulty physical disks whose average response time is not zero to obtain the average response time of each type of physical disk. Response time: Calculate the product of the average response time of each type of physical disk and the preset abnormal response time weighted value to obtain the disk abnormal response time threshold of each type of physical disk; judge the average response of the physical disk of each non-faulty physical disk Whether the time reaches the disk exception response time threshold corresponding to its model.

在另一种可选的实现方式中，所述计算单元320，具体用于累加各物理磁盘针对所述预设统计周期的已完成的IO读写指令的响应时间；统计各物理磁盘针对所述预设统计周期的已完成的IO读写指令的个数；将各物理磁盘分别对应的累加的响应时间和统计的IO读写指令的个数相除，分别获得各物理磁盘的平均响应时间。In another optional implementation manner, the calculation unit 320 is specifically configured to accumulate the response time of the completed IO read and write commands of each physical disk for the preset statistical period; The number of completed IO read and write commands in the preset statistical period; divide the accumulated response time corresponding to each physical disk by the number of statistical IO read and write commands to obtain the average response time of each physical disk.

在另一种可选的实现方式中，所述标记单元340，具体用于分别记录平均响应时间达到与其型号对应的磁盘异常响应时间阈值的物理磁盘的持续周期数；如果在若干个统计周期后，所述平均响应时间达到与其型号对应的磁盘异常响应时间阈值的物理磁盘中的任一物理磁盘的持续周期数达到预设的持续周期阈值，则将该物理磁盘标记为故障物理磁盘。In another optional implementation manner, the marking unit 340 is specifically configured to respectively record the number of continuous cycles of the physical disk whose average response time reaches the abnormal response time threshold of the disk corresponding to its model; , the number of continuous cycles of any physical disk among the physical disks whose average response time reaches the disk abnormal response time threshold corresponding to its model reaches the preset continuous cycle threshold, then the physical disk is marked as a faulty physical disk.

在另一种可选的实现方式中，所述标记单元340，进一步用于针对所述平均响应时间达到与其型号对应的磁盘异常响应时间阈值的物理磁盘中的每个物理磁盘，在下一个统计周期结束时，如果该物理磁盘再次被确定为所述平均响应时间达到与其型号对应的磁盘异常响应时间阈值的物理磁盘，则增加该物理磁盘的持续周期数并记录；如果该物理磁盘未被确定为所述平均响应时间达到与其型号对应的磁盘异常响应时间阈值的物理磁盘，则减少该物理磁盘的持续周期数并记录；其中，物理磁盘的持续周期数的初始值为零。In another optional implementation manner, the marking unit 340 is further configured to, for each physical disk whose average response time reaches the disk abnormal response time threshold corresponding to its model, in the next statistical cycle At the end, if the physical disk is determined again as the physical disk whose average response time reaches the abnormal response time threshold of the disk corresponding to its model, then increase the number of continuous cycles of the physical disk and record; if the physical disk is not determined to be If the average response time of the physical disk reaches the abnormal response time threshold of the disk corresponding to its model, the duration number of the physical disk is reduced and recorded; wherein, the initial value of the duration number of the physical disk is zero.

上述装置中各个单元的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程，在此不再赘述。For the implementation process of the functions and effects of each unit in the above device, please refer to the implementation process of the corresponding steps in the above method for details, and will not be repeated here.

对于装置实施例而言，由于其基本对应于方法实施例，所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本申请方案的目的。本领域普通技术人员在不付出创造性劳动的情况下，即可以理解并实施。As for the device embodiment, since it basically corresponds to the method embodiment, for related parts, please refer to the part description of the method embodiment. The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed to multiple network elements. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this application. It can be understood and implemented by those skilled in the art without creative effort.

以上所述仅为本申请的较佳实施例而已，并不用以限制本申请，凡在本申请的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本申请保护的范围之内。The above is only a preferred embodiment of the application, and is not intended to limit the application. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the application should be included in the application. within the scope of protection.

Claims

1. A method for triggering RAID array reconstruction, characterized in that, the method is applied to a disk subsystem of a storage device, the storage device includes at least one RAID array, and the RAID array includes several physical disks, the method include:

Send IO read and write instructions to each physical disk according to the IO read and write requests of each relevant subsystem;

Calculate the average response time of each physical disk based on the response time of the IO read and write commands returned by each physical disk within the preset statistical period;

Determine whether the average response time of the physical disks of each non-faulty physical disk reaches the disk abnormal response time threshold corresponding to its model; wherein, the disk abnormal response time thresholds of different types of physical disks are different;

Mark the physical disk whose average response time reaches the disk abnormal response time threshold corresponding to its model as a faulty physical disk, and if the faulty physical disk belongs to a RAID array, notify the RAID array to which the faulty physical disk belongs to the faulty physical disk to rebuild.

2. The method according to claim 1, wherein the abnormal response time threshold of the disk corresponding to the model is the product of the average response time of the physical disk of the model and the preset abnormal response time weighted value;

Said determining whether the average response time of the physical disks of each non-faulty physical disk reaches the abnormal response time threshold of the disk corresponding to its model, including:

Calculate the number of physical disks of non-faulty physical disks whose average response time of each model is not zero;

Accumulate the average response time of the physical disks of several non-faulty physical disks of each model;

Divide the average response time obtained by accumulating the physical disks of several non-failure physical disks of each model by the number of non-failure physical disks whose average response time is not zero corresponding to the model, and obtain the average response of each type of physical disk time;

Calculate the product of the average response time of each type of physical disk and the preset abnormal response time weighted value to obtain the disk abnormal response time threshold of each type of physical disk;

It is judged whether the average response time of the physical disks of each non-faulty physical disk reaches the abnormal response time threshold of the disk corresponding to its model.

3. The method according to claim 1, characterized in that, calculating the average response time of each physical disk respectively based on the response time of the IO read and write commands returned by each physical disk within a preset statistical period, comprising:

accumulating the response time of each physical disk for the completed IO read and write commands of the preset statistical period;

Counting the number of completed IO read and write instructions of each physical disk for the preset statistical period;

The accumulated response time corresponding to each physical disk is divided by the counted number of IO read and write commands to obtain the average response time of each physical disk.

4. The method according to claim 1, wherein marking the physical disk whose average response time reaches the abnormal disk response time threshold corresponding to its model as a failed physical disk comprises:

Record the number of continuous cycles of the physical disk whose average response time reaches the abnormal response time threshold of the disk corresponding to its model;

If after several statistical periods, the number of continuous periods of any physical disk whose average response time reaches the abnormal response time threshold of the disk corresponding to its model reaches the preset continuous period threshold, mark the physical disk as is a failed physical disk.

5. The method according to claim 4, wherein the recording of the number of continuous cycles of the physical disk whose average response time reaches the abnormal disk response time threshold corresponding to its model respectively includes:

For each physical disk whose average response time reaches the disk abnormal response time threshold corresponding to its model, at the end of the next statistical period, if the physical disk is determined again as the average response time reaching the threshold corresponding to its model If the physical disk whose average response time reaches the abnormal response time threshold of the corresponding model of the physical disk is not identified as the physical disk whose average response time reaches the abnormal response time threshold of the corresponding model, then Decrease the duration number of the physical disk and record it; wherein, the initial value of the duration number of the physical disk is zero.

6. A device for triggering RAID array reconstruction, characterized in that the device is applied to a disk subsystem of a storage device, the storage device includes at least one RAID array, and the RAID array includes several physical disks, the device include:

The issuing unit is used to issue IO read and write instructions to each physical disk according to the IO read and write requests of each relevant subsystem;

The calculation unit is used to calculate the average response time of each physical disk based on the response time of the IO read and write commands returned by each physical disk within a preset statistical period;

The judging unit is used to respectively judge whether the average response time of the physical disks of each non-faulty physical disk reaches the disk abnormal response time threshold corresponding to its model; wherein, the disk abnormal response time thresholds of different types of physical disks are different;

The marking unit is configured to mark the physical disk whose average response time reaches the disk abnormal response time threshold corresponding to its model as a faulty physical disk, and if the faulty physical disk belongs to a RAID array, then notify the RAID array pair to which the faulty physical disk belongs The failed physical disk is rebuilt.

7. The device according to claim 6, wherein the abnormal response time threshold of the disk corresponding to the model is the product of the average response time of the physical disk of the model and the preset abnormal response time weighted value;

The judgment unit is specifically used to calculate the number of physical disks of non-failure physical disks whose average response time of each model is not zero; respectively accumulate the average response time of the physical disks of several non-failure physical disks of each model; The average response time of physical disks accumulated by several non-failure physical disks of each model is divided by the number of non-failure physical disks whose average response time is not zero corresponding to the model, to obtain the average response time of each type of physical disk; Calculate the product of the average response time of each type of physical disk and the preset abnormal response time weighted value to obtain the disk abnormal response time threshold of each type of physical disk; determine whether the average response time of each non-faulty physical disk has reached The disk exception response time threshold corresponding to its model.

8. The device according to claim 6, wherein the calculation unit is specifically configured to accumulate the response time of each physical disk for the completed IO read and write instructions of the preset statistical cycle; count each physical disk The number of completed IO read and write instructions for the preset statistical period; the cumulative response time corresponding to each physical disk and the number of statistical IO read and write instructions are divided to obtain the average Response time.

9. The device according to claim 6, wherein the marking unit is specifically used to respectively record the number of continuous cycles of the physical disk whose average response time reaches the abnormal response time threshold of the disk corresponding to its model; After the statistical period, if the average response time of the physical disks whose average response time reaches the abnormal response time threshold of the disk corresponding to its model and the number of continuous periods of any physical disk reaches the preset continuous period threshold, the physical disk is marked as a failed physical disk .

10. The device according to claim 9, wherein the marking unit is further configured to, for each physical disk whose average response time reaches the disk abnormal response time threshold corresponding to its model, set the following At the end of a statistical period, if the physical disk is again determined as the physical disk whose average response time reaches the abnormal response time threshold of the disk corresponding to its model, then increase the number of continuous cycles of the physical disk and record it; if the physical disk does not If it is determined that the average response time of the physical disk reaches the abnormal response time threshold of the disk corresponding to its model, the number of continuous cycles of the physical disk is reduced and recorded; wherein, the initial value of the number of continuous cycles of the physical disk is zero.