CN118277154B - Hard disk fault recovery method and computing device - Google Patents
Hard disk fault recovery method and computing device Download PDFInfo
- Publication number
- CN118277154B CN118277154B CN202410223671.8A CN202410223671A CN118277154B CN 118277154 B CN118277154 B CN 118277154B CN 202410223671 A CN202410223671 A CN 202410223671A CN 118277154 B CN118277154 B CN 118277154B
- Authority
- CN
- China
- Prior art keywords
- hard disk
- fault recovery
- target hard
- recovery operation
- card
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000011084 recovery Methods 0.000 title claims abstract description 422
- 238000000034 method Methods 0.000 title claims abstract description 77
- 230000003993 interaction Effects 0.000 claims description 43
- 238000012545 processing Methods 0.000 claims description 30
- 238000003745 diagnosis Methods 0.000 claims description 29
- 230000005856 abnormality Effects 0.000 claims description 10
- 230000004044 response Effects 0.000 claims description 10
- 238000007726 management method Methods 0.000 description 122
- 230000002159 abnormal effect Effects 0.000 description 12
- 239000008186 active pharmaceutical agent Substances 0.000 description 12
- 230000008569 process Effects 0.000 description 11
- 238000012360 testing method Methods 0.000 description 10
- 238000012423 maintenance Methods 0.000 description 7
- 238000012544 monitoring process Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 230000002452 interceptive effect Effects 0.000 description 6
- 239000000306 component Substances 0.000 description 5
- 238000004590 computer program Methods 0.000 description 4
- 230000001360 synchronised effect Effects 0.000 description 3
- 239000008358 core component Substances 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 229910000838 Al alloy Inorganic materials 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 239000011248 coating agent Substances 0.000 description 1
- 238000000576 coating method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000011982 device technology Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 239000011521 glass Substances 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000009191 jumping Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
- 238000011144 upstream manufacturing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1415—Saving, restoring, recovering or retrying at system level
- G06F11/1441—Resetting or repowering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/2053—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C29/00—Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
- G11C29/70—Masking faults in memories by using spares or by reconfiguring
- G11C29/78—Masking faults in memories by using spares or by reconfiguring using programmable devices
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
Abstract
The embodiment of the application provides a fault recovery method of a hard disk and computing equipment, the method is applied to the computing equipment, the computing equipment comprises a management controller, a redundant disk array card, a back plate and at least one hard disk, the back plate comprises a programmable logic device, the redundant disk array card is connected with each hard disk through the back plate, when the redundant disk array card detects that a target hard disk breaks down, the redundant disk array card sends a fault recovery instruction aiming at the target hard disk to the management controller, the fault recovery instruction indicates that fault recovery operation needs to be executed on the target hard disk, the target hard disk is any hard disk in the at least one hard disk, the management controller responds to the fault recovery instruction and sends a control instruction to the programmable logic device, and the programmable logic device responds to the control instruction and executes the fault recovery operation on the target hard disk, so that the fault recovery time can be reduced, and the fault recovery efficiency of the hard disk can be improved.
Description
Technical Field
The present application relates to the field of computing devices, and in particular, to a method for recovering a hard disk failure and a computing device.
Background
With the development of computing device technology, RAID cards, which are one of the core components of a computing device, typically connect multiple hard disks. The hard disk itself belongs to a storage medium, and medium errors and hard disk faults can occur in long-term use. When a hard disk fails, a RAID card usually generates a hard disk failure alarm through a management controller in a computing device to prompt operation and maintenance personnel to process, that is, at present, when the hard disk fails, the hard disk needs to be replaced manually, but the hard disk cannot be replaced immediately after the hard disk fails due to the influence of certain factors, so that the problems of serious time consumption for failure recovery, low failure recovery efficiency of the hard disk and the like exist.
Disclosure of Invention
The embodiment of the application provides a fault recovery method and computing equipment for a hard disk, which can reduce the fault recovery time and improve the efficiency of hard disk fault recovery.
In a first aspect, an embodiment of the present application provides a method for recovering a failure of a hard disk, which is applied to a computing device, where the computing device includes a management controller, a redundant disk array card, a backplane, and at least one hard disk, the backplane includes a programmable logic device, the redundant disk array card establishes a connection with each hard disk through the backplane, and when the redundant disk array card detects that a failure of a target hard disk, the redundant disk array card sends a failure recovery instruction for the target hard disk to the management controller, the failure recovery instruction indicates that a failure recovery operation needs to be performed on the target hard disk, the target hard disk is any hard disk of the at least one hard disk, the management controller responds to the failure recovery instruction and sends a control instruction to the programmable logic device, and the programmable logic device responds to the control instruction and performs the failure recovery operation on the target hard disk.
Under the implementation mode, through interaction among the RAID card, the management controller and the programmable logic device, when the target hard disk fails, the target hard disk can be automatically recovered from failure, and the hard disk failure recovery is realized without relying on manual processing, so that the failure recovery time can be reduced to a certain extent, and the hard disk failure recovery efficiency is improved.
The programmable logic device is used for controlling the power supply time sequence of the slot position of the target hard disk, responding to the control instruction and executing fault recovery operation on the target hard disk, and comprises the steps that the programmable logic device responds to the control instruction and executes power-down operation on the target hard disk and then executes power-up operation on the target hard disk after executing the power-down operation.
In the implementation mode, the programmable logic device can quickly realize fault recovery of the hard disk by executing power-down operation and then power-up operation on the target hard disk.
The method comprises the steps that the redundant disk array card sends a fault recovery instruction aiming at a target hard disk to the management controller, wherein the time interval between the current fault time of the target hard disk and the time when the target hard disk executes the fault recovery operation last time is determined by the redundant disk array card, and if the time interval is larger than an interval threshold value, the redundant disk array card sends the fault recovery instruction aiming at the target hard disk to the management controller.
If the time interval is greater than the interval threshold, the redundant array of inexpensive disks card sends a failure recovery instruction for the target hard disk to the management controller, including:
If the time interval is greater than the interval threshold, the redundant array of inexpensive disks card counts the times of executing fault recovery operation on the target hard disk;
If the number of times is smaller than the number of times threshold, the redundant array of independent disks card sends a fault recovery instruction aiming at the target hard disk to the management controller.
In the implementation mode, the problem that the hard disk is always in an automatic fault recovery process and cannot be solved due to the fact that the hard disk is always in a fault can be avoided by introducing the interval threshold value and the frequency threshold value.
Wherein the method further comprises:
the redundant array of independent disks card receives a fault recovery notice sent by the management controller for the target hard disk, wherein the fault recovery notice is used for notifying that the fault recovery operation is executed on the target hard disk.
In this implementation, a failure recovery notification is sent to the redundant array of disks card, so that the redundant array of disks card can interact with the target hard disk in time to determine whether the failure recovery is successful.
Wherein the method further comprises:
The redundant disk array card interacts with the target hard disk after the fault recovery operation is executed again;
if the redundant disk array card successfully interacts with the target hard disk after the fault recovery operation is executed, the redundant disk array card restarts the target hard disk after the fault recovery operation to execute the service.
In the implementation manner, the redundant disk array card successfully interacts with the target hard disk after the fault recovery operation is executed, which means that the fault recovery of the target hard disk is successful, and the corresponding service can be executed, so that the success rate of the fault recovery of the target hard disk can be ensured to a certain extent through the interaction with the target hard disk after the fault recovery operation is executed.
The method for re-interacting the redundant disk array card with the target hard disk after the fault recovery operation is executed comprises the following steps:
the redundant disk array card establishes a link connection with a target hard disk after fault recovery operation is executed;
the redundant array of inexpensive disks card obtains the related information of the hard disk related to the target hard disk after the fault recovery operation is executed based on the link connection, and carries out the abnormality diagnosis processing on the target hard disk after the fault recovery operation is executed based on the related information of the hard disk, so as to obtain the diagnosis result;
If the diagnosis result is used for indicating that the target hard disk after the fault recovery operation is executed is in a normal state, the Redundant Array of Independent Disks (RAID) card acquires a historical RAID group of the target hard disk before the fault recovery operation is executed;
and the redundant array of inexpensive disks card processes the target hard disk after the fault recovery operation is executed according to the RAID group level of the historical RAID group.
The redundant array of independent disks card processes the target hard disk after fault recovery operation according to the RAID group level of the historical RAID group, and comprises the following steps:
If the RAID group level of the historical RAID group belongs to the first level, the redundant array of independent disks card clears the marking information of the target hard disk after the fault recovery operation is executed, marks the target hard disk with the cleared marking information as a normal state, and determines that the interaction with the target hard disk after the fault recovery operation is executed is successful;
If the RAID group level of the historical RAID group belongs to the second level, marking the target hard disk after the fault recovery operation is executed as a normal state by the redundant disk array card, and carrying out reconstruction operation on the target hard disk after the fault recovery operation and member disks except the target hard disk in the historical RAID group to obtain a new RAID group, and synchronizing data of the new RAID group into the target hard disk after the fault recovery operation by the redundant disk array card, and determining that interaction with the target hard disk after the fault recovery operation is successful.
In the implementation manner, through the three interactive processes of establishing link connection with the target hard disk after the fault recovery operation is executed, performing abnormality diagnosis processing on the target hard disk after the fault recovery operation is executed based on the hard disk association information, and processing the target hard disk after the fault recovery operation is executed according to the RAID group level of the historical RAID group, whether the fault recovery of the target hard disk is successful or not can be more accurately determined, and through the interaction, the success rate of the fault recovery of the target hard disk can be improved to a certain extent, so that the target hard disk after the fault recovery operation is used for executing subsequent service processing.
Wherein the method further comprises:
if the interaction with the target hard disk after the fault recovery operation fails, the redundant array of inexpensive disks card sends a fault recovery instruction for the target hard disk to the management controller again.
In a second aspect, an embodiment of the present application provides a method for recovering a failure of a hard disk, which is applied to a management controller, where the method includes:
Receiving a fault recovery instruction aiming at a target hard disk and sent by a redundant disk array card, wherein the fault recovery instruction is sent when the redundant disk array card detects the fault of the target hard disk, and the fault recovery instruction indicates that the fault recovery operation is required to be executed on the target hard disk;
and responding to the fault recovery instruction, and sending a control instruction to the programmable logic device, wherein the control instruction is used for instructing the programmable logic device to execute fault recovery operation on the target hard disk.
In the implementation mode, through interaction among the RAID card, the management controller and the programmable logic device, when the target hard disk fails, the target hard disk can be automatically recovered from failure, and the hard disk failure recovery is realized without relying on manual processing, so that the failure recovery time can be reduced to a certain extent, and the hard disk failure recovery efficiency is improved.
In a third aspect, an embodiment of the present application provides a method for recovering a failure of a hard disk, which is applied to a redundant array of inexpensive disks card, where the method includes:
When the hard disk fault is detected, a fault recovery instruction aiming at the target hard disk is sent to the management controller, the fault recovery instruction indicates that the fault recovery operation needs to be executed on the target hard disk, the fault recovery instruction is used for triggering the management controller to send a control instruction to the programmable logic device, and the control instruction is used for indicating the programmable logic device to execute the fault recovery operation on the target hard disk.
In the implementation mode, the redundant disk array card sends the fault recovery instruction aiming at the target hard disk to the management controller so as to trigger the management controller to send the control instruction to the programmable logic device, so that the target hard disk can be automatically recovered from faults through interaction among the redundant disk array card, the management controller and the programmable logic device when the target hard disk breaks down, and the fault recovery of the hard disk is realized without relying on manual processing, so that the fault recovery time can be reduced to a certain extent, and the fault recovery efficiency of the hard disk is improved.
In a fourth aspect, an embodiment of the present application provides a method for recovering a failure of a hard disk, which is applied to a programmable logic device, including:
The method comprises the steps of receiving a control instruction sent by a management controller, sending the control instruction when the management controller receives a fault recovery instruction for a target hard disk sent by a redundant disk array card, sending the fault recovery instruction when the redundant disk array card detects a fault of the target hard disk, indicating that the fault recovery operation needs to be executed on the target hard disk, and responding to the control instruction to execute the fault recovery operation on the target hard disk.
In the implementation mode, the redundant disk array card sends the fault recovery instruction aiming at the target hard disk to the management controller so as to trigger the management controller to send the control instruction to the programmable logic device, so that the target hard disk can be automatically recovered from faults through interaction among the redundant disk array card, the management controller and the programmable logic device when the target hard disk breaks down, and the fault recovery of the hard disk is realized without relying on manual processing, so that the fault recovery time can be reduced to a certain extent, and the fault recovery efficiency of the hard disk is improved.
In a fifth aspect, an embodiment of the present application provides a computing device, where the computing device includes a management controller, a redundant disk array card, a backplane, and at least one hard disk, the backplane includes a programmable logic device, the redundant disk array card establishes a connection with each hard disk through the backplane, and the computing device is configured to execute a failure recovery method of the hard disk
In a sixth aspect, an embodiment of the present application further provides a computer readable storage medium storing a computer program, where the computer program includes program instructions that when executed cause the above-mentioned method for recovering from a hard disk failure to be implemented.
In a seventh aspect, embodiments of the present application provide a computer program product comprising a computer program or instructions which, when run on a computing device, cause the computing device to perform a method of recovering from a failure of a hard disk as described above.
Drawings
FIG. 1 is a block diagram of a computing device according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a fault recovery system of a hard disk according to an embodiment of the present application;
FIG. 3 is a block diagram of another hard disk failure recovery system according to an embodiment of the present application;
fig. 4 is a flow chart of a fault recovery method of a hard disk according to an embodiment of the present application;
FIG. 5 is a schematic flow chart of a method for recovering a hard disk failure according to an embodiment of the present application;
FIG. 6 is a schematic flow chart of a method for recovering a hard disk failure according to an embodiment of the present application;
fig. 7 is a flow chart of a fault recovery method of a hard disk according to an embodiment of the present application;
FIG. 8 is a schematic flow chart of a method for recovering a hard disk failure according to an embodiment of the present application;
fig. 9 is a flow chart of a fault recovery method for a hard disk according to an embodiment of the present application.
Detailed Description
The following description of the technical solutions according to the embodiments of the present application will be given with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments.
Referring to fig. 1, a block diagram of a computing device is provided according to an embodiment of the present application. Wherein the computing device may be a server. The computing device includes a RAID (Redundant Arrays of INDEPENDENT DISKS, redundant disk array) card 101, a processor 102, a management controller 103, a motherboard 104, a backplane 105, and at least one hard disk 106. The back plate 105 includes a plurality of slots, and the hard disk 106 may be inserted into the corresponding slot of the back plate 105. The hard disk 106 is the primary storage device in a computing device, and includes solid state disks, mechanical disks, and the like. Solid state disks are stored using flash particles, mechanical disks are stored using magnetic disks, and the hard disk is typically composed of multiple disks and a read/write head. The disk is usually made of aluminum alloy or glass, and the pickup head is composed of a magnetic head and a motor. The surface of the disc is coated with a very thin magnetic coating for recording data. The read/write head is responsible for reading and writing data on the disk. The hard disk also includes a controller that is responsible for moving the head to the correct position, rotating the platter, and transferring data to the computing device.
RAID card 101 is disposed on a PCIe (PERIPHERAL COMPONENT INTERCONNECT EXPRESS, high speed serial computer expansion bus standard) slot of motherboard 104 and is electrically connected to motherboard 104. RAID card 101 interfaces with processor 102 via PCIe protocol upstream and RAID card 101 interfaces with backplane 105, hard disk 106 via SAS (SERIAL ATTACHED SCSI, serial attached SCSI interface) protocol downstream. In addition, the RAID card 101 may establish a connection with at least one hard disk 106, and the at least one hard disk 106 may form a RAID group under the RAID card 101, and the formed RAID group may also be referred to as VD (Virtual Drive). RAID groups can be divided into two classes, one class being RAID groups of non-redundant RAID levels, e.g., RAID0, etc., and another class being RAID groups of redundant RAID levels, e.g., RAID1, RAID5, RAID6, etc. The redundant RAID level RAID group refers to that a redundant backup mechanism is arranged among member disks in the RAID group, so that when a certain member disk in the RAID group is out of place, normal reading of data of the whole RAID group can be completed by reading data in other member disks which are still in place. The RAID group with non-redundant RAID level refers to that a redundant backup mechanism is not arranged among member disks in the RAID group, and once a certain member disk in the RAID group is out of position, the data of the whole RAID group cannot be read normally. In addition, each hard disk 106 may exist as a single-disk JBOD (Just a Bunch Of Disks, disk cluster), which refers to a single bare disk under RAID card 101 that does not participate in the composition of a RAID group.
The processor 102 may be referred to as a central processing unit (central processing unit, abbreviated as CPU), and in a computer architecture, a CPU is a core hardware unit that performs control allocation and general-purpose operations on all hardware resources (such as a memory and an input/output unit) of a computer. The CPU is an operation and control core of the computing device and is a final execution unit for information processing and program running.
The management controller 103, as an out-of-band management core component of the computing device, may interact with the RAID card 101. The interaction channel between the management controller 103 and the RAID card 101 may be an IIC (Inter-INTEGRATED CIRCUIT, integrated circuit bus) channel or an MCTP (MANAGEMENT COMPONENT TRANSPORT PROTOCOL ) channel, and accordingly, the management controller 103 may interact with the RAID card 101 via an IIC (Inter-INTEGRATED CIRCUIT, integrated circuit bus) signal or an MCTP (MANAGEMENT COMPONENT TRANSPORT PROTOCOL ) signal.
Since the functions and related information supported by the RAID card 101 are more, and the information shielding is performed on the hard disk under the RAID card 101, where the information shielding means that the management controller 103 cannot directly obtain the information and the status of the hard disk under the RAID card 101, and the management controller 103 needs to query the information and the status of any hard disk 106 through the RAID card 101. RAID card 101 may report the queried results regarding hard disk 106 to management controller 103. In one implementation, in order for the management controller 103 to be able to query information about the RAID card 101 or about the hard disk 106 under the RAID card, a LIB (Library) Library specific to the RAID card 101 is generally required, and the LIB Library is self-developed and defined by each RAID card manufacturer. An interactive interface (API, application Programming Interface) for interaction between the RAID card 101 and the management controller is defined in the LIB library, where the interactive interface includes at least one of an API for information inquiry, an API for information setting, an API for creation, and an API for information deletion, and in this embodiment of the present application, a notification API may be defined in advance in the LIB library, for the management controller to send a failure recovery notification to the RAID card, for notifying the RAID card that a failure recovery operation has been performed on a failed hard disk, and an information providing API for the RAID card to provide failure information of the hard disk to the management controller, where the management controller 103 may call a corresponding API to interact with the RAID card according to a requirement. Illustratively, when the management controller 103 needs to query the related information of the hard disk under the RAID card 101, an API for information query needs to be called, and a query instruction is sent through the IIC or MCTP channel to query the related information of a certain hard disk 106 under the RAID card 101.
It should be noted that the names of management controllers may be different from one manufacturer to another, for example, the names of management controllers are typically BMC (baseboard management controller ), iLO (INTEGRATED LIGHTS-out, integrated remote management port), IDRAC (INTEGRATED DELL remote access, remote control card), HDM (HARDWARE DEVICE MANAGEMENT ), IMM (INTEGRATED MANAGEMENT module), and the like. Whether called iLO or IDRAC, HDM, IMM, are understood to be management controllers in embodiments of the present application.
The back plate 105 includes a programmable logic device thereon. The editable logic device may be a CPLD (Complex Programmable Logic Device ). The editable logic of the back plane 105 may be used to control the power timing of the entire back plane 105 and the slots in which each hard disk 106 is located. In addition, the editable logic device of the back plate 105 may further control LED lamps of the slot position where each hard disk 106 is located, where different LED lamp states represent different states of the hard disk, and illustratively, the states of the LED lamps include a locate state, a normal (active) state, and a fault (fault) state, where the locate state represents that the current hard disk is being located, the normal state represents that the current hard disk is in a normal working state, and the fault state represents that the current hard disk is in a fault state.
The programmable logic device of the backplane 105 may communicate with the RAID card 101 via SGPIO (SERIAL GENERAL Purpose Input/Output control) signals, which connect between the system motherboard and external devices, in one implementation, the RAID card 101 may control LED lamp status of the backplane 105 by sending SGPIO signals. In addition, the back plane 105 may also communicate with the management controller 103 via IIC signals. In the embodiment of the present application, the management controller 103 may control the backplane 105 by sending the IIC signal and control the programmable logic device to perform the power-down and power-up operations on the slot of each hard disk under the RAID card 101.
Currently, when the RAID card 101 detects a failure of a certain hard disk 106, the failed hard disk 106 needs to be set to a Fail state and the failure information of the hard disk is transferred to the management controller 103, and the management controller 103 generates alarm information, where the alarm information prompts an operation and maintenance person to process the failed hard disk 106 and wait for a manual quick replacement of the failed hard disk, this way consumes a great deal of labor cost and time cost, and because the hard disk has failed, if the operation and maintenance person cannot replace the hard disk in time, there may be a situation that the failed hard disk causes a larger abnormality, for example, if one hard disk in the RAID group (for example, RAID 1/5) fails, only the degraded RAID group still can normally process the pre/background IO, but if a plurality of hard disks in the RAID group (for example, RAID 1/5) Fail, the failed RAID group cannot normally process the pre/background IO.
Therefore, in order to optimize the scenario, the embodiment of the application provides a hard disk failure recovery scheme, after the RAID card finds out the hard disk failure, the failure recovery scheme can execute automatic failure recovery operation on the failed hard disk through interaction among the RAID card, the management controller and the programmable logic device, thereby attempting to recover the hard disk and increasing the normal service possibility of the hard disk, the mode can reduce labor cost and time cost, improve the hard disk failure recovery efficiency, avoid the condition that the failed hard disk causes larger abnormality to a certain extent, and reduce the hard disk failure rate. For easy understanding, please refer to fig. 2, which is a schematic diagram of a fault recovery system of a hard disk according to an embodiment of the present application. In fig. 2, the failure recovery system of the hard disk mainly includes a RAID card, a management controller, a programmable logic device of a backplane, and at least one hard disk. When a hard disk in at least one hard disk fails, the interactive flow among the RAID card, the management controller and the programmable logic device in the failure recovery system for the hard disk is as follows, ① when the RAID card detects that the hard disk fails, the RAID card can set the hard disk to be in a failure (fail) state. ② The RAID card notifies the management controller that a failure recovery operation is to be performed on the failed hard disk, and in one implementation, the RAID card notifies the management controller that a power down and then power up operation is to be performed on the failed hard disk. ③ And after receiving the notification of the RAID card, the management controller notifies the programmable logic device of the backboard to execute fault recovery operation on the fault hard disk. ④ The management controller sends a failure recovery notification to the RAID card, the failure recovery notification being used to notify that a failure recovery operation has been performed on the hard disk. ⑤ And the RAID card interacts with the hard disk which executes the fault recovery operation again according to the fault recovery notice.
In order to implement the interaction flow among the RAID card, the management controller and the programmable logic device, a plurality of modules are introduced in the embodiment of the present application, please refer to fig. 3, which is a block diagram of another fault recovery system for a hard disk provided in the embodiment of the present application. In fig. 3, the RAID card includes a FW (Firmware) module that operates on the RAID card and is used to control the RAID card to perform normal operations. The FW module comprises a hard disk monitoring module, a hard disk automatic recovery module and a hard disk interaction module, wherein the management controller comprises a control module, the programmable logic device of the backboard comprises a fault recovery module, and the related description of each module is carried out:
(1) FW module
① Hard disk monitoring module
The hard disk monitoring module is mainly responsible for monitoring whether the hard disk is abnormal or not. When the hard disk fails or responds to abnormality, the hard disk monitoring module can timely monitor the hard disk failure and trigger the hard disk automatic recovery module to work.
② Automatic hard disk recovery module
The hard disk automatic recovery module is mainly responsible for executing automatic fault recovery flow process on the fault hard disk. 1, determining the time interval of the failed hard disk from the last execution of the automatic failure recovery operation, if the time interval is smaller than an interval threshold value, triggering the RAID card to kick the disk and informing a management controller to give an alarm, and prompting maintenance personnel to replace the hard disk. 2. Counting the number of times a failed hard disk has performed a fail-over operation prior to performing an automatic fail-over procedure, the fail-over operation including, in one implementation, a power-down and power-up operation, counting the number of times a failed hard disk has performed a power-down and power-up operation prior to performing an automatic fail-over procedure. If the times are greater than the times threshold Y, triggering the RAID card to kick the disk and informing the management controller to give an alarm to prompt maintenance personnel to replace the hard disk. 3. And when the time interval is larger than the interval threshold value and/or the times are smaller than or equal to the times threshold value, notifying a control module in the management controller that the fault recovery operation needs to be executed on the hard disk with the fault. In one implementation, the FW module in the RAID card interacts with the control module through IIC or MCTP channels based primarily on APIs in the LIB library customized by the RAID card manufacturer.
③ Hard disk interaction module
The method is mainly responsible for interacting with the hard disk after the fault recovery operation (such as power-down and power-up operation) is executed. The interaction mainly comprises at least one of finding a disk build chain, diagnosing a hard disk, re-marking the hard disk as normal, reconstructing a RAID group, and the like.
1. And (4) disk searching and chain building, namely, the RAID card carries out initialization actions such as disk searching and chain building, rate negotiation and the like again. The method is characterized in that the RAID card searches the hard disk after the fault recovery operation again and establishes link connection with the hard disk after the fault recovery operation. The rate negotiation refers to the rate of negotiating data reading and writing from the hard disk after the RAID card and the hard disk after fault recovery operation are executed.
2. A, reading hard disk attribute information, wherein the hard disk attribute information comprises S.M.A.R.T (Self-Monitoring, analysis and reporting technology) information, and the S.M.A.R.T information is used for indicating the health state of the hard disk, namely, the hard disk adopting the S.M.A.R.T technology can analyze and compare the operation conditions of the magnetic head and the disk, the history record and the preset safety value through Monitoring instructions on the hard disk and Monitoring software on a host. When the running condition of the magnetic head and the disk exceeds the respective preset safe value range, the hard disk is considered to be abnormal, the log information is checked, the log information can comprise log logs, the log information can comprise error reporting information of the hard disk, if the error reporting information occurs, the hard disk is considered to be abnormal, and if the error reporting information does not occur, the hard disk is considered to be normal, the Short DST (DISKSELFTEST, short self-checking) test is executed, and whether the hard disk is abnormal or not is determined according to the test result, wherein the Short DST is a general command for diagnosing the hard disk in the SAS protocol. Short self-tests only detect primary components, such as read/write heads, disks, etc., which ensure that these primary components are all operating within acceptable parameters, and when any one of them fails, it generates an error code, at which point hard disk anomalies can be determined.
3. And (3) re-marking the hard disk as a normal state, namely if the hard disk after the fault recovery operation is executed originally belongs to the RAID group member disk of the first level, if the first level is a non-redundant RAID level, marking information is added to the hard disk under the RAID card because the hard disk is executed with the fault recovery operation, and if the marking information comprises a Foreign state, the marking information is used for indicating that the hard disk after the fault recovery operation is executed as a Foreign configuration hard disk. The RAID card needs to clear the identification information of the hard disk and re-mark the hard disk as normal.
4. And (3) reconstructing the RAID group, namely if the hard disk after the fault recovery operation originally belongs to a RAID group member disk of a second level, if the second level is a redundant RAID level, for example RAID1/5/6, the RAID card needs to clear the identification information of the hard disk, marks the hard disk again as a normal state, and after marking the hard disk after the fault recovery operation as a normal state, needs to reconstruct the hard disk after the fault recovery operation and other hard disks in the original RAID group to obtain a new RAID group, and synchronizes the data of the new RAID group to the hard disk after the fault recovery operation so as to recover the RAID group as to be in the normal state.
5. And judging whether the interactive flow between the RAID card and the hard disk after the fault recovery operation is executed is successful or not. If the interaction is successful, the hard disk is re-enabled and normal processing traffic (including foreground/background IO) is performed. The method comprises the steps of completing disc searching and chain building, rate negotiation, performing abnormal diagnosis on a hard disc (namely, the diagnosis result indicates that the hard disc after fault recovery operation is performed is in a normal state), marking the hard disc as a normal state, reconstructing a RAID group, if the interaction fails, planning to try again to perform fault recovery operation on the fault hard disc, adding one to the number of times of performing the fault recovery operation, and jumping to an automatic hard disc recovery module.
(2) Control module
The control module is mainly responsible for informing the programmable logic device of the backboard of executing the fault recovery operation on the fault hard disk, if the fault recovery operation comprises the power-down and power-up operation, the control module informs the programmable logic device of executing the power-down and power-up operation on the fault hard disk, and in addition, the control module is also responsible for informing the firmware module of the RAID card of the completion condition of executing the fault recovery operation on the fault hard disk, if the firmware module of the RAID card is informed of the completion condition of executing the power-down and power-up operation on the fault hard disk. When the RAID card informs the management controller that the fault recovery operation needs to be executed on the fault hard disk, the management controller needs to transmit a control instruction to the programmable logic device of the backboard through the IIC channel, wherein the control instruction is used for indicating a fault recovery module in the programmable logic device to execute the fault recovery operation on the fault hard disk.
(3) Fault recovery module
The fault recovery module is mainly responsible for controlling the fault recovery operation of the slot position where the appointed hard disk is located. In one implementation, the fault recovery operation includes a power-down and power-up operation, and the fault recovery module is mainly responsible for performing the power-down and power-up operation on the slot where the specified hard disk is located. When a control instruction for a specified hard disk is received and sent by the control module, a fault recovery module in a programmable logic device on the back plate can execute fault recovery operation on a slot where the specified hard disk is located, wherein the specified hard disk can be referred to as a fault hard disk.
In summary, in the embodiment of the present application, by introducing corresponding modules in the RAID card, the management controller, and the programmable logic device, automatic failure recovery of a failed hard disk can be better completed through interaction between the modules, so that failure recovery time and labor cost can be reduced to a certain extent, failure rate of the hard disk can be reduced, and failure recovery efficiency can be improved.
The following describes a method for recovering a hard disk failure according to the embodiment of the present application.
Fig. 4 is a schematic flow chart of a fault recovery method for a hard disk according to an embodiment of the present application. The fault recovery method of the hard disk can be cooperatively executed by each device in the fault recovery system of the hard disk. The fault recovery method of the hard disk provided by the embodiment of the application can comprise the following steps S401-S403:
S401, when the RAID card detects the fault of the target hard disk, the RAID card sends a fault recovery instruction aiming at the target hard disk to the management controller. Accordingly, the management controller may receive a failure recovery instruction for the target hard disk sent by the RAID card. The target hard disk may be any of the hard disks of the computing devices described above.
Wherein the target hard disk failure may include a target hard disk response anomaly. In one implementation, the RAID card may send an interaction signal to the management controller through the information providing API, the interaction signal carrying the failure recovery instruction. Wherein the interactive signal may include an IIC signal or an MCTP signal.
S402, the management controller responds to the fault recovery instruction and sends a control instruction to the programmable logic device. Correspondingly, the programmable logic device receives a control instruction sent by the management controller.
The programmable logic device can be a CPLD, and the control instruction can be used for instructing the programmable logic device to execute fault recovery operation on the target hard disk. In one implementation, the management controller may send an IIC signal to the programmable logic device, the IIC signal including control instructions.
S403, the programmable logic device responds to the control instruction to execute fault recovery operation on the target hard disk.
In one implementation, the programmable logic device may be configured to control a power supply timing sequence of a slot where the target hard disk is located, where the control instruction may be configured to instruct to perform a power-down operation and then a power-up operation on the slot where the target hard disk is located, and the programmable logic device may be configured to perform the power-down operation on the target hard disk in response to the control instruction, and perform the power-up operation on the target hard disk after performing the power-down operation, thereby implementing a fault recovery operation on the target hard disk.
In the embodiment of the application, when the RAID card detects the fault of the target hard disk, the RAID card sends a fault recovery instruction aiming at the target hard disk to the management controller, the management controller responds to the fault recovery instruction and sends a control instruction to the programmable logic device, and the programmable logic device responds to the control instruction and executes fault recovery operation on the target hard disk. Therefore, the embodiment of the application can automatically recover the fault of the target hard disk through interaction among the RAID card, the management controller and the programmable logic device when the target hard disk fails, and realizes the recovery of the hard disk without relying on manual processing, thereby reducing the recovery time of the fault to a certain extent and improving the recovery efficiency of the fault of the hard disk.
Fig. 5 is a schematic flow chart of a fault recovery method for a hard disk according to an embodiment of the present application. The fault recovery method of the hard disk can be cooperatively executed by each device in the fault recovery system for the hard disk, and the fault recovery method of the hard disk can comprise the following steps S501-S505:
s501, when the RAID card detects the fault of the target hard disk, the RAID card sends a fault recovery instruction for the target hard disk to the management controller. Accordingly, the management controller may receive a failure recovery instruction for the target hard disk sent by the RAID card.
S502, the management controller responds to the fault recovery instruction and sends a control instruction to the programmable logic device. Correspondingly, the programmable logic device receives a control instruction sent by the management controller.
S503, the programmable logic device responds to the control instruction to execute fault recovery operation on the target hard disk;
S504, the management controller sends a fault recovery notification for the target hard disk to the RAID card, wherein the fault recovery notification is used for notifying that the fault recovery operation is executed on the target hard disk.
In one implementation, the management controller may call a notification API corresponding to the RAID card, and send a failure recovery signal to the RAID card, where the failure recovery signal carries a failure recovery notification. Wherein the fault recovery signal may comprise a II2C signal or an MCTP signal.
It should be understood that the failure recovery notification may be sent after S502 is performed, and of course, the failure recovery notification may be sent after S503 is performed and the management controller receives a prompt message sent by the programmable logic device, where the prompt message is used to indicate that a failure recovery operation has been performed on the target hard disk, and the prompt message is used to trigger the management controller to send the failure recovery notification for the target hard disk to the RAID card.
S505, in response to the fault recovery notice, re-interacting with the target hard disk after the fault recovery operation is executed, and if the interaction with the target hard disk after the fault recovery operation is executed is successful, restarting the target hard disk after the fault recovery operation to execute the service. If the interaction with the target hard disk after the fault recovery operation is executed fails, the fault recovery operation is executed on the target hard disk again.
The interaction with the target hard disk after the fault recovery operation comprises ① establishing a link connection with the target hard disk after the fault recovery operation, ② obtaining hard disk related information related to the target after the fault recovery operation based on the link connection, performing abnormality diagnosis processing on the target hard disk after the fault recovery operation to obtain a diagnosis result, ③ adding mark information under the RAID card due to the fact that the target hard disk after the fault recovery operation is performed, wherein the mark information is used for indicating that the target hard disk after the fault recovery operation is performed is in a Foreign state, namely indicating that the target hard disk after the fault recovery operation is performed is an external configuration hard disk. And if the diagnosis result indicates that the target hard disk after the fault recovery operation is in a normal state, acquiring a historical redundant array of disks (RAID) group of the target hard disk before the fault recovery operation is executed, and processing the target hard disk after the fault recovery operation is executed according to the RAID group level of the historical RAID group by the redundant array of disks. If the RAID group level of the historical RAID group is the first level, the marking information of the target hard disk after the fault recovery operation is executed is cleared, the target hard disk with the marking information cleared is marked as a normal state, and when the marking information cleared of the target hard disk after the fault recovery operation is executed is completed, successful interaction with the target hard disk after the fault recovery operation is confirmed. The marking information is used for indicating that the target hard disk after the fault recovery operation is executed is an external configuration target hard disk. If the RAID group level of the historical RAID group belongs to the second level, if the second level is a redundant RAID level, the historical RAID group is RAID1/2, and the like, the marking information of the target hard disk after the fault recovery operation is cleared, the target hard disk after the fault recovery operation is cleared is marked as a normal state, namely the target hard disk after the fault recovery operation is marked as a normal state, then the target hard disk after the fault recovery operation and the member disks except the target hard disk in the historical RAID group are subjected to the reconstruction operation to obtain a new RAID group, and the data of the new RAID group is synchronized into the target hard disk after the fault recovery operation, and when the data of the new RAID group is synchronized into the target hard disk after the fault recovery operation is completed, successful interaction with the target hard disk after the fault recovery operation is determined. Illustratively, the member disks in the history RAID group include a target hard disk 1, a target hard disk 2, and a failed target hard disk, and after performing a failure recovery operation on the failed target hard disk, the target hard disk 1, the target hard disk 2, and the target hard disk after performing the failure operation may be subjected to a reconstruction operation, so as to obtain a new RAID group.
The method comprises the following steps that firstly, external configuration of the target hard disk is selected and imported by a user, and the state before the target hard disk recovery program after the fault recovery operation is plugged or the power-down and power-up operation is executed can be enabled to be achieved through importing the external configuration, so that the marking information of the target hard disk after the fault recovery operation is directly cleared, and the target hard disk after the fault recovery operation is directly recovered to be a historical RAID group member disk. And in the second mode, RAID information on the target hard disk after the fault recovery operation is cleared, so that the target hard disk after the fault recovery operation is recovered to be an unconfigured target hard disk, and the unconfigured target hard disk can be reused to create a new RAID group, thereby directly completing the clearing of the marking information of the target hard disk after the fault recovery operation.
It should be understood that if the link connection is not established with the target hard disk after the fault recovery operation is performed, or the diagnosis result indicates that the target hard disk after the fault recovery operation is performed is in an abnormal state, or the marking information of the target hard disk after the fault recovery operation is not cleared successfully, or the target hard disk after the fault recovery operation is not marked as a normal state, or the reconstruction operation between the target hard disk after the fault recovery operation and a member disk in the historical RAID group except for the target hard disk is failed, the interaction failure with the target hard disk after the fault recovery operation is determined.
In one implementation, the hard disk related information may include at least one of basic attribute information, log information and test information, acquiring the hard disk related information related to the target hard disk after performing the fault recovery operation based on the link connection, and performing an abnormality diagnosis process on the target hard disk after performing the fault recovery operation based on the hard disk related information, where the diagnosis result includes at least one of (1) the hard disk related information includes basic attribute information, reading the basic attribute information of the target hard disk after performing the fault recovery operation based on the link connection, where the basic attribute information may include s.m.a.r.t information, and the basic attribute information may be used to indicate a state of the target hard disk after performing the fault recovery operation. If the basic attribute information indicates that the target hard disk after the fault recovery operation is in a healthy state, a diagnosis result that the target hard disk after the fault recovery operation is in a normal state is obtained, wherein the healthy state of the target hard disk means that the running conditions of the magnetic head and the disk do not exceed the respective safe value ranges, and if the basic attribute information indicates that the target hard disk after the fault recovery operation is in a non-healthy state, a diagnosis result that the target hard disk after the fault recovery operation is in an abnormal state is obtained. (2) The hard disk association information comprises log information, log information aiming at a target hard disk is obtained based on link connection, the log information comprises log logs, if error reporting information of the target hard disk after the recovery operation is executed does not exist in the log information, a diagnosis result that the target hard disk after the fault recovery operation is in a normal state is obtained, and if error reporting information of the target hard disk after the recovery operation is executed exists in the log information, a diagnosis result that the target hard disk after the fault recovery operation is executed is in an abnormal state is obtained. (3) The hard disk association information includes test information, and the test information when the short DST test is executed on the target hard disk after the fault recovery operation is executed is acquired based on the link connection. The short DST is a communication command for diagnosing a target hard disk in the SAS protocol. If the test information comprises the error information of the target hard disk, a diagnosis result of the abnormal state of the target hard disk after the fault recovery operation is executed is obtained, and if the test result does not comprise the error information of the target hard disk, a diagnosis result of the normal state of the target hard disk after the fault recovery operation is executed is obtained.
It should be understood that, in the embodiment of the present application, at least two of the above (1) (2) (3) may be executed, and illustratively, the above (1) (2) (3) may be executed sequentially, where when the basic attribute information indicates that the target hard disk after performing the failure recovery operation is in a healthy state and there is no error reporting information about the target hard disk after performing the recovery operation in the log information, and the test information does not have error information about the target hard disk, a diagnosis result that the target hard disk after performing the failure recovery operation is in a normal state is obtained, and otherwise, a diagnosis result that the target hard disk after performing the failure recovery operation is in an abnormal state is obtained.
In the embodiment of the application, when the RAID card detects the fault of the target hard disk, the RAID card sends a fault recovery instruction aiming at the target hard disk to the management controller, the management controller responds to the fault recovery instruction and sends a control instruction to the programmable logic device, and the programmable logic device responds to the control instruction and executes fault recovery operation on the target hard disk. Therefore, the embodiment of the application can automatically recover the fault of the target hard disk through interaction among the RAID card, the management controller and the programmable logic device when the target hard disk fails, and realizes the recovery of the hard disk without relying on manual processing, thereby reducing the recovery time of the fault to a certain extent and improving the recovery efficiency of the fault of the hard disk. In addition, the management controller sends a fault recovery notice aiming at the target hard disk to the RAID card, the fault recovery notice is used for notifying that the fault recovery operation is executed on the target hard disk, the fault recovery notice is responded to and interacted with the target hard disk after the fault recovery operation is executed again, if the interaction with the target hard disk after the fault recovery operation is successful, the service execution of the target hard disk after the fault recovery operation is restarted, and therefore the success rate of the fault recovery can be guaranteed to a certain extent through the interaction with the target hard disk after the fault recovery operation is executed, and further the execution of the service is effectively guaranteed.
Fig. 6 is a schematic flow chart of a fault recovery method for a hard disk according to an embodiment of the present application. The method for recovering the fault of the hard disk may be cooperatively executed by each device in the fault recovery system for the hard disk, in this embodiment, the fault recovery operation includes a power-down and power-up operation, and the method for recovering the fault of the hard disk may include the following steps S601 to S614:
S601, the response of the target hard disk is abnormal. The response exception includes (1) the RAID card sending a command to the target hard disk, the target hard disk not responding, and (2) failing to query the capacity of the target hard disk.
S602, the RAID card recognizes that the response of the target hard disk is abnormal, determines that the target hard disk fails, and triggers a failure recovery process for the target hard disk, namely, step S603 is executed.
S603, determining the time interval between the current fault time of the target hard disk and the time when the target hard disk executes the power-down and power-up operation last time, and judging whether the time interval is larger than an interval threshold value.
In the embodiment of the application, the interval threshold value can be set when the same target hard disk is continuously triggered twice to execute the power-down and power-up operation of the target hard disk. When the RAID card detects a failure of the target hard disk, a time interval between the current failure time of the target hard disk and the time when the target hard disk performs the last power-on operation may be determined, if the time interval is greater than the interval threshold, a failure recovery instruction for the target hard disk is sent to the management controller, that is, S604 is executed, and if the time interval is less than or equal to the interval threshold, failure information of the target hard disk needs to be sent to the management controller, that is, S613 is executed. The interval threshold may be set according to requirements, for example, the interval threshold may be 1 day, 1 week, or the like, which is not limited in the embodiment of the present application. Illustratively, the current failure time of the target hard disk is 15 days in X years Y months, the time when the target hard disk last executes the power-down and power-up operation is 01 days in X years Y months, the time interval between the current failure time of the target hard disk and the time when the target hard disk last executes the power-down and power-up operation is 14 days, and if the interval threshold is 6 days, the time interval is greater than the interval threshold, then step S604 may be executed.
S604, if the time interval is greater than the interval threshold, the RAID card acquires the number of times that the current fault target hard disk has executed power-on and power-on operation.
S605, the RAID card judges whether the times are larger than a times threshold. The frequency threshold may be set according to requirements, and the frequency threshold may be 5 times, 10 times, or the like, which is not limited in any way in the embodiment of the present application.
And S606, if the times are smaller than or equal to the times threshold, the RAID card sends a fault recovery instruction to the management controller, wherein the fault recovery instruction indicates that the power-down and power-up operation is required to be executed on the target hard disk.
S607, the management controller responds to the fault recovery instruction and sends a control instruction to the programmable logic device, wherein the control instruction is used for indicating the editable logic device to execute the power-down and power-up operation on the fault target hard disk. Correspondingly, the programmable logic device receives a fault recovery instruction sent by the management controller.
S608, the management controller sends a failure recovery notification for the target hard disk to the RAID card, where the failure recovery notification is used to notify that a power-down and power-up operation has been performed on the target hard disk.
S609, the RAID card responds to the fault recovery notice and performs interaction again with the target hard disk after the power-down and power-up operation is performed.
The specific manner of S609 may be specifically referred to the specific implementation manner of S505, and will not be described again.
S610, the RAID card judges whether the interaction with the target hard disk after the power-down and power-up operation is executed is successful.
S611, if the interaction is successful, the RAID card restarts to execute the operation of powering down and powering up again to perform service processing on the target hard disk. In one implementation, the step of starting the target hard disk after the power-down and power-up operation to perform service processing includes starting the target hard disk after the power-down and power-up operation to perform IO service.
S612, if the interaction fails, the power-down and power-up operation is attempted to be executed again on the target hard disk, and S604-S614 are executed.
And S613, if the time interval is smaller than or equal to the interval threshold or the frequency is larger than the frequency threshold, the RAID card performs the disk kicking processing on the fault target hard disk and sends the fault information of the target hard disk to the management controller.
S614, the management controller generates alarm information based on the fault information, wherein the alarm information is used for prompting operation and maintenance personnel to replace the target hard disk.
In the embodiment of the application, by introducing the interval threshold and the frequency threshold, whether the time interval between the current fault time of the target hard disk and the time when the power-down and power-up operations are executed last time on the target hard disk is larger than the interval threshold or not can be judged in the fault recovery process of the target hard disk, if the time interval is larger than the threshold, whether the frequency of the power-down and power-up operations is larger than the frequency threshold or not can be judged, if the frequency is smaller than the frequency threshold, the RAID card sends a fault recovery instruction to the management controller, the fault recovery instruction indicates that the power-down and power-up operations are executed on the target hard disk, the management controller responds to the fault recovery instruction and sends a control instruction to the programmable logic device, the control instruction is used for indicating the editable logic device to execute the power-down and power-up operations on the fault target hard disk, the management controller sends a fault recovery notice for the target hard disk, the fault recovery notice is used for notifying that the power-down and power-up operations are executed on the target hard disk, the RAID card responds to the fault recovery notice, and the target hard disk after the power-down and power-up operations are executed again interacted, if the fault recovery notice is executed again, the RAID card is started again, the service processing is executed after the power-down and power-up operations are executed again. Therefore, when the target hard disk fails, the target hard disk can be automatically recovered without relying on manual processing to realize the hard disk failure recovery, so that the failure recovery time can be reduced to a certain extent, and the hard disk failure recovery efficiency is improved. If the time interval is smaller than or equal to the interval threshold or the time interval is smaller than or equal to the interval threshold, the RAID card performs kicking processing on the fault target hard disk, and sends fault information of the target hard disk to the management controller, and the management controller generates alarm information based on the fault information, wherein the alarm information is used for prompting operation and maintenance personnel to replace the target hard disk. By introducing the interval threshold value and the frequency threshold value, the problem that the hard disk is always in an automatic fault recovery process and is always in fault can be avoided.
Fig. 7 is a schematic flow chart of a fault recovery method for a hard disk according to an embodiment of the present application. The hard disk failure recovery method may be performed by a RAID card, and may include the following steps S701 to S702:
And S701, when a fault of the target hard disk is detected, sending a fault recovery instruction aiming at the target hard disk to the management controller, wherein the fault recovery instruction indicates that a fault recovery operation needs to be executed on the target hard disk, the fault recovery instruction is used for triggering the management controller to send a control instruction to the programmable logic device, and the control instruction is used for indicating the programmable logic device to execute the fault recovery operation on the target hard disk.
The programmable logic device is used for controlling the power supply time sequence of the slot position of the target hard disk, and the fault recovery operation comprises the steps of executing the power-down operation on the target hard disk and executing the power-up operation on the target hard disk after executing the power-down operation.
In one implementation, a time interval between a current failure time of the target hard disk and a time when the target hard disk last performed a failure recovery operation is determined, and if the time interval is greater than an interval threshold, a failure recovery instruction for the target hard disk is sent to the management controller. If the time interval is smaller than or equal to the interval threshold, the RAID card performs kicking processing on the fault target hard disk, and sends fault information of the target hard disk to the management controller, and the management controller generates alarm information based on the fault information.
In another implementation, the number of times of performing the failure recovery operation on the target hard disk is counted, and if the number of times is smaller than a threshold number of times, a failure recovery instruction for the target hard disk is sent to the management controller. If the number of times is greater than or equal to the number of times threshold, the RAID card is used for kicking the fault target hard disk, fault information of the target hard disk is sent to the management controller, and the management controller generates alarm information based on the fault information.
In another implementation mode, a time interval between the current fault time of the target hard disk and the time when the fault recovery operation is executed last time by the target hard disk is determined, if the time interval is larger than an interval threshold value, the number of times of executing the fault recovery operation on the target hard disk is counted, and if the number of times is smaller than the number threshold value, a fault recovery instruction aiming at the target hard disk is sent to the management controller. If the time interval is smaller than or equal to the interval threshold value and/or the frequency is larger than or equal to the frequency threshold value, the RAID card performs kicking processing on the fault target hard disk, and sends fault information of the target hard disk to the management controller, and the management controller generates alarm information based on the fault information.
S702, receiving a fault recovery notification sent by the management controller for the target hard disk, wherein the fault recovery notification is used for notifying that the fault recovery operation is executed on the target hard disk.
After step S702, the RAID card re-interacts with the target hard disk after performing the failure recovery operation, and if the interaction with the target hard disk after performing the failure recovery operation is successful, restarting the target hard disk execution service after performing the failure recovery operation. If the interaction with the target hard disk after the fault recovery operation is performed fails, the fault recovery operation needs to be re-tried on the target hard disk again, that is, a fault recovery instruction for the target hard disk is re-sent to the management controller. The RAID card and the target hard disk after fault recovery operation are interacted again, wherein the RAID card and the target hard disk after fault recovery operation are interacted again comprises the steps of establishing link connection with the target hard disk after fault recovery operation is carried out, acquiring hard disk related information related to the target hard disk after fault recovery operation is carried out based on the link connection, carrying out abnormality diagnosis processing on the target hard disk after fault recovery operation according to the hard disk related information to obtain a diagnosis result, acquiring a historical redundant array RAID group of the target hard disk before fault recovery operation if the diagnosis result is used for indicating that the target hard disk after fault recovery operation is in a normal state, and carrying out processing on the target hard disk after fault recovery operation according to the RAID group level of the historical RAID group by the redundant array card. If the RAID group level of the historical RAID group belongs to the second level, marking the hard disk after the fault recovery operation is executed as a normal state, and carrying out reconstruction operation on the hard disk after the fault recovery operation and the member disks except the target hard disk in the historical RAID group to obtain a new RAID group, synchronizing the data of the new RAID group into the target hard disk after the fault recovery operation, and determining that the interaction with the hard disk after the fault recovery operation is successful when the data of the new RAID group is synchronized into the target hard disk after the fault recovery operation is completed.
In the embodiment of the application, when the fault of the target hard disk is detected, a fault recovery instruction aiming at the target hard disk is sent to the management controller, the fault recovery instruction indicates that the fault recovery operation needs to be executed on the target hard disk, the fault recovery instruction is used for triggering the management controller to send a control instruction to the programmable logic device, the control instruction is used for indicating the programmable logic device to execute the fault recovery operation on the target hard disk, and the fault recovery notification aiming at the target hard disk and sent by the management controller is received, wherein the fault recovery notification is used for notifying that the fault recovery operation has been executed on the target hard disk. Therefore, the embodiment of the application can automatically recover the hard disk fault through interaction among the RAID card, the management controller and the programmable logic device when the hard disk is in fault, and realizes the recovery of the hard disk fault without relying on manual processing, thereby reducing the fault recovery time to a certain extent and improving the recovery efficiency of the hard disk fault.
Fig. 8 is a schematic flow chart of a method for recovering a hard disk failure according to an embodiment of the present application. The fault recovery method of the hard disk may be performed by a programmable logic device, and may include the steps of S801 to S802:
S801, receiving a control instruction sent by the management controller, wherein the control instruction is sent when the management controller receives a fault recovery instruction for a target hard disk sent by the RAID card.
S802, responding to the control instruction, and executing fault recovery operation on the target hard disk.
In one implementation, the fault recovery operation includes a power-down and power-up operation, and a specific implementation of S802 may be to perform the power-down operation on the target hard disk in response to the control instruction, and perform the power-up operation on the target hard disk after performing the power-down operation on the target hard disk.
In the embodiment of the application, the control instruction sent by the management controller is received, the control instruction is sent when the management controller receives the fault recovery instruction for the target hard disk sent by the RAID card, the fault recovery operation is executed on the target hard disk in response to the control instruction, and the fault recovery of the target hard disk is realized through the control instruction without relying on manual processing, so that the fault recovery time can be reduced to a certain extent, and the fault recovery efficiency of the target hard disk is improved.
Fig. 9 is a schematic flow chart of a method for recovering a hard disk failure according to an embodiment of the present application. The method for recovering the fault of the hard disk can be executed by a management controller and can comprise the following steps:
S901, receiving a fault recovery instruction aiming at a target hard disk, which is sent by a RAID card, wherein the fault recovery instruction indicates that a fault recovery operation needs to be executed on the target hard disk.
S902, responding to the fault recovery instruction and sending a control instruction to the programmable logic device. The control instruction is used for instructing the programmable logic device to execute fault recovery operation on the target hard disk.
Optionally, S903, sending a failure recovery notification for the target hard disk to the RAID card, so that the RAID card interacts with the target hard disk after performing the failure recovery operation again in response to the failure recovery notification, where the failure recovery notification is used to notify that the failure recovery operation has been performed on the target hard disk.
In a specific implementation, S903 may be performed after S902 is performed. Of course, S903 may also be performed after receiving the hint information sent by the programmable logic device, where the hint information is used to indicate that the failure recovery operation has been performed on the target hard disk.
In the embodiment of the application, a fault recovery instruction aiming at a target hard disk and sent by a RAID card is received, wherein the fault recovery instruction indicates that a fault recovery operation is required to be executed on the target hard disk. And responding to the fault recovery instruction, and sending a control instruction to the programmable logic device. The control instruction is used for instructing the programmable logic device to execute the fault recovery operation on the target hard disk, and sending a fault recovery notification for the target hard disk to the RAID card, wherein the fault recovery notification is used for notifying that the fault recovery operation has been executed on the target hard disk. The fault recovery is automatically carried out on the failed target hard disk through the control instruction without relying on manual processing to realize the hard disk fault recovery, so that the fault recovery time can be reduced to a certain extent, and the hard disk fault recovery efficiency is improved.
Those skilled in the art will understand that, for convenience and brevity, the specific working process of the system, apparatus and unit described above may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.
The same or similar parts may be referred to each other in the embodiments of the present application. In the embodiments of the present application and the respective implementation/implementation methods of the embodiments, if there is no special description and logic conflict, terms and/or descriptions between different embodiments and between the respective implementation/implementation methods of the embodiments may be consistent and may be mutually cited, technical features in the different embodiments and the respective implementation/implementation methods of the embodiments may be combined to form a new embodiment, implementation method, or implementation method according to their inherent logic relationship. The above embodiment of the present application does not limit the protection scope of the embodiment of the present application.
The foregoing is merely a specific implementation of the embodiment of the present application, but the protection scope of the embodiment of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the embodiment of the present application, and the changes or substitutions are covered by the protection scope of the embodiment of the present application.
Claims (7)
1. The fault recovery method of the hard disk is characterized by being applied to a computing device, wherein the computing device comprises a management controller, a redundant disk array card, a back plate and at least one hard disk, the back plate comprises a programmable logic device, and the redundant disk array card is connected with each hard disk through the back plate, and the method comprises the following steps:
When the redundant array of inexpensive disks card detects a fault of a target hard disk, the redundant array of inexpensive disks card sends a fault recovery instruction for the target hard disk to the management controller, wherein the fault recovery instruction indicates that a fault recovery operation is required to be executed on the target hard disk;
the management controller responds to the fault recovery instruction and sends a control instruction to the programmable logic device;
the programmable logic device responds to the control instruction and executes fault recovery operation on the target hard disk;
the redundant disk array card interacts with the target hard disk after the fault recovery operation is executed again;
if the redundant disk array card successfully interacts with the target hard disk after the fault recovery operation is executed, restarting the target hard disk after the fault recovery operation by the redundant disk array card to execute the service;
the method for the redundant array of independent disks card to interact with the target hard disk after the fault recovery operation comprises the following steps:
the redundant disk array card establishes a link connection with a target hard disk after fault recovery operation is executed;
the redundant disk array card acquires the hard disk related information related to the target hard disk after the fault recovery operation is executed based on the link connection, and performs abnormality diagnosis processing on the target hard disk after the fault recovery operation based on the hard disk related information to obtain a diagnosis result;
If the diagnosis result is used for indicating that the target hard disk after the fault recovery operation is executed is in a normal state, the redundant array of inexpensive disks card acquires a historical redundant array of inexpensive disks card group of the target hard disk before the fault recovery operation is executed;
If the raid group level of the historical redundant array of inexpensive disks card group belongs to a first level, the redundant array of inexpensive disks card clears the marking information of the target hard disk after the fault recovery operation is executed, marks the target hard disk cleared with the marking information as a normal state, and determines that the interaction with the target hard disk after the fault recovery operation is executed is successful;
If the raid group level of the historical redundant array of inexpensive disks card group belongs to the second level, marking a target hard disk after fault recovery operation is executed as a normal state by the redundant array of inexpensive disks card, and carrying out reconstruction operation on the target hard disk after fault recovery operation and member disks except the target hard disk in the historical redundant array of inexpensive disks card group to obtain a new redundant array of inexpensive disks card group; and synchronizing the data of the new redundant array of inexpensive disks card group into the target hard disk after the fault recovery operation is executed by the redundant array of inexpensive disks card, and determining that the interaction with the target hard disk after the fault recovery operation is executed is successful.
2. The method of claim 1, wherein the programmable logic device is configured to control a power supply timing sequence of a slot where the target hard disk is located, and the programmable logic device is configured to perform a fault recovery operation on the target hard disk in response to the control instruction, and the fault recovery operation comprises:
And the programmable logic device responds to the control instruction, executes power-down operation on the target hard disk, and executes power-up operation on the target hard disk after executing the power-down operation.
3. The method of claim 1, wherein the redundant disk array card sending a failure recovery instruction for the target hard disk to the management controller, comprising:
the redundant array of inexpensive disks card determines the time interval between the current fault time of the target hard disk and the time when the fault recovery operation is executed last time by the target hard disk;
And if the time interval is greater than an interval threshold, the redundant array of inexpensive disks card sends a fault recovery instruction aiming at the target hard disk to the management controller.
4. The method of claim 3, wherein the redundant array of independent disks card sending a failure recovery instruction to the management controller for the target hard disk if the time interval is greater than an interval threshold, comprising:
if the time interval is larger than the interval threshold, the redundant array of inexpensive disks card counts the times of executing fault recovery operation on the target hard disk;
And if the times are smaller than the times threshold, the redundant array of inexpensive disks card sends a fault recovery instruction aiming at the target hard disk to the management controller.
5. The method of claim 1, wherein the method further comprises:
And the redundant array of independent disks card receives a fault recovery notification for the target hard disk, which is sent by the management controller, wherein the fault recovery notification is used for notifying that the fault recovery operation is executed on the target hard disk.
6. A method for recovering a hard disk failure, applied to a management controller, comprising:
Receiving a fault recovery instruction for a target hard disk sent by a redundant disk array card, wherein the fault recovery instruction is sent when the redundant disk array card detects the fault of the target hard disk, and the fault recovery instruction indicates that a fault recovery operation is required to be executed on the target hard disk;
Responding to the fault recovery instruction, and sending a control instruction to a programmable logic device, wherein the control instruction is used for indicating the programmable logic device to execute fault recovery operation on the target hard disk so that the redundant disk array card interacts with the target hard disk after the fault recovery operation is executed again, and restarting the target hard disk execution service after the fault recovery operation is executed after the redundant disk array card interacts with the target hard disk after the fault recovery operation is executed successfully;
the method for the redundant array of independent disks card to interact with the target hard disk after the fault recovery operation comprises the following steps:
the redundant disk array card establishes a link connection with a target hard disk after fault recovery operation is executed;
the redundant disk array card acquires the hard disk related information related to the target hard disk after the fault recovery operation is executed based on the link connection, and performs abnormality diagnosis processing on the target hard disk after the fault recovery operation based on the hard disk related information to obtain a diagnosis result;
If the diagnosis result is used for indicating that the target hard disk after the fault recovery operation is executed is in a normal state, the redundant array of inexpensive disks card acquires a historical redundant array of inexpensive disks card group of the target hard disk before the fault recovery operation is executed;
If the raid group level of the historical redundant array of inexpensive disks card group belongs to a first level, the redundant array of inexpensive disks card clears the marking information of the target hard disk after the fault recovery operation is executed, marks the target hard disk cleared with the marking information as a normal state, and determines that the interaction with the target hard disk after the fault recovery operation is executed is successful;
If the raid group level of the historical redundant array of inexpensive disks card group belongs to the second level, marking a target hard disk after fault recovery operation is executed as a normal state by the redundant array of inexpensive disks card, and carrying out reconstruction operation on the target hard disk after fault recovery operation and member disks except the target hard disk in the historical redundant array of inexpensive disks card group to obtain a new redundant array of inexpensive disks card group; and synchronizing the data of the new redundant array of inexpensive disks card group into the target hard disk after the fault recovery operation is executed by the redundant array of inexpensive disks card, and determining that the interaction with the target hard disk after the fault recovery operation is executed is successful.
7. A computing device configured to perform the method of recovering from a failure of a hard disk as recited in any one of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410223671.8A CN118277154B (en) | 2024-02-28 | 2024-02-28 | Hard disk fault recovery method and computing device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410223671.8A CN118277154B (en) | 2024-02-28 | 2024-02-28 | Hard disk fault recovery method and computing device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118277154A CN118277154A (en) | 2024-07-02 |
CN118277154B true CN118277154B (en) | 2025-02-21 |
Family
ID=91646133
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410223671.8A Active CN118277154B (en) | 2024-02-28 | 2024-02-28 | Hard disk fault recovery method and computing device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118277154B (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109284207A (en) * | 2018-08-30 | 2019-01-29 | 紫光华山信息技术有限公司 | Hard disc failure processing method, device, server and computer-readable medium |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7313721B2 (en) * | 2004-06-21 | 2007-12-25 | Dot Hill Systems Corporation | Apparatus and method for performing a preemptive reconstruct of a fault-tolerant RAID array |
CN101625586A (en) * | 2008-07-09 | 2010-01-13 | 联想(北京)有限公司 | Method, equipment and computer for managing energy conservation of storage device |
US10719399B2 (en) * | 2018-01-08 | 2020-07-21 | International Business Machines Corporation | System combining efficient reliable storage and deduplication |
CN109359016A (en) * | 2018-09-27 | 2019-02-19 | 郑州云海信息技术有限公司 | A kind of hard disk alarm method and device |
WO2021082011A1 (en) * | 2019-11-01 | 2021-05-06 | 华为技术有限公司 | Data reconstruction method and apparatus applied to disk array system, and computing device |
CN115061641B (en) * | 2022-08-16 | 2022-11-25 | 新华三信息技术有限公司 | Disk fault processing method, device, equipment and storage medium |
-
2024
- 2024-02-28 CN CN202410223671.8A patent/CN118277154B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109284207A (en) * | 2018-08-30 | 2019-01-29 | 紫光华山信息技术有限公司 | Hard disc failure processing method, device, server and computer-readable medium |
Also Published As
Publication number | Publication date |
---|---|
CN118277154A (en) | 2024-07-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI317868B (en) | System and method to detect errors and predict potential failures | |
US8812913B2 (en) | Method and apparatus for isolating storage devices to facilitate reliable communication | |
CN106354610B (en) | Server system and operation method thereof | |
CN109284207A (en) | Hard disc failure processing method, device, server and computer-readable medium | |
CN111221800B (en) | Database migration method and device, electronic equipment and storage medium | |
CN117389790B (en) | Firmware detection system, method, storage medium and server capable of recovering faults | |
CN111400121B (en) | Server hard disk slot positioning and maintaining method | |
CN110618918A (en) | Control method, control device and control equipment for hard disk status lamp in PCH | |
US7003617B2 (en) | System and method for managing target resets | |
CN108431781A (en) | The self diagnosis of the mistake of device driver detection and automatic diagnostic data are collected | |
CN117251333A (en) | A hard disk information acquisition method, device, equipment and storage medium | |
JP4807172B2 (en) | Disk array device, patrol diagnosis method, and patrol diagnosis control program | |
WO2024259950A1 (en) | Faulty memory module processing method and apparatus, electronic device, and nonvolatile readable storage medium | |
US20140201566A1 (en) | Automatic computer storage medium diagnostics | |
CN110321255A (en) | It is used to check the method and system of cable mistake | |
US10416913B2 (en) | Information processing device that monitors operation of storage utilizing specific device being connected to storage | |
CN118277154B (en) | Hard disk fault recovery method and computing device | |
TWI756007B (en) | Method and apparatus for performing high availability management of all flash array server | |
US7457990B2 (en) | Information processing apparatus and information processing recovery method | |
US20200073751A1 (en) | Storage apparatus and recording medium | |
US20230101977A1 (en) | Electronic device and method for monitoring hard disks | |
US20230025750A1 (en) | Systems And Methods For Self-Healing And/Or Failure Analysis Of Information Handling System Storage | |
CN111475378B (en) | Monitoring method, device and equipment for Expander | |
CN113535472A (en) | cluster server | |
CN114443446B (en) | Hard disk indicator lamp control method, system, terminal and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |