[go: up one dir, main page]

CN119621394A - Fault recovery method, device, equipment and storage medium for hard disk in server - Google Patents

Fault recovery method, device, equipment and storage medium for hard disk in server Download PDF

Info

Publication number
CN119621394A
CN119621394A CN202411692331.6A CN202411692331A CN119621394A CN 119621394 A CN119621394 A CN 119621394A CN 202411692331 A CN202411692331 A CN 202411692331A CN 119621394 A CN119621394 A CN 119621394A
Authority
CN
China
Prior art keywords
hard disk
fault
disk
failure
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411692331.6A
Other languages
Chinese (zh)
Inventor
陈志文
罗青松
黄洪
胡远明
秦晓宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ningchang Information Technology Hangzhou Co ltd
Nettrix Information Industry Beijing Co Ltd
Original Assignee
Ningchang Information Technology Hangzhou Co ltd
Nettrix Information Industry Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ningchang Information Technology Hangzhou Co ltd, Nettrix Information Industry Beijing Co Ltd filed Critical Ningchang Information Technology Hangzhou Co ltd
Priority to CN202411692331.6A priority Critical patent/CN119621394A/en
Publication of CN119621394A publication Critical patent/CN119621394A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application relates to a method, a device, equipment and a storage medium for recovering faults of a hard disk in a server, which are used for improving the accuracy of hard disk fault recovery. The method comprises the steps of obtaining state information of each hard disk in a server, wherein the state information is used for representing the current state of the hard disk, determining the hard disk fault type of a first hard disk according to the state information of the first hard disk, wherein the first hard disk is any one of the hard disks in the server, determining a target fault recovery strategy corresponding to the hard disk fault type of the first hard disk by utilizing the preset corresponding relation between the hard disk fault type and the fault recovery strategy, and recovering the fault of the first hard disk by utilizing the target fault recovery strategy.

Description

Method, device, equipment and storage medium for recovering hard disk failure in server
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for recovering a hard disk failure in a server.
Background
With the rapid development of the internet, the demands of various industries on servers are increasing. There is also a more urgent need for stability of servers. In particular to the industries of civilian life, finance and the like, and has extremely high requirements on the stable operation of a server.
The hard disk is used as one of storage media in three major components of the server, if the hard disk fails, the possibility of losing key data exists, and the stable operation of the image server can be realized, so that the stability of the server is reduced. Therefore, when the hard disk fails, recovery from the hard disk failure in the server is required.
In the prior art, a hot spare disk is arranged for RAID card support, and the hot spare disk is utilized to reconstruct a disk array with faults. However, the method cannot cover all fault scenes, and only the processing method cannot integrate the state of the whole machine to make an adaptive adjustment recovery strategy. For example, once the hard disk is still in a failure state after a firmware upgrade, there is no corresponding processing scheme. The accuracy of the hard disk failure recovery is low.
Disclosure of Invention
The invention provides a fault recovery method of a hard disk in a server, which is used for analyzing the fault type of the hard disk through comprehensive diagnosis, triggering corresponding fault recovery strategies according to different fault types, covering various fault scenes and improving the accuracy rate of fault recovery of the hard disk.
In a first aspect, the present application provides a method for recovering a hard disk failure in a server, where the method includes:
Acquiring state information of each hard disk in a server, wherein the state information is used for representing the current state of the hard disk;
Determining a hard disk fault type of a first hard disk according to state information of the first hard disk, wherein the first hard disk is any one of all the hard disks in the server;
determining a target fault recovery strategy corresponding to the hard disk fault type of the first hard disk by utilizing a preset corresponding relation between the hard disk fault type and the fault recovery strategy;
and recovering the fault of the first hard disk by utilizing the target fault recovery strategy.
According to the method, the hard disk fault type of the hard disk is determined according to the state information of the hard disk, then the target fault recovery strategy corresponding to the hard disk fault type of the hard disk is determined by utilizing the preset corresponding relation between the hard disk fault type and the fault recovery strategy, and the fault of the hard disk is recovered by utilizing the target fault recovery strategy.
In one possible implementation, the hard disk failure types include a hard disk failure and an array card failure, where the hard disk failure is used to indicate that a hard disk is failed, and the array card failure is used to indicate that an array card in the server is failed.
According to the method, the fault types of the hard disk comprise a plurality of fault types so as to ensure that the fault types are not single processing mode, and the accuracy of fault recovery of the hard disk is improved.
In one possible implementation manner, the determining the fault type of the first hard disk according to the state information of the first hard disk includes:
if the state information of the first hard disk is a fault state, determining that the hard disk fault type of the first hard disk is a hard disk fault;
And if the state information of the first hard disk is empty, determining that the hard disk fault type of the first hard disk is an array card fault.
According to the method, the fault type of the hard disk is determined through the state information of the hard disk, and the accuracy of the determined fault type is ensured.
In one possible implementation manner, if the hard disk failure type is the hard disk failure;
recovering the failure of the first hard disk by using the target failure recovery strategy, including:
Determining a target logical disk corresponding to the first hard disk by using a preset corresponding relation between the hard disk and the logical disk, wherein the logical disk is a hard disk array formed by a plurality of hard disks, and the target logical disk is the logical disk where the first hard disk is located;
Determining whether a hot standby disk exists in the target logical disk or not by utilizing function identifiers of all hard disks in the target logical disk, wherein the function identifiers are used for representing functions of the hard disks in the target logical disk;
if yes, replacing the first hard disk by using the hot spare disk, and recovering the data in the first hard disk into the hot spare disk;
if not, setting the idle hard disk in the server as the hot spare disk, replacing the first hard disk by using the hot spare disk, and recovering the data in the first hard disk into the hot spare disk.
According to the method, through the hot standby disk in the target logical disk or the idle hard disk is set as the hot standby disk, the hot standby disk is started, and the data in the first hard disk is restored to the hot standby disk, so that the hot standby disk can replace the first hard disk to continue working, and the hard disk fault can be timely solved when the hard disk fault type is the hard disk fault.
In one possible embodiment, the method further comprises:
If the idle hard disk does not exist in the server, notifying a user to replace the first hard disk;
When the first hard disk is determined to be replaced, judging whether the configuration of the replaced first hard disk meets the specified condition or not;
if yes, determining the replaced first hard disk as the hot spare disk, replacing the first hard disk by the hot spare disk, and recovering backup data corresponding to the first hard disk into the hot spare disk;
If not, returning to the step of notifying the user to replace the first hard disk until the backup data corresponding to the first hard disk is restored to the hot standby disk.
The method ensures that the hard disk faults can be recovered for the corresponding fault recovery mode when the idle hard disk does not exist in the server, and further improves the accuracy of fault recovery.
In one possible implementation manner, if the hard disk failure type is the array card failure;
the recovering the failure of the first hard disk by using the target failure recovery strategy includes:
Based on the fault log reported by the server, determining whether a high-speed serial computer expansion bus (PCIE) uncorrectable error UCE alarm exists in the fault log;
if yes, processing the fault of the first hard disk based on the current temperature of the array card;
if not, acquiring the current link state of the serial communication bus I2C link where the array card is located, and performing fault recovery on the first hard disk based on the current link state of the I2C link.
According to the method, fault recovery is carried out through the current temperature of the array card or the I2C link where the array card is located, so that the fault of the hard disk can be recovered when the hard disk fault type is the fault of the array card, and the accuracy rate of the fault recovery of the hard disk is further improved.
In one possible implementation manner, the processing the fault of the first hard disk based on the current temperature of the array card includes:
if the current temperature is greater than the specified temperature, determining a target fan rotating speed corresponding to the current temperature of the array card by utilizing a preset corresponding relation between the temperature and the fan rotating speed, adjusting the rotating speed of the fan corresponding to the array card to the target rotating speed, and returning to a fault log reported by the server until the current temperature is not greater than the specified temperature, and determining whether a high-speed serial computer expansion bus PCIE uncorrectable error UCE alarm exists in the fault log;
If the current temperature is not greater than the specified threshold, reminding a user to recover from faults;
the performing fault recovery on the first hard disk based on the current link state of the I2C link includes:
And if the current state of the I2C link is an abnormal state, resetting the I2C link.
According to the method, the temperature of the array card is adjusted by adjusting the rotating speed of the fan, so that the temperature of the array card is guaranteed to be recovered to the normal temperature, or the I2C link is reset to ensure that the I2C link where the array card is located is recovered to the normal state, and the hard disk which can solve the fault type of the array card is guaranteed to be the fault of the array card.
In one possible embodiment, the method further comprises:
if the current state of the I2C link is a normal state and the state information of the first hard disk is null, acquiring fault information in an array card log in the server, wherein the array card log is used for storing the fault information reported by the array card;
Determining fault recovery indication information according to the fault information;
and sending the fault recovery indication information to user terminal equipment so as to facilitate the user to carry out fault recovery based on the fault recovery indication information.
The method ensures that when the current state of the I2C link is a normal state and the state information of the first hard disk is empty, corresponding fault recovery indication information exists, and informs a user to carry out fault recovery based on the fault recovery indication information. So as to ensure that faults of different types can be solved in time.
In one possible embodiment, the method further comprises:
And if the fault information is null, determining the hard disk fault type as a hard disk fault, and returning to execute the step of determining a target fault recovery strategy corresponding to the hard disk fault type of the first hard disk by utilizing the preset corresponding relation between the hard disk fault type and the fault recovery strategy.
According to the method, when the fault information is empty, the hard disk fault type is determined to be the hard disk fault, and the fault recovery is carried out by utilizing the fault recovery strategy corresponding to the hard disk fault, so that the fault can be solved in time.
In a second aspect, the present application provides a fault recovery apparatus for a hard disk in a server, the apparatus comprising:
The system comprises an acquisition module, a storage module and a control module, wherein the acquisition module is used for acquiring state information of each hard disk in a server, wherein the state information is used for representing the current state of the hard disk;
The system comprises a fault type determining module, a fault type determining module and a fault type determining module, wherein the fault type determining module is used for determining the hard disk fault type of a first hard disk according to the state information of the first hard disk, and the first hard disk is any one of all the hard disks in the server;
the fault recovery strategy determining module is used for determining a target fault recovery strategy corresponding to the hard disk fault type of the first hard disk by utilizing the preset corresponding relation between the hard disk fault type and the fault recovery strategy;
and the fault recovery module is used for recovering the fault of the first hard disk by utilizing the target fault recovery strategy.
In one possible implementation, the hard disk failure types include a hard disk failure and an array card failure, where the hard disk failure is used to indicate that a hard disk is failed, and the array card failure is used to indicate that an array card in the server is failed.
In a possible implementation manner, the fault type determining module is specifically configured to:
if the state information of the first hard disk is a fault state, determining that the hard disk fault type of the first hard disk is a hard disk fault;
And if the state information of the first hard disk is empty, determining that the hard disk fault type of the first hard disk is an array card fault.
In one possible implementation manner, if the hard disk failure type is the hard disk failure;
The fault recovery module is specifically configured to:
Determining a target logical disk corresponding to the first hard disk by using a preset corresponding relation between the hard disk and the logical disk, wherein the logical disk is a hard disk array formed by a plurality of hard disks, and the target logical disk is the logical disk where the first hard disk is located;
Determining whether a hot standby disk exists in the target logical disk or not by utilizing function identifiers of all hard disks in the target logical disk, wherein the function identifiers are used for representing functions of the hard disks in the target logical disk;
if yes, replacing the first hard disk by using the hot spare disk, and recovering the data in the first hard disk into the hot spare disk;
if not, setting the idle hard disk in the server as the hot spare disk, replacing the first hard disk by using the hot spare disk, and recovering the data in the first hard disk into the hot spare disk.
In one possible embodiment, the fault recovery module is further configured to:
If the idle hard disk does not exist in the server, notifying a user to replace the first hard disk;
When the first hard disk is determined to be replaced, judging whether the configuration of the replaced first hard disk meets the specified condition or not;
if yes, determining the replaced first hard disk as the hot spare disk, replacing the first hard disk by the hot spare disk, and recovering backup data corresponding to the first hard disk into the hot spare disk;
If not, returning to the step of notifying the user to replace the first hard disk until the backup data corresponding to the first hard disk is restored to the hot standby disk.
In one possible implementation manner, if the hard disk failure type is the array card failure;
The fault recovery module is specifically configured to:
Based on the fault log reported by the server, determining whether a high-speed serial computer expansion bus (PCIE) uncorrectable error UCE alarm exists in the fault log;
if yes, processing the fault of the first hard disk based on the current temperature of the array card;
if not, acquiring the current link state of the serial communication bus I2C link where the array card is located, and performing fault recovery on the first hard disk based on the current link state of the I2C link.
In one possible embodiment, the fault recovery module is further configured to:
if the current temperature is greater than the specified temperature, determining a target fan rotating speed corresponding to the current temperature of the array card by utilizing a preset corresponding relation between the temperature and the fan rotating speed, adjusting the rotating speed of the fan corresponding to the array card to the target rotating speed, and returning to a fault log reported by the server until the current temperature is not greater than the specified temperature, and determining whether a high-speed serial computer expansion bus PCIE uncorrectable error UCE alarm exists in the fault log;
If the current temperature is not greater than the specified threshold, reminding a user to recover from faults;
the performing fault recovery on the first hard disk based on the current link state of the I2C link includes:
And if the current state of the I2C link is an abnormal state, resetting the I2C link.
In one possible embodiment, the fault recovery module is further configured to:
if the current state of the I2C link is a normal state and the state information of the first hard disk is null, acquiring fault information in an array card log in the server, wherein the array card log is used for storing the fault information reported by the array card;
Determining fault recovery indication information according to the fault information;
and sending the fault recovery indication information to user terminal equipment so as to facilitate the user to carry out fault recovery based on the fault recovery indication information.
In one possible embodiment, the fault recovery module is further configured to:
And if the fault information is null, determining the hard disk fault type as a hard disk fault, and returning to execute the step of determining a target fault recovery strategy corresponding to the hard disk fault type of the first hard disk by utilizing the preset corresponding relation between the hard disk fault type and the fault recovery strategy.
In a third aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements steps in a method for recovering a hard disk failure in the server when the processor executes the computer program.
In a fourth aspect, an embodiment of the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the above-described method for recovering from a hard disk failure in a server according to the present application.
In a fifth aspect, an embodiment of the present application provides a computer program product, including a computer program, where the computer program is stored in a computer readable storage medium, and when a processor of a memory access device reads the computer program from the computer readable storage medium, the processor executes the computer program, so that the memory access device executes steps in a method for recovering a hard disk failure in a server according to the present application.
The technical effects of each of the third to fifth aspects and the technical effects that may be achieved by each aspect are referred to the technical effects that may be achieved by each possible aspect in the first aspect, and the description thereof is not repeated here.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it will be apparent that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a method for recovering a hard disk failure in a server according to an embodiment of the present application;
Fig. 2 is a schematic flow chart of fault recovery corresponding to a hard disk fault type being a hard disk fault according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of a method for recovering a failure corresponding to a failure of an array card according to an embodiment of the present application;
FIG. 4 is a second flow chart of a method for recovering a hard disk failure in a server according to an embodiment of the present application;
fig. 5 is a schematic diagram of a fault recovery device for a hard disk in a server according to an embodiment of the present application;
fig. 6 is a schematic diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings. The specific method of operation in the method embodiment may also be applied to the device embodiment or the system embodiment.
In the description of the present application, "plurality" is understood as "at least two". "and/or" describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate that there are three cases of a alone, a and B together, and B alone. A and B are connected, and it can be represented that A and B are directly connected and A and B are connected through C. In addition, in the description of the present application, the words "first," "second," and the like are used merely for distinguishing between the descriptions and not be construed as indicating or implying a relative importance or order.
In the prior art, a hot spare disk is arranged for RAID card support, and the hot spare disk is utilized to reconstruct a disk array with faults. However, the method cannot cover all fault scenes, and only the processing method cannot integrate the state of the whole machine to make an adaptive adjustment recovery strategy. For example, once the hard disk is still in a failure state after a firmware upgrade, there is no corresponding processing scheme. The accuracy of the hard disk failure recovery is low.
In this regard, the embodiment of the application provides a method for recovering the fault of the hard disk in the server, which comprises the steps of determining the fault type of the hard disk according to the state information of the hard disk, determining a target fault recovery strategy corresponding to the fault type of the hard disk by utilizing the preset corresponding relation between the fault type of the hard disk and the fault recovery strategy, and recovering the fault of the hard disk by utilizing the target fault recovery strategy.
The following describes a method for recovering a hard disk failure in a server according to an embodiment of the present application with reference to the accompanying drawings, as shown in fig. 1, which is a schematic flow chart of the method for recovering a hard disk failure in a server, and specifically may include the following steps:
Step 101, acquiring state information of each hard disk in a server, wherein the state information is used for representing the current state of the hard disk;
The state information in the embodiment of the application comprises a normal state, a fault state, a working state, an off-line state and the like.
102, Determining a hard disk fault type of a first hard disk according to state information of the first hard disk, wherein the first hard disk is any one of all hard disks in the server;
The hard disk fault type in the embodiment of the application comprises a hard disk fault and an array card fault, wherein the hard disk fault is used for indicating that a hard disk breaks down, and the array card fault is used for indicating that an array card in the server breaks down.
In one possible implementation, step 102 may be embodied as:
And if the state information of the first hard disk is empty, determining that the hard disk fault type of the first hard disk is an array card fault.
In the embodiment of the application, the condition that the state information of the first hard disk is empty is that the state information of the first hard disk cannot be acquired. In the embodiment of the application, when the state information of the first hard disk is obtained as the fault state or empty, the hard disk fault log is reported. And ending the hard disk fault log until the fault of the hard disk is recovered.
Step 103, determining a target fault recovery strategy corresponding to the hard disk fault type of the first hard disk by utilizing the preset corresponding relation between the hard disk fault type and the fault recovery strategy;
and 104, recovering the fault of the first hard disk by utilizing the target fault recovery strategy.
Next, the fault recovery method in step 104 will be described. Firstly, a fault recovery mode corresponding to a hard disk fault type is introduced. As shown in fig. 2, a flow chart of the corresponding fault recovery when the hard disk fault type is a hard disk fault may specifically include the following steps:
Step 201, determining a target logical disk corresponding to the first hard disk by utilizing a preset corresponding relation between the hard disk and the logical disk, wherein the logical disk is a hard disk array formed by a plurality of hard disks, and the target logical disk is the logical disk where the first hard disk is located;
Step 202, determining whether a hot standby disk exists in the target logical disk or not by using function identifiers of all hard disks in the target logical disk, if so, executing step 203, and if not, executing step 204, wherein the function identifiers are used for representing functions of the hard disks in the target logical disk;
For example, function id a is used to represent a hot spare disk, function id B is used to represent a spare disk, and so on. In the embodiment of the application, each hard disk in the logic disk has a corresponding function identifier, and whether the hot standby disk exists in the target logic disk can be determined through the function identifier of each hard disk.
Step 203, replacing the first hard disk with the hot spare disk, and recovering the data in the first hard disk into the hot spare disk;
In the embodiment of the application, the data in the first hard disk is recovered to the hot spare disk, namely the hot spare disk is replaced by the first hard disk, and the data in the first hard disk is recovered to the hot spare disk. The step is a step of reconstructing a logical disk, belongs to the prior art, is not an inventive point in the present application, and is not described in detail herein.
Step 204, judging whether an idle hard disk exists in the server, if yes, executing step 205, and if not, executing step 206;
And 205, setting an idle hard disk in the server as the hot spare disk, replacing the first hard disk by the hot spare disk, and recovering the data in the first hard disk into the hot spare disk.
In one possible implementation manner, before setting the idle hard disk in the server as the hot standby disk, it is determined that the type of the idle hard disk is the same as the type of the first hard disk, and the hard disk capacity of the idle hard disk is not less than the capacity of the target logical disk.
In one possible implementation, the idle hard disk in the server is set as the hot spare disk, and the implementation may be that the idle hard disk is added to the target logical disk, and a function identifier of the idle hard disk is set as a function identifier corresponding to the hot spare disk.
Step 206, notifying a user to replace the first hard disk;
In the embodiment of the application, the message for notifying the user comprises the slot corresponding to the first hard disk in the server, so that the user can correctly replace the first hard disk.
Step 207, when it is determined that the replacement of the first hard disk is completed, judging whether the configuration of the replaced first hard disk meets the specified condition, if so, executing step 208, and if not, returning to execute step 206;
In the embodiment of the application, whether the hard disk is replaced or a new disk is inserted is detected by polling the backboard through a BMC (Baseboard Management Controller in the server and a server baseboard management controller). The method comprises the steps of obtaining state information of a first hard disk, comparing the state of the hard disk with the state information of the hard disk obtained last time, if the state information of the hard disk is the same as the state information of the hard disk obtained last time, determining that the first hard disk is not replaced, and if the state information of the hard disk is not the same, determining that the first hard disk is replaced.
And step 208, determining the replaced first hard disk as the hot spare disk, replacing the first hard disk by using the hot spare disk, and recovering the backup data corresponding to the first hard disk into the hot spare disk.
In the embodiment of the application, if the backup data of the first hard disk exists in the other hard disks except the first hard disk and the hot spare disk in the target logical disk, the replaced first hard disk is determined to be the hot spare disk, and the backup data in the other hard disks are copied to the hot spare disk.
The specified condition in the embodiment of the present application is that the type of the replaced first hard disk is the same as the type of the hard disk of the first hard disk, and the capacity of the replaced first hard disk is not less than the capacity of the target logical disk.
After the fault recovery method corresponding to the hard disk fault type being the hard disk fault is introduced, the fault recovery method corresponding to the hard disk fault type being the array card fault in the embodiment of the application is introduced. As shown in fig. 3, which is a flow chart of a method for recovering a hard disk failure type as an array card failure correspondence failure, the method specifically includes the following steps:
step 301, determining whether a fault log has an uncorrectable error UCE alarm of a high-speed serial computer expansion bus PCIE based on the fault log reported by the server, if so, executing step 302, and if not, executing step 303;
Step 302, processing the fault of the first hard disk based on the current temperature of the array card;
In one embodiment, step 302 may be implemented as a step of determining, by using a preset correspondence between a temperature and a fan rotation speed, a target fan rotation speed corresponding to a current temperature of the array card, and adjusting the rotation speed of a fan corresponding to the array card to the target rotation speed, until the current temperature is not greater than the specified temperature, returning to a fault log reported by the server, and determining whether a fault log has a fault UCE alarm uncorrectable by the high-speed serial computer expansion bus PCIE, or not, and prompting a user to perform fault recovery if the current temperature is not greater than the specified threshold.
The following describes a specific manner of fault recovery for the user:
First, the user restarts the server, and checks whether the PCIE UCE alert is cleared. If yes, ending. If not, after the cable and the RISER are plugged in and pulled out again, checking whether the PCIE UCE alarm is cleared again, if not, replacing the cable and the RISER, checking whether the PCIE UCE alarm is cleared, and if not, replacing the PCIE card.
Step 303, obtaining the current link state of the serial communication bus I2C link where the array card is located, and performing fault recovery on the first hard disk based on the current link state of the serial communication bus I2C link.
In one embodiment, step 303 may be implemented by resetting the I2C link if the current state of the I2C link is an abnormal state.
In the embodiment of the application, the resetting of the I2C link is realized by adopting a PCA9548 chip. Specifically, the RESET signal pin of the PCA9548 chip is connected to a BMC GPIO (General purpose input/output port) in the server, and the BMC realizes the RESET function of the I2C link by pulling down the RESET pin level and then immediately restoring the RESET pin level.
In order to ensure that various fault scenes can be covered, in one embodiment, if the current state of the I2C link is a normal state and the state information of the first hard disk is null, fault information in an array card log in the server is obtained, wherein the array card log is used for storing the fault information reported by the array card, fault recovery indication information is determined according to the fault information, and the fault recovery indication information is sent to user terminal equipment so that a user can carry out fault recovery based on the fault recovery indication information.
In the embodiment of the application, the corresponding relation between the preset fault information and the fault recovery indication information is utilized to determine the fault recovery indication information corresponding to the fault information. Wherein, table 1 is the correspondence between the fault information and the fault recovery instruction information:
Table 1:
Fault information Fault recovery indication information
BBU failure Replacement of BBU
Cache faults Replacement cache
Array card failure Array replacement card
... ...
For example, if the failure information is a BBU (Battery Backup Unit ) failure, the user is notified to replace the BBU. If the fault information is a cache fault, notifying a user to replace the cache. If the fault information is the fault of the array card, notifying a user to replace the array card.
In one possible implementation manner, if the fault information is null, determining the hard disk fault type as a hard disk fault, and then returning to execute the step of determining the target fault recovery strategy corresponding to the hard disk fault type of the first hard disk by using the preset correspondence between the hard disk fault type and the fault recovery strategy.
The following describes a method for recovering a hard disk failure in a server according to an embodiment of the present application with reference to fig. 4, which specifically includes the following steps:
Step 401, obtaining state information of each hard disk in a server, wherein the state information is used for representing the current state of the hard disk;
Step 402, if the status information of the first hard disk is a failure status, executing step 404 after determining that the hard disk failure type of the first hard disk is a hard disk failure;
Step 403, if the status information of the first hard disk is empty, executing step 408 after determining that the hard disk failure type of the first hard disk is an array card failure;
Step 404, determining a target logical disk corresponding to the first hard disk by using a preset corresponding relation between the hard disk and the logical disk, wherein the logical disk is a hard disk array composed of a plurality of hard disks, and the target logical disk is the logical disk where the first hard disk is located;
Step 405, determining whether a hot standby disk exists in the target logical disk by using the function identifier of each hard disk in the target logical disk, if so, executing step 406, otherwise, executing step 407;
wherein the function identifier is used for representing the function of the hard disk in the target logic disk;
step 406, replacing the first hard disk with the hot spare disk, and recovering the data in the first hard disk to the hot spare disk;
Step 407, setting an idle hard disk in the server as the hot spare disk, replacing the first hard disk with the hot spare disk, and recovering data in the first hard disk into the hot spare disk;
Step 408, determining whether a PCIE UCE alarm exists in the fault log based on the fault log reported by the server, if yes, executing step 409, and if not, executing step 412;
step 409, judging whether the current temperature of the array card is greater than the specified temperature, if yes, executing step 410, and if not, executing step 411;
Step 410, determining a target fan rotating speed corresponding to the current temperature of the array card by utilizing the preset corresponding relation between the temperature and the fan rotating speed, adjusting the rotating speed of the fan corresponding to the array card to the target rotating speed until the current temperature is not greater than the designated temperature, and returning to execute step 408;
Step 411, reminding a user of fault recovery;
step 412, if the current state of the I2C link is an abnormal state, resetting the I2C link.
Based on the same inventive concept, the present application also provides a device for recovering a hard disk failure in a server, referring to fig. 5, the device 500 includes:
The obtaining module 501 is configured to obtain status information of each hard disk in the server, where the status information is used to represent a current status of the hard disk;
the fault type determining module 502 is configured to determine a hard disk fault type of a first hard disk according to status information of the first hard disk, where the first hard disk is any one of the hard disks in the server;
A failure recovery policy determining module 503, configured to determine a target failure recovery policy corresponding to the hard disk failure type of the first hard disk by using a preset correspondence between the hard disk failure type and the failure recovery policy;
and the fault recovery module 504 is configured to recover a fault of the first hard disk by using the target fault recovery policy.
In one possible implementation, the hard disk failure types include a hard disk failure and an array card failure, where the hard disk failure is used to indicate that a hard disk is failed, and the array card failure is used to indicate that an array card in the server is failed.
In one possible implementation manner, the fault type determining module 502 is specifically configured to:
if the state information of the first hard disk is a fault state, determining that the hard disk fault type of the first hard disk is a hard disk fault;
And if the state information of the first hard disk is empty, determining that the hard disk fault type of the first hard disk is an array card fault.
In one possible implementation manner, if the hard disk failure type is the hard disk failure;
the fault recovery module 504 is specifically configured to:
Determining a target logical disk corresponding to the first hard disk by using a preset corresponding relation between the hard disk and the logical disk, wherein the logical disk is a hard disk array formed by a plurality of hard disks, and the target logical disk is the logical disk where the first hard disk is located;
Determining whether a hot standby disk exists in the target logical disk or not by utilizing function identifiers of all hard disks in the target logical disk, wherein the function identifiers are used for representing functions of the hard disks in the target logical disk;
if yes, starting the hot standby disc, and recovering the data in the first hard disc to the hot standby disc;
If not, setting the idle hard disk in the server as the hot spare disk, starting the hot spare disk, and recovering the data in the first hard disk into the hot spare disk.
In one possible implementation, the fault recovery module 504 is further configured to:
If the idle hard disk does not exist in the server, notifying a user to replace the first hard disk;
When the first hard disk is determined to be replaced, judging whether the configuration of the replaced first hard disk meets the specified condition or not;
If yes, determining the replaced first hard disk as the hot spare disk, and restoring the backup data corresponding to the first hard disk to the hot spare disk;
If not, returning to the step of notifying the user to replace the first hard disk until the backup data corresponding to the first hard disk is restored to the hot standby disk.
In one possible implementation manner, if the hard disk failure type is the array card failure;
the fault recovery module 504 is specifically configured to:
Based on the fault log reported by the server, determining whether a high-speed serial computer expansion bus (PCIE) uncorrectable error UCE alarm exists in the fault log;
if yes, processing the fault of the first hard disk based on the current temperature of the array card;
If not, acquiring the current link state of the serial communication bus I2C link where the array card is located, and recovering the fault of the first hard disk based on the current link state of the serial communication bus I2C link.
In one possible implementation, the fault recovery module 504 is further configured to:
If the current temperature is greater than the specified temperature, determining a target fan rotating speed corresponding to the current temperature of the array card by utilizing a preset corresponding relation between the temperature and the fan rotating speed, adjusting the rotating speed of the fan corresponding to the array card to the target rotating speed, and returning to a fault log reported by the server until the current temperature is not greater than the specified temperature, so as to determine whether an array card UCE event exists in the fault log;
If the current temperature is not greater than the specified threshold, reminding a user to recover from faults;
the performing fault recovery on the first hard disk based on the current link state of the I2C link includes:
And if the current state of the I2C link is an abnormal state, resetting the I2C link.
In one possible implementation, the fault recovery module 504 is further configured to:
if the current state of the I2C link is a normal state and the state information of the first hard disk is null, acquiring fault information in an array card log in the server, wherein the array card log is used for storing the fault information reported by the array card;
Determining fault recovery indication information according to the fault information;
and sending the fault recovery indication information to user terminal equipment so as to facilitate the user to carry out fault recovery based on the fault recovery indication information.
In one possible implementation, the fault recovery module 504 is further configured to:
And if the fault information is null, determining the hard disk fault type as a hard disk fault, and returning to execute the step of determining a target fault recovery strategy corresponding to the hard disk fault type of the first hard disk by utilizing the preset corresponding relation between the hard disk fault type and the fault recovery strategy.
Based on the same inventive concept, the embodiment of the present application further provides an electronic device, where the electronic device may implement the function of the foregoing fault recovery apparatus for a hard disk in a server, and referring to fig. 6, the electronic device includes:
At least one processor 601, and a memory 602 connected to the at least one processor 601, a specific connection medium between the processor 601 and the memory 602 is not limited in the embodiment of the present application, and in fig. 6, the processor 601 and the memory 602 are connected through a bus 600 as an example. Bus 600 is shown in bold lines in fig. 6, and the manner in which the other components are connected is illustrated schematically and not by way of limitation. The bus 600 may be divided into an address bus, a data bus, a control bus, etc., and is represented by only one thick line in fig. 6 for convenience of representation, but does not represent only one bus or one type of bus. Alternatively, the processor 601 may be referred to as a controller, and the names are not limited.
In the embodiment of the present application, the memory 602 stores instructions executable by the at least one processor 601, and the at least one processor 601 may execute the fault recovery method of the hard disk in the server by executing the instructions stored in the memory 602. Processor 601 may implement the functions of the various modules in the apparatus shown in fig. 6.
The processor 601 is a control center of the device, and various interfaces and lines can be used to connect various parts of the whole control device, and through running or executing instructions stored in the memory 602 and calling data stored in the memory 602, various functions of the device and processing data can be performed, so that the device can be monitored as a whole.
In one possible design, processor 601 may include one or more processing units, and processor 601 may integrate an application processor and a modem processor, wherein the application processor primarily processes operating systems, user interfaces, application programs, and the like, and the modem processor primarily processes wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 601. In some embodiments, processor 601 and memory 602 may be implemented on the same chip, or they may be implemented separately on separate chips in some embodiments.
The processor 601 may be a general purpose processor such as a Central Processing Unit (CPU), digital signal processor, application specific integrated circuit, field programmable gate array or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, which may implement or perform the methods, steps and logic blocks disclosed in embodiments of the application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method for recovering the hard disk failure in the server disclosed by the embodiment of the application can be directly embodied as the execution completion of the hardware processor or the execution completion of the combination execution of the hardware and the software modules in the processor.
The memory 602 is a non-volatile computer readable storage medium that can be used to store non-volatile software programs, non-volatile computer executable programs, and modules. The Memory 602 may include at least one type of storage medium, which may include, for example, flash Memory, hard disk, multimedia card, card Memory, random access Memory (Random Access Memory, RAM), static random access Memory (Static Random Access Memory, SRAM), programmable Read-Only Memory (Programmable Read Only Memory, PROM), read-Only Memory (ROM), charged erasable programmable Read-Only Memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory, EEPROM), magnetic Memory, magnetic disk, optical disk, and the like. Memory 602 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 602 in embodiments of the present application may also be circuitry or any other device capable of performing storage functions for storing program instructions and/or data.
By programming the processor 601, the code corresponding to the method for recovering a hard disk failure in the server described in the foregoing embodiment may be cured into a chip, so that the chip can execute the steps of the method for recovering a hard disk failure in the server in the embodiment shown in fig. 1 during operation. How to design and program the processor 601 is a well-known technique for those skilled in the art, and will not be described in detail herein.
The embodiment of the application also provides a computer readable storage medium which stores computer executable instructions required to be executed by the processor and contains a program for executing the processor.
In some possible embodiments, aspects of the method for recovering from a failure of a hard disk in a server provided by the present application may also be implemented in a form of a program product, which includes program code for causing an electronic device to execute the steps of the method for recovering from a failure of a hard disk in a server according to various exemplary embodiments of the present application described above when the program product is run on the electronic device.
It will be apparent to those skilled in the art that embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (12)

1. A method for recovering a hard disk failure in a server, the method comprising:
Acquiring state information of each hard disk in a server, wherein the state information is used for representing the current state of the hard disk;
Determining a hard disk fault type of a first hard disk according to state information of the first hard disk, wherein the first hard disk is any one of all the hard disks in the server;
determining a target fault recovery strategy corresponding to the hard disk fault type of the first hard disk by utilizing a preset corresponding relation between the hard disk fault type and the fault recovery strategy;
and recovering the fault of the first hard disk by utilizing the target fault recovery strategy.
2. The method of claim 1, wherein the hard disk failure type comprises a hard disk failure and an array card failure, wherein the hard disk failure is to indicate a hard disk failure and the array card failure is to indicate an array card failure in the server.
3. The method of claim 2, wherein determining the failure type of the first hard disk based on the status information of the first hard disk comprises:
if the state information of the first hard disk is a fault state, determining that the hard disk fault type of the first hard disk is a hard disk fault;
And if the state information of the first hard disk is empty, determining that the hard disk fault type of the first hard disk is an array card fault.
4. The method of claim 2, wherein if the hard disk failure type is the hard disk failure;
recovering the failure of the first hard disk by using the target failure recovery strategy, including:
Determining a target logical disk corresponding to the first hard disk by using a preset corresponding relation between the hard disk and the logical disk, wherein the logical disk is a hard disk array formed by a plurality of hard disks, and the target logical disk is the logical disk where the first hard disk is located;
Determining whether a hot standby disk exists in the target logical disk or not by utilizing function identifiers of all hard disks in the target logical disk, wherein the function identifiers are used for representing functions of the hard disks in the target logical disk;
if yes, replacing the first hard disk by using the hot spare disk, and recovering the data in the first hard disk into the hot spare disk;
if not, setting the idle hard disk in the server as the hot spare disk, replacing the first hard disk by using the hot spare disk, and recovering the data in the first hard disk into the hot spare disk.
5. The method according to claim 4, wherein the method further comprises:
If the idle hard disk does not exist in the server, notifying a user to replace the first hard disk;
When the first hard disk is determined to be replaced, judging whether the configuration of the replaced first hard disk meets the specified condition or not;
if yes, determining the replaced first hard disk as the hot spare disk, replacing the first hard disk by the hot spare disk, and recovering backup data corresponding to the first hard disk into the hot spare disk;
If not, returning to the step of notifying the user to replace the first hard disk until the backup data corresponding to the first hard disk is restored to the hot standby disk.
6. The method of claim 2, wherein if the hard disk failure type is the array card failure;
the recovering the failure of the first hard disk by using the target failure recovery strategy includes:
Based on the fault log reported by the server, determining whether a high-speed serial computer expansion bus (PCIE) uncorrectable error UCE alarm exists in the fault log;
if yes, processing the fault of the first hard disk based on the current temperature of the array card;
if not, acquiring the current link state of the serial communication bus I2C link where the array card is located, and performing fault recovery on the first hard disk based on the current link state of the I2C link.
7. The method of claim 6, wherein the processing the failure of the first hard disk based on the current temperature of the array card comprises:
if the current temperature is greater than the specified temperature, determining a target fan rotating speed corresponding to the current temperature of the array card by utilizing a preset corresponding relation between the temperature and the fan rotating speed, adjusting the rotating speed of the fan corresponding to the array card to the target rotating speed, and returning to a fault log reported by the server until the current temperature is not greater than the specified temperature, and determining whether a high-speed serial computer expansion bus PCIE uncorrectable error UCE alarm exists in the fault log;
If the current temperature is not greater than the specified threshold, reminding a user to recover from faults;
the performing fault recovery on the first hard disk based on the current link state of the I2C link includes:
And if the current state of the I2C link is an abnormal state, resetting the I2C link.
8. The method of claim 7, wherein the method further comprises:
if the current state of the I2C link is a normal state and the state information of the first hard disk is null, acquiring fault information in an array card log in the server, wherein the array card log is used for storing the fault information reported by the array card;
Determining fault recovery indication information according to the fault information;
and sending the fault recovery indication information to user terminal equipment so as to facilitate the user to carry out fault recovery based on the fault recovery indication information.
9. The method of claim 8, wherein the method further comprises:
And if the fault information is null, determining the hard disk fault type as a hard disk fault, and returning to execute the step of determining a target fault recovery strategy corresponding to the hard disk fault type of the first hard disk by utilizing the preset corresponding relation between the hard disk fault type and the fault recovery strategy.
10. A device for recovering from a failure of a hard disk in a server, the device comprising:
The system comprises an acquisition module, a storage module and a control module, wherein the acquisition module is used for acquiring state information of each hard disk in a server, wherein the state information is used for representing the current state of the hard disk;
The system comprises a fault type determining module, a fault type determining module and a fault type determining module, wherein the fault type determining module is used for determining the hard disk fault type of a first hard disk according to the state information of the first hard disk, and the first hard disk is any one of all the hard disks in the server;
The fault recovery strategy determining module is used for determining a target fault recovery strategy corresponding to the hard disk fault type of the first hard disk by utilizing the preset corresponding relation between the hard disk fault type and the fault recovery strategy;
and the fault recovery module is used for recovering the fault of the first hard disk by utilizing the target fault recovery strategy.
11. An electronic device, comprising:
A memory for storing a computer program;
A processor for implementing the method of any of claims 1-9 when executing a computer program stored on the memory.
12. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by a processor, implements the method of any of claims 1-9.
CN202411692331.6A 2024-11-22 2024-11-22 Fault recovery method, device, equipment and storage medium for hard disk in server Pending CN119621394A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411692331.6A CN119621394A (en) 2024-11-22 2024-11-22 Fault recovery method, device, equipment and storage medium for hard disk in server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411692331.6A CN119621394A (en) 2024-11-22 2024-11-22 Fault recovery method, device, equipment and storage medium for hard disk in server

Publications (1)

Publication Number Publication Date
CN119621394A true CN119621394A (en) 2025-03-14

Family

ID=94907177

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411692331.6A Pending CN119621394A (en) 2024-11-22 2024-11-22 Fault recovery method, device, equipment and storage medium for hard disk in server

Country Status (1)

Country Link
CN (1) CN119621394A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120523689A (en) * 2025-07-22 2025-08-22 苏州元脑智能科技有限公司 Disk abnormality processing method, device, equipment and storage medium
CN120658858A (en) * 2025-07-17 2025-09-16 浙江零跑科技股份有限公司 Video stream fault recovery method, system, equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120658858A (en) * 2025-07-17 2025-09-16 浙江零跑科技股份有限公司 Video stream fault recovery method, system, equipment and storage medium
CN120523689A (en) * 2025-07-22 2025-08-22 苏州元脑智能科技有限公司 Disk abnormality processing method, device, equipment and storage medium
CN120523689B (en) * 2025-07-22 2025-10-10 苏州元脑智能科技有限公司 Disk exception handling method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN119621394A (en) Fault recovery method, device, equipment and storage medium for hard disk in server
CN106682162B (en) Log management method and device
CN111858468B (en) Method, system, terminal and storage medium for verifying metadata of distributed file system
CN113626256A (en) Virtual machine disk data backup method, device, terminal and storage medium
CN112015599A (en) Method and apparatus for error recovery
CN111897686A (en) Method, device, electronic device and storage medium for processing hard disk failure of server cluster
CN105607973A (en) Method, device and system for processing equipment failures in virtual machine system
CN114116280A (en) Interactive BMC self-recovery method, system, terminal and storage medium
CN120353632A (en) Memory fault repairing method, device, equipment, medium and computer program product
CN115705261A (en) Memory fault repairing method, CPU, OS, BIOS and server
CN105224416B (en) Restorative procedure and related electronic device
CN117170921A (en) Equipment correctable error handling methods, devices, computer equipment and storage media
CN112068935A (en) Method, device and equipment for monitoring deployment of kubernets program
CN110968456B (en) Method and device for processing fault disk in distributed storage system
CN119440900B (en) Storage system control method and device
CN119847808B (en) Abnormality diagnosis method, apparatus, medium and product for basic input/output system
CN118132118B (en) Firmware upgrading method and device
CN111221681A (en) Memory repairing method and device
CN120255970A (en) Baseboard management controller startup method, computer equipment, medium and product
CN110795155B (en) System starting method and device, electronic equipment and storage medium
CN117950900A (en) A memory error processing method and computing device
CN115484267A (en) Multi-cluster deployment processing method and device, electronic equipment and storage medium
CN116401118A (en) Method and device for monitoring Samba of file sharing service
CN115934395A (en) Fault injection method and device for solid state disk, computer equipment and storage medium
CN115114097A (en) Hard disk injection medium error test method, system, terminal and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination