CN117389781B - Abnormality detection and recovery method and system for server equipment, server and medium - Google Patents
Abnormality detection and recovery method and system for server equipment, server and medium Download PDFInfo
- Publication number
- CN117389781B CN117389781B CN202311352889.5A CN202311352889A CN117389781B CN 117389781 B CN117389781 B CN 117389781B CN 202311352889 A CN202311352889 A CN 202311352889A CN 117389781 B CN117389781 B CN 117389781B
- Authority
- CN
- China
- Prior art keywords
- equipment
- server
- abnormality
- hardware
- exception
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0793—Remedial or corrective actions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Retry When Errors Occur (AREA)
Abstract
The invention provides an abnormality detection and recovery method, a system, a server and a medium for server equipment, wherein the method is that equipment abnormality detection and recovery logic set according to the architecture type of the server is loaded to a baseboard management controller of the server, when the server is electrified and started, the baseboard management controller carries out hardware equipment abnormality detection according to a plurality of acquired equipment abnormality checkpoints corresponding to a current starting stage identifier, when abnormal hardware equipment exists, preset abnormality recovery logic is executed, and after the server is started, the baseboard management controller carries out hardware equipment abnormality detection according to a plurality of equipment abnormality checkpoints corresponding to an operating system operation stage, and when the hardware equipment is abnormal, preset abnormality recovery logic is executed. The invention can actively detect and automatically repair the equipment abnormality in the system starting stage and the running stage, ensure the continuous and stable running of the server product, effectively improve the fault tolerance of the product and further meet the high-performance application requirement of the server.
Description
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method, a system, a server, and a storage medium for detecting and recovering an abnormality of a server device.
Background
With the continuous development and progress of computer technology, servers of various architecture types are widely used in various industries to provide simple, fast, efficient and safe computing services for users, but the servers of various architecture types may have the problem of losing hardware devices during the process of starting or entering the operation system OS (Operating System) to run, because the drivers are not installed, and because of hardware power management, the normal running use of the servers is directly affected. Therefore, the fault tolerance of the server product is also becoming an important index for measuring the performance of the product.
However, when the hardware equipment is abnormal in the starting or running process of the existing server, only symbolic prompts are generated, and an administrator is required to manually repair the hardware equipment, either manually install a driver or restart the server, the operation and maintenance of the server are inevitably interrupted, and in extreme cases, the system is also paralyzed, namely the fault tolerance of the current server product cannot really meet the high-performance application requirements of the server.
Disclosure of Invention
The invention aims to provide an abnormality detection and recovery method of server equipment, which is characterized in that corresponding equipment abnormality detection points are preset for each starting stage and system operation stage based on the architecture type of a server, and each equipment abnormality detection point is monitored in real time through a baseboard management controller, and when equipment abnormality is detected, a corresponding abnormality recovery logic scheme is automatically executed, so that the problems that the existing server equipment abnormality still needs manual means to be repaired, cannot be automatically repaired and has low fault tolerance are solved, the equipment abnormality of the starting stage and the system operation stage can be actively detected, the abnormality repair can be automatically and timely carried out, the continuous and stable operation of a server product is ensured, the fault tolerance of the product is effectively improved, and further the high-performance application requirements of the server are met.
In order to achieve the above object, it is necessary to provide an abnormality detection and recovery method, system, server and storage medium for a server device.
In a first aspect, an embodiment of the present invention provides an anomaly detection and recovery method for a server device, where the method includes the following steps:
Setting corresponding equipment abnormality detection and recovery logic according to the architecture type of a server, and loading the equipment abnormality detection and recovery logic to a baseboard management controller of the server;
Responding to the power-on starting of a server, acquiring a current starting stage identifier through the baseboard management controller, detecting hardware equipment abnormality according to a plurality of equipment abnormality checkpoints corresponding to the current starting stage identifier, and executing preset abnormality recovery logic when abnormal hardware equipment exists;
And responding to the completion of starting of the server, detecting hardware equipment abnormality through the baseboard management controller according to a plurality of equipment abnormality checkpoints corresponding to the operating system operation stage, and executing preset abnormality recovery logic when the hardware equipment abnormality exists.
Further, the device anomaly detection and recovery logic includes a plurality of device anomaly checkpoints at each start-up stage and operating system operation stage, inspection levels corresponding to each device anomaly checkpoint, and anomaly recovery logic corresponding to each inspection level.
Further, the step of setting the equipment abnormality detection and recovery logic includes:
the method comprises the steps of respectively obtaining hardware equipment information to be loaded in each starting stage and operating system running stage of a server;
According to the hardware equipment information, carrying out operation independence and equipment dependence analysis on each hardware equipment to obtain a plurality of equipment abnormal check points of each starting stage and operating system operation stage;
setting corresponding check grades for the abnormal check points of each device according to the global importance of the hardware devices corresponding to the abnormal check points of each device;
And presetting corresponding abnormal recovery logic according to the inspection level of each equipment abnormal check point.
Further, the inspection level includes a level a inspection level, a level B inspection level, and a level C inspection level;
the step of setting the corresponding check level for each device exception check point according to the global importance of the hardware device corresponding to each device exception check point comprises the following steps:
When the hardware equipment is necessary functional equipment for system operation, setting the corresponding inspection level as an A-level inspection level;
When the hardware equipment is auxiliary function equipment for system operation, setting the corresponding inspection level as a B-level inspection level;
And when the hardware equipment is abnormal and can continue to operate the functional equipment, setting the corresponding check level as a C-level check level.
Further, the step of detecting hardware device abnormality according to the device abnormality check point includes:
Acquiring a state register value of the corresponding hardware device through the baseboard management controller according to the device abnormal check point;
and judging whether the corresponding hardware device is abnormal hardware device or not according to the starting bit value of the state register value.
Further, the step of obtaining, by the baseboard management controller, a status register value of the corresponding hardware device according to the device exception check point includes:
and sending a status register reading signal to the corresponding hardware equipment through the baseboard management controller, and analyzing the status register response signal when receiving the corresponding status register response signal to obtain the status register value.
Further, the anomaly recovery logic comprises a first anomaly recovery logic, a second anomaly recovery logic and a third anomaly recovery logic;
the step of executing the preset abnormality recovery logic when the hardware device abnormality exists comprises the following steps:
obtaining an inspection grade corresponding to the abnormal hardware equipment;
When the inspection level is a level A inspection level, executing first exception recovery logic on the exception hardware device; the first exception recovery logic is used for directly executing corresponding cold start operation on the exception hardware equipment, and restarting the server when the cold start operation cannot be recovered;
When the inspection level is a B-level inspection level, executing second exception recovery logic on the exception hardware device; the second abnormal recovery logic is used for executing corresponding hot start operation on the abnormal hardware equipment, executing corresponding cold start operation when the hot start operation cannot be recovered, and restarting the server when the cold start operation cannot be recovered;
When the inspection level is a C-level inspection level, executing a third exception recovery logic on the exception hardware device; the third exception recovery logic is configured to execute a corresponding driver repair operation on the exception hardware device, execute a corresponding warm boot operation when the driver repair operation is executed and cannot be recovered, execute a corresponding cold boot operation when the warm boot operation is executed and restart the server when the cold boot operation is executed and cannot be recovered.
In a second aspect, an embodiment of the present invention provides an anomaly detection and recovery system for a server device, including:
the preprocessing module is used for setting corresponding equipment abnormality detection and recovery logic according to the architecture type of the server and loading the equipment abnormality detection and recovery logic to a baseboard management controller of the server;
The starting exception handling module is used for responding to the power-on starting of the server, acquiring a current starting stage through the baseboard management controller, detecting hardware equipment exception according to a plurality of equipment exception check points corresponding to the current starting stage, and executing preset exception recovery logic when abnormal hardware equipment exists;
and the operation exception handling module is used for responding to the completion of the starting of the server, carrying out hardware equipment exception detection through the baseboard management controller according to a plurality of equipment exception check points corresponding to the operation stage of the operating system, and executing preset exception recovery logic when the hardware equipment exception exists.
In a third aspect, an embodiment of the present invention further provides a server, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the above method when executing the computer program.
In a fourth aspect, embodiments of the present invention also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the above method.
The application provides an abnormality detection and recovery method, a system, a server and a storage medium for server equipment, by the method, the technical schemes of setting corresponding equipment abnormality detection and recovery logic according to the architecture type of the server, loading the equipment abnormality detection and recovery logic to a baseboard management controller of the server, acquiring a current starting stage identifier through the baseboard management controller when the server is started, carrying out hardware equipment abnormality detection according to a plurality of equipment abnormality checkpoints corresponding to the current starting stage identifier, executing preset abnormality recovery logic when abnormal hardware equipment exists, carrying out hardware equipment abnormality detection according to a plurality of equipment abnormality checkpoints corresponding to an operating system operation stage through the baseboard management controller after the server is started, and executing preset abnormality recovery logic when the hardware equipment is abnormal are provided. Compared with the prior art, the method can actively detect the equipment abnormality in the starting stage and the system operation stage, automatically and timely carry out abnormality repair, not only can ensure continuous and stable operation of server products and effectively improve the fault-tolerant capability of the products, but also can save a large amount of manual repair cost and meet the high-performance application requirements of the server.
Drawings
FIG. 1 is a flowchart of an anomaly detection and recovery method for a server device according to an embodiment of the present invention;
FIG. 2 is a flowchart of a first exception recovery logic executed by a device exception corresponding to a class A inspection level according to an embodiment of the present invention;
FIG. 3 is a flowchart of a second exception recovery logic implementation of the B-level inspection level corresponding device exception in an embodiment of the present invention;
FIG. 4 is a flowchart illustrating a third exception recovery logic executed by the device exception corresponding to the class C level of the present invention;
FIG. 5 is a schematic diagram of an anomaly detection and recovery system of a server device according to an embodiment of the present invention;
fig. 6 is a schematic diagram showing an internal structure of a computer device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantageous effects of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples, and it is apparent that the examples described below are part of the examples of the present application, which are provided for illustration only and are not intended to limit the scope of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The abnormality detection and recovery method of the server equipment provided by the invention is based on the fact that equipment abnormality occurring in the starting or running of the existing server can be solved by human intervention, the running stability of the server is poor, the equipment abnormality recovery is not timely, the fault tolerance capability of products is insufficient, the corresponding equipment abnormality check points are preset for each starting stage and system running stage based on the architecture type of the server, the equipment abnormality detection points of each stage are monitored in real time through a baseboard management controller, and when the equipment abnormality is detected, the abnormality detection and recovery mechanism for carrying out equipment abnormality recovery according to the detection grade of each equipment abnormality detection point is automatically matched and executed corresponding abnormality recovery logic, so that the equipment abnormality in the starting stage and the system running stage can be actively detected, the optimal recovery strategy can be automatically matched according to different equipment abnormality types, the abnormality recovery can be timely carried out, the continuous and stable running of the server products can be ensured, the fault tolerance capability of the products can be effectively improved, and the high-performance application requirements of the server can be further met.
It should be noted that, the existing server includes three architecture types, such as an X86 (Complexinstruction Set Computing) server, a EPIC (Explicitlyparallel Instruction Computing) server, and a RISC (ReducedInstruction Set Computing) server including a Power server, and the method of the present invention is applicable to server applications of various architecture types, where the specific applications of different architecture servers are only different in terms of division of different startup phases, and the dependency relationship between each startup phase and the loaded hardware device and hardware device involved in the system operation phase has a certain difference, which affects selection and setting of specific checkpoints and corresponding exception recovery logic; in order to facilitate understanding of the method of the present invention, the following embodiments will take a Power architecture type server as an example to describe the abnormality detection and recovery method of the server device of the present invention in detail.
In one embodiment, as shown in fig. 1, there is provided an anomaly detection and recovery method of a server device, including the steps of:
S11, setting corresponding equipment abnormality detection and recovery logic according to the architecture type of a server, and loading the equipment abnormality detection and recovery logic to a baseboard management controller of the server; the architecture types of the server include three types as described above, the starting stages of the servers with different architecture types and the hardware devices to be loaded for running are different, and the corresponding hardware anomalies to be detected are different, but the overall hardware anomalies can be abstracted into device anomalies in a system starting checking stage and an operating system running checking stage, and in order to ensure the comprehensiveness and the effectiveness of anomaly detection and the timeliness of anomaly recovery, the embodiment preferably performs checkpoint subdivision detection on the system starting checking stage according to a large starting stage in the actual server starting process, and sets corresponding device anomaly detection and recovery logic according to different preset recovery strategies of the actual checkpoint anomaly influence, so as to perform real-time effective anomaly detection and recovery on the server starting and the system running process; specifically, the device abnormality detection and recovery logic includes a plurality of device abnormality checkpoints at each start-up stage and at each operating system operation stage, inspection levels corresponding to each device abnormality checkpoint, and abnormality recovery logic corresponding to each inspection level;
In practical application, the device exception checkpoints, inspection levels and the settings of the corresponding exception recovery logic in each starting stage and operating system running stage can be determined arbitrarily according to the requirements in principle, but considering the balance of sufficient checkpointing and system performance overhead and the balance of timeliness and rationality of exception recovery, the embodiment preferably adopts the analysis of independence, running dependence and global importance of related hardware devices to select enough checkpoints and set optimal exception recovery strategies; specifically, the step of setting the equipment abnormality detection and recovery logic includes:
The method comprises the steps of respectively obtaining hardware equipment information to be loaded in each starting stage and operating system running stage of a server; the hardware device information may be understood as a set of hardware devices that need to be loaded and used in each of the starting stage and the operating system operating stage obtained by analysis, for example, for a Power server, the hardware device information may be divided into a set of hardware devices in a small core starting stage, a set of hardware devices in a large core HostBoot stage, a set of hardware devices in a SkiBoot stage, a set of hardware devices in a linux kernel starting stage, and a set of hardware devices in an operating system OS operating stage after the starting is completed; it should be noted that, the hardware device information of other types of architecture servers may also be obtained by referring to the above example similar method, which is not described in detail herein;
According to the hardware equipment information, carrying out operation independence and equipment dependence analysis on each hardware equipment to obtain a plurality of equipment abnormal check points of each starting stage and operating system operation stage; the operation independence and device dependency analysis can be understood as analyzing whether the operation use of the hardware device in the whole system is completely independent or not and whether the operation use of the hardware device is required to depend on the operation use of one or more other hardware devices or not; for example, GPIO devices in the Power server are interfaces of chips for sending control signals, and the functions are independent and can be divided into independent check points; for another example, the CMN device is a high-speed communication network of the whole system, and if the CMN device has a problem, the communication network of the whole system has a problem, and the CMN device can also be used as an independent check point; checkpoints for each of the other stages may be similarly determined, and are not listed here;
Setting corresponding check grades for the abnormal check points of each device according to the global importance of the hardware devices corresponding to the abnormal check points of each device; the inspection level can be understood as evaluation and division of importance of each inspection point, in order to ensure that the establishment of an abnormal recovery strategy is more effective and more beneficial to the stable operation of a system and ensure the simplicity and the high efficiency of abnormal recovery logic, the embodiment preferably classifies the inspection points into three inspection levels such as an A-level inspection level, a B-level inspection level, a C-level inspection level and the like according to the global importance of corresponding hardware equipment in the operation of the system, and sets a uniform abnormal recovery strategy for the inspection points of various inspection levels; specifically, the step of setting the corresponding inspection level for each equipment exception checkpoint according to the global importance of the hardware equipment corresponding to each equipment exception checkpoint includes:
When the hardware equipment is necessary functional equipment for system operation, setting the corresponding inspection level as an A-level inspection level; the system operation necessary functional device may be understood as a hardware device that is necessary for the normal operation of the system, for example: the system high-speed communication arterial CMN equipment in the Power server is a high-speed communication network of the whole system, directly affects the communication network of the whole system, has global importance and can be divided into A-level inspection grades; the GPIO device of the basic command issuing is an interface used by a chip for issuing a control signal, belongs to an infrastructure communication facility in the whole system, has global importance and can be divided into A-level inspection grades;
When the hardware equipment is auxiliary function equipment for system operation, setting the corresponding inspection level as a B-level inspection level; the system operation auxiliary function device can be understood as a device for providing some auxiliary operation for the whole system operation, including a storage service device, a display service device or a LOG (LOG) service device, for example, a UART in a Power server provides LOG output service, which has problems without affecting the continuous operation of the system, and can be divided into class B inspection grades by checking LOG information of the system through other debugging tools; flash equipment belongs to storage equipment, and if abnormal problems occur, the Flash equipment can provide data storage service through equipment such as an external USB Flash disk and the like, and can be divided into B-level inspection grades;
When the hardware equipment is abnormal and can continue to operate the functional equipment, setting the corresponding check grade as a C-grade check grade; the abnormal continuously-running functional equipment can be understood as equipment which can still continuously run when the equipment is abnormal and does not influence normal function use; for example, low-speed communication devices I2C (Inter-INTEGRATED CIRCUIT), SPI (Seriel Peripheral Interface) and the like in the Power server are generally used for auxiliary communication, the system can still continue to operate when a problem occurs in the system, and the inspection point of the hardware device for providing the low-speed communication service can be set as a class C inspection level;
based on the steps of the method, the equipment abnormal check points and the corresponding check grades corresponding to each starting stage and the operating system running stage shown in the tables 1-2 can be obtained by carding:
TABLE 1 Equipment exception checkpoints and corresponding checklevels set during the small core startup phase
Check serial number | Checkpoint name | Inspection grade |
1 | Flash checkpoint | B |
2 | CMN700 checkpoint | A |
3 | GPIO checkpoint | A |
4 | I2C checkpoint | C |
5 | Uart checkpoint | B |
6 | SPI check point | C |
TABLE 2 device exception checkpoints and corresponding checklevels for the big core HostBoot phase, skiboot phase, linux phase, and OS running phase
Check serial number | Checkpoint name | Inspection grade |
1 | Basic register checkpoints | A |
2 | Shared register checkpoints | B |
3 | GPIO checkpoint | A |
4 | Flash checkpoint | C |
5 | MBox checkpoint | C |
6 | Uart checkpoint | C |
7 | CMN checkpoints | A |
8 | DDR check point | A |
9 | Watch dog checkpoint | C |
10 | ESPI check point | C |
11 | SSD checkpoint | C |
It should be noted that, the device exception checkpoints and corresponding check levels given in the foregoing tables 1-2 for each start-up phase and system operation phase of the server are only exemplary descriptions for the Power architecture type server, and similar tables may be obtained for other architecture type servers with reference to the foregoing method steps, which are not described in detail herein;
Presetting corresponding abnormal recovery logic according to the inspection level of each equipment abnormal check point; the exception recovery logic may be understood as an exception recovery policy corresponding to different inspection levels, such as a cold start, a hot start, or reinstallation of a drive, among others.
The method steps can obtain the equipment abnormality detection and recovery logic of the server with the architecture type to be optimized, and the equipment abnormality detection and recovery logic can be loaded to a baseboard management controller (BMC, baseboard Management Controller) of the server for the baseboard management controller to actively detect the equipment abnormality of the server and automatically recover the execution abnormality according to the starting stage and the running condition of the server.
S12, responding to the power-on starting of a server, acquiring a current starting stage identifier through the baseboard management controller, detecting hardware equipment abnormality according to a plurality of equipment abnormality checkpoints corresponding to the current starting stage identifier, and executing preset abnormality recovery logic when abnormal hardware equipment exists; the current starting stage identifier can be understood as a stage in which the current server starting process is executed, and can be obtained through IPMI (INTELLIGENT PLATFORM MANAGEMENT INTERFACE), the baseboard management controller can match loaded equipment abnormality detection and restoration logic according to the obtained current server starting stage identifier (small core stage, hostBoot stage, skiBoot stage or linux kernel stage), obtain equipment abnormality checkpoints to be detected, and after the state inspection of hardware equipment corresponding to each checkpoint is completed according to a preset inspection sequence, match the detected abnormal hardware equipment according to the preset inspection grade and execute corresponding abnormality restoration logic so as to enable the abnormal equipment to restore to normal;
In principle, the method for detecting hardware device abnormality of device abnormality checkpoints can be determined according to requirements, but considering that the states are written into a global state array after each sub-step istep is completed in the process of starting the server, in order to ensure the high efficiency and accuracy of hardware device state detection, in this embodiment, preferably, the hardware state of the device is determined by checking the state register value (corresponding position in the global state array) of the hardware device corresponding to each hardware abnormality checkpoint; specifically, the step of detecting hardware device abnormality according to the device abnormality check point includes:
Acquiring a state register value of the corresponding hardware device through the baseboard management controller according to the device abnormal check point; the status register of the hardware device has a system unique address, the status register of the hardware device can be operated by effectively reading and writing the unique address, and each bit value in the status register is already set at the beginning of design and written into the specification of the device (for example, the status register of the I2C device is a 32-bit register, and the enable bit 0 represents whether the device enables the bit: 0 represents inactive, 1 represents active;
In practical application, in order to ensure accuracy of acquiring the status register value, the embodiment preferably adopts an atomic operation signal communication mode to read the status register; specifically, the step of obtaining, by the baseboard management controller, the status register value of the corresponding hardware device according to the device exception check point includes:
Transmitting a status register reading signal to a corresponding hardware device through the baseboard management controller, and analyzing the status register response signal to obtain the status register value when receiving the corresponding status register response signal; the state register reading signal is understood to be a signal with specific device register information, after receiving the signal, the corresponding hardware device replies a response signal, when the baseboard management controller receives the state register response signal, the baseboard management controller analyzes the state register response signal to obtain a corresponding state register value, then finds a corresponding state bit, and reads the corresponding state information;
Judging whether the corresponding hardware equipment is abnormal hardware equipment or not according to the starting bit value of the state register value; wherein the start bit value of the status register value includes two values of 0 and 1 as described above, and the status bit should be 1 after the completion of the device loading is expected; therefore, when the starting bit value of the actually acquired state register value is 0 time representing state abnormality, marking the state abnormality as abnormal hardware equipment, otherwise, when the starting bit value of the state register value is 1 time representing normal state, the equipment can be used; the BMC monitoring service can obtain the state register of each hardware device through traversal, the register records the device state, if the device does not work normally, the values of a plurality of bits corresponding to the starting state are abnormal;
After the state register value of the hardware equipment corresponding to each check point senses the abnormality of the hardware equipment, the corresponding abnormality recovery logic can be selected according to the pre-estimated check level, so that the equipment abnormality of each check level can be timely and efficiently recovered to be normal under the condition that the influence degree of the whole system operation is minimum; specifically, the exception recovery logic includes a first exception recovery logic, a second exception recovery logic, and a third exception recovery logic; correspondingly, when the hardware device is abnormal, the step of executing the preset abnormal recovery logic comprises the following steps:
obtaining an inspection grade corresponding to the abnormal hardware equipment;
When the inspection level is a level A inspection level, executing first exception recovery logic on the exception hardware device; as shown in fig. 2, the first exception recovery logic is configured to directly perform a corresponding cold start operation on the abnormal hardware device, and restart the server when the cold start operation is performed and cannot be recovered; the cold start operation is understood as that the equipment is powered down and then is powered up again, initialized and run in time sequence, the starting is slower, a large amount of system time and resources are consumed, the equipment is powered up again through the operation, the problem of quality (physical damage) of all non-equipment can be solved in principle, namely if the cold start operation cannot be solved, the whole system is completely powered down, all the equipment is powered up again, and if the problem cannot be solved, abnormal hardware equipment is considered to be replaced;
When the inspection level is a B-level inspection level, executing second exception recovery logic on the exception hardware device; as shown in fig. 3, the second exception recovery logic is configured to execute a corresponding warm boot operation on the exception hardware device, execute a corresponding cold boot operation when the warm boot operation is executed and restart the server when the cold boot operation is executed and cannot be recovered; the hot start operation can be understood as quick start, namely, the reset pin of the hardware equipment is controlled to be reset through the GPIO module in the chip under the condition that the equipment is not powered down, and the reset pin is used for resetting the hardware equipment, and as the memory is not powered down, the data of each hardware equipment register stored in the memory can not be lost, the hardware equipment register can be directly used after the reset, and compared with the cold start operation, the consumed system time and resource degree are general, the starting speed is higher, and most of anomalies can be repaired;
When the inspection level is a C-level inspection level, executing a third exception recovery logic on the exception hardware device; as shown in fig. 4, the third exception recovery logic is configured to execute a corresponding driver repair operation on the exception hardware device, execute a corresponding warm boot operation when the driver repair operation is executed and execute a corresponding cold boot operation when the warm boot operation is executed and restart the server when the cold boot operation is executed and cannot be recovered; the driver software repairing operation can be understood as unloading a software driver program, reinstalling a driver, and performing software repairing, namely, operating by calling a driver interface provided by equipment in a system, reinstalling the driver program, wherein the operation is the lightest, and the consumption of system time and resources is the least;
by setting the abnormality recovery logic, the problem of equipment abnormality possibly occurring in each starting stage of the system can be automatically solved, so that the abnormality of each equipment in the starting stage can be automatically recovered according to the corresponding inspection grade under the condition of no manual repair, and the starting efficiency and the running effect of a server product are improved.
S13, responding to completion of starting of the server, detecting hardware equipment abnormality through the baseboard management controller according to a plurality of equipment abnormality checkpoints corresponding to the operating system operation stage, and executing preset abnormality recovery logic when the hardware equipment abnormality exists
The operating system operation stage can be understood as a server operation maintenance period when the server is started to finish entering the operation of the operating system OS; the device abnormality detection at this stage is still performed by the baseboard manager, the corresponding device abnormality checkpoints to be checked and the corresponding check levels are shown in table 2, and the implementation of the hardware device abnormality detection and device abnormality execution abnormality recovery logic performed on each device abnormality checkpoint corresponding to the operating system operation stage can refer to the relevant content of the server start stage given in the foregoing step S12, and will not be repeated here; it should be noted that, by adding active detection and automatic repair management in the running stage of the server, the server product can be ensured to have higher system running stability, and the fault tolerance of the server product is effectively improved.
According to the embodiment of the application, corresponding equipment abnormality detection and recovery logic is set according to the architecture type of the server, after the equipment abnormality detection and recovery logic is loaded to the baseboard management controller of the server, when the server is started, the current starting stage identification is obtained through the baseboard management controller, hardware equipment abnormality detection is carried out according to a plurality of equipment abnormality checkpoints corresponding to the current starting stage identification, when abnormal hardware equipment exists, preset abnormality recovery logic is executed, after the server is started, hardware equipment abnormality detection is carried out according to a plurality of equipment abnormality checkpoints corresponding to the operating system operation stage through the baseboard management controller, and when the hardware equipment abnormality exists, the preset technical scheme of the abnormality recovery logic is executed, so that the balance of the system performance overhead and the balance of the abnormality recovery are fully set by the checkpoints, and the timeliness and the rationality of the abnormality recovery are principle, the hardware equipment abnormality in each stage is actively detected and the targeted automatic and timely recovered based on the baseboard management controller, the continuous stable operation of a server product can be ensured, the fault-tolerant capacity is effectively improved, the labor cost is saved, and the application cost of the server is high.
Although the steps in the flowcharts described above are shown in order as indicated by arrows, these steps are not necessarily executed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders.
In one embodiment, as shown in fig. 5, there is provided an anomaly detection and recovery system of a server apparatus, the system including:
the preprocessing module 1 is used for setting corresponding equipment abnormality detection and recovery logic according to the architecture type of a server, and loading the equipment abnormality detection and recovery logic to a baseboard management controller of the server;
The starting exception handling module 2 is used for responding to the power-on starting of the server, acquiring a current starting stage through the baseboard management controller, detecting hardware equipment exception according to a plurality of equipment exception check points corresponding to the current starting stage, and executing preset exception recovery logic when abnormal hardware equipment exists;
And the operation exception handling module 3 is used for responding to the completion of the starting of the server, detecting the exception of the hardware equipment through the baseboard management controller according to a plurality of equipment exception check points corresponding to the operation stage of the operating system, and executing preset exception recovery logic when the exception of the hardware equipment exists.
For specific limitation of the abnormality detection and recovery system of a server device, reference may be made to the limitation of the abnormality detection and recovery method of a server device, and corresponding technical effects may be equally obtained, which will not be described herein. The above-mentioned modules in the abnormality detection and recovery system of a server device may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
Fig. 6 shows an internal structural diagram of a computer device, which may be a terminal or a server in particular, in one embodiment. As shown in fig. 6, the computer device includes a processor, a memory, a network interface, a display, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a method for anomaly detection and recovery of a server device. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those of ordinary skill in the art that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer devices to which the present inventive arrangements may be applied, and that a particular computing device may include more or fewer components than shown, or may combine some of the components, or have the same arrangement of components.
In one embodiment, a server is provided that includes a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, implements the steps of the above method.
In summary, the abnormality detection and recovery method, device and storage system for a server device provided by the embodiments of the present invention implement the technical scheme that according to the architecture type of the server, corresponding abnormality detection and recovery logic is set, and after the abnormality detection and recovery logic is loaded to a baseboard management controller of the server, when the server is started, a current starting stage identifier is obtained through the baseboard management controller, and according to a plurality of device abnormality checkpoints corresponding to the current starting stage identifier, hardware device abnormality detection is performed, when abnormal hardware devices exist, and after the server is started, hardware device abnormality detection is performed according to a plurality of device abnormality checkpoints corresponding to an operating system operation stage, and when the hardware devices exist, the preset abnormality recovery logic is performed.
In this specification, each embodiment is described in a progressive manner, and all the embodiments are directly the same or similar parts referring to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments. It should be noted that, any combination of the technical features of the foregoing embodiments may be used, and for brevity, all of the possible combinations of the technical features of the foregoing embodiments are not described, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples represent only a few preferred embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the application. It should be noted that modifications and substitutions can be made by those skilled in the art without departing from the technical principles of the present application, and such modifications and substitutions should also be considered to be within the scope of the present application. Therefore, the protection scope of the patent of the application is subject to the protection scope of the claims.
Claims (9)
1. An anomaly detection and recovery method for a server device, the method comprising the steps of:
Setting corresponding equipment abnormality detection and recovery logic according to the architecture type of a server, and loading the equipment abnormality detection and recovery logic to a baseboard management controller of the server; the equipment abnormality detection and recovery logic is obtained by carrying out independence, operation dependence and global importance analysis setting on hardware equipment to be loaded in each starting stage and operating system operation stage of the server; the equipment abnormality detection and recovery logic comprises a plurality of equipment abnormality checkpoints in each starting stage and operating system operation stage, inspection levels corresponding to the equipment abnormality checkpoints and abnormality recovery logic corresponding to the inspection levels;
Responding to the power-on starting of a server, acquiring a current starting stage identifier through the baseboard management controller, detecting hardware equipment abnormality according to a plurality of equipment abnormality checkpoints corresponding to the current starting stage identifier, and executing preset abnormality recovery logic when abnormal hardware equipment exists;
And responding to the completion of starting of the server, detecting hardware equipment abnormality through the baseboard management controller according to a plurality of equipment abnormality checkpoints corresponding to the operating system operation stage, and executing preset abnormality recovery logic when the hardware equipment abnormality exists.
2. The anomaly detection and recovery method for a server device of claim 1, wherein the step of setting the anomaly detection and recovery logic for the device comprises:
the method comprises the steps of respectively obtaining hardware equipment information to be loaded in each starting stage and operating system running stage of a server;
According to the hardware equipment information, carrying out operation independence and equipment dependence analysis on each hardware equipment to obtain a plurality of equipment abnormal check points of each starting stage and operating system operation stage;
setting corresponding check grades for the abnormal check points of each device according to the global importance of the hardware devices corresponding to the abnormal check points of each device;
And presetting corresponding abnormal recovery logic according to the inspection level of each equipment abnormal check point.
3. The abnormality detection and recovery method of a server apparatus according to claim 1, wherein the inspection levels include a level a inspection level, a level B inspection level, and a level C inspection level;
the step of setting the corresponding check level for each device exception check point according to the global importance of the hardware device corresponding to each device exception check point comprises the following steps:
When the hardware equipment is necessary functional equipment for system operation, setting the corresponding inspection level as an A-level inspection level;
When the hardware equipment is auxiliary function equipment for system operation, setting the corresponding inspection level as a B-level inspection level;
And when the hardware equipment is abnormal and can continue to operate the functional equipment, setting the corresponding check level as a C-level check level.
4. The method for detecting and recovering an abnormality of a server device according to claim 1, wherein the step of detecting an abnormality of a hardware device according to a device abnormality checkpoint comprises:
Acquiring a state register value of the corresponding hardware device through the baseboard management controller according to the device abnormal check point;
and judging whether the corresponding hardware device is abnormal hardware device or not according to the starting bit value of the state register value.
5. The anomaly detection and recovery method of a server device of claim 4, wherein the step of obtaining, by the baseboard management controller, a status register value of a corresponding hardware device according to the device anomaly checkpoint comprises:
and sending a status register reading signal to the corresponding hardware equipment through the baseboard management controller, and analyzing the status register response signal when receiving the corresponding status register response signal to obtain the status register value.
6. The anomaly detection and recovery method of a server device of claim 3, wherein the anomaly recovery logic comprises first, second, and third anomaly recovery logic;
the step of executing the preset abnormality recovery logic when the hardware device abnormality exists comprises the following steps:
obtaining an inspection grade corresponding to the abnormal hardware equipment;
When the inspection level is a level A inspection level, executing first exception recovery logic on the exception hardware device; the first exception recovery logic is used for directly executing corresponding cold start operation on the exception hardware equipment, and restarting the server when the cold start operation cannot be recovered;
When the inspection level is a B-level inspection level, executing second exception recovery logic on the exception hardware device; the second abnormal recovery logic is used for executing corresponding hot start operation on the abnormal hardware equipment, executing corresponding cold start operation when the hot start operation cannot be recovered, and restarting the server when the cold start operation cannot be recovered;
When the inspection level is a C-level inspection level, executing a third exception recovery logic on the exception hardware device; the third exception recovery logic is configured to execute a corresponding driver repair operation on the exception hardware device, execute a corresponding warm boot operation when the driver repair operation is executed and cannot be recovered, execute a corresponding cold boot operation when the warm boot operation is executed and restart the server when the cold boot operation is executed and cannot be recovered.
7. An anomaly detection and recovery system for a server device, the system comprising:
The preprocessing module is used for setting corresponding equipment abnormality detection and recovery logic according to the architecture type of the server and loading the equipment abnormality detection and recovery logic to a baseboard management controller of the server; the equipment abnormality detection and recovery logic is obtained by carrying out independence, operation dependence and global importance analysis setting on hardware equipment to be loaded in each starting stage and operating system operation stage of the server; the equipment abnormality detection and recovery logic comprises a plurality of equipment abnormality checkpoints in each starting stage and operating system operation stage, inspection levels corresponding to the equipment abnormality checkpoints and abnormality recovery logic corresponding to the inspection levels;
The starting exception handling module is used for responding to the power-on starting of the server, acquiring a current starting stage through the baseboard management controller, detecting hardware equipment exception according to a plurality of equipment exception check points corresponding to the current starting stage, and executing preset exception recovery logic when abnormal hardware equipment exists;
and the operation exception handling module is used for responding to the completion of the starting of the server, carrying out hardware equipment exception detection through the baseboard management controller according to a plurality of equipment exception check points corresponding to the operation stage of the operating system, and executing preset exception recovery logic when the hardware equipment exception exists.
8. A server comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311352889.5A CN117389781B (en) | 2023-10-18 | 2023-10-18 | Abnormality detection and recovery method and system for server equipment, server and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311352889.5A CN117389781B (en) | 2023-10-18 | 2023-10-18 | Abnormality detection and recovery method and system for server equipment, server and medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117389781A CN117389781A (en) | 2024-01-12 |
CN117389781B true CN117389781B (en) | 2024-06-04 |
Family
ID=89464382
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311352889.5A Active CN117389781B (en) | 2023-10-18 | 2023-10-18 | Abnormality detection and recovery method and system for server equipment, server and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117389781B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118642881B (en) * | 2024-08-13 | 2024-11-08 | 苏州元脑智能科技有限公司 | Server abnormality processing method, program product, device and storage medium |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2010211819A (en) * | 2010-04-26 | 2010-09-24 | Hitachi Ltd | Failure recovery method |
CN104424084A (en) * | 2013-08-27 | 2015-03-18 | 鸿富锦精密电子(天津)有限公司 | System error information detection system and method for server |
CN106557392A (en) * | 2015-09-29 | 2017-04-05 | 鸿富锦精密工业(深圳)有限公司 | Server failure detection means and method |
CN111708652A (en) * | 2020-05-20 | 2020-09-25 | 新华三技术有限公司 | Fault repairing method and device |
CN112131043A (en) * | 2020-08-27 | 2020-12-25 | 苏州浪潮智能科技有限公司 | A kind of abnormal detection and recovery method and device of basic input output system |
CN113849230A (en) * | 2021-08-30 | 2021-12-28 | 浪潮电子信息产业股份有限公司 | Server starting method and device, electronic equipment and readable storage medium |
CN115827330A (en) * | 2023-02-13 | 2023-03-21 | 西安超越申泰信息科技有限公司 | Method for realizing self-repair of server based on BMC (baseboard management controller) |
CN116010141A (en) * | 2022-12-15 | 2023-04-25 | 浪潮(山东)计算机科技有限公司 | Method, device and medium for positioning starting abnormality of multipath server |
CN116775145A (en) * | 2023-05-04 | 2023-09-19 | 合芯科技(苏州)有限公司 | Method, device, equipment and storage medium for starting and recovering server |
CN116775141A (en) * | 2023-07-07 | 2023-09-19 | 联想(北京)有限公司 | Abnormality detection method, abnormality detection device, computer device, and storage medium |
-
2023
- 2023-10-18 CN CN202311352889.5A patent/CN117389781B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2010211819A (en) * | 2010-04-26 | 2010-09-24 | Hitachi Ltd | Failure recovery method |
CN104424084A (en) * | 2013-08-27 | 2015-03-18 | 鸿富锦精密电子(天津)有限公司 | System error information detection system and method for server |
CN106557392A (en) * | 2015-09-29 | 2017-04-05 | 鸿富锦精密工业(深圳)有限公司 | Server failure detection means and method |
CN111708652A (en) * | 2020-05-20 | 2020-09-25 | 新华三技术有限公司 | Fault repairing method and device |
CN112131043A (en) * | 2020-08-27 | 2020-12-25 | 苏州浪潮智能科技有限公司 | A kind of abnormal detection and recovery method and device of basic input output system |
CN113849230A (en) * | 2021-08-30 | 2021-12-28 | 浪潮电子信息产业股份有限公司 | Server starting method and device, electronic equipment and readable storage medium |
CN116010141A (en) * | 2022-12-15 | 2023-04-25 | 浪潮(山东)计算机科技有限公司 | Method, device and medium for positioning starting abnormality of multipath server |
CN115827330A (en) * | 2023-02-13 | 2023-03-21 | 西安超越申泰信息科技有限公司 | Method for realizing self-repair of server based on BMC (baseboard management controller) |
CN116775145A (en) * | 2023-05-04 | 2023-09-19 | 合芯科技(苏州)有限公司 | Method, device, equipment and storage medium for starting and recovering server |
CN116775141A (en) * | 2023-07-07 | 2023-09-19 | 联想(北京)有限公司 | Abnormality detection method, abnormality detection device, computer device, and storage medium |
Non-Patent Citations (1)
Title |
---|
PC服务器故障预测分析及维护处理;来风刚;李济伟;董耀众;宋瑞华;李伟良;;电子技术与软件工程(第01期);115-116 * |
Also Published As
Publication number | Publication date |
---|---|
CN117389781A (en) | 2024-01-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6530774B2 (en) | Hardware failure recovery system | |
US9158628B2 (en) | Bios failover update with service processor having direct serial peripheral interface (SPI) access | |
US6934879B2 (en) | Method and apparatus for backing up and restoring data from nonvolatile memory | |
US8627143B2 (en) | Dynamically modeling and selecting a checkpoint scheme based upon an application workload | |
US20120221884A1 (en) | Error management across hardware and software layers | |
US20190370139A1 (en) | Usage profile based recommendations | |
CN101377750A (en) | System and method for cluster fault toleration | |
US6725396B2 (en) | Identifying field replaceable units responsible for faults detected with processor timeouts utilizing IPL boot progress indicator status | |
CN111198832B (en) | Processing method and electronic equipment | |
CN117389781B (en) | Abnormality detection and recovery method and system for server equipment, server and medium | |
CN111124728A (en) | Automatic service recovery method, system, readable storage medium and server | |
CN102609324A (en) | Method, device and system for restoring deadlock of virtual machine | |
CN116501343A (en) | Program upgrading method, power supply and computing device | |
EP3534259B1 (en) | Computer and method for storing state and event log relevant for fault diagnosis | |
CN104657232A (en) | BIOS automatic recovery system and BIOS automatic recovery method | |
KR100605031B1 (en) | Fault recovery and upgrade method of embedded system using USB memory device | |
CN111984195A (en) | Method and device for improving stability of embedded Linux system | |
JP2018180982A (en) | INFORMATION PROCESSING APPARATUS AND LOG RECORDING METHOD | |
CN116627702A (en) | Method and device for restarting virtual machine in downtime | |
CN114510375A (en) | Flash chip data area dynamic sharing system and method | |
JP2006079485A (en) | Method for information collection for fault analysis in electronic computer | |
JPS6113626B2 (en) | ||
CN117971564B (en) | Data recovery method, device, computer equipment and storage medium | |
CN118885359B (en) | Extension equipment state detection method, server and electronic equipment | |
Jang et al. | Hybrid booting with incremental hibernation for the baseboard management controllers |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |