[go: up one dir, main page]

CN119201535B - Server fault information control method and device, storage medium and electronic device - Google Patents

Server fault information control method and device, storage medium and electronic device Download PDF

Info

Publication number
CN119201535B
CN119201535B CN202411706757.2A CN202411706757A CN119201535B CN 119201535 B CN119201535 B CN 119201535B CN 202411706757 A CN202411706757 A CN 202411706757A CN 119201535 B CN119201535 B CN 119201535B
Authority
CN
China
Prior art keywords
target
description information
candidate
server
detection result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411706757.2A
Other languages
Chinese (zh)
Other versions
CN119201535A (en
Inventor
徐胜军
曾裕文
程超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Metabrain Intelligent Technology Co Ltd
Original Assignee
Suzhou Metabrain Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Metabrain Intelligent Technology Co Ltd filed Critical Suzhou Metabrain Intelligent Technology Co Ltd
Priority to CN202411706757.2A priority Critical patent/CN119201535B/en
Publication of CN119201535A publication Critical patent/CN119201535A/en
Application granted granted Critical
Publication of CN119201535B publication Critical patent/CN119201535B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • G06F11/0781Error filtering or prioritizing based on a policy defined by the user or on a policy defined by a hardware/software module, e.g. according to a severity level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本申请实施例提供了一种服务器故障信息的控制方法及装置、存储介质及电子设备,其中,该方法包括:查询服务器中的寄存器在当前时间上已累计的候选描述信息的候选数量,并提取将候选描述信息中的首个描述信息存储至寄存器中的初始时间;检测候选数量与目标数量阈值之间是否满足第一匹配条件,得到第一检测结果,并检测当前时间与初始时间之间的目标时长与时长阈值之间是否满足第二匹配条件,得到第二检测结果;根据第一检测结果和第二检测结果,从候选描述信息中筛选待删除的目标描述信息,并确定是否向服务器中的控制器传输寄存器中存储的更新描述信息;删除目标描述信息,并在确定向控制器传输更新描述信息的情况下,向控制器传输更新描述信息。

An embodiment of the present application provides a method and device for controlling server fault information, a storage medium and an electronic device, wherein the method comprises: querying a register in a server to determine the number of candidate description information that has been accumulated at a current time, and extracting the initial time for storing the first description information in the candidate description information in the register; detecting whether a first matching condition is satisfied between the candidate number and a target number threshold, obtaining a first detection result, and detecting whether a second matching condition is satisfied between a target duration between the current time and the initial time and a duration threshold, obtaining a second detection result; based on the first detection result and the second detection result, filtering the target description information to be deleted from the candidate description information, and determining whether to transmit the updated description information stored in the register to a controller in the server; deleting the target description information, and transmitting the updated description information to the controller if it is determined that the updated description information is to be transmitted to the controller.

Description

Control method and device for server fault information, storage medium and electronic equipment
Technical Field
The embodiment of the application relates to the field of computers, in particular to a control method and device for server fault information, a storage medium and electronic equipment.
Background
A bus port is deployed on a server, which may be used, but is not limited to, for connecting bus devices, e.g., bus ports and bus devices may fail during operation of the server, e.g., CE.
In the related art, for the processing of PCIe CEs, an AER register is set in an Intel CPU to identify the threshold value of the number of CEs, error information is written into the AER register when the number of PCIe CEs is accumulated to the threshold value, and only the threshold value of the CE reported to the BMC and the OS is set in an AMD CPU, wherein the threshold value is the threshold value commonly corresponding to all devices on the server and is often larger, and the reporting cannot be performed in time.
It can be appreciated that in the related art, the processing manner for PCIe CE may cause CE to accumulate all the time, possibly resulting in CE accumulating and evolving into UCE (Uncorrectable Error ), and it is also impossible to distinguish which error information needs to be reported, and reporting cannot be performed in time, so that the running stability of the server is low.
Disclosure of Invention
The embodiment of the application provides a control method and device for server fault information, a storage medium and electronic equipment, which are used for at least solving the problem of lower operation stability of a server in the related technology.
According to one embodiment of the application, a control method of server fault information is provided, and the method is applied to target firmware in a server, and comprises the steps of inquiring candidate quantity of candidate descriptive information accumulated by a register in the server on the current time, and extracting initial time for storing first descriptive information in the candidate descriptive information into the register, wherein the initial time is earlier than the current time, and the register is used for storing descriptive information of faults occurring on a bus port on the server and descriptive information of faults occurring on bus equipment connected with the bus port; detecting whether a first matching condition is met between the candidate number and a target number threshold value to obtain a first detection result, detecting whether a second matching condition is met between a target duration between the current time and the initial time and a duration threshold value to obtain a second detection result, wherein the target number threshold value is determined by the target firmware according to the influence degree of faults of the bus port and the bus equipment on the running performance of the server, screening target description information to be deleted from the candidate description information according to the first detection result and the second detection result, determining whether to transmit update description information stored in the register to a controller in the server, wherein the update description information is description information stored in the register after the target description information, the candidate description information comprises the update description information, deleting the target description information, and under the condition that the update description information is determined to be transmitted to the controller, and transmitting the update description information to the controller.
In an exemplary embodiment, the server includes an adjustment device, where the adjustment device is connected to the target firmware, and before the detecting whether the candidate number and the target number threshold meet the first matching condition and obtaining the first detection result, the method further includes receiving a target adjustment request, where the target adjustment request is used to request to adjust a number threshold corresponding to the number of description information stored in the register to the target number threshold, and sending the target adjustment request to the adjustment device, where the adjustment device is used to execute the target adjustment request.
In one exemplary embodiment, the server includes a target interface, the receiving a target adjustment request includes detecting a first editing operation performed by a first field on the target interface and detecting a second editing operation performed by a second field on the target interface, where the first field is used for setting a degree of influence of faults of the bus port and the bus device on operation performance of the server, the second field is used for setting a quantity threshold of description information stored in the register corresponding to the degree of influence of faults of the bus port and the bus device on operation performance of the server, the first editing operation is used for setting the degree of influence of faults of the bus port and the bus device on operation performance of the server as a target degree of influence, the second editing operation is used for adjusting a quantity threshold of description information stored in a register corresponding to the target degree of influence to the target quantity threshold, and generating the first editing operation and the second editing operation corresponding to the target degree of influence of the bus device on operation performance of the server, and transmitting the target adjustment request to the firmware.
In an exemplary embodiment, after the sending the target adjustment request to the adjustment device, the method further includes defining a threshold identifier in a setting file, where a value of the threshold identifier is used to identify a number threshold corresponding to the number of description information stored in the register, and calling an entry function to adjust the value of the threshold identifier to the target number threshold.
In an exemplary embodiment, the server comprises a clock chip, the clock chip is connected with the target firmware, the clock chip is used for recording the current time of the server, and before the register in the query server accumulates the candidate number of the candidate description information on the current time, the method comprises detecting a target address of the clock chip, wherein the target address is used for extracting the time recorded in the clock chip, and the time recorded in the clock chip is extracted as the current time by accessing the target address.
In an exemplary embodiment, the querying the register in the server for the candidate number of the candidate description information accumulated in the current time includes querying the candidate number of the candidate description information recorded in the register in the current time by executing the following steps, wherein the register is used for recording the description information of faults occurring in the N bus ports and bus devices connected with the bus ports, N is a positive integer, detecting N groups of port description information of the N bus ports, the i-th group of port description information in the N groups of port description information is used for indicating the i-th bus port in the N bus ports, i is a positive integer smaller than or equal to N, invoking a target interrupt to poll the number of the candidate description information corresponding to the bus devices connected with the N bus ports and the bus devices accumulated in the current time according to the N groups of port description information, and obtaining the candidate number and the candidate number.
In an exemplary embodiment, the detecting whether the first matching condition is met between the candidate number and the target number threshold value, and obtaining a first detection result includes detecting whether the candidate number is greater than or equal to the target number threshold value, determining that the first detection result is used for indicating that the first matching condition is met between the candidate number and the target number threshold value when the candidate number is greater than or equal to the target number threshold value, and determining that the first detection result is used for indicating that the first matching condition is not met between the candidate number and the target number threshold value when the candidate number is less than the target number threshold value.
In an exemplary embodiment, the detecting whether a second matching condition is met between a target duration and a duration threshold between the current time and the initial time, and obtaining a second detection result includes detecting whether the target duration is less than or equal to the duration threshold, determining that the second detection result is used for indicating that the second matching condition is met between the target duration and the duration threshold when the target duration is less than or equal to the duration threshold, and determining that the second detection result is used for indicating that the second matching condition is not met between the target duration and the duration threshold when the target duration is greater than the duration threshold.
In an exemplary embodiment, the screening the target description information to be deleted from the candidate description information according to the first detection result and the second detection result includes that in the case that the first detection result is used for indicating that the first matching condition is met between the candidate number and the target number threshold, and the second detection result is used for indicating that the second matching condition is met between the target duration and the duration threshold, the description information of the target number threshold is screened from the candidate description information as the target description information, and in the case that the first detection result is used for indicating that the first matching condition is not met between the candidate number and the target number threshold, and/or the second detection result is used for indicating that the second matching condition is not met between the target duration and the duration threshold, the candidate description information is determined as the target description information.
In one exemplary embodiment, the determining whether to transmit the update description information stored in the register to a controller in the server includes determining to transmit the update description information to the controller if the first detection result is used to indicate that the first matching condition is satisfied between the candidate number and the target number threshold and the second detection result is used to indicate that the second matching condition is satisfied between the target duration and the duration threshold, determining to not transmit the update description information to the controller if the first detection result is used to indicate that the first matching condition is not satisfied between the candidate number and the target number threshold and/or the second detection result is used to indicate that the second matching condition is not satisfied between the target duration and the duration threshold.
In an exemplary embodiment, the controller includes a data table and the server includes a record function, and the transmitting the update description information to the controller includes calling the record function to record the update description information into the data table one by one and transmitting the update description information to the controller one by one.
According to another embodiment of the present application, there is provided a control apparatus for server fault information, the apparatus being applied to target firmware in a server, the apparatus including a first processing module configured to query a register in the server for a candidate number of candidate description information accumulated at a current time, and extract an initial time for storing first description information in the candidate description information into the register, wherein the initial time is earlier than the current time, and the register is configured to store description information of a fault occurring at a bus port on the server and description information of a fault occurring at a bus device to which the bus port is connected; a first detection module for detecting whether a first matching condition is satisfied between the candidate number and a target number threshold value, obtaining a first detection result, and detecting whether a second matching condition is satisfied between a target duration between the current time and the initial time and a duration threshold value, obtaining a second detection result, wherein the target number threshold value is determined by the target firmware according to the degree of influence of faults of the bus port and the bus device on the operation performance of the server, a second processing module for screening target description information to be deleted from the candidate description information according to the first detection result and the second detection result, determining whether update description information stored in the register is transmitted to a controller in the server, the update description information is description information stored in the register after the target description information, the candidate description information comprises the update description information, a third processing module for deleting the target description information, and transmitting the update description information to the controller in case it is determined to transmit the update description information to the controller.
According to a further embodiment of the application, there is also provided a computer readable storage medium having stored therein a computer program, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
According to a further embodiment of the application there is also provided an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
According to a further embodiment of the application, there is also provided a computer program product comprising a computer program which, when executed by a processor, implements the steps of any of the method embodiments described above.
According to the application, the bus port on the server and the bus equipment connected with the bus port may have faults in the running process, the target quantity threshold is determined by the target firmware according to the influence degree of the faults of the bus port and the bus equipment connected with the bus port on the running performance of the server, it is understood that the quantity threshold is a quantity threshold independently set for the bus port and the bus equipment connected with the bus port, the description information to be deleted is automatically screened out from the description information stored in the register according to whether the first matching condition is met between the candidate quantity and the target quantity threshold and whether the second matching condition is met between the target duration and the duration threshold between the current time and the initial time, and whether the update description information (for example, the description information which is not screened to be deleted) stored in the register is transmitted to the controller is determined, and the update description information is transmitted to the controller under the condition that the update description information is determined to be transmitted to the controller. Through the mode, when a large number of faults occur, operation and maintenance personnel can timely receive and process description information of faults needing to be concerned, and meanwhile, the description information stored in the register can be timely cleaned, so that the condition that the operation of the server is affected due to excessive accumulated description information in the register is avoided. Therefore, the problem of lower operation stability of the server can be solved, and the effect of improving the operation stability of the server is achieved.
Drawings
Fig. 1 is a hardware configuration block diagram of a server apparatus of a control method of server failure information according to an embodiment of the present application;
FIG. 2 is a flow chart of a method of controlling server failure information according to an embodiment of the present application;
FIG. 3 is a schematic diagram of an alternative target interface according to an embodiment of the application;
FIG. 4 is a schematic diagram of threshold cleaning and reporting of description information of an alternative server failure according to the present embodiment;
FIG. 5 is a flowchart of an alternative 24-hour quantitative clear PCIe CE debug in accordance with an embodiment of the present application;
Fig. 6 is a block diagram of a control apparatus of server failure information according to an embodiment of the present application.
Detailed Description
Embodiments of the present application will be described in detail below with reference to the accompanying drawings in conjunction with the embodiments.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.
The terms involved in the embodiments of the present invention are explained first as follows:
IPMI INTELLIGENT PLATFORM MANAGEMENT INTERFACE, intelligent platform management interface, IPMI can cross different operating systems, firmware and hardware platform, can intelligently monitor, control and automatically report the operation status of a large number of servers, so as to reduce the cost of the server system.
BMC Baseboard Management Controller baseboard management controller.
BIOS Basic Output and Input is a set of programs solidified on ROM (Read-Only Memory) chip on the main board in computer, and it can save the most important programs of basic input and output of computer, self-checking program after starting up and system self-starting up program, and has specific information for reading and writing system settings.
AMD Advanced Micro Devices, chaowei semiconductor Co.
Intel.
PCI Express, the high speed serial computer expansion bus standard, a technically fast peripheral component interconnect (PERIPHERAL COMPONENT INTERCONNECT EXPRESS), but commonly abbreviated as PCIe or PCI-E, is a standard type of connection for computer internal devices.
CE Corrected Error, error can be corrected.
RAS: reliability, availability, and Serviceability, reliability, availability and security. The RAS as a whole serves to ensure that the entire system operates reliably as long as possible without going offline and has a sufficiently powerful fault tolerance mechanism. This is an integral part of the application environment like large data centers, web centers like stock exchanges, telecommunication rooms, database centers of banks, etc.
APEI: advanced Platform Error Interfaces, a unified, efficient interface between hardware and upper layer software for communicating error messages.
RTC Real-time Clock/CALENDAR CHIP, clock chip. A real-time clock circuit with RAM (Random Access Memory ) and high performance and low power consumption can time year, month, day of week, time, minute and second and has leap year compensation function.
FW First FIRMWARE FIRST, prioritizes firmware handling, i.e., the errors that occur are handled First by the BIOS response.
OS First: operating SYSTEM FIRST, the Operating system handles preferentially, i.e., the errors that are generated are handled by the OS response First.
Run time protocol, a Runtime protocol, in BIOS refers to some protocol that may be invoked when the machine is in the OS phase.
CE errors and UCE errors in the application scope of PCIe RAS, there are most common ones of Error Detection, error Correction, error Reporting, hot-plug Support, etc. Error Correction, error Correction code, ECC is an additional set of bits that are appended during data storage or transmission to detect and correct certain types of errors. When an error is detected, if the location of the error is known, the error can be automatically repaired using the ECC information without the need to retransmit the data. Error Reporting, which is two of the most common errors, UCE and CE, UCE being Uncorrectable Error, which cannot be automatically corrected by hardware, when an uncorrectable Error is detected, the hardware will report to the operating system or device driver via AER, and UCE may be caused by hardware failure, serious signal degradation, etc. For UCE, user intervention is often required to address, for example, replacement of a failed component or restarting the system. CE is Correctable Error, which refers to errors that can be automatically corrected at the hardware level. When a correctable error is detected, the hardware will attempt to fix the error and report it to the operating system or device driver via the AER, CE typically due to noise, transient interference, etc.
The method embodiments provided in the embodiments of the present application may be executed in a server apparatus or similar computing device. Taking the operation on the server device as an example, fig. 1 is a hardware block diagram of the server device according to a control method of server failure information according to an embodiment of the present application. As shown in fig. 1, the server device may include one or more (only one is shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a microprocessor MCU, a programmable logic device FPGA, or the like processing means) and a memory 104 for storing data, wherein the server device may further include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those of ordinary skill in the art that the architecture shown in fig. 1 is merely illustrative and is not intended to limit the architecture of the server apparatus described above. For example, the server device may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a control method of server fault information in an embodiment of the present application, and the processor 102 executes the computer program stored in the memory 104 to perform various functional applications and data processing, that is, implement the above-mentioned method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located with respect to the processor 102, which may be connected to the server device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of a server device. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as a NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.
In this embodiment, a control method of server fault information is provided, fig. 2 is a flowchart of a control method of server fault information according to an embodiment of the present application, where the method is applied to target firmware in a server, as shown in fig. 2, and the flowchart includes the following steps:
Step S202, inquiring the candidate quantity of the candidate descriptive information accumulated by a register in a server on the current time, and extracting initial time for storing the first descriptive information in the candidate descriptive information into the register, wherein the initial time is earlier than the current time, and the register is used for storing the descriptive information of faults occurring at a bus port on the server and the descriptive information of faults occurring at a bus device connected with the bus port;
Step S204, whether a first matching condition is met between the candidate quantity and a target quantity threshold value is detected, a first detection result is obtained, whether a second matching condition is met between a target duration between the current time and the initial time and a duration threshold value is detected, and a second detection result is obtained, wherein the target quantity threshold value is determined by the target firmware according to the influence degree of faults of the bus port and the bus equipment on the operation performance of the server;
Step S206, screening target description information to be deleted from the candidate description information according to the first detection result and the second detection result, and determining whether to transmit update description information stored in the register to a controller in the server, wherein the update description information is the description information stored in the register after the target description information, and the candidate description information comprises the update description information;
step S208, deleting the target description information, and transmitting the update description information to the controller in the case that it is determined to transmit the update description information to the controller.
Through the steps, the bus port on the server and the bus device connected with the bus port may malfunction in the running process, the target quantity threshold is determined by the target firmware according to the influence degree of the malfunction of the bus port and the bus device connected with the bus port on the running performance of the server, it is understood that the quantity threshold is a quantity threshold independently set for the bus port and the bus device connected with the bus port, the description information to be deleted is automatically screened out from the description information stored in the register according to whether the first matching condition is met between the candidate quantity and the target quantity threshold and whether the second matching condition is met between the target duration and the duration threshold between the current time and the initial time, and whether the update description information (for example, the description information which is not screened to be deleted) stored in the register is transmitted to the controller is determined, and the update description information is transmitted to the controller under the condition that the update description information is transmitted to the controller is determined. Through the mode, when a large number of faults occur, operation and maintenance personnel can timely receive and process description information of faults needing to be concerned, and meanwhile, the description information stored in the register can be timely cleaned, so that the condition that the operation of the server is affected due to excessive accumulated description information in the register is avoided. Therefore, the problem of lower operation stability of the server can be solved, and the effect of improving the operation stability of the server is achieved.
In the solution provided in step S202, the bus port may but is not limited to include a port conforming to the target bus protocol, the bus device may but is not limited to conform to the same bus protocol as the bus port, for example, the target bus protocol may but is not limited to include a PCIe protocol, an I2C (Inter-INTEGRATED CIRCUIT, two-wire serial bus) protocol, and the like, and the bus port may but is not limited to include a PCIe port, and the bus device may but is not limited to include a PCIe device, for example, a network card.
Alternatively, in this embodiment, the target firmware may be, but is not limited to, a management module including a system, for example, a BIOS, and may be used to manage and control description information of a fault occurring at a bus port on the server and description information of a fault generated by a bus device connected to the bus port.
Alternatively, in the present embodiment, the description information of the fault may be, but is not limited to, information indicating the type of fault occurring to the bus port and the bus device, the bus port occurring to the fault, the bus device occurring to the fault, and the error information occurring to the fault, etc.
Optionally, in this embodiment, the description information of the fault occurring to the bus port and the fault occurring to the bus device connected to the bus port may include, but are not limited to, faults including CE and UCE, and one fault may include, but is not limited to, corresponding one description information, may include, but is not limited to, a target bus protocol including a PCIe protocol, a bus port including a PCIe port, and a bus device including a PCIe device, and the description information of the fault occurring to the bus port and the fault occurring to the bus device connected to the bus port may include, but is not limited to, description information of CE faults occurring to the PCIe port and the PCIe device.
Alternatively, in this embodiment, the initial time may be, but is not limited to, a time for storing the first description information in the candidate description information in a register, the current time and the initial time may be, but is not limited to, represented by year, month, day, time, minute, second, etc., for example, the initial time may be, but is not limited to, represented by a series of specific time parameters, such as, for example, stored and used in the form of T0D (day), T0H (hour), T0M (minute), T0S (second). The current time may be represented, but is not limited to, by a series of specific time parameters, for example stored and used in the form of T2D (days), T2H (hours), T2M (minutes), T2S (seconds).
In an exemplary embodiment, the server includes a clock chip, where the clock chip is connected to the target firmware, and the clock chip is configured to record a current time of the server, and may, but is not limited to, detect a target address of the clock chip before a candidate number of candidate description information accumulated at the current time by a register in the query server, where the target address is used to extract the time recorded in the clock chip, and extract the time recorded in the clock chip as the current time by accessing the target address.
Alternatively, in this embodiment, the clock chip may be used, but not limited to, recording, in real time, the current time of the server, the target address may be, but not limited to, an access address including accessing, recorded in the clock chip, the current time of the server, and may be, but not limited to, accessing, by a register or a storage unit inside the clock chip for storing and updating time data, the target address, and extracting the current time of the server.
Alternatively, in this embodiment, the clock chip may, but is not limited to, continue to accurately track time when the system is powered off by the standby power supply, so as to record time in real time.
According to the embodiment of the application, the current time is read through the target address provided by the clock chip (RTC) to identify when the time length threshold is reached, so that the description information of the fault is periodically and quantitatively cleaned, and the operation and maintenance efficiency of the server is improved.
In one exemplary embodiment, a candidate number of candidate descriptive information accumulated in a register in a server at a current time may be, but is not limited to, queried by performing the steps of querying a candidate number of candidate descriptive information recorded in the register at the current time in a case that the bus port includes N bus ports, wherein the register is used for recording descriptive information of faults occurring to bus devices connected to each of the N bus ports and the N bus ports, N is a positive integer, detecting N sets of port descriptive information of the N bus ports, wherein the i-th set of port descriptive information in the N sets of port descriptive information is used for indicating the i-th bus port of the N bus ports, i is a positive integer less than or equal to N, invoking a target interrupt to poll the N bus ports accumulated in the register at the current time and the N bus devices connected to each of the N bus ports according to the N sets of port descriptive information, and obtaining a candidate number of pairs of values, and performing the candidate number.
Alternatively, in the present embodiment, the port description information may include, but is not limited to, an identifier of a Bus port, for example, may include, but is not limited to, a BDF (Bus Number), device Number, function Number, abbreviated as BDF), and the like, including the Bus port.
Alternatively, in this embodiment, the i-th number of N numbers may be, but is not limited to, 0, or the i-th number is greater than 0, where the i-th number is the number of description information of the i-th bus port stored in the register and the fault occurring in the connected i-th bus device, and the N bus ports include the i-th bus port.
Alternatively, in this embodiment, the target interrupt may be, but is not limited to, used to issue an interrupt signal, e.g., an SMI interrupt, when a fault is detected (e.g., a PCIe CE fault). For example, when the BIOS detects a PCIe CE error in the AER register via the SMM protocol, an SMI interrupt may be triggered, but is not limited to.
In the solution provided in step S204, the influence degree may be, but is not limited to, calculated according to the following formula (1):
*100% formula (1)
Wherein R is the influence degree, S 1 is the running performance of the server when the bus port and the bus device connected to the bus port are not failed, and S 2 is the running performance of the server after the bus port and the bus device connected to the bus port are failed.
As an alternative example, the operational performance of the server may be determined, but is not limited to, by the processing speed of the server, the accuracy of the data transmitted by the server, the speed at which the server responds to the request, and so forth.
Alternatively, in this embodiment, the number of thresholds corresponding to the degree of influence of the faults of the bus port and the bus device on the running performance of the server may be different, for example, in the case where the degree of influence of the faults of the bus port and the bus device on the running performance of the server is 20%, the threshold corresponding to the fault is 200, and in the case where the degree of influence of the faults of the bus port and the bus device on the running performance of the server is 40%, the threshold corresponding to the fault is 600.
Optionally, in this embodiment, the first detection result may, but is not limited to, including that the first matching condition is satisfied between the candidate number and the target number threshold, or that the first matching condition is not satisfied between the candidate number and the target number threshold, and the second detection result may, but is not limited to, including that the second matching condition is satisfied between the target duration and the duration threshold between the current time and the initial time, or that the second matching condition is not satisfied between the target duration and the duration threshold between the current time and the initial time.
In an exemplary embodiment, the server includes an adjustment device, where the adjustment device is connected to the target firmware, and may, but is not limited to, receive a target adjustment request before the detecting whether a first matching condition is satisfied between the candidate number and a target number threshold, where the target adjustment request is used to request adjustment of a number threshold corresponding to a number of description information stored in the register to the target number threshold, and send the target adjustment request to the adjustment device, where the adjustment device is used to execute the target adjustment request.
Alternatively, in the present embodiment, the adjustment device may be provided with, but not limited to, a fault diagnosis drive, for example, an RAS fault diagnosis drive, and the adjustment device may be provided with, but not limited to, a target adjustment request by the fault diagnosis drive.
Alternatively, in this embodiment, the target adjustment request may be, but not limited to, a request to adjust a quantity threshold corresponding to the quantity of the description information stored in the register from an initial quantity threshold to a target quantity threshold, for example, the initial quantity threshold may be greater than the target quantity threshold, or the initial quantity threshold may be less than the target quantity threshold.
In an exemplary embodiment, the server includes a target interface, and the target adjustment request may be received by, but is not limited to, detecting a first editing operation performed in a first field on the target interface and detecting a second editing operation performed in a second field on the target interface, where the first field is used to set a degree of influence of a fault of the bus port and the bus device on an operation performance of the server, the second field is used to set a threshold of a quantity of description information stored in the register corresponding to the degree of influence of the fault of the bus port and the bus device on the operation performance of the server, the first editing operation is used to set the degree of influence of the fault of the bus port and the bus device on the operation performance of the server to the target degree of influence, and the second editing operation is used to adjust a threshold of a quantity of description information stored in a register corresponding to the target degree of influence to the target quantity threshold, and generate the first editing operation and the second editing operation correspond to the target adjustment request and transmit the target adjustment request to firmware.
Alternatively, in this embodiment, the influence degree and the number threshold may have a correspondence relationship, and the user may edit the influence degree and the number threshold corresponding to the influence degree on the target interface, for example, by default, the number threshold corresponding to the influence degree of 10% is 100, the number threshold corresponding to the influence degree of 20% is 300, and the user may adjust the threshold corresponding to the influence degree of 10% to 150, but not limited thereto.
Fig. 3 is a schematic view of an alternative target interface according to an embodiment of the present application, as shown in fig. 3, a user may perform a first editing operation on a first field (e.g., a degree of influence) on the target interface, for example, a list is shown on the target interface, in which a degree of influence selectable by the user is shown, for example, 10%,20%,30%. 100%, etc. The user may perform a second editing operation on a second field (e.g., a quantity threshold) on the target interface, such as, for example, a list is presented on the target interface in which a quantity threshold corresponding to a degree of influence that the user may select is presented, such as, for example, 100,300,500, etc.
For example, the user may edit, but is not limited to, the degree of influence to 10% and the number threshold to 100 corresponding to 10%. In this case, after the confirmation operation is clicked on the target interface, target adjustment requests corresponding to the first editing operation and the second editing operation are generated, and the target adjustment requests are transmitted to the target firmware. It is understood that the target adjustment request may be, but is not limited to, a request to set the quantity threshold to the quantity threshold 100 corresponding to a degree of influence of 10%.
It should be noted that, because of the correspondence between the influence degree and the number threshold, after the editing operation is performed on the influence degree, the number threshold is automatically adjusted according to the correspondence to match the new influence degree to set the number threshold, and the user is supported to perform independent editing operation on the first field (influence degree) and the second field (number threshold) according to specific requirements on the target interface. For example, the user can edit the influence degree on the target interface, and the final editing result is that the influence degree is selected to be 10%, the number threshold is 100 correspondingly, and the user can correspondingly adjust the number threshold, for example, adjust 100 up or down.
According to the embodiment of the application, the selectable PCIe CE overrun threshold is opened on the Setup (equivalent to the target interface), the threshold is set quantitatively at fixed time according to the requirement of the customer, and the operation and maintenance personnel can adjust the threshold according to the historical error data and the current system condition by allowing the user to edit the influence degree and the quantity threshold, so that unnecessary alarm and error processing can be avoided, the operation and maintenance time and cost are saved, and the configuration flexibility and the intelligent level of the system are improved.
In one exemplary embodiment, the target adjustment request may be sent to the adjustment device after, but not limited to, defining a threshold identifier in a setting file, where a value of the threshold identifier is used to identify a number threshold corresponding to the number of description information stored in the register, and invoking an entry function to adjust the value of the threshold identifier to the target number threshold.
Alternatively, in this embodiment, the value of the threshold identifier may be, but not limited to, the same as the number threshold corresponding to the number of description information stored in the register. When the number threshold is updated, the value of the threshold identifier may be updated by, but is not limited to, invoking an entry function, it being understood that the value of the threshold identifier may be dynamically changing.
In one exemplary embodiment, a first detection result may be obtained by, but is not limited to, detecting whether a first matching condition is satisfied between the candidate number and a target number threshold, detecting whether the candidate number is greater than or equal to the target number threshold, determining that the first detection result is used to indicate that the first matching condition is satisfied between the candidate number and the target number threshold if the candidate number is greater than or equal to the target number threshold, and determining that the first detection result is used to indicate that the first matching condition is not satisfied between the candidate number and the target number threshold if the candidate number is less than the target number threshold.
Alternatively, in this embodiment, the target number threshold may be dynamically changed in different detection periods, where the duration of the detection period is equal to the duration threshold, for example, the target number threshold in the first detection period is smaller than the target number threshold in the second detection period, in which case the same candidate number may satisfy the first matching condition between the first detection period and the target number threshold, but the same candidate number may not satisfy the first matching condition between the second detection period and the target number threshold.
In one exemplary embodiment, a second detection result may be obtained by, but is not limited to, detecting whether a second matching condition is satisfied between a target duration and a duration threshold between the current time and the initial time, detecting whether the target duration is less than or equal to the duration threshold, determining that the second detection result is used to indicate that the second matching condition is satisfied between the target duration and the duration threshold if the target duration is less than or equal to the duration threshold, and determining that the second detection result is used to indicate that the second matching condition is not satisfied between the target duration and the duration threshold if the target duration is greater than the duration threshold.
Alternatively, in this embodiment, the present time and the time of first recording the description information of the fault may be normalized by the following formula (2), that is, the duration between the present time and the initial time is converted into hours, and then the duration between the two times is calculated by subtracting, in hours.
Formula (2)
Wherein T represents a time length between the current time and the initial time, T2D represents a number of days portion of the current time, T2H represents a number of hours portion of the current time, T0D represents a number of days portion of the time when the description information of the failure is first recorded, and T0H represents a number of hours portion of the time when the description information of the failure is first recorded.
Alternatively, in the present embodiment, in the case of calculating the duration between the current time and the initial time, the calculation may be performed by converting the duration between the current time and the initial time into minutes or seconds, but is not limited thereto.
By converting the current time and the time for recording the fault for the first time onto the same time scale, the time length between the current time and the time for recording the description information of the fault for the first time is accurately calculated, and the timeliness and effectiveness of management of the description information of the fault are ensured.
In the solution provided in step S206, the description information of the fault generated by the bus port and the bus device may be written into the register in real time, and the number of the update description information stored in the register may be, but is not limited to, 0 or the number of the update description information stored in the register is greater than 0 at the current time.
In one exemplary embodiment, target description information to be deleted may be screened from the candidate description information according to the first detection result and the second detection result, but not limited to, in a case that the first detection result is used for indicating that the first matching condition is satisfied between the candidate number and the target number threshold, and the second detection result is used for indicating that the second matching condition is satisfied between the target duration and the duration threshold, the description information of the target number threshold may be screened from the candidate description information as the target description information, and in a case that the first detection result is used for indicating that the first matching condition is not satisfied between the candidate number and the target number threshold, and/or the second detection result is used for indicating that the second matching condition is not satisfied between the target duration and the duration threshold, the candidate description information is determined as the target description information.
Alternatively, in the present embodiment, in the case where the first detection result satisfies the first matching condition and the second detection result satisfies the second matching condition, it may be indicated that the number of pieces of description information of the failure accumulated within the time period threshold has exceeded the target number threshold. At this time, the description information of the target number threshold is selected from the candidate description information as the target description information.
In the case where the first detection result does not satisfy the first matching condition and/or the second detection result does not satisfy the second matching condition, it may be indicated that the number of pieces of description information of the failure accumulated within the time period threshold has not reached the target number threshold, and in such a case, all the pieces of candidate description information are determined as target description information.
In one exemplary embodiment, it may be determined, but is not limited to, whether to transmit update description information stored in the register to a controller in the server by determining to transmit the update description information to the controller if the first detection result is used to indicate that the first matching condition is satisfied between the candidate number and the target number threshold and the second detection result is used to indicate that the second matching condition is satisfied between the target duration and the duration threshold, and/or determining not to transmit the update description information to the controller if the first detection result is used to indicate that the first matching condition is not satisfied between the candidate number and the target number threshold and/or the second detection result is used to indicate that the second matching condition is not satisfied between the target duration and the duration threshold.
Alternatively, in this embodiment, it may be determined to transmit the update description information to the controller, but not limited to, in a case where the first detection result is used to indicate that the first matching condition is satisfied between the candidate number and the target number threshold, and the second detection result is used to indicate that the second matching condition is satisfied between the target duration and the duration threshold. Fig. 4 is a schematic diagram of threshold clearing and reporting of description information of an optional server fault according to this embodiment, as shown in fig. 4, but may not be limited to including 200 by a number threshold, where the candidate number of candidate description information recorded by a register at the current time is 400, and the duration threshold is 24 hours, where the target firmware includes BIOS, and the candidate number is greater than the number threshold, and the duration between the current time and the initial time is less than or equal to 24 hours, where the BIOS determines that the filtered description information 1 to the description information 200 are description information to be deleted, and determines to transmit the description information 201 to the description information 400 stored in the register to the controller, and it may be understood that updating the description information includes the description information 201 to the description information 400.
In the technical solution provided in the above step S208, in the case that it is determined to transmit updated description information to the controller, the threshold number of description information stored in the register is deleted, and description information exceeding the threshold number is immediately transmitted to the controller, and in the case that it is determined not to transmit updated description information to the controller, all the description information stored in the register is deleted.
For example, in the case where the time length threshold is 24 hours and the number threshold is 200, if 300 pieces of error information have been accumulated in the register at 18 hours, 201 st to 300 nd pieces of information in the register are immediately transmitted to the controller, and 1 st to 200 st pieces of information are deleted.
In the related art, a CE filtering mechanism is needed in the current PCIe CE processing of the server to enable the client to dynamically and selectively perceive the reporting of the CEs, and simply not enabling the PCIe CE to report causes the client to be unable to perceive the risk of part of the hardware links. The technical scheme of the application can be used for, but not limited to, server products (or other server products) of an AMD platform X86 architecture, the application is not limited to the server products, the reading of an RTC clock is used, the risk of stopping the machine due to the accumulation of a large amount of PCIe CEs and the existence of CE storms is avoided in view of the set threshold value of the timing and quantification of the demands of clients, the dynamic processing of the PCIe CEs in a certain time is realized, the risk of CE storms is greatly reduced, the serious abnormality of the machine is selectively perceived by operation and maintenance personnel in time, and the invalid investment of the operation and maintenance personnel of the clients is reduced.
In an exemplary embodiment, the controller includes a data table and the server includes a record function, and the update description information may be transmitted to the controller, but is not limited to, by calling the record function to record the update description information into the data table, and transmitting the update description information to the controller, one by one.
Optionally, in this embodiment, the controller may, but is not limited to, detect faults occurring on the bus port and the bus device according to the received update description information, and repair corresponding faults in time, where the controller may, but is not limited to, include a BMC, an OS, and the like.
Optionally, in this embodiment, the method further includes detecting, when it is detected that the number of candidate descriptors stored in the register at the current time is greater than or equal to a number threshold, whether the number of candidate descriptors stored in the register at the current time is greater than or equal to an upper limit, where the upper limit is greater than the target number threshold, and the upper limit is a threshold corresponding to a case where a degree of influence of a fault occurring in the bus port and a bus device connected to the bus port on the operation performance of the server is greater than or equal to a degree of influence threshold;
in the case where it is detected that the number of candidate descriptors is greater than or equal to an upper limit value, the descriptors stored to the register after the target descriptors are determined as updated descriptors, and the updated descriptors are transmitted to the controller, and the descriptors of the upper limit value are deleted.
According to the embodiment of the application, under the condition that the number of the candidate descriptive information stored in the register at the current time is greater than or equal to the upper limit value, the fact that CE storm is likely to happen at present can be indicated, a large number of faults are generated in a very short time, under the condition that the CE storm is likely to happen, all descriptive information stored in the register after the target descriptive information can be directly transmitted to the controller, the situation that more time is wasted for detecting whether the error information needing to be reported is needed or not for many times, the situation that more time is wasted for screening the error information needing to be reported from the candidate descriptive information is avoided, and the error information is directly reported when CE storm is detected, the efficiency of reporting the error information to the controller is improved, and operation and maintenance personnel can conveniently and rapidly process the CE storm.
In order to better understand the control procedure of the control method of server failure information in the embodiment of the present application, the following explanation and description of the control procedure of the control method of server failure information in the embodiment of the present application are applicable to, but not limited to, the embodiment of the present application.
Under the condition that a CPU can normally write PCIe CEs into an AER register, the technical scheme of the application can be realized by, but is not limited to, executing the following steps of 1) setting a threshold value for the quantity of PCIe CEs by the BIOS (equivalent to target firmware) to identify that the PCIe CEs are reported to the BMC and the OS when the CEs accumulate to the threshold value in 24 hours, 2) reading the current time according to a clock chip (RTC) address to identify when the 24-hour duration is reached, and 3) executing a mechanism for cleaning the PCIe CEs on time according to the time acquired from the RTC.
FIG. 5 is a flowchart of an alternative 24-hour quantitative clear PCIe CE debug, as shown in FIG. 5, and may be interpreted as, but is not limited to, target bus protocols including PCIe protocols, bus ports including PCIe ports, bus devices including PCIe devices, and failures including CE failures, according to an embodiment of the present application.
After the PCIe CE occurs, the BIOS (corresponding to the target firmware) invokes the Ras fault diagnosis driver (corresponding to the regulator device) of the AMD platform to perform fault handling of the PCIe CE. At this time, a threshold value is set on Setup (corresponding to the target interface) to identify how many CE errors can be cleared within 24 hours (corresponding to the duration threshold value, or may be 12 hours or 30 minutes, etc., which is not limited by the present application). We need to define the threshold identifier displayed on Setup in sd file (corresponding to Setup file), then initialize and assign the threshold at PCIe CE processing entry function, in this process we can formulate how many CE errors need to be missed in 24 hours according to our needs, and associate the threshold on Setup to realize the selectivity of clearing the number of errors.
And then the BIOS calls an SMI interrupt (equivalent to a target interrupt) to process the CE error in the AER register, when the CE error processing mode in the AER is FW First, that is, when the CE error processing mode is not triggered by a device driving layer, the BIOS calls the SMI interrupt through an SMM protocol to poll according to BDF of PCIe Root ports (PCIe Root ports), and confirms whether errors occur in the corresponding PCIe ports, and when the error number of the PCIe CE error processing mode exceeds a threshold, the BIOS transmits the errors to the OS through the SMI and synchronizes the errors to the BMC. When PCIe CE errors occur, the BIOS stores the PCIe CE errors in a list through a run time protocol and counts and accumulates the PCIe CE errors, so that the fact that how much PCIe CE errors are recorded in the AER register of the CPU under the current configuration can be confirmed. When the BIOS starts to poll the PCIe CE to report errors, the BIOS records the current time to be T0D, T0H, T0M, T S, then the BIOS reads the current time according to the address of the clock chip when the PCIe CE reports errors to exceed the threshold value, the current time is accurate to the day, the time, the minute and the second of the month, the current time is recorded to be T2D, T2H, T2M, T S, the BIOS calculates whether the time of (T2D+T2H) - (T0D+24+T0H) exceeds 24 hours or not, and confirms whether the CE report errors need to be reported or not.
After all PCIe CEs are polled, the BIOS triggers a CE log record function, if the values of (t2d+t2h) - (t0d+t0h) do not exceed 24 hours, the BIOS records the values into the BERT table (corresponding to the data table) of the OS one by one according to the number exceeding the CE threshold, and reports the values to the BMC one by one, the BMC records the values in the log system, then subtracts the threshold number of PCIe CE reporting errors from the AER register, the remaining PCIe CE enters the next round of accumulation, and if the values of (t2d+t2h) - (t0d+t0h) exceed 24 hours, the BIOS does not report a CE reporting error to the OS and the BMC, and then clears the PCIe CE reporting error in the AER, and starts to perform the next round of CE accumulation. Also after the 24-hour cycle count is complete, the BIOS assigns T2D, T2H, T2M, T S to T0D, T0H, T0M, T S to start a new cycle count.
The application mainly optimizes the existing PCIe CE error handling mechanism, and regularly and quantitatively cleans PCIe CE error by reading time confirmation time intervals of an RTC clock (equivalent to a clock chip), thereby bringing the following three benefits:
(1) When a large number of PCIe CE errors are reported by the machine, the PCIe CE errors can be reported in a certain time, so that operation and maintenance personnel can perceive the PCIe CE errors, error reasons can be timely checked, and the risk of CE storms is greatly reduced;
(2) The technical scheme can enable the customer to sense more serious error reporting selectively, and timely check whether a machine link is abnormal or not, so that a large amount of investment of operation and maintenance personnel can be saved, and the risk of serious abnormality of the machine is reduced;
(3) The technical scheme can lead the client to selectively sense the error reporting of the PCIe CE in one day, dynamically and intelligently regulate the reporting of the PCIe CE, and greatly help the competitive power improvement of the machine.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present application.
The embodiment also provides a device for controlling server fault information, which is used for implementing the foregoing embodiments and preferred embodiments, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
Fig. 6 is a block diagram of a control apparatus for server fault information according to an embodiment of the present application, which is applied to target firmware in a server as shown in fig. 6, the apparatus comprising:
A first processing module 602, configured to query a register in a server for a candidate number of candidate descriptors accumulated at a current time, and extract an initial time for storing a first one of the candidate descriptors in the register, where the initial time is earlier than the current time, and the register is configured to store description information of a fault occurring at a bus port on the server and description information of a fault occurring at a bus device connected to the bus port;
A first detection module 604, configured to detect whether a first matching condition is satisfied between the candidate number and a target number threshold, to obtain a first detection result, and detect whether a second matching condition is satisfied between a target duration between the current time and the initial time and a duration threshold, to obtain a second detection result, where the target number threshold is determined by the target firmware according to a degree of influence of a fault occurring between the bus port and the bus device on an operation performance of the server;
A second processing module 606, configured to screen target description information to be deleted from the candidate description information according to the first detection result and the second detection result, and determine whether to transmit update description information stored in the register to a controller in the server, where the update description information is description information stored in the register after the target description information, and the candidate description information includes the update description information;
And a third processing module 608, configured to delete the target description information, and transmit the update description information to the controller if it is determined to transmit the update description information to the controller.
By means of the device, faults may occur in the running process of the bus ports on the server and the bus devices connected with the bus ports, the target quantity threshold is determined by the target firmware according to the influence degree of the faults of the bus ports and the bus devices connected with the bus ports on the running performance of the server, it is understood that the quantity threshold is independently set for the bus ports and the bus devices connected with the bus ports, the description information to be deleted is automatically screened out of the description information stored in the register according to whether the first matching condition is met between the candidate quantity and the target quantity threshold and whether the second matching condition is met between the target duration and the duration threshold between the current time and the initial time, and whether the update description information (for example, the description information which is not screened to be deleted) stored in the register is transmitted to the controller is determined, and the update description information is transmitted to the controller under the condition that the update description information is transmitted to the controller is determined. Through the mode, when a large number of faults occur, operation and maintenance personnel can timely receive and process description information of faults needing to be concerned, and meanwhile, the description information stored in the register can be timely cleaned, so that the condition that the operation of the server is affected due to excessive accumulated description information in the register is avoided. Therefore, the problem of lower operation stability of the server can be solved, and the effect of improving the operation stability of the server is achieved.
In one exemplary embodiment, the server includes an adjustment device, the adjustment device being connected to the target firmware, the apparatus further comprising:
The receiving module is used for receiving a target adjustment request before whether a first matching condition is met between the detected candidate quantity and a target quantity threshold value to obtain a first detection result, wherein the target adjustment request is used for requesting to adjust the quantity threshold value corresponding to the quantity of the description information stored in the register to the target quantity threshold value;
and the sending module is used for sending the target adjustment request to the adjustment device, wherein the adjustment device is used for executing the target adjustment request.
In one exemplary embodiment, the server includes a target interface, and the receiving module includes:
A first detecting unit, configured to detect a first editing operation performed by a first field on the target interface, and detect a second editing operation performed by a second field on the target interface, where the first field is used to set a degree of influence of a fault occurring in the bus port and the bus device on an operation performance of the server, the second field is used to set a number threshold of a number of description information stored in the register corresponding to the degree of influence of the fault occurring in the bus port and the bus device on the operation performance of the server, the first editing operation is used to set the degree of influence of the fault occurring in the bus port and the bus device on the operation performance of the server as a target degree of influence, and the second editing operation is used to adjust the number threshold of a number of description information stored in the register corresponding to the target degree of influence to the target number threshold;
The first processing unit is used for generating the target adjustment request corresponding to the first editing operation and the second editing operation and transmitting the target adjustment request to the target firmware.
In one exemplary embodiment, the apparatus further comprises:
a defining module, configured to define a threshold identifier in a setting file after the target adjustment request is sent to the adjustment device, where a value of the threshold identifier is used to identify a number threshold corresponding to a number of description information stored in the register;
And the adjusting module is used for calling an entry function to adjust the value of the threshold identifier to the target quantity threshold.
In an exemplary embodiment, the server includes a clock chip, the clock chip is connected to the target firmware, the clock chip is used for recording the current time of the server, and the apparatus includes:
The second detection module is used for detecting a target address of the clock chip before the candidate number of the candidate descriptive information accumulated by the register in the query server at the current time, wherein the target address is used for extracting the time recorded in the clock chip;
and the extraction module is used for accessing the target address and extracting the time recorded in the clock chip as the current time.
In one exemplary embodiment, the first processing module includes:
in the case that the bus ports include N bus ports, querying the candidate number of the candidate description information recorded by the register at the current time, where the register is used to record the description information of faults occurring in the N bus ports and the bus devices connected to each bus port of the N bus ports, and N is a positive integer:
The second detection unit is used for detecting N groups of port description information of the N bus ports, wherein the ith group of port description information in the N groups of port description information is used for indicating the ith bus port in the N bus ports, and i is a positive integer less than or equal to N;
The calling unit is used for calling the target interrupt to poll the N bus ports accumulated in the register at the current time according to the N groups of port description information and the number of description information corresponding to the bus devices connected with each bus device in the N bus ports, so as to obtain N numbers;
And the execution unit is used for executing sum operation on the N number quantities to obtain the candidate quantity.
In one exemplary embodiment, the first detection module includes:
A third detection unit configured to detect whether the candidate number is greater than or equal to the target number threshold;
The first determining unit is used for determining that the first detection result is used for indicating that the first matching condition is met between the candidate quantity and the target quantity threshold value when the candidate quantity is detected to be larger than or equal to the target quantity threshold value, and determining that the first detection result is used for indicating that the first matching condition is not met between the candidate quantity and the target quantity threshold value when the candidate quantity is detected to be smaller than the target quantity threshold value.
In an exemplary embodiment, the first detection module further includes:
A fourth detection unit, configured to detect whether the target duration is less than or equal to the duration threshold;
and the second determining unit is used for determining that the second detection result is used for indicating that the second matching condition is not met between the target duration and the duration threshold value when the target duration is detected to be smaller than or equal to the duration threshold value, and determining that the second detection result is used for indicating that the second matching condition is not met between the target duration and the duration threshold value when the target duration is detected to be longer than the duration threshold value.
In one exemplary embodiment, the second processing module includes:
A screening unit, configured to screen, from the candidate description information, description information of the target number threshold as the target description information, in a case where the first detection result is used to indicate that the first matching condition is satisfied between the candidate number and the target number threshold, and the second detection result is used to indicate that a second matching condition is satisfied between the target duration and the duration threshold;
A third determining unit, configured to determine the candidate description information as the target description information when the first detection result is used to indicate that the first matching condition is not satisfied between the candidate number and the target number threshold, and/or the second detection result is used to indicate that the second matching condition is not satisfied between the target duration and the duration threshold.
In an exemplary embodiment, the second processing module further includes:
A fourth determining unit, configured to determine to transmit the update description information to the controller, when the first detection result is used to indicate that the first matching condition is satisfied between the candidate number and the target number threshold, and the second detection result is used to indicate that the second matching condition is satisfied between the target duration and the duration threshold;
A fifth determining unit, configured to determine not to transmit the update description information to the controller, if the first detection result is used to indicate that the first matching condition is not satisfied between the candidate number and the target number threshold, and/or the second detection result is used to indicate that the second matching condition is not satisfied between the target duration and the duration threshold.
In one exemplary embodiment, the controller includes a data table, the server includes a record function, and the third processing module includes:
And the second processing unit is used for calling the recording function to record the update description information into the data table one by one and transmitting the update description information to the controller one by one.
It should be noted that each of the above modules may be implemented by software or hardware, and the latter may be implemented by, but not limited to, the above modules all being located in the same processor, or each of the above modules being located in different processors in any combination.
Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
In an exemplary embodiment, the computer readable storage medium may include, but is not limited to, a U disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, etc. various media in which a computer program may be stored.
An embodiment of the application also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
In an exemplary embodiment, the electronic device may further include a transmission device connected to the processor, and an input/output device connected to the processor.
Embodiments of the application also provide a computer program product comprising a computer program which, when executed by a processor, implements the steps of any of the method embodiments described above.
Embodiments of the present application also provide another computer program product comprising a non-volatile computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of any of the method embodiments described above.
Embodiments of the present application also provide a computer program comprising computer instructions stored in a computer readable storage medium, a processor of a computer device reading the computer instructions from the computer readable storage medium, the processor executing the computer instructions to cause the computer device to perform the steps of any of the method embodiments described above.
Specific examples in this embodiment may refer to the examples described in the foregoing embodiments and the exemplary implementation, and this embodiment is not described herein.
It will be appreciated by those skilled in the art that the modules or steps of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps of them may be fabricated into a single integrated circuit module. Thus, the present application is not limited to any specific combination of hardware and software.
The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present application should be included in the protection scope of the present application.

Claims (15)

1. A control method of server fault information is characterized in that,
A target firmware for application in a server, the method comprising:
Inquiring the candidate quantity of candidate descriptive information accumulated by a register in a server on the current time, and extracting initial time for storing first descriptive information in the candidate descriptive information into the register, wherein the initial time is earlier than the current time, and the register is used for storing descriptive information of faults occurring on a bus port on the server and descriptive information of faults occurring on bus equipment connected with the bus port;
Detecting whether a first matching condition is met between the candidate quantity and a target quantity threshold value to obtain a first detection result, and detecting whether a second matching condition is met between a target duration between the current time and the initial time and a duration threshold value to obtain a second detection result, wherein the target quantity threshold value is determined by the target firmware according to the influence degree of faults of the bus port and the bus equipment on the running performance of the server;
Screening target description information to be deleted from the candidate description information according to the first detection result and the second detection result, and determining whether to transmit update description information stored in the register to a controller in the server, wherein the update description information is the description information stored in the register after the target description information, and the candidate description information comprises the update description information;
Deleting the target description information and transmitting the update description information to the controller in the case that the update description information is determined to be transmitted to the controller.
2. The method of claim 1, wherein the step of determining the position of the substrate comprises,
The server comprises an adjusting device, the adjusting device is connected with the target firmware, and before the detecting whether the first matching condition is met between the candidate quantity and the target quantity threshold value or not and obtaining a first detection result, the method further comprises:
Receiving a target adjustment request, wherein the target adjustment request is used for requesting to adjust a quantity threshold corresponding to the quantity of the description information stored in the register to the target quantity threshold;
and sending the target adjustment request to the adjustment device, wherein the adjustment device is used for executing the target adjustment request.
3. The method of claim 2, wherein the step of determining the position of the substrate comprises,
The server includes a target interface, and the receiving a target adjustment request includes:
Detecting a first editing operation performed by a first field on the target interface, and detecting a second editing operation performed by a second field on the target interface, wherein the first field is used for setting the influence degree of faults of the bus port and the bus device on the running performance of the server, the second field is used for setting a quantity threshold value of the quantity of description information stored in the register corresponding to the influence degree of the faults of the bus port and the bus device on the running performance of the server, the first editing operation is used for setting the influence degree of the faults of the bus port and the bus device on the running performance of the server as a target influence degree, and the second editing operation is used for adjusting the quantity threshold value of the quantity of the description information stored in the register corresponding to the target influence degree to the target quantity threshold value;
and generating the target adjustment request corresponding to the first editing operation and the second editing operation, and transmitting the target adjustment request to the target firmware.
4. The method of claim 2, wherein the step of determining the position of the substrate comprises,
After said sending said target adjustment request to said adjustment device, said method further comprises:
Defining a threshold identifier in a setting file, wherein the value of the threshold identifier is used for identifying a quantity threshold corresponding to the quantity of the descriptive information stored in the register;
And calling an entry function to adjust the value of the threshold identifier to the target quantity threshold.
5. The method of claim 1, wherein the step of determining the position of the substrate comprises,
The server comprises a clock chip, the clock chip is connected with the target firmware, the clock chip is used for recording the current time of the server, and before the register in the query server accumulates the candidate number of the candidate description information in the current time, the method comprises the following steps:
detecting a target address of the clock chip, wherein the target address is used for extracting time recorded in the clock chip;
and extracting the time recorded in the clock chip as the current time by accessing the target address.
6. The method of claim 1, wherein the step of determining the position of the substrate comprises,
The candidate number of the candidate descriptive information accumulated by the register in the query server at the current time comprises:
in the case that the bus ports include N bus ports, querying the candidate number of the candidate description information recorded by the register at the current time, where the register is used to record the description information of faults occurring in the N bus ports and the bus devices connected to each bus port of the N bus ports, and N is a positive integer:
Detecting N groups of port description information of the N bus ports, wherein the i group of port description information in the N groups of port description information is used for indicating the i bus port in the N bus ports, and i is a positive integer less than or equal to N;
the call target interrupt polls the N bus ports accumulated in the current time of the register according to the N groups of port description information and the number of description information corresponding to the bus devices connected with each bus device in the N bus ports, so as to obtain N numbers;
And executing sum operation on the N number quantities to obtain the candidate quantity.
7. The method of claim 1, wherein the step of determining the position of the substrate comprises,
The detecting whether the candidate number and the target number threshold meet the first matching condition or not, to obtain a first detection result, includes:
Detecting whether the candidate number is greater than or equal to the target number threshold;
determining that the first detection result is used for indicating that the first matching condition is met between the candidate number and the target number threshold value when the candidate number is detected to be larger than or equal to the target number threshold value; and in the case that the candidate quantity is detected to be smaller than the target quantity threshold value, determining that the first detection result is used for indicating that the first matching condition is not met between the candidate quantity and the target quantity threshold value.
8. The method of claim 1, wherein the step of determining the position of the substrate comprises,
Detecting whether a second matching condition is met between a target duration and a duration threshold between the current time and the initial time, and obtaining a second detection result comprises the following steps:
detecting whether the target duration is less than or equal to the duration threshold;
Determining that the second detection result is used for indicating that the second matching condition is met between the target duration and the duration threshold value under the condition that the target duration is detected to be smaller than or equal to the duration threshold value; and under the condition that the target time length is detected to be larger than the time length threshold value, determining that the second detection result is used for indicating that the second matching condition is not met between the target time length and the time length threshold value.
9. The method of claim 1, wherein the step of determining the position of the substrate comprises,
The screening the target description information to be deleted from the candidate description information according to the first detection result and the second detection result includes:
screening description information of the target quantity threshold value from the candidate description information as the target description information under the condition that the first detection result is used for indicating that the first matching condition is met between the candidate quantity and the target quantity threshold value and the second detection result is used for indicating that the second matching condition is met between the target duration and the duration threshold value;
And determining the candidate description information as the target description information under the condition that the first detection result is used for indicating that the first matching condition is not met between the candidate quantity and the target quantity threshold value, and/or the second detection result is used for indicating that the second matching condition is not met between the target duration and the duration threshold value.
10. The method of claim 1, wherein the step of determining the position of the substrate comprises,
The determining whether to transmit the update description information stored in the register to a controller in the server includes:
determining to transmit the update description information to the controller when the first detection result is used for indicating that the first matching condition is satisfied between the candidate number and the target number threshold, and the second detection result is used for indicating that the second matching condition is satisfied between the target duration and the duration threshold;
and determining not to transmit the update description information to the controller when the first detection result is used for indicating that the first matching condition is not met between the candidate number and the target number threshold value, and/or the second detection result is used for indicating that the second matching condition is not met between the target duration and the duration threshold value.
11. The method of claim 1, wherein the step of determining the position of the substrate comprises,
The controller includes a data table, the server includes a record function, the transmitting the update description information to the controller includes:
And calling the recording function to record the update description information into the data table one by one, and transmitting the update description information to the controller one by one.
12. A control apparatus for server failure information, the apparatus being applied to target firmware in a server, the apparatus comprising:
The first processing module is used for inquiring the candidate quantity of the candidate descriptive information accumulated by a register in the server at the current time and extracting initial time for storing the first descriptive information in the candidate descriptive information into the register, wherein the initial time is earlier than the current time, and the register is used for storing the descriptive information of faults occurring at a bus port on the server and the descriptive information of faults occurring at bus equipment connected with the bus port;
the first detection module is used for detecting whether a first matching condition is met between the candidate quantity and a target quantity threshold value to obtain a first detection result, and detecting whether a second matching condition is met between a target duration between the current time and the initial time and a duration threshold value to obtain a second detection result, wherein the target quantity threshold value is determined by the target firmware according to the influence degree of faults of the bus port and the bus equipment on the operation performance of the server;
The second processing module is used for screening target description information to be deleted from the candidate description information according to the first detection result and the second detection result, and determining whether to transmit update description information stored in the register to a controller in the server, wherein the update description information is the description information stored in the register after the target description information, and the candidate description information comprises the update description information;
and the third processing module is used for deleting the target description information and transmitting the update description information to the controller under the condition that the update description information is determined to be transmitted to the controller.
13. A computer-readable storage medium comprising,
The computer readable storage medium has stored therein a computer program, wherein the computer program when executed by a processor realizes the steps of the method as claimed in any of the claims 1 to 11.
14. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that,
The processor, when executing the computer program, implements the steps of the method as claimed in any one of claims 1 to 11.
15. A computer program product comprising a computer program, characterized in that,
Which computer program, when being executed by a processor, carries out the steps of the method as claimed in any one of claims 1 to 11.
CN202411706757.2A 2024-11-26 2024-11-26 Server fault information control method and device, storage medium and electronic device Active CN119201535B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411706757.2A CN119201535B (en) 2024-11-26 2024-11-26 Server fault information control method and device, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411706757.2A CN119201535B (en) 2024-11-26 2024-11-26 Server fault information control method and device, storage medium and electronic device

Publications (2)

Publication Number Publication Date
CN119201535A CN119201535A (en) 2024-12-27
CN119201535B true CN119201535B (en) 2025-03-18

Family

ID=94075562

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411706757.2A Active CN119201535B (en) 2024-11-26 2024-11-26 Server fault information control method and device, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN119201535B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115941438A (en) * 2022-11-08 2023-04-07 苏州浪潮智能科技有限公司 Method and device for processing fault information, storage medium and electronic device
CN115964218A (en) * 2022-12-28 2023-04-14 新华三信息技术有限公司 Method and device for identifying fault of high-speed serial computer expansion bus equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050102488A1 (en) * 2003-11-07 2005-05-12 Bullis George A. Firmware description language for accessing firmware registers
CN117389790B (en) * 2023-12-13 2024-02-23 苏州元脑智能科技有限公司 Firmware detection system, method, storage medium and server capable of recovering faults

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115941438A (en) * 2022-11-08 2023-04-07 苏州浪潮智能科技有限公司 Method and device for processing fault information, storage medium and electronic device
CN115964218A (en) * 2022-12-28 2023-04-14 新华三信息技术有限公司 Method and device for identifying fault of high-speed serial computer expansion bus equipment

Also Published As

Publication number Publication date
CN119201535A (en) 2024-12-27

Similar Documents

Publication Publication Date Title
TWI229796B (en) Method and system to implement a system event log for system manageability
US8977905B2 (en) Method and system for detecting abnormality of network processor
CN110594180A (en) Control method and system of server heat dissipation controller
CN113176963B (en) PCIe fault self-repairing method, device, equipment and readable storage medium
CN107729213B (en) Background task monitoring method and device
CN113590429A (en) Server fault diagnosis method and device and electronic equipment
US20200305300A1 (en) Method for remotely clearing abnormal status of racks applied in data center
CN114610567A (en) Container monitoring method, network device and storage medium
CN108932007A (en) Method for acquiring time stamp and computer device
CN102375775B (en) A kind of computer system with detection system unrecoverable error indication signal
CN116319618A (en) Switch operation control method, device, system, equipment and storage medium
US10754722B1 (en) Method for remotely clearing abnormal status of racks applied in data center
US10842041B2 (en) Method for remotely clearing abnormal status of racks applied in data center
CN119201535B (en) Server fault information control method and device, storage medium and electronic device
CN113946448B (en) A server cluster timing management method, device and electronic equipment
CN107026759A (en) The firmware and its development approach of a kind of remote management BBU modules based on BMC
JP6040894B2 (en) Log generation apparatus and log generation method
CN118656245A (en) A method, device, electronic device and medium for handling server exceptions
CN116015425B (en) Optical module control method and device, storage medium and electronic device
TW202026879A (en) Method for remotely clearing abnormal status of racks applied in data center
CN113742166A (en) Log recording method, device and system for server system device
CN111414267A (en) Far-end eliminating method for abnormal state of cabinet applied to data center
CN111414274A (en) Remote exclusion method for abnormal state of cabinets in data centers
CN112052147A (en) Monitoring method, electronic device and storage medium
CN111416721A (en) Far-end eliminating method for abnormal state of cabinet applied to data center

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant