[go: up one dir, main page]

CN120803802B - Error reporting methods and electronic devices for processor platforms - Google Patents

Error reporting methods and electronic devices for processor platforms

Info

Publication number
CN120803802B
CN120803802B CN202511311814.1A CN202511311814A CN120803802B CN 120803802 B CN120803802 B CN 120803802B CN 202511311814 A CN202511311814 A CN 202511311814A CN 120803802 B CN120803802 B CN 120803802B
Authority
CN
China
Prior art keywords
error
processor platform
data
flag bit
processor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202511311814.1A
Other languages
Chinese (zh)
Other versions
CN120803802A (en
Inventor
田卓
王一鸣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
IEIT Systems Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN202511311814.1A priority Critical patent/CN120803802B/en
Publication of CN120803802A publication Critical patent/CN120803802A/en
Application granted granted Critical
Publication of CN120803802B publication Critical patent/CN120803802B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Biomedical Technology (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本申请公开了一种处理器平台的错误上报方法和电子设备,涉及计算机硬件技术领域,包括获取处理器平台的多个错误标识,这些标识错误标识用于表征处理器平台的故障类型的信息。根据历史数据和实时运行参数,构建预测模型,预测模型可以提升处理器平台的预测性维护能力,再根据错误标识和预测模型确定处理器平台在未来一段时间内产生的故障类型,因此,可以解决相关技术中处理器平台难以提前预警潜在故障的技术问题,达到可以及时的检测出处理器平台即将会发生的故障,进而及时根据预测结果进行及时处理,避免易错失在故障发生前采取预防措施的机会的技术效果。

This application discloses a method and electronic device for error reporting on a processor platform, relating to the field of computer hardware technology. The method includes acquiring multiple error identifiers of the processor platform, which characterize the types of faults occurring on the platform. Based on historical data and real-time operating parameters, a predictive model is constructed. This model improves the predictive maintenance capabilities of the processor platform. Then, based on the error identifiers and the predictive model, the types of faults that the processor platform may experience in the future are determined. Therefore, this method solves the technical problem in related technologies where it is difficult to provide early warnings of potential faults in processor platforms. It achieves the technical effect of timely detection of impending faults in the processor platform, allowing for timely handling based on the prediction results, and avoiding missed opportunities to take preventative measures before faults occur.

Description

Error reporting method of processor platform and electronic equipment
Technical Field
The present application relates to the field of computer hardware, and in particular, to a method for reporting errors in a processor platform and an electronic device.
Background
In modern data centers and high performance computing environments, reliability, availability, and maintainability (RAS) of the system are critical to ensuring business continuity and reducing operation and maintenance costs. Currently, in a server platform, a BIOS (basic input output system) is used as a core component for hardware initialization and management, and an RAS error reporting mechanism of the BIOS mainly relies on APEI (ACPI Platform Error Interface) and other standard interfaces to record and report error information. However, the RAS mechanism of the existing BIOS mainly focuses on the recording and reporting of errors that have occurred, while the deep analysis of historical error data and the predictive ability of potential faults are limited, missing the opportunity to take precautions before the fault occurs.
Disclosure of Invention
The application provides an error reporting method of a processor platform and electronic equipment, which at least solve the technical problem that the processor platform is difficult to early warn potential faults in the related art.
The application provides an error reporting method of a processor platform, which comprises the steps of obtaining a plurality of error identifications of the processor platform, wherein the error identifications are used for representing information of fault types of the processor platform, constructing a prediction model according to historical data and real-time operation parameters, wherein the historical data and the real-time operation parameters at least comprise error logs, hardware operation parameters and environment monitoring data of a basic input and output system of the processor platform, determining the fault types generated by the processor platform in a future period according to the plurality of error identifications and the prediction model, and reporting the fault types to the processor platform and carrying out early warning.
The application also provides electronic equipment, which comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for realizing the steps of the error reporting method of any one of the processor platforms when executing the computer program.
The error reporting method acquires a plurality of error identifications of the processor platform, wherein the identifications cover the types and the characteristics of errors of the processor platform and provide basic data for subsequent error analysis and prediction. According to the historical data and the real-time operation parameters, a prediction model is constructed, the predictive maintenance capacity of the processor platform can be improved, and then the fault type of the processor platform generated in a period of time in the future is determined according to the error identification and the prediction model, so that the technical problem that the processor platform is difficult to early warn potential faults in the related art can be solved, the problem that the processor platform is about to generate faults can be timely detected, timely processing is further carried out according to the prediction result, and the technical effect that the opportunity of taking preventive measures before the faults happen is easily missed is avoided.
Drawings
For a clearer description of embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described, it being apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.
Fig. 1 is a block diagram of a hardware architecture of a mobile terminal according to an error reporting method of a processor platform according to an embodiment of the present application;
FIG. 2 is a flow chart of a method for reporting errors of a processor platform according to an embodiment of the present application;
fig. 3 is a flowchart of another error reporting method of a processor platform according to an embodiment of the present application.
Wherein the above figures include the following reference numerals:
102. Processor, 104, memory, 106, transmission equipment, 108, input and output equipment.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. Based on the embodiments of the present application, all other embodiments obtained by a person of ordinary skill in the art without making any inventive effort are within the scope of the present application.
It should be noted that in the description of the present application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "first," "second," and the like in this specification are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.
The present application will be further described in detail below with reference to the drawings and detailed description for the purpose of enabling those skilled in the art to better understand the aspects of the present application.
The specific application environment architecture or specific hardware architecture upon which execution of the error reporting method of the processor platform depends is described herein.
The method embodiments provided in the embodiments of the present application may be executed in a server apparatus or similar computing device. Taking the example of running on a server device, fig. 1 is a hardware block diagram of a server device of a method for reporting errors of a processor platform according to an embodiment of the present application. As shown in fig. 1, the server device may include one or more (only one is shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a central processing unit CPU, a microprocessor MCU, a programmable logic device FPGA, etc.) and a memory 104 for storing data, where the server device may further include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those of ordinary skill in the art that the architecture shown in fig. 1 is merely illustrative and is not intended to limit the architecture of the server apparatus described above. For example, the server device may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The error reporting method is mainly applied to environments with strict requirements on RAS (Reliability Availability Serviceability), such as a data center server, a high-performance computing system, an embedded computing platform and the like. Specifically, the application environment architecture it relies on is as follows:
In a large-scale server cluster, the method can cooperate with BIOS and BMC on a plurality of nodes to realize the fault detection and report of the cross-node and improve the stability and the operation and maintenance efficiency of the whole cluster.
The method is suitable for a virtualized data center, can penetrate through a virtual layer, monitors and early warns errors of physical hardware in real time, and ensures continuous operation of virtual machines and services.
The technical scheme of the application is particularly aimed at the following hardware architecture:
BMC integrated by VGA controller the BMC is used as a hardware management controller, and the integration of VGA controller is key to realize high-efficiency data interaction. The VGA controller of the BMC provides an independent video memory space, so that the BIOS can directly access and write error data, and the data transmission delay is greatly reduced.
TLS and AES-256 encryption hardware to ensure data security and compliance, the method relies on hardware that supports TLS (Transport Layer Security) and AES-256 (Advanced Encryption Standard) encryption. TLS ensures encryption of data during network transmission, while AES-256 is used for encryption of static data storage, together forming a double line of defense for data protection.
AGESA (general purpose package software architecture) microcode compatible hardware-hardware should support AGESA microcode of the platform server, which is the core component for initialization and configuration of the processor bottom layer. AGESA provides a software foundation that is tightly integrated with the processor error detection and reporting mechanism.
In the application, the BMC is integrated with the VGA controller in the platform server and has an independent video memory address space. The memory address of the BMC is mapped to the system memory address space through the Hardware Abstraction Layer (HAL) configuration of the BIOS, so that the BIOS can be directly accessed.
Some technical terms involved are as follows:
Reliability, availability and maintainability (RAS, reliability Availability and Serviceability) are used to describe the ability of the system to remain properly functioning during operation. Reliability, availability, and maintainability, these three concepts form the basis for evaluating and designing high performance, mission-critical systems. RAS emphasizes the ability of the system to operate without failure for a predetermined period of time (reliability), to continue to operate or to recover quickly when a failure occurs (availability), and to perform maintenance and upgrades without the need to shut down the system (maintainability). RAS is a key to ensuring business continuity and data integrity for data centers, cloud computing environments, and enterprise-level applications. Through comprehensive optimization of hardware design, firmware functions, management software and service strategies, RAS aims at maximizing normal running time of a system, minimizing fault influence, supporting quick repair and upgrading and reducing planned and unplanned system shutdown.
A Basic Input/Output System (BIOS) is the bottom firmware that runs when a computer is started, and is responsible for performing tasks such as self-checking hardware, initializing peripheral devices, and loading an operating System boot program. BIOS acts as a bridge between computer hardware and the operating system, ensuring that the hardware device is properly identified and used by the operating system. During the starting process, the BIOS will check the hardware configuration and set system parameters, which is important to ensure the stable operation and efficient starting of the computer.
The baseboard management controller (BMC, baseboard Management Controller) is independent of the management chip of the host processor and is responsible for remote monitoring and management of the system. The baseboard management controller is a micro-processor embedded on the server motherboard, and is independent of the operation of the main CPU, and is mainly used for remotely monitoring and managing the running state of the system. The BMC collects information such as temperature, voltage, fan speed, etc. through various sensors and hardware interfaces, and also supports remote control of the server through a network. BMC is a core component of server RAS (reliability, availability, serviceability) capability. The system can monitor the health condition of the system in real time and execute functions such as remote startup and shutdown, fault early warning, log recording and the like, thereby improving the usability and maintainability of the system and reducing the monitoring load of an administrator.
The MCA mechanism (MCA, machine Check Architecture) is a critical machine check Architecture in the processor for detecting and reporting hardware errors, and is mainly used to identify and record hardware faults occurring at runtime. The design goal of MCA is to collect detailed error information as much as possible when a hardware failure occurs so that the operating system or BIOS can take appropriate corrective action or perform failure analysis, thereby improving reliability, availability and maintainability (RAS) of the system.
The boot error record table (BERT, boot Error Record Table) is a table in the BIOS for recording error information that occurs during the boot process. The boot error log is a data structure that exists in the BIOS that is dedicated to recording error information that occurs during system boot. It plays the role of a system log, especially in the system initialization phase, which can capture many hardware or firmware level errors. The BERT table can help system administrators and engineers to quickly diagnose problems in the starting process, provide early indications of hardware faults, and facilitate maintenance and debugging. It usually contains information such as error code, error type, point in time when the error occurred, location of the error, etc., which is of great importance for troubleshooting.
Transport layer security protocols (TLS, transport Layer Security) are used to provide data encryption and authentication in network communications.
The embodiment of the application provides an error reporting method of a processor platform, which is described in detail by combining an execution flow of the error reporting method of the processor platform.
According to the error reporting method of the processor platform provided by the application, as shown in fig. 2, the error reporting method comprises the following steps:
Step S1, a plurality of error identifications of a processor platform are obtained, and the error identifications are used for representing information of fault types of the processor platform;
Specifically, during the running process of the processor platform, error identifications from the MCA mechanism are continuously monitored and collected, the error identifications not only comprise types of errors, but also comprise detailed information such as occurrence time, CPU number, memory channel number, error address and the like through the MCA record register, and basic data are provided for subsequent error classification and fault prediction. The error identification is derived from the hardware error information obtained and the type of error information. Hardware error information includes, but is not limited to, memory errors, CPU errors, bus errors, and the like. Through the BERT table record, the error information of the starting of the processor platform can be obtained. The health state of the processor platform can be mastered in time through real-time and comprehensive error identification collection, and detailed data support is provided for follow-up intelligent analysis and early warning. Thus, the response speed of the processor platform and the accuracy of fault detection can be improved.
S2, constructing a prediction model according to historical data and real-time operation parameters, wherein the historical data and the real-time operation parameters at least comprise error logs, hardware operation parameters and environment monitoring data of a basic input and output system of a processor platform;
Specifically, in constructing the predictive model, the error log of the BIOS record is first analyzed in depth, including but not limited to BERT table records and MCA error logs. And extracting the characteristics of the type, the frequency, the influence range and the like of the errors according to the error log. Meanwhile, hardware operation parameters including, but not limited to, real-time operation parameters such as CPU temperature, memory voltage, fan rotation speed and the like, and environment monitoring data such as temperature, humidity, power supply state and the like of a machine room are continuously monitored and provided. These multisource data are integrated and input into a machine learning model that is trained to identify potential failure modes and pre-warning signals. By means of the prediction model combining the historical data and the real-time parameters, hardware faults possibly occurring in the future can be accurately predicted based on the current hardware state and the running environment, and early warning is sent out 36/72 hours in advance. Therefore, the risk of unplanned shutdown can be reduced, and the availability and operation and maintenance efficiency of the processor platform are improved.
Step S3, determining the fault type generated by the processor platform in a future period according to a plurality of error identifications and prediction models;
Specifically, when a new hardware error occurs, a new error identification is detected, which inputs error identification data into an already trained predictive model that, by analyzing the data, identifies a likely failure mode and outputs a prediction of the type of failure over a period of time in the future (e.g., a week in the future), including the specific type of failure, the probability of expected occurrence, etc. The hardware operation parameters of the current processor platform can be input into the prediction model at the same time, so that comprehensive evaluation can be performed according to the current operation condition of the processor platform and the predicted faults, and more accurate maintenance measures can be obtained. The maintenance strategy can be dynamically adjusted by combining the real-time error data with the prediction model, so that high-risk faults are preferentially processed, the fault response time is reduced, and the pertinence and the efficiency of maintenance are improved.
And S4, reporting the fault type to a processor platform and carrying out early warning.
Specifically, once the prediction model determines the fault types which may happen in the future, the BIOS is immediately controlled to report the fault types to the BMC through an optimized data interaction mechanism (based on efficient transmission of the VGA controller video memory). The BMC is controlled to further send the early warning information to a server management system or inform operation and maintenance personnel to take preventive measures, such as automatic restarting and repairing, memory particle isolation and the like.
By the error reporting method of the processor platform, a plurality of error identifications of the processor platform are acquired, the identifications cover types and characteristics of errors of the processor platform, and basic data are provided for subsequent error analysis and prediction. According to the historical data and the real-time operation parameters, a prediction model is constructed, the predictive maintenance capacity of the processor platform can be improved, and then the possible fault types of the processor platform in a future period are determined according to the error identification and the prediction model, so that the technical problem that the processor platform is difficult to early warn potential faults in the related technology can be solved, the problem that the processor platform is about to generate faults can be timely detected, timely treatment is further carried out according to the prediction result, and the possibility of taking preventive measures before the faults occur is avoided.
In some optional embodiments, the method further comprises initializing the basic input/output system, acquiring the read authority of the video memory space, acquiring the address and the capacity of the video memory space according to the read authority, and dividing a buffer zone with preset capacity in the video memory space. The video memory space configuration is completed in the BIOS initialization stage, so that the error data reporting efficiency is improved, the data interaction time between the BIOS and the BMC is reduced from the traditional 300ms to below 50ms, 83% is reduced, the system starting time is shortened by 15% on average, and unnecessary delay is reduced.
Firstly, obtaining address mapping, and reading VGA controller information of a BMC through a PCI configuration space in a BIOS initialization stage to obtain a video memory base address and a video memory size;
performing authority setting, namely setting a Memory Mapping Register (MMR) of a processor platform, and endowing the BIOS with read-write authority to the video memory area;
and (3) performing buffer area allocation, and dividing a ring buffer area (such as 16 KB) with a fixed size in the BMC video memory, wherein the ring buffer area is used for storing error data. The buffer area structure is as follows:
typedef struct {
uint32_t head;// write pointer
Uint32_t tail;// read pointer
Uint 8_tdata [16384 ];// data storage area
Uint8_ t flags;// status flag
} BiosBmcSharedBuffer。
The PCI (Peripheral Component Interconnect) standard is a high-speed bus standard for interfacing computer hardware, originally developed by Intel, and is intended to provide faster data transfer speeds and better performance than previous ISA buses. The advent of PCI buses has greatly driven the standardization and compatibility of computer hardware, allowing different manufacturers' peripherals (e.g., graphics cards, sound cards, network cards, etc.) to communicate with a computer motherboard via a unified interface. Each PCI device has a configuration space that is a memory area for storing and reading configuration information for the device. The configuration space contains information such as device identification, functions, status registers, etc., which are critical to the operating system or BIOS initialization and configuration of the hardware device. During computer startup, the BIOS or operating system identifies and configures the PCI device by accessing the PCI configuration space. For example, the VGA controller information of the BMC is read, and the base address and the size of the video memory are obtained by accessing a specific register in the PCI configuration space. Accessing the PCI configuration space is typically accomplished through PCI configuration registers that allow the host to read and write configuration information for the PCI device. The access process involves writing the device's bus number, device number, function number, and register offset address to the PCI configuration address register (PCI Configuration ADDRESS REGISTER), and then updating the device state by reading the PCI configuration data register (PCI Configuration DATA REGISTER) to obtain configuration data, or writing new configuration values to it.
In some alternative embodiments, obtaining the plurality of error identifications of the processor platform includes obtaining a plurality of error information of the processor platform and constructing an error structure based on the plurality of error information and a preset order.
Specifically, firstly, error detection is carried out, an MCA mechanism of a control processor detects hardware errors and notifies a BIOS through interruption, data packaging is carried out, the BIOS is controlled to read error information from an MCA register, and the read error information is built into an error structure body according to a certain preset sequence, wherein the structure of the error structure body is as follows:
typedef struct {
uint8_t error_type;// MCA error type
Uin8_ T SEVERITY;// severity level
Uint32_t cpu_id;// CPU number
Uint32_t bank_id;// memory channel number
Uint64_t error_address;// error address
Uint32_t mca_record [16 ];// MCA records register values
Uint64_ T TIMESTAMP;// timestamp
Uint 32_tcrc32;// CRC check code
} McaIrqData。
And under the condition that the input of new error information into the basic input and output system is detected, converting the error information into error identification according to the error structural body and a preset format. By constructing the structured error identification, the data interaction between the BIOS and the BMC is simplified, the complexity of data transmission is reduced, the reliability and the integrity of transmission are ensured, the real-time performance of error reporting is reduced from average 10 seconds to within 1 second, and the 90% is improved.
In some alternative embodiments, converting the error information into an error identification according to the error structure includes:
the write pointer of the buffer of the processor platform is obtained and the write pointer is obtained before it is allowed to store information to the data storage area.
The error structure body is written into a data storage area of the buffer area according to the writing pointer, the error data check value CRC in the data storage area is updated according to the error structure body, the data storage area is used for temporarily storing the error structure body, and the VGA video memory space of the BMC is used as an information temporary storage medium between the BIOS and the BMC. Specifically, after detecting a hardware error (such as an MCA error, a memory error, etc.), the BIOS packages the error data according to a preset structure format, calculates a CRC check code, and then writes the CRC check code into a VGA video memory designated area of the BMC. When the subsequent reading is needed, the video memory can be accessed in parallel through the special hardware interface to acquire error data, the traditional IPMI serial communication mode is replaced, the data interaction efficiency is improved, the single interaction time of the BIOS and the BMC is reduced from 300ms of the traditional IPMI to below 50ms, and the single interaction time is reduced by about 83%. The platform startup time is reduced by 15% on average from 45 seconds to 38 seconds due to reduced interaction latency of the BIOS and BMC. To ensure compatibility, the platform also reserves IPMI communication as a spam scheme, and automatically switches to IPMI mode when the memory interaction is abnormal.
According to a preset period, a plurality of READY FLAG bits of a buffer area are acquired, whether the READY FLAG bits are target READY FLAG bits (ready_flag=1) is judged, updated error data check values in a data storage area are read under the condition that the READY FLAG bits are the target READY FLAG bits, whether data in an error structure body are normal or not is determined according to the updated error data check values, the CRC is the unique check value of an error data packet, the BIOS sends the data packet to the BMC, and the BMC takes the CRC check values as check evidence to ensure that the data is not damaged.
And under the condition that the data in the error structural body is determined to be normal, analyzing the normal error structural body according to a preset format to obtain an error mark. The predetermined format is the structure of the erroneous structure. By setting the ready flag bit and verifying the CRC, the reliability of error reporting is obviously enhanced, and the fault detection accuracy of the system can be maintained to be more than 90% even under a high-load environment, so that the overall stability of the system is enhanced.
The MCA mechanism described above monitors various possible hardware failure points at the processor core level, including but not limited to memory systems (e.g., memory controllers, data paths, and caches), I/O subsystems, system buses, power management units, clock units, and the like. When an error is detected, the MCA triggers an Interrupt (MACHINE CHECK Interrupt, MCI) and saves the error information in a set of special registers called MACHINE CHECK REGISTERS (MCR).
The MCA mechanism can report multiple types of errors, which fall into two broad categories, uncorrectable errors (Uncorrectable Error, UCE) and correctable errors (Correctable Error, CE):
Uncorrectable errors (UCEs) such errors typically mean that hardware components are permanently damaged and the system needs to respond immediately, such as shutting down the affected CPU core or an entire system restart.
In contrast to Correctable Errors (CEs), CE errors are typically transient and can be repaired by the error correction capabilities of the hardware itself, such as error correction of ECC memory. Nevertheless, frequently occurring CE errors may also be indicative of impending UCE errors by the hardware. The error information in the present application may be of the two error types described above. And performing targeted processing according to the detected error type.
In some alternative embodiments, the method further comprises:
After the error identification is obtained, outputting an acknowledgement FLAG bit (ack_flag=1) to the basic input output system;
and under the condition that the basic input and output system receives the confirmation flag bit, clearing the ready flag bit and the determination flag bit of the buffer area of the processor platform.
Specifically, according to the error structure, the BMC is controlled to analyze error data and remap the error data to the memory of the buffer area in the BMC to obtain error identification, a confirmation zone bit is output to the basic input and output system, and a fault type is obtained according to the error identification and a prediction model and then data is reported. By introducing the confirmation flag bit, the bidirectional communication confirmation between the BIOS and the BMC is realized, a timely clearing mechanism of the flag bit is realized, redundant storage of data is avoided, the utilization rate of a buffer area is improved, and the possibility of data collision is reduced.
In some optional embodiments, the method further comprises acquiring a confirmation flag bit according to a preset period when the ready flag bit is a target ready flag bit before outputting the confirmation flag bit to the basic input output system, judging whether the confirmation flag bit is a target preset flag bit, and clearing the ready flag bit and the confirmation flag bit of the buffer zone of the processor platform when the confirmation flag bit is the target preset flag bit.
Specifically, after the BIOS writes the ready flag bit, the validation flag bit is polled and acquired, and after the validation flag bit is acquired, the ready flag bit and the validation flag bit are emptied to release the buffer space and prepare for new error data, indicating that the interaction is completed. Therefore, the dynamic use of the buffer area and the quick response of the platform can be ensured, the invalid waiting time is reduced, and the real-time performance of the platform is improved.
In some alternative embodiments, the method further comprises:
And obtaining a historical error log, cleaning data in the historical error log and extracting features to obtain a target historical error log, wherein the extracted data at least comprises an error type and an occurrence frequency, and the historical error log comprises but is not limited to BERT table records and MCA error logs. The data extracted by the feature may also include an area of influence, such as a single DIMM failure or a cross CPU node failure.
Training a machine learning model according to the target historical error log and the error category to obtain an error classification model, wherein the machine learning model comprises but is not limited to random forest or LSTM algorithm;
And classifying the error identification according to the error classification model.
Specifically, when a new error occurs (when an error is detected, the MCA mechanism is controlled to generate an interrupt notification BIOS), the error data is input into the trained model, and the error type is output. By means of deep analysis of historical data and model training driven by machine learning, new errors can be automatically and accurately classified, and accuracy and efficiency of error identification are improved. Intelligent error classification reduces the time for an administrator to locate critical issues by 60% from an average of 30 minutes to 12 minutes.
In some alternative embodiments, the method further comprises:
and determining the priorities of the plurality of error identifications according to target characteristics of the plurality of error identifications, wherein the target characteristics comprise the influence range of the error identifications and the load state of the processor platform.
The error priority is dynamically adjusted according to factors such as the error influence range (such as single DIMM fault or cross-CPU node fault), the current load state of the platform (such as CPU occupancy rate and memory utilization rate) and the like. For example, if the current platform load is high, even a slight hardware error may quickly deteriorate into a severe fault, at which point the model will automatically raise the priority of such errors, ensuring timely response. By dynamically adjusting the error priority, errors that may have a significant impact on overall performance can be prioritized, improving the timeliness of the fault response and maintenance level of system stability. Particularly in a high-load scene, the mechanism can effectively avoid system breakdown caused by light preliminary errors, and further enhances the usability and reliability of the platform. For example, when the system CPU occupancy reaches 85%, slight errors (such as single DIMM correctable errors) can be promoted to high priority processing, so that potential fault upgrades are avoided, and stable operation of the system in a high-pressure environment is ensured.
In some alternative embodiments, constructing a predictive model based on historical data and real-time operating parameters includes:
The method comprises the steps of obtaining historical data and real-time operation parameters, wherein the historical data and the real-time operation parameters comprise error logs of a basic input and output system, hardware operation parameters, microcode data of a processor platform and environment monitoring data, the BIOS error logs comprise but are not limited to BERT table records, MCA error logs, memory self-checking results and the like, the hardware operation parameters comprise but are not limited to CPU temperature, memory voltage, fan rotating speed and the like, the AGESA microcode data comprise but are not limited to processor bottom layer error information such as cache errors, bus errors and the like, and the environment monitoring data comprise but are not limited to machine room temperature, humidity, power supply state and the like.
Training a machine learning model according to the historical data and the real-time operation parameters to obtain a prediction model, wherein the machine learning model comprises at least one of a random forest or a long-term and short-term memory network. The prediction model obtained by the method has the early warning accuracy of more than 90% on the memory faults, can give out early warning 72 hours in advance, can timely carry out preventive maintenance measures according to the prediction result of the prediction model, reduces the unplanned downtime by 50%, reduces the MTTR (mean repair time) from 4 hours to 2.4 hours, reduces the time for an administrator to locate key problems by 60% due to intelligent error classification of the prediction model, and reduces the time from 30 minutes to 12 minutes.
The processor platform of the application supports AGESA.2.0 and above micro-code in terms of hardware, supports video memory interactive interface with mainstream BMC firmware (such as AST2500 and iKVM) in terms of software, and simultaneously reserves IPMI 2.0 compatible mode. In terms of data security, TLS encryption transmission and AES-256 storage encryption are adopted to ensure confidentiality of error data, and the blockchain storage certificate ensures that the data cannot be tampered, in terms of compliance support, compliance requirements of ISO 27001, PCI-DSS and the like are met, the auditability of error reports reaches enterprise-level standards, and in terms of audit trail, the whole process of generation, transmission and storage of error reports can be traced through blockchain storage certificate, so that compliance audit requirements are met.
To optimize error reporting, a multi-level error reporting strategy is introduced. When the BIOS detects a hardware error, preliminary analysis is performed first, and whether the error needs to be immediately reported to the BMC or whether the error can be processed at the BIOS level is determined according to the severity and the influence range of the error. For example, BIOS may attempt to repair locally for minor, correctable errors, such as PPR (Power-on Program PACKAGE REPAIR) mechanisms, while BIOS may trigger reporting immediately for severe, uncorrectable errors, ensuring that BMC and upper management platforms can respond in time.
And allowing the user to define the triggering condition and strategy of the error early warning according to the self service requirement and the platform operation environment. A user can set specific error types and parameter thresholds (such as CPU temperature exceeds 80 ℃, memory error frequency exceeds 10 times/hour and the like) through management interfaces of the BIOS and the BMC, and when the platform detects errors meeting the conditions, an early warning process is automatically started and corresponding preventive measures are taken. User interfaces are provided for custom setting, the configurability and user interaction experience of the system are enhanced, and non-professional personnel can easily manage RAS characteristics of the system.
In order to enable those skilled in the art to more clearly understand the technical solution of the present application, the following describes in detail the implementation procedure of the error reporting method of the processor platform according to the present application with reference to specific embodiments.
Examples
As shown in FIG. 3, the BIOS is controlled to detect the hardware error of the platform, the detected error information is generated into error data and packaged, a new CRC check code is calculated, a plurality of error information of the processor platform is obtained, and an error structure body is constructed according to the plurality of error information and a preset sequence.
The method comprises the steps of controlling a BIOS to write an error structure into a VGA video memory, setting a ready bit of a buffer zone, controlling an MBC to monitor the ready bit under the condition that the set ready bit of the buffer zone is 1, reading video memory data, verifying CRC (cyclic redundancy check) check codes, processing the error data (analyzing the error structure to obtain error identification) after verification is passed, acquiring a write pointer of the buffer zone of a processor platform, writing the error structure into a data storage area of the buffer zone according to the write pointer, updating an error data check value in the data storage area according to the error structure, temporarily storing the error structure by the data storage area, acquiring a plurality of ready bits of the buffer zone according to a preset period, judging whether the ready bit is a target ready bit, reading the updated error data check value in the data storage area under the condition that the ready bit is the target ready bit, determining whether the data in the error structure are normal according to the updated error data check value, analyzing the normal error structure according to the preset error identification to obtain the error identification, and obtaining the error identification after the error format is 1.
And in the process of obtaining the error identification, polling to determine whether the confirmation flag bit is 1, and clearing the ready flag bit (0) and the confirmation flag bit (0) under the condition that the confirmation flag bit is 1 and the ready flag bit is 1, wherein under the condition that the ready flag bit is 1, the confirmation flag bit is obtained according to a preset period, whether the confirmation flag bit is a target preset flag bit is judged, and under the condition that the confirmation flag bit is the target preset flag bit, clearing the ready flag bit and the confirmation flag bit of a buffer zone of the processor platform.
And meanwhile, determining the possible fault types of the processor platform in a future period according to the obtained error identification and the prediction model, and reporting the fault types to the processor platform and carrying out early warning.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment.
The embodiment of the application also provides an error reporting device of the processor platform. The device comprises an acquisition module, a construction module, a determination module and a reporting module, wherein the acquisition module is used for acquiring a plurality of error identifications of the processor platform, the error identifications are used for representing information of fault types of the processor platform, the construction module is used for constructing a prediction model according to historical data and real-time operation parameters, the historical data and the real-time operation parameters at least comprise error logs, hardware operation parameters and environment monitoring data of a basic input and output system of the processor platform, the determination module is used for determining the fault types generated by the processor platform in a future period according to the plurality of error identifications and the prediction model, and the reporting module is used for reporting the fault types to the processor platform and carrying out early warning.
The description of the features in the embodiment corresponding to the error reporting device of the processor platform may refer to the related description of the embodiment corresponding to the error reporting method of the processor platform, which is not described herein in detail.
The embodiment of the application also provides an electronic device comprising a memory and a processor, the memory storing a computer program, the processor being arranged to run the computer program to perform the steps of the error reporting method embodiment of any of the processor platforms described above.
Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is configured to perform, when run, the steps of an error reporting method embodiment of any of the processor platforms described above.
In an exemplary embodiment, the computer readable storage medium may include, but is not limited to, a U disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, etc. various media in which a computer program may be stored.
Embodiments of the present application also provide a computer program product comprising a computer program which, when executed by a processor, implements the steps of the error reporting method embodiment of any of the processor platforms described above.
Embodiments of the present application also provide another computer program product, including a non-volatile computer readable storage medium, where the non-volatile computer readable storage medium stores a computer program, where the computer program when executed by a processor implements the steps of any of the foregoing error reporting method embodiments of the processor platform.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The error reporting method of the processor platform and the electronic equipment provided by the application are described in detail. The principles and embodiments of the present application have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present application and its core ideas. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the application can be made without departing from the principles of the application and these modifications and adaptations are intended to be within the scope of the application as defined in the following claims.

Claims (8)

1. An error reporting method for a processor platform, comprising:
Acquiring a plurality of error identifications of a processor platform, wherein the error identifications are used for representing information of fault types of the processor platform, a basic input and output system of the processor platform comprises a video memory space, and the video memory space comprises a buffer zone, and the method comprises the steps of acquiring a plurality of error information of the processor platform; constructing an error structure body according to a plurality of error information and a preset sequence; under the condition that new error information is detected to be input into the basic input and output system, converting the error information into an error mark according to an error structure body and a preset format to obtain a write pointer of a buffer zone of the processor platform, wherein the write pointer comprises the steps of writing the error structure body into a data storage zone of the buffer zone according to the write pointer, updating an error data check value in the data storage zone according to the error structure body, wherein the data storage zone is used for temporarily storing the error structure body;
Determining whether the data in the error structure body is normal or not according to the updated error data check value, and analyzing the normal error structure body according to the preset format under the condition of determining that the data in the error structure body is normal to obtain the error identification;
Acquiring a historical error log, cleaning data in the historical error log, and extracting features to obtain a target historical error log, wherein the extracted data at least comprises an error type, an influence range and an occurrence frequency;
Constructing a prediction model according to historical data and real-time operation parameters, wherein the historical data and the real-time operation parameters at least comprise error logs, hardware operation parameters and environment monitoring data of a basic input and output system of the processor platform;
determining a type of fault generated by the processor platform in a future period of time according to a plurality of the error identifications and the prediction model;
and reporting the fault type to a processor platform and carrying out early warning.
2. The error reporting method of claim 1, wherein the obtaining a plurality of error identifications of a processor platform comprises:
acquiring a plurality of error information of the processor platform;
constructing an error structure body according to a plurality of error information and a preset sequence;
and under the condition that the input of new error information into the basic input and output system is detected, converting the error information into error identification according to an error structural body and a preset format.
3. The error reporting method of claim 2, wherein the method further comprises:
initializing the basic input/output system;
acquiring a video memory space reading authority, and acquiring an address and a capacity of the video memory space according to the reading authority;
And dividing a buffer area with preset capacity in the video memory space.
4. The error reporting method of claim 1, wherein the method further comprises:
After the error identification is obtained, outputting a confirmation zone bit to the basic input/output system;
and under the condition that the basic input and output system receives the confirmation flag bit, clearing the ready flag bit and the determination flag bit of the buffer zone of the processor platform.
5. The error reporting method of claim 4, further comprising:
Before outputting a confirmation flag bit to the basic input and output system, acquiring the confirmation flag bit according to a preset period under the condition that the ready flag bit is a target ready flag bit, judging whether the confirmation flag bit is a target preset flag bit, and clearing the ready flag bit and the confirmation flag bit of the buffer zone of the processor platform under the condition that the confirmation flag bit is the target preset flag bit.
6. The error reporting method of claim 1, wherein the method further comprises:
and determining priorities of a plurality of error identifications according to target characteristics of the error identifications, wherein the target characteristics comprise the influence range of the error identifications and the load state of the processor platform.
7. The error reporting method of claim 1, wherein the constructing a predictive model based on historical data and real-time operating parameters comprises:
Acquiring the historical data and the real-time operation parameters, wherein the historical data and the real-time operation parameters comprise error logs of the basic input and output system, the hardware operation parameters, microcode data of the processor platform and the environment monitoring data;
training a machine learning model according to the historical data and the real-time operation parameters to obtain the prediction model, wherein the machine learning model comprises at least one of a random forest or a long-term and short-term memory network.
8. An electronic device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the error reporting method of the processor platform according to any one of claims 1 to 7 when executing the computer program.
CN202511311814.1A 2025-09-15 2025-09-15 Error reporting methods and electronic devices for processor platforms Active CN120803802B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202511311814.1A CN120803802B (en) 2025-09-15 2025-09-15 Error reporting methods and electronic devices for processor platforms

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202511311814.1A CN120803802B (en) 2025-09-15 2025-09-15 Error reporting methods and electronic devices for processor platforms

Publications (2)

Publication Number Publication Date
CN120803802A CN120803802A (en) 2025-10-17
CN120803802B true CN120803802B (en) 2025-12-02

Family

ID=97322500

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202511311814.1A Active CN120803802B (en) 2025-09-15 2025-09-15 Error reporting methods and electronic devices for processor platforms

Country Status (1)

Country Link
CN (1) CN120803802B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119201525A (en) * 2024-09-20 2024-12-27 浪潮计算机科技有限公司 Error analysis method, system, electronic device and medium
CN119883843A (en) * 2024-12-30 2025-04-25 北京天地汇云科技有限公司 Fault prediction method and device and baseboard management controller

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011100269A (en) * 2009-11-05 2011-05-19 Renesas Electronics Corp Cache system
CN117493065B (en) * 2023-12-28 2024-03-22 苏州元脑智能科技有限公司 Method and device for processing processor information, storage medium and electronic equipment
CN120086053B (en) * 2025-05-06 2025-08-01 苏州元脑智能科技有限公司 Server and processor error processing method, device, medium and program product
CN120474904B (en) * 2025-07-11 2025-09-26 苏州元脑智能科技有限公司 Switch chip fault analysis method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119201525A (en) * 2024-09-20 2024-12-27 浪潮计算机科技有限公司 Error analysis method, system, electronic device and medium
CN119883843A (en) * 2024-12-30 2025-04-25 北京天地汇云科技有限公司 Fault prediction method and device and baseboard management controller

Also Published As

Publication number Publication date
CN120803802A (en) 2025-10-17

Similar Documents

Publication Publication Date Title
CN114328102B (en) Equipment state monitoring method, equipment state monitoring device, equipment and computer readable storage medium
US11163623B2 (en) Serializing machine check exceptions for predictive failure analysis
TWI229796B (en) Method and system to implement a system event log for system manageability
WO2024230401A1 (en) Baseboard management controller system operation method and apparatus, device, and non-volatile readable storage medium
CN119883843A (en) Fault prediction method and device and baseboard management controller
CN120560897B (en) Server memory management system and cluster system
CN120353632A (en) Memory fault repairing method, device, equipment, medium and computer program product
CN120086053A (en) Server and processor error handling method, device, medium and program product
CN117312094A (en) A server hardware monitoring and collection method based on time series analysis algorithm
US20140201566A1 (en) Automatic computer storage medium diagnostics
CN109634796A (en) A kind of method for diagnosing faults of computer, apparatus and system
JP5440073B2 (en) Information processing apparatus, information processing apparatus control method, and control program
CN114422395A (en) Link diagnosis method and device
EP3534259B1 (en) Computer and method for storing state and event log relevant for fault diagnosis
CN120803802B (en) Error reporting methods and electronic devices for processor platforms
CN115543665A (en) Memory reliability evaluation method and device and storage medium
CN119046050B (en) Fault reporting method, electronic equipment, medium and computer program product
CN119690754A (en) A memory fault processing method, device, medium and server
CN119271474A (en) Server self-check control method, device, equipment and storage medium
CN100369009C (en) Monitoring system and method using system management interrupt signal
JP2013109722A (en) Computer, computer system and failure information management method
CN117724916A (en) Safe memory bank abnormality detection method
CN118747165A (en) Method, device, computer equipment and storage medium for reading log data
CN118916200A (en) Abnormality positioning method, device, equipment and medium
CN117591351A (en) Training method of disk failure detection model and disk failure detection method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant