[go: up one dir, main page]

CN118550747A - PCIe fatal error quick positioning method, system, electronic equipment and medium - Google Patents

PCIe fatal error quick positioning method, system, electronic equipment and medium Download PDF

Info

Publication number
CN118550747A
CN118550747A CN202410544375.8A CN202410544375A CN118550747A CN 118550747 A CN118550747 A CN 118550747A CN 202410544375 A CN202410544375 A CN 202410544375A CN 118550747 A CN118550747 A CN 118550747A
Authority
CN
China
Prior art keywords
error
pcie
information
fatal
configuration space
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410544375.8A
Other languages
Chinese (zh)
Inventor
朱金平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Metabrain Intelligent Technology Co Ltd
Original Assignee
Suzhou Metabrain Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Metabrain Intelligent Technology Co Ltd filed Critical Suzhou Metabrain Intelligent Technology Co Ltd
Priority to CN202410544375.8A priority Critical patent/CN118550747A/en
Publication of CN118550747A publication Critical patent/CN118550747A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2205Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested
    • G06F11/221Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested to test buses, lines or interfaces, e.g. stuck-at or open line faults

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The disclosure provides a method, a system, electronic equipment and a medium for quickly positioning PCIe fatal errors, belongs to the field of PCIe error positioning, and aims to solve the problem that the PCIe fatal errors cannot be effectively positioned by the traditional method and the problem that the processing progress of the problem cannot meet the production or user requirements. The method comprises the following steps: under the condition that PCIe equipment is in error, abnormal information is reported to PCIe root nodes of the CPU; the abnormal information carries an error type; determining an on-condition of an advanced error reporting function in case the error type is an unrecoverable fatal error; under the condition that the advanced error reporting function is not started, according to BIOS setting, a query command is sent to the PCIe equipment to obtain information of the PCIe equipment; and storing the information of the PCIe device to a baseboard management controller.

Description

一种PCIe致命错误的快速定位方法、系统、电子设备及介质A PCIe fatal error rapid positioning method, system, electronic device and medium

技术领域Technical Field

本公开涉及PCIe错误定位领域,特别是涉及一种PCIe致命错误的快速定位方法、系统、电子设备及介质。The present disclosure relates to the field of PCIe error positioning, and in particular to a method, system, electronic device and medium for quickly positioning a PCIe fatal error.

背景技术Background Art

在服务器中,数据信息的内部传输大部分都是使用PCIe(peripheral componentinterconnect express,外设组件互连快速总线)信号传递。在PCIe信息传递过程中,由于链路传输速率极快、端口信息处理过程异常往往会造成PCIe链路异常错误的问题。In the server, most of the internal transmission of data information is transmitted using PCIe (peripheral component interconnect express) signals. In the process of PCIe information transmission, due to the extremely fast link transmission rate and abnormal port information processing, PCIe link abnormality errors are often caused.

针对致命错误需要对链路进行系统重启,造成数据丢失。导致对于长时间工作后出现或概率性出现的致命错误,无法有效定位错误出现的位置,也无法有效发现错误出现的原因,从而造成处理过程复杂,问题处理进度无法满足生产或用户要求的问题。Fatal errors require a system restart on the link, resulting in data loss. As a result, for fatal errors that occur after a long period of work or occur probabilistically, it is impossible to effectively locate the location of the error or find the cause of the error, which complicates the processing process and the problem handling progress cannot meet production or user requirements.

因此,急需一种能够快速定位出现致命错误的PCIe链路位置和错误原因。Therefore, there is an urgent need for a method that can quickly locate the PCIe link position and error cause of a fatal error.

发明内容Summary of the invention

为克服相关技术中存在的问题,本公开提供一种PCIe致命错误的快速定位方法、系统、电子设备及介质。本公开的技术方案如下:In order to overcome the problems existing in the related art, the present disclosure provides a method, system, electronic device and medium for quickly locating PCIe fatal errors. The technical solution of the present disclosure is as follows:

根据本公开实施例的第一方面,提供一种PCIe致命错误的快速定位方法,所述方法包括:According to a first aspect of an embodiment of the present disclosure, a method for quickly locating a PCIe fatal error is provided, the method comprising:

在PCIe设备发生错误的情况下,向CPU的PCIe根节点上报异常信息;所述异常信息携带错误类型;所述错误类型包括:可恢复错误、不可恢复的非致命错误和不可恢复的致命错误;When an error occurs in a PCIe device, report abnormal information to the PCIe root node of the CPU; the abnormal information carries the error type; the error type includes: recoverable error, unrecoverable non-fatal error and unrecoverable fatal error;

在所述错误类型为不可恢复的致命错误的情况下,确定高级错误报告功能的开启情况;In the case where the error type is an unrecoverable fatal error, determining whether an advanced error reporting function is enabled;

在所述高级错误报告功能未开启的情况下,根据BIOS设置,向所述PCIe设备发送查询命令,得到所述PCIe设备的信息;所述信息包括以下至少一项或多项:错误设备的标识信息、错误原因、所述PCIe设备的配置空间中所有错误配置空间寄存器信息和所述PCIe设备所属的上层设备的配置空间寄存器信息;When the advanced error reporting function is not enabled, according to the BIOS setting, a query command is sent to the PCIe device to obtain information of the PCIe device; the information includes at least one or more of the following: identification information of the error device, the cause of the error, all error configuration space register information in the configuration space of the PCIe device, and configuration space register information of the upper layer device to which the PCIe device belongs;

将所述PCIe设备的信息存储到基板管理控制器。The information of the PCIe device is stored in a baseboard management controller.

可选地,所述根据BIOS设置,向所述PCIe设备发送查询命令,得到所述PCIe设备的信息,包括:Optionally, sending a query command to the PCIe device according to BIOS settings to obtain information of the PCIe device includes:

根据所述BIOS设置,通过PCIe链路向错误设备发送第一查询命令;所述错误设备为:上报所述异常信息的所述PCIe设备;According to the BIOS setting, a first query command is sent to the error device through the PCIe link; the error device is: the PCIe device that reports the abnormal information;

在所述错误设备响应所述第一查询命令的情况下,对所述错误设备的配置空间寄存器进行查询,得到所述错误设备的错误原因,以及得到所述错误设备的配置空间中所有错误配置空间寄存器信息;In the case where the erroneous device responds to the first query command, querying the configuration space register of the erroneous device to obtain the error cause of the erroneous device and all erroneous configuration space register information in the configuration space of the erroneous device;

在所述错误设备不响应所述第一查询命令的情况下,确定所述错误设备所属的上层设备;In the case that the erroneous device does not respond to the first query command, determining an upper layer device to which the erroneous device belongs;

向所述错误设备所属的上层设备发送第二查询命令;Sending a second query command to an upper layer device to which the erroneous device belongs;

在所述错误设备所属的上层设备响应所述第二查询命令的情况下,对所述错误设备所属的上层设备的配置空间寄存器进行查询,得到所述错误设备的错误原因,以及所述错误设备所属的上层设备的配置空间寄存器信息。When the upper layer device to which the erroneous device belongs responds to the second query command, the configuration space register of the upper layer device to which the erroneous device belongs is queried to obtain the error cause of the erroneous device and the configuration space register information of the upper layer device to which the erroneous device belongs.

可选地,所述将所述PCIe设备的信息存储到基板管理控制器,包括:Optionally, storing the information of the PCIe device in a baseboard management controller includes:

收集CPU与所述PCIe设备之间传递的目标信息;Collecting target information transmitted between the CPU and the PCIe device;

从所述PCIe设备的日志信息中,收集所述PCIe设备运行状态相关的目标日志信息;Collecting target log information related to the running status of the PCIe device from the log information of the PCIe device;

将所述PCIe设备的信息、所述目标信息和所述目标日志信息,存储到所述基板管理控制器。The information of the PCIe device, the target information and the target log information are stored in the baseboard management controller.

可选地,在所述向CPU的PCIe根节点上报异常信息之前,所述方法还包括:根据PCIe规范要求,检测所述PCIe设备的运行状态是否发生错误;Optionally, before reporting the abnormal information to the PCIe root node of the CPU, the method further comprises: detecting whether an error occurs in the running state of the PCIe device according to PCIe specification requirements;

在所述PCIe设备发生错误的情况下,将所述异常信息对应的相关数据登记在所述PCIe设备的配置空间寄存器中;所述异常信息对应的相关数据包括所述错误类型;In the case where an error occurs in the PCIe device, registering relevant data corresponding to the abnormal information in a configuration space register of the PCIe device; the relevant data corresponding to the abnormal information includes the error type;

根据所述异常信息对应的相关数据,生成携带所述错误类型的所述异常信息。可选地,在所述错误类型为不可恢复的致命错误且所述高级错误报告功能开启的情况下,所述方法还包括:According to the relevant data corresponding to the abnormal information, the abnormal information carrying the error type is generated. Optionally, when the error type is an unrecoverable fatal error and the advanced error reporting function is enabled, the method further includes:

从不可纠正错误状态寄存器中,获取所述异常信息对应的相关数据中的所述错误原因;所述异常信息对应的相关数据是:所述PCIe设备自动将上传至所述不可纠正错误状态寄存器和错误状态严重性寄存器中的;所述异常信息对应的相关数据还包括:所述高级错误报告功能对应的配置空间中所有错误配置空间寄存器信息。From the uncorrectable error status register, obtain the error cause in the relevant data corresponding to the exception information; the relevant data corresponding to the exception information is: the PCIe device automatically uploads the uncorrectable error status register and the error status severity register; the relevant data corresponding to the exception information also includes: all error configuration space register information in the configuration space corresponding to the advanced error reporting function.

可选地,所述方法还包括:Optionally, the method further comprises:

响应于所述异常信息的上报,根据所述错误类型,触发所述错误类型对应的中断事件;In response to the reporting of the abnormal information, triggering an interrupt event corresponding to the error type according to the error type;

在将所述PCIe设备的信息存储到基板管理控制器后,所述方法还包括:After storing the information of the PCIe device in the baseboard management controller, the method further includes:

获取预设处理方案;所述预设处理方案包括:各个错误类型对应的处理方式;所述处理方式包括:屏蔽错误、对设备进行复位和系统重启;Obtaining a preset processing solution; the preset processing solution includes: processing methods corresponding to each error type; the processing methods include: shielding errors, resetting the device, and restarting the system;

根据所述预设处理方案和异常信息携带的错误类型,确定目标处理方式;Determine a target processing method according to the preset processing scheme and the error type carried by the abnormal information;

按照所述目标处理方式,对所述PCIe设备发生的错误进行处理;Processing the error occurring in the PCIe device according to the target processing method;

在处理好所述PCIe设备发生的错误后,解除所述中断事件。After the error occurring in the PCIe device is processed, the interrupt event is released.

可选地,所述方法还包括:Optionally, the method further comprises:

在所述错误类型为所述可恢复错误或所述不可恢复的非致命错误的情况下,解除所述中断事件。When the error type is the recoverable error or the unrecoverable non-fatal error, the interrupt event is released.

根据本公开实施例的第二方面,提供一种PCIe致命错误的快速定位系统,所述PCIe致命错误的快速定位系统包括:PCIe设备、CPU、BIOS和基板管理控制器;所述PCIe设备,用于在检测到自身发生错误的情况下,向所述CPU的PCIe根节点上报异常信息;所述异常信息携带错误类型;所述错误类型包括:可恢复错误、不可恢复的非致命错误和不可恢复的致命错误;According to a second aspect of an embodiment of the present disclosure, a fast positioning system for PCIe fatal errors is provided, the fast positioning system for PCIe fatal errors comprising: a PCIe device, a CPU, a BIOS and a baseboard management controller; the PCIe device is used to report abnormal information to a PCIe root node of the CPU when an error is detected in itself; the abnormal information carries an error type; the error type comprises: a recoverable error, an unrecoverable non-fatal error and an unrecoverable fatal error;

所述PCIe致命错误的快速定位系统,用于在所述错误类型为不可恢复的致命错误的情况下,确定高级错误报告功能的开启情况;The PCIe fatal error rapid positioning system is used to determine the activation status of the advanced error reporting function when the error type is an unrecoverable fatal error;

所述PCIe致命错误的快速定位系统,用于在所述高级错误报告功能未开启的情况下,根据BIOS设置,向所述PCIe设备发送查询命令,得到所述PCIe设备的信息;所述信息包括以下至少一项或多项:错误设备的标识信息、错误原因、所述PCIe设备的配置空间中所有错误配置空间寄存器信息和所述PCIe设备所属的上层设备的配置空间寄存器信息;The PCIe fatal error rapid positioning system is used to send a query command to the PCIe device according to the BIOS setting when the advanced error reporting function is not enabled, so as to obtain information of the PCIe device; the information includes at least one or more of the following: identification information of the error device, the cause of the error, all error configuration space register information in the configuration space of the PCIe device, and configuration space register information of the upper layer device to which the PCIe device belongs;

所述基板管理控制器,用于存储所述PCIe设备的信息。The baseboard management controller is used to store the information of the PCIe device.

根据本公开实施例的第三方面,提供一种电子设备,包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述计算机程序被所述处理器执行时,实现如第一方面中所述的PCIe致命错误的快速定位方法的步骤。According to a third aspect of an embodiment of the present disclosure, there is provided an electronic device, comprising: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the computer program is executed by the processor, the steps of the method for rapidly locating a PCIe fatal error as described in the first aspect are implemented.

根据本公开实施例的第四方面,提供一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时,实现如第一方面中所述的PCIe致命错误的快速定位方法的步骤。According to a fourth aspect of an embodiment of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored. When the computer program is executed by a processor, the steps of the method for quickly locating a PCIe fatal error as described in the first aspect are implemented.

根据本公开实施例的第五方面,提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现第一方面所述的PCIe致命错误的快速定位方法的步骤。According to a fifth aspect of an embodiment of the present disclosure, a computer program product is provided, including a computer program, which, when executed by a processor, implements the steps of the method for quickly locating PCIe fatal errors described in the first aspect.

本公开实施例中,PCIe上报异常信息的时候,能够携带异常信息所属的错误类型,从而能够通过对错误进行分类,系统可以根据错误的严重程度和类型采取相应的处理措施,从而保证系统的稳定性和数据的完整性;在错误类型为不可恢复的致命错误的情况下,并且高级错误报告功能未开启的情况下,本公开根据BIOS设置,获取上报异常信息的PCIe设备的相关信息,能够通过查询命令获取错误设备的配置空间寄存器中的错误记录信息和问题原因,从而快速确定错误设备的位置和原因;通过分析错误原因,实现在后续设计中进行规避,优化设计,减少类似问题的出现。In the embodiment of the present disclosure, when PCIe reports exception information, it can carry the error type to which the exception information belongs, so that by classifying the errors, the system can take corresponding processing measures according to the severity and type of the errors, thereby ensuring the stability of the system and the integrity of the data; when the error type is an unrecoverable fatal error and the advanced error reporting function is not enabled, the present disclosure obtains the relevant information of the PCIe device that reports the exception information according to the BIOS settings, and can obtain the error record information and the cause of the problem in the configuration space register of the erroneous device through a query command, thereby quickly determining the location and cause of the erroneous device; by analyzing the cause of the error, it is possible to avoid and optimize the design in subsequent designs to reduce the occurrence of similar problems.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本公开实施例的技术方案,下面将对本公开实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the description of the embodiments of the present disclosure will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present disclosure. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative labor.

图1是本公开实施例示出的一种PCIe致命错误的快速定位方法的步骤示意图;FIG1 is a schematic diagram of the steps of a method for quickly locating a PCIe fatal error according to an embodiment of the present disclosure;

图2是本公开实施例示出的一种错误定位流程图;FIG2 is a flowchart of an error location shown in an embodiment of the present disclosure;

图3是本公开实施例示出的一种问题处理流程图;FIG3 is a problem handling flow chart shown in an embodiment of the present disclosure;

图4是本公开实施例示出的一种PCIe致命错误的快速定位系统的硬件架构图;FIG4 is a hardware architecture diagram of a system for quickly locating a PCIe fatal error shown in an embodiment of the present disclosure;

图5是本公开实施例提出的电子设备的示意图。FIG. 5 is a schematic diagram of an electronic device according to an embodiment of the present disclosure.

具体实施方式DETAILED DESCRIPTION

下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。The following will be combined with the drawings in the embodiments of the present disclosure to clearly and completely describe the technical solutions in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present disclosure.

本公开的说明书和权利要求书中的术语“第一”、“第二”等是用于区别类似的对象,而不用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便本公开的实施例能够以除了在这里图示或描述的那些以外的顺序实施,且“第一”、“第二”等所区分的对象通常为一类,并不限定对象的个数,例如第一对象可以是一个,也可以是多个。此外,说明书以及权利要求中“和/或”表示所连接对象的至少其中之一,字符“/”,一般表示前后关联对象是一种“或”的关系。The terms "first", "second", etc. in the specification and claims of the present disclosure are used to distinguish similar objects, and are not used to describe a specific order or sequence. It should be understood that the data used in this way can be interchangeable when appropriate, so that the embodiments of the present disclosure can be implemented in an order other than those illustrated or described herein, and the objects distinguished by "first", "second", etc. are generally of one type, and the number of objects is not limited. For example, the first object can be one or more. In addition, "and/or" in the specification and claims represents at least one of the connected objects, and the character "/" generally indicates that the objects associated with each other are in an "or" relationship.

在相关技术背景下,对于不可恢复的致命错误,系统通过重启实现对PCIe链路的复位,这种处理方式会造成数据丢失,从而无法获得PCIe设备出现错误的原因,这会使得长时间工作后出现或概率性出现的致命错误的修复效率无法满足生产或用户的要求。In the relevant technical background, for unrecoverable fatal errors, the system resets the PCIe link by restarting. This processing method will cause data loss, and it is impossible to obtain the cause of the error in the PCIe device. This will make the repair efficiency of fatal errors that occur after a long period of work or occur probabilistically unable to meet production or user requirements.

为了解决上述技术问题,本公开提出一种PCIe致命错误的快速定位方法与系统,该方法可以保证能够快速定位出现致命错误的PCIe设备,并且获得所述PCIe设备的错误原因。In order to solve the above technical problems, the present disclosure proposes a method and system for quickly locating a PCIe fatal error, which can ensure that a PCIe device with a fatal error can be quickly located and the cause of the error of the PCIe device can be obtained.

首先,为了方便理解,对本公开实施例进行描述的过程中出现的部分名词或术语进行如下解释:First, for ease of understanding, some nouns or terms that appear in the description of the embodiments of the present disclosure are explained as follows:

BIOS(Basic Input Output System,基本输入输出系统):是一种业界标准的固件接口,主要功能是为计算机提供最底层的、最直接的硬件设置和控制。BIOS (Basic Input Output System): is an industry-standard firmware interface whose main function is to provide the lowest-level and most direct hardware settings and controls for the computer.

PCIe:是一种高速串行扩展总线标准,用于连接计算机系统中的外部设备,如显卡、网络适配器、存储控制器等。PCIe: is a high-speed serial expansion bus standard used to connect external devices in computer systems, such as graphics cards, network adapters, storage controllers, etc.

PCIe根节点:PCIe树状拓扑的初始位置,代表CPU与系统的其余部分进行通信。BMC(Baseboard Management Controller,基板管理控制器):对服务器进行带外管理,进行一系列的监视和控制功能。PCIe root node: The initial location of the PCIe tree topology, representing the CPU to communicate with the rest of the system. BMC (Baseboard Management Controller): Performs out-of-band management of the server and performs a series of monitoring and control functions.

本公开实施例提供的一种PCIe致命错误的快速定位方法,应用于一种PCIe致命错误的快速定位系统,所述PCIe致命错误的快速定位系统包括:PCIe设备、CPU、BIOS和基板管理控制器,其中CPU包括PCIe根节点。A method for quickly locating a PCIe fatal error provided by an embodiment of the present disclosure is applied to a system for quickly locating a PCIe fatal error. The system for quickly locating a PCIe fatal error includes: a PCIe device, a CPU, a BIOS, and a baseboard management controller, wherein the CPU includes a PCIe root node.

图1是本公开实施例示出的一种PCIe致命错误的快速定位方法的步骤示意图。如图1所示,所述PCIe致命错误的快速定位方法具体可以包括如下步骤:FIG1 is a schematic diagram of the steps of a method for quickly locating a PCIe fatal error shown in an embodiment of the present disclosure. As shown in FIG1 , the method for quickly locating a PCIe fatal error may specifically include the following steps:

步骤S11:在PCIe设备发生错误的情况下,向CPU的PCIe根节点上报异常信息;所述异常信息携带错误类型;所述错误类型包括:可恢复错误、不可恢复的非致命错误和不可恢复的致命错误。Step S11: When an error occurs in the PCIe device, report abnormal information to the PCIe root node of the CPU; the abnormal information carries the error type; the error type includes: recoverable error, unrecoverable non-fatal error and unrecoverable fatal error.

PCIe规范包含了关于错误报告的要求和指南,这些要求和指南定义了PCIe设备应该如何报告各种类型的错误以及它们应该提供哪些错误信息。The PCIe specification contains requirements and guidelines for error reporting that define how PCIe devices should report various types of errors and what error information they should provide.

PCIe设备通常通过硬件和固件来自动检测错误,这些错误可能涉及传输问题、电源问题、配置错误等,能够有助于提高系统的稳定性和可靠性,以及快速诊断和解决问题。PCIe devices usually automatically detect errors through hardware and firmware. These errors may involve transmission problems, power problems, configuration errors, etc., which can help improve system stability and reliability, as well as quickly diagnose and solve problems.

根据PCIe规范要求,PCIe设备自动检测错误,并登记在配置空间寄存器中,PCIe设备根据预先设置的各个错误类型,选择当前错误所属的错误类型,将携带有错误类型的错误消息上报到PCIe根节点的寄存器中,PCIe根节点在接收到PCIe设备上报的错误消息后,根据错误消息携带的错误类型,产生对应的中断事件。其中错误类型包括:可恢复错误、不可恢复的非致命错误和不可恢复的致命错误。针对可恢复错误和不可恢复的非致命错误,系统可以进行自行处理,由于不会造成系统重启,因此不会丢失PCIe设备错误的相关数据;而针对不可恢复的致命错误,相关技术中通过系统重启复位PCIe链路,会在一定程度上造成数据丢失。其中,在一种可选的实施例中,在所述向CPU的PCIe根节点上报异常信息之前,根据PCIe规范要求,检测所述PCIe设备的运行状态是否发生错误;在所述PCIe设备发生错误的情况下,将所述异常信息对应的相关数据登记在所述PCIe设备的配置空间寄存器中;所述异常信息对应的相关数据包括所述错误类型;根据所述异常信息对应的相关数据,生成携带所述错误类型的所述异常信息。According to the PCIe specification requirements, the PCIe device automatically detects errors and registers them in the configuration space register. The PCIe device selects the error type to which the current error belongs based on the various pre-set error types, and reports the error message carrying the error type to the register of the PCIe root node. After receiving the error message reported by the PCIe device, the PCIe root node generates a corresponding interrupt event based on the error type carried by the error message. The error types include: recoverable errors, unrecoverable non-fatal errors, and unrecoverable fatal errors. For recoverable errors and unrecoverable non-fatal errors, the system can handle them by itself. Since it will not cause the system to restart, the relevant data of the PCIe device error will not be lost; for unrecoverable fatal errors, resetting the PCIe link by restarting the system in the related technology will cause data loss to a certain extent. Among them, in an optional embodiment, before reporting the abnormal information to the PCIe root node of the CPU, according to the PCIe specification requirements, it is detected whether an error occurs in the operating status of the PCIe device; when an error occurs in the PCIe device, the relevant data corresponding to the abnormal information is registered in the configuration space register of the PCIe device; the relevant data corresponding to the abnormal information includes the error type; based on the relevant data corresponding to the abnormal information, the abnormal information carrying the error type is generated.

PCIe设备的配置空间寄存器是一组用于存储设备配置和标识信息的寄存器集合,位于每个PCIe设备的固定地址范围内。这些寄存器包含了关于设备的重要信息,如设备ID、厂商ID、中断信息、资源分配等。寄存器可以包含有关设备遇到的错误、异常情况或警告的详细信息。The configuration space registers of a PCIe device are a set of registers used to store device configuration and identification information, located in a fixed address range for each PCIe device. These registers contain important information about the device, such as device ID, vendor ID, interrupt information, resource allocation, etc. Registers can contain detailed information about errors, exceptions, or warnings encountered by the device.

PCIe设备能够根据规范要求,通过硬件和固件来自动检测错误,包括传输速率、链路状态等;在所述PCIe设备出现错误的情况下,将所述异常信息对应的相关数据登记在所述PCIe设备的配置空间寄存器中;并根据设置确定错误对应的错误类型,使得所述异常信息对应的相关数据包括所述错误类型;根据所述异常信息对应的相关数据,生成携带所述错误类型的所述异常信息。The PCIe device can automatically detect errors, including transmission rate, link status, etc., through hardware and firmware according to specification requirements; when an error occurs in the PCIe device, the relevant data corresponding to the exception information is registered in the configuration space register of the PCIe device; and the error type corresponding to the error is determined according to the setting, so that the relevant data corresponding to the exception information includes the error type; based on the relevant data corresponding to the exception information, the exception information carrying the error type is generated.

采用本公开的实施例,PCIe设备能够自动检测错误,无需人工干预,提高了系统的可靠性和稳定性;错误信息被记录在设备的配置空间寄存器中,方便后续的问题分析和故障排查;生成携带错误类型的异常信息,能够在异常信息发送给PCIe根节点的时候,采取与所述错误类型对应的措施,提高错误定位和识别的效率。By adopting the embodiments of the present disclosure, the PCIe device can automatically detect errors without manual intervention, thereby improving the reliability and stability of the system; the error information is recorded in the configuration space register of the device, which is convenient for subsequent problem analysis and troubleshooting; and the exception information carrying the error type is generated, and when the exception information is sent to the PCIe root node, measures corresponding to the error type can be taken, thereby improving the efficiency of error location and identification.

步骤S12:在所述错误类型为不可恢复的致命错误的情况下,确定高级错误报告功能的开启情况。Step S12: when the error type is an unrecoverable fatal error, determining the enabling status of the advanced error reporting function.

高级错误报告功能是PCIe规范定义的一种标准机制,用于向主机系统报告各种类型的高级错误。在相关技术中,PCIe错误上报是通过AER功能实现的,它提供了一种统一的方式来报告PCIe设备可能发生的各种错误,包括但不限于传输错误、数据校验错误、链路层错误等。The Advanced Error Reporting function is a standard mechanism defined by the PCIe specification for reporting various types of advanced errors to the host system. In related technologies, PCIe error reporting is implemented through the AER function, which provides a unified way to report various errors that may occur in PCIe devices, including but not limited to transmission errors, data check errors, link layer errors, etc.

其中,在一种可选的实施例中,在所述错误类型为不可恢复的致命错误且所述高级错误报告功能开启的情况下,可以通过从不可纠正错误状态寄存器中,获取所述异常信息对应的相关数据中的所述错误原因;所述异常信息对应的相关数据是:所述PCIe设备自动将上传至所述不可纠正错误状态寄存器和错误状态严重性寄存器中的;所述异常信息对应的相关数据还包括:所述高级错误报告功能对应的配置空间中所有错误配置空间寄存器信息。Among them, in an optional embodiment, when the error type is an unrecoverable fatal error and the advanced error reporting function is turned on, the error cause in the relevant data corresponding to the exception information can be obtained from the uncorrectable error status register; the relevant data corresponding to the exception information is: the PCIe device automatically uploads to the uncorrectable error status register and the error status severity register; the relevant data corresponding to the exception information also includes: all error configuration space register information in the configuration space corresponding to the advanced error reporting function.

所述错误类型为不可恢复的致命错误且所述高级错误报告功能开启说明,在这种情况下可以直接通过高级错误报告功能上报所述PCIe设备的异常信息。The error type is an unrecoverable fatal error and the advanced error reporting function is enabled. In this case, the abnormal information of the PCIe device can be directly reported through the advanced error reporting function.

基于高级错误报告功能,PCIe设备会自动将具体错误原因上传到不可纠正的错误状态寄存器和错误状态严重性寄存器中,在这种情况下,可直接从高级错误报告功能配置空间中的不可纠正错误状态寄存器中获取问题原因,并保存高级错误报告功能对应的配置空间中所有错误配置空间寄存器信息。所有错误配置空间寄存器信息不仅仅是错误的具体原因,还可以包括错误的类型、时间戳、设备状态、错误码、错误发生的上下文等相关信息。Based on the advanced error reporting function, the PCIe device will automatically upload the specific error cause to the uncorrectable error status register and the error status severity register. In this case, the cause of the problem can be directly obtained from the uncorrectable error status register in the advanced error reporting function configuration space, and all error configuration space register information in the configuration space corresponding to the advanced error reporting function can be saved. All error configuration space register information is not only the specific cause of the error, but also can include relevant information such as the error type, timestamp, device status, error code, and context of the error.

采用本公开实施例的技术方案,不仅需要从不可纠正错误状态寄存器中获取问题原因,还需要获取高级错误报告功能对应的配置空间中所有错误配置空间寄存器信息,得到其他相关信息,如错误上下文、设备状态等;有助于深入了解问题发生的背景和环境,从而更有效地进行故障排除和分析。记录所有错误信息也有助于审计和追溯。通过保存错误信息内容,设备制造商和系统管理员可以分析设备的错误模式和趋势,从中获取宝贵的反馈信息。By adopting the technical solution of the embodiment of the present disclosure, it is necessary not only to obtain the cause of the problem from the uncorrectable error status register, but also to obtain all the error configuration space register information in the configuration space corresponding to the advanced error reporting function, and obtain other relevant information, such as error context, device status, etc.; this helps to gain a deeper understanding of the background and environment in which the problem occurred, so as to perform troubleshooting and analysis more effectively. Recording all error information also helps with auditing and tracing. By saving the content of the error information, device manufacturers and system administrators can analyze the error patterns and trends of the device and obtain valuable feedback information from them.

在PCIe根节点接收到PCIe设备上报的异常信息后,判断所述异常信息的错误类型,在数据类型为不可恢复的致命错误的情况下,PCIe根节点产生针对致命错误的中断事件,在系统状态为中断的基础上,采集上报错误消息的PCIe设备的相关信息。After the PCIe root node receives the exception information reported by the PCIe device, it determines the error type of the exception information. When the data type is an irrecoverable fatal error, the PCIe root node generates an interrupt event for the fatal error. Based on the system status being interrupted, it collects relevant information of the PCIe device that reports the error message.

本公开在对上报异常信息的PCIe设备进行错误相关信息获取之前,首先判断高级错误报告功能的开启情况,在高级错误报告功能开启的情况下,采用高级错误报告功能上报PCIe设备的错误相关信息,在高级错误报告功能未开启的情况下,根据BIOS设置上报PCIe设备的错误相关信息。Before obtaining error-related information of a PCIe device that reports abnormal information, the present invention first determines whether an advanced error reporting function is enabled. When the advanced error reporting function is enabled, the advanced error reporting function is used to report error-related information of the PCIe device. When the advanced error reporting function is not enabled, the error-related information of the PCIe device is reported according to BIOS settings.

步骤S13:在所述高级错误报告功能未开启的情况下,根据BIOS设置,向所述PCIe设备发送查询命令,得到所述PCIe设备的信息;所述信息包括以下至少一项或多项:错误设备的标识信息、错误原因、所述PCIe设备的配置空间中所有错误配置空间寄存器信息和所述PCIe设备所属的上层设备的配置空间寄存器信息。Step S13: When the advanced error reporting function is not enabled, according to the BIOS setting, a query command is sent to the PCIe device to obtain information about the PCIe device; the information includes at least one or more of the following: identification information of the error device, the cause of the error, all error configuration space register information in the configuration space of the PCIe device, and configuration space register information of the upper-layer device to which the PCIe device belongs.

BIOS是计算机系统中的固件,位于主板上的非易失性存储器中。BIOS提供了一个接口,让操作系统和应用程序能够与计算机的硬件进行通信和交互。它包含了一组固定的程序和数据,用于管理计算机硬件设备的初始化、配置和控制,以及提供基本的输入输出功能,如键盘、显示器和存储设备的访问。BIOS is the firmware in a computer system, located in non-volatile memory on the motherboard. BIOS provides an interface that allows the operating system and applications to communicate and interact with the computer's hardware. It contains a fixed set of programs and data used to manage the initialization, configuration, and control of computer hardware devices, as well as provide basic input and output functions such as keyboard, display, and storage device access.

在所述高级错误报告功能未开启的情况下,无法采用常规的错误上报方法。本公开提出,在所述高级错误报告功能未开启的情况下,根据BIOS设置上报PCIe设备的错误相关信息。When the advanced error reporting function is not enabled, conventional error reporting methods cannot be used. The present disclosure proposes that when the advanced error reporting function is not enabled, error-related information of the PCIe device is reported according to BIOS settings.

根据BIOS设置,向上报错误消息的所述PCIe设备发送查询命令,得到所述PCIe设备的信息;所述信息包括以下至少一项或多项:错误原因、所述PCIe设备的配置空间中所有错误配置空间寄存器信息和所述PCIe设备所属的上层设备的配置空间寄存器信息。According to the BIOS settings, a query command is sent to the PCIe device that reports an error message to obtain information about the PCIe device; the information includes at least one or more of the following: the cause of the error, all erroneous configuration space register information in the configuration space of the PCIe device, and the configuration space register information of the upper-layer device to which the PCIe device belongs.

PCIe设备的标识信息通常以设备的厂商ID和设备ID的形式出现。这些标识信息包含在PCIe设备的配置空间中,由设备的固件(如BIOS)在设备初始化时读取。The identification information of a PCIe device usually appears in the form of a manufacturer ID and a device ID of the device. These identification information are contained in the configuration space of the PCIe device and are read by the device's firmware (such as BIOS) when the device is initialized.

所述错误原因是所述PCIe设备出现错误的直接原因,能够快速定位和解决问题。通过获取错误原因,可以帮助工程师迅速了解问题的直接原因,从而采取相应的措施进行修复或优化设计,减少问题的出现。The error cause is the direct cause of the error in the PCIe device, which can quickly locate and solve the problem. By obtaining the error cause, engineers can quickly understand the direct cause of the problem, so as to take corresponding measures to repair or optimize the design and reduce the occurrence of the problem.

获取所述PCIe设备的配置空间中所有错误配置空间寄存器信息是为了更全面地了解错误设备的状态和配置信息。寄存器信息包含了设备的各种配置和状态数据,通过获取这些信息,可以帮助工程师更准确地分析问题的原因和影响范围。寄存器信息可以提供设备的错误记录和异常状态,帮助定位错误设备和问题原因。此外,寄存器信息还可以用于问题的复现和验证,以及后续的问题跟踪和分析。PCIe设备的错误可能会导致PCIe设备出现宕机,从而PCIe设备无法响应查询命令,可以通过获取所述PCIe设备所属的上层设备的配置空间寄存器信息得到所述PCIe设备的错误相关信息。The purpose of obtaining all the erroneous configuration space register information in the configuration space of the PCIe device is to have a more comprehensive understanding of the status and configuration information of the erroneous device. The register information contains various configuration and status data of the device. By obtaining this information, engineers can more accurately analyze the cause and scope of the problem. The register information can provide the error record and abnormal status of the device to help locate the erroneous device and the cause of the problem. In addition, the register information can also be used for the reproduction and verification of the problem, as well as subsequent problem tracking and analysis. The error of the PCIe device may cause the PCIe device to crash, so that the PCIe device cannot respond to the query command. The error-related information of the PCIe device can be obtained by obtaining the configuration space register information of the upper-level device to which the PCIe device belongs.

在一种可选的实施例中,步骤S13可以包括步骤S131~步骤S135。In an optional embodiment, step S13 may include steps S131 to S135.

步骤S131:根据所述BIOS设置,通过PCIe链路向错误设备发送第一查询命令;所述错误设备为:上报所述异常信息的所述PCIe设备。Step S131: according to the BIOS setting, sending a first query command to an erroneous device through a PCIe link; the erroneous device is: the PCIe device that reports the abnormal information.

PCIe链路是指连接在PCIe总线上的两个设备之间的通信路径。它由一对相互连接的发送器和接收器组成,用于在设备之间传输数据和控制信息。A PCIe link is a communication path between two devices connected to the PCIe bus. It consists of a pair of interconnected transmitters and receivers for transmitting data and control information between devices.

通过BIOS设置,确定要查询的错误设备,并获取该设备的PCIe地址;使用PCIe控制器和寄存器来建立与错误设备之间的通信链路。Through BIOS settings, determine the error device to be queried and obtain the PCIe address of the device; use the PCIe controller and registers to establish a communication link with the error device.

在高级错误报告功能未开启的情况下,系统通过PCIe链路向错误设备发送第一查询命令,可以通过将查询命令数据加载到PCIe控制器的发送缓冲区中;设置PCIe控制器的目标设备地址为错误设备的PCIe地址;触发PCIe控制器发送命令的动作,将命令数据发送到错误设备;查询命令会通过PCIe链路被发送到错误设备的配置空间寄存器中,以获取错误记录信息和问题原因。所述错误设备为:上报所述异常信息的所述PCIe设备。When the advanced error reporting function is not enabled, the system sends a first query command to the error device through the PCIe link, which can be done by loading the query command data into the sending buffer of the PCIe controller; setting the target device address of the PCIe controller to the PCIe address of the error device; triggering the action of the PCIe controller to send the command, and sending the command data to the error device; the query command will be sent to the configuration space register of the error device through the PCIe link to obtain the error record information and the cause of the problem. The error device is: the PCIe device that reports the abnormal information.

错误设备接收到查询命令后,根据命令内容进行处理。After receiving the query command, the error device processes it according to the command content.

步骤S132:在所述错误设备响应所述第一查询命令的情况下,对所述错误设备的配置空间寄存器进行查询,得到所述错误设备的错误原因,以及得到所述错误设备的配置空间中所有错误配置空间寄存器信息。Step S132: when the erroneous device responds to the first query command, query the configuration space register of the erroneous device to obtain the error cause of the erroneous device and obtain all erroneous configuration space register information in the configuration space of the erroneous device.

在错误设备能够响应第一查询命令的情况下,错误设备根据接收到的查询命令,访问自身的配置空间寄存器;错误设备将配置空间寄存器的信息发送回系统,系统对接收到的配置空间寄存器信息并进行解析和处理,可以提取错误记录信息和问题原因,并将其记录下来;得到错误设备的错误原因,以及得到所述错误设备的配置空间中所有错误配置空间寄存器信息。When the erroneous device is able to respond to the first query command, the erroneous device accesses its own configuration space register according to the received query command; the erroneous device sends the information of the configuration space register back to the system, and the system parses and processes the received configuration space register information, and can extract the error record information and the cause of the problem, and record them; obtain the error cause of the erroneous device, and obtain all the erroneous configuration space register information in the configuration space of the erroneous device.

步骤S133:在所述错误设备不响应所述第一查询命令的情况下,确定所述错误设备所属的上层设备。Step S133: When the erroneous device does not respond to the first query command, determine the upper layer device to which the erroneous device belongs.

如果错误设备由于错误出现宕机,那么错误设备无法响应所述第一查询命令;根据PCIe拓扑结构或其他相关信息,确定管理所述错误设备的上层设备;使用上层设备的PCIe地址和相关命令,设置PCIe控制器的目标设备地址为上层设备的地址,对所述上层设备发送第二查询命令。所述上层设备是管理错误设备的设备,上层设备的层级比错误设备高一级。If the error device crashes due to an error, the error device cannot respond to the first query command; determine the upper layer device that manages the error device according to the PCIe topology or other related information; use the PCIe address and related commands of the upper layer device to set the target device address of the PCIe controller to the address of the upper layer device, and send a second query command to the upper layer device. The upper layer device is a device that manages the error device, and the upper layer device is one level higher than the error device.

在错误设备不能够相应的情况下,通过查询管理错误设备的上层设备获取错误设备的相关信息,其中,上层设备配置空间寄存器中关于错误设备的相关信息更加粗略。若上层设备也无法响应,则向管理上层设备的更上一层的设备发送查询命令,从该设备的配置空间寄存器中获取所述错误设备的相关信息,越上层的设备中关于错误设备的信息就越粗略,因此需要在错误设备不响应查询命令的情况下,才会去获取上层设备的配置空间寄存器中关于错误设备的相关数据。In the case where the error device cannot respond, the relevant information of the error device is obtained by querying the upper-layer device that manages the error device, wherein the relevant information about the error device in the configuration space register of the upper-layer device is more coarse. If the upper-layer device is also unable to respond, a query command is sent to the upper-layer device that manages the upper-layer device, and the relevant information of the error device is obtained from the configuration space register of the device. The higher the layer of the device, the coarser the information about the error device. Therefore, it is necessary to obtain the relevant data about the error device in the configuration space register of the upper-layer device only when the error device does not respond to the query command.

步骤S134:向所述错误设备所属的上层设备发送第二查询命令。Step S134: Send a second query command to the upper layer device to which the erroneous device belongs.

在错误设备无法响应第一查询命令的情况下,才会对管理错误设备的上层设备发送第二查询命令。In the case where the erroneous device cannot respond to the first query command, the second query command will be sent to the upper layer device that manages the erroneous device.

步骤S135:在所述错误设备所属的上层设备响应所述第二查询命令的情况下,对所述错误设备所属的上层设备的配置空间寄存器进行查询,得到所述错误设备的错误原因,以及所述错误设备所属的上层设备的配置空间寄存器信息。Step S135: When the upper layer device to which the erroneous device belongs responds to the second query command, query the configuration space register of the upper layer device to which the erroneous device belongs to obtain the error cause of the erroneous device and the configuration space register information of the upper layer device to which the erroneous device belongs.

上层设备根据命令内容,从配置空间寄存器中查询所述错误设备相关的数据;从配置空间寄存器中读取错误设备的错误原因和其他相关信息;将错误设备的标识信息,读取到的错误设备的错误原因和所述错误设备所属的上层设备的配置空间寄存器信息通过PCIe链路或其他通信方式上传。The upper-layer device queries the data related to the erroneous device from the configuration space register according to the command content; reads the error cause and other related information of the erroneous device from the configuration space register; and uploads the identification information of the erroneous device, the error cause of the erroneous device read, and the configuration space register information of the upper-layer device to which the erroneous device belongs through a PCIe link or other communication methods.

采用本公开的实施例,能够在错误设备可以响应的情况下,直接从错误设备的配置空间寄存器中获取信息,由错误设备自行上报错误原因和寄存器信息;在错误设备无法响应的情况下,向管理错误设备的上层设备的配置空间寄存器中获取错误设备的相关信息,由上层设备上报错误设备的错误原因的上层设备的寄存器信息。通过获取错误设备的寄存器信息和获取上层设备的寄存器信息,可以提供更全面的错误信息,有助于工程师全面了解问题的背景和上下文,从而更好地进行问题分析和解决;在错误设备无法响应的情况下,对上层设备重新发送第二查询命令,可以避免因错误设备无响应而导致错误信息丢失或遗漏处理,确保问题得到妥善记录和处理;通过收集并处理问题跟踪了解问题出现原因,可以在后续设计中进行规避,优化设计,减少类似问题的出现。这样可以提高系统的稳定性和可靠性,减少类似问题对系统运行的影响。By adopting the embodiments of the present disclosure, when the error device can respond, information can be directly obtained from the configuration space register of the error device, and the error device can report the error cause and register information by itself; when the error device cannot respond, relevant information of the error device is obtained from the configuration space register of the upper-layer device that manages the error device, and the upper-layer device reports the register information of the upper-layer device of the error cause of the error device. By obtaining the register information of the error device and the register information of the upper-layer device, more comprehensive error information can be provided, which helps engineers to fully understand the background and context of the problem, so as to better analyze and solve the problem; when the error device cannot respond, resending the second query command to the upper-layer device can avoid the loss or omission of error information due to the failure of the error device to respond, and ensure that the problem is properly recorded and handled; by collecting and processing problem tracking to understand the cause of the problem, it can be avoided in subsequent designs, optimized design, and reduce the occurrence of similar problems. This can improve the stability and reliability of the system and reduce the impact of similar problems on system operation.

步骤S14:将所述PCIe设备的信息存储到基板管理控制器。Step S14: storing the information of the PCIe device in a baseboard management controller.

在PCIe根节点接收到上报的异常信息的时候,根据异常信息携带的错误类型触发相应的中断事件,在系统状态为中断的情况下,收集所述PCIe设备出现错误的相关信息,将收集到的错误信息存储到基板管理控制器,避免系统重启导致的数据丢失。When the PCIe root node receives the reported exception information, the corresponding interrupt event is triggered according to the error type carried in the exception information. When the system status is interrupted, the relevant information of the error in the PCIe device is collected and the collected error information is stored in the baseboard management controller to avoid data loss caused by system restart.

其中,在一种可选的实施例中,所述将所述PCIe设备的信息存储到基板管理控制器,包括:收集CPU与所述PCIe设备之间传递的目标信息;从所述PCIe设备的日志信息中,收集所述PCIe设备运行状态相关的目标日志信息;将所述PCIe设备的信息、所述目标信息和所述目标日志信息,存储到所述基板管理控制器。所述CPU与所述PCIe设备之间传递的目标信息指所述CPU与所述PCIe设备之间的信息传递过程。包括数据交互、指令传递、状态信息传递等。通过收集这些信息,可以了解CPU与设备之间的通信情况,包括数据传输的速度、指令的执行情况、设备的状态等。Wherein, in an optional embodiment, the storing of the information of the PCIe device to the baseboard management controller includes: collecting the target information transmitted between the CPU and the PCIe device; collecting the target log information related to the operating status of the PCIe device from the log information of the PCIe device; storing the information of the PCIe device, the target information and the target log information to the baseboard management controller. The target information transmitted between the CPU and the PCIe device refers to the information transmission process between the CPU and the PCIe device. Including data interaction, instruction transmission, status information transmission, etc. By collecting this information, the communication between the CPU and the device can be understood, including the speed of data transmission, the execution of instructions, the status of the device, etc.

在系统中,错误设备可能会出现各种异常情况,例如错误、故障、性能下降等。为了了解错误设备的运行状态以及可能导致问题的原因,可以收集所述PCIe设备运行状态相关的目标日志信息。日志信息可以包括错误设备的操作记录、错误信息、警告信息、性能指标等。In the system, the faulty device may have various abnormal conditions, such as errors, failures, performance degradation, etc. In order to understand the operating status of the faulty device and the possible causes of the problem, target log information related to the operating status of the PCIe device can be collected. The log information may include operation records, error information, warning information, performance indicators, etc. of the faulty device.

在从错误设备的配置空间寄存器中查询相关信息以及从上层设备的配置空间寄存器中查询错误设备的相关信息之后,还需要收集CPU与所述PCIe设备之间传递的目标信息,以及从所述PCIe设备的日志信息中,收集所述PCIe设备运行状态相关的目标日志信息;并将所述PCIe设备的信息、所述目标信息和所述目标日志信息,存储到所述基板管理控制器。After querying relevant information from the configuration space register of the erroneous device and querying relevant information of the erroneous device from the configuration space register of the upper-layer device, it is also necessary to collect the target information transmitted between the CPU and the PCIe device, and collect target log information related to the operating status of the PCIe device from the log information of the PCIe device; and store the information of the PCIe device, the target information and the target log information in the baseboard management controller.

信息收集可以通过软件或硬件的方式实现。例如,可以在系统中设置日志记录功能,将CPU与设备之间的通信信息记录下来。也可以通过硬件监控设备,实时捕获和记录CPU与设备之间的通信数据。Information collection can be achieved through software or hardware. For example, you can set up a logging function in the system to record the communication information between the CPU and the device. You can also use hardware monitoring equipment to capture and record the communication data between the CPU and the device in real time.

采用本公开的实施例,收集CPU与所述PCIe设备之间传递的目标信息,以及从所述PCIe设备的日志信息中,收集所述PCIe设备运行状态相关的目标日志信息,能够帮助工程师分析和定位问题,找出通信故障、性能瓶颈或其他异常情况的原因。By adopting the embodiments of the present disclosure, the target information transmitted between the CPU and the PCIe device is collected, and the target log information related to the operating status of the PCIe device is collected from the log information of the PCIe device, which can help engineers analyze and locate problems and find out the causes of communication failures, performance bottlenecks or other abnormal situations.

采用本公开的实施例,能够通过抓取PCIe设备配置空间寄存器信息,可以获取到设备的详细配置和状态数据快速定位问题;将错误设备的配置空间寄存器信息上传到BMC中进行记录,可以确保问题状态和问题原因都被保存下来,这样,在后续的问题分析中,工程师可以根据记录的信息迅速复现和处理问题,提高问题处理的效率;通过收集和处理错误设备的寄存器信息,可以了解问题出现的原因,并在后续的设计中进行规避和优化。这有助于减少类似问题的发生,提高系统的稳定性和可靠性;并且在高级错误报告功能未开启的情况下,根据BIOS设置,获取所述PCIe设备错误原因和相关寄存器信息能够保证PCIe设备致命问题的定位和解决。By adopting the embodiments of the present disclosure, the detailed configuration and status data of the device can be obtained by capturing the configuration space register information of the PCIe device to quickly locate the problem; the configuration space register information of the erroneous device can be uploaded to the BMC for recording, which can ensure that the problem status and the cause of the problem are saved. In this way, in the subsequent problem analysis, engineers can quickly reproduce and handle the problem based on the recorded information, thereby improving the efficiency of problem handling; by collecting and processing the register information of the erroneous device, the cause of the problem can be understood, and it can be avoided and optimized in the subsequent design. This helps to reduce the occurrence of similar problems and improve the stability and reliability of the system; and when the advanced error reporting function is not enabled, according to the BIOS settings, obtaining the error cause and related register information of the PCIe device can ensure the location and resolution of the fatal problem of the PCIe device.

其中,在一种可选的实施例中,所述PCIe致命错误的快速定位方法还包括以下内容:In an optional embodiment, the PCIe fatal error rapid location method further includes the following contents:

响应于所述异常信息的上报,根据所述错误类型,触发所述错误类型对应的中断事件;在将所述PCIe设备的信息存储到基板管理控制器后,所述方法还包括:获取预设处理方案;所述预设处理方案包括:各个错误类型对应的处理方式;所述处理方式包括:屏蔽错误、对设备进行复位和系统重启;根据所述预设处理方案和异常信息携带的错误类型,确定目标处理方式;按照所述目标处理方式,对所述PCIe设备发生的错误进行处理;在处理好所述PCIe设备发生的错误后,解除所述中断事件。In response to the reporting of the abnormal information, according to the error type, an interrupt event corresponding to the error type is triggered; after the information of the PCIe device is stored in the baseboard management controller, the method also includes: obtaining a preset processing scheme; the preset processing scheme includes: a processing method corresponding to each error type; the processing method includes: shielding errors, resetting the device and restarting the system; determining a target processing method according to the preset processing scheme and the error type carried by the abnormal information; processing the error occurring in the PCIe device according to the target processing method; after processing the error occurring in the PCIe device, releasing the interrupt event.

中断事件的优势在于它们提供了一种异步通信机制,允许设备在完成特定任务或发生特定事件时通知处理器或主机系统。The advantage of interrupt events is that they provide an asynchronous communication mechanism that allows a device to notify the processor or host system when a specific task is completed or a specific event occurs.

响应于所述异常信息的上报,根据所述错误类型,触发所述错误类型对应的中断事件,而并不是在响应到异常信息上报的时候,就通过重启系统修复异常。针对不同的错误类型,触发所述错误类型对应的中断事件,例如,在错误类型为不可恢复的致命错误的情况下,获取产生致命错误的所述PCIe设备的错误原因和相应的寄存器信息。In response to the reporting of the abnormal information, according to the error type, an interrupt event corresponding to the error type is triggered, rather than repairing the abnormality by restarting the system when responding to the abnormal information reporting. For different error types, an interrupt event corresponding to the error type is triggered. For example, when the error type is an unrecoverable fatal error, the error cause and corresponding register information of the PCIe device that generates the fatal error are obtained.

在触发中断事件后,收集所述PCIe设备的相关信息,并将其存储到基板管理控制器,保证PCIe设备的错误原因和寄存器信息不会因为重启丢失之后,对PCIe设备采取恢复措施。After the interrupt event is triggered, relevant information of the PCIe device is collected and stored in the baseboard management controller to ensure that the error cause and register information of the PCIe device will not be lost due to restart, and recovery measures are taken for the PCIe device.

首先,获取预设处理方案;所述预设处理方案包括:各个错误类型对应的处理方式;所述处理方式包括:屏蔽错误、对设备进行复位和系统重启。First, a preset processing solution is obtained; the preset processing solution includes: processing methods corresponding to various error types; the processing methods include: shielding errors, resetting the device and restarting the system.

屏蔽错误具体为:根据预设处理方案,系统可以通过设置来屏蔽致命错误,即暂时忽略该错误,使系统能够继续正常运行,而不会受到错误的影响。To shield errors specifically: According to the preset processing plan, the system can be set to shield fatal errors, that is, temporarily ignore the error so that the system can continue to operate normally without being affected by the error.

对设备进行复位具体为:如果预设处理方案中包括设备复位的步骤,系统可以对出现致命错误的设备进行复位操作,以尝试恢复设备的正常工作状态。Resetting the device specifically includes: if the preset processing plan includes a step of resetting the device, the system can reset the device that has a fatal error to try to restore the device to a normal working state.

系统重启具体为:在某些情况下,致命错误可能导致整个系统无法正常运行,此时预设处理方案可能包括系统重启的步骤,以重新启动系统并尝试解决错误。其中,在一种可选的实施例中,在所述错误类型为所述可恢复错误或所述不可恢复的非致命错误的情况下,解除所述中断事件。Specifically, the system restart is: in some cases, a fatal error may cause the entire system to fail to operate normally. At this time, the preset processing solution may include a system restart step to restart the system and try to resolve the error. In an optional embodiment, when the error type is the recoverable error or the unrecoverable non-fatal error, the interrupt event is released.

由于系统可以对可恢复错误和不可恢复的非致命错误进行自行处理,且不会造成系统重启进而丢失问题过程中出现的相关数据,因此只需要针对产生不可恢复的致命错误的PCIe设备的错误原因和相应的寄存器信息进行信息收集,若PCIe设备的错误类型为可恢复错误或不可恢复的非致命错误,那么直接解除中断事件。采用本实施例,对于可纠正错误或非致命不可纠正错误,系统可以直接结束处理,无需进行进一步的复杂操作。这样可以节省时间和资源,提高问题处理的效率;通过系统自行处理可纠正错误或非致命不可纠正错误能够提高系统的稳定性和可靠性。Since the system can handle recoverable errors and unrecoverable non-fatal errors by itself, and will not cause the system to restart and lose relevant data that occurred during the problem process, it is only necessary to collect information on the error cause and corresponding register information of the PCIe device that generated the unrecoverable fatal error. If the error type of the PCIe device is a recoverable error or an unrecoverable non-fatal error, then the interrupt event is directly released. With this embodiment, for correctable errors or non-fatal uncorrectable errors, the system can directly end the processing without further complex operations. This can save time and resources and improve the efficiency of problem handling; the system can improve the stability and reliability of the system by having the system handle correctable errors or non-fatal uncorrectable errors by itself.

采用本实施例,在响应到异常信息的时候,中断事件可以及时提醒系统有错误事件发生,引起系统的注意;在触发中断事件之后,将错误设备的错误原因以及相应的寄存器信息存储到基板管理控制器能够确保错误原因等相关信息不会丢失,以便后续的问题分析和处理。By adopting this embodiment, when responding to abnormal information, the interrupt event can promptly remind the system that an error event has occurred, thereby attracting the attention of the system; after the interrupt event is triggered, the error cause of the error device and the corresponding register information are stored in the baseboard management controller to ensure that relevant information such as the error cause will not be lost, so as to facilitate subsequent problem analysis and processing.

在一种可选的实施例中,可以将问题上报逻辑集成到PCIe设备芯片中,在出现错误的时候,不立刻上报错误,而是在第二次开机时将异常信息自动上报到CPU中,由CPU发送到BMC中便于维护。In an optional embodiment, the problem reporting logic can be integrated into the PCIe device chip. When an error occurs, the error is not reported immediately. Instead, the abnormal information is automatically reported to the CPU when the computer is turned on for the second time, and the CPU sends it to the BMC for maintenance.

图2是本公开实施例示出的一种错误定位流程图。按照图2所示,PCIe链路上设备异常;根据协议规范自动上报异常到PCIe根节点端口的设备状态寄存器;PCIe根节点端口根据错误类型产生对应的中断事件传输到系统;软件获取问题过程的日志信息;将相关信息记录到BMC中,然后按预设处理方案对所述PCIe设备发生的错误进行处理。Figure 2 is a flowchart of an error location shown in an embodiment of the present disclosure. As shown in Figure 2, a device on a PCIe link is abnormal; the abnormality is automatically reported to the device status register of the PCIe root node port according to the protocol specification; the PCIe root node port generates a corresponding interrupt event according to the error type and transmits it to the system; the software obtains log information of the problem process; the relevant information is recorded in the BMC, and then the error of the PCIe device is processed according to the preset processing solution.

图3是本公开实施例示出的一种问题处理流程图。按照图3所示,PCIe链路上设备异常,根据协议规范自动上报异常到PCIe根节点端口的设备状态寄存器;PCIe根节点端口根据错误类型产生对应的中断事件传输到系统;判断错误类型是否为不可恢复的致命错误;在错误类型为可恢复错误或所述不可恢复的非致命错误的情况下,直接结束;在错误类型为不可恢复的致命错误的情况下,判断高级错误报告功能是否开启;在高级错误报告功能开启的情况下,根据高级错误报告功能获取问题原因;在高级错误报告功能未开启的情况下,向错误设备发送第一查询命令;判断错误设备是否响应第一查询命令;在错误设备响应第一查询命令的情况下,错误设备自行上报错误原因;在错误设备不响应第一查询命令的情况下,重新向错误设备所属的上层设备发送第二查询命令,上层设备上报错误原因;将错误设备的错误原因传输到基板管理控制器记录;收集错误设备相关的日志信息传输到基板管理控制器记录;系统根据预设处理方案对致命错误进行处理。图4是本公开实施例示出的一种PCIe致命错误的快速定位系统的硬件架构图。CPU包含PCIe根节点接口;PCIe根节点接口与拓扑中的PCIe设备进行通信,并将PCIe信息传输到CPU中进行处理;BIOS作为基础底层设置传输到CPU中运行,完成服务器最初始、最底层的设计;基板管理控制器作为带外管理系统对服务器进行管理,同时服务器CPU会将一些日志信息发送给BMC。Fig. 3 is a problem handling flow chart shown in an embodiment of the present disclosure. As shown in Fig. 3, if a device on a PCIe link is abnormal, the abnormality is automatically reported to the device status register of the PCIe root node port according to the protocol specification; the PCIe root node port generates a corresponding interrupt event according to the error type and transmits it to the system; it is determined whether the error type is an unrecoverable fatal error; if the error type is a recoverable error or the unrecoverable non-fatal error, it is directly terminated; if the error type is an unrecoverable fatal error, it is determined whether the advanced error reporting function is enabled; if the advanced error reporting function is enabled, the cause of the problem is obtained according to the advanced error reporting function; if the advanced error reporting function is not enabled, a first query command is sent to the error device; it is determined whether the error device responds to the first query command; if the error device responds to the first query command, the error device reports the cause of the error by itself; if the error device does not respond to the first query command, a second query command is resent to the upper layer device to which the error device belongs, and the upper layer device reports the cause of the error; the cause of the error of the error device is transmitted to the baseboard management controller for record; the log information related to the error device is collected and transmitted to the baseboard management controller for record; the system processes the fatal error according to the preset processing scheme. Figure 4 is a hardware architecture diagram of a PCIe fatal error rapid location system shown in an embodiment of the present disclosure. The CPU includes a PCIe root node interface; the PCIe root node interface communicates with the PCIe devices in the topology and transmits the PCIe information to the CPU for processing; the BIOS is transmitted to the CPU as the basic bottom-level setting to run, completing the initial and bottom-level design of the server; the baseboard management controller manages the server as an out-of-band management system, and the server CPU will send some log information to the BMC.

需要说明的是,对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本公开实施例并不受所描述的动作顺序的限制,因为依据本公开实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本公开实施例所必须的。It should be noted that, for the method embodiments, for the sake of simplicity, they are all described as a series of action combinations, but those skilled in the art should be aware that the embodiments of the present disclosure are not limited by the order of the actions described, because according to the embodiments of the present disclosure, certain steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also be aware that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required by the embodiments of the present disclosure.

按照图4所示,所述PCIe致命错误的快速定位系统包括:PCIe设备、CPU、BIOS和基板管理控制器;As shown in FIG4 , the PCIe fatal error rapid positioning system includes: a PCIe device, a CPU, a BIOS, and a baseboard management controller;

所述PCIe设备,用于在检测到自身发生错误的情况下,向所述CPU的PCIe根节点上报异常信息;所述异常信息携带错误类型;所述错误类型包括:可恢复错误、不可恢复的非致命错误和不可恢复的致命错误;The PCIe device is used to report abnormal information to the PCIe root node of the CPU when an error is detected in the PCIe device; the abnormal information carries the error type; the error type includes: recoverable error, unrecoverable non-fatal error and unrecoverable fatal error;

所述PCIe致命错误的快速定位系统,用于在所述错误类型为不可恢复的致命错误的情况下,确定高级错误报告功能的开启情况;The PCIe fatal error rapid positioning system is used to determine the activation status of the advanced error reporting function when the error type is an unrecoverable fatal error;

所述PCIe致命错误的快速定位系统,用于在所述高级错误报告功能未开启的情况下,根据BIOS设置,向所述PCIe设备发送查询命令,得到所述PCIe设备的信息;所述信息包括以下至少一项或多项:错误设备的标识信息、错误原因、所述PCIe设备的配置空间中所有错误配置空间寄存器信息和所述PCIe设备所属的上层设备的配置空间寄存器信息;The PCIe fatal error rapid positioning system is used to send a query command to the PCIe device according to the BIOS setting when the advanced error reporting function is not enabled, so as to obtain information of the PCIe device; the information includes at least one or more of the following: identification information of the error device, the cause of the error, all error configuration space register information in the configuration space of the PCIe device, and configuration space register information of the upper layer device to which the PCIe device belongs;

所述基板管理控制器,用于存储所述PCIe设备的信息。The baseboard management controller is used to store the information of the PCIe device.

可选地,所述PCIe致命错误的快速定位系统,还用于:Optionally, the PCIe fatal error rapid positioning system is further used for:

根据所述BIOS设置,通过PCIe链路向错误设备发送第一查询命令;所述错误设备为:上报所述异常信息的所述PCIe设备;According to the BIOS setting, a first query command is sent to the error device through the PCIe link; the error device is: the PCIe device that reports the abnormal information;

在所述错误设备响应所述第一查询命令的情况下,对所述错误设备的配置空间寄存器进行查询,得到所述错误设备的错误原因,以及得到所述错误设备的配置空间中所有错误配置空间寄存器信息;In the case where the erroneous device responds to the first query command, querying the configuration space register of the erroneous device to obtain the error cause of the erroneous device and all erroneous configuration space register information in the configuration space of the erroneous device;

在所述错误设备不响应所述第一查询命令的情况下,确定所述错误设备所属的上层设备;In the case that the erroneous device does not respond to the first query command, determining an upper layer device to which the erroneous device belongs;

向所述错误设备所属的上层设备发送第二查询命令;Sending a second query command to an upper layer device to which the erroneous device belongs;

在所述错误设备所属的上层设备响应所述第二查询命令的情况下,对所述错误设备所属的上层设备的配置空间寄存器进行查询,得到所述错误设备的错误原因,以及所述错误设备所属的上层设备的配置空间寄存器信息。When the upper layer device to which the erroneous device belongs responds to the second query command, the configuration space register of the upper layer device to which the erroneous device belongs is queried to obtain the error cause of the erroneous device and the configuration space register information of the upper layer device to which the erroneous device belongs.

可选地,所述PCIe致命错误的快速定位系统,还用于:Optionally, the PCIe fatal error rapid positioning system is further used for:

收集CPU与所述PCIe设备之间传递的目标信息;Collecting target information transmitted between the CPU and the PCIe device;

从所述PCIe设备的日志信息中,收集所述PCIe设备运行状态相关的目标日志信息;Collecting target log information related to the running status of the PCIe device from the log information of the PCIe device;

将所述PCIe设备的信息、所述目标信息和所述目标日志信息,存储到所述基板管理控制器。The information of the PCIe device, the target information and the target log information are stored in the baseboard management controller.

可选地,所述PCIe设备,用于根据PCIe规范要求,检测所述PCIe设备的运行状态是否发生错误;Optionally, the PCIe device is used to detect whether an error occurs in the running state of the PCIe device according to PCIe specification requirements;

在所述PCIe设备发生错误的情况下,将所述异常信息对应的相关数据登记在所述PCIe设备的配置空间寄存器中;所述异常信息对应的相关数据包括所述错误类型;In the case where an error occurs in the PCIe device, registering relevant data corresponding to the abnormal information in a configuration space register of the PCIe device; the relevant data corresponding to the abnormal information includes the error type;

根据所述异常信息对应的相关数据,生成携带所述错误类型的所述异常信息。可选地,所述PCIe致命错误的快速定位系统,还用于:According to the relevant data corresponding to the abnormal information, the abnormal information carrying the error type is generated. Optionally, the PCIe fatal error rapid positioning system is also used to:

在所述错误类型为不可恢复的致命错误且所述高级错误报告功能开启的情况下,从不可纠正错误状态寄存器中,获取所述异常信息对应的相关数据中的所述错误原因;所述异常信息对应的相关数据是:所述PCIe设备自动将上传至所述不可纠正错误状态寄存器和错误状态严重性寄存器中的;所述异常信息对应的相关数据还包括:所述高级错误报告功能对应的配置空间中所有错误配置空间寄存器信息。When the error type is an unrecoverable fatal error and the advanced error reporting function is turned on, the error cause in the relevant data corresponding to the exception information is obtained from the uncorrectable error status register; the relevant data corresponding to the exception information is: the PCIe device automatically uploads to the uncorrectable error status register and the error status severity register; the relevant data corresponding to the exception information also includes: all error configuration space register information in the configuration space corresponding to the advanced error reporting function.

可选地,所述CPU的PCIe根节点,用于:Optionally, the PCIe root node of the CPU is used to:

根据所述错误类型,触发所述错误类型对应的中断事件;According to the error type, trigger an interrupt event corresponding to the error type;

在将所述PCIe设备的信息存储到基板管理控制器后,所述PCIe致命错误的快速定位系统还用于:After storing the information of the PCIe device in the baseboard management controller, the PCIe fatal error rapid positioning system is further used for:

获取预设处理方案;所述预设处理方案包括:各个错误类型对应的处理方式;所述处理方式包括:屏蔽错误、对设备进行复位和系统重启;Obtaining a preset processing solution; the preset processing solution includes: processing methods corresponding to each error type; the processing methods include: shielding errors, resetting the device, and restarting the system;

根据所述预设处理方案和异常信息携带的错误类型,确定目标处理方式;Determine a target processing method according to the preset processing scheme and the error type carried by the abnormal information;

按照所述目标处理方式,对所述PCIe设备发生的错误进行处理;Processing the error occurring in the PCIe device according to the target processing method;

在处理好所述PCIe设备发生的错误后,解除所述中断事件。After the error occurring in the PCIe device is processed, the interrupt event is released.

可选地,所述CPU的PCIe根节点,还用于:Optionally, the PCIe root node of the CPU is further used to:

在所述错误类型为所述可恢复错误或所述不可恢复的非致命错误的情况下,解除所述中断事件。When the error type is the recoverable error or the unrecoverable non-fatal error, the interrupt event is released.

需要说明的是,系统实施例与方法实施例相近,故描述的较为简单,相关之处参见方法实施例即可。It should be noted that the system embodiment is similar to the method embodiment, so the description is relatively simple, and the relevant parts can be referred to the method embodiment.

本公开实施例还提供了一种电子设备,参照图5,图5是本公开实施例提出的电子设备的示意图。如图5所示,电子设备100包括:存储器110和处理器120,存储器110与处理器120之间通过总线通信连接,存储器110中存储有计算机程序,该计算机程序可在处理器120上运行,进而实现本公开实施例公开的PCIe致命错误的快速定位方法中的步骤。The present disclosure also provides an electronic device, with reference to FIG5 , which is a schematic diagram of the electronic device provided by the present disclosure. As shown in FIG5 , the electronic device 100 includes: a memory 110 and a processor 120, the memory 110 and the processor 120 are connected via a bus communication, the memory 110 stores a computer program, and the computer program can be run on the processor 120, thereby implementing the steps in the method for quickly locating a PCIe fatal error disclosed in the present disclosure.

本公开实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时,实现如本公开实施例公开的PCIe致命错误的快速定位方法中的步骤。The embodiment of the present disclosure further provides a computer-readable storage medium, on which a computer program is stored. When the computer program is executed by a processor, the steps in the method for quickly locating a PCIe fatal error disclosed in the embodiment of the present disclosure are implemented.

本公开实施例还提供了一种计算机程序产品,包括计算机程序,所述计算机程序被计算机设备的处理器执行时,能够执行如本公开实施例公开的PCIe致命错误的快速定位方法中的步骤。The embodiment of the present disclosure also provides a computer program product, including a computer program. When the computer program is executed by a processor of a computer device, the computer program can execute the steps in the method for quickly locating PCIe fatal errors disclosed in the embodiment of the present disclosure.

本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The same or similar parts between the various embodiments can be referenced to each other.

本领域内的技术人员应明白,本公开实施例可提供为方法、装置或计算机程序产品。因此,本公开实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本公开实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that the disclosed embodiments may be provided as methods, devices or computer program products. Therefore, the disclosed embodiments may take the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Furthermore, the disclosed embodiments may take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

本公开实施例是参照根据本公开实施例的方法、装置、电子设备和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理终端设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理终端设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The embodiments of the present disclosure are described with reference to the flowcharts and/or block diagrams of the methods, devices, electronic devices, and computer program products according to the embodiments of the present disclosure. It should be understood that each process and/or box in the flowchart and/or block diagram, as well as the combination of the processes and/or boxes in the flowchart and/or block diagram, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing terminal device to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing terminal device generate a device for implementing the functions specified in one process or multiple processes in the flowchart and/or one box or multiple boxes in the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理终端设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal device to operate in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product including an instruction device that implements the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理终端设备上,使得在计算机或其他可编程终端设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程终端设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。尽管已描述了本公开实施例的部分实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本公开实施例范围的所有变更和修改。以上对本公开所提供的一种PCIe致命错误的快速定位方法、系统、电子设备及介质,进行了详细介绍,本文中应用了具体个例对本公开的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本公开的方法及其核心思想;同时,对于本领域的一般技术人员,依据本公开的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本公开的限制。These computer program instructions can also be loaded onto a computer or other programmable data processing terminal device, so that a series of operation steps are executed on the computer or other programmable terminal device to generate a computer-implemented process, so that the instructions executed on the computer or other programmable terminal device provide steps for implementing the functions specified in one or more processes of the flowchart and/or one or more boxes of the block diagram. Although some embodiments of the present disclosure have been described, once the technical personnel in the field know the basic creative concept, they can make additional changes and modifications to these embodiments. Therefore, the attached claims are intended to be interpreted as including the preferred embodiment and all changes and modifications that fall within the scope of the embodiments of the present disclosure. The above is a detailed introduction to a PCIe fatal error fast positioning method, system, electronic device and medium provided by the present disclosure. The principles and implementation methods of the present disclosure are explained in this article using specific examples. The description of the above embodiments is only used to help understand the method and its core idea of the present disclosure; at the same time, for those of ordinary skill in the art, according to the idea of the present disclosure, there will be changes in the specific implementation method and application scope. In summary, the content of this specification should not be understood as limiting the present disclosure.

Claims (10)

1.一种PCIe致命错误的快速定位方法,其特征在于,所述方法包括:1. A method for quickly locating a PCIe fatal error, characterized in that the method comprises: 在PCIe设备发生错误的情况下,向CPU的PCIe根节点上报异常信息;所述异常信息携带错误类型;所述错误类型包括:可恢复错误、不可恢复的非致命错误和不可恢复的致命错误;When an error occurs in a PCIe device, report abnormal information to the PCIe root node of the CPU; the abnormal information carries the error type; the error type includes: recoverable error, unrecoverable non-fatal error and unrecoverable fatal error; 在所述错误类型为不可恢复的致命错误的情况下,确定高级错误报告功能的开启情况;In the case where the error type is an unrecoverable fatal error, determining whether an advanced error reporting function is enabled; 在所述高级错误报告功能未开启的情况下,根据BIOS设置,向所述PCIe设备发送查询命令,得到所述PCIe设备的信息;所述信息包括以下至少一项或多项:错误设备的标识信息、错误原因、所述PCIe设备的配置空间中所有错误配置空间寄存器信息和所述PCIe设备所属的上层设备的配置空间寄存器信息;When the advanced error reporting function is not enabled, according to the BIOS setting, a query command is sent to the PCIe device to obtain information of the PCIe device; the information includes at least one or more of the following: identification information of the error device, the cause of the error, all error configuration space register information in the configuration space of the PCIe device, and configuration space register information of the upper layer device to which the PCIe device belongs; 将所述PCIe设备的信息存储到基板管理控制器。The information of the PCIe device is stored in a baseboard management controller. 2.根据权利要求1所述的方法,其特征在于,所述根据BIOS设置,向所述PCIe设备发送查询命令,得到所述PCIe设备的信息,包括:2. The method according to claim 1, wherein the step of sending a query command to the PCIe device according to BIOS settings to obtain information of the PCIe device comprises: 根据所述BIOS设置,通过PCIe链路向错误设备发送第一查询命令;所述错误设备为:上报所述异常信息的所述PCIe设备;According to the BIOS setting, a first query command is sent to the error device through the PCIe link; the error device is: the PCIe device that reports the abnormal information; 在所述错误设备响应所述第一查询命令的情况下,对所述错误设备的配置空间寄存器进行查询,得到所述错误设备的错误原因,以及得到所述错误设备的配置空间中所有错误配置空间寄存器信息;In the case where the erroneous device responds to the first query command, querying the configuration space register of the erroneous device to obtain the error cause of the erroneous device and all erroneous configuration space register information in the configuration space of the erroneous device; 在所述错误设备不响应所述第一查询命令的情况下,确定所述错误设备所属的上层设备;In the case that the erroneous device does not respond to the first query command, determining an upper layer device to which the erroneous device belongs; 向所述错误设备所属的上层设备发送第二查询命令;Sending a second query command to an upper layer device to which the erroneous device belongs; 在所述错误设备所属的上层设备响应所述第二查询命令的情况下,对所述错误设备所属的上层设备的配置空间寄存器进行查询,得到所述错误设备的错误原因,以及所述错误设备所属的上层设备的配置空间寄存器信息。When the upper layer device to which the erroneous device belongs responds to the second query command, the configuration space register of the upper layer device to which the erroneous device belongs is queried to obtain the error cause of the erroneous device and the configuration space register information of the upper layer device to which the erroneous device belongs. 3.根据权利要求1所述的方法,其特征在于,所述将所述PCIe设备的信息存储到基板管理控制器,包括:3. The method according to claim 1, wherein storing the information of the PCIe device in a baseboard management controller comprises: 收集CPU与所述PCIe设备之间传递的目标信息;Collecting target information transmitted between the CPU and the PCIe device; 从所述PCIe设备的日志信息中,收集所述PCIe设备运行状态相关的目标日志信息;Collecting target log information related to the running status of the PCIe device from the log information of the PCIe device; 将所述PCIe设备的信息、所述目标信息和所述目标日志信息,存储到所述基板管理控制器。The information of the PCIe device, the target information and the target log information are stored in the baseboard management controller. 4.根据权利要求1所述的方法,其特征在于,在所述向CPU的PCIe根节点上报异常信息之前,所述方法还包括:4. The method according to claim 1, characterized in that before reporting the abnormal information to the PCIe root node of the CPU, the method further comprises: 根据PCIe规范要求,检测所述PCIe设备的运行状态是否发生错误;According to PCIe specification requirements, detecting whether an error occurs in the operating state of the PCIe device; 在所述PCIe设备发生错误的情况下,将所述异常信息对应的相关数据登记在所述PCIe设备的配置空间寄存器中;所述异常信息对应的相关数据包括所述错误类型;In the case where an error occurs in the PCIe device, registering relevant data corresponding to the abnormal information in a configuration space register of the PCIe device; the relevant data corresponding to the abnormal information includes the error type; 根据所述异常信息对应的相关数据,生成携带所述错误类型的所述异常信息。The exception information carrying the error type is generated according to relevant data corresponding to the exception information. 5.根据权利要求1所述的方法,其特征在于,在所述错误类型为不可恢复的致命错误且所述高级错误报告功能开启的情况下,所述方法还包括:5. The method according to claim 1, characterized in that, when the error type is an unrecoverable fatal error and the advanced error reporting function is enabled, the method further comprises: 从不可纠正错误状态寄存器中,获取所述异常信息对应的相关数据中的所述错误原因;所述异常信息对应的相关数据是:所述PCIe设备自动将上传至所述不可纠正错误状态寄存器和错误状态严重性寄存器中的;所述异常信息对应的相关数据还包括:所述高级错误报告功能对应的配置空间中所有错误配置空间寄存器信息。From the uncorrectable error status register, obtain the error cause in the relevant data corresponding to the exception information; the relevant data corresponding to the exception information is: the PCIe device automatically uploads the uncorrectable error status register and the error status severity register; the relevant data corresponding to the exception information also includes: all error configuration space register information in the configuration space corresponding to the advanced error reporting function. 6.根据权利要求1所述的方法,其特征在于,所述方法还包括:6. The method according to claim 1, characterized in that the method further comprises: 响应于所述异常信息的上报,根据所述错误类型,触发所述错误类型对应的中断事件;In response to the reporting of the abnormal information, triggering an interrupt event corresponding to the error type according to the error type; 在将所述PCIe设备的信息存储到基板管理控制器后,所述方法还包括:After storing the information of the PCIe device in the baseboard management controller, the method further includes: 获取预设处理方案;所述预设处理方案包括:各个错误类型对应的处理方式;所述处理方式包括:屏蔽错误、对设备进行复位和系统重启;Obtaining a preset processing solution; the preset processing solution includes: processing methods corresponding to each error type; the processing methods include: shielding errors, resetting the device, and restarting the system; 根据所述预设处理方案和异常信息携带的错误类型,确定目标处理方式;Determine a target processing method according to the preset processing scheme and the error type carried by the abnormal information; 按照所述目标处理方式,对所述PCIe设备发生的错误进行处理;Processing the error occurring in the PCIe device according to the target processing method; 在处理好所述PCIe设备发生的错误后,解除所述中断事件。After the error occurring in the PCIe device is processed, the interrupt event is released. 7.根据权利要求6所述的方法,其特征在于,所述方法还包括:7. The method according to claim 6, characterized in that the method further comprises: 在所述错误类型为所述可恢复错误或所述不可恢复的非致命错误的情况下,解除所述中断事件。When the error type is the recoverable error or the unrecoverable non-fatal error, the interrupt event is released. 8.一种PCIe致命错误的快速定位系统,其特征在于,所述PCIe致命错误的快速定位系统包括:PCIe设备、CPU、BIOS和基板管理控制器;8. A PCIe fatal error rapid positioning system, characterized in that the PCIe fatal error rapid positioning system comprises: a PCIe device, a CPU, a BIOS and a baseboard management controller; 所述PCIe设备,用于在检测到自身发生错误的情况下,向所述CPU的PCIe根节点上报异常信息;所述异常信息携带错误类型;所述错误类型包括:可恢复错误、不可恢复的非致命错误和不可恢复的致命错误;The PCIe device is used to report abnormal information to the PCIe root node of the CPU when an error is detected in the PCIe device; the abnormal information carries the error type; the error type includes: recoverable error, unrecoverable non-fatal error and unrecoverable fatal error; 所述PCIe致命错误的快速定位系统,用于在所述错误类型为不可恢复的致命错误的情况下,确定高级错误报告功能的开启情况;The PCIe fatal error rapid positioning system is used to determine the activation status of the advanced error reporting function when the error type is an unrecoverable fatal error; 所述PCIe致命错误的快速定位系统,用于在所述高级错误报告功能未开启的情况下,根据BIOS设置,向所述PCIe设备发送查询命令,得到所述PCIe设备的信息;所述信息包括以下至少一项或多项:错误设备的标识信息、错误原因、所述PCIe设备的配置空间中所有错误配置空间寄存器信息和所述PCIe设备所属的上层设备的配置空间寄存器信息;The PCIe fatal error rapid positioning system is used to send a query command to the PCIe device according to the BIOS setting when the advanced error reporting function is not enabled, so as to obtain information of the PCIe device; the information includes at least one or more of the following: identification information of the error device, the cause of the error, all error configuration space register information in the configuration space of the PCIe device, and configuration space register information of the upper layer device to which the PCIe device belongs; 所述基板管理控制器,用于存储所述PCIe设备的信息。The baseboard management controller is used to store the information of the PCIe device. 9.一种电子设备,其特征在于,包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述计算机程序被所述处理器执行时,实现如权利要求1-7中任一项所述的PCIe致命错误的快速定位方法的步骤。9. An electronic device, characterized in that it comprises: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the computer program is executed by the processor, the steps of the method for rapidly locating a PCIe fatal error as described in any one of claims 1 to 7 are implemented. 10.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时,实现如权利要求1-7中任一项所述的PCIe致命错误的快速定位方法的步骤。10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method for quickly locating a PCIe fatal error as described in any one of claims 1 to 7 are implemented.
CN202410544375.8A 2024-04-30 2024-04-30 PCIe fatal error quick positioning method, system, electronic equipment and medium Pending CN118550747A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410544375.8A CN118550747A (en) 2024-04-30 2024-04-30 PCIe fatal error quick positioning method, system, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410544375.8A CN118550747A (en) 2024-04-30 2024-04-30 PCIe fatal error quick positioning method, system, electronic equipment and medium

Publications (1)

Publication Number Publication Date
CN118550747A true CN118550747A (en) 2024-08-27

Family

ID=92452159

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410544375.8A Pending CN118550747A (en) 2024-04-30 2024-04-30 PCIe fatal error quick positioning method, system, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN118550747A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118708396A (en) * 2024-08-30 2024-09-27 苏州元脑智能科技有限公司 Error information processing method, device, medium and program product
CN118885359A (en) * 2024-09-27 2024-11-01 苏州元脑智能科技有限公司 Extension device status detection method, server and electronic device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118708396A (en) * 2024-08-30 2024-09-27 苏州元脑智能科技有限公司 Error information processing method, device, medium and program product
CN118885359A (en) * 2024-09-27 2024-11-01 苏州元脑智能科技有限公司 Extension device status detection method, server and electronic device

Similar Documents

Publication Publication Date Title
JP6333410B2 (en) Fault processing method, related apparatus, and computer
TWI229796B (en) Method and system to implement a system event log for system manageability
CN104639380B (en) server monitoring method
CN118550747A (en) PCIe fatal error quick positioning method, system, electronic equipment and medium
CN111488233A (en) Method and system for processing bandwidth loss problem of PCIe device
CN112988442B (en) Method and equipment for transmitting fault information in server operation stage
CN117389790B (en) Firmware detection system, method, storage medium and server capable of recovering faults
CN110704228A (en) Solid state disk exception handling method and system
CN114003416B (en) Memory error dynamic processing method, system, terminal and storage medium
CN115793963A (en) Hard disk fault processing method, device, equipment and storage medium
CN111858240A (en) A monitoring method, system, device and medium for a distributed storage system
JP5425720B2 (en) Virtualization environment monitoring apparatus and monitoring method and program thereof
CN115878430A (en) PCIE equipment failure monitoring method, device, communication equipment and storage medium
WO2024250776A1 (en) Fault detection method and apparatus for external device
US20080288828A1 (en) structures for interrupt management in a processing environment
CN113742120B (en) A kdump trigger method, system, device and medium
CN118550752A (en) Cloud platform fault detection and operation and maintenance system, method, equipment and storage medium
WO2024124862A1 (en) Server-based memory processing method and apparatus, processor and an electronic device
CN114003477B (en) Method, system, terminal and storage medium for collecting diagnosis information of slow disk
CN115408192A (en) IO error detection method of virtual machine and related components thereof
CN109491846B (en) Method and system for capturing SATA hard disk trace by server
CN113190278B (en) Multi-scenario fault processing method, system and medium
CN118860720A (en) Fault information processing method, equipment and medium
CN118041743A (en) Node failure processing method, device, electronic device, chip and storage medium
CN118377644A (en) A method and system for rapidly improving CPU fault diagnosis based on FPGA

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination