CN117667467A

CN117667467A - A method for dealing with memory failures and related equipment

Info

Publication number: CN117667467A
Application number: CN202211025578.3A
Authority: CN
Inventors: 张飞
Original assignee: XFusion Digital Technologies Co Ltd
Current assignee: XFusion Digital Technologies Co Ltd
Priority date: 2022-08-25
Filing date: 2022-08-25
Publication date: 2024-03-08
Also published as: WO2024041093A1; EP4524746A1

Abstract

This application discloses a method for handling memory faults and related equipment, which are applied in the storage field. The method includes: obtaining the uncorrectable error information of the first memory, the uncorrectable error information including the fault address, then obtaining the first target data corresponding to the fault address from the second memory, and writing the first target data into the first memory. Fault address, where the first memory and the second memory are mirror memories of each other, and the second target data located at the fault address of the first memory is verified. The second target data is to write the fault address of the first memory into the first target. After the data is retrieved, if the second target data is verified to be fault data, the fault address of the first memory is marked as the address to be isolated, and a page soft isolation operation is performed on the address to be isolated. In this application, the page soft isolation operation is performed on the faulty address to ensure the normal use of other memory spaces in the first memory without the need to release the mirror mode of the memory.

Description

A method for dealing with memory failures and related equipment

技术领域Technical field

本申请实施例涉及存储领域，尤其涉及一种处理内存故障的方法及其相关设备。Embodiments of the present application relate to the field of storage, and in particular, to a method of handling memory faults and related equipment.

背景技术Background technique

当前计算设备多采用内存镜像模式，其将两个独立的物理内存通道的内存空间设置为互为备份，两个内存空间具备相同的地址空间，且在相同的地址内存储相同的数据，且其中一个内存通道设置为主通道，另一个内存通道设置为备份通道。以此保证主通道内存出现失效的情况下仍能从备份通道的内存设备中获取正确的数据，保证计算设备的正常运行。Most current computing devices use memory mirroring mode, which sets the memory spaces of two independent physical memory channels to back up each other. The two memory spaces have the same address space and store the same data at the same address, and among them One memory channel is set as the primary channel, and the other memory channel is set as the backup channel. This ensures that even if the main channel memory fails, correct data can still be obtained from the memory device of the backup channel, ensuring the normal operation of the computing device.

在计算设备开启内存镜像模式后，在运行过程中若内存控制器检测到主通道有内存数据的不可纠正错误(uncorrectable error，UCE)故障产生，内存控制器会从备份通道对应的内存空间的相同地址读取数据，并对主通道对应的内存空间中故障的内存地址进行数据修复，但是当该数据无法修复时，则认为该UCE故障为内存硬失效类UCE，此时内存控制器会解除主通道对应的内存空间和备份通道对应的内存空间之间的镜像关系，只使用备份通道对应的内存空间。After the computing device turns on the memory mirroring mode, if the memory controller detects that an uncorrectable error (UCE) fault occurs in the memory data on the main channel during operation, the memory controller will start from the same memory space corresponding to the backup channel. The data is read from the address and the faulty memory address in the memory space corresponding to the main channel is repaired. However, when the data cannot be repaired, the UCE fault is considered to be a memory hard failure UCE. At this time, the memory controller will remove the main channel. In the mirroring relationship between the memory space corresponding to the channel and the memory space corresponding to the backup channel, only the memory space corresponding to the backup channel is used.

但是，解除主通道对应的内存空间和备份通道对应的内存空间之间的镜像关系导致除了出现故障的内存地址外整个主通道对应的内存空间也不可用。However, undoing the mirroring relationship between the memory space corresponding to the main channel and the memory space corresponding to the backup channel causes the memory space corresponding to the entire main channel to be unavailable except for the failed memory address.

发明内容Contents of the invention

本申请提供了一种处理内存故障的方法及其相关设备，应用于存储领域中。该处理内存故障的方法能精准定位内存故障的精确位置，且无需解除内存的镜像模式。This application provides a method for handling memory failures and related equipment, which are applied in the storage field. This method of handling memory faults can accurately locate the precise location of the memory fault without removing the mirror mode of the memory.

第一方面，提供了一种处理内存故障的方法，包括：The first aspect provides a method for handling memory failures, including:

获取第一内存的不可纠正错误信息，不可纠正错误信息包括故障地址。Obtain uncorrectable error information of the first memory, where the uncorrectable error information includes the fault address.

从第二内存中获取与故障地址对应的第一目标数据，并将第一目标数据写入第一内存的故障地址，其中，第一内存和第二内存互为镜像内存。Obtain the first target data corresponding to the fault address from the second memory, and write the first target data to the fault address of the first memory, where the first memory and the second memory are mirror memories of each other.

校验位于第一内存的故障地址的第二目标数据，若第二目标数据校验为故障数据，标记第一内存的故障地址为待隔离地址。Verify the second target data located at the fault address of the first memory. If the second target data is verified to be fault data, mark the fault address of the first memory as the address to be isolated.

对待隔离地址执行页面软隔离操作。Perform page soft isolation operation on the address to be isolated.

在本申请的实施方式中，对硬失效类UCE故障的故障地址标记为待隔离地址，并对其进行页面软隔离操作，以此实现了对故障地址的精准隔离，不会对其他内存地址产生影响，进而无需解除第一内存与第二内存之间的镜像关系，从而第一内存内除故障地址之外的其余内存空间仍然可以正常支持读写，避免解除镜像关系后第一内存内正常内存空间无法使用，提高第一内存的使用概率，降低了内存硬失效类UCE的不良影响范围，避免浪费内存资源，且很大程度降低了内存镜像模式被解除的概率。In the implementation of this application, the fault address of a hard failure UCE fault is marked as an address to be isolated, and a page soft isolation operation is performed on it, thereby achieving accurate isolation of the fault address without causing any impact on other memory addresses. Therefore, there is no need to cancel the mirror relationship between the first memory and the second memory, so that the remaining memory space in the first memory except the fault address can still support reading and writing normally, avoiding the normal memory in the first memory after canceling the mirror relationship. The space cannot be used, which increases the probability of using the first memory, reduces the scope of adverse effects of memory hard failure UCE, avoids wasting memory resources, and greatly reduces the probability of the memory mirroring mode being released.

第一方面的一种可能的实现方式中，基于待隔离标识标记第一内存的故障地址为待隔离地址。In a possible implementation of the first aspect, the fault address of the first memory is marked as the address to be isolated based on the identification to be isolated.

在本申请的实施方式中，通过多种方式实现待隔离标识标记第一内存的故障地址为待隔离地址，体现了方案的多样性以及可选择性。In the embodiment of the present application, multiple methods are used to realize that the to-be-isolated identification marks the fault address of the first memory as the to-be-isolated address, which reflects the diversity and selectivity of the solution.

第一方面的一种可能的实现方式中，通过操作系统(operating system，OS)对待隔离地址执行页面软隔离操作。In a possible implementation of the first aspect, the operating system (operating system, OS) performs a page soft isolation operation on the isolation address.

在本申请的实施方式中，计算设备通过OS对待隔离地址执行页面软隔离操作体现了方案的具体实现方式，不会对第一内存的其他正常内存空间产生影响，且无需解除第一内存与第二内存的镜像关系。In the implementation of the present application, the computing device performs the page soft isolation operation on the address to be isolated through the OS, which embodies the specific implementation of the solution. It will not affect other normal memory spaces of the first memory, and there is no need to release the first memory from the third memory. The mirror relationship between the two memories.

第一方面的一种可能的实现方式中，生成通用硬件错误源(generic hardwareerror source，GHES)表，该GHES表中包括故障地址以及对应的待隔离标识。In a possible implementation of the first aspect, a general hardware error source (GHES) table is generated, and the GHES table includes the fault address and the corresponding identification to be isolated.

在本申请的实施方式中，通过GHES表中包括故障地址以及对应的待隔离标识，可以从GHES表中基于待隔离标识确定对应的故障地址为待隔离地址，体现了方案的具体实现，体现了方案的可靠性。In the embodiment of the present application, by including the fault address and the corresponding identification to be isolated in the GHES table, the corresponding fault address can be determined as the address to be isolated based on the identification to be isolated from the GHES table, which embodies the specific implementation of the solution and embodies the The reliability of the scheme.

第一方面的一种可能的实现方式中，GHES表中故障地址的故障级别为可纠正。In a possible implementation of the first aspect, the fault level of the fault address in the GHES table is correctable.

在本申请的实施方式中，以此体现了该故障地址的发生的故障不会对本计算设备的运行产生不良影响。In the embodiment of the present application, this reflects that the fault occurring at the fault address will not have a negative impact on the operation of the computing device.

第一方面的一种可能的实现方式中，通过OS对GHES表中故障级别为可纠正以及待隔离标识对应的故障地址执行页面软隔离操作。In a possible implementation of the first aspect, the OS performs a page soft isolation operation on the fault address in the GHES table whose fault level is correctable and whose fault level is to be isolated and corresponding to the isolation identification.

在本申请的实施方式中，体现了方案的具体实现方式，体现了方案的可靠性。In the implementation manner of this application, the specific implementation manner of the solution is reflected, and the reliability of the solution is reflected.

第二方面，提供一种计算设备，该计算设备包括处理器、第一内存、第二内存以及内存控制器，第一内存以及第二内存与内存控制器相连，内存控制器与处理器相连，且第一内存与第二内存互为镜像内存。In a second aspect, a computing device is provided. The computing device includes a processor, a first memory, a second memory, and a memory controller. The first memory and the second memory are connected to the memory controller, and the memory controller is connected to the processor. And the first memory and the second memory are mirror memories of each other.

其中，内存控制器，用于获取第一内存的不可纠正错误信息，不可纠正错误信息包括故障地址。Wherein, the memory controller is used to obtain uncorrectable error information of the first memory, where the uncorrectable error information includes a fault address.

内存控制器，还用于从第二内存中获取与故障地址对应的第一目标数据，并将第一目标数据写入第一内存的故障地址。The memory controller is also used to obtain the first target data corresponding to the fault address from the second memory, and write the first target data into the fault address of the first memory.

内存控制器，还用于校验位于第一内存的故障地址的第二目标数据。The memory controller is also used to verify the second target data located at the fault address of the first memory.

处理器，用于若第二目标数据校验为故障数据，标记第一内存的故障地址为待隔离地址。The processor is configured to mark the fault address of the first memory as the address to be isolated if the second target data is verified to be fault data.

处理器，还用于对待隔离地址执行页面软隔离操作。The processor is also used to perform page soft isolation operations on the addresses to be isolated.

在本申请的实施方式中，处理器，具体用于基于待隔离标识标记第一内存的故障地址为待隔离地址。In the embodiment of the present application, the processor is specifically configured to mark the fault address of the first memory as the address to be isolated based on the identification to be isolated.

第二方面的一种可能的实现方式中，处理器，生成GHES表，该GHES表中包括故障地址以及对应的待隔离标识。In a possible implementation of the second aspect, the processor generates a GHES table, and the GHES table includes the fault address and the corresponding identification to be isolated.

第二方面的一种可能的实现方式中，GHES表中故障地址的故障级别为可纠正。In a possible implementation of the second aspect, the fault level of the fault address in the GHES table is correctable.

第二方面的一种可能的实现方式中，处理器，具体用于通过OS对GHES表中故障级别为可纠正以及待隔离标识对应的所述故障地址执行所述页面软隔离操作。In a possible implementation of the second aspect, the processor is specifically configured to perform the page soft isolation operation through the OS on the fault address in the GHES table whose fault level is correctable and corresponding to the identification to be isolated.

第二方面的有益效果参见第一方面，此处不再赘述。For the beneficial effects of the second aspect, please refer to the first aspect and will not be described again here.

第三方面，提供另一种计算设备，可以包括处理器，该处理器与存储器耦合，其中存储器用于存储指令，处理器用于执行存储器中的指令使得该计算设备执行本申请第一方面或第一方面任意一种可能实现方式所描述的方法。In a third aspect, another computing device is provided, which may include a processor coupled to a memory, where the memory is used to store instructions, and the processor is used to execute instructions in the memory so that the computing device executes the first aspect of the present application or the third aspect. On the one hand any of the methods described in the possible implementations.

第四方面，提供另一种计算设备，包括处理器，用于执行存储器中存储的计算机程序(或计算机可执行指令)，当计算机程序(或计算机可执行指令)被执行时，使得执行如第一方面及第一方面各个可能的实现方式中的方法。In a fourth aspect, another computing device is provided, including a processor for executing a computer program (or computer-executable instructions) stored in a memory. When the computer program (or computer-executable instructions) is executed, the computer program (or computer-executable instructions) is executed as described in the fourth aspect. Methods in one aspect and various possible implementations of the first aspect.

在一种可能的实现中，处理器和存储器集成在一起；In one possible implementation, the processor and memory are integrated;

在另一种可能的实现中，上述存储器位于该计算设备之外。In another possible implementation, the above-mentioned memory is located outside the computing device.

该计算设备还包括通信接口，该通信接口用于该计算设备与其他设备进行通信，例如数据和/或信号的发送或接收。示例性的，通信接口可以是收发器、电路、总线、模块或其它类型的通信接口。The computing device also includes a communication interface for the computing device to communicate with other devices, such as the transmission or reception of data and/or signals. For example, the communication interface may be a transceiver, a circuit, a bus, a module, or other types of communication interfaces.

第五方面提供一种计算机可读存储介质，包括计算机可读指令，当计算机可读指令在计算机上运行时，使得本申请第一方面或第一方面任一种可能实现方式。The fifth aspect provides a computer-readable storage medium that includes computer-readable instructions. When the computer-readable instructions are run on a computer, the first aspect of the present application or any one of the first aspects may be implemented.

第六方面，提供一种计算机程序产品，包括计算机可读指令，当计算机可读指令在计算机上运行时，使得本申请第一方面或第一方面任一种可能实现方式。In a sixth aspect, a computer program product is provided, which includes computer-readable instructions. When the computer-readable instructions are run on a computer, the first aspect of the present application or any one of the first aspects may be implemented.

附图说明Description of drawings

图1为mirror scrub机制的处理流程示意图；Figure 1 is a schematic diagram of the processing flow of the mirror scrub mechanism;

图2为本申请实施例提供的计算设备的一个架构示意图；Figure 2 is an architectural schematic diagram of a computing device provided by an embodiment of the present application;

图3为本申请实施例提供的处理内存故障的方法的一个流程示意图；Figure 3 is a schematic flowchart of a method for handling memory failures provided by an embodiment of the present application;

图4为本申请实施例提供的确定硬失效类UCE的一个示意图；Figure 4 is a schematic diagram of determining the hard failure type UCE provided by the embodiment of the present application;

图5为本申请实施例提供的GHES表的一个示意图；Figure 5 is a schematic diagram of the GHES table provided by the embodiment of the present application;

图6为本申请实施例提供的计算设备的一个结构示意图；Figure 6 is a schematic structural diagram of a computing device provided by an embodiment of the present application;

图7为本申请实施例提供的计算设备的另一个结构示意图。FIG. 7 is another schematic structural diagram of a computing device provided by an embodiment of the present application.

具体实施方式Detailed ways

本申请实施例提供了一种处理内存故障的方法及其相关设备，应用于存储领域中。该处理内存故障的方法能精准定位内存故障的精确位置，且无需解除内存的镜像模式。Embodiments of the present application provide a method for handling memory faults and related equipment, which are applied in the storage field. This method of handling memory faults can accurately locate the precise location of the memory fault without removing the mirror mode of the memory.

本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换，这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。The terms "first", "second", etc. in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that the terms so used are interchangeable under appropriate circumstances, and are merely a way of distinguishing objects with the same attributes in describing the embodiments of the present application. Furthermore, the terms "include" and "having" and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, product or apparatus comprising a series of elements need not be limited to those elements, but may include not explicitly other elements specifically listed or inherent to such processes, methods, products or equipment.

本申请实施例涉及了许多关于内存镜像模式的相关知识，为了更好地理解本申请实施例的方案，下面先对本申请实施例可能涉及的相关术语和概念进行介绍。The embodiments of this application involve a lot of relevant knowledge about the memory mirroring mode. In order to better understand the solutions of the embodiments of this application, the relevant terms and concepts that may be involved in the embodiments of this application are first introduced below.

地址空间：地址空间是指对存储器编码(编码地址)的范围。所谓编码就是对每一个物理存储单元(一个字节)分配一个号码，通常叫作“编址”。分配一个号码给一个存储单元的目的是为了便于找到它，完成数据的读写，这就是所谓的“寻址”(所以，有人也把地址空间称为寻址空间)。Address space: Address space refers to the range of memory encoding (encoding addresses). The so-called encoding is to assign a number to each physical storage unit (one byte), usually called "addressing". The purpose of assigning a number to a storage unit is to facilitate finding it and completing the reading and writing of data. This is called "addressing" (so some people also call the address space an addressing space).

内存的镜像模式：将两个独立的物理内存设备的内存空间设置为互为备份的关系，这种备份关系称就之为镜像。计算设备会为这两个互为镜像的内存分配完全相同的地址空间作为内存空间(示例性的，即两个内存分别对应的地址空间均为4G)，且将其中的一个内存与内存控制器连接的通道为主通道，一个内存与内存控制器连接的通道为备份通道，其工作原理为数据写入时将内存数据做两个拷贝，分别写入主通道对应的内存空间和备份通道对应的内存空间中，读数据时主要从主通道对应的内存空间中读取，当主通道的内存空间中出现错误的时候计算设备会从备份通道对应的内存空间中读取数据，以此来保证主通道对应的内存空间失效的情况下仍能够通过获取存储在备份通道对应的内存空间中的正确的数据，保证计算设备持续正常的运行。Memory mirroring mode: Set the memory spaces of two independent physical memory devices to back up each other. This backup relationship is called mirroring. The computing device will allocate exactly the same address space as the memory space for the two mirrored memories (for example, the address spaces corresponding to the two memories are both 4G), and connect one of the memories to the memory controller. The connected channel is the main channel, and the channel connected between a memory and the memory controller is a backup channel. Its working principle is to make two copies of the memory data when writing data, and write them into the memory space corresponding to the main channel and the backup channel respectively. In the memory space, when reading data, it is mainly read from the memory space corresponding to the main channel. When an error occurs in the memory space of the main channel, the computing device will read data from the memory space corresponding to the backup channel to ensure that the main channel Even if the corresponding memory space fails, the correct data stored in the memory space corresponding to the backup channel can still be obtained to ensure the continuous normal operation of the computing device.

镜像巡检(mirror scrub)：在内存镜像模式下，当进行数据读取时，若内存控制器无法通过自身的纠错能力纠正主通道对应的内存空间中错误的数据(即主通道对应的内存空间发生UCE故障)时，内存控制器将向备份通道发出数据读取请求，如果读取成功(没有返回错误)，内存控制器会将此正确的数据传递给数据读取模块后将此正确的数据再写回主通道内存中尝试纠正主通道中的错误数据，且回写动作完成后会对回写到主通道中的数据再次读取进行校验，此动作称之为mirror scrub。Mirror scrub (mirror scrub): In the memory mirror mode, when reading data, if the memory controller cannot correct the erroneous data in the memory space corresponding to the main channel through its own error correction capability (that is, the memory corresponding to the main channel When a UCE failure occurs in the space), the memory controller will send a data read request to the backup channel. If the read is successful (no error is returned), the memory controller will pass the correct data to the data read module and then send the correct data to the backup channel. The data is written back to the main channel memory to try to correct the erroneous data in the main channel. After the write-back action is completed, the data written back to the main channel will be read again for verification. This action is called mirror scrub.

具体如图1所示，图1为mirror scrub机制的处理流程示意图，其中，当处理器(central processing unit，CPU)的处理器核core读取数据时，CPU的内存控制器会先从主通道对应的内存设备读取第一数据，当主通道对应的内存设备存储数据的Rank(生产商把64bit的内存集合称为一个Rank)故障时，获取到的第一数据经校验是错误的，此时内存控制器会从主通道对应的内存设备重新读取第一数据，仍然获取到该第一数据错误且无法纠正后，内存控制器从备份通道对应的内存设备读取得到正确的第二数据，其中第二数据和第一数据的地址相同，并将正确的第二数据返回给core完成读取，另外还将正确的第二数据回写到主通道对应的内存设备内第一数据的地址中。As shown in Figure 1, Figure 1 is a schematic diagram of the processing flow of the mirror scrub mechanism. When the processor core of the central processing unit (CPU) reads data, the memory controller of the CPU first reads data from the main channel. The corresponding memory device reads the first data. When the Rank (the manufacturer refers to a 64-bit memory set as a Rank) of the memory device corresponding to the main channel to store data fails, the first data obtained is incorrect after verification. At that time, the memory controller will re-read the first data from the memory device corresponding to the main channel. If the first data error is still obtained and cannot be corrected, the memory controller will read the correct second data from the memory device corresponding to the backup channel. , where the addresses of the second data and the first data are the same, and the correct second data is returned to the core to complete the reading. In addition, the correct second data is written back to the address of the first data in the memory device corresponding to the main channel. middle.

硬失效类UCE：计算设备开启内存镜像模式后，在运行过程中内存控制器通过mirror scrub发现通过备份通道正确的数据回写主通道后仍然校验失败且无法纠正，则认为此故障为内存的硬失效类UCE。Hard failure UCE: After the computing device turns on the memory mirroring mode, the memory controller finds through mirror scrubbing during operation that the correct data is written back to the main channel through the backup channel and still fails to be verified and cannot be corrected. This fault is considered to be a memory fault. Hard failure class UCE.

镜像解除(mirror failover)：指配置为镜像模式的内存通道之间解除镜像关系，镜像关系解除后，内存控制器将只使用备份通道的内存，不再使用主通道下的内存。Mirror failover: refers to the lifting of the mirroring relationship between memory channels configured in mirror mode. After the mirroring relationship is lifted, the memory controller will only use the memory of the backup channel and no longer use the memory of the main channel.

页面(Page)：在操作系统(operating system，OS)中，Page是指固定长度的连续虚拟内存块，它是操作系统中用于内存管理的最小数据单位，会被映射到一个与之长度相同的连续物理内存块，Page的大小通常由处理器架构决定，OS中的页面一般会具有统一的大小，通常为4096字节。Page: In the operating system (OS), Page refers to a fixed-length continuous virtual memory block. It is the smallest data unit used for memory management in the operating system and will be mapped to a block of the same length. The size of a contiguous physical memory block is usually determined by the processor architecture. Pages in the OS generally have a uniform size, usually 4096 bytes.

页面软隔离(soft Page offline)：此功能是将待隔离Page的内容复制到其它地方(或者如果不需要，则直接删除此Page中的内容)，并且原始Page会从操作系统的内存管理系统中删除并且不再使用，该功能不会杀死或以其它方式影响任何应用程序。Page soft isolation (soft Page offline): This function is to copy the contents of the Page to be isolated to other places (or directly delete the contents of this Page if not needed), and the original Page will be removed from the memory management system of the operating system. Removed and no longer used, this feature will not kill or otherwise affect any applications.

系统管理中断(system management interruption，SMI)：由硬件触发，由基础输入输出系统(basic input output system，BIOS)处理的中断，硬件由相应的指令可以触发，触发后CPU将进入系统管理模式(system management mode，SMM)模式，此时OS相关执行流程将被挂起，执行BIOS中注册的SMI中断服务程序。System management interruption (SMI): An interrupt triggered by hardware and processed by the basic input output system (BIOS). The hardware can be triggered by corresponding instructions. After triggering, the CPU will enter the system management mode (system management mode). management mode (SMM) mode. At this time, the OS-related execution process will be suspended and the SMI interrupt service program registered in the BIOS will be executed.

系统控制中断(system control interruption，SCI)：是由BIOS触发，然后由OS处理的中断，通常会在BIOS处理完相关的SMI中断后触发，SCI中断触发后由OS内核中注册的SCI中断服务程序进行处理，此中断用于BIOS和OS之间的通信。System control interruption (SCI): It is an interrupt triggered by the BIOS and then processed by the OS. It is usually triggered after the BIOS has processed the related SMI interrupt. After the SCI interrupt is triggered, it is triggered by the SCI interrupt service routine registered in the OS kernel. For processing, this interrupt is used for communication between BIOS and OS.

在介绍本申请实施例之前，先对目前内存镜像模式中出现硬失效类UCE的处理方式进行简单说明，以便于后续理解本申请实施例。Before introducing the embodiments of the present application, a brief description will be given of the current processing method of hard failure UCE in the memory mirroring mode, so as to facilitate subsequent understanding of the embodiments of the present application.

当前的计算设备出现内存硬失效类UCE，此时内存控制器会将此故障类型标记为mirror failover error，且触发SMI中断，BIOS响应该SMI中断，且执行对应的中断服务程序以此实现mirror failover操作的执行，在执行mirror failover操作后，内存控制器再执行读写操作时就不再访问主通道的内存，仅仅使用备份通道的内存。Current computing devices experience memory hard failure UCE. At this time, the memory controller will mark this fault type as a mirror failover error and trigger an SMI interrupt. The BIOS responds to the SMI interrupt and executes the corresponding interrupt service routine to implement mirror failover. During the execution of the operation, after performing the mirror failover operation, the memory controller no longer accesses the memory of the main channel when performing read and write operations, and only uses the memory of the backup channel.

其中，一个通道上至少有一个Rank(生产商把64bit的内存集合称为一个Rank)的物理内存存在，而一个Rank的物理内存的容量可能达到了16GB左右，且随着科技的发展，这个容量还会不断增加，而当某个Rank的某个比特出现内存硬失效类UCE时，执行mirrorfailover操作解除镜像关系后会导致除了出现故障的内存位置外的整个主通道中其余正常的内存也不可用，由此增加了内存硬失效类UCE的不良影响范围，浪费了内存资源，且很大程度增加了内存镜像模式被解除的概率。Among them, there is at least one physical memory of Rank (manufacturers call a 64-bit memory set a Rank) on one channel, and the capacity of a Rank's physical memory may reach about 16GB, and with the development of technology, this capacity It will continue to increase, and when a memory hard failure UCE occurs in a certain bit of a Rank, performing a mirrorfailover operation to release the mirror relationship will cause the remaining normal memories in the entire main channel to be unavailable except for the failed memory location. , thus increasing the scope of adverse effects of memory hard failure UCE, wasting memory resources, and greatly increasing the probability of the memory mirroring mode being released.

为解决上述问题，本申请实施例提供了一种处理内存故障的方法及其相关设备，应用于存储领域中。确定主通道的内存故障为硬失效类UCE后，通过BIOS获取故障地址且将其设置为待隔离地址，并通过OS对待隔离地址执行soft Page offline操作，以此实现对故障地址的精准隔离，不会对其他内存地址产生影响，进而无需解除镜像关系，从而主通道内除故障地址之外的其余内存仍然可以正常支持读写，避免解除镜像关系后主通道内正常内存无法使用，提高主通道的内存的使用概率。In order to solve the above problems, embodiments of the present application provide a method and related equipment for handling memory failures, which are applied in the storage field. After determining that the memory fault of the main channel is a hard failure UCE, obtain the fault address through the BIOS and set it as the address to be isolated, and perform the soft page offline operation on the address to be isolated through the OS to achieve accurate isolation of the fault address. It will have an impact on other memory addresses, and there is no need to cancel the mirror relationship. Therefore, the rest of the memory in the main channel except the fault address can still support reading and writing normally. This prevents the normal memory in the main channel from being unusable after the mirror relationship is cancelled, and improves the performance of the main channel. Memory usage probability.

首先，示例性的，为便于理解后续实施例，先对应用本申请实施例提供的处理内存故障的方法的一个计算设备的架构进行简单说明。具体请参阅图2，图2为本申请实施例提供的计算设备的一个架构示意图，其中计算设备的硬件具体包括：First, as an example, to facilitate understanding of subsequent embodiments, the architecture of a computing device applying the method for handling memory faults provided by embodiments of the present application is briefly described. Please refer to Figure 2 for details. Figure 2 is an architectural schematic diagram of a computing device provided by an embodiment of the present application. The hardware of the computing device specifically includes:

至少一个CPU、第一内存、第二内存以及内存控制器，第一内存以及第二内存与内存控制器相连，内存控制器与处理器相连。其内存控制器可以支持内存镜像模式，且能支持内存数据的检错和纠错能力，其中第一内存通过第一通道访问，第二内存通过第二通道访问，且通过内存控制器将第一内存以及第二内存配置为镜像模式，即第一内存与第二内存互为镜像内存。且在CPU上可以运行BIOS以及OS，BIOS作为软件可以存储在CPU中，也可以存储在独立于CPU的存储器中，另外，其OS作为软件也可以存储与存储器中，CPU可以通过调用存储器中的BIOS对计算设备的硬件装置进行初始化配置并在运行过程中通过中断响应的方式搜集一些组成模块的状态信息，例如第一内存或第二内存发生故障的相关信息，并且可以与OS之间进行信息交互，即CPU可以通过调用OS根据BIOS的相关信息执行相关操作。At least one CPU, a first memory, a second memory and a memory controller, the first memory and the second memory are connected to the memory controller, and the memory controller is connected to the processor. Its memory controller can support memory mirroring mode and can support error detection and correction capabilities of memory data. The first memory is accessed through the first channel, the second memory is accessed through the second channel, and the first memory is accessed through the memory controller. The memory and the second memory are configured in a mirroring mode, that is, the first memory and the second memory are mirror memories of each other. And the BIOS and OS can be run on the CPU. The BIOS as a software can be stored in the CPU or in a memory independent of the CPU. In addition, the OS as a software can also be stored in the memory. The CPU can call the memory in the memory. BIOS initializes the configuration of the hardware device of the computing device and collects status information of some component modules through interrupt response during operation, such as information related to the failure of the first memory or second memory, and can communicate with the OS. Interaction, that is, the CPU can perform relevant operations based on relevant information from the BIOS by calling the OS.

需要说明的是，前述图2仅仅作为示例用于本申请实施例，不对本申请产生实质性的限定，可以理解的是，在实际情况中，内存控制器可以如图2所示是独立于CPU的一个芯片，也可以集成于CPU，或集成在北桥中，或者集成在南桥中，具体此处不做限定。It should be noted that the aforementioned Figure 2 is only used as an example for the embodiment of the present application and does not substantially limit the present application. It can be understood that in actual situations, the memory controller can be independent of the CPU as shown in Figure 2 A chip can also be integrated into the CPU, or integrated into the north bridge, or integrated into the south bridge. There is no specific limit here.

示例性的，计算设备具体执行处理内存故障的方法包括：获取第一内存的不可纠正错误信息，该不可纠正错误信息包括故障地址，然后从第二内存中获取与故障地址对应的第一目标数据，并将第一目标数据写入第一内存的故障地址，其中，第一内存与第二内存互为镜像内存，校验位于第一内存的故障地址的第二目标数据，该第二目标数据是将第一内存的故障地址写入第一目标数据后，第一内存的故障地址中的数据，若该第二目标数据校验为故障数据，标记第一内存的故障地址为待隔离地址，并对待隔离地址执行页面软隔离操作。Exemplarily, the computing device specifically executes a method for handling a memory failure including: obtaining uncorrectable error information of the first memory, where the uncorrectable error information includes the fault address, and then obtaining the first target data corresponding to the fault address from the second memory. , and writes the first target data to the fault address of the first memory, where the first memory and the second memory are mirror memories of each other, and verifies the second target data located at the fault address of the first memory. The second target data After the fault address of the first memory is written into the first target data, the data in the fault address of the first memory. If the second target data is verified as fault data, the fault address of the first memory is marked as the address to be isolated. And perform page soft isolation operation on the address to be isolated.

为了更好的理解本申请的实施例，下面结合附图，对本申请的实施例提供的处理内存故障的方法进行详细描述。本领域普通技术人员可知，随着技术的发展和新场景的出现，本申请实施例提供的技术方案对于类似的技术问题，同样适用。具体请参阅图3，图3为本申请实施例提供的处理内存故障的方法的一个流程示意图，具体包括：In order to better understand the embodiments of the present application, the method for handling memory faults provided by the embodiments of the present application will be described in detail below with reference to the accompanying drawings. Persons of ordinary skill in the art know that with the development of technology and the emergence of new scenarios, the technical solutions provided in the embodiments of this application are also applicable to similar technical problems. Please refer to Figure 3 for details. Figure 3 is a schematic flowchart of a method for handling a memory failure provided by an embodiment of the present application, which specifically includes:

301、获取第一内存的不可纠正错误信息。301. Obtain the uncorrectable error information of the first memory.

计算设备获取第一内存的不可纠正错误信息，其不可纠正错误信息包括故障地址。The computing device obtains uncorrectable error information for the first memory, the uncorrectable error information including the fault address.

示例性的，为便于理解图3的示例，本申请以前述图2的计算设备作为示例进行说明。当第一内存中发送UCE故障时，计算设备的内存控制器获取第一内存的不可纠正错误信息，该不可纠正错误信息包括故障地址。其中，内存控制器支持内存镜像模式、内存数据校验、错误类型识别、故障地址记录、mirror scrub数据回写以及SMI中断触发等动作。示例性的，内存控制器从第一内存读取数据及其校验位，读取到后会对该数据进行校验得到一个新生成的校验位，若该校验位与第一内存中读取的校验位不一样，则确定读取到的该数据错误，且确定第一内存内存储该数据的地址为故障地址。Illustratively, to facilitate understanding of the example of FIG. 3 , this application uses the aforementioned computing device of FIG. 2 as an example for description. When a UCE fault is sent in the first memory, the memory controller of the computing device obtains uncorrectable error information for the first memory, the uncorrectable error information including the fault address. Among them, the memory controller supports actions such as memory mirroring mode, memory data verification, error type identification, fault address recording, mirror scrub data writeback, and SMI interrupt triggering. For example, the memory controller reads the data and its check bit from the first memory. After reading the data, it will check the data to obtain a newly generated check bit. If the check bit is the same as the check bit in the first memory, If the read check bits are different, it is determined that the read data is incorrect, and the address storing the data in the first memory is determined to be the fault address.

然后，计算设备获取第一内存的不可纠正错误信息之后，会对第一内存中的故障地址中的数据进行修复。具体如步骤302所示：Then, after obtaining the uncorrectable error information of the first memory, the computing device repairs the data in the fault address in the first memory. Specifically as shown in step 302:

302、从第二内存中获取与故障地址对应的第一目标数据，并将第一目标数据写入第一内存的故障地址，第一内存和第二内存互为镜像内存。302. Obtain the first target data corresponding to the fault address from the second memory, and write the first target data to the fault address of the first memory. The first memory and the second memory are mirror memories of each other.

示例性的，计算设备的CPU通过调用BIOS将内存控制器下的第一内存与第二内存配置为镜像模式即第一内存与第二内存互为镜像内存，镜像模式开启之后，第一内存与第二内存中相同的地址存储的数据相同。内存控制器可以通过mirror scrub机制回写正确的第二数据到第一内存的故障地址中。具体的，内存控制器在获取第一内存的不可纠正错误信息之后，可以从第二内存中获取与故障地址对应的第一目标数据，即内存控制器可以从第二内存中与故障地址相同的地址内获取第一目标数据与其对应的校验位，且经过校验该第一目标数据得到的新生成的校验位与获取得到的校验位相同，即确定第一目标数据为正确的数据，然后将第一目标数据写入第一内存中的故障地址，对第一内存的故障地址中的数据进行修复。For example, the CPU of the computing device configures the first memory and the second memory under the memory controller into mirroring mode by calling the BIOS, that is, the first memory and the second memory are mirror memories of each other. After the mirroring mode is turned on, the first memory and the second memory are configured as mirroring modes. The same data is stored at the same address in the second memory. The memory controller can write back the correct second data to the fault address of the first memory through the mirror scrub mechanism. Specifically, after obtaining the uncorrectable error information of the first memory, the memory controller can obtain the first target data corresponding to the fault address from the second memory, that is, the memory controller can obtain the first target data corresponding to the fault address from the second memory. The first target data and its corresponding check digit are obtained from the address, and the newly generated check digit obtained after verifying the first target data is the same as the obtained check digit, that is, the first target data is determined to be correct data. , and then write the first target data to the fault address in the first memory, and repair the data in the fault address of the first memory.

303、校验位于第一内存的故障地址的第二目标数据，若第二目标数据校验为故障数据，标记第一内存的故障地址为待隔离地址。303. Verify the second target data located at the fault address of the first memory. If the second target data is verified to be fault data, mark the fault address of the first memory as the address to be isolated.

内存控制器在将从第二内存获取的第一目标数据写入第一内存的故障地址后，会重新从第一内存的故障地址获取其中的数据即第二目标数据(具体的，该第二目标数据可以与第一目标数据一样，也可能与第一目标数据不一样)，且还获取第二目标数据的校验码，然后内存控制器对第二目标数据进行校验，若得到的新生成的校验码与从第一内存中获取得到的校验码一样，则证明第二目标数据与第一目标数据相同，即第一内存的故障地址中的数据已经恢复为正确的数据；若得到的新生成的校验码与从第一内存中获取得到的校验码不一样，则证明第二目标数据与第一目标数据不一样，即第二目标数据为故障数据，可以认为第一内存的故障地址发生硬失效类UCE故障，此时第二目标数据校验为故障数据，则标记第一内存的故障地址为待隔离地址。After the memory controller writes the first target data obtained from the second memory to the fault address of the first memory, it will re-obtain the data, that is, the second target data (specifically, the second target data) from the fault address of the first memory. The target data may be the same as the first target data or may be different from the first target data), and the check code of the second target data is also obtained. Then the memory controller verifies the second target data. If the new target data is obtained If the generated check code is the same as the check code obtained from the first memory, it proves that the second target data is the same as the first target data, that is, the data in the fault address of the first memory has been restored to the correct data; if The newly generated check code is different from the check code obtained from the first memory, which proves that the second target data is different from the first target data, that is, the second target data is fault data, and the first target data can be considered A hard failure UCE fault occurs at the faulty address of the memory. At this time, the second target data is verified to be faulty data, and the faulty address of the first memory is marked as the address to be isolated.

在一种可能的实现方式中，计算设备基于待隔离标识标记第一内存的故障地址为待隔离地址。示例性的，待隔离标识可以是字符、数字、字符串、单词以及单词组合等等，具体此处不做限定。In a possible implementation, the computing device marks the fault address of the first memory as the address to be isolated based on the identification to be isolated. For example, the identification to be isolated can be characters, numbers, strings, words, word combinations, etc., and there is no specific limitation here.

在一种可能的实现方式中，计算设备生成通用硬件错误源(generic hardwareerror source，GHES)表，该GHES表中包括故障地址以及对应的待隔离标识。In a possible implementation, the computing device generates a general hardware error source (GHES) table, which includes a fault address and a corresponding identification to be isolated.

可选的，GHES表中故障地址的故障级别为可纠正。示例性的，在该GHES表中将此故障地址的故障级别设置为corrected，该corrected表示故障级别为可纠正(由于仍然可以通过备份通道即第二通道从第二内存中获取到正确的第一目标数据，所以对于计算设备而言此故障地址的故障是可纠正的)，以此体现了该故障地址的发生的故障不会对本计算设备的运行产生不良影响。Optionally, the fault level of the fault address in the GHES table is correctable. For example, the fault level of the fault address is set to corrected in the GHES table, which means that the fault level is correctable (since the correct first address can still be obtained from the second memory through the backup channel, that is, the second channel) Target data, so the fault at this fault address is correctable for the computing device), which reflects that the fault at this fault address will not have a negative impact on the operation of the computing device.

且计算设备对待隔离地址执行页面软隔离操作。如下步骤304：And the computing device performs a page soft isolation operation on the address to be isolated. Step 304 is as follows:

304、对待隔离地址执行页面软隔离操作。304. Perform page soft isolation operation on the address to be isolated.

具体的，计算设备对待隔离地址执行soft Page offline操作，待隔离地址会从操作系统的内存管理系统中删除并且不再使用，以此实现对故障地址的软隔离，因此无需解除第一内存与第二内存的镜像关系，且不会删除或以其它方式影响第一内存中除待隔离地址的内存空间的使用。Specifically, the computing device performs a soft page offline operation on the address to be isolated. The address to be isolated will be deleted from the memory management system of the operating system and no longer used. This achieves soft isolation of the faulty address. Therefore, there is no need to release the first memory and the third memory. The mirroring relationship between the two memories will not delete or otherwise affect the use of the memory space in the first memory except the address to be isolated.

在一种可能的实现方式中，计算设备调用OS对GHES表中故障级别为可纠正以及待隔离标识对应的故障地址执行页面软隔离操作。在本申请的实施方式中，体现了方案的具体实现方式，体现了方案的可靠性。In one possible implementation manner, the computing device calls the OS to perform a page soft isolation operation on the fault address in the GHES table whose fault level is correctable and whose fault level is to be isolated and corresponding to the identification to be isolated. In the implementation manner of this application, the specific implementation manner of the solution is reflected, and the reliability of the solution is reflected.

在本申请的实施方式中，获取第一内存的不可纠正错误信息，该不可纠正错误信息包括故障地址，然后从第二内存中获取与故障地址对应的第一目标数据，并将第一目标数据写入第一内存的故障地址，其中，第一内存和第二内存互为镜像内存，且校验位于第一内存的的故障地址的第二目标数据，若第二目标数据校验为故障数据，标记第一内存的故障地址为待隔离地址，并对待隔离地址执行页面软隔离操作。以此实现对故障地址的精准隔离，不会对其他内存地址产生影响，进而无需解除第一内存与第二内存之间的镜像关系，从而第一内存内除故障地址之外的其余内存空间仍然可以正常支持读写，避免解除镜像关系后第一内存内正常内存空间无法使用，提高第一内存的使用概率，降低了内存硬失效类UCE的不良影响范围，避免浪费内存资源，且很大程度降低了内存镜像模式被解除的概率。In an embodiment of the present application, the uncorrectable error information of the first memory is obtained, the uncorrectable error information includes the fault address, and then the first target data corresponding to the fault address is obtained from the second memory, and the first target data is Write the fault address of the first memory, where the first memory and the second memory are mirror memories of each other, and verify the second target data located at the fault address of the first memory, if the second target data is verified to be fault data , mark the fault address of the first memory as the address to be isolated, and perform a page soft isolation operation on the address to be isolated. In this way, the faulty address can be accurately isolated without affecting other memory addresses. There is no need to cancel the mirror relationship between the first memory and the second memory. Therefore, the remaining memory space in the first memory except the faulty address remains. It can support reading and writing normally, preventing the normal memory space in the first memory from being unusable after the mirroring relationship is released, improving the usage probability of the first memory, reducing the scope of adverse effects of memory hard failure UCE, avoiding wasting memory resources, and to a large extent Reduces the probability of memory mirroring mode being released.

示例性的，为更好的理解图3所示的实施例，下面结合图2的计算设备的一个具体应用场景作为示例进行说明。具体参阅图4，图4为本申请实施例提供的确定硬失效类UCE的一个示意图。Illustratively, in order to better understand the embodiment shown in Figure 3, a specific application scenario of the computing device in Figure 2 will be described below as an example. Specifically refer to FIG. 4 , which is a schematic diagram of determining a hard failure type UCE provided by an embodiment of the present application.

其中，计算设备通过BIOS将内存控制器连接的内存配置为镜像模式，例如图2中，第一内存设置为第一通道下的内存空间，第二内存设置为第二通道下的内存空间，此时可以将第一通道作为主通道，第二通道作为备份通道。Among them, the computing device configures the memory connected to the memory controller into mirroring mode through the BIOS. For example, in Figure 2, the first memory is set to the memory space under the first channel, and the second memory is set to the memory space under the second channel. You can use the first channel as the main channel and the second channel as the backup channel.

其内存控制器在写入数据进行存储时，会将同一份数据分别通过第一通道以及第二通道同时写入镜像模式的第一内存和第二内存中，实现内存数据的备份。而内存控制器在读取内存中的数据时，会通过第一通道从的第一内存中读取数据，且内存控制器每次读取时会对读取到的数据进行校验，具体的，内存控制器读取数据时还会从第一内存中读取该数据对应的校验码，并根据读取的数据生成新的校验码，若新生成的校验码与读取的校验码一样，则认为该数据为正确的数据，无错误无故障，将其发送给处理器；若获取到的数据新生成的校验码与读取的校验码不一致，则确定读取到的数据发送故障即检测到第一内存有UCE错误，且内存控制器无法纠正该故障的数据，且内存控制器确定该第一内存中存储该故障的数据的地址为故障地址，并获取第一内存包括故障地址的不可纠正错误信息。此时内存控制器通过mirror scrub机制通过第二通道从第二内存中与故障地址相同的地址读取对应的正确的备份数据即第一目标数据，并将其第一目标数据回写到第一内存的故障地址中，然后内存控制器从第一通道获取第一内存中回写后的第二目标数据，并对其进行校验，具体校验与前述类似，具体此处不再赘述。若经校验后，第二目标数据为正确的数据，可以将刚才发生的的故障标记为mirror corrected即标识此故障为可纠正的故障。若经校验后第二目标数据仍然为故障数据，可以将该故障标记为mirror failover error即标识此故障为硬失效类UCE。When its memory controller writes data for storage, it writes the same data to the first memory and the second memory in mirror mode simultaneously through the first channel and the second channel to achieve backup of memory data. When the memory controller reads the data in the memory, it will read the data from the first memory through the first channel, and the memory controller will verify the read data every time it reads, specifically. , when the memory controller reads data, it will also read the check code corresponding to the data from the first memory, and generate a new check code based on the read data. If the newly generated check code is consistent with the read check code, If the check code is the same, the data is considered to be correct data, without errors and faults, and is sent to the processor; if the newly generated check code of the acquired data is inconsistent with the read check code, it is determined that the read A data sending failure means that a UCE error is detected in the first memory, and the memory controller cannot correct the faulty data, and the memory controller determines that the address where the faulty data is stored in the first memory is the fault address, and obtains the first memory. The memory contains uncorrectable error information for the faulty address. At this time, the memory controller reads the corresponding correct backup data, that is, the first target data, from the address that is the same as the fault address in the second memory through the second channel through the mirror scrub mechanism, and writes back the first target data to the first target data. In the fault address of the memory, the memory controller then obtains the second target data written back in the first memory from the first channel and verifies it. The specific verification is similar to the above, and will not be described again here. If after verification, the second target data is correct data, the fault that just occurred can be marked as mirror corrected, which means that the fault is marked as a correctable fault. If the second target data is still faulty data after verification, the fault can be marked as a mirror failover error, which indicates that the fault is a hard failure UCE.

可选的，还可以通过其他的数字，字符、字符串、汉字、单词、单词组合或数字与字符组合等等标识硬失效类UCE或可纠正的故障，在实际情况中，还可以是其他的形式标识，具体此处不做限定。Optionally, hard failure UCE or correctable faults can also be identified through other numbers, characters, strings, Chinese characters, words, combinations of words, or combinations of numbers and characters, etc. In actual situations, it can also be other Formal identification, specific details are not limited here.

并且内存控制器触发SMI中断，CPU执行BIOS中的SMI中断服务程序。示例性的，如图4中，内存控制器触发SMI中断后，进入SMM模式，此时CPU执行BIOS中的SMI中断服务程序获取故障标记为mirror failover error的故障地址，并将其设置为待隔离地址。And the memory controller triggers the SMI interrupt, and the CPU executes the SMI interrupt service program in the BIOS. For example, as shown in Figure 4, after the memory controller triggers the SMI interrupt, it enters the SMM mode. At this time, the CPU executes the SMI interrupt service routine in the BIOS to obtain the fault address marked as mirror failover error, and sets it to be isolated. address.

一种可能的实现方式中，CPU通过BIOS为故障地址设置一个对应的标识，该标识用于指示该故障地址需要被隔离，以此表示该故障地址为待隔离地址。In one possible implementation, the CPU sets a corresponding identifier for the fault address through the BIOS. The identifier is used to indicate that the fault address needs to be isolated, thereby indicating that the fault address is an address to be isolated.

可选的，CPU通过BIOS生成GHES表，该GHES表中可以包括故障地址以及指示该故障地址待隔离的标识。示例性的，在图4中，通过SMI中断服务程序为该故障地址生成一个对应的GHES表，并在该GHES表中将此故障地址的故障级别设置为corrected，该corrected表示故障级别为可纠正(由于仍然可以通过备份通道即第二通道从第二内存中获取到正确的第一目标数据，所以对于计算设备而言此故障地址的故障是可纠正的)，且将对应的标识(flag)标记为error threshold exceeded，error threshold exceeded作为一个标识用于指示故障地址为待隔离状态。Optionally, the CPU generates a GHES table through the BIOS. The GHES table may include the fault address and an identifier indicating that the fault address is to be isolated. For example, in Figure 4, a corresponding GHES table is generated for the fault address through the SMI interrupt service routine, and the fault level of the fault address is set to corrected in the GHES table. Corrected indicates that the fault level is correctable. (Since the correct first target data can still be obtained from the second memory through the backup channel, that is, the second channel, the failure of this fault address is correctable for the computing device), and the corresponding flag (flag) Marked as error threshold exceeded, error threshold exceeded is used as an indicator to indicate that the faulty address is in the isolation state.

具体请参阅图5，图5为本申请实施例提供的GHES表的一个示意图，其中故障地址包括地址0xa2以及0x3b，故障级别都为corrected，对应的标记即flag均为errorthreshold exceeded，则表示地址0xa2以及0x3b均为待隔离地址。可选的，flag还可以为soft Page offline，指示了需要对故障地址执行soft Page offline操作，即标识该故障地址为待隔离地址，或者还可以其他的数字，字符、字符串、汉字、单词、单词组合或数字与字符组合等等标识故障地址为待隔离状态，在实际情况中，还可以是其他的形式标识，具体此处不做限定。Please refer to Figure 5 for details. Figure 5 is a schematic diagram of the GHES table provided by the embodiment of the present application. The fault addresses include addresses 0xa2 and 0x3b. The fault levels are both corrected. The corresponding flags, i.e., flags, are all errorthreshold exceeded, which means that the address 0xa2 and 0x3b are addresses to be isolated. Optionally, the flag can also be soft Page offline, indicating the need to perform a soft Page offline operation on the fault address, that is, identifying the fault address as an address to be isolated, or it can also be other numbers, characters, strings, Chinese characters, words, A combination of words or a combination of numbers and characters, etc., indicates that the fault address is in a state of isolation. In actual situations, it can also be identified in other forms, and the specifics are not limited here.

在本申请的实施方式中，通过生成GHES表设置故障地址为待隔离地址，以便于后续OS获取待隔离地址，减少了BIOS与OS之间的数据传输，提高了工作效率。In the embodiment of the present application, the fault address is set as the address to be isolated by generating a GHES table, so that the subsequent OS can obtain the address to be isolated, which reduces data transmission between the BIOS and the OS and improves work efficiency.

另外，可以理解的是，通过其他的实现方式为故障地址设置一个对应的标识，将其标识为待隔离地址，例如定义某个寄存器用于指示错误地址的状态，在寄存器中的其中一个比特位置1表示故障地址为待隔离地址，或者在其他表中设置对应的标识指示故障地址为待隔离地址，在实际情况中，还可以通过其他方式实现，具体此处不做限定。In addition, it is understandable that other implementation methods are used to set a corresponding identification for the fault address and identify it as an address to be isolated. For example, a register is defined to indicate the status of the error address, and one of the bit positions in the register 1 indicates that the fault address is the address to be isolated, or corresponding flags are set in other tables to indicate that the fault address is the address to be isolated. In actual situations, this can also be achieved in other ways, and there are no restrictions here.

然后运行BIOS的SMI中断服务程序向OS上报一个SCI中断，以此触发CPU执行OS中相关的SCI中断服务程序获取待隔离地址，并对待隔离地址执行soft Page offline操作。示例性的，图4中在生成GHES表后，触发SCI中断，然后CPU运行OS中的SCI中断服务程序从GHES表中获取flag为error threshold exceeded的故障地址即待隔离地址，然后执行softPage offline操作，以此实现将故障地址进行软隔离，不在支持内存读写。Then run the SMI interrupt service program of the BIOS to report an SCI interrupt to the OS, thereby triggering the CPU to execute the relevant SCI interrupt service program in the OS to obtain the address to be isolated, and perform the soft page offline operation on the address to be isolated. For example, in Figure 4, after the GHES table is generated, the SCI interrupt is triggered, and then the CPU runs the SCI interrupt service program in the OS to obtain the fault address with the flag error threshold exceeded from the GHES table, that is, the address to be isolated, and then performs the softPage offline operation. , in this way, the fault address is softly isolated, and memory reading and writing are no longer supported.

或者，一种可能的实现方式中，CPU运行BIOS向OS发送一个携带待隔离地址的消息，以此OS从携带待隔离地址的消息中获取待隔离地址并对待隔离地址执行soft Pageoffline操作。可选的，该消息可以携带待隔离地址以及对应的标识，具体此处不做限定。在本申请的实施方式中，通过BIOS发送的携带待隔离的地址的消息获取待隔离地址，增加了实施方式以及应用场景，提升了方案的灵活性。Or, in a possible implementation, the CPU runs the BIOS to send a message carrying the address to be isolated to the OS, so that the OS obtains the address to be isolated from the message carrying the address to be isolated and performs a soft Pageoffline operation on the address to be isolated. Optionally, the message can carry the address to be quarantined and the corresponding identifier, which are not limited here. In the implementation of this application, the address to be isolated is obtained through a message sent by the BIOS carrying the address to be isolated, which increases the implementation and application scenarios and improves the flexibility of the solution.

在本申请实施例中，通过BIOS将硬失效类UCE的故障地址设置为需要执行软隔离操作的待隔离地址，然后通过OS的soft Page offline操作对故障地址进行精准隔离，完全消除故障影响，且无需执行mirror failover操作解除内存的镜像模式，不会对其他内存地址产生影响，进而无需解除第一内存与第二内存之间的镜像关系，从而第一内存内除故障地址之外的其余内存空间仍然可以正常支持读写，避免解除镜像关系后第一内存内正常内存空间无法使用，提高第一内存的使用概率，降低了内存硬失效类UCE的不良影响范围，避免浪费内存资源，且很大程度降低了内存镜像模式被解除的概率。In the embodiment of this application, the fault address of the hard failure UCE is set through the BIOS to the address to be isolated that needs to perform a soft isolation operation, and then the fault address is accurately isolated through the soft page offline operation of the OS, completely eliminating the impact of the fault, and There is no need to perform a mirror failover operation to release the mirror mode of the memory, which will not affect other memory addresses. There is no need to release the mirror relationship between the first memory and the second memory, so that the remaining memory space in the first memory except the fault address is It can still support reading and writing normally, preventing the normal memory space in the first memory from being unusable after the mirroring relationship is released, improving the usage probability of the first memory, reducing the scope of adverse effects of memory hard failure UCE, and avoiding the waste of memory resources, which is very large. This greatly reduces the probability of the memory mirroring mode being released.

需要说明的是，前述图4仅仅作为一个示例用于理解本申请实施例，不对本方案产生实质性的限定，可以理解的是，本方案还可以通过其他方式实现，具体此处不做限定。It should be noted that the aforementioned Figure 4 is only used as an example to understand the embodiments of the present application, and does not substantially limit this solution. It is understandable that this solution can also be implemented in other ways, and is not specifically limited here.

以上对本申请实施例所提供的检测内存故障的方法进行了详细介绍，本文中应用了具体个例对本申请的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本申请的检测内存故障的方法及其核心思想。同时，对于本领域的一般技术人员，依据本申请的思想，在具体实施方式及应用范围上均会有改变之处，综上，本说明书内容不应理解为对本申请的限制。The above is a detailed introduction to the method of detecting memory faults provided by the embodiments of the present application. Specific examples are used in this article to illustrate the principles and implementation methods of the present application. The description of the above embodiments is only used to help understand the detection of the present application. Memory failure methods and their core ideas. At the same time, for those of ordinary skill in the art, there will be changes in the specific implementation and application scope based on the ideas of the present application. In summary, the content of this description should not be understood as a limitation of the present application.

如图6所示，本申请实施例还提供了一种计算设备，应用于存储领域。具体请参阅图6，图6为本申请实施例提供的计算设备的一个结构示意图。一种可能的实现中，该计算设备可以包括执行上述方法实施例中图3的方法/操作/步骤/动作所一一对应的模块或单元，该单元可以是硬件电路，也可是软件，也可以是硬件电路结合软件实现。一种可能的实现中，该计算设备可以包括：处理器601、内存控制器602、第一内存603以及第二内存604。其中，第一内存603以及第二内存604与内存控制器602相连，内存控制器602与处理器601相连，且第一内存603与第二内存604互为镜像内存。其内存控制器602可以用于执行如上述方法实施例中获取第一内存603的不可纠正错误信息，该不可纠正错误信息包括故障地址，从第二内存604中获取与故障地址对应的第一目标数据，并将第一目标数据写入第一内存603的故障地址，以及校验位于第一内存603的故障地址的第二目标数据的步骤，处理器601可以用于执行如上述方法实施例中若第二目标数据为故障数据，标记第一内存603的故障地址为待隔离地址，对待隔离地址执行页面软隔离操作的步骤。As shown in Figure 6, an embodiment of the present application also provides a computing device, which is applied in the storage field. Please refer to FIG. 6 for details. FIG. 6 is a schematic structural diagram of a computing device provided by an embodiment of the present application. In a possible implementation, the computing device may include modules or units that perform one-to-one correspondence with the methods/operations/steps/actions in Figure 3 in the above method embodiment. The unit may be a hardware circuit, software, or It is a combination of hardware circuit and software implementation. In a possible implementation, the computing device may include: a processor 601, a memory controller 602, a first memory 603 and a second memory 604. Among them, the first memory 603 and the second memory 604 are connected to the memory controller 602, the memory controller 602 is connected to the processor 601, and the first memory 603 and the second memory 604 are mirror memories of each other. The memory controller 602 may be used to perform the above method embodiment to obtain the uncorrectable error information of the first memory 603, where the uncorrectable error information includes the fault address, and obtain the first target corresponding to the fault address from the second memory 604. data, and the steps of writing the first target data to the fault address of the first memory 603, and verifying the second target data located at the fault address of the first memory 603, the processor 601 can be used to perform the steps as in the above method embodiment. If the second target data is fault data, mark the fault address of the first memory 603 as the address to be isolated, and perform the page soft isolation operation step on the address to be isolated.

在其他可能的设计中，上述处理器601以及内存控制器602可以一一对应的执行上述方法实施例中储能设备各种可能的实现方式中的方法/操作/步骤/动作。In other possible designs, the above-mentioned processor 601 and the memory controller 602 can execute the methods/operations/steps/actions in various possible implementations of the energy storage device in the above method embodiment in one-to-one correspondence.

在一种可能的设计中，上述处理器601，具体用于基于待隔离标识标记第一内存603的故障地址为待隔离地址。In one possible design, the above-mentioned processor 601 is specifically configured to mark the fault address of the first memory 603 as the address to be isolated based on the identification to be isolated.

在一种可能的设计中，上述处理器601，具体用于通过OS对待隔离地址执行页面软隔离操作。In one possible design, the above-mentioned processor 601 is specifically configured to perform a page soft isolation operation on the address to be isolated through the OS.

在一种可能的设计中，上述处理器601，具体用于生成GHES表，该GHES表中包括故障地址以及对应的待隔离标识。In one possible design, the above-mentioned processor 601 is specifically configured to generate a GHES table, which includes the fault address and the corresponding identification to be isolated.

在一种可能的设计中，上述GHES表中故障地址的故障级别为可纠正。In one possible design, the fault level of the fault address in the above GHES table is correctable.

在一种可能的设计中，上述处理器601，具体用于通过OS对GHES表中故障级别为可纠正以及待隔离标识对应的故障地址执行页面软隔离操作。In one possible design, the above-mentioned processor 601 is specifically configured to perform a page soft isolation operation through the OS on fault addresses in the GHES table whose fault level is correctable and corresponding to the identification to be isolated.

本申请上述的各种设计的计算设备的有益效果请参考上述图3以及图4中方法实施例中一一对应的各种实现方式的有益效果，具体此处不再赘述。For the beneficial effects of the computing devices of various designs mentioned above in this application, please refer to the beneficial effects of the various implementation methods corresponding to each other in the method embodiments in FIG. 3 and FIG. 4, which will not be described again here.

需要说明的是，图6对应实施例的计算设备中各模块/单元之间的信息交互、执行过程等内容，与本申请中图3对应的方法实施例基于同一构思，具体内容可参见本申请前述所示的方法实施例中的叙述，此处不再赘述。It should be noted that the information interaction, execution process, etc. between the modules/units in the computing device corresponding to the embodiment in Figure 6 are based on the same concept as the method embodiment corresponding to Figure 3 in this application. For specific content, please refer to this application. The descriptions in the method embodiments shown above will not be repeated here.

另外，在本申请各个实施例中的各功能模块或单元可以集成在一个处理器中，且上述集成的模块或单元既可以采用硬件的形式实现，也可以采用软件功能模块的形式实现。In addition, each functional module or unit in various embodiments of the present application can be integrated in a processor, and the above integrated modules or units can be implemented in the form of hardware or software function modules.

接下来介绍本申请实施例提供的一种计算设备，请参阅图7，图7为本申请实施例提供的计算设备的另一个结构示意图。计算设备例如可以服务器、计算机等电子设备。具体的，计算设备700包括CPU701以及至少一个存储器702，其中，存储器702可以是短暂存储或持久存储。存储在存储器702的程序可以包括一个或一个以上模块(图示没标出)，例如BIOS和/或OS的应用程序存储在存储器702，每个模块可以包括对计算设备700中的一系列指令操作。更进一步地，CPU701可以设置为与存储器702通信，在计算设备700上执行存储器702中的一系列指令操作。Next, a computing device provided by an embodiment of the present application is introduced. Please refer to FIG. 7 . FIG. 7 is another schematic structural diagram of a computing device provided by an embodiment of the present application. The computing device may be, for example, a server, a computer, or other electronic device. Specifically, the computing device 700 includes a CPU 701 and at least one memory 702, where the memory 702 may be a short-term storage or a persistent storage. The program stored in the memory 702 may include one or more modules (not shown in the figure), such as BIOS and/or OS application programs stored in the memory 702 , and each module may include a series of instruction operations on the computing device 700 . Furthermore, the CPU 701 may be configured to communicate with the memory 702 and execute a series of instruction operations in the memory 702 on the computing device 700 .

本申请实施例中，CPU701，用于执行图3对应实施例中的方法。例如，CPU701可以用于：获取第一内存的不可纠正错误信息，该不可纠正错误信息包括故障地址，然后从第二内存中获取与故障地址对应的第一目标数据，并将第一目标数据写入第一内存的故障地址，其中，第一内存与第二内存互为镜像内存，校验位于第一内存的故障地址的第二目标数据，该第二目标数据是将第一内存的故障地址写入第一目标数据后，第一内存的故障地址中的数据，若该第二目标数据校验为故障数据，标记第一内存的故障地址为待隔离地址，并对待隔离地址执行页面软隔离操作。以此实现了对故障地址进行精准隔离，完全消除故障影响，且无需执行mirror failover操作解除第一内存与第二内存的镜像模式，从而第一内存内除故障地址之外的其余内存空间仍然可以正常支持读写，避免解除镜像关系后第一内存内正常内存空间无法使用，提高第一内存的使用概率，降低了内存硬失效类UCE的不良影响范围，避免浪费内存资源，且很大程度降低了内存镜像模式被解除的概率。In the embodiment of the present application, the CPU 701 is used to execute the method in the corresponding embodiment of Figure 3. For example, the CPU 701 can be used to: obtain the uncorrectable error information of the first memory, where the uncorrectable error information includes the fault address, and then obtain the first target data corresponding to the fault address from the second memory, and write the first target data Enter the fault address of the first memory, where the first memory and the second memory are mirror memories of each other, and verify the second target data located at the fault address of the first memory. The second target data is the fault address of the first memory. After writing the first target data, if the data in the fault address of the first memory is verified as fault data, mark the fault address of the first memory as the address to be isolated, and perform page soft isolation on the address to be isolated. operate. In this way, the fault address can be accurately isolated, the impact of the fault can be completely eliminated, and there is no need to perform a mirror failover operation to release the mirror mode of the first memory and the second memory, so that the remaining memory space in the first memory except the fault address can still be used. Normally supports reading and writing, preventing the normal memory space in the first memory from being unusable after the mirroring relationship is released, increasing the usage probability of the first memory, reducing the scope of adverse effects of memory hard failure UCE, avoiding wasting memory resources, and greatly reducing the The probability of the memory mirroring mode being released.

本申请实施例还提供一种计算机可读存储介质，包括计算机可读指令，当计算机可读指令在计算机上运行时，使得计算机执行如前述方法实施例所示任一项实现方式。Embodiments of the present application also provide a computer-readable storage medium, which includes computer-readable instructions. When the computer-readable instructions are run on a computer, they cause the computer to execute any of the implementations shown in the foregoing method embodiments.

本申请实施例还提供的一种计算机程序产品，计算机程序产品包括计算机程序或指令，当计算机程序或指令在计算机上运行时，使得计算机执行如前述方法实施例所示任一项实现方式。Embodiments of the present application also provide a computer program product. The computer program product includes a computer program or instructions. When the computer program or instructions are run on a computer, it causes the computer to execute any of the implementation methods shown in the foregoing method embodiments.

本申请还提供一种芯片或芯片系统，该芯片可包括处理器。该芯片还可包括存储器(或存储模块)和/或收发器(或通信模块)，或者，该芯片与存储器(或存储模块)和/或收发器(或通信模块)耦合，其中，收发器(或通信模块)可用于支持该芯片进行有线和/或无线通信，存储器(或存储模块)可用于存储程序或一组指令，该处理器调用该程序或该组指令可用于实现上述方法实施例、方法实施例的任意一种可能的实现方式中由终端或者网络设备执行的操作。该芯片系统可包括以上芯片，也可以包含上述芯片和其他分离器件，如存储器(或存储模块)和/或收发器(或通信模块)。This application also provides a chip or chip system, and the chip may include a processor. The chip may also include a memory (or storage module) and/or a transceiver (or communication module), or the chip may be coupled with a memory (or storage module) and/or a transceiver (or communication module), where the transceiver ( or communication module) can be used to support the chip to perform wired and/or wireless communications. The memory (or storage module) can be used to store a program or a set of instructions. The processor can call the program or the set of instructions to implement the above method embodiments. Operations performed by the terminal or network device in any possible implementation of the method embodiment. The chip system may include the above chip, or may include the above chip and other separate devices, such as a memory (or storage module) and/or a transceiver (or communication module).

另外需说明的是，以上所描述的装置实施例仅仅是示意性的，其中作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外，本申请提供的装置实施例附图中，模块之间的连接关系表示它们之间具有通信连接，具体可以实现为一条或多条通信总线或信号线。In addition, it should be noted that the device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units. , that is, it can be located in one place, or it can be distributed across multiple units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the device embodiments provided in this application, the connection relationship between modules indicates that there are communication connections between them, which can be specifically implemented as one or more communication buses or signal lines.

通过以上的实施方式的描述，所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现，当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下，凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现，而且，用来实现同一功能的具体硬件结构也可以是多种多样的，例如模拟电路、数字电路或专用电路等。但是，对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在可读取的存储介质中，如计算机的软盘、U盘、移动硬盘、只读存储器(read only memory，ROM)、随机存取存储器(random access memory，RAM)、磁碟或者光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，训练设备，或者网络设备等)执行本申请各个实施例的方法。Through the above description of the embodiments, those skilled in the art can clearly understand that the present application can be implemented by software plus necessary general hardware. Of course, it can also be implemented by dedicated hardware including dedicated integrated circuits, dedicated CPUs, dedicated memories, Special components, etc. to achieve. In general, all functions performed by computer programs can be easily implemented with corresponding hardware. Moreover, the specific hardware structures used to implement the same function can also be diverse, such as analog circuits, digital circuits or special-purpose circuits. circuit etc. However, for this application, software program implementation is a better implementation in most cases. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or that contributes to the existing technology. The computer software product is stored in a readable storage medium, such as a computer floppy disk. , U disk, mobile hard disk, read only memory (ROM), random access memory (RAM), magnetic disk or optical disk, etc., including several instructions to make a computer device (which can be a personal computer, training equipment, or network equipment, etc.) to execute the methods of various embodiments of this application.

Claims

1. A method for handling a memory failure, comprising:

obtaining uncorrectable error information of a first memory, wherein the uncorrectable error information comprises a fault address;

acquiring first target data corresponding to the fault address from a second memory, and writing the first target data into the fault address of the first memory, wherein the first memory and the second memory are mirror memories;

checking second target data of the fault address of the first memory, and marking the fault address of the first memory as an address to be isolated if the second target data is checked to be the fault data;

and executing page soft isolation operation on the address to be isolated.

2. The method of claim 1, wherein the marking the failed address of the first memory as an address to be isolated comprises:

and marking the fault address of the first memory as the address to be isolated based on the identification to be isolated.

3. The method of claim 1 or 2, wherein performing a page soft isolation operation on the address to be isolated comprises:

and executing the page soft isolation operation on the address to be isolated through an operating system OS.

4. The method of claim 3, wherein marking the failed address of the first memory as the address to be isolated based on an identification to be isolated comprises:

and generating a general hardware error source GHES table, wherein the GHES table comprises the fault address and the corresponding identification to be isolated.

5. The method of claim 4, wherein the failure level of the failed address in the GHES table is correctable.

6. The method of claim 5, wherein performing a page soft isolate operation on the address to be isolated comprises:

and executing the page soft isolation operation on the fault address corresponding to the fault level correctable and the identification to be isolated in the GHES table through the OS.

7. The computing device is characterized by comprising a processor, a first memory, a second memory and a memory controller, wherein the first memory and the second memory are connected with the memory controller, the memory controller is connected with the processor, and the first memory and the second memory are mirror images;

the memory controller is configured to obtain uncorrectable error information of the first memory, where the uncorrectable error information includes a failure address;

The memory controller is further configured to obtain first target data corresponding to the failure address from the second memory, and write the first target data into the failure address of the first memory;

the memory controller is further configured to verify second target data located at the failure address of the first memory;

the processor is configured to mark the failure address of the first memory as an address to be isolated if the second target data is verified to be failure data;

and the processor is also used for executing page soft isolation operation on the address to be isolated.

8. The computing device of claim 7, wherein the processor is specifically configured to tag the failed address of the first memory as the address to be isolated based on an identification to be isolated.

9. The computing device of claim 8, wherein the processor is configured to generate a generic hardware error source GHES table, wherein the GHES table includes the failure address and the corresponding identification to be isolated.

10. The computing device of claim 9, wherein the processor is specifically configured to perform, by the OS, the page soft isolation operation on the failed address corresponding to the identity to be isolated and the failure level in the GHES table being correctable.