CN117806855A - Memory error processing method and device - Google Patents
Memory error processing method and device Download PDFInfo
- Publication number
- CN117806855A CN117806855A CN202211172016.1A CN202211172016A CN117806855A CN 117806855 A CN117806855 A CN 117806855A CN 202211172016 A CN202211172016 A CN 202211172016A CN 117806855 A CN117806855 A CN 117806855A
- Authority
- CN
- China
- Prior art keywords
- memory
- computer system
- target
- memory area
- data migration
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/073—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0793—Remedial or corrective actions
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Debugging And Monitoring (AREA)
Abstract
本申请提供了一种内存错误处理方法,该方法应用于包括内存的计算机系统,该方法包括:在确定需要对内存中发生可纠正错误的目标内存区域执行数据迁移和内存隔离的情况下,可以获取计算机系统在当前时间间隔内的若干性能指标,并根据该若干性能指标确定该计算机系统是否处于空闲态;当该计算机系统处于空闲态的情况下,对目标内存区域执行数据迁移和内存隔离。如此,通过在确定计算机系统已经处于空闲态的情况下才对发生可纠正错误的目标内存区域执行数据迁移和内存隔离,可避免因对目标内存区域执行数据迁移和内存隔离而影响计算机系统对其它业务的高效执行。
The present application provides a memory error handling method, which is applied to a computer system including a memory, and the method includes: when it is determined that data migration and memory isolation need to be performed on a target memory area where a correctable error occurs in the memory, several performance indicators of the computer system in the current time interval can be obtained, and whether the computer system is in an idle state is determined based on the several performance indicators; when the computer system is in an idle state, data migration and memory isolation are performed on the target memory area. In this way, by performing data migration and memory isolation on the target memory area where a correctable error occurs only when it is determined that the computer system is already in an idle state, it is possible to avoid affecting the efficient execution of other services of the computer system due to the execution of data migration and memory isolation on the target memory area.
Description
技术领域Technical field
本申请涉及计算机技术领域,尤其涉及一种内存错误处理方法及装置。The present application relates to the field of computer technology, and in particular to a memory error processing method and device.
背景技术Background technique
随机存取存储器(random access memory,RAM)通常也被简称为内存,其是计算机系统的重要组成部件之一。内存发生可纠正错误(corrected error,CE)时,可以采用包含错误校验与校正(error checking and correction,ECC)在内的各种纠错算法进行纠错,而且可以采用包含自适应双设备数据校正(adaptive double device data correction,ADDDC)在内的各种技术实现对发生CE的内存区域进行数据迁移和内存隔离。Random access memory (RAM), often referred to as memory, is one of the important components of the computer system. When a corrected error (CE) occurs in the memory, various error correction algorithms including error checking and correction (ECC) can be used to correct the error, and adaptive dual-device data can be used to correct the error. Various technologies, including adaptive double device data correction (ADDDC), implement data migration and memory isolation in memory areas where CE occurs.
对内存区域执行数据迁移和内存隔离的过程中,通常会大量占用计算机系统的资源,可能导致计算机系统无法高效的执行其当前正在执行的其它业务。The process of performing data migration and memory isolation on the memory area usually occupies a large amount of computer system resources, which may cause the computer system to be unable to efficiently execute other businesses that are currently being executed.
发明内容Contents of the invention
本申请实施例中至少提供了一种内存错误处理方法及装置,在需要对发生CE的目标内存区域执行数据迁移和内存隔离的情况下,可以根据计算机系统在当前时间间隔内的若干性能指标判断计算机系统是否处于空闲态,在确定计算机系统处于空闲态的情况下才对目标内存区域执行数据迁移和内存隔离,可避免因对目标内存区域执行数据迁移和内存隔离而影响计算机系统对其它业务的高效执行。The embodiments of the present application at least provide a memory error processing method and device. When it is necessary to perform data migration and memory isolation on the target memory area where CE occurs, the judgment can be made based on several performance indicators of the computer system within the current time interval. Whether the computer system is in an idle state, data migration and memory isolation are performed on the target memory area only after it is determined that the computer system is in an idle state. This can avoid affecting the computer system's ability to perform other services due to data migration and memory isolation on the target memory area. Execute efficiently.
第一方面,提供了一种内存错误处理方法,该方法应用于包括内存的计算机系统。该方法包括:在需要对内存中发生可纠正错误CE的目标内存区域执行数据迁移和内存隔离的情况下,可以首先获取前述计算机系统在当前时间间隔内的若干性能指标,并根据该若干性能指标确定前述计算机系统是否处于空闲态;当确定前述计算机系统处于空闲态的情况下,对目标内存区域执行数据迁移和内存隔离。In a first aspect, a memory error handling method is provided, which method is applied to a computer system including a memory. The method includes: when it is necessary to perform data migration and memory isolation on the target memory area where the correctable error CE occurs in the memory, several performance indicators of the aforementioned computer system within the current time interval can be first obtained, and based on the several performance indicators Determine whether the aforementioned computer system is in an idle state; when it is determined that the aforementioned computer system is in an idle state, perform data migration and memory isolation on the target memory area.
如此,在需要对发生CE的目标内存区域执行数据迁移和内存隔离时,可以根据计算机系统在当前时间间隔内的若干性能指标判断计算机系统是否处于空闲态,并在确定计算机系统处于空闲态的情况下才对目标内存区域执行数据迁移和内存隔离,可避免因对目标内存区域执行数据迁移和内存隔离而影响计算机系统对其它业务的高效执行。In this way, when it is necessary to perform data migration and memory isolation on the target memory area where CE occurs, it is possible to determine whether the computer system is in an idle state based on several performance indicators of the computer system within the current time interval, and determine whether the computer system is in an idle state. Perform data migration and memory isolation on the target memory area only later, which can avoid affecting the efficient execution of other services by the computer system due to data migration and memory isolation on the target memory area.
在一种可能的实施方式中,前述的若干性能指标可以包括但不限于如下各项性能指标中的任意一项或多项:计算机系统是否运行在用户态、处理器占用率、内存带宽、转发带宽、存储带宽,以及依赖计算机系统并且处于繁忙状态的虚拟机是否与所述目标内存区域位于相同的非一致存储访问结构(non-uniform memory access,NUMA)。In a possible implementation, the aforementioned performance indicators may include but are not limited to any one or more of the following performance indicators: whether the computer system is running in user mode, processor occupancy, memory bandwidth, forwarding Bandwidth, storage bandwidth, and whether the virtual machine that is dependent on the computer system and is in a busy state is in the same non-uniform memory access (NUMA) as the target memory area.
在一种可能的实施方式中,该方法还包括:获取计算机系统的内存错误信息;根据内存错误信息确定内存中发生CE的目标内存区域和CE模式;根据CE模式确定是否需要对目标内存区域执行数据迁移和内存隔离。该实施方式中,由于并非全部的CE均可能影响内存区域在后续过程中继续发生不可纠正错误(uncorrected errors,UCE),因此并不将全部的CE均作为对发生CE的内存区域进行数据迁移和内存隔离的必要条件,可以避免因频繁执行对发生CE的内存区域进行数据迁移和隔离而带来其它问题。In a possible implementation, the method further includes: obtaining memory error information of the computer system; determining the target memory area and CE mode in the memory where CE occurs based on the memory error information; determining whether the target memory area needs to be executed based on the CE mode. Data migration and memory isolation. In this implementation, since not all CEs may affect the memory area to continue to generate uncorrected errors (UCE) in the subsequent process, all CEs are not used as data migration and processing for the memory areas where CEs occur. Memory isolation is a necessary condition to avoid other problems caused by frequent data migration and isolation of memory areas where CE occurs.
在一种可能的实施方式中,根据CE模式确定是否需要对目标内存区域执行数据迁移和内存隔离,包括:在CE模式属于预先配置的若干目标CE模式的情况下,确定需要对目标内存区域执行数据迁移和内存隔离。In a possible implementation, determining whether data migration and memory isolation need to be performed on the target memory area according to the CE mode includes: when the CE mode belongs to several pre-configured target CE modes, determining whether the target memory area needs to be performed Data migration and memory isolation.
在一种可能的实施方式中,根据CE模式确定是否需要对目标内存区域执行数据迁移和内存隔离,包括:在CE模式属于预先配置的若干目标CE模式的情况下,将目标内存区域发生属于若干目标CE模式的CE的频次加1;在执行加1操作后的频次达到预设阈值的情况下,确定需要对目标内存区域执行数据迁移和内存隔离。In a possible implementation, determining whether data migration and memory isolation need to be performed on the target memory area according to the CE mode includes: when the CE mode belongs to several preconfigured target CE modes, the target memory area belongs to several preconfigured CE modes. The frequency of CE in the target CE mode is increased by 1; when the frequency after performing the increment operation reaches the preset threshold, it is determined that data migration and memory isolation need to be performed on the target memory area.
在一种可能的实施方式,前述的若干目标CE模式包括如下各项CE模式中的至少一项:row CE、column CE和bank CE。In a possible implementation manner, the aforementioned several target CE modes include at least one of the following CE modes: row CE, column CE, and bank CE.
第二方面,提供了一种内存错误处理装置,该装置部署在包括内存的计算机系统中。该装置包括:指标获取模块,用于在需要对模块内存中发生可纠正错误CE的目标内存区域执行数据迁移和内存隔离的情况下,获取计算机系统在当前时间间隔内的若干性能指标;状态判断模块,用于根据若干性能指标确定计算机系统是否处于空闲态,并在计算机系统处于空闲态时触发隔离处理模块;隔离处理模块,用于在状态判断模块的触发下,对目标内存区域执行数据迁移和内存隔离。A second aspect provides a memory error handling device, which is deployed in a computer system including a memory. The device includes: an indicator acquisition module, used to obtain several performance indicators of the computer system within the current time interval when it is necessary to perform data migration and memory isolation on the target memory area where the correctable error CE occurs in the module memory; state judgment The module is used to determine whether the computer system is in an idle state based on several performance indicators, and triggers the isolation processing module when the computer system is in the idle state; the isolation processing module is used to perform data migration on the target memory area under the trigger of the status judgment module. and memory isolation.
在一种可能的实施方式中,若干性能指标包括如下各项性能指标中的任意一项或多项:计算机系统是否运行在用户态、处理器占用率、内存带宽、转发带宽、存储带宽,以及依赖计算机系统并且处于繁忙状态的虚拟机是否与目标内存区域位于相同的NUMA。In a possible implementation, several performance indicators include any one or more of the following performance indicators: whether the computer system is running in user mode, processor occupancy, memory bandwidth, forwarding bandwidth, storage bandwidth, and Depends on the computer system and whether the virtual machine is in a busy state and is in the same NUMA as the target memory area.
在一种可能的实施方式中,该装置还包括:信息获取模块,用于获取计算机系统的内存错误信息;故障分析模块,用于根据内存错误信息确定内存中发生CE的目标内存区域和CE模式;根据CE模式确定是否需要对目标内存区域执行数据迁移和内存隔离。In a possible implementation, the device further includes: an information acquisition module, used to obtain memory error information of the computer system; a fault analysis module, used to determine the target memory area and CE mode in the memory where CE occurs based on the memory error information. ;Determine whether data migration and memory isolation need to be performed on the target memory area according to the CE mode.
在一种可能的实施方式中,故障分析模块,具体用于在CE属于预先配置的若干目标CE模式的情况下,确定需要对目标内存区域执行数据迁移和内存隔离。In a possible implementation, the fault analysis module is specifically configured to determine that data migration and memory isolation need to be performed on the target memory area when the CE belongs to several preconfigured target CE modes.
在一种可能的实施方式中,故障分析模块,具体用于在CE模式属于预先配置的若干目标CE模式的情况下,将目标内存区域发生属于若干目标CE模式的CE的频次加1;在执行加1操作后的频次达到预设阈值的情况下,确定需要对目标内存区域执行数据迁移和内存隔离。In one possible implementation, the fault analysis module is specifically configured to add 1 to the frequency of CEs belonging to several target CE modes occurring in the target memory area when the CE mode belongs to several preconfigured target CE modes; When the frequency after adding one operation reaches the preset threshold, it is determined that data migration and memory isolation need to be performed on the target memory area.
在一种可能的实施方式中,若干目标CE模式包括如下各项CE模式中的至少一项:row CE、column CE和bank CE。In a possible implementation, several target CE modes include at least one of the following CE modes: row CE, column CE, and bank CE.
第三方面,本申请实施例中提供了一种计算设备,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码,实现第一方面提供的方法。In a third aspect, embodiments of the present application provide a computing device, including a memory and a processor. The memory stores executable code, and the processor executes the executable code to implement the method provided in the first aspect. .
第四方面,本申请实施例中提供了一种计算机系统,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码,实现第一方面提供的方法。In a fourth aspect, embodiments of the present application provide a computer system, including a memory and a processor. The memory stores executable code, and the processor executes the executable code to implement the method provided in the first aspect. .
第五方面,本申请实施例中提供了一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机实现第一方面提供的方法。In a fifth aspect, an embodiment of the present application provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed in a computer, the computer implements the method provided in the first aspect.
第六方面,本申请实施例中提供了一种计算机程序或计算机程序产品,所述计算机程序或计算机程序产品包括指令,当所述指令被执行时,实现第一方面提供的方法。In a sixth aspect, embodiments of the present application provide a computer program or computer program product. The computer program or computer program product includes instructions. When the instructions are executed, the method provided in the first aspect is implemented.
第七方面,本申请的实施例中提供了一种芯片,该芯片包括至少一个处理器和接口,所述至少一个处理器通过所述接口确定程序指令或者数据;前述至少一个处理器用于执行所述程序指令,以实现第一方面提供的方法。In the seventh aspect, a chip is provided in an embodiment of the present application, the chip comprising at least one processor and an interface, wherein the at least one processor determines program instructions or data through the interface; the at least one processor is used to execute the program instructions to implement the method provided in the first aspect.
可以理解的是,前述第二方面至第七方面的有益效果可以参见上述第一方面中的相关描述,在此不再赘述It can be understood that the beneficial effects of the foregoing second to seventh aspects can be referred to the relevant descriptions in the foregoing first aspect, and will not be described again here.
附图说明Description of drawings
图1为本申请实施例中提供的一种计算机系统的结构示意图之一;FIG1 is a schematic diagram of a computer system provided in an embodiment of the present application;
图2为本申请实施例中提供的一种内存错误处理方法的流程图;Figure 2 is a flow chart of a memory error handling method provided in an embodiment of the present application;
图3为本申请实施例中提供的一种计算机系统的结构示意图之二;Figure 3 is a second structural schematic diagram of a computer system provided in an embodiment of the present application;
图4为本申请实施例中提供的一种计算机系统的结构示意图之三;Figure 4 is a third structural schematic diagram of a computer system provided in an embodiment of the present application;
图5为本申请实施例中提供的一种内存错误处理装置的结构示意图;Figure 5 is a schematic structural diagram of a memory error processing device provided in an embodiment of the present application;
图6为本申请实施例中提供的一种计算设备的示意图。FIG6 is a schematic diagram of a computing device provided in an embodiment of the present application.
具体实施方式Detailed ways
下面通过附图和实施例,对本申请的技术方案做进一步的详细描述。The technical solution of the present application will be further described in detail below through the accompanying drawings and examples.
计算机系统的内存所发生的错误,通常可以被划分为CE和UCE两种错误类型。对于CE而言,通常可以采用包含ECC在内的各种纠错算法对其进行纠错。对于UCE而言,其可能导致计算机系统执行的业务因无法准确的访问发生UCE的内存区域而带来其它问题,甚至可能直接导致计算机系统停止运行。The errors that occur in the memory of a computer system can usually be divided into two types: CE and UCE. For CE, various error correction algorithms including ECC can usually be used to correct it. For UCE, it may cause other problems due to the inability to accurately access the memory area where the UCE occurs in the business executed by the computer system, and may even directly cause the computer system to stop running.
对于发生UCE的内存区域,其在发生UCE前经常会发生属于特定模式的若干CE。通过对有限数据集进行分析发现,行(row)CE占比约17%、列(column)CE占比约15.3、bank CE占比约15.7,先发生row CE并且继续发生UCE的概率约25%,先发生行column CE并且继续发生UCE的概率约23.9%,先发生bank CE并且继续发生UCE的概率约22.6%。基于以上发现,可以确定某个内存区域发生属于特定模式的若干CE后,例如在发生属于row CE、columnCE以及bank CE等CE模式的若干CE后,该内存区域可能继续发生UCE。因此,可以考虑在发现某个内存区域发生属于特定模式的若干CE后,对该内存区域进行数据迁移和内存隔离,使得计算机系统所执行的业务能够准确的访问原本存储于该内存区域的数据并不再继续访问该内存区域,降低内存发生UCE的频次以提高计算机系统的可用性。For the memory area where UCE occurs, several CEs belonging to a specific pattern often occur before UCE occurs. By analyzing a limited data set, it is found that row CE accounts for about 17%, column CE accounts for about 15.3%, and bank CE accounts for about 15.7. The probability of row CE occurring first and then UCE continuing to occur is about 25%, the probability of row column CE occurring first and then UCE continuing to occur is about 23.9%, and the probability of bank CE occurring first and then UCE continuing to occur is about 22.6%. Based on the above findings, it can be determined that after several CEs belonging to a specific pattern occur in a certain memory area, for example, after several CEs belonging to CE patterns such as row CE, column CE, and bank CE occur, the memory area may continue to have UCE. Therefore, it can be considered to migrate data and isolate memory in a certain memory area after discovering that several CEs belonging to a specific pattern occur in a certain memory area, so that the business executed by the computer system can accurately access the data originally stored in the memory area and no longer continue to access the memory area, thereby reducing the frequency of UCE occurring in the memory to improve the availability of the computer system.
示例性的,可以采用自适应双设备数据校正(adaptive double device datacorrection,ADDDC)技术实现对内存区域执行数据迁移和内存隔离。例如请参见图1所示所示的计算机系统,该计算机系统的处理器和基本输入输出系统(basic input outputsystem,BIOS)可以各自实现为相应的固件,处理器可以通过其内存控制器连接若干双列直插式内存模块(dual inline memory modules,DIMM),例如通过单个内存通道连接DIMM0和DIMM1等两个DIMM。单个DIMM例如可以包括Rank0和两个rank1等两个rank;单个rank例如可以包括chip 00~chip 17等18个颗粒(chip),chip 17可以作为冗余颗粒;单个chip可以包括bank 0~bank n等n+1个逻辑bank。假设DIMM0的rank0中属于chip 00的bank n因发生CE错误而基于某些规则被判定为需要执行数据迁移和内存隔离,那么例如可以通过ADDDC技术将DIMM0的rank0中属于chip 00的bank n所存储的数据,迁移到DIMM1的rank0中属于chip 17的bank n以及DIMM0的rank0中属于chip 17的bank n,并对DIMM0的rank0中属于chip 00的bank n进行隔离。其中被迁移到DIMM1的rank0中属于chip 17的bank n的数据,以及被迁移到DIMM0的rank0中属于chip 17的bank n的数据,可以用于恢复原本存储于DIMM0的rank0中属于chip 00的bank n的数据。Exemplarily, adaptive double device data correction (ADDDC) technology can be used to implement data migration and memory isolation of memory areas. For example, please refer to the computer system shown in Figure 1. The processor and basic input output system (BIOS) of the computer system can be implemented as corresponding firmware respectively. The processor can connect several dual inline memory modules (DIMMs) through its memory controller, such as connecting two DIMMs such as DIMM0 and DIMM1 through a single memory channel. A single DIMM can include two ranks such as Rank0 and two Rank1s; a single rank can include 18 chips such as chip 00 to chip 17, and chip 17 can be used as a redundant chip; a single chip can include n+1 logical banks such as bank 0 to bank n. Assuming that bank n belonging to chip 00 in rank0 of DIMM0 is determined to need data migration and memory isolation based on certain rules due to a CE error, the data stored in bank n belonging to chip 00 in rank0 of DIMM0 can be migrated to bank n belonging to chip 17 in rank0 of DIMM1 and bank n belonging to chip 17 in rank0 of DIMM0 through the ADDDC technology, and bank n belonging to chip 00 in rank0 of DIMM0 is isolated. The data migrated to bank n belonging to chip 17 in rank0 of DIMM1 and the data migrated to bank n belonging to chip 17 in rank0 of DIMM0 can be used to restore the data originally stored in bank n belonging to chip 00 in rank0 of DIMM0.
前文虽然示例性描述了通过ADDDC技术实现对内存中发生CE的逻辑bank执行数据迁移和内存隔离,然而可以理解的是还可能通过其它技术实现对内存中发生CE的内存区域执行数据迁移和内存隔离,例如采用自适应型双颗粒数据纠正-多区域(adaptive doubledevice data correction-multiple region,ADDDC-MR)、自适应型数据纠正-单区域(adaptive data correction-single region,ADC-SR)自适应型双颗粒错误纠正(adaptive double device error correction,ADDEC)等技术对内存区域进行数据迁移和内存隔离。Although the above example describes the use of ADDDC technology to perform data migration and memory isolation on the logical bank where CE occurs in the memory, it can be understood that other technologies may also be used to perform data migration and memory isolation on the memory area where CE occurs in the memory. , such as adaptive double device data correction-multiple region (ADDDC-MR), adaptive data correction-single region (ADC-SR) adaptive type Technologies such as adaptive double device error correction (ADDEC) perform data migration and memory isolation in memory areas.
前文虽然示例性描述了对发生CE的bank执行数据迁移和内存隔离,然而可以理解的是发生CE的内存区域还可能是rank、chip、属于bank的row或属于bank的column等等。Although the above example describes data migration and memory isolation for the bank where CE occurs, it can be understood that the memory area where CE occurs may also be rank, chip, row belonging to the bank or column belonging to the bank, etc.
对内存区域执行数据迁移和内存隔离时,将会大幅占用计算机系统的各项资源,进而可能影响计算机系统对其它业务的高效执行。在有限次数的实验分析中发现,通过ADDDC技术实现对内存区域执行数据迁移和内存隔离时,均会对存储带宽、转发带宽和处理器的数据处理时延等造成较大影响,其中最大数据输入时延达到710ms,最大数据输出时延达到63ms,处理器的性能下降约1%而且处理器占用率大幅上升的持续时间约10ms,甚至还可能导致依赖计算机系统的虚拟机复位以及导致数据库输入/输出报错等其它问题。When data migration and memory isolation are performed on the memory area, various resources of the computer system will be greatly occupied, which may affect the efficient execution of other services by the computer system. In a limited number of experimental analyses, it was found that when data migration and memory isolation are implemented in the memory area through ADDDC technology, it will have a great impact on storage bandwidth, forwarding bandwidth and data processing delay of the processor. Among them, the maximum data input The delay reaches 710ms, the maximum data output delay reaches 63ms, the performance of the processor decreases by about 1%, and the processor occupancy increases significantly for about 10ms. It may even cause the virtual machine that relies on the computer system to reset and cause database input/ Output errors and other issues.
鉴于以上问题,本申请实施例中提供了一种内存错误处理方法及装置。在需要对发生CE的目标内存区域执行数据迁移和内存隔离的情况下,可以根据计算机系统在当前时间间隔内的若干性能指标判断计算机系统是否处于空闲态,并且在确定计算机系统处于空闲态的情况下才对目标内存区域执行数据迁移和内存隔离,可避免因对目标内存区域执行数据迁移和内存隔离而影响计算机系统对其它业务的高效执行。In view of the above problems, embodiments of the present application provide a memory error processing method and device. When it is necessary to perform data migration and memory isolation on the target memory area where CE occurs, it is possible to determine whether the computer system is in an idle state based on several performance indicators of the computer system within the current time interval, and when determining that the computer system is in an idle state Perform data migration and memory isolation on the target memory area only later, which can avoid affecting the efficient execution of other services by the computer system due to data migration and memory isolation on the target memory area.
示例性的,图2为本说明书实施例中提供的一种内存错误处理方法的流程图。其中该方法可以由处理器、包含处理器的计算设备/计算机系统执行;更具体地,处理器、包含处理器的计算设备/计算机系统可以执行计算机程序/指令以实现图2中所示的各个方法步骤。前述计算设备/计算机系统例如可以包括但不限于服务器、交换机、路由器、基站控制器、终端或者计算加速卡等等,前述的服务器通常可以是一体机,或者前述的服务器可以采用基于基板管理控制器(baseboard management controller,BMC)实现的分层云架构。请参见图2所示,该方法可以包括但不限于如下步骤S200~步骤S210中的部分或全部。Exemplarily, FIG2 is a flowchart of a memory error handling method provided in an embodiment of the present specification. The method can be executed by a processor, a computing device/computer system including a processor; more specifically, the processor, a computing device/computer system including a processor can execute a computer program/instruction to implement the various method steps shown in FIG2. The aforementioned computing device/computer system may, for example, include but is not limited to a server, a switch, a router, a base station controller, a terminal or a computing acceleration card, etc. The aforementioned server may generally be an all-in-one machine, or the aforementioned server may adopt a layered cloud architecture implemented based on a baseboard management controller (baseboard management controller, BMC). As shown in FIG2, the method may include but is not limited to part or all of the following steps S200 to S210.
步骤S200,获取计算机系统的内存错误信息。Step S200: Obtain memory error information of the computer system.
当计算机系统的内存发生错误时,例如可以由该计算机系统的BIOS通过处理器的内存控制器获得相应的内存错误信息。请参见图3所示,当计算机系统是采用分层云架构的服务器时,前述内存错误信息例如还可以由该计算机系统的BIOS发送至该计算机系统的BMC。请参见图4所示,当计算机系统并非是采用分层云架构的服务器时,前述内存错误信息例如还可以由该计算机系统的BIOS发送至该计算机系统的系统管理单元。前述系统管理单元可以是该计算机系统中部署的操作系统(Operating System,OS),更具体地说可以是该计算机系统中部署的OS所包含的某个功能模块(例如故障分析模块),或者该系统管理单元也可以是该计算机系统中除其部署的OS以外的其它固件。When an error occurs in the memory of a computer system, for example, the BIOS of the computer system can obtain corresponding memory error information through the memory controller of the processor. Referring to Figure 3, when the computer system is a server using a layered cloud architecture, the memory error information may also be sent to the BMC of the computer system by the BIOS of the computer system, for example. Referring to FIG. 4 , when the computer system is not a server using a layered cloud architecture, the memory error information may also be sent to the system management unit of the computer system by the BIOS of the computer system, for example. The aforementioned system management unit may be an operating system (Operating System, OS) deployed in the computer system, more specifically, it may be a certain functional module (such as a fault analysis module) included in the OS deployed in the computer system, or the The system management unit may also be other firmware in the computer system other than the OS it is deployed on.
步骤S202,根据内存错误信息确定计算机系统的内存中发生CE的目标内存区域以及所发生CE的CE模式。Step S202: Determine the target memory area in the memory of the computer system where CE occurs and the CE mode of the occurrence of CE based on the memory error information.
当计算机系统包括BMC时,例如可以由该计算机系统的BMC实现根据内存错误信息确定发生CE的目标内存区域以及所发生CE的CE模式。当计算机系统并不包括BMC时,例如可以由该计算机系统的系统管理单元实现根据内存错误信息确定发生CE的目标内存区域以及所发生CE的CE模式。具体地,可以对内存错误信息进行特征分析以确定目标内存区域所发生CE是否符合相应的CE模式;或者,可以采用机器学习的方式对内存错误信息以及与内存运行状态相关的其它数据进行分析,更加准确的确定目标内存区域所发生CE的CE模式。CE模式可以包括row CE、column CE、bank CE、chip CE以及rank CE等等。When the computer system includes a BMC, for example, the BMC of the computer system may determine the target memory area in which the CE occurs and the CE mode in which the CE occurs based on the memory error information. When the computer system does not include a BMC, for example, the system management unit of the computer system can determine the target memory area where the CE occurs and the CE mode in which the CE occurs based on the memory error information. Specifically, the memory error information can be characterized by analysis to determine whether the CE occurring in the target memory area conforms to the corresponding CE pattern; or, machine learning can be used to analyze the memory error information and other data related to the memory operating status. More accurately determine the CE pattern of CE occurring in the target memory area. CE modes can include row CE, column CE, bank CE, chip CE, rank CE, etc.
步骤S204,根据CE模式确定是否需要对目标内存区域执行数据迁移和内存隔离。Step S204: Determine whether data migration and memory isolation need to be performed on the target memory area according to the CE mode.
当计算机系统包括BMC时,例如可以由该计算机系统的BMC实现根据步骤S202确定的CE模式确定是否需要对目标内存区域执行数据迁移和内存隔离。当计算机系统不包括BMC时,例如可以由该计算机系统的系统管理单元实现根据步骤S202确定的CE模式确定是否需要对目标内存区域执行数据迁移和内存隔离。When the computer system includes a BMC, for example, the BMC of the computer system may determine whether data migration and memory isolation need to be performed on the target memory area according to the CE mode determined in step S202. When the computer system does not include a BMC, for example, the system management unit of the computer system may determine whether data migration and memory isolation need to be performed on the target memory area according to the CE mode determined in step S202.
在一种可能的实施方式中,当步骤S202中确定的CE模式属于预先配置的若干目标CE模式时,步骤S204中可以确定需要对目标内存区域执行数据迁移和内存隔离;反之,当步骤S202中确定的CE模式不属于预先配置的若干目标CE模式时,步骤S204中可以确定无需对目标内存区域执行数据迁移和内存隔离。In a possible implementation, when the CE mode determined in step S202 belongs to several pre-configured target CE modes, it can be determined in step S204 that data migration and memory isolation need to be performed on the target memory area; conversely, when the CE mode determined in step S202 does not belong to several pre-configured target CE modes, it can be determined in step S204 that data migration and memory isolation do not need to be performed on the target memory area.
在一种可能的实施方式中,当步骤S202中确定的CE模式属于预先配置的若干目标CE模式时,步骤S204中可以将目标内存区域发生的属于若干目标CE模式的CE的频次加1,如果执行加1操作后的频次达到预设阈值,则确定需要对目标内存区域执行数据迁移和内存隔离;反之,如果执行加1操作后的频次并未达到预设阈值,则确定无需对目标内存区域执行数据迁移和内存隔离。In a possible implementation, when the CE mode determined in step S202 belongs to several preconfigured target CE modes, in step S204, the frequency of CEs belonging to several target CE modes occurring in the target memory area may be increased by 1, if If the frequency of adding 1 operations reaches the preset threshold, it is determined that data migration and memory isolation need to be performed on the target memory area; conversely, if the frequency of adding 1 operations does not reach the preset threshold, it is determined that there is no need to perform data migration and memory isolation on the target memory area. Perform data migration and memory isolation.
前述若干目标CE模式可以包括但不限于:row CE、column CE以及bank CE。The aforementioned target CE modes may include but are not limited to: row CE, column CE and bank CE.
当前述步骤S204确定需要对目标内存区域执行数据迁移和内存隔离时,继续执行图下步骤S206,获取计算机系统在当前时间间隔内的若干性能指标。When the aforementioned step S204 determines that data migration and memory isolation need to be performed on the target memory area, step S206 in the figure below is continued to obtain several performance indicators of the computer system within the current time interval.
步骤S208,根据若干性能指标确定计算机系统是否处于空闲态。Step S208, determining whether the computer system is in an idle state according to a number of performance indicators.
可以由计算机系统的系统管理单元实现前述步骤S208。The aforementioned step S208 may be implemented by a system management unit of the computer system.
前述若干性能指标可以包括但不限于如下各项性能指标中的任意一项或多项:计算机系统是否运行在用户态、处理器占用率、内存带宽、转发带宽、存储带宽,以及依赖计算机系统并且处于繁忙状态的虚拟机是否与目标内存区域位于相同的NUMA。其中,内存带宽是总线宽度、总线频率以及时钟周期内交换的数据包个数的乘积;转发带宽是指单位时间内能够在线路上传送的数据量,单位是bps(bit per second);存储带宽是指单位时间内存储器所存取的数据量,也称为存储器在单位时间内读出/写入的位数或字节。The aforementioned performance indicators may include but are not limited to any one or more of the following performance indicators: whether the computer system is running in user mode, processor occupancy, memory bandwidth, forwarding bandwidth, storage bandwidth, and dependence on the computer system and Whether the virtual machine in busy state is in the same NUMA as the target memory area. Among them, memory bandwidth is the product of bus width, bus frequency and the number of data packets exchanged in a clock cycle; forwarding bandwidth refers to the amount of data that can be transmitted on the line per unit time, the unit is bps (bit per second); storage bandwidth is Refers to the amount of data accessed by the memory per unit time, also known as the number of bits or bytes read/written by the memory per unit time.
在一种可能的实施方式中,当计算机系统运行在用户态、依赖计算机系统并且处于繁忙状态的虚拟机与目标内存区域位于不同的NUMA时,可以进一步基于预先配置的业务规则确定当前时间间隔内的其余各项性能指标分别对应的业务分值,然后对各个业务分值进行加权求和以得到总分值,进而基于总分值的大小确定计算机系统是否处于空闲态。In a possible implementation, when the computer system is running in user mode, depends on the computer system, and the virtual machine in a busy state is located in a different NUMA than the target memory area, it can be further determined based on the preconfigured business rules within the current time interval. The remaining performance indicators correspond to business scores respectively, and then perform a weighted sum of each business score to obtain the total score, and then determine whether the computer system is in an idle state based on the total score.
在一种可能的实施方式中,当计算机系统运行在用户态、依赖计算机系统并且处于繁忙状态的虚拟机与目标内存区域位于不同的NUMA,而且处理器占用率、内存带宽、转发带宽、存储带宽等性能指标均小于其各自对应的预设参考值时,确定计算机系统处于空闲态。In a possible implementation, when the computer system is running in user mode, the virtual machine that relies on the computer system and is in a busy state is located in different NUMAs from the target memory area, and the processor occupancy, memory bandwidth, forwarding bandwidth, storage bandwidth When the performance indicators are all less than their corresponding preset reference values, it is determined that the computer system is in an idle state.
需要特别说明的是,计算机系统中可能并不存在处于繁忙状态的虚拟机,此种情况下所获取的计算机系统在当前时间间隔内的若干性能指标,可能并不包括依赖计算机系统并且处于繁忙状态的虚拟机是否与目标内存区域位于相同的NUMA。It should be noted that there may not be a busy virtual machine in the computer system. In this case, the obtained performance indicators of the computer system in the current time interval may not include the computer system that is dependent on the computer system and is in a busy state. Whether the virtual machine is located in the same NUMA as the target memory area.
总而言之,计算机系统处于空闲态时,计算机系统应当运行在用户态,依赖计算机系统并且处于繁忙状态的虚拟机与目标内存区域应当位于不同的NUMA,除此之外处理器占用率、内存带宽、转发带宽、存储带宽等各项指标应当具有相对较小的值,确保计算机系统有足够的资源来支持对目标内存区域执行数据迁移和内存隔离,从而避免因对目标内存区域执行数据迁移和内存隔离而影响计算机系统对其需要执行的其它业务的高效执行。In summary, when the computer system is in idle state, the computer system should run in user state. The virtual machines that rely on the computer system and are in busy state should be located in different NUMAs from the target memory area. In addition, various indicators such as processor occupancy, memory bandwidth, forwarding bandwidth, and storage bandwidth should have relatively small values to ensure that the computer system has sufficient resources to support data migration and memory isolation of the target memory area, thereby avoiding affecting the efficient execution of other services that the computer system needs to execute due to data migration and memory isolation of the target memory area.
当步骤S208中根据计算机系统在当前时间间隔内的若干性能指标确定计算机系统并未处于空闲态时,可以按照相应的时间间隔周期性的执行前述步骤S206和步骤S208,直到确定出计算机系统处于空闲态时,执行如下步骤S210。When it is determined in step S208 that the computer system is not in an idle state based on several performance indicators of the computer system in the current time interval, the aforementioned steps S206 and S208 can be periodically executed according to the corresponding time intervals until it is determined that the computer system is in an idle state. When in the state, perform the following step S210.
步骤S210,对目标内存区域执行数据迁移和内存隔离。Step S210, perform data migration and memory isolation on the target memory area.
示例性的,计算机系统的系统管理单元可以通过该计算机系统的BIOS触发该计算机系统的处理器对目标内存区域执行数据迁移和内核隔离。参照前文所述,可以采用ADDDC技术实现对目标内存区域进行数据迁移和内存隔离,此外也可能采用自适应型双颗粒数据纠正-多区域(adaptive double device data correction-multiple region,ADDDC-MR)、自适应型数据纠正-单区域(adaptive data correction-single region,ADC-SR)自适应型双颗粒错误纠正(adaptive double device error correction,ADDEC)等技术实现对目标内存区域进行数据迁移和内存隔离。For example, the system management unit of the computer system can trigger the processor of the computer system to perform data migration and kernel isolation on the target memory area through the BIOS of the computer system. As mentioned above, ADDDC technology can be used to implement data migration and memory isolation in the target memory region. In addition, adaptive double device data correction-multiple region (ADDDC-MR) or adaptive double device data correction-multiple region (ADDDC-MR) may also be used. Technologies such as adaptive data correction-single region (ADC-SR) and adaptive double device error correction (ADDEC) realize data migration and memory isolation in target memory areas.
与前述方法实施例基于相同的构思,本申请实施例中还提供了一种内存错误处理装置,所述装置部署在包括内存的计算机系统中。如图5所示,所述内存错误处理装置50包括:指标获取模块501,用于在需要对所述内存中发生CE的目标内存区域执行数据迁移和内存隔离的情况下,获取所述计算机系统在当前时间间隔内的若干性能指标;状态判断模块503,用于根据所述若干性能指标确定所述计算机系统是否处于空闲态,并在所述计算机系统处于空闲态时触发隔离处理模块;所述隔离处理模块505,用于在所述状态判断模块的触发下,对所述目标内存区域执行数据迁移和内存隔离。Based on the same concept as the foregoing method embodiments, embodiments of the present application also provide a memory error processing device, which is deployed in a computer system including a memory. As shown in Figure 5, the memory error processing device 50 includes: an indicator acquisition module 501, used to acquire the computer system when it is necessary to perform data migration and memory isolation on the target memory area where CE occurs in the memory. Several performance indicators within the current time interval; the status judgment module 503 is used to determine whether the computer system is in an idle state according to the several performance indicators, and trigger the isolation processing module when the computer system is in an idle state; the The isolation processing module 505 is configured to perform data migration and memory isolation on the target memory area under the trigger of the status judgment module.
在一种可能的实施方式中,所述若干性能指标包括如下各项性能指标中的任意一项或多项:所述计算机系统是否运行在用户态、处理器占用率、内存带宽、转发带宽、存储带宽,以及依赖所述计算机系统并且处于繁忙状态的虚拟机是否与所述目标内存区域位于相同的非一致存储访问结构NUMA。In a possible implementation, the several performance indicators include any one or more of the following performance indicators: whether the computer system is running in user mode, processor occupancy, memory bandwidth, forwarding bandwidth, storage bandwidth, and whether a virtual machine that depends on the computer system and is in a busy state is located in the same non-uniform memory access structure NUMA as the target memory area.
在一种可能的实施方式中,所述装置还包括:信息获取模块507,用于获取所述计算机系统的内存错误信息;故障分析模块509,用于根据所述内存错误信息确定所述内存中发生CE的目标内存区域和CE模式;根据所述CE模式确定是否需要对所述目标内存区域执行数据迁移和内存隔离。In a possible implementation, the device further includes: an information acquisition module 507, used to obtain memory error information of the computer system; a fault analysis module 509, used to determine the memory error information in the memory according to the memory error information. The target memory area and CE mode where CE occurs; determine whether data migration and memory isolation need to be performed on the target memory area according to the CE mode.
在一种可能的实施方式中,所述故障分析模块509,用于在所述CE模式属于预先配置的若干目标CE模式的情况下,确定需要对所述目标内存区域执行数据迁移和内存隔离。In a possible implementation manner, the fault analysis module 509 is used to determine that data migration and memory isolation need to be performed on the target memory area when the CE mode belongs to several pre-configured target CE modes.
在一种可能的实施方式中,所述故障分析模块509,用于在所述CE模式属于预先配置的若干目标CE模式的情况下,将所述目标内存区域发生属于所述若干目标CE模式的CE的频次加1;在执行加1操作后的所述频次达到预设阈值的情况下,确定需要对所述目标内存区域执行数据迁移和内存隔离。In a possible implementation, the fault analysis module 509 is configured to, when the CE mode belongs to several pre-configured target CE modes, determine whether the target memory area belongs to the several target CE modes. The frequency of CE is incremented by 1; when the frequency after performing the increment operation reaches a preset threshold, it is determined that data migration and memory isolation need to be performed on the target memory area.
在一种可能的实施方式中,所述若干目标CE模式包括如下各项CE模式中的至少一项:row CE、column CE和bank CE。In a possible implementation, the plurality of target CE modes include at least one of the following CE modes: row CE, column CE, and bank CE.
根据本申请实施例的内存错误处理装置50可对应于执行本申请实施例中描述的方法,并且内存错误处理装置50中的各个模块的所分别执行的前述各项操作和其它操作和/或功能分别为了实现图2中的各个方法的相应流程,为了简洁,在此不再赘述。The memory error processing device 50 according to the embodiment of the present application may correspond to the execution of the method described in the embodiment of the present application, and the aforementioned operations and other operations and/or functions respectively performed by each module in the memory error processing device 50 In order to implement the corresponding processes of each method in Figure 2, for the sake of simplicity, they will not be described again here.
根据本申请实施例的内存错误处理装置50所包括的指标获取模块501、状态判断模块503、隔离处理模块505、信息获取模块507和故障分析模块509,可以通过软件实现,或者可以通过硬件实现。示例性的,接下来以指标获取模块501为例,介绍指标获取模块501的实现方式。类似的,状态判断模块503、隔离处理模块505、信息获取模块507和故障分析模块509的实现方式可以参考指标获取模块501的实现方式。According to the memory error processing device 50 of the embodiment of the present application, the indicator acquisition module 501, the state judgment module 503, the isolation processing module 505, the information acquisition module 507 and the fault analysis module 509 included in the device can be implemented by software or by hardware. Exemplarily, the implementation of the indicator acquisition module 501 is introduced below by taking the indicator acquisition module 501 as an example. Similarly, the implementation of the state judgment module 503, the isolation processing module 505, the information acquisition module 507 and the fault analysis module 509 can refer to the implementation of the indicator acquisition module 501.
模块作为软件功能模块的一种举例,指标获取模块501可以包括运行在计算实例上的代码。计算实例可以包括物理主机(计算设备)、虚拟机、容器中的一种。Module As an example of a software function module, the indicator acquisition module 501 may include code running on a computing instance. Computing instances may include one of physical hosts (computing devices), virtual machines, and containers.
模块作为硬件功能模块的一种举例,指标获取模块501可以是利用专用集成电路(application-specific integrated circuit,ASIC)实现、或者可编程逻辑器件(programmable logic device,PLD)实现的设备等。其中,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD)、现场可编程门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其任意组合实现。Module As an example of a hardware function module, the indicator acquisition module 501 may be a device implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (PLD). The above-mentioned PLD can be implemented by a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), a general array logic (GAL), or any combination thereof.
与前述的方法的实施例基于相同的构思,本申请实施例中还提供了一种计算设备和一种计算机系统,该计算设备/计算机系统至少包括处理器和存储器,存储器上存储有程序,处理器该程序时,可以实现图2所示的方法中的各个步骤的单元或模块。Based on the same concept as the aforementioned method embodiment, a computing device and a computer system are also provided in the embodiment of the present application. The computing device/computer system includes at least a processor and a memory, and a program is stored in the memory. When the processor executes the program, it can implement the units or modules of each step in the method shown in Figure 2.
图6为本申请实施例中提供的一种计算设备的结构示意图。FIG. 6 is a schematic structural diagram of a computing device provided in an embodiment of the present application.
如图6所示,所述计算设备600包括至少一个处理器601、存储器602和通信接口603。其中,处理器601、存储器602和通信接口603通信连接,可以通过有线(例如总线)的方式实现通信连接,也可以通过无线的方式实现通信连接。该通信接口603用于接收其他设备发送的数据(例如写入数据);存储器602存储有计算机指令,处理器601执行该计算机指令,执行前述方法实施例中的方法。As shown in FIG6 , the computing device 600 includes at least one processor 601, a memory 602, and a communication interface 603. The processor 601, the memory 602, and the communication interface 603 are connected in communication, and the communication connection can be realized by wired means (such as a bus) or by wireless means. The communication interface 603 is used to receive data (such as write data) sent by other devices; the memory 602 stores computer instructions, and the processor 601 executes the computer instructions to execute the method in the aforementioned method embodiment.
应理解,在本申请实施例中,该处理器601可以包括中央处理单元CPU,该处理器601还可以包括其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(fieldprogrammable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者是任何常规的处理器等。It should be understood that in the embodiment of the present application, the processor 601 may include a central processing unit CPU, and the processor 601 may also include other general-purpose processors, digital signal processors (DSP), application specific integrated circuits (ASIC), field programmable gate arrays (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or any conventional processor, etc.
该存储器602可以包括只读存储器和随机存取存储器,并向处理器601提供指令和数据。存储器602还可以包括非易失性随机存取存储器。The memory 602 may include read-only memory and random access memory, and provides instructions and data to the processor 601. Memory 602 may also include non-volatile random access memory.
该存储器602可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data date SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlinkDRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。The memory 602 may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. Among them, the non-volatile memory can be read-only memory (ROM), programmable ROM (PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically removable memory. Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory. Volatile memory may be random access memory (RAM), which is used as an external cache. By way of illustration, but not limitation, many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (synchronous DRAM, SDRAM), Double data rate synchronous dynamic random access memory (double data date SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous link dynamic random access memory (synchlink DRAM, SLDRAM) and direct memory Bus random access memory (direct rambus RAM, DR RAM).
应理解,根据本申请实施例的计算设备600可以执行实现本申请实施例中图2所示方法,该方法实现的详细描述参见上文,为了简洁,在此不再赘述。It should be understood that the computing device 600 according to the embodiment of the present application can perform the method shown in Figure 2 in the embodiment of the present application. For detailed description of the implementation of the method, please refer to the above. For the sake of brevity, the details will not be described again.
本申请的实施例中提供了一种计算机可读存储介质,其上存储有计算机程序,当所述计算机指令在被处理器执行时,使得上文提及的方法被实现。An embodiment of the present application provides a computer-readable storage medium on which a computer program is stored. When the computer instructions are executed by a processor, the above-mentioned method is implemented.
本申请的实施例中提供了一种芯片,该芯片包括至少一个处理器和接口,所述至少一个处理器通过所述接口确定程序指令或者数据;前述至少一个处理器用于执行所述程序指令,以实现上文提及的方法。A chip is provided in an embodiment of the present application. The chip includes at least one processor and an interface. The at least one processor determines program instructions or data through the interface. The at least one processor is used to execute the program instructions to implement the method mentioned above.
本申请的实施例中提供了一种计算机程序或计算机程序产品,该计算机程序或计算机程序产品包括指令,当该指令执行时,令计算机执行上文提及的方法。An embodiment of the present application provides a computer program or computer program product. The computer program or computer program product includes instructions. When the instructions are executed, the computer is caused to perform the above-mentioned method.
本领域普通技术人员应该还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执轨道,取决于技术方案的特定应用和设计约束条件。本领域普通技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art should further realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented with electronic hardware, computer software, or a combination of both. In order to clearly illustrate the hardware and software interchangeability. In the above description, the composition and steps of each example have been generally described according to functions. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. One of ordinary skill in the art may implement the described functionality using different methods for each specific application, but such implementations should not be considered beyond the scope of this application.
结合本文中所公开的实施例描述的方法或算法的步骤可以用硬件、处理器执轨道的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of the methods or algorithms described in connection with the embodiments disclosed herein may be implemented in hardware, software modules executed by a processor, or a combination of both. Software modules may be located in random access memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROMs, or anywhere in the field of technology. any other known form of storage media.
以上所述的具体实施方式,对本申请的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本申请的具体实施方式而已,并不用于限定本申请的保护范围,凡在本申请的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。The above-mentioned specific embodiments further describe the purpose, technical solutions and beneficial effects of the present application in detail. It should be understood that the above-mentioned are only specific embodiments of the present application and are not intended to limit the scope of the present application. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of this application shall be included in the scope of protection of this application.
Claims (15)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211172016.1A CN117806855A (en) | 2022-09-26 | 2022-09-26 | Memory error processing method and device |
PCT/CN2023/101096 WO2024066500A1 (en) | 2022-09-26 | 2023-06-19 | Memory error processing method and apparatus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211172016.1A CN117806855A (en) | 2022-09-26 | 2022-09-26 | Memory error processing method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117806855A true CN117806855A (en) | 2024-04-02 |
Family
ID=90418696
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211172016.1A Pending CN117806855A (en) | 2022-09-26 | 2022-09-26 | Memory error processing method and device |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN117806855A (en) |
WO (1) | WO2024066500A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118245290A (en) * | 2024-05-24 | 2024-06-25 | 浪潮云信息技术股份公司 | System and method for rapidly detecting unrecoverable errors in operating system memory |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4734003B2 (en) * | 2005-03-17 | 2011-07-27 | 富士通株式会社 | Soft error correction method, memory control device, and memory system |
CN104077375B (en) * | 2014-06-24 | 2017-09-12 | 华为技术有限公司 | The processing method and node of a kind of wrong catalogue of CC NUMA systems interior joint |
US9812222B2 (en) * | 2015-04-20 | 2017-11-07 | Qualcomm Incorporated | Method and apparatus for in-system management and repair of semi-conductor memory failure |
CN112231128B (en) * | 2020-09-11 | 2024-06-21 | 中科可控信息产业有限公司 | Memory error processing method, device, computer equipment and storage medium |
CN113868001B (en) * | 2021-09-10 | 2023-08-08 | 苏州浪潮智能科技有限公司 | Method, system and computer storage medium for checking memory repair results |
CN115016963A (en) * | 2022-05-06 | 2022-09-06 | 阿里巴巴(中国)有限公司 | Memory page isolation method, memory monitoring system and computer readable storage medium |
-
2022
- 2022-09-26 CN CN202211172016.1A patent/CN117806855A/en active Pending
-
2023
- 2023-06-19 WO PCT/CN2023/101096 patent/WO2024066500A1/en unknown
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118245290A (en) * | 2024-05-24 | 2024-06-25 | 浪潮云信息技术股份公司 | System and method for rapidly detecting unrecoverable errors in operating system memory |
CN118245290B (en) * | 2024-05-24 | 2024-08-13 | 浪潮云信息技术股份公司 | System and method for rapidly detecting unrecoverable errors in operating system memory |
Also Published As
Publication number | Publication date |
---|---|
WO2024066500A1 (en) | 2024-04-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10002043B2 (en) | Memory devices and modules | |
WO2021135272A1 (en) | Memory anomaly processing method and system, electronic device, and storage medium | |
US11080135B2 (en) | Methods and apparatus to perform error detection and/or correction in a memory device | |
JP2016045957A (en) | Memory system and method of operating the same | |
CN111625199B (en) | Method, device, computer equipment and storage medium for improving reliability of solid state disk data path | |
CN115168087B (en) | A method and device for determining memory fault repair resource granularity | |
US8261134B2 (en) | Error management watchdog timers in a multiprocessor computer | |
CN115168088A (en) | A method and device for repairing uncorrectable errors in memory | |
CN117971539A (en) | Memory fault processing method, computing equipment and management platform | |
CN115328684A (en) | Memory fault reporting method, BMC and electronic equipment | |
CN114996065A (en) | Memory fault prediction method, device and equipment | |
WO2024066500A1 (en) | Memory error processing method and apparatus | |
CN113360323A (en) | Many-core computing circuit, stacked chip and fault-tolerant control method | |
WO2023050927A1 (en) | Memory detection method and apparatus | |
CN115705261A (en) | Memory fault repairing method, CPU, OS, BIOS and server | |
EP4280064A1 (en) | Systems and methods for expandable memory error handling | |
US9251054B2 (en) | Implementing enhanced reliability of systems utilizing dual port DRAM | |
CN115391074A (en) | Memory fault handling method, system and storage medium | |
CN118885326A (en) | Memory error prediction method, device and equipment | |
US20250138963A1 (en) | Adaptive replacement of cxl memory device | |
CN111273867B (en) | A block-based data relocation method, system, terminal and storage medium | |
CN117851271A (en) | Fault repair method and server | |
CN115686901B (en) | Memory fault analysis method and computer equipment | |
WO2024113295A1 (en) | System, method and apparatus for filtering configuration accesses to unimplemented devices | |
CN119046060A (en) | Fault prediction method and related device for memory |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |