[go: up one dir, main page]

CN112596916A - Dual-core lock step error recovery system and method - Google Patents

Dual-core lock step error recovery system and method Download PDF

Info

Publication number
CN112596916A
CN112596916A CN202110232537.0A CN202110232537A CN112596916A CN 112596916 A CN112596916 A CN 112596916A CN 202110232537 A CN202110232537 A CN 202110232537A CN 112596916 A CN112596916 A CN 112596916A
Authority
CN
China
Prior art keywords
core
controller
lockstep
snapshot
storage unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110232537.0A
Other languages
Chinese (zh)
Inventor
樊崇斌
魏斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Lichi Semiconductor Co ltd
Original Assignee
Shanghai Lichi Semiconductor Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Lichi Semiconductor Co ltd filed Critical Shanghai Lichi Semiconductor Co ltd
Priority to CN202110232537.0A priority Critical patent/CN112596916A/en
Publication of CN112596916A publication Critical patent/CN112596916A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores
    • G06F9/524Deadlock detection or avoidance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/142Reconfiguring to eliminate the error

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Hardware Redundancy (AREA)

Abstract

本发明公开了一种双核锁步错误恢复系统及方法,涉及CPU架构技术领域,解决了在双核锁步机制下发生锁步错误时不能快速使系统恢复到安全状态的技术问题,其技术方案要点是通过控制器根据第一核心和第二核心的运行状态生成对应的快照,并将所述快照存储至存储单元,当第一核心和第二核心出现锁步错误时,控制器从存储单元提取快照并将快照提供给第一核心和第二核心,以使第一核心和第二核心能够迅速恢复到安全状态,无需依靠其他的程序来进行恢复,恢复过程简单且恢复时间快。

Figure 202110232537

The invention discloses a dual-core lockstep error recovery system and method, relates to the technical field of CPU architecture, and solves the technical problem that the system cannot be quickly restored to a safe state when a lockstep error occurs under a dual-core lockstep mechanism, and the main points of the technical solution are The controller generates corresponding snapshots according to the operating states of the first core and the second core, and stores the snapshots in the storage unit. When a lockstep error occurs between the first core and the second core, the controller extracts the snapshot from the storage unit. Snapshot and provide the snapshot to the first core and the second core, so that the first core and the second core can quickly recover to a safe state without relying on other programs for recovery, the recovery process is simple and the recovery time is fast.

Figure 202110232537

Description

双核锁步错误恢复系统及方法Dual-core lockstep error recovery system and method

技术领域technical field

本公开涉及CPU架构技术领域,尤其涉及一种双核锁步错误恢复系统及方法。The present disclosure relates to the technical field of CPU architecture, and in particular, to a dual-core lockstep error recovery system and method.

背景技术Background technique

在某些领域,例如汽车业或民用航空业领域,对于部件的功能安全有较高的要求,CPU作为这些部件的“大脑”,其功能安全特性对于能否满足这些领域的高要求起着相当重要的作用。在发生功能失效时,如何能够快速可靠的对错误进行处理后并可靠恢复程序的执行,是功能安全领域研究的一个重要内容。In some fields, such as the automotive industry or the civil aviation industry, there are high requirements for the functional safety of components. As the "brain" of these components, the functional safety characteristics of the CPU play a considerable role in meeting the high requirements in these fields. important role. When a functional failure occurs, how to handle the error quickly and reliably and restore the program execution reliably is an important part of the research in the field of functional safety.

双核锁步(Dual-Core Lockstep)是一种常见的增强芯片功能安全的方法。处于双核锁步状态的两个核心读取同样的数据并执行同样的指令,并有检测单元实时监测两个核心的状态,如果两个核心的状态不一致(此处称作锁步错误, 往往意味着至少一个核心至少发生了错误)则触发警告(如发起中断或设置某信号),通过锁步机制,CPU的功能安全大大增强。Dual-Core Lockstep is a common method to enhance the functional safety of chips. The two cores in the dual-core lockstep state read the same data and execute the same instructions, and a detection unit monitors the states of the two cores in real time. If the states of the two cores are inconsistent (here called a lockstep error, it often means If at least one core has at least an error), a warning is triggered (for example, an interrupt is initiated or a signal is set). Through the lockstep mechanism, the functional safety of the CPU is greatly enhanced.

在增强系统功能安全的同时,为进一步提高系统的可用性,在发生锁步错误时,系统应尝试恢复程序的正常运行。发生锁步错误时,一般采用如下方式进行恢复:(1)软件在锁步错误中断处理程序中进行处理;(2)告警信号直接触发重启电路重新启动整个系统。While enhancing the functional safety of the system, in order to further improve the availability of the system, the system should try to restore the normal operation of the program when a lockstep error occurs. When a lockstep error occurs, the recovery is generally carried out in the following ways: (1) The software processes it in the lockstep error interrupt handler; (2) The alarm signal directly triggers the restart circuit to restart the entire system.

目前已知的在CPU锁步错误后的恢复机制, 要么对软件有较多的依赖,例如需要假定CPU能以可靠的执行处理锁步错误的中断处理程序;要么需要较长的恢复时间,例如使用锁步错误信号直接触发整个系统重新启动,在此情况下,大量程序需要重新执行才可以完成最终的恢复执行,因而花费较长的时间。The currently known recovery mechanisms after a CPU lockstep error either rely heavily on software, such as assuming that the CPU can reliably execute an interrupt handler for handling lockstep errors; or require a longer recovery time, such as The lockstep error signal is used to directly trigger the entire system to restart. In this case, a large number of programs need to be re-executed to complete the final recovery execution, which takes a long time.

发明内容SUMMARY OF THE INVENTION

本公开提供了一种双核锁步错误恢复系统及方法,其技术目的是在双核锁步机制下发生锁步错误时能快速使系统恢复到安全状态。The present disclosure provides a dual-core lockstep error recovery system and method, the technical purpose of which is to quickly restore the system to a safe state when a lockstep error occurs under the dual-core lockstep mechanism.

本公开的上述技术目的是通过以下技术方案得以实现的:The above-mentioned technical purpose of the present disclosure is achieved through the following technical solutions:

一种双核锁步错误恢复系统,包括处理器,所述处理器包括:A dual-core lockstep error recovery system, comprising a processor, the processor comprising:

第一核心;the first core;

第二核心,所述第二核心与所述第一核心处于双核锁步模式;a second core, the second core and the first core are in a dual-core lockstep mode;

第一监测单元,对所述第一核心和所述第二核心进行监测,确定是否出现锁步错误,若出现锁步错误则触发第一锁步错误信号,并将所述第一锁步错误信号发送给所述第一核心、所述第二核心和控制器;The first monitoring unit monitors the first core and the second core to determine whether a lockstep error occurs, and if a lockstep error occurs, a first lockstep error signal is triggered, and the first lockstep error is detected. sending a signal to the first core, the second core, and a controller;

所述第一核心和第二核心接收到所述第一锁步错误信号时,停止当前运行的指令,并通过所述控制器从存储单元中提取快照以进行错误恢复;When the first core and the second core receive the first lockstep error signal, stop the currently running instruction, and extract a snapshot from the storage unit through the controller for error recovery;

控制器,未接收到所述第一锁步错误信号时,根据所述第一核心和所述第二核心的运行状态生成对应的快照,并将所述快照存储至所述存储单元;接收到所述第一锁步错误信号时,从所述存储单元中提取所述快照并提供给所述第一核心和所述第二核心进行错误恢复;The controller, when not receiving the first lockstep error signal, generates a corresponding snapshot according to the operating state of the first core and the second core, and stores the snapshot to the storage unit; receiving When the first lockstep error signal is detected, the snapshot is extracted from the storage unit and provided to the first core and the second core for error recovery;

所述存储单元,用于存储所述快照。The storage unit is used for storing the snapshot.

进一步地,该系统还包括第二监测单元,所述第二监测单元用于:对所述控制器进行监测,监测所述控制器是否出现错误,若出现错误则触发错误信号并将所述错误信号发送给所述处理器或外部系统。Further, the system also includes a second monitoring unit, the second monitoring unit is used for: monitoring the controller, monitoring whether an error occurs in the controller, and if an error occurs, an error signal is triggered and the error occurs. signal to the processor or to an external system.

进一步地,所述控制器还包括第一控制器和第二控制器,所述第一控制器与所述第二控制器处于锁步模式;Further, the controller further includes a first controller and a second controller, the first controller and the second controller are in a lockstep mode;

所述第二监测单元还用于:对所述第一控制器和所述第二控制器进行监测,确定所述第一控制器和所述第二控制器是否出现锁步错误,若出现锁步错误则触发第二锁步错误信号,并将所述第二锁步错误信号发送给所述处理器或外部系统。The second monitoring unit is further configured to: monitor the first controller and the second controller to determine whether a lockstep error occurs in the first controller and the second controller. A step error triggers a second lockstep error signal, and the second lockstep error signal is sent to the processor or an external system.

进一步地,所述存储单元还包括第一存储单元和第二存储单元,所述第一存储单元和所述第二存储单元都属于堆栈结构;Further, the storage unit further includes a first storage unit and a second storage unit, and both the first storage unit and the second storage unit belong to a stack structure;

所述第一控制器用于生成所述第一核心的第一快照,并将所述第一快照存储至所述第一存储单元;所述第二控制器用于生成所述第二核心的第二快照,并将所述第二快照存储至所述第二存储单元。The first controller is configured to generate a first snapshot of the first core, and store the first snapshot to the first storage unit; the second controller is configured to generate a second snapshot of the second core snapshot, and store the second snapshot to the second storage unit.

进一步地,当所述第一监测单元监测到所述第一核心和所述第二核心的锁步错误时,所述第一控制器和所述第二控制器在预设时间阈值内不再生成快照,在所述预设时间阈值之后再生成快照。Further, when the first monitoring unit detects a lockstep error of the first core and the second core, the first controller and the second controller do not regenerate within a preset time threshold A snapshot is generated, and a snapshot is generated after the preset time threshold.

进一步地,所述控制器还用于:Further, the controller is also used for:

对所述存储单元中快照的出栈情况进行监测,若某一快照被连续出栈的次数超过预设阈值,则通知所述处理器或外部系统;Monitoring the popping situation of the snapshots in the storage unit, if the number of times that a certain snapshot is continuously popped from the stack exceeds a preset threshold, the processor or the external system is notified;

对所述存储单元中的快照数量进行监测,若所述存储单元为空则通知所述处理器或外部系统。Monitor the number of snapshots in the storage unit, and notify the processor or an external system if the storage unit is empty.

一种使用如上任一所述的双核锁步错误恢复系统进行错误恢复的方法,包括:A method for error recovery using the dual-core lockstep error recovery system as described above, comprising:

根据第一核心和第二核心的运行状态生成对应的快照,并对所述快照进行存储;Generate corresponding snapshots according to the operating states of the first core and the second core, and store the snapshots;

对所述第一核心和所述第二核心进行监测,确定是否出现锁步错误;monitoring the first core and the second core to determine whether a lockstep error occurs;

当监测到所述第一核心和所述第二核心出现锁步错误时,将所述快照提供给所述第一核心和所述第二核心进行错误恢复。When a lockstep error is detected in the first core and the second core, the snapshot is provided to the first core and the second core for error recovery.

本公开的有益效果在于:本公开所述的双核锁步错误恢复系统及方法, 通过控制器根据第一核心和第二核心的运行状态生成对应的快照,并将所述快照存储至存储单元,当第一核心和第二核心出现锁步错误时,控制器从存储单元提取快照并将快照提供给第一核心和第二核心,以使第一核心和第二核心能够迅速恢复到安全状态,无需依靠其他的程序来进行恢复,恢复过程简单且恢复时间快。The beneficial effects of the present disclosure are: in the dual-core lockstep error recovery system and method described in the present disclosure, the controller generates corresponding snapshots according to the operating states of the first core and the second core, and stores the snapshots in the storage unit, When a lockstep error occurs between the first core and the second core, the controller extracts a snapshot from the storage unit and provides the snapshot to the first core and the second core, so that the first core and the second core can quickly recover to a safe state, No need to rely on other programs for recovery, the recovery process is simple and the recovery time is fast.

附图说明Description of drawings

图1为本发明所述的双核锁步错误恢复系统的示意图;1 is a schematic diagram of a dual-core lockstep error recovery system according to the present invention;

图2为本发明所述的双核锁步错误恢复系统实施例一示意图;2 is a schematic diagram of Embodiment 1 of the dual-core lockstep error recovery system according to the present invention;

图3为本发明所述的双核锁步错误恢复系统实施例二示意图;3 is a schematic diagram of Embodiment 2 of the dual-core lockstep error recovery system according to the present invention;

图4为本发明所述的双核锁步错误恢复方法的流程图。FIG. 4 is a flowchart of the dual-core lockstep error recovery method according to the present invention.

具体实施方式Detailed ways

下面将结合附图对本公开技术方案进行详细说明。在本公开的描述中,需要理解地是,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量,仅用来区分不同的组成部分。The technical solutions of the present disclosure will be described in detail below with reference to the accompanying drawings. In the description of the present disclosure, it should be understood that the terms "first" and "second" are only used for the purpose of description, and should not be interpreted as indicating or implying relative importance or implicitly indicating the number of technical features indicated, It is only used to distinguish the different components.

图1为本发明所述的双核锁步错误恢复系统的示意图,如图1所示,该系统包括处理器,处理器又包括第一核心、第二核心、第一监测单元、控制器和存储单元。第一核心与第二核心处于双核锁步模式,第一监测单元对第一核心和第二核心进行监测,确定是否出现锁步错误,若出现锁步错误则触发第一锁步错误信号,并将第一锁步错误信号发送给第一核心、第二核心和控制器。第一核心和第二核心接收到第一锁步错误信号时,停止当前运行的指令,并通过控制器从存储单元中提取快照以进行错误恢复。FIG. 1 is a schematic diagram of a dual-core lockstep error recovery system according to the present invention. As shown in FIG. 1 , the system includes a processor, and the processor further includes a first core, a second core, a first monitoring unit, a controller and a storage unit. The first core and the second core are in dual-core lockstep mode. The first monitoring unit monitors the first core and the second core to determine whether a lockstep error occurs. If a lockstep error occurs, the first lockstep error signal is triggered, and a lockstep error occurs. A first lockstep error signal is sent to the first core, the second core and the controller. When the first core and the second core receive the first lockstep error signal, they stop the currently running instruction, and extract a snapshot from the storage unit through the controller for error recovery.

控制器在未接收到上述第一锁步错误信号时,根据第一核心和第二核心的运行状态生成对应的快照,并将快照存储至存储单元;控制器在接收到上述第一锁步错误信号时,从存储单元中提取快照并提供给第一核心和第二核心进行错误恢复。存储单元是一个堆栈结构,用于存储快照。When the controller does not receive the above-mentioned first lockstep error signal, a corresponding snapshot is generated according to the operating state of the first core and the second core, and the snapshot is stored in the storage unit; when the controller receives the above-mentioned first lockstep error When the signal is received, a snapshot is extracted from the storage unit and provided to the first core and the second core for error recovery. A storage unit is a stack structure that stores snapshots.

作为具体实施例地,控制器生成第一核心和第二核心的快照,并将生成的快照压入快照堆栈(即存储单元),当快照的数量大于堆栈可存储的最大值时,栈底的快照会被其上方的快照替代。As a specific embodiment, the controller generates snapshots of the first core and the second core, and pushes the generated snapshots into a snapshot stack (ie, a storage unit). When the number of snapshots is greater than the maximum value that can be stored in the stack, the The snapshot is replaced by the snapshot above it.

出栈(从栈顶提取一个快照,其余快照在栈中依次向栈顶方向移动)时,控制器应向两个核心提供出栈接口,确保在合适的时机(例如确保总线上的当前传输完成)进行快照的出栈。When popping (taking a snapshot from the top of the stack, and moving the rest of the snapshots in the stack toward the top of the stack), the controller should provide a popup interface to both cores to ensure that the current transfer on the bus is completed at the right time ) to pop the snapshot.

具体地,将快照的出栈入栈和函数调用结合起来,在函数入口完成必要的初始化后入栈快照,在函数出口(到达函数出口证明该函数的执行没有发生锁步错误)出栈或使出栈的快照无效。那么软件可设置多个快照,方便对函数调用尤其是嵌套调用进行保护和故障恢复。Specifically, combine the pop-up and push of the snapshot with the function call, push the snapshot into the stack after the function entry completes the necessary initialization, and pop the stack or make the function exit at the function exit (reaching the function exit to prove that the execution of the function does not have a lock-step error). Popped snapshots are invalid. Then the software can set up multiple snapshots to facilitate the protection and fault recovery of function calls, especially nested calls.

图2为本发明所述的双核锁步错误恢复系统实施例一示意图,该实施例中,该系统还包括第二监测单元,第二监测单元用于:对控制器进行监测,监测控制器是否出现错误,若出现错误则触发错误信号并将错误信号发送给处理器或外部系统。2 is a schematic diagram of Embodiment 1 of the dual-core lockstep error recovery system according to the present invention. In this embodiment, the system further includes a second monitoring unit, and the second monitoring unit is used to monitor the controller and monitor whether the controller is An error occurs, and if an error occurs, an error signal is triggered and sent to the processor or external system.

本申请中,外部系统指的是处理器之外的控制系统,若控制器出现了故障,可以通过处理器本身或是外部其他控制器来对整个处理器的电路进行复位发起系统重启等。In this application, the external system refers to a control system other than the processor. If the controller fails, the processor itself or other external controllers can reset the circuit of the entire processor to initiate a system restart, etc.

图3为本发明所述的双核锁步错误恢复系统实施例二示意图,该实施例中,控制器包括第一控制器和第二控制器,第一控制器和第二控制器也处于锁步模式。存储单元也包括第一存储单元和第二存储单元,且第一存储单元和第二存储单元都属于堆栈结构。第一控制器用于生成第一核心的第一快照,并将第一快照存储至第一存储单元;第二控制器用于生成第二核心的第二快照,并将第二快照存储至第二存储单元。3 is a schematic diagram of Embodiment 2 of the dual-core lockstep error recovery system according to the present invention. In this embodiment, the controller includes a first controller and a second controller, and the first controller and the second controller are also in lockstep model. The storage unit also includes a first storage unit and a second storage unit, and both the first storage unit and the second storage unit belong to a stack structure. The first controller is configured to generate a first snapshot of the first core and store the first snapshot to the first storage unit; the second controller is configured to generate a second snapshot of the second core and store the second snapshot to the second storage unit unit.

当第一核心与第二核心出现锁步错误时,第一核心通过第一控制器从第一存储单元中提取第一快照进行错误恢复,同时第二核心通过第二控制器从第二存储单元中提取第二快照进行错误恢复。两个控制器分别为两个核心提供入栈和出栈接口。When a lockstep error occurs between the first core and the second core, the first core extracts the first snapshot from the first storage unit through the first controller for error recovery, while the second core uses the second controller to extract the first snapshot from the second storage unit. Extract the second snapshot for error recovery. The two controllers provide push and pop interfaces for the two cores, respectively.

作为具体实施例地,当第一监测单元监测到第一核心和第二核心的锁步错误时,第一控制器和第二控制器在预设时间阈值内不再生成快照,在预设时间阈值之后再生成快照。As a specific embodiment, when the first monitoring unit detects a lockstep error of the first core and the second core, the first controller and the second controller no longer generate snapshots within a preset time threshold, and within the preset time Snapshots are generated after the threshold.

由于从第一核心和第二核心发生锁步错误到第一监测单元监测到锁步错误并触发第一锁步错误信号有一定延时,为避免在该时间段内入栈快照,快照的入栈应延时一定的时钟周期,并确保在这些时钟周期内没有锁步错误后快照再入栈。Since there is a certain delay from the occurrence of a lockstep error on the first core and the second core to when the first monitoring unit detects the lockstep error and triggers the first lockstep error signal, in order to avoid stack snapshots during this time period, the snapshot The stack should be delayed for a certain number of clock cycles, and the snapshot should be pushed to the stack after ensuring that there are no lockstep errors within these clock cycles.

作为具体实施例地,控制器还对存储单元中快照的出栈情况进行监测,若某一快照被连续出栈的次数超过预设阈值,则通知处理器或外部系统进行处理。另外,控制器还对存储单元中的快照数量进行监测,若存储单元为空则通知处理器或外部系统进行处理。As a specific embodiment, the controller also monitors the popping status of snapshots in the storage unit, and notifies the processor or an external system for processing if the number of times a snapshot is popped continuously exceeds a preset threshold. In addition, the controller also monitors the number of snapshots in the storage unit, and notifies the processor or an external system for processing if the storage unit is empty.

图4为本发明所述的双核锁步错误恢复方法的流程图,如图4所示,步骤S100:根据第一核心和第二核心的运行状态生成对应的快照,并对所述快照进行存储。FIG. 4 is a flowchart of the dual-core lockstep error recovery method according to the present invention. As shown in FIG. 4 , step S100 : generating corresponding snapshots according to the operating states of the first core and the second core, and storing the snapshots .

步骤S101:对第一核心和第二核心进行监测,确定是否出现锁步错误。Step S101: Monitor the first core and the second core to determine whether a lockstep error occurs.

步骤S102:当监测到所述第一核心和所述第二核心出现锁步错误时,将所述快照提供给所述第一核心和所述第二核心进行错误恢复。Step S102: When a lockstep error is detected in the first core and the second core, the snapshot is provided to the first core and the second core for error recovery.

步骤S102中,第一监测单元监测到第一核心和第二核心出现锁步错误时,触发第一锁步错误信号,并将第一锁步错误信号发送给第一核心和第二核心。第一核心和第二核心收到第一锁步错误信号后,停止当前运行的指令,从快照堆栈中出栈一个快照并据此以安全的方式(例如需要保证处理器发起的总线访问完成后再进行恢复)恢复两个核心运行场景并恢复运行。In step S102, the first monitoring unit triggers a first lockstep error signal when detecting that the first core and the second core have a lockstep error, and sends the first lockstep error signal to the first core and the second core. After the first core and the second core receive the first lockstep error signal, they stop the currently running instructions, pop a snapshot from the snapshot stack, and use this in a safe manner (for example, it is necessary to ensure that the bus access initiated by the processor is completed after the completion of the process). then resume) to resume the two core running scenarios and resume running.

以上为本公开示范性实施例,本公开的保护范围由权利要求书及其等效物限定。The above are exemplary embodiments of the present disclosure, and the protection scope of the present disclosure is defined by the claims and their equivalents.

Claims (7)

1.一种双核锁步错误恢复系统,其特征在于,包括处理器,所述处理器包括:1. a dual-core lockstep error recovery system, is characterized in that, comprises a processor, and described processor comprises: 第一核心;the first core; 第二核心,所述第二核心与所述第一核心处于双核锁步模式;a second core, the second core and the first core are in a dual-core lockstep mode; 第一监测单元,对所述第一核心和所述第二核心进行监测,确定是否出现锁步错误,若出现锁步错误则触发第一锁步错误信号,并将所述第一锁步错误信号发送给所述第一核心、所述第二核心和控制器;The first monitoring unit monitors the first core and the second core to determine whether a lockstep error occurs, and if a lockstep error occurs, a first lockstep error signal is triggered, and the first lockstep error is detected. sending a signal to the first core, the second core, and a controller; 所述第一核心和第二核心接收到所述第一锁步错误信号时,停止当前运行的指令,并通过所述控制器从存储单元中提取快照以进行错误恢复;When the first core and the second core receive the first lockstep error signal, stop the currently running instruction, and extract a snapshot from the storage unit through the controller for error recovery; 控制器,未接收到所述第一锁步错误信号时,根据所述第一核心和所述第二核心的运行状态生成对应的快照,并将所述快照存储至所述存储单元;接收到所述第一锁步错误信号时,从所述存储单元中提取所述快照并提供给所述第一核心和所述第二核心进行错误恢复;The controller, when not receiving the first lockstep error signal, generates a corresponding snapshot according to the operating state of the first core and the second core, and stores the snapshot to the storage unit; receiving When the first lockstep error signal is detected, the snapshot is extracted from the storage unit and provided to the first core and the second core for error recovery; 所述存储单元,用于存储所述快照。The storage unit is used for storing the snapshot. 2.如权利要求1所述的双核锁步错误恢复系统,其特征在于,该系统还包括第二监测单元,所述第二监测单元用于:对所述控制器进行监测,监测所述控制器是否出现错误,若出现错误则触发错误信号并将所述错误信号发送给所述处理器或外部系统。2. The dual-core lockstep error recovery system according to claim 1, wherein the system further comprises a second monitoring unit, and the second monitoring unit is used for: monitoring the controller, monitoring the control Whether the processor has an error, if an error occurs, an error signal is triggered and the error signal is sent to the processor or an external system. 3.如权利要求2所述的双核锁步错误恢复系统,其特征在于,所述控制器还包括第一控制器和第二控制器,所述第一控制器与所述第二控制器处于锁步模式;3. The dual-core lockstep error recovery system according to claim 2, wherein the controller further comprises a first controller and a second controller, the first controller and the second controller are in a lockstep mode; 所述第二监测单元还用于:对所述第一控制器和所述第二控制器进行监测,确定所述第一控制器和所述第二控制器是否出现锁步错误,若出现锁步错误则触发第二锁步错误信号,并将所述第二锁步错误信号发送给所述处理器或外部系统。The second monitoring unit is further configured to: monitor the first controller and the second controller to determine whether a lockstep error occurs in the first controller and the second controller. A step error triggers a second lockstep error signal, and the second lockstep error signal is sent to the processor or an external system. 4.如权利要求3所述的双核锁步错误恢复系统,其特征在于,所述存储单元还包括第一存储单元和第二存储单元,所述第一存储单元和所述第二存储单元都属于堆栈结构;4. The dual-core lockstep error recovery system according to claim 3, wherein the storage unit further comprises a first storage unit and a second storage unit, both of the first storage unit and the second storage unit belong to the stack structure; 所述第一控制器用于生成所述第一核心的第一快照,并将所述第一快照存储至所述第一存储单元;所述第二控制器用于生成所述第二核心的第二快照,并将所述第二快照存储至所述第二存储单元。The first controller is configured to generate a first snapshot of the first core, and store the first snapshot to the first storage unit; the second controller is configured to generate a second snapshot of the second core snapshot, and store the second snapshot to the second storage unit. 5.如权利要求4所述的双核锁步错误恢复系统,其特征在于,当所述第一监测单元监测到所述第一核心和所述第二核心的锁步错误时,所述第一控制器和所述第二控制器在预设时间阈值内不再生成快照,在所述预设时间阈值之后再生成快照。5. The dual-core lockstep error recovery system according to claim 4, wherein when the first monitoring unit detects a lockstep error of the first core and the second core, the first The controller and the second controller no longer generate snapshots within a preset time threshold, and then generate snapshots after the preset time threshold. 6.如权利要求1所述的双核锁步错误恢复系统,其特征在于,所述控制器还用于:6. The dual-core lockstep error recovery system of claim 1, wherein the controller is also used for: 对所述存储单元中快照的出栈情况进行监测,若某一快照被连续出栈的次数超过预设阈值,则通知所述处理器或外部系统;Monitoring the popping situation of the snapshots in the storage unit, if the number of times that a certain snapshot is continuously popped from the stack exceeds a preset threshold, the processor or the external system is notified; 对所述存储单元中的快照数量进行监测,若所述存储单元为空则通知所述处理器或外部系统。Monitor the number of snapshots in the storage unit, and notify the processor or an external system if the storage unit is empty. 7.一种使用如权利要求1-6任一所述的双核锁步错误恢复系统进行错误恢复的方法,其特征在于,包括:7. A method for error recovery using the dual-core lockstep error recovery system according to any one of claims 1-6, characterized in that, comprising: 根据第一核心和第二核心的运行状态生成对应的快照,并对所述快照进行存储;Generate corresponding snapshots according to the operating states of the first core and the second core, and store the snapshots; 对所述第一核心和所述第二核心进行监测,确定是否出现锁步错误;monitoring the first core and the second core to determine whether a lockstep error occurs; 当监测到所述第一核心和所述第二核心出现锁步错误时,将所述快照提供给所述第一核心和所述第二核心进行错误恢复。When a lockstep error is detected in the first core and the second core, the snapshot is provided to the first core and the second core for error recovery.
CN202110232537.0A 2021-03-03 2021-03-03 Dual-core lock step error recovery system and method Pending CN112596916A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110232537.0A CN112596916A (en) 2021-03-03 2021-03-03 Dual-core lock step error recovery system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110232537.0A CN112596916A (en) 2021-03-03 2021-03-03 Dual-core lock step error recovery system and method

Publications (1)

Publication Number Publication Date
CN112596916A true CN112596916A (en) 2021-04-02

Family

ID=75210144

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110232537.0A Pending CN112596916A (en) 2021-03-03 2021-03-03 Dual-core lock step error recovery system and method

Country Status (1)

Country Link
CN (1) CN112596916A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114091380A (en) * 2021-11-18 2022-02-25 上海励驰半导体有限公司 An anomaly detection method and system based on lockstep design
CN116125847A (en) * 2022-09-23 2023-05-16 中国航空无线电电子研究所 An embedded system high-integrity clock circuit monitoring system and method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5692121A (en) * 1995-04-14 1997-11-25 International Business Machines Corporation Recovery unit for mirrored processors
CN107531250A (en) * 2015-04-20 2018-01-02 奥托立夫开发公司 Vehicle safety electronic control system
CN108052420A (en) * 2018-01-08 2018-05-18 哈尔滨工业大学 Zynq-7000-based dual-core ARM processor anti-single event upset protection method
US10078565B1 (en) * 2016-06-16 2018-09-18 Xilinx, Inc. Error recovery for redundant processing circuits
CN110140112A (en) * 2017-01-19 2019-08-16 高通股份有限公司 The periodical non-invasive diagnostic of lock-step system
US20190303260A1 (en) * 2018-03-29 2019-10-03 Arm Ltd. Device, system and process for redundant processor error detection
CN112015599A (en) * 2019-05-31 2020-12-01 华为技术有限公司 Method and apparatus for error recovery
CN112416609A (en) * 2021-01-22 2021-02-26 南京芯驰半导体科技有限公司 Mode configuration method and device of dual-core mode

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5692121A (en) * 1995-04-14 1997-11-25 International Business Machines Corporation Recovery unit for mirrored processors
CN107531250A (en) * 2015-04-20 2018-01-02 奥托立夫开发公司 Vehicle safety electronic control system
US10078565B1 (en) * 2016-06-16 2018-09-18 Xilinx, Inc. Error recovery for redundant processing circuits
CN110140112A (en) * 2017-01-19 2019-08-16 高通股份有限公司 The periodical non-invasive diagnostic of lock-step system
CN108052420A (en) * 2018-01-08 2018-05-18 哈尔滨工业大学 Zynq-7000-based dual-core ARM processor anti-single event upset protection method
US20190303260A1 (en) * 2018-03-29 2019-10-03 Arm Ltd. Device, system and process for redundant processor error detection
CN112041821A (en) * 2018-03-29 2020-12-04 Arm有限公司 Apparatus, system, and process for redundant processor error detection
CN112015599A (en) * 2019-05-31 2020-12-01 华为技术有限公司 Method and apparatus for error recovery
CN112416609A (en) * 2021-01-22 2021-02-26 南京芯驰半导体科技有限公司 Mode configuration method and device of dual-core mode

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CARLES HERNANDEZ 等: ""Timely Error Detection for Effective Recovery in Light-Lockstep Automotive Systems"", 《IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS》 *
CARLES HERNANDEZ等: ""Low-cost checkpointing in automotive safety-relevant systems"", 《2015 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE)》 *
孙越等: ""面向商用APSoC器件的双核锁步机制"", 《科技创新导报》 *
王真等: ""高性能处理器的差错校正技术"", 《计算机研究与发展》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114091380A (en) * 2021-11-18 2022-02-25 上海励驰半导体有限公司 An anomaly detection method and system based on lockstep design
CN116125847A (en) * 2022-09-23 2023-05-16 中国航空无线电电子研究所 An embedded system high-integrity clock circuit monitoring system and method

Similar Documents

Publication Publication Date Title
US11068360B2 (en) Error recovery method and apparatus based on a lockup mechanism
US9823983B2 (en) Electronic fault detection unit
US7966528B2 (en) Watchdog mechanism with fault escalation
RU2520399C2 (en) Microcomputer and operation method thereof
CN101799776A (en) Fault processing method of multi-core processor, multi-core processor and communication device
US20150006978A1 (en) Processor system
CN112596916A (en) Dual-core lock step error recovery system and method
US6553512B1 (en) Method and apparatus for resolving CPU deadlocks
KR20160034939A (en) Robust hardware/software error recovery system
US7966527B2 (en) Watchdog mechanism with fault recovery
EP3321814B1 (en) Method and apparatus for handling outstanding interconnect transactions
CN115904793B (en) Memory transfer method, system and chip based on multi-core heterogeneous system
US9274909B2 (en) Method and apparatus for error management of an integrated circuit system
CN110832459B (en) Vehicle control device
JPH08287030A (en) Apparatus and method for automatic restart of multi-system computer system
US11354182B1 (en) Internal watchdog two stage extension
US20250004831A1 (en) System and method for operating a hardware watchdog timer in a data processing unit
JP2025080743A (en) Cross-domain access method, device, equipment, and medium
CN118193128A (en) Operating system hot backup method, device, equipment and medium based on virtual machine
CN103617094A (en) Transient fault tolerant system of multi-core processor
CN108932113A (en) A kind of disk management method, device, equipment and readable storage medium storing program for executing
JPH03180948A (en) Fault restoring system in multihost system
EP2176757A2 (en) Watchdog mechanism with fault recovery
JPH06324897A (en) Error recovery system for logical unit
JPS60120463A (en) Multiprocessor system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210402

RJ01 Rejection of invention patent application after publication