WO2021093931A1

WO2021093931A1 - Fault detection system

Info

Publication number: WO2021093931A1
Application number: PCT/EP2019/080874
Authority: WO
Inventors: Tariq Kurd; Andrew Dellow; Mark Bowen HILL; Zhitong XU; Qibiao ZHONG
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2019-11-11
Filing date: 2019-11-11
Publication date: 2021-05-20
Also published as: EP4055481A1

Abstract

A fault detection system for detecting faults in processing a set of instructions. The fault detection system comprises a first processing module and a second processing module. The first processing module processes instructions of a set of instructions, thereby obtaining a first set of data. The second processing module processes the instructions, thereby obtaining a second set of data. The fault detection system compares the first set of data with the second set of data. In response to determining a mismatch therebetween, it determines a processing fault, and outputs a fault signal indicative of the determined processing fault. Processing the instructions by the first processing module and by the second processing module comprises one or more of the following: processing, by the second processing module, each instruction with a variable delay relative to processing of the respective instruction by the first processing module, processing, by the first processing module, the instructions in a first processing order, and processing, by the second processing module, the instructions in a second processing order which differs from the first processing order, and processing, by the first processing module, each instruction from the set of instructions by performing a first calculation, and processing, by the second processing module, each instruction from the set of instructions by performing a second calculation which differs from the first calculation. The system is particularly efficient and secure.

Description

FAULT DETECTION SYSTEM

BACKGROUND

Faults can occur in processing systems such as processor chips. These faults can lead to undesirable behaviour of the processing system. For example, a fault in a processor system can lead to an incorrect instruction being run which can give an unexpected result. Such behaviour is undesirable.

Faults can be introduced accidentally or deliberately. An example of an accidental fault is one caused by a high energy particle or beam, such as a cosmic ray, that strikes a part of the processing system causing it to malfunction. Variations in characteristics or performance of a component device of a processing system, for example due to small feature size, can introduce faults into the processing system.

Examples of deliberate fault injection are those caused by laser probing (such as heating up at least a portion of the processing system using a laser) and introducing glitches in the power supply to the processing system. Deliberate fault injection can be used to try to get unexpected behaviour in the processing system. If a processing system runs incorrect instructions, an attacker can use this to try and compromise the processing system. This can help an attacker gain control of the processing system and/or reveal details of the operation of the processing system.

It is known to defend against fault injection by providing two identical copies of a CPU core. Both copies are run in lock-step with one another. If a fault is injected into a core, the result of one core will differ from the result of the other core, and the fault can be detected. The provision of two identical copies of a core results in duplications of instruction processing, doubles the area and power needed and can makes the system more vulnerable to power analysis attacks.

It is desirable to provide a more efficient fault detection system. SUMMARY OF THE INVENTION

There is provided a fault detection system for detecting faults in processing a set of instructions, the fault detection system comprising: a first processing module and a second processing module; the first processing module being configured to process instructions of a set of instructions, thereby obtaining a first set of data; and the second processing module being configured to: process the instructions, thereby obtaining a second set of data; the fault detection system being configured to: compare the first set of data with the second set of data, in response to determining a mismatch therebetween, determine a processing fault, and output a fault signal indicative of the determined processing fault, wherein processing the instructions by the first processing module and by the second processing module comprises one or more of the following: processing, by the second processing module, each instruction with a variable delay relative to processing of the respective instruction by the first processing module, processing, by the first processing module, the instructions in a first processing order, and processing, by the second processing module, the instructions in a second processing order which differs from the first processing order, and processing, by the first processing module, each instruction from the set of instructions by performing a first calculation, and processing, by the second processing module, each instruction from the set of instructions by performing a second calculation which differs from the first calculation.

This approach enables processing at the second processing module to occur out-of sync with processing at the first processing module. This can increase security by increasing the difficulty for an attacker to analyse the processing at one or both of the processing modules.

The first processing module may be configured to save the first set of data in a memory accessible to the second processing module. This permits the second processing module to access the first set of data, for example at a time convenient to the second processing module. The system may comprise memory logic associated with the memory, the memory logic being configured to maintain an age value of the first set of data and to restrict access by the second processing module to the first set of data in dependence on the age value. The age value can usefully identify the delay for an instruction.

The instructions may form an ordered instruction stream and the first processing module may be configured to process the instructions out-of-order and the second processing module may be configured to process the instructions in order. Suitably the second processing module processes the instructions in order so that the order of the instructions being checked is as expected for a program being run.

The first set of data can comprise one or more of: the instruction for processing; one or more operands of a calculation to be carried out at the first processing module in processing the instruction; one or more results of the calculation carried out at the first processing module; whether the instruction is a load instruction or a store instruction; a load address from which load data is loaded or a store address at which store data is stored; load data or store data; exception information; interrupt information; debug information; bus error information; and a cyclic redundancy check (CRC) value.

These items, separately and in combination, can usefully reveal that a fault has occurred in instruction processing, for example by comprising or relating to a value that may change in the event of a fault occurring.

The second set of data comprises one or more of: one or more operands of a further calculation to be carried out at the second processing module in processing the instruction; one or more results of the further calculation carried out at the second processing module; whether the instruction is a load instruction or a store instruction; a further load address from which load data is loaded or a further store address at which store data is stored; further store data; further exception information; and a further CRC value.

The CRC value and the further CRC value may be determined in dependence on one or more of: the instruction; and the load address and the further load address, respectively; or the store address and the further store address, respectively. This enables a check to be made of the instruction or the combination of the instruction and the address, whilst permitting a saving in memory needed to store the check value. The CRC value and/or the further CRC value may be determined in dependence on any other item or combination of items in the first set of data or in the second set of data. Instead of the CRC value and/or the further CRC value, a parity check value or further parity check value may be used. A parity check value or further parity check value may be used together with the CRC value or the further CRC value.

The second processing module may be further configured to monitor a transaction over a bus coupled to the first processing module, and to store one or more of: whether the transaction relates to a load instruction or a store instruction; a load address in the transaction from which load data is loaded or a store address in the transaction at which store data is stored; load data or store data, as sent over the bus; and bus error information determined from the bus. Monitoring bus transactions can enable detection of faults occurring on the bus or relating to data transferred over the bus.

The fault detection system may be further configured: for a store instruction, to compare one or more of: the store address with the further store address, the store data with the further store data, the store address and the store data with, respectively, the store address in the monitored transaction and the store data as sent over the bus, and the bus error information provided by the first processing module with the bus error information determined from the bus; and for a load instruction, to compare one or more of: the load address with the further load address; the load address and the load data with, respectively, the load address in the monitored transaction and the load data as sent over the bus, and the bus error information provided by the first processing module with the bus error information determined from the bus.

These comparisons permit detection of faults that occur before the relevant address is sent over the bus, and/or detection of faults that occur after the relevant address is sent over the bus.

Where the instruction comprises an integer DIVIDE-REMAINDER calculation, a ÷ b, the fault detection system may be configured to verify a divide and remainder result of the calculation, d and r, by performing, at the second processing module, a MULTIPLY and an ADD calculation, and checking one or other of: (i ) d * b + r = a; or (ii) d * b = r- a. This approach enables the second processing module to perform the check calculation over a number of cycles that does not exceed the number of cycles taken by the first processing module, and/or to perform the check calculation using a simpler architecture. Thus this approach can reduce the chance that the second processing module will stall the first processing module.

The instruction may comprise a floating point calculation and the fault detection system may be configured to verify a result of the calculation by checking that a result obtained by the second processing module is within a predetermined range of an expected result. This approach can permit useful checks to be made where the instructions comprise a floating point calculation. The instruction may comprise a SQUARE ROOT calculation, s = Va, the second processing module may be configured to calculate m = s^A2 and the fault detection system may be configured to verify the result of the calculation by checking that m is within the predetermined range of a; or the instruction may comprise a DIVIDE calculation, d = a ÷ b, the second processing module may be configured to calculate m = b ^* d and the fault detection system may be configured to verify the result of the calculation by checking that m is within the predetermined range of a. This approach enables the second processing module to perform the check calculation over a number of cycles that does not exceed the number of cycles taken by the first processing module, and/or to perform the check calculation using a simpler architecture.

The predetermined range may be determined in respect of a plurality of rounding modes. The predetermined range may differ for different rounding modes.

The second processing module may be configured to process the instructions in a predetermined number of processor cycles. This approach can avoid the need to identify situations where the number of processing cycles taken by the second processing module is lower than a maximum number of cycles taken by the second processing module.

The first processing module may be provided with or have access to a first set of registers for recording state data relating to processing the set of instructions, and the second processing module may be provided with or have access to a second set of registers for recording a portion of the state data, where the second set of registers corresponds to a subset of the first set of registers. This approach permits the second processing module to perform effective fault checks by referencing an architectural state smaller than the whole architectural state of the first processing module. There is provided a method of detecting faults in a system for processing a set of instructions, the system comprising a first processing module and a second processing module, the method comprising: at the first processing module: processing instructions of a set of instructions, thereby obtaining a first set of data; and at the second processing module: processing the instructions, thereby obtaining a second set of data; the method further comprising: comparing the first set of data with the second set of data, in response to determining a mismatch therebetween, determining a processing fault, and outputting a fault signal indicative of the determined processing fault; wherein processing the instructions at the first processing module and at the second processing module comprises one or more of the following: processing, at the second processing module, each instruction with a variable delay relative to processing of the respective instruction at the first processing module, processing, at the first processing module, the instructions in a first processing order, and processing, at the second processing module, the instructions in a second processing order which differs from the first processing order, and processing, at the first processing module, each instruction from the set of instructions by performing a first calculation, and processing, at the second processing module, each instruction from the set of instructions by performing a second calculation which differs from the first calculation.

This approach enables processing at the second processing module to occur out-of sync with processing at the first processing module. This can increase security by increasing the difficulty for an attacker to analyse the processing at one or both of the processing modules. BRIEF DESCRIPTION OF THE FIGURES

The present invention will now be described by way of example with reference to the accompanying drawings.

In the drawings:

Figure 1 shows an example system comprising a main core and a shadow core;

Figure 2 shows an example reorder buffer; and

Figure 3 shows an example of a portion of a shadow core.

DETAILED DESCRIPTION OF THE INVENTION

Processor security is an important consideration when developing processor cores and processing modules. Typically processing modules will contain secret information, such as secret encryption/decryption keys. It is desirable to provide secure systems which can prevent or reduce the number of instances of an attacker gaining access to the secret information. In general, it is desirable to prevent an attacker from being able to influence or control the operation of a processing module without such influence or control being detected. Once it is detected, steps can be taken to reassert authorised control and limit the effect of the attack or other fault.

For example, where an accidental fault occurs it may be desirable to try to correct that fault and to carry on with the processing. An alternative is to stop the processing on fault detection. Where a fault is deliberately introduced, processing can be stopped, the fault reviewed and the processing core or module reset to clear the fault.

The present disclosure relates to a method for checking execution correctness to verify that a processor core or processing module is running correct code, and has not been subverted by fault attacks.

The present techniques comprise using two cores or processing modules. Suitably the two cores are different from one another. This can make power analysis attacks less likely to succeed. A main core - a first processing module - comprises typical core architecture. Instructions can be executed in a standard way by the main core. Results of the instruction execution are stored in one or more registers or memories. That is, the main core can report information about all completed instructions.

Results can be written to a reorder buffer or other memory.

A checker core or shadow core - a second processing module - is a simplified version of the main core. The shadow core can access the results of the main core, for example by accessing the reorder buffer. The shadow core re-executes the same instructions as the main core and checks that the result of the main core is as expected. That is, the shadow core can check architectural state updates from the main core. Where a result is found to be incorrect, for example where the result of executing an instruction at the shadow core does not match the result of executing that instruction at the main core, the shadow core can flag to the system that a fault has occurred. The system is then able to take remedial action, for example by stopping the processing and resetting the system.

The shadow core is smaller and simpler than the main core, and has a different power profile. The shadow core can execute a variable delay behind the main core. That is, the timing of the two cores or processing modules is separated. The delay between processing an instruction at the first processing module and processing that instruction at the second processing module may comprise a number of cycles between starting execution of the instruction at the first processing module and starting execution of the instruction at the second processing module. The delay may comprise a number of cycles between completing execution of an instruction at the first processing module and completing execution of the instruction at the second processing module. The delay may change with time. The delay may change between instructions. A delay in respect of a first instruction may be different from a delay in respect of a second instruction. The change in delay can thus be termed a variable delay. Executing instructions at the shadow core with a variable delay compared to the execution of those instructions at the main core can help disrupt predictions or analysis of the processing at each of the cores. Thus, this can enhance the security of the system. Security systems or monitors, such as a security monitor comprising a shadow core, help prevent attackers from interfering in the operation of a core and inserting their own code for execution at that core. Such a security system should make it difficult for an attacker to effectively attack both a main core and a shadow core.

Where a shadow core executes instructions with a fixed delay after execution of those instructions at a main core, the same fault can be injected in the main and shadow cores at a given delay. Where the given delay matches the fixed delay, an attacker can derive useful information about the instructions being executed at the cores.

Where a shadow core executes instructions with a fixed delay after execution of those instructions at a main core, power analysis of the two cores can reveal useful information about the operation of the cores. For example, where large numbers are multiplied with one another, a power spike can occur. This power spike can be doubled where both cores do the same calculation (a fixed delay can be filtered out from power analysis results).

Use of a variable delay, as in the present techniques, prevents such attacks from revealing the same amount of information. Where there is a variable delay between the cores in the timing of the execution of instructions, the same fault will not have the same effect on both cores and power analysis results from each core will not correlate well with one another. This helps obfuscate the working of the cores to an external observer such as an attacker.

The operation of the cores can be obfuscated by writing back the instructions in a different order between the cores. The main core can be configured to execute a set of instructions in a first order, and the shadow core can be configured to execute the set of instructions in a different, second order. In this way, any power spikes that occur from executing an instruction will differ in timing between the cores, helping to decorrelate main core operation from shadow core operation. Thus configuring the main and shadow cores to execute instructions in a different order from one another can change the power dissipation profile and make it harder for an attacker to attempt to reverse engineer the instructions being executed.

The operation of the cores can be obfuscated by performing different calculations at each of the cores. For example, the main core can determine the result of an instruction using one calculation and the shadow core can determine the result of the same instruction using a different calculation. The performing of different calculations can lead to differences in power spike intensities and/or timing (since a different calculation may consume a different amount of power and/or execute over a different number of cycles).

The techniques of using a variable delay, processing instructions in a different order and performing different calculations when executing instructions can be used separately or together in any desired combination.

The delay in instruction execution between the main core and the shadow core can be between 1 and 5 cycles. Typically the delay will be 3 cycles or fewer cycles. Preferably the delay is 1 or 2 cycles. The delay may be up to 3 cycles on average. The delay may be 2 cycles on average. For example, the mean delay for instructions in a set of instructions can be 2 cycles.

The variable delay introduced at the shadow core means that even if an attacker knows how to inject a particular fault at the main core (for example by shining a laser at a portion of the main core) the attacker will not know how to replicate that fault at the shadow core - thus the attack can be detected due to a difference in the results of the processing at the main core and the shadow core.

An example of a system comprising a main core and a shadow core will now be described with reference to the figures. The system is shown generally in figure 1 at 100. The system comprises a main core 110 and security monitors 120. The shadow core 130 is provided at the security monitors, together with a reorder buffer 132 and a load/store checker 134. The main core comprises an interrupt controller 112, a security violation module 114, a memory protection unit (MPU) 116 and a return address stack (RAS) (or call address stack) 118. The MPU 116 can implement physical memory protection checking, described in more detail below.

The main core executes instructions and reports the results into the reorder buffer module, which is a completed instruction queue. This is similar to a full trace output from the core. It reports details such as the address (or program counter (PC)), instruction encoding, writeback data, and load/store unit (LSU) address (if applicable).

The RAS 118 can be used to help defeat code reuse attacks (CRAs) or “return oriented programming”. An example of such attacks is to attempt to exploit a jump to a return address within programming such as by diverting program flow to a different return address from within a function. The return address stack can be used to check that the return address on the stack matches the return address to which the program flow is diverted, i.e. that the program returns to the expected address. This can be achieved by inserting a marker instruction at the start of a function - for example adding zero to a given register - then the system can determine an address from which program flow occurs and therefore the address to which the program should return.

The system 100 comprises an instruction bus (l-bus) 140, a data bus (D-bus) 142 and a system bus (S-bus) 144. The D-bus is suitably used for accessing local memory. The S-bus is suitably used for off-chip access, and is likely to be slower than the D-bus. The security monitors 120, for example the shadow core 130, can monitor transactions on the l-bus, the D-bus and the S-bus. In some implementations, the shadow core is configured to obtain instructions from the main core, and so need not access the l-bus.

Examples of checks performed by the shadow core will now be described.

Checking arithmetic results

The shadow core contains a copy of much of the architectural state of the main core. The shadow core need not contain a copy of the whole architectural state of the main core. The shadow core suitably comprises or can access a copy of all of the integer registers so that if the main core executes (e.g.) an ADD instruction, the shadow core will execute the same ADD instruction after a delay. The shadow core will check the result of the ADD instruction executed at the shadow core with the result of the ADD instruction executed at the main core. The shadow core may check that the PC (address) of the instruction executed at the main core matches the PC of the instruction executed at the shadow core. The shadow core may check that any exception information associated with the instruction executed at the main core matches any exception information of the instruction executed at the shadow core.

In principle, the shadow core can be configured to check any aspect of the architectural state, so the system may therefore be configured such that the shadow core has access to a copy of the whole architectural state. However, since the shadow core operates with a variable delay compared to the main core, the state of all of the registers is not necessarily accurate at the shadow core - an example of this is when an interrupt occurs which can alter the architectural state. In such cases, the shadow core can be configured to accept the state as reported by the main core.

Checking load/store data and addresses

The security monitors 120 comprise a load/store checker 134. The load/store checker can be provided at the shadow core 130. The load/store checker module checks that load/store data reported by the main core is actually seen on one of the memory busses. The load/store checker module can comprise a table of recent load/store bus requests, including whether the bus request relates to a load or a store, an address from which the load is to be read (i.e. a read address), an address at which the store is to be made (i.e. a write address), load data (data read from the specified address), store data (data to be written at the specified address), bus error information, and so on.

To save memory space (and hence area), a check parameter can be stored instead of one or more data fields. The check can comprise a cyclic redundancy check (CRC) remainder.

Suitably the table is large enough so that it cannot overflow. In such cases flow control need not be implemented. This can enable a simpler configuration of table access logic to be implemented. In other implementations, for example where a smaller table is used, flow control can be implemented. Whether or not to implement flow control can be determined based on a trade-off between table size and code complexity.

When the main core reports a load/store to the shadow core, the shadow core can perform one or more of the following actions: i) the shadow core checks the address reported by the main core (for example the address passed by the main core to the shadow core via the reorder buffer) against an address calculated locally at or by the shadow core. ii) the shadow core checks store data as seen on a bus or reported by the main core against store data determined locally at or by the shadow core. iii) the shadow core looks up the load/store in the load/store checker to identify whether it is a load or a store, and compares the type of transaction to the type of transaction determined from the data sent over the bus, data reported by the main core and/or an instruction executed at the main core.

It is possible to determine a fault by carrying out any one of these checks, where that fault causes a difference in the relevant result. Not all faults will cause all of these three checks to report that a fault has occurred. Thus, improved fault checking can be obtained by performing more than one of these checks, in any combination.

The shadow core is suitably configured to check that the result from the main core (for example as passed to the shadow core via the reorder buffer) matches the result as seen on the bus. This checking method gives an independent path for load/store data and the addresses from the bus pins on the core into the shadow core, so that all transactions can be checked against the result reported by the main core.

Checking results of iterative instructions

If instruction processing at the shadow core falls behind the main core, then there is a risk that the reorder buffer will fill up since the main core may write data to the reorder buffer more quickly than the shadow core can access that data (and mark the relevant data as having been accessed). In this case the main core may stall whilst waiting for space in the reorder buffer in which to write data. It is generally undesirable for the shadow core to stall the main core. Whether or not the shadow core will stall the main core can change in light of one or more of (i) the variable delay between the main core and the shadow core, (ii) the order in which instructions are processed at the main and shadow cores, and (iii) the calculations performed at each of the cores when executing an instruction. The shadow core should be configured such that execution of instructions at the shadow core can keep up with execution of instructions at the main core.

It can be possible for the reorder buffer to fill up, but ideally the shadow core will process the output from the reorder buffer every x cycles. Suitably x = 1 , and so the shadow core will process output from the reorder buffer every cycle. In some implementations x = 2, and the shadow core is configured to complete processing of the reorder buffer output every 2 cycles. In some implementations, it may be known that an instruction will take 2 cycles in the shadow core. In such cases, an indication can be provided that the instruction needs an extra cycle in the reorder buffer if a stall is required.

It is possible to minimise or avoid such stalls at the main core by reducing the number of cycles taken at the shadow core when executing an instruction. For example, stalls can be avoided where the shadow core completes execution of an instruction in a single cycle.

Suitably the shadow core is configured to have a fixed latency. This can mean that the shadow core is always able to keep up with the main core. Where it is known or can be determined that the output of the reorder buffer can be taken and used in a fixed amount of time, this can simplify flow control. In such cases, there is no need for feedback from the shadow core to report how long execution of an instruction has taken.

Flow control may also be simplified where the shadow core is configured to execute instructions over differing numbers of cycles. For instance, where the shadow core is configured to execute all instructions in either 1 cycle or 2 cycles, the system can assume that the execution of each instruction will take 2 cycles (in general: the largest number of cycles that the shadow core takes to process an instruction). In such cases there is no need to feed back the number of cycles taken for an instruction.

However, this can pose a problem for iterative instructions. The shadow core cannot realistically calculate a divide or remainder result in one cycle. Therefore the shadow core can be configured to check the divide-remainder result by performing a different calculation. In this example the calculation can be reversed.

For example, where the main core performs: a ÷ b = d rem r the shadow core is configured to receive both the divide and the remainder result from the main core (d and r). The shadow core can be configured to check that d ^* b + r = a where a and b are read from the shadow core register file, and d and r are received from the main core.

The check performed by the shadow core can be optimised so that the check is: d * b = a - r so that the multiply and add are done in parallel, and can reuse the existing multiplier and adder logic.

For iterative floating point instructions a range check may be performed on the result of the reverse calculation in the shadow core. For floating point square root: s = Va; m = s² and for floating point divide: d = a ÷ b; m = b ^* d

In both cases m ~= a within a mathematically computed bound across all rounding modes. Note that in some cases an ‘inexact flag’ may not be set on the floating point result, and in these cases the check must be exact, otherwise a fault condition can be raised. The mathematically computed bound may be determined by calculating in advance a result with different rounding modes to obtain a likely bound. An estimate may be made of a bound slightly larger than would be expected to occur, e.g. on a next most significant digit to the possibly rounded digit.

Such a determined bound may be stored in a memory at or accessible to the shadow core, for example in a look up table. The determined bound may be stored in dependence on an instruction type.

Checking that the instruction was correctly fetched

Since, in the present implementation, the shadow core does not monitor the l-bus, the shadow core cannot check that the instruction was correctly fetched from memory. Corruption of the instruction is checked for with a CRC check, but the shadow core trusts that the instruction itself is correct.

A different scheme may be used to check for faults. When an instruction fetch request is sent to the bus, the address of the instruction fetch is looped back to the core with the fetched instructions (so the address forms part of the bus response). When the response is received, an instruction fetch unit checks the address returned over the bus against the original request address sent out on the bus. This approach will detect any faults introduced on the bus, but it will not detect if the original request address had a fault. To do this, a joint CRC may be taken of the instruction and its fetch PC before the instruction is sent over the bus, for example when the instructions are queued up for sending over the bus. When the instructions are issued to the execution pipeline, the CRC (of the instruction and the PC) can be checked against a further CRC determined using the fetched instruction and the locally generated PC.

Therefore because the CRC is checked against the further CRC determined from the local PC in the execution pipe it can be determined that the original fetch address was correct. The instruction fetch unit may issue only the joint CRC (i.e. the CRC of the instructions and PC) and instruction. The instruction fetch unit does not need to save the PC of unissued instructions, saving significant area. Different order of writebacks to improve SPA resistance

The order of register file writebacks is different between the main core and the shadow core to improve simple power analysis (SPA) resistance. In the described implementation this can be achieved by writing back load data out of order in the main core and in-order in the shadow core. This is not a strict requirement: any reordering of register file writebacks between the main core and the shadow core will help improve SPA resistance.

Physical memory protection checking

The MPU 116 can define regions of memory as being executable or non-executable. Up to 20 memory regions can be defined in some examples. The main core may be configured to check memory accesses using the MPU, which can for example check memory accesses using the RISC-V Physical memory Protection standard.

The MPU may be used to determine whether a particular memory region is executable or not, i.e. whether instructions may execute from that memory region. The MPU may be used to determine whether a particular memory region has read permission or not, i.e. whether a load instruction can read data from that memory region.

For example, the shadow core may attempt to execute instructions at address ADDR_1 which try to read data from address ADDR_2. The MPU may be used to check whether the memory region comprising ADDR_2 has execution permissions and/or whether the memory region comprising ADDR_1 has read permissions.

The MPU contains a lot of architectural state, so it is undesirable to have a duplicate copy of the MPU state in the shadow core; this can take up more memory and hence area than is necessary. The MPU may have 4 ports:

1 . between the MPU and the remainder of the main core: instruction fetch address check;

2. between the MPU and the remainder of the main core: load/store address;

3. between the MPU and the shadow core: completed PC check; and

4. between the MPU and the shadow core: load/store address. There are differences between the two checks for the instruction PC. In the main core the fetch address which is issued to the bus is always 32-bit aligned (in the present implementation), even though the instruction may be mixed 16-bit and 32-bit (2-byte and 4-byte) instructions. Hence, an instruction may start at a 16-bit boundary. In the shadow core the PC check is exact, so if a 32-bit instruction crosses two MPU regions the shadow core will detect it, whereas the main core will not (in the described example). In other implementations, it is possible for the main core to be configured to comprise a more accurate check that can determine whether an instructions crosses a MPU boundary.

For most blocks it can be sufficient for a check to comprise a single parity bit for each byte. This enables detection of an odd number of errors per byte, but a fault leading to an even number of errors would not be detected by this method. Use of a CRC remainder value can enhance the accuracy of the check. A CRC check can detect any number of bit errors, depending on the polynomial used for the CRC check. In the present system a CRC remainder value is used to detect 1 , 2, 3, or 4 single bit errors in that value.

Thus, suitably all MPU registers are protected with CRC checks instead of parity checks. A CRC check takes up more memory space than a parity check (since the CRC value takes up more memory space than a parity check value), but still takes up significantly less space than storing the full values of the register entries. This is considered an acceptable trade-off since the MPU plays an important role in the system.

The techniques described herein may be useful in various security CPUs, for example in HiMiDeerSVxxx security CPUs which form part of the Huawei product line.

The following description will describe some features of the present techniques in more detail.

Reorder buffer Reference is made to figure 2, showing an example of a reorder buffer for interfacing between one processing module (e.g. a main core) and another processing module (e.g. a shadow core). The reorder buffer 132 comprises common entries fields 210 for storing common entries, an mcause/ccause field 220 for storing mcause and ccause data, and a writeback data field 230 for storing writeback data. Mcause comprises code representing an exception or interrupt that has occurred. Ccause suitably provides additional information relating to the exception or interrupt. For example, mcause can comprise information relating to the general type of fault, e.g. whether the fault is an instruction access fault. Ccause can comprise information such as specific information relating to the fault, such as a specific type of MPU fault, e.g. that the instruction access fault is a bus error.

The main core completes instructions and writes 4 data words to a register. The written data words comprise completed instruction information. The common entries comprise:

- instructions (which may be stored in an obfuscated manner);

- PCs;

- a mode of a group of modes (e.g. debug mode, machine mode, user mode, and so on); and

- extra data (this field has various uses, e.g. storing LSU address or a remainder result for a divide instruction).

The common entry fields may be updated at the same time.

The mcause/ccause field and the writeback data field can be updated after a delay. The mcause/ccause field (or fields: a separate field may be provided for mcause and for ccause data) holds exception information (bus error). Data can only be written to this field after a transaction has been sent over a bus and a response received since it is only at that stage that the error information will be known. If an exception or interrupt was taken, the exact type of exception or interrupt may be stored as part of the mcause/ccause field. The writeback data field holds writeback data. The writeback data field may comprise data in an obfuscated form. For example data can be obfuscated by performing bit-swapping and/or any other suitable method. The data may be de-obfuscated before use and re-obfuscated before being written back. Store data can be written straight away since it will be known. Load data will not be known until the load data is returned, so there will be a delay in writing the load data.

Thus it will be appreciated that the data written by the main core to the reorder buffer, and the timing of the writing of such data, will depend on the type of the instruction being executed at the main core.

If the instruction is an ADD instruction, it is possible to write all relevant data to the reorder buffer at once since the data is known straight away. The main core may then continue to write to the reorder buffer in respect of other instructions.

Once all fields relevant to a particular instruction have been filled in, an age counter can start counting. The age counter suitably counts the number of cycles that the instruction data is to be stored in the buffer. Preferably it is the entry into the reorder buffer of the last data item in respect of an instruction that starts the age counter running. The age counter can be used to determine the variable delay with which to being processing of an instruction at the shadow core.

To support out-of-order writeback in the main core, the fields of the reorder buffer can be separated out. In at least some implementations, instructions are always allocated in-order in the buffer. As discussed, exception information (mcause / ccause) and/or the writeback data (load data) may be added later.

The main core is able to write back the results from load instructions out-of-order, and may be configured to write the load data into the reorder buffer and also the main core register file after the bus response has been received. The writes to the reorder buffer and/or the main core register file need not be in program order. In some implementations, a second write port on the register is provided for writing load data.

Once an instruction has been written into the reorder buffer, and all data fields completed, the instructions should remain there for (typically) 2 cycles before being issued to the shadow core. Holding the instructions for 2 cycles helps ensure that register writes happen a minimum of 3 cycles apart between the main core and shadow core. This delay can reduce the chance that the different writes are equally affected by the same injected fault. Therefore the reorder buffer logic counts the age of each completed entry, for example using the age counter, and will only present the oldest entry as valid once it has been in the buffer for long enough.

The delay in the reorder buffer can be randomised. Typically the delay will be 2 cycles, but a random range can be applied. In the present implementation delays of 2-4 cycles are possible. Longer delays cost more in terms of area needed (since the reorder buffer will need to be larger to store more entries). A good trade-off can be achieved by having a typical delay of 2 cycles and occasional delays of 1 cycle.

Shadow core

Reference is now made to figure 3, showing an example of a portion of a shadow core 130. Figure 3 shows some internal features of the shadow core.

The shadow core comprises most of the execution logic of the main core ( for example a branch unit 302, an arithmetic logic unit 304, a multiplexer 306, control and system registers (CSR) 308, and so on. The illustrated shadow core does not include any instruction fetch logic - all instructions come from the main core. A separate mechanism is used to ensure that the correctly associated PC and instruction are returned by the main core (see elsewhere herein).

The shadow core comprises its own PC register 310, integer registers 312 and copies of many of the main CSRs - but crucially not all CSRs to save area. The shadow core comprises enough CSRs to re-execute the instruction stream (e.g. the shadow core must know the exception vector MEPC to know which PC to jump to on exception/interrupt). Some CSRs cannot be correctly implemented in the shadow core, for example MIP (machine-mode interrupt pending). Because the shadow core executes after a variable delay from the main core it is not possible to know which interrupts were pending at the time that the main core read the MIP CSR. Therefore in this case the main core result is trusted.

In its simplest form the shadow core executes instructions at the head of the reorder buffer 132. The shadow core is configured to update its own PC and register writeback data. The shadow core is configured to compare the local PC and writeback data against the PC and writeback data from the main core.

In figure 3, the dashed arrows show where the main core writeback data is used in the shadow core. In one implementation the multiplier output is too late for the shadow core to write back the result to the register file, so the main core result is written back instead. The main core result may then be checked against the shadow core result on the next cycle. This advantageously avoids introducing a one cycle delay at the shadow core. In other implementations, the shadow core may be configured to write back the multiplier result to the register file.

Different embodiments could have the shadow core execute over a single cycle, or many cycles. However the shadow core as described has no flow control - the shadow core always accepts the instruction at the head of the reorder buffer. In alternative implementations, flow control may be provided for selecting which instruction in the reorder buffer to process at the shadow core.

Load/store checker

Reference is made to the description of the load/store checker elsewhere herein.

The following provides further information.

For store instructions the fault detection system, for example the shadow core, is configured to:

- check the address reported by the main core against the locally calculated shadow core address;

- check the store data reported by the main core (for example in the ‘extra data field of the common entries fields in the reorder buffer) against the locally calculated store data;

- send all details of the store to the load/store checker which is configured to check that: o a store issued to that address with the correct data was seen on the memory bus; and o bus error information matches between the reported information from the main core and what was seen on the bus. For load instructions the fault detection system, for example the shadow core is configured to:

- send all details of the store to the load/store checker which is configured to check that: o a load issued to that address with the correct data was seen on the memory bus; and o bus error information matches between the reported information from the main core and what was seen on the bus.

The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

Claims

1. A fault detection system for detecting faults in processing a set of instructions, the fault detection system comprising: a first processing module and a second processing module; the first processing module being configured to process instructions of a set of instructions, thereby obtaining a first set of data; and the second processing module being configured to: process the instructions, thereby obtaining a second set of data; the fault detection system being configured to: compare the first set of data with the second set of data, in response to determining a mismatch therebetween, determine a processing fault, and output a fault signal indicative of the determined processing fault, wherein processing the instructions by the first processing module and by the second processing module comprises one or more of the following: processing, by the second processing module, each instruction with a variable delay relative to processing of the respective instruction by the first processing module, processing, by the first processing module, the instructions in a first processing order, and processing, by the second processing module, the instructions in a second processing order which differs from the first processing order, and processing, by the first processing module, each instruction from the set of instructions by performing a first calculation, and processing, by the second processing module, each instruction from the set of instructions by performing a second calculation which differs from the first calculation.

2. A fault detection system according to claim 1 , in which the first processing module is configured to save the first set of data in a memory accessible to the second processing module.

3. A fault detection system according to claim 2, further comprising memory logic associated with the memory, the memory logic being configured to maintain an age value of the first set of data and to restrict access by the second processing module to the first set of data in dependence on the age value.

4. A fault detection system according to any preceding claim, in which the instructions form an ordered instruction stream and the first processing module is configured to process the instructions out-of-order and the second processing module is configured to process the instructions in order.

5. A fault detection system according to any preceding claim, in which the first set of data comprises one or more of: the instruction for processing; one or more operands of a calculation to be carried out at the first processing module in processing the instruction; one or more results of the calculation carried out at the first processing module; whether the instruction is a load instruction or a store instruction; a load address from which load data is loaded or a store address at which store data is stored; load data or store data; exception information; interrupt information; debug information; bus error information; and a cyclic redundancy check (CRC) value.

6. A fault detection system according to any preceding claim, in which the second set of data comprises one or more of: one or more operands of a further calculation to be carried out at the second processing module in processing the instruction; one or more results of the further calculation carried out at the second processing module; whether the instruction is a load instruction or a store instruction; a further load address from which load data is loaded or a further store address at which store data is stored; further store data; further exception information; and a further CRC value.

7. A fault detection system according to claim 5 or claim 6, in which the CRC value and the further CRC value are determined in dependence on one or more of: the instruction; and the load address and the further load address, respectively; or the store address and the further store address, respectively.

8. A fault detection system according to any preceding claim, in which the second processing module is further configured to monitor a transaction over a bus coupled to the first processing module, and to store one or more of: whether the transaction relates to a load instruction or a store instruction; a load address in the transaction from which load data is loaded or a store address in the transaction at which store data is stored; load data or store data, as sent over the bus; and bus error information determined from the bus.

9. A fault detection system according to any of claims 5 to 8, in which the fault detection system is further configured: for a store instruction, to compare one or more of: the store address with the further store address, the store data with the further store data, the store address and the store data with, respectively, the store address in the monitored transaction and the store data as sent over the bus, and the bus error information provided by the first processing module with the bus error information determined from the bus; and for a load instruction, to compare one or more of: the load address with the further load address; the load address and the load data with, respectively, the load address in the monitored transaction and the load data as sent over the bus, and the bus error information provided by the first processing module with the bus error information determined from the bus.

10. A fault detection system according to any preceding claim, in which, where the instruction comprises an integer DIVIDE-REMAINDER calculation, a ÷ b, the fault detection system is configured to verify a divide and remainder result of the calculation, d and r, by performing, at the second processing module, a MULTIPLY and an ADD calculation, and checking one or other of:

(ii) d ^* b = r- a.

11. A fault detection system according to any preceding claim, in which the instruction comprises a floating point calculation and the fault detection system is configured to verify a result of the calculation by checking that a result obtained by the second processing module is within a predetermined range of an expected result.

12. A fault detection system according to claim 11, in which the instruction comprises a SQUARE ROOT calculation, s = Va, the second processing module is configured to calculate m = s^A2 and the fault detection system is configured to verify the result of the calculation by checking that m is within the predetermined range of a; or in which the instruction comprises a DIVIDE calculation, d = a ÷ b, the second processing module is configured to calculate m = b ^* d and the fault detection system is configured to verify the result of the calculation by checking that m is within the predetermined range of a.

13. A fault detection system according to any preceding claim, in which the second processing module is configured to process the instructions in a predetermined number of processor cycles.

14. A fault detection system according to any preceding claim, in which the first processing module is provided with or has access to a first set of registers for recording state data relating to processing the set of instructions, and the second processing module is provided with or has access to a second set of registers for recording a portion of the state data, where the second set of registers corresponds to a subset of the first set of registers.

15. A method of detecting faults in a system for processing a set of instructions, the system comprising a first processing module and a second processing module, the method comprising: at the first processing module: processing instructions of a set of instructions, thereby obtaining a first set of data; and at the second processing module: processing the instructions, thereby obtaining a second set of data; the method further comprising: comparing the first set of data with the second set of data, in response to determining a mismatch therebetween, determining a processing fault, and outputting a fault signal indicative of the determined processing fault; wherein processing the instructions at the first processing module and at the second processing module comprises one or more of the following: processing, at the second processing module, each instruction with a variable delay relative to processing of the respective instruction at the first processing module, processing, at the first processing module, the instructions in a first processing order, and processing, at the second processing module, the instructions in a second processing order which differs from the first processing order, and processing, at the first processing module, each instruction from the set of instructions by performing a first calculation, and processing, at the second processing module, each instruction from the set of instructions by performing a second calculation which differs from the first calculation.