[go: up one dir, main page]

CN102591763A - System and method for detecting faults of integral processor on basis of determinacy replay - Google Patents

System and method for detecting faults of integral processor on basis of determinacy replay Download PDF

Info

Publication number
CN102591763A
CN102591763A CN2011104606426A CN201110460642A CN102591763A CN 102591763 A CN102591763 A CN 102591763A CN 2011104606426 A CN2011104606426 A CN 2011104606426A CN 201110460642 A CN201110460642 A CN 201110460642A CN 102591763 A CN102591763 A CN 102591763A
Authority
CN
China
Prior art keywords
processor
xor
result
processor core
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011104606426A
Other languages
Chinese (zh)
Other versions
CN102591763B (en
Inventor
李磊
陈云霁
孙国庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Loongson Technology Corp Ltd
Original Assignee
Loongson Technology Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Loongson Technology Corp Ltd filed Critical Loongson Technology Corp Ltd
Priority to CN201110460642.6A priority Critical patent/CN102591763B/en
Publication of CN102591763A publication Critical patent/CN102591763A/en
Application granted granted Critical
Publication of CN102591763B publication Critical patent/CN102591763B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Hardware Redundancy (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention provides a system and a method for detecting faults of an integral processor on the basis of determinacy replay. The faults in two processor core groups are detected by recording executed information interaction between detected processor cores in a multi-core processor, deterministically carrying out replay execution in a redundancy processor core group and comparing two execution results; a group of processor cores is replayed each time, and thus, both the faults on the processor cores and the faults on other non-processor cores (such as a second level cache, a NoC (Network-on-Chip), a memory controller and the like) can be detected; the defect that the faults on the non-processor cores cannot be detected in the prior art is effectively overcome; and the reliability of a processor chip is ensured.

Description

A kind of processor total failure detection system and method for resetting based on determinacy
Technical field
The present invention relates to a kind of processor chips design and detection technique field, especially, relate to a kind of processor total failure detection system and method for resetting based on determinacy.
Background technology
Increasing transistor scale, more and more lower chip voltage, more and more littler technology make processor more and more be easy to generate fault.In general, a fault is meant that one or more logic gate has obtained wrong result in processor (or can be described as processor chips) use.The reason of fault generating maybe foot because mistake, the processor of processor when making arrives radiation that serviceable life, outside ray processor is caused or the like in advance.
Though existing scan chain technique can be gone the fault in the measurement processor according to fault model, only limit to the fault below the existing fault model.Yet; High speed development along with processor; The application of various Low-power Technology, high-speed interface; Increasing fault model is suggested, even scan chain technique detects some fault model has been reached 100% coverage rate, but under some other fault models, or do not have the fault of stuck-at fault model still can't be detected.In addition, the fault that some tests just take place after accomplishing, can't be detected ahead of time to serviceable life like processor equally.
Because fault can make some logics of processor produce mistake, it probably causes program execution error on processor.The program execution result of mistake can reduce reliability of processor greatly, even can cause the total system collapse.So how all faults that detect the processor generation as much as possible have become the problem that current processor is more and more paid attention to.
Existing method comprises does redundancy detection to processor core, and those tests in its measurement processor nuclear do not have detected fault.Specifically, each processor core is all arranged the processor core of a redundancy, two processor cores are carried out same program simultaneously, and compare the result of two execution, if two times result is different, two processor cores must have one fault is arranged so.Though these methods can detect all inner faults of processor core, these methods can't detect the mistake in the non-processor cores such as network-on-chip, L2 cache, Memory Controller Hub.And in ripe processor chips, these non-processor cores have taken the area that surpasses entire chip 50%, so a kind of method that can detect the fault of entire process device is demanded urgently proposing.
Summary of the invention
The object of the present invention is to provide a kind of processor total failure detection system and method for resetting based on determinacy; It is reset through determinacy and detects the fault in the entire process device; Make the fault in the entire process device can access detection, guaranteed entire process device chip reliability.
Be a kind of processor total failure detection system of resetting that realizes that the object of the invention provides, comprise that one detects polycaryon processor, and a redundancy ratio be than polycaryon processor based on determinacy;
Also comprise logging modle; With the corresponding a plurality of XOR modules-1 of each processor core in the said detection polycaryon processor; Playback module, with said redundancy ratio than the corresponding a plurality of XOR modules-2 of each processor core in the polycaryon processor and a plurality of comparison modules, wherein:
Said logging modle be used for when said detection polycaryon processor is carried out a concurrent program, writing down the information interaction between all processor cores to be detected, and the mutual transmission information transmission that will note is gone out;
Said XOR module-1; Be used for when said detection polycaryon processor is carried out said concurrent program; Collect the execution that each processor core instructs to each bar in the said detection polycaryon processor; Note the result that each bar instruction is carried out, and all results are handled through the mode of XOR, the XOR result-1 after obtaining handling also transfers out;
Said playback module, the mutual transmission information between the processor core that is used for noting according to logging modle, said redundancy ratio than polycaryon processor in determinacy reset to carry out said concurrent program;
Said XOR module-2; Be used for carrying out the determinacy playback than polycaryon processor according to the mutual transmission information of playback module and carry out said concurrent program in said redundancy ratio; Collect said redundancy ratio than the execution of in the processor core each bar being instructed; Note the result that each bar instruction is carried out, and all results are handled through the mode of XOR, the XOR result-2 after obtaining handling is transferred to comparison module;
Said comparison module; Be used to read in the XOR result-1 of said XOR module-1 record; And XOR result-1 compared with the XOR result-2 who reset to carry out, the result through comparison judges whether detect polycaryon processor and/or redundancy ratio described in twice execution takes place than polycaryon processor or trigger fault.
More excellent ground, in the described processor total failure detection system of resetting based on determinacy, a non-processor core equipment needs processor core group to be detected to form one to detect polycaryon processor, the complete implementation of instruction of completion concurrent program with said one group; Perhaps form a redundancy ratio than polycaryon processor, accomplish the complete implementation of instruction of concurrent program with the processor core group of said one group of redundancy that is used for comparing;
Wherein, said one group of a plurality of processor core that the processor core group that needs processor core group to be detected and said one group of redundancy that is used for comparing has same number.
More excellent ground, said non-processor core equipment comprises L2 cache, network-on-chip, Memory Controller Hub etc.
More excellent ground, said each XOR module-1 are configured in accordingly integratedly, and each needs processor core the inside to be detected;
Said each XOR module-2 is configured in the processor core the inside of redundancy ratio than the redundancy that each is used for comparing in the polycaryon processor integratedly.
More excellent ground, said mutual transmission information comprise the time order relation and carry out order relation;
Said logging modle comprises sampling module;
Said sampling module is used to write down according to the sampling period and carries out the said time order relation between the processor core that the instruction window sampling obtains; And the said execution order relation between the conflict operation of record double sampling within the cycle.
Preferably, the said sampling period is per 512 clock period.
Said logging modle comprises a plurality of and the corresponding CAM stored record of each processor core to be detected module, is used to store 1024 access instruction that said respective processor kernel nearest is submitted to; And when an access instruction is performed, detect the CAM stored record module in other processor cores, judge whether the access instruction of conflicting with it; If have, the execution order relation between then recording processor is examined; Otherwise record not.
More excellent ground; In the said playback module; Comprise with the said redundancy ratio a plurality of time-outs corresponding and get the finger register than each processor nuclear phase of polycaryon processor; Be used to monitor the anti-instruction number of depositing of each processor core current executed, reset according to said time order relation and the determinacy that the execution order relation is accomplished said concurrent program.
More excellent ground, said XOR module-1 is provided with one 32 result register-1, and initial value is 0, carries out an instruction at every turn, just with the value of original result register-1 on the XOR as a result of this instruction;
1024 instructions of the every execution of said XOR module-1, just the value with this result register-1 exports to detection polycaryon processor outside, and reinitializes this result register-1;
32 result register-2 is set in the said XOR module-2, and initial value is 0, carries out an instruction at every turn, just with original result register-2 on the XOR as a result of this instruction;
1024 instructions of the every execution of XOR module-2, just the value with this result register-2 exports to comparison module, and reinitializes this result register-2.
For realizing that the object of the invention also provides a kind of processor total failure detection method of resetting based on determinacy, may further comprise the steps:
Step S100, concurrent program is carried out in detecting polycaryon processor, notes the mutual transmission information between each processor core in the said detection polycaryon processor, and notes the XOR result-1 that instruction is carried out;
Step S200, said concurrent program is carried out in than polycaryon processor in redundancy ratio, and according to the mutual transmission information between the processor core of step S100 record, the XOR result-2 of execution is noted in the execution that determinacy is reset said concurrent program simultaneously;
Step S300 will compare with XOR result-2 to twice execution XOR result-1, detects whether fault generating or triggering are arranged.
More excellent ground, among the said step S300, said detection comprises the steps:
If comparison result is different, then judge wherein in the execution of current concurrent program, to have fault in one group of processor core; Otherwise, judge that two groups of processor cores all do not have fault in current the execution.
Beneficial effect of the present invention: processor total failure detection system and method for resetting of the present invention based on determinacy; Through the information interaction between the processor core of in processor, carrying out to be detected through record; And deterministic execution of in redundant processor nuclear group, resetting; The result of twice execution is compared, detect in two processor core groups whether fault is arranged, owing to one group of processor core is reset at every turn; So no matter be the fault above the processor core; Still the fault above other non-processor cores (like L2 cache, network-on-chip, Memory Controller Hub etc.) can detect, and it is highly effective to have remedied the deficiency that prior art can't detect fault above the non-processor core, has guaranteed the reliability of processor chips.
Figure of description
Fig. 1 the present invention is based on the processor total failure detection system structural representation that determinacy is reset.
Embodiment
In order to make the object of the invention, technical scheme and advantage clearer,, be further elaborated to the present invention is based on the processor total failure detection system that determinacy resets and the realization of method below in conjunction with accompanying drawing.Should be appreciated that specific embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.
Specify processor total failure detection system and method for resetting of the present invention through specific embodiment below based on determinacy.
As shown in Figure 1; The processor total failure detection system based on the determinacy playback of the embodiment of the invention; Comprise one group of processor core group that needs are to be detected; And the processor core group of one group of redundancy that is used for comparing, wherein, a plurality of processor cores that the processor core group that said one group of needs are to be detected and the processor core group of said one group of redundancy that is used for comparing have same number.
The processor total failure detection system based on the determinacy playback of the embodiment of the invention also comprises a non-processor core equipment;
Said one non-processor core equipment and said one group of needs processor core group to be detected are formed one and are detected polycaryon processor, accomplish the complete implementation of instruction of concurrent program; Perhaps form redundancy ratio than polycaryon processor, accomplish the complete implementation of instruction of concurrent program with the processor core group of said one group of redundancy that is used for comparing.
Preferably, but as a kind of embodiment, said non-processor core equipment includes but not limited to equipment such as L2 cache, network-on-chip, Memory Controller Hub.
The processor total failure detection system of resetting based on determinacy of the embodiment of the invention also comprises logging modle, and with said detection polycaryon processor in the corresponding a plurality of XOR modules-1 of each processor core, wherein:
Said logging modle; Be used for when the detection polycaryon processor that said needs processor core group to be detected is formed is carried out a concurrent program; Write down the information interaction between all processor cores to be detected; And the external interface (not shown) of the mutual transmission information that will note through said detection polycaryon processor transfer out, and record becomes interactive log;
Said XOR module-1; Be used for when said detection polycaryon processor is carried out said concurrent program, collect the execution of in said each processor core of detection polycaryon processor each bar being instructed, note the result that each bar instruction is carried out; And all results are handled through the mode of XOR; XOR result-1 after obtaining handling, regular again with this XOR result-1 through said detection polycaryon processor the external interface (not shown) and transfer out, record becomes the XOR daily record.
Preferably, but as a kind of embodiment, said each XOR module-1 be configured in integratedly corresponding each need processor core the inside to be detected.
Described processor total failure detection system of resetting based on determinacy also comprises playback module, with said redundancy ratio than the corresponding a plurality of XOR modules-2 of each processor core in the polycaryon processor and a plurality of comparison modules, wherein:
Said playback module, the mutual transmission information between the processor core that is used for noting according to logging modle, said redundancy ratio than polycaryon processor in determinacy reset to carry out said concurrent program.
Said XOR module-2; Be used for when said redundancy ratio is carried out the said concurrent program of determinacy playback execution than polycaryon processor according to the mutual transmission information of playback module; Collect said redundancy ratio than the execution of in the processor core each bar being instructed; Note the result that each bar instruction is carried out; And all results are handled through the mode of XOR, the XOR result-2 after obtaining handling regular again is transferred to comparison module with the internal interface (not shown) of this XOR result-2 through processor chips.
Preferably, but as a kind of embodiment, said each XOR module-2 is configured in redundancy ratio than the processor core of the redundancy that each is used for comparing in polycaryon processor the inside integratedly.
Said comparison module; Be used to read in the XOR result-1 of said XOR module-1 record; And XOR result-1 compared with the XOR result-2 who reset to carry out, the result through comparison judges whether detect polycaryon processor and/or redundancy ratio described in twice execution takes place than polycaryon processor or triggered a fault, promptly; If the XOR result-2 that XOR result-1 carries out with resetting is different, then judgement has one or more faults to take place or triggers; If the XOR result-2 that XOR result-1 carries out with resetting is the same, then judgement does not have fault to take place or triggers, thereby can detect the fault that detects in the polycaryon processor.
The processor total failure detection system based on the determinacy playback of the embodiment of the invention; Need processor core group to be detected to form at one group; Comprise in the non-processor core equipment testing polycaryon processors such as L2 cache, network-on-chip, Memory Controller Hub; Whether when in this detection polycaryon processor, carrying out concurrent program, detecting this detection multiprocessor nuclear inside has fault generating or triggering.
Said detection polycaryon processor comprises the processor core group with a plurality of processor cores of one group of needs fault to be detected; When carrying out concurrent program, these processor cores are accomplished the information transmission mutual (showing like Fig. 1) between the processor core through non-processor core equipment such as L2 cache, network-on-chip, Memory Controller Hub.
The processor total failure detection system based on the determinacy playback of the embodiment of the invention; Also comprise one group of processor core group with redundancy that is used for comparing of the processor core identical with said detection polycaryon processor number; With the redundancy ratio of non-processor core equipment such as said L2 cache, network-on-chip, Memory Controller Hub than in the polycaryon processor, said redundancy ratio also is through using non-processor core equipment such as identical said L2 cache, network-on-chip, Memory Controller Hub to accomplish redundancy ratio than the information interaction between the processor core of polycaryon processor (showing like Fig. 1) with the processor core group with a plurality of processor cores of said needs fault to be detected than polycaryon processor.Same concurrent program will be carried out in redundant comparator processor in the determinacy playback, and XOR result-2 who carries out and the XOR result-1 who detects the execution in the polycaryon processor are compared; If different, judge wherein one group of processor core produces or has triggered one or more faults in the process of carrying out this concurrent program.
For a concurrent program, the possibility of result of repeatedly carrying out is different, so the method for said detection failure needs the support that determinacy is reset, that is, resetting through determinacy guarantees that repeatedly carrying out of a concurrent program has the same execution result.
Preferably, but as a kind of embodiment, in the embodiment of the invention, said mutual transmission information comprises the time order relation and carries out order relation;
Said logging modle comprises sampling module;
Said sampling module is used to write down according to the sampling period and carries out the said time order relation between the processor core that the instruction window sampling obtains; Said execution order relation between the conflict operation of record double sampling within the cycle.
But as a kind of embodiment, the information transmission that said logging modle writes down between the processor core group to be detected is mutual, is the precedence relationship of record different processor to same memory access accessed.For example: two accessing operation u are arranged; V carries out on different processor nuclear, if the address of their visits is same, and one of them is a write operation; The embodiment of the invention claims that u and v are the access instruction of two conflicts; The precedence relationship of their this addresses of visit, i.e. ordinal relation during u and v and execution thereof just need be used as mutual transmission information and put down in writing.
In the realization of the embodiment of the invention; But as a kind of embodiment; Through the original external interface (not shown) of processor chips, regularly the XOR result is transferred out from processor core group to be detected, and the result that regularly will note and order relation import redundant comparator processor group.
Preferably, but as a kind of embodiment, the said sampling period is 512 clock period.
Preferably; But as a kind of embodiment; Said logging modle also comprises corresponding CAM (content-addressable memory content address storer) stored record module in a plurality of and each processor core to be detected, is used to store 1024 access instruction that said respective processor kernel nearest is submitted to; And when an access instruction is performed, detect the CAM stored record module in other processor cores, judge whether the access instruction of conflicting with it; If have, the execution order relation between then recording processor is examined; Otherwise record not.
Sampling module is clapped (i.e. 512 clock period) through the accessing operation piecemeal of PC sampling (instruction window sampling) with processor core, the time order relation between the recording processor nuclear per 512.
But as a kind of embodiment, clap sampling, obtained all instructions that each processor core is submitted between the double sampling through PC per 512.
The instruction that each processor core was submitted to when the embodiment of the invention was sampled each PC is noted, and just can obtain the order relation of the conflict operation of generation during concurrent program is carried out in detecting polycaryon processor.For example: any access instruction u that between the 3rd sampling and the 4th sampling, submits to, before any access instruction v that submits between the 5th sampling and the 6th sampling.Because in the 4th sampling, u is submitted to, and v does not also begin.
Through the order relation (embodiment of the invention is called the time preface) of sample record, contained the order relation of all submission time differences between the conflict operation of double sampling outside the cycle.
But as a kind of embodiment, in embodiments of the present invention, except preface writing time, the embodiment of the invention is also put down in writing the order relation (embodiment of the invention be called execution preface) of submission time difference between the conflict operation of double sampling within the cycle.
Because nearest 1024 access instruction have comprised all instructions within the cycle of nearest double sampling, so the order relation of all submission time differences between the conflict operation of double sampling within the cycle also noted.
The embodiment of the invention is passed through logging modle like this; Write down the order relation (comprise the time preface and carry out preface) between all conflict access instruction; In determinacy playback implementation; As long as these order relations can be guaranteed do not having under the situation of fault, twice execution just should obtain the same execution result.
Preferably; In embodiments of the present invention; But, in the said playback module, comprise with the said redundancy ratio a plurality of time-outs corresponding and get the finger register than each processor nuclear phase of polycaryon processor as a kind of embodiment; Be used to monitor the access instruction number of each processor core current executed, accomplish (comprise the time preface and carry out preface) determinacy of said concurrent program according to order relation and reset.
But as a kind of embodiment, in logging modle, the embodiment of the invention has been write down two kinds of information: time preface and execution preface.For the time preface, to instruction u, instruction u samples between the i+1 time sampling at the i time at the executory submission time of detection polycaryon processor; When u gets finger, detect whether detect polycaryon processor carry out in all submission times all submit to the i-1 time sampling instruction before; If no, then u stops to get finger, all submits to up to the instruction of all submission times before the i-1 time sampling, otherwise does not stop to get finger.In the writing time preface, in the embodiment of the invention, the instruction number of having submitted in the time of through sampling is each time monitored the access instruction number of each processor core current executed, just can reappear the time preface that the first time, executive logging got off.
For carrying out preface, if instructing u before instruction v, to carry out, a collisions accessing operation gone on record, so, in the execution of resetting, the execution (promptly getting finger) of instruction v must be waited for instruction u complete (promptly submitting to).For example 1000 of processor core A access instruction are before 1300 access instruction of processor core B; In the execution of resetting; When processor core B gets when pointing to 1300 access instruction, check then whether processor core A has accomplished the 1000th access instruction, if do not have; Processor core B need suspend and gets finger, has accomplished the 1000th access instruction up to processor core A.
In case all are put down in writing the time preface of getting off and have all been reset with the execution preface, the execution of resetting just obtains and detects polycaryon processor and carry out identical execution result (not having under the situation of fault).
But as a kind of embodiment, the execution of each bar instruction can be changed one 32 register or 32 memory address.In the embodiment of the invention; But as a kind of embodiment; Said XOR module-1 is provided with one 32 result register-1, and preferably, initial value is 0; Instruction of each execution is just with original result register-1 on result's (value of 32 bit registers of promptly changing or 32 memory address) XOR of this instruction.1024 instructions of the every execution of XOR module-1, just the value with this result register-1 exports to detection polycaryon processor outside, and reinitializes this result register-1.That is,, preserve the value of per 1024 instructions with the data of 32 (bit) through lossy compression method.
Similarly; But as a kind of embodiment, the same in the said XOR module-2 with XOR module-1,32 result register-2 also is set; Preferably; Initial value is 0, carries out an instruction at every turn, just with original result register-2 on result's (value of 32 bit registers of promptly changing or 32 memory address) XOR of this instruction.1024 instructions of the every execution of XOR module-2, just the value with this result register-2 exports to comparison module, and reinitializes this result register-2.
But as a kind of embodiment; Said comparison module; In the process of reset carrying out, the value of the result register-2 of XOR module-2 is collected in 1024 instructions of every execution; The value of result register-1 compares these two values (being 32) when collecting the execution that detects to come in to transmission the polycaryon processor outside simultaneously.If equate, then do not have fault to be produced or trigger; If etc., then must there not be one group of processor core to produce or triggered fault.
Preferably, in the embodiment of the invention,,, can highly effective realization detect relatively only with the value that compares one 32 result register through 1024 instructions of every execution.
Correspondingly, the embodiment of the invention also provides a kind of processor total failure detection method of resetting based on determinacy, may further comprise the steps:
Step S100, concurrent program is carried out in detecting polycaryon processor, notes the mutual transmission information between each processor core in the said detection polycaryon processor, and notes the XOR result-1 that instruction is carried out;
Step S200, said concurrent program is carried out in than polycaryon processor in redundancy ratio, according to the mutual transmission information between the processor core of step S100 record, carries out the execution that determinacy is reset said concurrent program, notes the XOR result-2 of execution simultaneously;
Step S300 will compare with XOR result-2 to twice execution XOR result-1, detects whether fault generating or triggering are arranged.
Preferably, if comparison result is different, then judge wherein in the execution of current concurrent program, to have fault in one group of processor core; Otherwise, judge that two groups of processor cores all do not have fault in current the execution.
Said processor total failure detection method of resetting based on determinacy; Its procedure is identical with the processor total failure detection system course of work of resetting based on determinacy of the embodiment of the invention; Therefore, in embodiments of the present invention, describe in detail no longer one by one.
Processor total failure detection system and method for resetting of the present invention based on determinacy, the information interaction between the processor core of in processor, carrying out for the first time to be detected, and the deterministic primary execution of in redundant processor nuclear group, resetting through record.And, the result of twice execution is compared, detect in two processor core groups whether fault is arranged.Owing to one group of processor core is reset at every turn; So no matter be the fault above the processor core; Still the fault above other non-processor cores (like L2 cache, network-on-chip, Memory Controller Hub etc.) can detect; The highly effective deficiency that can't detect fault above the non-processor core in the prior art that remedied has guaranteed the reliability of polycaryon processor.
Should be noted that at last that obviously those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these revise and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification.

Claims (10)

1. a processor total failure detection system of resetting based on determinacy comprises that one detects polycaryon processor, and a redundancy ratio is characterized in that than polycaryon processor,
Also comprise logging modle; With the corresponding a plurality of XOR modules-1 of each processor core in the said detection polycaryon processor; Playback module, with said redundancy ratio than the corresponding a plurality of XOR modules-2 of each processor core in the polycaryon processor and a plurality of comparison modules, wherein:
Said logging modle is used for when said detection polycaryon processor is carried out a concurrent program, writing down the information interaction between all processor cores to be detected, and will note rice mutual transmission information transmission go out;
Said XOR module-1; Be used for when said detection polycaryon processor is carried out said concurrent program; Collect the execution that each processor core instructs to each bar in the said detection polycaryon processor; Note the result that each bar instruction is carried out, and all results are handled through the mode of XOR, the XOR result-1 after obtaining handling also transfers out;
Said playback module, the mutual transmission information between the processor core that is used for noting according to logging modle, said redundancy ratio than polycaryon processor in determinacy reset to carry out said concurrent program;
Said XOR module-2; Be used for when said redundancy ratio is carried out the said concurrent program of determinacy playback execution than polycaryon processor according to the mutual transmission information of playback module; Collect said redundancy ratio than the execution of in the processor core each bar being instructed; Note the result that each bar instruction is carried out, and all results are handled through the mode of XOR, the XOR result-2 after obtaining handling is transferred to comparison module;
Said comparison module; Be used to read in the XOR result-1 of said XOR module-1 record; And XOR result-1 compared with the XOR result-2 who reset to carry out, the result through comparison judges whether detect polycaryon processor and/or said redundancy ratio described in twice execution takes place than polycaryon processor or trigger fault.
2. processor total failure detection system of resetting according to claim 1 based on determinacy; It is characterized in that; One non-processor core equipment and said one group of needs processor core group to be detected are formed one and are detected polycaryon processor, accomplish the complete implementation of instruction of concurrent program; Perhaps form a redundancy ratio than polycaryon processor, accomplish the complete implementation of instruction of concurrent program with the processor core group of said one group of redundancy that is used for comparing;
Wherein, said one group of a plurality of processor core that the processor core group that needs processor core group to be detected and said one group of redundancy that is used for comparing has same number.
3. processor total failure detection system of resetting based on determinacy according to claim 2 is characterized in that said non-processor core equipment comprises L2 cache, network-on-chip, Memory Controller Hub.
4. according to each described processor total failure detection system of resetting of claim 1 to 3, it is characterized in that based on determinacy:
Said each XOR module-1 is configured in accordingly integratedly, and each needs processor core the inside to be detected;
Said each XOR module-2 is configured in the processor core the inside of redundancy ratio than the redundancy that each is used for comparing in the polycaryon processor integratedly.
5. according to each described processor total failure detection system of resetting of claim 1 to 3, it is characterized in that said mutual transmission information comprises the time order relation and carries out order relation based on determinacy;
Said logging modle comprises sampling module;
Said sampling module is used to write down according to the sampling period and carries out the said time order relation between the processor core that the instruction window sampling obtains; Said execution order relation between the conflict operation of record double sampling within the cycle.
6. processor total failure detection system of resetting based on determinacy according to claim 5 is characterized in that the said sampling period is 512 clock period.
Said logging modle comprises a plurality of and the corresponding CAM stored record of each processor core to be detected module, is used to store 1024 access instruction that said respective processor kernel nearest is submitted to; And when an access instruction is performed, detect the CAM stored record module in other processor cores, judge whether the access instruction of conflicting with it; If have, the execution order relation between then recording processor is examined; Otherwise record not.
7. processor total failure detection system of resetting according to claim 6 based on determinacy; It is characterized in that; In the said playback module; Comprise with the said redundancy ratio a plurality of time-outs corresponding and get the finger register, be used to monitor the access instruction number of each processor core current executed, reset according to said time order relation and the determinacy that the execution order relation is accomplished said concurrent program than each processor nuclear phase of polycaryon processor.
8. according to each described processor total failure detection system of resetting of claim 1 to 3 based on determinacy; It is characterized in that; Said XOR module-1 is provided with one 32 result register-1; Initial value is 0, carries out an instruction at every turn, just with the value of original result register-1 on the XOR as a result of this instruction;
1024 instructions of the every execution of said XOR module-1, just the value with this result register-1 exports to detection polycaryon processor outside, and reinitializes this result register-1;
32 result register-2 is set in the said XOR module-2, and initial value is 0, carries out an instruction at every turn, just with original result register-2 on the XOR as a result of this instruction;
1024 instructions of the every execution of XOR module-2, just the value with this result register-2 exports to comparison module, and reinitializes this result register-2.
9. a processor total failure detection method of resetting based on determinacy is characterized in that, may further comprise the steps:
Step S100, concurrent program is carried out in detecting polycaryon processor, notes the mutual transmission information between every processor core in the said detection polycaryon processor, and notes the XOR result-1 that instruction is carried out;
Step S200, said concurrent program is carried out in than polycaryon processor in redundancy ratio, and according to the mutual transmission information between the processor core of step S100 record, the XOR result-2 of execution is noted in the execution that determinacy is reset said concurrent program simultaneously;
Step S300 will compare with XOR result-2 to twice execution XOR result-1, detects whether fault generating or triggering are arranged.
10. processor total failure detection method of resetting based on determinacy according to claim 9 is characterized in that among the said step S300, said detection comprises the steps:
If comparison result is different, then judge wherein in the execution of current concurrent program, to have fault in one group of processor core; Otherwise, judge that two groups of processor cores all do not have fault in current the execution.
CN201110460642.6A 2011-12-31 2011-12-31 System and method for detecting faults of integral processor on basis of determinacy replay Active CN102591763B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110460642.6A CN102591763B (en) 2011-12-31 2011-12-31 System and method for detecting faults of integral processor on basis of determinacy replay

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110460642.6A CN102591763B (en) 2011-12-31 2011-12-31 System and method for detecting faults of integral processor on basis of determinacy replay

Publications (2)

Publication Number Publication Date
CN102591763A true CN102591763A (en) 2012-07-18
CN102591763B CN102591763B (en) 2015-03-04

Family

ID=46480459

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110460642.6A Active CN102591763B (en) 2011-12-31 2011-12-31 System and method for detecting faults of integral processor on basis of determinacy replay

Country Status (1)

Country Link
CN (1) CN102591763B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970512A (en) * 2014-05-21 2014-08-06 龙芯中科技术有限公司 Multi-core processor and parallel replay method thereof
CN106681911A (en) * 2016-12-08 2017-05-17 浙江大学 Method for achieving deterministic replay function which supports fault injection
CN109582512A (en) * 2017-09-28 2019-04-05 通用汽车环球科技运作有限责任公司 For testing the method and system of the component of parallel computation unit
CN109716302A (en) * 2016-08-17 2019-05-03 西门子移动有限公司 Method and apparatus for redundant data processing
CN111311476A (en) * 2018-12-11 2020-06-19 快图有限公司 Multi-processor neural network processing equipment
CN111381147A (en) * 2018-12-29 2020-07-07 北京灵汐科技有限公司 Many-core chip testing method, many-core chip testing device and many-core chip testing equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0283193A2 (en) * 1987-03-18 1988-09-21 AT&T Corp. Method of spare capacity use for fault detection in a multiprocessor system
CN1159630A (en) * 1995-07-13 1997-09-17 富士通株式会社 information processing system
CN1729456A (en) * 2002-12-19 2006-02-01 英特尔公司 On-die mechanism for high-reliability processor
CN101689233A (en) * 2007-07-05 2010-03-31 Nxp股份有限公司 Standby operation of a resonant power converter

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0283193A2 (en) * 1987-03-18 1988-09-21 AT&T Corp. Method of spare capacity use for fault detection in a multiprocessor system
CN1159630A (en) * 1995-07-13 1997-09-17 富士通株式会社 information processing system
CN1729456A (en) * 2002-12-19 2006-02-01 英特尔公司 On-die mechanism for high-reliability processor
CN101689233A (en) * 2007-07-05 2010-03-31 Nxp股份有限公司 Standby operation of a resonant power converter

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970512A (en) * 2014-05-21 2014-08-06 龙芯中科技术有限公司 Multi-core processor and parallel replay method thereof
CN103970512B (en) * 2014-05-21 2016-09-14 龙芯中科技术有限公司 Polycaryon processor and parallel playback method thereof
CN109716302A (en) * 2016-08-17 2019-05-03 西门子移动有限公司 Method and apparatus for redundant data processing
US11334451B2 (en) 2016-08-17 2022-05-17 Siemens Mobility GmbH Method and apparatus for redundant data processing in which there is no checking for determining whether respective transformations are linked to a correct processor core
CN109716302B (en) * 2016-08-17 2022-10-18 西门子交通有限公司 Method and apparatus for redundant data processing
CN106681911A (en) * 2016-12-08 2017-05-17 浙江大学 Method for achieving deterministic replay function which supports fault injection
CN106681911B (en) * 2016-12-08 2019-05-14 浙江大学 A kind of implementation method of certainty playback that supporting direct fault location
CN109582512A (en) * 2017-09-28 2019-04-05 通用汽车环球科技运作有限责任公司 For testing the method and system of the component of parallel computation unit
CN109582512B (en) * 2017-09-28 2022-06-21 通用汽车环球科技运作有限责任公司 Method and system for testing components of a parallel computing device
CN111311476A (en) * 2018-12-11 2020-06-19 快图有限公司 Multi-processor neural network processing equipment
CN111381147A (en) * 2018-12-29 2020-07-07 北京灵汐科技有限公司 Many-core chip testing method, many-core chip testing device and many-core chip testing equipment
CN111381147B (en) * 2018-12-29 2022-03-01 北京灵汐科技有限公司 Many-core chip testing method, many-core chip testing device and many-core chip testing equipment

Also Published As

Publication number Publication date
CN102591763B (en) 2015-03-04

Similar Documents

Publication Publication Date Title
Khan et al. Detecting and mitigating data-dependent DRAM failures by exploiting current memory content
CN107357666B (en) Multi-core parallel system processing method based on hardware protection
US10042700B2 (en) Integral post package repair
US9952963B2 (en) System on chip and corresponding monitoring method
CN111611120B (en) On-chip multi-core processor Cache consistency protocol verification method, system and medium
CN102591763A (en) System and method for detecting faults of integral processor on basis of determinacy replay
US20110307741A1 (en) Non-intrusive debugging framework for parallel software based on super multi-core framework
CN103488563A (en) Data race detection method and device for parallel programs and multi-core processing system
CN103593271A (en) Method and device for chip tracking debugging of system on chip
CN103440196A (en) Resource problem detection method for novel operation system
CN100446129C (en) Method and system for memory fault testing
CN102053886A (en) Memory detection method under non-uniform memory access environment
Lin et al. Quick error detection tests with fast runtimes for effective post-silicon validation and debug
US7206979B1 (en) Method and apparatus for at-speed diagnostics of embedded memories
JP2008517370A (en) Method for monitoring cache coherence of a data processing system and a processing unit
Li et al. Concurrent autonomous self-test for uncore components in system-on-chips
CN116048887A (en) Chip verification method, device, system, electronic equipment and storage medium
DeOrio et al. Post-silicon verification for cache coherence
US7954012B2 (en) Hierarchical debug information collection
CN108735267A (en) Data processing
CN112380127B (en) Test case regression method, device, equipment and storage medium
Chung et al. A built-in repair analyzer with optimal repair rate for word-oriented memories
US9251023B2 (en) Implementing automated memory address recording in constrained random test generation for verification of processor hardware designs
Carretero et al. Hardware/software-based diagnosis of load-store queues using expandable activity logs
CN111858361A (en) An Atomicity Violation Defect Detection Method Based on Prediction and Parallel Verification Strategy

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: 100095 Building 2, Longxin Industrial Park, Zhongguancun environmental protection technology demonstration park, Haidian District, Beijing

Patentee after: Loongson Zhongke Technology Co.,Ltd.

Address before: 100190 No. 10 South Road, Zhongguancun Academy of Sciences, Haidian District, Beijing

Patentee before: LOONGSON TECHNOLOGY Corp.,Ltd.