Survey Paper 2
Survey Paper 2
net/publication/228518767
CITATIONS READS
16 2,193
8 authors, including:
All content following this page was uploaded by Doe Hyun Yoon on 23 May 2014.
TR-LPH-2011-002
April 2011
This research was, in part, funded by the U.S. Government . The views and conclusions contained in this
document are those of the authors and should not be interpreted as representing the official policies, either
expressed or implied, of the U.S. Government.
1
Survey of Error and Fault Detection Mechanisms
Abstract
This report describes diverse error detection mechanisms that can be utilized within a
resilient system to protect applications against various types of errors and faults, both hard
and soft. These detection mechanisms have different overhead costs in terms of energy,
performance, and area, and also differ in their error coverage, complexity, and programmer
effort.
In order to achieve the highest efficiency in designing and running a resilient computer
system, one must understand the trade-offs among the aforementioned metrics for each
detection mechanism and choose the most efficient option for a given running environment.
To accomplish such a goal, we first enumerate many error detection techniques previously
suggested in the literature.
1 Introduction
Error detection mechanisms form the basis of an error resilient system as any fault during
operation needs to be detected first before the system can take a corrective action to tolerate
it. Myriad error detection techniques have been proposed in the literature, where each option
has different tradeoff options in terms of energy, performance, area, coverage, complexity, and
programmer effort; however, there is no single technique that is optimal for all parts of a complex
computer system, all conditions of a large variety of applications, or all operating scenarios. Thus,
adaptability and tunability become crucial aspects of an error-resilient system with high efficiency.
In that respect, we must fully understand each error detection technique, in the context of a
specific system, to choose the best option for a given operating scenario and application.
Detection mechanisms proposed thus far can be classified in three different ways as shown in
Table 1: based on type of redundancy, placement in the system hierarchy, or detection coverage.
Type of redundancy can be space-redundant, where hardware is replicated, or time-redundant,
where software code is replicated. On the other hand, not all techniques utilize redundancy;
thus, type of redundancy does not provide a comprehensive coverage of all available error
detection mechanisms. Whether redundant or not, all techniques, however, are fully covered by a
categorization based on placement in the system hierarchy or detection coverage. Placement of
detection mechanisms can be at the circuit, architecture, software system, or application levels or
involve a combination of these levels in a hybrid approach. Finally, these detection techniques
cover hard, soft or both types of errors.
In short, this report lists all the detection techniques that can be applied to the Echelon system
and provides a qualitative trade-off analysis, which will help achieve a tunable and adaptable
2
Table 1: Classification of error detection mechanisms
Criterion Category
Space-redundant
Redundancy type
Time-redundant
Circuit-level
Architecture-level
System hierarchy Software system
Application-level
Hybrid
Hard errors
Detection coverage Intermittent errors
Transient errors
resiliency within the Echelon system. The rest of the paper is organized as follows: Section 2
explains the failure mechanisms we assume for the errors. Section 3, Section 4, and Section 5
explain and compare various error detection techniques for memory, compute, and system,
respectively. Then concluding remarks will be given in Section 6. Note that this report includes
tables summarizing and comparing the different techniques. These tables contain overhead
numbers as reported in the research papers describing the mechanisms. The overhead numbers
are not in the context of Echelon. We will evaluate the mechanisms for Echelon in the future.
2 Failure Mechanisms
In this report, we describe existing error detection techniques for both hard and soft errors. The
failure mechanisms for hard errors are permanent stuck-at faults that occur in the field, undetected
manufacturing or design flaws, or degradation-dependent faults that initially look like transient
errors but become permanent under further degradation. This type of error causes permanent
removal of a component and may trigger reconfiguration of the system. Note that we do not
cover design errors that can be detected by traditional testing methods such as boundary scan
chain or built-in self-test (BIST). We also exclude timing errors that can be detected by techniques
like Razor [1] from our discussion.
The failure mechanisms for soft errors can be classified into two types. First, energetic particle
strikes cause hole-electron pairs to be generated, effectively injecting a momentary (< 1ns) pulse
of current into a circuit node. This results in a single event upset (SEU), which we refer to as
a transient error. This type of failure mechanism is also applicable in the case of supply noise
briefly affecting a circuit’s voltage level. Second, variations introduced during manufacture
and runtime can cause temporal timing violations along the critical paths of the logic. They are
referred to as intermittent errors and they are becoming more serious as we push the margin
with techniques like dynamic voltage and frequency scaling (DVFS) to achieve higher efficiency.
While intermittent errors are actually the result of hard faults, they are often treated as soft errors
because of the difficultly of systematically reproducing the conditions that trigger an error and
3
Register File
Load/Store Queue
Core-to-L1 bus
L1 cache
L1-L2 MSHR
L1-to-L2 bus
L2 cache
Memory Controller
Read/Write Queue
Figure 1: The memory hierarchy with uniform ECC; gray color denotes storage and intercon-
nections dedicated to redundant information. Note that ECC is applied at a finer granularity in
a register file and an L1 cache but that L2 and main memory have ECC per data line. Though
not shown, lower storage levels such as Flash memory based disks or disk caches and hard-disk
drives also have uniform ECC, but at a coarser granularity; e.g., 4kB data blocks in NAND Flash
memory
4
Table 2: ECC storage array overheads [11].
SEC-DED SNC-DND DEC-TED
Data check check check
overhead overhead overhead
bits bits bits bits
16 6 38% 12 75% 11 69%
32 7 22% 12 38% 13 41%
64 8 13% 14 22% 15 23%
128 9 7% 16 13% 17 13%
5
caches to provide higher error correction capabilities. The AMD Athlon [15] and Opteron [16]
processors, as well as the DEC Alpha 21264 [17], interleave eight 8-bit SEC-DED codes for every
64-byte cache line to tolerate more errors per line at a cost of 12.5% additional overhead.
Recent research on low-power caches uses strong multi-bit error correction capabilities to
tolerate failures due to reduced margin. This includes low-VCC caches as well as reduced-refresh-
rate embedded DRAM caches. Word disabling and bit fix [18] tradeoff cache capacity for reliability
in low-VCC operation. These techniques result in 50% and 25% capacity reductions, respectively.
Multi-bit Segmented ECC (MS-ECC) [19] uses Orthogonal Latin Square Codes (OLSC) [20] that
can tolerate both faulty bits in low-VCC and soft errors, sacrificing 50% of cache capacity. Abella
et al. [21] study performance predictability of low-VCC cache designs using subblock disabling.
Wilkerson et al. [22] suggest Hi-ECC, a technique that incorporates multi-bit error-correcting
codes to reduce refresh rate of embedded DRAM caches. Hi-ECC implements a fast decoder for
common-case single-bit-error correction and a slow decoder for uncommon-case multi-bit-error
correction.
6
Memory
Controller
x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4
x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4 x4
Figure 2: Baseline chipkill correct DRAM configuration (gray DRAMs are dedicated to ECC
storage).
7
Table 3: Comparison of circuit-level detection techniques. Overheads quoted from papers and
not yet brought into Echelon’s context.
Hardening Hardening heuristics Circuit monitoring
Hazucha et Lunardini et Mohanram et Zoellin et Ndai et Narsale et
Rao et al. [32]
al. [29] al. [30] al. [31] al. [33] al. [34] al. [35]
Gate resizing
Redundant Partial Selective gate Current Supply rail
Mechanism Gate resizing and flip-flip
transistors duplication resizing mirror monitor
selection
Error
High High Configurable Low Configurable High High
coverage
-50%, Faster
Performance
None due to bigger None None None Configurable Negligible
overhead
transistors
Depends on
Power Depends on Depends on
30-40% 400% Minimal performance 20%
overhead error coverage error coverage
overhead
Depends on
Area Depends on
40% 100% 5% error coverage Unknown 20%
overhead error coverage
(10-50%)
discuss each technique with respect to its cost and types of errors it covers.
8
4.2 Architecture-level Techniques
In order to detect errors, some degree of redundancy must be introduced in the architecture.
Code-based techniques operate by providing a redundant representation of numbers with the
property that certain errors can be detected and sometimes corrected through the analysis and
handling of the resulting erroneous number. Code-based techniques offer several distinct advan-
tages over alternative strategies for protecting computation—they run concurrently, generally
detecting and reporting errors online with minimal latency, and they operate through selective
redundancy, requiring only a fractional increase in area to provide error coverage. The amount of
error coverage provided by a code-based technique is rarely complete, but is often quantifiable
and may be tuned to the system requirements to provide low-cost error detection for a target
failure rate. When code-based techniques are not applicable, or require too much custom design,
execution redundancy is the most common architectural alternative. Through the use of redun-
dant execution at the module level, errors can be detected with very high coverage and little
design cost. Execution redundancy usually has high fixed overhead close to 100%, either in space
or time.
While the regular structure of memory arrays enable the efficient protection through parity-based
codes or communication codes, these error-correcting codes are not ideally suited for arithmetic
operations. AN codes and residue codes are the most well known examples of error codes which
are designed to detect and correct errors which occur during the processing of integer arithmetic
operations. AN codes, also known as linear residue codes, product codes, and residue-class codes [36],
represent a given integer N by its product with a constant A. Therefore, the addition of two
numbers N1 + N2 can be checked by testing the equality of Equation 1. Variants which work
under other operations of interest exist [37]; error detection (and perhaps correction) is applied at
the functional unit granularity, as there is no separability between the coded circuitry and the
circuitry which performs the original operation.
?
A ∗ N1 + A ∗ N2 = A ∗ ( N1 + N2 ) (1)
A class of arithmetic error codes called residue codes is largely equivalent to AN codes, but has
significant practical implementation advantages [36]. Figure 3 shows an overview of the error
detection process using residue codes. Most arithmetic operations can be checked by testing the
equality of Equation 2, where | N | A = N mod A and ⊕ is the operation of interest. If both sides
of Equation 2 are equal, it is likely that no error has occurred. If both sides are not equal, then
some error has occurred. Residue codes are more flexible than AN codes–a single residue checker
can detect errors in numerous operations–and provide separability between the circuitry that
performs the original computation and that checks the computation. This separation simplifies
9
implementation, reduces the intrusiveness of designs, and can make it easier to detect errors
without impacting the delay of the original circuit.
?
| N1 ⊕ N2 | A = | N1 | A ⊕ | N2 | A A
(2)
The arithmetic error control codes are especially useful because they are preserved under
arithmetic operations. While non-arithmetic error codes are not, they may also be applied to
integer operations through a process called check prediction. Parity codes [38], checksum codes [39],
and Berger codes [40] have all been successfully applied to protect computer arithmetic.
There is no known direct check prediction or error coding method for floating-point arithmetic.
However, more intrusive methods of error detection exist which use residue checking or Burger
check prediction in a piecewise fashion within the floating-point unit to provide resiliency [39].
These methods, while intrusive and requiring custom design, report low area overheads.
Code-based techniques are cost-effective; however, they require custom design and are relatively
inflexible for covering errors in a variety of hardware structures. In the general case, a viable
option is to replicate the execution of some logic and compare the results. At the architecture
level, hardware components can be replicated at varying granularity from a single module to an
entire core. [41] suggests that a simple checker module can be used to detect errors in execution.
Further analyses show that the hardware costs are modest and that performance degradation is
low. While being a promising design point for complex modern control-intensive superscalar
processors, this method is not applicable to compute-intensive architectures. The reason is that
the main computational engine is in essence as simple as the suggested checker, and the overall
scheme closely resembles full hardware replication1 of a large portion of the processor as done in
the lockstepped IBM G5 processor [42].
10
Table 4: Comparison of architecture-level detection techniques. Overheads quoted from papers
and not yet brought into Echelon’s context.
Redundant Chip
Pipeline manipulaion Hardware replication multithreading Multiprocessor
(RMT) RMT
Lockstepped
SITR [43] RazorII [44] DIVA [41] [45, 46, 47, 48] [50, 51]
pipeline [42]
Detecting Partial Space and time
Mechanism Time redundancy Full replication Time redundancy
circuitry replication redundancy
Transient, Transient,
Error types All Hard, transient Transient Hard, transient
intermittent intermittent
Performance
Low Low Low Very high (> 2.0X) High (1.5 − 2.0X) Modest (< 1.5X)
Overhead
Energy
Low Low Low Very high (> 2.0X) High (1.5 − 2.0X) Modest (< 1.5X)
Overhead
Area
Low Low Low High None None
Overhead
11
Table 5: Comparison of detection techniques in software system. Overheads quoted from papers
and not yet brought into Echelon’s context.
TIMA [55] ED4 I [54] SWIFT [56] SWAT [61] Shoestring [62]
Replication Instruction Instruction Instruction None Some instruction
Control flow check Yes No Yes Implicit Implicit
Error types Hard, transient All Hard, transient All All
Error coverage High High High Low Low
Performance Overhead 200% 80% 41% 5-14% 16%
Energy Overhead Roughly the same as performance overhead
12
4.4.1 Algorithmic Based Fault Tolerance (ABFT)
A less systematic approach to software fault detection, which still relies on specific knowledge of
the algorithm and program, is to have the programmer annotate the code with assertions and
invariants [70, 71, 72]. Although it is difficult to analyze the effectiveness of this technique in the
general case, it has been shown to provide high error-coverage at very low cost.
An interesting specific case of an assertion is to specify a few sanity checks and make sure
the result of the computation is reasonable. An example might be to check whether energy is
conserved in a physical system simulation. This technique is very simple to implement, does not
degrade performance, and is often extremely effective. In fact, it is probably the most common
technique employed by users when running programs on cluster machines and grids [73].
As in the case of ABFT, when the programmer knows these techniques will be effective, they
are most likely the least costly and can be used without employing any hardware methods.
13
cost effective way to explore the spatial and time-based dimensions of the design space. Hybrid
techniques also can give the flexibility to dynamically tradeoff reliability and performance to best
suit an application’s needs.
Relax [74] uses try/catch like semantics to provide reliability though a cooperative hardware-
software approach. Relax relies on low-latency hardware error detection capabilities while
software handles state preservation and restoration. The programmer uses the Relax framework
to declare a block of instructions as “relaxed”. It is the obligation of the compiler to ensure that a
relaxed code block can be re-executed or discarded upon a failure. As a result, hardware can relax
the safety margin (e.g., frequency or voltage) to improve performance or save energy, and the
programmer can tune which block of codes are relaxed and how the recovery is done.
FaulTM [75] is another research project which uses transactional semantics for reliability.
FaulTM uses hardware transactional memory with lazy conflict detection and lazy data versioning
to provide hybrid hardware-software fault-tolerance. While the programmer declares a vulnerable
block (similar to transactional memories and Relax), lazy transactional memory (in hardware)
enables state preservation and restoration of a user-defined-block. FaulTM duplicates a vulnerable
block across two different cores for reliable execution.
CRAFT [76] is a hybrid approach which combines the software-only approach of replicated
instructions and checks [56] with some time redundant multithreading-style hardware support in
order to achieve higher error coverage and slightly improved performance [46, 51]. By taking
a hybrid approach, CRAFT achieves better reliability and performance than the software-only
approach while requiring less additional area than time redundant multithreading. Performance
is still degraded to a large degree compared to aggressive hardware-based resiliency approaches,
however.
Argus [77] also takes a hybrid approach for control protection. Argus compiler generates
static control/data flow graph, and this information is inserted into the instruction stream as
basic block signatures. At runtime, hardware modules generate dynamic control/data flow graph
and perform comparisons against the static information passed from the compiler. While this
provides an economic way of protecting control, computation must also be protected in order to
avoid silent data corruption. Argus employs previously suggested techniques such as modulo
checker for protecting the ALU and Multiplier/Divider.
14
5.1 Detection at the Core Level
One implementation of detection at the core level is utilized in the IBM z990 processor [78].
It integrates multiple hardware techniques discussed thus far in Sections 3 and 4 together to
create a fault-tolerant core with the most efficient detection mechanisms for different parts of the
system. Overall, the techniques used in IBM z990 are ECC, parity, retry (re-execution), mirroring
(hardware duplication), checkpointing, and rollback.
In IBM z990, ECC and parity are the main choice of detection mechanisms for the components
of the memory hierarchy. The main memory is protected by 2-bit symbols. Furthermore, there is
extra ECC to detect hard and soft errors on the address lines. L2 caches are again protected by
ECC, which allows purging, cleaning, and/or invalidation of data as necessary. Moreover, the
system keeps track of persistent errors in the cache, and if one exists, it shuts down the cache line
causing the error. Simultaneously, the L2 pipeline is checked against errors by parity bits placed
in each stage of the pipeline. In case of repeating errors, the system turns off the entire core. Other
SRAMs and register files are similarly protected by ECC and parity. Finally, the memory address
and command interface is covered by parity with re-execution of the memory command in cases
of failure.
The datapath and the surrounding logic also benefit from ECC and parity; however, other
more suitable techniques exist for these parts of the IBM z990. Logic in the pipeline is mirrored
and checkpointed. The results of the duplicated hardware are compared against each other, and
the core returns to the checkpointed state if the results do not match. Similarly, fetch data bus, I/O
buses, and the store address stack are protected by parity with recovery through checkpointing
and rollback. If the error persists, the entire core is turned off. Furthermore, to create an even
more robust system, the checkpoint arrays themselves are protected by ECC as a second layer of
protection. Control signals in the pipeline are protected by ECC in each stage, and the progatation
of this ECC data is checked with parity bits. I/O operations are also covered by parity with
re-execution on errors. Finally, ECC is recommended for off-chip address and control signals in
SMPs.
In [79, 80], network communications are protected by having strong error detection on all data
paths and structures. ECC is used to protect memories and data paths. Network packet transfers
are protected with cycle redundancy checks (CRC). The network provides a 16-bit packet CRC,
which protects up to 64-byte of data and the associated headers (768 bits max). The receiving link
checks the CRC as a packet arrives, returning an error if it is incorrect. The CRC is also checked
as a packet leaves each device, and as it transitions from the router to the NIC, enabling detection
15
of errors occurring within the router core. Furthermore, many of these paths and structures also
have error correction.
When an unrecoverable hardware error is detected, some form of notification is always
generated. For errors in data payloads, the errors are reported directly to the client that requested
the communication. This is usually done in the form of completion events where the error
indication is included as part of that event. For severe errors that might affect the operation of the
network, the operating system is also informed via an interrupt.
Errors in control information are more problematic in that the client information cannot be
trusted when this occurs. So a direct report to the client is not always possible. Instead, the
communication is usually dropped at the point of this kind of error. However, every communica-
tion on the network is tracked in various ways that always include hardware timeout mechanisms
to report if the communication has been lost. These timeouts are also reported via the completion
events. Again, severe errors are reported to the operating system via an interrupt if they are
detected in hardware directly associated with a node. If the error is detected in hardware not
associated with a particular node, the error is reported to the independent supervisory system
(which uses a separate network and processors).
In addition, any such errors, either in payload or control information, are always reported at
the point of occurrence, usually to the supervisory system. This reporting channel is intended to
be used for maintenance purposes.
Node failures are usually detected by closely monitoring them for health [81]. The monitoring is
accomplished by requiring an operating system thread on every node to increment a heartbeat
counter that is checked by the independent supervisory system. This thread also verifies that all
of the cores of a node are functional, at least with the ability to schedule and run the heartbeat
thread. When the supervisory system detects a lack of heartbeat, a failure event is generated.
Other nodes in the system may subscribe for that event so that they are notified of any particular
node failure.
In addition, the job launch and control system maintains a control tree of communication
connections between the nodes in a job. If any of these connections fail, the nodes at the far end of
the connection are considered down. This causes the entire job to be torn down in [81]. However,
it has been suggested that we could optionally trigger a notification to the job and a reconstruction
of the control tree. The receipt of the above node failure notifications by the job launch and
control system from the supervisory system can also optionally trigger this notification and
reconstruction.
16
6 Conclusion
In this report, we enumerate diverse existing error detection mechanisms for memory, compute,
and system. The error detection mechanisms are further classified based on their redundancy type,
placement in the system hierarchy, and error type coverage. As a qualitative trade-off analysis,
techniques in each category are explained in detail and compared to one another where applicable.
It is shown that different techniques have different trade-offs in terms of performance, energy and
area. This analysis should provide an important insight in achieving efficient resiliency within
the Echelon system.
Acknowledgements
This research was funded in part by DARPA contract HR0011-10-9-0008. The authors also wish
to greatfully acknowledge the input of Jinsuk Chung, Evgeni Krimer, Karthik Shankar, and
Jongwook Sohn for helping formulate the ideas contained in this report.
References
[1] D. Ernst, N. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner
et al., “Razor: A low-power pipeline based on circuit-level timing speculation,” 2003.
[Online]. Available: http://www.eecs.umich.edu/∼taustin/papers/MICRO36-Razor.pdf
[2] R. W. Hamming, “Error correcting and error detecting codes,” Bell System Technical J., vol. 29,
pp. 147–160, Apr. 1950.
[4] S. Lin and D. J. C. Jr., Error Control Coding: Fundamentals and Applications. Prentice-Hall, Inc.,
Englewood Cliffs, NJ, 1983.
[5] C. L. Chen and M. Y. Hsiao, “Error-correcting codes for semiconductor memory applications:
A state-of-the-art review,” IBM J. Research and Development, vol. 28, no. 2, pp. 124–134, Mar.
1984.
[6] I. S. Reed and G. Solomon, “Polynomial codes over certain finite fields,” J. Soc. for Industrial
and Applied Math., vol. 8, pp. 300–304, Jun. 1960.
[7] A. Hocquenghem, “Codes correcteurs d’erreurs,” Chiffres (Paris), vol. 2, pp. 147–156, 1959.
17
[8] R. C. Bose and D. K. Ray-Chaudhuri, “On a class of error correcting binary group codes,”
Information and Control, vol. 3, pp. 68–79, 1960.
[9] J. Kim, N. Hardavellas, K. Mai, B. Falsafi, and J. C. Hoe, “Multi-bit error tolerant caches
using two-dimensional error coding,” in Proc. the 40th IEEE/ACM Int’l Symp. Microarchitecture
(MICRO), Dec. 2007.
[10] D. Strukov, “The area and latency tradeoffs of binary bit-parallel BCH decoders for prospec-
tive nanoelectronic memories,” in Proc. Asilomar Conf. Signals Systems and Computers, October
2006.
[11] C. Slayman, “Cache and memory error detection, correction, and reduction techniques for
terrestrial servers and workstations,” IEEE Trans. Device and Materials Reliability, vol. 5, pp.
397– 404, Sep. 2005. [Online]. Available: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=
&arnumber=1545899
[13] J. M. Tendler, J. S. Dodson, J. S. F. Jr., H. Le, and B. Sinharoy, “POWER4 system microarchitec-
ture,” IBM J. Research and Development, vol. 46, no. 1, pp. 5–25, Jan. 2002.
[14] J. Wuu, D. Weiss, C. Morganti, and M. Dreesen, “The asynchronous 24MB on-chip level-3
cache for a dual-core Itanium R -family processor,” in Proc. the Int’l Solid-State Circuits Conf.
(ISSCC), Feb. 2005.
[15] J. Huynh, White Paper: The AMD Athlon MP Processor with 512KB L2 Cache, May 2003.
[16] C. N. Keltcher, K. J. McGrath, A. Ahmed, and P. Conway, “The AMD Opteron processor for
multiprocessor servers,” IEEE Micro, vol. 23, no. 2, pp. 66–76, Mar.-Apr. 2003.
[17] D. E. Corporation, Alpha 21264 Microprocessor Hardware Reference Manual, Jul. 1999.
[19] Z. Chisti, A. R. Alameldeen, C. Wilkerson, W. Wu, and S.-L. Lu, “Improving cache lifetime
reliability at ultra-low voltages,” in Proc. the 42nd IEEE/ACM Int’l Symp. Microarchitecture
(MICRO), Dec. 2009.
[20] M. Y. Hsiao, D. C. Bossen, and R. T. Chien, “Orthogonal latic square codes,” IBM Journal of
Research and Development, vol. 14, no. 4, pp. 390–394, Jul. 1970.
18
[21] J. Abella, J. Carretero, P. Chaparro, X. Vera, and A. Gonzalez, “Low Vccmin fault-tolerant
cache with highly predictable performance,” in Proc. the 42nd IEEE/ACM Int’l Symp. Microar-
chitecture (MICRO), Dec. 2009.
[22] C. Wilkerson, A. R. Alameldeen, Z. Chishti, W. Wu, D. Somasekhar, and S.-L. Lu, “Reducing
cache power with low-cost, multi-bit error correcting codes,” in Proc. the Ann. Int’l Symp.
Computer Architecture (ISCA), Jun. 2010.
[23] B. Schroeder, E. Pinheiro, and W. Weber, “DRAM errors in the wild: a large-scale
field study,” in Proceedings of the eleventh international joint conference on Measurement and
modeling of computer systems. ACM, 2009, pp. 193–204. [Online]. Available: http:
//research.google.com/pubs/archive/35162.pdf
[24] T. J. Dell, “A white paper on the benefits of chipkill-correct ECC for PC server main memory,”
IBM Microelectronics Division, Nov. 1997.
[25] AMD, “BIOS and kernel developer’s guide for AMD NPT family 0Fh processors,” Jul. 2007.
[Online]. Available: http://support.amd.com/us/Processor TechDocs/32559.pdf
[26] C. L. Chen, “Symbol error correcting codes for memory applications,” in Proc. the 26th Ann.
Int’l Symp. Fault-Tolerant Computing (FTCS), Jun. 1996.
[28] S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, “Robust system design with built-in
soft-error resilience,” Computer, vol. 38, no. 2, pp. 43–52, 2005.
[31] K. Mohanram and N. Touba, “Cost-effective approach for reducing soft error failure rate in
logic circuits,” International Test Conference, 2003. Proceedings. ITC 2003., pp. 893–901, 2003.
19
[32] R. Rao, D. Blaauw, and D. Sylvester, “Soft error reduction in combinational logic using
gate resizing and flipflop selection,” in Computer-Aided Design, 2006. ICCAD ’06. IEEE/ACM
International Conference on, 2006, pp. 502 –509.
[33] C. G. Zoellin, H.-J. Wunderlich, I. Polian, and B. Becker, “Selective hardening in early design
steps,” European Test Symposium, IEEE, vol. 0, pp. 185–190, 2008.
[34] P. Ndai, A. Agarwal, Q. Chen, and K. Roy, “A soft error monitor using switching current
detection,” in 2005 IEEE International Conference on Computer Design: VLSI in Computers and
Processors, 2005. ICCD 2005. Proceedings, 2005, pp. 185–190.
[35] A. Narsale and M. Huang, “Variation-tolerant hierarchical voltage monitoring circuit for soft
error detection,” in Quality of Electronic Design, 2009. ISQED 2009. Quality Electronic Design.
IEEE, 2009, pp. 799–805.
[36] T. R. N. Rao, Error Coding for Arithmetic Processors. Orlando, FL, USA: Academic Press, Inc.,
1974.
[37] I. Proudler, “Idempotent an codes,” in Signal Processing Applications of Finite Field Mathematics,
IEE Colloquium on, Jun. 1989, pp. 8/1 –8/5.
[39] J.-C. Lo, “Reliable floating-point arithmetic algorithms for error-coded operands,” Computers,
IEEE Transactions on, vol. 43, no. 4, pp. 400 –412, apr. 1994.
[40] J. Lo, S. Thanawastien, and T. Rao, “Concurrent error detection in arithmetic and logical
operations using berger codes,” in Proceedings of 9th Symposium on Computer Arithmetic, Sep.
1989, pp. 233 –240.
[41] T. M. Austin, “DIVA: a reliable substrate for deep submicron microarchitecture design,” in
MICRO 32: Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitec-
ture, 1999, pp. 196–207.
20
[44] S. Das, C. Tokunaga, S. Pant, W. Ma, S. Kalaiselvan, K. Lai, D. Bull, and D. Blaauw,
“Razorii: In situ error detection and correction for pvt and ser tolerance,” IEEE
Journal of Solid-State Circuits, vol. 44, no. 1, pp. 32–48, 2009. [Online]. Available:
http://www.ece.ncsu.edu/asic/ece733/2009/docs/RazorII.pdf
[45] N. Saxena and E. McCluskey, “Dependable adaptive computing systems-the roar project,” in
Systems, Man, and Cybernetics, 1998. 1998 IEEE International Conference on, vol. 3. IEEE, 2002,
pp. 2172–2177.
[46] S. Reinhardt and S. Mukherjee, “Transient fault detection via simultaneous multithreading,”
ACM SIGARCH Computer Architecture News, vol. 28, no. 2, pp. 25–36, 2000.
[48] J. Ray, J. C. Hoe, and B. Falsafi, “Dual use of superscalar datapath for transient-fault detec-
tion and recovery,” in Proceedings of the 34th annual ACM/IEEE international symposium on
Microarchitecture, ser. MICRO 34. Washington, DC, USA: IEEE Computer Society, 2001, pp.
214–224.
21
[53] J. H. Wensley, M. W. Green, K. N. Levitt, and R. E. Shostak, “The design, analysis, and
verification of the SIFT fault tolerant system,” in ICSE ’76: Proceedings of the 2nd international
conference on Software engineering, 1976, pp. 458–469.
[54] N. Oh, S. Mitra, and E. J. McCluskey, “ED4 I: Error detection by diverse data and duplicated
instructions,” IEEE Trans. Comput., vol. 51, no. 2, pp. 180–199, 2002.
[57] J. Ohlsson and M. Rimen, “Implicit signature checking,” in Fault-Tolerant Computing, 1995.
FTCS-25. Digest of Papers., Twenty-Fifth International Symposium on, Jun. 1995, pp. 218 –227.
[60] R. Venkatasubramanian, J. Hayes, and B. Murray, “Low-cost on-line fault detection using
control flow assertions,” in On-Line Testing Symposium, 2003. IOLTS 2003. 9th IEEE, 2003, pp.
137 – 143.
[61] M. Li, P. Ramachandran, S. Sahoo, S. Adve, V. Adve, and Y. Zhou, “SWAT: An Error Resilient
System,” in the Fourth Workshop on Silicon Errors in Logic-System Effects (SELSE-IV). Citeseer,
2008. [Online]. Available: http://rsim.cs.illinois.edu/Pubs/08SELSE-Li.pdf
[62] S. Feng, S. Gupta, A. Ansari, and S. Mahlke, “Shoestring: probabilistic soft error reliability
on the cheap,” ACM SIGARCH Computer Architecture News, vol. 38, no. 1, pp. 385–396, 2010.
[Online]. Available: http://www.eecs.umich.edu/∼shoe/papers/sfeng-asplos10.pdf
22
[64] A. Al-Yamani, N. Oh, and E. McCluskey, “Performance evaluation of checksum-based ABFT,”
in 16th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’01),
San Francisco, California, USA, October 2001.
[65] K. H. Huang and J. A. Abraham, “Algorithm-based fault tolerance for matrix operations,”
IEEE Trans. Comput., vol. C-33, pp. 518–528, 1984.
[67] A. L. N. Reddy and P. Banerjee, “Algorithm-based fault detection for signal processing
applications,” IEEE Trans. Comput., vol. 39, no. 10, pp. 1304–1308, 1990.
[68] J.-Y. Jou and J. A. Abraham, “Fault-tolerant FFT networks,” IEEE Trans. Comput., vol. 37,
no. 5, pp. 548–561, 1988.
[69] A. Mishra and P. Banerjee, “An algorithm-based error detection scheme for the multigrid
method,” IEEE Trans. Comput., vol. 52, no. 9, pp. 1089–1099, 2003.
[70] D. M. Andrews, “Using executable assertions for testing and fault tolerance,” in 9th Fault-
Tolerance Computing Symposium, Madison, Wisconsin, USA, June 1979.
[71] A. Mahmood, D. J. Lu, and E. J. McCluskey, “Concurrent fault detection using a watchdog
processor and assertions,” in 1983 International Test Conference, Philadelphia, Pennsylvania,
USA, October 1983, pp. 622–628.
[72] M. Z. Rela, H. Madeira, and J. G. Silva, “Experimental evaluation of the fail-silent behaviour
in programs with consistency checks,” in FTCS ’96: Proceedings of the The Twenty-Sixth Annual
International Symposium on Fault-Tolerant Computing, 1996, p. 394.
[73] J. M. Wozniak, A. Striegel, D. Salyers, and J. A. Izaguirre, “GIPSE: Streamlining the manage-
ment of simulation on the grid,” in ANSS ’05: Proceedings of the 38th Annual Symposium on
Simulation, 2005, pp. 130–137.
[75] G. Yalcin, O. Unsal, I. Hur, A. Cristal, and M. Valero, “FaulTM: Fault-tolerant using hardware
transactional memory,” in Proc. the Workshop on Parallel Execution of Sequential Programs on
Multi-Core Architecture (PESPMA), 2010.
23
[76] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. August, and S. S. Mukherjee, “Design
and evaluation of hybrid fault-detection systems,” in ISCA ’05: Proceedings of the 32nd Annual
International Symposium on Computer Architecture. Madison, Wisconsin, USA: IEEE Computer
Society, 2005, pp. 148–159.
[77] A. Meixner, M. Bauer, and D. Sorin, “Argus: Low-cost, comprehensive error detection in
simple cores,” in Microarchitecture, 2007. MICRO 2007. 40th Annual IEEE/ACM International
Symposium on. IEEE, 2007, pp. 210–222.
[78] P. Meaney, S. Swaney, P. Sanda, and L. Spainhower, “IBM z990 soft error detection and
recovery,” Device and Materials Reliability, IEEE Transactions on, vol. 5, no. 3, pp. 419 – 427, sep.
2005.
[79] R. Alverson, D. Roweth, and L. Kaplan, “The Gemini System Interconnect,” in High Perfor-
mance Interconnects (HOTI), 2010 IEEE 18th Annual Symposium on. IEEE, 2010, pp. 83–87.
[80] F. Godfrey, “Resiliency features in the next generation Cray Gemini network,” in Cray User
Group (CUG) 2010, 2010.
[81] “ASCI Red Storm system overview and design specification,” Internal Report, Sandia
National Labs, 2003.
24 Vie