Software redundancy refers to all additional software installed
in a system that would not be needed for a fault-free computer.
Software redundancy plays a major role in most faulttolerant
computers. Even computers that recover from failures
mainly by hardware means use software to control their
recovery and decision-making processes. The level of software
used depends on the recovery system design. The recovery
design depends on the type of error or malfunction that
is expected. Different schemes have been found to be more
appropriate for the handling of different errors. Some can be
accomplished most efficiently solely by hardware means.
Others need only software, but most use a mixture of the two.
For a functional system, i.e., one without hardware design
faults, errors can be classified into two varieties: (1) software
design errors and (2) hardware malfunctions.
The first category can be corrected mainly by means of
software. It is extremely difficult for hardware to be designed
to correct for programmers’ errors. The software methods,
though, are often used to correct hardware faults—especially
transient ones. The reduction and correction of software design
errors can be accomplished through the techniques outlined
below.
Computers may be designed to detect several software
errors.14,15 Examples include the use of illegal instructions
(i.e., instructions that do not exist), the use of privileged
instructions when the system has not been authorized to
process them, and address violations. This latter refers to
reading or writing into locations beyond usable memory.
These limits can often be set physically on the hardware.
Computers capable of detecting these errors allow the programmer
to handle the errors by causing interrupts. The interrupts
route the program to specific locations in memory. The
programmer, knowing these locations, can then add his own
code to branch to his specific subroutines, which can handle
each error in a specified manner.
Software recovery from software errors can be accomplished
via several methods. As mentioned before, parallel
programming, in which alternative methods are used to determine
a correct solution, can be used when an incorrect solution
can be identified. Some less sophisticated systems print out
diagnostics so that the user can correct the program off line
© 2003 by Béla Lipták
128 General Considerations
from the machine. This should only be a last resort for a
fault-tolerant machine. Nevertheless, a computer should always
keep a log of all errors incurred, memory size permitting.
Preventive measures used with software methods refer
mainly to the use of redundant storage. Hardware failures
often result in a garbling or a loss of data or instructions that
are read from memory. If hardware techniques such as coding
cannot recover the correct bit pattern, those words will
become permanently lost. Therefore, it is important to at least
duplicate all necessary program and data storage so that it
can be retrieved if one copy is destroyed. In addition, special
measures should be taken so that critical programs such as
error recovery programs are placed in nonvolatile storage,
i.e., read-only memory. Critical data as well should be placed
in nondestructive readout memories. An example of such a
memory is a plated-wire memory.
The second task of the software in fault tolerance is to
detect and diagnose errors. Software error-detection techniques
for software errors often can be used to detect transient
hardware faults. This is important, since “a relatively large
number of malfunctions are intermittent in nature rather than
solid failures.”9 Time-redundant processes, i.e., repeated trials,
shall be used for their recovery.
Software detection techniques do not localize the sources
of the errors. Therefore, diagnostic test programs are frequently
implemented to locate the module or modules
responsible. These programs often test the extent of the faults
at the time of failure or perform periodic tests to determine
malfunctions before they manifest themselves as errors during
program execution. Almost every computer system uses
some form of diagnostic routines to locate faults. In a faulttolerant
system, the system itself initiates these tests and
interprets their results, as opposed to the outside insertion of
test programs by operators in other systems.