Dependability Assessment of Operating Systems in Multi-Core Architectures
Dependability Assessment of Operating Systems in Multi-Core Architectures
Dependability Assessment of Operating Systems in Multi-Core Architectures
The dependability of a service is directly influenced by In this study we leverage the NFTAPE fault injection
the reliability provided by the lower system layers, such as testing environment [1]. The current NFTAPE implementa-
the operating system and hardware. Due to device and volt- tion for Operating System (OS) assessment consists of two
age scaling, and the increasing complexity of digital sys- machines: (i) the target machine, which is a machine host-
tems, transient errors are forecast to be a problem for all fu- ing the operating system under study. The process manager
ture digital systems [4]. When we consider large-scale ma- application, which performs the actual fault injection, runs
chines, with many processors and cores, this phenomenon on this machine; and (ii) a control host machine (different
is furthered exacerbated. Therefore, it is mandatory to eval- from the target), which controls the experiment by issuing
uate the behavior of systems under such conditions. commands to the target machine, via the process manager,
In this context, understanding the way errors can mani- and collecting data about the injections.
fest and are handled by the platform operating system layer The process of conducting an OS injection campaign,
gives an indication of the reliability of the system and also there are two phases: (i) setting up the injection, and (ii)
how they affect the user-level applications. performing the injection itself, i.e., executing instructions
that corrupt a specific value in memory or registers. Figure
The robustness and fault sensitivity of operating systems
1 depicts how the injection works. The control host issues a
have been extensively studied [2, 3, 1]. There are sev-
request to inject a fault to the process manager. The process
eral challenges in developing a fault injection framework
manager invokes a injector program, which is a user level
for multi-core systems: (i) it is likely that we have more
program. This process receives the injection parameters,
than one error at a time, possibly occurring in different pro-
and passes them to the NFTAPE kernel module. The ker-
cessors/cores; (ii) there is no direct control over where the
nel module writes the parameters into kernel data structures,
fault injector is going to be executed or where the work-
added specially for fault injection. It injects faults by us-
load is being placed by the operating system scheduler; (iii)
ing breakpoint registers. When the module receives the ad-
a fault that is injected in one processor can get manifested
dress of the target data/instruction for the injection, it writes
in another processor. These issue make it difficult to accu-
this value into the processor debug register. When the tar-
rately measure dependability metrics, such as crash latency.
get data/instruction at this address is accessed/executed, the
The use of standard techniques, such as using performance
kernel executes the breakpoint handler. The breakpoint han-
counters, is not trivial, given that we would have to config-
dler is instrumented with fault injection instructions, which
ure the performance registers in all processors at the same
are used to corrupt the value according to a given fault mask.
time. This scenario poses a new set of questions that have
not been addressed by the current approaches: is the error
behavior different with errors occurring in multiple proces- Request from
Control Host
sors/cores? How do latent errors influence the system? Do
Process kernel
injector
errors propagate between different processors or cores? Manager module
In addressing these questions, this work describes the invoke injector kernel text write_
address 0xC0123456 debugReg(0xC0123456)
development of an experimental environment that enables mask 0x00000010
∗ Sponsored by CAPES/Brazil
1
2.1 Moving to Multiprocessor Systems Request from
Control Host