Dependability Assessment of Operating Systems in Multi-Core Architectures

Dependability Assessment of Operating Systems in Multi-core Architectures
Gabriela Jacques-Silva∗, Zbigniew Kalbarczyk, Ravishankar K. Iyer

Center for Reliable and High-Performance Computing
University of Illinois at Urbana-Champaign, IL 61801
{gjsilva,kalbar,rkiyer}@crhc.uiuc.edu
1 Introduction 2 NFTAPE Operating System Injection
The dependability of a service is directly influenced by In this study we leverage the NFTAPE fault injection
the reliability provided by the lower system layers, such as testing environment [1]. The current NFTAPE implementa-
the operating system and hardware. Due to device and volt- tion for Operating System (OS) assessment consists of two
age scaling, and the increasing complexity of digital sys- machines: (i) the target machine, which is a machine host-
tems, transient errors are forecast to be a problem for all fu- ing the operating system under study. The process manager
ture digital systems [4]. When we consider large-scale ma- application, which performs the actual fault injection, runs
chines, with many processors and cores, this phenomenon on this machine; and (ii) a control host machine (different
is furthered exacerbated. Therefore, it is mandatory to eval- from the target), which controls the experiment by issuing
uate the behavior of systems under such conditions. commands to the target machine, via the process manager,
In this context, understanding the way errors can mani- and collecting data about the injections.
fest and are handled by the platform operating system layer The process of conducting an OS injection campaign,
gives an indication of the reliability of the system and also there are two phases: (i) setting up the injection, and (ii)
how they affect the user-level applications. performing the injection itself, i.e., executing instructions
that corrupt a specific value in memory or registers. Figure
The robustness and fault sensitivity of operating systems
1 depicts how the injection works. The control host issues a
have been extensively studied [2, 3, 1]. There are sev-
request to inject a fault to the process manager. The process
eral challenges in developing a fault injection framework
manager invokes a injector program, which is a user level
for multi-core systems: (i) it is likely that we have more
program. This process receives the injection parameters,
than one error at a time, possibly occurring in different pro-
and passes them to the NFTAPE kernel module. The ker-
cessors/cores; (ii) there is no direct control over where the
nel module writes the parameters into kernel data structures,
fault injector is going to be executed or where the work-
added specially for fault injection. It injects faults by us-
load is being placed by the operating system scheduler; (iii)
ing breakpoint registers. When the module receives the ad-
a fault that is injected in one processor can get manifested
dress of the target data/instruction for the injection, it writes
in another processor. These issue make it difficult to accu-
this value into the processor debug register. When the tar-
rately measure dependability metrics, such as crash latency.
get data/instruction at this address is accessed/executed, the
The use of standard techniques, such as using performance
kernel executes the breakpoint handler. The breakpoint han-
counters, is not trivial, given that we would have to config-
dler is instrumented with fault injection instructions, which
ure the performance registers in all processors at the same
are used to corrupt the value according to a given fault mask.
time. This scenario poses a new set of questions that have
not been addressed by the current approaches: is the error
behavior different with errors occurring in multiple proces- Request from
Control Host
sors/cores? How do latent errors influence the system? Do
Process kernel
injector
errors propagate between different processors or cores? Manager module
In addressing these questions, this work describes the invoke injector kernel text write_
address 0xC0123456 debugReg(0xC0123456)
development of an experimental environment that enables mask 0x00000010
fault injection based evaluation of operating systems, par-

ticularly Linux, for multi-core and multiprocessor architec-
tures. Figure 1. Target machine procedure
∗ Sponsored by CAPES/Brazil
1
2.1 Moving to Multiprocessor Systems Request from
Control Host
The current injection tools typically makes use of debug Operating

System
registers to obtain fine-grained data about injection experi-
ments (e.g., if a fault is activated). In the case of multi-core injector
Process scheduler
architecture, each CPU/core has its own register set. There- Manager CPU0
queue 0
fore, we have to find a way to set the debug registers in a
specific CPU and enable the fault injection triggering. We Core dispatcher
scheduler
CPU1
queue 1
have to select and force multiple CPUs to trigger a fault.
2 0
Since we want to emulate the occurrence of multiple faults, injector
scheduler
we have to set different breakpoint addresses in different queue 2
CPU2
CPUs. This means that we need to force the execution of
the injection process in different CPUs. Therefore, when scheduler
CPU3
we invoke the kernel module it will set the registers in the queue3
CPU that is executing the injection process.

To enforce fault triggering in multiple processors, we
take advantage of a set of system calls provided by the Figure 2. Forcing breakpoint in different CPUs
Linux kernel. Linux allows a process to set and get a CPU
affinity mask. This affinity mask determines which CPUs a
process is allowed to run. The affinity is also passed along
to any forked child. bus or in the instruction cache of a particular processor, af-
Figure 2 shows how we take advantage of such system fecting one processor only.
calls. When the process manager receives a injection re- Kernel Stack - injection to a valid kernel stack range of
quest from the control host, it instantiates a core dispatcher. a process running on the target CPUs.
Since the core dispatcher is forked by the process man- Global Kernel Data - injection to a global kernel data
ager, they execute in the same CPU. If the injection is tar- structure, which is visible to all the processors in the system.
geting processors 0 and 2, we add both processors to the
core dispatcher queue. It then forks a injection process in 3 Future work
CPU0. This process will set the appropriate debug registers
in CPU0 by accessing the kernel module. After the injec- This fast abstract describes a first step towards a frame-
tor is forked, the core dispatcher changes its CPU affinity work to evaluate operating system behavior in multi-core
and relinquishes the current processor, forcing it to be exe- environment. In this scenario, we have to consider the pos-
cuted in another CPU (CPU2 in this case). In the next run sibility that more then one error can occur in a short period
of core dispatcher, it will be running on CPU2, and it will of time. We aim at verifying if the failure behavior of op-
then dispatch the second injector process. This process sets erating systems is different when running in environments
the appropriate debug registers in CPU2. To ensure that a that are susceptible to higher rates of transient errors, such
fault is not injected while the debug registers are being set, as multi-core architectures. Other issues to be addressed are
we add a global kernel variable that is set only after all the the execution and distribution of workload in different pro-
configuration has taken place. If a breakpoint exception oc- cessors, and the collection of data for dependability metrics,
curs while this variable is not set, it is ignored by the fault such as crash latency. Such metric is important for estimat-
injection instrumentation. ing error propagation.
2.2 Fault model References

[1] W. Gu, Z. Kalbarczyk, and R. K. Iyer. Error sensitivity of the
The fault model that is currently being considered for the linux kernel executing on powerpc g4 and pentium 4 proces-
multi-core injection is single/multiple bit flips. However, sors. In Proc. of DSN 2004, Washington, DC, USA, 2004.
here we assume that they can happen more than once during [2] T. Jarboui, J. Arlat, Y. Crouzet., K. Kanoun, and T. Marteau.
a single experiment run. The current targets for injections Analysis of the effects of real and injected software faults:
in the OS are the following: Linux as a case study. In Proc. of PRDC2002, pages 51–58,
Kernel Text - injection into kernel instructions. Since 16-18 Dec. 2002.
[3] P. Koopman and J. DeVale. The exception handling effec-
in multiprocessor architectures the kernel code section is
tiveness of posix operating systems. IEEE Trans. Softw. Eng.,
shared, we can have two types of injections, depending on 26(9):837–848, 2000.
the fault location: (i) a fault that occurs in memory and cor- [4] N. J. Wang and S. J. Patel. Restore: Symptom-based soft error
rupts a instruction, making the fault visible to all the pro- detection in microprocessors. IEEE Trans. Dependable Secur.
cessors in the system, (i) a fault that occurs on the pipeline, Comput., 3(3):188–201, 2006.

Dependability Assessment of Operating Systems in Multi-Core Architectures

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Dependability Assessment of Operating Systems in Multi-Core Architectures

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dependability Assessment of Operating Systems in Multi-Core Architectures

Uploaded by

Copyright:

Available Formats

Dependability Assessment of Operating Systems in Multi-core Architectures

Gabriela Jacques-Silva∗, Zbigniew Kalbarczyk, Ravishankar K. Iyer

1 Introduction 2 NFTAPE Operating System Injection

fault injection based evaluation of operating systems, par-

The current injection tools typically makes use of debug Operating

CPU that is executing the injection process.

2.2 Fault model References

You might also like