An Analysis of The AUTOSAROS Timing Protection Mechanism
An Analysis of The AUTOSAROS Timing Protection Mechanism
Abstract from each other in the space (access to memories and de-
vices) and time (access to the CPU) domains.
The in-vehicle embedded system market is evolving to- The operating system kernel specified in the AU-
ward a large improvement of the industrialization of the TOSAR OS standard [2] extends the set of services pro-
embedded software. One of the technical consequences vided by the OSEK/VDX OS standard (widely used in the
of this evolution is the mandatory integration of protec- automotive domain) with several protection mechanisms.
tion mechanisms in the embedded operating system ker- We have realized an evaluation of this mechanism and re-
nels to support the design of multi-suppliers multi-critical port our main conclusions in this paper.
component-based embedded software. In this paper, we The paper is organized as follow: in section 2, we de-
evaluate such a mechanism: the timing protection mecha- scribe the kind of faults that the mechanism shall be able
nism proposed in the AUTOSAR OS standard. This eval- to detect and confine; in section 3, we briefly describe
uation shows that the present version of the mechanism is similar mechanisms used in comparable RTOS; in section
not fully adapted to multi-critical systems because it does 4, we describe the mechanism; in section 5, we compare
not handle soft/non real-time applications. the mechanism with the other proposals; in section 6, we
summarizes the results of a set of simulation that high-
light some limitations of the mechanism ; in section 7, we
1. Introduction propose and evaluate two techniques for configuring the
mechanism; in section 8, we conclude.
Two emerging families of standards are deeply impact-
ing the in-vehicle embedded system market: AUTOSAR 2. Fault, error and failure in the time domain
and ISO 26262.
The AUTOSAR (AUTomotive Open System ARchi- A timing protection mechanism is useful because faults
tecture) family of standards specify the organization of (occurring at design time or at run time) introduce latent
modern in-vehicle embedded software systems so as to errors in the system, some of which can result in failures
ease up the design of multi-supplier systems, leverage the in the time domain. We adopt the following definition of
re-use of standardized off the shelf components and in- a failure in the time domain: a service fails in the time
crease the overall flexibility of the design process of these domain if its response time exceeds its deadline.
systems. Several faults can be sources of failures in the time do-
The ISO26262 family of standards define the de- main.
pendability requirements of in-vehicle embedded systems A first category consists of the faults that occur at run-
through the concept of Automotive Safety Integrity Level time and target hardware components. Let us take some
(ASIL, ranging from A, the least critical, to D, the most exemples. A failing sensor can adopt a babbling idiot be-
critical). havior and continuously send interruption requests to the
These on-going works illustrate the evolution of the CPU. As a consequence, the associated interrupt service
in-vehicle embedded system market toward the industri- routine is activated at a higher frequency than expected.
alization of the software. Modern embedded software are A bit flip hitting the program counter or the instruction
made up of components designed by different suppliers, register or a variable involved in the control structure of
involved in functions of different criticality levels. This the running program can lead the active job to execute an
evolution has important technical consequences on the ba- unwanted sequence of instructions (among other effects).
sic software components, especially the operating system As a consequence, the execution time of the job targeted
kernel that must be able to insulate the components one by the fault can exceed its expected WCET. Usually, faults
∗ Work partly supported by project SCARLET financed by ANR of this category produce errors in several domain (for in-
(Agence National de la Recherche, programme PREDIT). stance, a corruption of the instruction register could also
First, the timing protection mechanism of AUTOSAR n ≤ 18, all tasks have different periods). The possible
OS prevents the propagation of faults in the time domain, values are chosen with an algorithm based on [12] (in our
in the sense that a misbehaving job cannot lengthen the re- study T = 2u2 × 3u3 × 5u5 with u2 , u3 ∈ {0, 1, 2} and
sponse time of other jobs in the system more than an off- u5 ∈ {0, 1} in order to obtain a sufficient and representa-
line computed bound (see for instance [3] or [13] to find tive diversity of values). The WCET is set by one of the
algorithms that compute upper bounds on the response three algorithms presented in [4] (UScaling, UFitting and
times of the jobs in a system scheduled according to AU- UUniFast). The deadlines are set so that the ratio Di /Ti
TOSAR OS scheduling policy based on the knowledge of is constant. The priorities are set according to the dead-
their minimal inter-arrival time, their maximal execution line monotonic algorithm (known to be optimal for this
time and their worst-case locking time). As highlighted type of system as long as Di /Ti ≤ 1). Thus, the be-
in [15] and contrary to the RTOS cited in section 3, this havior of the generator is controlled by setting three pa-
does not provide a true insulation of jobs in the time do- rameters: n (number of tasks), U (system utilization, i.e.
n
main: the behavior of a job still interfere with the behavior i=1 Ci /Ti ) and D/T (hardness of the real-time con-
P
of the other jobs in the system. straints).
Second, in the case of the activation of an error in the Table 2 shows the maximal value, minimal value, mean
time domain (over-activation or over-run), an AUTOSAR value and standard deviation obtained for Ci and Ci /Di
OS kernel kills the faulty job. The solution is rather strict, on 50 runs of the generator configured with n = 10, U =
especially when it targets non hard real-time applications. 0.9 and Di /Ti ∈ {1, 0.75, 0.67}.
On the contrary, RTOS based on partition scheduling (AR-
INC 653, COS or DEOS for instance) offer a better sup- Di /Ti max min m σ
port: when the execution time of a job exceeds its ex- Ti * 180 1 34,35 49,09
pected limit, the reaction is local to the partition, i.e. the
application. A hard real-time application will certainly 1 0.3300 0.0004 0.0889 0.0721
Ci /Di 0.75 0.4400 0.0006 0.1185 0.0961
kill the job and trigger a recovery function, whereas a
0.67 0.4925 0.0007 0.1327 0.1076
soft or non real-time application can give the job a sec-
ond chance to complete (this is especially efficient if the
Table 2. Behavior of the generator config-
RTOS is capable of exploiting the slack time existing in
ured with n = 10, U = 0.9 and Di /Ti ∈
the schedule as in DEOS). This feature may cause prob-
{1, 0.75, 0.67}.
lems in the emerging use-case of multi-ASIL ECU. In the
follow-up, we investigate this problem. We first show that
a naïve configuration of the timing protection mechanism
The simulator is based on the TrueTime Mat-
can lead to an important decrease of the QoS of the system lab/Simulink toolbox [9], with a few modifications to im-
in some circumstances. Then, we propose and evaluate plement the simulation of the timing protection mecha-
some alternative solutions. nism. Concerning the simulation of faults, we have cho-
sen to only consider WCET underestimation because this
6. Evaluation of the mechanism type of fault is the most likely to happen (in the context
of the automotive domain). It is impossible to simulate
The mechanism is implemented in a simulation tool. the system in its real lifetime, thus the apparition rate of
The tool has two parts : the architecture generator and the fault has to be artificially augmented. As we want to stress
simulator. the timing protection mechanism, we also have to increase
The architecture generator generates task sets. Each the importance of the fault (i.e. the difference between the
set is composed of n tasks where each task τi is charac- estimated WCET and the effective WCET). Thus, the ex-
terized by four parameters: its period Ti , its WCET Ci , ecution time of each job of task τi is chosen in an inter-
its deadline Di and its priority Pi . Every Ti time units, val [cmin Ci , cmax Ci ] according to a uniform distribution
the task creates a job that must access the CPU during (where cmin and cmax are parameters of the simulation).
Ci time units within the next Di time units. The period Other distributions have been tested but the results were
of each task is randomly chosen in a list of 18 values (if less significant and the uniform distribution is more sim-
ple to handle. 7. Improvements
In the first experiment, the generator is configured with
n = 10, U = 0.9 and Di /Ti = 0.67. The simulator is The timing protection mechanism proposed by AU-
configured with cmin = 0.95 and cmax = 1.15: actual TOSAR OS does not use a slack stealing mechanism.
execution times can be lower or greater than the estimated Thus, it cannot use dynamic slack time, i.e. time reserved
WCET. Each simulation is stopped after the execution of for some critical jobs but not used. Nevertheless, if prop-
5000 jobs. 50 architectures have been generated and simu- erly configured, it can use static slack time, i.e. time not
lated with and without timing protection. The timing pro- reserved by critical jobs. In this section, we expose how
tection mechanism is configured with TIMEFRAME=Ti this can be achieved by using the results of Bini and co-
and EXECUTIONBUDGET=Ci (we consider indepen- authors on sensitivity analysis of fixed priority preemptive
dant tasks, thus we are not interested in the locking time systems [6], and we measure by simulation the benefits
budgets). When the timing protection is active, we mea- obtained from this approach.
sure the number of jobs killed before their termination.
When the timing protection is inactive, we measure the 7.1. C-space computation for fixed priority preemptive
number of response time violation. The results are given systems
on figure 2. The number of jobs killed in each run is plot- In [5], Bini and co-authors give a necessary and suf-
ted with the circle line (left scale). The number of job ficient schedulability condition for fixed priority preemp-
missing their deadline is plotted with the solid line (right tive systems where tasks are independent and jobs release
scale). The dashed lines represent the mean values. The times are unknown (the notation assumes that tasks are
results show that the vast majority of the faults does not in- ordered by decreasing priority) :
duce a failure. More precisely, the mean number of dead-
line violation per simulation is 1.48, whereas the mean Theorem 1 [5] A periodic independent task set T is
number of overrun is 3887. Most of the time, there is schedulable on a FPP scheduling if and only if
enough slack time available in the system to cancel these i−1
faults. Table 3 summarizes the results obtained for differ- X t
∀i ∈ [1, n], ∃t ∈ schedPi , Ci + Cj ≤ t
ent values of Di /Ti . As expected, the number of deadline Tj
j=1
violation decreases when the constraint is relaxed.
where schedPi is a set of scheduling points defined as
schedPi = Pi−1 (Di ), and Pi is defined as follows
(
P0 (t) = {t} j k
Pi (t) = Pi−1 Tti Ti ∪ Pi−1 (t)
B = (1 + λ) . C (1)
Di /Ti max min m σ
1 2566 752 1650 536
Overruns 0.75 3361 1544 2380 400
0.67 3378 1644 2420 86
1 2 0 0.04 0.28
Dead. miss 0.75 11 0 1.28 2.69
0.67 13 0 1.48 2.89
where λ ≥ 0 is a factor which increase the execution observe as expected that the better results are obtained for
budget of each task proportionally to its WCET. the higher values of Di /Ti , because these configurations
have more static slack time available. When Di /Ti = 1,
The factor λ can be computed as follow [6]: the average number of overruns drops to 1650 instead of
t − ni . C i 3887 with implicit budgets.
λ = min max (2)
i=1,...,n t∈schedPi ni . C i
l m l m l m 7.3. Multi-critical systems
where ni = t t t
and The benefits obtained from the proportional distribu-
T1 , T2 , . . . , Ti−1 , 1
Ci = {Cj }j=1,...,i . tion of static slack times into tasks budgets come to the
cost of a delay in the detection of faults that cannot be can-
Let us apply this approach to our example. The esti- celled naturally by the system (e.g. hardware faults). For
mated WCET are C1 = 4 and C2 = 6 (point S4 on fig 3). the most critical tasks whose WCET estimation is safe,
We obtain λ = 5/14. Thus B1 and B2 can be chosen re- this is a problem. To alleviate this problem, we propose to
spectively equal to 5.43 and 8.14. The budget of each task take into account the ASIL level of each task in the com-
integrates a part of the static slack time that can be used to putation of λ.
cancel overrun without jeopardizing the system. On the one hand, the WCET estimations for non
We have run again our simulations with the same pa- critical tasks are not safe, thus their execution budgets
rameters, but this time, the EXECUTIONBUDGET pa- shall be large, to allow cancellation of overruns. On
rameters are set as exposed above, so as to integrate a part the other hand, the WCET estimations for critical tasks
of the static slack time. The results are given in figure 4 are safe, thus their execution budgets shall stick to the
and table 4. estimation, to allow the fault detection and recovery
mechanism to be triggered as early as possible. To reflect
this situation, each task τi is associated with a weight wi
used to compute the execution budget (the more critical
the task, the smaller the weight). The sensitivity analysis
can be reformulated to take this new parameter into
account [6].
8. Conclusions