[go: up one dir, main page]

0% found this document useful (0 votes)
106 views11 pages

A Comparison of Scheduling Latency in Linux, PREEMPT RT, and LITMUS

This document compares scheduling latency in Linux, PREEMPT RT, and LITMUSRT using cyclictest. It finds that: 1) LITMUSRT introduces only minor overhead itself, but inherits limitations from Linux in the presence of I/O-bound background tasks. 2) cyclictest is commonly used to evaluate scheduling latency in Linux and its real-time variants like PREEMPT RT, but was not previously ported to LITMUSRT. 3) This paper ports cyclictest to LITMUSRT to allow for the first direct comparison of scheduling latency between these three systems on a 16-core Intel platform.

Uploaded by

Pocho Gomez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views11 pages

A Comparison of Scheduling Latency in Linux, PREEMPT RT, and LITMUS

This document compares scheduling latency in Linux, PREEMPT RT, and LITMUSRT using cyclictest. It finds that: 1) LITMUSRT introduces only minor overhead itself, but inherits limitations from Linux in the presence of I/O-bound background tasks. 2) cyclictest is commonly used to evaluate scheduling latency in Linux and its real-time variants like PREEMPT RT, but was not previously ported to LITMUSRT. 3) This paper ports cyclictest to LITMUSRT to allow for the first direct comparison of scheduling latency between these three systems on a 16-core Intel platform.

Uploaded by

Pocho Gomez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

A Comparison of Scheduling Latency

in Linux, PREEMPT RT, and LITMUSRT


Felipe Cerqueira Bjorn B. Brandenburg
Max Planck Institute for Software Systems (MPI-SWS)

Abstract critical sections in the kernel,2 and Step 4 generally causes


Scheduling latency under Linux and its principal real-time a TLB flush (on platforms without a tagged TLB), which
variant, the PREEMPT RT patch, are typically measured us- causes additional delays. Thus, even for the highest-priority
ing cyclictest, a tracing tool that treats the kernel as a black task, there is always a delay between the activating event
box and directly reports scheduling latency. LITMUSRT , a and the instant when the task starts executing. This delay,
real-time extension of Linux focused on algorithmic improve- called scheduling latency, affects the response times of all
ments, is typically evaluated using Feather-Trace, a fined- tasks and imposes a lower bound on the deadlines that can
grained tracing mechanism that produces a comprehensive be supported by the system. Therefore, it is essential to con-
overhead profile suitable for overhead-aware schedulability sider scheduling latency when determining whether a system
analysis. This difference in tracing tools and output has to can provide the desired temporal guarantees. The focus of
date prevented a direct comparison. This paper reports on a this paper is an empirical evaluation of scheduling latency in
port of cyclictest to LITMUSRT and a case study comparing Linux and two of its real-time variants, PREEMPT RT and
scheduling latency on a 16-core Intel platform. The main LITMUSRT , using cyclictest, a latency benchmark.
conclusions are: (i) LITMUSRT introduces only minor over- PREEMPT RT and cyclictest. The importance of schedul-
head itself, but (ii) it also inherits mainline Linuxs severe ing latency has made it a standard metric for the evaluation
limitations in the presence of I/O-bound background tasks. of real-time operating systems in general, and Linux and
its real-time extensions in particular. Concerning the latter,
1 Introduction the PREEMPT RT patchthe de facto standard real-time
variant of Linuxspecifically aims to improve scheduling
Real-time tasks are usually activated in response to external latency by reducing the number and the length of critical
events (e.g., when a sensor triggers) or by periodic timer sections in the kernel that mask interrupts or disable preemp-
expirations (e.g., once every millisecond). At the kernel level, tions [21, 27]. The efficacy of these changes is commonly
both types of activations require the following sequence of quantified using cyclictest, a scheduling latency benchmark
steps to be carried out: originally created by Thomas Gleixner and currently main-
1. the processor transfers control to an interrupt handler tained by Clark Williams [2].
to react to the device or timer interrupt; A key feature of cyclictest is that it is easy to use, and as a
result it has been widely adopted as the universal benchmark
2. the interrupt handler identifies the task waiting for the of real-time performance under Linux. For instance, it has
event and resumes it, which causes the task to be added been applied to evaluate different hardware and software
to the ready queue; platforms (e.g., [15, 25, 26, 29]) as well as various aspects of
3. the scheduler is invoked to check whether the resumed the kernel (e.g., [16, 22, 24, 31]); the cyclictest approach has
real-time task should be scheduled immediately (and, if even been extended to continuously monitor the scheduling
so, on which processor); and finally, latency of real applications in a production environment [20].
4. if the resumed real-time task has higher priority than In short, low scheduling latency as reported by cyclictest can
the currently executing task, then it is dispatched, which be considered the gold standard of real-time performance in
requires a context switch. the Linux real-time community.
LITMUSRT . In contrast to the PREEMPT RT project,
In theory, the highest-priority real-time task should be sched- which has been primarily driven by industry concerns, the
uled immediately when its activating event occurs, but in Linux Testbed for Multiprocessor Scheduling in Real-Time
practice, Step 1 is delayed if interrupts are temporarily Systems [1, 10, 13, 14] is a primarily algorithms-oriented
masked by critical sections in the kernel,1 Steps 24 are real-time extension of Linux. While PREEMPT RT aggres-
delayed by cache misses, contention for memory bandwidth, sively optimizes scheduling latency, LITMUSRT facilitates
and (in multiprocessor systems) lock contention, Step 3 is the implementation and evaluation of novel scheduling and
further delayed if preemptions are temporarily disabled by locking policies, which it does by introducing a scheduler
1 In Linux, via the local irq disable() interface. 2 In Linux, via the preempt disable() interface.
plugin interface. That is, the PREEMPT RT patch reengi- provides its own, non-standard system call and userspace
neers core parts of the kernel to avoid delaying Steps 1 and 3, API, which must be used to configure a real-time tasks
whereas LITMUSRT primarily modularizes the scheduling scheduling parameters. In this section, we discuss the (few)
logic that is invoked in Step 3, and leaves other aspects of changes that were required to port cyclictest to LITMUSRT .
the kernel unchanged. Since the validity of the measurements depends on a
LITMUSRT has served its purpose well and has enabled correct and unbiased application of cyclictest, we begin
a number of studies exploring the tradeoff between system with a review of cyclictest in Sec. 2.1 and explain how its
overheads and analytical temporal guarantees across a di- scheduling parameters are configured under stock Linux
verse range of scheduling approaches (e.g., see [5, 7, 14, and PREEMPT RT in Sec. 2.2. In Sec. 2.3, we review
19, 23]; a full list is provided online [1]). The key com- LITMUSRT s task model and show how cyclictest was
ponent of LITMUSRT that enables such studies is Feather- mapped to it. Finally, Sec. 2.4 briefly discusses a simple, but
Trace [11], a light-weight event tracing toolkit for x86 plat- unexpected problem with the resolution of Linuxs one-shot
forms that is used to collect fine-grained measurements of timers that we encountered during the experiments.
kernel overheads. With Feather-Trace, it is possible to extract
detailed overhead profiles, which can then be incorporated 2.1 An Overview of cyclictest
into overhead-aware schedulability analysis to formally val- The execution of cyclictest can be divided into three phases.
idate timing requirements under consideration of system During the initialization phase, the program creates a con-
overheads (this process is discussed in detail in [10, Ch. 4]). figurable number of threads (according to the specified pa-
The overhead characteristics of LITMUSRT s scheduling rameters), which are then admitted as real-time tasks. The
policies are well documented and have been evaluated in processor affinity mask is also set, which enables migration
detail in prior work [5, 10]. However, because cyclictest can- to be restricted. After that, each thread starts a periodic (i.e.,
not be used (without modifications) to evaluate LITMUSRT cyclic) execution phase, during which cyclictest executes the
(see Sec. 2), and because Feather-Trace and cyclictest pro- main measurement loop. An iteration (i.e., one test cycle)
duce fundamentally different outputs (see Sec. 4), to date starts when the threads associated one-shot timer expires.3
it has unfortunately not been possible to directly compare Once the thread resumes, a sample of scheduling latency
LITMUSRT with Linux and the PREEMPT RT patch. is recorded as the difference between the current time and
In this paper, we present a comprehensive experimental the instant when the timer should have fired. The timer is
evaluation of scheduling latency under LITMUSRT based then rearmed to start a new iteration and the thread suspends.
on data obtained with a ported version of cyclictest. By com- After a configurable duration, the thread is demoted to best-
paring LITMUSRT against stock Linux and PREEMPT RT, effort status and exits, and the recorded scheduling latency
we seek to address two questions concerning the real-time samples are written to disk.
capabilities of the current LITMUSRT implementation: The cyclic phase, during which the measurements are
collected, uses only standard userspace libraries (for ex-
1. Does the LITMUSRT scheduling framework introduce ample, POSIX APIs to set up timers, synchronize threads,
a significant overhead in terms of scheduling latency? etc.) and does not rely on scheduler-specific functional-
And: ity. Since LITMUSRT maintains compatibility with most
2. To what extent is LITMUSRT affected by high schedul- userspace APIs, only the code pertaining to task admis-
ing latencies due to not (yet) incorporating the improve- sion and exit must be adapted, i.e., it suffices to replace
ments of the PREEMPT RT patch? sched setscheduler() system calls with LITMUSRT s library
functions for task admission. Importantly, cyclictests core
To answer the first question, we compared LITMUSRT measurement loop does not have to be changed, which helps
against the underlying Linux version; to answer the sec- to avoid the inadvertent introduction of any bias.
ond question, we compared LITMUSRT against the latest The required modifications, which amount to only 18
stable version of the PREEMPT RT patch. lines of new code, are discussed in Sec. 2.3 below. To
The rest of the paper is organized as follows. Sec. 2 re- illustrate how LITMUSRT s interface differs from stock
views how cyclictest operates and describes a faithful port of Linuxs interface, we first review how cyclictest is config-
cyclictest to LITMUSRT . In Sec. 3, we present our experi- ured as a real-time task in Linux (either with or without the
mental setup and discuss our results. In Sec. 4, we compare PREEMPT RT patch).
cyclictest with Feather-Trace and remark on some of the
advantages and limitations of the cyclictest benchmark. In 2.2 Setting Task Parameters under Linux
Sec. 5, we conclude and mention future work directions. In accordance with the POSIX standard, Linux implements
a fixed-priority real-time scheduler (with 99 distinct priority
2 Porting cyclictest to LITMUSRT levels). Tasks are dispatched in order of decreasing priority,
While cyclictest is a remarkably versatile utility, it cannot be 3 cyclictest supports a large number of options and can be configured to
applied to LITMUSRT out of the box, since LITMUSRT use different timer APIs. We focus herein on the tools default behavior.
and ties in priority are broken according to one of two poli- 1 struct thread param * par ;
cies: under the SCHED RR policy, tasks of equal priority 2
alternate using a simple round-robin scheme, and, under the 3 /* par contains the c y c l i c t e s t c o n f i g u r a t i o n
4 * and t h r e a d p a r a m e t e r s * /
SCHED FIFO policy, multiple tasks with the same priority 5
are simply scheduled in FIFO order with respect to the time 6 i f ( p a r>cpu ! = 1) {
at which they were enqueued into the ready queue. By de- 7 CPU ZERO(&mask ) ;
fault, tasks are allowed to migrate among all processors in 8 CPU SET ( p a r>cpu , &mask ) ;
9 thread = pthread self () ;
Linux, but it is possible to restrict migrations using processor
10 i f ( p t h r e a d s e t a f f i n i t y n p ( thread ,
affinity masks, and cyclictest optionally does so. 11 s i z e o f ( mask ) , &mask ) == 1)
Fig. 1 summarizes the relevant code from cyclictest that 12 warn ( Could n o t s e t CPU a f f i n i t y ) ;
is used to admit a thread as a SCHED FIFO real-time task 13 }
14
under Linux. First, the thread defines the CPU affinity mask,
15 /* ... */
which is assigned with pthread setaffinity np() (lines 613 16
in Fig. 1). To attain real-time priority, the thread calls the 17 memset (& s c h e d p , 0 , s i z e o f ( s c h e d p ) ) ;
sched setscheduler() system call with the desired scheduling 18 s c h e d p . s c h e d p r i o r i t y = p a r>p r i o ;
policy and priority as parameters (lines 17-20). Finally, after 19 i f ( s e t s c h e d u l e r ( 0 , p a r>p o l i c y , &s c h e d p ) )
20 f a t a l ( F a i l e d t o s e t p r i o r i t y \n ) ;
the measurement phase, the thread transitions back to non- 21
real-time status by reverting to the SCHED OTHER best- 22 / * measurement phase * /
effort policy (lines 24-25). 23
In the experiments discussed in Sec. 3, cyclictest was 24 schedp . s c h e d p r i o r i t y = 0;
25 s c h e d s e t s c h e d u l e r ( 0 , SCHED OTHER , &s c h e d p ) ;
configured to spawn one thread per core and to use the
SCHED FIFO policy with the maximum priority of 99. Fur- Figure 1: Task admission in Linux (original cyclictest).
ther, processor affinity masks were assigned to fix each mea-
surement thread to a dedicated core. Since there is only one of execution exactly matches the assumptions underlying the
task per core, the choice of tie-breaking policy is irrelevant. sporadic task model and can thus be trivially expressed with
Executing the initialization sequence depicted in Fig. 1 parameters di = pi = I. To avoid having to estimate the
under LITMUSRT would not result in an error (LITMUSRT per-job (i.e., per-iteration) execution cost of cyclictest, we
does not disable SCHED FIFO); however, it would also not set ei to a dummy value of 1 ns and disable budget enforce-
achieve the desired effect because, for historical reasons, ment, so that each thread can always execute without being
real-time tasks must use a different API (which also allows throttled by LITMUSRT .
specifying more explicit and detailed parameters) to attain The code necessary to realize admission of a cyclictest
real-time status in LITMUSRT . We thus adapted cyclictest thread as a real-time task under LITMUSRT is shown in
to use LITMUSRT s API to create an analogous setup. Fig. 2. The first step is defining the task parameters (in
particular, ei , di , and pi ) in lines 38 and initializing the
2.3 Setting Task Parameters under LITMUSRT userspace interface with init rt thread(). Budget enforce-
LITMUSRT implements the sporadic task model [28], in ment is disabled (line 7) and the maximum possible priority
which real-time tasks are modeled as a sequence of recurrent is assigned (line 8). LITMUSRT s fixed-priority plugin cur-
jobs and defined by a tuple Ti = (ei , di , pi ), where ei de- rently supports 512 distinct priorities; the priority field is
notes the worst-case execution time (WCET) of a single job, ignored by EDF-based plugins.
di the relative deadline, and pi the minimum inter-arrival Next, if a partitioned scheduler is used, a scheduling ap-
time (or period). Under LITMUSRT s event-driven schedul- proach where each task is statically assigned to a core, the
ing policies, the parameter ei is optional and used only for task must specify its assigned processor in the rt task struc-
budget enforcement (if enabled). The parameter di , how- ture (line 13) and perform this migration (line 14), which is
ever, is required for scheduling plugins based on the earliest- accomplished by calling be migrate to().5 Otherwise, if a
deadline first (EDF) policy, and the parameter pi is always global scheduler is used, a scheduling approach where tasks
required to correctly identify individual jobs. In LITMUSRT , may migrate freely, a processor assignment is not required
all task parameters are expressed in nanosecond granularity (and ignored by the kernel if provided). The task parame-
since this is the granularity internally used by the kernel. first thread, and d, an increment which is added to the interval of each
As mentioned in Sec. 2.1, each thread in cyclictest exe- consecutive thread. For example, if i = 1000 and d = 100, cyclictest
cutes in a loop, alternating between resuming, measuring launches threads with intervals I {1000, 1100, 1200, . . .} s. For
latency, and suspending. The wake-up timer is armed peri- simplicity, we assume a single interval I. By default, and as employed in
our experiments, cyclictest uses i = 1000s and d = 500s.
odically, according to a configurable interval I (in microsec- 5 The function be migrate to() is currently implemented as a wrapper
onds) defined as a parameter.4 cyclictests periodic pattern around Linuxs processor affinity mask API, but could be extended to
incorporate LITMUSRT -specific functionality in the future. The be
4 In fact, cyclictest allows two parameters: i, the timer interval of the prefix stems from the fact that it may be called only by best-effort tasks.
1 struct rt task r t t ; / * LITMUS RT API * / 3 Experiments
2
3 init r t t a s k p a r a m (& r t t ) ; We conducted experiments with cyclictest to evaluate the
4 rtt . exec cost = 1; scheduling latency experienced by real-time tasks under
5 rtt . p e r i o d = p a r>i n t e r v a l * 1 0 0 0 ; LITMUSRT in comparison with an unmodified Linux ker-
6 rtt . r e l a t i v e d e a d l i n e = p a r>i n t e r v a l * 1 0 0 0 ;
nel. The results were further compared with latencies as
7 rtt . b u d g e t p o l i c y = NO ENFORCEMENT ;
8 rtt . p r i o r i t y = LITMUS HIGHEST PRIORITY ; observed under Linux with the PREEMPT RT patch. Our
9 testing environment consisted of a 16-core Intel Xeon X7550
10 init rt thread () ; 2.0GHz platform with 1 TiB RAM. Features that lead to un-
11 predictability such as hardware multithreading, frequency
12 i f ( p a r>cpu ! = 1) {
13 r t t . cpu = p a r>cpu ;
scaling, and deep sleep states were disabled for all kernels,
14 i f ( b e m i g r a t e t o ( p a r>cpu ) < 0 ) along with every kernel configuration option associated with
15 f a t a l ( Could n o t s e t CPU a f f i n i t y ) ; tracing or debugging. Background services such as cron
16 } were disabled to the extent possible, with the notable excep-
17
tion of the remote login server sshd for obvious reasons.
18 i f ( s e t r t t a s k p a r a m ( g e t t i d ( ) , &r t t ) < 0)
19 f a t a l ( Failed to set rt param . ) ; We used cyclictest to sample scheduling latency under six
20 different kernel and scheduling policy combinations. Under
21 i f ( t a s k m o d e ( LITMUS RT TASK ) ! = 0 ) LITMUSRT , which is currently still based on Linux 3.0,
22 f a t a l ( f a i l e d t o c h a n g e t a s k mode . \ n ) ; we focused our analysis on a subset of the event-driven
23
24 / * measurement phase * /
scheduler plugins: the partitioned EDF plugin (PSN-EDF),
25 the global EDF plugin (GSN-EDF), and the partitioned fixed-
26 t a s k m o d e (BACKGROUND TASK) ; priority plugin (P-FP).6 We did not evaluate LITMUSRT s
Pfair [4] plugin, which implements the PD2 [3] scheduling
Figure 2: Task admission in LITMUSRT (modified cyclictest). policy, since PD2 is a quantum-driven policy and hence not
ters are stored in the process control block (and validated optimized to achieve low scheduling latency.7
by the kernel) with the system call set rt task param() in We further evaluated SCHED FIFO in three Linux ker-
line 18. Finally, task mode(LITMUS RT TASK) is called nels: in Linux 3.0 (the stock version, without any patches),
in line 21, which causes the thread to be admitted to Linux 3.8.13 (again, without patches), and Linux 3.8.13
the set of real-time tasks. After the measurement phase, with the PREEMPT RT patch. Though we compare schedul-
task mode(BACKGROUND TASK) is used to give up real- ing latencies of two different underlying versions of Linux,
time privileges and return to SCHED OTHER status. both considered versions exhibit a similar latency profile (for
which we provide supporting data in Sec. 3.5), so our com-
With these changes in place, cyclictest is provisioned
parison of LITMUSRT and PREEMPT RT is valid despite
under LITMUSRT equivalently to the configuration exe-
the difference in base kernel versions.
cuted under Linux (both with and without the PREEMPT RT
patch). This ensures a fair and unbiased comparison. For each scheduling policy and kernel, we varied the set of
background tasks to assess scheduling latency in three scenar-
2.4 The Effect of Timer Resolution on nanosleep() ios: (i) a system with no background workload, (ii) a system
with a cache-intensive, CPU-bound background workload,
Despite our efforts to establish a level playing field, we ini- and (iii) a system with an interrupt-intensive, I/O-bound
tially observed unexpectedly large scheduling latencies under background workload. Of these three scenarios, scenario (i)
LITMUSRT in comparison with SCHED FIFO, even in an is clearly the best-case scenario, whereas scenario (iii) puts
otherwise idle system. This was eventually tracked down to a severe stress onto the system. Scenario (ii) matches the back-
systematic 50s delay of timer interrupts, which was caused ground workload that has been used in prior LITMUSRT
by the fact that Linux subjects nanosleep() system calls to studies (e.g., see [5, 6, 10, 12]).
timer coalescing to reduce the frequency of wake-ups. As cyclictest was executed with standard SMP parameters
this feature is undesirable for real-time tasks, it is circum- (one thread per processor), with periods in the range of
vented for SCHED FIFO and SCHED RR tasks. A similar I {1000s, 1500s, 2000s, . . .} and the -m flag enabled,
exception was introduced for LITMUSRT , which resolved which locks memory pages with mlockall() to prevent page
the discrepancy in expected and observed latencies. faults. The result of each execution is a histogram of ob-
It should be noted that LITMUSRT provides its own API served scheduling latencies, where the x-axis represents the
for periodic job activations, and that this API has never been
6 The S and N in the plugin names PSN-EDF and GSN-EDF refer
subject to timer coalescing, as it does not use the nanosleep
functionality. The issue arose in our experiments only be- to support for predictable suspensions and non-preemptive sections; see
[8, 10]. These algorithmic details are irrelevant in the context of this paper.
cause we chose to not modify the way in which cyclictest 7 Under a quantum-driven scheduler, worst-case scheduling latencies
triggers its periodic activations (since we did not want to cannot be lower than the quantum length, which in the current version of
change the actual measuring code in any way). LITMUSRT is tied to Linuxs scheduler tick and thus is at least 1ms.
measured delay and the y-axis the absolute frequency of the 3.2 CPU-bound Background Workload
corresponding value plotted on a log scale. Samples were Kernel overheads in general, and thus also the scheduling
grouped in buckets of size 1 s. Each test ran for 20 minutes, latency experienced by real-time tasks, vary depending on
generating around 5.85 million samples per configuration. the contention for limited hardware resources such as mem-
The outcome of the experiments is depicted in Figs. 37 ory bandwidth and shared caches. A cache-intensive, CPU-
and analyzed in the following sections. In Secs. 3.13.3, we bound background workload can thus be expected to result
first discuss the differences and similarities in scheduling in worsened latencies, as cache misses and contention in the
latency incurred under LITMUSRT s P-FP plugin, Linux 3.0, memory hierarchy are more likely to occur. To evaluate such
and Linux 3.8.13 (both with and without the PREEMPT RT a scenario, we executed cyclictest along with CPU-bound
patch), and then in Sec. 3.4 we compare the results of the background tasks. For each processor, we instantiated a task
three considered LITMUSRT scheduler plugins with each consisting of a tight loop accessing random memory loca-
other. Finally, Sec. 3.5 compares Linux 3.0 and Linux 3.8.13. tions to generate cache misses and contention. Each such task
was configured to have a working set of 20 MiB, to exceed
3.1 Idle System the size of each processors exclusive L2 cache, which has a
capacity of 18 MiB. This causes significant cache pressure.
As a first step, we evaluated the latency of the system with- Fig. 4 shows the recorded latencies for PSN-EDF, Linux 3.0,
out background tasks under P-FP and PREEMPT RT, both and Linux 3.8.13 (with and without PREEMPT RT).
running on Linux 3.0, and Linux 3.8.13 with and without the A pairwise comparison between the same policies in
PREEMPT RT patch. An idle system represents a best-case Fig. 3 and Fig. 4 illustrates how scheduling latencies are
scenario as non-timer-related interrupts are rare, because impacted by cache and memory issues. As expected, the aver-
kernel code and data is likely to remain cached between age and maximum latencies under P-FP, Linux 3.0 and stock
scheduler activations, and since code segments within the Linux 3.8.13, depicted in insets (a), (b), and (c), respectively,
kernel that disable interrupts have only few inputs to process. increased noticeably. While average and median scheduling
As can be seen in Fig. 3, the maximum observed schedul- latencies increased by only one to two microseconds in ab-
ing latency was below 20s under each of the four sched- solute terms, the increase in relative terms is significantly
ulers (insets (a)-(d)), and even below 12s under the higher, exceeding 45 percent in the case of average latency
PREEMPT RT configuration. The maximum observed under both LITMUSRT and Linux 3.0. Most significant is
scheduling latency under stock Linux 3.8.13 is somewhat the increase in observed maximum latency, which reached
higher than under both LITMUSRT and stock Linux 3.0. As roughly 48s, 73s, and 65s under LITMUSRT , Linux 3.0,
high-latency samples occur only rarely, we ascribe this dif- and Linux 3.8.13, respectively. This shows that even a mod-
ference to random chance; with a longer sampling duration, est, compute-only background workload can significantly
latencies of this magnitude would likely be detected under impact observable latencies in mainline Linux.
LITMUSRT and Linux 3.0, too. All three Linux variants Interestingly, the advantages of PREEMPT RT (inset
exhibited comparable average and median latencies, close to (d)) become more apparent in this scenario: with the
2.8s and 2.6s, respectively. Scheduling latencies under PREEMPT RT patch, Linux was able to maintain low laten-
LITMUSRT s P-FP scheduler were slightly higher with a me- cies despite the increased load, both in terms of average as
dian and average of roughly 3.4s and 3.1s, respectively. well as maximum latency (3.4s and 17.42s, respectively).
Considering the overall shape of the histograms, all four The corresponding stock kernel incurred significantly worse
schedulers exhibit similar trends. Slight differences are vis- latencies (see the longer tail in Fig. 4(c)).
ible in PREEMPT RTs histogram, which resembles the Comparing the distribution of samples, it can be seen
results for the other Linux versions in the initial 0-5s in- that the observed scheduling latency under LITMUSRT s
terval, but lacks higher latency samples. This suggests that P-FP plugin follows a slightly different pattern than either
PREEMPT RT avoids outliers even in best-case scenarios. Linux 3.0 or Linux 3.8.13. In particular, LITMUSRT s dis-
tribution appears to be wider and heavier, with a less
Overall, as there are not many sources of latency in the rapid decrease in the frequency of samples in the range of
absence of a background workload (such as the disabling of 1s40s. This explains LITMUSRT s slightly higher aver-
interrupts and contention at the hardware level), the observed age and median scheduling latencies, which are about 1s
scheduling latencies are suitably low under each tested ker- higher than under either Linux 3.0 or Linux 3.8.13. However,
nel. While the LITMUSRT patch does appear to increase la- note that LITMUSRT , Linux 3.0, and Linux 3.8.13 are all
tencies slightly on average, it does not substantially alter the similarly subject to long tails, which indicates that the ob-
underlying latency profile of Linux as other factors dominate. served maximum latencies are caused by factors unrelated to
From this, we conclude that the LITMUSRT framework does LITMUSRT (i.e., they are caused by issues in the underlying
not inherently introduce undue complexity and overheads. Linux kernel, which the PREEMPT RT patch addresses).
Next, we discuss the effects of increased processor and Nonetheless, the histograms reveal that, in the average
cache contention. case, LITMUSRT adds measurable (but not excessive) ad-
LITMUS^RT P-FP: scheduling latency (no bg tasks) Linux 3.0: scheduling latency (no bg tasks)
min=1.96us max=15.13us avg=3.45us median=3.10us stdev=1.03us min=1.87us max=13.89us avg=2.89us median=2.77us stdev=0.51us
1e+07 1e+07
samples: total=5854818 samples: total=5854779
1e+06 1e+06
number of samples

number of samples
100000 100000
10000 10000
1000 1000
100 100
10 10
1 1
0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90
overhead in microseconds (bin size = 1.00us) overhead in microseconds (bin size = 1.00us)

(a) LITMUSRT with the P-FP scheduler plugin (b) Linux 3.0

Linux 3.8.13: scheduling latency (no bg tasks) Linux 3.8.13 w/ PREEMPT-RT: scheduling latency (no bg tasks)
min=1.52us max=19.73us avg=2.89us median=2.58us stdev=0.69us min=1.55us max=11.20us avg=2.74us median=2.57us stdev=0.42us
1e+07 1e+07
samples: total=5854801 samples: total=5854801
1e+06 1e+06
number of samples

number of samples
100000 100000
10000 10000
1000 1000
100 100
10 10
1 1
0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90
overhead in microseconds (bin size = 1.00us) overhead in microseconds (bin size = 1.00us)

(c) Linux 3.8.13 without the PREEMPT RT patch (d) Linux 3.8.13 with the PREEMPT RT patch
Figure 3: Histograms of observed scheduling latency in an otherwise idle system.

ditional overhead. We suspect two primary sources for this and may carry out significant processing, which both directly
additional overhead. First, LITMUSRT s scheduling path affects scheduling latency. It should be noted that Linux
needs to acquire (at least) one additional spin lock com- has long supported split interrupt handling (e.g., see [9]),
pared to stock Linux, which is especially costly in the pres- wherein interrupt handlers are split into a (short) top half
ence of high cache and memory-bandwidth contention. This and a (typically longer) bottom half, and only the top half
additional spin lock acquisition stems from the fact that is executed in the (hard) interrupt context, and the bottom
LITMUSRT s scheduling state is not protected by Linuxs half is queued for later processing. However, in stock Linux,
runqueue locks; however, Linuxs runqueue locks must bottom halves still effectively have higher priority than
still be acquired prior to invoking LITMUSRT s schedul- regular real-time tasks, in the sense that the execution of bot-
ing framework. And second, the increased average-case tom halves is not under control of the regular SCHED FIFO
overheads might be due to a lack of low-level optimizations process scheduler8 and thus may negatively affect scheduling
in LITMUSRT (in comparison with the mature codebase latencies. Further, bottom halves may still disable interrupts
of Linux). Given that LITMUSRT is primarily a research- and preemptions for prolonged times.
oriented project focused on algorithmic real-time scheduling Considerable effort has been invested by the developers of
issues, a certain lack of low-level tuning is not surprising. the PREEMPT RT patch to address these very issues. This
As was already briefly mentioned, the CPU-bound back- is accomplished by forcing bottom half processing to take
ground workload matches the setup that has been used in place in kernel threads (which can be scheduled such that
prior LITMUSRT -based studies (e.g., [5, 6, 10, 12]). As they do not delay high-priority real-time tasks), and by iden-
is apparent when comparing Fig. 3(a) with Fig. 4(a), our tifying and breaking up code segments that disable interrupts
data confirms that the CPU-bound workload generates suf- and preemptions for prolonged durations. In contrast, since
ficient memory and cache pressure to magnify kernel over- LITMUSRT is currently based on stock Linux, and since
heads. Conversely, conducting overhead experiments with- the focus of LITMUSRT is the exploration and evaluation
out a cache-intensive background workload does not yield of new scheduling policies (and not the reengineering of the
an accurate picture of kernel overheads. Next, we discuss underlying Linux kernel), no such improvements are present
the impact of interrupt-intensive background workloads. in LITMUSRT . A key motivation for our experiments was
to determine to which extent LITMUSRT is penalized by the
3.3 I/O-bound Background Workload
Interrupts are challenging from a latency point of view since 8 Bottom halves are processed by so-called softirqs,
which in stock Linux
interrupt handlers typically disable interrupts temporarily are invoked from interrupt and exception return paths.
LITMUS^RT P-FP: scheduling latency (CPU-bound bg tasks) Linux 3.0: scheduling latency (CPU-bound bg tasks)
min=2.10us max=47.59us avg=5.17us median=4.37us stdev=2.75us min=2.04us max=72.73us avg=4.22us median=3.86us stdev=1.37us
1e+07 1e+07
samples: total=5854719 samples: total=5854711
1e+06 1e+06
number of samples

number of samples
100000 100000
10000 10000
1000 1000
100 100
10 10
1 1
0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90
overhead in microseconds (bin size = 1.00us) overhead in microseconds (bin size = 1.00us)

(a) LITMUSRT with the P-FP scheduler plugin (b) Linux 3.0

Linux 3.8.13: scheduling latency (CPU-bound bg tasks) Linux 3.8.13 w/ PREEMPT-RT: scheduling latency (CPU-bound bg tasks)
min=2.14us max=64.47us avg=4.02us median=3.67us stdev=1.20us min=1.73us max=17.42us avg=3.40us median=3.02us stdev=1.12us
1e+07 1e+07
samples: total=5854707 samples: total=5854640
1e+06 1e+06
number of samples

number of samples
100000 100000
10000 10000
1000 1000
100 100
10 10
1 1
0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90
overhead in microseconds (bin size = 1.00us) overhead in microseconds (bin size = 1.00us)

(c) Linux 3.8.13 without the PREEMPT RT patch (d) Linux 3.8.13 with the PREEMPT RT patch
Figure 4: Histograms of observed scheduling latency in the presence of a CPU-bound background workload.

absence of such improvements. In combination, these three workloads cause considerable


We explored the impact of interrupt-intensive workloads stress for the entire kernel, and can be expected to frequently
on scheduling latency with I/O-bound background tasks that trigger code paths that inherently have to disable interrupts
generate a large number of interrupts, system calls, and and preemptions. While the three tools may not reflect any
scheduler invocations. To simulate such workloads, we used particular real-world application, the chosen combination
a combination of the following three tools. of stress sources is useful to consider because it approaches
a worst-case scenario with regard to background activity.
1. hackbench, a standard stress test for the Linux sched-
The resulting distributions of observed scheduling latency
uler [30]. Under the employed default settings, it cre-
under P-FP, Linux 3.0, and Linux 3.8.13 with and without
ates 400 processes that exchange tokens via (local) sock-
the PREEMPT RT patch are depicted in Fig. 5.
ets, thus causing frequent system calls and scheduler
invocations (due to blocking reads). Scheduling latencies are severely affected by the I/O-
bound background workload under LITMUSRT , Linux 3.0,
2. Bonnie++, a standard file system and hard disk stress and stock Linux 3.8.13 alike. The corresponding histograms,
test [17]. Bonnie++ tests file creation, random and se- shown in insets (a)(c) of Fig. 5, respectively, exhibit a long,
quential file access, and file system metadata access. dense tail. Note that the x-axis in Fig. 5 uses a different
We configured Bonnie++ to use direct I/O, which cir- scale than Fig. 3 and 4: scheduling latencies in excess of
cumvents Linuxs page cache (and thus results in in- 5ms were observed in this scenario, two orders of magnitude
creased disk activity). This workload results in a large worse than in the previous ones. Scheduling latencies in this
number of system calls, disk interrupts, and scheduler range clearly limit these kernels to hosting applications that
invocations (due to blocking reads and writes). are not particularly latency-sensitive.
3. wget, a common utility to download files via HTTP. We In contrast, Linux 3.8.13 with the PREEMPT RT patch
used wget to download a 600 MiB file from a web server maintained much lower scheduling latencies, in the order
on the local network in a loop. The downloaded file was of tens of microseconds, despite the stress placed upon
immediately discarded by writing it to /dev/null to avoid the system, which can be seen in Fig. 5(d). Nonetheless,
stalling on disk I/O. One such download-and-discard the maximum observed scheduling latency did increase to
loop was launched for each of the 16 cores. This work- 44s, which shows that, even with the PREEMPT RT patch,
load generates a large number of network interrupts non-negligible latencies arise given harsh workloads. How-
as Ethernet packets are received at the maximum rate ever, this maximum was still significantly lower than the
sustained by the network (with 1 Gib links in our lab). maximum latency previously observed under Linux 3.8.13
LITMUS^RT P-FP: scheduling latency (IO-bound bg tasks) Linux 3.0: scheduling latency (IO-bound bg tasks)
min=1.89us max=3956.48us avg=6.60us median=5.17us stdev=12.76us min=1.85us max=4300.43us avg=6.39us median=4.98us stdev=13.25us
1e+07 1e+07
samples: total=5854660 samples: total=5854674
1e+06 1e+06
number of samples

number of samples
100000 100000
10000 10000
1000 1000
100 100
10 10
1 1
0 150 300 450 600 750 900 1050 1200 1350 0 150 300 450 600 750 900 1050 1200 1350
overhead in microseconds (bin size = 1.00us) overhead in microseconds (bin size = 1.00us)

(a) LITMUSRT with the P-FP scheduler plugin (b) Linux 3.0

Linux 3.8.13: scheduling latency (IO-bound bg tasks) Linux 3.8.13 w/ PREEMPT-RT: scheduling latency (IO-bound bg tasks)
min=1.85us max=5464.07us avg=6.23us median=4.60us stdev=15.91us min=1.47us max=44.16us avg=4.12us median=4.07us stdev=0.99us
1e+07 1e+07
samples: total=5854773 samples: total=5854748
1e+06 1e+06
number of samples

number of samples
100000 100000
10000 10000
1000 1000
100 100
10 10
1 1
0 150 300 450 600 750 900 1050 1200 1350 0 150 300 450 600 750 900 1050 1200 1350
overhead in microseconds (bin size = 1.00us) overhead in microseconds (bin size = 1.00us)

(c) Linux 3.8.13 without the PREEMPT RT patch (d) Linux 3.8.13 with the PREEMPT RT patch
Figure 5: Histograms of observed scheduling latency in the presence of an I/O-bound background workload.

without the PREEMPT RT patch in the presence of only 3.4 Scheduling Latency of LITMUSRT Plugins
CPU-bound workloads, which is apparent when comparing
In the preceding sections, we have focused on LITMUSRT s
Fig. 4(c) with Fig. 5(d). Remarkably, the average and median
P-FP plugin, since it implements the same scheduling policy
scheduling latency under PREEMPT RT worsened by less
as SCHED FIFO (albeit with a larger number of priorities
than 0.7s with the introduction of the I/O-bound workload.
and support for additional real-time locking protocols) and
thus allows for the most direct comparison. We also in-
Finally, we also ran two variations of the I/O-bound work- vestigated how scheduling latency varies among the three
load with varying degrees of disk activity. First, we disabled evaluated LITMUSRT scheduler plugins. Fig. 6 compares
bonnie++ altogether, which brought down the maximum the P-FP, PSN-EDF and GSN-EDF plugins in LITMUSRT ,
observed latencies under Linux 3.0, Linux 3.8.13 (without under each of the three considered background workloads.
the PREEMPT RT patch), and LITMUSRT to around 550s, Comparing insets (g), (h), and (i), it is apparent that the
which is still too high for practical purposes, but shows that three plugins are equally subject to high scheduling latencies
the extreme outliers are caused by disk-related code. And (approaching 4ms) in the case of the I/O-bound background
second, we tried launching an instance of bonnie++ on each workload. This is not surprising, since the long tail of high
core, which brought the disk I/O subsystem to its knees and scheduling latencies is caused by the design of the underlying
caused latency spikes in the range of 80200 milliseconds (!) Linux kernel, and thus independent of the choice of plugin.
under the three non-PREEMPT RT kernels. Remarkably, the Further, comparing Fig. 6(a) with Fig. 6(b), and Fig. 6(d)
maximum observed scheduling latency under PREEMPT RT with Fig. 6(e), it is apparent that the PSN-EDF and P-FP plu-
remained below 50s even in this case. gins yield near-identical scheduling latency distributions,
despite the difference in implemented scheduling policy.
Overall, our experiment asserts the importance of This, however, is expected since the tests run only one real-
PREEMPT RT in turning Linux into a viable real-time plat- time task per processor; the real-time scheduler is hence not
form. Given the huge differences in maximum observed stressed and the cost of the scheduling operation is so small
latency, LITMUSRT would be substantially improved if it compared to other sources of latency that any differences
incorporated PREEMPT RT. Though this will require con- between fixed-priority and EDF scheduling disappear in the
siderable engineering effort (both patches modify in part the noise. Differences emerge only for higher task counts [10].
same code regions), there are no fundamental obstacles to However, looking at Fig. 6(f) and Fig. 6(i), it is apparent
rebasing LITMUSRT on top of the PREEMPT RT patch. that the scheduling latency is noticeably higher under GSN-
LITMUS^RT P-EDF: scheduling latency (no bg tasks) LITMUS^RT P-FP: scheduling latency (no bg tasks) LITMUS^RT G-EDF: scheduling latency (no bg tasks)
min=1.76us max=26.17us avg=3.45us median=2.87us stdev=1.24us min=1.96us max=15.13us avg=3.45us median=3.10us stdev=1.03us min=1.59us max=14.34us avg=3.06us median=2.56us stdev=1.18us
1e+07 1e+07 1e+07
samples: total=5854783 samples: total=5854818 samples: total=5854797
1e+06 1e+06 1e+06
number of samples

number of samples

number of samples
100000 100000 100000
10000 10000 10000
1000 1000 1000
100 100 100
10 10 10
1 1 1
0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90
overhead in microseconds (bin size = 1.00us) overhead in microseconds (bin size = 1.00us) overhead in microseconds (bin size = 1.00us)

(a) PSN-EDF (idle) (b) P-FP (idle) (c) GSN-EDF (idle)

LITMUS^RT P-EDF: scheduling latency (CPU-bound bg tasks) LITMUS^RT P-FP: scheduling latency (CPU-bound bg tasks) LITMUS^RT G-EDF: scheduling latency (CPU-bound bg tasks)
min=2.40us max=73.27us avg=5.14us median=4.21us stdev=2.95us min=2.10us max=47.59us avg=5.17us median=4.37us stdev=2.75us min=1.91us max=60.20us avg=5.81us median=5.39us stdev=2.51us
1e+07 1e+07 1e+07
samples: total=5854739 samples: total=5854719 samples: total=5854728
1e+06 1e+06 1e+06
number of samples

number of samples

number of samples
100000 100000 100000
10000 10000 10000
1000 1000 1000
100 100 100
10 10 10
1 1 1
0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90
overhead in microseconds (bin size = 1.00us) overhead in microseconds (bin size = 1.00us) overhead in microseconds (bin size = 1.00us)

(d) PSN-EDF (CPU-bound) (e) P-FP (CPU-bound) (f) GSN-EDF (CPU-bound)

LITMUS^RT P-EDF: scheduling latency (IO-bound bg tasks) LITMUS^RT P-FP: scheduling latency (IO-bound bg tasks) LITMUS^RT G-EDF: scheduling latency (IO-bound bg tasks)
min=1.98us max=3874.99us avg=6.56us median=5.11us stdev=12.66us min=1.89us max=3956.48us avg=6.60us median=5.17us stdev=12.76us min=2.26us max=3905.79us avg=10.95us median=7.38us stdev=14.11us
1e+07 1e+07 1e+07
samples: total=5854606 samples: total=5854660 samples: total=5854793
1e+06 1e+06 1e+06
number of samples

number of samples

number of samples
100000 100000 100000
10000 10000 10000
1000 1000 1000
100 100 100
10 10 10
1 1 1
0 150 300 450 600 750 900 1050 1200 1350 0 150 300 450 600 750 900 1050 1200 1350 0 150 300 450 600 750 900 1050 1200 1350
overhead in microseconds (bin size = 1.00us) overhead in microseconds (bin size = 1.00us) overhead in microseconds (bin size = 1.00us)

(g) PSN-EDF (I/O-bound) (h) P-FP (I/O-bound) (i) GSN-EDF (I/O-bound)


RT
Figure 6: Histograms of observed scheduling latency under LITMUS with the PSN-EDF, P-FP, and GSN-EDF plugins, under each of the
three considered background workloads.

EDF in the average case, which is due to its more complex underlying (unpatched) Linux versions. For ease of compari-
implementation. Issues such as contention caused by coarse- son, the results are repeated in Fig. 7.
grained locking, extra bookkeeping, and cache-coherence A comparison of inset (a)-(c) with insets (d)-(f) shows
delays when accessing shared structures increase both the that, though the observed maxima vary (for example, from
median and average observed scheduling latencies. 13.89s to 19.73s in the scenario without background
While this shows that LITMUSRT s implementation of tasks), the shapes of the distributions are largely similar. Fur-
global scheduling incurs higher overheads, there is little ther, there are no substantial differences in the average and
reason to employ global scheduling when the number of median latencies of the two kernel versions. This indicates
tasks does not exceed the number of available cores (which that no significant improvements concerning latency and pre-
is the case in the considered cyclictest setup). If the number emptivity have been incorporated since Linux 3.0. Therefore,
of tasks actually exceeds the number of available coresthat a direct comparison between the LITMUSRT patch and the
is, if the scheduling problem is not entirely trivialthen PREEMPT RT patch is valid.
other factors such as the impact of interference from higher- This concludes the discussion of our experimental results.
priority tasks or a need for bounded tardiness [18] can make Next, we briefly discuss how the presented cyclictest experi-
minor differences in scheduling latency a secondary concern, ments differ from the overhead and latency tracing typically
with only little impact on overall temporal correctness. used to evaluate LITMUSRT .
3.5 Linux 3.0 vs. Linux 3.8
4 Limitations of cyclictest
In this paper, we compared the latency of LITMUSRT and
Linux with the PREEMPT RT patch using the latest ver- As discussed in Sec. 1, LITMUSRT is normally evaluated
sions of each patch, which are based on Linux 3.0 and Linux using Feather-Trace, not cyclictest. While cyclictest is a very
3.8.13, respectively. As already discussed in the preceding useful tool to assess and compare different kernel versions
sections, to verify that comparing the two patches is valid (e.g., it can be used to test whether a proposed patch has
despite the difference in the underlying kernel version, we a negative impact on scheduling latency), it also has some
also measured the scheduling latencies exhibited by the two limitations if used as the sole metric for estimating a systems
Linux 3.0: scheduling latency (no bg tasks) Linux 3.0: scheduling latency (CPU-bound bg tasks) Linux 3.0: scheduling latency (IO-bound bg tasks)
min=1.87us max=13.89us avg=2.89us median=2.77us stdev=0.51us min=2.04us max=72.73us avg=4.22us median=3.86us stdev=1.37us min=1.85us max=4300.43us avg=6.39us median=4.98us stdev=13.25us
1e+07 1e+07 1e+07
samples: total=5854779 samples: total=5854711 samples: total=5854674
1e+06 1e+06 1e+06
number of samples

number of samples

number of samples
100000 100000 100000
10000 10000 10000
1000 1000 1000
100 100 100
10 10 10
1 1 1
0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90 0 150 300 450 600 750 900 1050 1200 1350
overhead in microseconds (bin size = 1.00us) overhead in microseconds (bin size = 1.00us) overhead in microseconds (bin size = 1.00us)

(a) Linux 3.0 (idle) (b) Linux 3.0 (CPU-bound) (c) Linux 3.0 (I/O-bound)

Linux 3.8.13: scheduling latency (no bg tasks) Linux 3.8.13: scheduling latency (CPU-bound bg tasks) Linux 3.8.13: scheduling latency (IO-bound bg tasks)
min=1.52us max=19.73us avg=2.89us median=2.58us stdev=0.69us min=2.14us max=64.47us avg=4.02us median=3.67us stdev=1.20us min=1.85us max=5464.07us avg=6.23us median=4.60us stdev=15.91us
1e+07 1e+07 1e+07
samples: total=5854801 samples: total=5854707 samples: total=5854773
1e+06 1e+06 1e+06
number of samples

number of samples

number of samples
100000 100000 100000
10000 10000 10000
1000 1000 1000
100 100 100
10 10 10
1 1 1
0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90 0 150 300 450 600 750 900 1050 1200 1350
overhead in microseconds (bin size = 1.00us) overhead in microseconds (bin size = 1.00us) overhead in microseconds (bin size = 1.00us)

(d) Linux 3.8.13 (idle) (e) Linux 3.8.13 (CPU-bound) (f) Linux 3.8.13 (I/O-bound)
Figure 7: Histograms of observed scheduling latency under Linux 3.0 and 3.8.13, under each of the three considered background workloads.

capability to provide temporal guarantees. overheads are fully reflected in the derived temporal guaran-
The primary advantage of cyclictest is that it provides tees for all tasks (and not just the highest-priority task).
an easy-to-interpret metric that reflects various sources of As another example, consider how tasks are resumed
unpredictability as a single, opaque measure. That is, it treats under partitioned schedulers such as the P-FP plugin (or
the kernel and the underlying hardware as a black box and SCHED FIFO with appropriate processor affinity masks). If
reports the actual cumulative impact of system overheads a real-time task resumes on a remote processor (i.e., any pro-
and hardware capabilities on real-time tasks. For application cessor other than its assigned partition), an inter-processor in-
developers, this is convenient as it requires neither post- terrupt (IPI) must be sent to its assigned processor to trigger
tracing analysis nor a detailed understanding of the kernel. the scheduler. IPIs are of course not delivered and processed
In contrast, Feather-Trace yields a large number of (non- instantaneously in a real system and thus affect scheduling
human-readable) event timestamps that require matching, latency if they arise. When scheduling cyclictest on hard-
filtering, post-processing, and a statistical evaluation. The ware platforms with processor-local timers (such as local
resulting overhead profile is primarily intended for integra- APIC timers in modern x86 systems), however, such IPIs
tion into schedulability analysis and is less suitable to direct are not required because the interrupt signaling the expiry of
interpretation. However, while cyclictest is arguably more cyclictests one-shot timer is handled locally. If we simply
convenient, LITMUSRT s Feather-Trace approach provides execute cyclictest under PSN-EDF, P-FP, or SCHED FIFO
a more complete picture since it yields the data required to with appropriate processor affinity masks to determine the
assess the impact of kernel overheads on tasks other than the worst-case latency, it will never trace the impact of such
highest-priority task, as we explain next. IPIs, even though an actual real-time application that is trig-
The main feature of Feather-Trace is that it integrates gered by interrupts from devices other than timers (e.g., such
many tracepoints in the kernel, which can be used to collect as a sensor) would actually be subject to IPI delays. In con-
fine-grained overheads. By measuring and considering the trast, in the methodology used to evaluate LITMUSRT (see
various sources of delay individually, a detailed analysis of [10, Ch. 4]), Feather-Trace is used to measure IPI latencies,
the worst-case cumulative delay can be carried out. which are then correctly accounted for in the schedulability
For example, for a task other than the highest-priority analysis to reflect the worst-case task-activation delay.
task, the cumulative delay incurred depends on the worst- In summary, it is impossible to derive how real-time tasks
case scheduling latency and the delays due to preemptions other than the highest-priority task are affected by overheads
by higher-priority tasks, which in turn depends on context- from cyclictest-based experiments, because overhead-aware
switching overheads, scheduling overheads in the presence of schedulability analysis is fundamentally required to make
potentially many ready tasks, and so on. With Feather-Trace temporal guarantees for all tasks. Such an analysis is made
in LITMUSRT , it is possible to measure all these aspects possible by Feather-Traces ability to extract specific over-
individually, and then account for them during schedulability heads. While obtaining measurements in a fine-grained man-
analysis (see [10, Ch. 3] for a comprehensive introduction ner is more involved than simply running cyclictest, Feather-
to overhead accounting), such that the observed worst-case Traces fine-grained measurement approach provides a flexi-
bility that is not achievable with coarse-grained approaches [5] A. Bastoni, B. Brandenburg, and J. Anderson. An empirical compari-
such as cyclictest. This, of course, does not diminish son of global, partitioned, and clustered multiprocessor EDF sched-
ulers. In Proc. of the 31st Real-Time Systems Symposium, pages 1424,
cyclictests value as a quick assessment and debugging aid, 2010.
but it should not be mistaken to provide a general measure of [6] A. Bastoni, B. Brandenburg, and J. Anderson. Is semi-partitioned
scheduling practical? In Proc. of the 23rd Euromicro Conference on
a systems real-time capability; it can only show the lack Real-Time Systems, pages 125135, 2011.
of such capability under certain circumstancesfor instance, [7] A. Block. Adaptive multiprocessor real-time systems. PhD thesis,
by exposing scheduling latencies in excess of 5ms in the University of North Carolina at Chapel Hill, 2008.
[8] A. Block, H. Leontyev, B. Brandenburg, and J. Anderson. A flexible
presence of I/O-bound background tasks. real-time locking protocol for multiprocessors. In Proc. of the 13th
IEEE Conference on Embedded and Real-Time Computing Systems
and Applications, pages 4757, 2007.
5 Conclusion and Future Work [9] D. Bovet and M. Cesati. Understanding The Linux Kernel. OReilly
& Associates Inc, third edition, 2005.
We presented an empirical evaluation of scheduling latency [10] B. Brandenburg. Scheduling and locking in multiprocessor real-time
under LITMUSRT using cyclictest. We ported cyclictest to operating systems. PhD thesis, The University of North Carolina at
LITMUSRT s native API and collected samples of schedul- Chapel Hill, 2011.
[11] B. Brandenburg and J. Anderson. Feather-trace: A light-weight event
ing latency under several of its event-driven scheduler plu- tracing toolkit. In Proc. of the Workshop on Operating Systems Plat-
gins, in three system configurations (an idle system, a sys- forms for Embedded Real-Time applications, pages 6170, 2007.
[12] B. Brandenburg and J. Anderson. A comparison of the M-PCP, D-
tem with CPU-bound background tasks, and a system with
PCP, and FMLP on LITMUSRT . In Proc. of the 12th Intl. Conference
I/O-bound background tasks). For the purpose of compari- on Principles of Distributed Systems, pages 105124, 2008.
son, we repeated the same measurements under Linux 3.0, [13] B. Brandenburg, A. Block, J. Calandrino, U. Devi, H. Leontyev, and
Linux 3.8.13, and Linux 3.8.13 with the PREEMPT RT J. Anderson. LITMUSRT : a status report. 9th Real-Time Linux
Workshop, 2007.
patch using the original, unmodified cyclictest version. [14] J. Calandrino, H. Leontyev, A. Block, U. Devi, and J. Anderson.
The results obtained from an idle system and in the pres- LITMUSRT : A testbed for empirically comparing real-time multi-
ence of CPU-bound background tasks showed that while processor schedulers. In Proc. of the 27th IEEE Real-Time Systems
Symposium, pages 111123, 2006.
LITMUSRT introduces some additional overheads, the dif- [15] G. Chanteperdrix and R. Cochran. The ARM fast context switch
ference is minor in absolute terms and manifests only extension for Linux. Real Time Linux Workshop, 2009.
in the average and median scheduling latencies. Impor- [16] R. Cochran, C. Marinescu, and C. Riesch. Synchronizing the Linux
system time to a PTP hardware clock. In Proc. of the 2011 Intl. IEEE
tantly, LITMUSRT was not observed to affect the maximum Symposium on Precision Clock Synchronization for Measurement
scheduling latencies negatively, which is due to the fact that Control and Communication, pages 8792, 2011.
[17] R. Coker. bonnie++ program to test hard drive performance. Linux
other factors in mainline Linux have a much larger impact on manual page.
worst-case delays. We conclude from these observations that [18] U. Devi. Soft real-time scheduling on multiprocessors. PhD thesis,
LITMUSRT does not impose inherently impractical over- Chapel Hill, NC, USA, 2006.
[19] G. Elliott and J. Anderson. Globally scheduled real-time multiproces-
heads. Further, we believe that the observed minor increase sor systems with GPUs. Real-Time Systems, 48(1):3474, 2012.
in average and median scheduling latency is not fundamental, [20] C. Emde. Long-term monitoring of apparent latency in PREEMPT RT
but caused by a lack of low-level optimizations that could be Linux real-time systems. 12th Real-Time Linux Workshop, 2010.
[21] L. Fu and R. Schwebel. Real-time linux wiki. RT PREEMPT
rectified with additional engineering effort. HOWTO. https://rt.wiki.kernel.org/index.php/
However, our data also documents that LITMUSRT inher- RT_PREEMPT_HOWTO.
[22] L. Henriques. Threaded IRQs on Linux PREEMPT-RT. In Proc. of
its mainline Linuxs weaknesses in the presence of I/O-bound the 5th Intl. Workshop on Operating Systems Platforms for Embedded
background tasks. Again, LITMUSRT did not increase the Real-Time Applications, pages 2332, 2009.
observed maximum scheduling latency, but the latency pro- [23] C. Kenna, J. Herman, B. Brandenburg, A. Mills, and J. Anderson. Soft
real-time on multiprocessors: are analysis-based schedulers really
file of the underlying Linux 3.0 kernel renders it unfit for se- worth it? In Proc. of the 32nd Real-Time Systems Symposium, pages
rious (hard) real-time applications. Further, our experiments 93103, 2011.
[24] J. Kiszka. Towards Linux as a real-time hypervisor. In Proc. of the
confirmed that this is still the case with the more recent 11th Real-Time Linux Workshop, pages 205214, 2009.
mainline Linux version 3.8.13. It would thus be highly de- [25] K. Koolwal. Investigating latency effects of the linux real-time pre-
sirable to combine LITMUSRT s algorithmic improvements emption patches (PREEMPT RT) on AMDs GEODE LX platform.
In Proc. of the 11th Real-Time Linux Workshop, pages 131146, 2009.
with the increased responsiveness under load achieved by [26] A. Lackorzynski, J. Danisevskis, J. Nordholz, and M. Peter. Real-
the PREEMPT RT patch, which remains as future work. time performance of L4Linux. In Proc. of the 13th Real-Time Linux
Workshop, pages 117124, 2011.
[27] P. McKenney. A realtime preemption overview. 2005. LWN.
References http://lwn.net/Articles/146861/.
[28] A. Mok. Fundamental design problems of distributed systems for the
[1] The LITMUSRT project. http://www.litmus-rt.org. hard-real-time environment. PhD thesis, 1983.
[2] Real-time linux wiki. cyclictest - RTwiki. https://rt.wiki. [29] M. Traut. Real-time CORBA performance on Linux-RT PREEMPT.
kernel.org/index.php/Cyclictest. 9th Real-Time Linux Workshop, 2007.
[3] J. Anderson and A. Srinivasan. Mixed Pfair/ERfair scheduling of [30] C. Williams and D. Sommerseth. hackbench scheduler bench-
synchronous periodic tasks. In Proc. of the 13th Euromicro Conference mark/stress test. Linux manual page.
on Real-Time Systems, pages 7685. IEEE, 2001. [31] B. Zuo, K. Chen, A. Liang, H. Guan, J. Zhang, R. Ma, and H. Yang.
[4] S. Baruah, N. Cohen, C. Plaxton, and D. Varvel. Proportionate Performance tuning towards a KVM-based low latency virtualization
progress: A notion of fairness in resource allocation. Algorithmica, system. In Proc. of the 2nd Internation Conference on Information
15(6):600625, 1996. Engineering and Computer Science, pages 14. IEEE, 2010.

You might also like