[go: up one dir, main page]

0% found this document useful (0 votes)
60 views15 pages

Fast23 Liu

The document discusses a new machine learning based approach called OSML for intelligent resource scheduling of co-located latency-critical services on servers. It aims to improve scheduling efficiency, avoid 'resource cliffs', and reduce convergence time compared to previous approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views15 pages

Fast23 Liu

The document discusses a new machine learning based approach called OSML for intelligent resource scheduling of co-located latency-critical services on servers. It aims to improve scheduling efficiency, avoid 'resource cliffs', and reduce convergence time compared to previous approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Intelligent Resource Scheduling for Co-located

Latency-critical Services: A Multi-Model


Collaborative Learning Approach
Lei Liu, Beihang University; Xinglei Dou and
Yuetao Chen, ICT, CAS; Sys-Inventor Lab
https://www.usenix.org/conference/fast23/presentation/liu

This paper is included in the Proceedings of the


21st USENIX Conference on File and
Storage Technologies.
February 21–23, 2023 • Santa Clara, CA, USA
978-1-939133-32-8

Open access to the Proceedings


of the 21st USENIX Conference on
File and Storage Technologies
is sponsored by
Intelligent Resource Scheduling for Co-located Latency-critical Services:
A Multi-Model Collaborative Learning Approach
Lei Liu1∗, Xinglei Dou2 , Yuetao Chen2
1 Beihang University; 2 ICT, CAS; Sys-Inventor Lab
Abstract scheduling for co-located LC services more complicated and
Latency-critical services have been widely deployed in cloud time-consuming. Moreover, in reality, end-users keep increas-
environments. For cost-efficiency, multiple services are usu- ing demands for quick responses from the cloud [15,47,53].
ally co-located on a server. Thus, run-time resource schedul- According to Amazon’s estimation, even if the end-users
ing becomes the pivot for QoS control in these complicated experience a 1-second delay, they tend to abort the transac-
co-location cases. However, the scheduling exploration space tions, translating to $1.6 billion loss annually [1]. Briefly,
enlarges rapidly with the increasing server resources, making unprecedented challenges are posed for resource scheduling
schedulers unable to provide ideal solutions quickly and effi- mechanisms [8,10,31,47,52].
ciently. More importantly, we observe that there are “resource Some previous studies [17,31,33,45,61] design clustering
cliffs” in the scheduling exploration space. They affect the approaches to allocate LLC or LLC together with main mem-
exploration efficiency and always lead to severe QoS fluctua- ory bandwidth to cores for scheduling single-threaded appli-
tions in previous schedulers. To address these problems, we cations. Yet, they are not suitable in cloud environments, as
propose a novel ML-based intelligent scheduler – OSML. It the cloud services often have many concurrent threads and
learns the correlation between architectural hints (e.g., IPC, strict QoS constraints (i.e., latency-critical). Alternatively,
cache misses, memory footprint, etc.), scheduling solutions the existing representative studies either use heuristic algo-
and the QoS demands. OSML employs multiple ML models rithms – increasing/decreasing one resource at a time and
to work collaboratively to predict QoS variations, shepherd observing the performance variations [10] or use learning-
the scheduling, and recover from QoS violations in compli- based algorithms (e.g., Bayesian optimization [46]) in a rela-
cated co-location cases. OSML can intelligently avoid re- tively straightforward manner. The studies in [10,46] show
source cliffs during scheduling and reach an optimal solution that scheduling five co-located interactive services to meet
much faster than previous approaches for co-located applica- certain QoS constraints can take more than 20 seconds on
tions. Experimental results show that OSML supports higher average. Existing schedulers still have room for improve-
loads and meets QoS targets with lower scheduling overheads ment in scheduling convergence time, intelligence, and how
and shorter convergence time than previous studies. to schedule complicated interactive resources (e.g., parallel
computing units and complex memory/storage hierarchies) in
1 Introduction a timely fashion. Moreover, the existing schedulers cannot
Cloud applications are shifting from monolithic architec- easily avoid “resource cliffs”, i.e., decreasing a resource only
tures to loosely-coupled designs, consisting of many latency- slightly during scheduling leads to a significant QoS slow-
critical (LC) services (e.g., some microservices, interactive down. To the best of our knowledge, our community has been
services) with strict QoS requirements [18,19,52,53]. Many expecting new directions on developing resource-scheduling
cloud providers, including Amazon, Alibaba, Facebook, Google, mechanisms to handle co-located LC services [16,30,31,45].
and LinkedIn, employ this loosely-coupled design to improve To this end, we design OSML, a novel machine learning
productivity, scalability, functionality, and reliability of their (ML) based resource scheduler for LC services on large-scale
cloud systems [2,3,18,52]. servers. Using ML models significantly improves scheduling
QoS-driven resource scheduling faces more challenges in exploration efficiency for multiple co-located cloud services
this era. The cost-efficiency policy drives providers to co- and can handle the complicated resource sharing, under/over-
locate as many applications as possible on a server. However, provision cases timely. ML has achieved tremendous suc-
these co-located services exhibit diverse behaviors across the cess in improving speech recognition [54], benefiting image
storage hierarchy, including multiple interactive resources recognition [25], and helping the machine to beat the human
such as CPU cores, cache, memory/IO bandwidth, and main champion at Go [13,24,51]. In OSML, we make progress
memory banks. These behaviors also can be drastically dif- in leveraging ML for resource scheduling, and we make the
ferent from demand to demand and change within seconds. following contributions.
Moreover, with the increasing number of cores, more threads (1) Investigation in RCliff for Multiple Resources during
contend for the shared LLC (last-level cache) and memory Scheduling. In the context of cloud environment, we study
bandwidth. Notably, these shared resources interact with resource cliff (RCliff, i.e., reducing a resource only slightly
each other [32,33,45,80]. All these issues make resource leads to a significant QoS slowdown) for computing and cache
∗ Corresponding author (PI): lei.liu@zoho.com; liulei2010@buaa.edu.cn. resources. More importantly, we show that RCliffs commonly

USENIX Association 21st USENIX Conference on File and Storage Technologies 153
Table 1: Latency-critical (LC) services, including micro-/interactive Table 2: Our platform specification vs. a server in 2010~14 [80].
services [18,19,52,68,46]. The max load - max RPS - is with the Conf. / Servers Our Platform Server (2010s)
99th percentile tail latency QoS target [10,18,46,52].
CPU Model Intel Xeon E5-2697 v4 Intel i7-860
LC service Domain RPS (Requests Per Second)
Logical Processor Cores 36 Cores (18 phy. cores) 8 Cores (4 phy. cores)
Img-dnn [62] Image recognition 2000,3000,4000,5000,6000 (Max) Processor Speed 2.3GHz 2.8GHz
Masstree [62] Key-value store 3000,3400,3800,4200,4600 Main Memory / 256GB, 2400MHz DDR4 / 8GB, 1600MHz DDR3 /
Memcached [65] Key-value store 256k,512k,768k,1024k,1280k Channel / BW 4 Channels / 76.8GB/s 2 Channels / 25.6GB/s
MongoDB [64] Persistent database 1000,3000,5000,7000,9000 Private L1 & L2
32KB and 256KB 32KB and 256KB
Moses [62] RT translation 2200,2400,2600,2800,3000 Cache Size
Nginx [66] Web server 60k,120k,180k,240k,300k Shared L3 Cache Size 45MB - 20 ways 8MB - 16 ways
Specjbb [62] Java middleware 7000,9000,11000,13000,15000 Disk 1TB,7200 RPM,HD 500GB,5400 RPM,HD
Sphinx [62] Speech recognition 1,4,8,12,16 NVIDIA GP104
GPU N/A
[GTX 1080], 8GB Memory
Xapian [62] Online search 3600,4400,5200,6000,6800
Login [68] Login 300,600,900,1200,1500 https://github.com/Sys-Inventor-Lab/AI4System-OSML.
Ads [68,52] Online renting ads 10,100,1000
exist in many widely used LC services and challenge existing
2 Background and Motivation
schedulers (Sec.3.3). Furthermore, we show ML can be an The cloud environment has a trend towards a new model
ideal approach that benefits scheduling (Sec.4.4). [3,18,52], in which cloud applications comprise numerous
(2) Collaborative ML Models for Intelligent Scheduling. distributed LC services (i.e., micro/interactive services), such
OSML is an ML-based scheduler that intelligently schedules as key-value storing, database serving, and business applica-
multiple interactive resources to meet co-located services’ QoS tions serving [18,19]. Table 1 includes some widely used ones,
targets. OSML learns the correlation between architectural and they form a significant fraction of cloud applications [18].
hints (e.g., IPC, cache misses, memory footprint, etc.), opti- These services have different features and resource demands.
mal scheduling solutions, and the QoS demands. It employs In terms of the datacenter servers, at present, new servers
MLP models to avoid RCliffs intelligently, thus avoiding the can have an increased number of cores, larger LLC capac-
sudden QoS slowdown often incurred by the RCliffs in prior ity, larger main memory capacity, higher bandwidth, and the
schedulers; it predicts the QoS variations and resource margins, resource scheduling exploration space becomes much larger
and then delivers appropriate resource allocations. It leverages than ever before as a result. Table 2 compares the two typical
an enhanced DQN to shepherd the allocations and recover from servers used at different times. On the one hand, although
the QoS violation and resource over-provision cases. Moreover, modern servers can have more cores and memory resources
as OSML’s models are lightweight and their functions are clearly than ever before, they are not fully exploited in today’s cloud
defined, it is easy to locate the problems and debug them. environments. For instance, in Google’s datacenter, the CPU
(3) An Open-sourced Data Set for Low-overhead ML. We have utilization is about 45~53% and memory utilization ranges
collected the performance traces for widely deployed LC services from 25~77% during 25 days, while Alibaba’s cluster ex-
(in Table 1), covering 62,720,264 resource allocation cases that hibits a lower and unstable trend, i.e., 18~40% for CPU and
contain around 2-billion samples (Sec.4). These data have a 42~60% for memory in 12 hours [32,49], indicating that a
rich set of information, e.g., the RCliffs for multiple resources; lot of resources are wasted. On the other hand, the larger re-
the interactions between workload features and the mainstream source scheduling exploration space, which consists of more
architectures. Our models can be trained and generalized with diverse resources, prohibits the schedulers from achieving
these data and then used on new platforms with low-overhead the optimal solution quickly. Additionally, cloud applications
transfer learning. People can study the data set and train their can have dozens of concurrent threads [10,46]. When several
models without a long period for data collection. cloud applications run on a server, they share and contend
(4) Real Implementation and Detailed Comparisons. We resources across multiple resource layers – cores, LLC, mem-
implement OSML based on latest Linux. OSML is designed ory bandwidth/banks. Previous studies show they may incur
as a co-worker of the OS scheduler located between the OS severe performance degradation and unpredictable QoS viola-
kernel and the user layer. We compare OSML with the most tions, and propose the scheduling approaches at architecture
related open-source studies [10,46] and show the advantages. [9,23,44], OS [31,45,50,80], and user-level [10,37,38]. Yet,
OSML captures the applications’ online behaviors, for- do they perform ideally for scheduling co-located LC services
wards them to the ML models running on CPU or GPU, and on modern datacenter servers?
schedules resources accordingly. Compared with [10,46], 3 Resource Scheduling for LC Services
OSML takes 36~55% less time to meet the QoS targets; and
To answer the above question, we study the LC services
supports 10~50% higher loads in 58% cases where Moses,
(Table 1) that are widely deployed in cloud environments.
Img-dnn, and Xapian can be co-scheduled in our experiments.
Its ML models have low run-time overheads. OSML project 3.1 Understanding the LC Services - Resource Cliff
and its ecological construction receive support from industry We study how sensitive these LC services are to the criti-
and academia; a version for research will be open-sourced via cal resources, e.g., the number of cores and LLC capacity,

154 21st USENIX Conference on File and Storage Technologies USENIX Association
Figure 1: The resource scheduling exploration space for cores and LLC ways. All services here are with 36 threads. These figures show
the sensitivity to resource allocation under different policies. Each col./row represents a specific number of LLC ways/cores allocated to an
application. Each cell denotes the LC service’s response latency under the given number of cores and LLC ways. The Redline highlights the
RCliff (can be obtained by selecting the knee solution [69]). The green color cells show allocation policies that bring better performance (low
response latency). OAA (Optimal Allocation Area) is also illustrated for each LC service. We test all of the LC services in Table 1, and we find
the RCliff and OAA are always existing, though the RPS varies [79]. We only show several of them for the sake of saving space.
on a commercial platform (“our platform” in Table 2). For
Moses, as illustrated in Figure 1-a, with the increasing number
of cores, more threads are mapped on them simultaneously.
Meanwhile, for a specific amount of cores, more LLC ways
can benefit performance. Thus, we observe the response la-
tency is low when computing and LLC resources are ample
(i.e., below 10ms in the area within green color). The overall
trends are also observed from other LC services. Figure 2: OAA exists regardless of the num. of concurrent threads.
However, we observe the Cliff phenomenon for these ser- defined as the ideal number of allocated cores and LLC ways
vices. In Figure 1-a, in the cases where 6 cores are allocated to bring an acceptable QoS. More resources than OAA cannot
to Moses, the response latency is increased significantly from deliver more significant performance, but fewer resources
34ms to 4644ms if merely one LLC way is reduced (i.e., lead to the danger of falling off the RCliff. OAA is the goal
from 10 ways to 9 ways). Similar phenomena also happen that schedulers should achieve.
in cases where computing resources are reduced. As slight 3.2 Is OAA Sensitive to the Number of Threads?
resource re-allocations bring a significant performance slow- In practice, an LC service may have many threads for higher
down, we denote this phenomenon as Resource Cliff (RCliff). performance. Therefore, we come up with the question: is
It is defined as the resource allocation cases that could in- the OAA sensitive to the number of threads, i.e., if someone
cur the most significant performance slowdown if resources starts more threads, will the OAA change?
(e.g., core, cache) are deprived via a fine-grain way in the To answer this question, for a specific LC service, we start
scheduling exploration space. Take Moses as an example, a different number of threads and map them across a different
on the RCliff (denoted by the red box in Figure 1-a), there number of cores (the num. of threads can be larger than the
would be a sharp performance slowdown if only one core or num. of cores). Through the experiments, we observe - (i)
one LLC way (or both) is deprived. Figure 1-b and c show More threads do not necessarily bring more benefits. Take
RCliffs for Img-dnn and MongoDB, respectively. They ex- Moses as an example, when more threads are started (e.g.,
hibit computing-sensitive features and have RCliff only for 20/28/36) and mapped across a different number of cores, the
cores. From another angle, RCliff means that a little bit more overall response latency can be higher (as illustrated in Figure
resources will bring significant performance improvement. 2). The underlying reason lies in more memory contentions
Figure 1-a shows that Moses exhibits RCliff for both core and at memory hierarchy and more context switch overheads,
LLC. Moreover, we test the services in Table 1 across various leading to a higher response latency [20,36]. (ii) The OAA
RPS and find the RCliffs always exist, though the RCliffs is not sensitive to the number of concurrent threads. For
vary (8.8% on average) according to different RPS. Moses in Figure 2, even if 20/28/36 threads are mapped to
The underlying reason for the cache cliff is locality; for the 10~25 cores, around 8/9-core cases always perform ideally.
core cliff, the fundamental reason is on queuing theory - the Other LC services in Table 1 also show a similar phenomenon,
latency will increase drastically when the request arrival rate though their OAAs are different for different applications.
exceeds the available cores. RCliff alerts the scheduler not to In practice, if the QoS for a specific LC service is satisfied,
allocate resources close to it because it is “dangerous to fall off LLC ways should be allocated as less as possible, saving
the cliff” and incurs a significant performance slowdown, i.e., LLC space for other applications. Similarly, we also try to
even a slight resource reduction can incur a severe slowdown. allocate fewer cores for saving computing resources. Here,
Notably, in Figure 1, we highlight each LC service’s Optimal we conclude that the OAA is not sensitive to the number
Allocation Area (OAA) in the scheduling exploration space,

USENIX Association 21st USENIX Conference on File and Storage Technologies 155
Table 3: The input parameters for ML models. complex configurations. (3) Unable to provide accurate QoS
Feature Description Models predictions. Therefore, the scheduler can hardly balance the
IPC Instructions per clock A/A’/B/B’/C global QoS and resource allocations across all co-located appli-
Cache Misses LLC misses per second A/A’/B/B’/C cations, leading to QoS violations or resource over-provision.
MBL Local memory bandwidth A/A’/B/B’/C
An ideal scheduler should avoid the RCliff and quickly achieve
CPU Usage The sum of each core’s utilization A/A’/B/B’/C
Virt. Memory Virtual memory in use by an app A/A’/B/B’ the OAA from any positions in the scheduling space. We claim
Res. Memory Resident memory in use by an app A/A’/B/B’ it is time to design a new scheduler, and using ML can be a good
Allocated Cores The number of allocated cores A/A’/B/B’/C approach to handle such complicated cases with low overheads.
Allocated Cache The capacity of allocated cache A/A’/B/B’/C
Core Frequency Core Frequency during run time A/A’/B/B’/C 4 Leveraging ML for Scheduling
QoS Slowdown Percentage of QoS slowdown B We design a new resource scheduler - OSML. It differs from
Expected Cores Expected cores after deprivation B’
Expected Cache Expected cache after deprivation B’
the previous studies in the following ways. We divide the re-
Cores used by N. Cores used by Neighbors A’/B/B’ source scheduling for co-located services into several routines
Cache used by N. Cache capacity used by Neighbors A’/B/B’ and design ML models to handle them, respectively. These
MBL used by N. Memory BW used by Neighbors A’/B/B’ models work collaboratively in OSML to perform schedul-
Resp. Latency Average latency of a LC service C ing. OSML uses data-driven static ML models (Model-A/B)
of threads. We should further reveal: how do the existing to predict the OAA/RCliff, and balances the QoS and re-
schedulers perform in front of OAAs and RCliffs? source allocations among co-located LC services, and uses
the reinforcement learning model (Model-C) to shepherd the
3.3 Issues the Existing Schedulers May Meet
allocations dynamically. Moreover, we collect extensive real
We find three main shortcomings in the existing schedulers
traces for widely deployed LC services.
when dealing with OAAs and RCliffs. (1) Entangling with
RCliffs. Many schedulers often employ heuristic scheduling 4.1 Model-A: Aiming OAA
algorithms, i.e., they increase/reduce resources until the moni- Model-A Description. The neural network used in Model-A
tor alerts that the system performance is suffering a significant is a 3-layer multi-layer perceptron (MLP); each layer is a
change (e.g., a severe slowdown). Yet, these approaches could set of nonlinear functions of a weighted sum of all outputs
incur unpredictable latency spiking. For example, if the current that are fully connected from the prior one [21,24]. There
resource allocation for an LC service is in the base of RCliff are 40 neurons in each hidden layer. There is a dropout layer
(i.e., the yellow color area in Figure 1-a/b/c), the scheduler with a loss rate of 30% behind each fully connected layer
has to try to achieve OAA. However, as the scheduler doesn’t to prevent overfitting. For each LC service, the inputs of
know the “location” of OAA in the exploration space, it has the MLP include 9 items in Table 3 and the outputs include
to increase resources step by step in a fine-grain way, thus the the OAA for multiple interactive resources, OAA bandwidth
entire scheduling process from the base of the RCliff will incur (bandwidth requirement for OAA), and the RCliff. Model-A
very high response latency. For another example, if the current has a shadow – A’, which has the same MLP structure and 12
resource allocation is on the RCliff or close to RCliff, a slight input parameters (Table 3), providing solutions when multiple
resource reduction for any purpose could incur a sudden and LC services are running together.
sharp performance drop for LC services. The previous efforts Model-A Training. Collecting training data is an offline
[10,32,50,53] find there would be about hundreds/thousands of job. We have collected the performance traces that involve
times latency jitter, indicating the QoS cannot be guaranteed the parameters in Table 3 for the LC services in Table 1 on
during these periods. Thus, RCliffs should not be neglected “our platform” in Table 2. The parameters are normalized
when designing a scheduler. (2) Unable to accurately and into [0,1] according to the function: Normalized_Feature =
simultaneously schedule a combination of multiple interac- (Feature − Min)/(Max − Min). Feature is the original value;
tive resources (e.g., cores, LLC ways) to achieve OAAs in Max and Min are predefined according to different metrics.
low overheads. Prior studies [10,31,32,45] show that the core For each LC service with every common RPS demand,
computing ability, cache hierarchy, and memory bandwidth are we sweep 36 threads to 1 thread across LLC allocation poli-
interactive factors for resource scheduling. Solely considering cies ranging from 1 to 20 ways and map the threads on a
a single dimension in scheduling often leads to sub-optimal certain number of cores and collect the performance trace
QoS. However, the existing schedulers using heuristic or model- data accordingly. In each case, we label the corresponding
based algorithms usually schedule one dimension resource at OAA, RCliff and OAA bandwidth. For example, Figure 3
a time and bring high overheads on scheduling multiple inter- shows a data collection case where 8 threads are mapped
active resources. For example, PARTIES [10] takes around onto 7 cores with 4 LLC ways. We feed the LC services
20~30 seconds on average (up to 60 seconds in the worst cases) with diverse RPS (Table 1), covering most of the common
to find ideal allocations when 3~6 LC services are co-running. cases. Moreover, to train Model-A’s shadow (A’), we map LC
The efforts in [16,41,42] also show the heuristics inefficiency services on the remaining resources in the above process and
due to the high overheads on scheduling various resources with get the traces for co-location cases. Note that the resources

156 21st USENIX Conference on File and Storage Technologies USENIX Association
Figure 3: Model-A data collection. Figure 4: Model-B training.
Figure 5: Model-C in a nutshell.
are partitioned among applications. We test comprehensive
co-location cases for LC services in Table 1, and find the LC Model-B Training. For training Model-B and B’, we
services’ RCliffs vary from 2.8% to 38.9%, and OAAs vary reduce the allocated resources for a specific LC service from
from 5.6% to 36.1%, respectively, when multiple LC services its OAA by fine-grain approaches, as illustrated in Figure
are co-running. Finally, we collect 43,299,135 samples (data 4. The reduction has three angles, i.e., horizontal, oblique,
tuples), covering 1,308,296 allocation cases with different and vertical, i.e., B-Points include <cores dominated, LLC
numbers of cores, LLC ways, and bandwidth allocations. A ways>, <cores, LLC ways>, <cores, LLC ways dominated>,
large amount of traces may lead to higher accuracy for ML respectively. For each fine-grain resource reduction step, we
models. The workload characteristics are converted to traces collect the corresponding QoS slowdowns and then label them
consisting of hardware parameters used for fitting and training as less equal to (≤) 5%, 10%, 15%, and so on, respectively.
MLP to provide predictions. Examples are illustrated in Figure 4, which shows the B-
Points with the corresponding QoS slowdown. We collect the
4.2 Model-B: Balancing QoS and Resources training data set for every LC service in Table 1. The data set
Model-B Description. Model-B employs an MLP with the contains 65,998,227 data tuples, covering 549,987 cases.
same structure in Model-A’ plus one more input item, i.e., Model-B Function. We design a new loss function:
2
QoS slowdown (Table 3). Model-B outputs the resources 1 n

yt
L= ×
∑ yt + c t t ,
(s − y )
that a service can be deprived of under allowable QoS slow- n t=1
down. As the computing units and memory resource can in which st is the prediction output value of Model-B, yt is the
be fungible [10], Model-B’s outputs include three policies, labeled value in practice, and ‘c’ is a constant that is infinitely
i.e., <cores, LLC ways>, <cores dominated, LLC ways> and close to zero. We multiply the difference between st and yt by
yt
<cores, LLC ways dominated>, respectively. The tuple items yt +c for avoiding adjusting the weights during backpropaga-
are the number of cores, LLC ways deprived and reallocated tion in the cases where yt = 0 and yty+c t
= 0 caused by some
to others with the allowable QoS slowdown. The term “cores non-existent cases (we label the non-existent cases as 0, i.e.,
dominated” indicates the policy using more cores to trade the yt = 0, indicating we don’t find a resource-QoS trading policy
LLC ways, and vice versa. The allowable QoS slowdown in the data collection process).
is determined according to the user requirement or the LC
services’ priority and controlled by the OSML’s central logic. 4.3 Model-C: Handling the Changes On the Fly
We denote Model-B’s outputs as B-Points. Model-C Description. Model-C corrects the resource
Model-B trades QoS for resources. For example, when an under-/over-provision and conducts online training. Fig-
LC service (E) comes to a server that already has 4 co-located ure 5 shows the Model-C in a nutshell. Model-C’s core
services, OSML enables Model-A’ to obtain <n+, m+>, which component is an enhanced Deep Q-Network (DQN) [43],
denotes at least n more cores and m more LLC ways should consisting of two neural networks, i.e., Policy Network
be provided to meet E’s QoS. Then, OSML enables Model-B and Target Network. The Policy and Target Network em-
and uses the allowable QoS slowdown as an input to infer B- ploy the 3-layer MLP, and each hidden layer has 30 neu-
Points for obtaining resources from other co-located services. rons. Policy Network’s inputs consist of the parameters
B-Points include the “can be deprived” resources from E’s in Table 3, and the outputs are resource scheduling ac-
neighbors with the allowable QoS slowdown. Finally, OSML tions (e.g., reducing/increasing a specific number of cores
finds the best solution to match <n+, m+> with B-Points, or LLC ways) and the corresponding expectations (defined
which has a minimal impact on the co-located applications’ as Q(action)). These actions numbered 0~48 are defined
current allocation state. Detailed logic is in Algo._1. Besides, as Action_Function:{< m, n > |m ∈ [−3, 3], n ∈ [−3, 3]}, in
we design Model-B’ (a shadow of Model-B) to predict how which a positive m denotes allocating m more cores (i.e., add
much QoS slowdown will suffer if a certain amount of re- operation) for an application and a negative m means depriv-
sources is deprived of a specific service. Model-B’ has an ing it of m cores (i.e., sub operation); n indicates the actions
MLP with the same structure in Model-A’ plus the items that on LLC ways. Figure 5 illustrates Model-C’s logic. The
indicate the remaining cache ways and cores after deprivation. scheduling action with the maximum expectation value (i.e.,

USENIX Association 21st USENIX Conference on File and Storage Technologies 157
the action towards the best solution) will be selected in ① Table 4: The Summary of ML models in OSML.
and executed in ②. In ③, Model-C will get the Reward value Model Gradient Activation
ML Model Features Loss Function
according to the Reward Function. Then, the tuple <Status, Size Descent Function
Action, Reward, Status’> will be saved in the Experience A MLP 9 144 KB Mean Square
Pool in ④, which will be used during online training. The A’ MLP 12 155 KB Error (MSE) Adam
B MLP 13 110 KB Modified MSE Optimizer ReLU
terms Status and Status’ denote system’s status described by B’ MLP 14 106 KB MSE
the parameters in Table 3 before and after the Action is taken. C DQN 8 141 KB Modified MSE RMSProp
Model-C can quickly have the ideal solutions in practice
Experience Pool. For each tuple, Model-C uses the Policy
(about 2 or 3 actions). Please note that in ①, Model-C might
Network to get the Action’s expectation value (i.e., Q(Action)
randomly select an Action instead of the best Action with a
[43]) with the Status. In Model-C, the target of the Ac-
5% chance. By doing so, OSML avoids falling into a local
tion’s expectation value is the Reward observed plus the
optimum [43]. These random actions have a 44% chance
weighted best expectation value of the next status (i.e., Sta-
of causing QoS violations when Model-C reduces resources.
tus’). As illustrated in Figure 5, Model-C uses the Target
But OSML can handle the QoS violations by withdrawing the
Network to have the expectation values of Status’ for the
action (line 9 in Algo._3).
actions in Action_Function and then finds the best one, i.e.,
Model-C’s Reward Function. The reward function of
Max(Q(Action′ )). We design a new Loss Function based on
Model-C is defined as follow:
If Latencyt−1 > Latencyt : MSE: (Reward + γMax(Q(Action′ )) − Q(Action))2 . It helps
Rt = log(1 + Latencyt−1 − Latencyt ) − (∆CoreNum + ∆CacheWay) the Policy Network predict closer to the target. The Policy
If Latencyt−1 < Latencyt : Network is updated during online training. The Target Net-
Rt = − log(1 + Latencyt − Latencyt−1 ) − (∆CoreNum + ∆CacheWay) work’s weights are synchronized periodically with the Policy
If Latencyt−1 = Latencyt : Network’s weights. Doing so enables the Target Network
Rt = −(∆CoreNum + ∆CacheWay), to provide stable predictions for the best expectation value
where Latencyt−1 and Latencyt denote the latency in previous of Status’ within a predefined number of time steps, thus
and current status, respectively; ∆CoreNum and ∆CacheWay improving the stability of the training and prediction.
represent the changes in the number of cores and LLC ways,
respectively. This function gives higher reward and expec- 4.4 Discussions on the design of ML Models
tation to Action that brings less resource usage and lower (1) Why using MLPs. Table 4 characterizes the ML models
latency. Thus, Model-C can allocate appropriate resources. used in OSML. We employ three-layered MLPs in Model-
Algo._2 and 3 show the logic on using Model-C in detail. A and B, because they can fit continuous functions with an
Offline Training. A training data tuple includes Status, arbitrary precision given a sufficient number of neurons in
Status’, Action and Reward, which denote the current status each layer [67], and we can use extensive training data to im-
of a LC service, the status after these actions are conducted prove the accuracy of MLPs for predicting OAAs and RCliffs.
(e.g., reduce several cores or allocate more LLC ways) and Moreover, after offline training, using MLPs brings negligi-
the reward calculated using the above functions, respectively. ble run-time overheads to OSML. (2) Why do we need the
To create the training data set for Model-C, we resort to Three models? We divide the OSML’s scheduling logic into
the data set used in Model-A training. The process is as three parts, which the three models cover, respectively. Models
follows. Two tuples in Model-A training data set are selected work in different scheduling phases, and no single model can
to denote Status and Status’, and we further get the differences handle all cases. For example, model-A predicts the RCliffs
of the resource allocations between the two status (i.e., the and OAAs; Model-B predicts the QoS variations and resource
actions that cause the status shifting). Then, we use the reward margins in co-location cases. DQN in Model-C learns online
function to have the reward accordingly. These 4 values form to shepherd the scheduling results from Model-A/B. They are
a specific tuple in Model-C training data set. In practice, as necessary and work cooperatively to cover the main schedul-
there are a large number of data tuples in Model-A training ing cases. Moreover, they are easier to generalize than other
data set, it is impossible to try every pair of tuples in the approaches, e.g., a table lookup approach (Sec.6.4). Why Not
data set, we only select two tuples from resource allocation Only use the online learning Model-C? Model-C uses DQN
policies that have less than or equal to 3 cores, or 3 LLC that depends on the start points. It starts with Model-A/B’s
ways differences. Moreover, we also collect the training data outputs to avoid exploring the whole (large) scheduling space.
in the cases where cache way sharing happens and preserve Without the approximate OAA provided by Model-A for many
them in the experience pool. Therefore, Model-C can work unseen cases, only using Model-C will incur more schedul-
in resource-sharing cases. To sum up, we have 1,521,549,190 ing actions (overheads). (3) Insights on Generalization for
tuples in Model-C training data set. Unseen apps and New servers. (i) We use “hold-out cross
Online Training. Model-C collects online traces. The validation”, i.e., the training data (70% of the whole data set)
training flow is in the right part of Figure 5. Model-C ran- excludes the testing data (30%) for each LC service. (ii) We
domly selects some data tuples (200 by default) from the train models with extensive representative traces from many

158 21st USENIX Conference on File and Storage Technologies USENIX Association
Central Control
Logic

No New app. No No Res. waste


QoS violation
arrived? detected? detected?
Try to share resource
Algo._1 Yes Algo._2 Yes Algo._4 Algo._3 Yes
Model-A/A' Model-C returns Model-B' predicts Model-C returns
returns OAA adjustment neighbors' QoS adjustment actions
Figure 6: The overview of the system design of OSML. actions slowdown after
Res. sharing
typical services (e.g., memory/CPU intensive, random memory Idle
Res. enough? QoS
Yes No No satisfied after
access, etc. [10,46]), fitting the correlation between archi- Available? QoS deprivation?
Model-B slowdown
tectural hints for mainstream features (e.g., IPC, cache miss, Allocate returns acceptable? Yes No
Res. to B-Points with Yes
memory footprint, CPU/LLC utilization and MBL), OAA, and the app. acceptable Yes No
QoS slowdown Allocate Res. Conduct
to QoS- Res. Withdraw
the QoS demands. For instance, the spearman correlation co- violated app. Share the action
Migrate deprivation
Yes Res. with the app.
efficient is 0.571, 0.499, and -0.457 between OAA and cache Can neighbor
the policy
work? Reclaim
miss, MBL, and IPC, respectively. On different platforms or LLC sharing wasted Res.
No
for new/unseen applications, these numbers might be varied; If the app. must be
placed, call Algo._4 Move to OAA
however, this correlation trend is not changed, enabling OSML
to generalize to other situations. Even for errors caused by de- Figure 7: OSML’s central logic.
viations from the training dataset for unseen applications/new and Model-C will be called via modelC_downsize().
platforms, model-C can learn online and correct them. (iii) Moreover, Algo._4 will enable resource sharing (the default
Using transfer learning, collecting new traces on a new server scheduling is to do hard partitioning of cores/LLC ways), if
for several hours will make OSML work well for it (refer to all of the co-located LC services are close to their RCliff and
Sec.6.4). (iv) OSML is a long-term project open to the com- the upper scheduler still wants to increase loads on this server.
munity; we continue adding new traces collected from new Model-A and B work cooperatively to accomplish this goal in
applications and new servers to the data set for enhancing Algo._4. In practice, to minimize the adverse effects, resource
models’ performance for new cases. sharing usually happens between only two applications. Note
that Algo._4 might incur the resource sharing over the RCliff,
5 OSML: System Design and thus may bring higher response latency for one or more LC
5.1 The Central Control Logic services. OSML will report the potential QoS slowdown to the
The overview of OSML is in Figure 6. OSML is a per-node upper scheduler and ask for the decisions. If the slowdown is
scheduler. The central controller of OSML coordinates the not allowed, the corresponding actions will not be conducted.
ML models, manages the data/control flow, and reports the Bandwidth Scheduling. OSML partitions the overall band-
scheduling results. Figure 7 shows its overall control logic. width for each co-located LC service according to the ratio
Allocating Resources for LC services. Algo._1 shows BWj / ∑ BWi . BWj is a LC service’s OAA bandwidth re-
how OSML uses Model-A and B in practice. Figure 7 quirement, which is obtained from the Model-A. Note that
highlights its operations. For a newly coming LC service, such scheduling needs Memory Bandwidth Allocation (MBA)
the central controller calls Model-A via the interface mod- mechanism [4,5] in CPU.
elA_oaa_rcliff() to get the OAA and RCliff. Suppose the
5.2 Implementation
current idle resources are not sufficient to satisfy the new LC
service. In that case, OSML will enable Model-B through the OSML learns whether the LC services have met their QoS
interface modelB_trade_qos_res() to deprive some resources targets. OSML monitors the run-time parameters for each
of other LC services with the allowable QoS slowdown (con- co-located LC service using performance counters for ev-
trolled by the upper-level scheduler) and then allocate them to ery second (default). If the observation period is too short,
the new one. The upper-level scheduler manages the cluster other factors (e.g., cache data evicted from the cache hierar-
consisting of nodes using OSML. In the depriving process for chy, context switch) may interfere with the sampling results.
a specific LC service, OSML reduces its allocated resources Moreover, OSML performs well with other interval settings
and gets close to the RCliff, but it will not easily fall off the and allows configuration flexibility (e.g., 1.5 or 2 seconds).
RCliff unless expressly permitted (refer to Algo._4). We design OSML that works cooperatively with OS (Fig-
Dynamic Adjusting. Figure 7 shows the dynamic adjusting ure 6). As the kernel space lacks the support of ML libraries,
of Algo._2 and 3, in which Model-C works as a dominant role. OSML lies in the user space and exchanges information with
During the run time, OSML monitors each LC service’s QoS the OS kernel. OSML is implemented using python and C. It
status for every second. If the QoS violation is detected, the cen- employs Intel CAT technology [4] to control the cache way
tral controller will enable Algo._2 and call Model-C to allocate allocations, and it supports dynamically adjusting. OSML
more resources to achieve the ideal QoS. The interface is mod- uses Linux’s taskset and MBA [5] to allocate specific cores
elC_upsize(). If OSML finds an LC service is over-provisioned and bandwidth to an LC service. OSML captures the online
(i.e., wasting resources), Algo._3 will be used to reclaim them, performance parameters by using the pqos tool [4] and PMU

USENIX Association 21st USENIX Conference on File and Storage Technologies 159
[5]. The ML models are based on TensorFlow [6] with the
version 2.0.4, and can be run on either CPU or GPU.
6 Evaluations
6.1 Methodology
We evaluate the per-node OSML performance on our platform
in Table 2. Details on LC services are in Table 1. The met-
rics involve the QoS (similar to [10], the QoS target of each
application is the 99th percentile latency of the knee of the
and error” – for all of the applications. The core mechanism
latency-RPS curve. Latency higher than the QoS target is a
in [10] is like an FSM [60].
violation.); Effective Machine Utilization (EMU) [10] (the
CLITE [46]. It conducts various allocation policies and
max aggregated load of all co-located LC services) – higher is
samples each of them; it then feeds the sampling results – the
better. We first evaluate the scenarios where LC services run
QoS and run-time parameters for resources – to a Bayesian
at constant loads, and the loads are from 10% - 100%. Then,
optimizer to predict the next scheduling policy.
we explore workload churn. We inject applications with loads
Unmanaged Allocation (baseline). This policy doesn’t con-
from 20% - 100% of their respective max load. Furthermore,
trol the allocation policies on cores, LLC, and other shared
to evaluate the generalization of OSML, we employ some
resources for co-located LC services. This policy relies on
new/unseen applications that are not in Table 1 and the new
the original OS schedulers.
platform in our experiments. If an allocation in which all appli-
ORACLE. We obtain these results by exhaustive offline sam-
cations meet their QoS cannot be found after 3 mins, we signal
pling and find the best allocation policy. It indicates the
that the scheduler cannot deliver QoS for that configuration.
ceiling that the schedulers try to achieve.
6.2 OSML Effectiveness We show the effectiveness of OSML as follow.
We compare OSML with the most related approaches in (1) OSML exhibits a shorter scheduling convergence time.
[10,46] based on the latest open-source version. Using ML models, OSML achieves OAA quickly and effi-
PARTIES [10]. It makes incremental adjustments in one- ciently handles cases with diverse loads. Figure 8-a shows the
dimension resource at a time until QoS is satisfied – “trial distributions of the scheduling results of 104 loads for OSML,

160 21st USENIX Conference on File and Storage Technologies USENIX Association
PARTIES and CLITE, respectively. Every dot represents a
specific workload that contains 3 co-located LC services with
different RPS. We launch the applications in turn and use a
scheduler to handle QoS violations until all applications meet
their QoS targets. The x-axis shows the convergence time;
the y-axis denotes the achieved EMU. Generally, OSML can
achieve the same EMU with a shorter convergence time for
a specific load. Figure 8-b shows the violin plots of conver-
gence time for these loads. On average, OSML takes 20.9 Figure 8: (a) The performance distributions for 104 loads that OSML,
seconds to converge, while PARTIES and CLITE take 32.7 PARTIES and CLITE can all converge. (b) Violin plots of conver-
and 46.3 seconds, respectively. OSML converges 1.56X and gence time for loads in (a).
2.22X faster than PARTIES and CLITE. OSML performs ing the three LC services – Moses, Img-dnn, and Xapian. For
stably – the convergence time ranges from 5.3s (best case) a specific scheduling phase, by using ML to achieve OAA,
to 80.0s (worst case). By contrast, the convergence time in OSML supports 10~50% higher loads than PARTIES and
PARTIES ranges from 5.5s to 111.1s, and CLITE is from CLITE in highlighted cells in Figure 10-d, accounting for
14.0s to 140.6s. OSML converges faster mainly because 58% of the cases that can be scheduled. All schedulers per-
the start point in the scheduling space provided by Model-A form better than the Unmanaged (Figure 10-a), as they reduce
is close to OAA. PARTIES and CLITE take a longer con- the resource contentions.
vergence time, indicating that they require high scheduling (2) Compared with PARTIES and CLITE, OSML consumes
overheads in cloud environments. In Cloud, jobs come and fewer resources to support the same loads to meet the QoS
go frequently; thus, scheduling occurs frequently, and longer targets. On average, for the 104 loads in Figure 8-a, OSML
scheduling convergence time often leads to unstable/low QoS. uses 34 cores and 16 LLC ways; by contrast, PARTIES and
We further analyze how these schedulers work in detail. CLITE exhaust all the platform’s 36 cores and 20 LLC ways.
Figure 9-a/b/c show the actions used in OSML, PARTIES, and As illustrated in Figure 9-a, PARTIES partitions the LLC
CLITE’s scheduling process for case A in Figure 8. This case ways and cores equally for each LC service at the begin-
includes Moses, Img-dnn, and Xapian with 40%, 60%, and ning; once it meets the QoS target (using 8 actions), it stops.
50% of their maximum loads. For this load, PARTIES, CLITE Thus, PARTIES drops the opportunities to explore alterna-
and OSML take 14.5 seconds, 72.6 seconds and 8.2 seconds to tive better solutions (i.e., using fewer cores or cache ways to
converge, respectively. Figure 9 highlights scheduling actions meet identical QoS targets). PARTIES allocates all cores and
using solid red lines to represent increasing resources and blue LLC ways finally. CLITE also uses all cores and cache ways
dotted lines to denote reducing resources. Figure 9-a shows shown in Figure 9-b. By contrast, OSML schedules accord-
PARTIES takes 7 actions for scheduling cores and 1 action ing to applications’ resource requirements instead of using
for cache ways. It schedules in a fine-grained way by increas- all resources. Figure 9-c shows that using Model-A, OSML
ing/decreasing one resource at a time. CLITE relies on the achieves each LC service’s OAA (the optimal solution) after 5
sampling points in the scheduling exploration space. Figure actions. OSML detects and reclaims over-provided resources
9-b shows CLITE repeats sampling until the “expected im- using Model-C. For example, the last action in Figure 9-c re-
provement” in CLITE drops below a certain threshold. CLITE claims 3 cores and 2 LLC ways from Xapian. Finally, OSML
only performs five scheduling actions according to its latest saves 3 cores and 9 LLC ways. As OSML is designed for LC
open-source version; but it takes the longest convergence time services that are executed for a long period, saving resources
(72.6 seconds). The underlying reason is that CLITE’s sam- means saving budgets for cloud providers.
pling/scheduling doesn’t have clear targets. In practice, the (3) Using ML models, OSML provides solutions for shar-
improper resource partitions/allocations during sampling lead ing some cores and LLC ways among LC services, therefore
to the accumulation of requests, and the requests cannot be supporting higher loads. PARTIES and CLITE don’t show
handled due to resource under-provision. Therefore, it brings resource sharing in the original design. Using Algo._4, OSML
a significant increase in response latency. Moreover, due to lists some potential resource sharing solutions, and then en-
the early termination of CLITE’s scheduling process, CLITE ables Model-B’ to predict the QoS slowdown for each case.
cannot schedule resources to handle QoS violations in a timely The sharing solution with a relatively lower QoS slowdown
manner, leading to a long convergence time. Figure 9-c shows is selected. More details refer to Figure 7. Figure 9-d shows
OSML achieves OAA for each LC service with 5 actions. how OSML shares resources for the highlighted case B in
Compared with prior schedulers, OSML has clear aims and Figure 10-d. OSML enables Model-C to add resources for
schedules multiple resources simultaneously to achieve them. Moses in Algo._2 and uses Algo._4 to share 2 CPU cores with
It has the shortest convergence time – 8.2 seconds. Xapian. Finally, the QoS is met. By enabling resource sharing,
Moreover, as the scheduling is fast, OSML often supports OSML can support higher loads than PARTIES and CLITE,
more loads. Figure 10 shows the OSML’s results on schedul- and can even be close to ORACLE in Figure 10-e. If not

USENIX Association 21st USENIX Conference on File and Storage Technologies 161
Figure 9: Resource usage comparisons for OSML, PARTIES, and CLITE.

Figure 11: Throughput distribution.

Figure 13: Highlighted the scheduling


traces in scheduling space for all sched-
ulers from time point 180 to 228 in Figure
Figure 10: Co-location of Moses, Img-dnn and Xapian. The heatmap values are the percentage 12. Each circle denotes a specific schedul-
of the third LC service’s (i.e., Xapian) achieved max load without QoS violations in these cases. ing policy conducted by a specific sched-
The x- and y-axis denote the first and second app’s fraction of their max loads (with QoS target), uler. The number in a circle denotes the
respectively. Cross means QoS target cannot be met. The related studies [10,46] use heat maps to sequence of these scheduling actions dur-
show their effectiveness, so we also use heat maps for comparisons in this work. ing the scheduling phase.
OSML, however, the “trial and error” approach has to try to service (Table 5). So, Model-C is needed to correct the errors.
share core/cache way in a fine-grain way among applications, We cannot disable any models in OSML in practice.
and then observes the real-time latency for making a further (5) OSML performs well in various cases. Figure 11 shows
decision, inevitably incurring higher scheduling overheads the EMU distribution for converged loads among 302 loads
and bringing sharp QoS slowdown if falling off the RCliff. (each has 3 apps with diverse RPS). EMU reflects the system
(4) OSML promptly handles the resource under/over- throughput [10,32]. OSML can have scheduling results for
provision and QoS violations using Model-C. Based on 285 loads; PARTIES and CLITE can work for 260 and 148
Model-A/B’s results, Model-C shepherds and adjusts the al- loads, respectively. OSML’s distribution in Figure 11 is wider
locations with several actions for each application in our than PARTIES and CLITE. This indicates that OSML works
experiments and converges more quickly than previous ap- for more loads, including those with a high EMU (e.g., 130-
proaches. More experiments on dynamic, complicated cases 180%). Even for the loads that OSML, PARTIES, and CLITE
can be found in Sec.6.3. Can we only use Model-C or only can all converge, OSML can converge faster (Figure 8).
use Model-A/B? Enabling the three models is necessary for 6.3 Performance for Workload Churn
OSML. For the case in Figure 9-c, when OSML uses the
We evaluate how OSML behaves with dynamically changing
three models collaboratively for scheduling, it takes 8.2s and
loads. Each LC service’s QoS is normalized to the solely
5 actions to achieve OAA for all applications. By contrast,
running case. As illustrated in Figure 12-a, in the beginning,
if only enabling Model-C, it takes 18.5s and 13 actions. Be-
Moses with 60% of max load arrives; then Sphinx with 20%
cause Model-A/B can at least provide an approximate OAA
of max load and Img-dnn with 60% of max load arrive. We
for scheduling, thus reducing the convergence time. Just us-
observe their response latency increases caused by the re-
ing Model-A/B may lead to 4-core errors for an unseen LC

162 21st USENIX Conference on File and Storage Technologies USENIX Association
Table 5: ML Models’ performance on average. The units of errors
are the number of cores/LLC ways. For Model B’, the unit is the
percentage of QoS slowdown.
Errors for Err on new
Error unseen LC platforms
ML Outputs services (TL) MSE Over-
heads
Core LLC Core LLC Core LLC
RCliff 0.589 0.288 1.266 0.198 2.142 0.542
A 0.0025 0.20s
OAA 0.580 0.360 1.276 0.191 2.004 0.865
RCliff 1.072 0.815 3.403 1.835 0.772 0.411
A’ 0.0135 0.20s
OAA 1.072 0.814 3.404 1.835 0.790 0.413
B-Points 0.612 0.053 4.012 0.167 2.320 0.969
B-Points,Core 0.314 0.048 3.434 0.937 2.250 0.815
B dominated 0.0012 0.18s
B-Points,Cache 0.093 0.462 0.789 0.783 1.868 1.519
dominated
B’ QoS reduction 7.87% 8.33% 11.28% 0.0035 0.19s
Scheduling
C actions
0.908 0.782 0.844 0.841 1.390 1.801 0.7051 0.20s

for Img-dnn from 180 to 228. During this phase, PARTIES


does not add resources for Img-dnn; but it add more resources
for Specjbb and Xapian as they are with higher latency. Img-
dnn’s response latency keeps increasing. CLITE samples
several scheduling policies in the scheduling space, but does
not converge and thus incurs QoS fluctuations. By contrast,
OSML’s Model-C can handle QoS violations of Img-dnn even
the load increases. Moreover, as mentioned before, OSML
Figure 12: OSML’s performance in reality with varying loads. saves resources and thus it can serve more workloads. For ex-
source contentions among them. In Figure 12-b, PARTIES ample, as shown in Figure 12, Mysql (an unseen workload in
aids the QoS violations step by step. During the scheduling, training) comes at time point 180; OSML allocates the saved
Moses always has high latency until it ends at 80 seconds. cores to it without sharing or depriving other LC services of
CLITE’s scheduling relies on sampling and Bayesian opti- resources.
mizer. CLITE starts scheduling at time point 16, where all 6.4 New/Unseen Apps and New Platforms
the three services arrive. At 32s, CLITE obtains the schedul- Generalization. Based on our data set, the ML models can be
ing solution for these LC services after five sampling steps. well trained. It will be at most 4-core error for unseen applica-
But it doesn’t satisfy these services’ QoS targets. In Fig- tions (Table 5). We evaluate OSML using unseen applications
ure 12-c, Moses and Img-dnn still have high latency. By that are not used in training, i.e., Silo [62], Shore[62], Mysql
contrast, with Model-A’s OAA predictions and Model-C’s [70], Redis [77], and Node.js [78]. They exhibit diverse com-
online scheduling, OSML quickly provides better scheduling puting/memory patterns. We evaluate the convergence time
solutions at time point 48 for all three services. During the of OSML with 3 groups of workloads. Each group has 15
identical scheduling phase (e.g., the time point 16 to 80), we workloads, and each workload has 3 LC services. The work-
can observe the lowest overall normalized latency in Figure loads in Group 1 have 1 unseen application. The workloads
12-d. Moreover, Figure 12-e and f illustrate OSML’s schedul- in Group 2 and 3 have 2 and all unseen applications, respec-
ing actions for achieving ideal solutions. In short, within a tively. OSML takes 24.6s, 29.3s, and 31.0s to converge for
few scheduling actions (scheduling overheads) that schedule the 3 groups of workloads, on average, respectively. OSML
multiple resources, OSML quickly meets the QoS targets. performs well, even for unseen cases, showing good general-
From 180 to 228, we increase the load for Img-dnn as ization performance. PARTIES uses 34.9s, 43.6s, and 42.8s;
illustrated in Figure 12-a. OSML meets Img-dnn’s chang- CLITE uses 58.5s, 60.6s, and 46.5s for these applications.
ing demands by using Model-C. PARTIES does not reflect Note that PARTIES and CLITE’s scheduling don’t need previ-
quickly for this change, and it works for other services. Thus, ous knowledge for applications, so their performance doesn’t
as illustrated in Figure 12-b, the QoS violation is not aided un- depend on whether the application is seen/unseen.
til 244s, when Img-dnn’s load decreases. For CLITE, it has to For new platforms, we use fine-tuning in transfer learning
sample each time when the load changes. But during the sam- (TL). We freeze the first hidden layer of the MLPs; we retrain
pling, a specific service might not have sufficient resources the last two-hidden layers and the output layer using the traces
to handle the requests; thus, the requests are accumulated, collected on two new platforms (w/ CPU Xeon Gold 6240M
leading to QoS fluctuations/violations during the scheduling and E5-2630 v4, respectively). For each LC service, based
(Figure 12-c). Figure 13 highlights the scheduling actions on our data set, collecting new traces on a new platform for

USENIX Association 21st USENIX Conference on File and Storage Technologies 163
several hours will be sufficient (covering the more allocation ML models to determine the performance dependencies be-
cases, the better). The model updating can be accomplished tween microservices in clusters. They are cluster schedulers
in hours. The time consumption will be shorter if using [14,15,74]. By contrast, OSML deeply studies scheduling
multiple machines in parallel. Table 5 shows the average in co-location cases. Selecta [72] predicts near-optimal con-
values of ML models’ quality. The new models’ prediction figurations of computing and storage resources for analytics
errors are slightly higher than the previous models on the workloads based on profiling data. CLITE [46] uses Bayesian
original platforms, but OSML still handles them well. By optimization for scheduling on-node resources. The work
contrast, if we use a table lookup approach instead, we have in [48] applies ML to predict the end-to-end tail latency of
to use additional memory to store the data tuples, e.g., 60GB LC service workflows. Twig [63] uses RL to characterize
will be wasted for the current data set to replace Model-A. tail latency for energy-efficient task management. CuttleSys
More importantly, this approach is difficult to generalize for [76] leverages data mining to identify suitable core and cache
new/unseen applications or platforms, as their traces and the configurations for co-scheduled applications. For compli-
corresponding OAAs don’t exist in the current data set. cated co-location cases, OSML can avoid RCliffs and quickly
Overheads. OSML takes 0.2s of a single core for each achieve the ideal allocations (OAA) for multiple interactive
time during the interval setting (e.g., 1 second by default in resources simultaneously for LC services. Moreover, OSML
Sec.5.2). 0.01s for ML model and 0.19s for online monitoring. performs well in generalization.
As our models are light-weighted (OSML uses only one core), Resource Partitioning. PARTIES [10] partitions cache, main
running them on CPU and GPU has a similar overhead. If memory, I/O, network, disk bandwidth, etc., to provide QoS
models are on GPU, it takes an extra 0.03s for receiving results for co-located services. The studies in [17,28,57,71] design
from GPU. OSML doesn’t monopolize GPU. Generally, the some new LLC partitioning/sharing policies. The efforts
overhead doesn’t hinder the overall performance. In the cloud, in [23,27,44,45,73] show that considering cooperative parti-
applications’ behaviors may change every second due to the tioning on LLC, memory banks and channels outperforms
diversity of user demands. Thus, OSML plays a critical role one-level memory partitioning. However, the cooperative
during the entire run time. For building models, using our partitioning policies need to be carefully designed [29,30,37],
current data set on platform in Table 2, it takes 3.3 mins, 5 mins, and [16,32] show the heuristic resource scheduling approach
and 8.3 hours to train Model-A, B, and C for one epoch (all could be ineffective in many QoS-constrained cases. [7,11]
training samples are used to train once), respectively. We train study the “performance cliff” on cache for SPECCPU 2006
models for ten epochs. Training can be accelerated using the applications and Memcached. Caladan [75] doesn’t involve
popular Multi-GPU training technology. Doing so is practical cache optimizations, and core/cache cliffs cannot be avoided,
in datacenters, and training time will not impede practice. causing QoS fluctuations in some cases. By contrast, OSML is
the first work that profoundly explores cache cliff and core cliff
7 Related Work and Our Novelty simultaneously (i.e., RCliff) for many widely used LC services
ML for System Optimizations. The work in [55] employs in co-location cases. OSML is a representative work using
DNN to optimize the buffer size for the database systems. ML to guide the multiple resources partitioning in co-location
The efforts in [22,56,34] leverage ML to optimize computer cases; OSML is cost-effective in new cloud environments.
architecture or resource management in the network to meet
various workloads. The studies in [9,39] use ML to manage 8 Conclusion
on-chip hardware resources. CALOREE [41] can learn key We present OSML, an ML-based resource scheduler for co-
control parameters to meet latency requirements with minimal located LC services. We learn that straightforwardly using a
energy in complex environments. The studies in [26,31,58,59] simple ML model might not handle all of the processes during
optimize the OS components with learned rules or propose the scheduling. Therefore, using multiple ML models cooper-
insights on designing new learned OS. In OSML, we design atively in a pipe-lined way can be an ideal approach. More
an intelligent multi-model collaborative learning approach, importantly, we advocate the new solution, i.e., leveraging
providing better co-location solutions to meet QoS targets for ML to enhance resource scheduling, could have an immense
LC services faster than the latest work stably. potential for OS design. In a world where co-location and
ML for Scheduling. Decima [35] designs cluster-level data sharing are a fundamental reality, our solution should grow in
processing job scheduling using RL. Resource Central [12] importance and benefits our community.
builds a system that contains the historical resource utilization
information of the workloads used in Azure and employs ML
Acknowledgement
to predict resource management for VMs. The study in [40] We thank the reviewers and our shepherd Young-ri Choi for the
uses RL to predict which subsets of operations in a Tensor- invaluable comments. This work is supported by the Key-Area
Flow graph should run on the available devices. Paragon [14] R&D Program of Guangdong (No.2021B0101310002), NSFC
classifies and learns workload interference. Quasar [15] deter- (No.62072432). L. Liu thanks Hecheng’s arriving. X. Dou and
mines jobs’ resource preferences on clusters. Sinan [74] uses Y. Chen are students in Sys-Inventor Lab led by L. Liu.

164 21st USENIX Conference on File and Storage Technologies USENIX Association
References proach,” in ISCA, 2008

[1] “How 1s could cost amazon $1.6 billion in sales.” [23] Min Kyu Jeong, Doe Hyun Yoon, Dam Sunwoo, Michael Sullivan,
https://www.fastcompany.com/1825005/how-one-second-could- Ikhwan Lee, Mattan Erez,“Balancing DRAM Locality and Parallelism
cost-amazon-16-billion-sales in Shared Memory CMP Systems,”in HPCA, 2012
[2] “Microservices workshop: Why, what, and how to get there,” [24] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav
http://www.slideshare.net/adriancockcroft/microservices-workshop- Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al
craft-conference Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark,
Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara
[3] “State of the Cloud Report,” http://www.righscale.com/lp/state-of-the- Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert
cloud. Accessed: 2019-01-28 Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt,
Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Ka-
[4] “Improving real-time performance by uti- plan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar,
lizing cache allocation technology,” Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary,
https://www.intel.com/content/dam/www/public/us/en/documents/white- Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adri-
papers/cache-allocation-technology-white-paper.pdf, Intel Corporation, ana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi
April, 2015 Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick,
[5] “Intel 64 and IA-32 Architectures Software Developer’s Manual,” Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir
https://software.intel.com/en-us/articles/intel-sdm, Intel Corporation, Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snel-
October, 2016 ham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory
Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard
[6] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Walter, Walter Wang, Eric Wilcox, Doe Hyun Yoon, “In-Datacenter
Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irv- Performance Analysis of a Tensor Processing Unit,” in ISCA, 2017
ing, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga,
Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Va- [25] Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton, "ImageNet Classi-
sudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng, fication with Deep Convolutional Neural Networks," in Advances in
“TensorFlow: A System for Large-Scale Machine Learning,” in OSDI, neural information processing systems, 2012
2016 [26] Yanjing Li, Onur Mutlu, Subhasish Mitra, “Operating System Schedul-
[7] Nathan Beckmann, Daniel Sanchez, “Talus: A Simple Way to Remove ing for Efficient Online Self-Test in Robust Systems,” in ICCAD, 2009
Cliffs in Cache Performance,” in HPCA, 2015 [27] Lei Liu, et al, “A Software Memory Partition Approach for Eliminating
[8] Daniel S. Berger, Benjamin Berg, Timothy Zhu, Siddhartha Sen, Mor Bank-level Interference in Multicore Systems,” in PACT, 2012
Harchol-Balter, “RobinHood: Tail Latency Aware Caching – Dynamic [28] Jiang Lin, Qingda Lu, Xiaoning Ding, Zhao Zhang, Xiaodong Zhang, P.
Reallocation from Cache-Rich to Cache-Poor,” in OSDI, 2018 Sadayappan, “Gaining insights into mlticore cache partitioning: bridg-
[9] Ramazan Bitirgen, Engin Ipek, Jose F. Martinez, “Coordinated Man- ing the gap between simulation and real systems,” in HPCA, 2008
agement of Multiple Interacting Resources in Chip Multiprocessors: A [29] Fang Liu, Yan Solihin, “Studying the Impact of Hardware Prefetching
Machine Learning Approach,” in Micro, 2008 and Bandwidth Partitioning in Chip-Multiprocessors,” in Sigmetrics,
[10] Shuang Chen, Christina Delimitrou, José F. Martínez, “PARTIES: QoS- 2011
Aware Resource Partitioning for Multiple Interactive Services,” in AS- [30] Seung-Hwan Lim, Jae-Seok Huh, Yougjae Kim, Galen M. Shipman,
PLOS, 2019 Chita R. Das, “D-Factor: A Quantitative Model of Application Slow-
[11] Asaf Cidon, Assaf Eisenman, Mohammad Alizadeh, Sachin Katti, Down in Multi-Resource Shared Systems,” in Sigmetrics, 2012
“Cliffhanger: Scaling Performance Cliffs in Web Memory Caches,” in [31] Lei Liu, et al, “Rethinking Memory Management in Modern Operating
NSDI, 2016 System: Horizontal, Vertical or Random?” in IEEE Trans. on Comput-
[12] Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Mar- ers, 2016
cus Fontoura, Ricardo Bianchini, “Resource Central: Understanding [32] David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ran-
and Predicting Worloads for Improved Resource Management in Large ganathan, Christos Kozyrakis, "Heracles: Improving Resource Effi-
Cloud Platforms,” in SOSP, 2017 ciency at Scale," in ISCA, 2015
[13] Jeff Dean, David A. Patterson, Cliff Young, “A New Golden Age in [33] Lei Liu, et al, “Hierarchical Hybrid Memory Management in OS for
Computer Architecture: Empowering the Machine-Learning Revolu- Tiered Memory Systems,” in IEEE Trans. on Parallel and Distributed
tion,” in IEEE Micro, 2018 Systems, 2019
[14] Christina Delimitrou, Christos Kozyrakis, “QoS-Aware Scheduling in [34] Hongzi Mao, Mohammad Alizadeh, Ishai Menache, Srikanth Kan-
Heterogenous Datacenters with Paragon,” in ACM TOCS, 2013 dula, “Resource Management with Deep Reinforcement Learning,” in
[15] Christina Delimitrou, Christos Kozyrakis, “Quasar: Resource-Efficient HotNet-XV, 2016
and QoS-Aware Cluster Management,” in ASPLOS, 2014 [35] Hongzi Mao, Malte Schwarzkopf, Shaileshh B. Venkatakrishnan, Zili
[16] Yi Ding, Nikita Mishra, Henry Hoffmann, “Generative and Multi-phase Memg, Mohammad Alizadeh, “Learning Scheduling Algorithms for
Learning for Computer Systems Optimization,” in ISCA, 2019 Data Processing Clusters,” in SIGCOMM, 2019
[17] Nosayba El-Sayed, Anurag Mukkara, Po-An Tsai, Harshad Kasture, [36] Yashwant Marathe, Nagendra Gulur, Jee Ho Ryoo, Shuang Song, and
Xiaosong Ma, Daniel Sanchez, “KPart: A hybrid Cache Partitioning- Lizy K. John, “CSALT: Context Switch Aware Large TLB,” in Micro,
Sharing Technique for Commodity Multicores,” in HPCA, 2018 2017
[18] Yu Gan and Christina Delimitrou, “The Architectural Implications of [37] Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, Mary Lou
Cloud Microservices,” in IEEE Computer Architecture Letters, 2018 Soffa, “Bubble-Up: Increasing Utilization in Modern Warehouse Scale
Computers via Sensible Co-locations,” in Micro, 2011
[19] Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna
Pancholi, Christina Delimitrou, “Leveraging Deep Learning to Improve [38] Jason Mars, Lingjia Tang, Mary Lou Soffa, “Directly Characterizing
Performance Predictability in Cloud Microservices with Seer,” in ACM Cross Core Interference Through Contention Synthesis,” in HiPEAC,
SIGOPS Operating Systems Review, 2019 2011
[20] Mark D. Hill, Michael R. Marty, “Amdahl’s Law in the Multicore Era,” [39] Jose F. Martinez, Egin Ipek, “Dynamic multicore resource management:
in IEEE Computers, 2008 A machine learning approach,” in IEEE Micro 29 (5):8-17 (2009)
[21] Kurt Hornik, “Approximation Capabilities of Multilayer Feedforward [40] Azalia Mirhoseini, Hieu Pham, Quoc V. Le, Benoit Steiner, Rasmus
Networks,” in Neural Networks, 1991 Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy
Bengio, Jeff Dean, "Learning Device Placement in Tensorflow Compu-
[22] Engin Ipek, Onur Mutlu, José F. Martínez, Rich Caruana, “Self- tations," in Arxiv 1706.04972
Optimizing Memory Controllers: A Reinforcement Learning Ap-
[41] Nikita Mishra, Connor Imes, John D. Lafferty, Henry Hoffmann,

USENIX Association 21st USENIX Conference on File and Storage Technologies 165
“CALOREE: Learning Control for Predictable Latency and Low En- [62] Harshad Kasture, Daniel Sanchez, “Tailbench: a benchmark suite and
ergy,” in ASPLOS, 2018 evaluation methodology for latency-critical applications,” in IISWC,
2016
[42] Nikita Mishra, Harper Zhang, John Lafferty, Henry Hoffmann, “A
probabilistic Graphical Model-based Approach for Minimizing Energy [63] Rajiv Nishtala, Vinicius Petrucci, Paul Carpenter, Magnus Sjalander,
Under Performance Constraints,” in ASPLOS, 2015 “Twig: Multi-Agent Task Management for Colocated Latency-Critical
Cloud Services,” in HPCA, 2020
[43] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu,
Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, [64] MongoDB official website. http://www.mongodb.com
Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie,
Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, [65] Memcached official website. https://memcached.org
Daan Wierstra, Shane Legg, Demis Hassabis, “Human-level control [66] NGINX official website. http://nginx.org
through deep reinforcement learning,” in Nature 518 (7540): 529-533,
2015 [67] https://en.wikipedia.org/wiki/Universal_approximation_theorem
[44] Sai Prashanth Muralidhara, Lavanya Subramanian, Onur Mutlu, Mah- [68] Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi,
mut Kandemir, Thomas Moscibroda, “Reducing Memory Interference Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon
in Multicore Systems via Application-Aware Memory Channel Parti- Jackson, Kelvin Hu, Meghna Pancholi, Yuan He, Brett Clancy, Chris
tioning,” in Micro, 2011 Colen, Fukang Wen, Catherine Leung, Siyuan Wang, Leon Zaruvinsky,
Mateo Espinosa, Rick Lin, Zhongling Liu, Jake Padilla, Christina
[45] Jinsu Park, Seongbeom Park, Woongki Baek, “CoPart: Coordinated Delimitrou, “An Open-Source Benchmark Suite for Microservices and
Partitioning of Last-Level Cache and Memory Bandwidth for Fairness- Their Hardware-Software Implications for Cloud & Edge Systems,” in
Aware Workload Consolidation on Commodity Servers,” in EuroSys, ASPLOS, 2019
2019
[69] Kalyanmoy Deb, Shivam Gupta, “Understanding Knee Points in Bicri-
[46] Tirthak Patel, Devesh Tiwari, “CLITE: Efficient and QoS-Aware Co- teria Problems and Their Implications as Preferred Solution Principles,”
Location of Multiple Latency-Critical Jobs for Warehouse Scale Com- in Engineering Optimization, 43 (11), 2011
puters,” in HPCA, 2020
[70] www.mysql.com
[47] Henry Qin, Qian Li, Jacqueline Speiser, Peter Kraft, and John Ouster-
hout, “Arachne: Core-Aware Thread Management,” in OSDI, 2018 [71] Harshad Kasture, Daniel Sanchez, “Ubik: Efficient Cache Sharing with
Strict QoS for Latency-Critical Workloads,” in ASPLOS, 2014
[48] Joy Rahman, Palden Lama, “Predicting the End-to-End Tail Latency of
Containerized Microservices in the Cloud,” in IC2E, 2019 [72] Ana Klimovic, Heiner Litz, Christos Kozyrakis, “Selecta: Learning
Heterogeneous Cloud Storage Configuration for Data Analytics,” in
[49] Yizhou Shan, Yutong Huang, Yilun Chen, Yiying Zhang, “LegoOS: A USENIX ATC, 2018
Disseminated, Distributed OS for Hardware Resource Disaggregation,”
in OSDI, 2018 [73] Harshad Kasture, Xu Ji, Nosayba El-Sayed, Xiaosong Ma, Daniel
Sanchez, “Improving Datacenter Efficiency Through Partitioning-
[50] Prateek Sharma, Ahmed Ali-Eldin, Prashant Shenoy, “Resource De- Aware Scheduling,” in PACT, 2017
flation: A New Approach For Transient Resource Reclamation,” in
EuroSys, 2019 [74] Yanqi Zhang, Weizhe Hua, Zhuangzhuang Zhou, G. Edward Suh,
Christina Delimitrou, “Sinan: ML-Based and QoS-Aware Resource
[51] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Lau- Management for Cloud Microservices,” in ASPLOS, 2021
rent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis
Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, [75] Joshua Fried, Zhenyuan Ruan, Amy Ousterhout, Adam Belay, “Cal-
Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timo- adan: Mitigating Interference at Microsecond Timescales,” in OSDI,
thy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, 2020
Demis Hassabis, “Mastering the game of Go with deep neural networks
and tree search,” in Nature, 529 (7587), 2016 [76] Neeraj Kulkarni, Gonzalo Gonzalez-Pumariega, Amulya Khurana,
Christine A Shoemaker, Christina Delimitrou, David H Albonesi, “Cut-
[52] Akshitha Sriraman, Abhishek Dhanotia, Thomas F. Wenisch, “Soft- tlesys: Data-driven Resource Management for Interactive Services on
SKU: Optimizing Server Architectures for Microservice Diversity Re configurable Multicores,” in Micro, 2020
@Scale,” in ISCA, 2019
[77] Redis official website. https://redis.io/
[53] Akshitha Sriraman, Thomas F. Wenisch, “µTune: Auto-Tuned Thread-
[78] Node.js official website. https://nodejs.org/en/
ing for OLDI Microservices,” in OSDI, 2018
[79] Lei Liu, “QoS-Aware Resources Scheduling for Microservices:
[54] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scoott A Multi-Model Collaborative Learning-based Approach,” in
Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, An- arXiv:1911.13208v2, 2019
drew Rabinovich, “Going deeper with convolutions,” in CVPR, 2015
[80] Lei Liu, et al, “Going Vertical in Memory Management: Handling
[55] Jian Tan, Tieying Zhang, Feifei Li, Jie Chen, Qixing Zheng, Ping Zhang, Multiplicity by Multi-policy,” in ISCA, 2014
Honglin Qiao, Yue Shi, Wei Cao, Rui Zhang, “iBTune: Individualized
Buffer Tuning for Large-scale Cloud Databases,” in VLDB, 2019
[56] Stephen J. Tarsa, Rangeen Basu Roy Chowdhury, Julien Sebot, Gau-
tham Chinya, Jayesh Gaur, Karthik Sankaranarayanan, Chit-Kwan
Lin, Robert Chappell, Ronak Singhal, Hong Wang, “Post-Silicon CPU
Adaptations Made Practical Using Machine Learning,” in ISCA, 2019
[57] Xiaodong Wang, Shuang Chen, Jeff Setter, Jose F. Martínez, “SWAP:
Effective Fine-Grain Management of Shared Last-Level Caches with
Minimum Hardware Support,” in HPCA, 2017
[58] Zi Yan, Daniel Lustig, David Nellans, and Abhishek Bhattacharjee,
“Nimble Page Management for Tiered Memory Systems,” in ASPLOS,
2019
[59] Yiying Zhang, Yutong Huang, “Learned Operating Systems,” in ACM
SIGOPS Operating Systems Review, 2019
[60] Zhijia Zhao, Bo Wu, Xipeng Shen, “Challenging the "Embarrassingly
Sequential": Parallelizing Finite State Machine-based Computations
through Principled Speculation,” in ASPLOS, 2014
[61] Xiaoya Xiang, Chen Ding, Hao Luo, Bin Bao, “HOTL: A higher order
theory of locality,” in ASPLOS, 2013

166 21st USENIX Conference on File and Storage Technologies USENIX Association

You might also like