Fast23 Liu
Fast23 Liu
USENIX Association 21st USENIX Conference on File and Storage Technologies 153
Table 1: Latency-critical (LC) services, including micro-/interactive Table 2: Our platform specification vs. a server in 2010~14 [80].
services [18,19,52,68,46]. The max load - max RPS - is with the Conf. / Servers Our Platform Server (2010s)
99th percentile tail latency QoS target [10,18,46,52].
CPU Model Intel Xeon E5-2697 v4 Intel i7-860
LC service Domain RPS (Requests Per Second)
Logical Processor Cores 36 Cores (18 phy. cores) 8 Cores (4 phy. cores)
Img-dnn [62] Image recognition 2000,3000,4000,5000,6000 (Max) Processor Speed 2.3GHz 2.8GHz
Masstree [62] Key-value store 3000,3400,3800,4200,4600 Main Memory / 256GB, 2400MHz DDR4 / 8GB, 1600MHz DDR3 /
Memcached [65] Key-value store 256k,512k,768k,1024k,1280k Channel / BW 4 Channels / 76.8GB/s 2 Channels / 25.6GB/s
MongoDB [64] Persistent database 1000,3000,5000,7000,9000 Private L1 & L2
32KB and 256KB 32KB and 256KB
Moses [62] RT translation 2200,2400,2600,2800,3000 Cache Size
Nginx [66] Web server 60k,120k,180k,240k,300k Shared L3 Cache Size 45MB - 20 ways 8MB - 16 ways
Specjbb [62] Java middleware 7000,9000,11000,13000,15000 Disk 1TB,7200 RPM,HD 500GB,5400 RPM,HD
Sphinx [62] Speech recognition 1,4,8,12,16 NVIDIA GP104
GPU N/A
[GTX 1080], 8GB Memory
Xapian [62] Online search 3600,4400,5200,6000,6800
Login [68] Login 300,600,900,1200,1500 https://github.com/Sys-Inventor-Lab/AI4System-OSML.
Ads [68,52] Online renting ads 10,100,1000
exist in many widely used LC services and challenge existing
2 Background and Motivation
schedulers (Sec.3.3). Furthermore, we show ML can be an The cloud environment has a trend towards a new model
ideal approach that benefits scheduling (Sec.4.4). [3,18,52], in which cloud applications comprise numerous
(2) Collaborative ML Models for Intelligent Scheduling. distributed LC services (i.e., micro/interactive services), such
OSML is an ML-based scheduler that intelligently schedules as key-value storing, database serving, and business applica-
multiple interactive resources to meet co-located services’ QoS tions serving [18,19]. Table 1 includes some widely used ones,
targets. OSML learns the correlation between architectural and they form a significant fraction of cloud applications [18].
hints (e.g., IPC, cache misses, memory footprint, etc.), opti- These services have different features and resource demands.
mal scheduling solutions, and the QoS demands. It employs In terms of the datacenter servers, at present, new servers
MLP models to avoid RCliffs intelligently, thus avoiding the can have an increased number of cores, larger LLC capac-
sudden QoS slowdown often incurred by the RCliffs in prior ity, larger main memory capacity, higher bandwidth, and the
schedulers; it predicts the QoS variations and resource margins, resource scheduling exploration space becomes much larger
and then delivers appropriate resource allocations. It leverages than ever before as a result. Table 2 compares the two typical
an enhanced DQN to shepherd the allocations and recover from servers used at different times. On the one hand, although
the QoS violation and resource over-provision cases. Moreover, modern servers can have more cores and memory resources
as OSML’s models are lightweight and their functions are clearly than ever before, they are not fully exploited in today’s cloud
defined, it is easy to locate the problems and debug them. environments. For instance, in Google’s datacenter, the CPU
(3) An Open-sourced Data Set for Low-overhead ML. We have utilization is about 45~53% and memory utilization ranges
collected the performance traces for widely deployed LC services from 25~77% during 25 days, while Alibaba’s cluster ex-
(in Table 1), covering 62,720,264 resource allocation cases that hibits a lower and unstable trend, i.e., 18~40% for CPU and
contain around 2-billion samples (Sec.4). These data have a 42~60% for memory in 12 hours [32,49], indicating that a
rich set of information, e.g., the RCliffs for multiple resources; lot of resources are wasted. On the other hand, the larger re-
the interactions between workload features and the mainstream source scheduling exploration space, which consists of more
architectures. Our models can be trained and generalized with diverse resources, prohibits the schedulers from achieving
these data and then used on new platforms with low-overhead the optimal solution quickly. Additionally, cloud applications
transfer learning. People can study the data set and train their can have dozens of concurrent threads [10,46]. When several
models without a long period for data collection. cloud applications run on a server, they share and contend
(4) Real Implementation and Detailed Comparisons. We resources across multiple resource layers – cores, LLC, mem-
implement OSML based on latest Linux. OSML is designed ory bandwidth/banks. Previous studies show they may incur
as a co-worker of the OS scheduler located between the OS severe performance degradation and unpredictable QoS viola-
kernel and the user layer. We compare OSML with the most tions, and propose the scheduling approaches at architecture
related open-source studies [10,46] and show the advantages. [9,23,44], OS [31,45,50,80], and user-level [10,37,38]. Yet,
OSML captures the applications’ online behaviors, for- do they perform ideally for scheduling co-located LC services
wards them to the ML models running on CPU or GPU, and on modern datacenter servers?
schedules resources accordingly. Compared with [10,46], 3 Resource Scheduling for LC Services
OSML takes 36~55% less time to meet the QoS targets; and
To answer the above question, we study the LC services
supports 10~50% higher loads in 58% cases where Moses,
(Table 1) that are widely deployed in cloud environments.
Img-dnn, and Xapian can be co-scheduled in our experiments.
Its ML models have low run-time overheads. OSML project 3.1 Understanding the LC Services - Resource Cliff
and its ecological construction receive support from industry We study how sensitive these LC services are to the criti-
and academia; a version for research will be open-sourced via cal resources, e.g., the number of cores and LLC capacity,
154 21st USENIX Conference on File and Storage Technologies USENIX Association
Figure 1: The resource scheduling exploration space for cores and LLC ways. All services here are with 36 threads. These figures show
the sensitivity to resource allocation under different policies. Each col./row represents a specific number of LLC ways/cores allocated to an
application. Each cell denotes the LC service’s response latency under the given number of cores and LLC ways. The Redline highlights the
RCliff (can be obtained by selecting the knee solution [69]). The green color cells show allocation policies that bring better performance (low
response latency). OAA (Optimal Allocation Area) is also illustrated for each LC service. We test all of the LC services in Table 1, and we find
the RCliff and OAA are always existing, though the RPS varies [79]. We only show several of them for the sake of saving space.
on a commercial platform (“our platform” in Table 2). For
Moses, as illustrated in Figure 1-a, with the increasing number
of cores, more threads are mapped on them simultaneously.
Meanwhile, for a specific amount of cores, more LLC ways
can benefit performance. Thus, we observe the response la-
tency is low when computing and LLC resources are ample
(i.e., below 10ms in the area within green color). The overall
trends are also observed from other LC services. Figure 2: OAA exists regardless of the num. of concurrent threads.
However, we observe the Cliff phenomenon for these ser- defined as the ideal number of allocated cores and LLC ways
vices. In Figure 1-a, in the cases where 6 cores are allocated to bring an acceptable QoS. More resources than OAA cannot
to Moses, the response latency is increased significantly from deliver more significant performance, but fewer resources
34ms to 4644ms if merely one LLC way is reduced (i.e., lead to the danger of falling off the RCliff. OAA is the goal
from 10 ways to 9 ways). Similar phenomena also happen that schedulers should achieve.
in cases where computing resources are reduced. As slight 3.2 Is OAA Sensitive to the Number of Threads?
resource re-allocations bring a significant performance slow- In practice, an LC service may have many threads for higher
down, we denote this phenomenon as Resource Cliff (RCliff). performance. Therefore, we come up with the question: is
It is defined as the resource allocation cases that could in- the OAA sensitive to the number of threads, i.e., if someone
cur the most significant performance slowdown if resources starts more threads, will the OAA change?
(e.g., core, cache) are deprived via a fine-grain way in the To answer this question, for a specific LC service, we start
scheduling exploration space. Take Moses as an example, a different number of threads and map them across a different
on the RCliff (denoted by the red box in Figure 1-a), there number of cores (the num. of threads can be larger than the
would be a sharp performance slowdown if only one core or num. of cores). Through the experiments, we observe - (i)
one LLC way (or both) is deprived. Figure 1-b and c show More threads do not necessarily bring more benefits. Take
RCliffs for Img-dnn and MongoDB, respectively. They ex- Moses as an example, when more threads are started (e.g.,
hibit computing-sensitive features and have RCliff only for 20/28/36) and mapped across a different number of cores, the
cores. From another angle, RCliff means that a little bit more overall response latency can be higher (as illustrated in Figure
resources will bring significant performance improvement. 2). The underlying reason lies in more memory contentions
Figure 1-a shows that Moses exhibits RCliff for both core and at memory hierarchy and more context switch overheads,
LLC. Moreover, we test the services in Table 1 across various leading to a higher response latency [20,36]. (ii) The OAA
RPS and find the RCliffs always exist, though the RCliffs is not sensitive to the number of concurrent threads. For
vary (8.8% on average) according to different RPS. Moses in Figure 2, even if 20/28/36 threads are mapped to
The underlying reason for the cache cliff is locality; for the 10~25 cores, around 8/9-core cases always perform ideally.
core cliff, the fundamental reason is on queuing theory - the Other LC services in Table 1 also show a similar phenomenon,
latency will increase drastically when the request arrival rate though their OAAs are different for different applications.
exceeds the available cores. RCliff alerts the scheduler not to In practice, if the QoS for a specific LC service is satisfied,
allocate resources close to it because it is “dangerous to fall off LLC ways should be allocated as less as possible, saving
the cliff” and incurs a significant performance slowdown, i.e., LLC space for other applications. Similarly, we also try to
even a slight resource reduction can incur a severe slowdown. allocate fewer cores for saving computing resources. Here,
Notably, in Figure 1, we highlight each LC service’s Optimal we conclude that the OAA is not sensitive to the number
Allocation Area (OAA) in the scheduling exploration space,
USENIX Association 21st USENIX Conference on File and Storage Technologies 155
Table 3: The input parameters for ML models. complex configurations. (3) Unable to provide accurate QoS
Feature Description Models predictions. Therefore, the scheduler can hardly balance the
IPC Instructions per clock A/A’/B/B’/C global QoS and resource allocations across all co-located appli-
Cache Misses LLC misses per second A/A’/B/B’/C cations, leading to QoS violations or resource over-provision.
MBL Local memory bandwidth A/A’/B/B’/C
An ideal scheduler should avoid the RCliff and quickly achieve
CPU Usage The sum of each core’s utilization A/A’/B/B’/C
Virt. Memory Virtual memory in use by an app A/A’/B/B’ the OAA from any positions in the scheduling space. We claim
Res. Memory Resident memory in use by an app A/A’/B/B’ it is time to design a new scheduler, and using ML can be a good
Allocated Cores The number of allocated cores A/A’/B/B’/C approach to handle such complicated cases with low overheads.
Allocated Cache The capacity of allocated cache A/A’/B/B’/C
Core Frequency Core Frequency during run time A/A’/B/B’/C 4 Leveraging ML for Scheduling
QoS Slowdown Percentage of QoS slowdown B We design a new resource scheduler - OSML. It differs from
Expected Cores Expected cores after deprivation B’
Expected Cache Expected cache after deprivation B’
the previous studies in the following ways. We divide the re-
Cores used by N. Cores used by Neighbors A’/B/B’ source scheduling for co-located services into several routines
Cache used by N. Cache capacity used by Neighbors A’/B/B’ and design ML models to handle them, respectively. These
MBL used by N. Memory BW used by Neighbors A’/B/B’ models work collaboratively in OSML to perform schedul-
Resp. Latency Average latency of a LC service C ing. OSML uses data-driven static ML models (Model-A/B)
of threads. We should further reveal: how do the existing to predict the OAA/RCliff, and balances the QoS and re-
schedulers perform in front of OAAs and RCliffs? source allocations among co-located LC services, and uses
the reinforcement learning model (Model-C) to shepherd the
3.3 Issues the Existing Schedulers May Meet
allocations dynamically. Moreover, we collect extensive real
We find three main shortcomings in the existing schedulers
traces for widely deployed LC services.
when dealing with OAAs and RCliffs. (1) Entangling with
RCliffs. Many schedulers often employ heuristic scheduling 4.1 Model-A: Aiming OAA
algorithms, i.e., they increase/reduce resources until the moni- Model-A Description. The neural network used in Model-A
tor alerts that the system performance is suffering a significant is a 3-layer multi-layer perceptron (MLP); each layer is a
change (e.g., a severe slowdown). Yet, these approaches could set of nonlinear functions of a weighted sum of all outputs
incur unpredictable latency spiking. For example, if the current that are fully connected from the prior one [21,24]. There
resource allocation for an LC service is in the base of RCliff are 40 neurons in each hidden layer. There is a dropout layer
(i.e., the yellow color area in Figure 1-a/b/c), the scheduler with a loss rate of 30% behind each fully connected layer
has to try to achieve OAA. However, as the scheduler doesn’t to prevent overfitting. For each LC service, the inputs of
know the “location” of OAA in the exploration space, it has the MLP include 9 items in Table 3 and the outputs include
to increase resources step by step in a fine-grain way, thus the the OAA for multiple interactive resources, OAA bandwidth
entire scheduling process from the base of the RCliff will incur (bandwidth requirement for OAA), and the RCliff. Model-A
very high response latency. For another example, if the current has a shadow – A’, which has the same MLP structure and 12
resource allocation is on the RCliff or close to RCliff, a slight input parameters (Table 3), providing solutions when multiple
resource reduction for any purpose could incur a sudden and LC services are running together.
sharp performance drop for LC services. The previous efforts Model-A Training. Collecting training data is an offline
[10,32,50,53] find there would be about hundreds/thousands of job. We have collected the performance traces that involve
times latency jitter, indicating the QoS cannot be guaranteed the parameters in Table 3 for the LC services in Table 1 on
during these periods. Thus, RCliffs should not be neglected “our platform” in Table 2. The parameters are normalized
when designing a scheduler. (2) Unable to accurately and into [0,1] according to the function: Normalized_Feature =
simultaneously schedule a combination of multiple interac- (Feature − Min)/(Max − Min). Feature is the original value;
tive resources (e.g., cores, LLC ways) to achieve OAAs in Max and Min are predefined according to different metrics.
low overheads. Prior studies [10,31,32,45] show that the core For each LC service with every common RPS demand,
computing ability, cache hierarchy, and memory bandwidth are we sweep 36 threads to 1 thread across LLC allocation poli-
interactive factors for resource scheduling. Solely considering cies ranging from 1 to 20 ways and map the threads on a
a single dimension in scheduling often leads to sub-optimal certain number of cores and collect the performance trace
QoS. However, the existing schedulers using heuristic or model- data accordingly. In each case, we label the corresponding
based algorithms usually schedule one dimension resource at OAA, RCliff and OAA bandwidth. For example, Figure 3
a time and bring high overheads on scheduling multiple inter- shows a data collection case where 8 threads are mapped
active resources. For example, PARTIES [10] takes around onto 7 cores with 4 LLC ways. We feed the LC services
20~30 seconds on average (up to 60 seconds in the worst cases) with diverse RPS (Table 1), covering most of the common
to find ideal allocations when 3~6 LC services are co-running. cases. Moreover, to train Model-A’s shadow (A’), we map LC
The efforts in [16,41,42] also show the heuristics inefficiency services on the remaining resources in the above process and
due to the high overheads on scheduling various resources with get the traces for co-location cases. Note that the resources
156 21st USENIX Conference on File and Storage Technologies USENIX Association
Figure 3: Model-A data collection. Figure 4: Model-B training.
Figure 5: Model-C in a nutshell.
are partitioned among applications. We test comprehensive
co-location cases for LC services in Table 1, and find the LC Model-B Training. For training Model-B and B’, we
services’ RCliffs vary from 2.8% to 38.9%, and OAAs vary reduce the allocated resources for a specific LC service from
from 5.6% to 36.1%, respectively, when multiple LC services its OAA by fine-grain approaches, as illustrated in Figure
are co-running. Finally, we collect 43,299,135 samples (data 4. The reduction has three angles, i.e., horizontal, oblique,
tuples), covering 1,308,296 allocation cases with different and vertical, i.e., B-Points include <cores dominated, LLC
numbers of cores, LLC ways, and bandwidth allocations. A ways>, <cores, LLC ways>, <cores, LLC ways dominated>,
large amount of traces may lead to higher accuracy for ML respectively. For each fine-grain resource reduction step, we
models. The workload characteristics are converted to traces collect the corresponding QoS slowdowns and then label them
consisting of hardware parameters used for fitting and training as less equal to (≤) 5%, 10%, 15%, and so on, respectively.
MLP to provide predictions. Examples are illustrated in Figure 4, which shows the B-
Points with the corresponding QoS slowdown. We collect the
4.2 Model-B: Balancing QoS and Resources training data set for every LC service in Table 1. The data set
Model-B Description. Model-B employs an MLP with the contains 65,998,227 data tuples, covering 549,987 cases.
same structure in Model-A’ plus one more input item, i.e., Model-B Function. We design a new loss function:
2
QoS slowdown (Table 3). Model-B outputs the resources 1 n
yt
L= ×
∑ yt + c t t ,
(s − y )
that a service can be deprived of under allowable QoS slow- n t=1
down. As the computing units and memory resource can in which st is the prediction output value of Model-B, yt is the
be fungible [10], Model-B’s outputs include three policies, labeled value in practice, and ‘c’ is a constant that is infinitely
i.e., <cores, LLC ways>, <cores dominated, LLC ways> and close to zero. We multiply the difference between st and yt by
yt
<cores, LLC ways dominated>, respectively. The tuple items yt +c for avoiding adjusting the weights during backpropaga-
are the number of cores, LLC ways deprived and reallocated tion in the cases where yt = 0 and yty+c t
= 0 caused by some
to others with the allowable QoS slowdown. The term “cores non-existent cases (we label the non-existent cases as 0, i.e.,
dominated” indicates the policy using more cores to trade the yt = 0, indicating we don’t find a resource-QoS trading policy
LLC ways, and vice versa. The allowable QoS slowdown in the data collection process).
is determined according to the user requirement or the LC
services’ priority and controlled by the OSML’s central logic. 4.3 Model-C: Handling the Changes On the Fly
We denote Model-B’s outputs as B-Points. Model-C Description. Model-C corrects the resource
Model-B trades QoS for resources. For example, when an under-/over-provision and conducts online training. Fig-
LC service (E) comes to a server that already has 4 co-located ure 5 shows the Model-C in a nutshell. Model-C’s core
services, OSML enables Model-A’ to obtain <n+, m+>, which component is an enhanced Deep Q-Network (DQN) [43],
denotes at least n more cores and m more LLC ways should consisting of two neural networks, i.e., Policy Network
be provided to meet E’s QoS. Then, OSML enables Model-B and Target Network. The Policy and Target Network em-
and uses the allowable QoS slowdown as an input to infer B- ploy the 3-layer MLP, and each hidden layer has 30 neu-
Points for obtaining resources from other co-located services. rons. Policy Network’s inputs consist of the parameters
B-Points include the “can be deprived” resources from E’s in Table 3, and the outputs are resource scheduling ac-
neighbors with the allowable QoS slowdown. Finally, OSML tions (e.g., reducing/increasing a specific number of cores
finds the best solution to match <n+, m+> with B-Points, or LLC ways) and the corresponding expectations (defined
which has a minimal impact on the co-located applications’ as Q(action)). These actions numbered 0~48 are defined
current allocation state. Detailed logic is in Algo._1. Besides, as Action_Function:{< m, n > |m ∈ [−3, 3], n ∈ [−3, 3]}, in
we design Model-B’ (a shadow of Model-B) to predict how which a positive m denotes allocating m more cores (i.e., add
much QoS slowdown will suffer if a certain amount of re- operation) for an application and a negative m means depriv-
sources is deprived of a specific service. Model-B’ has an ing it of m cores (i.e., sub operation); n indicates the actions
MLP with the same structure in Model-A’ plus the items that on LLC ways. Figure 5 illustrates Model-C’s logic. The
indicate the remaining cache ways and cores after deprivation. scheduling action with the maximum expectation value (i.e.,
USENIX Association 21st USENIX Conference on File and Storage Technologies 157
the action towards the best solution) will be selected in ① Table 4: The Summary of ML models in OSML.
and executed in ②. In ③, Model-C will get the Reward value Model Gradient Activation
ML Model Features Loss Function
according to the Reward Function. Then, the tuple <Status, Size Descent Function
Action, Reward, Status’> will be saved in the Experience A MLP 9 144 KB Mean Square
Pool in ④, which will be used during online training. The A’ MLP 12 155 KB Error (MSE) Adam
B MLP 13 110 KB Modified MSE Optimizer ReLU
terms Status and Status’ denote system’s status described by B’ MLP 14 106 KB MSE
the parameters in Table 3 before and after the Action is taken. C DQN 8 141 KB Modified MSE RMSProp
Model-C can quickly have the ideal solutions in practice
Experience Pool. For each tuple, Model-C uses the Policy
(about 2 or 3 actions). Please note that in ①, Model-C might
Network to get the Action’s expectation value (i.e., Q(Action)
randomly select an Action instead of the best Action with a
[43]) with the Status. In Model-C, the target of the Ac-
5% chance. By doing so, OSML avoids falling into a local
tion’s expectation value is the Reward observed plus the
optimum [43]. These random actions have a 44% chance
weighted best expectation value of the next status (i.e., Sta-
of causing QoS violations when Model-C reduces resources.
tus’). As illustrated in Figure 5, Model-C uses the Target
But OSML can handle the QoS violations by withdrawing the
Network to have the expectation values of Status’ for the
action (line 9 in Algo._3).
actions in Action_Function and then finds the best one, i.e.,
Model-C’s Reward Function. The reward function of
Max(Q(Action′ )). We design a new Loss Function based on
Model-C is defined as follow:
If Latencyt−1 > Latencyt : MSE: (Reward + γMax(Q(Action′ )) − Q(Action))2 . It helps
Rt = log(1 + Latencyt−1 − Latencyt ) − (∆CoreNum + ∆CacheWay) the Policy Network predict closer to the target. The Policy
If Latencyt−1 < Latencyt : Network is updated during online training. The Target Net-
Rt = − log(1 + Latencyt − Latencyt−1 ) − (∆CoreNum + ∆CacheWay) work’s weights are synchronized periodically with the Policy
If Latencyt−1 = Latencyt : Network’s weights. Doing so enables the Target Network
Rt = −(∆CoreNum + ∆CacheWay), to provide stable predictions for the best expectation value
where Latencyt−1 and Latencyt denote the latency in previous of Status’ within a predefined number of time steps, thus
and current status, respectively; ∆CoreNum and ∆CacheWay improving the stability of the training and prediction.
represent the changes in the number of cores and LLC ways,
respectively. This function gives higher reward and expec- 4.4 Discussions on the design of ML Models
tation to Action that brings less resource usage and lower (1) Why using MLPs. Table 4 characterizes the ML models
latency. Thus, Model-C can allocate appropriate resources. used in OSML. We employ three-layered MLPs in Model-
Algo._2 and 3 show the logic on using Model-C in detail. A and B, because they can fit continuous functions with an
Offline Training. A training data tuple includes Status, arbitrary precision given a sufficient number of neurons in
Status’, Action and Reward, which denote the current status each layer [67], and we can use extensive training data to im-
of a LC service, the status after these actions are conducted prove the accuracy of MLPs for predicting OAAs and RCliffs.
(e.g., reduce several cores or allocate more LLC ways) and Moreover, after offline training, using MLPs brings negligi-
the reward calculated using the above functions, respectively. ble run-time overheads to OSML. (2) Why do we need the
To create the training data set for Model-C, we resort to Three models? We divide the OSML’s scheduling logic into
the data set used in Model-A training. The process is as three parts, which the three models cover, respectively. Models
follows. Two tuples in Model-A training data set are selected work in different scheduling phases, and no single model can
to denote Status and Status’, and we further get the differences handle all cases. For example, model-A predicts the RCliffs
of the resource allocations between the two status (i.e., the and OAAs; Model-B predicts the QoS variations and resource
actions that cause the status shifting). Then, we use the reward margins in co-location cases. DQN in Model-C learns online
function to have the reward accordingly. These 4 values form to shepherd the scheduling results from Model-A/B. They are
a specific tuple in Model-C training data set. In practice, as necessary and work cooperatively to cover the main schedul-
there are a large number of data tuples in Model-A training ing cases. Moreover, they are easier to generalize than other
data set, it is impossible to try every pair of tuples in the approaches, e.g., a table lookup approach (Sec.6.4). Why Not
data set, we only select two tuples from resource allocation Only use the online learning Model-C? Model-C uses DQN
policies that have less than or equal to 3 cores, or 3 LLC that depends on the start points. It starts with Model-A/B’s
ways differences. Moreover, we also collect the training data outputs to avoid exploring the whole (large) scheduling space.
in the cases where cache way sharing happens and preserve Without the approximate OAA provided by Model-A for many
them in the experience pool. Therefore, Model-C can work unseen cases, only using Model-C will incur more schedul-
in resource-sharing cases. To sum up, we have 1,521,549,190 ing actions (overheads). (3) Insights on Generalization for
tuples in Model-C training data set. Unseen apps and New servers. (i) We use “hold-out cross
Online Training. Model-C collects online traces. The validation”, i.e., the training data (70% of the whole data set)
training flow is in the right part of Figure 5. Model-C ran- excludes the testing data (30%) for each LC service. (ii) We
domly selects some data tuples (200 by default) from the train models with extensive representative traces from many
158 21st USENIX Conference on File and Storage Technologies USENIX Association
Central Control
Logic
USENIX Association 21st USENIX Conference on File and Storage Technologies 159
[5]. The ML models are based on TensorFlow [6] with the
version 2.0.4, and can be run on either CPU or GPU.
6 Evaluations
6.1 Methodology
We evaluate the per-node OSML performance on our platform
in Table 2. Details on LC services are in Table 1. The met-
rics involve the QoS (similar to [10], the QoS target of each
application is the 99th percentile latency of the knee of the
and error” – for all of the applications. The core mechanism
latency-RPS curve. Latency higher than the QoS target is a
in [10] is like an FSM [60].
violation.); Effective Machine Utilization (EMU) [10] (the
CLITE [46]. It conducts various allocation policies and
max aggregated load of all co-located LC services) – higher is
samples each of them; it then feeds the sampling results – the
better. We first evaluate the scenarios where LC services run
QoS and run-time parameters for resources – to a Bayesian
at constant loads, and the loads are from 10% - 100%. Then,
optimizer to predict the next scheduling policy.
we explore workload churn. We inject applications with loads
Unmanaged Allocation (baseline). This policy doesn’t con-
from 20% - 100% of their respective max load. Furthermore,
trol the allocation policies on cores, LLC, and other shared
to evaluate the generalization of OSML, we employ some
resources for co-located LC services. This policy relies on
new/unseen applications that are not in Table 1 and the new
the original OS schedulers.
platform in our experiments. If an allocation in which all appli-
ORACLE. We obtain these results by exhaustive offline sam-
cations meet their QoS cannot be found after 3 mins, we signal
pling and find the best allocation policy. It indicates the
that the scheduler cannot deliver QoS for that configuration.
ceiling that the schedulers try to achieve.
6.2 OSML Effectiveness We show the effectiveness of OSML as follow.
We compare OSML with the most related approaches in (1) OSML exhibits a shorter scheduling convergence time.
[10,46] based on the latest open-source version. Using ML models, OSML achieves OAA quickly and effi-
PARTIES [10]. It makes incremental adjustments in one- ciently handles cases with diverse loads. Figure 8-a shows the
dimension resource at a time until QoS is satisfied – “trial distributions of the scheduling results of 104 loads for OSML,
160 21st USENIX Conference on File and Storage Technologies USENIX Association
PARTIES and CLITE, respectively. Every dot represents a
specific workload that contains 3 co-located LC services with
different RPS. We launch the applications in turn and use a
scheduler to handle QoS violations until all applications meet
their QoS targets. The x-axis shows the convergence time;
the y-axis denotes the achieved EMU. Generally, OSML can
achieve the same EMU with a shorter convergence time for
a specific load. Figure 8-b shows the violin plots of conver-
gence time for these loads. On average, OSML takes 20.9 Figure 8: (a) The performance distributions for 104 loads that OSML,
seconds to converge, while PARTIES and CLITE take 32.7 PARTIES and CLITE can all converge. (b) Violin plots of conver-
and 46.3 seconds, respectively. OSML converges 1.56X and gence time for loads in (a).
2.22X faster than PARTIES and CLITE. OSML performs ing the three LC services – Moses, Img-dnn, and Xapian. For
stably – the convergence time ranges from 5.3s (best case) a specific scheduling phase, by using ML to achieve OAA,
to 80.0s (worst case). By contrast, the convergence time in OSML supports 10~50% higher loads than PARTIES and
PARTIES ranges from 5.5s to 111.1s, and CLITE is from CLITE in highlighted cells in Figure 10-d, accounting for
14.0s to 140.6s. OSML converges faster mainly because 58% of the cases that can be scheduled. All schedulers per-
the start point in the scheduling space provided by Model-A form better than the Unmanaged (Figure 10-a), as they reduce
is close to OAA. PARTIES and CLITE take a longer con- the resource contentions.
vergence time, indicating that they require high scheduling (2) Compared with PARTIES and CLITE, OSML consumes
overheads in cloud environments. In Cloud, jobs come and fewer resources to support the same loads to meet the QoS
go frequently; thus, scheduling occurs frequently, and longer targets. On average, for the 104 loads in Figure 8-a, OSML
scheduling convergence time often leads to unstable/low QoS. uses 34 cores and 16 LLC ways; by contrast, PARTIES and
We further analyze how these schedulers work in detail. CLITE exhaust all the platform’s 36 cores and 20 LLC ways.
Figure 9-a/b/c show the actions used in OSML, PARTIES, and As illustrated in Figure 9-a, PARTIES partitions the LLC
CLITE’s scheduling process for case A in Figure 8. This case ways and cores equally for each LC service at the begin-
includes Moses, Img-dnn, and Xapian with 40%, 60%, and ning; once it meets the QoS target (using 8 actions), it stops.
50% of their maximum loads. For this load, PARTIES, CLITE Thus, PARTIES drops the opportunities to explore alterna-
and OSML take 14.5 seconds, 72.6 seconds and 8.2 seconds to tive better solutions (i.e., using fewer cores or cache ways to
converge, respectively. Figure 9 highlights scheduling actions meet identical QoS targets). PARTIES allocates all cores and
using solid red lines to represent increasing resources and blue LLC ways finally. CLITE also uses all cores and cache ways
dotted lines to denote reducing resources. Figure 9-a shows shown in Figure 9-b. By contrast, OSML schedules accord-
PARTIES takes 7 actions for scheduling cores and 1 action ing to applications’ resource requirements instead of using
for cache ways. It schedules in a fine-grained way by increas- all resources. Figure 9-c shows that using Model-A, OSML
ing/decreasing one resource at a time. CLITE relies on the achieves each LC service’s OAA (the optimal solution) after 5
sampling points in the scheduling exploration space. Figure actions. OSML detects and reclaims over-provided resources
9-b shows CLITE repeats sampling until the “expected im- using Model-C. For example, the last action in Figure 9-c re-
provement” in CLITE drops below a certain threshold. CLITE claims 3 cores and 2 LLC ways from Xapian. Finally, OSML
only performs five scheduling actions according to its latest saves 3 cores and 9 LLC ways. As OSML is designed for LC
open-source version; but it takes the longest convergence time services that are executed for a long period, saving resources
(72.6 seconds). The underlying reason is that CLITE’s sam- means saving budgets for cloud providers.
pling/scheduling doesn’t have clear targets. In practice, the (3) Using ML models, OSML provides solutions for shar-
improper resource partitions/allocations during sampling lead ing some cores and LLC ways among LC services, therefore
to the accumulation of requests, and the requests cannot be supporting higher loads. PARTIES and CLITE don’t show
handled due to resource under-provision. Therefore, it brings resource sharing in the original design. Using Algo._4, OSML
a significant increase in response latency. Moreover, due to lists some potential resource sharing solutions, and then en-
the early termination of CLITE’s scheduling process, CLITE ables Model-B’ to predict the QoS slowdown for each case.
cannot schedule resources to handle QoS violations in a timely The sharing solution with a relatively lower QoS slowdown
manner, leading to a long convergence time. Figure 9-c shows is selected. More details refer to Figure 7. Figure 9-d shows
OSML achieves OAA for each LC service with 5 actions. how OSML shares resources for the highlighted case B in
Compared with prior schedulers, OSML has clear aims and Figure 10-d. OSML enables Model-C to add resources for
schedules multiple resources simultaneously to achieve them. Moses in Algo._2 and uses Algo._4 to share 2 CPU cores with
It has the shortest convergence time – 8.2 seconds. Xapian. Finally, the QoS is met. By enabling resource sharing,
Moreover, as the scheduling is fast, OSML often supports OSML can support higher loads than PARTIES and CLITE,
more loads. Figure 10 shows the OSML’s results on schedul- and can even be close to ORACLE in Figure 10-e. If not
USENIX Association 21st USENIX Conference on File and Storage Technologies 161
Figure 9: Resource usage comparisons for OSML, PARTIES, and CLITE.
162 21st USENIX Conference on File and Storage Technologies USENIX Association
Table 5: ML Models’ performance on average. The units of errors
are the number of cores/LLC ways. For Model B’, the unit is the
percentage of QoS slowdown.
Errors for Err on new
Error unseen LC platforms
ML Outputs services (TL) MSE Over-
heads
Core LLC Core LLC Core LLC
RCliff 0.589 0.288 1.266 0.198 2.142 0.542
A 0.0025 0.20s
OAA 0.580 0.360 1.276 0.191 2.004 0.865
RCliff 1.072 0.815 3.403 1.835 0.772 0.411
A’ 0.0135 0.20s
OAA 1.072 0.814 3.404 1.835 0.790 0.413
B-Points 0.612 0.053 4.012 0.167 2.320 0.969
B-Points,Core 0.314 0.048 3.434 0.937 2.250 0.815
B dominated 0.0012 0.18s
B-Points,Cache 0.093 0.462 0.789 0.783 1.868 1.519
dominated
B’ QoS reduction 7.87% 8.33% 11.28% 0.0035 0.19s
Scheduling
C actions
0.908 0.782 0.844 0.841 1.390 1.801 0.7051 0.20s
USENIX Association 21st USENIX Conference on File and Storage Technologies 163
several hours will be sufficient (covering the more allocation ML models to determine the performance dependencies be-
cases, the better). The model updating can be accomplished tween microservices in clusters. They are cluster schedulers
in hours. The time consumption will be shorter if using [14,15,74]. By contrast, OSML deeply studies scheduling
multiple machines in parallel. Table 5 shows the average in co-location cases. Selecta [72] predicts near-optimal con-
values of ML models’ quality. The new models’ prediction figurations of computing and storage resources for analytics
errors are slightly higher than the previous models on the workloads based on profiling data. CLITE [46] uses Bayesian
original platforms, but OSML still handles them well. By optimization for scheduling on-node resources. The work
contrast, if we use a table lookup approach instead, we have in [48] applies ML to predict the end-to-end tail latency of
to use additional memory to store the data tuples, e.g., 60GB LC service workflows. Twig [63] uses RL to characterize
will be wasted for the current data set to replace Model-A. tail latency for energy-efficient task management. CuttleSys
More importantly, this approach is difficult to generalize for [76] leverages data mining to identify suitable core and cache
new/unseen applications or platforms, as their traces and the configurations for co-scheduled applications. For compli-
corresponding OAAs don’t exist in the current data set. cated co-location cases, OSML can avoid RCliffs and quickly
Overheads. OSML takes 0.2s of a single core for each achieve the ideal allocations (OAA) for multiple interactive
time during the interval setting (e.g., 1 second by default in resources simultaneously for LC services. Moreover, OSML
Sec.5.2). 0.01s for ML model and 0.19s for online monitoring. performs well in generalization.
As our models are light-weighted (OSML uses only one core), Resource Partitioning. PARTIES [10] partitions cache, main
running them on CPU and GPU has a similar overhead. If memory, I/O, network, disk bandwidth, etc., to provide QoS
models are on GPU, it takes an extra 0.03s for receiving results for co-located services. The studies in [17,28,57,71] design
from GPU. OSML doesn’t monopolize GPU. Generally, the some new LLC partitioning/sharing policies. The efforts
overhead doesn’t hinder the overall performance. In the cloud, in [23,27,44,45,73] show that considering cooperative parti-
applications’ behaviors may change every second due to the tioning on LLC, memory banks and channels outperforms
diversity of user demands. Thus, OSML plays a critical role one-level memory partitioning. However, the cooperative
during the entire run time. For building models, using our partitioning policies need to be carefully designed [29,30,37],
current data set on platform in Table 2, it takes 3.3 mins, 5 mins, and [16,32] show the heuristic resource scheduling approach
and 8.3 hours to train Model-A, B, and C for one epoch (all could be ineffective in many QoS-constrained cases. [7,11]
training samples are used to train once), respectively. We train study the “performance cliff” on cache for SPECCPU 2006
models for ten epochs. Training can be accelerated using the applications and Memcached. Caladan [75] doesn’t involve
popular Multi-GPU training technology. Doing so is practical cache optimizations, and core/cache cliffs cannot be avoided,
in datacenters, and training time will not impede practice. causing QoS fluctuations in some cases. By contrast, OSML is
the first work that profoundly explores cache cliff and core cliff
7 Related Work and Our Novelty simultaneously (i.e., RCliff) for many widely used LC services
ML for System Optimizations. The work in [55] employs in co-location cases. OSML is a representative work using
DNN to optimize the buffer size for the database systems. ML to guide the multiple resources partitioning in co-location
The efforts in [22,56,34] leverage ML to optimize computer cases; OSML is cost-effective in new cloud environments.
architecture or resource management in the network to meet
various workloads. The studies in [9,39] use ML to manage 8 Conclusion
on-chip hardware resources. CALOREE [41] can learn key We present OSML, an ML-based resource scheduler for co-
control parameters to meet latency requirements with minimal located LC services. We learn that straightforwardly using a
energy in complex environments. The studies in [26,31,58,59] simple ML model might not handle all of the processes during
optimize the OS components with learned rules or propose the scheduling. Therefore, using multiple ML models cooper-
insights on designing new learned OS. In OSML, we design atively in a pipe-lined way can be an ideal approach. More
an intelligent multi-model collaborative learning approach, importantly, we advocate the new solution, i.e., leveraging
providing better co-location solutions to meet QoS targets for ML to enhance resource scheduling, could have an immense
LC services faster than the latest work stably. potential for OS design. In a world where co-location and
ML for Scheduling. Decima [35] designs cluster-level data sharing are a fundamental reality, our solution should grow in
processing job scheduling using RL. Resource Central [12] importance and benefits our community.
builds a system that contains the historical resource utilization
information of the workloads used in Azure and employs ML
Acknowledgement
to predict resource management for VMs. The study in [40] We thank the reviewers and our shepherd Young-ri Choi for the
uses RL to predict which subsets of operations in a Tensor- invaluable comments. This work is supported by the Key-Area
Flow graph should run on the available devices. Paragon [14] R&D Program of Guangdong (No.2021B0101310002), NSFC
classifies and learns workload interference. Quasar [15] deter- (No.62072432). L. Liu thanks Hecheng’s arriving. X. Dou and
mines jobs’ resource preferences on clusters. Sinan [74] uses Y. Chen are students in Sys-Inventor Lab led by L. Liu.
164 21st USENIX Conference on File and Storage Technologies USENIX Association
References proach,” in ISCA, 2008
[1] “How 1s could cost amazon $1.6 billion in sales.” [23] Min Kyu Jeong, Doe Hyun Yoon, Dam Sunwoo, Michael Sullivan,
https://www.fastcompany.com/1825005/how-one-second-could- Ikhwan Lee, Mattan Erez,“Balancing DRAM Locality and Parallelism
cost-amazon-16-billion-sales in Shared Memory CMP Systems,”in HPCA, 2012
[2] “Microservices workshop: Why, what, and how to get there,” [24] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav
http://www.slideshare.net/adriancockcroft/microservices-workshop- Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al
craft-conference Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark,
Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara
[3] “State of the Cloud Report,” http://www.righscale.com/lp/state-of-the- Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert
cloud. Accessed: 2019-01-28 Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt,
Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Ka-
[4] “Improving real-time performance by uti- plan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar,
lizing cache allocation technology,” Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary,
https://www.intel.com/content/dam/www/public/us/en/documents/white- Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adri-
papers/cache-allocation-technology-white-paper.pdf, Intel Corporation, ana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi
April, 2015 Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick,
[5] “Intel 64 and IA-32 Architectures Software Developer’s Manual,” Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir
https://software.intel.com/en-us/articles/intel-sdm, Intel Corporation, Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snel-
October, 2016 ham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory
Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard
[6] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Walter, Walter Wang, Eric Wilcox, Doe Hyun Yoon, “In-Datacenter
Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irv- Performance Analysis of a Tensor Processing Unit,” in ISCA, 2017
ing, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga,
Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Va- [25] Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton, "ImageNet Classi-
sudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng, fication with Deep Convolutional Neural Networks," in Advances in
“TensorFlow: A System for Large-Scale Machine Learning,” in OSDI, neural information processing systems, 2012
2016 [26] Yanjing Li, Onur Mutlu, Subhasish Mitra, “Operating System Schedul-
[7] Nathan Beckmann, Daniel Sanchez, “Talus: A Simple Way to Remove ing for Efficient Online Self-Test in Robust Systems,” in ICCAD, 2009
Cliffs in Cache Performance,” in HPCA, 2015 [27] Lei Liu, et al, “A Software Memory Partition Approach for Eliminating
[8] Daniel S. Berger, Benjamin Berg, Timothy Zhu, Siddhartha Sen, Mor Bank-level Interference in Multicore Systems,” in PACT, 2012
Harchol-Balter, “RobinHood: Tail Latency Aware Caching – Dynamic [28] Jiang Lin, Qingda Lu, Xiaoning Ding, Zhao Zhang, Xiaodong Zhang, P.
Reallocation from Cache-Rich to Cache-Poor,” in OSDI, 2018 Sadayappan, “Gaining insights into mlticore cache partitioning: bridg-
[9] Ramazan Bitirgen, Engin Ipek, Jose F. Martinez, “Coordinated Man- ing the gap between simulation and real systems,” in HPCA, 2008
agement of Multiple Interacting Resources in Chip Multiprocessors: A [29] Fang Liu, Yan Solihin, “Studying the Impact of Hardware Prefetching
Machine Learning Approach,” in Micro, 2008 and Bandwidth Partitioning in Chip-Multiprocessors,” in Sigmetrics,
[10] Shuang Chen, Christina Delimitrou, José F. Martínez, “PARTIES: QoS- 2011
Aware Resource Partitioning for Multiple Interactive Services,” in AS- [30] Seung-Hwan Lim, Jae-Seok Huh, Yougjae Kim, Galen M. Shipman,
PLOS, 2019 Chita R. Das, “D-Factor: A Quantitative Model of Application Slow-
[11] Asaf Cidon, Assaf Eisenman, Mohammad Alizadeh, Sachin Katti, Down in Multi-Resource Shared Systems,” in Sigmetrics, 2012
“Cliffhanger: Scaling Performance Cliffs in Web Memory Caches,” in [31] Lei Liu, et al, “Rethinking Memory Management in Modern Operating
NSDI, 2016 System: Horizontal, Vertical or Random?” in IEEE Trans. on Comput-
[12] Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Mar- ers, 2016
cus Fontoura, Ricardo Bianchini, “Resource Central: Understanding [32] David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ran-
and Predicting Worloads for Improved Resource Management in Large ganathan, Christos Kozyrakis, "Heracles: Improving Resource Effi-
Cloud Platforms,” in SOSP, 2017 ciency at Scale," in ISCA, 2015
[13] Jeff Dean, David A. Patterson, Cliff Young, “A New Golden Age in [33] Lei Liu, et al, “Hierarchical Hybrid Memory Management in OS for
Computer Architecture: Empowering the Machine-Learning Revolu- Tiered Memory Systems,” in IEEE Trans. on Parallel and Distributed
tion,” in IEEE Micro, 2018 Systems, 2019
[14] Christina Delimitrou, Christos Kozyrakis, “QoS-Aware Scheduling in [34] Hongzi Mao, Mohammad Alizadeh, Ishai Menache, Srikanth Kan-
Heterogenous Datacenters with Paragon,” in ACM TOCS, 2013 dula, “Resource Management with Deep Reinforcement Learning,” in
[15] Christina Delimitrou, Christos Kozyrakis, “Quasar: Resource-Efficient HotNet-XV, 2016
and QoS-Aware Cluster Management,” in ASPLOS, 2014 [35] Hongzi Mao, Malte Schwarzkopf, Shaileshh B. Venkatakrishnan, Zili
[16] Yi Ding, Nikita Mishra, Henry Hoffmann, “Generative and Multi-phase Memg, Mohammad Alizadeh, “Learning Scheduling Algorithms for
Learning for Computer Systems Optimization,” in ISCA, 2019 Data Processing Clusters,” in SIGCOMM, 2019
[17] Nosayba El-Sayed, Anurag Mukkara, Po-An Tsai, Harshad Kasture, [36] Yashwant Marathe, Nagendra Gulur, Jee Ho Ryoo, Shuang Song, and
Xiaosong Ma, Daniel Sanchez, “KPart: A hybrid Cache Partitioning- Lizy K. John, “CSALT: Context Switch Aware Large TLB,” in Micro,
Sharing Technique for Commodity Multicores,” in HPCA, 2018 2017
[18] Yu Gan and Christina Delimitrou, “The Architectural Implications of [37] Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, Mary Lou
Cloud Microservices,” in IEEE Computer Architecture Letters, 2018 Soffa, “Bubble-Up: Increasing Utilization in Modern Warehouse Scale
Computers via Sensible Co-locations,” in Micro, 2011
[19] Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna
Pancholi, Christina Delimitrou, “Leveraging Deep Learning to Improve [38] Jason Mars, Lingjia Tang, Mary Lou Soffa, “Directly Characterizing
Performance Predictability in Cloud Microservices with Seer,” in ACM Cross Core Interference Through Contention Synthesis,” in HiPEAC,
SIGOPS Operating Systems Review, 2019 2011
[20] Mark D. Hill, Michael R. Marty, “Amdahl’s Law in the Multicore Era,” [39] Jose F. Martinez, Egin Ipek, “Dynamic multicore resource management:
in IEEE Computers, 2008 A machine learning approach,” in IEEE Micro 29 (5):8-17 (2009)
[21] Kurt Hornik, “Approximation Capabilities of Multilayer Feedforward [40] Azalia Mirhoseini, Hieu Pham, Quoc V. Le, Benoit Steiner, Rasmus
Networks,” in Neural Networks, 1991 Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy
Bengio, Jeff Dean, "Learning Device Placement in Tensorflow Compu-
[22] Engin Ipek, Onur Mutlu, José F. Martínez, Rich Caruana, “Self- tations," in Arxiv 1706.04972
Optimizing Memory Controllers: A Reinforcement Learning Ap-
[41] Nikita Mishra, Connor Imes, John D. Lafferty, Henry Hoffmann,
USENIX Association 21st USENIX Conference on File and Storage Technologies 165
“CALOREE: Learning Control for Predictable Latency and Low En- [62] Harshad Kasture, Daniel Sanchez, “Tailbench: a benchmark suite and
ergy,” in ASPLOS, 2018 evaluation methodology for latency-critical applications,” in IISWC,
2016
[42] Nikita Mishra, Harper Zhang, John Lafferty, Henry Hoffmann, “A
probabilistic Graphical Model-based Approach for Minimizing Energy [63] Rajiv Nishtala, Vinicius Petrucci, Paul Carpenter, Magnus Sjalander,
Under Performance Constraints,” in ASPLOS, 2015 “Twig: Multi-Agent Task Management for Colocated Latency-Critical
Cloud Services,” in HPCA, 2020
[43] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu,
Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, [64] MongoDB official website. http://www.mongodb.com
Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie,
Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, [65] Memcached official website. https://memcached.org
Daan Wierstra, Shane Legg, Demis Hassabis, “Human-level control [66] NGINX official website. http://nginx.org
through deep reinforcement learning,” in Nature 518 (7540): 529-533,
2015 [67] https://en.wikipedia.org/wiki/Universal_approximation_theorem
[44] Sai Prashanth Muralidhara, Lavanya Subramanian, Onur Mutlu, Mah- [68] Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi,
mut Kandemir, Thomas Moscibroda, “Reducing Memory Interference Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon
in Multicore Systems via Application-Aware Memory Channel Parti- Jackson, Kelvin Hu, Meghna Pancholi, Yuan He, Brett Clancy, Chris
tioning,” in Micro, 2011 Colen, Fukang Wen, Catherine Leung, Siyuan Wang, Leon Zaruvinsky,
Mateo Espinosa, Rick Lin, Zhongling Liu, Jake Padilla, Christina
[45] Jinsu Park, Seongbeom Park, Woongki Baek, “CoPart: Coordinated Delimitrou, “An Open-Source Benchmark Suite for Microservices and
Partitioning of Last-Level Cache and Memory Bandwidth for Fairness- Their Hardware-Software Implications for Cloud & Edge Systems,” in
Aware Workload Consolidation on Commodity Servers,” in EuroSys, ASPLOS, 2019
2019
[69] Kalyanmoy Deb, Shivam Gupta, “Understanding Knee Points in Bicri-
[46] Tirthak Patel, Devesh Tiwari, “CLITE: Efficient and QoS-Aware Co- teria Problems and Their Implications as Preferred Solution Principles,”
Location of Multiple Latency-Critical Jobs for Warehouse Scale Com- in Engineering Optimization, 43 (11), 2011
puters,” in HPCA, 2020
[70] www.mysql.com
[47] Henry Qin, Qian Li, Jacqueline Speiser, Peter Kraft, and John Ouster-
hout, “Arachne: Core-Aware Thread Management,” in OSDI, 2018 [71] Harshad Kasture, Daniel Sanchez, “Ubik: Efficient Cache Sharing with
Strict QoS for Latency-Critical Workloads,” in ASPLOS, 2014
[48] Joy Rahman, Palden Lama, “Predicting the End-to-End Tail Latency of
Containerized Microservices in the Cloud,” in IC2E, 2019 [72] Ana Klimovic, Heiner Litz, Christos Kozyrakis, “Selecta: Learning
Heterogeneous Cloud Storage Configuration for Data Analytics,” in
[49] Yizhou Shan, Yutong Huang, Yilun Chen, Yiying Zhang, “LegoOS: A USENIX ATC, 2018
Disseminated, Distributed OS for Hardware Resource Disaggregation,”
in OSDI, 2018 [73] Harshad Kasture, Xu Ji, Nosayba El-Sayed, Xiaosong Ma, Daniel
Sanchez, “Improving Datacenter Efficiency Through Partitioning-
[50] Prateek Sharma, Ahmed Ali-Eldin, Prashant Shenoy, “Resource De- Aware Scheduling,” in PACT, 2017
flation: A New Approach For Transient Resource Reclamation,” in
EuroSys, 2019 [74] Yanqi Zhang, Weizhe Hua, Zhuangzhuang Zhou, G. Edward Suh,
Christina Delimitrou, “Sinan: ML-Based and QoS-Aware Resource
[51] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Lau- Management for Cloud Microservices,” in ASPLOS, 2021
rent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis
Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, [75] Joshua Fried, Zhenyuan Ruan, Amy Ousterhout, Adam Belay, “Cal-
Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timo- adan: Mitigating Interference at Microsecond Timescales,” in OSDI,
thy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, 2020
Demis Hassabis, “Mastering the game of Go with deep neural networks
and tree search,” in Nature, 529 (7587), 2016 [76] Neeraj Kulkarni, Gonzalo Gonzalez-Pumariega, Amulya Khurana,
Christine A Shoemaker, Christina Delimitrou, David H Albonesi, “Cut-
[52] Akshitha Sriraman, Abhishek Dhanotia, Thomas F. Wenisch, “Soft- tlesys: Data-driven Resource Management for Interactive Services on
SKU: Optimizing Server Architectures for Microservice Diversity Re configurable Multicores,” in Micro, 2020
@Scale,” in ISCA, 2019
[77] Redis official website. https://redis.io/
[53] Akshitha Sriraman, Thomas F. Wenisch, “µTune: Auto-Tuned Thread-
[78] Node.js official website. https://nodejs.org/en/
ing for OLDI Microservices,” in OSDI, 2018
[79] Lei Liu, “QoS-Aware Resources Scheduling for Microservices:
[54] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scoott A Multi-Model Collaborative Learning-based Approach,” in
Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, An- arXiv:1911.13208v2, 2019
drew Rabinovich, “Going deeper with convolutions,” in CVPR, 2015
[80] Lei Liu, et al, “Going Vertical in Memory Management: Handling
[55] Jian Tan, Tieying Zhang, Feifei Li, Jie Chen, Qixing Zheng, Ping Zhang, Multiplicity by Multi-policy,” in ISCA, 2014
Honglin Qiao, Yue Shi, Wei Cao, Rui Zhang, “iBTune: Individualized
Buffer Tuning for Large-scale Cloud Databases,” in VLDB, 2019
[56] Stephen J. Tarsa, Rangeen Basu Roy Chowdhury, Julien Sebot, Gau-
tham Chinya, Jayesh Gaur, Karthik Sankaranarayanan, Chit-Kwan
Lin, Robert Chappell, Ronak Singhal, Hong Wang, “Post-Silicon CPU
Adaptations Made Practical Using Machine Learning,” in ISCA, 2019
[57] Xiaodong Wang, Shuang Chen, Jeff Setter, Jose F. Martínez, “SWAP:
Effective Fine-Grain Management of Shared Last-Level Caches with
Minimum Hardware Support,” in HPCA, 2017
[58] Zi Yan, Daniel Lustig, David Nellans, and Abhishek Bhattacharjee,
“Nimble Page Management for Tiered Memory Systems,” in ASPLOS,
2019
[59] Yiying Zhang, Yutong Huang, “Learned Operating Systems,” in ACM
SIGOPS Operating Systems Review, 2019
[60] Zhijia Zhao, Bo Wu, Xipeng Shen, “Challenging the "Embarrassingly
Sequential": Parallelizing Finite State Machine-based Computations
through Principled Speculation,” in ASPLOS, 2014
[61] Xiaoya Xiang, Chen Ding, Hao Luo, Bin Bao, “HOTL: A higher order
theory of locality,” in ASPLOS, 2013
166 21st USENIX Conference on File and Storage Technologies USENIX Association