[go: up one dir, main page]

Academia.eduAcademia.edu
arXiv:2109.12663v3 [cs.PF] 12 Jun 2022 Springer Nature 2021 LATEX template WCFS: A new framework for analyzing multiserver systems Isaac Grosof1*, Mor Harchol-Balter1 and Alan Scheller-Wolf2 1 2 Computer Science Department, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, 15213, PA, USA. Tepper School of Business, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, 15213, PA, USA. *Corresponding author(s). E-mail(s): igrosof@cs.cmu.edu; Contributing authors: harchol@cs.cmu.edu; awolf@andrew.cmu.edu; Abstract Multiserver queueing systems are found at the core of a wide variety of practical systems. Many important multiserver models have a previously-unexplained similarity: identical mean response time behavior is empirically observed in the heavy traffic limit. We explain this similarity for the first time. We do so by introducing the work-conserving finite-skip (WCFS) framework, which encompasses a broad class of important models. This class includes the heterogeneous M/G/k, the limited processor sharing policy for the M/G/1, the threshold parallelism model, and the multiserver-job model under a novel scheduling algorithm. We prove that for all WCFS models, scaled mean response time E[T ](1 − ρ) converges to the same value, E[S 2 ]/(2E[S]), in the heavy-traffic limit, which is also the heavy traffic limit for the M/G/1/FCFS. Moreover, we prove additively tight bounds on mean response time for the WCFS class, which hold for all load ρ. For each of the four models mentioned above, our bounds are the first known bounds on mean response time. Keywords: queueing, response time, bounds, heavy traffic, multiserver, M/G/k, scheduling 1 Springer Nature 2021 LATEX template 2 WCFS: A new framework for analyzing multiserver systems 1 Introduction Consider the following four queueing models, which are each important, practical models, but which seem very different. We will refer to these models throughout the paper as our four motivating models: • Heterogeneous M/G/k: A k-server system where servers run at different speeds. Jobs are held at a central queue and served in First-Come-FirstServed (FCFS) order when servers become available. If multiple servers are vacant, a server assignment policy such as Fastest Server First is applied. • Limited processor sharing: A single-server system where if at least k jobs are present, the k earliest arrivals each receive an equal fraction of service. If fewer than k jobs are present, the server is split equally among all jobs. • Threshold parallelism: A multiserver system where jobs can run on any number of servers up to some threshold, with perfect speedup. We consider FCFS service, where each job is allocated a number of servers equal to its threshold, as long as servers are available. The final job served may be allocated fewer servers than its threshold. • Multiserver-jobs under the ServerFilling policy: A multiserver system where the jobs are called “multiserver jobs,” because each job requires a fixed number of servers, which it holds concurrently throughout its service. We examine a service policy called ServerFilling, which always fills all of the servers if enough jobs are available. We define these models in more detail in Section 3. We will show that, while our four motivating models appear quite different, their mean response times, E[T ], are very similar, especially in the heavytraffic limit. Specifically, we will show that their behavior in the heavy traffic limit is identical to that of the M/G/1/FCFS model, and in fact the mean response time of each of these disparate models only differs by an additive constant from that of M/G/1/FCFS for all loads, a much stronger result than convergence in heavy traffic. The similarity of these models is illustrated by Fig. 1, which shows mean response time, E[T ], scaled by a factor of 1−ρ, to help illustrate the asymptotic behavior in the ρ → 1 limit. Observe that in each of our models of interest, as well as in the M/G/1 and the M/G/4, E[T ](1 − ρ) converges to E[S 2 ]/2E[S], the mean of the equilibrium (excess) distribution, where S denotes the job size distribution and ρ = λE[S] < 1 is the system load. This similarity is striking – to see just how notable it is, consider a variety of alternative models and policies shown in Fig. 2. For these alternative models, scaled mean response time either does not converge at all, or converges to a different limit entirely. This contrast poses an intriguing question: Why do our four motivating models converge to M/G/1/FCFS in heavy traffic? To put it another way, we ask what crucial property our four motivating models share, that is not shared by the alternative models in Fig. 2. Springer Nature 2021 LATEX template WCFS: A new framework for analyzing multiserver systems Scaled mean resp. time E[T](1 − ρ) 3.0 3 M/G/4 M/G/1 Heterogenous M/G/k Limited Processor Sharing Threshold Parallelism FCFS Multiserver-job 2.5 2.0 1.5 ServerFilling E[S 2]/2E[S] 1.0 0.5 0.0 0.0 0.2 0.4 0.6 Load ρ 0.8 1.0 Fig. 1: Scaled mean response time of our four motivating models, as well as the related M/G/k and M/G/1 models. Our four motivating models will be further defined in Section 3. In each case, the job size distribution S is distributed as E[S 2 ] Hyperexp(µ1 = 2, µ2 = 32 , p1 = 21 ). The black line is E[T ](1 − ρ) = 2E[S] , the heavy traffic behavior of M/G/1/FCFS and each of our models of interest. 109 arrivals simulated. ρ ∈ [0, 0.96] to ensure accurate results. To answer this question, we define the “work-conserving finite-skip” framework (WCFS), which applies to a broad class of models. The WCFS class contains our four motivating queueing models, as well many others. We demonstrate that for any model in the WCFS class (which we call a “WCFS model”), if the job size distribution S has bounded expected remaining size, then its scaled mean response time converges to the same heavy traffic limit as the M/G/1/FCFS. Specifically, we prove that Theorem 1. For any model π ∈ WCFS with bounded expected remaining size1 , lim E[T π ](1 − ρ) = ρ→1 E[S 2 ] . 2E[S] Theorem 1 follows from an even stronger result: We prove that the difference in mean response time between any WCFS model and M/G/1/FCFS is bounded by an explicit additive constant, that may depend on the specific WCFS model. 1 This assumption is defined in Section 2.3. Springer Nature 2021 LATEX template 4 WCFS: A new framework for analyzing multiserver systems Scaled mean resp. time E[T](1 − ρ) 3.0 Threshold Parallelism Inelastic First Threshold Parallelism Elastic First M/G/4/SRPT Multiserver-job 2.5 2.0 FCFS Multiserver-job MaxWeight Multiserver-job Least Servers First 1.5 1.0 Multiserver-job Most Servers First E[S 2]/2E[S] 0.5 0.0 0.0 0.2 0.4 Load ρ 0.6 0.8 1.0 Fig. 2: Scaled mean response time of alternative models and policies. All of these models and policies will be explained in Section 6. S ∼ Hyperexp(µ1 = E[S 2 ] . 109 arrivals simulated, 2, µ2 = 32 , p1 = 12 ). Black line is E[T ](1 − ρ) = 2E[S] ρ ∈ [0, 0.96] to ensure accurate results, except MaxWeight and M/G/4/SRPT: 1010 arrivals, ρ ∈ [0, 0.99]. Theorem 2. For any model π ∈ WCFS with bounded expected remaining size, ρ E[S 2 ] + cπupper 1 − ρ 2E[S] ρ E[S 2 ] E[T π ] ≥ + cπlower 1 − ρ 2E[S] E[T π ] ≤ for explicit constants cπupper and cπlower not dependent on load ρ. Theorem 2 not only implies Theorem 1, it also guarantees rapid convergence of scaled mean response time to the heavy traffic limit specified in Theorem 1. In summary, this paper makes the following contributions: • We define the WCFS framework and our bounded expected remaining size assumption. (Section 2) • We prove that each of the four motivating models is a WCFS model. (Section 3) • We discuss prior work on WCFS models. (Section 4) • We prove that all WCFS models with bounded expected remaining size have the same scaled mean response time as M/G/1/FCFS, and mean response time within an additive constant of M/G/1/FCFS. (Section 5) • We empirically validate our results, contrasting heavy traffic behavior of WCFS models and non-WCFS models. (Section 6) Springer Nature 2021 LATEX template WCFS: A new framework for analyzing multiserver systems 5 Fig. 3: Diagram of a Finite-Skip Model 2 The WCFS Framework and WCFS Models In Sections 2.1 and 2.2, we define the WCFS framework and resulting class of models. In Section 2.3, we define our “bounded expected remaining size” assumption. In Section 2.4, we define a few more concepts that will be used in the paper. Job sizes are sampled i.i.d. from a job size distribution. Once sampled, job sizes are fixed: we assume preempt-resume service if a job is preempted while in service. Intuitively, the size of a job represents the amount of work associated with the job. Size will be defined in more detail in Section 2.1.2. 2.1 WCFS Framework and WCFS Models The rate 1. 2. 3. WCFS framework applies to the class of models with Poisson arrivals at λ, which satisfy the following properties: Finite skip (Section 2.1.1), Work conserving (Section 2.1.2) Non-idling (Section 2.1.3). 2.1.1 Finite skip We first define the finite-skip property, which defines the class of finite-skip models. Consider the jobs in the system in arrival order. Associated with each finite-skip model, there is a finite parameter n. We partition the jobs in the system into two sets: the (up to) n jobs which arrived longest ago, which we call the front, and all other jobs, which we call the queue. The finite-skip property specifies that, among all of the jobs in the system, the server(s) only serve jobs in the front. In particular, no jobs beyond the first n jobs in arrival order receive any service. Fig. 3 shows a generic finite-skip model. Definition 1. We call the front full if at least n jobs are present in the system, and therefore exactly n jobs are at the front. Springer Nature 2021 LATEX template 6 WCFS: A new framework for analyzing multiserver systems The intuition behind the term “finite skip” comes from imagining moving through the jobs in the system in arrival order, skipping over some jobs and serving others. In a finite-skip model only the first n jobs can be served, so only finitely many jobs can be skipped. 2.1.2 Work conserving Now, we will specify what we mean by “work conserving,” which is a different concept here than in previous work. First, we normalize the total system capacity to 1, regardless of the number of servers in the system. For instance, in a homogeneous k-server system, we think of each server as serving jobs at rate 1/k. Whenever a job is in service, it receives some fraction of the system’s total service capacity, which we call the job’s service rate. Let B(t) ≤ 1 denote the total service rate of all jobs in service at time t, and let B be the stationary total service rate, assuming for now such a quantity exists. We define a job’s age at time t to be the total amount of service the job has received up to time t: a job’s age increases at a rate equal to the job’s service rate whenever the job is in service. Each job has a property called its size. When the job’s age reaches its size, the job completes. In particular, we assume that every job j has a size sj and a class cj drawn i.i.d. from some general joint distribution. Let (S, C) be the random variables denoting a job’s size and class pair. A job’s class is static information known to the scheduler, while a job’s size is unknown to the scheduler. For instance, in the threshold parallelism model defined in Section 3.3, a job’s parallelism threshold is its class. Definition 2. We call the system maximally busy if the entire capacity of the system is in use, namely if the total service rate of jobs in service is 1. We define a finite-skip model to be work conserving if whenever the front is full, the system is also maximally busy. In other words, a finite-skip model is work-conserving if, whenever there are at least n jobs in the system, the total service rate is 1. Now that we have defined a job’s size, we can also define the load of the system: ρ = λE[S]. Load ρ is the time-average service rate, or equivalently the time-average fraction of capacity in use. Specifically, ρ = E[B]. We assume ρ < 1 to ensure stability. 2.1.3 Non-idling We also assume that the total service rate B(t) is bounded away from zero whenever a job is present. Specifically, whenever a job is present, we assume that B(t) ≥ binf , for some constant binf > 0. This assumption is key to bounding mean response time under low load. For an example, see the batch-processing system in Section 2.2. Springer Nature 2021 LATEX template WCFS: A new framework for analyzing multiserver systems 7 2.2 Examples and non-examples To clarify which models fit within the WCFS framework, we give several examples, both positive and negative. • M/G/k/FCFS: This is a WCFS model with n = k. • M/G/∞: This model is not finite skip. All jobs are in service, regardless of the number of jobs in the system: there is no finite bound on the number of jobs in service. • M/G/k/SRPT: In this model, the k jobs with smallest remaining size are served at rate 1/k. This model is not finite skip because the jobs with smallest remaining size can be arbitrarily far back in the arrival ordering. • Multiserver-job model: Consider a multiserver system with k = 2 servers, and where each job requires either 1 or 2 servers. Let the front size n = 2. If jobs are served in FCFS order, with head-of-the-line (HOLB) blocking, this policy is finite-skip, but not work-conserving. If the front consists of a job requiring 1 server followed by a job requiring 2 servers, under HOLB the system will only utilize one server. In this case, the front is full, because n = 2 jobs are present in the system, but the system is not maximally busy. In contrast, consider a service policy which serves a 2 server job if either of the jobs in the front are 2 server jobs, or else serves each of the 1 server jobs at the front. This policy is a special case of the ServerFilling policy, depicted in Fig. 1 and defined in general in Section 3.4.2. This policy is finite-skip and work-conserving. • Batch-processing M/G/k: If there are at least k jobs present, the oldest k jobs in the system are each served at rate k1 . Otherwise, no service occurs. This model is finite-skip and work-conserving, but is not non-idling. To see why the non-idling property is necessary for our main results, specifically Theorem 2, one can show that in the λ → 0 limit, response times will grow arbitrarily large in the batch-processing M/G/k. To rule out systems where E[T ] diverges in the λ → 0 limit, we assume the non-idling property. • Red and Blue M/G/k: Imagine an M/G/k with red and blue jobs. Only one color of jobs is allowed to be in service at a time. To determine which jobs to serve, the scheduler counts off jobs in arrival order until it finds k red jobs or k blue jobs and serves all k of the appropriate color (if fewer than k jobs are found for both colors, the system serves the more populous color). This scheduling policy is WCFS with n = 2k − 1. 2.3 Bounded expected remaining size: Finite remsup At a given point in time, let the state of a job j consist of its class cj and its age aj . Within our WCFS framework, we allow service to be based on the states of the jobs in the front, but not on the number or states of jobs in the queue. A key assumption we make is that jobs have bounded expected remaining size from an arbitrary state. Let Sc be the job size distribution for jobs of class c ∈ C. We define remsup (S, C) to be the supremum over the expected Springer Nature 2021 LATEX template 8 WCFS: A new framework for analyzing multiserver systems remaining sizes of jobs, taken over all states: remsup (S, C) := sup E[Sc − a | Sc > a]. c∈C,a∈R+ When size S is independent of class C, or when a model has no class information, we simply write remsup (S). In this paper, we focus on job size distributions for which remsup (S, C) is finite. To better understand the finite remsup (S, C) assumption, let’s walk through a couple of examples. In all of these examples, let’s suppose that the class information is independent of the job size distribution S, so we can simply write remsup (S). Consider a job size distribution S that is hyperexponential:   Exp(µ1 ) w.p. p1 S = Exp(µ2 ) w.p. p2   Exp(µ3 ) w.p. p3 For all ages a, the expected remaining size is bounded: E[S − a | S > a] ≤ 1 = remsup (S). min(µ1 , µ2 , µ3 ) More generally, an arbitrary phase type job size distribution S ′ must have finite remsup . On the other hand, Pareto job size distributions do not have finite remsup . Let S ′′ ∼ P areto(α = 3, xmin = 1), which has finite first and second moments. a , 2 lim E[S ′′ − a | S ′′ > a] = ∞ E[S ′′ − a | S ′′ > a] = ∀a ≥ 1 a→∞ remsup = sup E[S ′′ − a | S ′′ > a] = ∞ a In general, finite remsup roughly corresponds to service time having an exponential or sub-exponential tail, though there are some subtleties. For instance, a Weibull distribution with P (S ≥ a) = a−k for some k < 1 has infinite remsup , while for k ≥ 1, remsup is finite. As a final example, suppose the WCFS scheduling policy is a known-size policy, such as a policy which serves the job with least remaining size among the n jobs in the front, at rate 1. Because we require that service is based only on the age and class of a job, we model this situation by saying that a job’s class is its original size. In this case, S = C, and the distribution Sx is simply the constant x. As a result, remsup (S, C) = sup(S). Therefore, in a known-size setting, remsup is finite only if S is bounded. Springer Nature 2021 LATEX template WCFS: A new framework for analyzing multiserver systems 9 2.4 Work, Number, Response Time Let the work in the system be defined as the sum of the remaining sizes of all jobs in the system. Let W (t) be the total work in the system at time t. Let WQ (t) and WF (t) be the work in the queue and the work at the front, respectively, at time t. (We will generally use the subscripts Q and F to denote the queue and the front.) Let W, WQ , and WF denote the corresponding timestationary random variables. Recall from Section 2.1.2 that B(t) is the total service rate at time t. Note d W (t) = −B(t), except at arrival instants. that dt Let N (t) be the number of jobs in the system at time t. Note that NF (t) = n whenever N (t) ≥ n, because the front is full, and NF (t) = N (t) otherwise. Let T be a random variable denoting a job’s time-stationary response time: the time from when a job arrives to when it completes. 3 Important WCFS Models Here we define in more detail the four motivating models mentioned in the introduction and depicted in Fig. 1, and show that each is a WCFS model. 3.1 Heterogeneous M/G/k The heterogeneous M/G/k/FCFS models multiserver systems where servers have different speeds. This scenario commonly arises in datacenters, which are often composed of servers with a wide variety of different types of hardware [1, 2]. In the mobile device setting, the big.LITTLE architecture employs heterogeneous processors to improve battery life [3]. P Let each server i have speed vi > 0, scaled so that i vi = 1. While a job is being served by server i, the job’s age increases at a rate of vi . If there are multiple servers idle when a job arrives, a server is chosen according to an arbitrary server assignment policy. Jobs may also be migrated between servers when a job completes. We only assume that jobs are served in FCFS order, and that no job is left waiting while a server is idle. Under these assumptions, all assignment policies fit within the WCFS framework. As an example, in Fig. 1 we show the scaled mean response time of a heterogeneous M/G/4 with server speeds 0.4, 0.3, 0.2, 0.1, and the Preemptive Fastest Server First assignment policy. 3.1.1 Heterogeneous M/G/k is a WCFS model To show that the heterogeneous M/G/k is a WCFS model, we must verify the three properties from Sections 2.1.1 to 2.1.3. Finite skip: Jobs enter service in FCFS order. As a result, the jobs in service are exactly the (up to) k oldest jobs in the system. The model is finite skip with parameter n = k. P Work conserving: The system has total capacity i vi = 1. Whenever at least k jobs are present in the system, all servers are occupied, and the total service Springer Nature 2021 LATEX template 10 WCFS: A new framework for analyzing multiserver systems rate is 1. In other words, whenever the front is full, the system is maximally busy. Positive service rate when nonempty: If a job is present, the job will be in service on some server. The system will therefore have minimum service rate binf ≥ vmin , where vmin = mini vi . 3.2 Limited Processor Sharing The Processor Sharing policy for the M/G/1 is of great theoretical interest, and has been extensively studied [4]. However, in real systems, running too many jobs at once causes a significant overhead. A natural remedy is to utilize a policy is known as Limited Processor Sharing (LPS) [5–8]. The LPS policy is parameterized by some Multi-Programming Level k. If at most k jobs are present in the system, then the policy is equivalent to Processor Sharing, serving all jobs at an equal rate, with total service rate 1. When more than k jobs are present, the k oldest jobs in FCFS order are each served at rate 1/k. LPS is a WCFS model with n = k. As an example, in Fig. 1 we show the scaled mean response time of a LPS system with MPL 4. 3.2.1 Heterogeneous M/G/k is a WCFS model To show that the heterogeneous M/G/k is a WCFS model, we must verify the three properties from Sections 2.1.1 to 2.1.3. Finite skip: Jobs enter service in FCFS order. As a result, the jobs in service are exactly the (up to) k oldest jobs in the system. The model is finite skip with parameter n = k. P Work conserving: The system has total capacity i vi = 1. Whenever at least k jobs are present in the system, all servers are occupied, and the total service rate is 1. In other words, whenever the front is full, the system is maximally busy. Positive service rate when nonempty: If a job is present, the job will be in service on some server. The system will therefore have minimum service rate binf ≥ vmin , where vmin = mini vi . 3.3 Threshold Parallelism In modern datacenters, it is increasingly common for jobs to be parallelizable across a variety of different numbers of servers, where the level of parallelism is chosen by the scheduler [9, 10]. Under Threshold Parallelism, a job j has two characteristics: its size sj and its parallelism threshold ℓj , where ℓj is some number of servers. Job j may be parallelized across up to ℓj servers, with linear speedup. The pair (sj , ℓj ) is sampled i.i.d. from some joint distribution (S, L). Note that ℓj is the class of the job j. Let k be the total number of servers. Note that ℓj ∈ [1, k]. If a job j is served on q ≤ ℓj servers, then it receives service rate kq and will complete after Springer Nature 2021 LATEX template WCFS: A new framework for analyzing multiserver systems 11 ksj q time in service. The number of servers a job is allocated can change over time, correspondingly changing its service rate. We focus on the FCFS service policy. Under this policy, jobs are placed into service in arrival order until their total parallelism thresholds sum to at least k, or all jobs are in service. Each job j other than the final job in service is served by ℓj servers. The final job in service is served by the remaining servers. Under FCFS service, Threshold Parallelism fits the WCFS framework with n = k. As an example, in Fig. 1 we show the scaled mean response time of a Threshold Parallelism model where the joint distribution (S, L) is (Exp(2), 1) with probability 21 , and (Exp( 23 ), 4) with probability 21 , and with FCFS service. As a comparison, in Fig. 2, we show Threshold Parallelism models with the same joint distribution (S, L), but with different service policies: “Elastic First,” prioritizing jobs with L = 1, and “Inelastic First,” prioritizing jobs with L = 4. These policies do not fit within the WCFS framework, because a job may skip over an arbitrary number of jobs. 3.3.1 Threshold Parallelism with FCFS service is a WCFS model Finite skip: The jobs in service are the initial set of jobs in arrival order whose parallelism thresholds sum to at least k. This initial set can contain at most k jobs, because every job has parallelism threshold at least 1. As a result, the model is finite skip with parameter n = k. Work conserving: Whenever jobs are present in the system whose parallelism thresholds sum to at least k, all servers are occupied, and the system is maximally busy. Whenever k jobs are present, the system must be maximally busy. Positive service rate when nonempty: If a job is present in the system, at least one server must be occupied, and so the service rate is at least 1/k. Hence binf ≥ 1/k. 3.4 Multiserver-jobs under the ServerFilling policy First, we will describe the multiserver-job setting. Then we will specify the ServerFilling policy. 3.4.1 Multiserver-Job Setting When we look at jobs in cloud computing systems [11] and in supercomputing systems [12–14], jobs commonly require an exact number of servers for the entire time the job is in service. To illustrate, in Fig. 4 we show the distribution of the number of CPUs requested by the jobs in Google’s recently published trace of its “Borg” computation cluster [15, 16]. The distribution is highly variable, with jobs requesting anywhere from 1 to 100,000 normalized CPUs2 . 2 The data was published in a scaled form [15]. We rescale the data so the smallest job in the trace uses one normalized CPU. Springer Nature 2021 LATEX template WCFS: A new framework for analyzing multiserver systems Fraction of jobs 12 10 0 10 −1 10 −2 10 −3 10 −4 0 10 10 1 10 2 10 3 10 4 10 5 Normalized CPUs requested Fig. 4: The distribution of number of CPUs requested in Google’s recently published Borg trace [15]. Number of CPUs is normalized to the size of the smallest request observed, not an absolute value. The Multiserver-Job (MSJ) model is a natural model for these computing systems. In an MSJ model, a job j has two requirements: A number of servers vj and an amount of time xj , which are sampled i.i.d. from some joint distribution (V, X). If job j requires vj servers, then it can only be served when exactly vj servers are allocated to it. The job will complete after xj time in service. Let a job j’s size be defined as sj = vj xj k S= VX . k There are a wide variety of possible service policies for placing jobs at open servers, including FCFS, MaxWeight, Most Servers First and many others. (We formally define these policies in Section 6.) As examples, in Fig. 2, we show the scaled mean response time of Multiserver-Job models under a variety of service policies, where the joint distribution (V, X) is (1, Exp( 12 )) with probability 21 , and (4, Exp( 23 )) with probability 12 . Unfortunately, no existing policies fit within the WCFS framework – all existing policies, including those shown in Fig. 2, are either non-finite-skip, such as Most Servers First, or non-work-conserving, such as FCFS. Correspondingly, in Fig. 2, we see that no existing policy has its scaled mean response time converge to the same limit as M/G/1/F CF S. We therefore define a novel service policy called ServerFilling which yields a WCFS model. The scaled mean response time of this service policy is depicted in Fig. 1, with the same joint distribution (V, X) as the policies shown in Fig. 2. 3.4.2 ServerFilling For simplicity, we initially define the Server Filling policy for the common situation in computer systems where all jobs require a number of servers which is a power of 2 (V is always a power of 2), and where k is also a power of 2. We discuss generalizations in Section 3.4.4. First, ServerFilling designates a candidate set M , consisting of the minimal prefix (i.e. initial subset) of the jobs in the system in arrival order which Springer Nature 2021 LATEX template WCFS: A new framework for analyzing multiserver systems 13 collectively require at least k servers. If all jobs in the system collectively require fewer than k servers, then all are served. Note that |M | ≤ k because all jobs require at least 1 server. For instance, if k = 8 and the jobs in the system require [1, 2, 1, 1, 4, 2, 2, 1] servers, in arrival order (reading from left to right), then M would consist of the first 5 jobs: [1, 2, 1, 1, 4], which collectively require 9 servers. Next, the jobs in M are ordered by their server requirements vj , from largest to smallest, tiebroken by arrival order. Jobs are placed into service in that order until no more servers are available. In our example, jobs requiring 4, 2, 1, and 1 server(s) would be placed into service. To show that ServerFilling fits within WCFS with n = k, we must show that ServerFilling always utilizes all k servers if at least k jobs are in the system. P Lemma 1. Let M be a set of jobs such that j∈M vj ≥ k, where each vj = 2i ′ for some i and k = 2i for some i′ . Label the jobs m1 , m2 , . . . in decreasing order of server requirement: vm1 ≥ vm2 ≥ . . .. Then there exists some index ℓ ≤ |M | such that ℓ X vmj = k. j=1 Proof Let req(z) count the number of servers required by the first z jobs in this ordering: z X req(z) = vm j . j=1 We want to show that req(ℓ) = k for some ℓ. To do so, it suffices to prove that: There exists no index ℓ′ such that both req(ℓ′ ) < k and req(ℓ′ + 1) > k. (1) Equation (1) states that req(z) cannot cross from below k to above k without exactly equalling k. Because req(0) = 0 and req(|M |) ≥ k, req(ℓ) must exactly equal k for some ℓ. To prove (1), let us examine the quantity k − req(z), the number of remaining servers after z jobs have been placed in service. Because all vj s are powers of 2, k − req(z) carries an important property: k − req(z) is divisible by vmz+1 for all z. (2) We write a|b to indicate that a divides b. We will prove (2) inductively. For z = 0, k − req(0) = k. Because k is a power of 2, and vm1 is a power of 2 no greater than k, the base case holds. Next, assume that (2) holds for some index z, meaning that vmz+1 |(k − req(z)). Note that req(z + 1) = req(z) + vmz+1 . As a result, vmz+1 |(k − req(z + 1)). Now, note that vmz+2 |vmz+1 , because both are powers of 2, and vmz+2 ≤ vmz+1 . As a result, vmz+2 |(k−req(z+1)), completing the proof of (2). Now, we are ready to prove (1). Assume for contradiction that there does exist such an ℓ′ . Then k − req(ℓ′ ) > 0, and k − req(ℓ′ + 1) < 0. Because req(ℓ′ + 1) = Springer Nature 2021 LATEX template 14 WCFS: A new framework for analyzing multiserver systems req(ℓ′ ) + vmℓ′ +1 , we therefore know that vmℓ′ +1 > k − req(ℓ′ ). But from (2), we  know that vmℓ′ +1 divides k − req(ℓ′ ), which is a contradiction. 3.4.3 ServerFilling for the Multiserver-Job system is a WCFS policy Finite skip: The jobs in service are a subset of the candidate set M , the initial set of jobs in arrival order whose server requirements vj sum to at least k. This initial set must contain at most k jobs, because every job requires at least 1 server. As a result, the model is finite skip with parameter n = k. Work conserving: By Lemma 1, whenever jobs are present in the system whose server requirements vj sum to at least k, all servers are occupied, and the system is maximally busy. Thus, whenever k jobs are present, the system must be maximally busy. Positive service rate when nonempty: If a job is present in the system, at least one server must be occupied, and so the service rate is at least 1/k. Hence binf ≥ 1/k. 3.4.4 Generalizations of ServerFilling The ServerFilling policy can be generalized, as long as all server requirements divide k. We describe the corresponding scheduling policy, which we call DivisorFilling, in Section A. DivisorFilling is the most general possible WCFS policy for the MSJ setting. If some server requirement does not divide k, then no policy fits within the WCFS framework, because the system is not work conserving if all jobs present require that non-divisible number of servers and more than n jobs are present. 4 Prior Work 4.1 M/G/k 4.1.1 Fixed k In this regime, the best known bounds on response time either require much stronger assumptions on the job size distribution S than we assume [17], or prove much weaker bounds on mean response time [18, 19]. A paper by Loulou [17] bounds mean work in system in the M/G/k to within an additive gap, under the strong assumption that the job size distribution S is bounded. While the paper mostly focuses on the overload regime (ρ > 1), their equations (9) and (10) apply in our setting (ρ < 1) as well. They couple the multiserver system with a single-server system on the same arrival sequence. They show that 0 ≤ W M/G/k (t) − W M/G/1 (t) ≤ k max Si , 1≤i≤A(t) Springer Nature 2021 LATEX template WCFS: A new framework for analyzing multiserver systems 15 where A(t) is the number of jobs that have arrived by time t. In the case of a bounded job size distribution S, one can therefore show that 0 ≤ W M/G/k (t) − W M/G/1 (t) ≤ k sup(S). (3) One could then use this workload bound to prove a bound on mean response time in the M/G/k. These bounds are comparable to those in our Lemma 3 when S is bounded, but our bounds require a much weaker assumption on the job size distribution S. Köllerström [18] proves convergence of queueing time to an exponential distribution in the GI/GI/k. Specialized to the M/G/k, the result states that M/G/k in the ρ → 1 limit, TQ converges to an exponential distribution with mean 1 ρ E[S 2 ] 1 M/G/1 − = E[TQ ]− . 1 − ρ 2E[S] λ λ Köllerström [19] improves upon [18] by characterizing the rate of convergence and thereby derives explicit moment bounds. However, unlike prior single-server results [20], these bounds are quite weak. Specialized to the M/G/k, Köllerström [19]’s bounds state that clower (1 − ρ)1/2 c higher M/G/k M/G/1 E[TQ ] − E[TQ ]≤ 1−ρ M/G/k E[TQ M/G/1 ] − E[TQ ]≥ (4) (5) for constants clower , chigher not dependent on ρ. 1 ) scaling in (5) is especially poor: this bound is too weak to The Θ( 1−ρ M/G/k give any explicit bound on the convergence rate of E[TQ ](1 − ρ) to the E[S 2 ] 2E[S] . previously established limit of Our bounds are tighter in that they are constants not depending on ρ, but we assume S has finite remsup , while Köllerström [19] merely assumes that S has finite second moment. 4.1.2 Scaling k Recent work has focused on regimes where both ρ and k scale asymptotically, such as the Halfin-Whitt regime. These results are not directly comparable to ours; they indicate that the limiting behavior in the Halfin-Whitt regime depends in a complex way on the job size distribution S [21–23]. Turning to the more general case of scaling k, in work currently under submission, Goldberg and Li [24] prove the first bounds on E[TQ ] that scale as c 1−ρ for an explicit constant c and arbitrary joint scaling of k and ρ. Unfortunately, the constant c is enormous, scaling as 10450 E[S 3 ]. In contrast, we focus Springer Nature 2021 LATEX template 16 WCFS: A new framework for analyzing multiserver systems on the regime of fixed k, and prove tight and explicit bounds on mean response time. Goldberg and Li [24] also provide a highly detailed literature review on bounds on E[TQ ] and related measures in the M/G/k and related models. 4.2 Heterogeneous M/G/k 4.2.1 Heterogeneous M/M/k Much of the previous work on multiserver models with heterogeneous service rates has focused on the much simpler M/M/k setting, where jobs are memoryless [25–28]. In this model, one can analyze the preemptive FastestServer-First policy to derive a natural lower bound on the mean response time of any server assignment policy. One can similarly analyze the preemptive Slowest-Server-First policy to derive an upper bound. These two policies each lead to a single-dimensional birth-death Markov chain, allowing for straightforward analysis [26]. One can think of our bounds as essentially extending these bounds for the M/M/k to the much more complex setting of the M/G/k. 4.2.2 Heterogeneous M/Hm /k Van Harten and Sleptchenko [29] primarily study a homogeneous multiserver setting with hyperexponential job sizes. However, in their conclusion, they mention that their methods could be extended to a setting with heterogeneous servers, but at the cost of making their Markov chain grow exponentially. This exponential blowup seems inevitable when applying exact Markovian methods to a heterogeneous setting with differentiated jobs. 4.2.3 M/(M+G)/2 Model Another intermediate model is the M/(M+G)/2 model of Boxma et al. [30]. In this model, jobs are not differentiated. Instead, the service time distribution is entirely dependent on the server. Server 1, the first server to be used, has an exponential service time distribution, while server 2 has a general service time distribution. Boxma et al. [30] derive an implicit expression for the LaplaceStieltjes transform of response time in this setting, which they are only able to make explicit when the general service time distribution has rational transform. Subsequent work has fully solved the M/(M+G)/2 model, under both FCFS service and related service disciplines [31–33]. Our results are not directly applicable to the M/(M+G)/2 setting, because the servers have different distributions of service time, not just different speeds. However, the slow progress on this two-server model illustrates the immense difficulty in solving even the simplest heterogeneous multiserver models. In contrast, our WCFS framework handles both differentiated jobs and an arbitrary number of servers with no additional effort. Springer Nature 2021 LATEX template WCFS: A new framework for analyzing multiserver systems 17 4.3 Limited Processor Sharing The Limited Processor Sharing policy has been studied by a wide variety of authors [5–8, 34–36], but none bound mean response time for all loads ρ. 4.3.1 Asymptotic Regimes A series of papers by Zhang, Dai and Zwart [7, 34, 35] derive the strongest known results on Limited Processor Sharing in a variety of asymptotic regimes. These authors derive the measure-valued fluid limit [34], the diffusion limit [35] and a steady-state approximation [7]. The most comparable of their results to our work is their steady-state approximation. When specialized to mean response time in the M/G/1/LPS, their approximation states that E[T ] ≈ E[S 2 ] ρk E[S] (1 − ρk ) + 1−ρ 2E[S] 1 − ρ They prove that this approximation is accurate in the heavy-traffic limit; they do not provide specific error bounds, but empirically show the approximation performs well at all loads ρ [7]. Our results therefore complement their results by proving concrete error bounds. 4.3.2 State-dependent Server Speed To model the behavior of databases, Gupta and Harchol-Balter [8] introduce a variant of the Limited Processor Sharing model, where the total server speed is a function of the number of jobs in service. In their setting server speed increases to a peak, and then slowly declines as more jobs enter service. They derive a two-moment approximation for mean response time, and use it to derive a heuristic policy for choosing the Multi-Programming Level (MPL). While this two-moment approximation is not known to be tight, it indicates that the optimal MPL for minimizing mean response time may be significantly larger than the service-rate-maximizing MPL, if job size variability is large and load is not too high. Using our WCFS framework it is possible to derive bounds on mean response time for the state-dependent server speeds setting. For MPL parameters less than or equal to the service-rate-maximizing MPL, both our upper and lower bounds apply, while if the MPL parameter is greater than the servicerate-maximizing MPL, only our upper bounds apply, because the system only partially fulfills our definition of work conservation. Subsequently, Telek and Van Houdt [6] derive the Laplace-Stieltjes transform of response time in the LPS model with state-dependent server speed, under phase-type job sizes. Unfortunately, the transform takes the form of a complicated matrix equation, making it difficult to derive general insights across general job size distributions. Instead, the authors numerically invert the Laplace transform for a handful of specific distributions to derive empirical insights. Springer Nature 2021 LATEX template 18 WCFS: A new framework for analyzing multiserver systems 4.4 Threshold Parallelism Jobs with “speedup functions” are common in Machine Learning and other highly parallel computing settings. A job’s speedup function specifies the degree to which it can be parallelized. In [37–39], the authors study optimal allocation policies of servers to jobs when the arriving jobs have different speedup function. In many cases, a job’s speedup function takes the form of a “threshold” function: here the job receives perfect (linear) speedup up to some threshold number of servers and receives no additional speedup beyond that number of servers. We refer to this as the Threshold Parallelism model. While understanding the response time in systems where jobs have speedup functions is generally intractable, Berg et al. [39] were able to approximately analyze response time in the case where every job is either “inelastic,” with parallelism threshold 1, or “elastic,” with parallelism threshold k. They also assume that inelastic jobs have size distributed as Exp(µI ), and elastic jobs have size distributed as Exp(µE ), with sizes unknown to the scheduler. They focus on two preemptive-priority service policies for this setting: Inelastic First (IF) and Elastic First (EF). In this setting, they approximate the mean response time of EF and IF within 1% error by using a combination of the Busy-Period Transitions technique and Matrix-Analytic methods to evaluate their multidimensional Markov chain. The Threshold Parallelism model in our paper is far broader than that in the prior literature, and our bounds are tighter in the heavy-traffic limit. 4.5 Multiserver Jobs The Multiserver-Job model has been extensively studied, in both practical [12–14] and theoretical settings [11, 40–46]. It captures the common scenario in datacenters and supercomputing where each job requires a fixed number of servers in order to run. Characterizing the stability region of policies in this model is already a challenging problem, and there were no bounds on mean response time for any scheduling policy, prior to our bound on ServerFilling. 4.5.1 FCFS Scheduling The most natural policy is FCFS, where the oldest jobs are placed into service until a job requires more servers than remain, at which point the queue is blocked. Therefore, the FCFS policy can leave a large number of servers idle even when many jobs are present. As a result, FCFS does not in general achieve an optimal stability region. Even worse, deriving the stability region of FCFS is an open problem, and has only been solved in a few special cases [40, 41]. One technique that may be useful for characterizing this stability region is the saturated system approach [47, 48]. The saturated system is a system in which additional jobs are always available, so the front is always full, only the composition of jobs in the front varies. The completion rate of the saturated system exactly matches the stability region of the equivalent open system, Springer Nature 2021 LATEX template WCFS: A new framework for analyzing multiserver systems 19 under a wide variety of arrival processes. Unfortunately, analyzing the general Multiserver-Job FCFS saturated system seems intractable. Given the difficulty of proving results under FCFS scheduling, finding policies with better theoretical guarantees, such as ServerFilling, is desirable. 4.5.2 MaxWeight Scheduling One natural throughput-optimal policy is the MaxWeight policy [11]. Here jobs are divided into classes based on their server requirements, with Ni (t) denoting the number of jobs requiring i servers in the system at time t. Let the set Z(t) denote all possible packings of jobs at time t onto servers. Let z ∈ Z(t) be a particular packing, where zi denotes the number of jobs requiring i servers that are served by packing z. The MaxWeight service policy picks the packing z which maximizes max z X Ni (t)zi . i For example, if there are many jobs requiring 3 servers, we want to pick a packing that serves many 3-server jobs. While MaxWeight is throughput optimal, it is very computationally intensive to implement, requiring the scheduler to solve an NP-hard optimization problem whenever a job arrives or departs. For comparison, ServerFilling is also throughput-optimal given our assumptions on the server requirements V , but it is far computationally simpler, requiring approximately linear time as a function of k. Moreover, no bounds on mean response time are known for MaxWeight, due in part to its high complexity. 4.5.3 Nonpreemptive Scheduling In certain practical settings such as supercomputing, a nonpreemptive service policy is preferred. In such settings, a backfilling policy such as EASY backfilling or conservative backfilling is often used [12–14]. These start by serving jobs in FCFS order, until a job is reached that requires more servers than remain. At this point, jobs further back in the queue that require fewer servers are scheduled, but only if they will not delay older jobs, based on user-provided service time upper bounds. While these policies are popular in practice little is known about them theoretically, including their response time characteristics. Finding any nonpreemptive throughput-optimal policy is a challenging problem. Several such policies have been designed [43, 44, 46], typically by slowly shifting between different server configurations to alleviate overhead. Because such policies can have very large renewal times, many jobs can back up while the system is in a low-efficiency configuration, which can empirically lead to very high mean response times. However, no theoretical mean response time analysis exists for any policy in the Multiserver-Job setting. As a result, there is no good baseline policy to compare against novel policies, Our bounds Springer Nature 2021 LATEX template 20 WCFS: A new framework for analyzing multiserver systems on the mean response time of ServerFilling can serve as such a baseline, albeit in the more permissive setting of preemptive scheduling. 5 Theorems and Proofs We perform a heavy traffic analysis within our WCFS framework, assuming finite remsup (S, C). Specifically, we prove that the scaled mean response time of any WCFS model converges to the same constant as an M/G/1/FCFS: Theorem 1 (Heavy Traffic response time). For any model π ∈ WCFS, if remsup (S, C) is finite, lim E[T π ](1 − ρ) = ρ→1 E[S 2 ] . 2E[S] To prove Theorem 1, we prove a stronger theorem, tightly and explicitly bounding E[T π ] up to an additive constant, for any π ∈ WCFS. Theorem 2 (Explicit response time bounds). For any model π ∈ WCFS, if remsup (S, C) is finite, ρ E[S 2 ] + cπupper 1 − ρ 2E[S] ρ E[S 2 ] E[T π ] ≥ + cπlower 1 − ρ 2E[S] E[T π ] ≤ for explicit constants cπupper and cπlower not dependent on load ρ. Proof deferred to Section 5.1  From Theorem 2, Theorem 1 follows via a simple rearrangement: E[S 2 ] ρ E[S 2 ] E[S 2 ] 2E[S] = − . 1 − ρ 2E[S] 1 − ρ 2E[S] Theorem 2 also implies rapid convergence of scaled mean response time to its limiting constant for any WCFS policy: Corollary 1. For any model π ∈ WCFS, if remsup (S, C) is finite, E[T π ](1 − ρ) = E[S 2 ] + O(1 − ρ). 2E[S] Springer Nature 2021 LATEX template WCFS: A new framework for analyzing multiserver systems 21 5.1 Outline of Proof of Theorem 2 We will prove Theorem 2 where nE[S] , binf = −(n − 1)remsup (S, C) + E[S], cπupper = (n − 1)remsup (S, C) + cπlower where n denotes the size of the front, and where binf is defined in Section 2.1.3. Our goal is simply to prove the bounds in Theorem 2 for some constants cπupper , cπlower independent of ρ; we have made no effort to optimize these constants, leaving that to future work. Specifically, for three of our four motivating n models, the binf term scales as O(n2 ). For these models this term is unnecessarily loose, and could easily be lowered to an O(n) bound by using a more detailed view. Our approach is to split response time T into two pieces, queueing time TQ and front time TF , and bound the expectation of each separately. We first bound E[TQ ], which forms the bulk of our proof. The two key ideas come from the intuition that a WCFS model behaves like a FCFS M/G/1 system. In Lemma 2, we prove that E[TQ ] = E[W ] + c, for some constant c; in a WCFS model, jobs progress through the system in essentially FCFS order, and as ρ → 1 work is completed essentially at rate 1. In Lemma 3, we prove that E[W ] = E[W M/G/1 ] + c, for some constant c. The key idea here is that in a WCFS model, if W is large, work arrives and completes in exactly the same way as in an M/G/1. Likewise, if the front is not full, then W cannot be large. In Lemma 4, we combine Lemmas 2 and 3 to prove that E[TQ ] = E[T M/G/1 ] + c for some constant c. In Lemma 5, we prove that work W is indeed stationary with finite mean. This is a technical lemma that rules out pathological scenarios, which is necessary because our WCFS class of models is very general. Lemma 5 is used by both Lemmas 2 and 3. Finally, in Lemma 6, we bound E[TF ], utilizing Little’s law. Combining Lemmas 4 and 6 proves Theorem 2. 5.2 Two Views At several steps in our proof of Theorem 2, we will make use of two different views of the queueing system, corresponding to two different state descriptors: Omniscient view: In the omniscient view the state descriptor consists of the remaining size and class of all jobs in the system; we sample jobs’ sizes and classes when the jobs enter the system. For a given system state, work is a deterministic quantity. Limited view: In the limited view, the state descriptor consists of the age and class of the jobs in the front, and the number of jobs in the queue. We sample jobs’ classes when they enter the front, and determine whether jobs complete Springer Nature 2021 LATEX template 22 WCFS: A new framework for analyzing multiserver systems according to the hazard rate of the job size distribution, as the job ages. For a given system state, work is a random variable. We will make it clear which view of the system we are using in each step of the proof. Generally, the omniscient view is useful when analyzing total work in the system, and the limited view is useful when analyzing work at the front. 5.3 Lemma 2: E[TQ ] and E[W ] First, we prove that mean queueing time and mean work are similar: Lemma 2 (Queueing time and work). For any model π ∈ WCFS, if remsup (S, C) is finite, E[W ] − (n − 1)remsup (S, C) ≤ E[TQ ] ≤ E[W ]. Proof Start by writing time in queue TQ in terms of work in system. Let us consider the omniscient view of the system, so work W is a deterministic quantity given the system state. Consider an arbitrary tagged job j. When j arrives, let W A (j) be the amount of work j sees in the system. Let WFF (j) be the amount of work j sees in the front other than j itself, when j leaves the queue and enters the front. In WFF , the subscript F indicates that we are looking at the amount of work at the front, and the superscript F indicates that we are looking at the moment when j enters the front. Because the model is finite-skip, jobs move from the queue to the front in arrival order, so all of the W A (j) work that was in the system when j arrived is either complete or in the front when j enters the front. As a result, the amount of work which is completed while j is in the queue is exactly W A (j) − WFF (j). Note that if j enters the front upon arrival to the system, W A (j) = WFF (j), and no work is completed while j is in the queue. While j is in the queue, the front must be full; the system must be maximally busy during this time, completing work at rate 1. Job j is in the queue for TQ (j) time, so the system must complete TQ (j) work during that time. We can therefore conclude that W A (j) − WFF (j) = TQ (j). Because j is an arbitrary job, we can write WFF (j) as WFF , a random variable over all jobs that pass through the system. Likewise, TQ (j) is simply TQ . Because Poisson arrivals see time averages, W A (j) ∼ W , the time-stationary amount of work in the system. Combining these equivalencies, we find that W − WFF = TQ . WFF (6) Note that W is time-stationary, while and TQ are event-stationary. To rigorously demonstrate (6), we need to prove that the system converges to a stationary distribution, which we prove in Lemma 5. To give bounds on WFF , we switch to the limited view of the system, where the state of the front consists of the classes and ages of the jobs at the front. We have two simple bounds on WFF : First, WFF ≥ 0. Next, because WFF (j) is the work of at most n − 1 jobs, the jobs at the front when a given job enters the front, we know that E[WFF ] ≤ (n − 1)remsup (S, C). Springer Nature 2021 LATEX template WCFS: A new framework for analyzing multiserver systems 23 Combining these bounds with (6), we can bound E[TQ ] in terms of E[W ]: E[W ] − (n − 1)remsup (S, C) ≤ E[TQ ] ≤ E[W ].  5.4 Lemma 3: Bounding E[W ] Lemma 3. (Work bounds) For any model π ∈ WCFS, if remsup (S, C) is finite, ρ E[S 2 ] ρ E[S 2 ] ≤ E[W ] ≤ + (n − 1)remsup (S, C). 1 − ρ 2E[S] 1 − ρ 2E[S] Proof Consider the stationary random variable W 2 in the omniscient view, so work is a deterministic quantity at a given time on a given sample path. W 2 evolves in two ways: continuous decrease as work is completed, and stochastic jumps as jobs arrive. Because W 2 is a stationary random variable, the expected rate of decrease and increase must be equal, due to the rate conservation law [49] with respect to W 2 . To calculate the expected rate of decrease, note that, ignoring moments where d W (t) = −B(t), by definition, where B(t) is the total service rate of jobs arrive, dt d the system at time t. As a result, dt W (t)2 = −2W (t)B(t), ignoring arrival epochs. This expected rate of decrease is a well-defined random variable, because the system converges to stationarity. Thus the expected rate of decrease of W 2 is 2E[W B]. To calculate the expected rate of increase, let t− be the time just before a job arrives to the system. When the job arrives, W 2 increases from W (t− )2 to (W (t− ) + S)2 , a change of 2W (t− )S + S 2 . Note that W (t− ) is distributed as W , by PASTA. Note also that W and S are independent, because S is sampled i.i.d.. As a result, the expected increase per arrival is 2E[W ]E[S] + E[S 2 ]. Arrivals occur at rate λ. As a result, the expected rate of increase is 2λE[W ]E[S] + λE[S 2 ]. To show that these rates are equal, we must show that the rates are finite. This follows from the fact that E[W ] is finite, which we prove in Lemma 5. As a result, the rates of increase and decrease of W 2 are equal: 2E[W B] = 2λE[W ]E[S] + λE[S 2 ] E[W B] = λE[W ]E[S] + λ E[S 2 ] 2 λ E[S 2 ] 2 λ E[W ] − E[W (1 − B)] = ρE[W ] + E[S 2 ] 2 λ E[W ](1 − ρ) = E[W (1 − B)] + E[S 2 ] 2 E[W (1 − B)] λE[S 2 ] E[W ] = + 1−ρ 2(1 − ρ) E[W B] = ρE[W ] + (7) Now, we merely need to bound E[W (1−B)]. We do so by switching to the limited view. Note that E[W (1 − B)] = E[W (1 − B)✶{B = 1}] + E[W (1 − B)✶{B < 1}] = E[W (1 − B)✶{B < 1}] Springer Nature 2021 LATEX template 24 WCFS: A new framework for analyzing multiserver systems Because the model is work-conserving, if B < 1, the front is not full, and there are at most n − 1 jobs in the system. Taking expectations over the future randomness of these jobs, at any time t for which B(t) < 1, E[W (t)] ≤ (n − 1)remsup (S, C) Therefore, E[W (1 − B)✶{B < 1}] ≤ (n − 1)remsup (S, C)E[(1 − B)✶{B < 1}] = (n − 1)remsup (S, C)E[1 − B] = (n − 1)remsup (S, C)(1 − ρ) E[W (1 − B)] ≤ (n − 1)remsup (S, C)(1 − ρ). Substituting this into (7), our equation for E[W ], we find that E[W ] ≤ λE[S 2 ] + (n − 1)remsup (S, C). 2(1 − ρ) Dropping the first term of (7), we also get a lower bound: E[W ] ≥ λE[S 2 ] . 2(1 − ρ)  One might alternatively try to prove Lemma 3 via a coupling argument, by coupling the WCFS system to an M/G/1 with the same arrival process. Unfortunately, this proof strategy does not succeed, for a subtle reason. One can show that the difference in work between the two systems during an interval when the WCFS system has a full front is bounded by the amount of work in the WCFS system at the beginning of the interval. This is analogous to the many-jobs interval argument used by Grosof et al. [50] to analyze relevant work in the M/G/k/SRPT. The key difference is that in the WCFS setting, we consider total work, not relevant work, meaning that job sizes are not bounded. As a result, while the expected work at the beginning of a full-front interval is bounded, the realization of that work may be arbitrarily large. A coupling argument would therefore need to bound the relative length of full-front intervals started by different amounts of work, to prove a timeaverage bound on the gap between E[W ] and E[W M/G/1 ]. This seems intractable, given the generality of WCFS policies. Instead, by using a rate-conservation approach, formalized by Palm Calculus, we directly connect the small expected amount of work in a WCFS system with non-full front to a small expected difference in work between the two systems. We therefore prove Lemma 3, while avoiding all of the complications of a coupling-based argument. 5.5 Lemma 4: Bounding E[TQ ] Now, we can bound E[TQ ] by combining Lemmas 2 and 3: Springer Nature 2021 LATEX template WCFS: A new framework for analyzing multiserver systems 25 Lemma 4 (Queueing time bounds). For any model π ∈ WCFS, if remsup (S, C) is finite, ρ E[S 2 ] + (n − 1)remsup (S, C) 1 − ρ 2E[S] ρ E[S 2 ] E[TQπ ] ≥ − (n − 1)remsup (S, C) 1 − ρ 2E[S] E[TQπ ] ≤ 5.6 Lemma 5: Finite E[W ] Lemma 5 (Finite mean work). For any model π ∈ WCFS, if remsup (S, C) is finite, for any load ρ < 1, W is a well-defined stationary random variable and E[W ] is finite. Proof Recall that W = WF + WQ ; we first focus on WF . There are at most n jobs in the front at any time. In the limited view, each job has expected remaining size at most remsup (S, C), so E[WF ] ≤ nremsup (S, C). As for the stationarity of the state of the front, this follows from two assumptions we made in Section 2.3. First, we assumed that the service policy is dependent only on the state of the front. Second, the front must empty and thereby undergo renewals, because the service rate B(t) is at least binf whenever the system is nonempty. As a result, WF is stationary. We now turn to WQ . To prove that WQ is stationary and well-defined with finite mean, we will apply the “inventory process” results of Sigman and Yao [51], and Scheller-Wolf [52]’s refinement of those results. We upper bound WQ by W, which we will write as an inventory process. W := W ✶{WQ > 0}. Here we will use the omniscient view, so W(t) is a specific value. By proving W is stationary and well-defined with finite mean, we also show the same is true of WQ . Because WQ = (W − WF )+ , the stationarity of W also implies the stationarity of WQ , given the stationarity of WF . To write W as an inventory process as in [51], we must define a process X(t) with stationary and ergodic increments, such that W(t) = X(t) + L(t), where L(t) := sup (− min{0, X(s)}). 0≤s≤t Here X(t) represents the potential workload process, and L(t) corrects for the fact that the queue can empty. We will apply [52, Theorem 2.2.1], for the special case of the first moment. Note by Remarks 1 and 3, for the first moment of an inventory process, it suffices to show: • Negative drift: There exists an amount of work w < ∞ and a drift rate δ > 0 such that conditioned on W(t) ≥ w, lim ǫ→0 EFt [X(t + ǫ) − X(t)] ≥ −δ ǫ Springer Nature 2021 LATEX template 26 WCFS: A new framework for analyzing multiserver systems where Ft is the filtration defined by the behavior of the system up to time t. • Finite second moment of positive jumps: There exists a constant k1 < ∞ such that lim EFt [((X(t + ǫ) − X(t))+ )2 ] ≤ k1 ǫ→0 Now, we define the potential workload process X(t) based on W (t) and WQ (t). During intervals when WQ (t) = 0, X(t) is constant. If t0 is the beginning of an interval where WQ (t) > 0, X(t) jumps up by W (t+ 0 ) at time t0 . During an interval where WQ (t) > 0, X(t) mimics W (t): X(t) rises by S when a job arrives, and decreases at rate 1. If t1 is the end of an interval where WQ (t) > 0, X(t) jumps down by W (t− 1 ) at time t1 . By construction, X(t) generates W(t) as an inventory process. For example, let t1 be the end of an interval where WQ (t) > 0. Assume that the desired relationship − − between X(t) and W(t) holds up to time t− 1 . In particular, W(t1 ) = W (t1 ). Then + W(t1 ) = 0, as desired. Next, we show that X(t) has stationary and ergodic increments. X(t) has two types of increments: First, Poisson arrivals cause increments sampled i.i.d. from S, which are clearly stationary and ergodic. Second, the beginning and end of intervals where WQ (t) = 0 cause increments equal to WF (t). These increments are stationary and ergodic because the state of the front, and WF in particular, are stationary. Thus, X(t) has stationary and ergodic increments. To demonstrate negative drift, let w be an arbitrary nonzero amount of work. Whenever W(t) ≥ w, X(t) has two types of increments: jumps of size S occurring at rate λ, and continuous decrease at rate 1. As a result, the drift of X(t) is ρ − 1 < 0. To demonstrate finite second moment of positive jumps, note that X(t) has two kinds of positive jumps: Jumps of size S, when WQ (t) > 0, and jumps of size W (t), at the beginning of a WQ > 0 interval. Switching back to the limited view, note that the latter kind of jump consists of the remaining size of at most n jobs. These remaining sizes are distributed as R(a, c) ∼ [Sc − a | Sc > a] for some age a and class c. It therefore suffices to show that there exists a constant r such that for all a, c, E[R(a, c)2 ] ≤ r < ∞. To do so, we will write R(a, c)e , the excess of the remaining size distribution, as a mixture of remaining size distributions for different ages. Note that for any distribution Y , the excess Ye is equivalent to Ye ∼ [Y − Ye | Y > Ye ]. This holds because the forward and backwards renewal times are distributed identically [36, Chapter 23]. By applying this construction with Y = R(a, c), we find that R(a, c)e ∼ [R(a, c) − R(a, c)e | R(a, c) > R(a, c)e ] = [Sc − (a + R(a, c)e ) | Sc > a + R(a, c)e ]. As a result, a + R(a, c)e is the desired age distribution. Springer Nature 2021 LATEX template WCFS: A new framework for analyzing multiserver systems 27 For any age a′ , E[R(a′ , c)] ≤ remsup (S, C). Because R(a, c)e can be written as a mixture of remaining size distributions, E[R(a, c)e ] ≤ remsup (S, C), which is finite by assumption. We can now bound E[R(a, c)2 ]: E[R(a, c)e ] = E[R(a, c)2 )] 2E[R(a, c)] E[R(a, c)2 ] = 2E[R(a, c)]E[R(a, c)e ] ≤ 2remsup (S, C)2 Thus, the requirements of [52, Theorem 2.2.1] are satisfied, so both W and WQ are stationary and well-defined, and have finite mean.  5.7 Lemma 6: Bounding E[TF ] Lemma 6 (Front time bounds). For any model π ∈ WCFS, E[S] ≤ E[T F ] ≤ nE[S] binf Proof First, to prove that E[T F ] ≥ E[S], note that if a job receives service at the maximum possible rate of 1 for the entire time it is in the front, then the job will complete in time S. As a result, E[T F ] ≥ E[S]. To prove the upper bound, recall that by the non-idling assumption from Section 2.1.3, in all states of the front s where NF (s) ≥ 1, the service rate B(s) ≥ binf . Because NF (s) ≤ n, we can bound the ratio B(s)/NF (s) in all NF (s) ≥ 1 states: B(s) b ≥ inf . NF (s) n Therefore, in all states, B(s) ≥ binf N (s). n F In expectation, the same must hold: binf E[NF ]. n Note that E[B] = ρ and E[NF ] = λE[TF ] by Little’s Law. Thus, E[B] ≥ ρ≥ binf λE[TF ] n nE[S] ≥ E[TF ]. binf  Note that Lemma 6 proves a relatively weak bound on E[T F ], because we have only made the weak assumption that binf is positive. In many models, one can prove a stronger bound on E[T F ] by using more information about the model’s dynamics when the front is not full. From Lemma 4 and Lemma 6, Theorem 2 follows immediately, with explicit formulas for cπupper and cπlower . Springer Nature 2021 LATEX template WCFS: A new framework for analyzing multiserver systems Diff. Δπ 28 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Load ρ M/G/4 Limited Processor Sharing M/G/1 Threshold Parallelism FCFS Heterogenous M/G/k Multiserver-job ServerFilling Fig. 5: ∆π for WCFS models. Job size distribution S is hyperexponential: Exp(2) w.p. 1/2, Exp(2/3) otherwise. 109 arrivals simulated. ρ > 0.96 omitted due to the large amount of random noise under high load. Specific settings: Heterogeneous M/G/k with speeds [0.4, 0.3, 0.2, 0.1]. Limited Processor Sharing with Multi-programming Level 4. Threshold Parallelism FCFS with joint random variable (S, L) of (Exp(2), 1) w.p. 1/2, (Exp(2/3), 4) otherwise. Multiserver-job ServerFilling with joint random variable (V, X) of (1, Exp(1/2)) w.p. 1/2, (4, Exp(2/3)) otherwise. 6 Empirical Comparison: WCFS and non-WCFS We have proven tight bounds on mean response time for all WCFS policies. To quantify the tightness of our bounds, we define the mean response time difference ∆π for a given policy π: ∆π = E[T π ] − ρ E[S 2 ] M/G/1 = E[T π ] − E[TQ ]. 1 − ρ 2E[S] For instance, ∆M/G/1 = E[S]. This definition is useful because we have shown in Theorem 2 that for any load ρ, ∆π ∈ [cπlower , cπupper ], for constants cπlower , cπupper not dependent on ρ, but potentially depending on the model π. To investigate the behavior of ∆π , we turn to simulation. We simulate both WCFS models, to confirm our results, as well as non-WCFS models, to show that non-WCFS models typically do not have constant ∆π in the ρ → 1 limit. In Fig. 5, we simulate WCFS models: our four motivating models from Section 3, as well as the simpler M/G/k and M/G/1 models. In each case, we Springer Nature 2021 LATEX template Diff. −π WCFS: A new framework for analyzing multiserver systems 10.0 7.5 5.0 2.5 0.0 .2.5 .5.0 .7.5 .10.0 0.0 0.2 Threshold Parallelism Inelastic First Threshold Parallelism Elastic First M/G/4/SRPT Multiserver-job FCFS 0.4 Load ρ 0.6 0.8 29 1.0 Multiserver-job MaxWeight Multiserver-job Least Servers First Multiserver-job Most Servers First Fig. 6: ∆π for non-WCFS models. Same job sizes and specific settings as in Fig. 5. Same number of arrivals and range of ρ except MaxWeight: 1010 arrivals, ρ ∈ [0, 0.99]. find that ∆π remains bounded quite close to 0, meaning that Theorem 2 holds with constants close to 0. In Fig. 5, we see that for some models, ∆π increases with ρ, while for others, ∆π decreases with ρ. Intuitively, this depends on which jobs tend to be prioritized as ρ → 1. Policies which serve many jobs at once, such as the M/G/4 and Limited Processor Sharing systems, typically have ∆π decrease as ρ → 1, because they allow small and large jobs to share service. As a result, small jobs can complete faster than in an M/G/1, lowering ∆π if ρ is large enough that many jobs are typically in the system. In contrast, policies which reorder large jobs ahead of small jobs typically have ∆π increase as ρ → 1, by the same principle. For example, MultiserverJob ServerFilling prioritizes jobs in the front which require 4 servers. In the setting depicted in Fig. 5, such jobs have mean size 3/2 in this system, compared to the overall mean size E[S] = 1. In all of the settings simulated in Fig. 5, ∆π > 0. This is merely a coincidence, not a general rule, as can be seen in Fig. 7b. Regardless of the different reordering behavior of these different WCFS policies, ∆π does not diverge as ρ → 1, as predicted by Theorem 2. In contrast, in Fig. 6, we simulate several non-WCFS models, which we depicted earlier in Fig. 2. These models are: • Threshold Parallelism Inelastic First: This is the Threshold Parallelism model from Section 3.3, but rather than serving jobs in FCFS order, we prioritize jobs j with smaller parallelism threshold pj [37]. Springer Nature 2021 LATEX template 30 WCFS: A new framework for analyzing multiserver systems • Threshold Parallelism Elastic First: This is the Threshold Parallelism model from Section 3.3, but we prioritize jobs j with larger parallelism threshold pj . • M/G/k/SRPT: This is an M/G/k, where each of the k servers runs at speed 1/k, and we prioritize jobs of least remaining size. • Multiserver-job FCFS: This is the Multiserver-job model from Section 3.4, but we serve jobs in FCFS order. If the next job to be served doesn’t “fit” in the remaining servers, those servers remain idle until other jobs complete, idling sufficient servers to allow the job to fit. • Multiserver-job Least Servers First: This is the Multiserver-job model from Section 3.4, but we prioritize jobs j with smaller server requirements vj . Again, if the next job doesn’t fit, the remaining servers remain idle until the job can fit. • Multiserver-job Most Servers First: This is the Multiserver-job model from Section 3.4, but we prioritize jobs j with larger server requirements vj . • Multiserver-job MaxWeight: This is the Multiserver-job model from Section 3.4, but we serve jobs according to the “MaxWeight” policy which we describe in Section 4.5.2. In all cases, prioritization is preemptive. Our empirical results in Fig. 6 indicate that for these non-WCFS policies, ∆π diverges as ρ → 1. Specifically, for Threshold Parallelism Elastic First, Multiserver-job FCFS, Multiserver-job Least Servers First, and Multiserverjob Most Servers First, ∆π appears to diverge in the positive direction. For Threshold Parallelism Inelastic First, M/G/k/SRPT, and Multiserverjob ServerFilling, ∆π appears to diverge in the negative direction. Note the expanded scale of Fig. 6 as compared to Fig. 5. For Multiserver-job MaxWeight, we performed additional simulation, which indicated that ∆π diverged in the negative direction as ρ → 1. Next, we explore the behavior of ∆π for WCFS models, as we vary the front size n and the job size distribution S. First, in Fig. 7a, we investigate the effects of varying front size n on ∆π for the Multiserver-job model with our ServerFilling policy; under this model, the front size n is equal to the number of servers k. In this setting, the difference ∆π empirically grows approximately linearly with the number of servers k, and is nearly constant as ρ → 1. This matches the behavior of our bounds proven in Theorem 2, which expand linearly with n. Our simulations indicate that other WCFS policies similarly experience linear relationships between n and ∆π . In Fig. 7b we investigate the effects of varying job size distribution S on ∆π in the Heterogeneous M/G/k where the job size distribution S is parameterized by a real value x. Each S is a hyperexponential distribution with E[S] = 1. At large ages a, the remaining size distributions [S − a | S > a] of these job size distributions converge to Exp(1/x), the larger exponential branch. From this, it is straightforward to show that remsup (S) = x. In Fig. 7b, we see that as x increases, ∆π at loads near 1 falls linearly, with more negative slope for larger x. However, for each specific x, it does Springer Nature 2021 LATEX template 8 7 6 5 4 3 2 1 0 31 3 2 1 Diff. Δπ Diff. Δπ WCFS: A new framework for analyzing multiserver systems 0 −1 −2 −3 0.0 0.2 0.4 0.6 0.8 Load ρ k=2 k=8 k=4 k=16 1.0 (a) Varying front size n. Multiserver-job ServerFilling with k = [2, 4, 8, 16]. S distributed Exp(1). Server requirement V distributed uniformly over all integer powers of 2 ≤ k. 0.0 0.2 0.4 0.6 0.8 Load ρ x=1 x=4 x=2 x=8 1.0 (b) Varying job size distributions. Heterogeneous M/G/4 with speeds [0.4, 0.3, 0.2, 0.1]. S distributed hyperexponential: Exp(1/x) with probability 1/2x, else Exp((2x − 1)/x), for x ∈ [1, 2, 4, 8]. E[S] = 1, C 2 ≅ [1, 1.67, 3.57, 7.53]. Fig. 7: ∆π under WCFS models with varying conditions. Up to 109 arrivals simulated. not appear that ∆π is diverging to positive or negative infinity. For instance, consider the red curve, x = 8: as ρ → 1, ∆π converges to a value near −3, rather than diverging. Broadly, Fig. 7b matches the behavior of our bounds proven in Theorem 2, which expand linearly with remsup (S), which here is x. We have empirically found that other WCFS policies similarly experience linear relations between remsup (S) and ∆π , for hyperexponential job size distributions S, and we believe that similar behavior will occur for other common job size distributions. 7 Conclusion We introduce the work-conserving finite-skip (WCFS) framework, and use it to analyze many important queueing models which have eluded analysis thus far. We prove that the scaled mean response time E[T π ](1−ρ) of any WCFS model π converges in heavy traffic to the same limit as M/G/1/FCFS. Moreover, we M/G/1 prove that the additive gap ∆π = E[T π ] − E[TQ ] remains bounded by explicit constants at all loads ρ, proving rapid convergence to the heavy traffic limit. A possible direction for future work would be to to tighten the explicit constants on ∆π . Doing so will likely require use of more detailed properties of the WCFS models being analyzed, but seems quite doable. Springer Nature 2021 LATEX template 32 WCFS: A new framework for analyzing multiserver systems This paper considers models which are finite skip and work conserving relative to the FCFS service ordering. Another interesting direction would be to investigate policies which are “finite-skip” relative to other base service orderings. Hopefully, one could prove bounds on mean response time of models in this new class relative to an M/G/1 operating under the base service ordering. Finally, one could try to characterize other metrics of response time for WCFS policies, such as tail metrics. One possible approach to doing so would be to generalize the rate-conservation technique used in Lemma 3. References [1] Nathuji, R., Isci, C., Gorbatov, E.: Exploiting platform heterogeneity for power efficient data centers. In: Fourth International Conference on Autonomic Computing (ICAC’07), pp. 5–5 (2007) [2] Mars, J., Tang, L., Hundt, R.: Heterogeneity in “homogeneous” warehouse-scale computers: A performance opportunity. IEEE Computer Architecture Letters 10(2), 29–32 (2011) [3] Cho, H.-D., Engineer, P.D.P., Chung, K., Kim, T.: Benefits of the big.LITTLE architecture. EETimes, Feb (2012) [4] Yashkov, S., Yashkova, A.: Processor sharing: A survey of the mathematical theory. Automation and Remote Control 68(9), 1662–1731 (2007) [5] Nuyens, M., Van Der Weij, W.: Monotonicity in the limited processor sharing queue. resource 4, 7 (2008) [6] Telek, M., Van Houdt, B.: Response time distribution of a class of limited processor sharing queues. SIGMETRICS Perform. Eval. Rev. 45(3), 143– 155 (2018) [7] Zhang, J., Zwart, B.: Steady state approximations of limited processor sharing queues in heavy traffic. Queueing Systems 60(3), 227–246 (2008) [8] Gupta, V., Harchol-Balter, M.: Self-adaptive admission control policies for resource-sharing systems. SIGMETRICS Perform. Eval. Rev. 37(1), 311–322 (2009) [9] Delimitrou, C., Kozyrakis, C.: Quasar: Resource-efficient and QoS-aware cluster management. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems. ASPLOS ’14, pp. 127–144 (2014) [10] Peng, Y., Bao, Y., Chen, Y., Wu, C., Guo, C.: Optimus: An efficient dynamic resource scheduler for deep learning clusters. In: Proceedings of Springer Nature 2021 LATEX template WCFS: A new framework for analyzing multiserver systems 33 the Thirteenth EuroSys Conference. EuroSys ’18 (2018) [11] Maguluri, S.T., Srikant, R., Ying, L.: Stochastic models of load balancing and scheduling in cloud computing clusters. In: 2012 Proceedings IEEE Infocom, pp. 702–710. IEEE, Orlando, FL, USA (2012) [12] Feitelson, D.G., Rudolph, L., Schwiegelshohn, U.: Parallel job scheduling—a status report. In: Workshop on Job Scheduling Strategies for Parallel Processing, pp. 1–16. Springer, New York, NY, USA (2004) [13] Srinivasan, S., Kettimuthu, R., Subramani, V., Sadayappan, P.: Characterization of backfilling strategies for parallel job scheduling. In: Proceedings. International Conference on Parallel Processing Workshop, pp. 514–519 (2002) [14] Carastan-Santos, D., De Camargo, R.Y., Trystram, D., Zrigui, S.: One can only gain by replacing easy backfilling: A simple scheduling policies case study. In: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 1–10 (2019) [15] Tirmazi, M., Barker, A., Deng, N., Haque, M.E., Qin, Z.G., Hand, S., Harchol-Balter, M., Wilkes, J.: Borg: The next generation. In: Proceedings of the Fifteenth European Conference on Computer Systems. EuroSys ’20 (2020) [16] Grosof, I., Harchol-Balter, M., Scheller-Wolf, A.: Stability for two-class multiserver-job systems. arXiv preprint arXiv:2010.00631 (2020) [17] Loulou, R.: Multi-channel queues in heavy traffic. Journal of Applied Probability 10(4), 769–777 (1973) [18] Köllerström, J.: Heavy traffic theory for queues with several servers. I. Journal of Applied Probability 11(3), 544–552 (1974) [19] Köllerström, J.: Heavy traffic theory for queues with several servers. II. Journal of Applied Probability 16(2), 393–401 (1979) [20] Kingman, J.: Some inequalities for the queue GI/G/1. Biometrika 49(3/4), 315–324 (1962) [21] Gamarnik, D., Momčilović, P.: Steady-state analysis of a multiserver queue in the Halfin-Whitt regime. Advances in Applied Probability 40(2), 548–577 (2008) [22] Aghajani, R., Ramanan, K.: The limit of stationary distributions of manyserver queues in the Halfin–Whitt regime. Mathematics of Operations Research 45(3), 1016–1055 (2020) Springer Nature 2021 LATEX template 34 WCFS: A new framework for analyzing multiserver systems [23] Dai, J., Dieker, A., Gao, X.: Validity of heavy-traffic steady-state approximations in many-server queues with abandonment. Queueing Systems 78(1), 1–29 (2014) [24] Goldberg, D.A., Li, Y.: Simple and explicit bounds for multi-server queues with universal 1/(1-rho) scaling. arXiv preprint arXiv:1706.04628 (2017) [25] Efrosinin, D.V., Rykov, V.V.: On performance characteristics for queueing systems with heterogeneous servers. Automation and Remote Control 69(1), 61–75 (2008) [26] Alves, F., Yehia, H., Pedrosa, L., Cruz, F., Kerbache, L.: Upper bounds on performance measures of heterogeneous M/M/c queues. Mathematical Problems in Engineering 2011 (2011) [27] Efrosinin, D., Stepanova, N., Sztrik, J., Plank, A.: Approximations in performance analysis of a controllable queueing system with heterogeneous servers. Mathematics 8(10) (2020) [28] Lin, W., Kumar, P.: Optimal control of a queueing system with two heterogeneous servers. IEEE Transactions on Automatic Control 29(8), 696–703 (1984) [29] Van Harten, A., Sleptchenko, A.: On Markovian multi-class, multi-server queueing. Queueing systems 43(4), 307–328 (2003) [30] Boxma, O.J., Deng, Q., Zwart, A.P.: Waiting-time asymptotics for the M/G/2 queue with heterogeneous servers. Queueing Systems 40(1), 5–31 (2002) [31] Keaogile, T., Fatai Adewole, A., Ramasamy, S.: Geo (λ)/Geo (µ)+ G/2 queues with heterogeneous servers operating under FCFS queue discipline. Am. J. Appl. Math. Stat 3(2), 54–58 (2015) [32] Sani, S., Daman, O.A.: The M/G/2 Queue with Heterogeneous Servers Under a Controlled Service Discipline: Stationary Performance Analysis. IAENG International Journal of Applied Mathematics 45(1) (2015) [33] Ramasamy, S., Daman, O.A., Sani, S.: An M/G/2 queue where customers are served subject to a minimum violation of FCFS queue discipline. European Journal of Operational Research 240(1), 140–146 (2015). Publisher: Elsevier [34] Zhang, J., Dai, J.G., Zwart, B.: Law of large number limits of limited processor-sharing queues. Mathematics of Operations Research 34(4), 937–970 (2009) Springer Nature 2021 LATEX template WCFS: A new framework for analyzing multiserver systems 35 [35] Zhang, J., Dai, J.G., Zwart, B.: Diffusion limits of limited processor sharing queues. The Annals of Applied Probability 21(2), 745–799 (2011) [36] Harchol-Balter, M.: Performance Modeling and Design of Computer Systems: Queueing Theory in Action. Cambridge University Press, Cambridge, England (2013) [37] Berg, B., Dorsman, J.-P., Harchol-Balter, M.: Towards optimality in parallel scheduling. Proc. ACM Meas. Anal. Comput. Syst. 1(2) (2017) [38] Berg, B., Harchol-Balter, M.: Optimal scheduling of parallel jobs with unknown service requirements. In: Handbook of Research on Methodologies and Applications of Supercomputing, pp. 18–40. IGI Global, Hershey, PA, USA (2021) [39] Berg, B., Harchol-Balter, M., Moseley, B., Wang, W., Whitehouse, J.: Optimal resource allocation for elastic and inelastic jobs. In: Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures. SPAA ’20, pp. 75–87 (2020) [40] Brill, P.H., Green, L.: Queues in which customers receive simultaneous service from a random number of servers: A system point approach. Management Science 30(1), 51–68 (1984) [41] Rumyantsev, A., Morozov, E.: Stability criterion of a multiserver model with simultaneous service. Annals of Operations Research 252(1), 29–39 (2017) [42] Hong, Y., Wang, W.: Sharp zero-queueing bounds for multi-server jobs (2021) [43] Ghaderi, J.: Randomized algorithms for scheduling VMs in the cloud. In: IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications, pp. 1–9 (2016) [44] Psychas, K., Ghaderi, J.: Randomized algorithms for scheduling multiresource jobs in the cloud. IEEE/ACM Transactions on Networking 26(5), 2202–2215 (2018) [45] Psychas, K., Ghaderi, J.: On Non-Preemptive VM Scheduling in the Cloud. Proceedings of the ACM on Measurement and Analysis of Computing Systems 1(2), 35–13529 (2017) [46] Maguluri, S.T., Srikant, R.: Scheduling Jobs With Unknown Duration in Clouds. IEEE/ACM Transactions on Networking 22(6), 1938–1951 (2014). Conference Name: IEEE/ACM Transactions on Networking Springer Nature 2021 LATEX template 36 WCFS: A new framework for analyzing multiserver systems [47] Baccelli, F., Foss, S.: On the saturation rule for the stability of queues. Journal of Applied Probability 32(2), 494–507 (1995) [48] Foss, S., Konstantopoulos, T.: An overview of some stochastic stability methods. Journal of the Operations Research Society of Japan 47(4), 275–303 (2004) [49] Miyazawa, M.: Rate conservation laws: a survey. Queueing Systems 15(1), 1–58 (1994) [50] Grosof, I., Scully, Z., Harchol-Balter, M.: SRPT for multiserver systems. Performance Evaluation 127-128, 154–175 (2018) [51] Sigman, K., Yao, D.D.: Finite moments for inventory processes. The Annals of Applied Probability, 765–778 (1994) [52] Scheller-Wolf, A.: Finite moment conditions for stationary content processes with applications to fluid models and queues. PhD thesis, Columbia University (1996) Appendix A DivisorFilling The DivisorFilling policy is a Multiserver-job service policy which assumes that all server requirements vj divide the total number of servers k. The DivisorFilling policy is a WCFS policy with front size n = k, as we will show. Finite-skip will be straightforward, the main difficulty is showing work-conservation. We first define the DivisorFilling policy. DivisorFilling is a preemptive policy, in that when a job completes, the set of jobs in service may change, removing partially-complete jobs from service. The DivisorFilling policy is defined recursively. The policy’s behavior with respect to larger k is defined based on its behavior for smaller k. In particular, we will prove work conservation inductively. Let M be the set of jobs at the front. To define DivisorFilling, we split into three cases: • M contains at least k/6 jobs with server requirement vj = 1. • k = 2a 3b for some integers a, b, and M contains < k/6 jobs with vj = 1. • k has a prime factor p ≥ 5 and M contains < k/6 jobs with vj = 1. A.1 At least k/6 jobs requiring 1 server First, assume that M contain at least k/6 jobs requiring 1 server. Just as in the ServerFilling policy, label the jobs f1 , f2 , . . . in decreasing order of server requirement. Let i∗ be defined as i∗ = arg max i i X ℓ=1 vfℓ ≤ k. Springer Nature 2021 LATEX template WCFS: A new framework for analyzing multiserver systems 37 In this case, the DivisorFilling policy serves jobs f1 , . . . fi∗ , as well as any jobs requiring 1 server that fit in the remaining servers. Specifically, DivisorFilling serves ∗ k− i X vfℓ . ℓ=1 additional jobs that require 1 servers, or all jobs requiring 1 server if fewer are available. A.1.1 Work conservation We want to show that if M contains k jobs, DivisorFilling serves jobs requiring k servers in this case. P i∗ Let us write sumi∗ := ℓ=1 vfℓ . Because we have at least k/6 jobs requiring 1 server, it suffices to show that sumi∗ ≥ 5k/6. The remaining servers are filled by the jobs requiring 1 server. First, note that sumk ≥ k, because there are k jobs, each requiring at least 1 server. Next, note that k − sumi∗ < fi∗ +1 , because the i∗ + 1 job does not fit in service. Because the labels f1 , f2 , . . . are in decreasing order of server requirement, k − sumi∗ < fi∗ . Therefore, to prove that k − sumi∗ ≤ k/6, we need only consider sequences of the i∗ largest server requirements in M in which all such requirements are greater than k/6. We need only consider requirements equal to k, k/2, k/3, k/4, k/5. We enumerate all such sequences. Note that if k is not divisible by all of {2, 3, 4, 5}, some entries will not apply. This only tightens the resulting bound on k − sumi∗ for such k. We list i∗ requirements if sumi∗ = k, and i∗ + 1 otherwise. We write gi∗ as a shorthand for k − sumi∗ . Sequence g i∗ Sequence g i∗ k 0 k/2, k/2 0 k/2, k/3, k/3 k/6 k/2, k/4, k/4 0 k/2, k/4, k/5, k/5 k/20 k/2, k/5, k/5, k/5 k/10 k/3, k/3, k/3 0 k/3, k/3, k/4, k/4 k/12 k/3, k/3, k/5, k/5 2k/15 k/3, k/4, k/4, k/4 k/6 k/3, k/4, k/5, k/5, k/5 k/60 k/3, k/5, k/5, k/5, k/5 k/15 k/4, k/4, k/4, k/4 0 k/4, k/4, k/4, k/5, k/5 k/20 k/4, k/4, k/5, k/5, k/5 k/10 k/4, k/5, k/5, k/5, k/5 3k/20 k/5, k/5, k/5, k/5, k/5 0 In all cases, k − sumi∗ ≤ k/6. As a result, DivisorFilling is work conserving in this case. Springer Nature 2021 LATEX template 38 A.2 WCFS: A new framework for analyzing multiserver systems k = 2 a 3b Suppose that k is of the form 2a 3b , for some integers a and b, and that the number of jobs in M that require 1 server is less than k/6. Let M2 be the set of jobs requiring an even number of servers in M , and let Mr be the remaining jobs: M2 := {j | j ∈ M, vj is even} Mr := {j | j ∈ M, vj is odd, vj > 1} Note that because 2 and 3 are the only prime factors of k, all jobs in Mr have server requirements divisible by 3. How we now schedule is based on which is larger: 2|M2 |, or 3|Mr |. In this case of a tie, either would be fine, so we arbitrarily select M2 . If 2|M2 | is larger, we will only serve jobs from among M2 . To do so, imagine that we combine pairs of servers, reducing k by a factor of 2, and reducing the server requirement of every job in M2 by a factor of 2. We now compute which jobs from M2 DivisorFilling would serve, in this simplified subproblem. DivisorFilling serves the corresponding jobs. If 3|Mr | is larger, we do the same, except that we combine triples of jobs. A.2.1 Work conservation If at least k jobs are present, we will show that this process fills all of the servers. Because there are < n/6 jobs requiring 1 server, |M2 | + |M3 | ≥ 5k/6. As a result, either 2|M2 | ≥ k or 3|Mr | ≥ k. Consider the case where 2|M2 | ≥ k. The constructed subproblem has k/2 servers and |M2 | ≥ k/2 jobs, so by induction DivisorFilling fills all of the servers in the subproblem. That property is carried over in the main problem. The case where 3|Mr | ≥ k is equivalent. A.3 k has a prime factor k ≥ 5 Finally, suppose that k has a prime factor p ≥ 5, and that M contains < k/6 jobs requiring 1 server. Specifically, let p be k’s largest prime factor. Let us form the set Mp consisting of the jobs in M whose server requirements are multiples of p, and Mr consisting of jobs which require more than 1 server, but not a multiple of p. As in Section A.2, if |Mp | ≥ k/p, we can recurse by combining groups of p servers to fill all of M . Otherwise, we turn to Mr . Note that all jobs in Mr have server requirements which are divisors of k/p, because their requirements are divisors of k which are not multiples of p. If |Mr | ≥ k/p, let us apply the DivisorFilling policy on an arbitrary subset of Mr of size k/p. By induction, DivisorFilling finds a subset of these jobs requiring exactly k/p servers. Let us extract this subset from Mr , creating Mr1 . We repeat this process until we have extracted p subsets, or |Mri | < k/p for some i. DivisorFilling serves the extracted subsets. Springer Nature 2021 LATEX template WCFS: A new framework for analyzing multiserver systems A.3.1 39 Work conservation We must show that the extraction procedure always successfully extracts p subsets, if |M | = k. In the extraction case, note that |Mp | < k/p ≤ k/5, and that there are ≤ k/6 jobs requiring 1 server. Mr consists of the remaining jobs. As a result, |Mr | ≥ k − k/6 − k/5 = 19k/30. Note also that every job in Mr requires at least 2 servers, so at most k/2p jobs are extracted at each step. To prove that p subsets can be extracted, we must show that at least k/p jobs remain after p−1 subsets have been extracted. |Mrp−1 | ≥ 19k (p − 1)k 19k k k 2k k − = − + = + 30 2p 30 2 2p 15 2p To prove that |Mrp−1 | ≥ k/p, we just need to show that 2k/15 ≥ k/2p. But p ≥ 5, so 2k/15 > k/10 ≥ k/2p. Thus, we can always extract p disjoint subsets of jobs, each requiring a total of k/p servers, from Mr . Combining these subsets fills all k servers, as desired.