Skip to main content

    Mor Harchol-balter

    We consider optimal job scheduling where each job consists of multiple tasks, each of unknown duration, with precedence constraints between tasks. A job is not considered complete until all of its tasks are complete. Traditional... more
    We consider optimal job scheduling where each job consists of multiple tasks, each of unknown duration, with precedence constraints between tasks. A job is not considered complete until all of its tasks are complete. Traditional heuristics, such as favoring the job of shortest expected remaining processing time, are suboptimal in this setting. Furthermore, even if we know which job to run, it is not obvious which task within that job to serve. In this paper, we characterize the optimal policy for a class of such scheduling problems and show that the policy is simple to compute.
    Modern data centers serve workloads which can exploit parallelism. When a job parallelizes across multiple servers it completes more quickly. However, it is unclear how to share a limited number of servers between many parallelizable... more
    Modern data centers serve workloads which can exploit parallelism. When a job parallelizes across multiple servers it completes more quickly. However, it is unclear how to share a limited number of servers between many parallelizable jobs. In this paper we consider a typical scenario where a data center composed of N servers will be tasked with completing a set of M parallelizable jobs. Typically, M is much smaller than N. In our scenario, each job consists of some amount of inherent work which we refer to as a job's size. We assume that job sizes are known up front to the system, and each job can utilize any number of servers at any moment in time. These assumptions are reasonable for many parallelizable workloads such as training neural networks using TensorFlow [2]. Our goal in this paper is to allocate servers to jobs so as to minimize the mean slowdown across all jobs, where the slowdown of a job is the job's completion time divided by its running time if given exclusive access to all N servers. Slowdown measures how a job was interfered with by other jobs in the system, and is often the metric of interest in the theoretical parallel scheduling literature (where it is also called stretch), as well as the HPC community (where it is called expansion factor).
    This short paper contains an approximate analysis for the M/G/1/SRPT queue under alternating periods of overload and low load. The result in this paper along with several other results on systems under transient overload are contained in... more
    This short paper contains an approximate analysis for the M/G/1/SRPT queue under alternating periods of overload and low load. The result in this paper along with several other results on systems under transient overload are contained in our recent technical report [2].
    New computing and communications paradigms will result in traffic loads in information server systems that fluctuate over much broader ranges of time scales than current systems. In addition, these fluctuation time scales may only be... more
    New computing and communications paradigms will result in traffic loads in information server systems that fluctuate over much broader ranges of time scales than current systems. In addition, these fluctuation time scales may only be indirectly known or even be unknown. However, we should still be able to accurately design and manage such systems. This paper addresses this issue: we consider an M/M/1 queueing system operating in a random environment (denoted M/M/1(R)) that alternates between HIGH and LOW phases, where the load in the HIGH phase is higher than in the LOW phase. Previous work on the performance characteristics of M/M/1(R) systems established fundamental properties of the shape of performance curves. In this paper, we extend monotonicity results to include convexity and concavity properties, provide a partial answer to an open problem on stochastic ordering, develop new computational techniques, and include boundary cases and various degenerate M/M/1(R) systems. The ba...
    This document examines five performance questions which are repeatedly asked by practitioners in industry: (i) My system utilization is very low, so why are job delays so high? (ii) What should I do to lower job delays? (iii) How can I... more
    This document examines five performance questions which are repeatedly asked by practitioners in industry: (i) My system utilization is very low, so why are job delays so high? (ii) What should I do to lower job delays? (iii) How can I favor short jobs if I don't know which jobs are short? (iv) If some jobs are more important than others, how do I negotiate importance versus size? (v) How do answers change when dealing with a closed-loop system, rather than an open system? All these questions have simple answers through queueing theory. This short paper elaborates on the questions and their answers. To keep things readable, our tone is purposely informal throughout. For more formal statements of these questions and answers, please see [14].
    Scheduling to minimize mean response time in an M/G/1 queue is a classic problem. The problem is usually addressed in one of two scenarios. In the perfect-information scenario, the scheduler knows each job's exact size, or service... more
    Scheduling to minimize mean response time in an M/G/1 queue is a classic problem. The problem is usually addressed in one of two scenarios. In the perfect-information scenario, the scheduler knows each job's exact size, or service requirement. In the zero-information scenario, the scheduler knows only each job's size distribution. The well-known shortest remaining processing time (SRPT) policy is optimal in the perfect-information scenario, and the more complex Gittins policy is optimal in the zero-information scenario. In real systems the scheduler often has partial but incomplete information about each job's size. We introduce a new job model, that of multistage jobs, to capture this partial-information scenario. A multistage job consists of a sequence of stages, where both the sequence of stages and stage sizes are unknown, but the scheduler always knows which stage of a job is in progress. We give an optimal algorithm for scheduling multistage jobs in an M/G/1 queue ...
    Multiserver queueing systems are found at the core of a wide variety of practical systems. Unfortunately, existing tools for analyzing multiserver models have major limitations: Techniques for exact analysis often struggle with... more
    Multiserver queueing systems are found at the core of a wide variety of practical systems. Unfortunately, existing tools for analyzing multiserver models have major limitations: Techniques for exact analysis often struggle with high-dimensional models, while techniques for deriving bounds are often too specialized to handle realistic system features, such as variable service rates of jobs. New techniques are needed to handle these complex, important, high-dimensional models. In this paper we introduce the work-conserving finite-skip class of models. This class includes many important models, such as the heterogeneous M/G/k, the limited processor sharing policy for the M/G/1, the threshold parallelism model, and the multiserver-job model under a simple scheduling policy. We prove upper and lower bounds on mean response time for any model in the work-conserving finite-skip class. Our bounds are separated by an additive constant, giving a strong characterization of mean response time a...
    Faster storage media, faster interconnection networks, and improvements in systems software have significantly mitigated the effect of I/O bottlenecks in HPC applications. Even so, applications that read and write data in small chunks are... more
    Faster storage media, faster interconnection networks, and improvements in systems software have significantly mitigated the effect of I/O bottlenecks in HPC applications. Even so, applications that read and write data in small chunks are limited by the ability of both the hardware and the software to handle such workloads efficiently. Often, scientific applications partition their output using one file per process. This is a problem on HPC computers with hundreds of thousands of cores and will only worsen with exascale computers, which will be an order of magnitude larger. To avoid wasting time creating output files on such machines, scientific applications are forced to use libraries that combine multiple I/O streams into a single file. For many applications where output is produced out-of-order, this must be followed by a costly, massive data sorting operation. DeltaFS allows applications to write to an arbitrarily large number of files, while also guaranteeing efficient data acc...
    Abstract We consider how to best schedule reparative downtime for a customer-facing online service that is vulnerable to cyber attacks such as malware infections. These infections can cause performance degradation (i.e., a slower service... more
    Abstract We consider how to best schedule reparative downtime for a customer-facing online service that is vulnerable to cyber attacks such as malware infections. These infections can cause performance degradation (i.e., a slower service rate) and facilitate data theft, both of which have monetary repercussions. Infections may go undetected and can only be removed by time-consuming cleanup procedures, which require temporarily taking the service offline. From a security-oriented perspective, cleanups should be undertaken as frequently as possible. From a performance-oriented perspective, frequent cleanups are desirable because they maintain faster service, but they are simultaneously undesirable because they lead to more frequent downtimes and subsequent loss of revenue. We ask when and how often cleanups should happen. In order to analyze various downtime scheduling policies, we combine queueing-theoretic techniques with a revenue model to capture the problem’s tradeoffs. Unlike classical repair problems, this problem necessitates the analysis of a quasi-birth-death Markov chain, tracking the number of customer requests in the system and the (possibly unknown) infection state. We adapt a recent analytic technique, Clearing Analysis on Phases (CAP), to determine the exact steady-state distribution of the underlying Markov chain, which we then use to compute revenue rates and make recommendations. Prior work on downtime scheduling under cyber attacks relies on heuristic approaches, with our work being the first to address this problem analytically.
    We consider the age-old problem of job placement in a distributed server system. Jobs (tasks) arrive according to a Poisson Process and must each be dispatched to exactly one of several host machines for processing. We assume for... more
    We consider the age-old problem of job placement in a distributed server system. Jobs (tasks) arrive according to a Poisson Process and must each be dispatched to exactly one of several host machines for processing. We assume for simplicity that these host machines are identical and that there is no cost for dispatching jobs to hosts. The rule for assigning jobs to host machines is known as the task assignment policy. In this paper we consider the particular model of a distributed server system in which jobs are not preemptible i.e. each job is run-to-completion (no timesharing between jobs). Our model is motivated by batch job schedulers like Load-Leveler, LSF, PBS, and NQS which typically only support run-to-completion [11]. The processing requirements of the jobs are assumed to be i.i.d, according to some distribution G, which we typically assume to be heavy-tailed (to be defined shortly). We assume that the processing requirement of the job is not known at the time the job arrives (although the distribution G could be deduced after many observations). We will use the terms processing requirement, service demand, and size interchangably.
    2004 Technical Reports by Author Computer Science Department School of Computer Science, Carnegie Mellon University. ACAR, Umut A. CMU-CS-04-155. AIROLDI, Edoardo CMU-CS-04-130. AKELLA, Aditya CMU-CS-04-158. ARUNACHALAM, Raghu... more
    2004 Technical Reports by Author Computer Science Department School of Computer Science, Carnegie Mellon University. ACAR, Umut A. CMU-CS-04-155. AIROLDI, Edoardo CMU-CS-04-130. AKELLA, Aditya CMU-CS-04-158. ARUNACHALAM, Raghu CMU-CS-04-107, CMU-CS-04-164. BLELLOCH, Guy E. CMU-CS-04-155, CMU-CS-04-166. BLUM, Avrim CMU-CS-04-142. BROWNING, Brett CMU-CS-04-181. BRUMLEY, David CMU-CS-04-113. BRYANT, Randal E. CMU-CS-04-179. BUDIU, Mihai CMU-CS-04-103. BURCH, Hal CMU-CS- ...
    2006 Technical Reports by Author Computer Science Department School of Computer Science, Carnegie Mellon University. ACAR, Umut A. CMU-CS-06-115, CMU-CS-06-168. AILAMAKI, Anastassia CMU-CS-06-116, CMU-CS-06-128, CMU-CS-06-139. ALDRICH,... more
    2006 Technical Reports by Author Computer Science Department School of Computer Science, Carnegie Mellon University. ACAR, Umut A. CMU-CS-06-115, CMU-CS-06-168. AILAMAKI, Anastassia CMU-CS-06-116, CMU-CS-06-128, CMU-CS-06-139. ALDRICH, Jonathan CMU-CS-06-109, CMU-CS-06-178. ANDERSEN, David G. CMU-CS-06-114, CMU-CS-06-154. AVRAMOPOULOS, Ioannis CMU-CS-06-154. BALAN, Rajesh CMU-CS-06-120, CMU-CS-06-123. BENNETT, Paul N. CMU-CS-06-121. BENNETT, Rachael CMU-CS-06-125. ...
    We consider a service provider facing a continuum of delay-sensitive strategic customers. The service provider maximizes revenue by charging customers for the privilege of joining an M/G/1 queue and assigning them service priorities. Each... more
    We consider a service provider facing a continuum of delay-sensitive strategic customers. The service provider maximizes revenue by charging customers for the privilege of joining an M/G/1 queue and assigning them service priorities. Each customer has a valuation for the service, with a waiting cost per unit time that is proportional to their valuation; customer types are drawn from a continuous distribution and are unobservable to the service provider. We illustrate how to find revenue-maximizing incentive-compatible priority pricing menus, where the firm charges higher prices for higher queueing priority. We show that our proposed priority pricing scheme is optimal across all incentive-compatible pricing policies whenever the customer valuation distribution is regular. We compute the resulting price menus and priority allocations in closed form when customer valuations are drawn from Exponential, Uniform, or Pareto distributions. We find revenues in closed form for the special cas...
    We consider the social welfare model of Naor [20] and revenue-maximization model of Chen and Frank [7], where a single class of delay-sensitive customers seek service from a server with an observable queue, under state dependent pricing.... more
    We consider the social welfare model of Naor [20] and revenue-maximization model of Chen and Frank [7], where a single class of delay-sensitive customers seek service from a server with an observable queue, under state dependent pricing. It is known that in this setting both revenue and social welfare can be maximized by a threshold policy, whereby customers are barred from entry once the queue length reaches a certain threshold. However, no explicit expression for this threshold has been found. This paper presents the first derivation of the optimal threshold in closed form, and a surprisingly simple formula for the (maximum) revenue under this optimal threshold. Utilizing properties of the Lambert W function, we also provide explicit scaling results of the optimal threshold as the customer valuation grows. Finally, we present a generalization of our results, allowing for settings with multiple servers.
    In this paper we consider server farms with a setup cost. This model is common in manufacturing systems and data centers, where there is a cost to turn servers on. Setup costs always take the form of a time delay, and sometimes there is... more
    In this paper we consider server farms with a setup cost. This model is common in manufacturing systems and data centers, where there is a cost to turn servers on. Setup costs always take the form of a time delay, and sometimes there is additionally a power penalty, as in the case of data centers. Any server can be either
    We consider a distributed server system and ask which policy should be used for assigning jobs (tasks) to hosts. In our server, jobs are not preemptible. Also, the job's service demand is not known a priori. We are particularly... more
    We consider a distributed server system and ask which policy should be used for assigning jobs (tasks) to hosts. In our server, jobs are not preemptible. Also, the job's service demand is not known a priori. We are particularly concerned with the case where the workload is heavy-tailed, as is characteristic of many empirically measured computer workloads. We analyze several natural task assignment policies and propose a new one TAGS (Task Assignment based on Guessing Size). The TAGS algorithm is counterintuitive in many respects, including load un balancing, non -work-conserving, and fairness . We find that under heavy-tailed workloads, TAGS can outperform all task assignment policies known to us by several orders of magnitude with respect to both mean response time and mean slowdown, provided the system load is not too high. We also introduce a new practical performance metric for distributed servers called server expansion . Under the server expansion metric, TAGS significantl...
    Abstract For most computer systems, even short periods of overload degrade performance significantly. The number of jobs in the system quickly grows, often exceeding the capacity of the system within just seconds, and response times... more
    Abstract For most computer systems, even short periods of overload degrade performance significantly. The number of jobs in the system quickly grows, often exceeding the capacity of the system within just seconds, and response times explode. In this paper we investigate ...
    Page 1. AutoScale: Dynamic, Robust Capacity Management for Multi-Tier Data Centers Anshul Gandhi ∗ Mor Harchol-Balter ∗ Ram Raghunathan ∗ Michael Kozuch † April 2012 CMU-CS-12-109 School of Computer Science Carnegie Mellon University... more
    Page 1. AutoScale: Dynamic, Robust Capacity Management for Multi-Tier Data Centers Anshul Gandhi ∗ Mor Harchol-Balter ∗ Ram Raghunathan ∗ Michael Kozuch † April 2012 CMU-CS-12-109 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 ...
    We consider the problem of task assignment in a distributed system (such as a distributedWeb server) in which task sizes are drawn from a heavy-tailed distribution. Many task assignmentalgorithms are based on the heuristic that balancing... more
    We consider the problem of task assignment in a distributed system (such as a distributedWeb server) in which task sizes are drawn from a heavy-tailed distribution. Many task assignmentalgorithms are based on the heuristic that balancing the load at the server hosts will resultin optimal performance. Weshowthisconventional wisdom is less true when the task size distributionis heavy-tailed (as is the case for Web file sizes). Weintroduce a new task assignmentpolicy, called Size Interval Task Assignment with Variable Load (SITA-V). ...
    We examine the question of whether to employ the first-come-first-served (FCFS) discipline or the processor-sharing (PS) discipline at the hosts in a distributed server system. We are interested in the case in which service times are... more
    We examine the question of whether to employ the first-come-first-served (FCFS) discipline or the processor-sharing (PS) discipline at the hosts in a distributed server system. We are interested in the case in which service times are drawn from a heavy-tailed distribution, and so have very high variability. Traditional wisdom when task sizes are highly variable would prefer the PS discipline, because it allows small tasks to avoid being delayed behind large tasks in a queue. However, we show that system performance can actually be significantly ...

    And 161 more