Skip to main content
Eitan Frachtenberg

    Eitan Frachtenberg

    There are many choices to make when evaluating the performance of a complex system. In the context of parallel job scheduling, one must decide what workload to use and what measurements to take. These decisions sometimes have subtle... more
    There are many choices to make when evaluating the performance of a complex system. In the context of parallel job scheduling, one must decide what workload to use and what measurements to take. These decisions sometimes have subtle implications that are easy to overlook. In this paper we document numerous pitfalls one may fall into, with the hope of providing at least some help in avoiding them. Along the way, we also identify topics that could benefit from additional research.
    The current work introduces a method for predicting Memcached throughput on single-core and multi-core processors. The method is based on traces collected from a full system simulator running Memcached. A series of microarchtectural... more
    The current work introduces a method for predicting Memcached throughput on single-core and multi-core processors. The method is based on traces collected from a full system simulator running Memcached. A series of microarchtectural simulators consume these traces and the results are used to produce a CPI model composed of a baseline issue rate, cache miss rates, and branch mispredictions rate. Simple queueing models are used to produce througput predictions with accuracy in the range of 8 % to 17%.
    Historically, Markovian predictors have been very successful in predicting branch outcomes. In this work we propose a hybrid scheme that employs two Prediction by Partial Matching (PPM) Markovian predictors, one that predicts based on... more
    Historically, Markovian predictors have been very successful in predicting branch outcomes. In this work we propose a hybrid scheme that employs two Prediction by Partial Matching (PPM) Markovian predictors, one that predicts based on local branch histories and one based on global branch histories. The two independent predictions are combined using a neural network. On the CBP-2 traces the proposed scheme achieves over twice the prediction accuracy of the gshare predictor. 1.
    Although workstation clusters are a common platform for high-performance computing (HPC), they remain more difficult to manage than sequential systems or even symmetric multiprocessors. Furthermore, as cluster sizes increase, the quality... more
    Although workstation clusters are a common platform for high-performance computing (HPC), they remain more difficult to manage than sequential systems or even symmetric multiprocessors. Furthermore, as cluster sizes increase, the quality of the resource-management subsystem—essentially, all of the code that runs on a cluster other than the applications— increasingly impacts application efficiency. In this paper, we present STORM, a resourcemanagement framework designed for scalability and performance. The key innovation behind STORM is a software architecture that enables resource management to exploit low-level network features. As a result of this HPC-application-like design, STORM is orders of magnitude faster than the best reported results in the literature on two sample resource-management functions: job launching and process scheduling. 1
    The efficient implementation of collective communication patterns in a parallel machine is a challenging design effort, that requires the solution of many problems. In this paper we present an in-depth description of how the Quadrics... more
    The efficient implementation of collective communication patterns in a parallel machine is a challenging design effort, that requires the solution of many problems. In this paper we present an in-depth description of how the Quadrics network supports both hardware- and softwarebased collectives. We describe the main features of the two building blocks of this network, a network interface that can perform zero-copy user-level communication and a wormhole routing switch. We also focus our attention on the routing and flow control algorithms, deadlock avoidance and on how the processing nodes are integrated in a global, virtual shared memory. Experimental results conducted on 64-node AlphaServer cluster indicate that the time to complete the hardwarebased barrier synchronization on the whole network is as low as 6 s, with very good scalability. Good latency and scalability are also achieved with the software-based synchronization, which takes about 15 s. With the broadcast, similar per...
    Many scientific and high-performance computing applications consist of multiple processes running on different processors that communicate frequently. Because of their synchronization needs, these applications can suffer severe... more
    Many scientific and high-performance computing applications consist of multiple processes running on different processors that communicate frequently. Because of their synchronization needs, these applications can suffer severe performance penalties if their processes are not all coscheduled to run together. Two common approaches to coscheduling jobs are batch scheduling, wherein nodes are dedicated for the duration of the run, and gang scheduling, wherein time slicing is coordinated across proces-sors. Both work well when jobs are load-balanced and make use of the entire parallel machine. However, these conditions are rarely met and most realistic workloads con-sequently suffer from both internal and external fragmentation, in which resources and processors are left idle because jobs cannot be packed with perfect efficiency. This situ-ation leads to reduced utilization and suboptimal performance. Flexible CoScheduling (FCS) addresses this problem by monitoring each job’s computatio...
    A common trend in the design of large-scale clusters is to use a high-performance data network to integrate the processing nodes in a single par-allel computer. In these systems the performance of the interconnect can be a limiting factor... more
    A common trend in the design of large-scale clusters is to use a high-performance data network to integrate the processing nodes in a single par-allel computer. In these systems the performance of the interconnect can be a limiting factor for the input/output (I/O), which is traditionally bottle-necked by the disk bandwidth. In this paper we present an experimental analysis on a 64-node AlphaServer cluster based on the Quadrics network (QsNET) of the behavior of the interconnect under I/O traffic, and the in-fluence of the placement of the I/O servers on the overall performance. The effects of using dedicated I/O nodes or overlapping I/O and computation on the I/O nodes are also analyzed. In addition, we evaluate how back-ground I/O traffic interferes with other parallel applications running con-currently. Our experimental results show that a correct placement of the I/O servers can provide upto 20 % increase in the available I/O bandwidth. Moreover, some important guidelines for ap...
    The Quadrics interconnection network (QsNet) contributes two novel innovations to the field of highperformance interconnects: (1) integration of the virtualaddress spaces of individual nodes into a single, global, virtual-address space... more
    The Quadrics interconnection network (QsNet) contributes two novel innovations to the field of highperformance interconnects: (1) integration of the virtualaddress spaces of individual nodes into a single, global, virtual-address space and (2) network fault tolerance via link-level and end-to-end protocols that can detect faults and automatically re-transmit packets. QsNet achieves these feats by extending the native operating system in the nodes with a network operating system and specialized hardware support in the network interface. As these and other important features of QsNet can be found in the InfiniBand specification, QsNet can be viewed as a precursor to InfiniBand.
    Although clusters are a popular form of high-performance computing, they remain more difficult to manage than sequential systems—or even symmetric multiprocessors. In this paper, we identify a small set of primitive mechanisms that are... more
    Although clusters are a popular form of high-performance computing, they remain more difficult to manage than sequential systems—or even symmetric multiprocessors. In this paper, we identify a small set of primitive mechanisms that are sufficiently general to be used as building blocks to solve a variety of resource-management problems. We then present STORM, a resource-management environment that embodies these mechanisms in a scalable, low-overhead, and efficient implementation. The key innovation behind STORM is a modular software architecture that reduces all resource management functionality to a small number of highly scalable mechanisms. These mechanisms simplify the integration of resource management with low-level network features. As a result of this design, STORM can launch large, parallel applications an order of magnitude faster than the best time reported in the literature and can gang-schedule a parallel application as fast as the node OS can schedule a sequential app...
    Scientific codes spend a considerable part of their run time executing collective communication operations. Such operations can also be critical for efficient resource management in large-scale machines. Therefore, scalable collective... more
    Scientific codes spend a considerable part of their run time executing collective communication operations. Such operations can also be critical for efficient resource management in large-scale machines. Therefore, scalable collective communication is a key factor to achieve good performance in large-scale parallel computers. In this paper we describe the performance and scalability of some common collective communication patterns on the ASCI Q machine. Experimental results conducted on a 1024-node/4096-processor segment show that the network is fast and scalable. The network is able to barriersynchronize in a few tens of s, perform a broadcast with an aggregate bandwidth of more than 100 GB/s and sustain heavy hot-spot traffic with a limited performance degradation. 1.
    Historically, Markovian predictors have been very successful in predicting branch outcomes. In this work we propose a hybrid scheme that employs two Prediction by Partial Matching (PPM) Markovian predictors, one that predicts based on... more
    Historically, Markovian predictors have been very successful in predicting branch outcomes. In this work we propose a hybrid scheme that employs two Prediction by Partial Matching (PPM) Markovian predictors, one that predicts based on local branch histories and one based on global branch histories. The two independent predictions are combined using a neural network. On the CBP-2 traces the proposed scheme achieves over twice the prediction accuracy of the gshare predictor. 1.
    Fine-grained parallel applications require all their processes to run simultaneously on distinct processors to achieve good efficiency. This is typically accomplished by space slicing, wherein nodes are dedicated for the duration of the... more
    Fine-grained parallel applications require all their processes to run simultaneously on distinct processors to achieve good efficiency. This is typically accomplished by space slicing, wherein nodes are dedicated for the duration of the run, or by gang scheduling, wherein time slicing is coordinated across processors. Both schemes suffer from fragmentation, where processors are left idle because jobs cannot be packed with perfect efficiency. Obviously, this leads to reduced utilization and sub-optimal performance. Flexible coscheduling (FCS) solves this problem by monitoring each job's granularity and communication activity, and using gang scheduling only for those jobs that require it. Processes from other jobs, which can be scheduled without any constraints, are used as filler to reduce fragmentation. In addition, inefficiencies due to load imbalance and hardware heterogeneity are also reduced because the classification is done on a per-process basis. FCS has been fully imple...
    Using multiple independent networks (also known as rails) is an emerging technique to overcome bandwidth limitations and enhance fault tolerance of current high-performance parallel computers. In this paper we present and analyze various... more
    Using multiple independent networks (also known as rails) is an emerging technique to overcome bandwidth limitations and enhance fault tolerance of current high-performance parallel computers. In this paper we present and analyze various algorithms to allocate multiple communication rails, including static and dynamic allocation schemes. An analytical lower bound on the number of rails required for static rail allocation is shown. We also present an extensive experimental comparison of the behavior of various algorithms in terms of bandwidth and latency. We show that striping messages over multiple rails can substantially reduce network latency, depending on average message size, network load, and allocation scheme.
    Using multiple independent networks (also known as rails) is an emerging technique to overcome bandwidth limitations and enhance fault-tolerance of current high-performance clusters. This report presents the limitations and performance of... more
    Using multiple independent networks (also known as rails) is an emerging technique to overcome bandwidth limitations and enhance fault-tolerance of current high-performance clusters. This report presents the limitations and performance of static rail-allocation approaches, where each rail is pre-assigned a direction for communication. An analytical lower bound on the number of networks required for rail allocation is shown. We present an extensive experimental comparison of the behavior of various allocation schemes in terms of bandwidth and latency, compared to static rail allocation. We also compare the ability of static and dynamic rail-allocation mechanism to stripe messages over multiple rails. Scalability issues of static and dynamic rail allocation are also compared. We find that not only static rail allocation necessarily consumes many resources, it also performs poorly compared to dynamic rail allocation schemes, in all the tested aspects.
    Fine-grained parallel applications require all their proc esses to run simultaneously on distinct processors to make g ood progress. This is typically achieved by space slicing with v ariable partitioning, in which nodes are dedicated for... more
    Fine-grained parallel applications require all their proc esses to run simultaneously on distinct processors to make g ood progress. This is typically achieved by space slicing with v ariable partitioning, in which nodes are dedicated for the d uration of the run, or by gang scheduling, in which time slicing is coo rdinated across processors. The problem is that both scheme s suffer from fragmentation, where processors are left idle b ecause jobs cannot be packed with 100% efficiency. Naturally , this leads to reduced utilization and sub-optimal performance. Flexible coscheduling (FCS) solves this problem by monitor i g each job’s granularity and communication activity, and usi ng gang scheduling only for those jobs that really need it. Pr ocesses from other jobs, which can be scheduled without any constrai nts, are used as filler to reduce fragmentation. In addition, inefficiencies due to load imbalance and hardware heterogen eity are also reduced, because the classification is don...
    Genetic and Evolutionary Algorithms (GEAs) rely on operators such as mutation and recombination to introduce variation to the genotypes. Because of their crucial role and effect on GEA performance, several studies have attempted to model... more
    Genetic and Evolutionary Algorithms (GEAs) rely on operators such as mutation and recombination to introduce variation to the genotypes. Because of their crucial role and effect on GEA performance, several studies have attempted to model and quantify the variation induced by different operators on various genotypic representations and GEAs. One metric of particular interest is the locality of genetic operators and representations, or how sensitive the phenotype is to small changes in genotype. Consequently, there is a considerable body of empirical work on the effects that different representations have on locality, with an emphasis on several popular representations, such as Gray encoding, and popular variation operators, such as single-bit mutation and single-point crossover.Here, we compute and prove tight upper and lower bounds on locality. We first precisely define our locality metrics for the single-point mutation and generic crossover operators by reformulating Rothlauf’s sem...
    The current work introduces a method for predicting Memcached throughput on single-core and multi-core processors. The method is based on traces collected from a full system simulator running Memcached. A series of microarchitectural... more
    The current work introduces a method for predicting Memcached throughput on single-core and multi-core processors. The method is based on traces collected from a full system simulator running Memcached. A series of microarchitectural simulators consume these traces and the results are used to produce a CPI model composed of a baseline issue rate, cache miss rates, and branch misprediction rate. Simple queuing models are used to produce throughput predictions with accuracy in the range of 8% to 17%.
    Two major phenomena shaped the U.S. news for most of 2020: the COVID-19 pandemic and a new civil rights movement. In this article, we examine the intersection of these events and their effects on the tech work landscape.
    ... and Burton Smith A Scalable Multi-Discipline, Multiple-Processor Scheduling Framework for IRIX 45 James M. Barton and Nawaf Bitar Scheduling to ... 182 Kelvin K. Yue and David J. Lilja A Microeconomk Scheduler for Parallel Computers... more
    ... and Burton Smith A Scalable Multi-Discipline, Multiple-Processor Scheduling Framework for IRIX 45 James M. Barton and Nawaf Bitar Scheduling to ... 182 Kelvin K. Yue and David J. Lilja A Microeconomk Scheduler for Parallel Computers 200 Ion Stoica, Hussein Abdel-Wahab ...
    Commodity parallel computers are no longer a technology predicted for some indistinct future: they are becoming ubiquitous. In the absence of significant advances in clock speed, chip-multiprocessors (CMPs) and symmetric multithreading... more
    Commodity parallel computers are no longer a technology predicted for some indistinct future: they are becoming ubiquitous. In the absence of significant advances in clock speed, chip-multiprocessors (CMPs) and symmetric multithreading (SMT) are the modern workhorses that keep Moore’s Law still relevant. On the software side, we are starting to observe the adaptation of some codes to the new commodity parallel hardware. While in the past, only complex professional codes ran on parallel computers, the commoditization of parallel computers is opening the door for many desktop applications to benefit from parallelization. We expect thi s software trend to continue, since the only apparent way of obtaining additional performance from the hardware will be through parallelization. Based on the premise that the average desktop workload is growing more parallel and complex, this paper asks the question: Are current desktop operating systems appropriate for these trends? Specifically, we are...
    Although workstation clusters are a common platform for high-performance computing (HPC), they remain more difficult to manage than sequential systems or even symmetric multiprocessors. Furthermore, as cluster sizes increase, the quality... more
    Although workstation clusters are a common platform for high-performance computing (HPC), they remain more difficult to manage than sequential systems or even symmetric multiprocessors. Furthermore, as cluster sizes increase, the quality of the resource-management subsystem---essentially, all of the code that runs on a cluster other than the applications---increasingly impacts application efficiency. In this paper, we present STORM, a resource-management framework
    Computer Science researchers rely on peer-reviewed conferences to publish their work and to receive feedback. The impact of these peer-reviewed papers on researchers’ careers can hardly be overstated. Yet conference organizers can make... more
    Computer Science researchers rely on peer-reviewed conferences to publish their work and to receive feedback. The impact of these peer-reviewed papers on researchers’ careers can hardly be overstated. Yet conference organizers can make inconsistent choices for their review process, even in the same subfield. These choices are rarely reviewed critically, and when they are, the emphasis centers on the effects on the technical program, not the authors. In particular, the effects of conference policies on author experience and diversity are still not well understood. To help address this knowledge gap, this paper presents a cross-sectional study of 56 conferences from one large subfield of computer science, namely computer systems. We introduce a large author survey (n = 918), representing 809 unique papers. The goal of this paper is to expose this data and present an initial analysis of its findings. We primarily focus on quantitative comparisons between different survey questions and ...
    ABSTRACT
    ... Available from www.cs.wm. edu/~dsn/papers/icpp03.pdf. [3] Andrea C. Arpaci-Dusseau. Implicit Coscheduling: Co-ordinated scheduling with implicit information in dis-tributed systems. ACM Transactions on Computer Systems, 19(3):283 331,... more
    ... Available from www.cs.wm. edu/~dsn/papers/icpp03.pdf. [3] Andrea C. Arpaci-Dusseau. Implicit Coscheduling: Co-ordinated scheduling with implicit information in dis-tributed systems. ACM Transactions on Computer Systems, 19(3):283 331, August 2001. ...
    Key-value stores are a vital component in many scale-out enterprises, including social networks, online retail, and risk analysis. Accordingly, they are receiving increased attention from the research community in an effort to improve... more
    Key-value stores are a vital component in many scale-out enterprises, including social networks, online retail, and risk analysis. Accordingly, they are receiving increased attention from the research community in an effort to improve their performance, scalability, reliability, cost, and power consumption. To be effective, such efforts require a detailed understanding of realistic key-value workloads. And yet little is known about these workloads outside of the companies that operate them. This paper aims to address this gap. To this end, we have collected detailed traces from Facebook's Memcached deployment, arguably the world's largest. The traces capture over 284 billion requests from five different Memcached use cases over several days. We analyze the workloads from multiple angles, including: request composition, size, and rate; cache efficacy; temporal patterns; and application use cases. We also propose a simple model of the most representative trace to enable the ge...

    And 52 more