Massive Indexed Directories in DeltaFS

Faster storage media, faster interconnection networks, and improvements in systems software have significantly mitigated the effect of I/O bottlenecks in HPC applications. Even so, applications that read and write data in small chunks are limited by the ability of both the hardware and the software to handle such workloads efficiently. Often, scientific applications partition their output using one file per process. This is a problem on HPC computers with hundreds of thousands of cores and will only worsen with exascale computers, which will be an order of magnitude larger. To avoid wasting time creating output files on such machines, scientific applications are forced to use libraries that combine multiple I/O streams into a single file. For many applications where output is produced out-of-order, this must be followed by a costly, massive data sorting operation. DeltaFS allows applications to write to an arbitrarily large number of files, while also guaranteeing efficient data acc...

• spring 2018 http://www.pdl.cmu.edu/ newsletter on pdl activities and events from academia’s premiere storage systems research center devoted to advancing the state of the art in storage and information infrastructures. C ON T E N TS DeltaFS ....................................... 1 Director’s Letter .............................2 Year in Review ...............................4 Recent Publications ........................5 PDL News & Awards........................8 3Sigma ...................................... 12 Defenses & Proposals..................... 14 Alumni News .............................. 18 New PDL Faculty & Staff................. 19 P DL CO NSO R T I U M M EMBE RS Alibaba Group Broadcom, Ltd. Dell EMC Facebook Google Hewlett Packard Enterprise Hitachi, Ltd. IBM Research Intel Corporation Micron Microsoft Research MongoDB NetApp, Inc. Oracle Corporation Salesforce Samsung Information Systems America Seagate Technology Toshiba Two Sigma Veritas Western Digital Massive Indexed Directories in DeltaFS by Qing Zheng, George Amvrosiadis & the DeltaFS Group Faster storage media, faster interconnection networks, and improvements in systems software have significantly mitigated the effect of I/O bottlenecks in HPC applications. Even so, applications that read and write data in small chunks are limited by the ability of both the hardware and the software to handle such workloads efficiently. Often, scientific applications partition their output using one file per process. This is a problem on HPC computers with hundreds of thousands of cores and will only worsen with exascale computers, which will be an order of magnitude larger. To avoid wasting time creating output files on such machines, scientific applications are forced to use libraries that combine multiple I/O streams into a single file. For many applications where output is produced out-of-order, this must be followed by a costly, massive data sorting operation. DeltaFS allows applications to write to an arbitrarily large number of files, while also guaranteeing efficient data access without requiring sorting. The first challenge when handling an arbitrarily large number of files is dealing with the resulting metadata load. We manage this using the transient and serverless DeltaFS file system [1]. The transient property of DeltaFS allows each program that uses it to individually control the amount of computing resources dedicated to the file system, effectively scaling metadata performance under application control. When combined with DeltaFS’s serverless nature, file system design and provisioning decisions are decoupled from the overall design of the HPC platform. As a result, applications that create one file for each process are no longer tied to the platform storage system’s ability to handle metadata-heavy workloads. The HPC platform can also provide scalable file creation rates without requiring a fundamental redesign of the platform’s storage system. The second challenge is guaranteeing both fast writing and reading for workloads that consist primarily of small I/O transfers. This work was inspired by interactions with cosmologists seeking to explore the trajectories of the highest energy particles in an astrophysics simulation using the VPIC plasma simulation code [2]. VPIC is a highlyoptimized particle simulation code developed at Los Alamos Traditional file-per-process output DeltaFS file-per-particle output Simulation P ... P ... P O(1M) P P P National Laboratory (LANL). O(1M) Procs Each VPIC simulation proD A E A D B E F C ceeds in timesteps, and each F B C A D B E F C File B ... F O(1M) D O(1T) System ... process represents a bounding E C A API box in the physical simulation space that particles move through. Every few timesteps Object index index index ... ... O(1M) O(1M) Store the simulation stops, and each process creates a file and O(1MB) search O(1TB) search Trajectory Query A B C A B C writes the data for the particles that are currently located Figure 1: DeltaFS in-situ indexing of particle data in an within its bounding box. This Indexed Massive Directory. While indexed particle data are is the default, file-per-process Indexed Massive Dir an informal publication exposed as one DeltaFS subfile per particle, they are stored as indexed log objects in the underlying storage. continued on page 11 FROM THE DIRECTOR’S CHAIR TH E PDL PACK E T GREG GANGER Hello from fabulous Pittsburgh! 25 years! This past fall, we celebrated 25 years of the Parallel Data Lab. Started by Garth after he defended his PhD dissertation on RAID at UC-Berkeley, PDL has seen growth and success that I can’t imagine he imagined... from the early days of exploring new disk array approaches to today’s broad agenda of large-scale storage and data center infrastructure research... from a handful of core CMU researchers and industry participants to a vibrant community of scores of CMU researchers and 20 sponsor companies. Amazing. It has been another great year for the Parallel Data Lab, and I’ll highlight some of the research activities and successes below. Others, including graduations, publications, awards, etc., can be found throughout the newsletter. But, I can’t not start with the biggest PDL news item of this 25th anniversary year: Garth has graduated;). More seriously, 25 years after founding PDL, including guiding/ nurturing it into a large research center with sustained success (25 years!), Garth decided to move back to Canada and take the reins (as President and CEO) of the new Vector Institute for AI. We wish him huge success with this new endeavor! Garth has been an academic role model, a mentor, and a friend to me and many others... we will miss him greatly, and he knows that we will always have a place for him at PDL events. Because it overlaps in area with Vector, I’ll start my highlighting of PDL activities with our continuing work at the intersection for machine learning (ML) and systems. We continue to explore new approaches to system support for large-scale machine learning, especially aspects of how ML systems should adapt and be adapted in cloud computing environments. Beyond our earlier focus on challenges around dynamic resource availability and time-varying resource interference, we continue to explore challenges related to training models over geo-distributed data, training very large models, and how edge resources should be shared among inference applications using DNNs for video stream processing. We are also exploring how ML can be applied to make systems better, including even ML systems ;). Indeed, much of PDL’s expansive database systems research activities center on embedding automation in DBMSs. With an eye toward simplifying administration and improving performance robustness, there are a number of aspects of Andy’s overall vision of a self-driving database system being explored and realized. To embody them, and other ideas, a new open source DBMS called Peloton has been created and is being continuously enhanced. There also continue to be cool results and papers on better exploitation of NVM in databases, improved concurrency control mechanisms, and range query filtering. I thoroughly enjoy watching (and participating) in the great energy that Andy has infused into database systems research at CMU. Of course, PDL continues to have a big focus on storage systems research at various levels. At the high end, PDL’s long-standing focus on metadata scaling for scalable storage has led to continued research into benefits of and approaches to allowing important applications to manage their own namespaces and metadata for periods of time. In addition to bypassing traditional metadata bottlenecks 2 THE PARALLEL DATA LABORATORY School of Computer Science Department of ECE Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213-3891 voice 412•268•6716 fax 412•268•3010 PUBLISHER Greg Ganger EDITOR Joan Digney The PDL Packet is published once per year to update members of the PDL Consortium. A pdf version resides in the Publications section of the PDL Web pages and may be freely distributed. Contributions are welcome. THE PDL LOGO Skibo Castle and the lands that comprise its estate are located in the Kyle of Sutherland in the northeastern part of Scotland. Both ‘Skibo’ and ‘Sutherland’ are names whose roots are from Old Norse, the language spoken by the Vikings who began washing ashore regularly in the late ninth century. The word ‘Skibo’ fascinates etymologists, who are unable to agree on its original meaning. All agree that ‘bo’ is the Old Norse for ‘land’ or ‘place,’ but they argue whether ‘ski’ means ‘ships’ or ‘peace’ or ‘fairy hill.’ Although the earliest version of Skibo seems to be lost in the mists of time, it was most likely some kind of fortified building erected by the Norsemen. The present-day castle was built by a bishop of the Roman Catholic Church. Andrew Carnegie, after making his fortune, bought it in 1898 to serve as his summer home. In 1980, his daughter, Margaret, donated Skibo to a trust that later sold the estate. It is presently being run as a luxury hotel. THE PDL PACKET PAR ALLEL D ATA L ABO R ATORY FACULTY Greg Ganger (PDL Director) 412•268•1297 ganger@ece.cmu.edu George Amvrosiadis Seth Copen Goldstein David Andersen Mor Harchol-Balter Lujo Bauer Gauri Joshi Nathan Beckmann Todd Mowry Daniel Berger Onur Mutlu Chuck Cranor Priya Narasimhan Lorrie Cranor David O’Hallaron Christos Faloutsos Andy Pavlo Kayvon Fatahalian Majd Sakr Rajeev Gandhi M. Satyanarayanan Saugata Ghose Srinivasan Seshan Phil Gibbons Rashmi Vinayak Garth Gibson Hui Zhang STAFF MEMBERS Bill Courtright, 412•268•5485 (PDL Executive Director) wcourtright@cmu.edu Karen Lindenfelser, 412•268•6716 (PDL Administrative Manager) karen@ece.cmu.edu Jason Boles Joan Digney Chad Dougherty Mitch Franzos Alex Glikson Charlene Zang VISITING RESEARCHERS / POST DOCS Rachata Ausavarungnirun Hyeontaek Lim Kazuhiro Saito GRADUATE STUDENTS Abutalib Aghayev Conglong Li Joy Arulraj Kunmin Li Ben Blum Yang Li V. Parvathi Bhogaraju Yixin Luo Amirali Boroumand Lin Ma Sol Boucher Diptesh Majumdar Christopher Canel Ankur Mallick Dominic Chen Charles McGuffey Haoxian Chen Prashanth Menon Malhar Chaudhari Yuqing Miao Andrew Chung Wenqi Mou Chris Fallin Pooja Nilangekar Pratik Fegade Yiqun Ouyang Ziqiang Feng Jun Woo Park Samarth Gupta Aurick Qiao Aaron Harlap Souptik Sen Kevin Hsieh Sivaprasad Sudhir Fan Hu Aaron Tian Abhilasha Jain Dana Van Aken Saksham Jain Nandita Vijaykumar Angela Jiang Haoran Wang Ellango Jothimurugesan Jianyu Wang Saurabh Arun Kadekodi Justin Wang Anuj Kalia Ziqi Wang Rajat Kateja Jinliang Wei Jin Kyu Kim Daniel Wong Thomas Kim Lin Xiao Vamshi Konagari Hao Zhang Jack Kosaian Huanchen Zhang Marcel Kost Qing Zheng Giulio Zhou Michael Kuchnik FRO M THE D I RE CTO R’ S C H A I R entirely during the heaviest periods of activity, this approach promises opportunities for efficient in-situ index creation to enable fast queries for subsequent analysis activities. At the lower end, we continue to explore how software systems should be changed to maximize the value from NVM storage, including addressing read-write performance asymmetry and providing storage management features (e.g., page-level checksums, dedup, etc.) without yielding load/store efficiency. We’re excited about continuing to work with PDL companies on understanding where storage hardware is (and should be) going and how it should be exploited in systems. PDL continues to explore questions of resource scheduling for cloud computing, which grows in complexity as the breadth of application and resource types grow. Our cluster scheduling research continues to explore how job runtime estimates can be automatically generated and exploited to achieve greater efficiency. Our most recent work explores more robust ways of exploiting imperfectly-estimated runtime information, finding that providing full distributions of likely runtimes (e.g., based on history of “similar” jobs) works quite well for real-world workloads as reflected in real cluster traces. We are also exploring scheduling for adaptively-sized “virtual clusters” within public clouds, which introduces new questions about which machine types to allocate, how to pack them, and how aggressively to release them. I continue to be excited about the growth and evolution of the storage systems and cloud classes created and led by PDL faculty — their popularity is at an all-time high again this year. These project-intensive classes prepare 100s of MS students to be designers and developers for future infrastructure systems. They build FTLs that store real data (in a simulated NAND Flash SSD), hybrid cloud file systems that work, cluster schedulers, efficient ML model training apps, etc. It’s really rewarding for us and for them. In addition to our lectures and the projects, these classes each feature 3-5 corporate guest lecturers (thank you, PDL Consortium members!) bringing insight on real-world solutions, trends, and futures. Many other ongoing PDL projects are also producing cool results. For example, to help our (and others’) file systems research, we have developed a new file system aging suite, called Geriatrix. Our key-value store research continues to expose new approaches to indexing and remote value access. This newsletter and the PDL website offer more details and additional research highlights. I’m always overwhelmed by the accomplishments of the PDL students and staff, and it’s a pleasure to work with them. As always, their accomplishments point at great things to come. The CMU fence displays a farewell message to Garth. SPRING 2018 3 Y E AR I N REVIEW May 2018 20th annual Spring Visit Day. Qing Zheng and Michael Kuchnik will be interning with LANL this summer. April 2018 Andy Pavlo receive the 2018 Joel & Ruth Spira Teaching Award. Lorrie Cranor received the IAPP Leadership Award. Srinivasan Seshan was appointed Head of he Computer Science Dept. at CMU. Michael Kuchnik received an NDSEG Fellowship for his work on machine learning in HPC systems. Huanchen Zhang proposed his PhD research “Towards SpaceEfficient High-Performance InMemory Search Structures.” Jun Woo Park presented “3Sigma: Distribution-based Cluster Scheduling for Runtime Uncertainty” at EuroSys ’18 in Porto, Portugal. Charles McGuffey delivered his speaking skills talk on “Designing Algorithms to Tolerate Processor Faults.” Qing Zheng gave his speaking skills talk “Light-Weight In-Situ Indexing For Scientific Workloads.” March 2018 Andy Pavlo wins Google Faculty Research Award for his research on automatic database management systems. Anuj Kalia proposed his thesis research “Efficient Networked Systems for Datacenter Fabrics with RPCs.” Nathan Beckmann presented “LHD: Improving Cache Hit Rate by Maximizing Hit Density” at NSDI ‘18 in Renton, WA. Rajat Kateja presented “Viyojit: Decoupling Battery and DRAM Capacities for Battery-Backed DRAM” at NVMW ‘18 in San Diego, CA. Rachata Ausavarungnirun presented “MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency” at ASPLOS’18 in Williamsburg, VA. ASPLOS’18 in Williamsburg, VA. computing systems.” Onur, who is now at ETH Zurich, was chosen for “contributions to computer architecture research, especially in memory systems.” Joy Arulraj proposed his PhD research “The Design & Implementation of a Non-Volatile Memory Database Management System.” Dana Van Aken gave her speaking skills talk on “Automatic Database Management System Tuning Through Large-scale Machine Learning.” February 2018 Lorrie Cranor awarded FORE Systems Chair of Computer Science. Qing Zheng gave his speaking skills talk on “Light-weight In-situ Analysis with Frugal Resource Usage.” Rachata Ausavarungnirun presented “Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes” and Vivek Seshadri presented Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology” at MICRO ‘17 in Cambridge, MA. Timothy Zhu presented “Workload Compactor: Reducing Datacenter Cost while Providing Tail Latency SLO Guarantees” at SoCC’17 in Santa Clara, CA. 25th annual PDL Retreat. Lorrie Cranor wins top SIGCHI Award, given to individuals who promote the application of human-computer interaction research to pressing social needs. Six posters were presented at the 1st SysML Conference at Stanford U. on various work related to creating more efficient systems for machine learning. Yixin Luo successfully defended his PhD dissertation on “Architectural Techniques for Improving NAND Flash Memory Reliability.” Andy Pavlo awarded a Sloan Fellowship to continue his work on the study of database management systems, specifically main memory systems, non-relational systems (NoSQL), transaction processing systems (NewSQL) and large-scale data analytics. December 2017 Greg Ganger and PDL alums Hugo Patterson (Datrium) and Jiri Schindler (HPE) enjoy social time at the PDL Retreat. 4 Mor Harchol-Balter and Onur Mutlu were made Fellows of the ACM. Mor was selected “for contributions to performance modeling and analysis of distributed November 2017 Qing Zheng presented “SoftwareDefined Storage for Fast Trajectory Queries using a DeltaFS Indexed Massive Directory” at PDSW-DISCS ‘17 in Denver, CO. October 2017 September 2017 Garth Gibson to lead new Vector Institute for AI in Toronto. Hongyi Xin delivered his speaking skills talk on “Improving DNA Read Mapping with Error-resilient Seeds.” continued on page 32 THE PDL PACKET RE CE NT PUB L I CA T I O N S Geriatrix: Aging what you see and what you don’t see. A file system aging approach for modern storage systems Saurabh Kadekodi, Vaishnavh Nagarajan, Gregory R. Ganger & Garth A. Gibson 2018 USENIX Annual Technical Conference (ATC’18). July 11–13, 2018, Boston, MA. File system performance on modern primary storage devices (Flash-based SSDs) is greatly affected by aging of the free space, much more so than were mechanical disk drives. We introduce Geriatrix, a simple-to-use profile driven file system aging tool that induces target levels of fragmentation in both allocated files (what you see) and remaining free space (what you don’t see), unlike previous approaches that focus on just the former. This paper describes and evaluates the effectiveness of Geriatrix, showing that it recreates both fragmentation effects better than previous approaches. Using Geriatrix, we show that measurements presented in many recent file systems papers are higher than should be expected, by up to 30% on mechanical 100 Geriatrix Impressions SSD 12.8 11.5 126.1 64.8 0 12.7 Unaged 50 36.3 Throughput (MB/s) Proﬁle HDD Storage Device Aging impact on Ext4 atop SSD and HDD. The three bars for each device represent the FS freshly formatted (unaged), aged with Geriatrix, and aged with Impressions. Although relatively small differences are seen with the HDD, aging has a big impact on FS performance on the SSD. Although their file fragmentation levels are similar, the higher free space fragmentation produced by Geriatrix induces larger throughput reductions than for Impressions. SPRING 2018 (HDD) and up to 75% on Flash (SSD) disks. Worse, in some cases, the performance rank ordering of file system designs being compared are different from the published results. Geriatrix will be released as open source software with eight built-in aging profiles, in the hopes that it can address the need created by the increased performance impact of file system aging in modern SSD-based storage. A Case for Packing and Indexing in Cloud File Systems Saurabh Kadekodi, Bin Fan, Adit Madan, Garth A. Gibson & Gregory R. Ganger 10th USENIX Workshop on Hot Topics in Cloud Computing. July 9, 2018, Boston, MA. Tiny objects are the bane of highly scalable cloud object stores. Not only do tiny objects cause massive slowdowns, but they also incur tremendously high costs due to current operation-based pricing models. For example, in Amazon S3’s current pricing scheme, uploading 1GB data by issuing tiny (4KB) PUT requests (at 0.0005 cents each) is approximately 57x more expensive than storing that same 1GB for a month. To address this problem, we propose client-side packing of files into gigabyte-sized blobs with embedded indices to identify each file’s location. Experiments with a packing implementation in Alluxio (an open-source distributed file system) illustrate the potential benefits, such as simultaneously increasing file creation throughput by up to 61x and decreasing cost by over 99.99%. SOAP: One Clean Analysis of All Age-Based Scheduling Policies Ziv Scully, Mor Harchol-Balter & Alan Scheller-Wolf Proceedings of ACM SIGMETRICS 2018 Conference on Measurement and Modeling of Computer Systems Los Angeles, CA, June 2018. We consider an extremely broad class of M/G/1 scheduling policies called SOAP: Schedule Ordered by Agebased Priority. The SOAP policies include almost all scheduling policies in the literature as well as an infinite number of variants which have never been analyzed, or maybe not even conceived. SOAP policies range from classic policies, like first-come, first-serve (FCFS), foreground-background (FB), class-based priority, and shortest remaining processing time (SRPT); to much more complicated scheduling rules, such as the famously complex Gittins index policy and other policies in which a job’s priority changes arbitrarily with its age. While the response time of policies in the former category is well understood, policies in the latter category have resisted response time analysis. We present a universal analysis of all SOAP policies, deriving the mean and Laplace-Stieltjes transform of response time. Towards Optimality in Parallel Job Scheduling Ben Berg, Jan-Pieter Dorsman & Mor Harchol-Balter Proceedings of ACM SIGMETRICS 2018 Conference on Measurement and Modeling of Computer Systems Los Angeles, CA, June 2018. To keep pace with Moore’s law, chip designers have focused on increasing the number of cores per chip rather than single core performance. In turn, modern jobs are often designed to run on any number of cores. However, to effectively leverage these multi-core chips, one must address the question of how many cores to assign to each job. Given that jobs receive sublinear speedups from additional cores, there is an obvious tradeoff: allocating more cores to an individual job reduces the job’s runtime, but in turn decreases continued on page 6 5 R E CE N T PU BLICATIONS continued from page 5 the efficiency of the overall system. We ask how the system should schedule jobs across cores so as to minimize the mean response time over a stream of incoming jobs. To answer this question, we develop an analytical model of jobs running on a multi-core machine. We prove that EQUI, a policy which continuously divides cores evenly across jobs, is optimal when all jobs follow a single speedup curve and have exponentially distributed sizes. EQUI requires jobs to change their level of parallelization while they run. Since this is not possible for all workloads, we consider a class of “fixed-width” policies, which choose a single level of parallelization, k, to use for all jobs. We prove that, surprisingly, it is possible to achieve EQUI’s performance without requiring jobs to change their levels of parallelization by using the optimal fixed level of parallelization, k*. We also show how to analytically derive the optimal k* as a function of the system load, the speedup curve, and the job size distribution. In the case where jobs may follow different speedup curves, finding a good scheduling policy is even more challenging. In particular, we find that policies like EQUI which performed well in the case of a single speedup function now perform poorly. We propose a very simple policy, GREEDY*, which performs near-optimally when compared to the numerically-derived optimal policy. 3Sigma: Distribution-based Cluster Scheduling for Runtime Uncertainty Jun Woo Park, Alexey Tumanov, Angela Jiang, Michael A. Kozuch & Gregory R. Ganger EuroSys ’18, April 23–26, 2018, Porto, Portugal. Supersedes CMUPDL-17-107, Nov. 2017. The 3Sigma cluster scheduling system uses job runtime histories in a new way. Knowing how long each job will 6 execute enables a scheduler to more effectively pack jobs with diverse time concerns (e.g., deadline vs. thesooner-the-better) and placement preferences on heterogeneous cluster resources. But, existing schedulers use single-point estimates (e.g., mean or median of a relevant subset of historical runtimes), and we show that they are fragile in the face of real-world estimate error profiles. In particular, analysis of job traces from three different large-scale cluster environments shows that, while the runtimes of many jobs can be predicted well, even state-of-the-art predictors have wide error profiles with 8–23% of predictions off by a factor of two or more. Instead of reducing relevant history to a single point, 3Sigma schedules jobs based on full distributions of relevant runtime histories and explicitly creates plans that mitigate the effects of anticipated runtime uncertainty. Experiments with workloads derived from the same traces show that 3Sigma greatly outperforms a state-of-the-art scheduler that uses point estimates from a state-of-the-art predictor; in fact, the performance of 3Sigma approaches the end-to-end performance of a scheduler based on a hypothetical, perfect runtime predictor. 3Sigma reduces SLO miss rate, increases cluster goodput, and improves or matches latency for best effort jobs. LHD: Improving Cache Hit Rate by Maximizing Hit Density Nathan Beckmann, Haoxian Chen & Asaf Cidon 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). April 9–11, 2018, Renton, WA. Cloud application performance is heavily reliant on the hit rate of datacenter key-value caches. Key-value caches typically use least recently used (LRU) as their eviction policy, but LRU’s hit rate is far from optimal under real workloads. Prior research Relative cache size needed to match LHD’s hit rate on different traces. LHD requires roughly one-fourth of LRU’s capacity, and roughly half of that of prior eviction policies. has proposed many eviction policies that improve on LRU, but these policies make restrictive assumptions that hurt their hit rate, and they can be difficult to implement efficiently. We introduce least hit density (LHD), a novel eviction policy for key-value caches. LHD predicts each object’s expected hits-per-spaceconsumed (hit density), filtering objects that contribute little to the cache’s hit rate. Unlike prior eviction policies, LHD does not rely on heuristics, but rather rigorously models objects’ behavior using conditional probability to adapt its behavior in real time. To make LHD practical, we design and implement RankCache, an efficient key-value cache based on memcached. We evaluate RankCache and LHD on commercial memcached and enterprise storage traces, where LHD consistently achieves better hit rates than prior policies. LHD requires much less space than prior policies to match their hit rate, on average 8X less than LRU and 2–3X less than recently proposed policies. Moreover, RankCache requires no synchronization in the common case, improving request throughput at 16 threads by 8 over LRU and by 2X over CLOCK. continued on page 7 THE PDL PACKET RE CE NT PUB L I CA T I O N S continued from page 6 Tributary: Spot-dancing for Elastic Services with Latency SLOs Aaron Harlap, Andrew Chung, Alexey Tumanov, Gregory R. Ganger & Phillip B. Gibbons Carnegie Mellon University Parallel Data Lab Technical Report CMUPDL-18-102, Jan. 2018. The Tributary elastic control system embraces the uncertain nature of transient cloud resources, such as AWS spot instances, to manage elastic services with latency SLOs more robustly and more cost-effectively. Such resources are available at lower cost, but with the proviso that they can be preempted en masse, making them risky to rely upon for business-critical services. Tributary creates models of preemption likelihood and exploits the partial independence among different resource offerings, selecting collections of resource allocations that will satisfy SLO requirements and adjusting them over time as client workloads change. Although Tributary’s collections are often larger than required in the absence of preemptions, they are cheaper because of both lower spot costs and partial refunds for preempted resources. At the same time, the often-larger sets allow unexpected workload bursts to be handled without SLO violation. Over a range of web service workloads, we find that Tributary reduces cost for achieving a given SLO by 81–86% compared to traditional scaling on non-preemptible resources and by 47–62% compared to the high-risk approach of the same scaling with spot resources. to re-tune settings during execution. Experiments show that MLtuner can robustly find and re-tune tunable settings for a variety of ML applications, including image classification (for 3 models and 2 datasets), video classification, and matrix factorization. Compared to state-of-the-art ML auto-tuning approaches, MLtuner is more robust for large problems and over an order of magnitude faster. MLtuner: System Support for Automatic Machine Learning Tuning Addressing the Long-Lineage Bottleneck in Apache Spark Henggang Cui, Gregory R. Ganger & Phillip B. Gibbons Haoran Wang, Jinliang Wei & Garth Gibson arXiv:1803.07445v1 [cs.LG] 20 Mar, 2018. MLtuner automatically tunes settings for training tunables — such as the learning rate, the momentum, the mini-batch size, and the data staleness bound—that have a significant impact on large-scale machine learning (ML) performance. Traditionally, these tunables are set manually, which is unsurprisingly error prone and difficult to do without extensive domain knowledge. MLtuner uses efficient snapshotting, branching, and optimization-guided online trial-and-error to find good initial settings as well as Carnegie Mellon University Parallel Data Lab Technical Report CMUPDL-18-101, January 2018. Apache Spark employs lazy evaluation [11, 6]; that is, in Spark, a dataset is represented as Resilient Distributed Dataset (RDD), and a single-threaded application (driver) program simply describes transformations (RDD to RDD), referred to as lineage [7, 12], without performing distributed computation until output is requested. The lineage traces computation and dependency back to external (and assumed durable) data sources, allowing Spark to opportunistically cache intermediate RDDs, because it can recompute everything from external data sources. To initiate computation on worker machines, the driver process constructs a directed acyclic graph (DAG) representing computation and dependency according to the requested RDD’s lineage. Then the driver broadcasts this DAG to all involved workers requesting they execute their portion of the result RDD. When a requested RDD has a long lineage, as one would expect from iterative convergent or streaming applications [9, 15], constructing and broadcasting computational dependencies can become a significant bottleneck. For example, when solving matrix factorization using Gemulla’s iterative convergent The Tributary Architecture. SPRING 2018 continued on page 20 7 A W AR D S & OTHER PDL NE W S April 2018 Andy Pavlo Receives 2018 Joel & Ruth Spira Teaching Award The School of Computer Science honored outstanding faculty and staff members April 5 during the annual Founder’s Day ceremony in Rashid Auditorium. It was the seventh year for the event and was hosted by Dean Andrew Moore. Andy Pavlo, Assistant Professor in the Computer Science Department (CSD), was the winner of the Joel and Ruth Spira Teaching Award, sponsored by Lutron Electronics Co. of Coopersburg, Pa., in honor of the company’s founders and the inventor of the electronic dimmer switch. -- CMU SCS news, April 5, 2018 is given annually to individuals who demonstrate an “ongoing commitment to furthering privacy policy, promoting recognition of privacy issues and advancing the growth and visibility of the privacy profession.” Cranor helped develop and is now co-director of CMU’s MSIT-Privacy Engineering master’s degree program as well as director of the CyLab Usable Privacy and Security Laboratory. --CMU Piper, April 5, 2018 April 2018 Welcome Baby Nora! Pete and Laura Losi, and Grandma Karen Lindenfelser are thrilled to announce Nora Grace joined big sister Layla Anne and big cousin Landon Thomas to become a family of four (five if you count, Rudy, the granddog). Nora was born Friday the 13th at 11:50 am at 7 lbs 19.5 inches. April 2018 Lorrie Cranor Receives IAPP Leadership Award Lorrie Cranor has received the 2018 Leadership Award from The International Association of Privacy Professionals (IAPP). Cranor, a professor in the Institute for Software Research and the Department of Engineering and Public Policy, accepted the award at the IAPP’s Global Privacy Summit on March 27. “Lorrie Cranor, for 20 years, has been a leading voice and a leader in the privacy field,” said IAPP President and CEO Trevor Hughes. “She developed some of the earliest privacy enhancing technologies, she developed a groundbreaking program at Carnegie Mellon University to create future generations of privacy engineers, and she has been a steadfast supporter, participant and leader of the field of privacy for that entire time. Her merits as recipient for our privacy leadership award are unimpeachable. She’s as great a person as we have in our world.” The IAPP Leadership Award 8 turn to full-time teaching and research. “We are all excited about Srini Seshan’s new role as head of CSD,” said School of Computer Science Dean Andrew Moore. “He is an outstanding researcher and teacher, and I’m confident that his expanded role in leadership will help the department reach even greater heights.” Seshan joined the CSD faculty in 2000, and served as the department’s associate head for graduate education from 2011 to 2015. His research focuses on improving the design, performance and security of computer networks, including wireless and mobile networks. He earned his bachelor’s, master’s and doctoral degrees in computer science at the University of California, Berkeley. He worked as a research staff member at IBM’s T.J. Watson Research Center for five years before joining Carnegie Mellon. --CMU Piper, April 5, 2018 March 2018 Andy Pavlo Wins Google Faculty Research Award April 2018 Srinivasan Seshan Appointed Head of CSD Srinivasan Seshan has been appointed head of the Computer Science Department (CSD), effective July 1. He succeeds Frank Pfenning, who will re- The CMU Database Group and the PDL are pleased to announce that Prof. Andy Pavlo has won a 2018 Google Faculty Research Award. This award was for his research on automatic database management systems. Andy was one of a total 14 faculty members at Carnegie Mellon University selected for this award. The Google Faculty Research Awards is an annual open call for proposals on computer science and related topics such as machine learning, machine perception, natural language processing, and quantum computing. Grants cover tuition for a graduate student and provide both faculty and students the opportunity to work directly with Google researchers and engineers. continued on page 9 THE PDL PACKET A W A RD S & O THE R PD L N E W S continued from page 8 This round received 1033 proposals covering 46 countries and over 360 universities from which 152 were chosen to fund. The subject areas that received the most support this year were human computer interaction, machine learning, machine perception, and systems. -- Google and CMU Database Group News, March 20, 2018 February 2018 Lorrie Cranor Wins Top SIGCHI Award Lorrie Cranor, a professor in the Institute for Software Research and the Department of Engineering and Public Policy, is this year’s recipient of the Social Impact Award from the Association for Computing Machinery Special Interest Group on Computer Human Interaction (SIGCHI). The Social Impact Award is given to mid-level or senior individuals who promote the application of humancomputer interaction research to pressing social needs and includes an honorarium of $5,000, the opportunity to give a talk about the awarded work at the CHI conference, and lifetime invitations to the annual SIGCHI award banquet. “Lorrie’s work has had a huge impact on the ability of non-technical users to protect their security and privacy through her user-centered approach to security and privacy research and development of numerous tools and technologies,” said Blase Ur, who prepared Lorrie’s nomination. Ur is a former Ph.D. student of Lorrie’s, and is now an assistant professor at the University of Chicago. In addition to Ur, three former students from Cranor’s CyLab Usable Privacy and Security Lab – Michelle SPRING 2018 Mazurek, Florian Schaub and Yang Wang – supported Lorrie’s nomination. “All four of us are currently assistant professors, spread out across the United States,” said Ur, who received his doctorate degree in 2016. “In addition to this impact on end users, the four of us who jointly nominated her have also benefitted greatly from her mentorship.” A full summary of this year’s SIGCHI award recipients can be found on the organization’s website. -- info from Cylab News, Daniel Tkacik, Feb. 23, 2018 very snuggly addition to their family. Sebastian Alexander Andersen-Fuchs was born December 11, 2017, at 11:47 am at 8lb 8oz and 21” long. Mom and baby are healthy, and Aria is very excited to be a big sister. February 2018 Andy Pavlo Awarded a Sloan Fellowship December 2017 Mor Harchol-Balter and Onur Mutlu Fellows of the ACM “The Sloan Research Fellows represent the very best science has to offer,” said Sloan President Adam Falk. “The brightest minds, tackling the hardest problems, and succeeding brilliantly — fellows are quite literally the future of 21st century science.” Andrew Pavlo, an assistant professor of computer science, specializes in the study of database management systems, specifically main memory systems, non-relational systems (NoSQL), transaction processing systems (NewSQL) and large-scale data analytics. He is a member of the Database Group and the Parallel Data Laboratory. He joined the Computer Science Department in 2013 after earning a Ph.D. in computer science at Brown University. He won the 2014 Jim Gray Doctoral Dissertation Award from the Association for Computing Machinery’s (ACM) Special Interest Group on the Management of Data. -- Carnegie Mellon University News, Feb. 15, 2018 Congratulations to Mor (Professor of CS) and Onur (adjunct Professor of ECE), who have been made Fellows of the ACM. From the ACM website: “To be selected as a Fellow is to join our most renowned member grade and an elite group that represents less than 1 percent of ACM’s overall membership,” explains ACM President Vicki L. Hanson. “The Fellows program allows us to shine a light on landmark contributions to computing, as well as the men and women whose hard work, dedication, and inspiration are responsible for groundbreaking work that improves our lives in so many ways.” Mor was selected “for contributions to performance modeling and analysis of distributed computing systems.” Onur, who is now at ETH Zurich, was chosen for “contributions to computer architecture research, especially in memory systems.” --with info from www.acm.org December 2017 Welcome Baby Sebastian! In not-unexpected news, David, Erica and big sister Aria are delighted to announce the arrival of a squirmy and continued on page 10 9 A W AR D S & OTHER PDL NE W S continued from page 9 November 2017 Welcome Baby Will! Kevin Hsieh and his wife would like share the news of their new baby! Will was born on November 15, 2017 at 11:15am (not a typo...). He was born at 6lb 7oz and 20’’ long. Since then, he has been growing very well and keeping his family busy. October 2017 Welcome Baby Jonas! Jason & Chien-Chiao Boles are excited to announce the arrival of their son Jonas at 7:42pm, October 18th. Jonas was born a few weeks early — a surprise for us all. Everyone is doing well so far. October 2017 Lorrie Cranor Awarded FORE Systems Chair of Computer Science We are very pleased to announce that, in addition to a long list of accomplishments, which has included a term as the Chief Technologist of the Federal Trade Commission, Lorrie Cranor has been made the FORE Systems Professor of Computer Science and Engineering & Public Policy at CMU. 10 Lorrie provided information that “the founders of FORE Systems, Inc. established the FORE Systems Professorship in 1995 to support a faculty member in the School of Computer Science. The company’s name is an acronym formed by the initials of the founders’ first names. Before it was acquired by Great Britain’s Marconi in 1998, FORE created technology that allows computer networks to link and transfer information at a rapid speed. Ericsson purchased much of Marconi in 2006.” The chair was previously held by CMU University Professor Emeritus, Edmund M. Clarke. September 2017 Garth Gibson to Lead New Vector Institute for AI in Toronto In January of 2 0 1 8 , P D L’s founder, Garth Gibson, became President and CEO of the Vector Institute for AI in Toronto. Vector’s website states that “Vector will be a leader in the transformative field of artificial intelligence, excelling in machine and deep learning — an area of scientific, academic, and commercial endeavour that will shape our world over the next generation.” Frank Pfenning, Head of the Department of Computer Science, notes that “this is a tremendous opportunity for Garth, but we will sorely miss him in the multiple roles he plays in the department and school: Professor (and all that this entails), Co-Director of the MCDS program, and Associate Dean for Masters Programs in SCS.” We are sad to see him go and will miss him greatly, but the opportunities presented here for world level innovation are tremendous and we wish him all the best. June 2017 Satya Honored for Creation of Andrew File System The Association for Computing Machinery has named the developers of CMU’s pioneering Andrew File System (AFS) the recipients of its prestigious 2016 Software System Award. AFS was the first distributed file system designed for tens of thousands of machines, and pioneered the use of scalable, secure and ubiquitous access to shared file data. To achieve the goal of providing a common shared file system used by large networks of people, AFS introduced novel approaches to caching, security, management and administration. The award recipients, including CS Professor Mahadev Satyanarayanan, built the Andrew File System in the 1980s while working as a team at the Information Technology Center (ITC) — a partnership between Carnegie Mellon and IBM. The ACM Software System Award is presented to an institution or individuals recognized for developing a software system that has had a lasting influence, reflected in contributions to concepts, in commercial acceptance, or both. AFS is still in use today as both an open-source system and as a file system in commercial applications. It has also inspired several cloud-based storage applications. Many universities integrated AFS before it was introduced as a commercial application. -- Byron Spice, The Piper, June 1, 2017 THE PDL PACKET M ASSIVE INDE X E D D I RE CTO RI E S I N D E LT A -F S continued from page 1 SPRING 2018 4096 Query Time (sec) 1024 Baseline 256 DeltaFS 64 16 4 1 245x 665x 532x 625x 992x 2221x 4049x 5112x 496 992 1984 3968 7936 16368 32736 49104 0.25 0.0625 0.015625 Simulation Size (M Particles) (a) Query time 15 Output Size ( TiB ) To improve the performance of applications with small I/O access patterns similar to VPIC, we propose an Indexed Massive Directory — a new technique for indexing data in-situ as it is written to storage. In-situ indexing of massive amounts of data written to a single directory simultaneously, and in an arbitrarily large number of files with the goal of efficiently recalling data written to the same file without requiring any timeconsuming data post-processing steps to reorganize it. This greatly improves the readback performance of applications, at the price of small overheads associated with partitioning and indexing the data during writing. We achieve this through a memory-efficient indexing mechanism for reordering and indexing data, and a log-structured storage layout to pack small writes into large log objects, all while ensuring compute node resources are used frugally. We evaluated the efficiency of the Indexed Massive Directory on LANL’s Trinity hardware (Figure 2). By applying in-situ partial sorting of VPIC’s particle output, we demonstrated over 5000x speedup in reading a single particle’s trajectory from a 48- billion particle simulation output using only a single CPU core, compared to post-processing the entire dataset (10TiB) using the same amount of CPU cores as the original simulation. This speedup increases with simulation scale, while the total memory used for partial sort is fixed at 3% of the memory available to the simulation code. The cost of this read acceleration is the increased work in the in-situ pipeline and the additional storage capacity dedicated to storing the indexes. These results are encouraging, as they indicate that the output write buffering stage of the software-defined storage stack can be leveraged for one or more forms of efficient in-situ analysis, and can be applied to more kinds of query workloads. For more information, please see [3] or visit our project page at www.pdl.cmu. edu/DeltaFS/ Baseline 12 DeltaFS 108% 9 108% 6 108% 108% 3 108% 108% 108% 108% 496 992 1984 3968 0 7936 16368 32736 49104 1.13x 1.15x 1.13x 16368 32736 49104 Simulation Size (M Particles) (b) Output size 200 Frame Write Time (sec) mode of VPIC. For each timestep, 40 bytes of data is produced per particle representing the particle’s spatial location, velocity, energy, etc. We refer to the entire particle data written at the same timestep as a frame, because frame data is often used by domain scientists to construct false-color movies of the simulation state over time. Large-scale VPIC simulations have been conducted with up to trillions of particles, generating terabytes of data for each frame. Domain scientists are often interested in a tiny subset of particles with specific characteristics, such as high energy, that is not known until the simulation ends. All data for each such particle is gathered for further analysis, such as visualizing its trajectory through space over time. Unfortunately, particle data within a frame is written out of order, since output order depends on the particles’ spatial location. Therefore, in order to locate individual particles’ data over time, all output data must be sorted before they can be analyzed. For scientists working with VPIC, it would be significantly easier programmatically to create a separate file for each particle, and append a 40-byte data record on each timestep. This would reduce analysis queries to sequentially reading the contents of a tiny number of particle files. Attempting to do this in today’s parallel file systems, however, would be disastrous for performance. Expecting existing HPC storage stacks and file systems to adapt to scientific needs such as this one, however, is lunacy. Parallel file systems are designed to be long-running, robust services that work across applications. They are typically kernel resident, mainly developed to manage the hardware, and primarily optimized for large sequential data access. DeltaFS aims to provide this file-per-particle representation to applications, while ensuring that storage hardware is utilized to its full performance potential. A comparison of the file-per-process (current state-ofthe-art) and file-per-particle (DeltaFS) representations is shown in Figure 1. Baseline 160 DeltaFS 120 1.29x 80 1.56x 9.63x 4.78x 2.42x 496 992 1984 40 0 3968 7936 Simulation Size (M Particles) (c) Frame write time Figure 2: Results from real VPIC simulation runs with and without DeltaFS at L ANL Trinity computer. References [1] Zheng, Q., Ren, K., Gibson, G., Settlemyer, B. W., and Grider, G. DeltaFS: Exascale file systems scale better without dedicated servers. In Proceedings of the 10th Parallel Data Storage Workshop (PDSW 15), pp. 1–6. [2] Byna, S., Sisneros, R., Chadalavada, K., and Koziol, Q. Tuning parallel I/O on blue waters for writing 10 trillion particles. In Cray User Group (CUG) (2015). [3] Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Garth Gibson, Chuck Cranor, Brad Settlemyer, Gary Grider, Fan Guo. Software-Defined Storage for Fast Trajectory Queries using a DeltaFS Indexed Massive Directory. PDSW-DISCS 2017, Denver, CO, November 2017. 11 3S IGMA Jun Woo Park, Greg Ganger and PDL 3Sigma Group 3Sigma: Distribution-Based Cluster Scheduling for Runtime Uncertainty Knowledge of the runtimes of these pending jobs has been identified as a powerful building block for modern cluster schedulers. With it, a scheduler can pack jobs more aggressively in a cluster’s resource assignment plan, for instance by allowing a latency-sensitive best-effort job to run before a highpriority batch job provided that the priority job will still meet its deadline. Runtime knowledge allows a scheduler to determine whether it is better to start a job immediately on suboptimal machine types with worse expected performance, wait for the jobs currently occupying the preferred machines to finish, or to preempt them. Exploiting job runtime knowledge leads to better, more robust scheduler decisions than relying on hard-coded assumptions. In most cases, the job runtime estimates are based on previous runtimes observed for similar jobs (e.g., from the same user or by the same periodic job script). When such estimates are accurate, the schedulers relying on them outperform those using other approaches. However, we find that estimate errors, while expected in large, multiuse clusters, cover an unexpectedly larger range. Applying a state-of-the12 20 SLO Miss(%) Modern cluster schedulers face a daunting task. Modern clusters support a diverse mix of activities, including exploratory analytics, software development and test, scheduled content generation, and customer-facing services [2]. Pending work is typically mapped to heterogeneous resources to satisfy deadlines for business-critical jobs, minimize delays for interactive best-effort jobs, maximize efficiency, and so on. Cluster schedulers are expected to make that happen. profiles as compared to having perfect estimates. 25 15 10 5 0 3Sigma Point Point PerfEst RealEst Prio Figure 1: Comparison of 3Sigma with three other scheduling approaches w.r.t. SLO (deadline) miss rate, for a mix of SLO and best effort jobs derived from the Google cluster trace [2] on a 256-node cluster. 3Sigma, despite estimating runtime distributions online with imperfect knowledge of job classification, approaches the performance of a hypothetical scheduler using perfect runtime estimates (PointPerfEst). Full historical runtime distributions and mis-estimation handling helps 3Sigma outperform PointRealEst, a stateof-the-art point-estimate-based scheduler. The value of exploiting runtime information, when done well, is confirmed by comparison to a conventional priority-based approach (Prio). art ML-based predictor [1] to three real-world traces, including the wellstudied Google cluster trace [2] and new traces from data analysis clusters used at a hedge fund and a scientific site, shows good estimates in general (e.g., 77–92% within a factor of two of the actual runtime and most much closer). Unfortunately, 8–23% are not within that range, and some are off by an order of magnitude or more. Thus, a significant percentage of runtime estimates will be well outside the error ranges previously reported. Worse, we find that schedulers relying on runtime estimates cope poorly with such error profiles. Comparing the middle two bars of Fig. 1 shows one example of how much worse a state-of-the-art scheduler does with real estimate error Our 3Sigma cluster scheduling system uses all of the relevant runtime history for each job rather than just a point estimate derived from it. Instead, it uses expected runtime distributions (e.g., the histogram of observed runtimes), taking advantage of the much richer information (e.g., variance, possible multi-modal behaviors, etc.) to make more robust decisions. The first bar of Fig. 1 illustrates 3Sigma’s efficacy. By considering the range of possible runtimes for a job, and their likelihoods, 3Sigma can explicitly consider the various potential outcomes from each possible plan and select a plan based on optimizing the expected outcome. For example, the predicted distribution for one job might have low variance, indicating that the scheduler can be aggressive in packing it in, whereas another job’s high variance might suggest that it should be scheduled early (relative to its deadline). 3Sigma similarly exploits the runtime distribution to adaptively address the problem of point over-estimates, which may suggest that the scheduler will avoid scheduling a job based on the likelihood of missing its deadline. In application, 3Sigma replaces the scheduling component of a cluster manager (e.g. YARN). The cluster manager remains responsible for job and resource life-cycle management. Job requests are received asynchronously by 3Sigma from the cluster manager (Step 1 of Fig. 2). As is typical for such systems, the specification of the request includes a number of attributes, such as (1) the name of the job to be run, (2) the type of job to be run (e.g. MapReduce), (3) the user submitting the job, and (4) a specification of the resources requested. continued on page 13 THE PDL PACKET 3SIGMA continued from page 12 4. Measured runtime 1. Job submission 2. Job submission + distribution 3σSched Scheduling Option Generator SortV Resources John Expert selector Cluster Manager Feature history 3σPredict Time Optimization Compiler Optimization Solver 3. Job placement Figure 2: End-to-end system integration The role of the predictor component 3σPredict is to provide the core scheduler with a probability distribution of the execution time of the submitted job. 3σPredict does this by maintaining a history of previously executed jobs, identifying a set of jobs that, based on their attributes, are similar to the current job and deriving the runtime distribution the selected jobs’ historical runtimes (Step 2 of Fig. 2). Given a distribution of expected job runtimes and request specifications, the core scheduler, 3σSched decides which jobs to place on which resources and when. The scheduler evaluates the expected utility of each option and the expected resource consumption and availability over the scheduling horizon. Valuations and computed resource capacity are then compiled into an optimization problem, which is solved by an external solver. 3σSched translates the solution into an updated schedule and submits the schedule to the cluster manager (Step 3 of Fig. 2). On completion, the job’s actual runtime is recorded by 3σPredict (along with the attribute information from the job) and incorporated into the job history for future predictions (Step 4 of Fig. 2). Full system and simulation experiSPRING 2018 ments with production-derived workloads demonstrate 3Sigma’s effectiveness. Using its imperfect but automatically-generated history-based runtime distributions, 3Sigma outperforms both a state-of-the-art point-estimatebased scheduler and a priority-based (runtime-unaware) scheduler, especially for mixes of deadline-oriented jobs and latency-sensitive jobs on heterogeneous resources. 3Sigma simultaneously provides higher (1) SLO attainment for deadline-oriented jobs and (2) cluster goodput (utilization). Our evaluation of 3Sigma, yielded five key takeaways. First, 3Sigma achieves significant improvement over the stateof-the-art in SLO miss rate, best-effort job goodput, and best-effort latency in a fully-integrated real cluster deployment, approaching the performance of the unrealistic PointPerfEst in SLO miss rate and BE latency. Second, all of the 3σSched component features are important, as seen via a piecewise benefit attribution. Third, estimated distributions are beneficial in scheduling even if they are somewhat inaccurate, and such inaccuracies are better handled by distribution-based scheduling than point-estimate-based scheduling. In fact, experiments with trace-derived workloads both on a real 256-node cluster and in simulation demonstrate that 3Sigma’s distribution-based scheduling greatly outperforms a state-of-the-art pointestimate scheduler, approaching the performance of a hypothetical scheduler operating with perfect runtime estimates. Fourth, 3Sigma performs well (i.e., comparably to PointPerfEst) under a variety of conditions, such as varying cluster load, relative SLO job deadlines, and prediction inaccuracy. Fifth, we show that the 3Sigma components (3σPredict and 3σSched) can scale to >10000 nodes. Overall, we see that 3Sigma robustly exploits runtime distributions to improve SLO attainment and best-effort performance, dealing gracefully with the complex runtime variations seen in real cluster environments. For more information, please see [3] or visit www.pdl.cmu.edu/TetriSched/ References [1] Alexey Tumanov, Angela Jiang, Jun Woo Park, Michael A. Kozuch, and Gregory R. Ganger. 2016. JamaisVu: Robust Scheduling with AutoEstimated Job Runtimes. Technical Report CMU-PDL-16-104. Carnegie Mellon University. [2] Charles Reiss, Alexey Tumanov, Gregory R. Ganger, Randy H. Katz, and Michael A. Kozuch. 2012. Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis. In Proc. of the 3nd ACM Symposium on Cloud Computing (SOCC ’12). [3] Jun Woo Park, Alexey Tumanov, Angela Jiang, Michael A. Kozuch, Gregory R. Ganger. 3Sigma: Distribution-based Cluster Scheduling for Runtime Uncertainty. EuroSys ’18, April 23–26, 2018, Porto, Portugal. 13 D E FE N SES & PROPOSA L S DISSERTATION ABSTRACT: Architectural Techniques for Improving NAND Flash Memory Reliability Yixin Luo Carnegie Mellon University, SCS PhD Defense — February 9, 2018 Raw bit errors are common in NAND flash memory and will increase in the future. These errors reduce flash reliability and limit the lifetime of a flash memory device. This dissertation improves flash reliability with a multitude of low-cost architectural techniques. We show that NAND flash memory reliability can be improved at low cost and with low performance overhead by deploying various architectural techniques that are aware of higher-level application behavior and underlying flash device characteristics. This dissertation analyzes flash error characteristics and workload behavior through rigorous experimental characterization and designs new flash controller algorithms that use the insights gained from our analysis to improve flash reliability at low cost. We investigate four novel directions. (1) We propose a new technique called WARM that improves flash lifetime by 12.9 times by managing flash retention differently for write-hot data and write-cold data. (2) We propose a new framework that learns an online flash channel model for each chip and enables four new flash controller algorithms to improve flash write endurance by up to 69.9%. (3) We identify three new error characteristics in 3D Industry guests and CMU folks boarding the bus to head to Bedford Springs for the PDL Retreat 14 NAND flash memory through comprehensive experimental characterization of real 3D NAND chips, and propose four new techniques that mitigate these new errors and improve 3D NAND raw bit error rate by up to 66.9%. (4) We propose a new technique called HeatWatch that improves 3D NAND lifetime by 3.85 times by utilizing the self-healing effect to mitigate retention errors in 3D NAND. D I S S E RTAT I O N A B ST R AC T: Fast Storage for File System Metadata Kai Ren Carnegie Mellon University, SCS PhD Defense — August 8, 2017 In an era of big data, the rapid growth of data that many companies and organizations produce and manage continues to drive efforts to improve the scalability of storage systems. The number of objects presented in storage systems continue to grow, making metadata management critical to the overall performance of file systems. Many modern parallel applications are shifting toward shorter durations and larger degree of parallelism. Such trends continue to make storage systems to experience more diverse metadata intensive workloads. The goal of this dissertation is to improve metadata management in both local and distributed file systems. The dissertation focuses on two aspects. One is to improve the out-of-core representation of file system metadata, by exploring the use of log-structured multi-level approaches to provide a unified and efficient representation for different types of secondary storage devices (e.g., traditional hard disk and solid state disk). We have designed and implemented TableFS and its improved version SlimFS, which shows 50% to 10x faster than traditional Linux file systems. The other aspect is to demonstrate that such representation also can be flexibly integrated with many namespace distribution mechanisms to scale metadata performance of distribution file systems, and Greg Ganger, PDL alum Michael Abd-ElMalek (Google), and Bill Courtright enjoy social time at the PDL Retreat. provide better support for a variety of big data applications in data center environment. Our distributed metadata middleware IndexFS can help improve metadata performance for PVFS, Lustre and HDFS by scaling to as many as 128 metadata servers. DISSERTATION ABSTRACT: Enabling Data-Driven Optimization of Quality of Experience in Internet Applications Junchen Jiang Carnegie Mellon University, SCS PhD Defense — June 23, 2017 Today’s Internet has become an eyeball economy dominated by applications such as video streaming and VoIP. With most applications relying on user engagement to generate revenues, maintaining high user-perceived Quality of Experience (QoE) has become crucial to ensure high user engagement. For instance, one short buffering interruption leads to 39% less time spent watching videos and causes significant revenue losses for ad-based video sites. Despite increasing expectations for high QoE, existing approaches have limitations to achieve the QoE needed by today’s applications. They either require costly re-architecting of the network core, or use suboptimal endpoint-based protocols to react to the dynamic Internet performance based on limited knowledge of the network. continued on page 15 THE PDL PACKET D E FE NSE S & PRO PO S A L S continued from page 14 Kevin K. Chang Carnegie Mellon University, ECE PhD Defense — May 5, 2017 Shinya Matsusmoto (HItachi) talks about his company’s research on “Risk-aware Data Replication against Widespread Disasters” at the PDL retreat industry poster session. In this thesis, I present a new approach, which is inspired by the recent success of data-driven approaches in many fields of computing. I will demonstrate that datadriven techniques can improve Internet QoE by utilizing a centralized real-time view of performance across millions of endpoints (clients). I will focus on two fundamental challenges unique to this data-driven approach: the need for expressive models to capture complex factors affecting QoE, and the need for scalable platforms to make real-time decisions with fresh data from geo-distributed clients. Our solutions address these challenges in practice by integrating several domain-specific insights in networked applications with machine learning algorithms and systems, and achieve better QoE than using many standard machine learning solutions. I will present end-to-end systems that yield substantial QoE improvement and higher user engagement for video streaming and VoIP. Two of my projects, CFA and VIA, have been used in industry by Conviva and Skype, companies that specialize in QoE optimization for video streaming and VoIP, respectively. DISSERTATION ABSTRACT: Understanding and Improving the Latency of DRAM-Based Memory System SPRING 2018 Over the past two decades, the storage capacity and access bandwidth of main memory have improved tremendously, by 128x and 20x, respectively. These improvements are mainly due to the continuous technology scaling of DRAM (dynamic random-access memory), which has been used as the physical substrate for main memory. In stark contrast with capacity and bandwidth, DRAM latency has remained almost constant, reducing by only 1.3x in the same time frame. Therefore, long DRAM latency continues to be a critical performance bottleneck in modern systems. Increasing core counts, and the emergence of increasingly more data-intensive and latencycritical applications further stress the importance of providing low-latency memory accesses. In this dissertation, we identify three main problems that contribute significantly to long latency of DRAM accesses. To address these problems, we present a series of new techniques. Our new techniques significantly improve both system performance and energy efficiency. We also examine the critical relationship between supply voltage and latency in modern DRAM chips and develop new mechanisms that exploit this voltage-latency tradeoff to improve energy efficiency. First, while bulk data movement is a key operation in many applications Saurabh Kadekodi discusses his research on “Aging Gracefully with Geriatrix: A File System Aging Suite” at a PDL retreat poster session. and operating systems, contemporary systems perform this movement inefficiently, by transferring data from DRAM to the processor, and then back to DRAM, across a narrow offchip channel. The use of this narrow channel for bulk data movement results in high latency and high energy consumption. This dissertation introduces a new DRAM design, Low-cost Inter-linked SubArrays (LISA), which provides fast and energy-efficient bulk data movement across sub- arrays in a DRAM chip. We show that the LISA substrate is very powerful and versatile by demonstrating that it efficiently enables several new architectural mechanisms, including low-latency data copying, reduced DRAM access latency for frequently-accessed data, and reduced preparation latency for subsequent accesses to a DRAM bank. Second, DRAM needs to be periodically refreshed to prevent data loss due to leakage. Unfortunately, while DRAM is being refreshed, a part of it becomes unavailable to serve memory requests, which degrades system performance. To address this refresh interference problem, we propose two access-refresh parallelization techniques that enable more overlapping of accesses with refreshes inside DRAM, at the cost of very modest changes to the memory controllers and DRAM chips. These two techniques together achieve performance close to an idealized system that does not require refresh. Third, we find, for the first time, that there is significant latency variation in accessing different cells of a single DRAM chip due to the irregularity in the DRAM manufacturing process. As a result, some DRAM cells are inherently faster to access, while others are inherently slower. Unfortunately, existing systems do not exploit this variation and use a fixed latency value based on the slowest cell across all DRAM chips. To exploit latency variation within the DRAM chip, we continued on page 16 15 D E FE N SES & PROPOSA L S continued from page 15 Jiri Schindler (HPE), Bruce Wilson (Broadcom) and Rajat Kateja discuss PDL research at a retreat poster session. experimentally characterize and understand the behavior of the variation that exists in real commodity DRAM chips. Based on our characterization, we propose Flexible-LatencY DRAM (FLY-DRAM), a mechanism to reduce DRAM latency by categorizing the DRAM cells into fast and slow regions, and accessing the fast regions with a reduced latency, thereby improving system performance significantly. Our extensive experimental characterization and analysis of latency variation in DRAM chips can also enable development of other new techniques to improve performance or reliability. Fourth, this dissertation, for the first time, develops an understanding of the latency behavior due to another important factor—supply voltage, which significantly impacts DRAM performance, energy consumption, and reliability. We take an experimental approach to understanding and exploiting the behavior of modern DRAM chips under different supply voltage values. Our detailed characterization of real commodity DRAM chips demonstrates that memory access latency reduces with increasing supply voltage. Based on our characterization, we propose Voltron, a new mechanism that improves system energy efficiency by dynamically adjusting the DRAM supply voltage based on a performance model. Our extensive experimental data on the relationship between DRAM supply voltage, latency, and reliability can further enable developments of other new mechanisms that 16 improve latency, energy efficiency, or reliability. The key conclusion of this dissertation is that augmenting DRAM architecture with simple and low-cost features, and developing a better understanding of manufactured DRAM chips together leads to significant memory latency reduction as well as energy efficiency improvement. We hope and believe that the proposed architectural techniques and detailed experimental data on real commodity DRAM chips presented in this dissertation will enable developments of other new mechanisms to improve the performance, energy efficiency, or reliability of future memory systems. THESIS PROPOSAL: Towards Space-Efficient High-Performance In-Memory Search Structures Huanchen Zhang, SCS April 30, 2018 This thesis seeks to address the challenge of building space-efficient yet high- performance in-memory search structures, including indexes and filters, to allow more efficient use of memory in OLTP databases. We show that we can achieve this goal by first designing fast static structures that leverage succinct data structures to approach the information-theoretic optimum in space, and then using the “hybrid index” architecture to obtain dynamicity with bounded and modest cost in space and performance. To obtain space-efficient yet highperformance static data structures, we first introduce the Dynamic-to-Static rules that present a systematic way to convert existing dynamic structures to smaller immutable versions. We then present the Fast Succinct Trie (FST) and its application, the Succinct Range Filter (SuRF), to show how to leverage theories on succinct data structures to build static search structures that consume space close to the informationtheoretic minimum while performing comparably to uncompressed indexes. To support dynamic operations such as inserts, deletes, and updates, we introduce the dual-stage hybrid index architecture that preserves the space efficiency brought by a compressed static index, while amortizing its performance overhead on dynamic operations by applying modifications in batches. In the proposed work, we seek opportunities to further shrink the size of in-memory indexes by co-designing the indexes with the in-memory tuple storage. We also propose to complete the hybrid index work by extending the techniques to support concurrent indexes. THESIS PROPOSAL: Efficient Networked Systems for Datacenter Fabrics with RPCs Anuj Kalia, SCS March 23, 2018 Datacenter networks have changed radically in recent years. Their bandwidth and latency has improved by orders of magnitude, and advanced network devices such as NICs with Remote Direct Memory Access (RDMA) capabilities and programmable switches have been deployed. The conventional wisdom is that to best use fast datacenter networks, distributed systems must be redesigned to offload processing from server CPUs to network devices. In this dissertation, we show that conventional, non-offloaded designs offer continued on page 17 Bill Bolosky (Microsoft Research) talks about his company’s work on exciting new projects at the PDL retreat industry poster session. THE PDL PACKET D E FE NSE S & PRO PO S A L S continued from page 16 better or comparable performance for a wide range of datacenter workloads, including key-value stores, distributed transactions, and highly-available replicated services. We present the following principle: The physical limitations of networks must inform the design of high-performance distributed systems. Offloaded designs often require more network round trips than conventional CPU-based designs, and therefore have fundamentally higher latency. Since they require more network packets, they also have lower throughput. Realizing the benefits of this principle requires fast networking software for CPUs. To this end, we undertake a detailed exploration of datacenter network capabilities, CPU-NIC interaction over the system bus, and NIC hardware architecture. We use insights from this study to create highperformance remote procedure call implementations for use in distributed systems with active end host CPUs. We demonstrate the effectiveness of this principle through the design and evaluation of four distributed inmemory systems: a key-value cache, a networked sequencer, an online transaction processing system, and a state machine replication system. We show that our designs often simultaneously outperform the competition in performance, scalability, and simplicity. THESIS PROPOSAL: Design & Implementation of a Non-Volatile Memory Database Management System Joy Arulraj, SCS December 7, 2017 For the first time in 25 years, a new non-volatile memory (NVM) category is being created that is two orders of magnitude faster than current durable storage media. This will fundamentally change the dichotomy between volatile memory and durable storage in DB systems. The new NVM devices are almost as fast as DRAM, but all writes SPRING 2018 THESIS PROPOSAL: STRADS: A New Distributed Framework for Scheduled Model-Parallel Machine Learning Jin Kyu Kim, SCS May 15, 2017 Joan Digney and Garth Gibson celebrate 25 years of PDL research and retreats. to it are potentially persistent even after power loss. Existing DB systems are unable to take full advantage of this technology because their internal architectures are predicated on the assumption that memory is volatile. With NVM, many components of legacy database systems are unnecessary and will degrade the performance of data intensive applications. This dissertation explores the implications of NVM for database systems. It presents the design and implementation of Peloton, a new database system tailored specifically for NVM. We focus on three aspects of a database system: (1) logging and recovery, (2) storage management, and (3) indexing. Our primary contribution in this dissertation is the design of a new logging and recovery protocol, called write-behind logging, that improves the availability of the system by more than two orders of magnitude compared to the ubiquitous write- ahead logging protocol. Besides improving availability, we found that write- behind logging improves the space utilization of the NVM device and extends its lifetime. Second, we propose a new storage engine architecture that leverages the durability and byte-addressability properties of NVM to avoid unnecessary data duplication. Third, the dissertation presents the design of a latchfree range index tailored for NVM that supports near-instantaneous recovery without requiring special-purpose recovery code. Machine learning (ML) methods are used to analyze data which are collected from various sources. As the problem size grows, we turn to distributed parallel computation to complete ML training in a reasonable amount of time. However, naive parallelization of ML algorithms often hurts the effectiveness of parameter updates due to the dependency structure among model parameters, and a subset of model parameters often bottlenecks the completion of ML algorithms due to the uneven convergence rate. In this proposal, I propose two efforts: 1) STRADS that improves the training speed in an order of magnitude and 2) STRADS-AP that makes parallel ML programming easier. In STRADS, I will first present scheduled model-parallel approach with two specific scheduling schemes: 1) model parameter dependency checking to avoid updating dependent parameters concurrently; 2) parameter prioritization to give more update chances to the parameters far from its convergence point. To efficiently run the scheduled model-parallel in a distributed system, continued on page 18 Yixin Luo and and Michael Kuchnik, ready to discuss their research on “Error Characterization, Mitigation, and Recovery in Flash Memory Based Solid-State Drives” and “Machine Learning Based Feature Tracking in HPC Simulations” at a PDL retreat poster session. 17 D E FE N SES & PROPOSA L S continued from page 17 I implement a prototype framework called STRADS. STRADS improves the parameter update throughput by pipelining iterations and overlapping update computations with network communication for parameter synchronization. With ML scheduling and system optimizations, STRADS improves the ML training time by an order of magnitude. However, these performance gains are at the cost of extra programming burden when writing ML schedules. In STRADS-AP, I will present a high-level programming library and a system infrastructure that automates ML scheduling. The STRADS-AP library consist of three programming constructs: 1) a set of distributed data structures (DDS); 2) a set of functional style operators; and 3) an imperative style loop operator. Once an ML programmer writes an ML program using STRADS-AP library APIs, the STRADS-AP runtime automatically parallelizes the user program over a cluster ensuring data consistency. THESIS PROPOSAL: Novel Computational Techniques for Mapping NextGeneration Sequencing Reads Hongyi Xin, SCS May 31, 2017 Dana Van Aken presents her research on “Automatic Database Management System Tuning Through Large-scale Machine Learning” at the PDL retreat. DNA read mapping is an important problem in Bioinformatics. With the introduction of next-generation sequencing (NGS) technologies, we are facing an exponential increase in the amount of genomic sequence data. The success of many medical and genetic applications critically depends on computational methods to process the enormous amount of sequence data quickly and accurately. However, due to the repetitive nature of human genome and limitations of the sequencing technology, current read mapping methods still fall short from achieving both high performance and high sensitivity. In this proposal, I break down the DNA read mapping problem into four subproblems: intelligent seed extraction, efficient filtration of incorrect seed locations, high performance extension and accurate and efficient read cloud mapping. I provide novel computational techniques for each subproblem, including: 1) a novel seed selection algorithm that optimally divides a read into low frequency seeds; 2) a novel SIMD-friendly bit-parallel filtering problem that quickly estimates if two strings are highly similar; 3) a generalization of a state-of-the-art approximate string matching algorithm that measures genetic similarities with more realistic metrics and 4) a novel mapping strategy that utilizes characteristics of a new sequencing technologies, read cloud sequencing, to map NGS reads with higher accuracy and efficiency. A L UMN I NEWS Hugo Patterson (Ph.D., ECE ‘98) We are pleased to pass on the news that Datrium (www.datrium.com/), where Hugo is a co-founder, won Gold in Search Storage’s 2017 Product of the Year. “Datrium impresses judges and wins top honors with its DVX storage architecture, designed to sidestep latency and deliver performance and speed at scale.” http://bit.ly/2Cl2mAR Hugo received his Ph.D. in Electrical and Computer Engineering from Carnegie Mellon University where he was a charter student in the PDL. He was advised by Garth Gibson and his Ph.D. research focused on informed prefetching and caching. He was named a distinguished alumni of the PDL in 2007. there. He also wants to mention that they are hiring! 23andMe is a personal genomics and biotechnology company based in Mountain View, California. The company is named for the 23 pairs of chromosomes in a normal human cell. Ted Wong (Ph.D, CS ‘04) Ted joined 23andMe (www.23andme. com), as a Senior Software Engineer with the Machine Learning Engineering group back in January 2018, and reports he is incredibly happy to be 18 THE PDL PACKET NE W PD L FA CUL TY & S T A F F Gauri Joshi The PDL would like to welcome Gauri Joshi to our family! Gauri is an Assistant Professor at CMU in the Department of Electrical and Computer Engineering. She is interested in stochastic modeling and analysis that provides sharp insights into the design of computing systems. Her favorite tools include probability, queueing, coding theory and machine learning. Until August 2017 Gauri was a Research Staff Member at IBM T. J. Watson in Yorktown Heights NY. In June 2016 she completed her PhD at MIT, working with Prof. Gregory Wornell and Prof. Emina Soljanin. Before that, Gauri spent five years at IIT Bombay, where I completed a dual degree (B.Tech + M.Tech) in Electrical Engineering. She also spent several summers interning at Google, Bell Labs, and Qualcomm. Currently, Gauri is working on several projects. These include one on Distributed Machine Learning. In large-scale machine learning, training is performed by running stochastic gradient descent (SGD) in a distributed fashion using a central parameter server and multiple servers (learners). Using asynchronous methods to alleviate the problem of stragglers, the research goal is to design a distributed SGD algorithm that strikes the best trade-off between the training time, and errors in the trained model. Her project on Straggler Replication in Parallel Computing develops insights into the best relaunching time, and the number of replicas to relaunch to reduce latency, without a significant increase in computing SPRING 2018 costs in jobs with hundreds of parallel tasks, where the slowest task becomes the bottleneck. Unlike traditional file transfer where only total delay matters, Streaming Communication requires fast and inorder delivery of individual packets to the user. This project analyzes the trade-off between throughput and the in-order delivery delay, and in particular how it is affected by the frequency of feedback to the source, and proposes a simple combination of repetition and greedy linear coding that achieves close to optimal throughput-delay trade-off. Rashmi Vinayak We would also like to welcome Rashmi Vinayak! Rashmi is an assistant professor in the Computer Science department at Carnegie Mellon University. She received her PhD in the EECS department at UC Berkeley in 2016, and was a postdoctoral researcher at AMPLab/RISELab and BLISS. Her dissertation received the Eli Jury Award 2016 from the EECS department at UC Berkeley for outstanding achievement in the area of systems, communications, control, or signal processing. Rashmi is the recipient of the IEEE Data Storage Best Paper and Best Student Paper Awards for the years 2011/2012. She is also a recipient of the Facebook Fellowship 2012-13, the Microsoft Research PhD Fellowship 2013-15, and the Google Anita Borg Memorial Scholarship 2015-16. Her research interests lie in building high performance and resource-efficient big data systems based on theoretical foundations. A recent project has focused on Storage and caching, particularly on fault tolerance, scalability, load balancing, and reducing latency in large-scale distributed data storage and caching systems. She and her colleagues designed coding theory based solutions that were shown to be provably optimal. They also built systems and evaluated them on Facebook’s data-analytics cluster and on Amazon EC2 showing significant benefits over the state-ofthe-art. The solutions are now a part of Apache Hadoop 3.0 and are also being considered by several companies such as NetApp and Cisco. Rashmi is also interested in machine learning: the research focus here has been on the generalization performance of a class of learning algorithms that are widely used for ranking. She collaborated on designing an algorithm building on top of Multiple Additive Regression Trees, and through empirical evaluation on real-world datasets showed significant improvement over classification, regression, and ranking tasks. This new algorithm is now deployed in production in Microsoft’s data-analysis toolbox which powers the Azure Machine Learning product. Alex Glikson Alex Glikson joined the Computer Science Department as a staff engineer, after spending the last 14 years at IBM Research in Israel, where he has been leading a number of research and development projects in the area of systems management and cloud infrastructure. Alex is interested in resource and workload management in cloud computing environments, recently focusing on ‘Function-as-aService’ platforms, infrastructure for Deep Learning workloads, and the combination of the two. 19 R E CE N T PU BLICATION S continued from page 7 algorithm [3], and taking tens of data passes to converge, each data pass is slowed down by 30-40% relative to the prior pass, so the eighth data pass is 8.5X slower than the first. The current practice to avoid such performance penalty is to frequently checkpoint to durable storage device which truncates lineage size. Checkpointing as a performance speedup is difficult for a programmer to anticipate and fundamentally contradicts Spark’s philosophy that the working set should stay in memory and not be replicated across the network. Since Spark caches intermediate RDDs, one solution is to cache constructed DAGs and broadcast only new DAG elements. Our experiments show that with this optimization, per iteration execution time is almost independent of growing lineage size and comparable to the execution time provided by optimal checkpointing. On 10 machines using 240 cores in total, without checkpointing we observed a 3.4X speedup when solving matrix factorization and 10X speedup for a streaming application provided in the Spark distribution. 3LC: Lightweight and Effective Traffic Compression for Distributed Machine Learning Hyeontaek Lim, David G. Andersen & Michael Kaminsky arXiv:1802.07389v1 [cs.LG] 21 Feb 2018. The performance and efficiency of distributed machine learning (ML) depends significantly on how long it takes for nodes to exchange state changes. Overly-aggressive attempts to reduce communication often sacrifice final model accuracy and necessitate additional ML techniques to compensate for this loss, limiting their generality. Some attempts to reduce communication incur high computation overhead, which makes their performance benefits visible only over slow networks. 20 Server Aggregated gradients Decompressed gradients Mohammad Sadrosadati, Amirhossein Mirhosseini, Seyed Borna Ehsani, Hamid Sarbazi-Azad, Mario Drumond, Babak Falsafi, Rachata Ausavarungnirun & Onur Mutlu Push Compressed gradients Gradients Worker Worker (a) Gradient pushes from workers to servers. Server Model deltas Compressed model deltas Pull Decompressed model deltas Updated local model Worker LTRF: Enabling HighCapacity Register Files for GPUs via Hardware/ Software Cooperative Register Prefetching Worker (b) Model pulls from servers to workers. Point-to-point tensor compression for two example layers in 3LC. We present 3LC, a lossy compression scheme for state change traffic that strikes balance between multiple goals: traffic reduction, accuracy, computation overhead, and generality. It combines three new techniques—3-value quantization with sparsity multiplication, quartic encoding, and zero-run encoding—to leverage strengths of quantization and sparsification techniques and avoid their drawbacks. It achieves a data compression ratio of up to 39–107X, almost the same test accuracy of trained models, and high compression speed. Distributed ML frameworks can employ 3LC without modifications to existing ML algorithms. Our experiments show that 3LC reduces wall-clock training time of ResNet-110–based image classifiers for CIFAR-10 on a 10-GPU cluster by up to 16–23X compared to TensorFlow’s baseline design. ASPLOS ’18, March 24–28, 2018, Williamsburg, VA, USA. Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file, to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this paper, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp’s aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging continued on page 21 THE PDL PACKET RE CE NT PUB L I CA T I O N S continued from page 20 high-density high-latency memory technologies, enabling 8× larger capacity and improving overall GPU performance by 31% while reducing register file power consumption by 46%. MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency Rachata Ausavarungnirun, Vance Miller, Joshua Landgraf, Saugata Ghose, Jayneel Gandhi, Adwait Jog, Christopher J. Rossbach & Onur Mutlu ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA. Graphics Processing Units (GPUs) exploit large amounts of thread-level parallelism to provide high instruction throughput and to efficiently hide long-latency stalls. The resulting high throughput, along with continued programmability improvements, have made GPUs an essential computational resource in many domains. Applications from different domains can have vastly different compute and memory demands on the GPU. In a large-scale computing environment, to efficiently accommodate such wide-ranging demands without leaving GPU resources underutilized, multiple applications can share a single GPU, akin to how multiple applications execute concurrently on a CPU. Multi-application concurrency requires several support mechanisms in both hardware and software. One such key mechanism is virtual memory, which manages and protects the address space of each application. However, modern GPUs lack the extensive support for multi-application concurrency available in CPUs, and as a result suffer from high performance overheads when shared by multiple applications, as we demonstrate. We perform a detailed analysis of which multi-application concurrency support limitations hurt GPU performance the most. We find that the poor performance is largely a result of the virtual memory mechanisms employed in modern GPUs. In particular, poor address translation performance is a key obstacle to efficient GPU sharing. State-of-theart address translation mechanisms, which were designed for single-application execution, experience significant inter-application interference when multiple applications spatially share the GPU. This contention leads to frequent misses in the shared translation lookaside buffer (TLB), where a single miss can induce long-latency stalls for hundreds of threads. As a result, the GPU often cannot schedule enough threads to successfully hide the stalls, which diminishes system throughput and becomes a first-order performance concern. Based on our analysis, we propose MASK, a new GPU framework that provides low-overhead virtual memory support for the concurrent execution of multiple applications. MASK consists of three novel address-trans- 1 TLB-Fill Tokens TLB Miss Tokens Page Table Root Cache Address Translation Requests Level 1 Hit Rate Level 2 Hit Rate Level 3 Hit Rate Level 4 Hit Rate ASPLOS ’18, March 24–28, 2018, Williamsburg, US. Emerging chips with hundreds and thousands of cores require networks with unprecedented energy/area efficontinued on page 22 Request Buﬀers Golden Queue Bank 0 L2 Hit Rate if (Level Hit Rate >= L2 Hit Rate) Address-Space-Aware Memory Scheduler if (Level Hit Rate < L2 Hit Rate) Bypassed Address Translation Requests Address Translation Requests Silver Queue Bank 1 Address Translation Requests Data Demand Requests Bank 2 Tags Entries Shared L2 TLB Maciej Besta, Syed Minhaj Hassan, Sudhakar Yalamanchili, Rachata Ausavarungnirun, Onur Mutlu & Torsten Hoefler DRAM TLB Bypass Cache Page Table Walker Page Table Root Tokens Dir Slim NoC: A Low-Diameter On-Chip Network Topology for High Energy Efficiency and Scalability Address-Translation-Aware Cache Bypass # Hits # Misses Prev. Hit lation-aware cache and memory management mechanisms that work together to largely reduce the overhead of address translation: (1) a token-based technique to reduce TLB contention, (2) a bypassing mechanism to improve the effectiveness of cached address translations, and (3) an applicationaware memory scheduling scheme to reduce the interference between address translation and data requests. Our evaluations show that MASK restores much of the throughput lost to TLB contention. Relative to a stateof-the-art GPU TLB, MASK improves system throughput by 57.8%, improves IPC throughput by 43.4%, and reduces applicationlevel unfairness by 22.4%. MASK’s system throughput is within 23.2% of an ideal GPU system with no address translation overhead. Normal Queue L1 Cache Bank n Data Demand Requests L2 Cache Address Translation Request Data Demand Request Memory Requests Data Demand Requests Memory Controller MASK design overview. SPRING 2018 21 R E CE N T PU BLICATIONS continued from page 21 ciency and scalability. To address this, we propose Slim NoC (SN): a new on-chip network design that delivers significant improvements in efficiency and scalability compared to the stateof-the-art. The key idea is to use two concepts from graph and number theory, degree-diameter graphs combined with non-prime finite fields, to enable the smallest number of ports for a given core count. SN is inspired by state-of-the-art off-chip topologies; it identifies and distills their advantages for NoC settings while solving several key issues that lead to significant overheads on-chip. SN provides NoC-specific layouts, which further enhance area/energy efficiency. We show how to augment SN with stateof-the-art router microarchitecture schemes such as Elastic Links, to make the network even more scalable and efficient. Our extensive experimental evaluations show that SN outperforms both traditional low-radix topologies (e.g., meshes and tori) and modern high-radix networks (e.g., various Flattened Butterflies) in area, latency, throughput, and static/dynamic power consumption for both synthetic and real workloads. SN provides a promising direction in scalable and energyefficient NoC topologies. Mosaic: A GPU Memory Manager with ApplicationTransparent Support for Multiple Page Sizes Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller, Saugata Ghose, Jayneel Gandhi, Christopher J. Rossbach & Onur Mutlu Proc. of the International Symposium on Microarchitecture (MICRO), Cambridge, MA, October 2017. Contemporary discrete GPUs support rich memory management features such as virtual memory and demand paging. These features simplify GPU programming by providing a virtual address space abstraction similar to CPUs and eliminating manual mem22 Application 1 Base Pages 1 Application 2 Base Pages Large Page Frame 1 Large Page Frame 2 Standard Memory Allocation 2 Unallocated Pages Large Page Frame 1 Large Page Frame 2 Cannot Coalesce Pages Without Migrating Data (a) State-of-the-art GPU memory management. 3 Large Page Frame 1 4 Coalesced Large Page 1 Large Page Frame 2 Coalesced Large Page 2 Contiguity-Conserving Allocation Lazy Coalescer (b) Memory management with Mosaic. Page allocation and coalescing behavior of GPU memory managers: (a) state-of-the-art, (b) Mosaic. 1: The GPU memory manager allocates base pages from both Applications 1 and 2. 2: As a result, the memory manager cannot coalesce the base pages into a large page without first migrating some of the base pages, which would incur a high latency. 3: Mosaic uses Contiguity Conser ving Allocation (CoCoA) — a memory allocator which provides a soft guarantee that all of the base pages within the same large page range belong to only a single application, and 4: InPlace Coalescer, a page size selection mechanism that merges base pages into a large page immediately after allocation . ory management, but they introduce high performance overheads during (1) address translation and (2) page faults. A GPU relies on high degrees of thread-level parallelism (TLP) to hide memory latency. Address translation can undermine TLP, as a single miss in the translation lookaside buffer (TLB) invokes an expensive serialized page table walk that often stalls multiple threads. Demand paging can also undermine TLP, as multiple threads often stall while they wait for an expensive data transfer over the system I/O (e.g., PCIe) bus when the GPU demands a page. In modern GPUs, we face a tradeoff on how the page size used for memory management affects address translation and demand paging. The address translation overhead is lower when we employ a larger page size (e.g., 2MB large pages, compared with conventional 4KB base pages), which increases TLB coverage and thus reduces TLB misses. Conversely, the demand paging overhead is lower when we employ a smaller page size, which decreases the system I/O bus transfer latency. Support for multiple page sizes can help relax the page size trade-off so that address translation and demand paging optimizations work together synergistically. However, existing page coalescing (i.e., merging base pages into a large page) and splintering (i.e., splitting a large page into base pages) policies require costly base page migrations that undermine the benefits multiple page sizes provide. In this paper, we observe that GPGPU applications present an opportunity to support multiple page sizes without costly data migration, as the applications perform most of their memory allocation en masse (i.e., they allocate a large number of base pages at once). We show that this en masse allocation allows us to create intelligent memory allocation policies which ensure that base pages that are contiguous in virtual memory are allocated to contiguous physical memory pages. As a result, coalescing and splintering operations no longer need to migrate base pages. We introduce Mosaic, a GPU memory manager that provides applicationtransparent support for multiple page sizes. Mosaic uses base pages to transfer data over the system I/O bus, and allocates physical memory in a way that (1) preserves base page contiguity and (2) ensures that a large page frame contains pages from only a single memory protection domain. We take advantage of this allocation strategy to design a novel in-place page size selection mechanism that avoids data migration. This mechanism allows the TLB to use large pages, reducing address translation overhead. During data transfer, this mechanism enables the GPU to transfer only the base pages that are needed by the application over the system I/O bus, keeping demand paging continued on page 23 THE PDL PACKET RE CE NT PUB L I CA T I O N S continued from page 22 Software-Defined Storage for Fast Trajectory Queries using a DeltaFS Indexed Massive Directory Qing Zheng, George Amvrosiadis, Saurabh Kadekodi, Garth Gibson, Chuck Cranor, Brad Settlemyer, Gary Grider & Fan Guo PDSW-DISCS 2017: 2nd Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing System held in conjunction with SC17, Denver, CO, Nov. 2017. In this paper we introduce the Indexed Massive Directory, a new technique for indexing data within DeltaFS. With its design as a scalable, server-less file system for HPC platforms, DeltaFS scales file system metadata performance with application scale. The Indexed Massive Directory is a novel extension to the DeltaFS data plane, enabling in-situ indexing of massive amounts of data written to a single directory simultaneously, and in an arbitrarily large number of files. We achieve this through a memory-efficient indexing mechanism for reordering and indexing writes, and a log-structured storage layout to pack small data into large log objects, all while ensuring compute node resources are used frugally. We demonstrate the efficiency of this indexing mechanism through VPIC, a plasma simulation code that scales to trillions of particles. With Indexed Massive Directory, we modify VPIC to SPRING 2018 create a file for each particle to receive writes of that particle’s simulation output data. Dynamically indexing the directory’s underlying storage keyed on particle filename allows us to achieve a 5000x speedup for a single particle trajectory query, which requires reading all data for a single particle. This speedup increases with application scale, while the overhead remains stable at 3% of the available memory. Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology Vivek Seshadri, Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali Boroumand, Jeremie Kim, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons & Todd C. Mowry Proceedings of the 50th International Symposium on Microarchitecture (MICRO), Boston, MA, USA, October 2017. Many important applications trigger bulk bitwise operations, i.e., bitwise operations on large bit vectors. In fact, recent works design techniques that exploit fast bulk bitwise operations to accelerate databases (bitmap indices, BitWeaving) and web search (BitFunnel). Unfortunately, in existing architectures, the throughput of bulk bitwise operations is limited by the memory bandwidth available to the processing unit (e.g., CPU, GPU, FPGA, processing-in-memory). To overcome this bottleneck, we propose Ambit, an Accelerator-in-Memory for bulk bitwise operations. Unlike prior works, Ambit exploits the analog operation of DRAM technology to perform bitwise operations completely inside DRAM, thereby exploiting the full internal DRAM bandwidth. Ambit consists of two components. First, simultaneous activation of three DRAM rows that share the same set of sense amplifiers enables the system to perform bitwise AND and OR opera- DRAM cell wordline capacitor bitline overhead low. Our evaluations show that Mosaic reduces address translation overheads while efficiently achieving the benefits of demand paging, compared to a contemporary GPU that uses only a 4KB page size. Relative to a state-of-the-art GPU memory manager, Mosaic improves the performance of homogeneous and heterogeneous multi-application workloads by 55.5% and 29.7% on average, respectively, coming within 6.8% and 15.4% of the performance of an ideal TLB where all TLB requests are hits. access transistor enable sense amplifier bitline DRAM cell and sense amplifier. tions. Second, with modest changes to the sense amplifier, the system can use the inverters present inside the sense amplifier to perform bitwise NOT operations. With these two components, Ambit can perform any bulk bitwise operation efficiently inside DRAM. Ambit largely exploits existing DRAM structure, and hence incurs low cost on top of commodity DRAM designs (1% of DRAM chip area). Importantly, Ambit uses the modern DRAM interface without any changes, and therefore it can be directly plugged onto the memory bus. Our extensive circuit simulations show that Ambit works as expected even in the presence of significant process variation. Averaged across seven bulk bitwise operations, Ambit improves performance by 32X and reduces energy consumption by 35X compared to stateof-the-art systems. When integrated with Hybrid Memory Cube (HMC), a 3D-stacked DRAM with a logic layer, Ambit improves performance of bulk bitwise operations by 9.7X compared to processing in the logic layer of the HMC. Ambit improves the performance of three real-world data-intensive applications, 1) database bitmap indices, 2) BitWeaving, a technique to accelerate database scans, and 3) bit-vector-based implementation of sets, by 3X-7X compared to a continued on page 24 23 R E CE N T PU BLICATIONS continued from page 23 state-of-the-art baseline using SIMD optimizations. We describe four other applications that can benefit from Ambit, including a recent technique proposed to speed up web search. We believe that large performance and energy improvements provided by Ambit can enable other applications to use bulk bitwise operations. Detecting and Mitigating DataDependent DRAM Failures by Exploiting Current Memory Content Samira Khan, Chris Wilkerson, Zhe Wang, Alaa R. Alameldeen, Donghyuk Lee & Onur Mutlu Proceedings of the 50th International Symposium on Microarchitecture (MICRO), Boston, MA, USA, October 2017. DRAM cells in close proximity can fail depending on the data content in neighboring cells. These failures are called data-dependent failures. Detecting and mitigating these failures online, while the system is running in the field, enables various optimizations that improve reliability, latency, and energy efficiency of the system. For example, a system can improve performance and energy efficiency by using a lower refresh rate for most cells and mitigate the failing cells using higher refresh rates or error correcting codes. All these system optimizations depend on accurately detecting every possible data-dependent failure that could occur with any content in DRAM. Unfortunately, detecting all data-dependent failures requires the knowledge of DRAM internals specific to each DRAM chip. As internal DRAM architecture is not exposed to the system, detecting data-dependent failures at the system-level is a major challenge. In this paper, we decouple the detection and mitigation of data-dependent failures from physical DRAM organization such that it is possible to detect failures without knowledge of DRAM internals. To this end, we propose MEMCON, a memory content-based detection and mitigation mechanism for data-dependent failures in DRAM. MEMCON does not detect every possible data-dependent failure. Instead, it detects and mitigates failures that occur only with the current content in memory while the programs are running in the system. Such a mecha- nism needs to detect failures whenever there is a write access that changes the content of memory. As detection of failure with a runtime testing has a high overhead, MEMCON selectively initiates a test on a write, only when the time between two consecutive writes to that page (i.e., write interval) is long enough to provide significant benefit by lowering the refresh rate during that interval. MEMCON builds upon a simple, practical mechanism that predicts the long write intervals based on our observation that the write intervals in real workloads follow a Pareto distribution: the longer a page remains idle after a write, the longer it is expected to remain idle. Our evaluation shows that compared to a system that uses an aggressive refresh rate, MEMCON reduces refresh operations by 65-74%, leading to a 10%/17%/40% (min) to 12%/22%/50% (max) performance improvement for a singlecore and 10%/23%/52% (min) to 17%/29%/65% (max) performance improvement for a 4-core system using 8/16/32 Gb DRAM chips. Bigger, Longer, Fewer: What Do Cluster Jobs Look Like Outside Google? George Amvrosiadis, Jun Woo Park, Gregory R. Ganger, Garth A. Gibson, Elisabeth Baseman & Nathan DeBardeleben Carnegie Mellon University Parallel Data Lab Technical Report CMUPDL-17-104, October 2017. In the last 5 years, a set of job scheduler logs released by Google has been used in more than 400 publications as the token cloud workload. While this is an invaluable trace, we think it is crucial that researchers evaluate their work under other workloads as well, to ensure the generality of their techniques. To aid them in this process, we analyze three new traces consisting of job scheduler logs from one private and Greg opens the 25th PDL Retreat at the Bedford Springs Resort. 24 continued on page 25 THE PDL PACKET RE CE NT PUB L I CA T I O N S continued from page 24 two HPC clusters. We further release the two HPC traces, which we expect to be of interest to the community due to their unique characteristics. The new traces represent clusters 0.3-3 times the size of the Google cluster in terms of CPU cores, and cover a 3-60 times longer time span. This paper presents an analysis of the differences and similarities between all aforementioned traces. We discuss a variety of aspects: job characteristics, workload heterogeneity, resource utilization, and failure rates. More importantly, we review assumptions from the literature that were originally derived from the Google trace, and verify whether they hold true when the new traces are considered. For those assumptions that are violated, we examine affected work from the literature. Finally, we demonstrate the importance of dataset plurality in job scheduling research by evaluating the performance of JVuPredict, the job runtime estimate module of the TetriSched scheduler, using all four traces. Workload Compactor: Reducing Datacenter Cost while ProvidingTail Latency SLO Guarantees Timothy Zhu, Michael A. Kozuch & Mor Harchol-Balter ACM Symposium on Cloud Computing (SoCC’17), Santa Clara, Oct 2017. Service providers want to reduce datacenter costs by consolidating workloads onto fewer servers. At the same time, customers have performance goals, such as meeting tail latency Service Level Objectives (SLOs). Consolidating workloads while meeting tail latency goals is challenging, especially since workloads in production environments are often bursty. To limit the congestion when consolidating workloads, customers and service providers often agree upon rate limits. Ideally, rate limits are chosen to maximize the number of workloads that can be coSPRING 2018 Rate (r) Tokens Bucket size (b) Take tokens to proceed Queue Requests Token bucket rate limiters control the rate and burstiness of a stream of requests. When a request arrives at the rate limiter, tokens are used (i.e., removed) from the token bucket to allow the request to proceed. If the bucket is empty, the request must queue and wait until there are enough tokens. Tokens are added to the bucket at a constant rate r up to a maximum capacity as specified by the bucket size b. Thus, the token bucket rate limiter limits the workload to a maximum instantaneous burst of size b and an average rate r. located while meeting each workload’s SLO. In reality, neither the service provider nor customer knows how to choose rate limits. Customers end up selecting rate limits on their own in some ad hoc fashion, and service providers are left to optimize given the chosen rate limits. This paper describes Workload Compactor, a new system that uses workload traces to automatically choose rate limits simultaneously with selecting onto which server to place workloads. Our system meets customer tail latency SLOs while minimizing datacenter resource costs. Our experiments show that by optimizing the choice of rate limits, Workload Compactor reduces the number of required servers by 30-60% as compared to state-of-the-art approaches. Error Characterization, Mitigation, and Recovery in Flash-Memory-Based SolidState Drives Yu Cai, Saugata Ghose, Erich F. Haratsch, Yixin Luo & Onur Mutlu Proceedings of the IEEE Volume: 105, Issue: 9, Sept. 2017. NAND flash memory is ubiquitous in everyday life today because its capacity has continuously increased and cost has continuously decreased over decades. This positive growth is a result of two key trends: 1) effective process technology scaling; and 2) multi-level (e.g., MLC, TLC) cell data coding. Unfortunately, the reliability of raw data stored in flash memory has also continued to become more difficult to ensure, because these two trends lead to 1) fewer electrons in the flash memory cell floating gate to represent the data; and 2) larger cell-to-cell interference and disturbance effects. Without mitigation, worsening reliability can reduce the lifetime of NAND flash memory. As a result, flash memory controllers in solid-state drives (SSDs) have become much more sophisticated: they incorporate many effective techniques to ensure the correct interpretation of noisy data stored in flash memory cells. In this article, we review recent advances in SSD error characterization, mitigation, and data recovery techniques for reliability and lifetime improvement. We provide rigorous experimental data from state-of-theart MLC and TLC NAND flash devices on various types of flash memory errors, to motivate the need for such techniques. Based on the understanding developed by the experimental characterization, we describe several mitigation and recovery techniques, including 1) cell-to-cell interference mitigation; 2) optimal multi-level cell sensing; 3) error correction using state-of-the-art algorithms and methods; and 4) data recovery when error correction fails. We quantify the reliability improvement provided by each of these techniques. Looking forward, we briefly discuss how flash memory and these techniques could evolve into the future. continued on page 26 25 R E CE N T PU BLICATIONS continued from page 25 A Better Model for Job Redundancy: Decoupling Server Slowdown and Job Size Kristen Gardner, Mor Harchol-Balter, Alan Scheller-Wolf & Benny Van Houdt Transactions on Networking, September 2017. Recent computer systems research has proposed using redundant requests to reduce latency. The idea is to replicate a request so that it joins the queue at multiple servers. The request is considered complete as soon as any one of its copies completes. Redundancy allows us to overcome server-side variability – the fact that a server might be temporarily slow due to factors such as background load, network interrupts, and garbage collection – to reduce response time. In the past few years, queueing theorists have begun to study redundancy, first via approximations, and, more recently, via exact analysis. Unfortunately, for analytical tractability, most existing theoretical analysis has assumed an Independent Runtimes (IR) model, wherein the replicas of a job each experience independent runtimes (service times) at different servers. The IR model is unrealistic and has led to theoretical results which can be at odds with computer systems implementation results. This paper introduces a much more realistic model of redundancy. Our model decouples the inherent job size (X) from the server-side slowdown (S), where we track both S and X for each job. Analysis within the S&X model is, of course, much more difficult. Nevertheless, we design a dispatching policy, Redundant-to-Idle-Queue (RIQ), which is both analytically tractable within the S&X model and has provably excellent performance. Utility-Based Hybrid Memory Management Yang Li, Saugata Ghose, Jongmoo Choi, Jin Sun, Hui Wang & Onur Mutlu 26 Before migration: request to Page 0 After migration: request to Page 0 T Application stall time t reduced by T (a) Alone request request to Page 1 request to Page 2 request to Page 1 request to Page 2 T Application stall time reduced by T (b) Overlapped requests t Conceptual example showing that the MLP of a page influences how much effect its migration to fast memory has on the application stall time. In Proc. of the IEEE Cluster Conference (CLUSTER), Honolulu, HI, September 2017. While the memory footprints of cloud and HPC applications continue to increase, fundamental issues with DRAM scaling are likely to prevent traditional main memory systems, composed of monolithic DRAM, from greatly growing in capacity. Hybrid memory systems can mitigate the scaling limitations of monolithic DRAM by pairing together multiple memory technologies (e.g., different types of DRAM, or DRAM and non-volatile memory) at the same level of the memory hierarchy. The goal of a hybrid main memory is to combine the different advantages of the multiple memory types in a cost-effective manner while avoiding the disadvantages of each technology. Memory pages are placed in and migrated between the different memories within a hybrid memory system, based on the properties of each page. It is important to make intelligent page management (i.e., placement and migration) decisions, as they can significantly affect system performance. In this paper, we propose utilitybased hybrid memory management (UH-MEM), a new page management mechanism for various hybrid memories, that systematically estimates the utility (i.e., the system performance benefit) of migrating a page between different memory types, and uses this information to guide data placement. UH-MEM operates in two steps. First, it estimates how much a single application would benefit from migrating one of its pages to a different type of memory, by comprehensively considering access frequency, row buffer locality, and memory-level parallelism. Second, it translates the estimated benefit of a single application to an estimate of the overall system performance benefit from such a migration. We evaluate the effectiveness of UHMEM with various types of hybrid memories, and show that it significantly improves system performance on each of these hybrid memories. For a memory system with DRAM and non-volatile memory, UH-MEM improves performance by 14% on average (and up to 26%) compared to the best of three evaluated state-of-the-art mechanisms across a large number of data-intensive workloads. Scheduling for Efficiency and Fairness in Systems with Redundancy Kristen Gardner, Mor Harchol-Balter, Esa Hyyti & Rhonda Righter Performance Evaluation, July 2017. Server-side variability—the idea that the same job can take longer to run on one server than another due to serverdependent factors—is an increasingly important concern in many queueing continued on page 27 THE PDL PACKET RE CE NT PUB L I CA T I O N S continued from page 26 systems. One strategy for overcoming server-side variability to achieve low response time is redundancy, under which jobs create copies of themselves and send these copies to multiple different servers, waiting for only one copy to complete service. Most of the existing theoretical work on redundancy has focused on developing bounds, approximations, and exact analysis to study the response time gains offered by redundancy. However, response time is not the only important metric in redundancy systems: in addition to providing low overall response time, the system should also be fair in the sense that no job class should have a worse mean response time in the system with redundancy than it did in the system before redundancy is allowed. In this paper we use scheduling to address the simultaneous goals of (1) achieving low response time and (2) maintaining fairness across job classes. We develop new exact analysis for perclass response time under First-Come First-Served (FCFS) scheduling for a general type of system structure; our analysis shows that FCFS can be unfair in that it can hurt non-redundant jobs. We then introduce the Least Redundant First (LRF) scheduling policy, which we prove is optimal with respect to overall system response time, but which can be unfair in that it can hurt the jobs that become redundant. Finally, we introduce the Primaries First (PF) scheduling policy, which is provably fair and also achieves excellent overall mean response time. Viyojit: Decoupling Battery and DRAM Capacities for Battery-Backed DRAM. Rajat Kateja, Anirudh Badam, Sriram Govindan, Bikash Sharma & Greg Ganger ISCA ’17, June 24-28, 2017, Toronto, ON, Canada. Non-Volatile Memories (NVMs) can significantly improve the performance SPRING 2018 6 2 5 7 1 4 3 8 Flow chart describing Viyojit’s implementation for tracking dirty pages and enforcing the dirty budget. of data-intensive applications. A popular form of NVM is Battery-backed DRAM, which is available and in use today with DRAMs latency and without the endurance problems of emerging NVM technologies. Modern servers can be provisioned with up-to 4 TB of DRAM, and provisioning battery backup to write out such large memories is hard because of the large battery sizes and the added hardware and cooling costs. We present Viyojit, a system that exploits the skew in write working sets of applications to provision substantially smaller batteries while still ensuring durability for the entire DRAM capacity. Viyojit achieves this by bounding the number of dirty pages in DRAM based on the provisioned battery capacity and proactively writing out infrequently written pages to an SSD. Even for write-heavy workloads with less skew than we observe in analysis of real data center traces, Viyojit reduces the required battery capacity to 11% of the original size, with a performance overhead of 7-25%. Thus, Viyojit frees battery-backed DRAM from stunted growth of battery capacities and enables servers with terabytes of battery-backed DRAM. Litz: An Elastic Framework for High-Performance Distributed Machine Learning Aurick Qiao, Abutalib Aghayev, Weiren Yu, Haoyang Chen, Qirong Ho, Garth A. Gibson & Eric P. Xing Carnegie Mellon University Parallel Data Laboratory Technical Report CMU-PDL-17-103. June 2017. Machine Learning (ML) is becoming an increasingly popular application in the cloud and data-centers, inspiring a growing number of distributed frameworks optimized for it. These frameworks leverage the specific properties of ML algorithms to achieve orders of magnitude performance improvements over generic data processing frameworks like Hadoop or Spark. However, they also tend to be static, unable to elastically adapt to the changing resource availability that is characteristic of the multi-tenant environments in which they run. Furthermore, the programming models provided by these frameworks tend to be restrictive, narrowing their applicability even within the sphere of ML workloads. Motivated by these trends, we present Litz, a distributed ML framework that achieves both elasticity and generality without giving up the performance of more specialized frameworks. Litz uses a programming model based on scheduling micro-tasks with parameter server access which enables applications to implement key distributed ML techniques that have recently been introduced. Furthermore, we believe that the union of ML and elasticity presents new opportunities for job scheduling due to dynamic resource usage of ML algorithms. We give examples of ML properties which give rise to such resource usage patterns and suggest ways to exploit them to improve resource utilization in multitenant environments. To evaluate Litz, we implement two popular ML applications that vary dramatically terms of their structure and run-time behavior—they are typically implemented by different ML frameworks tuned for each. We show that Litz achieves competitive performance with the state of the art while providing lowoverhead elasticity and exposing the continued on page 28 27 R E CE N T PU BLICATION S continued from page 27 underlying dynamic resource usage of ML applications. Workload Analysis and Caching Strategies for Search Advertising Systems Conglong Li, David G. Andersen, Qiang Fu, Sameh Elnikety & Yuxiong He SoCC ’17, September 24–27, 2017, Santa Clara, CA, USA. Search advertising depends on accurate predictions of user behavior and interest, accomplished today using complex and computationally expensive machine learning algorithms that estimate the potential revenue gain of thousands of candidate advertisements per search query. The accuracy of this estimation is important for revenue, but the cost of these computations represents a substantial expense, e.g., 10% to 30% of the total gross revenue. Caching the results of previous computations is a potential path to reducing this expense, but traditional domain-agnostic and revenueagnostic approaches to do so result in substantial revenue loss. This paper presents three domain-specific caching mechanisms that successfully optimize for both factors. Simulations on a trace User search query click ads search result w/ ads Search Engine Publisher (Bing Ads) ads to show Auction cache hit Proposed Cache tens Advertiser of ads cache insert no cache/ Scoring cache miss/ ads refresh thousands of ads keywords budgets Candidate Selection thousands of million ads Ads partition Pool partition Simplified workflow of how Bing advertising system serves ads to users. 28 from the Bing advertising system show that a traditional cache can reduce cost by up to 27.7% but has negative revenue impact as bad as −14.1%. On the other hand, the proposed mechanisms can reduce cost by up to 20.6% while capping revenue impact between −1.3% and 0%. Based on Microsoft’s earnings release for FY16 Q4, the traditional cache would reduce the net profit of Bing Ads by $84.9 to $166.1 million in the quarter, while our proposed cache could increase the net profit by $11.1 to $71.5 million. Cachier: Edge-caching for Recognition Applications Utsav Drolia, Katherine Guo, Jiaqi Tan, Rajeev Gandhi & Priya Narasimhan The 37th IEEE International Conference on Distributed Computing Systems (ICDCS 2017), June 5 – 8, 2017, Atlanta, GA, USA Recognition and perception-based mobile applications, such as image recognition, are on the rise. These applications recognize the user’s surroundings and augment it with information and/or media. These applications are latency-sensitive. They have a soft-realtime nature - late results are potentially meaningless. On the one hand, given the compute-intensive nature of the tasks performed by such applications, execution is typically offloaded to the cloud. On the other hand, offloading such applications to the cloud incurs network latency, which can increase the user-perceived latency. Consequently, edge-computing has been proposed to let devices offload intensive tasks to edge-servers instead of the cloud, to reduce latency. In this paper, we propose a different model for using edge-servers. We propose to use the edge as a specialized cache for recognition applications and formulate the expected latency for such a cache. We show that using an edge-server like a typical web-cache, for recognition applications, can lead to higher latencies. We propose Cachier, a system that uses the caching model along with novel optimizations to minimize latency by adaptively balancing load between the edge and the cloud by leveraging spatiotemporal locality of requests, using offline analysis of applications, and online estimates of network conditions. We evaluate Cachier for image-recognition applications and show that our techniques yield 3x speed-up in responsiveness, and perform accurately over a range of operating conditions. To the best of our knowledge, this is the first work that models edge-servers as caches for compute-intensive recognition applications, and Cachier is the first system that uses this model to minimize latency for these applications. Carpool: A Bufferless On-Chip Network Supporting Adaptive Multicast and Hotspot Alleviation Xiyue Xiang, Wentao Shi, Saugata Ghose, Lu Peng, Onur Mutlu & Nian-Feng Tzeng In Proc. of the International Conference on Supercomputing (ICS), Chicago, IL, June 2017 Modern chip multiprocessors (CMPs) employ on-chip networks to enable communication between the individual cores. Operations such as coherence and synchronization generate a significant amount of the on-chip network traffic, and often create network requests that have one-to-many (i.e., a core multicasting a message to several cores) or many-to-one (i.e., several cores sending the same message to a common hotspot destination core) flows. As the number of cores in a CMP increases, one-to-many and many-to-one flows result in greater congestion on the network. To alleviate this congestion, prior work provides hardware support for efficient one-to-many and many-to-one flows in buffered on-chip networks. continued on page 29 THE PDL PACKET RE CE NT PUB L I CA T I O N S Automatic Database Management System Tuning Through Large-scale Machine Learning Dana Van Aken, Andrew Pavlo, Geoffrey J. Gordon & Bohan Zhang ACM SIGMOD International Conference on Management of Data, May 14-19, 2017. Chicago, IL, USA. Database management system (DBMS) configuration tuning is an essential aspect of any data-intensive application effort. But this is historically a difficult task because DBMSs have hundreds of configuration “knobs” SPRING 2018 2.0 1.5 1.0 0.5 0.0 Lo 200 g fi 400 le siz 600800 e 500 1500 1000 B) 2500 2000 size (M r pool (M B) Buffe (a) Dependencies 99th %-tile (sec) 3.0 2.0 1.0 0.0 500 1000 1500 2000 2500 3000 Buffer pool size (MB) (b) Continuous Settings 99th %-tile (sec) 6.0 Workload #1 Workload #2 Workload #3 4.0 2.0 0.0 Config #1 Config #2 Config #3 (c) Non-Reusable Configurations 600 Number of knobs Unfortunately, this hardware support cannot be used in bufferless on-chip networks, which are shown to have lower hardware complexity and higher energy efficiency than buffered networks, and thus are likely a good fit for large-scale CMPs. We propose Carpool, the first bufferless on-chip network optimized for one-to-many (i.e., multicast) and many-to-one (i.e., hotspot) traffic. Carpool is based on three key ideas: it (1) adaptively forks multicast flit replicas; (2) merges hotspot flits; and (3) employs a novel parallel port allocation mechanism within its routers, which reduces the router critical path latency by 5.7% over a bufferless network router without multicast support. We evaluate Carpool using synthetic traffic workloads that emulate the range of rates at which multithreaded applications inject multicast and hotspot requests due to coherence and synchronization. Our evaluation shows that for an 8×8 mesh network, Carpool reduces the average packet latency by 43.1% and power consumption by 8.3% over a bufferless network without multicast or hotspot support. We also find that Carpool reduces the average packet latency by 26.4% and power consumption by 50.5% over a buffered network with multicast support, while consuming 63.5% less area for each router. 99th %-tile (sec) continued from page 28 MySQL Postgres 400 200 0 2000 2004 2008 2012 2016 Release date (d) Tuning Complexity Motivating Examples – Figs. a to c show performance measurements for the YCSB workload running on MySQL (v5.6) using different configuration settings. Fig. d shows the number of tunable knobs provided in MySQL and Postgres releases over time. that control everything in the system, such as the amount of memory to use for caches and how often data is written to storage. The problem with these knobs is that they are not standardized (i.e., two DBMSs use a different name for the same knob), not independent (i.e., changing one knob can impact others), and not universal (i.e., what works for one application may be suboptimal for another). Worse, information about the effects of the knobs typically comes only from (expensive) experience. To overcome these challenges, we present an automated approach that leverages past experience and collects new information to tune DBMS configurations: we use a combination of supervised and unsupervised machine learning methods to (1) select the most impactful knobs, (2) map unseen database workloads to previous workloads from which we can transfer experience, and (3) recommend knob settings. We implemented our techniques in a new tool called OtterTune and tested it on three DBMSs. Our evaluation shows that OtterTune recommends configurations that are as good as or better than ones generated by existing tools or a human expert. Understanding ReducedVoltage Operation in Modern DRAM Devices: Experimental Characterization, Analysis, and Mechanisms Kevin K. Chang, A. Giray Yaglikçi, Saugata Ghose, Aditya Agrawal, Niladrish Chatterjee, Abhijith Kashyap, Donghyuk Lee, Mike O’Connor, Hasan Hassan & Onur Mutlu Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS), Vol. 1, No. 1, June 2017. The energy consumption of DRAM is a critical concern in modern computing systems. Improvements in manufacturing process technology have allowed DRAM vendors to lower the DRAM supply voltage conservatively, which reduces some of the DRAM energy consumption. We would like to reduce the DRAM supply voltage more aggressively, to further reduce energy. Aggressive supply voltage reduction requires a thorough understanding of the effect voltage scaling has on DRAM access latency and DRAM reliability. In this paper, we take a comprehensive approach to understanding and exploiting the latency and reliability characteristics of modern DRAM when continued on page 30 29 R E CE N T PU BLICATIONS Efficient Redundancy Techniques for Latency Reduction in Cloud Systems Gauri Joshi, Emina Soljanin & Gregory Wornell ACM Transactions on Modeling and Performance Evaluation of Computing Systems (TOMPECS) Volume 2 Issue 2, May 2017. 30 Relaxed Operator Fusion for In-Memory Databases: Making Compilation, Vectorization, and Prefetching Work Together At Last Prashanth Menon, Todd C. Mowry & Andrew Pavlo Proceedings of the VLDB Endowment, Vol. 11, No. 1, 2017. In-memory database management systems (DBMSs) are a key component of modern on-line analytic processing (OLAP) applications, since they provide low-latency access to large volumes of data. Because disk accesses are no longer the principle bottleneck 512 cells/bitline wordline driver high error low error local sense amp. (a) Conceptual Bitline local sense amp. 512 cells/bitline wordline driver In cloud computing systems, assigning a task to multiple servers and waiting for the earliest copy to finish is an effective method to combat the variability in response time of individual servers and reduce latency. But adding redundancy may result in higher cost of computing resources, as well as an increase in queueing delay due to higher traffic load. This work helps in understanding when and how redundancy gives a cost-efficient reduction in latency. For a general task service time distribution, we compare different redundancy strategies in terms of the number of redundant tasks and the time when they are issued and canceled. We get the insight that the log-concavity of the task service time creates a dichotomy of when adding redundancy helps. If the service time distribution is log-convex (i.e., log of the tail probability is convex), then adding maximum redundancy reduces both latency and cost. And if it is logconcave (i.e., log of the tail probability is concave), then less redundancy, and early cancellation of redundant tasks is more effective. Using these insights, we design a general redundancy strategy that achieves a good latency-cost trade-off for an arbitrary service time distribution. This work also generalizes and extends some results in the analysis of fork-join queues. row address the supply voltage is lowered below the nominal voltage level specified by DRAM standards. Using an FPGAbased testing platform, we perform an experimental study of 124 real DDR3L (low-voltage) DRAM chips manufactured recently by three major DRAM vendors. We find that reducing the supply voltage below a certain point introduces bit errors in the data, and we comprehensively characterize the behavior of these errors. We discover that these errors can be avoided by increasing the latency of three major DRAM operations (activation, restoration, and precharge). We perform detailed DRAM circuit simulations to validate and explain our experimental findings. We also characterize the various relationships between reduced supply voltage and error locations, stored data patterns, DRAM temperature, and data retention. Based on our observations, we propose a new DRAM energy reduction mechanism, called Voltron. The key idea of Voltron is to use a performance model to determine by how much we can reduce the supply voltage without introducing errors and without exceeding a user-specified threshold for performance loss. Our evaluations show that Voltron reduces the average DRAM and system energy consumption by 10.5% and 7.3%, respectively, while limiting the average system performance loss to only 1.8%, for a variety of memory-intensive quad-core workloads. We also show that Voltron significantly outperforms prior dynamic voltage and frequency scaling mechanisms for DRAM. row address continued from page 29 high error low error high error local sense amp. (b) Open Bitline Scheme Design-Induced Variation Due to Row Organization in such systems, the focus in designing query execution engines has shifted to optimizing CPU performance. Recent systems have revived an older technique of using just-in-time (JIT) compilation to execute queries as native code instead of interpreting a plan. The state-of-the-art in query compilation is to fuse operators together in a query plan to minimize materialization overhead by passing tuples efficiently between operators. Our empirical analysis shows, however, that more tactful materialization yields better performance. We present a query processing model called “relaxed operator fusion” that allows the DBMS to introduce staging points in the query plan where intermediate results are temporarily materialized. This allows the DBMS to take advantage of inter-tuple parallelism inherent in the plan using a combination of prefetching and SIMD vectorization to support faster query continued on page 31 THE PDL PACKET RE CE NT PUB L I CA T I O N S continued from page 30 execution on data sets that exceed the size of CPU-level caches. Our evaluation shows that our approach reduces the execution time of OLAP queries by up to 2.2X and achieves up to 1.8X better performance compared to other in-memory DBMSs. EC-Cache: Load-Balanced, LowLatency Cluster Caching with Online Erasure Coding K. V. Rashmi, Mosharaf Chowdhury, Jack Kosaian, Ion Stoica & Kannan Ramchandran 12th USENIX Symposium on Operating Systems Design and Implementation, NOVEMBER 2–4, 2016, SAVANNAH, GA. Data-intensive clusters and object stores are increasingly relying on inmemory object caching to meet the I/O performance demands. These systems routinely face the challenges of popularity skew, background load imbalance, and server failures, which result in severe load imbalance across servers and degraded I/O performance. Selective replication is a commonly used technique to tackle these challenges, where the number of cached replicas of an object is proportional to its popularity. In this paper, we explore an alternative approach using erasure coding. EC-Cache is a load-balanced, low latency cluster cache that uses online erasure coding to overcome the limitations of selective replication. EC-Cache employs erasure coding by: (i) splitting and erasure coding individual objects during writes, and (ii) late binding, wherein obtaining any k out of (k + r) splits of an object are sufficient, during reads. As compared to selective replication, EC-Cache improves load balancing by more than 3x and reduces the median and tail read latencies by more than 2x, while using the same amount of memory. EC-Cache does so using 10% additional bandwidth and a small increase in the amount of stored metadata. SPRING 2018 The benefits offered by EC-Cache are further amplified in the presence of background network load imbalance and server failures. Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms Donghyuk Lee, Samira Khan, Lavanya Subramanian, Saugata Ghose, Rachata Ausavarungnirun, Gennady Pekhimenko, Vivek Seshadri & Onur Mutlu Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS), Vol. 1, No. 1, June 2017. Variation has been shown to exist across the cells within a modern DRAM chip. Prior work has studied and exploited several forms of variation, such as manufacturing-processor temperature-induced variation. We empirically demonstrate a new form of variation that exists within a real DRAM chip, induced by the design and placement of different components in the DRAM chip: different regions in DRAM, based on their relative distances from the peripheral structures, require different minimum access latencies for reliable operation. In particular, we show that in most real DRAM chips, cells closer to the peripheral structures can be accessed much faster than cells that are farther. We call this phenomenon designinduced variation in DRAM. Our goals are to i) understand design-induced variation that exists in real, state-ofthe-art DRAM chips, ii) exploit it to develop low-cost mechanisms that can dynamically find and use the lowest latency at which to operate a DRAM chip reliably, and, thus, iii) improve overall system performance while ensuring reliable system operation. To this end, we first experimentally demonstrate and analyze designed- induced variation in modern DRAM devices by testing and characterizing 96 DIMMs (768 DRAM chips). Our characterization identifies DRAM regions that are vulnerable to errors, if operated at lower latency, and finds consistency in their locations across a given DRAM chip generation, due to design-induced variation. Based on our extensive experimental analysis, we develop two mechanisms that reliably reduce DRAM latency. First, DIVA Profiling uses runtime profiling to dynamically identify the lowest DRAM latency that does not introduce failures. DIVA Profiling exploits design-induced variation and periodically profiles only the vulnerable regions to determine the lowest DRAM latency at low cost. It is the first mechanism to dynamically determine the lowest latency that can be used to operate DRAM reliably. DIVA Profiling reduces the latency of read/write requests by 35.1%/57.8%, respectively, at 55°C. Our second mechanism, DIVA Shuffling, shuffles data such that values stored in vulnerable regions are mapped to multiple error-correcting code (ECC) codewords. As a result, DIVA Shuffling can correct 26% more multi-bit errors than conventional ECC. Combined together, our two mechanisms reduce read/write latency by 40.0%/60.5%, which translates to an overall system performance improvement of 14.7%/13.7%/13.8% (in 2-/4-/8-core systems) across a variety of workloads, while ensuring reliable operation. 31 Y E AR I N REVIEW continued from page 4 Conglong Li presented “Workload Analysis and Caching Strategies for Search Advertising Systems” at SoCC ’17 in Santa Clara, CA. August 2017 Kai Ren successfully defended his PhD thesis on “Fast Storage for File System Metadata.” Souptik Sen interned with LinkedIn’s Data group in Sunnyvale, working with Venkatesh Iyer and Subbu Sanka on a data tooling library in Scala which converts generic parameterized Hive queries to Spark to create an optimized workflow on LinkedIn’s advertising data pipeline. Saurabh Kadekodi interned with Alluxio, Inc. in California, working on packing and indexing in cloud file systems. Aaron Harlap interned with Microsoft Research in Seattle, WA, working on “Scaling up Distributed DNN Training.” Qing Zheng interned with LANL in Los Alamos, NM, working on exascale file systems. Charles McGuffey interned with Google in Sunnyvale, CA, working on cache partitioning systems for Google infrastructure. Jinliang Wei interned with Saeed Maleki, Madan Musuvathi and Todd Mytkowicz at Microsoft Research in Redmond WA, working on parallelizing and scaling out stochastic gradient descent with sequential semantics. June 2016 M. Satyanarayanan and Colleagues Honored for Creation of Andrew File System Junchen Jiang successfully defended his PhD dissertation “Enabling Data-Driven Optimization of Quality of Experience in Internet.” Rajat Kateja presented “Viyojit: Decoupling Battery and DRAM Capacities for Battery-Backed DRAM” at ISCA ’17 in Toronto, ON, Canada. Utsav Drolia presented “Cachier: Edge-caching for Recognition Applications” at ICDCS ‘17 in Atlanta, GA. May 2016 Hongyi Xin proposed his dissertation research “Novel Computational Techniques for Mapping Next-Generation Sequencing Reads.” Kevin K. Chang successfully defended his PhD research on “Understanding and Improving the Latency of DRAM-Based Memory System.” Jin Kyu Kim proposed his PhD research “STRADS: A New Distributed Framework for Scheduled Model-Parallel Machine Learning.” Dana Van Aken presented “Automatic Database Management System Tuning Through Largescale Machine Learning” at ICMD ‘17 in Chicago, IL. 19th annual PDL Spring Visit Day. 2017 PDL Workshop and Retreat. 32 THE PDL PACKET

Log In

Massive Indexed Directories in DeltaFS

Massive Indexed Directories in DeltaFS

Related Papers

RELATED PAPERS