Skip to main content
Paul Gratz

    Paul Gratz

    Industry is moving towards large-scale hardware systems which bundle processor cores, memories, accelerators, etc. via 2.5D integration. These components are fabricated separately as chiplets and then integrated using an interposer as an... more
    Industry is moving towards large-scale hardware systems which bundle processor cores, memories, accelerators, etc. via 2.5D integration. These components are fabricated separately as chiplets and then integrated using an interposer as an interconnect carrier. This new design style is beneficial in terms of yield and economies of scale, as chiplets may come from various vendors and are relatively easy to integrate into one larger sophisticated system. However, the benefits of this approach come at the cost of new security challenges, especially when integrating chiplets that come from untrusted or not fully trusted, third- party vendors. In this work, we explore these challenges for modern interposer-based systems of cache-coherent, multi-core chiplets. First, we present basic coherence-oriented hardware Trojan attacks that pose a significant threat to chiplet-based designs and demonstrate how these basic attacks can be orchestrated to pose a significant threat to interposer-based sy...
    Hybrid memory systems, comprised of emerging non-volatile memory (NVM) and DRAM, have been proposed to address the growing memory demand of applications. Emerging NVM technologies, such as phase-change memories (PCM), memristor, and 3D... more
    Hybrid memory systems, comprised of emerging non-volatile memory (NVM) and DRAM, have been proposed to address the growing memory demand of applications. Emerging NVM technologies, such as phase-change memories (PCM), memristor, and 3D XPoint, have higher capacity density, minimal static power consumption and lower cost per GB. However, NVM has longer access latency and limited write endurance as opposed to DRAM. The different characteristics of two memory classes point towards the design of hybrid memory systems containing multiple classes of main memory.In the iterative and incremental development of new architectures, the timeliness of simulation completion is critical to project progression. Hence, a highly efficient simulation method is needed to evaluate the performance of different hybrid memory system designs. Design exploration for hybrid memory systems is challenging, because it requires emulation of the full system stack, including the OS, memory controller, and interconnect. Moreover, benchmark applications for memory performance tests typically have much larger working sets, thus taking an even longer simulation warm-up period.In this paper, we propose an FPGA-based hybrid memory system emulation platform. We target the mobile computing system, which is sensitive to energy consumption and is likely to adopt NVM for its power efficiency. The focus of our platform is on the design of hybrid memory system, so we leverage the on-board hard IP ARM processors to enhance simulation performance while improving the accuracy of results. Thus, users can implement their data placement/migration policies with the FPGA logic elements and evaluate new designs quickly and effectively. Results show that our emulation platform provides a speedup of 9280x in simulation time compared to the software counterpart gem5.
    As core counts increase, lock acquisition and release become even more critical because they lie on the critical path of shared memory applications. In this paper, we show that many applications exhibit regular and repeating lock sharing... more
    As core counts increase, lock acquisition and release become even more critical because they lie on the critical path of shared memory applications. In this paper, we show that many applications exhibit regular and repeating lock sharing patterns. Based on this observation, we introduce SpecLock, an efficient hardware mechanism which speculates on the lock acquisition pattern between cores. Upon the release of a lock, the cache line containing the lock is speculatively forwarded to the next consumer of the lock. This forwarding action is performed via a specialized prefetch request and does not require coherence protocol modification. Further, the lock is not speculatively acquired, only the cache line containing the lock variable is placed in the private cache of the predicted consumer. Speculative forwarding serves to hide the remote core's lock acquisition latency. SpecLock is distributed and all predictions are made locally at each core. We show that SpecLock captures 87% of predictable lock patterns correctly and improves performance by an average of 10% with 64 cores. SpecLock incurs a negligible overhead, with a 75% area reduction compared to past work. Compared to two state of the art methods, SpecLock provides a speedup of 8% and 4% respectively.
    Key-value (KV) stores have been widely deployed in a variety of scale-out enterprise applications such as online retail, big data analytics, social networks, etc. Key-Value SSDs (KVSSDs) provide a key-value interface directly from the... more
    Key-value (KV) stores have been widely deployed in a variety of scale-out enterprise applications such as online retail, big data analytics, social networks, etc. Key-Value SSDs (KVSSDs) provide a key-value interface directly from the device aiming at lowering software overhead and reducing I/O amplification for such applications. In this paper, we present KVRAID, a high performance, write efficient erasure coding management scheme on emerging key-value SSDs. The core innovation of KVRAID is to use logical to physical key conversion to efficiently pack similar size KV objects and dynamically manage the membership of erasure coding groups. Such design enables packing multiple user objects to a single physical object to reduce the object amplification compared to prior works. By applying out-of-place update technique, KVRAID can significantly reduce the I/O amplification compared to the state-of-art designs. Our experiments show that KVRAID outperforms state-of-art software KV-store with block RAID by 28x in terms of insert throughput and reduces CPU utilization, tail latency and write amplification significantly. Compared to state-of-art KV devices erasure coding management, KVRAID reduces object amplification by ~ 2.6x compared to StripeFinder and reduces I/O amplification by ~ 9.6x when compared to KVMD and StripeFinder for update intensive workloads.
    Shared-memory, multi-threaded applications often require programmers to insert thread synchronization primitives (i.e. locks, barriers, and condition variables) in critical sections to synchronize data access between processes. Scaling... more
    Shared-memory, multi-threaded applications often require programmers to insert thread synchronization primitives (i.e. locks, barriers, and condition variables) in critical sections to synchronize data access between processes. Scaling performance requires balanced per-thread workloads with little time spent in critical sections. In practice, however, threads often waste time waiting to acquire locks/barriers, leading to thread imbalance and poor performance scaling. Moreover, critical sections often stall data prefetchers that mitigate the effects of waiting by ensuring data is preloaded in core caches when the critical section is done. This paper introduces a pure hardware technique to enable safe data prefetching beyond synchronization points in chip multiprocessors (CMPs). We show that successful prefetching beyond synchronization points requires overcoming two significant challenges in existing techniques. First, typical prefetchers are designed to trigger prefetches based on current misses. Unlike cores in single-threaded applications, a multi-threaded core stall on a synchronization point does not produce new references to trigger a prefetcher. Second, even if a prefetch were correctly directed to read beyond a synchronization point, it will likely prefetch shared data from another core before this data has been written. This prefetch would be considered "accurate" but highly undesirable because it would lead to three extra "ping-pong" movements due to coherence, costing more latency and energy than without prefetching. We develop a new data prefetcher, Synchronization-aware B-Fetch (SB-Fetch), built as an extension to a previous single-threaded data prefetcher. SB-Fetch addresses both issues for shared memory multi-threaded workloads. The novelty in SB-Fetch is that it explicitly issues prefetches for data beyond synchronization points and it distinguishes between data likely and unlikely to incur cache coherence overhead. These two features are directly synergistic since blindly prefetching beyond synchronization is likely to incur coherence penalties. No prior work includes both features. SB-Fetch is evaluated using a representative set of benchmarks from Parsec [4], Rodinia [7], and Parboil [39]. SB-Fetch improves execution time by 12.3% over baseline and 4% over best of class prefetching.
    While public-key cryptography is essential for secure communications, the energy cost of even the most efficient algorithms based on Elliptic Curve Cryptography (ECC) is prohibitive on many ultra-low energy devices such as sensor-network... more
    While public-key cryptography is essential for secure communications, the energy cost of even the most efficient algorithms based on Elliptic Curve Cryptography (ECC) is prohibitive on many ultra-low energy devices such as sensor-network nodes and identification tags. Although an abundance of hardware acceleration techniques for ECC have been proposed in literature, little research has focused on understanding the energy benefits of these techniques. Therefore, we evaluate the energy cost of ECC on several different hardware/software configurations across a range of security levels. Our work comprehensively explores implementations of both GF(p) and GF(2m) ECC, demonstrating that GF(2m) provides a 1.31 to 2.11 factor improvement in energy efficiency over GF(p) on an extended RISC processor. We also show that including a 4KB instruction cache in our system can reduce the energy cost of ECC by as much as 30%. Furthermore, our GF(2m) coprocessor achieves a 2.8 to 3.61 factor improvement in energy efficiency compared to instruction set extensions and significantly outperforms prior work.
    This paper presents a control-theoretic approach to optimize the energy consumption of integrated CPU and GPU subsystems for graphic applications. It achieves this via a dynamic management of the CPU and GPU frequencies. To this end, we... more
    This paper presents a control-theoretic approach to optimize the energy consumption of integrated CPU and GPU subsystems for graphic applications. It achieves this via a dynamic management of the CPU and GPU frequencies. To this end, we first model the interaction between the GPU and CPU as a queuing system. Second, we formulate a Multi-Input-Multi-Output state-space closed loop control to ensure robustness and stability. We evaluated this control on an Intel Baytrail-based Android platform. Experimental evaluations show energy savings of 17.4% in the CPU-GPU subsystem with a low performance impact of 0.9%.
    Hardware prefetching is an effective technique for hiding cache miss latencies in modern processor designs. Prefetcher performance can be characterized by two main metrics that are generally at odds with one another: coverage, the... more
    Hardware prefetching is an effective technique for hiding cache miss latencies in modern processor designs. Prefetcher performance can be characterized by two main metrics that are generally at odds with one another: coverage, the fraction of baseline cache misses which the prefetcher brings into the cache; and accuracy, the fraction of prefetches which are ultimately used. An overly aggressive prefetcher may improve coverage at the cost of reduced accuracy. Thus, performance may be harmed by this over-aggressiveness because many resources are wasted, including cache capacity and bandwidth. An ideal prefetcher would have both high coverage and accuracy. In this paper, we introduce Perceptron-based Prefetch Filtering (PPF) as a way to increase the coverage of the prefetches generated by an underlying prefetcher without negatively impacting accuracy. PPF enables more aggressive tuning of the underlying prefetcher, leading to increased coverage by filtering out the growing numbers of inaccurate prefetches such an aggressive tuning implies. We also explore a range of features to use to train PPF's perceptron layer to identify inaccurate prefetches. PPF improves performance on a memory-intensive subset of the SPEC CPU 2017 benchmarks by 3.78% for a single-core configuration, and by 11.4% for a 4-core configuration, compared to the underlying prefetcher alone.
    Graphics processing units (GPUs) deploy a large register file (RF) to achieve high compute throughput. This RF, however, consumes a large portion of the total dynamic power in the GPU. Additionally, the RF banks and operand collectors... more
    Graphics processing units (GPUs) deploy a large register file (RF) to achieve high compute throughput. This RF, however, consumes a large portion of the total dynamic power in the GPU. Additionally, the RF banks and operand collectors (OCs) are designed with limited number of ports causing access serialization and negatively impacting performance. In this work, we introduce CMRC, a coalescing-aware RF organization that takes advantage of frequent narrow-width data present in general purpose applications to increase performance and reduce energy for GPGPUs. CMRC is a low-cost comprehensive approach to register coalescing capable of combining narrow-width read and write accesses from same or different warp instructions into fewer accesses, reducing port contention and access pressure. On general purpose applications, CMRC reduces RF accesses by 31.8%, achieves a performance speedup of 16.5%, and reduces overall GPU energy by 32.2% on average, outperforming best of class prior work by ~1.8x without the requirement of compiler support.
    To achieve high compute performance, graphics processing units (GPUs) provide a large register file and a large number of execution units. However, these design components consume a large portion of the total dynamic power in the GPU,... more
    To achieve high compute performance, graphics processing units (GPUs) provide a large register file and a large number of execution units. However, these design components consume a large portion of the total dynamic power in the GPU, particularly for general purpose applications. In this paper, we present a low-cost gating scheme to reduce dynamic power consumption in the register file and execution units without impacting performance. The scheme proposed dynamically exploit frequent found data value of zeros within and across registers in order to gate off register file reads and writes as well as execution units. We find that on general purpose applications from Rodinia, our low-cost gating scheme can reduce register file reads and writes on average by 35% and 40%, respectively. The register file and execution unit dynamic power are reduced on average by 19% and 13%, respectively. The reduction in total GPU dynamic power achieved is ranging from 3% to 19% with 8% on average with no performance loss.
    Key–value (KV) software has proven useful to a wide variety of applications including analytics, time-series databases, and distributed file systems. To satisfy the requirements of diverse workloads, KV stores have been carefully tailored... more
    Key–value (KV) software has proven useful to a wide variety of applications including analytics, time-series databases, and distributed file systems. To satisfy the requirements of diverse workloads, KV stores have been carefully tailored to best match the performance characteristics of underlying solid-state block devices. Emerging KV storage device is a promising technology for both simplifying the KV software stack and improving the performance of persistent storage-based applications. However, while providing fast, predictable put and get operations, existing KV storage devices do not natively support range queries that are critical to all three types of applications described above. In this article, we present KVRangeDB, a software layer that enables processing range queries for existing hash-based KV solid-state disks (KVSSDs). As an effort to adapt to the performance characteristics of emerging KVSSDs, KVRangeDB implements log-structured merge tree key index that reduces comp...
    Application virtual memory footprints are growing rapidly in all systems from servers down to smartphones. To address this growing demand, system integrators are incorporating ever larger amounts of main memory, warranting rethinking of... more
    Application virtual memory footprints are growing rapidly in all systems from servers down to smartphones. To address this growing demand, system integrators are incorporating ever larger amounts of main memory, warranting rethinking of memory management. In current systems, applications produce page fault exceptions whenever they access virtual memory regions that are not backed by a physical page. As application memory footprints grow, they induce more and more minor page faults. Handling of each minor page fault can take a few thousands of CPU cycles and blocks the application till the OS kernel finds a free physical frame. These page faults can be detrimental to the performance when their frequency of occurrence is high and spread across application runtime. Specifically, lazy allocation-induced minor page faults are increasingly impacting application performance. Our evaluation of several workloads indicates an overhead due to minor page faults as high as 29% of execution time....
    Hybrid memory systems, comprised of emerging non-volatile memory (NVM) and DRAM, have been proposed to address the growing memory demand of current mobile applications. Recently emerging NVM technologies, such as phase-change memories... more
    Hybrid memory systems, comprised of emerging non-volatile memory (NVM) and DRAM, have been proposed to address the growing memory demand of current mobile applications. Recently emerging NVM technologies, such as phase-change memories (PCM), memristor, and 3D XPoint, have higher capacity density, minimal static power consumption and lower cost per GB. However, NVM has longer access latency and limited write endurance as opposed to DRAM. The different characteristics of distinct memory classes render a new challenge for memory system design. Ideally, pages should be placed or migrated between the two types of memories according to the data objects’ access properties. Prior system software approaches exploit the program information from OS but at the cost of high software latency incurred by related kernel processes. Hardware approaches can avoid these latencies, however, hardware’s vision is constrained to a short time window of recent memory requests, due to the limited on-chip reso...
    Packet classification methods rely upon matching packet content/header against pre-defined rules, which are generated by network applications and their configurations. With the rapid development of network technology and the fast-growing... more
    Packet classification methods rely upon matching packet content/header against pre-defined rules, which are generated by network applications and their configurations. With the rapid development of network technology and the fast-growing network applications, users seek more enhanced, secure, and diverse network services. Hence it becomes critical to improve the performance of arbitrary matching operations. This article presents SIMD-Matcher, an efficient Single Instruction Multiple Data (SIMD) and cache-friendly arbitrary matching framework. To further improve the arbitrary matching performance, SIMD-Matcher adopts a trie node with a fixed high fanout and a varying span for each node depending on the data distribution. The trie node layout leverages cache and modern processor features such as SIMD instructions. To support arbitrary matching, we first interpret arbitrary rules into three fields: value, mask, and priority. Second, to support insertion of randomly positioned wildcards...
    A method, system and computer program product for dynamically composing processor cores to form logical processors. Processor cores are composable in that the processor cores are dynamically allocated to form a logical processor to handle... more
    A method, system and computer program product for dynamically composing processor cores to form logical processors. Processor cores are composable in that the processor cores are dynamically allocated to form a logical processor to handle a change in the operating status. Once a change in the operating status is detected, a mechanism may be triggered to recompose one or more processor cores into a logical processor to handle the change in the operating status. An analysis may be performed as to how one or more processor cores should be recomposed to handle the change in the operating status. After the analysis, the one or more processor cores are recomposed into the logical processor to handle the change in the operating status. By dynamically allocating the processor cores to handle the change in the operating status, performance and power efficiency is improved.

    And 77 more