Algorithms and Architectures for Parallel Processing
Cross-platform power/performance prediction is becoming increasingly important due to the rapid d... more Cross-platform power/performance prediction is becoming increasingly important due to the rapid development and variety of software and hardware architectures in an era of heterogeneous multi-core. However, accurate power/performance prediction is faced with an obstacle caused by the large gap between architectures, which is often overcome by laborious and time-consuming fine-grained program profiling on the target platform. To overcome these problems, this paper introduces $$CP^3$$ C P 3 , a hierarchical Cross-platform Power/Performance Prediction framework, which focuses on utilizing architecture differences to migrate built models to target platforms. The core of $$CP^3$$ C P 3 is the three-step hierarchical transfer learning approach, hierarchical division, partial transfer learning, and model fusion, respectively. $$CP^3$$ C P 3 firstly builds a power/performance model on the source platform, then rebuilds it with the reduced training data on the target platform, and finally ob...
Abstract The series of novel phosphors Ca9La(PO4)5 [(Si1-xGexO4)]F2:0.15Dy3+ with x = 0, 0.25, 0.... more Abstract The series of novel phosphors Ca9La(PO4)5 [(Si1-xGexO4)]F2:0.15Dy3+ with x = 0, 0.25, 0.50, 0.75 and 1 (CLPS1-xGxF:0.15Dy3+) with apatite-type structure (space group P63/m) has been synthesized. The crystal structure and electronic structure have been studied by powder X-ray diffraction (PXRD) in combination with the atomic-scale density functional theory (DFT) calculations. A combination of Raman, nuclear magnetic resonance (29Si NMR) spectroscopy and the evolution of Dy3+ ions photoluminescent properties was used to characterize the local crystal structure features. Based on the variation of the [SiO4] by [GeO4] tetrahedral substitutions in the apatite-type structure, the difference in the thermal stability and decay curves of CLPS1-xGxF:0.15Dy3+ phosphors were referred to crystal chemical stability of as-prepared compounds, which is also confirmed by the values of polyhedral distortion, bond valence sum (BVS) and DFT calculations. The present study provides new insights the synthetic methodology and fundamental characterizations into the design of new phosphors combined with theoretical methods.
Single-phase CaSr2(PO4)2:Dy3+,Li+ phosphors were prepared using the high-temperature solid-state ... more Single-phase CaSr2(PO4)2:Dy3+,Li+ phosphors were prepared using the high-temperature solid-state method in the air. To characterize the luminescence properties of the synthesized phosphors, Powder X-ray diffraction patterns (XRD), scanning electron microscopy images (SEM), photoluminescence spectra, and concentration-dependent emission spectra were measured to characterize the luminescence properties of the synthesized phosphors. The results showed that the CaSr2(PO4)2:Dy3+,Li+ phosphors exhibited white luminescence, and the emission spectra of the phosphors consisted of two sharp peaks at ≈486 and ≈578 nm (the most intense one). The optimum concentration of Dy3+ doping was determined to 0.06 mol.%. On the basis of the Dexter's theory, the mechanism of energy transfer between the Dy3+ ions was determined to dipole–dipole interactions. The results of the temperature-dependent luminescence confirmed that the as-prepared phosphors are proved to be promising UV-convertible material ...
2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS), 2020
Serverless computing, also known as “Function as a Service (FaaS)”, is emerging as an event-drive... more Serverless computing, also known as “Function as a Service (FaaS)”, is emerging as an event-driven paradigm of cloud computing. In the FaaS model, applications are programmed in the form of functions that are executed and managed separately. Functions are triggered by cloud users and are provisioned dynamically through containers or virtual machines (VMs). The startup delays of containers or VMs usually lead to rather high latency of response to cloud users. Moreover, the communication between different functions generally relies on virtual net devices or shared memory, and may cause extremely high performance overhead. In this paper, we propose Unikernel-as-a-Function (UaaF), a much more lightweight approach to serverless computing. Applications are abstracted as a combination of different functions, and each function are built as an unikernel in which the function is linked with a specified minimum-sized library operating system (LibOS). UaaF offers extremely low startup latency to execute functions, and an efficient communication model to speed up inter-functions interactions. We exploit an new hardware technique (namely VMFUNC) to invoke functions in other unikernels seamlessly (mostly like inter-process communications), without suffering performance penalty of VM Exits. We implement our proof-of-concept prototype based on KVM and deploy UaaF in three unikernels (MirageOS, IncludeOS, and Solo5). Experimental results show that U aaF can significantly reduce the startup latency and memory usage of serverless cloud applications. Moreover, the VMFUNC-based communication model can also significantly improve the performance of function invocations between different unikernels.
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021
The advent of Persistent Memory (PM) necessitates an evolution of Remote Direct Memory Access (RD... more The advent of Persistent Memory (PM) necessitates an evolution of Remote Direct Memory Access (RDMA) technologies for supporting remote data persistence. Previous software-based solutions require remote CPU intervention and postpone the visibility of remote persistence. In this paper, we design several hardware-supported RDMA primitives to flush data from the volatile cache of RDMA Network Interface Cards (RNICs) to the PM. We also propose durable RPCs based on the proposed RDMA Flush primitives to support remote data persistence and fast failure recovery. We emulate the performance of RDMA Flush primitives through other RDMA primitives, and compare our proposals with several state-of-the-art RPCs in a real testbed equipped with PM and InfiniBand networks. Experimental results show that our proposals can improve the throughput of RPCs by up to 90%, and reduce the 99th percentile latency by up to 49%. The experimental studies also provide instructive guidelines for designing RDMA-based distributed PM systems.
2021 IEEE International Conference on Cluster Computing (CLUSTER), 2021
Hybrid memories exacerbate the asymmetry of memory access latencies in Non-Uniform Memory Access ... more Hybrid memories exacerbate the asymmetry of memory access latencies in Non-Uniform Memory Access (NUMA) systems due to the vast performance gap between DRAM and Nonvolatile Memory (NVM). Since most graph processing systems have not considered the memory heterogeneity of NUMA nodes, they have sub-optimal performance due to improper data placement and access strategies. This paper proposes HNGraph, a graph processing framework for hybrid memory based NUMA systems. It mainly focuses on performance improvement by reducing random accesses to both local and remote NVM nodes. First, HNGraph assembles most random memory accesses in DRAM by exploiting a degree-aware partitioning strategy, which distributes high-degree and low-degree vertices to DRAM and NVM nodes, respectively. Second, we propose an adaptive graph processing model, which uses a hybrid inter-node communication mechanism to adapt to the asymmetric access latency between NVM and DRAM nodes. In DRAM nodes, we exploit a message passing communication model for remote random NVM updates. In NVM nodes, we use shared memory primitives to access remote DRAM directly. We evaluate the performance of HNGraph using different graph algorithms on typical datasets. Experimental results show that HNGraph can improve the application performance by 43.8% and 30.6% on average compared with the state-of-the-art graph processing systems GBBS and Polymer, respectively.
The upper elevational range limit of tree species (including treeline and non-treeline species) i... more The upper elevational range limit of tree species (including treeline and non-treeline species) is generally considered to result from either carbon limitation or sink limitation. Some evidence also suggests that tree line might reflect preferential carbon allocation to NSC storage at the expense of growth. How might the importance of these potential mechanisms be determined? We used an elevational gradient to examine light-saturated photosynthesis (Asat) and NSC concentrations in plant tissues of three different functional types of tree species. We also examined the effects of consecutive 4 years of in situ defoliation on growth and NSCs at the upper elevational range limit. Declining temperature with increasing elevation did not reduce Asat in any of the species. We found NSC increased with elevation in major storage tissues (e.g., roots and twigs) but not in leaves. The defoliation showed that C storage took priority over growth. Such preferential carbon allocation, directly caused by growth decline, always existed in the deciduous tree species. In the evergreen tree species, however, growth decline resulted from preferential carbon allocation to storage was only detected in 2017 and then disappeared as the intensity of defoliation increased. Our results showed that trees prioritized sustaining stores of C more highly than allocation of growth, regardless of the trees' C or sink limitations. At the cold range limits, the prioritized carbon allocation to storage in deciduous tree species was in response to low temperature stress, while in evergreen tree species, the prioritization of carbon allocation was only a transient physiological response to defoliation disturbances.
Proceedings of the International Conference on Supercomputing, 2017
Non-Volatile Memory (NVM) has recently emerged for its nonvolatility, high density and energy eff... more Non-Volatile Memory (NVM) has recently emerged for its nonvolatility, high density and energy efficiency. Hybrid memory systems composed of DRAM and NVM have the best of both worlds, because NVM can offer larger capacity and have near-zero standby power consumption while DRAM provides higher performance. Many studies have advocated to use DRAM as a cache to NVM. However, it is still an open problem on how to manage the DRAM cache effectively and efficiently. In this paper, we propose a novel Hardware/Software Cooperative Caching (HSCC) mechanism that organizes NVM and DRAM in a flat address space while logically supporting a cache/memory hierarchy. HSCC maintains the NVM- to-DRAM address mapping and tracks the access counts of NVM pages through a moderate extension to page tables and TLBs. It significantly simplifies the hardware design and offers several optimization opportunities for cache management in software layers. We thus propose utility-based cache filtering policies to improve the efficiency of DRAM cache. Experimental results show that HSCC improves system performance by up to 9.6X (77.2% on average) and reduces energy consumption by 34.3% on average, compared to a hardware-assisted DRAM/NVM memory system. HSCC also presents 15.4% and 14.5% performance improvement against a flat- addressable memory architecture and a Row Buffer Locality Aware (RBLA) caching policy for hybrid memories, respectively.
Abstract In great contrast to energy-extensive Haber-Bosch process, electrochemical nitrogen redu... more Abstract In great contrast to energy-extensive Haber-Bosch process, electrochemical nitrogen reduction reaction offers green and sustainable ammonia production at ambient reaction conditions. Among them, the demand for low-consumption and high-performance non-noble-metal catalysts is still a hot spot in the field of electrocatalysts. Herein, we demonstrate that single Fe atom loaded on anatase TiO2(0 0 1) shows well-balanced activity for N2 fixation and NH3 dissociation. In addition, it is active to catalyze nitrogen reductions with the initial hydrogenation to the distal N as the potential-determining step (1.27 eV). Considering that TiO2 is a model photocatalyst, single Fe atom can promote electron-hole separation to enhance the photocatalytic performance. Thus, Fe/TiO2(0 0 1) has been identified as a potential catalyst for photo-electrochemical ammonia synthesis.
Algorithms and Architectures for Parallel Processing
Cross-platform power/performance prediction is becoming increasingly important due to the rapid d... more Cross-platform power/performance prediction is becoming increasingly important due to the rapid development and variety of software and hardware architectures in an era of heterogeneous multi-core. However, accurate power/performance prediction is faced with an obstacle caused by the large gap between architectures, which is often overcome by laborious and time-consuming fine-grained program profiling on the target platform. To overcome these problems, this paper introduces $$CP^3$$ C P 3 , a hierarchical Cross-platform Power/Performance Prediction framework, which focuses on utilizing architecture differences to migrate built models to target platforms. The core of $$CP^3$$ C P 3 is the three-step hierarchical transfer learning approach, hierarchical division, partial transfer learning, and model fusion, respectively. $$CP^3$$ C P 3 firstly builds a power/performance model on the source platform, then rebuilds it with the reduced training data on the target platform, and finally ob...
Abstract The series of novel phosphors Ca9La(PO4)5 [(Si1-xGexO4)]F2:0.15Dy3+ with x = 0, 0.25, 0.... more Abstract The series of novel phosphors Ca9La(PO4)5 [(Si1-xGexO4)]F2:0.15Dy3+ with x = 0, 0.25, 0.50, 0.75 and 1 (CLPS1-xGxF:0.15Dy3+) with apatite-type structure (space group P63/m) has been synthesized. The crystal structure and electronic structure have been studied by powder X-ray diffraction (PXRD) in combination with the atomic-scale density functional theory (DFT) calculations. A combination of Raman, nuclear magnetic resonance (29Si NMR) spectroscopy and the evolution of Dy3+ ions photoluminescent properties was used to characterize the local crystal structure features. Based on the variation of the [SiO4] by [GeO4] tetrahedral substitutions in the apatite-type structure, the difference in the thermal stability and decay curves of CLPS1-xGxF:0.15Dy3+ phosphors were referred to crystal chemical stability of as-prepared compounds, which is also confirmed by the values of polyhedral distortion, bond valence sum (BVS) and DFT calculations. The present study provides new insights the synthetic methodology and fundamental characterizations into the design of new phosphors combined with theoretical methods.
Single-phase CaSr2(PO4)2:Dy3+,Li+ phosphors were prepared using the high-temperature solid-state ... more Single-phase CaSr2(PO4)2:Dy3+,Li+ phosphors were prepared using the high-temperature solid-state method in the air. To characterize the luminescence properties of the synthesized phosphors, Powder X-ray diffraction patterns (XRD), scanning electron microscopy images (SEM), photoluminescence spectra, and concentration-dependent emission spectra were measured to characterize the luminescence properties of the synthesized phosphors. The results showed that the CaSr2(PO4)2:Dy3+,Li+ phosphors exhibited white luminescence, and the emission spectra of the phosphors consisted of two sharp peaks at ≈486 and ≈578 nm (the most intense one). The optimum concentration of Dy3+ doping was determined to 0.06 mol.%. On the basis of the Dexter's theory, the mechanism of energy transfer between the Dy3+ ions was determined to dipole–dipole interactions. The results of the temperature-dependent luminescence confirmed that the as-prepared phosphors are proved to be promising UV-convertible material ...
2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS), 2020
Serverless computing, also known as “Function as a Service (FaaS)”, is emerging as an event-drive... more Serverless computing, also known as “Function as a Service (FaaS)”, is emerging as an event-driven paradigm of cloud computing. In the FaaS model, applications are programmed in the form of functions that are executed and managed separately. Functions are triggered by cloud users and are provisioned dynamically through containers or virtual machines (VMs). The startup delays of containers or VMs usually lead to rather high latency of response to cloud users. Moreover, the communication between different functions generally relies on virtual net devices or shared memory, and may cause extremely high performance overhead. In this paper, we propose Unikernel-as-a-Function (UaaF), a much more lightweight approach to serverless computing. Applications are abstracted as a combination of different functions, and each function are built as an unikernel in which the function is linked with a specified minimum-sized library operating system (LibOS). UaaF offers extremely low startup latency to execute functions, and an efficient communication model to speed up inter-functions interactions. We exploit an new hardware technique (namely VMFUNC) to invoke functions in other unikernels seamlessly (mostly like inter-process communications), without suffering performance penalty of VM Exits. We implement our proof-of-concept prototype based on KVM and deploy UaaF in three unikernels (MirageOS, IncludeOS, and Solo5). Experimental results show that U aaF can significantly reduce the startup latency and memory usage of serverless cloud applications. Moreover, the VMFUNC-based communication model can also significantly improve the performance of function invocations between different unikernels.
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021
The advent of Persistent Memory (PM) necessitates an evolution of Remote Direct Memory Access (RD... more The advent of Persistent Memory (PM) necessitates an evolution of Remote Direct Memory Access (RDMA) technologies for supporting remote data persistence. Previous software-based solutions require remote CPU intervention and postpone the visibility of remote persistence. In this paper, we design several hardware-supported RDMA primitives to flush data from the volatile cache of RDMA Network Interface Cards (RNICs) to the PM. We also propose durable RPCs based on the proposed RDMA Flush primitives to support remote data persistence and fast failure recovery. We emulate the performance of RDMA Flush primitives through other RDMA primitives, and compare our proposals with several state-of-the-art RPCs in a real testbed equipped with PM and InfiniBand networks. Experimental results show that our proposals can improve the throughput of RPCs by up to 90%, and reduce the 99th percentile latency by up to 49%. The experimental studies also provide instructive guidelines for designing RDMA-based distributed PM systems.
2021 IEEE International Conference on Cluster Computing (CLUSTER), 2021
Hybrid memories exacerbate the asymmetry of memory access latencies in Non-Uniform Memory Access ... more Hybrid memories exacerbate the asymmetry of memory access latencies in Non-Uniform Memory Access (NUMA) systems due to the vast performance gap between DRAM and Nonvolatile Memory (NVM). Since most graph processing systems have not considered the memory heterogeneity of NUMA nodes, they have sub-optimal performance due to improper data placement and access strategies. This paper proposes HNGraph, a graph processing framework for hybrid memory based NUMA systems. It mainly focuses on performance improvement by reducing random accesses to both local and remote NVM nodes. First, HNGraph assembles most random memory accesses in DRAM by exploiting a degree-aware partitioning strategy, which distributes high-degree and low-degree vertices to DRAM and NVM nodes, respectively. Second, we propose an adaptive graph processing model, which uses a hybrid inter-node communication mechanism to adapt to the asymmetric access latency between NVM and DRAM nodes. In DRAM nodes, we exploit a message passing communication model for remote random NVM updates. In NVM nodes, we use shared memory primitives to access remote DRAM directly. We evaluate the performance of HNGraph using different graph algorithms on typical datasets. Experimental results show that HNGraph can improve the application performance by 43.8% and 30.6% on average compared with the state-of-the-art graph processing systems GBBS and Polymer, respectively.
The upper elevational range limit of tree species (including treeline and non-treeline species) i... more The upper elevational range limit of tree species (including treeline and non-treeline species) is generally considered to result from either carbon limitation or sink limitation. Some evidence also suggests that tree line might reflect preferential carbon allocation to NSC storage at the expense of growth. How might the importance of these potential mechanisms be determined? We used an elevational gradient to examine light-saturated photosynthesis (Asat) and NSC concentrations in plant tissues of three different functional types of tree species. We also examined the effects of consecutive 4 years of in situ defoliation on growth and NSCs at the upper elevational range limit. Declining temperature with increasing elevation did not reduce Asat in any of the species. We found NSC increased with elevation in major storage tissues (e.g., roots and twigs) but not in leaves. The defoliation showed that C storage took priority over growth. Such preferential carbon allocation, directly caused by growth decline, always existed in the deciduous tree species. In the evergreen tree species, however, growth decline resulted from preferential carbon allocation to storage was only detected in 2017 and then disappeared as the intensity of defoliation increased. Our results showed that trees prioritized sustaining stores of C more highly than allocation of growth, regardless of the trees' C or sink limitations. At the cold range limits, the prioritized carbon allocation to storage in deciduous tree species was in response to low temperature stress, while in evergreen tree species, the prioritization of carbon allocation was only a transient physiological response to defoliation disturbances.
Proceedings of the International Conference on Supercomputing, 2017
Non-Volatile Memory (NVM) has recently emerged for its nonvolatility, high density and energy eff... more Non-Volatile Memory (NVM) has recently emerged for its nonvolatility, high density and energy efficiency. Hybrid memory systems composed of DRAM and NVM have the best of both worlds, because NVM can offer larger capacity and have near-zero standby power consumption while DRAM provides higher performance. Many studies have advocated to use DRAM as a cache to NVM. However, it is still an open problem on how to manage the DRAM cache effectively and efficiently. In this paper, we propose a novel Hardware/Software Cooperative Caching (HSCC) mechanism that organizes NVM and DRAM in a flat address space while logically supporting a cache/memory hierarchy. HSCC maintains the NVM- to-DRAM address mapping and tracks the access counts of NVM pages through a moderate extension to page tables and TLBs. It significantly simplifies the hardware design and offers several optimization opportunities for cache management in software layers. We thus propose utility-based cache filtering policies to improve the efficiency of DRAM cache. Experimental results show that HSCC improves system performance by up to 9.6X (77.2% on average) and reduces energy consumption by 34.3% on average, compared to a hardware-assisted DRAM/NVM memory system. HSCC also presents 15.4% and 14.5% performance improvement against a flat- addressable memory architecture and a Row Buffer Locality Aware (RBLA) caching policy for hybrid memories, respectively.
Abstract In great contrast to energy-extensive Haber-Bosch process, electrochemical nitrogen redu... more Abstract In great contrast to energy-extensive Haber-Bosch process, electrochemical nitrogen reduction reaction offers green and sustainable ammonia production at ambient reaction conditions. Among them, the demand for low-consumption and high-performance non-noble-metal catalysts is still a hot spot in the field of electrocatalysts. Herein, we demonstrate that single Fe atom loaded on anatase TiO2(0 0 1) shows well-balanced activity for N2 fixation and NH3 dissociation. In addition, it is active to catalyze nitrogen reductions with the initial hydrogenation to the distal N as the potential-determining step (1.27 eV). Considering that TiO2 is a model photocatalyst, single Fe atom can promote electron-hole separation to enhance the photocatalytic performance. Thus, Fe/TiO2(0 0 1) has been identified as a potential catalyst for photo-electrochemical ammonia synthesis.
Uploads
Papers