Skip to main content
ABSTRACT In this paper, we present a first general purpose GPU thermal management design that consists of both hardware architecture and OS scheduler changes. Our techniques schedule thread blocks from multiple computational kernels in... more
ABSTRACT In this paper, we present a first general purpose GPU thermal management design that consists of both hardware architecture and OS scheduler changes. Our techniques schedule thread blocks from multiple computational kernels in spatial, temporal, and spatio-temporal ways depending on the thermal state of the system. We can reduce the computation slowdown by 60% on average relative to the state of the art techniques while meeting the thermal constraints. We also extend our work to multi GPGPU cards and show improvements of 44% on average relative to existing technique.
Abstract—As GPUs are quickly evolving in complexity, tuning numerical libraries for them is becoming more challenging. We present an autotuning approach in the area of dense linear algebra (DLA) libraries for GPUs. The MAGMA library is... more
Abstract—As GPUs are quickly evolving in complexity, tuning numerical libraries for them is becoming more challenging. We present an autotuning approach in the area of dense linear algebra (DLA) libraries for GPUs. The MAGMA library is used to demonstrate the techniques and ...
Abstract. We present a Cholesky factorization for multicore with GPU accelerators. The challenges in developing scalable high performance algorithms for these emerging systems stem from their heterogeneity, massive parallelism, and the... more
Abstract. We present a Cholesky factorization for multicore with GPU accelerators. The challenges in developing scalable high performance algorithms for these emerging systems stem from their heterogeneity, massive parallelism, and the huge gap between the GPUs' compute power vs the CPU-GPU communication speed. We show an approach that is largely based on software infrastructures that have already been developed for homogeneous multicores and hybrid GPU-based computing. The algorithm features two ...
Abstract GPUs are excellent accelerators for data parallel applications with regular data access patterns. It is challenging, however, to optimize computations with irregular data access patterns on GPUs. One such computation is the... more
Abstract GPUs are excellent accelerators for data parallel applications with regular data access patterns. It is challenging, however, to optimize computations with irregular data access patterns on GPUs. One such computation is the Symmetric Matrix Vector product (SYMV) for dense linear algebra. Optimizing the SYMV kernel is important because it forms the basis of fundamental algorithms such as linear solvers and eigenvalue solvers on symmetric matrices.
Abstract In this work we propose a joint energy, thermal and cooling management technique (JETC) that significantly reduces per server cooling and memory energy costs. Our analysis shows that decoupling the optimization of cooling energy... more
Abstract In this work we propose a joint energy, thermal and cooling management technique (JETC) that significantly reduces per server cooling and memory energy costs. Our analysis shows that decoupling the optimization of cooling energy of CPU & memory and the optimization of memory energy leads to suboptimal solutions due to thermal dependencies between CPU and memory and non-linearity in cooling energy.
Abstract: Tuning numerical libraries has become more difficult over time, as systems get more sophisticated. In particular, modern multicore machines make the behaviour of algorithms hard to forecast and model. In this paper, we tackle... more
Abstract: Tuning numerical libraries has become more difficult over time, as systems get more sophisticated. In particular, modern multicore machines make the behaviour of algorithms hard to forecast and model. In this paper, we tackle the issue of tuning a dense QR factorization on multicore architectures. We show that it is hard to rely on a model, which motivates us to design a fully empirical approach. We exhibit few strong empirical properties that enable us to efficiently prune the search space.
Abstract We present an improved matrix���matrix multiplication routine (General Matrix Multiply [GEMM]) in the MAGMA BLAS library that targets the NVIDIA Fermi graphics processing units (GPUs) using Compute Unified Data Architecture... more
Abstract We present an improved matrix���matrix multiplication routine (General Matrix Multiply [GEMM]) in the MAGMA BLAS library that targets the NVIDIA Fermi graphics processing units (GPUs) using Compute Unified Data Architecture (CUDA). We show how to modify the previous MAGMA GEMM kernels in order to make a more efficient use of the Fermi's new architectural features, most notably their extended memory hierarchy and memory sizes.
Abstract Dense linear algebra (DLA) is one of the most important softwares in high performance computing. It is also important for it's wide usage in other application domains like machine learning, gaming, speech processing, image... more
Abstract Dense linear algebra (DLA) is one of the most important softwares in high performance computing. It is also important for it's wide usage in other application domains like machine learning, gaming, speech processing, image processing, etc. The introduction of new machines from vendor provides us opportunities to optimize DLA libraries for the new machines and thus exploit their power. Unfortunately the optimization phase is not straightforward all the time.
Abstract. We present a Cholesky factorization for multicore with GPU accelerators. The challenges in developing scalable high performance algorithms for these emerging systems stem from their heterogeneity, massive parallelism, and the... more
Abstract. We present a Cholesky factorization for multicore with GPU accelerators. The challenges in developing scalable high performance algorithms for these emerging systems stem from their heterogeneity, massive parallelism, and the huge gap between the GPUs' compute power vs the CPU-GPU communication speed. We show an approach that is largely based on software infrastructures that have already been developed for homogeneous multicores and hybrid GPU-based computing.
We present a Hessenberg reduction (HR) algorithm for hybrid systems of homogeneous multicore with GPU accelerators that can exceed 25�� the performance of the corresponding LAPACK algorithm running on current homogeneous multicores. This... more
We present a Hessenberg reduction (HR) algorithm for hybrid systems of homogeneous multicore with GPU accelerators that can exceed 25�� the performance of the corresponding LAPACK algorithm running on current homogeneous multicores. This enormous acceleration is due to proper matching of algorithmic requirements to architectural strengths of the system's hybrid components.
Abstract. We present a Cholesky factorization for multicore with GPU accelerators. The challenges in developing scalable high performance algorithms for these emerging systems stem from their heterogeneity, massive parallelism, and the... more
Abstract. We present a Cholesky factorization for multicore with GPU accelerators. The challenges in developing scalable high performance algorithms for these emerging systems stem from their heterogeneity, massive parallelism, and the huge gap between the GPUs' compute power vs the CPU-GPU communication speed. We show an approach that is largely based on software infrastructures that have already been developed for homogeneous multicores and hybrid GPU-based computing.
The goal of the Matrix Algebra on GPU and Multicore Architectures (MAGMA) project is to create a new generation of linear algebra libraries that achieve the fastest possible time to an accurate solution on hybrid/heterogeneous... more
The goal of the Matrix Algebra on GPU and Multicore Architectures (MAGMA) project is to create a new generation of linear algebra libraries that achieve the fastest possible time to an accurate solution on hybrid/heterogeneous architectures, starting with current multicore+ multiGPU systems.
Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major building block of dense linear algebra (DLA) libraries, and therefore have to be highly optimized. We present some techniques and implementations that... more
Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major building block of dense linear algebra (DLA) libraries, and therefore have to be highly optimized. We present some techniques and implementations that significantly accelerate the corresponding routines from currently available libraries for GPUs.
Recent activities of major chip manufacturers, such as Intel, AMD, IBM and NVIDIA, make it more evident than ever that future designs of microprocessors and large HPC systems will be hybrid/heterogeneous in nature, relying on the... more
Recent activities of major chip manufacturers, such as Intel, AMD, IBM and NVIDIA, make it more evident than ever that future designs of microprocessors and large HPC systems will be hybrid/heterogeneous in nature, relying on the integration (in varying proportions) of two major types of components:
We present a Cholesky factorization for multicore with GPU accelerators systems. The challenges in developing scalable high performance algorithms for these emerging systems stem from their heterogeneity, massive parallelism, and the huge... more
We present a Cholesky factorization for multicore with GPU accelerators systems. The challenges in developing scalable high performance algorithms for these emerging systems stem from their heterogeneity, massive parallelism, and the huge gap between the GPUs' compute power vs the CPU-GPU communication speed. We show an approach that is largely based on software infrastructures that have already been developed for homogeneous multicores and hybrid GPU-based computing.
Abstract Solving dense linear systems of equations is a fundamental problem in scientific computing. Numerical simulations involving complex systems represented in terms of unknown variables and relations between them often lead to linear... more
Abstract Solving dense linear systems of equations is a fundamental problem in scientific computing. Numerical simulations involving complex systems represented in terms of unknown variables and relations between them often lead to linear systems of equations that must be solved as fast as possible. We describe current efforts toward the development of these critical solvers in the area of dense linear algebra (DLA) for multicore with GPU accelerators.