Papers by Stanimire Tomov

The CEED co-design center is developing a number of kernels, bake-off / benchmark problems (BPs) ... more The CEED co-design center is developing a number of kernels, bake-off / benchmark problems (BPs) and mini-applications (miniapps) that capture the unique requirements of high-order finite element algorithms and use them to influence standards, best practices and vendors to generate more portable, better performing code across diverse hardware designs. This is a main activity in CEED's Hardware, Applications and Finite Element thrusts. CEED's miniapps, a.k.a. CEEDlings, are small, standalone applications-surrogates for sub-components of interest from the selected first wave of applications (the MARBL/LLNLApp and ExaSMR projects). The BPs are even smaller standalone programs that solve a simple physical problem, including discretization, linear and/or nonlinear solve, and potentially a relatively small number of time steps. A BP may be a proxy/surrogate for part of a miniapp or another problem of general interest. In CEED, both miniapps and BPs will be used extensively in the project to evaluate and compare the performance of algorithms in specific contexts. The parts of the miniapps and BPs that play an important role in their overall performance will be identified and separated as standalone kernels. These kernels and their interaction inside BPs, miniapps and the actual applications, will be the main focus of our optimization efforts. All three standalone pieces of software (miniapps, BPs, and kernels) will serve as a basis for engaging and collaborating with hardware vendors and software technologies (ST) projects. These partners can start at the bottom of the hierarchy, e.g. with the kernels, and easily transition to BPs and miniapps as all three families will be produced in the same CEED environment. In this milestone we identified the initial kernels, bake-off problems and miniapps and delivered software and documentation for them through the new CEED website, http://ceed.exascaleproject.org. We specifically identified the Nekbone miniapp for the ExaSMR application and developed a new miniapp, Laghos, for the MARBL/LLNLApp application. We also formulated four initial benchmark problems (BP1-BP4) with clearly defined performance metrics and performed an initial bake-off comparison between the Nek5000 (spectral element) and MFEM (high-order finite element) technologies. As part of the milestone, we also engaged vendors (AMD) and ST projects (MPICH, STRUMPACK). In this document we are reporting some details and results from these R&D activities, as well as additional project-wide activities performed in Q3 of FY17, including: making CEED software available on GitHub, developing the project website, and the preparation of CEED's first annual meeting.

Exascale Computing Project (ECP) v CEED-MS34 The main goal of this milestone was to help CEED-ena... more Exascale Computing Project (ECP) v CEED-MS34 The main goal of this milestone was to help CEED-enabled ECP applications, including ExaSMR, MARBL, ExaWind and ExaAM, to improve their performance and capabilities on GPU systems like Summit and Lassen/Sierra. In addition, the CEED team also worked to: add and improve support for additional hardware and programming models in the CEED software components; release the next version of the CEED software stack, CEED-3.0; and demonstrate performance of libParanumal kernels in libCEED, Nek and MFEM. These additional tasks contributed directly to the main CEED-MS34 goal and will also play an important role in CEED's future milestones. The specific tasks addressed in this milestone were: • Add/improve support for additional hardware and programming models in the CEED software components; • Demonstrate performance of libParanumal kernels in libCEED, Nek and MFEM; • Public release of CEED-3.0, including new releases of many CEED software components; • Work with CEED applications to improve their GPU performance and capabilities. All new developments were released under a CEED-3.0 release, and integrated with applications to improve their GPU performance and capabilities. The artifacts delivered include a number of software releases: CEED-3.0, libCEED-0.6, MFEM-4.1, NekRS-20.0, hipMAGMA-1.0, Laghos-3.0, Remhos-1.0 and GSLIB-1.0.6; performance improvements in applications, tuned CEED software for various architectures through a number of backends, freely available in the CEED's repository on GitHub. See the CEED website, http://ceed.exascaleproject.org and the CEED GitHub organization, http://github.com/ceed for more details.

Numerical methods in sparse linear algebra typically rely on a fast and efficient matrix vector p... more Numerical methods in sparse linear algebra typically rely on a fast and efficient matrix vector product, as this usually is the backbone of iterative algorithms for solving eigenvalue problems or linear systems. Against the background of a large diversity in the characteristics of high performance computer architectures, it is a challenge to derive a cross-platform efficient storage format along with fast matrix vector kernels. Recently, attention focused on the SELL-C-σ format, a sliced ELLPACK format enhanced by row-sorting to reduce the fill in when padding rows with zeros. In this paper we propose an additional modification resulting in the padded sliced ELLPACK (SELL-P) format, for which we develop a sparse matrix vector CUDA kernel that is able to efficiently exploit the computing power of NVIDIA GPUs. We show that the kernel we developed outperforms straightforward implementations for the widespread CSR and ELLPACK formats, and is highly competitive to the implementations in the highly optimized CUSPARSE library.

In this paper we unveil some energy efficiency and performance frontiers for sparse computations ... more In this paper we unveil some energy efficiency and performance frontiers for sparse computations on GPU-based supercomputers. To do this, we consider state-of-the-art implementations of the sparse matrix-vector (SpMV) product in libraries like cuSPARSE, MKL, and MAGMA, and their use in the LOBPCG eigen-solver. LOBPCG is chosen as a benchmark for this study as it combines an interesting mix of sparse and dense linear algebra operations with potential for hardware-aware optimizations. Most notably, LOBPCG includes a blocking technique that is a common performance optimization for many applications. In particular, multiple memory-bound SpMV operations are blocked into a SpM-matrix product (SpMM), that achieves significantly higher performance than a sequence of SpMVs. We provide details about the GPU kernels we use for the SpMV, SpMM, and the LOBPCG implementation design, and study performance and energy consumption compared to CPU solutions. While a typical sparse computation like the SpMV reaches only a fraction of the peak of current GPUs, we show that the SpMM achieves up to a 6× performance improvement over the GPU's SpMV, and the GPU-accelerated LOBPCG based on this kernel is 3 to 5× faster than multicore CPUs with the same power draw, e.g., a K40 GPU vs. two Sandy Bridge CPUs (16 cores). In practice though, we show that currently available CPU implementations are much slower due to missed optimization opportunities. These performance results translate to similar improvements in energy consumption, and are indicative of today's frontiers in energy efficiency and performance for sparse computations on supercomputers.

IEEE International Conference on High Performance Computing, Data, and Analytics, Apr 12, 2015
This paper presents a heterogeneous CPU-GPU algorithm design and optimized implementation for an ... more This paper presents a heterogeneous CPU-GPU algorithm design and optimized implementation for an entire sparse iterative eigensolver-the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG)-starting from low-level GPU data structures and kernels to the higher-level algorithmic choices and overall heterogeneous design. Most notably, the eigensolver leverages the high-performance of a new GPU kernel developed for the simultaneous multiplication of a sparse matrix and a set of vectors (SpMM). This is a building block that serves as a backbone for not only block-Krylov, but also for other methods relying on blocking for acceleration in general. The heterogeneous LOBPCG developed here reveals the potential of this type of eigensolver by highly optimizing all of its components, and can be viewed as a benchmark for other SpMM-dependent applications. Compared to non-blocked algorithms, we show that the performance speedup factor of SpMM vs. SpMV-based algorithms is up to six on GPUs like NVIDIA's K40. In particular, a typical SpMV performance range in double precision is 20 to 25 GFlop/s, while the SpMM is in the range of 100 to 120 GFlop/s. Compared to highly-optimized CPU implementations, e.g., the SpMM from MKL on two eight-core Intel Xeon E5-2690s, our kernel is 3 to 5×. faster on a K40 GPU. For comparison to other computational loads, the same GPU to CPU performance acceleration is observed for the SpMV product, as well as dense linear algebra, e.g., matrix-matrix multiplication and factorizations like LU, QR, and Cholesky. Thus, the modeled GPU (vs. CPU) acceleration for the entire solver is also 3 to 5×. In practice though, currently available CPU implementations are much slower due to missed optimization opportunities, as we show.
High-performance is difficult to obtain using existing libraries, especially for many independent... more High-performance is difficult to obtain using existing libraries, especially for many independent computations where each computation is very small. However, using our framework to batch computation plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory.
Procedia Computer Science, 2012
This paper explores the need for asynchronous iteration algorithms as smoothers in multigrid meth... more This paper explores the need for asynchronous iteration algorithms as smoothers in multigrid methods. The hardware target for the new algorithms is top-of-the-line, highly parallel hybrid architectures-multicore-based systems enhanced with GPGPUs. These architectures are the most likely candidates for future highend supercomputers. To pave the road for their efficient use, we must resolve challenges related to the fact that data movement, not floating-point operations, is the bottleneck to performance. Our work is in this direction-we designed block-asynchronous multigrid smoothers that perform more flops in order to reduce synchronization, and hence data movement. We show that the extra flops are done for "free," while synchronization is reduced and the convergence properties of multigrid with classical smoothers like Gauss-Seidel can be preserved.

As the general purpose graphics processing units (GPGPU) are increasingly deployed for scientific... more As the general purpose graphics processing units (GPGPU) are increasingly deployed for scientific computing for its raw performance advantages compared to CPUs, the fault tolerance issue has started to become more of a concern than before when they were exclusively used for graphics applications. The pairing of GPUs with CPUs to form a hybrid computing systems for better flexibility and performance creates a massive amounts of computations that have a higher possibility to be affected by transient error-a soft error that silently modifies data causing errors to pass unnoticed. This is despite the fact that the newest Fermi generation of GPUs from NVIDIA are equipped with error correcting units to protect their memories. This problem is particularly serious for applications that employ numerical linear algebra since large sections of data are often modified between steps, and therefore even a single error could eventually propagate into a large area of result. In order to give protection to dense linear algebra computations on such hybrid systems, we developed an algorithm that is resilient to soft errors. We chose the right-looking Householder QR factorization as a demonstration of our algorithm for a hybrid system that features both GPUs and CPUs. Algorithm based fault tolerance (ABFT) is used to protect from errors in the trailing matrix and the right factor, while a checkpointing method is used to ensure the left factor is error-free. This work is based on a previous study of fault tolerance in matrix factorizations. Our contribution includes (1) a stable multiple-error checkpointing and recovery mechanism for the left-factor, which is also scalable in performance in the hybrid execution environment and does not cause severe performance degradation. (2) optimized Givens rotation utilities on the GPU to efficiently reduce an upper Hessenberg matrix to upper triangular form, and (3) a recovery algorithm based on QR update inside a hybrid system. Experimental results show that, our fault tolerant QR factorization can successfully detect and correct data altered by soft errors in both the left and right factors and we observe a decreasing percentage of overhead as the matrix size grows.
arXiv (Cornell University), Sep 10, 2021
In this paper we describe the research and development activities in the Center for Efficient Exa... more In this paper we describe the research and development activities in the Center for Efficient Exascale Discretization within the US Exascale Computing Project, targeting state-of-the-art high-order finite-element algorithms for high-order applications on GPU-accelerated platforms. We discuss the GPU developments in several components of the CEED software stack, including the libCEED, MAGMA, MFEM, libParanumal, and Nek projects. We report performance and capability improvements in several CEED-enabled applications on both NVIDIA and AMD GPU systems.

International Journal of High Performance Computing Applications, Jun 8, 2021
Efficient exploitation of exascale architectures requires rethinking of the numerical algorithms ... more Efficient exploitation of exascale architectures requires rethinking of the numerical algorithms used in many large-scale applications. These architectures favor algorithms that expose ultra fine-grain parallelism and maximize the ratio of floating point operations to energy intensive data movement. One of the few viable approaches to achieve high efficiency in the area of PDE discretizations on unstructured grids is to use matrix-free / partially-assembled high-order finite element methods, since these methods can increase the accuracy and/or lower the computational time due to reduced data motion. In this paper we provide an overview of the research and development activities in the Center for Efficient Exascale Discretizations (CEED), a co-design center in the Exascale Computing Project that is focused on the development of nextgeneration discretization software and algorithms to enable a wide range of finite element applications to run efficiently on future hardware. CEED is a research partnership involving more than 30 computational scientists from two US national labs and five universities, including members of the Nek5000, MFEM, MAGMA and PETSc projects. We discuss the CEED co-design activities based on targeted benchmarks, miniapps and discretization libraries and our work on performance optimizations for large-scale GPU architectures. We also provide a broad overview of research and development activities in areas such as unstructured adaptive mesh refinement algorithms, matrix-free linear solvers, high-order data visualization, and list examples of collaborations with several ECP and external applications.

In this paper, we analyze the potential of asynchronous relaxation methods on Graphics Processing... more In this paper, we analyze the potential of asynchronous relaxation methods on Graphics Processing Units (GPUs). For this purpose, we developed a set of asynchronous iteration algorithms in CUDA and compared them with a parallel implementation of synchronous relaxation methods on CPU-based systems. For a set of test matrices taken from the University of Florida Matrix Collection we monitor the convergence behavior, the average iteration time and the total time-to-solution time. Analyzing the results, we observe that even for our most basic asynchronous relaxation scheme, despite its lower convergence rate compared to the Gauss-Seidel relaxation (that we expected), the asynchronous iteration running on GPUs is still able to provide solution approximations of certain accuracy in considerably shorter time then Gauss-Seidel running on CPUs. Hence, it overcompensates for the slower convergence by exploiting the scalability and the good fit of the asynchronous schemes for the highly parallel GPU architectures. Further, enhancing the most basic asynchronous approach with hybrid schemes-using multiple iterations within the "subdomain" handled by a GPU thread block and Jacobi-like asynchronous updates across the "boundaries", subject to tuning various parameters-we manage to not only recover the loss of global convergence but often accelerate convergence of up to two times (compared to the effective but difficult to parallelize Gauss-Seidel type of schemes), while keeping the execution time of a global iteration practically the same. This shows the high potential of the asynchronous methods not only as a stand alone numerical solver for linear systems of equations fulfilling certain convergence conditions but more importantly as a smoother in multigrid methods. Due to the explosion of parallelism in todays architecture designs, the significance and the need for asynchronous methods, as the ones described in this work, is expected to grow.
IEEE International Conference on High Performance Computing, Data, and Analytics, 2006
By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many de... more By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to exotic technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the Cell BE processor. Results on modern processor architectures and the Cell BE are presented.

This technical report describes our findings regarding performance optimizations of the tensor co... more This technical report describes our findings regarding performance optimizations of the tensor contraction kernels used in BLAST-a high-order FE hydrodynamics research code developed at LLNL-on various modern architectures. Our approach considers and shows ways to organize the contractions, their vectorization, data storage formats, read/write patterns, and parametrization as related to batched execution and parallelism in general. Autotuning framework is designed and used to find empirically best performing tensor kernels by exploring a large search space that results from the techniques described. We analyze these kernels to show the trade-o↵s between the various implementations, how di↵erent tensor kernel implementations perform on di↵erent architectures, and what tuning parameters can have a significant impact on performance. In all cases, we organize the tensor contractions to minimize their communications by considering index reordering that enables their execution as highly e cient batched matrix-matrix multiplications (GEMMs). We derive a performance model and bound for the maximum performance that can be achieved under the maximum data-reuse scenario, and show that our implementations achieve 90+% of these theoretically derived peaks on advanced multicore x86 CPU, ARM, GPU, and Xeon Phi architectures. These results significantly outperform what is available today in vendor libraries. In particular, we show average performance speedups of 1.3⇥ and 2⇥ compared to Intel MKL on two 10-core Haswell CPUs and KNL Xeon Phi, respectively, and 3⇥ when compared to NVIDIA CUBLAS on the latest P100 NVIDIA GPU.
This document describes an API for Batch Basic Linear Algebra Subprograms (Batched BLAS or BBLAS)... more This document describes an API for Batch Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). We focus on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The extensions beyond the original BLAS standard are considered that specify a programming interface not only for routines with uniformly-sized matrices and/or vectors but also for the situation where the sizes vary. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance manycore platforms. These include multicore and many-core CPU processors; GPUs and coprocessors; as well as other hardware accelerators with floating-point compute facility.
Kepler is the newest GPU architecture from NVIDIA, and the GTX 680 is the first commercially avai... more Kepler is the newest GPU architecture from NVIDIA, and the GTX 680 is the first commercially available graphics card based on that architecture. Matrix multiplication is a canonical computational kernel, and often the main target of initial optimization efforts for a new chip. This article presents preliminary results of automatically tuning matrix multiplication kernels for the Kepler architecture using the GTX 680 card.

Springer eBooks, 2015
The Generalized Minimum Residual (GMRES) method is a popular Krylov subspace projection method fo... more The Generalized Minimum Residual (GMRES) method is a popular Krylov subspace projection method for solving a nonsymmetric linear system of equations. On modern computers, communication is becoming increasingly expensive compared to arithmetic operations, and a communication-avoiding variant (CA-GMRES) may improve the performance of GMRES. To further enhance the performance of CA-GMRES, in this paper, we propose two techniques, focusing on the two main computational kernels of CA-GMRES, tall-skinny QR (TSQR) and matrix powers kernel (MPK). First, to improve the numerical stability of TSQR based on the Cholesky QR (CholQR) factorization, we use higherprecision arithmetic at carefully-selected steps of the factorization. In particular, our mixed-precision CholQR takes the input matrix in the standard 64-bit double precision, but accumulates some of its intermediate results in a software-emulated double-double precision. Compared with the standard CholQR, this mixed-precision CholQR requires about 8.5× more computation but a much smaller increase in communication. Since the computation is becoming less expensive compared to the communication on a newer computer, the relative overhead of the mixedprecision CholQR is decreasing. Our case studies on a GPU demonstrate that using higher-precision arithmetic for this small but critical segment of the algorithm can improve not only the overall numerical stability of CA-GMRES but also, in some cases, its performance. We then study an adaptive scheme to dynamically adjust the step size of MPK based on the static inputs and the performance measurements gathered during the first restart loop of CA-GMRES. Since the optimal step size of MPK is often much smaller than that of the orthogonalization kernel, the overall performance of CA-GMRES can be improved using different step sizes for these two kernels. Our performance results on multiple GPUs show that our adaptive scheme can choose a near optimal step size for MPK, reducing the total solution time of CA-GMRES.

Zenodo (CERN European Organization for Nuclear Research), Apr 12, 2023
The goal of this milestone was to support ECP applications in the preparation and execution of th... more The goal of this milestone was to support ECP applications in the preparation and execution of their exascale challenge problem runs. We focused on multi-node scaling Frontier (both strong and weak scaling) and performed additional developments to help CEED-enhanced applications to achieve their planned FOMs. As part of this milestone, we also ported/optimized the CEED software stack, including Nek, MFEM and libCEED to Aurora and El Capitan early access hardware, worked on optimizing the performance on AMD, NVIDIA, and Intel GPUs, and demonstrating impact in CEED-enabled ECP and external applications. The specific tasks addressed in this milestone were as follows. • CEED-T25 (ADCD04-95): Multi-node scaling on Frontier • CEED-T26 (ADCD04-96): Porting and optimizations for Aurora • CEED-T27 (ADCD04-97): Porting and optimizations for El Capitan • CEED-T28 (ADCD04-98): Help ECP applications meet their FOMs Exascale Computing Project (ECP) vi CEED-MS40 LIST OF TABLES 1 Double precision arithmetic intensity (in flops/byte) of representative CEED bakeoff kernels. 1 2 Maximum achieved throughputs for BK kernels in single-precision (FP32) and double-precision (FP64). When a kernel that only uses vector instructions achieved the highest throughput the result is shown in black. If a kernel that uses matrix-core instructions achieved the highest throughput the result is shown in red.
Uploads
Papers by Stanimire Tomov