The NVIDIA CUBLAS Library

This Best Practice Guide describes general purpose computation on Graphics Processing Units (GPUs). GPUs were originally developed for computer gaming and other graphical tasks, but for many years have been exploited for general purpose computing across a number of areas. They offer advantages over traditional CPUs because they have greater computational capability and use high-bandwidth memory systems (with memory bandwidth being the main bottleneck for many scientific applications). The guide includes information on how to get started with programming GPUs, which cannot be used in isolation but only as "accelerators" in conjunction with CPUs, and how to get good performance. Focus is given to NVIDIA GPUs, which are most widespread today. The GPGPU Best Practice Guide is based on the PRACE-2IP GPGPU Best Practice Mini-Guide. The following new topics were added: GPU architecture especially highlighting new features of NVIDIA Pascal, OpenCL Programming, OpenMP 4.x Offloading. Several existing sections have been updated to reflect recent changes.

The goal of the Matrix Algebra on GPU and Multicore Architectures (MAGMA) project is to create a new generation of linear algebra libraries that achieve the fastest possible time to an accurate solution on hybrid/heterogeneous architectures, starting with current multicore+ multiGPU systems.

Recent activities of major chip manufacturers, such as Intel, AMD, IBM and NVIDIA, make it more evident than ever that future designs of microprocessors and large HPC systems will be hybrid/heterogeneous in nature, relying on the integration (in varying proportions) of two major types of components:

The NVIDIA CUBLAS Library Dr. Volker Weinberg Leibniz-Rechenzentrum der Bayerischen Akademie der Wissenschaften volker.weinberg@lrz.de GPGPU Programming LRZ, 10.-11. October 2011 Overview 1 Overview 2 Compilation and Initialisation 3 Error Handling 4 CUBLAS Helper Function 5 BLAS Core Function BLAS Level 1 Routines BLAS Level 2 and 3 Routines 6 Description of some important BLAS Routines 7 Example Code 8 References Overview CUBLAS Implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA CUDA runtime, self-contained library → no direct interaction with the CUDA driver is necessary, helper functions provided to (de)allocate host and GPU memory for matrix objects, fill them with data, and down/upload them to/from the GPU, for maximum compatibility with existing Fortran environments, CUBLAS uses column-major storage 1-based indexing file cublas.h has to be included, applications need to link against the dynamic CUBLAS library libcublas.so and the CUDA runtime library libcudart.so this talk describes the CUBLAS library coming with CUDA 3.x (“Legacy CUBLAS API”), starting with CUDA Toolkit 4.0 a slightly different new API is introduced Volker Weinberg, LRZ LRZ · October 2011 The NVIDIA CUBLAS Library Compilation and Initialisation Compilation and Initialisation export CUDA_BASE=/lrz/sys/parallel/cuda/3.2/cuda/ export LD_LIBRARY_PATH=$CUDA_BASE/lib64:$LD_LIBRARY_PATH g++ -I$CUDA_BASE/include -L$CUDA_BASE/lib64/ -lcudart -lcublas file.c -o file Compilation and Initialisation @ LRZ module load cuda g++ -I$CUDA_BASE/include -L$CUDA_BASE/lib64/ -lcudart -lcublas file.c -o file Volker Weinberg, LRZ LRZ · October 2011 The NVIDIA CUBLAS Library Error Handling status of CUBLAS core functions can be retrieved via cublasGetError(), CUBLAS helper functions return status directly, type cublasStatus is used for core function status returns cublasStatus Values: CUBLAS CUBLAS CUBLAS CUBLAS CUBLAS CUBLAS CUBLAS CUBLAS Volker Weinberg, LRZ STATUS STATUS STATUS STATUS STATUS STATUS STATUS STATUS SUCCESS NOT INITIALIZED ALLOC FAILED INVALID VALUE ARCH MISMATCH MAPPING ERROR EXECUTION FAILED GPU INTERNAL ERROR LRZ · October 2011 operation completed successfully CUBLAS library not initialized resource allocation failed unsupported numerical value was passed to function function requires an absent architectural feature access to GPU memory space failed program failed to execute an internal CUBLAS operation failed The NVIDIA CUBLAS Library CUBLAS Helper Functions Initialisation/Release of GPU cublasInit() cublasShutdown() De/Allocation of GPU memory cublasAlloc() cublasFree() Setting/Getting Matrix/Vector values cublasSetVector() cublasSetMatrix() cublasGetVector() cublasGetMatrix() Error Handling cublasGetError() Asynchronous I/O using CUDA streams cublasSetKernelStream() cublasSetVectorAsync() cublasSetMatrixAsync() cublasGetVectorAsync() cublasGetMatrixAsync() Volker Weinberg, LRZ LRZ · October 2011 The NVIDIA CUBLAS Library Helper Functions: Initialisation/Release of GPU Initialisation of GPU cublasStatus cublasInit(void) Initialises the CUBLAS library and must be called before any other CUBLAS API function is invoked. It allocates hardware resources necessary for accessing the GPU. It attaches CUBLAS to whatever GPU is currently bound to the host thread from which it was invoked. Release of GPU cublasStatus cublasShutdown(void) Releases CPU-side resources used by the CUBLAS library. The release of GPU-side resources may be deferred until the application shuts down. Volker Weinberg, LRZ LRZ · October 2011 The NVIDIA CUBLAS Library Helper Functions:(De-)Allocation of GPU memory Allocation of GPU memory cublasStatus cublasAlloc(int n, int elemSize, void **devicePtr) Creates an object in GPU memory space capable of holding an array of n elements, where each element requires elemSize bytes of storage. If the function call is successful, a pointer to the object in GPU memory space is placed in devicePtr. Deallocation of GPU memory cublasStatus cublasFree(const void **devicePtr) Destroys the object in GPU memory space referenced by devicePtr. Volker Weinberg, LRZ LRZ · October 2011 The NVIDIA CUBLAS Library Helper Functions: Error Handling Error Handling cublasStatus cublasGetError(void) Returns the last error that occurred on invocation of any of the CUBLAS core functions. Reading the error status via cublasGetError() resets the internal error state to CUBLAS STATUS SUCCESS. Volker Weinberg, LRZ LRZ · October 2011 The NVIDIA CUBLAS Library Helper Functions: Setting/Getting Vector values Initialising a Vector on the GPU cublasStatus cublasSetVector (int n, int elemSize, const void *x, int incx, void *y, int incy) Copies n elements from a vector x in CPU memory space to a vector y in GPU memory space. Elements in both vectors are assumed to have a size of elemSize bytes. Storage spacing between consecutive elements is incx for the source vector x and incy for the destination vector y. In general, y points to an object, or part of an object, allocated via cublasAlloc(). Getting Values of a Vector on the GPU cublasStatus cublasGetVector (int n, int elemSize, const void *x, int incx, void *y, int incy) Copies n elements from a vector x in GPU memory space to a vector y in CPU memory space. Volker Weinberg, LRZ LRZ · October 2011 The NVIDIA CUBLAS Library Helper Functions: Setting/Getting Matrix values Initialising a Matrix on the GPU cublasStatus cublasSetMatrix (int rows, int cols, int elemSize, const void *A, int lda, void *B, int ldb) Copies a tile of rows×cols elements from a matrix A in CPU memory space to a matrix B in GPU memory space. Each element requires storage of elemSize bytes. Both matrices are assumed to be stored in column-major format, with the leading dimension (that is, the number of rows) of source matrix A provided in lda, and the leading dimension of destination matrix B provided in ldb. B is a device pointer that points to an object, or part of an object, that was allocated in GPU memory space via cublasAlloc(). Getting Values of a Matrix on the GPU cublasStatus cublasGetMatrix (int rows, int cols, int elemSize, const void *A, int lda, void *B, int ldb) Copies a tile of rows×cols elements from a matrix A in GPU memory space to a matrix B in CPU memory space. Volker Weinberg, LRZ LRZ · October 2011 The NVIDIA CUBLAS Library Helper Functions: Asynchronous I/O with CUDA STREAMS Setting the CUDA STREAM cublasStatus cublasSetKernelStream (cudaStream_t stream) Sets the CUBLAS stream in which all subsequent CUBLAS kernel launches will run. By default, if the CUBLAS stream is not set, all kernels use the NULL stream. Setting/Getting Values asynchronously cublasStatus cublasSetVectorAsync (int n, int elemSize, const void *x, int incx, void *y, int incy, cudaStream_t stream); cublasStatus cublasGetVectorAsync (int n, int elemSize, const void *x, int incx, void *y, int incy, cudaStream_t stream); cublasStatus cublasSetMatrixAsync (int rows, int cols, int elemSize, const void *A, int lda, void *B, int ldb, cudaStream_t stream) cublasStatus cublasGetMatrixAsync (int rows, int cols, int elemSize, const void *A, int lda, void *B, int ldb, cudaStream_t stream) Volker Weinberg, LRZ LRZ · October 2011 The NVIDIA CUBLAS Library BLAS Core Functions BLAS Core Functions BLAS Level 1 Routines: vector-vector operations BLAS Level 2 Routines: matrix-vector operations BLAS Level 3 Routines: matrix-matrix operations Quick Reference Guide: http://www.netlib.org/blas/blasqr.ps Volker Weinberg, LRZ LRZ · October 2011 The NVIDIA CUBLAS Library BLAS Level 1 Routines: Naming Scheme BLAS 1 routine names have the following structure: cublas<datatype> <operation> <mod> The <datatype> field indicates the data type: S C D Z real, single precision complex, single precision real, double precision complex, double precision Some routines and functions can have combined character codes, such as sc or dz. The <operation> field, in BLAS level 1, indicates the operation type. For example, the BLAS level 1 routines ?dot, ?rot, ?swap compute a vector dot product, vector rotation, and vector swap, respectively. The <mod> field, if present, provides additional details of the operation. BLAS level 1 names can have the following characters in the <mod> field: c conjugated vector u unconjugated vector g Givens rotation Volker Weinberg, LRZ LRZ · October 2011 The NVIDIA CUBLAS Library BLAS Level 1 Routines Group ?asum ?axpy ?copy ?dot ?dotc ?dotu ?nrm2 ?rot ?rotg ?rotm ?rotmg ?scal ?swap i?amax i?amin Volker Weinberg, LRZ Data Types s, d, sc, dz s, d, c, z s, d, c, z s, d c, z c, z s, d, sc, dz s, d, c, z, cs, zd s, d, c, z s, d s, d s, d, c, z, cs, zd s, d, c, z s, d, c, z s, d, c, z Description Sum of vector magnitudes Scalar-vector product Copy vector Dot product Dot product conjugated Dot product unconjugated Vector 2-norm (Euclidean norm) Plane rotation of points Givens rotation of points Modified plane rotation of points Givens modified plane rotation of points Vector-scalar product Vector-vector swap Index of the max. abs. value element of a vector Index of the min. abs. value element of a vector LRZ · October 2011 The NVIDIA CUBLAS Library BLAS Level 2/3 Routines: Naming Scheme BLAS 2/3 routine names have the following structure: cublas<datatype> <matrixtype> <operation> In BLAS level 2 and 3, <matrixtype> reflects the matrix argument type: ge gb sy sp sb he hp hb tr tp tb general matrix general band matrix symmetric matrix symmetric matrix (packed storage) symmetric band matrix Hermitian matrix Hermitian matrix (packed storage) Hermitian band matrix triangular matrix triangular matrix (packed storage) triangular band matrix. Volker Weinberg, LRZ LRZ · October 2011 The NVIDIA CUBLAS Library BLAS Level 2/3 Routines: Naming Scheme BLAS level 2 names can have the following characters in the <operation> field: mv sv r r2 matrix-vector product solving a system of linear equations with matrix-vector operations rank-1 update of a matrix rank-2 update of a matrix. BLAS level 3 names can have the following characters in the <operation> field: mm sm rk r2k Volker Weinberg, LRZ matrix-matrix product solving a system of linear equations with matrix-matrix operations rank-k update of a matrix rank-2k update of a matrix. LRZ · October 2011 The NVIDIA CUBLAS Library BLAS Level 2 Routines Group ?gbmv ?gemv ?ger ?gerc ?geru ?hbmv ?hemv ?her ?her2 ?hpmv ?hpr ?hpr2 ?sbmv ?spmv ?spr ?spr2 ?symv ?syr ?syr2 ?tbmv ?tbsv ?tpmv ?tpsv ?trmv ?trsv Volker Weinberg, LRZ Data Types s, d, c, z s, d, c, z s, d c, z c, z c, z c, z c, z c, z c, z c, z c, z s, d s, d s, d s, d s, d s, d s, d s, d, c, z s, d, c, z s, d, c, z s, d, c, z s, d, c, z s, d, c, z Description Matrix-vector product using a general band matrix Matrix-vector product using a general matrix Rank-1 update of a general matrix Rank-1 update of a conjugated general matrix Rank-1 update of a general matrix, unconjugated Matrix-vector product using a Hermitian band matrix Matrix-vector product using a Hermitian matrix Rank-1 update of a Hermitian matrix Rank-2 update of a Hermitian matrix Matrix-vector product using a Hermitian packed matrix Rank-1 update of a Hermitian packed matrix Rank-2 update of a Hermitian packed matrix Matrix-vector product using symmetric band matrix Matrix-vector product using a symmetric packed matrix Rank-1 update of a symmetric packed matrix Rank-2 update of a symmetric packed matrix Matrix-vector product using a symmetric matrix Rank-1 update of a symmetric matrix Rank-2 update of a symmetric matrix Matrix-vector product using a triangular band matrix Solution of a linear system of equations with a triangular band matrix Matrix-vector product using a triangular packed matrix Solution of a linear system of equations with a triangular packed matrix Matrix-vector product using a triangular matrix Solution of a linear system of equations with a triangular matrix LRZ · October 2011 The NVIDIA CUBLAS Library BLAS Level 3 Routines Group ?gemm ?hemm ?herk ?her2k ?symm ?syrk ?syr2k ?trmm ?trsm Volker Weinberg, LRZ Data Types s, d, c, z c, z c, z c, z s, d, c, z s, d, c, z s, d, c, z s, d, c, z s, d, c, z Description Matrix-matrix product of general matrices Matrix-matrix product of Hermitian matrices Rank-k update of Hermitian matrices Rank-2k update of Hermitian matrices Matrix-matrix product of symmetric matrices Rank-k update of symmetric matrices Rank-2k update of symmetric matrices Matrix-matrix product of triangular matrices Linear matrix-matrix solution for triangular matrices LRZ · October 2011 The NVIDIA CUBLAS Library BLAS Level 1 Function: cublasDaxpy() BLAS Level 1 Function: cublasDaxpy() void cublasDaxpy (int n, double alpha, const double *x, int incx, double *y, int incy) Computes a vector-scalar product and adds the result to a vector: y ←α∗x +y Input: n number of elements in input vectors alpha double-precision scalar multiplier x double-precision vector with n elements incx storage spacing between elements of x y double-precision vector with n elements incy storage spacing between elements of y Output: y double-precision result (unchanged if n ≤ 0) Volker Weinberg, LRZ LRZ · October 2011 The NVIDIA CUBLAS Library BLAS Level 2 Function: cublasDgemv() BLAS Level 2 Function: cublasDgemv() void cublasDgemv (char trans, int m, int n, double alpha, const double *A, int lda, const double *x, int incx, double beta, double *y, int incy) Computes a matrix-vector product using a general m × n matrix op(A) and adds the result to a scalar-vector product: y ← α ∗ op(A) ∗ x + beta ∗ y , op(A) = A, AT depending on char trans Input: trans specifies op(A). If trans == ’N’ or ’n’, op(A) = A. If trans == ’T’, ’t’, ’C’, or ’c’, op(A) = AT m specifies the number of rows of matrix A; m must be at least zero. n specifies the number of columns of matrix A; n must be at least zero. alpha double-precision scalar multiplier applied to op(A) A double-precision array of dimensions (lda, n) if trans=’N’ or ’n’, of dimensions (lda, m) otherwise lda leading dimension of two-dimensional array used to store matrix A. x double-precision array incx specifies the storage spacing for elements of x; incx must not be zero. beta double-precision scalar multiplier applied to vector y. If beta is zero, y is not read y double-precision array incy the storage spacing between elements of y; incy must not be zero. Output: y updated according to y = α ∗ op(A) ∗ x + β ∗ y Volker Weinberg, LRZ LRZ · October 2011 The NVIDIA CUBLAS Library BLAS Level 3 Function: cublasDgemm() BLAS Level 3 Function: cublasDgemm() void cublasDgemm (char transa, char transb, int m, int n, int k, double alpha, const double *A, int lda, const double *B, int ldb, double beta, double *C, int ldc) Computes a scalar-matrix-matrix product using general matrices and adds the result to a scalar-matrix product. (op(A) = m × k, op(B) = k × n, C = m × n matrix) C ← α ∗ op(A) ∗ op(B) + beta ∗ C , op(X ) = X , X T depending on char transx Input: transX m,n,k alpha, beta A,B,C ldx Output: C updated Volker Weinberg, LRZ specifies op(X). If transa=’N’ or ’n’, op(X ) = X If transa=’T’,’t’, OP(X ) = X T matrix dimensions double-precision scalar multiplier double-precision arrays leading dimension of two-dimensional array used to store matrix X according to α ∗ op(A) ∗ op(B) + beta ∗ C LRZ · October 2011 The NVIDIA CUBLAS Library Example Code: y = 3 ∗ A ∗ x + 4 ∗ y , A = l × m matrix 1 2 3 4 5 6 #i n c l u d e <c u b l a s . h> ... i n t m, l , l d a ; d o u b l e ∗A , ∗x , ∗ y ; d o u b l e ∗A gpu , ∗ x gpu , ∗ y g p u ; cublasStatus stat ; 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 // a l l o c a t e memory on t h e h o s t A = ( d o u b l e ∗) c a l l o c ( l d a ∗m, s i z e o f ( d o u b l e ) ) ; x = ( d o u b l e ∗) c a l l o c ( l , s i z e o f ( double ) ) ; y = ( d o u b l e ∗) c a l l o c ( m, s i z e o f ( double ) ) ; // i n i t A , x , y ... // a l l o c a t e memory on GPU s t a t=c u b l a s A l l o c ( l d a ∗m, s i z e o f ( d o u b l e ) , ( v o i d ∗∗)& A gpu ) ; s t a t=c u b l a s A l l o c (m, s i z e o f ( d o u b l e ) , ( v o i d ∗∗)& x g p u ) ; s t a t=c u b l a s A l l o c ( l , s i z e o f ( d o u b l e ) , ( v o i d ∗∗)& y g p u ) ; // t r a n s f e r i n p u t d a t a h o s t −> GPU s t a t=c u b l a s S e t M a t r i x ( l , m, s i z e o f ( d o u b l e ) , a , l d a , A gpu , s t a t=c u b l a s S e t V e c t o r (m, s i z e o f ( double ) , x , 1 , x gpu , s t a t=c u b l a s S e t V e c t o r ( l , s i z e o f ( double ) , y , 1 , y gpu , // r u n BLAS L e v e l 2 r o u t i n e cublasDgemv ( ’ n ’ , l , m, 3 . , A gpu , l d a , x gpu , 1 , 4 . , y gpu , s t a t=c u b l a s G e t E r r o r ( ) ; // t r a n s f e r o u t p u t d a t a GPU −> h o s t s t a t=c u b l a s G e t V e c t o r ( l , s i z e o f ( d o u b l e ) , y gpu , 1 , y , 1 ) ; // d e a l l o c a t e GPU memory c u b l a s F r e e ( A gpu ) ; c u b l a s F r e e ( x g p u ) ; c u b l a s F r e e ( y g p u ) ; Volker Weinberg, LRZ LRZ · October 2011 The NVIDIA CUBLAS Library lda ); 1); 1); 1); References I CUDA CUBLAS Library, PG-05326-032 V02, August, 2010 http://developer.download.nvidia.com/compute/cuda/3 2/ toolkit/docs/CUBLAS Library.pdf Basic Linear Algebra Subprograms – A Quick Reference Guide, May 11, 1997: http://www.netlib.org/blas/blasqr.ps Volker Weinberg, LRZ LRZ · October 2011 The NVIDIA CUBLAS Library

Log In

The NVIDIA CUBLAS Library

The NVIDIA CUBLAS Library

Related Papers

RELATED PAPERS