The NVIDIA CUBLAS Library
Dr. Volker Weinberg
Leibniz-Rechenzentrum der Bayerischen Akademie der Wissenschaften
volker.weinberg@lrz.de
GPGPU Programming
LRZ, 10.-11. October 2011
Overview
1
Overview
2
Compilation and Initialisation
3
Error Handling
4
CUBLAS Helper Function
5
BLAS Core Function
BLAS Level 1 Routines
BLAS Level 2 and 3 Routines
6
Description of some important BLAS Routines
7
Example Code
8
References
Overview
CUBLAS
Implementation of BLAS (Basic Linear Algebra Subprograms) on
top of the NVIDIA CUDA runtime,
self-contained library → no direct interaction with the CUDA driver
is necessary,
helper functions provided to (de)allocate host and GPU memory for
matrix objects, fill them with data, and down/upload them to/from
the GPU,
for maximum compatibility with existing Fortran environments,
CUBLAS uses
column-major storage
1-based indexing
file cublas.h has to be included,
applications need to link against the dynamic CUBLAS library
libcublas.so and the CUDA runtime library libcudart.so
this talk describes the CUBLAS library coming with CUDA 3.x
(“Legacy CUBLAS API”), starting with CUDA Toolkit 4.0 a slightly
different new API is introduced
Volker Weinberg, LRZ
LRZ · October 2011
The NVIDIA CUBLAS Library
Compilation and Initialisation
Compilation and Initialisation
export CUDA_BASE=/lrz/sys/parallel/cuda/3.2/cuda/
export LD_LIBRARY_PATH=$CUDA_BASE/lib64:$LD_LIBRARY_PATH
g++
-I$CUDA_BASE/include -L$CUDA_BASE/lib64/ -lcudart
-lcublas file.c -o file
Compilation and Initialisation @ LRZ
module load cuda
g++
-I$CUDA_BASE/include -L$CUDA_BASE/lib64/ -lcudart
-lcublas file.c -o file
Volker Weinberg, LRZ
LRZ · October 2011
The NVIDIA CUBLAS Library
Error Handling
status of CUBLAS core functions can be retrieved via
cublasGetError(),
CUBLAS helper functions return status directly,
type cublasStatus is used for core function status returns
cublasStatus Values:
CUBLAS
CUBLAS
CUBLAS
CUBLAS
CUBLAS
CUBLAS
CUBLAS
CUBLAS
Volker Weinberg, LRZ
STATUS
STATUS
STATUS
STATUS
STATUS
STATUS
STATUS
STATUS
SUCCESS
NOT INITIALIZED
ALLOC FAILED
INVALID VALUE
ARCH MISMATCH
MAPPING ERROR
EXECUTION FAILED GPU
INTERNAL ERROR
LRZ · October 2011
operation completed successfully
CUBLAS library not initialized
resource allocation failed
unsupported numerical value was passed to function
function requires an absent architectural feature
access to GPU memory space failed
program failed to execute
an internal CUBLAS operation failed
The NVIDIA CUBLAS Library
CUBLAS Helper Functions
Initialisation/Release of GPU
cublasInit()
cublasShutdown()
De/Allocation of GPU memory
cublasAlloc()
cublasFree()
Setting/Getting Matrix/Vector values
cublasSetVector()
cublasSetMatrix()
cublasGetVector()
cublasGetMatrix()
Error Handling
cublasGetError()
Asynchronous I/O using CUDA streams
cublasSetKernelStream()
cublasSetVectorAsync()
cublasSetMatrixAsync()
cublasGetVectorAsync()
cublasGetMatrixAsync()
Volker Weinberg, LRZ
LRZ · October 2011
The NVIDIA CUBLAS Library
Helper Functions: Initialisation/Release of GPU
Initialisation of GPU
cublasStatus cublasInit(void)
Initialises the CUBLAS library and must be called before any other
CUBLAS API function is invoked. It allocates hardware resources
necessary for accessing the GPU. It attaches CUBLAS to whatever GPU
is currently bound to the host thread from which it was invoked.
Release of GPU
cublasStatus cublasShutdown(void)
Releases CPU-side resources used by the CUBLAS library. The release of
GPU-side resources may be deferred until the application shuts down.
Volker Weinberg, LRZ
LRZ · October 2011
The NVIDIA CUBLAS Library
Helper Functions:(De-)Allocation of GPU memory
Allocation of GPU memory
cublasStatus cublasAlloc(int n, int elemSize,
void **devicePtr)
Creates an object in GPU memory space capable of holding an array of n
elements, where each element requires elemSize bytes of storage. If the
function call is successful, a pointer to the object in GPU memory space
is placed in devicePtr.
Deallocation of GPU memory
cublasStatus cublasFree(const void **devicePtr)
Destroys the object in GPU memory space referenced by devicePtr.
Volker Weinberg, LRZ
LRZ · October 2011
The NVIDIA CUBLAS Library
Helper Functions: Error Handling
Error Handling
cublasStatus cublasGetError(void)
Returns the last error that occurred on invocation of any of the CUBLAS
core functions. Reading the error status via cublasGetError() resets the
internal error state to CUBLAS STATUS SUCCESS.
Volker Weinberg, LRZ
LRZ · October 2011
The NVIDIA CUBLAS Library
Helper Functions: Setting/Getting Vector values
Initialising a Vector on the GPU
cublasStatus cublasSetVector (int n, int elemSize,
const void *x, int incx, void *y, int incy)
Copies n elements from a vector x in CPU memory space to a vector y in
GPU memory space. Elements in both vectors are assumed to have a size
of elemSize bytes. Storage spacing between consecutive elements is incx
for the source vector x and incy for the destination vector y. In general, y
points to an object, or part of an object, allocated via cublasAlloc().
Getting Values of a Vector on the GPU
cublasStatus cublasGetVector (int n, int elemSize,
const void *x, int incx, void *y, int incy)
Copies n elements from a vector x in GPU memory space to a vector y in
CPU memory space.
Volker Weinberg, LRZ
LRZ · October 2011
The NVIDIA CUBLAS Library
Helper Functions: Setting/Getting Matrix values
Initialising a Matrix on the GPU
cublasStatus cublasSetMatrix (int rows, int cols,
int elemSize, const void *A, int lda, void *B, int ldb)
Copies a tile of rows×cols elements from a matrix A in CPU memory
space to a matrix B in GPU memory space. Each element requires
storage of elemSize bytes. Both matrices are assumed to be stored in
column-major format, with the leading dimension (that is, the number of
rows) of source matrix A provided in lda, and the leading dimension of
destination matrix B provided in ldb. B is a device pointer that points to
an object, or part of an object, that was allocated in GPU memory space
via cublasAlloc().
Getting Values of a Matrix on the GPU
cublasStatus cublasGetMatrix (int rows, int cols,
int elemSize, const void *A, int lda, void *B, int ldb)
Copies a tile of rows×cols elements from a matrix A in GPU memory
space to a matrix B in CPU memory space.
Volker Weinberg, LRZ
LRZ · October 2011
The NVIDIA CUBLAS Library
Helper Functions: Asynchronous I/O with CUDA
STREAMS
Setting the CUDA STREAM
cublasStatus cublasSetKernelStream (cudaStream_t stream)
Sets the CUBLAS stream in which all subsequent CUBLAS kernel
launches will run. By default, if the CUBLAS stream is not set, all
kernels use the NULL stream.
Setting/Getting Values asynchronously
cublasStatus cublasSetVectorAsync (int n, int elemSize, const void *x,
int incx, void *y, int incy, cudaStream_t stream);
cublasStatus cublasGetVectorAsync (int n, int elemSize, const void *x,
int incx, void *y, int incy, cudaStream_t stream);
cublasStatus cublasSetMatrixAsync (int rows, int cols,
int elemSize, const void *A, int lda, void *B, int ldb, cudaStream_t stream)
cublasStatus cublasGetMatrixAsync (int rows, int cols,
int elemSize, const void *A, int lda, void *B, int ldb, cudaStream_t stream)
Volker Weinberg, LRZ
LRZ · October 2011
The NVIDIA CUBLAS Library
BLAS Core Functions
BLAS Core Functions
BLAS Level 1 Routines: vector-vector operations
BLAS Level 2 Routines: matrix-vector operations
BLAS Level 3 Routines: matrix-matrix operations
Quick Reference Guide: http://www.netlib.org/blas/blasqr.ps
Volker Weinberg, LRZ
LRZ · October 2011
The NVIDIA CUBLAS Library
BLAS Level 1 Routines: Naming Scheme
BLAS 1 routine names have the following structure:
cublas<datatype> <operation> <mod>
The <datatype> field indicates the data type:
S
C
D
Z
real, single precision
complex, single precision
real, double precision
complex, double precision
Some routines and functions can have combined character codes, such as
sc or dz.
The <operation> field, in BLAS level 1, indicates the operation type.
For example, the BLAS level 1 routines ?dot, ?rot, ?swap compute a
vector dot product, vector rotation, and vector swap, respectively.
The <mod> field, if present, provides additional details of the operation.
BLAS level 1 names can have the following characters in the <mod>
field:
c conjugated vector
u unconjugated vector
g Givens rotation
Volker Weinberg, LRZ
LRZ · October 2011
The NVIDIA CUBLAS Library
BLAS Level 1 Routines
Group
?asum
?axpy
?copy
?dot
?dotc
?dotu
?nrm2
?rot
?rotg
?rotm
?rotmg
?scal
?swap
i?amax
i?amin
Volker Weinberg, LRZ
Data Types
s, d, sc, dz
s, d, c, z
s, d, c, z
s, d
c, z
c, z
s, d, sc, dz
s, d, c, z, cs, zd
s, d, c, z
s, d
s, d
s, d, c, z, cs, zd
s, d, c, z
s, d, c, z
s, d, c, z
Description
Sum of vector magnitudes
Scalar-vector product
Copy vector
Dot product
Dot product conjugated
Dot product unconjugated
Vector 2-norm (Euclidean norm)
Plane rotation of points
Givens rotation of points
Modified plane rotation of points
Givens modified plane rotation of points
Vector-scalar product
Vector-vector swap
Index of the max. abs. value element of a vector
Index of the min. abs. value element of a vector
LRZ · October 2011
The NVIDIA CUBLAS Library
BLAS Level 2/3 Routines: Naming Scheme
BLAS 2/3 routine names have the following structure:
cublas<datatype> <matrixtype> <operation>
In BLAS level 2 and 3, <matrixtype> reflects the matrix argument type:
ge
gb
sy
sp
sb
he
hp
hb
tr
tp
tb
general matrix
general band matrix
symmetric matrix
symmetric matrix (packed storage)
symmetric band matrix
Hermitian matrix
Hermitian matrix (packed storage)
Hermitian band matrix
triangular matrix
triangular matrix (packed storage)
triangular band matrix.
Volker Weinberg, LRZ
LRZ · October 2011
The NVIDIA CUBLAS Library
BLAS Level 2/3 Routines: Naming Scheme
BLAS level 2 names can have the following characters in the
<operation> field:
mv
sv
r
r2
matrix-vector product
solving a system of linear equations with matrix-vector operations
rank-1 update of a matrix
rank-2 update of a matrix.
BLAS level 3 names can have the following characters in the
<operation> field:
mm
sm
rk
r2k
Volker Weinberg, LRZ
matrix-matrix product
solving a system of linear equations with matrix-matrix operations
rank-k update of a matrix
rank-2k update of a matrix.
LRZ · October 2011
The NVIDIA CUBLAS Library
BLAS Level 2 Routines
Group
?gbmv
?gemv
?ger
?gerc
?geru
?hbmv
?hemv
?her
?her2
?hpmv
?hpr
?hpr2
?sbmv
?spmv
?spr
?spr2
?symv
?syr
?syr2
?tbmv
?tbsv
?tpmv
?tpsv
?trmv
?trsv
Volker Weinberg, LRZ
Data Types
s, d, c, z
s, d, c, z
s, d
c, z
c, z
c, z
c, z
c, z
c, z
c, z
c, z
c, z
s, d
s, d
s, d
s, d
s, d
s, d
s, d
s, d, c, z
s, d, c, z
s, d, c, z
s, d, c, z
s, d, c, z
s, d, c, z
Description
Matrix-vector product using a general band matrix
Matrix-vector product using a general matrix
Rank-1 update of a general matrix
Rank-1 update of a conjugated general matrix
Rank-1 update of a general matrix, unconjugated
Matrix-vector product using a Hermitian band matrix
Matrix-vector product using a Hermitian matrix
Rank-1 update of a Hermitian matrix
Rank-2 update of a Hermitian matrix
Matrix-vector product using a Hermitian packed matrix
Rank-1 update of a Hermitian packed matrix
Rank-2 update of a Hermitian packed matrix
Matrix-vector product using symmetric band matrix
Matrix-vector product using a symmetric packed matrix
Rank-1 update of a symmetric packed matrix
Rank-2 update of a symmetric packed matrix
Matrix-vector product using a symmetric matrix
Rank-1 update of a symmetric matrix
Rank-2 update of a symmetric matrix
Matrix-vector product using a triangular band matrix
Solution of a linear system of equations with a triangular band matrix
Matrix-vector product using a triangular packed matrix
Solution of a linear system of equations with a triangular packed matrix
Matrix-vector product using a triangular matrix
Solution of a linear system of equations with a triangular matrix
LRZ · October 2011
The NVIDIA CUBLAS Library
BLAS Level 3 Routines
Group
?gemm
?hemm
?herk
?her2k
?symm
?syrk
?syr2k
?trmm
?trsm
Volker Weinberg, LRZ
Data Types
s, d, c, z
c, z
c, z
c, z
s, d, c, z
s, d, c, z
s, d, c, z
s, d, c, z
s, d, c, z
Description
Matrix-matrix product of general matrices
Matrix-matrix product of Hermitian matrices
Rank-k update of Hermitian matrices
Rank-2k update of Hermitian matrices
Matrix-matrix product of symmetric matrices
Rank-k update of symmetric matrices
Rank-2k update of symmetric matrices
Matrix-matrix product of triangular matrices
Linear matrix-matrix solution for triangular matrices
LRZ · October 2011
The NVIDIA CUBLAS Library
BLAS Level 1 Function: cublasDaxpy()
BLAS Level 1 Function: cublasDaxpy()
void cublasDaxpy (int n, double alpha, const double *x,
int incx, double *y, int incy)
Computes a vector-scalar product and adds the result to a vector:
y ←α∗x +y
Input:
n
number of elements in input vectors
alpha double-precision scalar multiplier
x
double-precision vector with n elements
incx
storage spacing between elements of x
y
double-precision vector with n elements
incy
storage spacing between elements of y
Output:
y double-precision result (unchanged if n ≤ 0)
Volker Weinberg, LRZ
LRZ · October 2011
The NVIDIA CUBLAS Library
BLAS Level 2 Function: cublasDgemv()
BLAS Level 2 Function: cublasDgemv()
void cublasDgemv (char trans, int m, int n, double alpha,
const double *A, int lda, const double *x,
int incx, double beta, double *y, int incy)
Computes a matrix-vector product using a general m × n matrix op(A)
and adds the result to a scalar-vector product:
y ← α ∗ op(A) ∗ x + beta ∗ y , op(A) = A, AT depending on char trans
Input:
trans
specifies op(A). If trans == ’N’ or ’n’, op(A) = A. If trans == ’T’, ’t’, ’C’, or ’c’, op(A) = AT
m
specifies the number of rows of matrix A; m must be at least zero.
n
specifies the number of columns of matrix A; n must be at least zero.
alpha
double-precision scalar multiplier applied to op(A)
A
double-precision array of dimensions (lda, n) if trans=’N’ or ’n’, of dimensions (lda, m) otherwise
lda
leading dimension of two-dimensional array used to store matrix A.
x
double-precision array
incx
specifies the storage spacing for elements of x; incx must not be zero.
beta
double-precision scalar multiplier applied to vector y. If beta is zero, y is not read
y
double-precision array
incy
the storage spacing between elements of y; incy must not be zero.
Output:
y
updated according to y = α ∗ op(A) ∗ x + β ∗ y
Volker Weinberg, LRZ
LRZ · October 2011
The NVIDIA CUBLAS Library
BLAS Level 3 Function: cublasDgemm()
BLAS Level 3 Function: cublasDgemm()
void cublasDgemm (char transa, char transb, int m, int n,
int k, double alpha, const double *A, int lda,
const double *B, int ldb, double beta, double *C, int ldc)
Computes a scalar-matrix-matrix product using general matrices and
adds the result to a scalar-matrix product.
(op(A) = m × k, op(B) = k × n, C = m × n matrix)
C ← α ∗ op(A) ∗ op(B) + beta ∗ C , op(X ) = X , X T depending on char
transx
Input:
transX
m,n,k
alpha, beta
A,B,C
ldx
Output:
C
updated
Volker Weinberg, LRZ
specifies op(X). If transa=’N’ or ’n’, op(X ) = X If transa=’T’,’t’, OP(X ) = X T
matrix dimensions
double-precision scalar multiplier
double-precision arrays
leading dimension of two-dimensional array used to store matrix X
according to α ∗ op(A) ∗ op(B) + beta ∗ C
LRZ · October 2011
The NVIDIA CUBLAS Library
Example Code: y = 3 ∗ A ∗ x + 4 ∗ y , A = l × m matrix
1
2
3
4
5
6
#i n c l u d e <c u b l a s . h>
...
i n t m, l , l d a ;
d o u b l e ∗A , ∗x , ∗ y ;
d o u b l e ∗A gpu , ∗ x gpu , ∗ y g p u ;
cublasStatus stat ;
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// a l l o c a t e memory on t h e h o s t
A = ( d o u b l e ∗) c a l l o c ( l d a ∗m, s i z e o f ( d o u b l e ) ) ;
x = ( d o u b l e ∗) c a l l o c ( l ,
s i z e o f ( double ) ) ;
y = ( d o u b l e ∗) c a l l o c ( m,
s i z e o f ( double ) ) ;
// i n i t A , x , y
...
// a l l o c a t e memory on GPU
s t a t=c u b l a s A l l o c ( l d a ∗m, s i z e o f ( d o u b l e ) , ( v o i d ∗∗)& A gpu ) ;
s t a t=c u b l a s A l l o c (m,
s i z e o f ( d o u b l e ) , ( v o i d ∗∗)& x g p u ) ;
s t a t=c u b l a s A l l o c ( l ,
s i z e o f ( d o u b l e ) , ( v o i d ∗∗)& y g p u ) ;
// t r a n s f e r i n p u t d a t a h o s t −> GPU
s t a t=c u b l a s S e t M a t r i x ( l , m, s i z e o f ( d o u b l e ) , a , l d a , A gpu ,
s t a t=c u b l a s S e t V e c t o r (m,
s i z e o f ( double ) , x , 1 ,
x gpu ,
s t a t=c u b l a s S e t V e c t o r ( l ,
s i z e o f ( double ) , y , 1 ,
y gpu ,
// r u n BLAS L e v e l 2 r o u t i n e
cublasDgemv ( ’ n ’ , l , m, 3 . , A gpu , l d a , x gpu , 1 , 4 . , y gpu ,
s t a t=c u b l a s G e t E r r o r ( ) ;
// t r a n s f e r o u t p u t d a t a GPU −> h o s t
s t a t=c u b l a s G e t V e c t o r ( l , s i z e o f ( d o u b l e ) , y gpu , 1 , y , 1 ) ;
// d e a l l o c a t e GPU memory
c u b l a s F r e e ( A gpu ) ; c u b l a s F r e e ( x g p u ) ; c u b l a s F r e e ( y g p u ) ;
Volker Weinberg, LRZ
LRZ · October 2011
The NVIDIA CUBLAS Library
lda );
1);
1);
1);
References I
CUDA CUBLAS Library, PG-05326-032 V02, August, 2010
http://developer.download.nvidia.com/compute/cuda/3 2/
toolkit/docs/CUBLAS Library.pdf
Basic Linear Algebra Subprograms – A Quick Reference Guide,
May 11, 1997: http://www.netlib.org/blas/blasqr.ps
Volker Weinberg, LRZ
LRZ · October 2011
The NVIDIA CUBLAS Library