#Tuning with tensile
See the Tensile tuning wiki for more details of Tensile and the specifics of the tuning process. The staging area we use for the our process is within build directories in the root level of the Tensile clone. This step will produce LibraryLogic files within each for the build subdirectories. ../build-hgemm_asm_full/3_LibraryLogic
.
The updated configurations get placed in the rocBLAS path rocBLAS/library/src/blas3/Tensile/Logic/asm_full
. They can either go in raw if a complete tuning was targeted or the configuration can be merged using the Tensile merging tool, Tensile/Utilities/merge_rocblas_yaml_files.py
.
run ./install -c
in the rocBLAS source directory. See rocBLAS build for more information.
The performance can be measured using rocblas-bench.
Run benchmarks with the following:
cd ${BUILD_DIR}/release/clients/staging
./rocblas-bench -h
The following are examples for running particular gemm and gemv benchmark:
./rocblas-bench -f gemm -r s -m 1024 -n 1024 -k 1024 --transposeB T -v 1
./rocblas-bench -f gemv -m 9216 -n 9216 --lda 9216 --transposeA T
The automation utilities are intended to be a set of tools independent of any formal Tensile components and as such, there is no need to have a Tensile clone to work with them. They can be copied to any working area in which the tuning is being performed. Any provisioning operation of Tensile or rocBLAS which the tuning work-flow depends upon are provided by the utilities.
$ sudo apt install python-pip
$ sudo apt install python3-pip
$ sudo pip install pandas
$ sudo pip3 install pandas
$ sudo pip3 install matplotlib
$ sudo pip install openpyxl
$ sudo apt-get install python3-tk
The automation scripts is a set of utilities which enables the work-flow of the process for performance optimization using Tensile and rocBLAS.
This step is driven by the scripts/provision_tuning.sh script. This script will take a log file traces the rocblas-bench calls extracts the sizes and generates the log file which can be used in tuning. It will also provision Tensile and stage it to run the actual tuning.
The log file can be generated using ROCBLAS_LAYER=2 in rocblas bench. e.g.
$ ROCBLAS_LAYER=2 ./rocblas-bench -f gemm -r f32_r --transposeA N --transposeB N -m 1600 -n 512 -k 1024 --alpha -1.0 --lda 1600 --ldb 1024 --beta 1.0 --ldc 1600 -i 10
This will produce the calls to rocblas. Example output.
./rocblas-bench -f gemm -r f32_r --transposeA N --transposeB N -m 1600 -n 512 -k 1024 --alpha -1 --lda 1600 --ldb 1024 --beta 1 --ldc 1600 -i 10
./rocblas-bench -f gemm -r f32_r --transposeA N --transposeB N -m 1600 -n 512 -k 1024 --alpha -1 --lda 1600 --ldb 1024 --beta 1 --ldc 1600 -i 10
./rocblas-bench -f gemm -r f32_r --transposeA N --transposeB N -m 1600 -n 512 -k 1024 --alpha -1 --lda 1600 --ldb 1024 --beta 1 --ldc 1600 -i 10
./rocblas-bench -f gemm -r f32_r --transposeA N --transposeB N -m 1600 -n 512 -k 1024 --alpha -1 --lda 1600 --ldb 1024 --beta 1 --ldc 1600 -i 10
transA,transB,M,N,K,alpha,lda,ldb,beta,ldc,rocblas-Gflops,us
N,N,1600,512,1024,-1,1600,1024,1,1600,6641.81,252.6
usage:
./provision_tuning.sh [-w|--working-path <path>] [-z | --size-log <logfile path>] [-b|--branch <branch>] [-c <github commit id>] [-t|--tag <github tag>] [-o|--output <configuration filename>] [-y | --type <data type>] [-l | --library <library/schedule>] [-n] [[-h|--help]
args:
-w|--working-path the working directory
-z | --size-log the path to the log file used to generate the sizes
-b|--branch the github branch id
-c github commit id
-t|--tag github tag
-o|--output output file name
-y | --type the datatype to tune for (hgemm|sgemm|dgemm)
-l | --library the library to use (arcturus|vega10|vega20)
-n if this is enabled the utility will generate the config files without provisioning tensile
-h|--help this help file
Example use case
$ ./tuning/scripts/provision_tuning.sh WORKING_PATH LOG_PATH OUTPUT_SUFFIX.yaml LIBRARY [options]
When this is run the tuning will be provisioned in the directory ./tensile_tuing. The following directories will be generated.
configs: this is where the tuning configurations that were created will be placed
make: this constructs a set of scripts which will execute the tuning process get placed in the Tensile path
scripts: generated scripts with rocblas calls which will be used for testing
sizes: the extracted sizes csv file
tensile: the directory which tensile gets cloned to
After this is executed everything will be provisioned. The tuning can be launched with the following steps:
$ cd tensile_tuning/tensile/Tensile
$ ./doit-all.sh
usage: ./provision_verification.sh [-w|--working-path <path>] [-r <Tensile reference>] [-b|--branch <branch>] [-c | --commit <github commit id>] [-t|--tag <github tag>] [-h|--help]
args:
-h | --help help
-w | --working-path the working path name
-t | --tag a github tag
-b | --branch github branch name
-c | --commit github commit id
-r the Tensile path where the tuning which contains the tuning results
When this utility is ran, it will create the following directories in the working path:
library\exact contains the new logic files library\merge contains the new logic merged into the reference logic file, this gets copied to rocBLAS-verify for testing rocblas\rocBLAS-reference contains the reference build of rocblas rocblas\rocBLAS-verify contains a build of rocblas with the new logic merged in
This will provision and build the reference and the verification version of rocBLAS. After it is provisioned, we can copy over the test scripts and execute the scripts in both clones.
Example:
```bash
$ ./tuning/scripts/provision_verification.sh -w validate -r tensile_tuning/tensile/Tensile
$ cp tensile_tuning/scripts/* validate/rocblas/rocBLAS-reference/build/release/clients/staging
$ cp tensile_tuning/scripts/* validate/rocblas/rocBLAS-validate/build/release/clients/staging
$ pushd validate/rocblas/rocBLAS-reference/build/release/clients/staging
$ ./doit_all.sh
$ popd
$ pushd validate/rocblas/rocBLAS-validate/build/release/clients/staging
$ ./doit_all.sh
$ popd
usage: ./analyze-results.sh [-b|--benchmark-path <benchmark results path>] [-r| --reference-path <reference results path>] [-o|--output <output path>] [-f] [-s] [-z] [-g|--gpu] [-m|--mfma] [-h|--help]
args:
-b|--benchmark-path the benchmark path (rocBLAS with new logic)
-r|--reference-path the reference path (rocBLAS for reference)
-o|--output the output of the analysis
-f frequency which the gpu was set at during validation
-s the size of the datatype
-z the log file used, in order to collect the call_count (-i)
-g|--gpu the gpu used when tuning
-m|--mfma whether mfma instructions were used during validation
-h|--help the help
When this utility is ran, it will generate analysis in the following directories:
final results with the old and new logic compared
new results for the version of rocBLAS with the new logic
reference results for the version of rocBLAS without the new logic
Example:
$ ./tuning/scripts/analyze-results.sh -o analysis -s 2 -f 1301 -r validate/rocblas/rocBLAS-reference/build/release/clients/staging/results -b validate/rocblas/rocBLAS-verify/build/release/clients/staging/results -z sizes.log -g mi50 -m disabled