Sgap: towards efficient sparse tensor algebra compilation for GPU

Genghan Zhang¹,
Yuetong Zhao¹,
Yanting Tao¹,
Zhongming Yu²,
Guohao Dai³,
Sitao Huang⁴,
Yuan Wen⁵,
Pavlos Petoumenos⁶ &
…
Yu Wang¹

344 Accesses
Explore all metrics

Abstract

Sparse compiler is a promising solution for sparse tensor algebra optimization. In compiler implementation, reduction in sparse-dense hybrid algebra plays a key role in performance. Though GPU provides various reduction semantics that can better utilize the parallel computing and memory bandwidth capacity, the central question is: how to elevate the flexible reduction semantics to sparse compilation theory that assumes serial execution. Specifically, we have to tackle two main challenges: (1) there are wasted parallelism by adopting static synchronization granularity (2) static reduction strategy limits optimization space exploration. We propose Sgap: s egment g roup and a tomic p arallelism to solve these problems. Atomic parallelism captures the flexible reduction semantics to systematically analyze the optimization space of sparse-dense hybrid algebra on GPU. It is a new optimization technique beyond current compiler-based and open-source runtime libraries. Segment group elevates the flexible reduction semantics to suitable levels of abstraction in the sparse compilation theory. It adopts changeable group size and user-defined reduction strategy to solve challenge (1) and (2), respectively. Finally, we use GPU sparse matrix-matrix multiplication (SpMM) on the TACO compiler as a use case to demonstrate the effectiveness of segment group in reduction semantics elevation. We achieve up to $1.2\,\times$ speedup over the original TACO’s SpMM kernels. We also apply new optimization techniques found by atomic parallelism to an open-source state-of-the-art SpMM library dgSPARSE. We achieve $1.6 \, \times \sim 2.3 \, \times$ speedup on the algorithm tuned with atomic parallelism.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 6

Fig. 9

Fig. 10

Design Principles for Sparse Matrix Multiplication on the GPU

Accelerating approximate matrix multiplication for near-sparse matrices on GPUs

Article 14 February 2022

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

Article Open access 11 March 2024

Notes

We build on commit d0654a8 https://github.com/zhang677/taco/tree/d0654a84137169883973c40a951dfdb89883fd9c.
We do not actually integrate these macro instructions into TACO, because it is fairly straightforward and purely engineering. When testing the kernels, we just replace the atomicAdd with the new macro instructions. We open-source the modified TACO https://github.com/zhang677/taco/tree/parallelreduction.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cooperative-groups.
Senanayake et al. (2020)’s authors shared their code with us. We also use a similar code base to test our kernels in Sect. 7.
https://docs.nvidia.com/nsight-compute/NsightCompute/index.html.
We open source the testing code at https://github.com/zhang677/segTACO.
https://github.com/dgSPARSE.
We open source our implementation at https://github.com/dgSPARSE/dgSPARSE-Library/commit/9e3e4c18f40e76b97a805b8a9733258f7e9edeb6.

References

Asgari, B., Hadidi, R., Cao, J., Lim, S.-K., Kim, H., et al.: Fafnir: Accelerating sparse gathering by using efficient near-memory intelligent reduction. In: 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 908–920 (2021). IEEE
Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pp. 1–11 (2009)
Bell, N., Dalton, S., Olson, L.N.: Exposing fine-grained parallelism in algebraic multigrid methods. SIAM J. Sci. Comput. 34(4), 123–152 (2012)
Article MathSciNet MATH Google Scholar
Bik, A.J., Koanantakool, P., Shpeisman, T., Vasilache, N., Zheng, B., Kjolstad, F.: Compiler support for sparse tensor computations in mlir. arXiv:2202.04305 (2022)
Bik, A.J., Wijshoff, H.A.: Compilation techniques for sparse matrix computations. In: Proceedings of the 7th International Conference on Supercomputing, pp. 416–424 (1993)
Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, H., Cowan, M., Wang, L., Hu, Y., Ceze, L., et al.: Tvm: An automated end-to-end optimizing compiler for deep learning. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp. 578–594 (2018)
Chou, S., Kjolstad, F., Amarasinghe, S.: Format abstraction for sparse tensor algebra compilers. Proc. ACM Program. Lang. 2(OOPSLA), 123–112330 (2018). https://doi.org/10.1145/3276493
Codd, E.F.: A relational model of data for large shared data banks. Commun. ACM 13(6), 377–387 (1970)
Article MATH Google Scholar
Dai, G., Huang, G., Yang, S., Yu, Z., Zhang, H., Ding, Y., Xie, Y., Yang, H., Wang, Y.: Heuristic adaptability to input dynamics for spmm on gpus. arXiv:2202.08556 (2022)
Feng, S., Hou, B., Jin, H., Lin, W., Shao, J., Lai, R., Ye, Z., Zheng, L., Yu, C.H., Yu, Y., et al.: Tensorir: An abstraction for automatic tensorized program optimization. arXiv:2207.04296 (2022)
Guennebaud, G., Jacob, B., et al.: Eigen. 3 http://eigen. tuxfamily.org (2010)
Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. Adv. Neural Inf. Process. Syst. 30 (2017)
Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A., Dally, W.J.: Eie: efficient inference engine on compressed deep neural network. ACM SIGARCH Comput. Archit. News 44(3), 243–254 (2016)
Article Google Scholar
Hidayetoğlu, M., Pearson, C., Mailthody, V.S., Ebrahimi, E., Xiong, J., Nagi, R., Hwu, W.-m.: At-scale sparse deep neural network inference with efficient gpu implementation. In: 2020 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–7 (2020). IEEE
Hong, C., Sukumaran-Rajam, A., Nisa, I., Singh, K., Sadayappan, P.: Adaptive sparse tiling for sparse matrix multiplication. In: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, pp. 300–314 (2019)
Huang, G., Dai, G., Wang, Y., Yang, H.: Ge-spmm: general-purpose sparse matrix-matrix multiplication on gpus for graph neural networks. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2020). IEEE
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv:1609.02907 (2016)
Kjolstad, F., Ahrens, P., Kamil, S., Amarasinghe, S.: Tensor algebra compilation with workspaces, 180–192 (2019)
Kjolstad, F., Kamil, S., Chou, S., Lugato, D., Amarasinghe, S.: The tensor algebra compiler. Proc. ACM Program. Lang. 1(OOPSLA), 77–17729 (2017). https://doi.org/10.1145/3133901
Kjolstad, F.: Sparse tensor algebra compilation. Ph.d. thesis, Massachusetts Institute of Technology, Cambridge, MA (2020). http://tensor-compiler.org/files/kjolstad-phd-thesis-taco-compiler.pdf
Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009)
Article MathSciNet MATH Google Scholar
Kurt, S.E., Raje, S., Sukumaran-Rajam, A., Sadayappan, P.: Sparsity-aware tensor decomposition. In: 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 952–962 (2022). IEEE
Lin, C.-Y., Luo, L., Ceze, L.: Accelerating spmm kernel with cache-first edge sampling for graph neural networks. arXiv:2104.10716 (2021)
Liu, B., Wang, M., Foroosh, H., Tappen, M., Pensky, M.: Sparse convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 806–814 (2015)
Mehrabi, A., Lee, D., Chatterjee, N., Sorin, D.J., Lee, B.C., O’Connor, M.: Learning sparse matrix row permutations for efficient spmm on gpu architectures. In: 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 48–58 (2021). IEEE
Naumov, M., Chien, L., Vandermersch, P., Kapasi, U.: Cusparse library. In: GPU Technology Conference (2010)
Nisa, I., Li, J., Sukumaran-Rajam, A., Vuduc, R., Sadayappan, P.: Load-balanced sparse mttkrp on gpus. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 123–133 (2019). IEEE
Popoola, T., Shankar, R., Rift, A., Singh, S., Davis, E.C., Strout, M.M., Olschanowsky, C.: An object-oriented interface to the sparse polyhedral library. In: 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC), pp. 1825–1831 (2021). IEEE
Qin, E., Garg, R., Bambhaniya, A., Pellauer, M., Parashar, A., Rajamanickam, S., Hao, C., Krishna, T.: Enabling flexibility for sparse tensor acceleration via heterogeneity. arXiv:2201.08916 (2022)
Senanayake, R., Hong, C., Wang, Z., Wilson, A., Chou, S., Kamil, S., Amarasinghe, S., Kjolstad, F.: A sparse iteration space transformation framework for sparse tensor algebra. Proc. ACM Program. Lang. 4(OOPSLA) (2020). https://doi.org/10.1145/3428226
Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Characterizing the impact of soft errors on iterative methods in scientific computing. In: Proceedings of the International Conference on Supercomputing, pp. 152–161 (2011)
Strout, M.M., Hall, M., Olschanowsky, C.: The sparse polyhedral framework: composing compiler-generated inspector-executor code. Proc. IEEE 106(11), 1921–1934 (2018)
Article Google Scholar
Venkat, A., Hall, M., Strout, M.: Loop and data transformations for sparse matrix code. ACM SIGPLAN Not. 50(6), 521–532 (2015)
Article Google Scholar
Wang, Z., Wohlwend, J., Lei, T.: Structured pruning of large language models. arXiv:1910.04732 (2019)
Wang, E., Zhang, Q., Shen, B., Zhang, G., Lu, X., Wu, Q., Wang, Y.: Intel math kernel library. In: High-Performance Computing on the Intel® Xeon Phi$^{{\rm TM}}$, pp. 167–188. Springer, Cham (2014)
Xin, J., Ye, X., Zheng, L., Wang, Q., Huang, Y., Yao, P., Yu, L., Liao, X., Jin, H.: Fast sparse deep neural network inference with flexible spmm optimization space exploration. In: 2021 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–7 (2021). IEEE
Yang, C., Buluç, A., Owens, J.D.: Design principles for sparse matrix multiplication on the gpu. In: European Conference on Parallel Processing, pp. 672–687 (2018). Springer
Ye, Z., Lai, R., Shao, J., Chen, T., Ceze, L.: Sparsetir: composable abstractions for sparse compilation in deep learning
Yu, Z., Dai, G., Huang, G., Wang, Y., Yang, H.: Exploiting online locality and reduction parallelism for sampled dense matrix multiplication on gpus. In: 2021 IEEE 39th International Conference on Computer Design (ICCD), pp. 567–574 (2021). IEEE
Yuster, R., Zwick, U.: Detecting short directed cycles using rectangular matrix multiplication and dynamic programming. In: SODA, vol. 4, pp. 254–260 (2004). Citeseer

Download references

Author information

Authors and Affiliations

Department of Electronic Engineering, Tsinghua University, Rhom 4101, Beijing, 100084, China
Genghan Zhang, Yuetong Zhao, Yanting Tao & Yu Wang
Department of Computer Science and Enigeering, University of California San Diego, Gilman Drive, La Jolla, CA, 92093, USA
Zhongming Yu
Qingyuan Research Institute, Shanghai Jiao Tong University, Room 318A, Building A No. 930 Jianchuan Road, Shanghai, 200240, China
Guohao Dai
Department of Electrical Engineering and Computer Science, University of California Irvine, 3215 Engineering Hall, Irvine, CA, 92697, USA
Sitao Huang
Department of Computer Science, University of Aberdeen King’s College, Meston Building, Aberdeen, AB24 3UE, UK
Yuan Wen
Department of Computer Science, University of Manchester, Kilburn Building, Manchester, M13 9PL, UK
Pavlos Petoumenos

Authors

Genghan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yuetong Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yanting Tao
View author publications
You can also search for this author in PubMed Google Scholar
Zhongming Yu
View author publications
You can also search for this author in PubMed Google Scholar
Guohao Dai
View author publications
You can also search for this author in PubMed Google Scholar
Sitao Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Wen
View author publications
You can also search for this author in PubMed Google Scholar
Pavlos Petoumenos
View author publications
You can also search for this author in PubMed Google Scholar
Yu Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Guohao Dai or Yu Wang.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, G., Zhao, Y., Tao, Y. et al. Sgap: towards efficient sparse tensor algebra compilation for GPU. CCF Trans. HPC 5, 210–227 (2023). https://doi.org/10.1007/s42514-023-00140-4

Download citation

Received: 03 September 2022
Accepted: 14 March 2023
Published: 08 May 2023
Issue Date: June 2023
DOI: https://doi.org/10.1007/s42514-023-00140-4

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Design Principles for Sparse Matrix Multiplication on the GPU

Accelerating approximate matrix multiplication for near-sparse matrices on GPUs

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Sgap: towards efficient sparse tensor algebra compilation for GPU

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Design Principles for Sparse Matrix Multiplication on the GPU

Accelerating approximate matrix multiplication for near-sparse matrices on GPUs

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now