[go: up one dir, main page]

Skip to main content

Advertisement

Log in

Sgap: towards efficient sparse tensor algebra compilation for GPU

  • Regular Paper
  • Published:
CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Abstract

Sparse compiler is a promising solution for sparse tensor algebra optimization. In compiler implementation, reduction in sparse-dense hybrid algebra plays a key role in performance. Though GPU provides various reduction semantics that can better utilize the parallel computing and memory bandwidth capacity, the central question is: how to elevate the flexible reduction semantics to sparse compilation theory that assumes serial execution. Specifically, we have to tackle two main challenges: (1) there are wasted parallelism by adopting static synchronization granularity (2) static reduction strategy limits optimization space exploration. We propose Sgap: s egment g roup and a tomic p arallelism to solve these problems. Atomic parallelism captures the flexible reduction semantics to systematically analyze the optimization space of sparse-dense hybrid algebra on GPU. It is a new optimization technique beyond current compiler-based and open-source runtime libraries. Segment group elevates the flexible reduction semantics to suitable levels of abstraction in the sparse compilation theory. It adopts changeable group size and user-defined reduction strategy to solve challenge (1) and (2), respectively. Finally, we use GPU sparse matrix-matrix multiplication (SpMM) on the TACO compiler as a use case to demonstrate the effectiveness of segment group in reduction semantics elevation. We achieve up to \(1.2\,\times\) speedup over the original TACO’s SpMM kernels. We also apply new optimization techniques found by atomic parallelism to an open-source state-of-the-art SpMM library dgSPARSE. We achieve \(1.6 \, \times \sim 2.3 \, \times\) speedup on the algorithm tuned with atomic parallelism.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. We build on commit d0654a8 https://github.com/zhang677/taco/tree/d0654a84137169883973c40a951dfdb89883fd9c.

  2. We do not actually integrate these macro instructions into TACO, because it is fairly straightforward and purely engineering. When testing the kernels, we just replace the atomicAdd with the new macro instructions. We open-source the modified TACO https://github.com/zhang677/taco/tree/parallelreduction.

  3. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cooperative-groups.

  4. Senanayake et al. (2020)’s authors shared their code with us. We also use a similar code base to test our kernels in Sect. 7.

  5. https://docs.nvidia.com/nsight-compute/NsightCompute/index.html.

  6. We open source the testing code at https://github.com/zhang677/segTACO.

  7. https://github.com/dgSPARSE.

  8. We open source our implementation at https://github.com/dgSPARSE/dgSPARSE-Library/commit/9e3e4c18f40e76b97a805b8a9733258f7e9edeb6.

References

  • Asgari, B., Hadidi, R., Cao, J., Lim, S.-K., Kim, H., et al.: Fafnir: Accelerating sparse gathering by using efficient near-memory intelligent reduction. In: 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 908–920 (2021). IEEE

  • Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pp. 1–11 (2009)

  • Bell, N., Dalton, S., Olson, L.N.: Exposing fine-grained parallelism in algebraic multigrid methods. SIAM J. Sci. Comput. 34(4), 123–152 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  • Bik, A.J., Koanantakool, P., Shpeisman, T., Vasilache, N., Zheng, B., Kjolstad, F.: Compiler support for sparse tensor computations in mlir. arXiv:2202.04305 (2022)

  • Bik, A.J., Wijshoff, H.A.: Compilation techniques for sparse matrix computations. In: Proceedings of the 7th International Conference on Supercomputing, pp. 416–424 (1993)

  • Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, H., Cowan, M., Wang, L., Hu, Y., Ceze, L., et al.: Tvm: An automated end-to-end optimizing compiler for deep learning. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp. 578–594 (2018)

  • Chou, S., Kjolstad, F., Amarasinghe, S.: Format abstraction for sparse tensor algebra compilers. Proc. ACM Program. Lang. 2(OOPSLA), 123–112330 (2018). https://doi.org/10.1145/3276493

  • Codd, E.F.: A relational model of data for large shared data banks. Commun. ACM 13(6), 377–387 (1970)

    Article  MATH  Google Scholar 

  • Dai, G., Huang, G., Yang, S., Yu, Z., Zhang, H., Ding, Y., Xie, Y., Yang, H., Wang, Y.: Heuristic adaptability to input dynamics for spmm on gpus. arXiv:2202.08556 (2022)

  • Feng, S., Hou, B., Jin, H., Lin, W., Shao, J., Lai, R., Ye, Z., Zheng, L., Yu, C.H., Yu, Y., et al.: Tensorir: An abstraction for automatic tensorized program optimization. arXiv:2207.04296 (2022)

  • Guennebaud, G., Jacob, B., et al.: Eigen. 3 http://eigen. tuxfamily.org (2010)

  • Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. Adv. Neural Inf. Process. Syst. 30 (2017)

  • Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A., Dally, W.J.: Eie: efficient inference engine on compressed deep neural network. ACM SIGARCH Comput. Archit. News 44(3), 243–254 (2016)

    Article  Google Scholar 

  • Hidayetoğlu, M., Pearson, C., Mailthody, V.S., Ebrahimi, E., Xiong, J., Nagi, R., Hwu, W.-m.: At-scale sparse deep neural network inference with efficient gpu implementation. In: 2020 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–7 (2020). IEEE

  • Hong, C., Sukumaran-Rajam, A., Nisa, I., Singh, K., Sadayappan, P.: Adaptive sparse tiling for sparse matrix multiplication. In: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, pp. 300–314 (2019)

  • Huang, G., Dai, G., Wang, Y., Yang, H.: Ge-spmm: general-purpose sparse matrix-matrix multiplication on gpus for graph neural networks. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2020). IEEE

  • Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv:1609.02907 (2016)

  • Kjolstad, F., Ahrens, P., Kamil, S., Amarasinghe, S.: Tensor algebra compilation with workspaces, 180–192 (2019)

  • Kjolstad, F., Kamil, S., Chou, S., Lugato, D., Amarasinghe, S.: The tensor algebra compiler. Proc. ACM Program. Lang. 1(OOPSLA), 77–17729 (2017). https://doi.org/10.1145/3133901

  • Kjolstad, F.: Sparse tensor algebra compilation. Ph.d. thesis, Massachusetts Institute of Technology, Cambridge, MA (2020). http://tensor-compiler.org/files/kjolstad-phd-thesis-taco-compiler.pdf

  • Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  • Kurt, S.E., Raje, S., Sukumaran-Rajam, A., Sadayappan, P.: Sparsity-aware tensor decomposition. In: 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 952–962 (2022). IEEE

  • Lin, C.-Y., Luo, L., Ceze, L.: Accelerating spmm kernel with cache-first edge sampling for graph neural networks. arXiv:2104.10716 (2021)

  • Liu, B., Wang, M., Foroosh, H., Tappen, M., Pensky, M.: Sparse convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 806–814 (2015)

  • Mehrabi, A., Lee, D., Chatterjee, N., Sorin, D.J., Lee, B.C., O’Connor, M.: Learning sparse matrix row permutations for efficient spmm on gpu architectures. In: 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 48–58 (2021). IEEE

  • Naumov, M., Chien, L., Vandermersch, P., Kapasi, U.: Cusparse library. In: GPU Technology Conference (2010)

  • Nisa, I., Li, J., Sukumaran-Rajam, A., Vuduc, R., Sadayappan, P.: Load-balanced sparse mttkrp on gpus. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 123–133 (2019). IEEE

  • Popoola, T., Shankar, R., Rift, A., Singh, S., Davis, E.C., Strout, M.M., Olschanowsky, C.: An object-oriented interface to the sparse polyhedral library. In: 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC), pp. 1825–1831 (2021). IEEE

  • Qin, E., Garg, R., Bambhaniya, A., Pellauer, M., Parashar, A., Rajamanickam, S., Hao, C., Krishna, T.: Enabling flexibility for sparse tensor acceleration via heterogeneity. arXiv:2201.08916 (2022)

  • Senanayake, R., Hong, C., Wang, Z., Wilson, A., Chou, S., Kamil, S., Amarasinghe, S., Kjolstad, F.: A sparse iteration space transformation framework for sparse tensor algebra. Proc. ACM Program. Lang. 4(OOPSLA) (2020). https://doi.org/10.1145/3428226

  • Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Characterizing the impact of soft errors on iterative methods in scientific computing. In: Proceedings of the International Conference on Supercomputing, pp. 152–161 (2011)

  • Strout, M.M., Hall, M., Olschanowsky, C.: The sparse polyhedral framework: composing compiler-generated inspector-executor code. Proc. IEEE 106(11), 1921–1934 (2018)

    Article  Google Scholar 

  • Venkat, A., Hall, M., Strout, M.: Loop and data transformations for sparse matrix code. ACM SIGPLAN Not. 50(6), 521–532 (2015)

    Article  Google Scholar 

  • Wang, Z., Wohlwend, J., Lei, T.: Structured pruning of large language models. arXiv:1910.04732 (2019)

  • Wang, E., Zhang, Q., Shen, B., Zhang, G., Lu, X., Wu, Q., Wang, Y.: Intel math kernel library. In: High-Performance Computing on the Intel® Xeon Phi\(^{{\rm TM}}\), pp. 167–188. Springer, Cham (2014)

  • Xin, J., Ye, X., Zheng, L., Wang, Q., Huang, Y., Yao, P., Yu, L., Liao, X., Jin, H.: Fast sparse deep neural network inference with flexible spmm optimization space exploration. In: 2021 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–7 (2021). IEEE

  • Yang, C., Buluç, A., Owens, J.D.: Design principles for sparse matrix multiplication on the gpu. In: European Conference on Parallel Processing, pp. 672–687 (2018). Springer

  • Ye, Z., Lai, R., Shao, J., Chen, T., Ceze, L.: Sparsetir: composable abstractions for sparse compilation in deep learning

  • Yu, Z., Dai, G., Huang, G., Wang, Y., Yang, H.: Exploiting online locality and reduction parallelism for sampled dense matrix multiplication on gpus. In: 2021 IEEE 39th International Conference on Computer Design (ICCD), pp. 567–574 (2021). IEEE

  • Yuster, R., Zwick, U.: Detecting short directed cycles using rectangular matrix multiplication and dynamic programming. In: SODA, vol. 4, pp. 254–260 (2004). Citeseer

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Guohao Dai or Yu Wang.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, G., Zhao, Y., Tao, Y. et al. Sgap: towards efficient sparse tensor algebra compilation for GPU. CCF Trans. HPC 5, 210–227 (2023). https://doi.org/10.1007/s42514-023-00140-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42514-023-00140-4

Keywords