NoT: a high-level no-threading parallel programming method for heterogeneous systems

Shusen Wu¹,
Xiaoshe Dong¹,
Xingjun Zhang¹ &
…
Zhengdong Zhu¹

303 Accesses
Explore all metrics

Abstract

Multithreading is the core of mainstream heterogeneous programming methods such as CUDA and OpenCL. However, multithreaded parallel programming requires programmers to handle low-level runtime details, making the programming process complex and error prone. This paper presents no-threading (NoT), a high-level no-threading programming method. It introduces the association structure, a new language construct, to provide a declarative runtime-free expression of different data parallelisms and avoid the use of multithreading. The NoT method designs C-like syntax for the association structure and implements a compiler and runtime system using OpenCL as an intermediate language. We demonstrate the effectiveness of our techniques with multiple benchmarks. The size of the NoT code is comparable to that of the serial code and is far less than that of the benchmark OpenCL code. The compiler generates efficient OpenCL code, yielding a performance competitive with or equivalent to that of the manually optimized benchmark OpenCL code on both a GPU platform and an MIC platform.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

OpenMP as a High-Level Specification Language for Parallelism

GLT: A Unified API for Lightweight Thread Libraries

Support for Parallel and Concurrent Programming in C++

Article 01 January 2018

References

The CUDA Toolkit. https://developer.nvidia.com/cuda-toolkit. Accessed 10 May 2018
The OpenCL standard. https://www.khronos.org/opencl/. Accessed 10 May 2018
Ryoo S, Rodrigues CI, Baghsorkhi SS, Stone SS, Kirk DB, Hwu WW(2008) Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP’08, pp 73–82
Alberto M, Christophe D, Michael OB (2014) Automatic optimization of thread-coarsening for graphics processors. In: Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT’14, pp 455–466
Luk CK, Hong S, Kim H (2009) Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pp 45–55
Han TD, Abdelrahman TS (2011) hiCUDA: high-level GPGPU programming. IEEE Trans Parallel Distrib Syst 22(1):78–90
Article Google Scholar
Wang Z, Grewe D, O’boyle MFP (2015) Automatic and portable mapping of data parallel programs to OpenCL for GPU-based heterogeneous systems. ACM Trans Archit Code Optim 11(4):1–26
Google Scholar
The OpenACC Homepage. https://www.openacc.org/. Accessed 10 May 2018
High Performance Fortran Forum. http://hpff.rice.edu/. Accessed 10 May 2018
Chamberlain BL, Callahan D, Zima HP (2007) Parallel programmability and the Chapel language. Int J High Perform Comput Appl 21(3):291–312
Article Google Scholar
C++ Accelerated Massive Parallelism. https://msdn.microsoft.com/en-us/library/hh265137.aspx. Accessed 10 May 2018
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Catanzaro B, Garland M, Keutzer K (2011) Copperhead: compiling an embedded data parallel language. ACM SIGPLAN Not 46(8):47–56
Article Google Scholar
Zhang Y, Mueller F (2013) Hidp: a hierarchical data parallel language. In: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization, CGO’13, pp 1–11
High-Performance Portable MPI. http://www.mpich.org/. Accessed 10 May 2018
The OpenMP API specification. http://www.openmp.org/specifications/. Accessed 10 May 2018
Szafaryn LG, Gamblin T, Supinski BRD, Skadron K (2013) Trellis: portability across architectures with a high-level framework. J Parallel Distrib Comput 73(10):1400–1413
Article Google Scholar
Carter EH, Trott CR, Sunderland D (2014) Kokkos: enabling manycore performance portability through polymorphic memory access patterns. J Parallel Distrib Comput 74(12):3202–3216
Article Google Scholar
Martineau M, Mcintosh-Smith S, Boulton M, Gaudin W (2016) An evaluation of emerging many-core parallel programming models. In: Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM’16, pp 1–10
Lee S, Eigenmann R (2010) OpenMPC: extended OpenMP programming and tuning for GPUs. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC’10, pp 1–11
Klöckner A, Pinto N, Lee Y, Catanzaro B, Ivanov P, Fasih A (2012) PyCUDA and PyOpenCL: a scripting-based approach to GPU run-time code generation. Parallel Comput 38(3):157–174
Article Google Scholar
Phothilimthana PM, Ansel J, Ragan-Kelley J, Amarasinghe S (2013) Portable performance on heterogeneous architectures. In: Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13), pp 431–444
Chafi H, Sujeeth AK, Brown KJ, Lee HJ, Atreya AR, Olukotun K (2011) A domain-specific approach to heterogeneous parallelism. In: Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP’11), pp 35–46
Pu J, Bell S, Yang X, Setter J, Richardson S, Ragan-Kelley J, Horowitz M (2017) Programming heterogeneous systems from an image processing DSL. ACM Trans Archit Code Optim 14(3), Article 26
Thies W, Karczmarek M, Amarasinghe S (2002) StreamIt: a language for streaming applications. In: Horspool RN (ed) Compiler construction, CC 2002, pp 179–196, vol 2304. Lecture Notes in Computer Science. Springer, Heidelberg
Google Scholar
Buck I, Foley T, Horn D, Sugerman J, Fatahalian K, Houston M, Hanrahan P (2004) Brook for GPUs: stream computing on graphics hardware. ACM Trans Graph 23(3):777–786
Article Google Scholar
Hormati AH, Samadi M, Woh M, Mudge T, Mahlke S (2011) Sponge: portable stream programming on graphics engines. ACM SIGPLAN Not 46(3):381–392
Article Google Scholar
Hong J, Hong K, Burgstaller B, Blieberger J (2012) StreamPI: a stream-parallel programming extension for object-oriented programming languages. J Supercomput 61(1):118–140
Article Google Scholar
Auerbach J, Bacon DF, Cheng P, Rabbah R (2010) Lime: a Java-compatible and synthesizable language for heterogeneous architectures. ACM SIGPLAN Not 45(10):89–108
Article Google Scholar
Dubach C, Cheng P, Rabbah R, Bacon DF, Fink SJ (2012) Compiling a high-level language for GPUs: (via language support for architectures and compilers). In: Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’12), pp 1–12
Su Y, Shi F, Talpur S, Wei J, Tan H (2014) Exploiting controlled-grained parallelism in message-driven stream programs. J Supercomput 70(1):488–509
Article Google Scholar
Linderman MD, Collins JD, Wang H, Meng TH (2008) Merge: a programming model for heterogeneous multi-core systems. ACM SIGPLAN Not 43(3):287–296
Article Google Scholar
Enmyren J, Kessler CW (2010) SkePU: a multi-backend skeleton programming library for multi-GPU systems. In: Proceedings of the Fourth International Workshop on High-Level Parallel Programming and Applications (HLPP’10), pp 5–14
Ernstsson A, Li L, Kessler C (2018) SkePU 2: flexible and type-safe skeleton programming for heterogeneous parallel systems. Int J Parallel Program 46(1):62–80
Article Google Scholar
Steuwer M, Kegel P, Gorlatch S (2011) SkelCL: a portable skeleton library for high-level GPU programming. In: Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp 1176–1182
Rodrigues C, Jablin T, Dakkak A, Hwu WM (2014) Triolet: a programming system that unifies algorithmic skeleton interfaces for high-performance cluster computing. In: Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’14), pp 247–258
Steuwer M, Fensch C, Lindley S, Dubach C (2015) Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code. In: Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming, ICFP 2015, pp 205–217
Steuwer M, Remmelg T, Dubach C (2017) LIFT: A functional data-parallel IR for high-performance GPU code generation. In: Proceedings of the 2017 IEEE/ACM International Symposium on Code Generation and Optimization, pp 74–85
Collins A, Grewe D, Grover V, Lee S, Susnea A (2014) NOVA: a functional language for data parallelism. In: Proceedings of ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY’14), pp 8–13
Henriksen T, Serup NGW, Elsman M, Henglein F, Oancea CE (2014) Futhark: purely functional gpu-programming with nested parallelism and in-place array updates. In: Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI’17, pp 556–571
Mattson T, Sanders B, Massingill B (2004) Patterns for parallel programming. Addison-Wesley Professional, Boston
MATH Google Scholar
Johnston WM, Hanna P Jr, Millar RJ (2004) Advances in dataflow programming languages. ACM Comput Surv 36(1):1–34
Article Google Scholar
Kaeli DR, Mistry P, Schaa D, Zhang DP (2015) Heterogeneous computing with OpenCL 2.0. Morgan Kaufmann, San Francisco
Google Scholar
Stratton JA, Rodrigues C, Sung IJ, Obeid N, Chang LW, Anssari N, Liu GD, Hwu WW (2012) Parboil: a revised benchmark suite for scientific and commercial throughput computing. http://impact.crhc.illinois.edu/Shared/Docs/impact-12-01.parboil.pdf. Accessed 10 May 2018
The SPEC ACCEL benchmark. http://www.spec.org/accel/. Accessed 10 May 2018

Download references

Acknowledgements

This work was supported by the National Key Research and Development Program of China (2017YFB0202002) and the National Natural Science Foundation of China (Grant No. 61572394).

Author information

Authors and Affiliations

School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an, 710049, Shaanxi, China
Shusen Wu, Xiaoshe Dong, Xingjun Zhang & Zhengdong Zhu

Authors

Shusen Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoshe Dong
View author publications
You can also search for this author in PubMed Google Scholar
Xingjun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhengdong Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaoshe Dong.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, S., Dong, X., Zhang, X. et al. NoT: a high-level no-threading parallel programming method for heterogeneous systems. J Supercomput 75, 3810–3841 (2019). https://doi.org/10.1007/s11227-019-02749-1

Download citation

Published: 10 January 2019
Issue Date: 01 July 2019
DOI: https://doi.org/10.1007/s11227-019-02749-1

NoT: a high-level no-threading parallel programming method for heterogeneous systems

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

OpenMP as a High-Level Specification Language for Parallelism

GLT: A Unified API for Lightweight Thread Libraries

Support for Parallel and Concurrent Programming in C++

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

NoT: a high-level no-threading parallel programming method for heterogeneous systems

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

OpenMP as a High-Level Specification Language for Parallelism

GLT: A Unified API for Lightweight Thread Libraries

Support for Parallel and Concurrent Programming in C++

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now