research-article

Open access

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs

Authors:

William S. Moses,

Ivan R. Ivanov,

Johannes Doerfert,

Oleksandr ZinenkoAuthors Info & Claims

PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

Pages 119 - 134

https://doi.org/10.1145/3572848.3577475

Published: 21 February 2023 Publication History

Abstract

While parallelism remains the main source of performance, architectural implementations and programming models change with each new hardware generation, often leading to costly application re-engineering. Most tools for performance portability require manual and costly application porting to yet another programming model.

We propose an alternative approach that automatically translates programs written in one programming model (CUDA), into another (CPU threads) based on Polygeist/MLIR. Our approach includes a representation of parallel constructs that allows conventional compiler transformations to apply transparently and without modification and enables parallelism-specific optimizations. We evaluate our framework by transpiling and optimizing the CUDA Rodinia benchmark suite for a multi-core CPU and achieve a 58% geomean speedup over handwritten OpenMP code. Further, we show how CUDA kernels from PyTorch can efficiently run and scale on the CPU-only Supercomputer Fugaku without user intervention. Our PyTorch compatibility layer making use of transpiled CUDA PyTorch kernels outperforms the PyTorch CPU native backend by 2.7×.

References

[1]

Alexander Aiken and David Gay. 1998. Barrier Inference. In Proceedings of the 25th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (San Diego, California, USA) (POPL '98). Association for Computing Machinery, New York, NY, USA, 342--354.

Digital Library

[2]

David Beckingsale, Richard Hornung, Tom Scogland, and Arturo Vargas. 2019. Performance Portable C++ Programming with RAJA. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (Washington, District of Columbia) (PPoPP '19). Association for Computing Machinery, New York, NY, USA, 455--456.

Digital Library

[3]

H. Carter Edwards, Christian R. Trott, and Daniel Sunderland. 2014. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. J. Parallel and Distrib. Comput. 74, 12 (2014), 3202--3216. Domain-Specific Languages and High-Level Frameworks for High-Performance Computing.

Digital Library

[4]

Prasanth Chatarasi, Jun Shirako, and Vivek Sarkar. 2015. Polyhedral Optimizations of Explicitly Parallel Programs. In 2015 International Conference on Parallel Architecture and Compilation (PACT). 213--226.

Digital Library

[5]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC). 44--54.

Digital Library

[6]

Valentin Churavy, Dilum Aluthge, Lucas C Wilcox, Simon Byrne, Maciej Waruszewski, Ali Ramadhan, Meredith, Simeon Schaub, James Schloss, Julian Samaroo, Jake Bolewski, Charles Kawczynski, Jeremy E Kozdon, Jinguo Liu, Oliver Schulz, Oscar, Páll Haraldsson, Takafumi Arakaki, and Tim Besard. 2022. JuliaGPU/KernelAbstractions.jl: v0.8.0.

[7]

R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck. 1989. An Efficient Method of Computing Static Single Assignment Form. In Proceedings of the 16th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (Austin, Texas, USA) (POPL '89). Association for Computing Machinery, New York, NY, USA, 25--35.

Digital Library

[8]

Alain Darte and Robert Schreiber. 2005. A Linear-Time Algorithm for Optimal Barrier Placement. In Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Chicago, IL, USA) (PPoPP '05). Association for Computing Machinery, New York, NY, USA, 26--35.

Digital Library

[9]

Gregory Diamos, Andrew Kerr, Sudhakar Yalamanchili, and Nathan Clark. 2010. Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 353--364.

Digital Library

[10]

Johannes Doerfert, Jose Manuel Monsalve Diaz, and Hal Finkel. 2019. The TRegion Interface and Compiler Optimizations for OpenMP Target Regions. In OpenMP: Conquering the Full Hardware Spectrum - 15th International Workshop on OpenMP, IWOMP 2019, Auckland, New Zealand, September 11--13, 2019, Proceedings (Lecture Notes in Computer Science, Vol. 11718), Xing Fan, Bronis R. de Supinski, Oliver Sinnen, and Nasser Giacaman (Eds.). Springer, 153--167.

Digital Library

[11]

Johannes Doerfert and Hal Finkel. 2018. Compiler Optimizations for OpenMP. In Evolving OpenMP for Evolving Architectures, Bronis R. de Supinski, Pedro Valero-Lara, Xavier Martorell, Sergi Mateo Bellido, and Jesus Labarta (Eds.). Springer International Publishing, Cham, 113--127.

[12]

Johannes Doerfert and Hal Finkel. 2018. Compiler Optimizations for Parallel Programs. In Languages and Compilers for Parallel Computing - 31st International Workshop, LCPC 2018, Salt Lake City, UT, USA, October 9--11, 2018, Revised Selected Papers (Lecture Notes in Computer Science, Vol. 11882), Mary W. Hall and Hari Sundar (Eds.). Springer, 112--119.

[13]

Aleksandr Drozd. 2021. Benchmarker. Online GitHub repository: https://github.com/undertherain/benchmarker/, commit e1f22da320b0c7384cbd2f4df50255c7c2fa6b9d.

[14]

Peng Du, Rick Weber, Piotr Luszczek, Stanimire Tomov, Gregory Peterson, and Jack Dongarra. 2012. From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming. Parallel Comput. 38, 8 (2012), 391--407.

Digital Library

[15]

H Carter Edwards, Christian R Trott, and Daniel Sunderland. 2014. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. Journal of parallel and distributed computing 74, 12 (2014), 3202--3216.

Digital Library

[16]

Paul Feautrier and Christian Lengauer. 2011. Polyhedron Model. Encyclopedia of parallel computing (2011), 1581--1592.

[17]

Franz Franchetti, Tze Meng Low, Doru Thom Popovici, Richard M. Veras, Daniele G. Spampinato, Jeremy R. Johnson, Markus Püschel, James C. Hoe, and José M. F. Moura. 2018. SPIRAL: Extreme Performance Portability. Proc. IEEE 106, 11 (2018), 1935--1968.

[18]

Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The Implementation of the Cilk-5 Multithreaded Language. In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation (Montreal, Quebec, Canada) (PLDI '98). Association for Computing Machinery, New York, NY, USA, 212--223.

Digital Library

[19]

Fujitsu. 2021. https://www.fujitsu.com/downloads/SUPER/a64fx/a64fx_datasheet_en.pdf

[20]

Fujitsu. 2022. https://github.com/fujitsu/dnnl_aarch64

[21]

Tobias Gysi, Christoph Müller, Oleksandr Zinenko, Stephan Herhut, Eddie Davis, Tobias Wicky, Oliver Fuhrer, Torsten Hoefler, and Tobias Grosser. 2021. Domain-Specific Multi-Level IR Rewriting for GPU: The Open Earth Compiler for GPU-Accelerated Climate Simulation. ACM Trans. Archit. Code Optim. 18, 4, Article 51 (sep 2021), 23 pages.

Digital Library

[22]

Hwansoo Han, Chau-Wen Tseng, and Pete Keleher. 1998. Eliminating barrier synchronization for compiler-parallelized codes on software DSMs. International journal of parallel programming 26, 5 (1998), 591--612.

[23]

Ruobing Han, Jaewon Lee, Jaewoong Sim, and Hyesoon Kim. 2022. COX: CUDA on X86 by Exposing Warp-Level Functions to CPUs. ACM Trans. Archit. Code Optim. (jul 2022).

Digital Library

[24]

Mark Harris et al. 2007. Optimizing parallel reduction in CUDA. Nvidia developer technology 2, 4 (2007), 70.

[25]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[26]

J. A. Herdman, W. P. Gaudin, O. Perks, D. A. Beckingsale, A. C. Mallinson, and S. A. Jarvis. 2014. Achieving Portability and Performance through OpenACC. In 2014 First Workshop on Accelerator Programming using Directives. 19--26.

Digital Library

[27]

Chuntao Hong, Dehao Chen, Wenguang Chen, Weimin Zheng, and Haibo Lin. 2010. MapCG: Writing Parallel Program Portable between CPU and GPU. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (Vienna, Austria) (PACT '10). Association for Computing Machinery, New York, NY, USA, 217--226.

Digital Library

[28]

Intel. 2022. OneAPI Deep Neural Network Library (OneDNN). https://github.com/oneapi-src/oneDNN

[29]

Pekka Jääskeläinen, Carlos Sánchez de La Lama, Erik Schnetter, Kalle Raiskila, Jarmo Takala, and Heikki Berg. 2015. pocl: A performance-portable OpenCL implementation. International Journal of Parallel Programming 43, 5 (2015), 752--785.

Digital Library

[30]

Ralf Karrenberg and Sebastian Hack. 2012. Improving performance of OpenCL on CPUs. In Compiler Construction, Michael O'Boyle (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1--20.

[31]

Andreas Klöckner. 2014. Loo.Py: Transformation-Based Code Generation for GPUs and CPUs. In Proceedings of ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY'14) (Edinburgh, United Kingdom). Association for Computing Machinery, New York, NY, USA, 82--87.

Digital Library

[32]

Maria Kotsifakou, Prakalp Srivastava, Matthew D. Sinclair, Rakesh Komuravelli, Vikram Adve, and Sarita Adve. 2018. HPVM: Heterogeneous parallel virtual machine. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Vienna, Austria) (PPoPP '18). Association for Computing Machinery, New York, NY, USA, 68--80.

Digital Library

[33]

C. Lattner and V. Adve. 2004. LLVM: a compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization, 2004. CGO 2004. 75--86.

[34]

Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2021. MLIR: Scaling Compiler Infrastructure for Domain Specific Computation. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 2--14.

Digital Library

[35]

Amy W. Lim and Monica S. Lam. 1997. Maximizing Parallelism and Minimizing Synchronization with Affine Transforms. In Proceedings of the 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (Paris, France) (POPL '97). Association for Computing Machinery, New York, NY, USA, 201--214.

Digital Library

[36]

LLVM Contributors. 2021. OpenMP-aware optimizations. Online: https://openmp.llvm.org/optimizations/OpenMPOpt.html.

[37]

Simon Moll, Johannes Doerfert, and Sebastian Hack. 2016. Input Space Splitting for OpenCL. In Proceedings of the 25th International Conference on Compiler Construction (Barcelona, Spain) (CC 2016). Association for Computing Machinery, New York, NY, USA, 251--260.

Digital Library

[38]

Sungdo Moon and Mary W Hall. 1999. Evaluation of predicated array data-flow analysis for automatic parallelization. ACM SIGPLAN Notices 34, 8 (1999), 84--95.

Digital Library

[39]

William Steven Moses. 2017. How should compilers represent fork-join parallelism? Master's thesis. Massachusetts Institute of Technology.

[40]

William S. Moses, Lorenzo Chelini, Ruizhe Zhao, and Oleksandr Zinenko. 2021. Polygeist: Raising C to Polyhedral MLIR. In 2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT). 45--59.

[41]

William S. Moses, Valentin Churavy, Ludger Paehler, Jan Hückelheim, Sri Hari Krishna Narayanan, Michel Schanen, and Johannes Doerfert. 2021. Reverse-Mode Automatic Differentiation and Optimization of GPU Kernels via Enzyme. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (St. Louis, Missouri) (SC '21). Association for Computing Machinery, New York, NY, USA, Article 61, 16 pages.

Digital Library

[42]

Cosmin E Oancea and Lawrence Rauchwerger. 2012. Logical inference techniques for loop parallelization. In Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation. 509--520.

Digital Library

[43]

M. O'Boyle and E. Stohr. 2002. Compile time barrier synchronization minimization. IEEE Transactions on Parallel and Distributed Systems 13, 6 (2002), 529--543.

Digital Library

[44]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Py-Torch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf

[45]

Atmn Patel, Shilei Tian, Johannes Doerfert, and Barbara Chapman. 2021. A Virtual GPU as Developer-Friendly OpenMP Offload Target. In 50th International Conference on Parallel Processing Workshop (Lemont, IL, USA) (ICPP Workshops '21). Association for Computing Machinery, New York, NY, USA, Article 24, 7 pages.

Digital Library

[46]

Matt Pharr and William R Mark. 2012. ispc: A SPMD compiler for high-performance CPU programming. In 2012 Innovative Parallel Computing (InPar). IEEE, 1--13.

[47]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Re-computation in Image Processing Pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (Seattle, Washington, USA) (PLDI '13). Association for Computing Machinery, New York, NY, USA, 519--530.

Digital Library

[48]

Harenome Razanajato, Cidric Bastoul, and Vincent Loechner. 2017. Lifting Barriers Using Parallel Polyhedral Regions. In 2017 IEEE 24th International Conference on High Performance Computing (HiPC). 338--347.

[49]

Mitsuhisa Sato, Yutaka Ishikawa, Hirofumi Tomita, Yuetsu Kodama, Tetsuya Odajima, Miwako Tsuji, Hisashi Yashiro, Masaki Aoki, Naoyuki Shida, Ikuo Miyoshi, Kouichi Hirai, Atsushi Furuya, Akira Asato, Kuniki Morita, and Toshiyuki Shimizu. 2020. Co-Design for A64FX Manycore Processor and "Fugaku". In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1--15.

[50]

Tao B Schardl, William S Moses, and Charles E Leiserson. 2019. Tapir: Embedding recursive fork-join parallelism into LLVM's intermediate representation. ACM Transactions on Parallel Computing (TOPC) 6, 4 (2019), 1--33.

Digital Library

[51]

Adrian Schmitz, Julian Miller, Lukas Trümper, and Matthias S Müller. 2021. PPIR: Parallel Pattern Intermediate Representation. In 2021 IEEE/ACM International Workshop on Hierarchical Parallelism for Exascale Computing (HiPar). IEEE, 30--40.

[52]

Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow.

[53]

Tyler Sorensen, Alastair F. Donaldson, Mark Batty, Ganesh Gopalakrishnan, and Zvonimir Rakamarić. 2016. Portable Inter-Workgroup Barrier Synchronisation for GPUs. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (Amsterdam, Netherlands) (OOP-SLA 2016). Association for Computing Machinery, New York, NY, USA, 39--58.

Digital Library

[54]

Tyler Sorensen, Lucas F Salvador, Harmit Raval, Hugues Evrard, John Wickerson, Margaret Martonosi, and Alastair F Donaldson. 2021. Specifying and testing GPU workgroup progress models. Proceedings of the ACM on Programming Languages 5, OOPSLA (2021), 1--30.

Digital Library

[55]

George Stelle, William S. Moses, Stephen L. Olivier, and Patrick McCormick. 2017. OpenMPIR: Implementing OpenMP Tasks with Tapir. In Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC (Denver, CO, USA) (LLVM-HPC'17). Association for Computing Machinery, New York, NY, USA, Article 3, 12 pages.

Digital Library

[56]

George Stelle, William S. Moses, Stephen L. Olivier, and Patrick McCormick. 2017. OpenMPIR: Implementing OpenMP Tasks with Tapir. In Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC (Denver, CO, USA). ACM, New York, NY, USA, Article 3, 12 pages.

Digital Library

[57]

John A. Stratton, Vinod Grover, Jaydeep Marathe, Bastiaan Aarts, Mike Murphy, Ziang Hu, and Wen-mei W. Hwu. 2010. Efficient Compilation of Fine-Grained SPMD-Threaded Programs for Multicore CPUs. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization (Toronto, Ontario, Canada) (CGO '10). Association for Computing Machinery, New York, NY, USA, 111--119.

Digital Library

[58]

John A. Stratton, Sam S. Stone, and Wen-mei W. Hwu. 2008. MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs. In Languages and Compilers for Parallel Computing, José Nelson Amaral (Ed.). Vol. 5335. Springer, Berlin, Heidelberg, 16--30. Series Title: Lecture Notes in Computer Science.

Digital Library

[59]

Sander Stuijk, Marc Geilen, and Twan Basten. 2006. Sdfˆ 3: Sdf for free. In Sixth International Conference on Application of Concurrency to System Design (ACSD'06). IEEE, IEEE, 276--278.

Digital Library

[60]

Zehra Sura, Xing Fang, Chi-Leung Wong, Samuel P. Midkiff, Jaejin Lee, and David Padua. 2005. Compiler Techniques for High Performance Sequentially Consistent Java Programs. In Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Chicago, IL, USA) (PPoPP '05). Association for Computing Machinery, New York, NY, USA, 2--13.

Digital Library

[61]

Xinmin Tian, Hideki Saito, Ernesto Su, Jin Lin, Satish Guggilla, Diego Caballero, Matt Masten, Andrew Savonichev, Michael Rice, Elena Demikhovsky, Ayal Zaks, Gil Rapaport, Abhinav Gaba, Vasileios Porpodas, and Eric N. Garcia. 2017. LLVM Compiler Implementation for Explicit Parallelization and SIMD Vectorization. In Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC. ACM, Denver, CO, USA, 4:1--4:11.

Digital Library

[62]

Chau-Wen Tseng. 1995. Compiler optimizations for eliminating barrier synchronization. ACM SIGPLAN Notices 30, 8 (1995), 144--155.

Digital Library

[63]

Nicolas Vasilache, Benoit Meister, Muthu Baskaran, and Richard Lethin. 2012. Joint scheduling and layout optimization to enable multi-level vectorization. IMPACT 12 (2012).

[64]

Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary Devito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2019. The Next 700 Accelerated Layers: From Mathematical Expressions of Network Computation Graphs to Accelerated GPU Kernels, Automatically. ACM Trans. Archit. Code Optim. 16, 4, Article 38 (oct 2019), 26 pages.

Digital Library

[65]

Oleksandr Zinenko, Sven Verdoolaege, Chandan Reddy, Jun Shirako, Tobias Grosser, Vivek Sarkar, and Albert Cohen. 2018. Modeling the Conflicting Demands of Parallelism and Temporal/Spatial Locality in Affine Scheduling. In Proceedings of the 27th International Conference on Compiler Construction (Vienna, Austria) (CC 2018). Association for Computing Machinery, New York, NY, USA, 3--13.

Digital Library

Cited By

Han RChen JGarg BZhou XLu JYoung JSim JKim H(2024)CuPBoP: Making CUDA a Portable LanguageACM Transactions on Design Automation of Electronic Systems10.1145/365994929:4(1-25)Online publication date: 21-Jun-2024
https://dl.acm.org/doi/10.1145/3659949
Drescher FEngelke ARodríguez GSadayappan PSukumaran-Rajam A(2024)Fast Template-Based Code Generation for MLIRProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641567(1-12)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641567
Ivanov IZinenko ODomke JEndo TMoses WGrosser TDubach CSteuwer MXue JOttoni GQuintão Pereira F(2024)Retargeting and Respecializing GPU Workloads for Performance PortabilityProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444828(119-132)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444828
Show More Cited By

Index Terms

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
2. Theory of computation
  1. Models of computation
    1. Concurrency
      1. Parallel computing models

Recommendations

A Portable and High-Performance General Matrix-Multiply (GEMM) Library for GPUs and Single-Chip CPU/GPU Systems
PDP '14: Proceedings of the 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing

OpenCL is a vendor neutral and portable interface for programming parallel compute devices such as GPUs. Tuning OpenCL implementations of important library functions such as dense general matrix multiply (GEMM) for a particular device is a difficult ...
Efficient CPU-GPU cooperative computing for solving the subset-sum problem

Heterogeneous CPU-GPU system is a powerful way to accelerate compute-intensive applications, such as the subset-sum problem. Many parallel algorithms for solving the problem have been implemented on graphics processing units GPUs. However, these GPU ...
Boosting CUDA Applications with CPU---GPU Hybrid Computing

This paper presents a cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU. The proposed system exploits at ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

February 2023

480 pages

ISBN:9798400700156

DOI:10.1145/3572848

General Chair:
Maryam Mehri Dehnavi
University of Toronto
,
Program Chairs:
Milind Kulkarni
Purdue University
,
Sriram Krishnamoorthy
Google

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 February 2023

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

Department of Energy
Advanced Scientific Computing Research Program
Los Alamos National Laboratory
Exascale Computing Project
United States Air Force Artificial Intelligence Accelerator
Japan Society for the Promotion of Science

Conference

PPoPP '23

Sponsor:

PPoPP '23: The 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

February 25 - March 1, 2023

QC, Montreal, Canada

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
2,122
Total Downloads

Downloads (Last 12 months)1,305
Downloads (Last 6 weeks)171

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Han RChen JGarg BZhou XLu JYoung JSim JKim H(2024)CuPBoP: Making CUDA a Portable LanguageACM Transactions on Design Automation of Electronic Systems10.1145/365994929:4(1-25)Online publication date: 21-Jun-2024
https://dl.acm.org/doi/10.1145/3659949
Drescher FEngelke ARodríguez GSadayappan PSukumaran-Rajam A(2024)Fast Template-Based Code Generation for MLIRProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641567(1-12)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641567
Ivanov IZinenko ODomke JEndo TMoses WGrosser TDubach CSteuwer MXue JOttoni GQuintão Pereira F(2024)Retargeting and Respecializing GPU Workloads for Performance PortabilityProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444828(119-132)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444828
Pitchanathan ACohen AZinenko OGrosser T(2024)Strided Difference Bound MatricesComputer Aided Verification10.1007/978-3-031-65627-9_14(279-302)Online publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1007/978-3-031-65627-9_14
Meyer JAlpay AHack SFröning HHeuveline V(2023)Implementation Techniques for SPMD Kernels on CPUsProceedings of the 2023 International Workshop on OpenCL10.1145/3585341.3585342(1-12)Online publication date: 18-Apr-2023
https://dl.acm.org/doi/10.1145/3585341.3585342
Putra MKim IGunawi HGrossman R(2023)CNT: Semi-Automatic Translation from CWL to Nextflow for Genomic Workflows2023 IEEE 23rd International Conference on Bioinformatics and Bioengineering (BIBE)10.1109/BIBE60311.2023.00012(22-27)Online publication date: 4-Dec-2023
https://doi.org/10.1109/BIBE60311.2023.00012
Wang HChen JXie CLiu SWang ZShen QZhao Y(2023)MLIRSmith: Random Program Generation for Fuzzing MLIR Compiler Infrastructure2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE)10.1109/ASE56229.2023.00120(1555-1566)Online publication date: 11-Sep-2023
https://doi.org/10.1109/ASE56229.2023.00120

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents