research-article

Open access

Beyond Static Parallel Loops: Supporting Dynamic Task Parallelism on Manycore Architectures with Software-Managed Scratchpad Memories

Authors:

Max Ruttenberg,

Dai Cheol Jung,

Dustin Richmond,

Michael Taylor,

Christopher BattenAuthors Info & Claims

ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3

Pages 46 - 58

https://doi.org/10.1145/3582016.3582020

Published: 25 March 2023 Publication History

ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3

Beyond Static Parallel Loops: Supporting Dynamic Task Parallelism on Manycore Architectures with Software-Managed Scratchpad Memories

Pages 46 - 58

Abstract
References

Abstract

Manycore architectures integrate hundreds of cores on a single chip by using simple cores and simple memory systems usually based on software-managed scratchpad memories (SPMs). However, such architectures are notoriously challenging to program, since the programmers need to manually manage all aspects of data movement and synchronization for both correctness and performance. We argue that this manycore programmability challenge is one of the key barriers to achieving the promise of manycore architectures. At the same time, the dynamic task parallel programming model is enjoying considerable success in addressing the programmability challenge of multi-core processors with tens of complex cores and hardware cache coherence.

Conventional wisdom suggests a work-stealing runtime, which forms the core of most dynamic task parallel programming models, is ill-suited for manycore architectures. In this work, we demonstrate that a work-stealing runtime is not just feasible on manycore architectures with SPMs, but such a runtime can also significantly improve the performance of irregular workloads when executing on these architectures. We also explore three optimizations that allow the runtime to leverage unused SPM space for further performance benefit. Our dynamic task parallel programming framework achieves 1.2–28.5× speedup on workloads that benefit from our techniques, and only induces minimal overhead for workloads that do not.

References

[1]

Tutu Ajayi, Khalid Al-Hawaj, Aporva Amarnath, Steve Dai, Scott Davidson, Paul Gao, Gai Liu, Atieh Lotfi, Julian Puscar, Anuj Rao, Austin Rovinski, Loai Salem, Ningxiao Sun, Christopher Torng, Luis Vega, Bandhav Veluri, Xiaoyang Wang, Shaolin Xie, Chun Zhao, Ritchie Zhao, Christopher Batten, Ronald G. Dreslinski, Ian Galton, Rajesh K. Gupta, Patrick P. Mercier, Mani Srivastava, Michael B. Taylor, and Zhiru Zhang. 2017. Celerity: An Open-Source RISC-V Tiered Accelerator Fabric. Symp. on High Performance Chips (Hot Chips), Aug.

[2]

Tutu Ajayi, Khalid Al-Hawaj, Aporva Amarnath, Steve Dai, Scott Davidson, Paul Gao, Gai Liu, Anuj Rao, Austin Rovinski, Ningxiao Sun, Christopher Torng, Luis Vega, Bandhav Veluri, Shaolin Xie, Chun Zhao, Ritchie Zhao, Christopher Batten, Ronald G. Dreslinski, Rajesh K. Gupta, Michael B. Taylor, and Zhiru Zhang. 2017. Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC in TSMC 16nm. Workshop on Computer Architecture Research with RISC-V (CARRV), Oct.

[3]

Lluc Alvarez, Miquel Moretó, Marc Casas, Emilio Castillo, Xavier Martorell, Jesús Labarta, Eduard Ayguadé, and Mateo Valero. 2015. Runtime-Guided Management of Scratchpad Memories in Multicore Architectures. Int’l Conf. on Parallel Architectures and Compilation Techniques (PACT), Oct, https://doi.org/10.1109/PACT.2015.26

Digital Library

[4]

E. Anderson, J. Brooks, C. Grassl, and S. Scott. 1997. Performance of the CRAY T3E Multiprocessor. Int’l Conf. on High Performance Networking and Computing (Supercomputing), Nov, 39–39. https://doi.org/10.1145/509593.509632

Digital Library

[5]

Eduard Ayguadé, Nawal Copty, Alejandro Duran, Jay Hoeflinger, Yuan Lin, Federico Massaioli, Xavier Teruel, Priya Unnikrishnan, and Guansong Zhang. 2009. The Design of OpenMP Tasks. IEEE Trans. on Parallel and Distributed Systems (TPDS), 20, 3 (2009), Mar, 404–418. https://doi.org/10.1109/TPDS.2008.105

Digital Library

[6]

Shane Bell, Bruce Edwards, John Amann, Rich Conlin, Kevin Joyce, Vince Leung, John MacKay, Mike Reif, Liewei Bao, John Brown, Matthew Mattina, Chyi-Chang Miao, Carl Ramey, Dave Wentzlaff, Walker Anderson, Ethan Berger, Nat Fairbanks, Durlov Khan, Froilan Montenegro, Jay Stickney, and John Zook. 2008. TILE64 Processor: A 64-Core SoC with Mesh Interconnect. Int’l Solid-State Circuits Conf. (ISSCC), Feb, https://doi.org/10.1109/ISSCC.2008.4523070

[7]

Robert D. Blumofe, Matteo Frigo, Christopher F. Joerg, Charles E. Leiserson, and Keith H. Randall. 1996. An Analysis of Dag-Consistent Distributed Shared-Memory Algorithms. Symp. on Parallel Algorithms and Architectures (SPAA), Jun, https://doi.org/10.1145/237502.237574

Digital Library

[8]

Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. 1995. Cilk: An Efficient Multithreaded Runtime System. Symp. on Principles and Practice of Parallel Programming (PPoPP), Jul, https://doi.org/10.1145/209937.209958

Digital Library

[9]

Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. 1996. Cilk: An Efficient Multithreaded Runtime System. J. Parallel and Distrib. Comput., 37, 1 (1996), Aug, 55–69.

Digital Library

[10]

Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling Multithreaded Computations by Work Stealing. J. ACM, 46, 5 (1999), Sep, 720–748. https://doi.org/10.1145/324133.324234

Digital Library

[11]

Brent Bohnenstiehl, Aaron Stillmaker, Jon J. Pimentel, Timothy Andreas, Bin Liu, Anh T. Tran, Emmanuel Adeagbo, and Bevan M. Baas. 2017. KiloCore: A 32-nm 1000-Processor Computational Array. IEEE Journal of Solid-State Circuits (JSSC), 52, 4 (2017), Apr, 891–902. https://doi.org/10.1109/JSSC.2016.2638459

[12]

Ajay Brahmakshatriya, Emily Furst, Victor Ying, Claire Hsu, Changwan Hong, Max Ruttenberg, Yunming Zhang, Dai Cheol Jung, Dustin Richmond, Michael Taylor, Julian Shun, Mark Oskin, Daniel Sanchez, and Saman Amarasinghe. 2021. Taming the Zoo: The Unified GraphIt Compiler Framework for Novel Architectures. Int’l Symp. on Computer Architecture (ISCA), Jun, https://doi.org/10.1109/ISCA52012.2021.00041

Digital Library

[13]

P. Charles, C. Grothoff, V. Sarkar, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. 2005. X10: An Object-Oriented Approach to Non-Uniform Cluster Computing. Conf. on Object-Oriented Programming Systems Languages and Applications (OOPSLA), Oct, https://doi.org/10.1145/1103845.1094852

Digital Library

[14]

Tao Chen, Shreesha Srinath, Christopher Batten, and Edward Suh. 2018. An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware. Int’l Symp. on Microarchitecture (MICRO), Oct, https://doi.org/10.1109/MICRO.2018.00014

Digital Library

[15]

Lin Cheng, Peitian Pan, Zhongyuan Zhao, Krithik Ranjan, Jack Weber, Bandhav Veluri, Seyed Borna Ehsani, Max Ruttenberg, Dai Cheol Jung, Preslav Ivanov, Dustin Richmond, Michael B. Taylor, Zhiru Zhang, and Christopher Batten. 2022. A Tensor Processing Framework for CPU-Manycore Heterogeneous Systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 41, 6 (2022), 1620–1635. https://doi.org/10.1109/TCAD.2021.3103825

[16]

1993. CRAY T3D System Architecture Overview. http://www.bitsavers.org/pdf/cray/HR-04033_CRAY_T3D_System_Architecture_Overview_Sep93.pdf

[17]

Andrew Danowitz, Kyle Kelley, James Mao, John P. Stevenson, and Mark Horowitz. 2012. CPU DB: Recording Microprocessor History. ACM Queue, Apr, 10–27.

Digital Library

[18]

Scott Davidson, Shaolin Xie, Christopher Torng, Khalid Al-Hawaj, Austin Rovinski, Tutu Ajayi, Luis Vega, Chun Zhao, Ritchie Zhao, Steve Dai, Aporva Amarnath, Bandhav Veluri, Paul Gao, Anuj Rao, Gai Liu, Rajesh K. Gupta, Zhiru Zhang, Ronald G. Dreslinski, Christopher Batten, and Michael B. Taylor. 2018. The Celerity Open-Source 511-Core RISC-V Tiered Accelerator Fabric: Fast Architectures and Design Methodologies for Fast Chips. IEEE Micro, 38, 2 (2018), Mar/Apr, 30–41. https://doi.org/10.1109/MM.2018.022071133

[19]

James Dinan, D. Brian Larkins, P. Sadayappan, Sriram Krishnamoorthy, and Jarek Nieplocha. 2009. Scalable Work Stealing. Int’l Conf. on High Performance Networking and Computing (Supercomputing), Nov, https://doi.org/10.1145/1654059.1654113

Digital Library

[20]

Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The Implementation of the Cilk-5 Multithreaded Language. ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI), Jun, https://doi.org/10.1145/277652.277725

Digital Library

[21]

Tom R. Halfhill. 2020. ThunderX3’s Cloudburst of Threads: Marvell Previews 96-core 384-thread Arm Server Processor. Microprocessor Report, The Linley Group, Apr.

[22]

Yatin Hoskote, Sriram Vangal, Arvind Singh, Nitin Borkar, and Shekhar Borkar. 2007. A 5-GHz Mesh Interconnect for a Teraflops Processor. IEEE Micro, 27, 5 (2007), Sep/Oct, 51–61. https://doi.org/10.1109/MM.2007.4378783

[23]

Jason Howard, Saurabh Dighe, Yatin Hoskote, Sriram Vangal, David Finan, Gregory Ruhl, David Jenkins, Howard Wilson, Nitin Borkar, Gerhard Schrom, Fabrice Pailet, Shailendra Jain, Tiju Jacob, Satish Yada, Sraven Marella, Praveen Salihundam, Vasantha Erraguntla, Michael Konow, Michael Riepen, Guido Droege, Joerg Lindemann, Matthias Gries, Thomas Apel, Kersten Henriss, Tor Lund-Larsen, Sebastian Steibl, Shekhar Borkar, Vivek De, Rob Van Der Wijngaart, and Timothy Mattson. 2010. A 48-Core IA-32 Message-Passing Processor with DVFS in 45nm CMOS. Int’l Solid-State Circuits Conf. (ISSCC), Feb, https://doi.org/10.1109/ISSCC.2010.5434077

[24]

2012. Intel Cilk Plus Language Extension Specification. https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1665.htm

[25]

2019. Intel Threading Building Blocks. https://software.intel.com/en-us/intel-tbb

[26]

Dai Cheol Jung, Scott Davidson, Chun Zhao, Dustin Richmond, and Michael Bedford Taylor. 2020. Ruche Networks: Wire-Maximal, No-Fuss NoCs : Special Session Paper. Int’l Symp. on Networks-on-Chip (NOCS), Apr, https://doi.org/10.1109/NOCS50636.2020.9241586

[27]

2022 (accessed Aug 2022). Kalray MPPA Products. Online Webpage. https://www.kalrayinc.com/products/mppa-technology/

[28]

David Kanter. 2015. Knights Landing Reshapes HPC.

[29]

John H. Kelm, Daniel R. Johnson, Matthew R. Johnson, Neal C. Crago, William Tuohy, Aqeel Mahesri, Steven S. Lumetta, Matthew I. Frank, and Sanjay J. Patel. 2009. Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator. Int’l Symp. on Computer Architecture (ISCA), Jun, https://doi.org/10.1145/1555754.1555774

Digital Library

[30]

2011. OpenCL Specification, v1.2. http://www.khronos.org/registry/cl/specs/opencl-1.2.pdf

[31]

Charles E. Leiserson. 2009. The Cilk++ Concurrency Platform. Design Automation Conf. (DAC), Jul, https://doi.org/10.1145/1629911.1630048

Digital Library

[32]

L. Li, J. Fang, H. Fu, J. Jiang, W. Zhao, C. He, X. You, and G. Yang. 2018. swCaffe: A Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight. Int’l Conf. on Cluster Computing, Sep, https://doi.org/10.48550/arXiv.1903.06934

[33]

S. Li, Z. Yang, D. Reddy, A. Srivastava, and B. Jacob. 2020. DRAMsim3: A Cycle-Accurate, Thermal-Capable DRAM Simulator. Computer Architecture Letters (CAL), Jul, https://doi.org/10.1109/LCA.2020.2973991

Digital Library

[34]

Zheng Li, Jose Duato, Olivier Certner, and Olivier Temam. 2010. Scalable Hardware Support for Conditional Parallelization. Int’l Conf. on Parallel Architectures and Compilation Techniques (PACT), Sep.

[35]

Mieszko Lis, Keun Sup Shim, Myong Hyon Cho, Ilia Lebedev, and Srinivas Devadas. 2013. Hardware-Level Thread Migration in a 110-Core Shared-Memory Multiprocessor. MIT CSAIL CSG.

[36]

Guo-Ping Long, Jun-Chao Zhang, and Dong-Rui Fan. 2008. Architectural Support and Evaluation of Cilk Language on Many-Core Architectures. Chinese Journal of Computers, 31, 11 (2008), 1975–1985. https://doi.org/10.3724/SP.J.1016.2008.01975

[37]

Steven Margerm, Amirali Sharifian, Apala Guha, Arrvindh Shriraman, and Gilles Pokam. 2018. TAPAS: Generating Parallel Accelerators from Parallel Programs. Int’l Symp. on Microarchitecture (MICRO), Oct, https://doi.org/10.1109/MICRO.2018.00028

Digital Library

[38]

Michael McCool, Arch D. Robinson, and James Reinders. 2012. Structured Parallel Programming: Patterns for Efficient Computation. Morgan Kaufmann.

[39]

Michael McKeown, Yaosheng Fu, Tri Nguyen, Yanqi Zhou, Jonathan Balkind, Alexey Lavrov, Mohammad Shahrad, Samuel Payne, and David Wentzlaff. 2017. Piton: A Manycore Processor for Multitenant Clouds. IEEE Micro, 37, 2 (2017), Mar/Apr, 70–80. https://doi.org/10.1109/MM.2017.36

Digital Library

[40]

Stephen Olivier, Jun Huan, Jinze Liu, Jan Prins, James Dinan, P. Sadayappan, and Chau-Wen Tseng. 2006. UTS: An Unbalanced Tree Search Benchmark. Int’l Workshop on Lanaguages and Compilers for Parallel Computing (LCPC), Nov, https://doi.org/10.1007/978-3-540-72521-3_18

[41]

Andreas Olofsson. 2016. Epiphany-V: A 1024-processor 64-bit RISC System-On-Chip. Computing Research Repository (CoRR), arXiv:abs/1610.01832 (2016), Aug, https://doi.org/10.48550/arXiv.1610.01832

[42]

2013. OpenMP Application Program Interface, Version 4.0. http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf

[43]

Marc S. Orr, Bradford M. Beckmann, Steven K. Reinhardt, and David A. Wood. 2014. Fine-Grain Task Aggregation and Coordination on GPUs. Int’l Symp. on Computer Architecture (ISCA), Jul, https://doi.org/10.1109/ISCA.2014.6853209

[44]

Yanghui Ou, Shady Agwa, and Christopher Batten. 2020. Implementing Low-Diameter On-Chip Networks for Manycore Processors Using a Tiled Physical Design Methodology. Int’l Symp. on Networks-on-Chip (NOCS), Sep, https://doi.org/10.1109/NOCS50636.2020.9241710

[45]

Guilherme P. Pezzi, Marcia C. Cera, Elton Mathias, Nicolas Maillard, and Philippe O. A. Navaux. 2007. On-line Scheduling of MPI-2 Programs with Hierarchical Work Stealing. Int’l Symp. on Computer Architecture and High Performance Computing (SBAC-PAD), Oct, https://doi.org/10.1109/SBAC-PAD.2007.36

[46]

Carl Ramey. 2011. TILE-Gx100 ManyCore Processor: Acceleration Interfaces and Architecture. Symp. on High Performance Chips (Hot Chips), Aug, https://doi.org/10.1109/HOTCHIPS.2011.7477491

[47]

James Reinders. 2007. Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O’Reilly.

[48]

Austin Rovinski, Chun Zhao, Khalid Al-Hawaj, Paul Gao, Shaolin Xie, Christopher Torng, Scott Davidson, Aporva Amarnath, Luis Vega, Bandhav Veluri, Anuj Rao, Tutu Ajayi, Julian Puscar, Steve Dai, Ritchie Zhao, Dustin Richmond, Zhiru Zhang, Ian Galton, Christopher Batten, Michael B. Taylor, and Ron G. Dreslinski. 2019. A 1.4 GHz 695 Giga RISC-V Inst/s 496-core Manycore Processor with Mesh On-Chip Network and an All-Digital Synthesized PLL in 16nm CMOS. Symp. on VLSI Technology and Circuits (VLSI), Jun, https://doi.org/10.23919/VLSIC.2019.8778031

[49]

Austin Rovinski, Chun Zhao, Khalid Al-Hawaj, Paul Gao, Shaolin Xie, Christopher Torng, Scott Davidson, Aporva Amarnath, Luis Vega, Bandhav Veluri, Anuj Rao, Tutu Ajayi, Julian Puscar, Steve Dai, Ritchie Zhao, Dustin Richmond, Zhiru Zhang, Ian Galton, Christopher Batten, Michael B. Taylor, and Ron G. Dreslinski. 2019. Evaluating Celerity: A 16nm 695 Giga-RISC-V Instructions/s Manycore Processor with Synthesizable PLL. IEEE Solid-State Circuits Letters (SSCL), 2, 12 (2019), Dec, 289–292. https://doi.org/10.1109/LSSC.2019.2953847

[50]

Vijay A. Saraswat, Prabhanjan Kambadur, Sreedhar Kodali, David Grove, and Sriram Krishnamoorthy. 2011. Lifeline-Based Global Load Balancing. SIGPLAN Not., feb, 201–212. https://doi.org/10.1145/2038037.1941582

Digital Library

[51]

Tao B. Schardl, William S. Moses, and Charles E. Leiserson. 2017. Tapir: Embedding Fork-Join Parallelism into LLVM’s Interemdiate Representation. Symp. on Principles and Practice of Parallel Programming (PPoPP), Feb, https://doi.org/10.1145/3155284.3018758

Digital Library

[52]

Julian Shun and Guy Blelloch. 2013. Ligra: A Lightweight Graph Processing Framework for Shared Memory. Symp. on Principles and Practice of Parallel Programming (PPoPP), Feb, https://doi.org/10.1145/2517327.2442530

Digital Library

[53]

Giuseppe Tagliavini, Daniele Cesarini, and Andrea Marongiu. 2018. Unleashing Fine-Grained Parallelism on Embedded Many-Core Accelerators with Lightweight OpenMP Tasking. IEEE Transactions on Parallel and Distributed Systems, 29, 9 (2018), 2150–2163. https://doi.org/10.1109/TPDS.2018.2814602

Digital Library

[54]

Guangming Tan, Dongrui Fan, Junchao Zhang, Andrew Russo, and Guang R. Gao. 2008. Experience on Optimizing Irregular Computation for Memory Hierarchy in Manycore Architecture. Symp. on Principles and Practice of Parallel Programming (PPoPP), Feb, https://doi.org/10.1145/1345206.1345255

Digital Library

[55]

Michael Bedford Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffmann, Paul Johnson, Walter Lee, Arvind Saraf, Nathan Shnidman, Volker Strumpen, Saman Amarasinghe, and Anant Agarwal. 2003. A 16-Issue Multiple-Program-Counter Microprocessor with Point-to-Point Scalar Operand Network. Int’l Solid-State Circuits Conf. (ISSCC), Feb, https://doi.org/10.1109/ISSCC.2003.1234253

[56]

Pascal Vivet, Eric Guthmuller, Yvain Thonnart, Gael Pillonnet, Guillaume Moritz, Ivan Miro-Panadès, Cesar Fuguet, Jean Durupt, Christian Bernard, Didier Varreau, Julian Pontes, Sebastien Thuries, David Coriat, Michel Harrand, Denis Dutoit, Didier Lattard, Lucile Arnaud, Jean Charbonnier, Perceval Coudrain, Arnaud Garnier, Frederic Berger, Alain Gueugnot, Alain Greiner, Quentin Meunier, Alexis Farcy, Alexandre Arriordaz, Severine Cheramy, and Fabien Clermidy. 2020. A 220GOPS 96-Core Processor with 6 Chiplets 3D-Stacked on an Active Interposer Offering 0.6ns/mm Latency, 3Tb/s/mm2 Inter-Chiplet Interconnects and 156mW/mm2@ 82%-Peak-Efficiency DC-DC Converters. Int’l Solid-State Circuits Conf. (ISSCC), Feb, https://doi.org/10.1109/ISSCC19947.2020.9062927

[57]

Rob von Behren, Jeremy Condit, Feng Zhou, George C. Necula, and Eric Brewer. 2003. Capriccio: Scalable Threads for Internet Services. Symp. on Operating Systems Principles (SOSP), Oct, 268–281. https://doi.org/10.1145/945445.945471

Digital Library

[58]

Moyang Wang, Tuan Ta, Lin Cheng, and Christopher Batten. 2020. Efficiently Supporting Dynamic Task Parallelism on Heterogeneous Cache-Coherent Systems. Int’l Symp. on Computer Architecture (ISCA), Jun, https://doi.org/10.1109/ISCA45697.2020.00025

Digital Library

[59]

David Wentzlaff, Patrick Griffin, Henry Hoffman, Liewei Bao, Bruce Edwards, Carl Ramey, Matthew Mattina, Chyi-Chang Miao, John F. Brown III, and Anant Agarwal. 2007. On-Chip Interconnection Architecture of the Tile Processor. IEEE Micro, 27 (2007), Sep/Oct, 15–31. https://doi.org/10.1109/MM.2007.4378780

[60]

Bob Wheeler. 2020. Ampere Maxes Out at 128 Cores. Microprocessor Report, The Linley Group, Jul.

[61]

Foivos S. Zakkak and Polyvios Pratikakis. 2016. Building a Java™ Virtual Machine for Non-Cache-Coherent Many-Core Architectures. Int’l Workshop on Java Technologies for Real-Time and Embedded Systems (JTRES), Aug, https://doi.org/10.1145/2990509.2990510

Digital Library

[62]

Florian Zaruba, Fabian Schuiki, and Luca Benini. 2021. Manticore: A 4096-Core RISC-V Chiplet Architecture for Ultraefficient Floating-Point Computing. IEEE Micro, Mar/Apr, https://doi.org/10.48550/arXiv.2008.06502

Cited By

Gavioli FBrilli GBurgio PBertozzi D(2024)Adaptive Localization for Autonomous Racing Vehicles with Resource-Constrained Embedded Platforms2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546748(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546748
Jung DRuttenberg MGao PDavidson SPetrisko DLi KKamath ACheng LXie SPan PZhao ZYue ZVeluri BMuralitharan SSampson ALumsdaine AZhang ZBatten COskin MRichmond DTaylor M(2024)Scalable, Programmable and Dense: The HammerBlade Open-Source RISC-V Manycore2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00061(770-784)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00061

Index Terms

Beyond Static Parallel Loops: Supporting Dynamic Task Parallelism on Manycore Architectures with Software-Managed Scratchpad Memories
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures
2. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms

Recommendations

Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory

Modern Graphic Processing Units (GPUs) have become pervasive computing devices in datacenters due to their high performance with massive thread level parallelism (TLP). GPUs are equipped with large register files (RF) to support fast context switch ...
Software Transactional Memory for GPU Architectures
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Modern GPUs have shown promising results in accelerating computation intensive and numerical workloads with limited dynamic data sharing. However, many real-world applications manifest ample amount of data sharing among concurrently executing threads. ...
Software Transactional Memory for GPU Architectures
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Modern GPUs have shown promising results in accelerating computation intensive and numerical workloads with limited dynamic data sharing. However, many real-world applications manifest ample amount of data sharing among concurrently executing threads. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3

March 2023

820 pages

ISBN:9781450399180

DOI:10.1145/3582016

General Chair:
Tor M. Aamodt
University of British Columbia, Canada
,
Program Chairs:
Natalie Enright Jerger
University of Toronto, Canada
,
Michael Swift
University of Wisconsin-Madison, USA

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 March 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

ASPLOS '23

Sponsor:

ASPLOS '23: 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3

March 25 - 29, 2023

BC, Vancouver, Canada

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
1,579
Total Downloads

Downloads (Last 12 months)564
Downloads (Last 6 weeks)48

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Gavioli FBrilli GBurgio PBertozzi D(2024)Adaptive Localization for Autonomous Racing Vehicles with Resource-Constrained Embedded Platforms2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546748(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546748
Jung DRuttenberg MGao PDavidson SPetrisko DLi KKamath ACheng LXie SPan PZhao ZYue ZVeluri BMuralitharan SSampson ALumsdaine AZhang ZBatten COskin MRichmond DTaylor M(2024)Scalable, Programmable and Dense: The HammerBlade Open-Source RISC-V Manycore2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00061(770-784)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00061

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten