[go: up one dir, main page]

skip to main content
10.1145/3582016.3582020acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Open access

Beyond Static Parallel Loops: Supporting Dynamic Task Parallelism on Manycore Architectures with Software-Managed Scratchpad Memories

Published: 25 March 2023 Publication History

Abstract

Manycore architectures integrate hundreds of cores on a single chip by using simple cores and simple memory systems usually based on software-managed scratchpad memories (SPMs). However, such architectures are notoriously challenging to program, since the programmers need to manually manage all aspects of data movement and synchronization for both correctness and performance. We argue that this manycore programmability challenge is one of the key barriers to achieving the promise of manycore architectures. At the same time, the dynamic task parallel programming model is enjoying considerable success in addressing the programmability challenge of multi-core processors with tens of complex cores and hardware cache coherence.
Conventional wisdom suggests a work-stealing runtime, which forms the core of most dynamic task parallel programming models, is ill-suited for manycore architectures. In this work, we demonstrate that a work-stealing runtime is not just feasible on manycore architectures with SPMs, but such a runtime can also significantly improve the performance of irregular workloads when executing on these architectures. We also explore three optimizations that allow the runtime to leverage unused SPM space for further performance benefit. Our dynamic task parallel programming framework achieves 1.2–28.5× speedup on workloads that benefit from our techniques, and only induces minimal overhead for workloads that do not.

References

[1]
Tutu Ajayi, Khalid Al-Hawaj, Aporva Amarnath, Steve Dai, Scott Davidson, Paul Gao, Gai Liu, Atieh Lotfi, Julian Puscar, Anuj Rao, Austin Rovinski, Loai Salem, Ningxiao Sun, Christopher Torng, Luis Vega, Bandhav Veluri, Xiaoyang Wang, Shaolin Xie, Chun Zhao, Ritchie Zhao, Christopher Batten, Ronald G. Dreslinski, Ian Galton, Rajesh K. Gupta, Patrick P. Mercier, Mani Srivastava, Michael B. Taylor, and Zhiru Zhang. 2017. Celerity: An Open-Source RISC-V Tiered Accelerator Fabric. Symp. on High Performance Chips (Hot Chips), Aug.
[2]
Tutu Ajayi, Khalid Al-Hawaj, Aporva Amarnath, Steve Dai, Scott Davidson, Paul Gao, Gai Liu, Anuj Rao, Austin Rovinski, Ningxiao Sun, Christopher Torng, Luis Vega, Bandhav Veluri, Shaolin Xie, Chun Zhao, Ritchie Zhao, Christopher Batten, Ronald G. Dreslinski, Rajesh K. Gupta, Michael B. Taylor, and Zhiru Zhang. 2017. Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC in TSMC 16nm. Workshop on Computer Architecture Research with RISC-V (CARRV), Oct.
[3]
Lluc Alvarez, Miquel Moretó, Marc Casas, Emilio Castillo, Xavier Martorell, Jesús Labarta, Eduard Ayguadé, and Mateo Valero. 2015. Runtime-Guided Management of Scratchpad Memories in Multicore Architectures. Int’l Conf. on Parallel Architectures and Compilation Techniques (PACT), Oct, https://doi.org/10.1109/PACT.2015.26
[4]
E. Anderson, J. Brooks, C. Grassl, and S. Scott. 1997. Performance of the CRAY T3E Multiprocessor. Int’l Conf. on High Performance Networking and Computing (Supercomputing), Nov, 39–39. https://doi.org/10.1145/509593.509632
[5]
Eduard Ayguadé, Nawal Copty, Alejandro Duran, Jay Hoeflinger, Yuan Lin, Federico Massaioli, Xavier Teruel, Priya Unnikrishnan, and Guansong Zhang. 2009. The Design of OpenMP Tasks. IEEE Trans. on Parallel and Distributed Systems (TPDS), 20, 3 (2009), Mar, 404–418. https://doi.org/10.1109/TPDS.2008.105
[6]
Shane Bell, Bruce Edwards, John Amann, Rich Conlin, Kevin Joyce, Vince Leung, John MacKay, Mike Reif, Liewei Bao, John Brown, Matthew Mattina, Chyi-Chang Miao, Carl Ramey, Dave Wentzlaff, Walker Anderson, Ethan Berger, Nat Fairbanks, Durlov Khan, Froilan Montenegro, Jay Stickney, and John Zook. 2008. TILE64 Processor: A 64-Core SoC with Mesh Interconnect. Int’l Solid-State Circuits Conf. (ISSCC), Feb, https://doi.org/10.1109/ISSCC.2008.4523070
[7]
Robert D. Blumofe, Matteo Frigo, Christopher F. Joerg, Charles E. Leiserson, and Keith H. Randall. 1996. An Analysis of Dag-Consistent Distributed Shared-Memory Algorithms. Symp. on Parallel Algorithms and Architectures (SPAA), Jun, https://doi.org/10.1145/237502.237574
[8]
Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. 1995. Cilk: An Efficient Multithreaded Runtime System. Symp. on Principles and Practice of Parallel Programming (PPoPP), Jul, https://doi.org/10.1145/209937.209958
[9]
Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. 1996. Cilk: An Efficient Multithreaded Runtime System. J. Parallel and Distrib. Comput., 37, 1 (1996), Aug, 55–69.
[10]
Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling Multithreaded Computations by Work Stealing. J. ACM, 46, 5 (1999), Sep, 720–748. https://doi.org/10.1145/324133.324234
[11]
Brent Bohnenstiehl, Aaron Stillmaker, Jon J. Pimentel, Timothy Andreas, Bin Liu, Anh T. Tran, Emmanuel Adeagbo, and Bevan M. Baas. 2017. KiloCore: A 32-nm 1000-Processor Computational Array. IEEE Journal of Solid-State Circuits (JSSC), 52, 4 (2017), Apr, 891–902. https://doi.org/10.1109/JSSC.2016.2638459
[12]
Ajay Brahmakshatriya, Emily Furst, Victor Ying, Claire Hsu, Changwan Hong, Max Ruttenberg, Yunming Zhang, Dai Cheol Jung, Dustin Richmond, Michael Taylor, Julian Shun, Mark Oskin, Daniel Sanchez, and Saman Amarasinghe. 2021. Taming the Zoo: The Unified GraphIt Compiler Framework for Novel Architectures. Int’l Symp. on Computer Architecture (ISCA), Jun, https://doi.org/10.1109/ISCA52012.2021.00041
[13]
P. Charles, C. Grothoff, V. Sarkar, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. 2005. X10: An Object-Oriented Approach to Non-Uniform Cluster Computing. Conf. on Object-Oriented Programming Systems Languages and Applications (OOPSLA), Oct, https://doi.org/10.1145/1103845.1094852
[14]
Tao Chen, Shreesha Srinath, Christopher Batten, and Edward Suh. 2018. An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware. Int’l Symp. on Microarchitecture (MICRO), Oct, https://doi.org/10.1109/MICRO.2018.00014
[15]
Lin Cheng, Peitian Pan, Zhongyuan Zhao, Krithik Ranjan, Jack Weber, Bandhav Veluri, Seyed Borna Ehsani, Max Ruttenberg, Dai Cheol Jung, Preslav Ivanov, Dustin Richmond, Michael B. Taylor, Zhiru Zhang, and Christopher Batten. 2022. A Tensor Processing Framework for CPU-Manycore Heterogeneous Systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 41, 6 (2022), 1620–1635. https://doi.org/10.1109/TCAD.2021.3103825
[16]
1993. CRAY T3D System Architecture Overview. http://www.bitsavers.org/pdf/cray/HR-04033_CRAY_T3D_System_Architecture_Overview_Sep93.pdf
[17]
Andrew Danowitz, Kyle Kelley, James Mao, John P. Stevenson, and Mark Horowitz. 2012. CPU DB: Recording Microprocessor History. ACM Queue, Apr, 10–27.
[18]
Scott Davidson, Shaolin Xie, Christopher Torng, Khalid Al-Hawaj, Austin Rovinski, Tutu Ajayi, Luis Vega, Chun Zhao, Ritchie Zhao, Steve Dai, Aporva Amarnath, Bandhav Veluri, Paul Gao, Anuj Rao, Gai Liu, Rajesh K. Gupta, Zhiru Zhang, Ronald G. Dreslinski, Christopher Batten, and Michael B. Taylor. 2018. The Celerity Open-Source 511-Core RISC-V Tiered Accelerator Fabric: Fast Architectures and Design Methodologies for Fast Chips. IEEE Micro, 38, 2 (2018), Mar/Apr, 30–41. https://doi.org/10.1109/MM.2018.022071133
[19]
James Dinan, D. Brian Larkins, P. Sadayappan, Sriram Krishnamoorthy, and Jarek Nieplocha. 2009. Scalable Work Stealing. Int’l Conf. on High Performance Networking and Computing (Supercomputing), Nov, https://doi.org/10.1145/1654059.1654113
[20]
Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The Implementation of the Cilk-5 Multithreaded Language. ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI), Jun, https://doi.org/10.1145/277652.277725
[21]
Tom R. Halfhill. 2020. ThunderX3’s Cloudburst of Threads: Marvell Previews 96-core 384-thread Arm Server Processor. Microprocessor Report, The Linley Group, Apr.
[22]
Yatin Hoskote, Sriram Vangal, Arvind Singh, Nitin Borkar, and Shekhar Borkar. 2007. A 5-GHz Mesh Interconnect for a Teraflops Processor. IEEE Micro, 27, 5 (2007), Sep/Oct, 51–61. https://doi.org/10.1109/MM.2007.4378783
[23]
Jason Howard, Saurabh Dighe, Yatin Hoskote, Sriram Vangal, David Finan, Gregory Ruhl, David Jenkins, Howard Wilson, Nitin Borkar, Gerhard Schrom, Fabrice Pailet, Shailendra Jain, Tiju Jacob, Satish Yada, Sraven Marella, Praveen Salihundam, Vasantha Erraguntla, Michael Konow, Michael Riepen, Guido Droege, Joerg Lindemann, Matthias Gries, Thomas Apel, Kersten Henriss, Tor Lund-Larsen, Sebastian Steibl, Shekhar Borkar, Vivek De, Rob Van Der Wijngaart, and Timothy Mattson. 2010. A 48-Core IA-32 Message-Passing Processor with DVFS in 45nm CMOS. Int’l Solid-State Circuits Conf. (ISSCC), Feb, https://doi.org/10.1109/ISSCC.2010.5434077
[24]
2012. Intel Cilk Plus Language Extension Specification. https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1665.htm
[25]
2019. Intel Threading Building Blocks. https://software.intel.com/en-us/intel-tbb
[26]
Dai Cheol Jung, Scott Davidson, Chun Zhao, Dustin Richmond, and Michael Bedford Taylor. 2020. Ruche Networks: Wire-Maximal, No-Fuss NoCs : Special Session Paper. Int’l Symp. on Networks-on-Chip (NOCS), Apr, https://doi.org/10.1109/NOCS50636.2020.9241586
[27]
2022 (accessed Aug 2022). Kalray MPPA Products. Online Webpage. https://www.kalrayinc.com/products/mppa-technology/
[28]
David Kanter. 2015. Knights Landing Reshapes HPC.
[29]
John H. Kelm, Daniel R. Johnson, Matthew R. Johnson, Neal C. Crago, William Tuohy, Aqeel Mahesri, Steven S. Lumetta, Matthew I. Frank, and Sanjay J. Patel. 2009. Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator. Int’l Symp. on Computer Architecture (ISCA), Jun, https://doi.org/10.1145/1555754.1555774
[30]
2011. OpenCL Specification, v1.2. http://www.khronos.org/registry/cl/specs/opencl-1.2.pdf
[31]
Charles E. Leiserson. 2009. The Cilk++ Concurrency Platform. Design Automation Conf. (DAC), Jul, https://doi.org/10.1145/1629911.1630048
[32]
L. Li, J. Fang, H. Fu, J. Jiang, W. Zhao, C. He, X. You, and G. Yang. 2018. swCaffe: A Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight. Int’l Conf. on Cluster Computing, Sep, https://doi.org/10.48550/arXiv.1903.06934
[33]
S. Li, Z. Yang, D. Reddy, A. Srivastava, and B. Jacob. 2020. DRAMsim3: A Cycle-Accurate, Thermal-Capable DRAM Simulator. Computer Architecture Letters (CAL), Jul, https://doi.org/10.1109/LCA.2020.2973991
[34]
Zheng Li, Jose Duato, Olivier Certner, and Olivier Temam. 2010. Scalable Hardware Support for Conditional Parallelization. Int’l Conf. on Parallel Architectures and Compilation Techniques (PACT), Sep.
[35]
Mieszko Lis, Keun Sup Shim, Myong Hyon Cho, Ilia Lebedev, and Srinivas Devadas. 2013. Hardware-Level Thread Migration in a 110-Core Shared-Memory Multiprocessor. MIT CSAIL CSG.
[36]
Guo-Ping Long, Jun-Chao Zhang, and Dong-Rui Fan. 2008. Architectural Support and Evaluation of Cilk Language on Many-Core Architectures. Chinese Journal of Computers, 31, 11 (2008), 1975–1985. https://doi.org/10.3724/SP.J.1016.2008.01975
[37]
Steven Margerm, Amirali Sharifian, Apala Guha, Arrvindh Shriraman, and Gilles Pokam. 2018. TAPAS: Generating Parallel Accelerators from Parallel Programs. Int’l Symp. on Microarchitecture (MICRO), Oct, https://doi.org/10.1109/MICRO.2018.00028
[38]
Michael McCool, Arch D. Robinson, and James Reinders. 2012. Structured Parallel Programming: Patterns for Efficient Computation. Morgan Kaufmann.
[39]
Michael McKeown, Yaosheng Fu, Tri Nguyen, Yanqi Zhou, Jonathan Balkind, Alexey Lavrov, Mohammad Shahrad, Samuel Payne, and David Wentzlaff. 2017. Piton: A Manycore Processor for Multitenant Clouds. IEEE Micro, 37, 2 (2017), Mar/Apr, 70–80. https://doi.org/10.1109/MM.2017.36
[40]
Stephen Olivier, Jun Huan, Jinze Liu, Jan Prins, James Dinan, P. Sadayappan, and Chau-Wen Tseng. 2006. UTS: An Unbalanced Tree Search Benchmark. Int’l Workshop on Lanaguages and Compilers for Parallel Computing (LCPC), Nov, https://doi.org/10.1007/978-3-540-72521-3_18
[41]
Andreas Olofsson. 2016. Epiphany-V: A 1024-processor 64-bit RISC System-On-Chip. Computing Research Repository (CoRR), arXiv:abs/1610.01832 (2016), Aug, https://doi.org/10.48550/arXiv.1610.01832
[42]
2013. OpenMP Application Program Interface, Version 4.0. http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf
[43]
Marc S. Orr, Bradford M. Beckmann, Steven K. Reinhardt, and David A. Wood. 2014. Fine-Grain Task Aggregation and Coordination on GPUs. Int’l Symp. on Computer Architecture (ISCA), Jul, https://doi.org/10.1109/ISCA.2014.6853209
[44]
Yanghui Ou, Shady Agwa, and Christopher Batten. 2020. Implementing Low-Diameter On-Chip Networks for Manycore Processors Using a Tiled Physical Design Methodology. Int’l Symp. on Networks-on-Chip (NOCS), Sep, https://doi.org/10.1109/NOCS50636.2020.9241710
[45]
Guilherme P. Pezzi, Marcia C. Cera, Elton Mathias, Nicolas Maillard, and Philippe O. A. Navaux. 2007. On-line Scheduling of MPI-2 Programs with Hierarchical Work Stealing. Int’l Symp. on Computer Architecture and High Performance Computing (SBAC-PAD), Oct, https://doi.org/10.1109/SBAC-PAD.2007.36
[46]
Carl Ramey. 2011. TILE-Gx100 ManyCore Processor: Acceleration Interfaces and Architecture. Symp. on High Performance Chips (Hot Chips), Aug, https://doi.org/10.1109/HOTCHIPS.2011.7477491
[47]
James Reinders. 2007. Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O’Reilly.
[48]
Austin Rovinski, Chun Zhao, Khalid Al-Hawaj, Paul Gao, Shaolin Xie, Christopher Torng, Scott Davidson, Aporva Amarnath, Luis Vega, Bandhav Veluri, Anuj Rao, Tutu Ajayi, Julian Puscar, Steve Dai, Ritchie Zhao, Dustin Richmond, Zhiru Zhang, Ian Galton, Christopher Batten, Michael B. Taylor, and Ron G. Dreslinski. 2019. A 1.4 GHz 695 Giga RISC-V Inst/s 496-core Manycore Processor with Mesh On-Chip Network and an All-Digital Synthesized PLL in 16nm CMOS. Symp. on VLSI Technology and Circuits (VLSI), Jun, https://doi.org/10.23919/VLSIC.2019.8778031
[49]
Austin Rovinski, Chun Zhao, Khalid Al-Hawaj, Paul Gao, Shaolin Xie, Christopher Torng, Scott Davidson, Aporva Amarnath, Luis Vega, Bandhav Veluri, Anuj Rao, Tutu Ajayi, Julian Puscar, Steve Dai, Ritchie Zhao, Dustin Richmond, Zhiru Zhang, Ian Galton, Christopher Batten, Michael B. Taylor, and Ron G. Dreslinski. 2019. Evaluating Celerity: A 16nm 695 Giga-RISC-V Instructions/s Manycore Processor with Synthesizable PLL. IEEE Solid-State Circuits Letters (SSCL), 2, 12 (2019), Dec, 289–292. https://doi.org/10.1109/LSSC.2019.2953847
[50]
Vijay A. Saraswat, Prabhanjan Kambadur, Sreedhar Kodali, David Grove, and Sriram Krishnamoorthy. 2011. Lifeline-Based Global Load Balancing. SIGPLAN Not., feb, 201–212. https://doi.org/10.1145/2038037.1941582
[51]
Tao B. Schardl, William S. Moses, and Charles E. Leiserson. 2017. Tapir: Embedding Fork-Join Parallelism into LLVM’s Interemdiate Representation. Symp. on Principles and Practice of Parallel Programming (PPoPP), Feb, https://doi.org/10.1145/3155284.3018758
[52]
Julian Shun and Guy Blelloch. 2013. Ligra: A Lightweight Graph Processing Framework for Shared Memory. Symp. on Principles and Practice of Parallel Programming (PPoPP), Feb, https://doi.org/10.1145/2517327.2442530
[53]
Giuseppe Tagliavini, Daniele Cesarini, and Andrea Marongiu. 2018. Unleashing Fine-Grained Parallelism on Embedded Many-Core Accelerators with Lightweight OpenMP Tasking. IEEE Transactions on Parallel and Distributed Systems, 29, 9 (2018), 2150–2163. https://doi.org/10.1109/TPDS.2018.2814602
[54]
Guangming Tan, Dongrui Fan, Junchao Zhang, Andrew Russo, and Guang R. Gao. 2008. Experience on Optimizing Irregular Computation for Memory Hierarchy in Manycore Architecture. Symp. on Principles and Practice of Parallel Programming (PPoPP), Feb, https://doi.org/10.1145/1345206.1345255
[55]
Michael Bedford Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffmann, Paul Johnson, Walter Lee, Arvind Saraf, Nathan Shnidman, Volker Strumpen, Saman Amarasinghe, and Anant Agarwal. 2003. A 16-Issue Multiple-Program-Counter Microprocessor with Point-to-Point Scalar Operand Network. Int’l Solid-State Circuits Conf. (ISSCC), Feb, https://doi.org/10.1109/ISSCC.2003.1234253
[56]
Pascal Vivet, Eric Guthmuller, Yvain Thonnart, Gael Pillonnet, Guillaume Moritz, Ivan Miro-Panadès, Cesar Fuguet, Jean Durupt, Christian Bernard, Didier Varreau, Julian Pontes, Sebastien Thuries, David Coriat, Michel Harrand, Denis Dutoit, Didier Lattard, Lucile Arnaud, Jean Charbonnier, Perceval Coudrain, Arnaud Garnier, Frederic Berger, Alain Gueugnot, Alain Greiner, Quentin Meunier, Alexis Farcy, Alexandre Arriordaz, Severine Cheramy, and Fabien Clermidy. 2020. A 220GOPS 96-Core Processor with 6 Chiplets 3D-Stacked on an Active Interposer Offering 0.6ns/mm Latency, 3Tb/s/mm2 Inter-Chiplet Interconnects and 156mW/mm2@ 82%-Peak-Efficiency DC-DC Converters. Int’l Solid-State Circuits Conf. (ISSCC), Feb, https://doi.org/10.1109/ISSCC19947.2020.9062927
[57]
Rob von Behren, Jeremy Condit, Feng Zhou, George C. Necula, and Eric Brewer. 2003. Capriccio: Scalable Threads for Internet Services. Symp. on Operating Systems Principles (SOSP), Oct, 268–281. https://doi.org/10.1145/945445.945471
[58]
Moyang Wang, Tuan Ta, Lin Cheng, and Christopher Batten. 2020. Efficiently Supporting Dynamic Task Parallelism on Heterogeneous Cache-Coherent Systems. Int’l Symp. on Computer Architecture (ISCA), Jun, https://doi.org/10.1109/ISCA45697.2020.00025
[59]
David Wentzlaff, Patrick Griffin, Henry Hoffman, Liewei Bao, Bruce Edwards, Carl Ramey, Matthew Mattina, Chyi-Chang Miao, John F. Brown III, and Anant Agarwal. 2007. On-Chip Interconnection Architecture of the Tile Processor. IEEE Micro, 27 (2007), Sep/Oct, 15–31. https://doi.org/10.1109/MM.2007.4378780
[60]
Bob Wheeler. 2020. Ampere Maxes Out at 128 Cores. Microprocessor Report, The Linley Group, Jul.
[61]
Foivos S. Zakkak and Polyvios Pratikakis. 2016. Building a Java™ Virtual Machine for Non-Cache-Coherent Many-Core Architectures. Int’l Workshop on Java Technologies for Real-Time and Embedded Systems (JTRES), Aug, https://doi.org/10.1145/2990509.2990510
[62]
Florian Zaruba, Fabian Schuiki, and Luca Benini. 2021. Manticore: A 4096-Core RISC-V Chiplet Architecture for Ultraefficient Floating-Point Computing. IEEE Micro, Mar/Apr, https://doi.org/10.48550/arXiv.2008.06502

Cited By

View all
  • (2024)Adaptive Localization for Autonomous Racing Vehicles with Resource-Constrained Embedded Platforms2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546748(1-6)Online publication date: 25-Mar-2024
  • (2024)Scalable, Programmable and Dense: The HammerBlade Open-Source RISC-V Manycore2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00061(770-784)Online publication date: 29-Jun-2024

Index Terms

  1. Beyond Static Parallel Loops: Supporting Dynamic Task Parallelism on Manycore Architectures with Software-Managed Scratchpad Memories

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3
      March 2023
      820 pages
      ISBN:9781450399180
      DOI:10.1145/3582016
      This work is licensed under a Creative Commons Attribution 4.0 International License.

      Sponsors

      In-Cooperation

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 25 March 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Manycore architecture
      2. fine-grained threading
      3. load-balancing
      4. parallel programming
      5. scratchpad memory

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      ASPLOS '23

      Acceptance Rates

      Overall Acceptance Rate 535 of 2,713 submissions, 20%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)726
      • Downloads (Last 6 weeks)67
      Reflects downloads up to 14 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Adaptive Localization for Autonomous Racing Vehicles with Resource-Constrained Embedded Platforms2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546748(1-6)Online publication date: 25-Mar-2024
      • (2024)Scalable, Programmable and Dense: The HammerBlade Open-Source RISC-V Manycore2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00061(770-784)Online publication date: 29-Jun-2024

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media