[go: up one dir, main page]

skip to main content
10.1145/3649476.3658709acmconferencesArticle/Chapter ViewAbstractPublication PagesglsvlsiConference Proceedingsconference-collections
research-article

HSCONN: Hardware-Software Co-Optimization of Self-Attention Neural Networks for Large Language Models

Published: 12 June 2024 Publication History

Abstract

Self-attention models excel in natural language processing and computer vision by capturing contextual information but encounter several challenges such as efficient data movement, quadratic computational complexity, and excessive memory accesses. Sparse attention techniques emerge as a solution, however, their irregular or regular patterns, coupled with costly data pre-processing, diminish their hardware efficiency. This paper introduces HSCONN, an energy-efficient hardware accelerator for self-attention, mitigating computational and memory overheads. HSCONN  employs dynamic voltage and frequency scaling (DVFS) along with exploiting dynamic sparsity in matrix multiplication, thereby optimizing energy efficiency. The approach includes a row-wise pruning algorithm and independent voltage/frequency islands for processing elements, exploiting additional sparsity to reduce memory access and overall energy consumption. Experiments in natural language processing showcase HSCONN’s remarkable speedups (1952 ×, 615 ×) and energy reductions (up to 820 ×, 113 ×) over CPU and GPU architectures. Compared to A3, SpAtten, and Sanger, HSCONN demonstrates superior speedup (1.71 ×, 1.25 ×, 1.47 ×) and higher energy efficiency (1.5 ×, 1.7 ×, 1.4 ×).

References

[1]
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, 2020. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 7432–7439.
[2]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
[3]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[4]
Tae Jun Ham, Sung Jun Jung, Seonghak Kim, Young H Oh, Yeonhong Park, Yoonho Song, Jung-Hun Park, Sanghee Lee, Kyoung Park, Jae W Lee, 2020. A3: Accelerating attention mechanisms in neural networks with approximation. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 328–341.
[5]
Weixiong Jiang, Heng Yu, Jiale Zhang, Jiaxuan Wu, Shaobo Luo, and Yajun Ha. 2020. Optimizing energy efficiency of CNN-based object detection with dynamic voltage and frequency scaling. Journal of Semiconductors 41, 2 (2020), 022406.
[6]
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).
[7]
Liqiang Lu, Yicheng Jin, Hangrui Bi, Zizhang Luo, Peng Li, Tao Wang, and Yun Liang. 2021. Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. 977–991.
[8]
Seyed Morteza Nabavinejad, Hassan Hafez-Kolahi, and Sherief Reda. 2019. Coordinated DVFS and Precision Control for Deep Neural Networks. IEEE Computer Architecture Letters 18, 2 (2019), 136–140.
[9]
Seyed Morteza Nabavinejad, Sherief Reda, and Masoumeh Ebrahimi. 2022. Coordinated Batching and DVFS for DNN Inference on GPU Accelerators. IEEE Transactions on Parallel and Distributed Systems 33, 10 (2022), 2496–2508. https://doi.org/10.1109/TPDS.2022.3144614
[10]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016).
[11]
Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. 2021. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics 9 (2021), 53–68.
[12]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
[13]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[14]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018).
[15]
Hanrui Wang, Zhekai Zhang, and Song Han. 2021. Spatten: Efficient sparse attention architecture with cascade token and head pruning. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 97–110.
[16]
Qiang Wang and Xiaowen Chu. 2020. GPGPU performance estimation with core and memory frequency scaling. IEEE Transactions on Parallel and Distributed Systems 31, 12 (2020), 2865–2881.
[17]
Qizhe Xie, Guokun Lai, Zihang Dai, and Eduard Hovy. 2017. Large-scale cloze test dataset created by teachers. arXiv preprint arXiv:1711.03225 (2017).
[18]
Zheqi Yu, Pedro Machado, Adnan Zahid, Amir M Abdulghani, Kia Dashtipour, Hadi Heidari, Muhammad A Imran, and Qammer H Abbasi. 2020. Energy and performance trade-off optimization in heterogeneous computing via reinforcement learning. Electronics 9, 11 (2020), 1812.

Index Terms

  1. HSCONN: Hardware-Software Co-Optimization of Self-Attention Neural Networks for Large Language Models

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      GLSVLSI '24: Proceedings of the Great Lakes Symposium on VLSI 2024
      June 2024
      797 pages
      ISBN:9798400706059
      DOI:10.1145/3649476
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 June 2024

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Domain Specific Accelerator
      2. Hardware and Software Codesign
      3. Large Language Models

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      Conference

      GLSVLSI '24
      Sponsor:
      GLSVLSI '24: Great Lakes Symposium on VLSI 2024
      June 12 - 14, 2024
      FL, Clearwater, USA

      Acceptance Rates

      Overall Acceptance Rate 312 of 1,156 submissions, 27%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 35
        Total Downloads
      • Downloads (Last 12 months)35
      • Downloads (Last 6 weeks)5
      Reflects downloads up to 04 Sep 2024

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media