[go: up one dir, main page]

 
 
applsci-logo

Journal Browser

Journal Browser

Data Structures for Graphics Processing Units (GPUs)

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: 20 July 2025 | Viewed by 2043

Special Issue Editors


E-Mail Website
Guest Editor
Department of Computer and Information Science, University of Mississippi, University, MS 38677, USA
Interests: hardware architecture and compilers for parallel and heterogeneous processors; GPU computing (GPGPU) and CPU-GPU heterogeneous computing and CPU-GPU heterogeneous computing

E-Mail Website
Guest Editor

Special Issue Information

Dear Colleagues,

This Special Issue will delve into the innovative and rapidly evolving field of data structures tailored to graphics processing units (GPUs). GPUs, originally designed for rendering graphics, have emerged as powerful parallel processors, revolutionizing computational tasks across diverse domains. This Special Issue will explore the development and optimization of data structures that leverage the parallel processing capabilities of GPUs to achieve significant performance enhancements.

Contributors to this Special Issue will present cutting-edge research into a variety of GPU-optimized data structures, including, but not limited to, stacks, queues, trees, graphs, hash tables, and priority queues. The articles will highlight novel approaches to memory management, data access patterns, and algorithmic modifications that harness the massive parallelism of GPUs. Furthermore, this Special Issue will address practical challenges, such as synchronization, load balancing, and efficient data transfer between CPU and GPU memory.

By featuring both theoretical advancements and practical implementations, this Special Issue will bridge the gap between traditional CPU-centric data structures and their GPU-optimized counterparts. Readers will gain insights into the latest techniques for maximizing GPU performance, making this Special Issue an essential resource for researchers and practitioners seeking to exploit the full potential of GPU computing for data-intensive applications.

Dr. Byunghyun Jang
Prof. Dr. Juan A. Gómez-Pulido
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • GPU computing
  • concurrent data structures
  • GPGPU
  • parallel computing

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (3 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

30 pages, 1684 KiB  
Article
Efficient GPU Implementation of the McMurchie–Davidson Method for Shell-Based ERI Computations
by Haruto Fujii, Yasuaki Ito, Nobuya Yokogawa, Kanta Suzuki, Satoki Tsuji, Koji Nakano, Victor Parque and Akihiko Kasagi
Appl. Sci. 2025, 15(5), 2572; https://doi.org/10.3390/app15052572 - 27 Feb 2025
Viewed by 217
Abstract
Quantum chemistry offers the formal machinery to derive molecular and physical properties arising from (sub)atomic interactions. However, as molecules of practical interest are largely polyatomic, contemporary approximation schemes such as the Hartree–Fock scheme are computationally expensive due to the large number of electron [...] Read more.
Quantum chemistry offers the formal machinery to derive molecular and physical properties arising from (sub)atomic interactions. However, as molecules of practical interest are largely polyatomic, contemporary approximation schemes such as the Hartree–Fock scheme are computationally expensive due to the large number of electron repulsion integrals (ERIs). Central to the Hartree–Fock method is the efficient computation of ERIs over Gaussian functions (GTO-ERIs). Here, the well-known McMurchie–Davidson method (MD) offers an elegant formalism by incrementally extending Hermite Gaussian functions and auxiliary tabulated functions. Although the MD method offers a high degree of versatility to acceleration schemes through Graphics Processing Units (GPUs), the current GPU implementations limit the practical use of supported values of the azimuthal quantum number. In this paper, we propose a generalized framework capable of computing GTO-ERIs for arbitrary azimuthal quantum numbers, provided that the intermediate terms of the MD method can be stored. Our approach benefits from extending the MD recurrence relations through shells, batches, and triple-buffering of the shared memory, and ordering similar ERIs, thus enabling the effective parallelization and use of GPU resources. Furthermore, our approach proposes four GPU implementation schemes considering the suitable mappings between Gaussian basis and CUDA blocks and threads. Our computational experiments involving the GTO-ERI computations of molecules of interest on an NVIDIA A100 Tensor Core GPU (NVIDIA, Santa Clara, CA, USA) have revealed the merits of the proposed acceleration schemes in terms of computation time, including up to a 72× improvement over our previous GPU implementation and up to a 4500× speedup compared to a naive CPU implementation, highlighting the effectiveness of our method in accelerating ERI computations for both monatomic and polyatomic molecules. Our work has the potential to explore new parallelization schemes of distinct and complex computation paths involved in ERI computation. Full article
(This article belongs to the Special Issue Data Structures for Graphics Processing Units (GPUs))
Show Figures

Figure 1

Figure 1
<p>Example of the configuration of two basis functions <math display="inline"><semantics> <msub> <mi>χ</mi> <mn>1</mn> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>χ</mi> <mn>2</mn> </msub> </semantics></math> (<math display="inline"><semantics> <mrow> <mi>M</mi> <mo>=</mo> <mn>2</mn> </mrow> </semantics></math>). (<b>a</b>) The quartet combinations, and (<b>b</b>) the symmetry-based combinations for Basis-ERIs when <math display="inline"><semantics> <mrow> <mi>M</mi> <mo>=</mo> <mn>2</mn> </mrow> </semantics></math>. By considering symmetrical relations, it becomes possible to reduce the number of Basis-ERIs, as shown by the upper triangular matrix in (<b>b</b>).</p>
Full article ">Figure 2
<p>Example of the relation between the Basis-ERIs and the GTO-ERIs. The row (column) directions represent the <span class="html-italic">bra</span> (<span class="html-italic">ket</span>) Basis-ERIs. Each cell in the upper triangular matrix corresponds to a single Basis-ERI.</p>
Full article ">Figure 3
<p>Basic idea behind the definition of shell-based ERIs. The term <math display="inline"><semantics> <mrow> <mo>(</mo> <mi>s</mi> <mi>s</mi> <mo>|</mo> </mrow> </semantics></math> implies that a <span class="html-italic">bra</span> consists of two s-shells, and the term <math display="inline"><semantics> <mrow> <mo>|</mo> <mi>s</mi> <mi>p</mi> <mo>)</mo> </mrow> </semantics></math> implies that a <span class="html-italic">ket</span> consists of one s-shell and one p-shell. The integral <math display="inline"><semantics> <mrow> <mo>(</mo> <mi>s</mi> <mi>s</mi> <mo>|</mo> <mi>s</mi> <mi>p</mi> <mo>)</mo> </mrow> </semantics></math> consists of three GTO-ERIs: <math display="inline"><semantics> <mrow> <mo>[</mo> <mi>s</mi> <mi>s</mi> <mo>|</mo> <mi>s</mi> <msub> <mi>p</mi> <mi>x</mi> </msub> <mo>]</mo> </mrow> </semantics></math>, <math display="inline"><semantics> <mrow> <mo>[</mo> <mi>s</mi> <mi>s</mi> <mo>|</mo> <mi>s</mi> <msub> <mi>p</mi> <mi>y</mi> </msub> <mo>]</mo> </mrow> </semantics></math>, and <math display="inline"><semantics> <mrow> <mo>[</mo> <mi>s</mi> <mi>s</mi> <mo>|</mo> <mi>s</mi> <msub> <mi>p</mi> <mi>z</mi> </msub> <mo>]</mo> </mrow> </semantics></math>.</p>
Full article ">Figure 4
<p>Basic idea of the dependencies, denoted by arrows, behind computing the values of the corresponding recurrences <span class="html-italic">R</span> by using <span class="html-italic">batch</span> concepts when <math display="inline"><semantics> <mrow> <mi>K</mi> <mo>=</mo> <mn>4</mn> </mrow> </semantics></math>. The values required by the MD method are highlighted in <span style="color: #FF0000">red</span>.</p>
Full article ">Figure 5
<p>Basic idea of the computation of <span class="html-italic">R</span> values for each batch using triple-buffering of the shared memory.</p>
Full article ">Figure 6
<p>Comparison of the required size of <span class="html-italic">shared memory</span> to store <span class="html-italic">R</span> values.</p>
Full article ">Figure 7
<p>Basic idea of the parallel thread assignment of CUDA blocks and CUDA threads to the Basis-ERI computation in BBM.</p>
Full article ">Figure 8
<p>Basic idea of the parallel thread assignment of CUDA blocks and CUDA threads to the Basis-ERI computation in BTM.</p>
Full article ">Figure 9
<p>Parallel thread assignment of CUDA blocks and CUDA threads to the shell-based ERI computation in SBM.</p>
Full article ">Figure 10
<p>Basic idea behind the parallel thread assignment of CUDA blocks and CUDA threads to the shell-based ERI computation in STM.</p>
Full article ">Figure 11
<p>Schematic of the 64-bit key for sorting the Basis-ERIs.</p>
Full article ">
20 pages, 899 KiB  
Article
Boundary-Aware Concurrent Queue: A Fast and Scalable Concurrent FIFO Queue on GPU Environments
by Md. Sabbir Hossain Polak, David A. Troendle and Byunghyun Jang
Appl. Sci. 2025, 15(4), 1834; https://doi.org/10.3390/app15041834 - 11 Feb 2025
Viewed by 387
Abstract
This paper presents Boundary-Aware Concurrent Queue (BACQ), a high-performance queue designed for modern GPUs, which focuses on high concurrency in massively parallel environments. BACQ operates at the warp level, leveraging intra-warp locality to improve throughput. A key to BACQ’s design is its [...] Read more.
This paper presents Boundary-Aware Concurrent Queue (BACQ), a high-performance queue designed for modern GPUs, which focuses on high concurrency in massively parallel environments. BACQ operates at the warp level, leveraging intra-warp locality to improve throughput. A key to BACQ’s design is its ability to replace conflicting accesses to shared data with independent accesses to private data. It uses a ticket-based system to ensure fair ordering of operations and supports infinite growth of the head and tail across its ring buffer. The leader thread of each warp coordinates enqueue and dequeue operations, broadcasting offsets for intra-warp synchronization. BACQ dynamically adjusts operation priorities based on the queue’s state, especially as it approaches boundary conditions such as overfilling the buffer. It also uses a virtual caching layer for intra-warp communication, reducing memory latency. Rigorous benchmarking results show that BACQ outperforms the BWD (Broker Queue Work Distributor), the fastest known GPU queue, by more than 2× while preserving FIFO semantics. The paper demonstrates BACQ’s superior performance through real-world empirical evaluations. Full article
(This article belongs to the Special Issue Data Structures for Graphics Processing Units (GPUs))
Show Figures

Figure 1

Figure 1
<p>Circular buffer queue structure with a size of N. “Head % queue_size” points to the next dequeue element, and “Tail % queue_size” points to the next enqueue element.</p>
Full article ">Figure 2
<p>Only four warps run on each SM at a single moment.</p>
Full article ">Figure 3
<p>Leader thread aggregates the global atomic calls from each warp.</p>
Full article ">Figure 4
<p>Leader thread is selcted from only active threads (threads in gray box).</p>
Full article ">Figure 5
<p>Throughput of CAS and FAA operations in a contended scenario (8 billion simultaneous requests).</p>
Full article ">Figure 6
<p>Memory Hierarchy of GPU.</p>
Full article ">Figure 7
<p>Shuffle semantics of CUDA.</p>
Full article ">Figure 8
<p>BACQ local tail calculation by using virtual cache layer.</p>
Full article ">Figure 9
<p>Broken FIFO semantics without tickets (buffer size is 2).</p>
Full article ">Figure 10
<p>FIFO semantics with tickets (buffer size is 2) and Index = (Head or Tail) % Buffer Size.</p>
Full article ">Figure 11
<p>Close to empty queue boundary when head and tail are in the same index.</p>
Full article ">Figure 12
<p>Close to full queue boundary when head and tail are in the same index.</p>
Full article ">Figure 13
<p>Fair publication order for an enqueue request.</p>
Full article ">Figure 14
<p>Time difference between enqueue-only and dequeue-only requests.</p>
Full article ">Figure 15
<p>Performance of the enqueue–dequeue pair (X = # of requests, Y = BOPS (billion ops/s).</p>
Full article ">Figure 16
<p>Performance of enqueue–dequeue mix (50:50) (X = # of requests, Y = BOPS (billion ops/s).</p>
Full article ">Figure 17
<p>Performance of enqueue–dequeue mix (70:30) (X = # of requests, Y = BOPS (billion ops/s).</p>
Full article ">Figure 18
<p>Performance of enqueue–dequeue mix (40:60) (X = # of requests, Y = BOPS (billion ops/s).</p>
Full article ">Figure 19
<p>Comparision BFS algorithm performance using broker queue and BACQ (X = # of edges, Y = elapsed time (s).</p>
Full article ">
21 pages, 6218 KiB  
Article
Multi-GPU Acceleration for Finite Element Analysis in Structural Mechanics
by David Herrero-Pérez and Humberto Martínez-Barberá
Appl. Sci. 2025, 15(3), 1095; https://doi.org/10.3390/app15031095 - 22 Jan 2025
Viewed by 860
Abstract
This work evaluates the computing performance of finite element analysis in structural mechanics using modern multi-GPU systems. We can avoid the usual memory limitations when using one GPU device for many-core computing using multiple GPUs for scientific computing. We use a GPU-awareness MPI [...] Read more.
This work evaluates the computing performance of finite element analysis in structural mechanics using modern multi-GPU systems. We can avoid the usual memory limitations when using one GPU device for many-core computing using multiple GPUs for scientific computing. We use a GPU-awareness MPI approach implementing a suitable smoothed aggregation multigrid for preconditioning an iterative distributed conjugate gradient solver for GPU computing. We evaluate the performance and scalability of different models, problem sizes, and computing resources. We take an efficient multi-core implementation as the reference to assess the computing performance of the numerical results. The numerical results show the advantages and limitations of using distributed many-core architectures to address structural mechanics problems. Full article
(This article belongs to the Special Issue Data Structures for Graphics Processing Units (GPUs))
Show Figures

Figure 1

Figure 1
<p>Multi-GPU node architecture.</p>
Full article ">Figure 2
<p>Multi-GPU system with four Nvidia TITAN V.</p>
Full article ">Figure 3
<p>Simply supported beam experiment with a hole: (<b>a</b>) geometric configuration and boundary conditions, and (<b>b</b>) mesh parameterization and partitioning into four subdomains.</p>
Full article ">Figure 4
<p>Single-arch dam experiment: (<b>a</b>) geometric configuration, (<b>b</b>) boundary conditions, and (<b>c</b>) mesh parameterization and partitioning into four subdomains.</p>
Full article ">Figure 5
<p>L-shaped cantilever experiment: (<b>a</b>) geometric configuration, (<b>b</b>) boundary conditions, and (<b>c</b>) mesh parameterization and partitioning into four subdomains.</p>
Full article ">Figure 6
<p>Simply supported beam experiment: (<b>a</b>) wall-clock time, (<b>b</b>) device memory, and speedup from (<b>c</b>) one and (<b>d</b>) eight MPI processes.</p>
Full article ">Figure 7
<p>Single-arch dam experiment: (<b>a</b>) wall-clock time, (<b>b</b>) device memory, and speedup from (<b>c</b>) one and (<b>d</b>) eight MPI processes.</p>
Full article ">Figure 8
<p>L-shaped cantilever experiment: (<b>a</b>) wall-clock time, (<b>b</b>) device memory, and speedup from (<b>c</b>) one and (<b>d</b>) eight MPI processes.</p>
Full article ">Figure 8 Cont.
<p>L-shaped cantilever experiment: (<b>a</b>) wall-clock time, (<b>b</b>) device memory, and speedup from (<b>c</b>) one and (<b>d</b>) eight MPI processes.</p>
Full article ">Figure 9
<p>Wall-clock time of setup and solving stages using GPU computing.</p>
Full article ">Figure 9 Cont.
<p>Wall-clock time of setup and solving stages using GPU computing.</p>
Full article ">
Back to TopTop