Paper 2
Paper 2
325
PACT ’24, October 14–16, 2024, Long Beach, CA, USA Sungbin Jang, Junhyeok Park, Osang Kwon, Yongho Lee, and Seokin Hong
However, all these techniques use the ine!cient multi-level RPTs SMs Scratchpad
➋
Shared L2 TLB MSHR
➌
as the baseline. Although it is possible to use Page Walk Caches CU Memory
GMMU
(PWCs) [7, 9, 10, 21] to replace these sequential pointer-chasing op- CU Coal. L1 Cache
Page Walk Queue
erations with an MMU cache lookup, PWCs do not work e#ectively
•••
L1 TLB
CU Coal.
for emerging memory-intensive workloads with large memory foot- ➊ MSHR
HPT (FS-HPT), which aims to tackle the fundamental limitation of Lv.4 Lv.3 Lv.2 Lv.1
Lv.3
0x000 0x02A XXX 0x0FA0
0x001 0x13C XXX 0x0FA8
current RPT-based GPU virtual memory. FS-HPT adopts Hashed 0x000 0x02A 0x006 0x74B0
Page Tables (HPTs) instead of RPTs for the GPU page table. Since PTBR
Lv.2
0x001 0x13C 0x17D 0x2280
HPTs use a hash function to calculate a hash value and directly use
Interconnect (PCIe/NVLink) ➏ ➑
this as the table index, the HPT lookup requires only a memory
➐ GPU Driver (CPU side)
reference (if there is no hash collision). This fast lookup mecha-
Figure 1: Baseline GPU architecture
nism of HPTs can eliminate the expensive pointer-chasing memory
accesses required by conventional RPTs.
Several recent works have proposed HPTs in the CPU do- factor) below a target level. With this remote PTE eviction policy,
main [50, 51, 54]. State-of-the-art HPT designs [50, 51] reduce the GPU driver [39] evicts rarely-used remote mappings from the
address translation overheads by an average of 41%. They elab- page table and stores them in a victim bu!er implemented as a radix
orately designed the HPTs to e#ectively manage hash collisions, tree structure for scalability.
one of the critical challenges when using HPTs [54]. However, prior The main contributions of this paper are as follows:
HPT designs are primarily optimized for CPUs, so it is not straight- • We propose a novel framework, called FS-HPT, to improve
forward to simply apply those designs to GPUs. They necessitate GPU virtual memory. FS-HPT employs a "xed-size hashed
adjustments in the page table size to manage hash collisions, which page table to enable rapid page table lookup. In most cases,
requires migrating Page Table Entries (PTEs) and frequently relies FS-HPT can retrieve page table entries with a single memory
on OS support. The PTE migrations can signi"cantly increase mem- reference by using a step table and step cache. To the best
ory allocation time, and frequent OS interventions decrease GPU of our knowledge, this paper is the "rst work to study the
performance [29]. In addition, state-of-the-art HPT design issues e#ectiveness of using HPTs as the GPU’s page table.
multiple memory requests simultaneously for a page table walk • We address a limitation regarding remote mappings by em-
to exploit memory-level parallelism. This can lead to performance ploying a remote PTE eviction policy and victim bu!er.
degradation since GPUs are more bandwidth-sensitive than CPUs. • We evaluate the performance of FS-HPT using a detailed
We exploit two key observations to e!ciently adopt HPTs in GPU simulator. For irregular memory-intensive workloads,
GPU’s virtual memory subsystems. First, there is an upper bound FS-HPT and FS-HPT integrated with the state-of-the-art
for the number of PTEs the GPU’s page table can hold because page table walk technique outperform RPTs by an average
the page table is responsible primarily for maintaining the PTEs of of 27.8% and 61.7%, respectively.
pages currently stored in the GPU memory. Second, the number of
remote mappings stored in the GPU page table accounts for only 2 Background and Motivation
a small portion of total entries. Recent GPUs allow direct access
to remote pages in the other devices [16, 47, 53]. To support direct 2.1 GPU Virtual Memory
remote access, the GPU’s page table needs to hold the PTEs for both Recent GPU systems have adopted the virtual memory [2, 36, 43,
local and remote pages, which increases the load factor as the page 44, 47] to share a virtual memory space between GPUs and CPUs.
table stores more remote mappings. However, we found that only There are two typical ways to enable shared virtual memory in GPU
a few remote mappings will be accessed again because frequently systems. One is sharing a page table between the CPUs and GPUs
accessed remote mappings become local mappings. via I/O Memory Management Unit (IOMMU) hardware [43, 44].
Motivated by these two key observations, the FS-HPT framework This approach is typically applied to integrated GPUs that share
employs a "xed-size hashed page table whose size is su!ciently the main memory with CPUs [48, 49]. The second way is to let
large to mostly avoid the hash collisions, thereby eliminating dy- CPUs and GPUs manage their own page tables separately. The CPU
namic resizing. FS-HPT uses open-addressing to handle hash col- page tables have the entire information, while the GPUs only retain
lisions during PTE allocation and stores open-addressing steps in certain parts of the address translation information in their distinct
a step table to enable fast PTE accesses during address translation. (local) page tables [32]. This organization is bene"cial when GPUs
Accessing the step table can incur additional memory references as have their own distinct memory (VRAM) with a GMMU since GPUs
the table exists in the device’s memory. To address this issue, we can then perform address translation without communication with
repurpose the existing MMU cache (PWCs) as a step cache for stor- the CPUs (via the IOMMU) [29, 30, 32, 45]. In this paper, we select
ing recently accessed step table entries. We also propose a remote the latter approach (i.e., address translation using a GMMU and
PTE eviction policy to keep the occupancy of the HPT (i.e., load distinct GPU page tables) as the baseline architecture.
326
Rethinking Page Table Structure for Fast Address Translation in GPUs: A Fixed-Size Hashed Page Table PACT ’24, October 14–16, 2024, Long Beach, CA, USA
327
PACT ’24, October 14–16, 2024, Long Beach, CA, USA Sungbin Jang, Junhyeok Park, Osang Kwon, Yongho Lee, and Seokin Hong
0.5 is used as an o#set of the page table. We can get the physical ad-
Miss Rate
0.4
0.3 dress of the PTE for the requested page by adding the hash value to
0.2
0.1 the PTBR value. As HPTs can directly generate the PTE’s physical
0
address, they can potentially o#er faster address translation com-
SYK
ATX
GEV
MVT
BC
CC
KC
BFS
Avg.
SP
NW
BIC
COR
HOT
PF
LU
Avg.
2DC
pared to the RPTs, which may require multiple memory references.
Irregular Regular
Therefore, by using HPTs, we can achieve signi"cant performance
Figure 4: Miss rate of level-2 PWC gains comparable to the ideal PWCs.
However, due to the inherent nature of hash functions, distinct
RPT Ideal PWC
1.6
hash keys (i.e., the input of the hash function) may yield identical
1.4
1.2
hash values, known as a hash collision. In a general hash table, a
Speedup
1
0.8
hash key is a unique identi"er derived from the processed data. In
0.6
0.4 HPTs, the VPN tag serves as the hash key. For precise translation,
0.2
0 we must store the hash key in each hash slot and compare the hash
key with the input virtual address’s VPN tag. If the hash key and
CC
GEV
MVT
SYK
BC
Avg.
ATX
BFS
NW
BIC
COR
KC
HOT
PF
LU
Avg.
SP
2DC
Irregular Regular VPN tag do not match, we must handle this collision. Handling
hash collisions adds complexities to HPT design, so managing these
Figure 5: Performance of four-level RPT and ideal PWC collisions is one of the key challenges of HPTs.
Hashed page tables in the CPU domain: State-of-the-art HPT
large memory footprints. These workloads frequently incur TLB techniques [50, 51] have been proposed to resolve the hash collision
misses, leading to numerous page table walks. In addition, due to the issue e!ciently. Skarlatos et al. [50] proposed an Elastic Cuckoo
inherent parallelism of GPUs, these workloads typically generate Page Table (ECPT) that prevents frequent hash collisions by re-
multiple simultaneous page table walk requests, putting signi"cant sizing the page table during program execution. The ECPT can
pressure on the cache hierarchy. [43, 44, 48, 49]. Caching these operate e#ectively with OS support and outperforms RPTs. How-
frequently and simultaneously generated page table walk requests ever, this approach requires resizing the page table during program
within a small PWC structure is di!cult, resulting in more than execution, which is very cumbersome for GPUs. Resizing the page
one memory access per page table walk. table necessitates rehashing all its entries, leading to migration of
Figure 4 shows the miss rates of a level-2 PWC for various bench- the page table entries and requiring frequent support from the OS.
marks. Since each entry in the level-4 and level-3 PWC covers 512GB These PTE migrations and frequent OS interventions prolong the
and 1GB logical region, respectively, the hit rate for these levels time it takes to handle PTE allocation within the GPU domain [1].
is almost 100%. Therefore, we focus on the miss rate of a level-2
PWC. The benchmarks and simulation environment are described 2.4 Motivation: Why Are HPTs a Good Choice
in Section 5.1. We categorize the benchmarks based on their level-2
for GPU Page Tables?
PWC miss rates. Workloads with a level-2 PWC miss rate exceed-
ing 20% are categorized as irregular, showing an average miss rate Dynamic resizing of the page table is not necessary: The rea-
of 33.8%. In contrast, workloads with a level-2 PWC miss rate of sons are twofold. First, the local page table primarily holds the
20% or less are categorized as regular, with an average miss rate of (local) mappings for the pages stored in GPU memory, allowing a
0.4%. To measure the impact of PWC misses on performance, we load factor1 of the hashed page table can be maintained without
compare the performance between a four-level RPT and an ideal resizing the table. A lower load factor reduces the likelihood of
PWC con"guration, where every page table walk hits at PWCs. Fig- hash collisions. As described in Section 2.1, since the CPU’s page
ure 5 shows the speedup of an ideal PWC over RPT. Since irregular table maintains any evicted PTEs from the GPU, the GPU’s page
workloads su#er from frequent PWC misses, the ideal PWC shows table only needs to store valid PTEs of the pages currently stored in
a speedup of 29% over RPT. This result demonstrates that reducing the device memory. Thus, the number of valid PTEs in the GPU’s
PWC misses can signi"cantly enhance overall performance. page table reaches a saturation point when the device memory is
One might suggest that scaling TLBs or PWCs could mitigate almost full, obviating the need for resizing the page table.
page table walk overheads in RPTs. However, since TLBs and PWCs Second, the number of live remote mappings in the GPU’s local
serve as the SRAM cache, they are hard to scale in terms of latency page table will become saturated, and thereby, the page table’s load
and area [29, 50, 54]. Additionally, with the upcoming "ve-level factor does not keep increasing even when it stores remote map-
RPTs, it becomes more challenging to scale them [24, 28, 50, 51, 54]. pings. Recent GPUs support remote access to host memory or peer
GPU memory by storing remote mappings in the GPU’s local page
2.3 Hashed Page Tables table [3, 36]. Especially, recent NVIDIA GPUs [36, 38] use a page
access counter to prevent page migration of rarely accessed pages.
To reduce the substantial overheads associated with address trans-
With the access counter, the GPU driver stores remote mappings
lation using RPTs, previous studies have explored Hashed Page
in the GPU’s page table and migrates only the pages whose access
Tables (HPTs). They have been implemented HPTs as an alternative
counter value reaches a certain threshold. Even if this mechanism
page table structure in some commercial CPUs [18–20, 50, 51, 54]
e#ectively avoids the migration of rarely accessed pages [30], it can
since the HPTs do not need sequential page table walks.
Figure 3b illustrates the basic structure of HPTs. A hash function 1 The load factor [14] of a hash table is a metric indicating the ratio of occupied entries
takes all bits of the VPN as input, and the output (a hash value) to the total number of slots available in the table.
328
Rethinking Page Table Structure for Fast Address Translation in GPUs: A Fixed-Size Hashed Page Table PACT ’24, October 14–16, 2024, Long Beach, CA, USA
GPU
SMs L2 Cache L2 TLB GMMU
Page Walk Queue
Device Memory
Hashed Page Table Victim Buffer
HPT Entry 0
HPT Entry 1
Lv.4 Lv.3 Lv.2 Lv.1
ⓓ Page Table Walker
•••
Step Table Step Cache
ⓒ VPN Tag Step VPN Tag Step
HPT Entry k ⓑ
VPN Tag ••• VPN Tag 0 2 1 ••• 4 32 entries
Figure 6: The number of live remote mappings of BFS [34].
•••
•••
•••
1
Local Data Hash Generator
Avg. Ratio of Live
0.8
Remote PTEs
0.6
0.4
0.2 Interconnect (PCIe/NVLink)
0
CPU
BC
NW
2DC
KC
PF
HOT
ATX
GEV
MVT
SYK
LU
BFS
Avg.
BIC
Avg.
SP
COR
CC
increase the load factor of the page table as the GPU must store address space due to the sequential table lookup process required
remote mappings, causing more hash collisions. by their hierarchical structure. HPTs are a viable alternative due to
However, we have found that a GPU’s page table does not need their ability to quickly look up the desired page table entry using
to store all remote mappings. Figure 6 shows the liveness of re- a hash function. With the observation that GPUs do not need to
mote mappings during the execution of BFS [34]. The "liveness" resize page tables dynamically, we apply HPTs to the GPU memory
of a remote mapping is de"ned as the duration from the initial system using a "xed-size page table. In this section, we present
access ("rst touch) of the remote mapping to its "nal access (last the Fixed-Size Hashed Page Table (FS-HPT), a novel GPU virtual
touch). We measure liveness using the access counter-based page memory framework that adopts Hashed Page Tables (HPTs) to
migration [39]. As shown in the "gure, the number of live remote address the fundamental limitation of the conventional page table
mappings decreases during program execution. This is because structure (i.e., RPTs).
frequently accessed remote mappings (hot remote PTEs) quickly Challenges and solutions: In order to adopt HPTs in GPU ar-
become local PTEs as the corresponding pages are migrated to chitectures, we need to address two key challenges. First, hash
the GPU memory, and the infrequently accessed remote mappings collisions could occur even if we set a large page table size, so we
(cold remote PTEs) tend to fall into disuse after their initial period have to resolve them e!ciently. To handle hash collisions, we adopt
of activity. Figure 7 shows the average proportion of live remote open-addressing as it is simple to implement in the GPU driver and
mappings relative to the total number of page table entries for all GMMU at a low cost [14]. Hash collision handling with the open-
the benchmarks. The ratio of the live remote mappings ranges from addressing scheme does lead to multiple memory accesses due to
0% to 20% for the various benchmarks (with an average of 0.16%). the sequential probing scheme. We tackle this issue with a step table
These results indicate that the number of live remote mappings and step cache (Section 3.3-3.5). With these, FS-HPT can look up
does not gradually increase in the GPU’s page table throughout the the page table with only one memory access.
execution of the programs. Second, remote mappings can increase the load factor of the page
The page table occupies a small portion of GPU memory: For table. As described in Section 2.4, live remote mappings occupy a
a 4KB base page size, the associated PTE size is eight bytes [37]. tiny fraction of total PTEs. Thus, if we store only hot remote map-
Therefore, GPUs that use RPTs require only 0.2% of total GPU pings in the page table, we can reduce hash collisions. To this end,
memory to store PTEs for pages that reside in GPU memory. If we propose a remote PTE eviction policy, which strategically evicts
we adopt HPTs for the GPU’s page table, we need a larger page rarely-used remote mappings from the page table (Section 3.6).
table than when using RPTs to avoid hash collisions. Since the However, if the evicted mapping is used in the future, it will cause a
page table size for storing all the mappings of the pages in GPU costly page fault. Thus, we also introduce the victim bu!er to hold
memory accounts for a small portion of the GPU memory, even if evicted remote mappings in the device memory (Section 3.7). The
we increase the page table size, the area overhead is negligible. For GMMU "rst looks up the step cache and step table, and if there
example, the page table, which is 2.5x times larger (for a target load are no matching entries, it accesses the victim bu#er to get the
factor of 0.4) than RPTs, only requires 0.5% of total GPU memory. requested remote mappings.
Figure 8 depicts the overall architecture of our FS-HPT frame-
3 Fixed-Size Hashed Page Table work. The device memory contains the hashed page table, step
table, and victim bu#er. In the GMMU, we repurpose the remaining
3.1 Overview PWCs to use as the step cache and add a hash generator. When an
Key idea - Adopting HPTs as the GPU’s page table structure: L2 TLB miss occurs, the page table walker "rst accesses the step
As described in Section 2.2, RPTs incur signi"cant performance a If there is no entry in the step cache, the walker accesses
cache (→).
degradation, and they also show poor scalability with increasing b After, the walker looks up
the step table and "lls the step cache (→).
329
PACT ’24, October 14–16, 2024, Long Beach, CA, USA Sungbin Jang, Junhyeok Park, Osang Kwon, Yongho Lee, and Seokin Hong
•••
4KB 0x00 0 0
Figure 9: HPT entry format Figure 10: PTE allocation with HPT Map
c
the page table with the value derived from the step table entry (→). the key using a speci"c probing sequence. In our implementation,
If there is no matching entry in the step table, the walker accesses the table index of the next candidate entry for the key is calcu-
the victim bu#er to get the remote mappings (→). d Finally, if there lated using 𝐿𝑀𝑁𝑂𝑃𝑄 and 𝐿𝑀𝑄𝑅. During the iterative probing for an
are no matching mappings in the victim bu#er, the GMMU incurs empty entry, the driver increments the 𝐿𝑀𝑄𝑅 until the empty one
a page fault, and the GPU driver populates the GPU’s page table by is found (𝑆𝑄𝑇 𝑂𝑆𝑃𝑄𝑈 = 𝑉𝑊𝑋𝑉 𝑌𝑊𝑍𝑎𝑄 + 𝐿𝑀𝑁𝑂𝑃𝑄 ↑ 𝐿𝑀𝑄𝑅). The 𝐿𝑀𝑁𝑂𝑃𝑄 is
referencing an HPT Map (→). e If there is no space for storing new pre-determined by the GPU driver when creating the page table.
PTE, the driver evicts a remote mapping from the page table using
a remote PTE LRU counter maintained by the driver (→). f 3.4 PTE Allocation
When the GPU requires data pages that are not currently in GPU
3.2 HPT Entry Format memory, it causes a page fault, and the GPU driver on the CPU
The naive implementation of HPTs shown in Figure 3b causes some side allocates the appropriate PTEs and migrates the data pages
problems. First, there will be a notable decrease in the spatial locality to the GPU. To deal with PTE allocation, we introduce a software-
of the PTEs [50, 54]. This is because allocating a PTE for each HPT de"ned data structure called the HPT Map in the driver. The HPT
entry does not ensure the spatial locality of PTEs in cacheline Map maintains allocation information on the HPT stored in device
granularity. Second, since the GPU driver evicts pages with a 2MB memory; each HPT entry has a corresponding entry in the HPT
Virtual Address Block granularity [1, 27], evicting pages from the map. As illustrated in Figure 10, an HPT Map entry consists of
GPU incurs 512 lookups in the hashed page table to remove the a VPN tag, a valid bit, and a deleted bit. The VPN tag is used to
PTEs for the evicted pages. Finally, it is di!cult to support various verify whether the accessed entry is the correct one for the given
page sizes simultaneously because the number of address bits used VA, and the valid bit indicates whether the corresponding HPT
as a VPN tag varies depending on page sizes. For example, if a GPU entry is valid. The deleted bit indicates that the corresponding HPT
uses 4KB and 2MB page sizes at the same time, the GMMU has entry is no longer valid and is part of a probing sequence. In open
to access the page table twice with di#erent VPN tags: VA[47:12] addressing, simply deleting an entry and leaving it empty (invalid)
for the 4KB page size and VA[47:21] for the 2MB page size. This can disrupt the continuity of the probing sequence. The deleted bit
dual-access approach is necessary because the page size cannot be simply addresses this challenge. Note that the driver can reuse the
determined with only a given VA. deleted entry to allocate a new HPT entry because its valid bit is
To address these problems, we propose the new HPT entry for- invalid. In other words, if there is a deleted entry before "nding
mat. Figure 9 shows a proposed HPT entry format. We cluster 512 an invalid entry, the driver allocates a new HPT entry in the same
PTEs of contiguous pages in a single hash table entry so that each place as the deleted entry.
entry represents a contiguous 2MB memory region. The PTEs in Algorithm 1 describes the PTE allocation process. First, the GPU
an HPT entry are indexed with a PTE o#set obtained from the driver accesses the HPT Map entry, which is pointed to by the index
VA[20:12]. This design ensures locality among PTEs and requires calculated by adding the hash value and HPT Map base address
only a single HPT lookup during page eviction. In addition, with (line 4). If the accessed entry is valid, the driver compares the new
this HPT entry format, the VPN tag remains the same across all PTE’s VPN tag with the tag in the accessed HPT Map entry (lines
supported page sizes (i.e., 4KB, 64KB, and 2MB [37]). Therefore, the 5-6). If the tag matches, the driver allocates a PTE at the existing
virtual address is translated using only one access to the table, even HPT entry (line 7). Otherwise, the driver increments the 𝐿𝑀𝑄𝑅 and
when various page sizes are supported in the GPU virtual memory. accesses the next entry of the HPT Map (line 10). If the deleted
bit of the accessed entry is set, the driver increments the 𝐿𝑀𝑄𝑅 and
3.3 Hash Collision Resolution accesses the next entry of the HPT Map (line 16). To reuse a deleted
There are several methods to resolve hash collisions, such as chain- entry, the driver keeps the address of the "rst accessed deleted
ing [14], open-addressing [14], and cuckoo hashing [40]. Among entry (lines 13-14). If the accessed entry is invalid, it means there
these, we employ the open addressing scheme, which is simple are no allocated HPT entries for the requested VA. Then, the driver
and thus can be implemented e#ectively in the GMMU and the checks whether oversubscription has occurred or if the number of
GPU driver. When inserting a new key-value pair, a hash function HPT entries has reached the threshold to guarantee enough space
calculates an index based on the key. If the entry associated with for the data page and PTEs. If oversubscription occurs, the driver
the calculated index is empty, the key-value pair is inserted. If the evicts a local HPT entry (lines 18-19). If the number of HPT entries
entry is already occupied by a di#erent key (indicating a collision), reaches the threshold, the driver evicts a remote entry (lines 20-21).
the open addressing scheme seeks an alternative empty entry for Remote PTE eviction will be described in the section 3.6. Before
330
Rethinking Page Table Structure for Fast Address Translation in GPUs: A Fixed-Size Hashed Page Table PACT ’24, October 14–16, 2024, Long Beach, CA, USA
•••
14: 𝑃𝑁𝑄 𝑅𝑆𝑆𝑇 ↓ 𝑊𝑉𝑇𝑇𝑁𝑈𝑀 𝑅𝑆𝑆𝑇 𝐿 Save "rst deleted entry address 0xFF X0XX X3
•••
•••
•••
15: end if ••• 0xFF X0XX ••• X3 16 Steps
16: 𝐿𝑀𝑁𝑂 ↓ 𝐿𝑀𝑁𝑂 + 1 𝐿 Deleted entry
17: else if 𝑈𝑘𝑀 𝑐 𝑌𝑄𝑑𝑆 then
18: if Oversubscription then
19: Page eviction
Figure 11: PTE access with step table
20: else if 𝑊𝑉𝑇𝑇𝑁𝑈𝑀 𝑎 𝑒 𝑁𝑈𝑀𝑇𝑑𝑁𝑍 ↔ 𝑙𝑌𝑚 . 𝑎 𝑒 𝑁𝑈𝑀𝑇𝑑𝑁𝑍 then
21: Remote PTE eviction and 16 step values (48 bits = 3↑16). Each entry of the step cache
22: end if is designated for 16 HPT entries allocated to a 32MB contiguous
23: if 𝑃𝑁𝑄 𝑅𝑆𝑆𝑇 ! = 𝑈𝑉𝑄𝑄 then region. As a result, the page table walker accesses the step cache
24: 𝑊𝑉𝑇𝑇𝑁𝑈𝑀 𝑅𝑆𝑆𝑇 ↓ 𝑃𝑁𝑄 𝑅𝑆𝑆𝑇 𝐿 Reuse deleted entry
25: end if before accessing the step table. Since the step cache covers a 1GB
26: 𝑐 𝑌𝑄𝑑𝑆 ↓ 𝑀𝑇𝑉𝑁 𝐿 Allocate new HPT entry region, its hit rate is nearly 100%, ensuring that PTEs are mostly
27: Allocate PTE
28: 𝑖𝑇𝑁𝑌𝑗 retrievable with only a single memory access. It is worth noting
29: end if that the step cache substantially outperforms PWCs in terms of
30: end while area e!ciency as it covers a 2MB region only using 3 bits, while
the PWC covers a 2MB region using 64 bits.
allocating a new HPT entry, the driver replaces the current address PTE access: Figure 11 also shows the page table access process. For
with the deleted address if it exists (lines 23-24). Finally, the driver each page table walk request, the page table walker "rst looks up
allocates a new HPT entry and a new PTE (lines 26-27). the step cache. If there is no matching entry in the step cache, the
3.5 PTE Access with Step Table walker accesses the step table with an index calculated by adding
the value in the Step Table Base Register (STBR) and the hash
Step table: The page table walker can access the HPT with a single value from VA[47:25]. Next, it "lls the step cache. If there is no
memory reference if there are no hash collisions. However, if hash matching entry in the step table, the GMMU incurs a page fault.
collisions occur, the walker needs to perform a probing sequence, Otherwise, if there is a matching entry, the walker accesses the HPT
necessitating multiple sequential memory accesses to the HPT. To by adding the value in the Page Table Base Register (PTBR), the
tackle this issue with open addressing, we employ a step table. This hash value from VA[47:21], and the 𝐿𝑀𝑄𝑅 ↑ 𝐿𝑀𝑁𝑂𝑃𝑄. The GPU driver
is designed to store the open-addressing steps for each HPT entry. pre-determines the 𝐿𝑀𝑁𝑂𝑃𝑄 value, while the 𝐿𝑀𝑄𝑅 value is derived
When the GPU driver populates the GPU’s page table, the driver from the step cache. Finally, the walker gets the requested PTE
also stores the 𝐿𝑀𝑄𝑅 value in the step table. The step table resides from the HPT entry using the PTE o#set (VA[20:12]) "eld.
in device memory and is implemented as a hash table.
Figure 11 shows the design of the step table. A step table entry
stores a tag (VA[47:25]) and open-addressing steps for 16 HPT 3.6 Remote PTE Eviction Policy
entries in a contiguous 32MB region. We limit the maximum open- Remote mappings can mitigate page migration overheads and page
addressing step value to eight, so each step table entry comprises thrashing issues in discrete GPU systems [16, 30, 47, 53]. How-
23 bits for the tag and an additional 48 bits (16↑3 bits) for the step ever, if the number of remote mappings stored in the GPU’s page
values of 16 HPT entries. This setup guarantees that every entry table increases, its load factor can also keep increasing, leading
in the step table can represent a 32MB region using just 9 bytes of to frequent hash collisions in the page allocation process. As de-
information. Given the compact size of each entry in the step table, scribed in Section 2.4, live remote mappings occupy only a small
we con"gure the step table’s size to achieve a low load factor of fraction of the total page table capacity during program execution.
0.01. This notably diminishes the frequency of hash collisions in Consequently, by storing only the live (or hot) remote mappings
the step table. Despite maintaining this low load factor, a step table in the page table, we can e#ectively prevent the load factor from
of this size requires only 0.001% of device memory capacity. continually increasing.
However, when using the step table, the page table walker ac- Figure 12 shows the concept of the remote PTE eviction policy
cesses the memory twice for every HPT lookup: once for the step we propose to constrain the load factor of FS-HPT. The upper side
table and once more for the HPT. To address this issue, we repurpose of the "gure shows the page eviction policy current GPU drivers
the existing PWCs as a step cache to store recently accessed step already use [15, 39]. When oversubscription occurs, the GPU driver
table entries. As illustrated in Figure 11, the step cache is designed selects a victim page from the front of the Least Recently Used
as a direct-mapped cache structure. It has 32 entries, each com- (LRU) list and evicts it (lines 18-19 in Algorithm 1). The LRU list is
prising an 18-bit tag derived from VA[47:30] of the virtual address updated for every time a page fault is handled.
331
PACT ’24, October 14–16, 2024, Long Beach, CA, USA Sungbin Jang, Junhyeok Park, Osang Kwon, Yongho Lee, and Seokin Hong
LRU MRU
Page 4 Even if a step cache is indeed more compact than PWCs, it still
Page Eviction
Evict Page 0 Page 1 Page 2 Page 3 experiences scaling issues as the application’s memory footprint
Policy
grows. In fact, FS-HPT can operate without the step cache if the
Remote PTE
LRU MRU
page table size is su!ciently large so that hash collisions rarely
Evict PTE 0 PTE 1 PTE 2 PTE 3
Eviction Policy occur. If hash collisions do not occur in the page allocation process,
PTE 4 the Step value will be zero for all HPT entries, eliminating the need
to read the Step value from the step table.
Figure 12: Page replacement and remote PTE eviction
By exploiting this insight, we propose an alternative design, Step
We extend this replacement policy to evict remote mappings Cache-Less FS-HPT. This approach excludes the step cache in the
from the GPU’s page table. The lower side of the "gure shows GMMU while using a large HPT. Even if this approach increases the
the proposed remote PTE eviction policy. When allocating a new area overhead of the page table, the impact of this area overhead is
HPT entry, if the load factor of the GPU’s page table reaches the trivial because the page table still basically occupies a small portion
designated load factor threshold, the driver selects a victim entry of the device memory, as discussed in section 2.4.
front of the LRU list and evicts that entry from the page table. To implement the step cache-less design, we need a small modi-
The GPU driver can update the LRU list since the GMMU tracks "cation to FS-HPT. In that design, a hash collision can occur while
the access frequency of each remote mapping [16, 47, 53]. Also, accessing the HPT. This is because a home HPT entry, whose index
the following equation (1) shows the calculation to determine the is 𝑉𝑊𝑋𝑉 𝑌𝑊𝑍𝑎𝑄 +𝐿𝑀𝑁𝑂𝑃𝑄 ↑ 0, is always accessed "rst as we assume that
load factor threshold used, where 𝑏 is a multiplicative factor. For the Step is zero for the given VA. We store a VPN tag in each HPT
example, if there is no oversubscription, we simply set 𝑏 as 1.0, entry to handle hash collisions. Since there are enough unused bits
and the page table can accommodate all required PTEs. On the in a PTE [35, 40, 54], we can store the VPN tag in each HPT entry.
other hand, if there is an oversubscription, we can adjust the 𝑏 If the requested translation is found in the home HPT entry (no
to be greater than 1.0. If 𝑏 is 1.2, the page table can store PTEs of hash collision), the page table walker "nishes the lookup with only
pages that cover 120% of the device’s memory size. FS-HPT sets the one memory reference. Otherwise, the page table walker reads a
default 𝑏 as 1.2, but 𝑏 can be adjusted in some cases. We evaluate Step value from the step table and then accesses the page table
the e#ects of varying 𝑏 in Section 5.3. again using the calculated index. Even if the step cache-less design
incurs three memory references per page table walk on a hash
# 𝑏𝑊𝑈 . 𝑍𝑐𝑒𝑊𝑍 𝑓𝑔 𝑄𝑆𝑀𝑁𝑂𝑄𝑋 collision, it achieves similar performance to the step cache-based
𝑍𝑐𝑊𝑃 𝑑 𝑊𝑒𝑀𝑐𝑁 𝑀𝑉𝑁𝑄𝑋𝑉𝑐𝑍𝑃 = 𝑏 ↑ (1)
# 𝑀𝑐𝑀𝑊𝑍 𝑓𝑔 𝑄𝑆𝑀𝑁𝑂𝑄𝑋 design because hash collisions are rare when using a larger HPT.
The impact of step cache-less design will be discussed in Section 5.4.
3.7 Victim Bu!er
With our remote PTE eviction policy, the GPU driver invalidates 4.2 Implementation Overheads
remote mappings evicted from the GPU’s page table. However, if Hardware overheads: FS-HPT requires a small amount of addi-
an evicted remote mapping is accessed again, a page fault occurs. In tional memory space and a few minor changes in the GMMU unit.
this case, the GPU driver populates the page table with that mapping The hashed page table accounts for 0.5% of device memory, while a
again. Even if page faults caused by re-accessing the previously corresponding step table with a maximum load factor of 0.01 occu-
evicted remote entries are infrequent, when they occur, the GPU pies 0.001% of device memory. The memory space requirements for
undertakes costly page fault handling [1]. the victim bu#er vary depending on the memory access patterns of
To minimize the performance impact of the remote PTE replace- the applications, but it can be freely resized since it uses a multi-
ment, we propose a victim bu#er that maintains the evicted remote level radix tree structure. Also, Employing the step cache incurs no
mappings in device memory. With the victim bu#er, we can prevent area overhead compared to the baseline architecture because we
page faults caused by re-accessing previously evicted remote map- repurpose existing PWCs as the step cache.
pings. Since the number of remote mappings evicted from the HPT Software overheads: FS-HPT requires small changes to the GPU
(primary page table) varies across applications, we design the victim driver. To support a GPU with an 80GB memory [38], the HPT Map
bu#er to be more scalable by using a radix tree structure. While this occupies only 200MB of memory space in the host memory. Since
radix-tree design can lead to sequential memory accesses, it o#ers the GPU driver spends about 80% of page fault handling time for
signi"cant bene"ts in managing the victim bu#er. Furthermore, the page migration and page unmapping (invalidating) from host page
victim bu#er is rarely accessed because the hot remote mappings table [1, 27], the time spent for HPT Map searching is negligible.
mostly remain in the HPT. Therefore, the performance impact of Our remote PTE eviction policy requires maintaining an LRU list
sequential accesses to the bu#er is trivial (Section 5.3). of remote mappings. The GPU driver can maintain and update this
LRU list by exploiting an existing Access Counter [16, 39, 47].
4 Discussion
4.1 Step Cache-Less Design 5 Evaluation
Our design uses an MMU cache, called the step cache, to store 5.1 Experimental Environment
open-addressing steps of recently accessed HPT entries. While this Simulation Setup: To evaluate the proposed technique, we used
design is similar to PWCs, the step cache is more e!cient than GPGPU-sim v4.0 [26] con"gured as similar to an RTX 3070
PWCs as it stores steps instead of physical addresses of the PTEs. GPU [38]. The detailed experimental setup is summarized in Table 1.
332
Rethinking Page Table Structure for Fast Address Translation in GPUs: A Fixed-Size Hashed Page Table PACT ’24, October 14–16, 2024, Long Beach, CA, USA
Speedup
1
Clock frequency 1500MHz 0.8
0.6
32 entries, fully associative, 20 cycles, 0.4
L1 TLB (private) 0.2
32 MSHR entries 0
1024 entries, 16 ways, 80 cycles,
MVT
SYK
BC
Avg.
CC
ATX
GEV
BFS
NW
BIC
COR
KC
HOT
PF
LU
Avg.
SP
2DC
L2 TLB (shared)
128 MSHR entries Irregular Regular
L1 cache 128KB per SM, 40 cycles
L2 cache 4MB, 16 ways, 180 cycles Figure 13: Speedup of FS-HPT over four-level RPT
Memory GDDR6, 16 channels, 448GB/s bandwidth
RPT FS-HPT Ideal PWC
Structure 4-level radix page table
GEV
SYK
KC
ATX
MVT
BC
Avg.
COR
CC
BFS
SP
NW
BIC
HOT
PF
LU
Avg.
2DC
Page prefetching 16 consecutive pages (64KB)
Irregular Regular
We faithfully extended the simulator to support RPT-based address Figure 14: Number of memory references per page table walk
translation by implementing multilevel TLBs, page table walkers, RPT FS-HPT Ideal PWC
page walk caches, and a radix page table. In our simulated GMMU,
Norm. PTW Latency
1
we modeled highly threaded page table walkers [44] and PWCs 0.8
with 32 entries [7]. We also modeled the behavior of FS-HPT in the 0.6
0.4
simulator by adding the step table, step cache, victim bu#er, and 0.2
remote PTE eviction policy. We do not consider demand paging 0
MVT
SYK
BC
Avg.
CC
ATX
GEV
BFS
NW
BIC
COR
KC
HOT
PF
LU
Avg.
SP
2DC
overhead in the same way as related works [22, 25, 29] since FS-HPT
focuses on reducing page table walk overhead. To determine the Irregular Regular
size of the page table, we assumed that GPU memory capacity is Figure 15: Average page table walk latency.
equal to the working set size of the benchmarks. Based on this
assumption, we con"gured the HPT size to cover a memory space speedups of 35.4%, 26.6%, 34.9%, and 35.8%, respectively. For the
that is 2.5 times larger than the benchmark’s memory footprint. Set- regular workloads, performance is comparable across RPT, FS-HPT,
ting up this way, the HPT consumes only 0.5% of the GPU memory. and ideal PWC because most page walk requests hit in the PWC.
We modeled a prefetching scheme that prefetches 16 consecutive The number of memory references per page table walk: To
pages to the GPU memory when a page fault occurs [15]. understand why FS-HPT o#ers a signi"cant speedup comparable
Benchmarks: We selected diverse benchmarks from Graphbig [34] to an ideal PWC, we examine the behavior of the page table walk
(BC, BFS, SP, CC, KC), Rodinia [12] (NW, HOT, PF) and Poly- process. Figure 14 shows the number of memory references per
bench [17] (ATX, GEV, MVT, SYK, 2DC, BIC, COR, LU). The bench- page table walk for each benchmark running with RPT, FS-HPT,
marks exhibiting a level-2 PWC miss rate greater than 20% are or ideal PWC con"guration. For all benchmarks, FS-HPT involves
categorized as irregular, while those with miss rates lower than 20% much fewer memory references per page table walk than RPT. For
fall into the regular category. This classi"cation enables a compre- irregular workloads especially, the number of memory references
hensive analysis of the proposed technique’s e!cacy across diverse per page table walk for RPT is an average of 1.35, while for FS-HPT
workloads. The memory footprint of the benchmarks is ranging it is only 1.01, which is very close to the ideal case.
from 864MB to 2306MB. To mitigate the issue of impractically long Since FS-HPT e#ectively resolves hash collisions with the step
simulation times tied to large memory footprints, we constrained table and step cache, it can achieve fewer memory references per
the working set size of the benchmarks to between 214 and 1024MB. page table walk. The step cache achieves an over 99% hit rate for all
benchmarks, while conventional PWCs o#er only a 65% hit rate on
5.2 Performance Analysis average for irregular benchmarks. This is because an entry of the
Performance comparison: Figure 13 shows the relative perfor- step cache covers a region 16 times larger (32MB) than an entry of
mance of four-level RPT, FS-HPT, and ideal PWC. For the irregular the PWCs. This high hit rate of the step cache makes the average
benchmarks, FS-HPT achieves a speedup of 15.8-35.8% (with an number of memory references per page table walk almost one.
average of 27.8%) compared to the four-level RPT. This signi"cant Average page table walk latency: Figure 15 shows the average
performance improvement with FS-HPT is comparable to the ideal page table walk latency for each benchmark running on RPT, FS-
PWC, where we assume all page walks involve only a single memory HPT, and ideal PWC. All results are normalized to RPT. For irregular
access. Typically, as the PWC miss rate increases, the performance workloads, FS-HPT shows a 22.2% reduction on average in page
gain also increases. For example, BC and SP, which have relatively table walk latency compared to RPT, which is close to the perfor-
low PWC miss rates compared to other irregular workloads, show mance of an ideal PWC (23.5%). Since FS-HPT reduces the number
speedups of 15.8% and 18.3%, respectively. On the other side, ATX, of memory references per page table walk, 1) the overall queueing
GEV, MVT, and BFS, whose PWC miss rates are about 40%, achieve latency for a page walk and 2) the memory access latency per page
333
PACT ’24, October 14–16, 2024, Long Beach, CA, USA Sungbin Jang, Junhyeok Park, Osang Kwon, Yongho Lee, and Seokin Hong
1 10%
0.8 8%
0.6 6%
0.4 4%
0.2 2%
0 0%
ATX GEV MVT SYK BC BFS SP NW Avg. 2DC BIC COR CC KC HOT PF LU Avg.
Irregular Regular
Figure 16: Speedup of FS-HPT over RPTs with 150% oversubscription varying multiplicative factor
Collision Rate
1.2 30%
Speedup
1 25%
0.8 20%
0.6 15%
0.4 10%
0.2 5%
0 0%
ATX GEV MVT SYK BC BFS SP NW Avg. 2DC BIC COR CC KC HOT PF LU Avg.
Irregular Regular
table walk queue both decreases. As a result, the average page table case. Some benchmarks (NW and KC) show lower performance
walk latency decreases, reducing the stall time for each SM due to than RPT with this multiplicative factor because of the high victim
address translation. bu#er access rate. On the other hand, with 𝑏 = 1.5, the GPU’s page
table can store all the PTEs of pages in the working set with 150%
5.3 Impact of Victim Bu!er oversubscription. Therefore, the victim bu#er access rate becomes
We measure the performance of all benchmarks running at 150% zero, and FS-HPT achieves comparable performance to in the no
oversubscription to evaluate the impact of the victim bu#er. For oversubscription case. Since the probing time for open addressing
150% oversubscription, we set the device memory size to 66.6% of accounts for a small portion of the long page fault handling times,
each benchmark’s working set size to avoid extremely long simula- the impact of increasing the multiplicative factor on performance
tion times. Figure 16 shows the speedup of FS-HPT over RPT both is trivial. Rather, a high multiplicative factor reduces the number
with and without 150% oversubscription, respectively, varying the of accesses to the victim bu#er, providing near-ideal performance.
multiplicative factor 𝑏 (bar graph). As mentioned in Section 3.6,
the default 𝑏 is 1.2. With 𝑏 = 1.2, for irregular workloads, FS-HPT 5.4 Performance of Step Cache-Less Design
o#ers slightly lower performance gains in 150% oversubscription We evaluate the performance of the step cache-less design discussed
scenarios compared to the case with no oversubscription. The av- in Section 4.1. Figure 17 shows the performance of the step cache-
erage speedup with 𝑏 = 1.2 is 21.5%, while the average speedup less design and its hash collision rate for page table access with
is 27.8% without oversubscription. The small reduction in perfor- varying sizes of the page table. Each bar graph depicts the speedup
mance improvement is mainly due to oversubscription requiring of FS-HPT with the step cache-less design for di#erent page table
accessing of the victim bu#er to retrieve cold remote mappings. sizes compared to RPT. The line graph shows the hash collision
Figure 16 also shows the victim bu#er access rate as a proportion rate on page table lookup. Starting from the base page table size
of total page table accesses for all workloads (line graph). With (0.5% of total memory), which is our default con"guration, we
𝑏 = 1.2, GEV, SYK, BFS, SP, and NW show over 1% victim bu#er gradually increase the size of the page table to 1%, 2%, and then
access rates (1.4-4.4%). Nevertheless, NW, which has the highest 4% of device memory. As expected, while accessing the HPT, the
victim bu#er access rate (4.4%), still shows a speedup of 11% over hash collision rate decreases as the page table size increases. With
RPT. These results demonstrate that while the performance gain the 0.5% con"guration, the performance decreases by an average
from using FS-HPT is slightly reduced with oversubscription, it still of 2.2% compared to RPT due to the high collision rate. However,
provides a substantial performance gain compared to RPT. the step cache-less design with page table sizes of 1%, 2%, and 4%
Sensitivity analysis: We evaluate the performance of FS-HPT of device memory shows an average 10%, 19%, and 24.4% speedup
under condition of 150% memory oversubscription by varying the over RPTs for irregular workloads, respectively. These results imply
multiplicative factor to assess the impact of di#erent load factor that using a large HPT can achieve near-ideal performance without
thresholds. For each multiplicative factor (1.0, 1.1, 1.2, 1.3, 1.4, and the need for an SRAM cache in the GMMU.
1.5), FS-HPT achieves average speedups of 11.7%, 17.7%, 21.5%,
25.3%, 26.6%, and 27.8%, over RPT, respectively. With 𝑏 = 1.0, all 5.5 Impact of Large Page Size
remote accesses result in victim bu#er lookups after the load factor Figure 18 shows the speedup for benchmarks running on RPT, FS-
reaches its threshold. Therefore, the average speedup over RPT with HPT, and ideal PWC over RPT with a 64KB base page size. For
𝑏 = 1.0 is 11.7%, which is smaller than in the no oversubscription irregular workloads, FS-HPT achieves a speedup of 24.5% over RPT,
334
Rethinking Page Table Structure for Fast Address Translation in GPUs: A Fixed-Size Hashed Page Table PACT ’24, October 14–16, 2024, Long Beach, CA, USA
RPT FS-HPT Ideal PWC FS-HPT because our technique focuses on reducing page table walk
1.6
1.4 overheads and can support multiple page sizes.
1.2
Reducing page table walk overhead: Shin et al. [48] proposed
Speedup
1
0.8
0.6 an e!cient page table walk scheduling policy by exploiting SIMD
0.4
0.2
0
execution characteristics. They scheduled page table walk requests
using the "shortest-job-"rst" principle and "batch" processing. Also,
BC
ATX
GEV
NW
KC
HOT
PF
MVT
SYK
Avg.
2DC
BIC
LU
COR
CC
Avg.
BFS
SP
Shin et al. [49] reduced the page table walk overhead by coalescing
Irregular Regular
multiple page table walk requests that fall into a single cacheline.
Figure 18: Speedup of FS-HPT over RPT for 64KB page size Since FS-HPT ensures the spatial locality of PTEs in cacheline
granularity, coalescing page table walk requests is compatible with
RPT RPT+NHA FS-HPT FS-HPT+NHA Ideal PWC+NHA
2.5
FS-HPT.
2 Virtual memory in multi-chip systems: Pratheek et al. [46] tried
Speedup
BC
NW
KC
PF
GEV
MVT
SYK
BIC
HOT
LU
Avg.
BFS
SP
Avg.
2DC
COR
CC
335
PACT ’24, October 14–16, 2024, Long Beach, CA, USA Sungbin Jang, Junhyeok Park, Osang Kwon, Yongho Lee, and Seokin Hong
References [24] Corbet Jonathan. 2017. Five-Level Page Tables. (2017). https://lwn.net/Articles
[1] Tyler Allen and Rong Ge. 2021. In-Depth Analyses of Uni"ed Virtual Memory /717293/
System for GPU Accelerated Computing. In Proceedings of the International [25] Vasileios Karakostas, Jayneel Gandhi, Furkan Ayar, Adrián Cristal, Mark D Hill,
Conference for High Performance Computing, Networking, Storage and Analysis. Kathryn S McKinley, Mario Nemirovsky, Michael M Swift, and Osman Ünsal.
1–15. 2015. Redundant Memory Mappings for Fast Access to Large Memories. ACM
[2] AMD 2012. AMD GRAPHICS CORES NEXT (GCN) ARCHITECTURE. AMD. SIGARCH Computer Architecture News 43, 3S (2015), 66–78.
https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf [26] Mahmoud Khairy, Zhesheng Shen, Tor M Aamodt, and Timothy G Rogers. 2020.
[3] AMD 2015. AMD APP SDK OpenCL Optimization Guide. AMD. https://www. Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling. In
amd.com/content/dam/amd/en/documents/radeon-tech-docs/programmer- 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture
references/AMD_OpenCL_Programming_Optimization_Guide2.pdf (ISCA). IEEE, 473–486.
[4] Andrea Arcangeli. 2010. Transparent Hugepage Support. In KVM forum, Vol. 9. [27] Hyojong Kim, Jaewoong Sim, Prasun Gera, Ramyad Hadidi, and Hyesoon Kim.
[5] Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic, Eiman Ebrahimi, 2020. Batch-Aware Uni"ed Memory Management in GPUs for Irregular Work-
Oreste Villa, Aamer Jaleel, Carole-Jean Wu, and David Nellans. 2017. MCM-GPU: loads. In Proceedings of the Twenty-Fifth International Conference on Architectural
Multi-Chip-Module GPUs for Continued Performance Scalability. ACM SIGARCH Support for Programming Languages and Operating Systems. 1357–1370.
Computer Architecture News 45, 2 (2017), 320–332. [28] Osang Kwon, Yongho Lee, and Seokin Hong. 2022. Pinning Page Structure
[6] Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller, Saugata Ghose, Jayneel Entries to Last-Level Cache for Fast Address Translation. IEEE Access 10 (2022),
Gandhi, Christopher J Rossbach, and Onur Mutlu. 2017. Mosaic: A GPU Mem- 114552–114565.
ory Manager with Application-Transparent Support for Multiple Page Sizes. In [29] Jiwon Lee, Ju Min Lee, Yunho Oh, William J Song, and Won Woo Ro. 2023.
Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchi- SnakeByte: A TLB Design with Adaptive and Recursive Page Merging in GPUs.
tecture. 136–150. In 2023 IEEE International Symposium on High-Performance Computer Architecture
[7] Thomas W Barr, Alan L Cox, and Scott Rixner. 2010. Translation Caching: Skip, (HPCA). IEEE, 1195–1207.
Don’t Walk (the Page Table). ACM SIGARCH Computer Architecture News 38, 3 [30] Bingyao Li, Yanan Guo, Yueqi Wang, Aamer Jaleel, Jun Yang, and Xulong Tang.
(2010), 48–59. 2023. IDYLL: Enhancing Page Translation in Multi-GPUs via Light Weight
[8] Trinayan Baruah, Yifan Sun, Saiful A Mojumder, José L Abellán, Yash Ukidave, PTE Invalidations. In Proceedings of the 56th Annual IEEE/ACM International
Ajay Joshi, Norman Rubin, John Kim, and David Kaeli. 2020. Valkyrie: Leveraging Symposium on Microarchitecture. 1163–1177.
Inter-TLB Locality to Enhance GPU Performance. In Proceedings of the ACM [31] Bingyao Li, Yueqi Wang, and Xulong Tang. 2023. Orchestrated Scheduling and
International Conference on Parallel Architectures and Compilation Techniques. Partitioning for Improved Address Translation in GPUs. In 2023 60th ACM/IEEE
455–466. Design Automation Conference (DAC). IEEE, 1–6.
[9] Ravi Bhargava, Benjamin Serebrin, Francesco Spadini, and Srilatha Manne. 2008. [32] Bingyao Li, Jieming Yin, Anup Holey, Youtao Zhang, Jun Yang, and Xulong Tang.
Accelerating Two-Dimensional Page Walks for Virtualized Systems. In Proceed- 2023. Trans-FW: Short Circuiting Page Table Walk in Multi-GPU Systems via
ings of the 13th international conference on Architectural support for programming Remote Forwarding. In 2023 IEEE International Symposium on High-Performance
languages and operating systems. 26–35. Computer Architecture (HPCA). IEEE, 456–470.
[10] Abhishek Bhattacharjee. 2013. Large-Reach Memory Management Unit Caches. [33] Nika Mansouri Ghiasi, Jisung Park, Harun Mustafa, Jeremie Kim, Ataberk Ol-
In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microar- gun, Arvid Gollwitzer, Damla Senol Cali, Can Firtina, Haiyu Mao, Nour Almad-
chitecture. 383–394. houn Alserr, et al. 2022. GenStore: A High-Performance in-Storage Processing
[11] Damla Senol Cali, Konstantinos Kanellopoulos, Joël Lindegger, Zülal Bingöl, System for Genome Sequence Analysis. In Proceedings of the 27th ACM Inter-
Gurpreet S Kalsi, Ziyi Zuo, Can Firtina, Meryem Banu Cavlak, Jeremie Kim, national Conference on Architectural Support for Programming Languages and
Nika Mansouri Ghiasi, et al. 2022. SeGraM: A Universal Hardware Accelerator for Operating Systems. 635–654.
Genomic Sequence-to-Graph and Sequence-to-Sequence Mapping. In Proceedings [34] Lifeng Nai, Yinglong Xia, Ilie G Tanase, Hyesoon Kim, and Ching-Yung Lin.
of the 49th Annual International Symposium on Computer Architecture. 638–655. 2015. GraphBIG: Understanding Graph Computing in the Context of Industrial
[12] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Shea#er, Sang- Solutions. In Proceedings of the International Conference for High Performance
Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing, Networking, Storage and Analysis. 1–12.
Computing. In 2009 IEEE international symposium on workload characterization [35] NVIDIA 2016. NVIDIA TESLA P100. NVIDIA. https://images.nvidia.com/conten
(IISWC). IEEE, 44–54. t/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf
[13] Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. [36] NVIDIA 2017. NVIDIA TESLA V100 GPU ARCHITECTURE. NVIDIA. https:
2019. Cluster-GCN: An E!cient Algorithm for Training Deep and Large Graph //images.nvidia.com/content/volta- architecture/pdf /volta- architecture-
Convolutional Networks. In Proceedings of the 25th ACM SIGKDD international whitepaper.pdf
conference on knowledge discovery & data mining. 257–266. [37] NVIDIA 2020. Pascal MMU Format Changes. NVIDIA. https://nvidia.github.io/o
[14] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Cli#ord Stein. 2022. pen-gpu-doc/pascal/gp100-mmu-format.pdf
Introduction to Algorithms. MIT press. [38] NVIDIA 2021. NVIDIA AMPERE GA102 GPU ARCHITECTURE. NVIDIA. https:
[15] Debashis Ganguly, Ziyu Zhang, Jun Yang, and Rami Melhem. 2019. Interplay //www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-
between Hardware Prefetcher and Page Eviction Policy in GPU-GPU Uni"ed whitepaper-v2.pdf
Virtual Memory. In Proceedings of the 46th International Symposium on Computer [39] NVIDIA 2024. NVIDIA Linux Open GPU Kernel Module Source. NVIDIA. https:
Architecture. 224–235. //github.com/NVIDIA/open-gpu-kernel-modules
[16] Debashis Ganguly, Ziyu Zhang, Jun Yang, and Rami Melhem. 2020. Adaptive [40] Rasmus Pagh and Flemming Friche Rodler. 2004. Cuckoo Hashing. Journal of
Page Migration for Irregular Data-Intensive Applications under GPU Memory Algorithms 51, 2 (2004), 122–144.
Oversubscription. In 2020 IEEE International Parallel and Distributed Processing [41] Chang Hyun Park, Ilias Vougioukas, Andreas Sandberg, and David Black-Scha#er.
Symposium (IPDPS). IEEE, 451–461. 2022. Every Walk’s a Hit: Making Page Walks Single-Access Cache Hits. In
[17] Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Proceedings of the 27th ACM International Conference on Architectural Support for
Cavazos. 2012. Auto-Tuning a High-Level Language Targeted to GPU Codes. In Programming Languages and Operating Systems. 128–141.
2012 innovative parallel computing (InPar). IEEE, 1–10. [42] Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and Abhishek Bhattachar-
[18] Jerry Huck and Jim Hays. 1993. Architectural Support for Translation Table jee. 2012. Colt: Coalesced large-reach tlbs. In 2012 45th Annual IEEE/ACM Inter-
Management in Large Address Space Machines. In Proceedings of the 20th annual national Symposium on Microarchitecture. IEEE, 258–269.
international symposium on computer architecture. 39–50. [43] Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. 2014. Architectural
[19] IBM. 2005. PowerPC Microprocessor Family: The Programming Environments Support for Address Translation on GPUs: Designing Memory Management
Manual for 64 and 32-Bit Microprocessors. Units for CPU/GPUs with Uni"ed Address Spaces. ACM SIGARCH Computer
[20] Intel. 2010. Intel Itanium Architecture Software Developer’s Manual. (2010). Architecture News 42, 1 (2014), 743–758.
https://www.intel.sg/content/dam/doc/manual/itanium-architecture-vol-1-2- [44] Jason Power, Mark D Hill, and David A Wood. 2014. Supporting x86-64 Address
3-4-reference-set-manual.pdf Translation for 100s of GPU Lanes. In 2014 IEEE 20th International Symposium on
[21] Intel. 2023. Intel 64 and IA-32 Architectures Software Developer’s Manual. (2023). High Performance Computer Architecture (HPCA). IEEE, 568–578.
https://cdrdv2.intel.com/v1/dl/getContent/671200 [45] B Pratheek, Neha Jawalkar, and Arkaprava Basu. 2021. Improving GPU Multi-
[22] Aamer Jaleel, Eiman Ebrahimi, and Sam Duncan. 2019. Ducati: High-Performance Tenancy with Page Walk Stealing. In 2021 IEEE International Symposium on
Address Translation by Extending TLB Reach of GPU-Accelerated Systems. ACM High-Performance Computer Architecture (HPCA). IEEE, 626–639.
Transactions on Architecture and Code Optimization (TACO) 16, 1 (2019), 1–24. [46] B Pratheek, Neha Jawalkar, and Arkaprava Basu. 2022. Designing Virtual Mem-
[23] Corbet Jonathan. 2005. Four-Level Page Tables Merged. (2005). https://lwn.net/ ory System of MCM GPUs. In 2022 55th IEEE/ACM International Symposium on
Articles/117749/ Microarchitecture (MICRO). IEEE, 404–422.
[47] Nikolay Sakharnykh. 2018. Everything You Need to Know about Uni"ed Memory.
https://www.nvidia.com/en- us/on- demand/session/gtcsiliconvalley2018-
336
Rethinking Page Table Structure for Fast Address Translation in GPUs: A Fixed-Size Hashed Page Table PACT ’24, October 14–16, 2024, Long Beach, CA, USA
337