CN104049951A

CN104049951A - Replaying memory transactions while resolving memory access faults

Info

Publication number: CN104049951A
Application number: CN201310752957.7A
Authority: CN
Inventors: 詹姆士·勒罗伊·德明; 杰尔姆·F·小杜鲁克; 约翰·马谢; 马克·海尔格罗夫; 卢森·邓宁; 乔纳森·斯图尔特·拉姆齐·埃文斯; 桑缪尔·H·邓肯; 卡梅伦·布沙特; 布雷恩·法斯
Original assignee: Nvidia Corp
Current assignee: Nvidia Corp
Priority date: 2013-03-15
Filing date: 2013-12-31
Publication date: 2014-09-17
Also published as: DE102013021996A1

Abstract

One embodiment of the invention is a Parallel Processing Unit (PPU) comprising one or more Streaming Multiprocessors (SM) and implementing a replay unit per SM. Upon detection of a page fault associated with a memory transaction issued by a particular SM, the corresponding replay unit stalls that SM, but not any unaffected SMs, from issuing new memory transactions. The replay unit then stores the failed memory transaction and any failed in-flight memory transactions in a replay buffer. When the page fault is resolved, the replay unit replays the memory transactions in the replay buffer—removing successful memory transactions from the replay buffer—until all stored memory transactions have successfully executed. Advantageously, the overall performance of the PPU is improved compared to conventional PPUs that, upon detection of a page fault, halt execution of memory transactions across all SMs included in the PPU until the fault is resolved.

Description

Replaying memory transactions while resolving memory access failures

技术领域technical field

本发明总体上涉及计算机科学，且更具体地，涉及在解决存储器访问故障的同时重播存储器事务（transaction）。The present invention relates generally to computer science, and more specifically, to replaying memory transactions while resolving memory access failures.

背景技术Background technique

典型的计算机系统包括中央处理单元（CPU）和并行处理单元（PPU）。当软件应用程序在计算机系统上执行时，CPU和PPU执行存储器操作以对物理存储器位置中的数据进行存储及检索。一些先进的计算机系统实施为CPU和PPU所共用的统一虚拟存储器架构（UVM）。除此之外，该架构还使得CPU和PPU能够使用共用（例如，同一）虚拟存储器地址来访问物理存储器位置，而不管该物理存储器位置是在系统存储器还是PPU本地的存储器（PPU存储器）内。A typical computer system includes a central processing unit (CPU) and a parallel processing unit (PPU). When a software application executes on a computer system, the CPU and PPU perform memory operations to store and retrieve data from physical memory locations. Some advanced computer systems implement a unified virtual memory architecture (UVM) shared by the CPU and PPU. Among other things, the architecture enables the CPU and PPU to use a common (eg, same) virtual memory address to access a physical memory location, regardless of whether the physical memory location is within system memory or the PPU's local memory (PPU memory).

计算机系统典型地包括存储器管理功能以使虚拟存储器和分页（paging）操作便利。在正常操作过程中，指令可请求对与页出（page out）的数据页相关联的虚拟地址进行访问，从而导致访问故障。对访问故障做出响应，常规的处理单元可完成在出故障的指令之前的指令，并将出故障的指令与在出故障的指令之后开始执行的所有指令一起取消。此时，访问故障处理程序（fault handler）对所请求的数据页进行页入（page-in），并从出故障的指令开始重新启动执行。在一些情况中，相对于典型的指令执行时间，访问故障处理程序会要求相当多的时间来完成。特别是，如果计算机系统实施统一虚拟存储器架构，则访问故障处理程序会执行冗长的出故障的程序（faulting procedure），该程序在系统和PPU本地的存储器之间迁移（migrate）存储器页。Computer systems typically include memory management functions to facilitate virtual memory and paging operations. During normal operation, an instruction may request access to a virtual address associated with a page of data that is paged out, causing an access fault. In response to an access failure, a conventional processing unit may complete instructions preceding the failing instruction and cancel the failing instruction along with all instructions beginning execution after the failing instruction. At this point, the access fault handler (fault handler) pages in the requested data page (page-in), and restarts execution from the faulty instruction. In some cases, accessing a fault handler may require a significant amount of time to complete relative to typical instruction execution times. In particular, if the computer system implements a unified virtual memory architecture, the access fault handler executes a lengthy faulting procedure that migrates memory pages between system and PPU-local memory.

在高度并行多线程先进PPU中，成百上千的存储器事务且因而许多地址转译随时都会很突出。因此，很多存储器访问故障随时都会激活。如果PPU是要实施常规的指令取消故障处理技术，则PPU将要频繁地取消所有执行单元上的几千个指令。此外，PPU将要等待冗长的访问故障处理程序，以为每个执行线程内的每个出故障的指令加载被页出的数据。这样的时延（latency）将会极大地并且通常不可接受地使整体系统性能降低。In a highly parallel multi-threaded advanced PPU, hundreds of memory transactions and thus many address translations can be prominent at any one time. Therefore, many memory access faults are active at any time. If the PPU were to implement conventional instruction cancellation fault handling techniques, the PPU would frequently cancel thousands of instructions across all execution units. In addition, the PPU will have to wait for a lengthy access fault handler to load paged out data for each faulted instruction within each thread of execution. Such latencies will greatly and often unacceptably degrade overall system performance.

如前所述，本领域所需要的是一种更加有效的方法，以处理涉及多线程处理单元的访问故障。As previously stated, what is needed in the art is a more efficient method of handling access faults involving multi-threaded processing units.

发明内容Contents of the invention

本发明的一个实施例是为一种由计算机实施的、用于处理与多线程处理单元相关联的虚拟存储器事务的方法而设。该方法包括：从第一单元接收第一虚拟存储器事务；试图执行第一虚拟存储器事务；检测与第一虚拟存储器事务有关的第一页故障；将第一虚拟存储器事务存储在重播缓存器中；引发停滞条件，所述停滞条件禁止第一单元产生随后的虚拟存储器事务，直到第一页故障已经解决为止；以及一旦第一页故障已经解决，就重新执行第一虚拟存储器事务以及存储在重播缓存器中的至少一个其他虚拟存储器事务。One embodiment of the invention is directed to a computer-implemented method for processing virtual memory transactions associated with a multi-threaded processing unit. The method includes: receiving a first virtual memory transaction from a first unit; attempting to execute the first virtual memory transaction; detecting a first page fault associated with the first virtual memory transaction; storing the first virtual memory transaction in a replay buffer; Initiating a stall condition that prohibits the first unit from generating subsequent virtual memory transactions until the first page fault has been resolved; and once the first page fault has been resolved, re-executing the first virtual memory transaction and storing in the replay cache at least one other virtual memory transaction in the server.

所公开的方法的一个优点在于，对页故障未做出贡献的多线程处理单元中所包含的单元在页故障存在时仍可继续发出虚拟存储器事务。此外，由于受影响的单元继续重播出故障的虚拟存储器事务以及出故障的飞行中虚拟存储器事务，所以在解决页故障的同时不取消这些虚拟存储器事务。因此，与常规的多线程处理单元相比，多线程处理单元的整体性能被改善，其中常规的多线程处理单元一经产生页故障，就取消由多处理单元内的所有单元所发出的虚拟存储器事务，直到页故障解决为止。One advantage of the disclosed method is that units contained in multithreaded processing units that did not contribute to a page fault can continue to issue virtual memory transactions while a page fault exists. Furthermore, since the affected cells continue to replay the failed virtual memory transaction as well as the failed in-flight virtual memory transaction, these virtual memory transactions are not canceled while the page fault is being resolved. Thus, the overall performance of a multi-threaded processing unit is improved compared to a conventional multi-threaded processing unit which cancels virtual memory transactions issued by all units within the multi-threaded processing unit as soon as a page fault occurs , until the page fault is resolved.

附图说明Description of drawings

因此，可以详细地理解本发明的上述特征，并且可以参考示范性实施例得到对如上面所简要概括的本发明更具体的描述，其中一些实施例在附图中示出。然而，应当注意的是，附图仅示出了本发明的典型实施例，因此不应被认为是对其范围的限制，本发明可以具有其他等效的实施例。So that the above recited features of the present invention can be understood in detail, and a more particular description of the invention, briefly summarized above, may be had by reference to exemplary embodiments, some of which are shown in the accompanying drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, for the invention may have other equally effective embodiments.

图1是示出了配置为实现本发明的一个或多个方面的计算机系统的框图；Figure 1 is a block diagram illustrating a computer system configured to implement one or more aspects of the present invention;

图2是根据本发明的一个实施例的、示出统一虚拟存储器系统（UVM）的框图；Figure 2 is a block diagram illustrating a unified virtual memory system (UVM), according to one embodiment of the present invention;

图3是根据本发明的一个实施例的、示出配置有重播单元的统一虚拟存储器系统（UVM）的框图；3 is a block diagram illustrating a unified virtual memory system (UVM) configured with a replay unit, according to one embodiment of the present invention;

图4是根据本发明的一个实施例的、示出图3的重播单元的概念图；以及FIG. 4 is a conceptual diagram illustrating the replay unit of FIG. 3 according to one embodiment of the present invention; and

图5是根据本发明的一个实施例的、用于管理由流式多处理器（SM）所发出的存储器事务的方法步骤的流程图。Figure 5 is a flowchart of method steps for managing memory transactions issued by a Streaming Multiprocessor (SM) according to one embodiment of the present invention.

具体实施方式Detailed ways

在下面的描述中，将阐述大量的具体细节以提供对本发明更透彻的理解。然而，本领域的技术人员应该清楚，本发明可以在没有一个或多个这些具体细节的情况下得以实施。In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced without one or more of these specific details.

系统概述System Overview

图1为示出了配置为实现本发明的一个或多个方面的计算机系统100的框图。计算机系统100包括经由可以包括存储器桥105的互连路径通信的中央处理单元（CPU）102和系统存储器104。存储器桥105可以是例如北桥芯片，经由总线或其他通信路径106（例如超传输（HyperTransport）链路）连接到I/O（输入/输出）桥107。I/O桥107，其可以是例如南桥芯片，从一个或多个用户输入设备108（例如键盘、鼠标）接收用户输入并且经由通信路径106和存储器桥105将该输入转发到CPU102。并行处理子系统112经由总线或第二通信路径113（例如外围部件互连（PCI）Express、加速图形端口或超传输链路）连接到存储器桥105；在一个实施例中，并行处理子系统112是将像素传送到显示设备110的图形子系统，显示设备110可以是任何常规的阴极射线管、液晶显示器、发光二极管显示器等。系统盘114也可连接到I/O桥107，并且可配置为存储由CPU102和并行处理子系统112所使用的应用程序和数据以及内容。系统盘114为应用程序和数据提供非易失性存储空间，并且可包含固定式或可移除式硬盘驱动器、闪存驱动器和CD-ROM（压缩光盘只读存储器）、DVD-ROM（数字通用光盘-ROM）、蓝光、HD-DVD（高分辨率DVD）或者其他磁性、光学或固态存储设备。FIG. 1 is a block diagram illustrating a computer system 100 configured to implement one or more aspects of the present invention. Computer system 100 includes a central processing unit (CPU) 102 and system memory 104 in communication via an interconnection path that may include a memory bridge 105 . The memory bridge 105 may be, for example, a north bridge chip, connected to an I/O (input/output) bridge 107 via a bus or other communication path 106 (eg, a HyperTransport link). I/O bridge 107 , which may be, for example, a south bridge chip, receives user input from one or more user input devices 108 (eg, keyboard, mouse) and forwards the input to CPU 102 via communication path 106 and memory bridge 105 . Parallel processing subsystem 112 is connected to memory bridge 105 via a bus or second communication path 113 such as Peripheral Component Interconnect (PCI) Express, Accelerated Graphics Port, or HyperTransport Link; in one embodiment, parallel processing subsystem 112 is the graphics subsystem that transfers the pixels to the display device 110, which may be any conventional cathode ray tube, liquid crystal display, light emitting diode display, or the like. System disk 114 may also be connected to I/O bridge 107 and may be configured to store applications and data and content used by CPU 102 and parallel processing subsystem 112 . System disk 114 provides non-volatile storage space for applications and data, and may contain fixed or removable hard drives, flash drives, and CD-ROM (Compact Disc Read-Only Memory), DVD-ROM (Digital Versatile Disc -ROM), Blu-ray, HD-DVD (high-resolution DVD), or other magnetic, optical, or solid-state storage devices.

交换器116提供I/O桥107与诸如网络适配器118以及各种插卡120和121的其他部件之间的连接。其他部件（未明确示出），包括通用串行总线（USB）或其他端口连接、压缩光盘（CD）驱动器、数字通用光盘（DVD）驱动器、胶片录制设备及类似部件，也可以连接到I/O桥107。图1所示的各种通信路径包括具体命名的通信路径106和113可以使用任何适合的协议实现，诸如PCI-Express、AGP（加速图形端口）、超传输或者任何其他总线或点到点通信协议，并且如本领域已知的，不同设备间的连接可使用不同协议。Switch 116 provides connections between I/O bridge 107 and other components such as network adapter 118 and various add-in cards 120 and 121 . Other components (not explicitly shown), including Universal Serial Bus (USB) or other port connections, compact disc (CD) drives, digital versatile disc (DVD) drives, film recording devices, and similar components, may also be connected to the I/O O bridge 107. The various communication paths shown in Figure 1, including specifically named communication paths 106 and 113, may be implemented using any suitable protocol, such as PCI-Express, AGP (Accelerated Graphics Port), HyperTransport, or any other bus or point-to-point communication protocol , and as known in the art, connections between different devices may use different protocols.

在一个实施例中，并行处理子系统112包含经优化用于图形和视频处理的电路，包括例如视频输出电路，并且构成一个或多个平行处理单元（PPU）202。在另一个实施例中，并行处理子系统112包含经优化用于通用处理的电路，同时保留底层（underlying）的计算架构，本文将更详细地进行描述。在又一个实施例中，可以将并行处理子系统112与一个或多个其他系统元件集成在单个子系统中，诸如结合存储器桥105、CPU102以及I/O桥107，以形成片上系统（SoC）。众所周知，许多图形处理单元（GPU）设计为执行并行操作和计算，因而被视为一类并行处理单元（PPU）。In one embodiment, parallel processing subsystem 112 includes circuitry optimized for graphics and video processing, including, for example, video output circuitry, and constitutes one or more parallel processing units (PPUs) 202 . In another embodiment, the parallel processing subsystem 112 includes circuitry optimized for general-purpose processing while preserving the underlying computing architecture, as described in greater detail herein. In yet another embodiment, parallel processing subsystem 112 may be integrated with one or more other system elements in a single subsystem, such as combining memory bridge 105, CPU 102, and I/O bridge 107, to form a system-on-chip (SoC) . It is well known that many graphics processing units (GPUs) are designed to perform parallel operations and calculations, and thus are considered a type of parallel processing unit (PPU).

在并行处理子系统112中可以包括任何数目的PPU202。例如，可在单个插卡上提供多个PPU202、或可将多个插卡连接到通信路径113、或可将一个或多个PPU202集成到桥式芯片中。在多PPU系统中的PPU202可以彼此同样或不同。例如，不同的PPU202可能具有不同数目的处理内核、不同容量的本地并行处理存储器等等。在存在多个PPU202的情况下，可并行操作那些PPU从而以高于单个PPU202所可能达到的吞吐量来处理数据。包含一个或多个PPU202的系统可以以各种配置和形式因素来实现，包括台式电脑、笔记本电脑或手持式个人计算机、服务器、工作站、游戏控制台、嵌入式系统等。Any number of PPUs 202 may be included in parallel processing subsystem 112 . For example, multiple PPUs 202 may be provided on a single add-in card, or multiple add-in cards may be connected to communication path 113, or one or more PPUs 202 may be integrated into a bridge chip. PPUs 202 in a multiple PPU system may be the same or different from each other. For example, different PPUs 202 may have different numbers of processing cores, different amounts of local parallel processing memory, and so on. Where multiple PPUs 202 are present, those PPUs may be operated in parallel to process data at a higher throughput than is possible with a single PPU 202 . Systems incorporating one or more PPUs 202 can be implemented in a variety of configurations and form factors, including desktop, notebook or handheld personal computers, servers, workstations, game consoles, embedded systems, and the like.

PPU202有利地实现高度并行处理架构。PPU202包括大量通用处理集群（GPC）。每个GPC均能够并发执行大量的（例如，几百或几千）线程，其中每个线程均为程序的实例（instance）。在一些实施例中，单指令、多数据（SIMD）指令发出技术用于在不提供多个独立指令单元的情况下支持大量线程的并行执行。在其他实施例中，单指令、多线程（SIMT）技术用于使用配置为向GPC208中的每一个内的处理引擎集发出指令的公共指令单元来支持大量一般来说同步的线程的并行执行。不同于所有处理引擎通常都执行同样指令的SIMD执行机制，SIMT执行通过给定线程程序允许不同线程更容易跟随分散执行路径。PPU 202 advantageously implements a highly parallel processing architecture. PPU 202 includes a number of general purpose processing clusters (GPCs). Each GPC is capable of concurrently executing a large number (eg, hundreds or thousands) of threads, where each thread is an instance of a program. In some embodiments, single instruction, multiple data (SIMD) instruction issue techniques are used to support parallel execution of a large number of threads without providing multiple independent instruction units. In other embodiments, single instruction, multiple thread (SIMT) technology is used to support parallel execution of a large number of generally simultaneous threads using a common instruction unit configured to issue instructions to a set of processing engines within each of the GPCs 208 . Unlike the SIMD execution mechanism where all processing engines typically execute the same instructions, SIMT execution allows different threads to more easily follow distributed execution paths through a given thread program.

GPU包括大量流式多处理器（SM），其中每个SM均配置为处理一个或多个线程组。传送到特定GPC208的一系列指令构成线程，并且跨SM内的并行处理引擎（未示出）的某一数目的并发执行线程的集合在本文中称为“线程束（warp）”或“线程组”。如本文所使用的，“线程组”是指对不同输入数据并发执行相同程序的一组线程，所述组的一个线程被指派到SM内的不同处理引擎。另外，多个相关线程组可以在SM内同时活动（在执行的不同阶段）。该线程组集合在本文中称为“协作线程阵列”（“CTA”）或“线程阵列”。A GPU includes a large number of streaming multiprocessors (SMs), where each SM is configured to process one or more thread groups. A series of instructions delivered to a particular GPC 208 constitutes a thread, and a collection of some number of concurrently executing threads across parallel processing engines (not shown) within an SM is referred to herein as a "warp" or "thread group." ". As used herein, a "thread group" refers to a group of threads that concurrently execute the same program on different input data, one thread of the group being assigned to a different processing engine within the SM. Additionally, multiple related thread groups can be active within an SM concurrently (at different stages of execution). This collection of thread groups is referred to herein as a "cooperative thread array" ("CTA") or "thread array."

在本发明的实施例中，使用计算系统的PPU202或其他处理器来使用线程阵列执行通用计算是可取的。为线程阵列中的每个线程指派在线程的执行期间对于线程可访问的唯一的线程标识符（“线程ID”）。可被定义为一维或多维数值的线程ID控制线程处理行为的各方面。例如，线程ID可用于确定线程将要处理输入数据集的哪部分和/或确定线程将要产生或写输出数据集的哪部分。In embodiments of the present invention, it may be desirable to use a computing system's PPU 202 or other processor to perform general-purpose computations using an array of threads. Each thread in the thread array is assigned a unique thread identifier ("thread ID") that is accessible to the thread during its execution. Thread IDs, which can be defined as one-dimensional or multi-dimensional values, control aspects of thread processing behavior. For example, the thread ID can be used to determine which portion of an input data set a thread is to process and/or to determine which portion of an output data set a thread is to produce or write.

工作时，CPU102是计算机系统100的主处理器，控制和协调其他系统部件的操作。具体地，CPU102发出控制PPU202的操作的命令。在一个实施例中，通信路径113是PCI Express链路，如本领域所知的，其中专用通道被分配到每个PPU202。也可以使用其他通信路径。PPU202有利地实现高度并行处理架构。PPU202可配备有任何容量（amount）的本地并行处理存储器（PPU存储器）。In operation, CPU 102 is the main processor of computer system 100, controlling and coordinating the operation of other system components. Specifically, CPU 102 issues commands to control the operation of PPU 202 . In one embodiment, communication path 113 is a PCI Express link, as is known in the art, where a dedicated lane is assigned to each PPU 202. Other communication paths may also be used. PPU 202 advantageously implements a highly parallel processing architecture. The PPU 202 can be equipped with any amount of local parallel processing memory (PPU memory).

在一些实施例中，系统存储器104包括统一虚拟存储器（UVM）驱动器101。UVM驱动器101包括用于执行与为CPU102和PPU202所共用的统一虚拟存储器（UVM）有关的各种任务的指令。除此之外，该架构还使得CPU102和PPU202能够用共用的（common）虚拟存储器地址来访问物理存储器位置，而不管该物理存储器位置是在系统存储器104还是PPU202本地的存储器（PPU存储器）内。In some embodiments, system memory 104 includes a unified virtual memory (UVM) driver 101 . The UVM driver 101 includes instructions for performing various tasks related to the unified virtual memory (UVM) shared by the CPU 102 and the PPU 202 . Among other things, the architecture enables CPU 102 and PPU 202 to use common virtual memory addresses to access physical memory locations, whether within system memory 104 or in memory local to PPU 202 (PPU memory).

应该理解，本文所示系统是示例性的，并且变化和修改都是可能的。连接拓扑，包括桥的数目和布置、CPU102的数目以及并行处理子系统112的数目，可根据需要修改。例如，在一些实施例中，系统存储器104直接连接到CPU102而不是通过桥，并且其他设备经由存储器桥105和CPU102与系统存储器104通信。在其他替代性拓扑中，并行处理子系统112连接到I/O桥107或直接连接到CPU102，而不是连接到存储器桥105。而在其他实施例中，I/O桥107和存储器桥105可能被集成到单个芯片上而不是作为一个或多个分立设备存在。大型实施例可以包括两个或更多个CPU102以及两个或更多个并行处理系统112。本文所示的特定部件是可选的；例如，任何数目的插卡或外围设备都可能得到支持。在一些实施例中，交换器116被去掉，网络适配器118和插卡120、121直接连接到I/O桥107。It should be understood that the systems shown herein are exemplary and that variations and modifications are possible. The connection topology, including the number and arrangement of bridges, the number of CPUs 102, and the number of parallel processing subsystems 112, can be modified as desired. For example, in some embodiments, system memory 104 is connected directly to CPU 102 rather than through a bridge, and other devices communicate with system memory 104 via memory bridge 105 and CPU 102 . In other alternative topologies, parallel processing subsystem 112 is connected to I/O bridge 107 or directly to CPU 102 instead of memory bridge 105 . Yet in other embodiments, I/O bridge 107 and memory bridge 105 may be integrated onto a single chip rather than exist as one or more discrete devices. Larger embodiments may include two or more CPUs 102 and two or more parallel processing systems 112 . Certain components shown herein are optional; for example, any number of add-in cards or peripherals may be supported. In some embodiments, switch 116 is eliminated and network adapter 118 and add-in cards 120 , 121 are directly connected to I/O bridge 107 .

统一虚拟存储器系统架构Unified Virtual Memory System Architecture

图2是根据本发明的一个实施例的、示出统一虚拟存储器（UVM）系统200的框图。如图所示，统一虚拟存储器系统200包括而不限于CPU102、系统存储器104以及与并行处理单元存储器（PPU存储器）204相连的并行处理单元（PPU）202。CPU102和系统存储器104彼此相连且经由存储器桥105而与PPU202相连。FIG. 2 is a block diagram illustrating a unified virtual memory (UVM) system 200 according to one embodiment of the present invention. As shown, the unified virtual memory system 200 includes, without limitation, a CPU 102 , a system memory 104 , and a parallel processing unit (PPU) 202 connected to a parallel processing unit memory (PPU memory) 204 . CPU 102 and system memory 104 are connected to each other and to PPU 202 via memory bridge 105 .

CPU102经由虚拟存储器地址来执行可请求存储在系统存储器104或PPU存储器204中的数据的线程。根据对存储器系统的内部工作方式的了解，虚拟存储器地址屏蔽（shield）正在CPU102中执行的线程。因而，线程可仅知晓虚拟存储器地址，并可通过经由虚拟存储器地址而请求数据来对数据进行访问。CPU 102 executes threads that may request data stored in system memory 104 or PPU memory 204 via virtual memory addresses. From knowledge of the inner workings of the memory system, virtual memory addresses shield threads that are executing in CPU 102 . Thus, a thread may only know the virtual memory address, and may access data by requesting the data through the virtual memory address.

CPU102包括CPU MMU209，其处理来自CPU102的用于将虚拟存储器地址转译成物理存储器地址的请求。对存储在诸如系统存储器104和PPU存储器204这类物理存储器单元中的数据进行访问，需要物理存储器地址。CPU102包括CPU故障处理程序211，其响应于产生页故障的CPUMMU209而执行步骤，以使所请求的数据对于CPU102是可用的。CPU故障处理程序211通常为驻存（reside）在系统存储器104中且在CPU102上执行的软件，该软件通过对CPU102的中断而被唤醒。CPU 102 includes CPU MMU 209, which handles requests from CPU 102 for translating virtual memory addresses into physical memory addresses. Accessing data stored in physical memory units such as system memory 104 and PPU memory 204 requires physical memory addresses. CPU 102 includes CPU fault handler 211 that performs steps to make requested data available to CPU 102 in response to CPUMMU 209 generating a page fault. The CPU fault handling program 211 is generally software that resides in the system memory 104 and executes on the CPU 102 , and is woken up by an interrupt to the CPU 102 .

系统存储器104存储包含各种存储器页（未示出），这些存储器页供在CPU102或PPU202上执行的线程所使用。如图所示，系统存储器104存储CPU页表206，其包含虚拟存储器地址和物理存储器地址之间的映射。系统存储器104还存储页状态目录210，其充当用于UVM系统200的“主页表”，如下面更为详细讨论的。系统存储器104存储故障缓存器216，其包含由PPU202写入以便通知CPU102P由PU202产生的页故障的条目。在一些实施例中，系统存储器104包括统一虚拟存储器（UVM）驱动器101，其包含这样的指令，这些指令在被执行时令CPU102此外还执行用于修复页故障的命令。在替代性实施例中，页状态目录210和一个或多个命令队列214的任何组合都可存储在PPU存储器204中。此外，PPU页表208可存储在系统存储器104中。System memory 104 stores various memory pages (not shown) used by threads executing on CPU 102 or PPU 202 . As shown, system memory 104 stores CPU page tables 206, which contain mappings between virtual memory addresses and physical memory addresses. System memory 104 also stores page state directory 210, which acts as a "home table" for UVM system 200, as discussed in more detail below. System memory 104 stores fault buffer 216 containing entries written by PPU 202 to notify CPU 102P of page faults generated by PU 202 . In some embodiments, system memory 104 includes a unified virtual memory (UVM) driver 101 that contains instructions that, when executed, cause CPU 102 to, among other things, execute commands for repairing page faults. In alternative embodiments, any combination of page state directory 210 and one or more command queues 214 may be stored in PPU memory 204 . Additionally, PPU page tables 208 may be stored in system memory 104 .

以与CPU102类似的方式，PPU202执行可经由虚拟存储器地址来请求存储在系统存储器104或PPU存储器204中的数据的指令。PPU202包括PPU MMU213，其处理来自PPU202的用于将虚拟存储器地址转译成物理存储器地址的请求。PPU202还包括复制引擎212，其执行存储在命令队列214中的用于复制存储器页、修改PPU页表208中的数据的命令以及其他命令。PPU故障处理程序215响应于PPU202上的页故障而执行步骤。PPU故障处理程序215可以是在处理器或PPU202中的专用微控制器上运行的软件。替代地，PPU故障处理程序215可以是彼此通信的在CPU102上运行的软件和在PPU202中的专用微控制器上运行的软件的组合。在一些实施例中，CPU故障处理程序211和PPU故障处理程序215可以是通过CPU102或PPU202任一者上的故障而被调用（invoke）。命令队列214可在PPU存储器204或系统存储器104中，但优选位于系统存储器104中。In a similar manner to CPU 102 , PPU 202 executes instructions that may request data stored in system memory 104 or PPU memory 204 via virtual memory addresses. PPU 202 includes PPU MMU 213, which handles requests from PPU 202 for translating virtual memory addresses into physical memory addresses. PPU 202 also includes copy engine 212 that executes commands stored in command queue 214 for copying memory pages, modifying data in PPU page table 208, and other commands. PPU fault handler 215 performs steps in response to a page fault on PPU 202 . PPU fault handler 215 may be software running on a processor or a dedicated microcontroller in PPU 202 . Alternatively, PPU fault handler 215 may be a combination of software running on CPU 102 and software running on a dedicated microcontroller in PPU 202 in communication with each other. In some embodiments, CPU fault handler 211 and PPU fault handler 215 may be invoked by a fault on either CPU 102 or PPU 202 . Command queue 214 may be in PPU memory 204 or system memory 104 , but is preferably located in system memory 104 .

在一些实施例中，CPU故障处理程序211和UVM驱动器101可以是统一软件程序。在这类情况中，所述统一软件程序可以是驻存在系统存储器104中且在CPU102上执行的软件。PPU故障处理程序215可以是在处理器或PPU202中的专用微控制器上运行的单独的软件程序，或者PPU故障处理程序215可以是在CPU102上运行的单独的软件程序。In some embodiments, CPU fault handler 211 and UVM driver 101 may be a unified software program. In such cases, the unified software program may be software that resides in system memory 104 and executes on CPU 102 . PPU fault handler 215 may be a separate software program running on a processor or a dedicated microcontroller in PPU 202 , or PPU fault handler 215 may be a separate software program running on CPU 102 .

在其他实施例中，PPU故障处理程序215和UVM驱动器101可以是统一软件程序。在这类情况中，所述统一软件程序可以是驻存在系统存储器104中且在CPU102上执行的软件。CPU故障处理程序211可以是驻存在系统存储器104中且在CPU102上执行的软件。In other embodiments, PPU fault handler 215 and UVM driver 101 may be a unified software program. In such cases, the unified software program may be software that resides in system memory 104 and executes on CPU 102 . CPU fault handler 211 may be software that resides in system memory 104 and executes on CPU 102 .

在其他实施例中，CPU故障处理程序211、PPU故障处理程序215和UVM驱动器101可以是统一软件程序。在这类情况中，所述统一软件程序可以是驻存于系统存储器104中且在CPU102上执行的软件。In other embodiments, the CPU fault handler 211, the PPU fault handler 215, and the UVM driver 101 may be a unified software program. In such cases, the unified software program may be software that resides in system memory 104 and executes on CPU 102 .

在一些实施例中，如上所述，CPU故障处理程序211、PPU故障处理程序215和UVM驱动器101可全部驻存在系统存储器104中。如图2所示，UVM驱动器101驻存在系统存储器104中，而CPU故障处理程序211和PPU故障处理程序215驻存在CPU102中。In some embodiments, CPU fault handler 211 , PPU fault handler 215 , and UVM driver 101 may all reside in system memory 104 as described above. As shown in FIG. 2 , UVM driver 101 resides in system memory 104 , while CPU fault handler 211 and PPU fault handler 215 reside in CPU 102 .

CPU故障处理程序211和PPU故障处理程序215可对源自（emanatefrom）CPU102或PPU202的硬件中断例如由于页故障而导致的中断做出响应。如下面进一步描述的，UVM驱动器101包含用于执行与UVM系统200的管理有关的各种任务的指令，包括而不限于修复页故障以及访问CPU页表206、页状态目录210和/或故障缓存器216。CPU fault handler 211 and PPU fault handler 215 may respond to hardware interrupts emanate from CPU 102 or PPU 202 , such as interrupts due to page faults. As described further below, UVM driver 101 contains instructions for performing various tasks related to the management of UVM system 200, including without limitation repairing page faults and accessing CPU page table 206, page state directory 210, and/or fault cache device 216.

在一些实施例中，CPU页表206和PPU页表208具有不同格式，且内含不同信息；例如，PPU页表208可内含而CPU页表206不含下列信息：原子禁用位（atomic disable bit）；压缩标签（compression tag）；和存储器搅和类型（memory swizzling type）。In some embodiments, CPU page table 206 and PPU page table 208 have different formats and contain different information; for example, PPU page table 208 may contain the following information while CPU page table 206 does not: atomic disable bit (atomic disable bit); compression tag; and memory swizzling type.

以与系统存储器104类似的方式，PPU存储器204存储各种页（未示出）。如图所示，PPU存储器204还包括PPU页表208，其包含虚拟存储器地址和物理存储器地址之间的映射。替代地，PPU页表208可存储在系统存储器104中。In a similar manner to system memory 104 , PPU memory 204 stores various pages (not shown). As shown, PPU memory 204 also includes a PPU page table 208 that contains a mapping between virtual memory addresses and physical memory addresses. Alternatively, PPU page table 208 may be stored in system memory 104 .

转译虚拟存储器地址translate virtual memory address

当在CPU102中执行的线程经由虚拟存储器地址来请求数据时，CPU102从CPU存储器管理单元（CPU MMU）209请求将虚拟存储器地址转译成物理存储器地址。作为响应，CPU MMU209试图将虚拟存储器地址转译成物理存储器地址，所述物理存储器地址指定存储器单元中存储CPU102所请求的数据的位置，例如系统存储器104。When a thread executing in the CPU 102 requests data via a virtual memory address, the CPU 102 requests translation of the virtual memory address into a physical memory address from the CPU memory management unit (CPU MMU) 209 . In response, CPU MMU 209 attempts to translate the virtual memory address into a physical memory address specifying a location in a memory unit, such as system memory 104, where the data requested by CPU 102 is stored.

为了将虚拟存储器地址转译成物理存储器地址，CPU MMU209执行查找操作，以确定CPU页表206是否包含于虚拟存储器地址相关联的映射。除虚拟存储器地址外，访问数据的请求还可指明虚拟存储器地址空间。统一虚拟存储器系统200可实现多个虚拟存储器地址空间，每个空间均被指派一个或多个线程。虚拟存储器地址在任何给定的虚拟存储器地址空间中都是唯一的。此外，给定的虚拟存储器地址内的虚拟存储器地址跨CPU102和PPU202连续，从而允许同一虚拟存储器地址跨CPU102和PPU202指向同一数据。在一些实施例中，两个虚拟存储器地址可以指向同一数据，但可以不映射到同一物理存储器地址（例如，CPU102和PPU202每个均可具有数据的本地只读副本）。To translate a virtual memory address to a physical memory address, CPU MMU 209 performs a lookup operation to determine whether CPU page table 206 contains a mapping associated with the virtual memory address. In addition to virtual memory addresses, requests to access data may also specify a virtual memory address space. The unified virtual memory system 200 can implement multiple virtual memory address spaces, each space is assigned one or more threads. Virtual memory addresses are unique within any given virtual memory address space. Furthermore, virtual memory addresses within a given virtual memory address are contiguous across CPU 102 and PPU 202 , allowing the same virtual memory address to point to the same data across CPU 102 and PPU 202 . In some embodiments, two virtual memory addresses may point to the same data, but may not map to the same physical memory address (eg, CPU 102 and PPU 202 may each have a local read-only copy of the data).

对于任何给定的虚拟存储器地址，CPU页表206可以包含或可以不包含虚拟存储器地址和物理存储器地址之间的映射。如果CPU页表206包含映射，则CPU MMU209读取该映射，以确定与虚拟存储器地址相关联的物理存储器地址并提供物理存储器地址给CPU102。然而，如果CPU页表206不包含与虚拟存储器地址相关联的映射，则CPU MMU209不能将虚拟存储器地址转译成物理存储器地址，且CPU MMU209产生页故障。为了修复页故障并使所请求的数据对于CPU102是可用的，执行“页故障序列（sequence）”。更具体地，CPU102读取PSD210以找出页的当前映射状态且然后确定适当的页故障序列。页故障序列通常映射与所请求的虚拟存储器地址相关联的存储器页或者改变许可的访问类型（例如，读取访问、写入访问、原子访问）。下面更加详细地讨论在UVM系统200中实现的不同类型的页故障序列。For any given virtual memory address, CPU page table 206 may or may not contain a mapping between the virtual memory address and the physical memory address. If CPU page table 206 contains a mapping, CPU MMU 209 reads the mapping to determine the physical memory address associated with the virtual memory address and provides the physical memory address to CPU 102. However, if CPU page table 206 does not contain a mapping associated with the virtual memory address, then CPU MMU 209 cannot translate the virtual memory address into a physical memory address, and CPU MMU 209 generates a page fault. To repair the page fault and make the requested data available to CPU 102, a "page fault sequence" is executed. More specifically, CPU 102 reads PSD 210 to find out the current mapping state of the page and then determines the appropriate page fault sequence. The page fault sequence typically maps the memory page associated with the requested virtual memory address or changes the type of access granted (eg, read access, write access, atomic access). The different types of page fault sequences implemented in UVM system 200 are discussed in more detail below.

在UVM系统200内，与给定的虚拟存储器地址相关联的数据可存储在系统存储器104、PPU存储器204或者系统存储器104和PPU存储器204两者中作为同一数据的只读副本。此外，对于任何这类数据，CPU页表206和PPU页表208中任一者或两者都可包含与该数据相关联的映射。请注意，存在一些数据，针对其的映射存在于一个页表中但不存在于另一个中。然而，PSD210包含存储在PPU页表208中的所有映射以及存储在CPU页表206中的PPU相关映射。PSD210因而用作用于统一虚拟存储器系统200的“主”页表。因此，当CPU MMU209在与特定的虚拟存储器地址相关联的CPU页表206中没有找到映射时，CPU102读取PSD210以确定PSD210是否包含于该虚拟存储器地址相关联的映射。除与虚拟存储器地址相关联的映射外，PSD210的各种实施例还可包含与虚拟存储器地址相关联的不同类型的信息。Within UVM system 200, data associated with a given virtual memory address may be stored in system memory 104, PPU memory 204, or both system memory 104 and PPU memory 204 as a read-only copy of the same data. Furthermore, for any such data, either or both CPU page table 206 and PPU page table 208 may contain a mapping associated with the data. Note that there is some data for which a mapping exists in one page table but not another. However, PSD 210 contains all mappings stored in PPU page table 208 as well as PPU-related mappings stored in CPU page table 206 . PSD 210 thus serves as the “master” page table for unified virtual memory system 200 . Thus, when CPU MMU 209 does not find a mapping in CPU page table 206 associated with a particular virtual memory address, CPU 102 reads PSD 210 to determine whether PSD 210 contains a mapping associated with that virtual memory address. Various embodiments of PSD 210 may contain different types of information associated with virtual memory addresses in addition to the mappings associated with virtual memory addresses.

当CPU MMU209产生页故障时，CPU故障处理程序211执行针对适当的页故障序列的一系列操作以修复页故障。而且，在页故障序列过程中，CPU102读取PSD210并执行附加操作以便改变CPU页表206和PPU页表208内的映射或许可（permission）。这类操作可包含读取和/或修改CPU页表206，读取和/或修改页状态目录210和/或在存储器单元（例如，系统存储器104和PPU存储器204）之间迁移被称为“存储器页”的数据块。When the CPU MMU 209 generates a page fault, the CPU fault handler 211 performs a series of operations for the appropriate page fault sequence to repair the page fault. Also, during the page fault sequence, CPU 102 reads PSD 210 and performs additional operations to change mappings or permissions within CPU page table 206 and PPU page table 208 . Such operations may include reading and/or modifying CPU page tables 206, reading and/or modifying page state directories 210, and/or migrating between memory units (e.g., system memory 104 and PPU memory 204) are referred to as " memory page" block of data.

为了确定哪些操作将在页故障序列中执行，CPU102识别与虚拟存储器地址相关联的存储器页。CPU102然后从与和引发页故障的存储器访问请求相关联的虚拟存储器地址有关的PSD210中读取关于该存储器页的状态信息。这类状态信息此外还可包含关于与虚拟存储器地址相关联的存储器页的所有权状态。对于任何给定的存储器页，若干所有权状态都是可能的。例如，存储器页可以是“CPU所有”、“PPU所有”或“CPU共享”。如果CPU102能经由虚拟地址访问存储器页而不引发页故障，且如果PPU202不能在不引发页故障的情况下经由虚拟地址访问存储器页，则该存储器页被视为是CPU所有。优选地，CPU所有的页驻存在系统存储器104中，但也可驻存在PPU存储器204中。如果PPU202能经由虚拟地址访问存储器页，且如果CPU102不能在不引发页故障的情况下经由虚拟地址访问该页，则该存储器页被视为是PPU所有。优选地，PPU所有的页驻存在PPU存储器204中，但当不进行从系统存储器104到PPU存储器204的迁移时，也可驻存在系统存储器104中。最后，如果CPU102和PPU202能经由虚拟地址访问存储器页而不引发页故障，则该存储器被视为是CPU共享。CPU共享的页可驻存在系统存储器104或PPU存储器204任一者中。To determine which operations are to be performed in the page fault sequence, CPU 102 identifies the memory page associated with the virtual memory address. CPU 102 then reads status information about the memory page from PSD 210 associated with the virtual memory address associated with the memory access request that caused the page fault. Such state information may additionally include an ownership state regarding the memory page associated with the virtual memory address. For any given memory page, several ownership states are possible. For example, a memory page may be "CPU-owned", "PPU-owned", or "CPU-shared". A memory page is considered owned by a CPU if the CPU 102 can access the memory page via a virtual address without incurring a page fault, and if the PPU 202 cannot access the memory page via a virtual address without incurring a page fault. Preferably, all pages of the CPU reside in system memory 104 , but may also reside in PPU memory 204 . A memory page is considered owned by the PPU if the PPU 202 can access the page via the virtual address, and if the CPU 102 cannot access the page via the virtual address without incurring a page fault. Preferably, all pages of the PPU reside in PPU memory 204, but may also reside in system memory 104 when migration from system memory 104 to PPU memory 204 is not performed. Finally, memory is considered CPU shared if CPU 102 and PPU 202 can access a memory page via a virtual address without causing a page fault. CPU shared pages may reside in either system memory 104 or PPU memory 204 .

CPU页表206可基于各种因素包括存储器页的使用历史而将所有权状态指派给存储器页。使用历史可包含关于CPU102或PPU202最近是否访问过存储器页以及这类访问进行了多少次的信息。例如，如果基于给定的存储器页的使用历史，UVM系统200确定该存储器页可能主要或仅仅被CPU102所使用，则UVM系统200可对该存储器页指派“CPU所有”的所有权状态，并将该页置于系统存储器104中。类似地，如果基于给定的存储器页的使用历史，UVM系统200确定该存储器页可能主要或仅仅被PPU202所使用，则UVM系统200可对该存储器页指派“PPU所有”的所有权状态，并将该页置于PPU存储器204中。最后，如果基于给定的存储器页的使用历史，UVM系统200确定该存储器页可能被CPU102和PPU202两者都使用，并确定将存储器页从系统存储器104到PPU存储器204来回迁移将会耗费太多时间，则UVM系统200可对该存储器页指派“CPU共享”的所有权状态。CPU page table 206 may assign an ownership state to a memory page based on various factors, including the usage history of the memory page. The usage history may include information about whether CPU 102 or PPU 202 has recently accessed a page of memory and how many times such access was made. For example, if based on the usage history of a given memory page, the UVM system 200 determines that the memory page is likely to be used primarily or exclusively by the CPU 102, the UVM system 200 can assign the memory page an ownership status of "CPU Owned" and assign the memory page Pages are placed in system memory 104 . Similarly, if based on the usage history of a given memory page, UVM system 200 determines that the memory page is likely to be used primarily or exclusively by PPU 202, UVM system 200 may assign the memory page an ownership status of "PPU Owned" and assign This page is placed in PPU memory 204 . Finally, if based on the usage history of a given memory page, UVM system 200 determines that the memory page is likely to be used by both CPU 102 and PPU 202, and determines that migrating the memory page from system memory 104 to PPU memory 204 and back would be too costly time, the UVM system 200 may assign the memory page an ownership status of "CPU Shared".

作为示例，故障处理程序211和215可实施下列用于迁移的启发法（heuristics）中的任何或全部：As an example, fault handlers 211 and 215 may implement any or all of the following heuristics for migration:

（a）关于对映射至PPU202且最近未迁移的被取消映射（unmap）的页的CPU102访问，将出故障的页从PPU202取消映射，将该页迁移到CPU102，并将该页映射至CPU102；(a) with respect to CPU 102 accesses to unmapped (unmapped) pages that are mapped to PPU 202 and have not been migrated recently, unmap the faulty page from PPU 202, migrate the page to CPU 102, and map the page to CPU 102;

（b）关于对映射至CPU102且最近未迁移的被取消映射的页的PPU202访问，将出故障的页从CPU102取消映射，将该页迁移到PPU202，并将该页映射至PPU202；(b) with respect to PPU 202 accesses to unmapped pages that are mapped to CPU 102 and have not been migrated recently, unmap the faulty page from CPU 102, migrate the page to PPU 202, and map the page to PPU 202;

（c）关于对映射至PPU202且最近经迁移的被取消映射的页的CPU102访问，将出故障的页迁移到CPU102并将该页映射在CPU102和PPU202两者上；(c) with respect to CPU 102 accesses to the most recently migrated unmapped page mapped to PPU 202 , migrate the faulted page to CPU 102 and map the page on both CPU 102 and PPU 202 ;

（d）关于对映射在CPU102上且最近经迁移的被取消映射的页的PPU202访问，将该页映射至CPU102和PPU202两者；(d) for a PPU 202 access to a most recently migrated unmapped page mapped on the CPU 102, map the page to both the CPU 102 and the PPU 202;

（e）关于对映射至CPU102和PPU202两者但对于PPU202所进行的原子操作未启用的页的PPU202原子访问，将该页从CPU102取消映射，并映射至PPU202且启用原子操作；(e) with respect to a PPU 202 atomic access to a page mapped to both CPU 102 and PPU 202 but not enabled for atomic operations by PPU 202, unmap the page from CPU 102 and map to PPU 202 with atomic operations enabled;

（f）关于对映射在CPU102和PPU202上作为写入时复制（copy-on-write）（COW）的页的PPU202写入访问，将该页复制到PPU202，从而制作该页的独立副本，将新的页作为读写（read-write）映射在PPU上，并保留当前页映射在CPU102上；(f) For PPU 202 write access to a page mapped on CPU 102 and PPU 202 as copy-on-write (COW), copy the page to PPU 202, thereby making an independent copy of the page, set The new page is mapped on the PPU as a read-write (read-write), and the current page is reserved on the CPU102;

（g）关于对映射在CPU102和PPU202上作为按需填零（zero-fill-on-demand）（ZFOD）的页的PPU202读取访问，分配PPU202上的物理存储器页并用零填充，且将该页映射在PPU上，但将其改变为在CPU102上被取消映射；(g) For PPU 202 read accesses to pages mapped on CPU 102 and PPU 202 as zero-fill-on-demand (ZFOD), a physical memory page on PPU 202 is allocated and filled with zeros, and the The page is mapped on the PPU, but changed to be unmapped on the CPU102;

（h）关于由第一PPU202(1)对映射在第二PPU202(2)上且最近未迁移的被取消映射的页的访问，将出故障的页从第二PPU202(2)取消映射，将该页迁移到第一PPU202(1)，并将该页映射至第一PPU202(1)；以及(h) with respect to an access by the first PPU 202(1) to an unmapped page that is mapped on the second PPU 202(2) and has not been migrated recently, unmap the failed page from the second PPU 202(2), migrating the page to the first PPU 202(1), and mapping the page to the first PPU 202(1); and

（i）关于由第一PPU202(1)对映射在第二PPU202(2)上且最近经迁移的被取消映射的页的访问，将出故障的页映射至第一PPU202(1)，并保持该页在第二PPU202(2)上的映射。(i) With respect to an access by the first PPU 202(1) to an unmapped page that was mapped on the second PPU 202(2) and was most recently migrated, map the failed page to the first PPU 202(1), and keep The mapping of this page on the second PPU 202(2).

总之，许多启发法规则都是可能的，且本发明的范围不限于这些示例。In conclusion, many heuristic rules are possible, and the scope of the invention is not limited to these examples.

另外，任何迁移启发法都可“向上取整（round up）”以包含较多的页或较大的页尺寸，例如：Also, any migration heuristic can be "rounded up" to include more pages or larger page sizes, for example:

（j）关于对映射至PPU202且最近未迁移的被取消映射的页的CPU102访问，将出故障的页外加在虚拟地址空间中与该出故障的页相邻的额外的页从PPU202取消映射，将这些页迁移到CPU102，并将这些页映射至CPU102（在更详细的示例中：对于4kB故障页，迁移包含4kB故障页的对准的（aligned）64kB区域）；(j) with respect to CPU 102 accesses to unmapped pages that are mapped to PPU 202 and have not been migrated recently, unmap the failed page plus additional pages adjacent to the failed page in virtual address space from PPU 202, Migrate these pages to CPU 102 and map these pages to CPU 102 (in a more detailed example: for a 4kB fault page, migrate the aligned 64kB region containing the 4kB fault page);

（k）关于对映射至CPU102且最近未迁移的被取消映射的页的PPU202访问，将出故障的页外加在虚拟地址空间中与该出故障的页相邻的额外的页从CPU102取消映射，将这些页迁移到PPU202，并将这些页映射至PPU202（在更详细的示例中：对于4kB故障页，迁移包含4kB故障页的对准的64kB区域）；(k) with respect to PPU 202 accesses to unmapped pages that are mapped to CPU 102 and have not been migrated recently, unmap the faulted page plus additional pages adjacent to the faulted page in virtual address space from CPU 102, migrate these pages to PPU 202, and map these pages to PPU 202 (in a more detailed example: for a 4kB fault page, migrate the aligned 64kB region containing the 4kB fault page);

（l）关于对映射至PPU202且最近未迁移的被取消映射的页的CPU102访问，将出故障的页外加在虚拟地址空间中与该出故障的页相邻的额外的页从PPU202取消映射，将这些页迁移到CPU102，将这些页映射至CPU102，并将所有迁移的页作为CPU102上的一个或多个较大的页对待（在更详细的示例中：对于4kB故障页，迁移包含4kB故障页的对准的64kB区域，并将该对准的64kB区域作为64kB页对待）；(l) with respect to CPU 102 accesses to unmapped pages that are mapped to PPU 202 and have not been migrated recently, unmap the failed page plus additional pages adjacent to the failed page in virtual address space from PPU 202, Migrate these pages to CPU102, map these pages to CPU102, and treat all migrated pages as one or more larger pages on CPU102 (in a more detailed example: for 4kB faulted pages, migration contains 4kB faults the aligned 64kB region of the page, and treat the aligned 64kB region as a 64kB page);

（m）关于对映射至CPU102且最近未迁移的被取消映射的页的PPU202访问，将出故障的页外加在虚拟地址空间中与该出故障的页相邻的额外的页从CPU102取消映射，将这些页迁移到PPU202，将这些页映射至PPU202，并将所有迁移的页作为PPU202上的一个或多个较大的页对待（在更详细的示例中：对于4kB故障页，迁移包含4kB故障页的对准的64kB区域，并将该对准的64kB区域作为64kB页对待）；(m) with respect to PPU 202 accesses to unmapped pages that are mapped to CPU 102 and have not been migrated recently, unmap the faulted page plus additional pages adjacent to the faulted page in virtual address space from CPU 102 , Migrate these pages to PPU202, map these pages to PPU202, and treat all migrated pages as one or more larger pages on PPU202 (in a more detailed example: for 4kB faulted pages, migration contains 4kB faults the aligned 64kB region of the page, and treat the aligned 64kB region as a 64kB page);

（n）关于由第一PPU202(1)对映射至第二PPU202(2)上且最近未迁移的被取消映射的页的访问，将出故障的页外加在虚拟地址空间中与该出故障的页相邻的额外的页从第二PPU202(2)取消映射，将这些页迁移到第一PPU202(1)，并将这些页映射至第一PPU202(1)；以及(n) For an access by the first PPU 202(1) to an unmapped page that is mapped to the second PPU 202(2) and has not been migrated recently, append the faulted page in the virtual address space with the faulted page unmapping additional pages adjacent to the page from the second PPU 202(2), migrating these pages to the first PPU 202(1), and mapping these pages to the first PPU 202(1); and

（o）关于由第一PPU202(1)对映射在第二PPU202(2)上且最近经迁移的被取消映射的页的访问，将出故障的页外加在虚拟地址空间中与该出故障的页相邻的额外的页映射至第一PPU202(1)，并保持这些页在第二PPU202(2)上的映射。(o) For an access by the first PPU 202(1) to a recently migrated unmapped page mapped on the second PPU 202(2), append the faulted page in the virtual address space with the faulted page Additional pages adjacent to the page are mapped to the first PPU 202(1), and the mapping of these pages on the second PPU 202(2) is maintained.

在一些实施例中，PSD条目可包含过渡（transitional）状态信息，以确保由CPU102和PPU202内的单元所做出的各种请求之间的适当的同步化。例如，PSD210条目可包含这样的过渡状态信息，即特定页处于正在从CPU所有过渡到PPU所有的过程中。CPU102和PPU202中的各种单元，例如CPU故障处理程序211和PPU故障处理程序215，一经确定页处于这类过渡状态，就可放弃（forego）部分页故障序列，以避免由之前对同一虚拟存储器地址的虚拟存储器访问所触发的页故障序列中的步骤。作为具体示例，如果页故障导致页从系统存储器104迁移到PPU存储器204，则检测到将会引发同一迁移的不同的页故障且不引发另一页迁移。此外，CPU102和PPU202中的各种单元可实施原子操作，用于对PSD210上的操作进行适当的排序。例如，关于对PSD210条目的修改，CPU故障处理程序211或PPU故障处理程序215可发出原子比较和交换（swap）操作，以修改PSD210中的特定条目的页状态。因此，该修改可以在不受来自其他单元的操作干扰的情况下完成。In some embodiments, PSD entries may contain transitional state information to ensure proper synchronization between various requests made by units within CPU 102 and PPU 202 . For example, a PSD 210 entry may contain transition state information that a particular page is in the process of transitioning from CPU ownership to PPU ownership. Various units in CPU 102 and PPU 202, such as CPU fault handler 211 and PPU fault handler 215, upon determining that a page is in such a transitional state, may forego a partial page fault sequence to avoid a previous fault to the same virtual memory A step in the page fault sequence triggered by a virtual memory access to an address. As a specific example, if a page fault causes a page to be migrated from system memory 104 to PPU memory 204, a different page fault is detected that would cause the same migration and not cause another page migration. Additionally, various units within CPU 102 and PPU 202 may implement atomic operations for proper sequencing of operations on PSD 210 . For example, with respect to modifications to PSD 210 entries, CPU fault handler 211 or PPU fault handler 215 may issue an atomic compare and swap (swap) operation to modify the page state of a particular entry in PSD 210 . Therefore, the modification can be done without interference from the operation of other units.

系统存储器104中可存储多个PSD210——每个虚拟地址空间一个。CPU102或PPU202任一者产生的存储器访问请求因而可包含虚拟存储器地址并且还识别与该虚拟存储器地址相关联的虚拟存储器地址空间。Multiple PSDs 210 may be stored in system memory 104 - one for each virtual address space. A memory access request generated by either CPU 102 or PPU 202 may thus contain a virtual memory address and also identify a virtual memory address space associated with the virtual memory address.

正如CPU102可执行包含虚拟存储器地址的存储器访问请求（即，包含经由虚拟存储器地址访问数据的请求的指令）一样，PPU202也可执行类似类型的存储器访问请求。更具体地，PPU202包括上面结合图1描述的配置为执行多个线程和线程组的多个执行单元，例如GPC和SM。在操作中，那些线程可通过制定虚拟存储器地址而从存储器请求数据（例如，系统存储器104或PPU存储器204）。正如CPU102和CPU MMU209一样，PPU202包括PPU存储器管理单元（MMU）213。PPU MMU213接收来自PPU202的关于虚拟存储器地址转译的请求，并试图为虚拟存储器地址从PPU页表208提供转译。Just as CPU 102 may execute memory access requests that include virtual memory addresses (ie, instructions that include requests to access data via virtual memory addresses), PPU 202 may execute similar types of memory access requests. More specifically, PPU 202 includes multiple execution units, such as GPCs and SMs, configured to execute multiple threads and thread groups, as described above in connection with FIG. 1 . In operation, those threads may request data from memory (eg, system memory 104 or PPU memory 204 ) by specifying a virtual memory address. Like CPU 102 and CPU MMU 209 , PPU 202 includes a PPU memory management unit (MMU) 213 . PPU MMU 213 receives requests from PPU 202 for virtual memory address translations and attempts to provide translations from PPU page tables 208 for virtual memory addresses.

与CPU页表206类似地，PPU页表208包含虚拟存储器地址和物理存储器地址之间的映射。CPU页表206的情况也如此，对于任何给定的虚拟地址，PPU页表208可以不包含将虚拟存储器地址映射至物理存储器地址的页表条目。与CPU MMU209一样，当PPU MMU213从PPU页表208请求对虚拟存储器地址的转译并且PPU页表208中不存在任何映射或该访问类型是PPU页表208所不允许的时，PPU MMU213产生页故障。随后，PPU故障处理程序215触发页故障序列。而且，下面将更详细描述在UVM系统200中实施的不同类型的页故障序列。Similar to CPU page table 206, PPU page table 208 contains a mapping between virtual memory addresses and physical memory addresses. As is the case with CPU page table 206, for any given virtual address, PPU page table 208 may contain no page table entries that map virtual memory addresses to physical memory addresses. Like the CPU MMU 209, when the PPU MMU 213 requests a translation of a virtual memory address from the PPU page table 208 and there is no mapping in the PPU page table 208 or the access type is not allowed by the PPU page table 208, the PPU MMU 213 generates a page fault . Subsequently, the PPU fault handler 215 triggers a page fault sequence. Also, the different types of page fault sequences implemented in UVM system 200 will be described in more detail below.

在页故障序列过程中，CPU102或PPU202可将命令写入命令队列214，用于由复制引擎212执行。这种方法使CPU102或PPU202在复制引擎212读取并执行存储在命令队列214中的命令的同时得以空出以执行其他任务，并允许关于故障序列的所有命令同时被排队，从而避免对故障序列的进度的监视。由复制引擎212执行的命令此外还可包括删除、生成或修改PPU页表208中的页表条目，从系统存储器104读取或写入数据，以及将数据读取或写入到PPU存储器204。During a page fault sequence, CPU 102 or PPU 202 may write commands to command queue 214 for execution by replication engine 212 . This approach frees the CPU 102 or PPU 202 to perform other tasks while the replication engine 212 reads and executes commands stored in the command queue 214, and allows all commands for the fault sequence to be queued at the same time, thereby avoiding the progress monitoring. Commands executed by replication engine 212 may additionally include deleting, creating, or modifying page table entries in PPU page table 208 , reading or writing data from system memory 104 , and reading or writing data to PPU memory 204 .

故障缓存器216存储指明与由PPU202产生的页故障有关的信息的故障缓存器条目。故障缓存器条目可包括例如试图进行的访问的类型（例如，读取、写入或原子的）、引发了页故障的试图进行的访问所针对的虚拟存储器地址、虚拟地址空间以及对引发了页故障的单元或线程的指示。在操作中，当PPU202引发页故障时，PPU202可将故障缓存器条目写入故障缓存器216中，以通知PPU故障处理程序215有关出故障的页和引发该故障的访问的类型。PPU故障处理程序215然后执行动作以修复页故障。因为PPU202正在执行多个线程，所以故障缓存器216可存储多个故障，其中每个线程由于PPU202的存储器访问的管线性质均可引发一个或多个故障。Fault buffer 216 stores fault buffer entries specifying information related to page faults generated by PPU 202 . The fault buffer entry may include, for example, the type of access attempted (e.g., read, write, or atomic), the virtual memory address for which the attempted access caused the page fault, the virtual address space, and the reference to the page that caused the fault. An indication of the failing apartment or thread. In operation, when PPU 202 causes a page fault, PPU 202 may write a fault buffer entry into fault buffer 216 to notify PPU fault handler 215 of the faulted page and the type of access that caused the fault. PPU fault handler 215 then performs actions to repair the page fault. Because PPU 202 is executing multiple threads, fault buffer 216 may store multiple faults, each of which may cause one or more faults due to the pipelined nature of PPU 202's memory accesses.

页故障序列page fault sequence

如上所述，响应于收到关于虚拟存储器地址转译的请求，如果CPU页表206不包含与所请求的虚拟存储器地址相关联的映射或者不许可正被请求的访问的类型，则CPU MMU209产生页故障。类似地，响应于收到关于虚拟存储器地址转译的请求，如果PPU页表208不包含与所请求的虚拟存储器地址相关联的映射或者不许可正被请求的访问的类型，则PPU MMU213产生页故障。当CPU MMU209或PPU MMU213产生页故障时，请求了虚拟存储器地址处的数据的线程停滞（stall），且“本地故障处理程序”——用于CPU102的CPU故障处理程序211或用于PPU202的PPU故障处理程序215——试图通过执行“页故障序列”来修复页故障。如上面所指出的，页故障序列包含使得出故障的单元（即，引发了该页面故障的单元——CPU102或PPU202任一者）能够访问与虚拟存储器地址相关联的数据的一系列操作。在页故障序列结束之后，经由虚拟存储器地址请求了数据的线程继续执行。在一些实施例中，通过允许故障恢复逻辑跟踪与出故障的指令相反的出故障的存储器访问，故障恢复得以简化。As described above, in response to receiving a request for translation of a virtual memory address, CPU MMU 209 generates a page if CPU page table 206 does not contain a mapping associated with the requested virtual memory address or does not grant the type of access being requested. Fault. Similarly, in response to receiving a request for translation of a virtual memory address, PPU MMU 213 generates a page fault if PPU page table 208 does not contain a mapping associated with the requested virtual memory address or does not grant the type of access being requested . When the CPU MMU 209 or the PPU MMU 213 generates a page fault, the thread that requested the data at the virtual memory address stalls, and the "local fault handler"—the CPU fault handler 211 for the CPU 102 or the PPU for the PPU 202 Fault Handler 215 - Attempts to repair page faults by performing a "page fault sequence". As noted above, a page fault sequence includes a series of operations that enable the faulted unit (ie, the unit that caused the page fault—either CPU 102 or PPU 202 ) to access data associated with a virtual memory address. After the page fault sequence ends, the thread that requested the data via the virtual memory address continues to execute. In some embodiments, fault recovery is simplified by allowing the fault recovery logic to track the failing memory access as opposed to the failing instruction.

如果存在任何与页故障相关联的存储器页不得不经历的所有权状态的变化或访问许可的变化，则在页故障序列过程中所执行的操作取决于这些变化。从当前的所有权状态到新的所有权状态的过渡或者访问许可的变化可以是页故障序列的一部分。在一些实例中，将与页故障相关联的存储器页从系统存储器104迁移到PPU存储器204也是页故障序列的一部分。在其他实例中，将与页故障相关联的存储器页从PPU存储器204迁移到系统存储器104也是页故障序列的一部分。本文中较为充分描述的各种启发法可用于配置UVM系统200以改变存储器页所有权状态或者以按照各种操作条件和图案的集合迁移存储器页。下面将更详细描述的是关于下列四种存储器页所有权状态过渡的页故障序列：CPU所有到CPU共享、CPU所有到PPU所有、PPU所有到CPU所有以及PPU所有到CPU共享。If there are any changes in ownership status or changes in access permissions that a memory page associated with a page fault has to undergo, the operations performed during the page fault sequence depend on these changes. A transition from a current ownership state to a new ownership state or a change in access permissions may be part of a page fault sequence. In some examples, migrating memory pages associated with page faults from system memory 104 to PPU memory 204 is also part of the page fault sequence. In other examples, migrating memory pages associated with page faults from PPU memory 204 to system memory 104 is also part of the page fault sequence. Various heuristics described more fully herein may be used to configure UVM system 200 to change memory page ownership states or to migrate memory pages according to various sets of operating conditions and patterns. Described in more detail below is the page fault sequence for the following four memory page ownership state transitions: CPU owned to CPU shared, CPU owned to PPU owned, PPU owned to CPU owned, and PPU owned to CPU shared.

由PPU202所产生的故障可开始从CPU所有到CPU共享的过渡。在这种过渡之前，正在PPU202中执行的线程试图访问在PPU页表208中没被映射的虚拟存储器地址处的数据。此访问试图引发基于PPU的页故障，该页故障然后致使故障缓存器条目被写入到故障缓存器216。作为响应，PPU故障处理程序215读取与该虚拟存储器地址相对应的PSD210条目，并识别与该虚拟存储器地址相关联的存储器页。在对PSD210进行读取之后，PPU故障处理程序215确定与该虚拟存储器地址相关联的存储器页的当前所有权状态为CPU所有。基于当前所有权状态以及其他因素，例如关于存储器页的使用特性或存储器访问的类型，PPU故障处理程序215确定该页的新的所有权状态应当为CPU共享。A fault generated by PPU 202 may initiate the transition from CPU ownership to CPU sharing. Prior to this transition, a thread executing in PPU 202 attempted to access data at a virtual memory address that was not mapped in PPU page table 208 . This access attempts to cause a PPU-based page fault, which then causes a fault buffer entry to be written to fault buffer 216 . In response, PPU fault handler 215 reads the PSD 210 entry corresponding to the virtual memory address and identifies the memory page associated with the virtual memory address. After reading PSD 210, PPU fault handler 215 determines that the current ownership state of the memory page associated with the virtual memory address is owned by the CPU. Based on the current ownership state and other factors, such as regarding the usage characteristics of the memory page or the type of memory access, the PPU fault handler 215 determines that the new ownership state for the page should be CPU shared.

为了改变所有权状态，PPU故障处理程序215在PPU页表208中写入新条目，与虚拟存储器地址相对应且将虚拟存储器地址与经由PSD210条目所识别的存储器页关联起来。PPU故障处理程序215还修改关于该存储器页的PSD210条目以指明所有权状态为CPU共享。在一些实施例中，使PPU202中的转译后备（look-aside）缓存器（TLB）无效，以将其中至无效页的转译被高速缓存（cache）的情况加以考虑。此时，页故障序列结束。存储器页的所有权状态为CPU共享，意味着存储器页对于CPU102和PPU202都是可访问的。CPU页表206和PPU页表208两者都包含将虚拟存储器地址关联到存储器页的条目。To change ownership status, PPU fault handler 215 writes a new entry in PPU page table 208 corresponding to the virtual memory address and associating the virtual memory address with the memory page identified via the PSD 210 entry. PPU fault handler 215 also modifies the PSD 210 entry for the memory page to indicate that the ownership status is CPU shared. In some embodiments, a translation look-aside buffer (TLB) in the PPU 202 is invalidated to account for cases where translations to invalid pages are cached. At this point, the page fault sequence ends. The ownership status of the memory page is CPU shared, meaning that the memory page is accessible to both CPU 102 and PPU 202 . Both CPU page table 206 and PPU page table 208 contain entries that associate virtual memory addresses to memory pages.

由PPU202所产生的故障可开始从CPU所有到PPU所有的过渡。在这种过渡之前，正在PPU202中执行的操作试图访问在PPU页表208中没被映射的虚拟存储器地址处的数据。此存储器访问试图引发基于PPU的页故障，该页故障然后致使故障缓存器条目被写入到故障缓存器216。作为响应，PPU故障处理程序215读取与该虚拟存储器地址相对应的PSD210条目，并识别与该虚拟存储器地址相关联的存储器页。在对PSD210进行读取之后，PPU故障处理程序215确定与该虚拟存储器地址相关联的存储器页的当前所有权状态为CPU所有。基于当前所有权状态以及其他因素，例如关于该页的使用特性或存储器访问的类型，PPU故障处理程序215确定该页的新的所有权状态应当为PPU所有。A fault generated by PPU 202 may initiate a transition from CPU ownership to PPU ownership. Prior to this transition, an operation being executed in PPU 202 attempted to access data at a virtual memory address that was not mapped in PPU page table 208 . This memory access attempts to cause a PPU-based page fault, which then causes a fault buffer entry to be written to fault buffer 216 . In response, PPU fault handler 215 reads the PSD 210 entry corresponding to the virtual memory address and identifies the memory page associated with the virtual memory address. After reading PSD 210, PPU fault handler 215 determines that the current ownership state of the memory page associated with the virtual memory address is owned by the CPU. Based on the current ownership state and other factors, such as regarding the usage characteristics of the page or the type of memory access, the PPU fault handler 215 determines that the new ownership state for the page should be owned by the PPU.

PPU202将指明PPU202产生了页故障且指明与该页故障相关联的虚拟存储器地址的故障缓存器条目写入故障缓存器216中。正在CPU102上执行的PPU故障处理程序215读取该故障缓存器条目，且作为响应，CPU102将CPU页表206中与引发该页故障的虚拟存储器地址相关联的映射移除。CPU102可在移除映射之前和/或之后清理（flush）高速缓存。CPU102还将指示PPU202将页从系统存储器104复制到PPU存储器204中的命令写入到命令队列214中。PPU202中的复制引擎212读取命令队列214中的命令并将页从系统存储器104复制到PPU存储器204。PPU202将页表条目写入PPU页表208中，与该虚拟存储器地址相对应且将该虚拟存储器地址与PPU存储器204中的新复制的存储器页关联起来。对PPU页表208的写入可经由PPU202来完成。替代地，CPU102可更新PPU页表208。PPU故障处理程序215还修改关于该存储器页的PSD210，以指明所有权状态为PPU所有。在一些实施例中，可使PPU202或CPU102中的TLB中的条目无效，以将其中转译被高速缓存的情况纳入考虑。此时，页故障序列结束。存储器页的所有权状态为PPU所有，意味着该存储器页仅对于PPU202是可访问的。仅PPU页表208包含将虚拟存储器地址与该存储器页关联起来的条目。PPU 202 writes a fault buffer entry into fault buffer 216 indicating that PPU 202 generated a page fault and designating the virtual memory address associated with the page fault. PPU fault handler 215 executing on CPU 102 reads the fault register entry, and in response CPU 102 removes the mapping in CPU page table 206 associated with the virtual memory address that caused the page fault. CPU 102 may flush the cache before and/or after removing the mapping. CPU 102 also writes a command into command queue 214 instructing PPU 202 to copy a page from system memory 104 into PPU memory 204 . Copy engine 212 in PPU 202 reads commands in command queue 214 and copies pages from system memory 104 to PPU memory 204 . PPU 202 writes a page table entry into PPU page table 208 corresponding to the virtual memory address and associates the virtual memory address with the newly copied memory page in PPU memory 204 . Writing to PPU page table 208 may be done via PPU 202 . Alternatively, CPU 102 may update PPU page table 208 . The PPU fault handler 215 also modifies the PSD 210 for the memory page to indicate that the ownership status is owned by the PPU. In some embodiments, entries in the TLB in PPU 202 or CPU 102 may be invalidated to account for cases where translations are cached. At this point, the page fault sequence ends. The ownership status of the memory page is owned by the PPU, meaning that the memory page is only accessible to the PPU 202 . Only PPU page table 208 contains an entry associating a virtual memory address with that memory page.

由CPU102所产生的故障可开始从PPU所有到CPU所有的过渡。在这种过渡之前，正在CPU102中执行的操作试图访问在CPU页表206中没被映射的虚拟存储器地址处的数据，这引发基于CPU的页故障。CPU故障处理程序211读取与该虚拟存储器地址相对应的PSD210条目，并识别与该虚拟存储器地址相关联的存储器页。在对PSD210进行读取之后，CPU故障处理程序211确定与该虚拟存储器地址相关联的存储器页的当前所有权状态为PPU所有。基于当前所有权状态以及其他因素，例如关于该页的使用特性或访问的类型，CPU故障处理程序211确定该页的新的所有权状态为CPU所有。A fault generated by CPU 102 may initiate a transition from PPU ownership to CPU ownership. Prior to this transition, an operation being executed in CPU 102 attempted to access data at a virtual memory address that was not mapped in CPU page table 206, causing a CPU-based page fault. CPU fault handler 211 reads the PSD 210 entry corresponding to the virtual memory address and identifies the memory page associated with the virtual memory address. After reading PSD 210, CPU fault handler 211 determines that the current ownership status of the memory page associated with the virtual memory address is owned by the PPU. Based on the current ownership state and other factors, such as regarding the usage characteristics of the page or the type of access, the CPU fault handler 211 determines that the new ownership state of the page is owned by the CPU.

CPU故障处理程序211将与存储器页相关联的所有权状态改变到CPU所有。CPU故障处理程序211将命令写入命令队列214中，以令复制引擎212从PPU页表208移除将虚拟存储器地址与该存储器页关联起来的条目。可使各种TLB条目无效。CPU故障处理程序211还将存储器页从PPU存储器204复制到系统存储器104中，这可经由命令队列214和复制引擎212来完成。CPU故障处理程序211在CPU页表206中写入将虚拟存储器地址与被复制到系统存储器104中的存储器页关联起来的页表条目。CPU故障处理程序211还更新PSD210以将虚拟存储器地址与新复制的存储器页关联起来。此时，页故障序列结束。存储器页的所有权状态为CPU所有，意味着该存储器页仅对于CPU102是可访问的。仅CPU页表206包含将虚拟存储器地址与该存储器页关联起来的条目。CPU fault handler 211 changes the ownership state associated with the memory page to CPU owned. CPU fault handler 211 writes a command into command queue 214 to cause replication engine 212 to remove from PPU page table 208 an entry associating a virtual memory address with the memory page. Various TLB entries can be invalidated. CPU fault handler 211 also copies memory pages from PPU memory 204 into system memory 104 , which may be done via command queue 214 and copy engine 212 . CPU fault handler 211 writes a page table entry in CPU page table 206 that associates a virtual memory address with a memory page copied into system memory 104 . CPU fault handler 211 also updates PSD 210 to associate virtual memory addresses with newly copied memory pages. At this point, the page fault sequence ends. The ownership status of the memory page is CPU, meaning that the memory page is only accessible to the CPU 102 . Only CPU page table 206 contains an entry associating a virtual memory address with that memory page.

由CPU102所产生的故障可开始从PPU所有到CPU共享的过渡。在这种过渡之前，正在CPU102中执行的操作试图访问在CPU页表206中没被映射的虚拟存储器地址处的数据，这引发基于CPU的页故障。CPU故障处理程序211读取与该虚拟存储器地址相对应的PSD210条目，并识别与该虚拟存储器地址相关联的存储器页。在对PSD210进行读取之后，CPU故障处理程序211确定与该虚拟存储器地址相关联的存储器页的当前所有权状态为PPU所有。基于当前所有权状态以及其他因素，例如关于该页的使用特性，CPU故障处理程序211确定该页的新的所有权状态为CPU共享。A fault generated by CPU 102 may initiate the transition from PPU ownership to CPU sharing. Prior to this transition, an operation being executed in CPU 102 attempted to access data at a virtual memory address that was not mapped in CPU page table 206, causing a CPU-based page fault. CPU fault handler 211 reads the PSD 210 entry corresponding to the virtual memory address and identifies the memory page associated with the virtual memory address. After reading PSD 210, CPU fault handler 211 determines that the current ownership status of the memory page associated with the virtual memory address is owned by the PPU. Based on the current ownership state and other factors, such as regarding the usage characteristics of the page, the CPU fault handler 211 determines that the new ownership state of the page is CPU shared.

CPU故障处理程序211将与存储器页相关联的所有权状态改变到CPU共享。CPU故障处理程序211将命令写入命令队列214中，以令复制引擎212从PPU页表208移除将虚拟存储器地址与该存储器页关联起来的条目。可使各种TLB条目无效。CPU故障处理程序211还将存储器页从PPU存储器204复制到系统存储器104中。此复制操作可经由命令队列214和复制引擎212来完成。CPU故障处理程序211然后将命令写入命令队列214中，以令复制引擎212改变PPU页表208中的条目，使得虚拟存储器地址与系统存储器104中的存储器页关联起来。CPU故障处理程序211将页表条目写入CPU页表206，以将虚拟存储器地址与系统存储器104中的存储器页关联起来。CPU故障处理程序211还更新PSD210以将虚拟存储器地址与系统存储器104中的存储器页关联起来。此时，页故障序列结束。该页的所有权状态为CPU共享，且该存储器页已被复制到系统存储器104中。由于CPU页表206包含将虚拟存储器地址与系统存储器104中的存储器页关联起来的条目，所以该页对于CPU102是可访问的。由于PPU页表208包含将虚拟存储器地址与系统存储器104中的存储器页关联起来的条目，所以该页对于PPU202也是可访问的。The CPU fault handler 211 changes the ownership state associated with the memory page to CPU shared. CPU fault handler 211 writes a command into command queue 214 to cause replication engine 212 to remove from PPU page table 208 an entry associating a virtual memory address with the memory page. Various TLB entries can be invalidated. CPU fault handler 211 also copies memory pages from PPU memory 204 into system memory 104 . This copy operation can be accomplished via command queue 214 and copy engine 212 . CPU fault handler 211 then writes a command into command queue 214 to cause replication engine 212 to change an entry in PPU page table 208 so that a virtual memory address is associated with a memory page in system memory 104 . CPU fault handler 211 writes a page table entry to CPU page table 206 to associate a virtual memory address with a memory page in system memory 104 . CPU fault handler 211 also updates PSD 210 to associate virtual memory addresses with memory pages in system memory 104 . At this point, the page fault sequence ends. The ownership status of the page is CPU shared, and the memory page has been copied into system memory 104 . Because CPU page table 206 contains an entry associating a virtual memory address with a memory page in system memory 104 , the page is accessible to CPU 102 . Since PPU page table 208 contains an entry associating a virtual memory address with a memory page in system memory 104 , that page is also accessible to PPU 202 .

页故障序列的详细示例Detailed example of a page fault sequence

在此情境下，现在提供对倘若从CPPU所有到CPU共享过渡则由PPU故障处理程序215执行的页故障序列的详细描述以展示原子操作和过渡状态是如何用来更有效地管理序列的。页故障序列被试图访问在PPU页表208中不存在相关映射的虚拟地址的PPU202线程所触发。当线程试图经由虚拟存储器地址访问数据时，PPU202（具体地，用户级线程）从PPU页表208请求转译。因为PPU页表208不包含于所请求的虚拟存储器地址相关联的映射，所以作为响应，发生PPU页故障。In this context, a detailed description of the page fault sequence performed by the PPU fault handler 215 in the event of a transition from CPPU-owned to CPU-shared is now provided to show how atomic operations and transition states are used to manage the sequence more efficiently. The page fault sequence is triggered by a PPU 202 thread attempting to access a virtual address for which no associated mapping exists in the PPU page table 208 . When a thread attempts to access data via a virtual memory address, the PPU 202 (specifically, a user-level thread) requests a translation from the PPU page table 208 . In response, a PPU page fault occurs because the PPU page table 208 does not contain a mapping associated with the requested virtual memory address.

在页故障发生之后，上述线程被困、停滞，并且PPU故障处理程序215执行页故障序列。PPU故障处理程序215对PSD210进行读取，以确定哪个存储器页与虚拟存储器地址相关联，以及确定虚拟存储器地址的状态。PPU故障处理程序215从PSD210确定存储器页的所有权状态是CPU所有。因此，由PPU202所请求的数据经由虚拟存储器地址对于PPU202是可访问的。存储器页的状态信息还指明所请求的数据不能被迁移到PPU存储器204。After a page fault occurs, the aforementioned threads are trapped, stalled, and the PPU fault handler 215 executes the page fault sequence. PPU fault handler 215 reads PSD 210 to determine which memory page is associated with the virtual memory address and to determine the status of the virtual memory address. PPU fault handler 215 determines from PSD 210 that the ownership status of the memory page is CPU owned. Thus, data requested by PPU 202 is accessible to PPU 202 via the virtual memory address. The status information for the memory page also indicates that the requested data cannot be migrated to PPU memory 204 .

基于从PSD210获得的状态信息，PPU故障处理程序215确定存储器页的新的状态应当为CPU共享。PPU故障处理程序215将状态改变到“过渡到CPU共享”。此状态指明页当前处于正过渡到CPU共享的过程中。当PPU故障处理程序215在存储器管理单元中的微控制器上运行时，则两个处理器将异步地更新PSD210，对PSD210使用原子比较-交换（“CAS”）操作而将状态改变到“过渡到GPU可见（visible）”（CPU共享）。Based on the state information obtained from PSD 210, PPU fault handler 215 determines that the new state of the memory page should be CPU shared. PPU Fault Handler 215 changes state to "Transition to CPU Sharing". This status indicates that the page is currently in the process of transitioning to CPU sharing. When the PPU fault handler 215 is running on the microcontroller in the memory management unit, then the two processors will asynchronously update the PSD 210, using an atomic compare-swap (“CAS”) operation on the PSD 210 to change state to “Transition To GPU visible (visible)" (CPU sharing).

PPU202更新PPU页表208以将虚拟存储器地址与存储器页关联起来。PPU202还使TLB高速缓存条目无效。接着，PPU202对PSD210执行另一原子比较-交换，以将与存储器页相关联的所有权状态改变到CPU共享。最后，页故障序列终止，且经由虚拟存储器地址请求了数据的线程继续执行。PPU 202 updates PPU page table 208 to associate virtual memory addresses with memory pages. PPU 202 also invalidates TLB cache entries. Next, PPU 202 performs another atomic compare-swap on PSD 210 to change the ownership state associated with the memory page to CPU shared. Finally, the page fault sequence terminates, and the thread that requested data via the virtual memory address continues to execute.

UVM系统架构变形例Variation of UVM system architecture

对统一虚拟存储器系统200的各种修改都是可能的。例如，在一些实施例中，在将故障缓存器条目写入故障缓存器216中，PPU202可触发CPU中断，以令CPU102读取故障缓存器216中的故障缓存器条目并响应于该故障缓存器条目而执行任何适当的操作。在其他实施例中，CPU102可周期性地轮询（poll）故障缓存器216。倘若CPU102在故障缓存器216中找到故障缓存器条目，则CPU102响应于该故障缓存器条目而执行一系列操作。Various modifications to unified virtual memory system 200 are possible. For example, in some embodiments, upon writing a fault register entry in fault register 216, PPU 202 may trigger a CPU interrupt to cause CPU 102 to read the fault register entry in fault register 216 and respond to the fault register entry and perform any appropriate action. In other embodiments, the CPU 102 may periodically poll the fault register 216 . If CPU 102 finds a faulty register entry in faulty register 216, CPU 102 performs a series of operations in response to the faulty register entry.

在一些实施例中，系统存储器104，而不是PPU存储器204，存储PPU页表208。在其他实施例中，可实施单级或多级高速缓存层级架构（hierarchy），例如单级或多级转译后备缓存器（TLB）层级架构（未示出），以供CPU页表206或PPU页表208高速缓存虚拟地址转译。In some embodiments, system memory 104 , rather than PPU memory 204 , stores PPU page table 208 . In other embodiments, a single-level or multi-level cache hierarchy, such as a single-level or multi-level translation lookaside buffer (TLB) hierarchy (not shown), may be implemented for the CPU page table 206 or the PPU Page table 208 caches virtual address translations.

在又一些实施例中，倘若正在PPU202中执行的线程引发PPU故障（“出故障的线程”）则PPU202可采取一个或多个动作。这些动作包含：使整个PPU202停滞，使正在执行出故障的线程的SM停滞，使PPU MMU213停滞，仅使出故障的线程停滞或者使一级或多级TLB停滞。在一些实施例中，在PPU页故障发生之后，并且统一虚拟存储器系统200已执行页故障序列，则出故障的线程继续执行，且出故障线程再次试图引发了该页故障的存储器访问请求。在一些实施例中，TLB处的停滞是以表现为对出故障的SM或出故障的线程的长时延（long-latency）存储器访问这样的方式来进行的，从而无需SM针对故障做出任何特殊操作。In still other embodiments, PPU 202 may take one or more actions in the event a thread executing in PPU 202 causes a PPU fault ("faulted thread"). These actions include stalling the entire PPU 202, stalling the SM executing the failing thread, stalling the PPU MMU 213, stalling only the failing thread, or stalling one or more levels of the TLB. In some embodiments, after a PPU page fault occurs, and the unified virtual memory system 200 has executed the page fault sequence, the faulted thread continues to execute, and the faulted thread retries the memory access request that caused the page fault. In some embodiments, stalls at the TLB are performed in such a way that they appear as long-latency memory accesses to the faulting SM or thread, thereby requiring the SM to do nothing in response to the fault. special operations.

最后，在其他替代性实施例中，UVM驱动器101可包含令CPU102执行一个或多个操作用于管理UVM系统200并修复页故障的指令，例如访问CPU页表206、PSD210和/或故障缓存器216。在其他实施例中，操作系统内核（未示出）可配置为通过访问CPU页表206、PSD210和/或故障缓存器216来管理UVM系统200并修复页故障。在又一些实施例中，操作系统内核可连同UVM驱动器101一起操作，以通过访问CPU页表206、PSD210和/或故障缓存器216来管理UVM系统200并修复页故障。Finally, in other alternative embodiments, UVM driver 101 may contain instructions that cause CPU 102 to perform one or more operations for managing UVM system 200 and repairing page faults, such as accessing CPU page tables 206, PSD 210, and/or fault buffers 216. In other embodiments, an operating system kernel (not shown) may be configured to manage UVM system 200 and repair page faults by accessing CPU page table 206 , PSD 210 , and/or fault buffer 216 . In yet other embodiments, an operating system kernel may operate in conjunction with UVM driver 101 to manage UVM system 200 and repair page faults by accessing CPU page table 206 , PSD 210 and/or fault buffer 216 .

停滞和重播故障Stall and replay glitches

如上所述，UVM系统200典型地至少部分地依靠CPU102来修复由PPU202所产生的存储器访问故障（即，页故障）。倘若发生存储器访问故障，则常规的PPU将出故障的存储器事务与PPU内在出故障的存储器事务之后才开始执行的所有存储器事务一起取消。这类常规PPU中的SM不再继续发出存储器事务，直到存储器访问故障被解决为止。相反，为了减小与出故障的存储器事务相关联的整体性能降低，PPU202配置为仅使发出该出故障的存储器事务的SM停滞。在该SM被停滞的同时，PPU202执行SM在出故障的存储器事务之前发出了的“飞行中的”存储器事务。此外，SM继续重播出故障的存储器事务以及没有成功完成的任何飞行中的存储器事务，直到所有这些存储器事务成功为止。有利地，没有引发任何未解决的存储器访问故障的SM继续发出存储器事务，且在UVM系统200修复突出的存储器访问故障的同时，PPU202继续执行这些存储器事务。As noted above, UVM system 200 typically relies at least in part on CPU 102 to repair memory access faults (ie, page faults) generated by PPU 202 . In the event of a memory access failure, a conventional PPU cancels the failed memory transaction along with all memory transactions within the PPU that began executing after the failed memory transaction. SMs in such conventional PPUs do not continue to issue memory transactions until the memory access fault is resolved. Instead, to reduce the overall performance degradation associated with a failed memory transaction, the PPU 202 is configured to stall only the SM that issued the failed memory transaction. While the SM is stalled, the PPU 202 executes "in-flight" memory transactions that the SM issued prior to the failed memory transaction. In addition, the SM continues to replay the failed memory transaction and any in-flight memory transactions that did not complete successfully until all of these memory transactions succeed. Advantageously, SMs that did not cause any outstanding memory access faults continue to issue memory transactions, and the PPU 202 continues to execute these memory transactions while the UVM system 200 repairs outstanding memory access faults.

通常，本文所描述的技术仅是例示性而非限制性的，并且在不脱离本发明的较宽的精神和范围的情况下，可经修改以反映各种实施方式。例如，SM是可发出存储器事务的众多单元其中之一。当前发明的实施例可包括任何数目和类型的执行单元来代替SM或与之结合。此外，在解决存储器访问故障的同时仅选择性停止PPU内的特定单元并修复存储器事务，可以任何技术上灵活的方式来实施。例如，PPU可重播由某些“可重播的”单元发出的出故障的存储器事务，并丢弃由其他单元发出的出故障的存储器事务。In general, the techniques described herein are illustrative rather than restrictive, and can be modified to reflect various implementations without departing from the broader spirit and scope of the invention. For example, an SM is one of many units that can issue memory transactions. Embodiments of the current invention may include any number and type of execution units in place of or in combination with SMs. Furthermore, selectively stopping only specific units within the PPU and repairing memory transactions while resolving memory access failures can be implemented in any technically flexible manner. For example, the PPU may replay failed memory transactions issued by certain "replayable" units and discard failed memory transactions issued by other units.

本文所描述的选择性停滞和重播功能可在PPU MMU213、不同的存储器管理单元、专用硬件单元或者在可编程硬件单元上执行的软件中加以实施——以任何组合。此外，PPU202可包含在任何类型的计算机系统中。例如，PPU202可包含在没有实施统一虚拟存储器架构的计算机系统中。The selective stall and replay functions described herein can be implemented in the PPU MMU 213, a different memory management unit, a dedicated hardware unit, or software executing on a programmable hardware unit—in any combination. Furthermore, PPU 202 may be included in any type of computer system. For example, PPU 202 may be included in a computer system that does not implement a unified virtual memory architecture.

带有重播单元的PPUPPU with replay unit

图3是根据本发明的一个实施例的、示出配置有重播单元350的统一虚拟存储器系统（UVM）200的框图。PPU202包括任一数目N的流式多处理器（SM）310和N个重播单元350——每SM310一个重播单元350。例如，如果PPU202是要包括32个SM310(0:31)，则PPU202将包括32个重播单元350(0:31)。每个重播单元350均使得PPU202能够在重播所选的存储器事务的同时使对应的SM310停止，而无需使其他SM310延迟。FIG. 3 is a block diagram illustrating a unified virtual memory system (UVM) 200 configured with a replay unit 350 according to one embodiment of the present invention. PPU 202 includes any number N of streaming multiprocessors (SM) 310 and N replay units 350 - one replay unit 350 per SM 310 . For example, if the PPU 202 is to include 32 SMs 310 (0:31), the PPU 202 will include 32 replay units 350 (0:31). Each replay unit 350 enables the PPU 202 to stall the corresponding SM 310 while replaying selected memory transactions without delaying other SMs 310 .

图4是根据本发明的一个实施例的、示出图3的重播单元350(0)的概念图。如图所示，重播单元350(0)包括而不限于事务多路复用器（事务Mux）420、微型转译后备缓存器（uTLB）430、飞行中缓存器440、故障检测器450和重播缓存器460。FIG. 4 is a conceptual diagram illustrating the replay unit 350(0) of FIG. 3, according to one embodiment of the present invention. As shown, the replay unit 350(0) includes, without limitation, a transaction multiplexer (transaction Mux) 420, a micro-translation lookaside buffer (uTLB) 430, an in-flight buffer 440, a fault detector 450, and a replay buffer device 460.

通常，正在SM310(0)内执行的线程每个均从SM310(0)产生虚拟存储器事务流。在SM310(0)从SM310(0)发出特定的虚拟存储器事务之后，来自SM310(0)的虚拟存储器事务在到达uTLB430和飞行中缓存器440之前通过事务Mux420。Typically, threads executing within SM 310(0) each generate a virtual memory transaction stream from SM 310(0). After SM 310 ( 0 ) issues a particular virtual memory transaction from SM 310 ( 0 ), the virtual memory transaction from SM 310 ( 0 ) passes through transaction Mux 420 before reaching uTLB 430 and in-flight buffer 440 .

uTLB430执行一个或多个查找操作以将虚拟存储器事务的虚拟存储器地址从SM310(0)映射到PPU存储器204中的物理存储器地址。请注意，uTLB430配置为对映射进行高速缓存，其进一步由TLB高速缓存的层级架构表示。页表或全局TLB数据结构（未示出）配置为存储与包括一个或多个PPU202和一个或多个CPU102的处理器复合体（complex）相关联的所有虚拟地址空间上的所有映射。uTLB 430 performs one or more lookup operations to map the virtual memory address of the virtual memory transaction from SM 310(0) to a physical memory address in PPU memory 204 . Note that uTLB 430 is configured to cache mappings, which are further represented by the hierarchy of TLB caches. A page table or global TLB data structure (not shown) is configured to store all mappings across all virtual address spaces associated with a processor complex (complex) including one or more PPUs 202 and one or more CPUs 102 .

本领域技术人员将认识到，倘若发生高速缓存失误，则由uTLB430执行的查找操作会耗时很长。因此，飞行中缓存器440按照先进先出次序对来自SM310(0)的虚拟存储器事务进行排队，从而保持来自SM310(0)的虚拟存储器事务相对于uTLB430的查找操作的情境。Those skilled in the art will recognize that lookup operations performed by uTLB 430 can be time consuming in the event of a cache miss. Accordingly, in-flight buffer 440 queues virtual memory transactions from SM 310(0) in first-in-first-out order, thereby maintaining the context of virtual memory transactions from SM 310(0) relative to uTLB 430 lookup operations.

如果uTLB成功地处理来自SM310(0)的虚拟存储器事务，则故障检测器450将对应的物理存储器事务路由（route）到PPU存储器204。在至PPU存储器204的物理存储器事务中，用由uTLB430查找操作而得到的物理地址更换来自SM310(0)的虚拟存储器事务所包含的虚拟地址。If the uTLB successfully processes the virtual memory transaction from SM 310 ( 0 ), fault detector 450 routes the corresponding physical memory transaction to PPU memory 204 . In the physical memory transaction to the PPU memory 204 , the virtual address included in the virtual memory transaction from SM 310 ( 0 ) is replaced with the physical address obtained by the lookup operation of uTLB 430 .

相反，如果uTLB430不能够映射由来自SM310(0)的虚拟存储器事务所指定的虚拟地址，或者如果虚拟地址需要改变存储器的目标页的部署（disposition），则uTLB430产生存储器访问故障。故障检测器450处理存储器访问故障——发送故障信号到CPU102并暂时禁止SM310(0)发出新的虚拟存储器事务。有利地，故障检测器450没有令PPU202中所包括的任何其他SM310停止发出新的虚拟存储器事务。Conversely, if uTLB 430 cannot map the virtual address specified by the virtual memory transaction from SM 310 ( 0 ), or if the virtual address needs to change the disposition of the target page of memory, uTLB 430 generates a memory access fault. Fault detector 450 handles memory access faults - sends a fault signal to CPU 102 and temporarily disables SM 310(0) from issuing new virtual memory transactions. Advantageously, fault detector 450 does not stop any other SM 310 included in PPU 202 from issuing new virtual memory transactions.

作为处理虚拟存储器故障的一部分，故障检测器450令故障缓存器条目被写入到图2的故障缓存器216。而且故障检测器450执行写入操作，将来自SM310(0)的出故障的虚拟存储器事务存储在重播缓存器460中。此外，故障检测器450令排队在飞行中缓存器440中的来自SM310(0)的任何虚拟存储器事务完成执行。如果任何这些虚拟存储器事务也出故障，则故障检测器450执行写入操作，将另外的出故障的虚拟存储器事务存储在重播缓存器460中。非必选但优选地，故障检测器450令与另外的出故障的虚拟存储器事务相对应的故障缓存器条目被写入到故障缓存器216。As part of handling virtual memory faults, fault detector 450 causes fault register entries to be written to fault register 216 of FIG. 2 . Also failure detector 450 performs a write operation, storing the failed virtual memory transaction from SM 310 ( 0 ) in replay buffer 460 . Additionally, fault detector 450 causes any virtual memory transactions from SM 310(0) queued in in-flight buffer 440 to complete execution. If any of these virtual memory transactions also fail, then failure detector 450 performs a write operation, storing the additional failed virtual memory transaction in replay buffer 460 . Optionally, but preferably, fault detector 450 causes fault buffer entries corresponding to additional faulted virtual memory transactions to be written to fault buffer 216 .

PPU故障处理程序215然后执行经设计以解决存储器访问故障的页故障序列。一经解决一个或多个存储器访问故障，CPU102就发送重播信号到重播单元350(0)。CPU102可在任何时间以任何技术上灵活的方式产生重播信号。优选地，CPU102中所包含的PPU故障处理程序215产生重播信号，通常经由命令队列214产生。以此方式，具有很高开销（overhead cost）的访问故障解决程序可一起执行，从而提高整体性能。经由命令队列214产生重播信号也允许重播操作与解决故障的命令同步化，使故障解决操作和重播操作管线化，从而允许PPU故障处理程序215以“自主导引（fire-and-forget）”的方式操作。在替代性实施例中，CPU102或PPU202可以任何技术上灵活的方式产生重播信号。例如，PPU202可以预定的时间间隔产生重播信号，致使以固定频率周期性重播。The PPU fault handler 215 then executes a page fault sequence designed to resolve memory access faults. Once the one or more memory access failures are resolved, CPU 102 sends a replay signal to replay unit 350(0). The CPU 102 can generate the replay signal at any time and in any technically flexible manner. Preferably, a PPU fault handler 215 included in CPU 102 generates a replay signal, typically via command queue 214 . In this way, access fault resolvers that have high overhead costs can be executed together, improving overall performance. Generating the replay signal via the command queue 214 also allows the replay operation to be synchronized with the resolve fault command, pipelining the fault resolve operation and the replay operation, thereby allowing the PPU fault handler 215 to "fire-and-forget" way to operate. In alternative embodiments, CPU 102 or PPU 202 may generate the replay signal in any technically flexible manner. For example, PPU 202 may generate rebroadcast signals at predetermined time intervals, resulting in periodic rebroadcasts at a fixed frequency.

一经收到重播信号，重播单元350(0)就使uTLB430无效，且事务Mux420将重播缓存器460中的出故障的虚拟存储器事务路由到uTLB430。对于这些出故障的虚拟存储器事务中的每一个，uTLB430试图将虚拟存储器地址映射至可访问的物理存储器地址。如果uTLB430成功地映射重播缓存器460中所包含的虚拟存储器事务，则故障检测器450将对应的物理存储器事务路由至PPU存储器204。然而，如果uTLB430不能够映射重播缓存器460中所包含的特定的虚拟存储器事务，则故障检测器450执行写入操作，重新请求重播缓存器460中的虚拟存储器事务。请注意，由于CPU102成功地修复每个特定页故障的原因，则对应的虚拟存储器事务成功，物理存储器事务产生，并且虚拟存储器事务从重播缓存器460移除。Upon receipt of the replay signal, replay unit 350(0) invalidates uTLB 430 and transaction Mux 420 routes failed virtual memory transactions in replay buffer 460 to uTLB 430 . For each of these failed virtual memory transactions, uTLB 430 attempts to map a virtual memory address to an accessible physical memory address. If uTLB 430 successfully maps a virtual memory transaction contained in replay buffer 460 , fault detector 450 routes the corresponding physical memory transaction to PPU memory 204 . However, if uTLB 430 is unable to map a particular virtual memory transaction contained in replay buffer 460 , fault detector 450 performs a write operation to re-request the virtual memory transaction in replay buffer 460 . Note that as a result of CPU 102 successfully repairing each particular page fault, the corresponding virtual memory transaction succeeds, a physical memory transaction occurs, and the virtual memory transaction is removed from replay buffer 460 .

重播单元350(0)继续重新执行重播缓存器460中所包含的存储器事务，直到重播缓存器460为空为止。在重播单元350(0)确定重播缓存器460为空之后，重播单元350令SM310(0)继续从SM310(0)发出虚拟存储器事务。在SM310(0)继续发出虚拟存储器事务之后，事务Mux420(0)将虚拟存储器事务路由至uTLB430用于处理。Replay unit 350(0) continues to re-execute the memory transactions contained in replay buffer 460 until replay buffer 460 is empty. After replay unit 350(0) determines that replay buffer 460 is empty, replay unit 350 instructs SM 310(0) to continue issuing virtual memory transactions from SM 310(0). After SM 310(0) continues to issue virtual memory transactions, transaction Mux 420(0) routes the virtual memory transactions to uTLB 430 for processing.

在替代性实施例中，虚拟存储器事务可被路由至对于PPU202是可访问的任何物理存储器，而不是PPU存储器204。例如，虚拟存储器事务可被路由至系统存储器104中所包含的共享页。In alternative embodiments, virtual memory transactions may be routed to any physical memory accessible to PPU 202 instead of PPU memory 204 . For example, virtual memory transactions may be routed to shared pages contained in system memory 104 .

图5是根据本发明的一个实施例的、用于管理由流式多处理器（SM）发出的存储器事务的方法步骤的流程图。虽然本文结合图1-4对这些方法步骤进行了描述，但本领域技术人员将理解，以任何次序配置为实施这些方法步骤的任何系统都落入本发明的范围内。Figure 5 is a flowchart of method steps for managing memory transactions issued by a Streaming Multiprocessor (SM), according to one embodiment of the present invention. Although these method steps are described herein in conjunction with FIGS. 1-4 , those skilled in the art will appreciate that any system configured to perform these method steps, in any order, falls within the scope of the present invention.

如图所示，方法500在步骤502开始，其中重播单元350(0)接收来自SM310(0)的虚拟存储器事务。作为对收到来自SM310(0)的虚拟存储器事务的响应，重播单元350(0)中所包含的事务Mux420将虚拟存储器事务路由至uTLB430，并将虚拟存储器事务排队在飞行中缓存器440中。如果在步骤504，uTLB430成功地处理虚拟存储器事务，则方法500前进至步骤506。在步骤506，重播单元350(0)中所包含的故障检测器450将对应的物理存储器事务路由至PPU存储器204，且方法500返回到步骤502。重播单元350(0)通过步骤502-506循环，接收并处理来自SM310(0)的虚拟存储器事务，直到uTLB430不能够成功地处理来自SM310(0)的虚拟存储器事务为止。As shown, method 500 begins at step 502, where replay unit 350(0) receives a virtual memory transaction from SM 310(0). In response to receiving a virtual memory transaction from SM 310 ( 0 ), transaction Mux 420 contained in replay unit 350 ( 0 ) routes the virtual memory transaction to uTLB 430 and queues the virtual memory transaction in in-flight buffer 440 . If at step 504 uTLB 430 successfully processed the virtual memory transaction, method 500 proceeds to step 506 . At step 506 , fault detector 450 included in replay unit 350 ( 0 ) routes the corresponding physical memory transaction to PPU memory 204 , and method 500 returns to step 502 . Replay unit 350(0) loops through steps 502-506, receiving and processing virtual memory transactions from SM 310(0), until uTLB 430 is unable to successfully process virtual memory transactions from SM 310(0).

在步骤504，如果uTLB430没有成功地将虚拟存储器事务映射至物理地址，则方法500前进至步骤508。在步骤508，重播单元350(0)中所包含的故障检测器450发送故障信号到CPU102，使SM310(0)停滞，并将出故障的虚拟存储器事务添加到重播缓存器460。在步骤510，故障检测器450处理排队在飞行中缓存器440中的任何虚拟存储器事务。这类存储器事务对应由SM310(0)发出的存储器事务，并且在出故障的虚拟存储器事务之前开始执行。如果任何来自SM310(0)的这些虚拟存储器事务也出故障，则故障检测器450执行一个或多个写入操作，将另外的出故障的虚拟存储器事务存储在重播缓存器460中。At step 504 , if uTLB 430 did not successfully map the virtual memory transaction to a physical address, method 500 proceeds to step 508 . At step 508 , fault detector 450 included in replay unit 350 ( 0 ) sends a fault signal to CPU 102 , stalls SM 310 ( 0 ), and adds the faulted virtual memory transaction to replay buffer 460 . At step 510 , fault detector 450 processes any virtual memory transactions queued in in-flight buffer 440 . This type of memory transaction corresponds to the memory transaction issued by SM 310(0), and begins execution before the failed virtual memory transaction. If any of these virtual memory transactions from SM 310 ( 0 ) also fail, then failure detector 450 performs one or more write operations, storing the additional failed virtual memory transactions in replay buffer 460 .

在步骤512，重播单元350(0)等待CPU102经由重播信号而信令（signal）一个或多个故障已经解决。一经收到该重播信号，重播单元350(0)就使uTLB430无效，并重新执行存储在重播缓存器460中的虚拟存储器事务。如果uTLB430成功地映射重播缓存器460中所包含的虚拟存储器事务，则重播单元350(0)将对应的物理存储器事务路由至PPU存储器204。然而，如果uTLB430不能够映射重播缓存器460中所包含的特定的虚拟存储器事务，则故障检测器450执行写入操作，将该虚拟存储器事务重新排队在重播缓存器460中。如果在步骤514，重播单元350(0)确定重播缓存器460不为空，则方法500返回到步骤512。重播单元350(0)通过步骤512-514循环，重新执行重播缓存器460中所包含的虚拟存储器事务，直到重播单元350(0)确定重播缓存器460为空为止。At step 512, replay unit 350(0) waits for CPU 102 to signal via the replay signal that one or more faults have been resolved. Upon receipt of the replay signal, replay unit 350 ( 0 ) invalidates uTLB 430 and re-executes the virtual memory transaction stored in replay buffer 460 . If uTLB 430 successfully maps a virtual memory transaction contained in replay buffer 460 , replay unit 350 ( 0 ) routes the corresponding physical memory transaction to PPU memory 204 . However, if uTLB 430 is unable to map a particular virtual memory transaction contained in replay buffer 460 , fault detector 450 performs a write operation to requeue the virtual memory transaction in replay buffer 460 . If, at step 514 , replay unit 350 ( 0 ) determines that replay buffer 460 is not empty, then method 500 returns to step 512 . Replay unit 350(0) loops through steps 512-514, re-executing the virtual memory transactions contained in replay buffer 460 until replay unit 350(0) determines that replay buffer 460 is empty.

在步骤514，如果重播单元350(0)确定重播缓存器460为空，则方法500前进至步骤516。在步骤516，重播单元350(0)令SM310(0)继续从SM310(0)发出虚拟存储器事务，且方法500返回到步骤502。重播单元350(0)继续通过步骤502-516循环，接收并处理来自SM310(0)的虚拟存储器事务。At step 514 , if replay unit 350 ( 0 ) determines that replay buffer 460 is empty, method 500 proceeds to step 516 . At step 516 , replay unit 350 ( 0 ) causes SM 310 ( 0 ) to continue issuing virtual memory transactions from SM 310 ( 0 ), and method 500 returns to step 502 . Replay unit 350(0) continues to loop through steps 502-516, receiving and processing virtual memory transactions from SM 310(0).

总之，并行处理单元（PPU）实施故障处理技术，该技术使得某些流式多处理器（SM）在令其他SM暂时停止执行线程的同时能够继续执行线程。在操作中，如果发生可归属于在特定SM上执行的线程的存储器访问故障，则在计算机系统解决故障的同时，与SM相对应的重播单元使特定SM停滞。请注意，重播单元令对应的SM停止产生另外的存储器事务，直到引发故障的存储器事务得以解决为止。此外，重播单元将由对应的SM在故障之前所发出的任何飞行中存储器事务排队在重播缓存器中。故障一经解决，重播单元就令存储在重播缓存器中的存储器事务重新执行。在成功地执行存储在重播单元中的所有存储器事务之后，重播单元启用对应的SM以继续产生另外的存储器事务。In summary, Parallel Processing Units (PPUs) implement fault-handling techniques that allow certain Streaming Multiprocessors (SMs) to continue executing threads while causing other SMs to temporarily stop executing threads. In operation, if a memory access failure attributable to a thread executing on a particular SM occurs, the replay unit corresponding to the SM stalls the particular SM while the computer system resolves the failure. Note that the replay unit stops the corresponding SM from generating further memory transactions until the faulting memory transaction is resolved. In addition, the replay unit queues any in-flight memory transactions issued by the corresponding SM prior to the failure in the replay buffer. Once the fault is resolved, the replay unit causes the memory transactions stored in the replay buffer to be re-executed. After successfully executing all memory transactions stored in the replay unit, the replay unit enables the corresponding SM to continue generating additional memory transactions.

有利地，在存在存储器访问故障时使受到影响的SM停滞的同时允许未受影响的SM继续执行，减小了与存储器访问故障相关联的执行惩罚（penalty）。请注意，因为未受影响的SM继续执行且出故障的存储器事务被存储和重播，所以指令无需取消。而且，由于计算机系统一起执行针对飞行中的出故障的存储器事务的故障解决程序，所以与单独解决每个故障相比，整体系统性能得到提高。相反，存储器访问故障一经产生，常规的PPU就使PPU中所包括的所有SM都停滞，并取消由SM产生的所有随后的存储器事务。这类PPU中所包括的SM不继续发出存储器事务，直到存储器访问故障得到解决为止。因此，与常规的PPU相比，在实施选择性存储器事务和重播技术的PPU中，与存储器访问故障相关联的性能降低被减小。Advantageously, stalling affected SMs while allowing unaffected SMs to continue executing in the presence of memory access faults reduces the execution penalty associated with memory access faults. Note that the instruction does not need to be canceled because unaffected SMs continue to execute and the failed memory transaction is stored and replayed. Also, because the computer systems collectively execute fault resolution procedures for failed memory transactions in-flight, overall system performance is improved compared to resolving each fault individually. In contrast, upon a memory access fault, a conventional PPU stalls all SMs included in the PPU and cancels all subsequent memory transactions generated by the SMs. SMs included in such PPUs do not continue to issue memory transactions until the memory access failure is resolved. Thus, performance degradation associated with memory access failures is reduced in PPUs implementing selective memory transaction and replay techniques compared to conventional PPUs.

虽然上述内容针对本发明的实施例，但可对本发明的其他以及进一步的实施例进行设计而不脱离其基本范围。例如，可以硬件或软件或硬件和软件的组合来实现本发明的各方面。本发明的一个实施例可被实施为与计算机系统一起使用的程序产品。该程序产品的程序定义实施例的各功能（包括本文中描述的方法）并且可以被包含在各种计算机可读存储介质上。示例性计算机可读存储介质包括但不限于：（i）不可写的存储介质（例如，计算机内的只读存储器设备，诸如可由CD-ROM驱动器读取的光盘只读存储器（CD-ROM）盘、闪存、只读存储器（ROM）芯片或任何类型的固态非易失性半导体存储器），在其上存储永久性信息；和（ii）可写的存储介质（例如，磁盘驱动器或硬盘驱动器内的软盘或者任何类型的固态随机存取半导体存储器），在其上存储可更改的信息。当承载针对本发明的功能的计算机可读指令时，这样的计算机可读存储介质是本发明的实施例。While the foregoing is directed to embodiments of the invention, other and further embodiments of the invention may be devised without departing from the essential scope thereof. For example, aspects of the present invention can be implemented in hardware or software or a combination of hardware and software. One embodiment of the invention can be implemented as a program product for use with a computer system. The program(s) of the program product define functions of the embodiments (including the methods described herein) and can be contained on a variety of computer-readable storage media. Exemplary computer-readable storage media include, but are not limited to: (i) non-writable storage media (e.g., read-only memory devices within a computer, such as compact disc read-only memory (CD-ROM) disks that can be read by a CD-ROM drive , flash memory, read-only memory (ROM) chips, or any type of solid-state non-volatile semiconductor memory) on which to store persistent information; and (ii) writable storage media (for example, a floppy disk or any type of solid-state random-access semiconductor memory) on which information can be changed. Such computer-readable storage media, when carrying computer-readable instructions for the functions of the present invention, are embodiments of the present invention.

以上已参照特定实施例对本发明进行了描述。然而，本领域普通技术人员将理解的是，可对此做出各种修改和变化而不脱离如随附权利要求书中所阐述的本发明的较宽精神和范围。因此，前面的描述以及附图应被视为是例示性而非限制性的意义。The invention has been described above with reference to specific embodiments. Those of ordinary skill in the art will appreciate, however, that various modifications and changes can be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. Accordingly, the foregoing description and drawings are to be regarded in an illustrative rather than a restrictive sense.

因此，本发明的范围由随附的权利要求书加以界定。Accordingly, the scope of the invention is defined by the appended claims.

Claims

1. A computer-implemented method for processing virtual memory transactions associated with a multithreaded processing unit, the method comprising:

receiving a first virtual memory transaction from a first unit;

attempting to perform the first virtual memory transaction;

detecting a first page fault associated with the first virtual memory transaction;

storing the first virtual memory transaction in a replay buffer;

raising a stall condition that prohibits the first unit from initiating subsequent virtual memory transactions until the first page fault has been resolved; and

The first virtual memory transaction and at least one other virtual memory transaction stored in the replay buffer are re-executed once the first page fault has been resolved.

2. The method of claim 1, further comprising determining that the replay buffer is empty, and enabling the first unit to generate subsequent virtual memory transactions.

3. The method of claim 1, further comprising receiving a second virtual memory transaction from a second unit while the first page fault is unresolved, and successfully executing the second virtual memory transaction.

4. The method of claim 1, further comprising:

receiving a second virtual memory transaction from the first unit prior to detecting the first page fault;

detecting a second page fault associated with the second virtual memory transaction; and

The second virtual memory transaction is stored in the replay buffer.

5. The method of claim 1, further comprising invalidating a translation lookaside cache prior to re-executing the first virtual memory transaction.

6. The method of claim 1, wherein re-executing the first virtual memory transaction comprises:

determining whether an entry corresponding to the first virtual memory transaction exists in a translation lookaside cache; and

If the entry exists, complete the first virtual memory transaction, otherwise

If the entry does not exist, the first virtual memory transaction is restored in the replay buffer.

7. The method of claim 1, wherein resolving the first page fault comprises:

placing a memory page associated with the first virtual memory transaction within a first memory subsystem based on a global translation table; and

A virtual mapping for the memory page is added to a translation lookaside cache.

8. The method of claim 7, wherein addressing the first page fault further comprises copying the memory page from the first memory subsystem to a second memory subsystem.

9. The method of claim 8, wherein the first memory subsystem comprises a memory coupled to a central processing unit, and the second memory subsystem comprises a memory coupled to the multi-threaded processing unit of memory.

10. A system configured to process virtual memory transactions, the system comprising:

memory; and

a multithreaded processor coupled to the memory and configured to:

receiving a first virtual memory transaction from a first unit;

attempting to perform the first virtual memory transaction;

storing the first virtual memory transaction in a replay buffer;