CN117521167B

CN117521167B - High-performance heterogeneous secure memory

Info

Publication number: CN117521167B
Application number: CN202311527819.9A
Authority: CN
Inventors: 华志超; 樊树霖; 夏虞斌; 陈海波
Original assignee: Shanghai Jiao Tong University
Current assignee: Shanghai Jiao Tong University
Priority date: 2023-11-15
Filing date: 2023-11-15
Publication date: 2025-03-25
Anticipated expiration: 2043-11-15
Also published as: CN117521167A

Abstract

The invention provides a high-performance heterogeneous secure memory which comprises HSMEM transmission engines, a multi-mode protection engine and an interface, wherein the HSMEM transmission engine establishes a HSMEM transmission channel to realize high-speed transmission between a CPU and a GPU, the HSMEM transmission channel is positioned in a chip and controls DMA requests for the secure memory, each memory protection scheme supported by the multi-mode protection engine is called a mode, one memory block is protected through different modes and comprises a mode selection module for selecting the mode for the memory block, a data encryption module for data encryption and decryption and an integrity tree module for maintaining the integrity tree of the different modes, and a developer uses a preset instruction to definitely change the mode of the memory block through the interface. The invention realizes high-performance data transmission between heterogeneous secure memories, and the time for CPU-GPU transmission is less, so that the overall performance of the application is higher.

Description

High-performance heterogeneous secure memory

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a high-performance heterogeneous secure memory.

Background

Through recent development, artificial intelligence technology has been widely used in many fields, such as automatic driving, face recognition, medical imaging, etc. Massive training data and high-value artificial intelligence models make data privacy concerns a focus of attention. Modern artificial intelligence applications are typically deployed in high performance data centers, such as ChatGPT training sites, using more than 10K of NVIDIA a100GPU due to their extensive resource requirements. Because of the complexity of the software stack and potential vulnerabilities (e.g., more than 3.6 hundred million lines of code and more than 3K vulnerabilities in Linux kernel), and the potential for malicious employees threatens the data privacy of the artificial intelligence application. On one hand, a platform administrator of the data center may have malicious behaviors, on the other hand, an existing platform usually has a very deep software stack, so that a large number of software vulnerabilities and attack surfaces are brought, and once a cloud platform software system is clamped, the security of a machine learning application cannot be guaranteed. Even if training is performed in a machine inside the enterprise, the training process may be corrupted by internal malicious employees or attackers. Therefore, how to ensure the safety of the machine learning application and prevent the machine learning application from being attacked by the clamped operating system becomes a problem which is widely focused by academia and industry.

Trusted Execution Environments (TEEs) are widely used to protect applications from untrusted system software, and the TEEs ensure that no components outside the TEEs can access or modify the data and control flows of the internal components of the TEEs. There are many methods of building TEEs available today, and almost all CPU manufacturers, including Intel, AMD and ARM, have introduced their hardware TEE extensions. Modern artificial intelligence applications, however, rely heavily on heterogeneous accelerators, such as GPUs, NPUs, etc., which are widely used to accelerate computation of artificial intelligence applications. However, these accelerators are not directly protected by the CPU-side TEE, and therefore, the security of the CPU-side TEE to artificial intelligence applications is very limited. Therefore, in order to protect the data security of the artificial intelligence application, a trusted execution environment needs to be built on the artificial intelligence accelerator, so that a malicious operating system cannot access the data of the accelerator and tamper with the computing process of the accelerator, a GPU is a common accelerator of the artificial intelligence application, and building a GPU TEE is an effective way of protecting the artificial intelligence application.

While High Bandwidth Memory (HBM) is considered secure and not subject to physical attack, it is expensive, limited in capacity, and difficult to replace in the event of a failure. Thus, in the future, there will still be accelerators using GDDR memory. Even at month 8 2023, a GPU such as H100 with an HBM has been on the market for a long time, NVIDIA announced an L40SGPU, a data center GPU product, but still uses GDDR memory. GDDR is only likely to complete its historical life when the cost and capacity of future HBMs are controllable. In addition, most existing accelerator models, including various GPUs, are still using GDDR memory, while HBMs are unable to solve the RowHammer and RowPress problems, and thus many academic studies are based on accelerators that use traditional DDR or GDDR to build secure memory.

The accelerator, including the GPU, differs from the CPU in terms of secure memory. They may differ in 1) the integrity tree structure may differ. 2) The encryption algorithm may be different. The GPU has multiple memory controllers, and in recently designed GPU secure memory, each memory partition has its own secure memory controller. By adopting a plurality of memory partitions, the parallel access capability of the memory can be improved. Each memory controller uses the local address within the partition and a counter for AES encryption. In addition, some multi-chip module GPUs include multiple GPUs and memory chips that are connected by inter-chip connections that have a lower bandwidth than the intra-chip connections. If a centralized engine is used for all memory partitions, performance may be undesirable due to the presence of on-chip or inter-chip interconnections between the engine and the memory controller. Thus, recent GPU secure memory architectures employ a protection engine internal to each controller, and each memory partition maintains its own integrity tree. In addition, some GPU secure memories use different encryption algorithms, such as AES-XTS. Furthermore, there is a difference in granularity of memory protection for the CPU and GPU. The GPU cache lines are typically 128 bytes, while the CPU cache lines are 64 bytes.

A Trusted Execution Environment (TEE) is used to protect user data and programs and may be configured to logically isolate the isolation region from the outside. However, logical isolation alone is not sufficient to defend against physical attacks, and hardware vendors introduce a memory protection engine in the memory controller to protect the data from physical attacks. The strongest protection mechanism in current products can ensure confidentiality, integrity and freshness, such as Intel SGX, the architecture is similar to BMT (Bonsai MERKLE TREE). The counter-based integrity tree is used to ensure that the correct counter is used. The memory block may be encrypted by a counter value, a key stored on the chip, and a memory address. In addition, to ensure block integrity, a Message Authentication Code (MAC) is also calculated and stored. The root of the integrity tree is stored on the chip and never tampered with. When a block is loaded into the chip, the integrity tree is used to calculate the correct counter and extract the MAC to verify the integrity of the block. When a dirty block is retired from the cache and written back to memory, the counter value is incremented. In addition, there are different memory addressing modes, as shown in fig. 1. Assume that there are k memory partitions and the interleaving granularity is g memory blocks. Even for such a randomly interleaved memory, it still ensures that consecutive k×g memory blocks come from different memory partitions. In the basic method of memory protection, the integrity of a counter is protected by an integrity tree, and then a counter is used for encryption of a memory block, as shown in fig. 3.

NVIDIA H100GPU contains many security functions that limit access to GPU content by untrusted software, has an on-chip secure processor within the GPU, supports multiple types and levels of encryption, and provides hardware-protected memory areas. The secret computing function of the NVIDIA H100GPU needs to be matched with a secret virtual machine, and in the case that one secret virtual machine is protected by a TEE of a CPU side, one secret virtual machine can be associated with a plurality of H100 GPUs which start a secure computing mode or a plurality of MIG computing instances. And the GPU and the confidential virtual machine on the CPU are provided with a mechanism for establishing a secure channel, data transmission of the GPU and the CPU is encrypted and decrypted through hardware, and meanwhile, related hardware can ensure that an untrusted virtual machine monitor cannot directly access the GPU in a secure computing mode. NVIDIA H100GPU allows users to verify that they are communicating with NVIDIAGPU with confidential computing enabled and verify the security status of the GPU. However, the transmission rate of this technique is low, and according to its technical report, the transmission rate of the CPU-GPU is only 4GB/s, while only HBM memory is protected, and a large number of GPUs using conventional memory cannot be protected.

MMT published at HPCA meeting 2023 (EFFICIENT DISTRIBUTED SECURE MEMORY WITH MIGRATABLE MERKLE TREE) this work extends the size of the memory protection engine from one node to multiple nodes in order to achieve a (nearly) linear secure data transfer speed between distributed enclaves. The memory protection engine has been able to protect the confidentiality, integrity and freshness of physical memory. By reusing the hardware protection mechanism, without additional cryptography-based operations, the enclave may directly transfer data to another enclave over various connections (e.g., PCIe, RDMA, etc.). The present solution proposes a Migratable Merck Tree (MMT) that can transmit data and metadata to remote nodes without software involvement (e.g., re-encryption). One key insight of MMT is that (single-node) hardware memory protection has provided confidentiality, integrity and freshness guarantees for untrusted DRAMs, which can be reused in untrusted networks for protection. To this end, the work first expands a single integrity tree into an integrity forest across multiple nodes and breaks the limitation of encryption metadata that is limited to CPUs. Furthermore, a new protocol MMT closure delegation was devised instead for protecting data transmitted in an untrusted network. It can securely transfer the root, node and data of the integrity subtree to the remote node to prevent revealing confidential information or replay attacks. Finally, the work implements a tiny and trusted module to manage the enclave and trusted hardware modules. It hides the details of the hardware implementation and is responsible for the connection between the local and remote enclaves. However, the technology can only support the transmission between nodes with the same secure memory architecture, and in heterogeneous scenarios, the secure memory architectures of the CPU and the GPU are different, and cannot adopt the scheme for transmission.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a high-performance heterogeneous secure memory.

The high-performance heterogeneous secure memory provided by the invention comprises HSMEM a transmission engine, a multi-mode protection engine and an interface;

The HSMEM transmission engine establishes a HSMEM transmission channel to realize high-speed transmission between the CPU and the GPU, the HSMEM transmission channel is positioned in the chip and controls a DMA request for a secure memory, the DMA request selects metadata related to the memory to be transmitted and transmits the data without additional data encryption and decryption, a preset number of cache blocks are encrypted for each data transmission, and after the data is transmitted to the CPU or the GPU, the multi-mode protection engine is positioned in the memory controller and directly uses the transmission data of an original protection scheme;

Each memory protection scheme supported by the multi-mode protection engine is called a mode, one memory block is protected through different modes, a mode selection module is arranged in the multi-mode protection engine and used for selecting a mode for the memory block, a data encryption module is arranged for encrypting and decrypting data, and an integrity tree module is arranged for maintaining an integrity tree of the different modes;

through the interface, the developer uses the preset instruction to explicitly change the mode of the memory block.

Preferably, at initialization, HSMEM the transmission engine generates keys with the peer device, builds a conventional secure channel, when data is to be transmitted, HSMEM the transmission engine prepares metadata that needs to be transmitted with the data, and in addition, the secure channel is used to exchange some basic system configuration during initialization;

the HSMEM transfer engine calculates the destination address of the metadata corresponding to the data, and then the HSMEM transfer engine controls the DMA engine to copy the metadata to the destination location, and the GPU uses its own DMA engine to transfer the data.

Preferably, at the time of initialization, the HSMEM transmission engine includes an initialization module, specifically:

When the system is started or the GPU is inserted, the CPU and the GPU mutually verify and establish a traditional secure channel, and negotiate a plurality of shared keys for memory encryption through the secure channel, and in addition, the CPU and the GPU mutually provide memory layout information, the method comprises the steps that the method comprises an integrity tree format, an encryption algorithm type and an address of a metadata area, the GPU also provides the number of memory controllers and a memory addressing scheme for a CPU, and after the operation is finished, the CPU and the GPU initialize a safe memory of the CPU;

the secure channel also has the ability to ensure freshness of the transmitted data by using incremental self-initialization vectors for AES-GCM encryption.

Preferably, at the time of initialization, the HSMEM transmission engine includes a meta information selection module, specifically:

When data is transmitted through HSMEM transmission channels, the security metadata is transmitted together with the data, for an integrity tree BMT, the counter and the message authentication code MAC value are transmitted together, each time the data is transmitted at least to cover the whole subtree, therefore, the minimum transmission granularity is the memory size covered by a single integrity tree node, the selected subtree meets that all the counters in the subtree root node are related to the data to be transmitted, the counter in the father node is called a subtree root counter for the subtree root node, the CPU and the GPU are set to use a split counter in the integrity tree, the integrity tree node fan-out of the GPU is 128 bytes, the cache line size is 128 bytes, the degree of the CPU is 64 bytes, the granularity of the CPU-GPU transmission is different from the granularity of the GPU-CPU transmission, the covered size is calculated by multiplying the degree by the cache line size, for the CPU-GPU transmission, the granularity is 4KB, and only the data covered by the leaf node of the integrity tree can be transmitted;

The HSMEM transfer engine uses the address of the data to be transferred to determine the integrity tree node that should be transferred with the data, finds one or more subtrees covering the range of the transferred data, and if some counters in the root node do not belong to the data to be transferred, continues to find subtrees to the lower level of the integrity tree.

Preferably, the mode selection module includes that on the CPU or GPU, for each memory block, the memory controller knows the current mode of the block, and when the block with the security metadata is transmitted from the CPU to the GPU, it is still protected by the original protection scheme;

The mode selection module is a hardware module in the memory controller for recording such information, using one bit to indicate the protection scheme of the memory block, the cache line size of the GPU is 128 bytes, and the cache line size of the CPU is 64 bytes, on the GPU the bitmap size is 1/2≡10 of the total memory size, on the CPU the bitmap size is 1/2^9 of the total memory size, for 1GB of GPU memory, the GPU needs 1MB of space for the bitmap, and for 1GB of system memory, the CPU needs 2MB;

Two levels of checking are used to reduce the overhead of accessing the bitmaps, on the GPU, each bit of the first level bitmap representing 128 bytes, and in the second level bitmap representing 128KB, the first level bitmap being stored directly in an on-chip buffer inside the memory controller, the second level bitmap being stored in secure memory and having to ensure integrity, leaving a region always in local mode to store the bitmaps, on the GPU, which region is always in GPU mode, and on the CPU, always in CPU mode, providing a dedicated buffer for the second level bitmap, referred to as bitmap buffer, the cache line size also being 128 bytes, so that each bit stored in the cache line represents 128KB of memory, when one cache line is replaced from the bitmap buffer, the value of the cache line is checked, if all bits in the replaced cache line are unset, meaning that all bits in 128KB are in GPU mode, bits in the first level bitmap will not be set, and if any bits in the bitmap are set, the corresponding bits in the first level bitmap are set.

Preferably, the data encryption module includes a CPU or GPU directly transmitting data to each other without re-encrypting, setting an encryption or decryption memory block, the encryption requiring an address and a counter, and being considered unsafe if both blocks are encrypted using the same encryption key, address and counter, so that each time a dirty memory block is evicted from the cache and written to the memory, the counter of the memory block is incremented, and if the CPU and GPU use a shared encryption key, a unique address is maintained for each memory block on the CPU or GPU, so that the address used for memory encryption is different from the local physical address on the CPU or GPU;

mapping all memories, including system memory and accelerator memory, to a single memory space called encryption address space EAS for encryption, addresses in EAS being used only for encryption and decryption, the addressing of normal memory access remaining unchanged, EAS being implemented by providing each device with a starting address, the receiver requiring additional information about the address of the ciphertext used for decrypting the transmission during data transmission;

the data encryption module firstly obtains a mode from the mode selection module, then obtains a correct counter from the integrity tree module, finds out a correct EAS address, and then decrypts the data block;

The CPU uses half of the MAC to validate its own 64 byte cache line.

Preferably, the integrity tree module of the GPU mode includes that the different memory controllers use their own integrity tree and verify their data without interacting with other memory partitions, for direct access to the transferred data, the GPU supports protection schemes for the CPU mode and the GPU mode, the data of the CPU mode is encrypted using addresses and counters from the CPU, when such memory blocks are acquired, the memory controllers acquire the correct counter values and addresses to decrypt the block, one leaf node of the integrity tree involves blocks from different memory partitions, and for supporting the CPU mode, therefore one module is set to verify the counter and communicate the correct counter values to all memory controllers;

The method comprises the steps that a CPU mode engine is used for verifying a counter in a CPU mode integrity tree, the CPU mode engine is an on-chip module and is connected with other memory controllers, metadata of transmitted CPU mode encryption protection is stored in a GPU memory, the CPU mode engine directly sends a request to the memory controllers to obtain read memory, the metadata is directly obtained and returned to the CPU mode engine, when the memory controllers need to obtain real counter values, the CPU mode engine sends the request to the CPU mode engine, and the CPU mode engine reads integrity tree nodes stored in the GPU memory, verifies the counter and returns the values to the memory controllers.

Preferably, the CPU mode integrity tree module comprises two different modes supported by an engine at the CPU end, a memory controller on the CPU leaves room for storing a root node of a subtree in the GPU mode besides a root node of a CPU mode counter, the two different types of integrity trees are directly supported in a single memory controller of the CPU, an EAS address of the subtree is also stored together with a counting node in a cache, when the mode selection engine determines that one memory block is in the GPU mode, the multimode protection engine firstly verifies the EAS address of the counter and a transmission block in the GPU mode, then decrypts and verifies the block, and the CPU maintains the integrity trees of several GPU modes for supporting the GPU mode;

all the integrity trees have the same shape, and when the integrity trees are initialized, the information provided by the GPU indicates a memory addressing mode, so that the CPU knows the memory partition address arrangement mode and correctly uses the addressing scheme of the GPU to search the integrity trees corresponding to the memory blocks.

Preferably, the data transmission process of the HSMEM transmission channels is as follows:

Step 1, software sends instructions to a GPU through a secure channel mechanism of the GPU TEE;

Step 2, after the instruction is read, the GPU command processor sends a signal to a HSMEM transmission engine at the CPU end, which is realized as a special interrupt request, and the special interrupt request does not actually cause external interrupt, but transmits the signal and information to a HSMEM transmission engine at the CPU end;

Step 3, the CPU HSMEM transmits a signal to the engine, reads the root node on the chip and verifies the root node, and simultaneously sends a request to the memory controller to prohibit the system from writing the address;

Step4, after selecting the root node, HSMEM transmission engine sends a request to memory controller, forbids writing to the area, if some tree nodes in the cache are dirty, they should be brushed back to the memory, then it reads the root node of the subtree to be transmitted, and transmits the root node through the secure channel;

The HSMEM transmission engine ensures that the GPU DMA engine is in a safe mode, calculates metadata addresses and performs safe memory transmission, the commands give physical addresses of a range to be transmitted, the special memory transmission bypasses the checking of the IOMMU, the addresses of metadata areas are given by the HSMEM transmission engine when the system is initialized, the HSMEM transmission engine directly calculates the addresses of the metadata and controls the DMA engine to extract correct metadata, and data, MAC and integrity tree nodes are transmitted, so that re-encryption is not needed;

Step 6, after the metadata is read from the system memory, the GPU transmits another special signal to the CPU, the signal does not trigger software interrupt, the HSMEM transmission engine at the CPU end processes the signal, and the root node and the EAS address of the subtree are transmitted to the GPU through the secure channel;

And 7, a HSMEM transmission engine at the CPU end sends a message to the memory controller, the writing limitation of the block memory is canceled in a transmission channel, for the data transmission of the CPU-GPU or the GPU-CPU, HSMEM requires all the transmitted data to be in the same mode, and for the data transmission of the GPU-GPU, an MMT transmission mode is adopted, and the data are protected by different modes.

Preferably, the memory access flow of the multi-mode data protection engine is:

Step 11, obtaining the mode of the current accessed block through a mode selection module, if the mode is a CPU mode, performing the following steps, and if the mode is a GPU mode, adopting the access mode of a common GPU safe memory;

step 12, when such a block is obtained from DRAM, it is verified by the CPU mode integrity tree and verified and decrypted using the integrity tree module and the memory encryption module;

step 13, the decrypted and verified data are put into a cache for access;

Step 14, after the access is finished, if the block is not modified, directly eliminating the cache, if the block is modified, reading a counter of the GPU mode, re-encrypting, updating metadata of the GPU mode, and then writing back, wherein when the GPU needs to encrypt the block and writes into a memory, the GPU does not use an EAS address provided by a CPU, but uses an own EAS address to encrypt the block;

And 15, if the GPU is switched to the GPU mode, notifying a mode selection module, updating corresponding information by the mode selection module, and marking the current memory block to be changed back to the GPU mode.

Compared with the prior art, the invention has the following beneficial effects:

(1) High-performance data transmission (without re-encryption) is realized between heterogeneous secure memories, and the time for transmission between a CPU and a GPU is less, so that the overall performance of the application is higher;

(2) Supporting multiple secure memory protection modes on a single secure memory device, so that data in the past (still using the original secure memory protection mode) can be directly accessed by the device without immediately converting the modes;

(3) On a device with a secure memory, a memory block can be efficiently switched between different secure memory protection modes, and can be transparently converted into a secure memory protection mode with higher performance in the running process according to the application requirements, so that the optimal performance is realized on the current device.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:

FIG. 1 is an Intel SGX style memory integrity tree;

FIG. 2 is a memory encryption process;

FIG. 3 is a GPU memory partition;

FIG. 4 is a system architecture;

FIG. 5 is metadata of a selected integrity tree;

fig. 6 is a multi-mode protection engine.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present invention.

Example 1

System architecture or scenario for application of the present invention

HSMEM consider a heterogeneous platform for confidential computation. The protected application runs the TEE on both the CPU side and the GPU side. The TEE at both the CPU and GPU end use secure memory to protect the data. HSMEM is aimed at building DRAM security for GPUTEE. For TEE management HSMEM a mechanism like Gravition [ OSDI'18] may be used. Only the CPU and GPU's on-chip components are trusted. Therefore, all data leaving the chip should be protected. Confidentiality, integrity and freshness of data should be guaranteed. The present invention contemplates GPUs or other accelerators that use GDDR memory, and an attacker may tap the memory bus on the CPU or GPU, or the PCIe bus or other interconnect bus between the CPU and the GPU. An attacker may also physically issue instructions to read, modify or replay the data in memory.

The core hardware modules of HSMEM include a HSMEM transport engine and a multi-mode protection engine. HSMEM the transmission engine establishes a HSMEM transmission channel, and high-speed transmission between the CPU and the GPU is realized. It is located inside the chip and controls special DMA requests for secure memory. It selects metadata associated with the memory to be transferred and can transfer the data without additional data encryption and decryption. For each data transmission, only a few cache blocks need to be encrypted, thus providing speeds approaching those of plaintext transmissions. After the data is transferred to the CPU or GPU, the multi-mode protection engine is located in the memory controller and can directly use the transferred data of the original protection scheme. The engine may support a variety of protection schemes. In the present invention, each memory protection scheme supported by the multi-mode protection engine is referred to as a "mode". A memory block may be protected in different modes. Inside the multi-mode protection engine, there is a mode selection module for selecting modes for the memory blocks. There is also a data encryption module for data encryption and decryption and an integrity tree module for maintaining the integrity tree in different modes. Furthermore HSMEM is not limited to a protection scheme similar to BMT, but can be extended to other memory protection schemes with looser security requirements. HSMEM also provide an interface that allows a developer to explicitly alter the mode of the memory block using special instructions.

The core network element/device module/product of the present invention is implemented, and the overall system architecture is shown in fig. 4.

HSMEM the transfer engine constructs HSMEM transfer channels for secure Direct Memory Access (DMA). HSMEM transmission channels can realize high-speed data transmission between different devices equipped with the secure memory. In the data transmission process, re-encryption is not needed, and only the data and the security metadata are transmitted. At initialization, HSMEM the transmission engine generates keys with the peer devices to construct the conventional secure channel. This secure channel is part of the HSMEM transmission channel. When data is to be transmitted, HSMEM the transmission engine prepares metadata that needs to be transmitted with the data. In addition, the secure channel is used to exchange some basic system configuration during the initialization process. HSMEM the transfer engine polls the destination address of the metadata corresponding to the accounting data, and HSMEM the transfer engine controls the DMA engine to copy the metadata to the destination location. The GPU uses its own DMA engine for data transfer. The transmission engine comprises an initialization module and a meta information selection module.

And the initialization module is used for mutually verifying the CPU and the GPU and establishing a traditional secure channel when the system is started or the GPU is inserted. Through the secure channel, the CPU and GPU may negotiate multiple shared keys for memory encryption. In addition, the CPU and GPU may provide memory layout information to each other. They should provide the integrity tree format, the encryption algorithm type, and the address of the metadata area. The GPU should also provide some additional information to the CPU, including the number of memory controllers and the memory addressing scheme. After this is done, the CPU and GPU will initialize their secure memory. The secure channel should also have the ability to ensure freshness of the transmitted data. This may be achieved by using an incremental self-Initialization Vector (IV) for AES-GCM encryption. It should be noted that maintaining such a secure channel does not place a significant burden on HSMEM transport engines, and that PCIe security enhancements that have been maintained using AES-GCM may even be reused.

The meta information selecting module is used for transmitting some safety meta data together with the data when the data is transmitted through the HSMEM transmission channels. For BMT, the counter is transmitted along with the MAC value. HSMEM require that each data transfer cover at least the entire subtree, so the smallest transfer granularity is the memory size covered by a single integrity tree node. The selected subtree must satisfy all counters in the subtree root node in relation to the data to be transmitted. In the present invention, the counter in the parent node is referred to as a subtree root counter for the subtree root node. Here, it is assumed that the CPU and GPU use split-counters (split-counters) in the integrity tree. The GPU's integrity tree node fan out is 128 and the cache line size is 128 bytes, while the CPU's degree is 64 and the cache line size is 64 bytes. The granularity of the CPU-GPU transmission is different from the granularity of the GPU-CPU transmission. The size covered may be calculated by multiplying the number of degrees by the cache line size. For CPU-GPU transfers, the granularity is 4KB, which means that only the data covered by the leaf nodes of the integrity tree will be transferred. First, HSMEM the transmission engine uses the address of the data to be transmitted to determine the integrity tree node that should be transmitted with the data. It finds one or more subtrees covering the transmission data range. It should ensure that all data covered in the root node of the selected subtree is data for transmission. If some of the counters in the root node do not belong to the data to be transmitted, it should continue to look up subtrees to the lower level of the integrity tree, as shown in FIG. 5.

The data transfer of the GPU-CPU uses different granularity. The cache lines of adjacent memory controllers may not be adjacent due to the address arrangement of the GPU memory. However, to simplify metadata management of data transferred on the CPU, HSMEM ensures that at least one memory controller's integrity tree is transferred and all memory controllers transfer the same amount of data to the CPU. Thus, the GPU mode integrity tree maintained on the CPU is homogenous and requires at least NR MemCtrl root nodes to be transmitted. The granularity of the transfer also depends on the address arrangement granularity. NR_ MemInter is the granularity of the memory address arrangement, meaning that adjacent NR_ MemInter bytes come from the same memory partition. The granularity of replication is min (nr_ MemInter,16 KB) nr_ MemCtrl, i.e. all memory controllers transmit the same amount of data to the CPU. Assuming that the granularity of memory address placement is 2 memory blocks (256 bytes) and there are 16 memory controllers, the granularity of replication should be 256KB.

The multi-mode protection engine supports the use of multiple memory protection schemes in a device equipped with secure memory. The multi-mode protection engines of both the CPU and GPU support two memory protection schemes. When the CPU or GPU receives data through HSMEM transmission channels, the data is first protected by its original protection scheme. The original mode is used when reading data and is switched to the local mode when writing data. First, the present solution describes how to read the memory of the CPU mode on the GPU and how to read the memory of the GPU mode on the CPU, and then how to switch between the two memory modes. The architecture of the multi-mode protection engine is shown in fig. 6. The CPU pattern integrity tree and MAC on the GPU are referred to as shadow metadata. Similarly, GPU mode security metadata on the CPU is also referred to as shadow metadata.

And the mode selection module is used for knowing the current mode of each memory block on the CPU or the GPU. When a block with secure metadata is transferred from the CPU to the GPU, it is still protected by the original protection scheme and vice versa. The mode selection module is a hardware module in the memory controller for recording such information. It uses one bit to indicate the protection scheme (i.e., mode) of the memory block. The cache line size of the GPU is 128 bytes, while the cache line size of the CPU is 64 bytes. On GPU, the bitmap size is 1/2≡10 of the total memory size, on CPU, the bitmap size is 1/2^9 of the total memory size. For 1GB of GPU memory, the GPU requires 1MB of space for the bitmap, while for 1GB of system memory the CPU requires 2MB. However, current CPUs or GPUs have a current memory ranging from tens to hundreds of GB. It is not practical to store all bitmaps on a chip. But each memory access should be checked by a bitmap. HSMEM use two levels of checking to reduce the overhead of accessing the bitmap. On the GPU, each bit of the first level bitmap represents 128 bytes, while in the second level bitmap, each bit represents 128KB. The first level bitmap is stored directly in an on-chip buffer inside the memory controller. The second level bitmap is stored in secure memory and integrity must be ensured. HSMEM reserve an area that is always in local mode to store the bitmap.

On the GPU, the region must always be in GPU mode, while on the CPU, it must always be in CPU mode. A dedicated cache, referred to as a bitmap cache, is provided for the second level bitmap, with a cache line size of 128 bytes (1028 bits). Thus, each bit stored in a cache line represents 128KB of memory. When a cache line is replaced from the bitmap cache, the value of the cache line is checked. If all bits in the replaced cache line are not set, meaning that all 128KB are in GPU mode, then the bits in the first level bitmap will not be set. If any bit in the bitmap is set, then both the first level bitmap and the corresponding bit in the bitmap should be set. Each memory controller on the GPU has a mode selection module to manage its memory partition. The mode selection module on the CPU is similar, but each bit in the first level bitmap represents 64 bytes, and each bit in the first level bitmap represents 64KB.

The data encryption module HSMEM allows the CPU or GPU to directly transfer data to each other without re-encryption. To encrypt or decrypt a memory block, encryption requires an address and a counter, as shown in FIG. 2. Two blocks are considered unsafe if they are encrypted using the same encryption key, address and counter, because they use the same One-time Pad (OTP). Thus, each time a dirty memory block is evicted from the cache and written to memory, the counter for the memory block is incremented. Similar to MMT (see related art), HSMEM also requires maintaining a unique address for each memory block on the CPU or GPU if the CPU and GPU use shared encryption keys. Thus, memory encryption uses addresses that are different from the local physical addresses on the CPU or GPU. HSMEM map all memory in the system, including system memory and accelerator memory, to a single memory space called the Encryption Address Space (EAS) for encryption. The addresses in EAS are used only for encryption and decryption, and the addressing for normal memory access remains unchanged. EAS can be easily implemented by providing a start address for each device. At the time of data transmission, the receiving side needs additional information about the address for decrypting the transmitted ciphertext. The data encryption module should first obtain the pattern from the pattern selection module, then obtain the correct counter from the integrity tree module and find the correct EAS address. The data block may then be decrypted. Since the cache line size of the GPU is 128 bytes, it is twice the CPU cache line size. In the GPU mode integrity tree on the CPU, two adjacent CPU memory blocks use the same counter. On the CPU, if the memory block is acquired, the counter of the GPU integrity tree will be verified and used for decryption of the CPU. EAS addresses are also stored with the nodes of the integrity tree. Although both CPU cache lines use the same GPU MAC, the CPU can only read one of the cache lines. This is because HSMEM changes the way the GPU's MAC is computed, which is generated by concatenating the two MACs. Each MAC is generated by one of the 64 bytes in the GPU cache line. Thus, the CPU can use half of the MAC to verify its own 64-byte cache line.

Integrity tree module (GPU mode) GPU with secret computing capability naturally supports the protection scheme for GPU mode. For the integrity tree of the GPU mode, a different memory controller may use its own integrity tree and may verify its data without interacting with other memory partitions. In order to directly access the transferred data, the GPU should support protection schemes for the CPU mode and the GPU mode. The data of the CPU mode is encrypted using the address from the CPU and the counter. When such a memory block is acquired, the memory controller should acquire the correct counter value and address to decrypt the block. However, to support CPU mode, one leaf node of the integrity tree involves blocks from different memory partitions. Thus, there should be one module to validate the counter and communicate the correct counter value to all memory controllers. HSMEM uses a special purpose unit, the CPU mode protection engine, to validate the counters in the CPU mode integrity tree. The CPU mode engine is an on-chip module and is connected with other memory controllers. The transmitted CPU mode encryption protected metadata is stored in the GPU memory. The CPU mode engine may retrieve the read memory by sending a request directly to the memory controller. The metadata will be retrieved directly and returned to the CPU mode engine. When the memory controller needs to acquire the real counter value, it sends a request to the CPU mode engine. The CPU mode engine reads the integrity tree nodes stored in the GPU memory and validates the counter and returns the value to the memory controller.

As previously described, several subtrees are transmitted each time data is transmitted to the GPU. The root counter of the subtree is protected by the secure channel. After a sub-tree is completed by the transfer, the root counter is stored in the on-chip portion of the accelerator. The multi-mode protection engine logically uses the CPU mode integrity tree to protect all secure memory. The physical address of a memory block is used to determine which CPU mode counter the block is associated with. When there is no data transfer, there are no active nodes in the CPU mode integrity tree. When data is transferred from the CPU, the corresponding node in the CPU mode integrity tree will become the active node, depending on the physical address of the transferred data. The transmitted subtrees will be placed directly in the corresponding locations. However, since the subtree has no valid parent node, the root counter of the subtree will be stored on-chip. A maximum of 8 bytes are required per CPU mode root counter. The multi-mode protection engine need only store the counter value in the parent counter. At the same time, the block address on the CPU or in the EAS is also transferred with the root counter of the subtree and stored on the chip. Thus, only one counter and 64-bit EAS address need be stored for each sub-tree. When a counter is needed, all tree nodes on the path to the root counter are verified. This continues until the first node of the on-chip cache. In the CPU mode counter cache on the accelerator, each node is stored with the ID of the root node stored on the chip. When verifying the node of the Counter Cache, the corresponding EAS address is obtained according to the EAS address of the root node. The accelerator is not a conventional processor core, but rather has fixed logic about the CPU mode protection scheme, so more on-chip storage may be used. When the GPU memory controller retrieves a memory block, it may request an EAS address and counter value from the accelerator. The memory controller may then verify and decrypt the acquired memory block. However, if there are too many subtrees, the buffer on the chip cannot store the EAS addresses and counters of these subtrees. HSMEM provide two different approaches. The first approach is for the unit to send an interrupt to the GPU command processor. The GPU command processor will then initiate a special GPU to convert a range of memory to a local GPU mode. The second approach is that the nodes may be evicted and saved in a block of secure memory. If the node is not found in the cache, the engine will fetch the node from the protected memory.

Integrity tree module (CPU mode) the CPU side engine should also support two different modes. In addition to the root node of the CPU mode counter, the memory controller on the CPU should also make room to store the root node of the subtree in GPU mode. These two different types of integrity trees may be supported directly in a single memory controller of the CPU. The EAS address of the subtree should also be stored with the counting node in the cache. When the mode selection engine determines that one memory block is in GPU mode, the multimode protection engine first verifies the counter in GPU mode and the EAS address of the transport block. The block may then be decrypted and verified. To support GPU modes, the CPU maintains an integrity tree of several GPU modes. Since the memory address arrangement is fixed, the CPU does not need to store the EAS addresses for the integrity tree for all GPU modes. All integrity trees have the same shape, as the transmission granularity chosen by this scheme ensures that the integrity trees for all GPU modes have the same shape. At initialization, the information provided by the GPU illustrates the manner of memory addressing. Therefore, the CPU knows the memory partition address arrangement and can correctly use the GPU's addressing scheme to find the integrity tree corresponding to the memory block.

The data transfer process of HSMEM transfer channels-upon interaction with the CPU, a GPU kernel space driver (a module in the operating system kernel) or a user space driver (CUDA driver) typically sends commands through the GPU channel, in addition to reading or writing MMIO registers. Using a GPU-TEE mechanism like Graviton, the integrity and confidentiality of the commands has been protected. When the driver pushes a command through the GPU channel to request data transfer with the HSMEM transfer engine, the address range of the memory range to be transferred should also be provided. The address translation hierarchy of the GPU needs to be considered. The GPU has its own page table and the address used by the GPU core is the GPU Virtual Address (VA). GPUMMU translate the address into GPU Physical Addresses (PA). Pages in the GPU memory or system are marked in the GPU page table. In current GPUs, there are two types of GPU channels with different rights, a privileged channel and a non-privileged channel. The kernel space driver may use a privileged channel and commands in the privileged channel may directly use physical addresses. For user space drivers, a non-privileged channel is used. The pushed command can only use GPUVA, but will eventually be converted to GPUPA. However, in modern computer systems, IOMMU is typically used to protect any access by the device to system memory. The GPU cannot directly observe IOPA or physical addresses in the system address. We note that in modern GPUs, address Translation Services (ATS) are used. For some supercomputers supporting CPU-GPUNVLINK interconnections (e.g., summit Super Computer), ATS enables the GPU to access system memory using CPU virtual addresses. After the ATS is enabled, the device or GPU will send a request to the address translation agent on the CPU side to request an address corresponding to IOPA of a certain address. The proxy will translate the address and return IOPA to the GPU. The GPU caches address translations using an address translation service (ATC) if an entry is found in the ATC. IOPA can be used directly to bypass the IOMMU to relieve the IOTLB in the IOMMU. HSMEM transmissions may use the ATS directly to determine the real address of the transmitted data. Secure transmission requests require a continuous transmission range, so only the address of the first byte is needed.

Step 1, software sends instructions (command) to the GPU through a GPUTEE secure channel mechanism;

and 2, after the instruction is read, the GPU command processor sends a signal to a HSMEM transmission engine at the CPU end. This can be implemented as a special interrupt request that does not actually cause an external interrupt, but rather passes signals and information to the HSMEM transport engine on the CPU side;

Step 4, after selecting the root node, HSMEM the transmission engine should send a request to the memory controller, prohibit writing to the region, if some tree nodes in the cache are dirty, they should be brushed back to the memory, then it reads the root node of the subtree to be transmitted, and transmits the root node through the secure channel;

The HSMEM transmission engine ensures that GPUDMA engine is in a safe mode and calculates metadata address and carries out safe memory transmission, the commands give physical address of the range to be transmitted, the special memory transmission can bypass the checking of IOMMU, the address of the metadata area is given by HSMEM transmission engine when the system is initialized, the address of the metadata can be directly calculated, the DMA engine is controlled to extract correct metadata, and the data, MAC and integrity tree nodes are transmitted, so that the steps do not need to be re-encrypted;

Step 6, after the metadata is read from the system memory, the GPU sends another special signal to the CPU, the signal does not trigger software interrupt, the signal is processed by HSMEM transmission engine at the CPU end, and the root node and the EAS address of the subtree are sent to the GPU through the secure channel;

And 7, a HSMEM transmission engine at the CPU end sends a message to the memory controller, the writing limitation of the block memory is canceled in a transmission channel, and for data transmission of the CPU-GPU or the GPU-CPU, HSMEM requires all transmitted data to be in the same mode. For example, if the GPU needs to transmit data to the CPU, then all data should be protected by GPU mode or CPU mode. For GPU-GPU data transmission, MMT transmission mode is adopted, data can be protected by different modes, which means that some data can be protected by CPU mode and some data can be protected by GPU mode.

Taking the GPU side as an example, when one memory block is in a CPU mode, the GPU integrity tree is kept unchanged. The counter nodes remain in the memory in plain text, as in the original integrity tree structure. The CPU-side implementation is similar.

Step 1, acquiring a mode of a current accessed block through a mode selection module, if the mode is a CPU mode, performing the following steps, and if the mode is a GPU mode, adopting an access mode of a common GPU safe memory;

Step 2, when such a block is obtained from DRAM, it is verified by CPU mode integrity tree, and verified and decrypted by integrity tree module and memory encryption module;

Step 3, the decrypted and verified data are put into a cache for access;

Step 4, after the access is finished, if the block is not modified, the block can be directly eliminated from the cache, if the block is modified, the counter of the GPU mode needs to be read, re-encryption is carried out, metadata such as an integrity tree of the GPU mode is updated, and then the metadata is written back, when the GPU needs to encrypt the block and write the block into a memory, the block is encrypted by using an EAS address provided by a CPU (Central processing Unit) instead of an EAS address of the CPU;

and 5, if the GPU mode is switched, notifying a mode selection module, updating corresponding information by the mode selection module, and marking the current memory block to be changed back to the GPU mode.

Example 2

Example 2 is a preferred example of example 1.

The invention can be realized on the GPU and the CPU using GDDR, can support a memory protection mode with an integrity tree, and can defend physical replay attacks. Meanwhile, a TEE mechanism exists between the CPU and the GPU, and the scheme can be constructed on the original TEE mechanism to realize efficient data transmission. Meanwhile, the invention provides a special instruction, a user can write a special GPUKernel, the special instruction is used for actively converting the memory without putting data into the cache of the GPU, the instruction can directly send a request to the memory controller, the memory controller can directly convert the memory mode, and the converted result is not put into the cache.

Meanwhile, in different memory controllers, a cache specially used for storing the integrity tree nodes and the data MAC is added for accelerating access to memory protection metadata. In the implementation process, multiple integrity trees can be adopted in the same memory controller to protect the memory, so that the tree depth is reduced, and the memory is similar to MMT.

The embodiment can realize high-performance data transmission between the CPU and the GPU and ensure the data security under the condition of protecting the confidentiality, the integrity and the freshness of the secure memory and when the secure memory of the CPU and the secure memory of the GPU are heterogeneous.

Example 3

Example 3 is a preferred example of example 1.

HSMEM may support different protection schemes. The strongest protection scheme is mainly used in examples 1 and 2 to ensure confidentiality, integrity and freshness at the same time to defend against physical attacks. This may be achieved by Bonsai MERKLE TREE (BMT) based designs, such as intel SGX. Some hardware platforms may not provide security. For example, under a physical attack, AMD SEV does not guarantee integrity and freshness, while intel TDX does not consider a physical replay attack. Thus, the integrity tree and MAC may be deleted in such an architecture.

By customizing the multi-mode protection engine HSMEM can be extended to other protection schemes. In such an architecture, no integrity tree is required. Because the address is still used as an encrypted adjustment value, the receiving party still needs to transmit and record the address at the time of data transmission. For transport blocks that are not encrypted using the local EAS address, the multi-mode protection engine should have some mechanism to find the correct EAS address to decrypt the block. HSMEM use the EAS translation engine to obtain the decrypted EAS address. For large memory transfers, it is sufficient to record only the starting address of the original EAS address for decryption. Thus, the EAS translation engine maintains range registers that record the start address, end address, and offset of the data as it is transferred. This information is stored in the range buffer. When the mode selection module finds that the memory block does not belong to the current mode, the memory controller checks whether the address falls within any range in the range buffer. However, the size of the range buffer is limited. The EAS translation engine also maintains a multi-level translation table for translation of EAS addresses. The table translates the local address to the original EAS address at a granularity of 4K. The table is organized in a manner similar to a page table and may also support mechanisms similar to large pages. The EAS translation engine periodically scans the bitmap and if all memory switches to local mode, deletes the entry in the range buffer. If the range buffer does not have sufficient space to store the range, it will evict an entry and update the mapping table. The engine will check if the new range is greater than the existing range in the buffer and only if the new range is greater than the existing range will the old range be evicted. Since data is typically transferred in large blocks, large pages can be used even if the range buffer is full and the translation table is used, thereby reducing overhead. The mapping table may introduce TLB-like buffers and may be combined with range buffers with limited translation overhead.

The invention can also be used for other accelerators that do not use HBM memory.

Those skilled in the art will appreciate that the systems, apparatus, and their respective modules provided herein may be implemented entirely by logic programming of method steps such that the systems, apparatus, and their respective modules are implemented as logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc., in addition to the systems, apparatus, and their respective modules being implemented as pure computer readable program code. Therefore, the system, the device and the respective modules thereof provided by the invention can be regarded as a hardware component, and the modules for realizing various programs included therein can be regarded as a structure in the hardware component, and the modules for realizing various functions can be regarded as a structure in the hardware component as well as a software program for realizing the method.

The foregoing describes specific embodiments of the present application. It is to be understood that the application is not limited to the particular embodiments described above, and that various changes or modifications may be made by those skilled in the art within the scope of the appended claims without affecting the spirit of the application. The embodiments of the application and the features of the embodiments may be combined with each other arbitrarily without conflict.

Claims

1. The high-performance heterogeneous secure memory is characterized by comprising HSMEM a transmission engine, a multi-mode protection engine and an interface;

The HSMEM transmission engine establishes a HSMEM transmission channel to realize high-speed transmission between the CPU and the GPU, the HSMEM transmission engine is positioned inside the CPU and the GPU chip and controls DMA requests for safe memory, the HSMEM transmission engine selects metadata related to the memory to be transmitted and transmits the data without additional data encryption and decryption, a preset number of cache blocks are encrypted for each data transmission, and after the data is transmitted to the CPU or the GPU, the multi-mode protection engine is positioned in a memory controller and directly uses the transmission data of an original protection scheme;

The integrity tree module of the GPU mode comprises that different memory controllers use own integrity tree and verify the data without interacting with other memory partitions, the GPU supports a protection scheme of the CPU mode and the GPU mode for directly accessing the transmitted data, the data of the CPU mode is encrypted by using an address and a counter from the CPU, when acquiring the memory block, the memory controller acquires the correct counter value and the address to decrypt the block, and one leaf node of the integrity tree relates to the block from the different memory partitions for supporting the CPU mode, therefore, one module is arranged for verifying the counter and transmitting the correct counter value to all the memory controllers;

The HSMEM transmission engine uses a CPU mode protection engine in the multi-mode protection engine to verify a counter in a CPU mode integrity tree, the CPU mode protection engine is an on-chip module and is connected with other memory controllers, metadata of the transmitted CPU mode encryption protection is stored in a GPU memory, the CPU mode protection engine directly obtains a read memory by sending a request to the memory controller, the metadata is directly obtained and returned to the CPU mode protection engine, when the memory controller needs to obtain a real counter value, the CPU mode protection engine sends the request to the CPU mode protection engine, and the CPU mode protection engine reads an integrity tree node stored in the GPU memory, verifies the counter and returns the value to the memory controller;

The multi-mode protection of CPU end supports two different modes, besides the root node of CPU mode counter, the memory controller on CPU also leaves space to store the root node of subtree in GPU mode, these two different types of integrity trees are directly supported in single memory controller of CPU, the EAS address of subtree is also stored together with counting node in buffer memory, when the mode selection engine determines that a memory block is in GPU mode, the multi-mode protection engine firstly verifies the counter in GPU mode and the EAS address of transmission block, then decrypts and verifies the block, in order to support GPU mode, CPU maintains several integrity trees of GPU mode;

all the integrity trees have the same shape, and when the integrity trees are initialized, the information provided by the GPU illustrates a memory addressing mode, so that the CPU knows the memory partition address arrangement mode and correctly uses the addressing scheme of the GPU to search the integrity tree corresponding to the memory block;

through the interface, the developer uses the preset instruction to change the mode of the memory block.

2. The high performance heterogeneous secure memory of claim 1 wherein at initialization, HSMEM the transfer engine generates keys with peer devices, builds a traditional secure channel, and when data is to be transferred, HSMEM the transfer engine prepares metadata that needs to be transferred with the data, and further wherein the secure channel is used to exchange some basic system configuration during initialization;

3. The high-performance heterogeneous secure memory according to claim 1, wherein, at the time of initialization, the HSMEM transfer engine comprises an initialization module, specifically:

4. The high-performance heterogeneous secure memory according to claim 1, wherein, at the time of initialization, the HSMEM transfer engine comprises a meta-information selection module, specifically:

When data is transmitted through HSMEM transmission channels, the security metadata is transmitted together with the data, for an integrity tree BMT, the counter and the message authentication code MAC value are transmitted together, each time the data is transmitted at least over the whole subtree, therefore, the minimum transmission granularity is the memory size covered by a single integrity tree node, the selected subtree meets that all the counters in the subtree root node are related to the data to be transmitted, the counter in the father node is called a subtree root counter for the subtree root node, the CPU and the GPU use partition counters in the integrity tree, the integrity tree node fan-out of the GPU is 128 bytes, the cache line size is 128 bytes, the degree of the CPU is 64 bytes, the granularity of the CPU-GPU transmission is different from the granularity of the GPU-CPU transmission, the covered size is calculated by multiplying the degree by the cache line size, for the CPU-GPU transmission, the granularity is 4KB, and only the data covered by the leaf node of the integrity tree can be transmitted;

5. The high performance heterogeneous secure memory according to claim 1, wherein the mode selection module comprises, on either the CPU or the GPU, for each memory block, the memory controller knows the current mode of the block, while the block with the secure metadata is still protected by the original protection scheme when it is transferred from the CPU to the GPU;

The mode selection module is a hardware module in the memory controller, uses one bit to indicate the protection scheme of the memory block, the cache line size of the GPU is 128 bytes, the cache line size of the CPU is 64 bytes, on the GPU, the bitmap size is 1/2≡10 of the total memory size, on the CPU, the bitmap size is 1/2^9 of the total memory size, for 1GB of GPU memory, the GPU needs 1MB of space for the bitmap, and for 1GB of system memory, the CPU needs 2MB;

Two levels of checking are used to reduce the overhead of accessing the bitmaps, on the GPU, each bit of the first level bitmap representing 128 bytes, and in the second level bitmap representing 128KB, each bit stored directly in an on-chip buffer inside the memory controller, the second level bitmap being stored in secure memory and having to ensure integrity, leaving a region always in local mode to store the bitmaps, on the GPU, which is always in GPU mode, and on the CPU, always in CPU mode, providing a dedicated cache for the second level bitmap, referred to as bitmap cache, the cache line size also being 128 bytes, so that each bit stored in a cache line represents 128KB of memory, the value of that cache line being checked when a cache line is replaced from the bitmap cache.

6. The high performance heterogeneous secure memory of claim 1, wherein the data encryption module comprises a CPU or GPU directly transmitting data to each other without re-encrypting, setting encryption or decryption memory blocks, the encryption requiring addresses and counters, and being considered unsafe if both blocks are encrypted using the same encryption key, address and counter, so that each time a dirty memory block is evicted from the cache and written to the memory, the counter of the memory block is incremented, and if the CPU and GPU use a shared encryption key, a unique address needs to be maintained for each memory block on the CPU or GPU, so that the address used for memory encryption is different from the local physical address on the CPU or GPU;

The CPU uses half of the MAC to validate its own 64 byte cache line.

7. The high-performance heterogeneous secure memory of claim 1, wherein the memory access flow of the multi-mode protection engine is:

Step 12, when the current accessed block is obtained from DRAM, it is verified by CPU mode integrity tree, and is verified and decrypted by utilizing integrity tree module and memory encryption module;

step 13, the decrypted and verified data are put into a cache for access;