Detailed Description
As described in the background section, with the increasing demand for data processing, the demand for memory capacity of Graphics Processors (GPUs) for assisting in processing data operations is increasing, and the memory capacity of Graphics Processors (GPUs) (GPU memory) is becoming a bottleneck for limiting performance improvement. Taking the current popular large language model (Large Language Model, LLM) as an example, when performing hybrid precision training, a single model copy requires about 1260GB of GPU memory, whereas the memory capacity of a single GPU for LLM training is typically tens to hundreds of GB, thus requiring multiple GPUs to perform model deployment and training together, e.g., deploying a single 1260GB LLM copy requires about 16 80GB GPUs.
However, introducing multiple GPUs for training brings about many difficulties, and although model parallel techniques such as pipeline parallel (PIPELINE PARALLELISM) and tensor parallel (Tensor Parallelism) successfully solve the problem of model segmentation, these techniques introduce additional high interconnection bandwidth requirements and complex requirements for resource scheduling such as computation and communication.
Also, there are factors that make it difficult to expand the physical capacity of the GPU memory.
Since the memory chip in the GPU needs to be packaged inside the GPU, taking high bandwidth memory (High Bandwidth Memory, HBM) as an example, only a certain number of HBM stacks (HBM Stacks) can be accommodated in the existing GPU, and the memory capacity of each HBM stack is fixed, so that the physical capacity of the GPU memory is limited by the physical space.
And the capacity expansion requirement of the GPU physical memory is lower than the memory capacity expansion requirement, so that the capacity expansion requirement of the GPU physical memory is always higher than the capacity expansion requirement of the GPU physical memory in order to ensure the computing performance of the GPU, and the capacity expansion requirement of the GPU physical memory is preferably met in the design and development stage of the GPU.
The memory compression technology can reduce the memory occupation requirement, is widely applied to CPU systems, and compresses data stored in a memory by applying a compression algorithm to reduce the memory occupation, so that the performance degradation of the CPU system caused by insufficient memory is delayed. Memory compression techniques are equally applicable to GPUs.
An existing GPU memory expansion technology reduces occupation of data in GPU memory by performing data compression processing between a data cache and a memory of the GPU, however, since the data is stored in the GPU memory in a preset format, after the data is compressed, some memory space may exist, and no data exists in the compressed page data, and memory fragments may be generated by the compressed page data, so as to facilitate orderly storage of the compressed page data, a memory manager in the GPU applies for a memory page again by adopting a page reallocation (re-allocation) mode, so as to store the compressed page data. However, the page reallocation not only occupies a certain amount of bandwidth, but also increases memory access delay, thereby affecting data storage and access efficiency.
In order to solve the above problems, the present invention provides a data storage method, a data reading method, and a graphics processor, so as to improve the data storage and access efficiency in the GPU memory.
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Fig. 1 is a flowchart of a data storage method according to an embodiment of the present invention. Referring to fig. 1, the data storage method is applied to a graphic processor, and includes the following steps.
And S110, acquiring page data to be compressed.
Data generated during the running of the program by the GPU cannot be contained entirely in a cache (e.g., an L1 cache) internal to the GPU computing core and needs to be stored in GPU memory for subsequent access. The page data to be compressed may be data generated by a computing core of the GPU and not accommodated in a cache of the GPU computing core.
After acquiring the page data to be compressed, step S120 is continued.
And step S120, compressing the page data to be compressed by utilizing a predetermined target compression algorithm to obtain compressed page data.
The target compression rate of the target compression algorithm is used to determine the amount of data that can be stored in the graphics processor memory.
And for the page data to be compressed, the compression processing can be carried out through a predetermined target compression algorithm, so that the capacity occupation of the page data to be compressed in the memory of the graphics processor is reduced, and the memory capacity expansion of the graphics processor is realized.
In some embodiments, to determine the target compression algorithm to improve the compression performance of the page data to be compressed, the following steps are performed before the step S110 is performed.
And S100, acquiring debugging page data.
Before the graphics processor runs the program, it needs to test according to the test data set to analyze the performance of the program. The debug page data is data generated when the graphics processor runs a program according to the test data set. After obtaining the debug page data, step S101 is continuously executed.
And step S101, compressing the debug page data for a plurality of times by utilizing a plurality of candidate compression algorithms, determining the compression rate of each compression of the debug page data by using the candidate compression algorithm aiming at each candidate compression algorithm, and obtaining the average compression rate of the debug page data based on the compression rate determined each time.
The resulting data may vary considerably due to the type of program running on the graphics processor. Different candidate compression algorithms have different compression performances for different data, so that multiple candidate compression algorithms can be utilized to compress the debug page data for multiple times, thereby obtaining the average compression rate of each compression algorithm on the debug page data, and the compression algorithm suitable for the program can be selected according to the requirement subsequently so as to improve the performance of the graphics processor.
The candidate compression algorithm is various common compression algorithms capable of performing data compression, such as Huffman (Huffman) coding, LZ77 (compression algorithm based on sliding window), DEFLATE (compression algorithm combining LZ77 algorithm and Huffman coding), and the like, which are widely applied to file compression and data transmission, and can completely recover original data.
After determining the average compression rate of the debug page data under each compression algorithm, step S102 is continuously executed.
And S102, selecting a candidate compression algorithm corresponding to the maximum value in the average compression ratios as a target compression algorithm, and taking the average compression ratio of the target compression algorithm as a target compression ratio.
Selecting the candidate compression algorithm corresponding to the maximum value in the average compression rate can further reduce the data storage pressure of the internal memory of the graphic processor, thereby improving the expansion performance of the internal memory capacity of the graphic processor. In some other embodiments, the target compression algorithm may also be selected according to other performance requirements, for example, a compression algorithm with the highest compression speed and an average compression rate capable of meeting the data storage requirement of the graphics processor memory is selected as the target compression algorithm.
In some embodiments, after the step S102 is performed, the target compression rate is saved for use in subsequent steps.
It will be appreciated that in some embodiments, the target compression algorithm may be determined directly. For example, a compression algorithm default used in a graphics processor is used as the target compression algorithm, or a specific compression algorithm is used as the target compression algorithm, so that the steps S100 to S102 are not required to be executed to determine the target compression algorithm, the logic of the data storage method is simplified, and the execution efficiency of the data storage method is improved.
With continued reference to fig. 1, after step S120 is performed, step S130 is continued to be performed.
And 130, storing the compressed page data with the data volume equal to the storable data volume into the internal memory of the graphic processor, storing the page data with the data volume exceeding the storable data volume into the internal memory of the applied non-graphic processor, and recording the storage address of the page data in the internal memory of the applied non-graphic processor.
Wherein the applied non-graphics processor memory is applied memory in a processor external to the graphics processor, and the storage capacity of the applied non-graphics processor memory is determined based at least on the target compression rate.
The graphics processor memory is global memory on a graphics processor executing the data storage method, and the applied non-graphics processor memory is memory applied on other processors than the graphics processor executing the data storage method, for example, memory of a CPU. In some embodiments, the graphics processor has a direct connection with the non-graphics processor to increase the speed of subsequent access to data stored in the graphics processor memory and the non-graphics processor memory. For example, the graphics processor and non-graphics processor may form a direct connection through high bandwidth interconnect technologies such as NVLink, CXL, etc. to improve the performance of the graphics processor.
Since the compressed page data may have different data amounts after the page data to be compressed is compressed by using the predetermined target compression algorithm, that is, if the compressed page data is directly stored in the graphics processor memory, there may be a data amount of the compressed page data exceeding the storable data amount of the graphics processor memory, so that the memory manager in the graphics processor performs page reassignment on the graphics processor memory, which may affect the storage and access efficiency of the graphics processor to data.
Meanwhile, each compression algorithm has the compression rate realized by itself, namely, a predetermined target compression algorithm has the corresponding target compression rate, after the page data to be compressed is compressed according to the target compression algorithm, the data volume in the compressed page data is at least equal to the storable data volume, and the situation that the storable data volume is exceeded may exist. It can be understood that, since the data amount of the compressed page data is at least equal to the storable data amount, the compressed page data is sliced according to the storable data amount, so that no space waste exists.
Therefore, the compressed page data, the page data with the same data quantity as the storable data quantity, are stored in the memory of the graphic processor, so that the memory manager in the graphic processor can be prevented from carrying out page reassignment on the memory of the graphic processor, and the data storage and access efficiency is improved. Further, the compressed page data includes page data exceeding the storable data volume in addition to being equal to the storable data volume, so that the data needs to be stored in the applied non-graphics processor memory to ensure the data storage and access performance of the graphics processor, and the page memory is prevented from being reapplied by adopting a page redistribution technology, so that the compressed page data is integrally stored.
In some embodiments, after performing step S120 to obtain compressed page data, the data storage method further includes determining a corresponding actual compression rate based on the compressed page data. The actual compression rate may calculate an actual compressed data amount of the compressed page data.
Specifically, in some embodiments, the actual compression rate is a ratio of an amount of data before uncompressed of the page data to be compressed to an amount of data actually compressed of the page data after compression, for example, the amount of data before uncompressed of the page data to be compressed is 4KB, the amount of data actually compressed of the page data after compression is 2KB, and then the actual compression rate is 4KB/2 kb=2.
And after determining the actual compression rate corresponding to the compressed page data, comparing the actual compression rate with the target compression rate.
If the actual compression rate is smaller than the target compression rate, determining that the actual compressed data amount is larger than the storable data amount, and executing the step S130, namely storing the page data with the data amount equal to the storable data amount in the graphics processor memory, storing the page data with the data amount exceeding the storable data amount in the applied non-graphics processor memory, and recording the storage address of the page data in the applied non-graphics processor memory.
If the actual compression rate is greater than or equal to the target compression rate, determining that the actual compressed data volume is less than or equal to the storable data volume, which means that there is no page data exceeding the storable data volume, and the storable data volume of the graphics processor memory satisfies the storing of the compressed page data, the compressed page data may be directly stored into the graphics processor memory. Therefore, the compressed page data can be accurately and properly stored completely by judging the actual compression rate and the target compression rate, so that the data storage efficiency is improved.
It can be seen that the data storage method provided by the embodiment of the invention is applied to a graphics processor, and is used for firstly obtaining page data to be compressed, and compressing the page data to be compressed by utilizing a predetermined target compression algorithm to obtain compressed page data, wherein the target compression rate of the target compression algorithm is used for determining the storable data amount in the memory of the graphics processor. Because the compressed data obtained after the compression processing of the page data to be compressed may have the situation that the data volume (storable data volume) determined by the target compression rate of the target compression algorithm does not meet the predetermined requirement, in the embodiment of the invention, for one compressed page data, page data with the data volume equal to the storable data volume in the compressed page data can be stored into the internal memory of the graphics processor, and meanwhile, page data with the data volume exceeding the storable data volume in the compressed page data is stored into the internal memory of the applied non-graphics processor, and the storage address of the applied non-graphics processor is recorded, and the internal memory of the applied non-graphics processor is determined at least based on the target compression rate of the target compression algorithm, so that the compressed page data can be stored completely at one time. Therefore, the method can not only ensure that the data stored in the internal memory of the graphic processor has no gap, but also avoid the graphic processor from adopting a page redistribution mode to reapply the page data which exceeds the storable data amount in the compressed page data stored in the internal memory of the graphic processor, thereby improving the data storage efficiency in the internal memory of the graphic processor; when the compressed page data stored in the applied non-graphic processor memory is needed to be used later, the compressed page data can be directly obtained based on the recorded storage address, so that the additional cost caused by storing the compressed page data by the graphic processor in a page redistribution mode is avoided on the basis of reducing the capacity occupation of the graphic processor memory and realizing the expansion of the memory capacity of the graphic processor, and the aim of improving the efficiency of data storage in the graphic processor memory is fulfilled.
In some embodiments, after determining the target compression algorithm, the data storage method further comprises the following steps before the step S110 is performed.
And obtaining the available capacity of the memory of the graphic processor. The available capacity of the graphics processor memory included in the graphics processor is the maximum memory capacity that can be applied by a program running on the graphics processor, so that in the subsequent steps, the capacity that can be used by the program running on the graphics processor can be expanded to the maximum extent.
After the step of obtaining the available capacity of the graphics processor memory is performed, the data storage method further comprises the step of determining an application capacity on the non-graphics processor memory based on the available capacity, wherein the application capacity is used as an applied non-graphics processor memory, and is determined by multiplying the available capacity by a difference value between the target compression rate and a first value.
The application capacity of the non-graphic processor memory is the maximum memory capacity which can be applied by the program running on the graphic processor on the non-graphic processor memory. In order to ensure that the applied non-graphics processor memory can accommodate all data amounts determined in excess of the target compression rate, the applied capacity is multiplied by the available capacity by the difference between the target compression rate and a first value, wherein the first value may be 1, so that the non-graphics processor memory does not cause data storage or access anomalies due to insufficient capacity before the graphics processor memory is fully occupied.
In some specific embodiments, the applied non-graphics processor memory is further determined according to a pre-configured non-graphics processor memory capacity threshold, so as to avoid the excessive capacity occupation of the non-graphics processor memory, which affects the performance of the non-graphics processor.
For example, after determining the target compression algorithm, the target compression rate of the target compression algorithm is 4, the actual memory capacity of the graphics processor memory is 40GB, then a memory space with a capacity of 40GB is applied on the graphics processor memory, and a memory space with a capacity of 40GB (4-1) =120 GB is applied on the non-graphics processor memory (e.g. CPU memory), so as to ensure the normal running of the program, and if in the above example, the operating system or the program is preconfigured with a non-graphics processor memory capacity threshold of 100GB, a memory space with a capacity of 100GB is applied on the non-graphics processor memory, instead of a memory space with a capacity of 120 GB.
In some embodiments, the non-graphics processor memory is a lock page memory to avoid swapping data stored in the non-graphics processor memory to the hard disk. Because the data in the page-locked memory is not exchanged into the hard disk, the access performance of the graphics processor to the data in the non-graphics processor memory can be ensured.
In some specific embodiments, after the step of taking the average compression rate of the target compression algorithm as the target compression rate, the following steps are further performed.
And storing a base address of the applied non-graphics processor memory, wherein the base address is used for indicating the storage position of the page data exceeding the storable data amount in the applied non-graphics processor memory in combination with the storage address.
After the target compression rate is stored, when the actual compression rate and the target compression rate need to be compared, the stored target compression rate can be read, and the efficiency of a data storage method is improved. And after the base address of the applied non-graphics processor memory is stored, when the compressed page data stored in the applied non-graphics processor memory needs to be accessed, quick and accurate memory access can be performed according to the base address.
In some embodiments, the actual compression rate and the memory address are recorded in a Page Table Entry (Page Table Entry) of the compressed Page data, and a Page compression flag of the compressed Page data is also recorded in the Page Table Entry. The page compression mark is used for indicating whether the compressed page data is compressed or not, and the actual compression rate is the ratio of the uncompressed data volume of the page data to be compressed to the actual compressed data volume of the page data after compression.
By recording the page compression flag, the actual compression rate and the storage address, when compressed page data stored in the internal memory of the graphic processor and the applied non-graphic processor respectively are accessed later, the compressed page data can be read based on the page compression flag, the actual compression rate and the storage address, so that the data access performance of the graphic processor is improved.
In some embodiments, in the page table entry of the compressed page data, the page compression flag, the actual compression rate, and the storage address (i.e., the offset address in the applied non-processor memory) are 1bit, 3bit, and 16bit, respectively.
To further improve the data access performance of the graphics processor, in some embodiments, after the step S120, before the step of determining the corresponding actual compression rate based on the compressed page data, the data storage method further includes storing the compressed page data into a compressed data cache of the graphics processor.
Compressed data caching may reduce the frequency of access to the GPU/CPU DRAM to improve overall system performance. Because address indexing is also required to read compressed Page data from the applied non-graphics processor memory of the CPU DRAM, access to the TLB (Translation Lookaside Buffer, address translation look-aside buffer), page Table, and the acquisition and use of the base address of the applied non-graphics processor memory are involved. And the compressed page data is directly cached by the compressed data cache, so that the access expense brought by the access to the GPU/CPU DRAM can be reduced.
When the computing core in the graphic processor accesses the compressed data cache, the access speed is faster than that of the graphic processor memory, so that the data access performance of the graphic processor can be improved by preferentially using the compressed data cache to store the compressed page data; and when the compressed data cache cannot accommodate the compressed page data, namely the compressed page data needs to be stored into a memory, continuing to execute the step of comparing the actual compression rate with the target compression rate, thereby further improving the data storage and reading performance of the graphics processor.
The embodiment of the invention also provides a data reading method which is applied to the graphic processor, and referring to FIG. 2, the data reading method comprises the following steps.
Step S201, a data read request is acquired, the data read request including a data read address.
And the program running on the GPU sends a data reading request through a computing core of the GPU, so that data in a cache or a memory of the GPU is obtained.
After the data reading request is acquired, step S202 is performed.
Step S202, when the page data corresponding to the data reading address is determined to be compressed page data, and the compressed page data is stored in a graphic processor memory and an applied non-graphic processor memory, acquiring the page data with the data quantity equal to the storable data quantity in the graphic processor memory based on the data reading address in the graphic processor memory, and acquiring the page data with the data quantity exceeding the storable data quantity based on the storage address in the data reading request in the applied non-graphic processor memory, so as to obtain the compressed page data.
The compressed page data is data stored by the data storage method according to any one of the foregoing embodiments, and the storable data amount is determined by a target compression rate of a target compression algorithm adopted by the data storage method.
If the page data corresponding to the data read address is compressed page data, for example, the page compression mark of the page table entry records that the page data is the page data subjected to compression processing, and the page table entry records the storage address of the applied non-graphics processor memory, it is indicated that the compressed page data corresponding to the data read address is stored in the graphics processor memory and the applied non-graphics processor memory, and at this time, the page data can be read from the graphics processor memory and the non-graphics processor memory respectively to form the compressed page data.
In some specific embodiments, when it is determined that the page table entry of the compressed page data corresponding to the data read address records a storage address of the applied non-graphics processor memory, the method further includes the steps of obtaining a base address of the applied non-graphics processor memory, and obtaining a physical address of the non-graphics processor memory based on the base address and the storage address.
For example, the storage address of the applied non-graphics processor memory is an address offset (offset address) of the compressed page data stored in the non-graphics processor memory, so that a base address of the applied non-graphics processor memory needs to be obtained, and a physical address in the non-graphics processor memory is obtained according to the storage address on the basis of the base address, where the physical address in the non-graphics processor memory is an actual physical location of the compressed page data stored in the non-graphics processor memory.
If the page table entry of the compressed page data corresponding to the data read address does not record the storage address of the applied non-graphics processor memory, the compressed page data corresponding to the data read address is only stored in the graphics processor memory, and only the compressed page data corresponding to the data read address is required to be read from the graphics processor memory in the follow-up process.
In some specific embodiments, the step S202 includes the steps of:
The method comprises the steps of obtaining page data with the same data volume as storable data volume according to a data reading address in a graphic processor memory, obtaining page data corresponding to a physical address of a non-graphic processor memory in an applied non-graphic processor memory, obtaining compressed page data which is data exceeding the storable data volume, and obtaining the compressed page data based on the page data obtained in the graphic processor memory and the page data obtained in the applied non-graphic processor memory.
The compressed page data are respectively stored in the graphics processor memory and the applied non-graphics processor memory, specifically, the compressed page data with the same data amount as the storable data amount are stored in the graphics processor memory, and the page data with the data amount exceeding the storable data amount are stored in the applied non-graphics processor memory, so that the complete compressed page data are required to be respectively read from the graphics processor memory and the applied non-graphics processor memory, and the compressed page data can be decompressed subsequently to obtain the decompressed page data.
After the compressed page data is obtained, step S203 is continuously executed.
And step 203, determining a decompression algorithm of the compressed page data by utilizing a target compression algorithm, and obtaining decompressed page data.
The target compression algorithm is a compression algorithm used by the compressed page data, so that a corresponding decompression algorithm can be determined according to a compression mode of the target compression algorithm so as to decompress the compressed page data.
After the decompressed page data is obtained, step S204 is continued.
And step S204, reading data from the decompressed page data.
After decompressing the compressed page data, the graphics processor can directly read the data, and the graphics processor can directly access the data according to the program requirement.
It can be seen that, in the data reading method provided by the embodiment of the present invention, firstly, the data reading address of the data reading request is obtained, and then when the corresponding page data is determined to be the compressed page data and stored in the graphics processor memory and the applied non-graphics processor memory, since the compressed page data is compressed by the predetermined target compression algorithm, and the page data with the data amount equal to the storable data amount is stored in the graphics processor memory, and the page data with the data amount exceeding the storable data amount is stored in the applied non-graphics processor memory; therefore, the corresponding page data are respectively read from the internal memory of the graphic processor and the internal memory of the applied non-graphic processor according to the data reading address and the storage address in the data reading address, and finally the compressed page data are obtained, and decompressing the compressed page data by using the target compression algorithm to obtain page data to be compressed which is required to be accessed by the graphics processor, thereby realizing the expansion of the memory capacity of the graphics processor.
In some embodiments, after performing step S201, further comprising:
judging whether the data read address hits in the cache of the graphic processor;
If yes, determining that page data corresponding to a data reading address is not compressed, and reading data requested by the data reading request from a cache hit by the data reading address;
if not, determining the page data corresponding to the data reading address as the compressed page data.
The cache of the graphics processor may be a level one cache (L1 cache) or a level two cache (L2 cache) or a compressed data cache of the graphics processor.
When the data read address hits in the L1 cache or the L2 cache, it is indicated that the page data corresponding to the data read address is not compressed, and the page data can be directly accessed in the L1 cache or the L2 cache, so as to execute the data read request.
The compressed data cache can store a certain amount of compressed page data, when the data read address does not hit the first-level cache and the second-level cache, but hits the compressed data cache, the compressed page data is stored in the compressed data cache, so that the compressed page data can be quickly read from the compressed data cache, the index of the graphics processor memory and the applied non-graphics processor memory according to the address of the compressed page data is avoided, the TLB and the page table of the graphics processor memory are not required to be accessed, the access cost for accessing the graphics processor memory and the non-graphics processor memory is reduced, and the data reading efficiency of the graphics processor is improved.
When the compressed data cache is missed, the physical address of the applied non-graphics processor memory is calculated by utilizing the data read address, the storage address and the base address according to the mode, and then the compressed page data is read according to the physical address of the applied non-graphics processor memory and the data read address.
The embodiment of the invention also provides a graphics processor. Referring to fig. 3, the graphic processor 1 includes:
The compressor 11 is used for acquiring page data to be compressed, and compressing the page data to be compressed by utilizing a predetermined target compression algorithm to obtain compressed page data, wherein the target compression rate of the target compression algorithm is used for determining the storable data quantity in the memory of the graphics processor;
The memory controller 12 is configured to store, in the graphics processor memory, page data having a data size equal to the storable data size, out of the compressed page data, page data having a data size exceeding the storable data size, into the applied non-graphics processor memory, and record a storage address of the page data in the applied non-graphics processor memory;
Wherein the applied non-graphics processor memory is applied memory in a processor external to the graphics processor, and the storage capacity of the applied non-graphics processor memory is determined based at least on the target compression rate.
Optionally, the memory controller 12 is further configured to, according to the obtained data read address of the data read request, obtain, when it is determined that the page data corresponding to the data read address is compressed page data and the compressed page data is stored in the graphics processor memory 14 and the applied non-graphics processor memory 21, page data with a data amount equal to the storable data amount in the graphics processor memory 14 based on the data read address, and obtain, in the applied non-graphics processor memory 21, page data with a data amount exceeding the storable data amount based on the storage address in the data read request;
The compressor 11 is further configured to determine a decompression algorithm for the compressed page data by using a target compression algorithm, and obtain decompressed page data;
the graphics processor 1 further comprises:
And the data reading module is used for reading the data from the decompressed page data.
Optionally, the graphics processor 1 further includes:
The compression rate comparison module is used for determining a corresponding actual compression rate based on the compressed page data, and comparing the actual compression rate with the target compression rate, wherein the actual compression rate can calculate the actual compression data amount of the compressed page data;
If the actual compression rate is smaller than the target compression rate, determining that the actual compression data amount is larger than the storable data amount, storing page data with the data amount equal to the storable data amount in the graphic processor memory in the compressed page data, storing the page data with the data amount exceeding the storable data amount in the applied non-graphic processor memory, and recording the storage address of the page data in the applied non-graphic processor memory;
And if the actual compression rate is greater than or equal to the target compression rate, determining that the actual compressed data volume is less than or equal to the storable data volume, and storing the compressed page data into the memory of the graphics processor.
Optionally, the graphics processor 1 further includes:
The performance analysis module is used for acquiring debugging page data, compressing the debugging page data for a plurality of times by utilizing a plurality of candidate compression algorithms, determining the compression rate of each time of compressing the debugging page data by using the candidate compression algorithm according to each candidate compression algorithm, obtaining the average compression rate of the debugging page data based on the compression rate determined each time, selecting the candidate compression algorithm corresponding to the maximum value in the average compression rate as a target compression algorithm, and taking the average compression rate of the target compression algorithm as a target compression rate.
The performance analysis module may provide a target compression algorithm and a target compression rate.
For example, the performance analysis module may be executed by:
firstly, continuously scanning debugging page data which is generated by GPU calculation in the current performance analysis stage and temporarily stored on a GPU memory system by the performance analysis module;
These debug page data will then be continuously fed into the compressor 11, where the compressor 11 integrates various mainstream compression algorithms. At this time, all compression algorithms are used for compression, and the compression rate of each compression algorithm is calculated.
Finally, the compression ratios in multiple iterations are averaged according to the dimensions of each compression algorithm. The compression algorithm with the minimum mean compression rate is selected as the practical compression algorithm applied in the method, namely the target compression algorithm, and the corresponding mean compression rate is rounded and then written into a target compression rate register for subsequent use.
Optionally, the graphics processor 1 further includes:
The system comprises a memory application module, a non-graphics processor memory, a memory application module, a memory control module and a memory control module, wherein the memory application module is used for acquiring the available capacity of the graphics processor memory, determining the application capacity on the non-graphics processor memory based on the available capacity as the applied non-graphics processor memory, wherein the application capacity is the storage capacity of the non-graphics processor memory, and the application capacity is determined by multiplying the available capacity by the difference value between the target compression rate and a first value;
a target compression ratio register 121 for storing the target compression ratio.
The memory application module can be triggered by the performance analysis module to execute the application of the non-graphic processor memory, namely the initialization flow of the capacity expansion system.
The implementation process of the memory application module comprises the following steps:
firstly, detecting and acquiring actual parameters in a current host, including:
1) The available capacity of the GPU DRAM;
2) The available capacity of the CPU DRAM;
3) The system environment variable configured by the upper user is the upper limit threshold value of the memory capacity occupied by the memory application module in the CPU DRAM;
Then, the actual detected parameter configuration and the data stored in the target compression rate register (target compression rate) are taken as input, and the current final initialization parameters of the GPU are finalized:
1) A final target compression rate;
2) The memory size applied on the CPU side.
For example, if the available capacity of GPU DRAM is 40GB and the finalized final target compression rate is x3, the memory size of the CPU side application is 80GB (40 GBx (3-1)), or if the available capacity of GPU DRAM is 50GB and the finalized final target compression rate is x4, the memory size of the CPU side application is 150GB (50 GBx (4-1));
Then, a page lock memory (pin memory) is applied to the CPU side using the calculated memory size applied to the CPU side. The page locking memory can avoid the situation that pages are exchanged to a CPU hard disk, and simultaneously, the base address (memory head address) of the applied non-graphic processor memory is recorded to a newly added register on the GPU, namely a base address register of the bottom-of-pocket memory;
finally, by writing the appointed hardware register, the compressed data cache, the TLB and the MMU on the GPU hardware are enabled, and the applied non-graphic processor memory and the graphic processor memory are used for completing the complete storage of the compressed page data.
Optionally, the graphics processor 1 further comprises a compressed data buffer 13 for storing the compressed page data.
Optionally, the graphics processor 1 further includes a base address register 122;
The base address register 122 is used for storing a base address of the applied non-graphics processor memory, and the base address is used in combination with the storage address to indicate a storage location of page data exceeding a storable data amount in the applied non-graphics processor memory.
It can be seen that, in the graphics processor 1 provided in the embodiment of the present invention, the computing core 10 generates the page data to be compressed, and the compressor 11 can compress the page data to be compressed by using a predetermined target compression algorithm, so as to obtain compressed page data, where the target compression rate of the target compression algorithm is used to determine the data amount storable in the graphics processor memory. Because the compressed data obtained after the compression processing of the page data to be compressed may have a situation that the data amount (storable data amount) determined by the target compression rate of the predetermined target compression algorithm is not satisfied, in the embodiment of the present invention, for one compressed page data, page data with the data amount equal to the storable data amount in the compressed page data may be stored in the graphics processor memory 14 on one hand, and meanwhile, page data with the data amount exceeding the storable data amount in the compressed page data may be stored in the applied non-graphics processor memory 21 on the other hand, and the storage address of the applied non-graphics processor memory 21 is recorded, and the applied non-graphics processor memory 21 is determined at least based on the target compression rate of the target compression algorithm, so that the compressed page data may be stored completely at one time. Therefore, the method can not only ensure that the data stored in the graphics processor memory 21 has no gap, but also avoid the graphics processor 1 from adopting a page reassignment mode to reapply the compressed page data stored in the graphics processor memory, and improve the data storage efficiency of the graphics processor memory 14 by applying for the page data exceeding the storable data amount, and can directly acquire the compressed page data stored in the applied non-graphics processor memory 21 based on the recorded storage address when the compressed page data is needed to be used later, so that the capacity occupation of the graphics processor memory 14 can be reduced, the additional expense caused by storing the compressed page data by adopting the page reassignment mode of the graphics processor 1 is avoided on the basis of the memory capacity expansion of the graphics processor 1, and the aim of improving the data storage efficiency of the graphics processor memory 14 is fulfilled.
Further, when the computing core in the graphics processor needs to access the compressed page data, the memory controller of the graphics processor firstly acquires the data read address of the data read request, and further when the corresponding page data is determined to be the compressed page data and stored in the graphics processor memory and the applied non-graphics processor memory, the compressed page data is compressed by a predetermined target compression algorithm, and in the compressed page data, the page data with the data volume equal to the storable data volume is stored in the graphics processor memory, and the page data with the data volume exceeding the storable data volume is stored in the applied non-graphics processor memory, so that the page data needs to be read from the graphics processor memory and the applied non-graphics processor memory respectively according to the data read address and the storage address in the data read address, finally the compressed page data is obtained, and then the compressed page data is decompressed by the target compression algorithm, so that the page data to be compressed which is required to be accessed by the graphics processor is obtained, and the memory capacity expansion of the graphics processor is realized.
For convenience in describing the functional implementation of the graphics processor provided in the embodiment of the present invention, please refer to fig. 4, and fig. 4 is a schematic diagram illustrating a process of implementing the data storage method provided in the embodiment of the present invention.
In fig. 4,4 page data to be compressed are compressed, pageA, pageB, pageC, pageD, and the target compression rate is illustrated as x4.
As shown in fig. 4, after each page data to be compressed is compressed by a predetermined target compression algorithm in the compressor, compressed page data is obtained, and each page data to be compressed correspondingly generates an actual compression rate:
PageA the corresponding actual compression ratio after compression is x1;
PageB the corresponding actual compression ratio after compression is x1.33;
PageC the corresponding actual compression ratio after compression is x2;
PageD the actual compression ratio corresponding to x4 after compression.
And then comparing the actual compression rate and the target compression rate after compressing each page data to be compressed, and storing each compressed data smaller than the target compression rate x4 into the graphics processor memory and the applied non-graphics processor memory (namely the CPU spam memory in FIG. 4) respectively.
As shown in fig. 4, the actual compression rate PageA, pageB, pageC is smaller than the target compression rate, so that the compressed page data with the data volume equal to the data volume (storable data volume) determined by the target compression rate is stored in the GPU memory, and the compressed page data with the data volume exceeding the storable data volume is stored in the CPU spam memory (the applied non-graphics processor memory).
PageD is equal to the target compression rate, so the compressed page data can be directly stored in the GPU memory.
The foregoing describes several embodiments of the present invention, and the various alternatives presented by the various embodiments may be combined, cross-referenced, with each other without conflict, extending beyond what is possible embodiments, all of which are considered to be embodiments of the present invention disclosed and disclosed.
Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the invention, and the scope of the invention should be assessed accordingly to that of the appended claims.