CN117764808B

CN117764808B - GPU data processing method, device and storage medium

Info

Publication number: CN117764808B
Application number: CN202311785293.4A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Moore Threads Technology Co Ltd
Current assignee: Mole Thread Intelligent Technology (Beijing) Co.,Ltd.
Priority date: 2023-12-22
Filing date: 2023-12-22
Publication date: 2024-09-17
Anticipated expiration: 2043-12-22
Also published as: CN117764808A

Abstract

The present disclosure relates to the field of graphics rendering technology, and in particular to a data processing method, device and storage medium for a GPU. The method includes: obtaining input primitive index data, the primitive index data including first index data corresponding to each of n vertex data in the primitive; performing CAM scanning on the primitive index data; when the number m of free index data in the CAM is less than n, reallocating the second index data corresponding to each of k vertex data among the n vertex data, and packaging the k second index data to obtain a task, the working mode of the CAM is a non-overwrite mode, and the non-overwrite mode is used to indicate that the input index data does not overwrite the allocated index data in the CAM. Compared with the overwritable scheme in the related art, the non-overwrite mechanism scheme of the CAM provided in the embodiment of the present disclosure can effectively improve the utilization rate of the subsequent vertex cache space.

Description

GPU data processing method, device and storage medium

技术领域Technical Field

本公开涉及图形渲染技术领域，尤其涉及一种图形处理器(Graphics ProcessingUnit，GPU)的数据处理方法、装置及存储介质。The present disclosure relates to the field of graphics rendering technology, and in particular to a data processing method, device and storage medium of a graphics processing unit (GPU).

背景技术Background Art

GPU在图形渲染的几何阶段，为了避免顶点数据被重复的统一着色器集群(Unified Shading Cluster，USC)计算，会使用内容可寻址存储器(Content-AddressableMemory，CAM)去重机制，为不同的顶点数据分配一个唯一的索引号，该索引号用于指示经过USC计算后的顶点数据在顶点缓存空间中的位置，其中USC计算后的顶点数据会经过打包处理成多个任务(英文：task)数据，统一存放在顶点缓存空间。In the geometry stage of graphics rendering, in order to avoid repeated unified shader cluster (USC) calculations of vertex data, the GPU uses a content-addressable memory (CAM) deduplication mechanism to assign a unique index number to different vertex data. The index number is used to indicate the location of the vertex data after USC calculation in the vertex cache space. The vertex data after USC calculation will be packaged into multiple task data and stored in the vertex cache space.

目前的CAM去重机制过于固化，使得顶点缓存空间的利用率不高，导致GPU在图形渲染的几何阶段的处理效率较低。The current CAM deduplication mechanism is too rigid, resulting in low utilization of vertex cache space and low processing efficiency of the GPU in the geometry stage of graphics rendering.

发明内容Summary of the invention

有鉴于此，本公开提出了一种GPU的数据处理方法、装置及存储介质。所述技术方案包括：In view of this, the present disclosure proposes a GPU data processing method, device and storage medium. The technical solution includes:

根据本公开的一方面，提供了一种GPU的数据处理方法，所述方法包括：According to one aspect of the present disclosure, a GPU data processing method is provided, the method comprising:

获取输入的图元索引数据，所述图元索引数据包括图元中的n个顶点数据各自对应的第一索引数据，所述n为正整数；Obtain input primitive index data, wherein the primitive index data includes first index data corresponding to each of n vertex data in the primitive, where n is a positive integer;

对所述图元索引数据进行CAM扫描，所述CAM扫描用于指示在CAM中为所述n个顶点数据依次重新分配对应的第二索引数据；Performing a CAM scan on the primitive index data, wherein the CAM scan is used to indicate to sequentially reallocate corresponding second index data for the n vertex data in the CAM;

当所述CAM中空闲的索引数据数量m小于所述n时，为所述n个顶点数据中的k个顶点数据重新分配各自对应的第二索引数据，并将所述k个第二索引数据进行打包得到任务，所述m和所述k均为正整数，所述k小于或等于所述m；When the number m of idle index data in the CAM is less than n, reallocating the second index data corresponding to k vertex data among the n vertex data, and packing the k second index data to obtain a task, wherein m and k are both positive integers, and k is less than or equal to m;

其中，所述CAM的工作模式为不覆写模式，所述不覆写模式用于指示输入的索引数据不覆写所述CAM中已分配的索引数据。The working mode of the CAM is a non-overwriting mode, and the non-overwriting mode is used to indicate that the input index data does not overwrite the allocated index data in the CAM.

在一种可能的实现方式中，所述获取输入的图元索引数据之前，还包括：In a possible implementation, before obtaining the input primitive index data, the process further includes:

获取控制数据，所述控制数据用于指示任务工作模式；Acquiring control data, wherein the control data is used to indicate a task working mode;

当检测到所述任务工作模式是目标粒度模式时，将所述CAM的工作模式设置为所述不覆写模式。When it is detected that the task working mode is the target granularity mode, the working mode of the CAM is set to the non-overwriting mode.

在另一种可能的实现方式中，所述目标粒度模式对应的粒度大于预设数值，所述目标粒度模式对应的粒度用于指示所组装任务包括的工作项实例的最大数量。In another possible implementation, the granularity corresponding to the target granularity pattern is greater than a preset value, and the granularity corresponding to the target granularity pattern is used to indicate a maximum number of work item instances included in the assembled task.

在另一种可能的实现方式中，所述方法还包括：In another possible implementation, the method further includes:

当检测到所述任务工作模式不是所述目标粒度模式时，将所述CAM的工作模式设置为覆写模式，所述覆写模式用于指示支持输入的索引数据覆写所述CAM中已分配的索引数据的功能。When it is detected that the task working mode is not the target granularity mode, the working mode of the CAM is set to an overwrite mode, where the overwrite mode is used to indicate a function of supporting input index data to overwrite allocated index data in the CAM.

当所述m大于或等于所述n时，为所述n个顶点数据重新分配各自对应的第二索引数据。When m is greater than or equal to n, the second index data corresponding to each of the n vertex data are reallocated.

在另一种可能的实现方式中，所述第二索引数据用于指示对应的所述顶点数据在顶点缓存空间中的位置，所述将所述k个第二索引数据进行打包得到任务之后，还包括：In another possible implementation, the second index data is used to indicate a position of the corresponding vertex data in the vertex cache space, and after the k second index data are packaged to obtain the task, the method further includes:

获取打包的所述任务；Get the packaged tasks;

根据所述任务中的所述k个第二索引数据，将所述k个第二索引数据各自对应的所述顶点数据存入所述顶点缓存空间中。According to the k second index data in the task, the vertex data corresponding to each of the k second index data are stored in the vertex cache space.

从所述顶点缓存空间中读取所述k个顶点数据，并获取所述k个顶点数据各自对应的所述第二索引数据；Read the k vertex data from the vertex buffer space, and obtain the second index data corresponding to each of the k vertex data;

在将所述k个顶点数据与对应的所述k个第二索引数据进行同步处理后，输出图元数据，所述图元数据包括至少一个所述顶点数据。After synchronizing the k vertex data with the corresponding k second index data, outputting the image data, the image data includes at least one of the vertex data.

保存每个所述任务的目标计数值，所述目标计数值为所述任务中的位于所述顶点缓存空间中未使用的所述第二索引数据的个数；Saving a target count value of each of the tasks, where the target count value is the number of unused second index data in the task and located in the vertex cache space;

当输出所述图元数据时，将所述目标计数值减去使用值，所述使用值为所述图元数据所包括的所述顶点数据的个数。When the primitive data are output, a usage value is subtracted from the target count value, where the usage value is the number of the vertex data included in the primitive data.

当检测到所述任务的所述目标计数值为零时，向所述顶点缓存空间发送缓存释放命令，所述缓存释放命令用于指示释放所述顶点缓存空间中所述任务对应的空间。When it is detected that the target count value of the task is zero, a cache release command is sent to the vertex cache space, where the cache release command is used to instruct to release the space corresponding to the task in the vertex cache space.

根据本公开的另一方面，提供了一种GPU的数据处理装置，所述装置包括：According to another aspect of the present disclosure, a data processing device of a GPU is provided, the device comprising:

获取模块，用于获取输入的图元索引数据，所述图元索引数据包括图元中的n个顶点数据各自对应的第一索引数据，所述n为正整数；An acquisition module, used for acquiring input primitive index data, wherein the primitive index data includes first index data corresponding to each of n vertex data in the primitive, where n is a positive integer;

扫描模块，用于对所述图元索引数据进行CAM扫描，所述CAM扫描用于指示在CAM中为所述n个顶点数据依次重新分配对应的第二索引数据；A scanning module, used for performing CAM scanning on the primitive index data, wherein the CAM scanning is used for indicating to sequentially reallocate corresponding second index data for the n vertex data in the CAM;

打包模块，用于当所述CAM中空闲的索引数据数量m小于所述n时，为所述n个顶点数据中的k个顶点数据重新分配各自对应的第二索引数据，并将所述k个第二索引数据进行打包得到任务，所述m和所述k均为正整数，所述k小于或等于所述m；A packing module, used for reallocating the second index data corresponding to k vertex data among the n vertex data when the number m of idle index data in the CAM is less than n, and packing the k second index data to obtain a task, wherein m and k are both positive integers, and k is less than or equal to m;

在一种可能的实现方式中，所述装置还包括：第一设置模块，所述第一设置模块，用于：In a possible implementation manner, the device further includes: a first setting module, wherein the first setting module is configured to:

在另一种可能的实现方式中，所述装置还包括：第二设置模块；所述第二设置模块，用于：In another possible implementation, the device further includes: a second setting module; the second setting module is configured to:

在另一种可能的实现方式中，所述装置还包括：分配模块；所述分配模块，用于：In another possible implementation, the device further includes: an allocation module; the allocation module is configured to:

在另一种可能的实现方式中，所述第二索引数据用于指示对应的所述顶点数据在顶点缓存空间中的位置，所述装置还包括：缓存模块；所述缓存模块，用于：In another possible implementation, the second index data is used to indicate a position of the corresponding vertex data in a vertex cache space, and the device further includes: a cache module; the cache module is used to:

获取打包的所述任务；Get the packaged tasks;

在另一种可能的实现方式中，所述装置还包括：输出模块；所述输出模块，用于：In another possible implementation, the device further includes: an output module; the output module is configured to:

在另一种可能的实现方式中，所述装置还包括：计算模块；所述计算模块，用于：In another possible implementation, the device further includes: a calculation module; the calculation module is configured to:

在另一种可能的实现方式中，所述装置还包括：释放模块；所述释放模块，用于：In another possible implementation, the device further includes: a releasing module; the releasing module is configured to:

根据本公开的另一方面，提供了一种计算设备，所述计算设备包括：处理器；用于存储处理器可执行指令的存储器；According to another aspect of the present disclosure, a computing device is provided, the computing device comprising: a processor; a memory for storing processor-executable instructions;

其中，所述处理器被配置为：Wherein, the processor is configured to:

根据本公开的另一方面，提供了一种非易失性计算机可读存储介质，其上存储有计算机程序指令，所述计算机程序指令被处理器执行时实现第一方面或第一方面的任意一种可能的实现方式提供的方法。According to another aspect of the present disclosure, a non-volatile computer-readable storage medium is provided, on which computer program instructions are stored. When the computer program instructions are executed by a processor, the method provided by the first aspect or any possible implementation manner of the first aspect is implemented.

本公开实施例通过获取输入的图元索引数据，所述图元索引数据包括图元中的n个顶点数据各自对应的第一索引数据，所述n为正整数；对所述图元索引数据进行CAM扫描，所述CAM扫描用于指示在CAM中为所述n个顶点数据依次重新分配对应的第二索引数据；当所述CAM中空闲的索引数据数量m小于所述n时，为所述n个顶点数据中的k个顶点数据重新分配各自对应的第二索引数据，并将所述k个第二索引数据进行打包得到任务；由于CAM的工作模式为不覆写模式，也就是说，在为顶点数据重新分配索引数据的过程中，输入的索引数据不覆写CAM中已分配的索引数据，避免了目前的可覆写方案中索引数据需要等待被完全覆写才可以释放资源的情况，本公开实施例提供的CAM的不覆写机制方案相较于相关技术中的可覆写方案，在几何阶段进行顶点数据管理时，能够有效提高后续顶点缓存空间的利用率。The disclosed embodiment obtains input primitive index data, wherein the primitive index data includes first index data corresponding to each of n vertex data in the primitive, wherein n is a positive integer; performs CAM scanning on the primitive index data, wherein the CAM scanning is used to indicate that corresponding second index data are reallocated in sequence for the n vertex data in the CAM; when the number m of free index data in the CAM is less than n, reallocates respective corresponding second index data for k vertex data among the n vertex data, and packages the k second index data to obtain a task; since the working mode of the CAM is a non-overwrite mode, that is, in the process of reallocating index data for vertex data, the input index data does not overwrite the allocated index data in the CAM, thus avoiding the situation in the current overwritable scheme that the index data needs to wait to be completely overwritten before releasing resources. Compared with the overwritable scheme in the related art, the non-overwrite mechanism scheme of the CAM provided by the disclosed embodiment can effectively improve the utilization rate of subsequent vertex cache space when vertex data is managed in the geometry stage.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

包含在说明书中并且构成说明书的一部分的附图与说明书一起示出了本公开的示例性实施例、特征和方面，并且用于解释本公开的原理。The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

图1示出了本公开一个示例性实施例提供的计算设备的结构示意图。FIG. 1 shows a schematic diagram of the structure of a computing device provided by an exemplary embodiment of the present disclosure.

图2示出了本公开一个示例性实施例提供的GPU的数据处理方法的流程图。FIG. 2 shows a flow chart of a method for processing data in a GPU provided by an exemplary embodiment of the present disclosure.

图3示出了本公开另一个示例性实施例提供的GPU的数据处理方法的流程图。FIG. 3 shows a flow chart of a method for processing data in a GPU provided by another exemplary embodiment of the present disclosure.

图4示出了本公开另一个示例性实施例提供的GPU的数据处理方法的流程图。FIG. 4 shows a flow chart of a data processing method for a GPU provided by another exemplary embodiment of the present disclosure.

图5示出了本公开一个示例性实施例提供的GPU的数据处理方法的原理示意图。FIG. 5 is a schematic diagram showing a principle of a data processing method for a GPU provided by an exemplary embodiment of the present disclosure.

图6示出了本公开另一个示例性实施例提供的GPU的数据处理方法涉及的数据缓存方式的原理示意图。FIG. 6 is a schematic diagram showing a principle of a data caching method involved in a GPU data processing method provided by another exemplary embodiment of the present disclosure.

图7示出了本公开另一个示例性实施例提供的GPU的数据处理方法的流程图。FIG. 7 shows a flow chart of a data processing method of a GPU provided by another exemplary embodiment of the present disclosure.

图8示出了本公开一个示例性实施例提供的GPU的数据处理装置的结构示意图。FIG. 8 shows a schematic diagram of the structure of a data processing device of a GPU provided by an exemplary embodiment of the present disclosure.

图9是根据一示例性实施例示出的一种用于GPU的数据处理方法的装置的框图。Fig. 9 is a block diagram showing an apparatus for a data processing method for a GPU according to an exemplary embodiment.

图10是根据一示例性实施例示出的一种用于GPU的数据处理方法的装置的框图。Fig. 10 is a block diagram showing an apparatus for a data processing method for a GPU according to an exemplary embodiment.

具体实施方式DETAILED DESCRIPTION

以下将参考附图详细说明本公开的各种示例性实施例、特征和方面。附图中相同的附图标记表示功能相同或相似的元件。尽管在附图中示出了实施例的各种方面，但是除非特别指出，不必按比例绘制附图。Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. The same reference numerals in the accompanying drawings represent elements with the same or similar functions. Although various aspects of the embodiments are shown in the accompanying drawings, the drawings are not necessarily drawn to scale unless otherwise specified.

在这里专用的词“示例性”意为“用作例子、实施例或说明性”。这里作为“示例性”所说明的任何实施例不必解释为优于或好于其它实施例。The word “exemplary” is used exclusively herein to mean “serving as an example, example, or illustration.” Any embodiment described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments.

另外，为了更好的说明本公开，在下文的具体实施方式中给出了众多的具体细节。本领域技术人员应当理解，没有某些具体细节，本公开同样可以实施。在一些实例中，对于本领域技术人员熟知的方法、手段、元件和电路未作详细描述，以便于凸显本公开的主旨。In addition, in order to better illustrate the present disclosure, numerous specific details are given in the following specific embodiments. It should be understood by those skilled in the art that the present disclosure can also be implemented without certain specific details. In some examples, methods, means, components and circuits well known to those skilled in the art are not described in detail in order to highlight the main purpose of the present disclosure.

首先，对本公开涉及的应用场景进行介绍。First, the application scenarios involved in the present disclosure are introduced.

请参考图1，其示出了本公开一个示例性实施例提供的计算设备的结构示意图。Please refer to FIG. 1 , which shows a schematic diagram of the structure of a computing device provided by an exemplary embodiment of the present disclosure.

计算设备包括服务器和/或终端。终端包括移动终端或者固定终端，比如终端可以是手机、平板电脑、膝上型便携计算机和台式计算机等等。服务器可以是一台服务器，或者由若干台服务器组成的服务器集群，或者是一个云计算服务中心。The computing device includes a server and/or a terminal. The terminal includes a mobile terminal or a fixed terminal, such as a mobile phone, a tablet computer, a laptop computer, a desktop computer, etc. The server can be a server, a server cluster composed of several servers, or a cloud computing service center.

计算设备包括处理器10、存储器20以及通信接口30。本领域技术人员可以理解，图1中示出的结构并不构成对该计算设备的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。其中：The computing device includes a processor 10, a memory 20, and a communication interface 30. Those skilled in the art will appreciate that the structure shown in FIG1 does not limit the computing device, and may include more or fewer components than shown, or combine certain components, or arrange the components differently.

处理器10是计算设备的控制中心，利用各种接口和线路连接整个计算设备的各个部分，通过运行或执行存储在存储器20内的软件程序和/或模块，以及调用存储在存储器20内的数据，执行计算设备的各种功能和处理数据，从而对计算设备进行整体控制。处理器10可以由CPU实现，也可以由GPU实现。The processor 10 is the control center of the computing device, and uses various interfaces and lines to connect various parts of the entire computing device. It executes various functions of the computing device and processes data by running or executing software programs and/or modules stored in the memory 20, and calling data stored in the memory 20, thereby controlling the computing device as a whole. The processor 10 can be implemented by a CPU or a GPU.

存储器20可用于存储软件程序以及模块。处理器10通过运行存储在存储器20的软件程序以及模块，从而执行各种功能应用以及数据处理。存储器20可主要包括存储程序区和存储数据区，其中，存储程序区可存储操作系统21、获取模块22、扫描模块23、打包模块24和至少一个功能所需的应用程序25等；存储数据区可存储根据计算设备的使用所创建的数据等。存储器20可以由任何类型的易失性或非易失性存储设备或者它们的组合实现，如静态随机存取存储器(Static Random Access Memory，SRAM)，电可擦除可编程只读存储器(Electrically Erasable Programmable Read-Only Memory，EEPROM)，可擦除可编程只读存储器(Erasable Programmable Read Only Memory，EPROM)，可编程只读存储器(Programmable Read-Only Memory，PROM)，只读存储器(Read Only Memory，ROM)，磁存储器，快闪存储器，磁盘或光盘。相应地，存储器20还可以包括存储器控制器，以提供处理器10对存储器20的访问。The memory 20 can be used to store software programs and modules. The processor 10 executes various functional applications and data processing by running the software programs and modules stored in the memory 20. The memory 20 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system 21, an acquisition module 22, a scanning module 23, a packaging module 24, and an application 25 required for at least one function, etc.; the data storage area may store data created according to the use of the computing device, etc. The memory 20 may be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as a static random access memory (SRAM), an electrically erasable programmable read-only memory (EEPROM), an erasable programmable read-only memory (EPROM), a programmable read-only memory (PROM), a read-only memory (ROM), a magnetic memory, a flash memory, a magnetic disk or an optical disk. Accordingly, the memory 20 may further include a memory controller to provide the processor 10 with access to the memory 20 .

其中，处理器10通过运行获取模块22执行以下功能：获取输入的图元索引数据，图元索引数据包括图元中的n个顶点数据各自对应的第一索引数据，n为正整数。处理器10通过运行扫描模块23执行以下功能：对图元索引数据进行CAM扫描，CAM扫描用于指示在CAM中为n个顶点数据依次重新分配对应的第二索引数据，CAM的工作模式为不覆写模式，不覆写模式用于指示输入的索引数据不覆写CAM中已分配的索引数据。处理器10通过运行打包模块24执行以下功能：当CAM中空闲的索引数据数量m小于n时，为n个顶点数据中的k个顶点数据重新分配各自对应的第二索引数据，并将k个第二索引数据进行打包得到任务，m和k均为正整数，k小于或等于m。Among them, the processor 10 performs the following functions by running the acquisition module 22: acquiring the input primitive index data, the primitive index data includes the first index data corresponding to each of the n vertex data in the primitive, and n is a positive integer. The processor 10 performs the following functions by running the scanning module 23: performing CAM scanning on the primitive index data, the CAM scanning is used to indicate that the corresponding second index data is sequentially reallocated for the n vertex data in the CAM, and the working mode of the CAM is the non-overwrite mode, and the non-overwrite mode is used to indicate that the input index data does not overwrite the allocated index data in the CAM. The processor 10 performs the following functions by running the packaging module 24: when the number m of idle index data in the CAM is less than n, the corresponding second index data is reallocated for k vertex data among the n vertex data, and the k second index data are packaged to obtain a task, m and k are both positive integers, and k is less than or equal to m.

本公开实施例提供的GPU的数据处理方法应用于计算设备的GPU中，应用场景为GPU在图形渲染的几何阶段。为了避免顶点数据被重复的USC计算，采用本公开实施例提供的CAM去重机制，为不同的顶点数据重新分配一个唯一的第二索引数据，该第二索引数据用于指示经过USC计算后的顶点数据在顶点缓存空间中的位置。其中，在为不同的顶点数据依次重新分配索引数据的过程中，输入的索引数据不覆写CAM中已分配的索引数据。The GPU data processing method provided by the embodiment of the present disclosure is applied to the GPU of the computing device, and the application scenario is the geometry stage of the GPU in graphics rendering. In order to avoid repeated USC calculations of vertex data, the CAM deduplication mechanism provided by the embodiment of the present disclosure is adopted to reallocate a unique second index data for different vertex data, and the second index data is used to indicate the position of the vertex data after USC calculation in the vertex cache space. In the process of reallocating index data for different vertex data in turn, the input index data does not overwrite the allocated index data in the CAM.

需要说明的是，相关技术中CAM的可覆写模式中存在一种机制，就是任务1中顶点数据对应的索引数据必须把任务0中的顶点数据对应的索引数据完全覆写之后，任务0对应的顶点数据在顶点缓存中所占用的资源才能释放，本公开实施例提供的CAM的不覆写机制避免了上述操作，不需要等待任务0的索引数据被完全覆写就可以释放资源，所以从时间维度上看，顶点缓存可以更快的使用以及更快的释放。因此本公开实施例提供的CAM的不覆写机制方案相较于相关技术中的可覆写方案，在几何阶段进行顶点数据管理时，能够有效提高后续顶点缓存空间的利用率。It should be noted that there is a mechanism in the overwritable mode of the CAM in the related art, that is, the index data corresponding to the vertex data in task 1 must completely overwrite the index data corresponding to the vertex data in task 0 before the resources occupied by the vertex data corresponding to task 0 in the vertex cache can be released. The non-overwriting mechanism of the CAM provided in the embodiment of the present disclosure avoids the above operation, and the resources can be released without waiting for the index data of task 0 to be completely overwritten. Therefore, from the perspective of the time dimension, the vertex cache can be used faster and released faster. Therefore, compared with the overwritable scheme in the related art, the non-overwriting mechanism scheme of the CAM provided in the embodiment of the present disclosure can effectively improve the utilization rate of the subsequent vertex cache space when managing vertex data in the geometry stage.

下面，采用几个示例性实施例对本公开实施例提供的GPU的数据处理方法进行介绍。The following describes the GPU data processing method provided by the embodiments of the present disclosure using several exemplary embodiments.

请参考图2，其示出了本公开一个示例性实施例提供的GPU的数据处理方法的流程图，本实施例以该方法用于图1所示的计算设备的GPU中来举例说明。该方法包括以下几个步骤。Please refer to Fig. 2, which shows a flow chart of a data processing method of a GPU provided by an exemplary embodiment of the present disclosure, and this embodiment is illustrated by using the method in the GPU of the computing device shown in Fig. 1. The method includes the following steps.

步骤201，获取控制数据，控制数据用于指示任务工作模式。Step 201, obtaining control data, where the control data is used to indicate a task working mode.

GPU获取输入的控制数据，该控制数据用于指示任务工作模式。The GPU obtains input control data, which is used to indicate the task working mode.

其中，任务工作模式对应的粒度用于指示所组装任务包括的工作项实例的最大数量，不同任务工作模式对应的粒度不同。The granularity corresponding to the task working mode is used to indicate the maximum number of work item instances included in the assembled task, and different task working modes correspond to different granularities.

示意性的，任务工作模式可以包括wave32模式和wave128模式(wave是一种自定义的SIMD线程束，wave32模式表示32个工作项实例组装成的并行线程束，wave128模式表示128个工作项实例组装成的并行线程束)。替代地或附加地，任务工作模式也可以包括wave64模式等。例如，可以根据wave32模式或wave128模式分别将32或128个工作项实例组装成一个任务wave32或wave128，其中32和128分别表示不同工作模式对应的粒度。替代地或附加地，任务也可以包括wave64等。Illustratively, the task working mode may include wave32 mode and wave128 mode (wave is a custom SIMD thread bundle, wave32 mode represents a parallel thread bundle composed of 32 work item instances, and wave128 mode represents a parallel thread bundle composed of 128 work item instances). Alternatively or additionally, the task working mode may also include wave64 mode, etc. For example, 32 or 128 work item instances may be assembled into a task wave32 or wave128 according to the wave32 mode or the wave128 mode, respectively, where 32 and 128 represent the granularity corresponding to different working modes, respectively. Alternatively or additionally, the task may also include wave64, etc.

示意性的，如果任务工作模式为wave128模式，则可以确定相应的粒度为128，从第1个工作项实例开始，直至第128个工作项实例为止，累计计数128之后，将128个工作项实例组装成一个任务wave128；如果任务工作模式为wave32模式，则可以确定相应的粒度为32，从第1个工作项实例开始，直至第32个工作项实例为止，累计计数32之后，将32个工作项实例组装成一个任务wave32。In schematic form, if the task working mode is wave128 mode, the corresponding granularity can be determined to be 128. Starting from the first work item instance to the 128th work item instance, after the cumulative count is 128, the 128 work item instances are assembled into a task wave128. If the task working mode is wave32 mode, the corresponding granularity can be determined to be 32. Starting from the first work item instance to the 32nd work item instance, after the cumulative count is 32, the 32 work item instances are assembled into a task wave32.

步骤202，当检测到任务工作模式是目标粒度模式时，将CAM的工作模式设置为不覆写模式。Step 202: When it is detected that the task working mode is the target granularity mode, the working mode of the CAM is set to a non-overwrite mode.

可选的，目标粒度模式是默认设置的任务工作模式，或者是自定义设置的任务工作模式。Optionally, the target granularity mode is a default task working mode or a custom task working mode.

可选的，目标粒度模式对应的粒度大于预设数值，目标粒度模式对应的粒度用于指示所组装任务包括的工作项实例的最大数量。即目标粒度模式为对应的粒度大于预设数值的任务工作模式。Optionally, the granularity corresponding to the target granularity mode is greater than a preset value, and the granularity corresponding to the target granularity mode is used to indicate the maximum number of work item instances included in the assembled task. That is, the target granularity mode is a task work mode whose corresponding granularity is greater than a preset value.

其中，预设数值可以是默认设置的或者是自定义设置的。示意性的，目标粒度模式也称大粒度模式，比如目标粒度模式为wave128模式。本公开实施例对此不加以限定。The preset value may be a default setting or a custom setting. Schematically, the target granularity mode is also called a large granularity mode, for example, the target granularity mode is a wave128 mode. The embodiments of the present disclosure are not limited to this.

当GPU检测到任务工作模式是目标粒度模式时，将CAM的工作模式设置为不覆写模式，不覆写模式用于指示输入的索引数据不覆写CAM中已分配的索引数据。也就是说，不覆写模式不支持索引数据的数据覆写功能，输入的索引数据不能替换CAM中已分配的索引数据，CAM中已分配的索引数据无需等待被覆写即可释放资源。示意性的，当CAM中不存在空闲的索引数据时，无法为输入的第一索引数据对应的顶点数据重新分配对应的第二索引数据。When the GPU detects that the task working mode is the target granularity mode, the working mode of the CAM is set to the non-overwrite mode, which is used to indicate that the input index data does not overwrite the allocated index data in the CAM. In other words, the non-overwrite mode does not support the data overwrite function of the index data, and the input index data cannot replace the allocated index data in the CAM. The allocated index data in the CAM can release resources without waiting to be overwritten. Schematically, when there is no free index data in the CAM, the corresponding second index data cannot be reallocated for the vertex data corresponding to the input first index data.

可选的，不覆写模式还用于指示一个图元中的所有顶点数据各自对应的第二索引数据均存在顶点缓存空间中的一个任务中。Optionally, the non-overwrite mode is also used to indicate that the second index data corresponding to all vertex data in a primitive are stored in a task in the vertex cache space.

可选的，不覆写模式不支持第一顶点数据与第二顶点数据组成图元的功能，第一顶点数据为第一任务中的至少一个第二索引数据对应的顶点数据，第二顶点数据为第二任务中的至少一个第二索引数据对应的顶点数据，第一任务和第二任务为两个不同的任务。Optionally, the non-overwrite mode does not support the function of forming a primitive with the first vertex data and the second vertex data, the first vertex data is the vertex data corresponding to at least one second index data in the first task, the second vertex data is the vertex data corresponding to at least one second index data in the second task, and the first task and the second task are two different tasks.

可选的，当GPU检测到任务工作模式不是目标粒度模式时，将CAM的工作模式设置为覆写模式，覆写模式用于指示支持输入的索引数据覆写CAM中已分配的索引数据的功能。也就是说，当GPU检测到任务工作模式不是目标粒度模式时，保持相关技术中提供的覆写模式不变。Optionally, when the GPU detects that the task working mode is not the target granularity mode, the working mode of the CAM is set to the overwrite mode, where the overwrite mode is used to indicate a function that supports input index data to overwrite the allocated index data in the CAM. That is, when the GPU detects that the task working mode is not the target granularity mode, the overwrite mode provided in the related art is kept unchanged.

步骤203，获取输入的图元索引数据，图元索引数据包括图元中的n个顶点数据各自对应的第一索引数据。Step 203 , obtaining input primitive index data, where the primitive index data includes first index data corresponding to each of n vertex data in the primitive.

在任务工作模式是目标粒度模式，且CAM的工作模式是不覆写模式时，GPU获取输入的图元索引数据，图元索引数据包括n个顶点数据分配的各自对应的第一索引数据，n个顶点数据为一个或多个图元包括的多个顶点数据，第一索引数据即原始索引数据，n为正整数。其中，顶点数据与第一索引数据存在一一对应的关系。When the task working mode is the target granularity mode and the CAM working mode is the non-overwrite mode, the GPU obtains the input primitive index data, which includes the first index data corresponding to each of the n vertex data, the n vertex data are multiple vertex data included in one or more primitives, the first index data is the original index data, and n is a positive integer. There is a one-to-one correspondence between the vertex data and the first index data.

步骤204，对图元索引数据进行CAM扫描，CAM扫描用于指示在CAM中为n个顶点数据依次重新分配对应的第二索引数据。Step 204 , performing CAM scanning on the primitive index data, where the CAM scanning is used to indicate to sequentially reallocate the corresponding second index data for the n vertex data in the CAM.

在任务工作模式是目标粒度模式，且CAM的工作模式是不覆写模式时，GPU对图元索引数据进行CAM扫描。其中，目标粒度模式对应的粒度大于预设数值，不覆写模式用于指示输入的索引数据不覆写CAM中已分配的索引数据。When the task working mode is the target granularity mode and the CAM working mode is the non-overwrite mode, the GPU performs a CAM scan on the primitive index data, wherein the granularity corresponding to the target granularity mode is greater than a preset value, and the non-overwrite mode is used to indicate that the input index data does not overwrite the allocated index data in the CAM.

CAM扫描用于指示在CAM中为n个顶点数据依次重新分配对应的第二索引数据，第二索引数据用于指示对应的顶点数据在顶点缓存空间中的位置。The CAM scan is used to indicate that the corresponding second index data is sequentially reallocated for n vertex data in the CAM, and the second index data is used to indicate the position of the corresponding vertex data in the vertex cache space.

第二索引数据不同于第一索引数据，可选的，为了有效降低后续处理所需要的缓存空间，第二索引数据所占的比特位小于第一索引数据所占的比特位，比如第一索引数据所占的比特位为32比特，第二索引数据所占的比特位为8比特。本公开实施例对此不加以限定。The second index data is different from the first index data. Optionally, in order to effectively reduce the cache space required for subsequent processing, the bits occupied by the second index data are smaller than the bits occupied by the first index data. For example, the bits occupied by the first index data are 32 bits, and the bits occupied by the second index data are 8 bits. This is not limited in the embodiments of the present disclosure.

可选的，当对输入的一个或多个图元中的所有顶点数据各自对应的第一索引数据进行CAM扫描后，判断CAM中空闲的索引数据数量是否小于顶点数据的总数量。其中，CAM中空闲的索引数据数量为m，顶点数据的总数量为输入的第一索引数据的总数量，即顶点数据的总数量为n，n和m均为正整数。若m小于n，也即发现无法一次性为输入的一个或多个图元中的所有顶点数据分配所需的第二索引数据时，执行步骤205。若m大于或等于n，也即发现可以一次性为输入的一个或多个图元中的所有顶点数据分配所需的第二索引数据时，即为n个顶点数据重新分配各自对应的第二索引数据，第二索引数据用于指示对应的顶点数据在顶点缓存空间中的位置，该阶段数据处理完成，对于下一次输入的图元索引数据，按照上述规则执行对图元索引数据进行CAM扫描的步骤。Optionally, after performing a CAM scan on the first index data corresponding to each of all vertex data in one or more input primitives, determine whether the number of idle index data in the CAM is less than the total number of vertex data. Wherein, the number of idle index data in the CAM is m, and the total number of vertex data is the total number of input first index data, that is, the total number of vertex data is n, and both n and m are positive integers. If m is less than n, that is, it is found that it is impossible to allocate the required second index data for all vertex data in the input one or more primitives at one time, execute step 205. If m is greater than or equal to n, that is, it is found that it is possible to allocate the required second index data for all vertex data in the input one or more primitives at one time, the second index data corresponding to each of the n vertex data is reallocated, and the second index data is used to indicate the position of the corresponding vertex data in the vertex cache space. The data processing at this stage is completed, and for the next input primitive index data, the step of performing a CAM scan on the primitive index data is performed according to the above rules.

可选的，为n个顶点数据重新分配各自对应的第二索引数据后，可以将n个第二索引数据进行打包得到任务。Optionally, after the n vertex data are reallocated to their respective corresponding second index data, the n second index data may be packaged to obtain a task.

可选的，GPU按照图元粒度对图元索引数据进行CAM扫描，CAM扫描用于指示依次为每个图元中的所有顶点数据，在CAM中一次性重新分配各自对应的第二索引数据。也就是说，对于一个图元，GPU判断CAM中空闲的索引数据数量是否小于该图元的顶点数据的总数量。若CAM中空闲的索引数据数量大于或等于该图元的顶点数据的总数量，也即发现可以一次性为该图元的所有顶点数据分配所需的第二索引数据时，即为该图元的顶点数据重新分配各自对应的第二索引数据，继续对下一个图元，开始执行判断CAM中空闲的索引数据数量是否小于该图元的顶点数据的总数量的步骤。若CAM中空闲的索引数据数量小于该图元的顶点数据的总数量，也即发现无法一次性为该图元的所有顶点数据分配所需的第二索引数据时，后续可以将CAM中已分配的多个第二索引数据进行打包得到任务，多个第二索引数据包括为其他图元的所有顶点数据已分配的第二索引数据，其他图元为图元索引数据中除了该图元以外的至少一个图元。Optionally, the GPU performs a CAM scan on the primitive index data according to the primitive granularity, and the CAM scan is used to indicate that all the vertex data of each primitive are reallocated in the CAM at one time. That is to say, for a primitive, the GPU determines whether the number of idle index data in the CAM is less than the total number of vertex data of the primitive. If the number of idle index data in the CAM is greater than or equal to the total number of vertex data of the primitive, that is, it is found that the required second index data can be allocated to all the vertex data of the primitive at one time, the vertex data of the primitive is reallocated to the corresponding second index data, and the next primitive is continued to perform the step of determining whether the number of idle index data in the CAM is less than the total number of vertex data of the primitive. If the number of idle index data in the CAM is less than the total number of vertex data of the primitive, that is, it is found that the required second index data cannot be allocated to all the vertex data of the primitive at one time, the multiple second index data allocated in the CAM can be packaged to obtain a task later, and the multiple second index data include the second index data allocated for all vertex data of other primitives, and the other primitives are at least one primitive in the primitive index data except the primitive.

步骤205，当CAM中空闲的索引数据数量m小于n时，为n个顶点数据中的k个顶点数据重新分配各自对应的第二索引数据，并将k个第二索引数据进行打包得到任务。Step 205 , when the number m of idle index data in the CAM is less than n, reallocate the corresponding second index data to k vertex data among the n vertex data, and pack the k second index data to obtain a task.

当m小于n时，由于CAM的工作模式是不覆写模式，输入的索引数据不覆写CAM中已分配的索引数据，表示CAM中已有的空闲的索引数据不足以分配给输入的一个或多个图元中的所有顶点数据，也就是说无法一次性为输入的所有顶点数据分配所需的第二索引数据，为n个顶点数据中的k个顶点数据重新分配各自对应的第二索引数据，并强制将已分配的k个第二索引数据进行打包得到任务，以便后续进行统一计算处理，其中k为小于或等于m的正整数。When m is less than n, since the working mode of the CAM is the non-overwrite mode, the input index data does not overwrite the allocated index data in the CAM, which means that the existing free index data in the CAM is insufficient to be allocated to all vertex data of one or more input primitives, that is, it is impossible to allocate the required second index data to all the input vertex data at one time, and reallocate the corresponding second index data for k vertex data out of the n vertex data, and force the allocated k second index data to be packaged to obtain a task for subsequent unified calculation and processing, where k is a positive integer less than or equal to m.

可选的，k个顶点数据包括目标图元的所有顶点数据；或者，包括目标图元的所有顶点数据和一个图元的部分顶点数据；或者，包括一个图元的部分顶点数据。其中，目标图元为输入的图元中的一个或至少两个图元，一个图元为输入的图元中的一个图元，输入的图元为输入的图元索引数据对应的多个图元。对应的，k个第二索引数据包括为目标图元的所有顶点数据已分配的第二索引数据；或者，包括为目标图元的所有顶点数据和一个图元的部分顶点数据已分配的第二索引数据；或者，包括为一个图元的部分顶点数据已分配的第二索引数据。Optionally, the k vertex data include all vertex data of the target primitive; or, include all vertex data of the target primitive and part of vertex data of one primitive; or, include part of vertex data of one primitive. The target primitive is one or at least two primitives of the input primitives, a primitive is one primitive of the input primitives, and the input primitives are multiple primitives corresponding to the input primitive index data. Correspondingly, the k second index data include second index data allocated for all vertex data of the target primitive; or, include second index data allocated for all vertex data of the target primitive and part of vertex data of one primitive; or, include second index data allocated for part of vertex data of one primitive.

可选的，不覆写模式还用于指示一个图元中的所有顶点数据各自对应的第二索引数据均存在一个任务中。在该情况下，k个顶点数据包括目标图元的所有顶点数据，k个第二索引数据包括为目标图元的所有顶点数据已分配的第二索引数据。Optionally, the non-overwrite mode is also used to indicate that the second index data corresponding to all vertex data in a primitive are all present in one task. In this case, the k vertex data include all vertex data of the target primitive, and the k second index data include the second index data allocated to all vertex data of the target primitive.

其中，CAM中空闲的索引数据数量为m，CAM中空闲的索引数据数量为CAM中已有的未分配的索引数据的总数量；输入的一个或多个图元中的顶点数据的总数量为n，n和m均为正整数。由于顶点数据与第一索引数据存在一一对应的关系，顶点数据的总数量即为该图元索引数据中的所有第一索引数据的总数量。The number of free index data in the CAM is m, which is the total number of unallocated index data in the CAM; the total number of vertex data in the input one or more primitives is n, and both n and m are positive integers. Since there is a one-to-one correspondence between vertex data and first index data, the total number of vertex data is the total number of all first index data in the primitive index data.

在一个示意性的例子中，在任务工作模式是目标粒度模式，且CAM的工作模式是不覆写模式时，GPU的数据处理方法包括但不限于如下几个步骤，如图3所示：步骤301，GPU获取输入的图元索引数据；步骤302，GPU按照图元的粒度进行CAM扫描；步骤303，GPU判断CAM中空闲的索引数据数量是否小于图元中的顶点数据的总数量；步骤304，当CAM中空闲的索引数据数量小于图元中的顶点数据的总数量时，表示CAM中已有的空闲的索引数据不足以分配给该图元中的所有顶点数据，GPU为图元中的部分顶点数据重新分配各自对应的第二索引数据，并强制将已分配的多个第二索引数据进行打包得到任务；步骤304，当CAM中空闲的索引数据数量大于或等于图元中的顶点数据的总数量时，表示可以一次性为该图元中的所有顶点数据分配所需的第二索引数据，该阶段数据处理完成，对于下一次输入的图元索引数据，按照上述规则执行对图元索引数据进行CAM扫描的步骤。In an illustrative example, when the task working mode is the target granularity mode and the working mode of the CAM is the non-overwrite mode, the data processing method of the GPU includes but is not limited to the following steps, as shown in Figure 3: Step 301, the GPU obtains the input primitive index data; Step 302, the GPU performs CAM scanning according to the granularity of the primitive; Step 303, the GPU determines whether the number of free index data in the CAM is less than the total number of vertex data in the primitive; Step 304, when the number of free index data in the CAM is less than the total number of vertex data in the primitive, it means that the existing free index data in the CAM is insufficient to be allocated to all vertex data in the primitive, and the GPU reallocates the corresponding second index data for part of the vertex data in the primitive, and forces the allocated multiple second index data to be packaged to obtain the task; Step 304, when the number of free index data in the CAM is greater than or equal to the total number of vertex data in the primitive, it means that the required second index data can be allocated to all vertex data in the primitive at one time, and the data processing at this stage is completed. For the next input primitive index data, the step of performing CAM scanning on the primitive index data is performed according to the above rules.

在一种可能的实现方式中，上述步骤205之后，还包括但不限于如下几个步骤，如图4所示：In a possible implementation, after the above step 205, the following steps are also included but not limited to, as shown in FIG4 :

步骤401，获取打包的任务。Step 401, obtain the packaged tasks.

GPU获取打包的任务，对任务进行处理得到任务中的k个第二索引数据，k个第二索引数据中的每个第二索引数据唯一指示对应的顶点数据。The GPU obtains the packaged task, processes the task to obtain k second index data in the task, and each second index data in the k second index data uniquely indicates the corresponding vertex data.

步骤402，根据任务中的k个第二索引数据，将k个第二索引数据各自对应的顶点数据存入顶点缓存空间中。Step 402 : according to the k second index data in the task, store the vertex data corresponding to each of the k second index data into the vertex cache space.

可选的，GPU根据任务中的k个第二索引数据，按照任务粒度将k个第二索引数据各自对应的顶点数据存入顶点缓存空间中，第二索引数据用于指示对应的顶点数据在顶点缓存空间中的位置。Optionally, the GPU stores the vertex data corresponding to each of the k second index data in the task into the vertex cache space according to the task granularity, and the second index data is used to indicate the position of the corresponding vertex data in the vertex cache space.

可选的，多个顶点数据是按照任务粒度存储在顶点缓存空间中的，即顶点缓存空间中包括按照任务粒度存储的每个任务对应的多个顶点数据。Optionally, the multiple vertex data are stored in the vertex cache space according to the task granularity, that is, the vertex cache space includes the multiple vertex data corresponding to each task stored according to the task granularity.

其中，一个图元中的所有顶点数据各自对应的第二索引数据均存在顶点缓存空间中的一个任务中。第一顶点数据与第二顶点数据不存在联系，即第一顶点数据与第二顶点数据无法组成图元，第一顶点数据为第一任务中的至少一个第二索引数据对应的顶点数据，第二顶点数据为第二任务中的至少一个第二索引数据对应的顶点数据，第一任务和第二任务为两个不同的任务。Among them, the second index data corresponding to all vertex data in a primitive are all stored in a task in the vertex cache space. The first vertex data and the second vertex data are not related, that is, the first vertex data and the second vertex data cannot form a primitive. The first vertex data is the vertex data corresponding to at least one second index data in the first task, and the second vertex data is the vertex data corresponding to at least one second index data in the second task. The first task and the second task are two different tasks.

步骤403，从顶点缓存空间中读取k个顶点数据，并获取k个顶点数据各自对应的第二索引数据。Step 403, read k vertex data from the vertex cache space, and obtain second index data corresponding to each of the k vertex data.

GPU从顶点缓存空间中读取k个顶点数据，并获取k个顶点数据各自对应的第二索引数据。The GPU reads k vertex data from the vertex buffer space, and obtains second index data corresponding to each of the k vertex data.

步骤404，在将k个顶点数据与对应的k个第二索引数据进行同步处理后，输出图元数据，并保存每个任务的目标计数值。Step 404 , after synchronizing the k vertex data with the corresponding k second index data, outputting the image data and saving the target count value of each task.

其中，图元数据包括至少一个顶点数据。The graphic element data includes at least one vertex data.

GPU将k个顶点数据与对应的k个第二索引数据进行同步处理，在同步处理完成后，输出一个图元数据，一个图元数据包括至少一个顶点数据，并保存每个任务的目标计数值。其中，目标计数值为任务中的位于顶点缓存空间中未使用的第二索引数据的个数。也即，目标计数值为该任务对应的位于顶点缓存空间中未使用的顶点数据的个数。The GPU synchronizes the k vertex data with the corresponding k second index data, and after the synchronization is completed, outputs a primitive data, which includes at least one vertex data, and saves the target count value of each task. The target count value is the number of unused second index data in the vertex cache space in the task. That is, the target count value is the number of unused vertex data in the vertex cache space corresponding to the task.

可选的，目标计数值是按照任务粒度进行存储的，即GPU按照任务粒度保存每个任务的目标计数值。GPU也可以按照其他逻辑保存每个任务的目标计数值。本公开实施例对此不加以限定。Optionally, the target count value is stored according to the task granularity, that is, the GPU saves the target count value of each task according to the task granularity. The GPU may also save the target count value of each task according to other logics. This embodiment of the disclosure is not limited to this.

步骤405，当输出图元数据时，将目标计数值减去使用值，使用值为图元数据所包括的顶点数据的个数。Step 405: when outputting the primitive data, the target count value is subtracted from the usage value, where the usage value is the number of vertex data included in the primitive data.

每当GPU输出一个图元数据时，将图元数据对应的任务的目标计数值减去使用值，即使用值为该图元数据所包括的顶点数据的个数。Whenever the GPU outputs a piece of graphics data, the target count value of the task corresponding to the graphics data is subtracted from the usage value, that is, the usage value is the number of vertex data included in the graphics data.

步骤406，当检测到任务的目标计数值为零时，向顶点缓存空间发送缓存释放命令，缓存释放命令用于指示释放顶点缓存空间中任务对应的空间。Step 406: When it is detected that the target count value of the task is zero, a cache release command is sent to the vertex cache space. The cache release command is used to instruct to release the space corresponding to the task in the vertex cache space.

当GPU检测到任务的目标计数值为零时，表示当前任务中的所有位于顶点缓存空间中的第二索引数据已经使用完毕，向顶点缓存空间发送以任务为粒度的缓存释放命令，缓存释放命令用于指示按照任务粒度释放空间，即释放顶点缓存空间中任务对应的空间。When the GPU detects that the target count value of the task is zero, it means that all the second index data in the vertex cache space of the current task has been used up, and a cache release command with task as the granularity is sent to the vertex cache space. The cache release command is used to indicate the release of space at the task granularity, that is, to release the space corresponding to the task in the vertex cache space.

可选的，当GPU检测到任务的目标计数值为零时，可将CAM中该任务对应的k个第二索引数据均重新设置为未分配的索引数据。需要说明的是，将k个第二索引数据进行打包得到任务后，在不需要CAM中该任务对应的k个索引数据的情况下即可进行上述的重新设置，本公开实施例对该重新设置的时机不加以限定。Optionally, when the GPU detects that the target count value of the task is zero, the k second index data corresponding to the task in the CAM can be reset to unallocated index data. It should be noted that after the k second index data are packaged to obtain the task, the above reset can be performed without the k index data corresponding to the task in the CAM, and the disclosed embodiment does not limit the timing of the reset.

可选的，对于n个顶点数据中的未重新分配的n-k个顶点数据，在处理完k个顶点数据对应的任务后(或者在将CAM中该任务对应的k个第二索引数据均重新设置为未分配的索引数据后)，对n-k个顶点数据各自对应的第一索引数据继续执行进行CAM扫描的步骤。需要说明的是，对n-k个顶点数据各自对应的第一索引数据继续执行进行CAM扫描的步骤，可类比参考对n个顶点数据各自对应的第一索引数据进行CAM扫描的相关描述，在此不再赘述。Optionally, for the unreallocated n-k vertex data among the n vertex data, after processing the tasks corresponding to the k vertex data (or after resetting the k second index data corresponding to the task in the CAM to unallocated index data), the step of performing CAM scanning on the first index data corresponding to each of the n-k vertex data is continued. It should be noted that the step of performing CAM scanning on the first index data corresponding to each of the n-k vertex data can be analogously referred to the relevant description of performing CAM scanning on the first index data corresponding to each of the n vertex data, which will not be repeated here.

在一个示意性的例子中，如图5所示，GPU包括但不限于模块A(包括CAM51)、模块B、模块C、模块D、顶点缓存空间52和片段管线53，在图形渲染的几何阶段，在任务工作模式是目标粒度模式，且CAM51的工作模式是不覆写模式时，模块A获取输入的图元索引数据，对输入的图元索引数据按照图元的粒度进行CAM51扫描去重，当CAM51中空闲的索引数据数量m小于该图元中的顶点数据的总数量n时，模块A为n个顶点数据中的k个顶点数据重新分配各自对应的第二索引数据，强制将k个第二索引数据进行打包得到任务。模块B获取模块A打包的任务后，根据任务中的k个第二索引数据，将对应的k个顶点数据存入顶点缓存空间52中。模块C从顶点缓存空间52中读取k个顶点数据，并从模块A中获取k个顶点数据各自对应的第二索引数据，在将k个顶点数据与对应的k个第二索引数据进行同步处理后，将图元数据发送至模块D，此时模块C中包括保存的每个任务中的目标计数值。模块D获取图元数据，对图元数据进行处理，将处理后的数据发送至片段管线53。其中，模块C每向模块D发送一次图元数据，则将图元数据对应的任务的目标计数值减去相应的使用值。当模块C检测到其中保存的任务的目标计数值为零时，向顶点缓存空间52发送缓存释放命令，释放顶点缓存空间52中任务对应的空间。In an illustrative example, as shown in FIG5 , the GPU includes but is not limited to module A (including CAM51), module B, module C, module D, vertex cache space 52 and fragment pipeline 53. In the geometry stage of graphics rendering, when the task working mode is the target granularity mode and the working mode of CAM51 is the non-overwrite mode, module A obtains the input primitive index data, performs CAM51 scanning and deduplication on the input primitive index data according to the granularity of the primitive, and when the number m of idle index data in CAM51 is less than the total number n of vertex data in the primitive, module A reallocates the second index data corresponding to each of the k vertex data among the n vertex data, and forces the k second index data to be packaged to obtain the task. After module B obtains the task packaged by module A, it stores the corresponding k vertex data in the vertex cache space 52 according to the k second index data in the task. Module C reads k vertex data from the vertex cache space 52, and obtains the second index data corresponding to each of the k vertex data from module A. After synchronizing the k vertex data with the corresponding k second index data, the module C sends the primitive data to module D. At this time, module C includes the target count value in each task saved. Module D obtains the primitive data, processes the primitive data, and sends the processed data to the fragment pipeline 53. Each time module C sends primitive data to module D, the target count value of the task corresponding to the primitive data is subtracted from the corresponding usage value. When module C detects that the target count value of the task saved therein is zero, it sends a cache release command to the vertex cache space 52 to release the space corresponding to the task in the vertex cache space 52.

在一个示意性的例子中，如图6所示，CAM51中示例性地示出了4个第二索引数据，分别为索引0、索引1、索引2和索引3，顶点缓存空间52中示例性地示出了任务0，任务0包括索引0、索引1、索引2和索引3，其中索引0对应的顶点数据v0，索引1对应的顶点数据v1，索引2对应的顶点数据v2，索引3对应的顶点数据v3。当检测到任务0的目标计数值为零时，表示当前任务0中的所有位于顶点缓存空间52中的第二索引数据已经使用完毕，向顶点缓存空间52发送缓存释放命令，释放顶点缓存空间52中任务0对应的空间。当接收到打包的任务1时，获取任务1中的4个第二索引数据，分别为索引0、索引1、索引2和索引3，其中索引0对应的顶点数据v4，索引1对应的顶点数据v5，索引2对应的顶点数据v6，索引3对应的顶点数据v7，将4个顶点数据存入顶点缓存空间52。In an illustrative example, as shown in FIG6 , CAM51 exemplarily shows four second index data, namely, index 0, index 1, index 2 and index 3, and vertex cache space 52 exemplarily shows task 0, which includes index 0, index 1, index 2 and index 3, wherein index 0 corresponds to vertex data v0, index 1 corresponds to vertex data v1, index 2 corresponds to vertex data v2, and index 3 corresponds to vertex data v3. When it is detected that the target count value of task 0 is zero, it means that all the second index data in the vertex cache space 52 of the current task 0 has been used up, and a cache release command is sent to the vertex cache space 52 to release the space corresponding to task 0 in the vertex cache space 52. When the packaged task 1 is received, the four second index data in task 1 are obtained, namely, index 0, index 1, index 2 and index 3, wherein index 0 corresponds to vertex data v4, index 1 corresponds to vertex data v5, index 2 corresponds to vertex data v6, and index 3 corresponds to vertex data v7, and the four vertex data are stored in the vertex cache space 52.

需要说明的一点是，在实现其功能时，仅以上述各个功能模块的划分进行举例说明，实际应用中，可以根据实际需要而将上述功能分配由不同的功能模块完成，即将设备的内容结构划分成不同的功能模块，以完成以上描述的全部或者部分功能。One point that needs to be explained is that, when implementing its functions, only the division of the above-mentioned functional modules is used as an example. In actual applications, the above-mentioned functions can be assigned to different functional modules according to actual needs, that is, the content structure of the device can be divided into different functional modules to complete all or part of the functions described above.

需要说明的另一点是，本公开实施例提供的CAM的不覆写模式，使得一个图元中的所有顶点数据各自对应的第二索引数据均存在顶点缓存空间中的一个任务中。一个任务对应的至少一个顶点数据与另一个任务对应的至少一个顶点数据不存在任何联系，无法组成图元。比如，在图6所示的例子中，任务0对应的顶点数据v2和顶点数据v3，与任务1对应的顶点数据v4不存在任何联系，不可以组成图元。Another point that needs to be explained is that the non-overwrite mode of the CAM provided by the embodiment of the present disclosure makes the second index data corresponding to all vertex data in a primitive exist in a task in the vertex cache space. At least one vertex data corresponding to a task has no connection with at least one vertex data corresponding to another task, and cannot form a primitive. For example, in the example shown in Figure 6, the vertex data v2 and vertex data v3 corresponding to task 0 have no connection with the vertex data v4 corresponding to task 1, and cannot form a primitive.

综上所述，本公开实施例提供了一种针对大粒度任务使用的CAM去重机制方案，一方面，通过当检测到任务工作模式是预设的目标粒度模式时，将CAM的工作模式设置为不覆写模式，目标粒度模式对应的粒度大于预设数值，即增大CAM去重扫描粒度，能够有效提高数据处理速度。另一方面，由于CAM的工作模式为不覆写模式，不覆写模式用于指示输入的索引数据不覆写CAM中已分配的索引数据，避免了目前的可覆写方案中索引数据需要等待被完全覆写才可以释放资源的情况，相较于相关技术中的可覆写方案，能够有效提高后续顶点缓存空间的利用率。另一方面，可以根据多个顶点数据各自对应的第二索引数据，将多个顶点数据存入顶点缓存空间中，并且当检测到任务的目标计数值为零时，向顶点缓存空间发送缓存释放命令，缓存释放命令用于指示释放顶点缓存空间中任务对应的空间，针对大粒度任务的处理方式能够提高带宽的同时充分利用顶点缓存空间的资源，更快的写入和释放顶点缓存空间的资源。In summary, the disclosed embodiment provides a CAM deduplication mechanism scheme for large-grained tasks. On the one hand, when it is detected that the task working mode is a preset target granularity mode, the working mode of the CAM is set to a non-overwrite mode, and the granularity corresponding to the target granularity mode is greater than the preset value, that is, the CAM deduplication scanning granularity is increased, which can effectively improve the data processing speed. On the other hand, since the working mode of the CAM is a non-overwrite mode, the non-overwrite mode is used to indicate that the input index data does not overwrite the allocated index data in the CAM, which avoids the situation in which the index data in the current overwriteable scheme needs to wait to be completely overwritten before the resources can be released. Compared with the overwriteable scheme in the related art, it can effectively improve the utilization rate of the subsequent vertex cache space. On the other hand, according to the second index data corresponding to each of the multiple vertex data, multiple vertex data can be stored in the vertex cache space, and when the target count value of the task is detected to be zero, a cache release command is sent to the vertex cache space. The cache release command is used to indicate the release of the space corresponding to the task in the vertex cache space. The processing method for large-grained tasks can improve the bandwidth while making full use of the resources of the vertex cache space, and write and release the resources of the vertex cache space faster.

请参考图7，其示出了本公开另一个示例性实施例提供的GPU的数据处理方法的流程图，本实施例以该方法用于图1所示的计算设备的GPU中来举例说明。该方法包括以下几个步骤。Please refer to Fig. 7, which shows a flow chart of a data processing method of a GPU provided by another exemplary embodiment of the present disclosure, and this embodiment is illustrated by using the method in the GPU of the computing device shown in Fig. 1. The method includes the following steps.

步骤701，获取输入的图元索引数据，图元索引数据包括图元中的n个顶点数据各自对应的第一索引数据，n为正整数。Step 701, obtaining input primitive index data, the primitive index data including first index data corresponding to each of n vertex data in the primitive, where n is a positive integer.

步骤702，对图元索引数据进行CAM扫描，CAM扫描用于指示在CAM中为n个顶点数据依次重新分配对应的第二索引数据，CAM的工作模式为不覆写模式，不覆写模式用于指示输入的索引数据不覆写CAM中已分配的索引数据。Step 702, perform CAM scanning on the primitive index data, the CAM scanning is used to indicate that the corresponding second index data is redistributed in sequence for n vertex data in the CAM, and the working mode of the CAM is a non-overwrite mode, which is used to indicate that the input index data does not overwrite the allocated index data in the CAM.

步骤703，当CAM中空闲的索引数据数量m小于n时，为n个顶点数据中的k个顶点数据重新分配各自对应的第二索引数据，并将k个第二索引数据进行打包得到任务，m和k均为正整数，k小于或等于m。Step 703, when the number m of idle index data in the CAM is less than n, reallocate the corresponding second index data for k vertex data among the n vertex data, and pack the k second index data to obtain a task, where m and k are both positive integers, and k is less than or equal to m.

需要说明的是，本实施例中的各个步骤的相关细节可参考上述实施例中的相关描述，在此不再赘述。It should be noted that the relevant details of each step in this embodiment can be referred to the relevant description in the above embodiment, and will not be repeated here.

以下为本公开实施例的装置实施例，对于装置实施例中未详细阐述的部分，可以参考上述方法实施例中公开的技术细节。The following is an apparatus embodiment of the present disclosure. For parts not described in detail in the apparatus embodiment, reference may be made to the technical details disclosed in the above method embodiment.

请参考图8，其示出了本公开一个示例性实施例提供的GPU的数据处理装置的结构示意图。该GPU的数据处理装置可以通过软件、硬件以及两者的组合实现成为计算设备的全部或一部分。该装置包括：获取模块22、扫描模块23和打包模块24。Please refer to FIG8 , which shows a schematic diagram of the structure of a data processing device of a GPU provided by an exemplary embodiment of the present disclosure. The data processing device of the GPU can be implemented as all or part of a computing device through software, hardware, or a combination of the two. The device includes: an acquisition module 22, a scanning module 23, and a packaging module 24.

获取模块22，用于获取输入的图元索引数据，图元索引数据包括图元中的n个顶点数据各自对应的第一索引数据，n为正整数；An acquisition module 22 is used to acquire input primitive index data, where the primitive index data includes first index data corresponding to n vertex data in the primitive, where n is a positive integer;

扫描模块23，用于对图元索引数据进行CAM扫描，CAM扫描用于指示在CAM中为n个顶点数据依次重新分配对应的第二索引数据；A scanning module 23, used for performing CAM scanning on the primitive index data, wherein the CAM scanning is used for indicating to sequentially reallocate corresponding second index data for n vertex data in the CAM;

打包模块24，用于当CAM中空闲的索引数据数量m小于n时，为n个顶点数据中的k个顶点数据重新分配各自对应的第二索引数据，并将k个第二索引数据进行打包得到任务，m和k均为正整数，k小于或等于m；The packing module 24 is used for reallocating the second index data corresponding to k vertex data among the n vertex data when the number m of idle index data in the CAM is less than n, and packing the k second index data to obtain a task, where m and k are both positive integers, and k is less than or equal to m;

其中，CAM的工作模式为不覆写模式，不覆写模式用于指示输入的索引数据不覆写CAM中已分配的索引数据。The working mode of the CAM is a non-overwriting mode, and the non-overwriting mode is used to indicate that the input index data does not overwrite the allocated index data in the CAM.

在一种可能的实现方式中，装置还包括：第一设置模块，第一设置模块，用于：In a possible implementation, the device further includes: a first setting module, the first setting module being configured to:

获取控制数据，控制数据用于指示任务工作模式；Obtain control data, which is used to indicate the task working mode;

当检测到任务工作模式是目标粒度模式时，将CAM的工作模式设置为不覆写模式。When it is detected that the task working mode is the target granularity mode, the working mode of the CAM is set to a non-overwrite mode.

在另一种可能的实现方式中，目标粒度模式对应的粒度大于预设数值，目标粒度模式对应的粒度用于指示所组装任务包括的工作项实例的最大数量。In another possible implementation, the granularity corresponding to the target granularity mode is greater than a preset value, and the granularity corresponding to the target granularity mode is used to indicate the maximum number of work item instances included in the assembled task.

在另一种可能的实现方式中，装置还包括：第二设置模块；第二设置模块，用于：In another possible implementation, the device further includes: a second setting module; the second setting module is configured to:

当检测到任务工作模式不是目标粒度模式时，将CAM的工作模式设置为覆写模式，覆写模式用于指示支持输入的索引数据覆写CAM中已分配的索引数据的功能。When it is detected that the task working mode is not the target granularity mode, the working mode of the CAM is set to an overwrite mode, where the overwrite mode is used to indicate a function of supporting input index data to overwrite the allocated index data in the CAM.

在另一种可能的实现方式中，该装置还包括：分配模块；分配模块，用于：In another possible implementation, the device further includes: an allocation module; the allocation module is configured to:

当m大于或等于n时，为n个顶点数据重新分配各自对应的第二索引数据。When m is greater than or equal to n, the second index data corresponding to each of the n vertex data are reallocated.

在另一种可能的实现方式中，第二索引数据用于指示对应的顶点数据在顶点缓存空间中的位置，装置还包括：缓存模块；缓存模块，用于：In another possible implementation, the second index data is used to indicate the position of the corresponding vertex data in the vertex cache space, and the device further includes: a cache module; the cache module is used to:

获取打包的任务；Get the packaged tasks;

根据任务中的k个第二索引数据，将k个第二索引数据各自对应的顶点数据存入顶点缓存空间中。According to the k second index data in the task, the vertex data corresponding to each of the k second index data are stored in the vertex cache space.

在另一种可能的实现方式中，装置还包括：输出模块；输出模块，用于：In another possible implementation, the device further includes: an output module; the output module is used to:

从顶点缓存空间中读取k个顶点数据，并获取k个顶点数据各自对应的第二索引数据；Read k vertex data from the vertex buffer space, and obtain second index data corresponding to each of the k vertex data;

在将k个顶点数据与对应的k个第二索引数据进行同步处理后，输出图元数据，图元数据包括至少一个顶点数据。After synchronizing the k vertex data with the corresponding k second index data, the graphic element data is output, and the graphic element data includes at least one vertex data.

在另一种可能的实现方式中，装置还包括：计算模块；计算模块，用于：In another possible implementation, the device further includes: a computing module; and the computing module is configured to:

保存每个任务的目标计数值，目标计数值为任务中的位于顶点缓存空间中未使用的第二索引数据的个数；Save a target count value of each task, where the target count value is the number of unused second index data in the task and located in the vertex cache space;

当输出图元数据时，将目标计数值减去使用值，使用值为图元数据所包括的顶点数据的个数。When outputting the metadata, the target count value is subtracted from the usage value, where the usage value is the number of vertex data included in the metadata.

在另一种可能的实现方式中，装置还包括：释放模块；释放模块，用于：In another possible implementation, the device further includes: a releasing module; the releasing module is configured to:

当检测到任务的目标计数值为零时，向顶点缓存空间发送缓存释放命令，缓存释放命令用于指示释放顶点缓存空间中任务对应的空间。When it is detected that the target count value of the task is zero, a cache release command is sent to the vertex cache space, where the cache release command is used to instruct to release the space corresponding to the task in the vertex cache space.

需要说明的是，上述实施例提供的装置在实现其功能时，仅以上述各个功能模块的划分进行举例说明，实际应用中，可以根据实际需要而将上述功能分配由不同的功能模块完成，即将设备的内容结构划分成不同的功能模块，以完成以上描述的全部或者部分功能。It should be noted that the device provided in the above embodiment only uses the division of the above-mentioned functional modules as an example to implement its functions. In actual applications, the above-mentioned functions can be assigned to different functional modules according to actual needs, that is, the content structure of the device can be divided into different functional modules to complete all or part of the functions described above.

关于上述实施例中的装置，其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述，此处将不做详细阐述说明。Regarding the device in the above embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment of the method, and will not be elaborated here.

本公开实施例还提供了一种计算设备，计算设备包括：处理器；用于存储处理器可执行指令的存储器；其中，处理器被配置为：实现上述各个方法实施例中由计算设备中的GPU执行的步骤。The present disclosure also provides a computing device, which includes: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to: implement the steps executed by the GPU in the computing device in the above-mentioned various method embodiments.

本公开实施例还提供了一种非易失性计算机可读存储介质，其上存储有计算机程序指令，计算机程序指令被处理器执行时实现上述各个方法实施例中的方法。The embodiments of the present disclosure further provide a non-volatile computer-readable storage medium on which computer program instructions are stored. When the computer program instructions are executed by a processor, the methods in the above-mentioned various method embodiments are implemented.

图9是根据一示例性实施例示出的一种用于GPU的数据处理方法的装置900的框图。例如，装置900可以是移动电话，计算机，数字广播终端，消息收发设备，游戏控制台，平板设备，医疗设备，健身设备，个人数字助理等。9 is a block diagram of an apparatus 900 for a method for processing data on a GPU according to an exemplary embodiment. For example, the apparatus 900 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, etc.

参照图9，装置900可以包括以下一个或多个组件：处理组件20，存储器30，电源组件906，多媒体组件908，音频组件910，输入/输出(I/O)的接口912，传感器组件914，以及通信组件916。9 , the device 900 may include one or more of the following components: a processing component 20 , a memory 30 , a power component 906 , a multimedia component 908 , an audio component 910 , an input/output (I/O) interface 912 , a sensor component 914 , and a communication component 916 .

处理组件20通常控制装置900的整体操作，诸如与显示，电话呼叫，数据通信，相机操作和记录操作相关联的操作。处理组件20可以包括一个或多个处理器920来执行指令，以完成上述的方法的全部或部分步骤。此外，处理组件20可以包括一个或多个模块，便于处理组件20和其他组件之间的交互。例如，处理组件20可以包括多媒体模块，以方便多媒体组件908和处理组件20之间的交互。The processing component 20 generally controls the overall operation of the device 900, such as operations associated with display, phone calls, data communications, camera operations, and recording operations. The processing component 20 may include one or more processors 920 to execute instructions to complete all or part of the steps of the above-mentioned method. In addition, the processing component 20 may include one or more modules to facilitate the interaction between the processing component 20 and other components. For example, the processing component 20 may include a multimedia module to facilitate the interaction between the multimedia component 908 and the processing component 20.

存储器30被配置为存储各种类型的数据以支持在装置900的操作。这些数据的示例包括用于在装置900上操作的任何应用程序或方法的指令，联系人数据，电话簿数据，消息，图片，视频等。存储器30可以由任何类型的易失性或非易失性存储设备或者它们的组合实现，如静态随机存取存储器(SRAM)，电可擦除可编程只读存储器(EEPROM)，可擦除可编程只读存储器(EPROM)，可编程只读存储器(PROM)，只读存储器(ROM)，磁存储器，快闪存储器，磁盘或光盘。The memory 30 is configured to store various types of data to support operations on the device 900. Examples of such data include instructions for any application or method operating on the device 900, contact data, phone book data, messages, pictures, videos, etc. The memory 30 may be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.

电源组件906为装置900的各种组件提供电力。电源组件906可以包括电源管理系统，一个或多个电源，及其他与为装置900生成、管理和分配电力相关联的组件。The power supply component 906 provides power to the various components of the device 900. The power supply component 906 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 900.

多媒体组件908包括在所述装置900和用户之间的提供一个输出接口的屏幕。在一些实施例中，屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板，屏幕可以被实现为触摸屏，以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界，而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中，多媒体组件908包括一个前置摄像头和/或后置摄像头。当装置900处于操作模式，如拍摄模式或视频模式时，前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。The multimedia component 908 includes a screen that provides an output interface between the device 900 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundaries of the touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 908 includes a front camera and/or a rear camera. When the device 900 is in an operating mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

音频组件910被配置为输出和/或输入音频信号。例如，音频组件910包括一个麦克风(MIC)，当装置900处于操作模式，如呼叫模式、记录模式和语音识别模式时，麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器30或经由通信组件916发送。在一些实施例中，音频组件910还包括一个扬声器，用于输出音频信号。The audio component 910 is configured to output and/or input audio signals. For example, the audio component 910 includes a microphone (MIC), and when the device 900 is in an operating mode, such as a call mode, a recording mode, and a speech recognition mode, the microphone is configured to receive an external audio signal. The received audio signal can be further stored in the memory 30 or sent via the communication component 916. In some embodiments, the audio component 910 also includes a speaker for outputting audio signals.

I/O接口912为处理组件20和外围接口模块之间提供接口，上述外围接口模块可以是键盘，点击轮，按钮等。这些按钮可包括但不限于：主页按钮、音量按钮、启动按钮和锁定按钮。I/O interface 912 provides an interface between processing component 20 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to, a home button, a volume button, a start button, and a lock button.

传感器组件914包括一个或多个传感器，用于为装置900提供各个方面的状态评估。例如，传感器组件914可以检测到装置900的打开/关闭状态，组件的相对定位，例如所述组件为装置900的显示器和小键盘，传感器组件914还可以检测装置900或装置900一个组件的位置改变，用户与装置900接触的存在或不存在，装置900方位或加速/减速和装置900的温度变化。传感器组件914可以包括接近传感器，被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件914还可以包括光传感器，如CMOS或CCD图像传感器，用于在成像应用中使用。在一些实施例中，该传感器组件914还可以包括加速度传感器，陀螺仪传感器，磁传感器，压力传感器或温度传感器。The sensor assembly 914 includes one or more sensors for providing various aspects of status assessment for the device 900. For example, the sensor assembly 914 can detect the open/closed state of the device 900, the relative positioning of components, such as the display and keypad of the device 900, and the sensor assembly 914 can also detect the position change of the device 900 or a component of the device 900, the presence or absence of user contact with the device 900, the orientation or acceleration/deceleration of the device 900, and the temperature change of the device 900. The sensor assembly 914 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 914 may also include an optical sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 914 may also include an accelerometer, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

通信组件916被配置为便于装置900和其他设备之间有线或无线方式的通信。装置900可以接入基于通信标准的无线网络，如WiFi，2G或3G，或它们的组合。在一个示例性实施例中，通信组件916经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中，所述通信组件916还包括近场通信(NFC)模块，以促进短程通信。例如，在NFC模块可基于射频识别(RFID)技术，红外数据协会(IrDA)技术，超宽带(UWB)技术，蓝牙(BT)技术和其他技术来实现。The communication component 916 is configured to facilitate wired or wireless communication between the device 900 and other devices. The device 900 can access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 916 receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 916 also includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.

在示例性实施例中，装置900可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现，用于执行上述方法。In an exemplary embodiment, the apparatus 900 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays (FPGAs), controllers, microcontrollers, microprocessors or other electronic components to perform the above method.

在示例性实施例中，还提供了一种非易失性计算机可读存储介质，例如包括计算机程序指令的存储器30，上述计算机程序指令可由装置900的处理器920执行以完成上述方法。In an exemplary embodiment, a non-volatile computer-readable storage medium is also provided, such as a memory 30 including computer program instructions, which can be executed by the processor 920 of the device 900 to perform the above method.

图10是根据一示例性实施例示出的一种用于GPU的数据处理方法的装置1000的框图。例如，装置1000可以被提供为一服务器。参照图10，装置1000包括处理组件1022，其进一步包括一个或多个处理器，以及由存储器30所代表的存储器资源，用于存储可由处理组件1022的执行的指令，例如应用程序。存储器30中存储的应用程序可以包括一个或一个以上的每一个对应于一组指令的模块。此外，处理组件1022被配置为执行指令，以执行上述方法。FIG10 is a block diagram of a device 1000 for a data processing method for a GPU according to an exemplary embodiment. For example, the device 1000 may be provided as a server. Referring to FIG10 , the device 1000 includes a processing component 1022, which further includes one or more processors, and a memory resource represented by a memory 30 for storing instructions executable by the processing component 1022, such as an application. The application stored in the memory 30 may include one or more modules, each of which corresponds to a set of instructions. In addition, the processing component 1022 is configured to execute instructions to perform the above method.

装置1000还可以包括一个电源组件1026被配置为执行装置1000的电源管理，一个有线或无线网络接口1050被配置为将装置1000连接到网络，和一个输入输出(I/O)接口1058。装置1000可以操作基于存储在存储器30的操作系统，例如Windows ServerTM，Mac OSXTM，UnixTM,LinuxTM，FreeBSDTM或类似。The device 1000 may also include a power supply component 1026 configured to perform power management of the device 1000, a wired or wireless network interface 1050 configured to connect the device 1000 to a network, and an input/output (I/O) interface 1058. The device 1000 may operate based on an operating system stored in the memory 30, such as Windows Server™, Mac OSX™, Unix™, Linux™, FreeBSD™ or the like.

在示例性实施例中，还提供了一种非易失性计算机可读存储介质，例如包括计算机程序指令的存储器30，上述计算机程序指令可由装置1000的处理组件1022执行以完成上述方法。In an exemplary embodiment, a non-volatile computer-readable storage medium is also provided, such as a memory 30 including computer program instructions, which can be executed by the processing component 1022 of the device 1000 to perform the above method.

本公开可以是系统、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质，其上载有用于使处理器实现本公开的各个方面的计算机可读程序指令。The present disclosure may be a system, a method and/or a computer program product. The computer program product may include a computer-readable storage medium carrying computer-readable program instructions for causing a processor to implement various aspects of the present disclosure.

计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是――但不限于――电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括：便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身，诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如，通过光纤电缆的光脉冲)、或者通过电线传输的电信号。A computer-readable storage medium may be a tangible device that can hold and store instructions used by an instruction execution device. A computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the above. More specific examples of computer-readable storage media (a non-exhaustive list) include: a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a static random access memory (SRAM), a portable compact disk read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanical encoding device, such as a punch card or a raised structure in a groove on which instructions are stored, and any suitable combination of the above. As used herein, a computer-readable storage medium is not to be interpreted as a transient signal per se, such as a radio wave or other freely propagating electromagnetic wave, an electromagnetic wave propagating through a waveguide or other transmission medium (e.g., a light pulse through a fiber optic cable), or an electrical signal transmitted through a wire.

这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备，或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令，并转发该计算机可读程序指令，以供存储在各个计算/处理设备中的计算机可读存储介质中。The computer-readable program instructions described herein can be downloaded from a computer-readable storage medium to each computing/processing device, or downloaded to an external computer or external storage device via a network, such as the Internet, a local area network, a wide area network, and/or a wireless network. The network can include copper transmission cables, optical fiber transmissions, wireless transmissions, routers, firewalls, switches, gateway computers, and/or edge servers. The network adapter card or network interface in each computing/processing device receives the computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in the computer-readable storage medium in each computing/processing device.

用于执行本公开操作的计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码，所述编程语言包括面向对象的编程语言—诸如Smalltalk、C++等，以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中，远程计算机可以通过任意种类的网络—包括局域网(LAN)或广域网(WAN)—连接到用户计算机，或者，可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中，通过利用计算机可读程序指令的状态信息来个性化定制电子电路，例如可编程逻辑电路、现场可编程门阵列(FPGA)或可编程逻辑阵列(PLA)，该电子电路可以执行计算机可读程序指令，从而实现本公开的各个方面。The computer program instructions for performing the operation of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages, such as Smalltalk, C++, etc., and conventional procedural programming languages, such as "C" language or similar programming languages. Computer-readable program instructions may be executed completely on a user's computer, partially on a user's computer, as an independent software package, partially on a user's computer, partially on a remote computer, or completely on a remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer via any type of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (e.g., using an Internet service provider to connect via the Internet). In some embodiments, an electronic circuit, such as a programmable logic circuit, a field programmable gate array (FPGA), or a programmable logic array (PLA), may be personalized by utilizing the state information of the computer-readable program instructions, and the electronic circuit may execute the computer-readable program instructions, thereby realizing various aspects of the present disclosure.

这里参照根据本公开实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解，流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合，都可以由计算机可读程序指令实现。Various aspects of the present disclosure are described herein with reference to the flowcharts and/or block diagrams of the methods, devices (systems) and computer program products according to the embodiments of the present disclosure. It should be understood that each box in the flowchart and/or block diagram and the combination of each box in the flowchart and/or block diagram can be implemented by computer-readable program instructions.

这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理器，从而生产出一种机器，使得这些指令在通过计算机或其它可编程数据处理装置的处理器执行时，产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中，这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作，从而，存储有指令的计算机可读介质则包括一个制造品，其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。These computer-readable program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, thereby producing a machine, so that when these instructions are executed by the processor of the computer or other programmable data processing device, a device that implements the functions/actions specified in one or more boxes in the flowchart and/or block diagram is generated. These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause the computer, programmable data processing device, and/or other equipment to work in a specific manner, so that the computer-readable medium storing the instructions includes a manufactured product, which includes instructions for implementing various aspects of the functions/actions specified in one or more boxes in the flowchart and/or block diagram.

也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上，使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤，以产生计算机实现的过程，从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device so that a series of operating steps are performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions executed on the computer, other programmable data processing apparatus, or other device to implement the functions/actions specified in one or more boxes in the flowchart and/or block diagram.

附图中的流程图和框图显示了根据本公开的多个实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分，所述模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个连续的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或动作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。The flow chart and block diagram in the accompanying drawings show the possible architecture, function and operation of the system, method and computer program product according to multiple embodiments of the present disclosure. In this regard, each square box in the flow chart or block diagram can represent a part of a module, program segment or instruction, and a part of the module, program segment or instruction includes one or more executable instructions for realizing the specified logical function. In some alternative implementations, the functions marked in the square box can also occur in a sequence different from that marked in the accompanying drawings. For example, two continuous square boxes can actually be executed substantially in parallel, and they can sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each square box in the block diagram and/or flow chart, and the combination of the square boxes in the block diagram and/or flow chart can be implemented with a dedicated hardware-based system that performs the specified function or action, or can be implemented with a combination of special hardware and computer instructions.

以上已经描述了本公开的各实施例，上述说明是示例性的，并非穷尽性的，并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下，对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择，旨在最好地解释各实施例的原理、实际应用或对市场中的技术的技术改进，或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。The embodiments of the present disclosure have been described above, and the above description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The selection of terms used herein is intended to best explain the principles of the embodiments, practical applications, or technical improvements to the technology in the market, or to enable other persons of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A data processing method for a graphics processor GPU, characterized in that the method comprises:

Obtain input primitive index data, wherein the primitive index data includes first index data corresponding to each of n vertex data in the primitive, the first index data is original index data, the vertex data and the first index data have a one-to-one correspondence, and n is a positive integer;

Performing a content addressable memory CAM scan on the primitive index data, wherein the CAM scan is used to indicate to sequentially reallocate corresponding second index data for the n vertex data in the CAM, wherein the second index data is used to indicate a position of the corresponding vertex data in a vertex cache space;

When the number m of idle index data in the CAM is less than n, reallocating the second index data corresponding to k vertex data among the n vertex data, and packing the k second index data to obtain a task, wherein m and k are both positive integers, and k is less than or equal to m;

The working mode of the CAM is a non-overwriting mode, and the non-overwriting mode is used to indicate that the input index data does not overwrite the allocated index data in the CAM.

2. The method according to claim 1, characterized in that before obtaining the input primitive index data, it also includes:

Acquiring control data, wherein the control data is used to indicate a task working mode;

When it is detected that the task working mode is the target granularity mode, the working mode of the CAM is set to the non-overwriting mode.

3. The method according to claim 2 is characterized in that the granularity corresponding to the target granularity pattern is greater than a preset value, and the granularity corresponding to the target granularity pattern is used to indicate the maximum number of work item instances included in the assembled task.

4. The method according to claim 2, characterized in that the method further comprises:

When it is detected that the task working mode is not the target granularity mode, the working mode of the CAM is set to an overwrite mode, where the overwrite mode is used to indicate a function of supporting input index data to overwrite allocated index data in the CAM.

5. The method according to claim 1, characterized in that the method further comprises:

When m is greater than or equal to n, the second index data corresponding to each of the n vertex data are reallocated.

6. The method according to claim 1, characterized in that after packaging the k second index data to obtain the task, it also includes:

Get the packaged tasks;

According to the k second index data in the task, the vertex data corresponding to each of the k second index data are stored in the vertex cache space.

7. The method according to claim 6, characterized in that the method further comprises:

Read the k vertex data from the vertex buffer space, and obtain the second index data corresponding to each of the k vertex data;

After synchronizing the k vertex data with the corresponding k second index data, outputting the image data, the image data includes at least one of the vertex data.

8. The method according to claim 7, characterized in that the method further comprises:

Saving a target count value of each of the tasks, where the target count value is the number of unused second index data in the task and located in the vertex cache space;

When the primitive data are output, a usage value is subtracted from the target count value, where the usage value is the number of the vertex data included in the primitive data.

9. The method according to claim 8, characterized in that the method further comprises:

When it is detected that the target count value of the task is zero, a cache release command is sent to the vertex cache space, where the cache release command is used to instruct to release the space corresponding to the task in the vertex cache space.

10. A GPU data processing device, characterized in that the device comprises:

An acquisition module, used for acquiring input primitive index data, wherein the primitive index data includes first index data corresponding to each of n vertex data in the primitive, the first index data is original index data, there is a one-to-one correspondence between the vertex data and the first index data, and n is a positive integer;

A scanning module, used for performing CAM scanning on the primitive index data, wherein the CAM scanning is used for indicating to sequentially reallocate corresponding second index data for the n vertex data in the CAM, wherein the second index data is used for indicating the position of the corresponding vertex data in the vertex cache space;

A packing module, used for reallocating the second index data corresponding to k vertex data among the n vertex data when the number m of idle index data in the CAM is less than n, and packing the k second index data to obtain a task, wherein m and k are both positive integers, and k is less than or equal to m;

11. A computing device, characterized in that the computing device comprises: a processor; a memory for storing instructions executable by the processor;

Wherein, the processor is configured to:

Performing a CAM scan on the primitive index data, the CAM scan is used to indicate to sequentially reallocate corresponding second index data for the n vertex data in the CAM, the second index data is used to indicate a position of the corresponding vertex data in the vertex cache space;

12. A non-volatile computer-readable storage medium having computer program instructions stored thereon, wherein the computer program instructions implement the method of any one of claims 1 to 9 when executed by a processor.