Disclosure of Invention
In view of the above problems, the present invention is directed to a method for processing shared memory between threads based on SPIR-V, and aims to solve the technical problem of how to translate the instructions for processing the shared memory in the SPIR-V code into GPU instruction sequences with the same semantics after the SPIR-V code is converted by a high-level language code.
The invention adopts the following technical scheme:
the method for processing the shared memory among threads based on the SPIR-V comprises the following steps:
S1, reading source codes of a high-level language code into a character buffer area, and converting the source codes in the character buffer area into intermediate codes in an SPIR-V form;
Step S2, traversing an instruction sequence of an intermediate code, regarding a variable statement, if a storage class of the variable is a working group, determining the current variable as a shared memory variable, calculating the occupied space of the variable, simultaneously recording the variable offset, and finally accumulating the sum of the occupied space of all the shared memory variables;
S3, calculating the size of the shared memory space to be allocated and acquiring a base address according to the sum of the sizes of the occupied spaces, and calculating the shared memory address of the thread group according to the one-dimensional index, the sum of the sizes of the occupied spaces and the base address by calculating the one-dimensional index of the thread group;
Step S4, traversing the instruction sequence of the intermediate code, and for the instructions in the function statement and the function definition, if the processing of the shared memory is involved, calculating the real address and generating different GPU instructions according to the instruction types.
Further, the specific process of step S2 is as follows:
S21, traversing an instruction sequence of the intermediate code, wherein OpVariable instructions are used for variable declarations, judging whether the storage class of the OpVariable instructions is a work group Workgroup or not for the variable declarations, and if so, indicating that the variable declared by the variable declaration is a shared memory variable;
S22, calculating the space size S i of the declared variable according to the type in the operand in the OpVariable instruction, wherein if the type declaration is OpTypeVoid, the space is not occupied, S i is 0, if the type declaration is the original type, S i is the specific space size of the original type, and if the type declaration is the composite data type, S i is the specific space size of the composite type;
s23, calculating the sum of the occupied space sizes of all the shared memory variables N is the number of shared memory variables, and in addition, the variable offset of the shared memory variable is recorded synchronously with offset k, the kth shared memory variable, the variable offset thereofI.e. the sum of the space size occupied by all the previous variables of the current shared memory variable.
Further, the specific process of step S3 is as follows:
S31, calculating the size SmTotalSize, smTotalSize =S total × GroupNum of the shared memory space to be allocated, wherein GroupNum is the number of all thread groups, and allocating a region with the size SmTotalSize in the GPU video memory for the shared memory space, wherein the base address of the region is SmBaseAddr;
S32, calculating a one-dimensional index GroupIndex of a thread group by generating a section of GPU instruction, wherein if the index of the thread group in each dimension is i 1,i2,…,id and the maximum thread group size in each dimension is n 1,n2,…,nd, the one-dimensional index GroupIndex of the thread group is:
GroupIndex=i1+n1×(i2+n2×(i3+…+nd-2×(id-1+nd-1×id)…));
s33, calculate the shared memory address of the thread group in the shared memory space, i.e. the start address SmAddr = GroupIndex ×s total + SmBaseAddr.
Further, in the step S4, the GPU instructions include OpLoad instructions, opStore instructions, and OpControlBarrier instructions, wherein OpLoad instructions are used for memory reading, opStore instructions are used for memory writing, opControlBarrier instructions are used for controlling synchronization and memory barrier of memory access:
For the OpLoad instruction, before generation, the real address realAddr =offset k +smaddr of the calculate instruction, the pointer operand of the OpLoad instruction comes directly from the return value of the OpVariable instruction, the address read by the OpLoad instruction is realAddr; if the pointer operand of the OpLoad instruction is from the return value of the chained operation instruction of SPIR-V, recursively calculating the address offset by the base address and index of the chained operation instruction, and adding realAddr to the address offset as the address read by the OpLoad instruction;
For OpStore instruction, judging whether the storage class of pointer operand of OpStore instruction is a working group or not to write data into the shared memory space, and calculating the real address in the same way as OpLoad instruction;
For OpControlBarrier instructions, according to the execution range of OpControlBarrier instruction synchronization and the synchronous memory range, a corresponding GPU Barrier instruction Barrier is generated to ensure the correct synchronization of the thread groups.
The method has the advantages that the method can reduce memory waste when space is accurately allocated by converting the source code of the high-level language code into the SPIR-V code, traversing the variable declaration instruction and accurately calculating the sum of the used shared memory space according to the data type of each variable, and simultaneously adding a section of GPU instruction to the GPU instruction sequence head to enable threads to calculate the shared memory address of the thread group before executing operation and calculate the real address in subsequent instruction translation, thereby reducing repeated calculation. Because the instructions used in the method are all basic instructions in all GPU instruction sets, the method has hardware platform universality.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The method aims at solving the problem of converting the SPIR-V codes of the processing shared memory into GPU instruction sequences with the same function after the high-level language is converted into the intermediate codes of the SPIR-V. The invention provides a processing method of shared memory among threads based on SPIR-V. The method comprises the steps of firstly analyzing variable declarations of shared memory variables used in SPIR-V codes, then calculating the size of occupied space, distributing shared memory space in hardware, generating a GPU instruction, placing the GPU instruction in a sequence header for calculating shared memory addresses, and finally converting instructions for reading and writing and synchronizing the shared memory in the SPIR-V codes into GPU instructions with the same semantics. The generated GPU instructions are basic instructions that are found in all GPU instruction sets, including MUL (multiplication), ADD (addition), LOAD (memory read), and STORE (memory write). In order to illustrate the technical scheme of the invention, the following description is made by specific examples.
As shown in fig. 1, the present embodiment provides a method for processing shared memory between threads based on SPIR-V, comprising the following steps:
and S1, reading source codes of the high-level language codes into a character buffer area, and converting the source codes in the character buffer area into intermediate codes in the form of SPIR-V.
The host program first reads the high-level language code from a file or other source into a character buffer, and then converts the source code in the character buffer into intermediate code in the form of SPIR-V. The SPIR-V instructions are arranged according to a specified sequence, and source codes are converted into intermediate codes in the form of SPIR-V to form an SPIR-V instruction sequence.
And S2, traversing an instruction sequence of the intermediate code, regarding variable declaration, if the storage class of the variable is a working group, recognizing the current variable as a shared memory variable, calculating the occupied space of the variable, recording the variable offset, and finally accumulating the sum of the occupied space of all the shared memory variables.
The SPIR-V instruction consists of a series of precisely arranged binary data streams, each word containing 32 bits, following a small-end endian, each instruction consisting of one or more words. Each instruction in SPIR-V contains an operation code (OpCode) and several operands. The opcode defines the type of instruction, each SPIR-V instruction beginning with an opcode indicating the type and function of the instruction. An opcode is a 16-bit integer that uniquely identifies an operation, and an operand provides the specific data needed for instruction execution. These operands may be direct values, values generated by referencing other instructions, or references to other resources, such as buffers or textures.
This step traverses the SPIR-V instruction sequence, calculating and accumulating the sum of the sizes of the space occupied by the shared memory variables. The specific process is as follows:
S21, traversing an instruction sequence of the intermediate code, wherein OpVariable instructions are used for variable declaration, judging whether the storage class of the OpVariable instructions is a work group Workgroup for the variable declaration, and if so, indicating that the variable declared by the variable declaration is a shared memory variable.
In the intermediate code in the SPIR-V format, opVariable instructions are used for variable declarations, and the Storage Class (Storage Class) of the OpVariable instruction specifies the Storage type of the variable. The Storage Class (Storage Class) in the SPIR-V defines the Storage type of the variable and its access range. Common storage classes include UniformConstant, input, uniform, output, workgroup, crossWorkgroup, private, function and Generic. UniformConstant are used for externally shared read-only variables, can be accessed in all functions of all calls, are commonly used for graphics unified memory and OpenCL constant memory, and can have initializers, specifically according to the specification of the client API. Therefore, in this step, it is determined whether the storage class of OpVariable instructions in the instruction sequence is the workgroup Workgroup, if so, it indicates that the variable declared by the OpVariable instruction is a shared memory variable, then the step S22 is continued to be executed, the shared variable occupation space is calculated, and otherwise, the next OpVariable instruction is traversed.
S22, calculating the space size S i of the declared variable according to the type in the operand in OpVariable instruction, wherein if the type declaration is OpTypeVoid, the space is not occupied, S i is 0, if the type declaration is the original type, S i is the specific space size of the original type, and if the type declaration is the composite data type, S i is the specific space size of the composite type.
The Result Type (Result Type) in the operand (Operand) of the OpVariable instruction is the return value of the OpTypePointer instruction, the OpTypePointer instruction is used to declare a pointer Type, the Type (Type) in the operand of the OpTypePointer instruction is the Type of object that the pointer Type points to, and the declared variable footprint size S i is calculated from the types herein.
There are various instructions for the SPIR-V to declare different types, such as in this embodiment, S i is 0 if the type declaration is OpTypeVoid, not taking up memory, S i is the specific footprint size of the original type if the type declaration is the original type, such as OpTypeBool, opTypeFloat, opTypeInt, etc., and S i is the specific footprint size of the entire composite type if the type declaration is a composite data type, such as OpTypeVector, opTypeMatrix, opTypeArray, opTypeStruct, etc.
S23, calculating the sum of the occupied space sizes of all the shared memory variablesN is the number of shared memory variables, and in addition, the variable offset of the shared memory variable is recorded synchronously with offset k, the kth shared memory variable, the variable offset thereofI.e. the sum of the space size occupied by all the previous variables of the current shared memory variable.
And S3, calculating the size of the shared memory space to be allocated and acquiring a base address according to the sum of the sizes of the occupied spaces, and calculating the shared memory address of the thread group according to the one-dimensional index, the sum of the sizes of the occupied spaces and the base address by calculating the one-dimensional index of the thread group.
This step is mainly to calculate the one-dimensional index. The process is specifically as follows:
S31, calculating the size SmTotalSize, smTotalSize =s total × GroupNum of the shared memory space to be allocated, where GroupNum is the number of all thread groups, and allocating a region with a size of SmTotalSize in the GPU video memory for the shared memory space, where the base address of the region is SmBaseAddr.
GroupNum is the number of all thread groups, the value of which is entered by the user when the API of the driver is called. And (3) distributing a region with the size SmTotalSize in the GPU video memory for a shared memory space, wherein the base address of the memory region is SmBaseAddr, and the shared memory space is used for storing data of shared memory variables in different thread groups.
S32, calculating a one-dimensional index GroupIndex of a thread group by generating a section of GPU instruction, wherein if the index of the thread group in each dimension is i 1,i2,…,id and the maximum thread group size in each dimension is n 1,n2,…,nd, the one-dimensional index GroupIndex of the thread group is:
GroupIndex=i1+n1×(i2+n2×(i3+…+nd-2×(id-1+nd-1×id)…)).
The invention aims to convert source codes into SPIR-V intermediate codes, namely an SPIR-V instruction sequence, then translate the instruction sequence into GPU instructions, and the translated GPU instructions are also an instruction sequence. The instructions included in GPU instructions may vary depending on the architecture and implementation of the GPU and typically include basic instructions to perform various computational tasks, manage data, and control program flows, basic instructions such as arithmetic operation instructions, e.g., add (add), subtract (sub), multiply (mul), divide (div), etc., and load and store instructions for memory access, where load is used to read data from a memory address and store is used to write data to a memory address. In embodiments of the present invention, the translation of SPIR-V instructions is accomplished primarily in dependence upon these instructions.
Before translation, a one-dimensional index GroupIndex of a set of GPU instruction compute threads is first generated, the instruction being located at the head of the GPU instruction sequence. The GPU instruction sequence is executed first. Let i 1,i2,…,id be the index of the thread group in each dimension, and n 1,n2,…,nd be the maximum thread group size in each dimension, d being the index number. The one-dimensional index GroupIndex for a thread group is:
GroupIndex=i1+n1×(i2+n2×(i3+…+nd-2×(id-1+nd-1×id)…)).
For example, for the commonly used OpenCL, vulkan such standards, the dimension that their thread group supports at maximum is three-dimensional, then GroupIndex =i 1+i2×n1+i3×n1×n2. Expressed in the form of a three address code as follows:
R1=i2×n1
R2=i1+r 1// for calculation of i1+. I2.n1
R3=n2+n1
R4=r3×i3// use in calculate i3. n1 is n2
GroupIndex=R4+R2//GroupIndex=i1+i2*n1+i3*n1*n2
Triple address codes are a representation for intermediate code generation and are widely used in compiler designs. Each instruction of a three address code typically contains three operands (addresses), two operands and one result address.
S33, calculate the shared memory address of the thread group in the shared memory space, i.e. the start address SmAddr = GroupIndex ×s total + SmBaseAddr.
Each thread needs to calculate the starting address SmAddr of the thread group in which it is located in the shared memory space before using the shared memory space, and in this embodiment there is a linear arrangement of the shared memory space in the shared memory space used by the different thread groups, so SmAddr = GroupIndex ×s total + SmBaseAddr.
Thus, 2 instructions are added after GroupIndex instructions for computing a one-dimensional index, represented in the form of a three address code as follows:
AddrOffset=GroupIndex×S_total
SmAddr=AddrOffset+SmBaseAddr。
Step S4, traversing the instruction sequence of the intermediate code, and for the instructions in the function statement and the function definition, if the processing of the shared memory is involved, calculating the real address and generating different GPU instructions according to the instruction types.
The different GPU instructions generated in this step include OpLoad instructions, opStore instructions, and OpControlBarrier instructions, wherein OpLoad instructions are used for memory reads, opStore instructions are used for memory writes, and OpControlBarrier instructions are used for controlling synchronization and memory barriers for memory accesses.
For OpLoad instructions, the OpLoad instruction in SPIR-V is used for memory reads. The OpLoad instruction has an operand of the return value type, determines whether the storage class of the return value type is a WorkGroup, and if so, indicates that it is used to read the data in the shared memory. The pointer operand of OpLoad instruction is the address of the memory, from which it can trace back to OpVariable instruction, and the previously recorded offset of the variable can be obtained from k, then an add instruction is added to calculate the real address realAddr of the instruction before OpLoad instruction in the GPU instruction sequence is generated, and the three address code form is expressed as follows:
realAddr=offsetk+SmAddr
If the pointer operand of the OpLoad instruction is directly from the return value of the OpVariable instruction, which means that no additional memory operation occurs in the middle, the address read by the OpLoad instruction is realAddr, and if the pointer operand of the OpLoad instruction is from the return value of the SPIR-V chained operation instruction, the address offset is recursively calculated by the base address and index of the chained operation instruction, and the address offset is added realAddr as the address read by the OpLoad instruction.
For OpStore instructions, the OpStore instruction in SPIR-V is used for memory writes. Whether the pointer operand of the OpStore instruction writes data into the shared memory space is judged by whether the storage class of the pointer operand is a working group or not, and the real address is calculated in the same way as the OpLoad instruction.
For the OpControlBarrier instruction, the OpControlBarrier instruction in SPIR-V is used to control the synchronization and memory barrier of memory accesses. And generating a corresponding GPU Barrier instruction Barrier according to the execution range of OpControlBarrier instruction synchronization and the synchronous memory range to ensure the correct synchronization of the thread groups.
And traversing all instructions until translation is completed according to the steps S2-S4. As can be seen from the above steps, the present invention is a method for allocating and calculating shared memory and mapping the instructions for processing shared memory in the SPIR-V code to the GPU instruction set, specifically recording the sizes and offsets of all the shared memory variables in the SPIR-V code, allocating the shared memory space of the corresponding sizes, generating instructions in the GPU instruction sequence header, ensuring that the thread calculates the shared memory address of the thread group of the thread before executing the operation, and translating the instructions for processing shared memory in the rest of the SPIR-V into GPU instructions with the same function.
In summary, the technical scheme of the invention has the following characteristics:
Firstly, the invention uses variable declaration instructions in the traversing SPIR-V codes to precisely calculate the size of a shared memory space required by each thread group, simultaneously records the offset of each variable in the shared memory, and finally calculates the total size of the shared memory space required by all the thread groups, secondly, after the shared memory space is distributed according to the total size of the shared memory space required by all the thread groups, a section of basic instructions are generated in a GPU instruction sequence head and are respectively used for calculating one-dimensional indexes of the thread groups, then the shared memory addresses of the thread groups are calculated according to the one-dimensional indexes of the thread groups, and are used for calculating real addresses when subsequent instructions are translated, thirdly, when the memory read-write instructions of the SPIR-V are translated, if the variables in the shared memory are read-write, an addition instruction is added to calculate the real addresses, and the memory barrier OpControlBarrier instructions of the SPIR-V are converted into GPU instructions with the same semantics to ensure the correct synchronization of the thread groups.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.