Background
Computing devices such as artificial intelligence chips can provide tremendous computing power. The tremendous computing power of AI chips stems from the large number of hardware cores inside. One AI chip typically contains multiple programmable multiprocessor clusters, such as a stream processor cluster (Stream Processor Cluster, SPC). Each programmable multiprocessor cluster typically includes a plurality of programmable multiprocessors, such as a plurality of Compute Units (CUs), and each Compute Unit typically includes a plurality of Execution units (EUs, or Execution cores), such as at least one of an Integer (INT) core module, a Floating Point (FP) core module, a Tensor core (Tcore) module, and a Vector core (Vcore) module. Programmable multiprocessors can support general purpose computing, scientific computing, and neural network computing by programmatically organizing various types of computing units.
The tensor kernel module is a domain-specific architecture (Domain Specific Architecture, DSA) that performs AI operations, some special operators for handling AI operations, including but not limited to tensor data handling, matrix multiply add (Matrix Multiply and Add, MMA), and convolution (Convolution) operations. In addition to operation-related control information, the tensor core module typically uses a configuration register approach to obtain more tensor data tags (including, but not limited to, address coordinate information for the tensor, size information for the tensor, and boundary zero-padding information). The contents of these registers are not statically configured, but are dynamically configured by instructions, but need only be generated by fixed point operations or simple data movement, and are typically scalar operations.
Single instruction multithreading (Single Instruction Multiple Threads, SIMT) instructions are a common type of graphics processor (Graphics Processing Unit, GPU) programming instruction. SIMT instructions have been widely used for massively parallel computing tasks such as graphics rendering and AI operations. The GPU hardware executes SIMT instructions through the vector core module to perform the corresponding vector operations. Vector operations include, but are not limited to, floating point operations (floating point operation), fixed point operations (fixed point operation), logical operations (logical operation), and the like. The pipeline architecture of the vector core module supports massively parallel computing tasks. Pipeline architectures include a number of processes such as instruction fetching, instruction scheduling, decoding, fetching operands, performing operations, writing back results, and the like. Vector core modules are also often used as host (master) modules for memory modules and tensor core modules. The vector core module (master module) generates operation control information and tensor core configuration information to a slave (slave) module (e.g., tensor core module) based on the SIMT instruction.
The problem with conventional vector core modules sending tensor core instruction configuration information is that due to the multiplexing requirements of SIMT computing resources, the delay of even fixed point instructions is relatively long due to the influence of floating point instructions, resulting in relatively slow speed of conventional vector core modules sending tensor core configuration information. Furthermore, tensor core configuration information only needs fixed-point operations or data movement and only works for single threads, so that the resource utilization rate of a conventional vector core module running tensor core instructions is very low. Furthermore, vector core instructions and tensor core instructions share the issue unit and subsequent pipelines, thereby impeding the efficiency of vector core instructions with higher performance requirements, resulting in a gap between actual and desired computing forces. In the application scenario that the vector core module is a master and the tensor core module is a slave, how to improve the efficiency of the vector core module is one of many technical issues in the field of AI chips.
Drawings
FIG. 1 is a circuit block diagram of a vector core module according to one embodiment;
FIG. 2 is a schematic diagram of a circuit module of a vector core module according to an embodiment of the invention;
FIG. 3 is a flow chart of a method of operation of a vector core module according to an embodiment of the invention;
FIG. 4 is a schematic diagram of a circuit module of an instruction dispatch unit according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a circuit module of an instruction dispatch unit according to another embodiment of the present invention;
FIG. 6 is a schematic diagram of a circuit block diagram of a tensor core instruction operation unit according to an embodiment of the present invention.
Description of the reference numerals
11. The first memory module is a memory module 1,
21. The memory module of the 2 nd one,
12. The 1 st tensor kernel module,
22. The tensor-2 kernel module,
100. The 1 st vector core module is provided with a first vector,
200. The 2 nd vector core module is used for processing the data,
110. The 1 st instruction cache is provided with a first instruction cache,
210. The 2 nd instruction cache is used to store the instructions,
120. An instruction 1 st fetch unit configured to fetch,
220. An instruction 2-fetch unit that fetches the instruction,
130. An instruction 1 st transmission unit,
241. An instruction 2-transmitting unit,
251. A3 rd instruction transmitting unit,
140. A1 st instruction decoding unit for decoding the first instruction,
242. A2 nd instruction decoding unit for decoding the instruction,
252. A3 rd instruction decoding unit for decoding the instruction,
150. A1 st one of the vector register sets,
260. A group of vector-2 registers,
160. A1 st scalar register set is provided which,
270. A2 nd scalar register set is provided,
170. A1 st vector kernel operation unit,
243. A 2 nd vector kernel operation unit,
180. The 1 st active thread detection unit,
280. The active thread detection unit of the 2 nd,
230. An instruction dispatch unit configured to dispatch the instructions,
231. The instruction set is used to classify the instructions into instructions,
232_1, Thread-bundle instruction classifier 1,
232_2, Thread bundle instruction classifier 2,
232—N, nth thread bundle instruction classifier,
233. The vector core instruction polls the arbiter,
234. The tensor core instructions poll the arbiter,
240. The vector core instruction execution pipeline,
250. The tensor core instruction processing pipeline,
253. A tensor core instruction arithmetic unit,
610. A fixed-point multiplier, which is used for multiplying the data,
620. The 1 st gate is provided with a first gate,
630. The 2 nd gate is provided with a gate,
650. 3 Rd Gate the device is used for controlling the temperature of the air,
640. A fixed-point adder which,
A6, a first operand is used for storing,
B6, a second operand,
C6, a third operand.
Detailed Description
Reference will now be made in detail to the exemplary embodiments of the present invention, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers are used in the drawings and the description to refer to the same or like parts.
The term "coupled" as used throughout this specification (including the claims) may refer to any direct or indirect connection. For example, if a first device couples (or connects) to a second device, that connection may be through a direct connection, or through an indirect connection via other devices and connections. The terms first, second and the like in the description (including the claims) are used for naming components or distinguishing between different embodiments or ranges and are not used for limiting the number of components, either upper or lower, or the order of the components.
Single instruction multithreading is a common type of graphics processor programming instruction. SIMT instructions have been widely used for massively parallel computing tasks such as graphics rendering and artificial intelligence operations. SIMT instructions perform parallel operations in the basic unit of a thread group (Warp), one thread group containing multiple threads. For example, a SIMT32 instruction represents an instruction containing 32 threads, and a single SIMT32 instruction may cause threads in a thread bundle containing 32 threads to perform the same operation in parallel. The GPU or AI chip can thereby process multiple data points or computing tasks simultaneously, thereby increasing computing efficiency.
Fig. 1 is a schematic circuit diagram of a 1 st vector core module 100 according to an embodiment. Based on the actual operating scenario, the 1 st vector core module 100 may act as a master and the 1 st tensor core module 12 may act as a slave. The 1 st tensor kernel module 12 is used to process some special operators of AI operations including, but not limited to, tensor data handling, matrix multiply add and convolution operations. In addition to operation-related control information, the 1 st tensor core module 12 typically uses a configuration register approach to obtain further tensor data tags, including but not limited to address coordinate information, size information, and boundary zero-padding information for the tensor. The contents of these registers are not statically configured, but are dynamically configured by instructions, but need only be generated by fixed point operations or simple data movement, and are typically scalar operations.
The 1 st vector core module 100 executes SIMT instructions to perform corresponding vector operations. Vector operations include, but are not limited to, floating point operations, fixed point operations, logical operations, and the like. The pipeline architecture of vector core module 100 supports massively parallel computing tasks. Pipeline architectures include a number of processes such as instruction fetching, instruction scheduling, decoding, fetching operands, performing operations, writing back results, and the like. The 1 st vector core module 100 writes the operation result back to the 1 st Memory (Memory) module 11. Operations to the 1 st memory module 11 include, but are not limited to, load (Load), store (Store), atomic operations (Atomic), and the like.
In the application scenario where the 1 st vector core module 100 is used as the host module of the 1 st memory module 11 and the 1 st tensor core module 12, the 1 st vector core module 100 generates operation control information for the slave module based on the SIMT instruction, and data or operands required by the 1 st memory module 11 and tensor core configuration information required by the 1 st tensor core module 12. The 1 st vector core module 100 transmits tag information of the phase tensor data (including, but not limited to, address coordinate information of tensors, size information of tensors, boundary zero padding information, and other information) to the 1 st tensor core module 12 as a slave using the SIMT pipeline architecture and arithmetic logic resources (ALU).
In detail, the 1 st vector core module 100 shown in fig. 1 includes a1 st instruction cache 110, a1 st instruction fetch unit 120, a1 st instruction issue unit 130, a1 st instruction decode unit 140, a1 st vector register set 150, a1 st scalar register set 160, a1 st vector core arithmetic unit 170, and a1 st effective thread detection unit 180. The 1 st instruction fetch unit 120 reads enough instructions from the 1 st instruction cache 110 and passes them one by one to the 1 st instruction issue unit 130 awaiting issue and execution. The 1 st vector core module 100 does not distinguish instruction types. Whether an instruction that processes only vector data, scalar data, memory data, or the like (collectively referred to herein as a "vector core instruction"), or an instruction that configures only a tensor tag register or controls the 1 st tensor core module 12 (collectively referred to herein as a "tensor core instruction"), is passed from the 1 st instruction fetch unit 120 to the 1 st instruction issue unit 130 and issued to the 1 st instruction decode unit 140 in sequence according to program order.
After any instruction is issued from the 1 st instruction issue unit 130, the 1 st instruction decode unit 140 decodes the instruction to obtain the vector core operand type and address information. The 1 st instruction decoding unit 140 reads data from the 1 st vector register group 150 and gives the data to the 1 st vector core operation unit 170. The operands may be from vector 1 register set 150 (independent data owned by each thread) or from scalar 1 register set 160 (shared data owned by each thread). In addition, the 1 st instruction decode unit 140 also obtains vector core operation control information and passes the information to the 1 st vector core operation unit 170 along with the operands.
The 1 st vector core arithmetic unit 170 performs computation upon receiving the operand and the operation type. The 1 st vector core operation unit 170 includes a plurality of thread operation units to support SIMT instructions operating on a plurality of threads. For example, each of the N independent thread arithmetic units executes a corresponding thread. Each thread arithmetic unit contains logical computing resources, floating point computing resources, fixed point computing resources, and the like. All the thread operation units can execute corresponding operations simultaneously and generate execution results after fixed delay. After the 1 st vector core operation unit 170 finishes executing the vector core instruction, the 1 st vector core operation unit 170 may directly write the execution result of each thread operation unit back into the 1 st vector register set 150. Based on the "1 st vector core module 100 as the master, the 1 st memory module 11 as the slave", the 1 st vector core module 100 directly transmits the execution result of each thread to the 1 st memory module 11, and simultaneously transmits control information of the memory operation.
After the 1 st vector core operation unit 170 finishes processing the tensor core instruction, the 1 st effective thread detection unit 180 selects the effective thread with the smallest number, for example, selects 1 processing result from the plurality of thread operation units of the 1 st vector core operation unit 170, and writes back into the 1 st scalar register set 160. Based on the "1 st vector core module 100 as the master, the 1 st tensor core module 12 as the slave", the 1 st valid thread detection unit 180 selects the valid thread with the smallest number, selects 1 processing result from the plurality of thread operation units of the 1 st vector core operation unit 170, sends the 1 st processing result to the 1 st tensor core module 12, and simultaneously sends control information of tensor core operation. The active thread number is typically 1. When the 1 st vector core arithmetic unit 170 processes the tensor core instruction, the 1 st vector core arithmetic unit 170 processes the tensor core instruction with only 1 thread arithmetic unit, and the rest of the thread arithmetic units of the 1 st vector core arithmetic unit 170 are idle.
The 1 st vector core module 100 shown in fig. 1 transmits tensor core instruction configuration information, which has the following problems. First, due to the multiplexing requirements of SIMT computing resources, subject to floating point instructions, even the delay of fixed point instructions can be relatively long, resulting in a relatively slow rate of sending tensor core configuration information throughout the pipeline apparatus. Furthermore, the tensor core configuration information only needs fixed-point operations or data movement, and only works on a single thread operation unit of the 1 st vector core operation unit 170, so the resource utilization of the pipeline running tensor core instruction of the 1 st vector core module 100 is very low. In addition, the vector core instruction and the tensor core instruction share the 1 st instruction issue unit 130 and the subsequent pipeline, thereby impeding the efficiency of the vector core instruction with higher performance requirements, resulting in a gap between actual and expected computing forces.
Fig. 2 is a circuit block diagram of a 2 nd vector core block 200 according to an embodiment of the present invention. Based on the actual operating scenario, the 2 nd vector core module 200 may act as a master and the 2 nd tensor core module 22 may act as a slave. The 2 nd vector core module 200, the 2 nd memory module 21, and the 2 nd tensor core module 22 shown in fig. 2 may refer to the relevant descriptions of the 1 st vector core module 100, the 1 st memory module 11, and the 1 st tensor core module 12 shown in fig. 1 and so on. In the embodiment shown in FIG. 2, vector core module 200 includes a 2 nd instruction cache 210, a 2 nd instruction fetch unit 220, an instruction dispatch unit 230, a vector core instruction execution pipeline 240, a tensor core instruction processing pipeline 250, a 2 nd vector register set 260, a 2 nd scalar register set 270, and a 2 nd active thread detection unit 280. The vector core instruction execution pipeline 240 executes the vector core instruction and stores the execution result in the 2 nd memory module 21. Vector 2 nd register set 260 and scalar 2 nd register set 270 are coupled to vector core instruction execution pipeline 240. The 2 nd scalar register set 270 is also coupled to the tensor core instruction processing pipeline 250.
After the execution of the vector core instruction is completed, the vector core instruction execution pipeline 240 writes back vector data corresponding to the vector core instruction to the 2 nd vector register group 260. Vector 2 nd register set 260 provides vector data to vector core instruction execution pipeline 240. The active thread detection unit 280 is coupled to the vector core instruction execution pipeline 240. The active thread detection unit 280 detects the execution result of the vector core instruction execution pipeline 240 to determine an active thread. The 2 nd active thread detecting unit 280 writes back scalar data corresponding to the active thread to the 2 nd scalar register set 270. Vector core instruction execution pipeline 240 and tensor core instruction processing pipeline 250 may access scalar data for scalar register group 2 270. The 2 nd vector register set 260, the 2 nd scalar register set 270, and the 2 nd effective thread detection unit 280 shown in fig. 2 can refer to the 1 st vector register set 150, the 1 st scalar register set 160, and the 1 st effective thread detection unit 180 shown in fig. 1 and so forth, and thus are not described in detail herein.
In the embodiment shown in FIG. 2, vector core instruction execution pipeline 240 includes a 2 nd instruction issue unit 241, a 2 nd instruction decode unit 242, and a 2 nd vector core arithmetic unit 243. Instruction 2 issue unit 241 is coupled to instruction dispatch unit 230 to receive vector core instructions. The 2 nd instruction transmitting unit 241 transmits a vector core instruction. The 2 nd instruction decoding unit 242 is coupled to the 2 nd instruction transmitting unit 241 for receiving the vector core instruction. Instruction 2 decode unit 242 decodes the vector core instructions to generate operands and operation types. The 2 nd vector core arithmetic unit 243 is coupled to the 2 nd instruction decode unit 242 to receive operands and operation types. The 2 nd vector core operation unit 243 operates on the operand based on the operation type to generate an execution result to the 2 nd memory module 21. The 2 nd vector core arithmetic unit 243 includes a plurality of thread arithmetic units, each of which performs an operation of a corresponding thread. The multiple thread operation units generate execution results to the 2 nd memory module 21. The instruction 2 transmitting unit 241, the instruction 2 decoding unit 242 and the vector core computing unit 243 shown in fig. 2 can refer to the instruction 1 transmitting unit 130, the instruction 1 decoding unit 140 and the vector core computing unit 170 shown in fig. 1 and so on, and thus are not described herein.
The 2 nd instruction fetch unit 220 is coupled to the 2 nd instruction cache 210 to fetch at least one thread bundle. The 2 nd instruction cache 210 and the 2 nd instruction fetch unit 220 shown in fig. 2 may refer to the related descriptions of the 1 st instruction cache 110 and the 1 st instruction fetch unit 120 shown in fig. 1 and so on, and thus are not described in detail herein. The vector core module 200 of FIG. 2 adds an instruction dispatch unit 230 and a tensor core instruction processing pipeline 250 as compared to the vector core module 100 of FIG. 1. Instruction 2 fetch unit 220 is coupled to instruction dispatch unit 230 to provide a thread bundle. The newly added instruction dispatch unit 230 distinguishes and distributes tensor core instructions and vector core instructions to the corresponding pipeline. In embodiments supporting multiple thread bundles, instruction dispatch unit 230 may dispatch one tensor core instruction and a vector core instruction from each of the different thread bundles to the corresponding pipeline at the same time. The added tensor core instruction processing pipeline 250 will only run tensor core instructions. The tensor core instruction processing pipeline 250 processes scalar data only, without redundant operations of vector and scalar conversion. The processing of the tensor core instructions by the tensor core instruction processing pipeline 250 includes only one or more of fixed point addition, fixed point multiplication, fixed point multiply addition, and data handling, and thus the delay of the tensor core instruction processing pipeline 250 is much less than the delay of the vector core instruction execution pipeline 240. The vector core module 200 of fig. 2 sends tensor core configuration information to the tensor core module 22 of fig. 2 relatively quickly as compared to the vector core module 100 of fig. 1.
FIG. 3 is a flow chart of a method of operation of a vector core module according to an embodiment of the invention. Referring to fig. 2 and 3, in step S310, the instruction scheduling unit 230 classifies the at least one thread bundle to distinguish the vector core instruction and the tensor core instruction from the at least one thread bundle. The instruction dispatch unit 230 is coupled to a vector core instruction execution pipeline 240 to provide vector core instructions. The instruction dispatch unit 230 is coupled to the tensor core instruction processing pipeline 250 to provide tensor core instructions.
In response to the thread bundle including a vector core instruction, the instruction dispatch unit 230 issues the vector core instruction to the vector core instruction execution pipeline 240 (step S320). The vector core instruction execution pipeline 240 executes the vector core instruction and stores the execution result in the 2 nd memory module 21 (step S330). In response to the thread bundle including the tensor core instruction, the instruction dispatch unit 230 sends the tensor core instruction to the tensor core instruction processing pipeline 250 (step S340). The tensor core instruction processing pipeline 250 processes the tensor core instruction and sends the processing result (e.g., tensor core configuration information) to the 2 nd tensor core module 22 (step S350).
In the embodiment shown in FIG. 2, the tensor core instruction processing pipeline 250 includes a 3 rd instruction issue unit 251, a 3 rd instruction decode unit 252, and a tensor core instruction arithmetic unit 253. The 3 rd instruction issue unit 251 is coupled to the instruction dispatch unit 230 for receiving tensor core instructions. The 3 rd instruction transmitting unit 251 sequentially transmits tensor core instructions in program order. The 3 rd instruction decode unit 252 is coupled to the 3 rd instruction issue unit 251 and the 2 nd scalar register set 270. The 3 rd instruction decode unit 252 decodes tensor core instructions to generate operands and operation types. For example, based on the decoding result, the 3 rd instruction decoding unit 252 reads the operand from the 2 nd scalar register group 270 to the tensor core instruction operation unit 253. The tensor core instruction operation unit 253 is coupled to the 3 rd instruction decoding unit 252 to receive operands and operation types. The tensor core instruction operation unit 253 processes the operands based on the operation type, generates a processing result (e.g., tensor core configuration information) to the 2 nd tensor core module 22, and simultaneously transmits control information of the tensor core operation.
In summary, the 2 nd vector core module 200 adds a tensor core instruction processing pipeline 250 (a separate tensor core configuration information generation and transmission pipeline). The tensor core instruction processing pipeline 250 only runs tensor core instructions, which are executed by the vector core instruction execution pipeline 240. In some embodiments, the 3 rd instruction decode unit 252 and the tensor core instruction arithmetic unit 253 only process scalar data, no vector to scalar conversion operations, so the tensor core instruction processing pipeline 250 is much less delayed than the vector core instruction execution pipeline 240, and is relatively fast to send to the 2 nd tensor core module 22. In some embodiments, the tensor core instruction arithmetic unit 253 supports only fixed-point addition, fixed-point multiplication, fixed-point multiply addition, and simple data handling, so the delay of the tensor core instruction processing pipeline 250 is much smaller than the vector core instruction execution pipeline 240. The tensor core instruction processing pipeline 250 sends tensor core configuration information to the 2 nd tensor core module 22 relatively quickly as compared to the vector core instruction execution pipeline 240. In the application scenario where the 2 nd vector core module 200 is the master and the 2 nd tensor core module 22 is the slave, the 2 nd vector core module 200 can improve efficiency.
Fig. 4 is a circuit block diagram of the instruction dispatch unit 230 according to an embodiment of the present invention. The instruction dispatch unit 230 of FIG. 4 may be used as one of many implementation examples of the instruction dispatch unit 230 of FIG. 2. The instruction fetch unit 220, the instruction dispatch unit 230, the vector core instruction execution pipeline 240, and the tensor core instruction processing pipeline 250 of FIG. 4 may refer to the relevant description of FIG. 2, and are therefore not described in detail herein. In the embodiment shown in FIG. 4, the thread bundles provided by instruction fetch unit 2 220 are single-threaded bundles, and instruction dispatch unit 230 includes instruction classifier 231. An input of the instruction classifier 231 is coupled to the 2 nd instruction fetch unit 220 to receive the single-threaded bundle. The instruction classifier 231 is coupled to the vector core instruction execution pipeline 240 and the tensor core instruction processing pipeline 250. The instruction classifier 231 classifies instructions for a single-thread bundle. In response to the single-threaded bundle including a vector core instruction, the instruction classifier 231 sends the vector core instruction to the vector core instruction execution pipeline 240. In response to the single-threaded bundle including the tensor core instruction, the instruction classifier 231 sends the tensor core instruction to the tensor core instruction processing pipeline 250.
Fig. 5 is a circuit block diagram of the instruction dispatch unit 230 according to another embodiment of the present invention. The instruction dispatch unit 230 of FIG. 5 may be used as one of many implementation examples of the instruction dispatch unit 230 of FIG. 2. The instruction fetch unit 220, the instruction dispatch unit 230, the vector core instruction execution pipeline 240, and the tensor core instruction processing pipeline 250 of FIG. 5 may refer to the relevant description of FIG. 2, and are therefore not described in detail herein. In the embodiment shown in fig. 5, the thread bundles provided by the 2 nd instruction fetch unit 220 are multi-threaded bundles, and the instruction dispatch unit 230 includes a plurality of thread bundle instruction classifiers (e.g., the 1 st thread bundle instruction classifier 232_1, the 2 nd thread bundle instruction classifier 232_2, the n-th thread bundle instruction classifier 232—n shown in fig. 5), a vector core instruction poll arbiter 233, and a tensor core instruction poll arbiter 234. The input of each of the 1 st thread bundle instruction classifiers 232_1 through n-th thread bundle instruction classifiers 232_n is coupled to the 2 nd instruction fetch unit 220 to receive a corresponding thread bundle of the multithreaded bundles. Each of the 1 st thread bundle instruction classifier 232_1 through the nth thread bundle instruction classifier 232—n classifies the corresponding thread bundle to distinguish vector core instructions and tensor core instructions from the corresponding thread bundle.
The vector core instruction poll arbiter 233 is coupled to the 1 st thread bundle instruction classifier 232_1 to the n-th thread bundle instruction classifier 232_n. The vector core instruction poll arbiter 233 polls the 1 st thread bundle instruction classifier 232_1 to the n-th thread bundle instruction classifier 232_n to acquire vector core instructions. The output of the vector core instruction poll arbiter 233 is coupled to the vector core instruction execution pipeline 240 to provide vector core instructions. The tensor core instruction poll arbiter 234 is coupled to the 1 st thread bundle instruction classifier 232_1 to the nth thread bundle instruction classifier 232_n. The tensor core instruction poll arbiter 234 polls the 1 st thread bundle instruction classifier 232_1 to the nth thread bundle instruction classifier 232_n to obtain tensor core instructions. An output of the tensor core instruction poll arbiter 234 is coupled to a tensor core instruction processing pipeline 250 to provide the tensor core instructions.
Fig. 6 is a schematic circuit diagram of the tensor instruction operation unit 253 according to an embodiment of the present invention. The tensor core instruction operation unit 253 shown in fig. 6 can be used as one of many embodiments of the tensor core instruction operation unit 253 shown in fig. 2. The 3 rd instruction decoding unit 252, the tensor core instruction computing unit 253 and the 2 nd tensor core module 22 shown in fig. 6 can refer to the related description of fig. 2, and thus are not described herein again. In the embodiment shown in fig. 6, the tensor core instruction operation unit 253 includes a fixed point multiplier 610, a1 st gate 620, a2 nd gate 630, a fixed point adder 640, and a3 rd gate 650. A first input of the 3 rd gate 650 is coupled to the 3 rd instruction decode unit 252 to receive the first operand a6. The output terminal of the 3 rd gate 650 is coupled to the 2 nd tensor core module 22 to provide the processing result of the tensor core instruction operation unit 253. A first input of the fixed point multiplier 610 is coupled to the 3 rd instruction decode unit 252 to receive the first operand a6. A second input of the fixed-point multiplier 610 is coupled to the 3 rd instruction decoding unit 252 to receive the second operand b6. The output of the fixed point multiplier 610 is coupled to a second input of the 3 rd gate 650.
A first input of the 1 st gate 620 is coupled to the 3 rd instruction decode unit 252 to receive the first operand a6. A second input of the 1 st gate 620 is coupled to an output of the fixed point multiplier 610. Based on the control of the 3 rd instruction decode unit 252, the 1 st gate 620 selects one of the first operand a6 and the output of the fixed-point multiplier 610 to be passed to the fixed-point adder 640. The first input terminal of the 2 nd gate 630 is coupled to the 3 rd instruction decode unit 252 to receive the second operand b6. A second input of the 2 nd gate 630 is coupled to the 3 rd instruction decode unit 252 to receive the third operand c6. Based on control of instruction 3 decode unit 252, gate 2 630 selects to pass one of second operand b6 and third operand c6 to setpoint adder 640. A first input of the fixed point adder 640 is coupled to an output of the 1 st gate 620. A second input of the fixed point adder 640 is coupled to an output of the 2 nd gate 630. An output of the fixed point adder 640 is coupled to a third input of the 3 rd gate 650.
Based on the control of the 3 rd instruction decode unit 252, the 3 rd gate 650 selects one of the first operand a6, the output of the fixed-point multiplier 610, and the output of the fixed-point adder 640 to be transferred to the 2 nd tensor core module 22. For data movement requirements, the 3 rd gater 650 selects to directly transfer the first operand a6 to the 2 nd tensor core module 22 as a result of the processing (e.g., tensor core configuration information). For multiplication requirements, the 3 rd gater 650 selects to pass the output "a6·b6" of the fixed point multiplier 610 to the 2 nd tensor core module 22 as a processing result (e.g., tensor core configuration information). For addition requirements, the 1 st gater 620 selects to pass the first operand a6 to the point adder 640, the 2 nd gater 630 selects to pass the second operand b6 to the point adder 640, and the 3 rd gater 650 selects to pass the output "a6+b6" of the point adder 640 to the 2 nd tensor core module 22 as a result of the processing (e.g., tensor core configuration information). For multiply-add requirements, the 1 st gater 620 selects to pass the output of the fixed-point multiplier 610 to the point adder 640, the 2 nd gater 630 selects to pass the third operand c6 to the point adder 640, and the 3 rd gater 650 selects to pass the output "a6.b6+c6" of the fixed-point adder 640 to the 2 nd tensor core module 22 as a result of the processing (e.g., tensor core configuration information).
It should be noted that the above embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that the technical solution described in the above embodiments may be modified or some or all of the technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the scope of the technical solution of the embodiments of the present invention.