CN120469721B

CN120469721B - Vector core module of artificial intelligence chip and its operation method

Info

Publication number: CN120469721B
Application number: CN202510971173.6A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Bi Ren Technology Co ltd
Current assignee: Shanghai Bi Ren Technology Co ltd
Priority date: 2025-07-15
Filing date: 2025-07-15
Publication date: 2025-09-30
Anticipated expiration: 2045-07-15
Also published as: CN120469721A

Abstract

The present invention provides a vector core module of an artificial intelligence chip and an operating method thereof, for improving the efficiency of the vector core module in an application scenario where "the vector core is the master and the tensor core module is the slave." The vector core module includes a vector core instruction execution pipeline, a tensor core instruction processing pipeline, and an instruction scheduling unit. The instruction scheduling unit performs instruction classification to distinguish vector core instructions and tensor core instructions from thread bundles. In response to the thread bundle including a vector core instruction, the instruction scheduling unit sends the vector core instruction to the vector core instruction execution pipeline. The vector core instruction execution pipeline executes the vector core instruction and stores the execution result in a memory module. In response to the thread bundle including a tensor core instruction, the instruction scheduling unit sends the tensor core instruction to the tensor core instruction processing pipeline. The tensor core instruction processing pipeline processes the tensor core instruction and sends the processing result to the tensor core module.

Description

Vector kernel module of artificial intelligent chip and operation method thereof

Technical Field

The present invention relates to an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) chip, and more particularly, to a vector core module and a method of operating the same.

Background

Computing devices such as artificial intelligence chips can provide tremendous computing power. The tremendous computing power of AI chips stems from the large number of hardware cores inside. One AI chip typically contains multiple programmable multiprocessor clusters, such as a stream processor cluster (Stream Processor Cluster, SPC). Each programmable multiprocessor cluster typically includes a plurality of programmable multiprocessors, such as a plurality of Compute Units (CUs), and each Compute Unit typically includes a plurality of Execution units (EUs, or Execution cores), such as at least one of an Integer (INT) core module, a Floating Point (FP) core module, a Tensor core (Tcore) module, and a Vector core (Vcore) module. Programmable multiprocessors can support general purpose computing, scientific computing, and neural network computing by programmatically organizing various types of computing units.

The tensor kernel module is a domain-specific architecture (Domain Specific Architecture, DSA) that performs AI operations, some special operators for handling AI operations, including but not limited to tensor data handling, matrix multiply add (Matrix Multiply and Add, MMA), and convolution (Convolution) operations. In addition to operation-related control information, the tensor core module typically uses a configuration register approach to obtain more tensor data tags (including, but not limited to, address coordinate information for the tensor, size information for the tensor, and boundary zero-padding information). The contents of these registers are not statically configured, but are dynamically configured by instructions, but need only be generated by fixed point operations or simple data movement, and are typically scalar operations.

Single instruction multithreading (Single Instruction Multiple Threads, SIMT) instructions are a common type of graphics processor (Graphics Processing Unit, GPU) programming instruction. SIMT instructions have been widely used for massively parallel computing tasks such as graphics rendering and AI operations. The GPU hardware executes SIMT instructions through the vector core module to perform the corresponding vector operations. Vector operations include, but are not limited to, floating point operations (floating point operation), fixed point operations (fixed point operation), logical operations (logical operation), and the like. The pipeline architecture of the vector core module supports massively parallel computing tasks. Pipeline architectures include a number of processes such as instruction fetching, instruction scheduling, decoding, fetching operands, performing operations, writing back results, and the like. Vector core modules are also often used as host (master) modules for memory modules and tensor core modules. The vector core module (master module) generates operation control information and tensor core configuration information to a slave (slave) module (e.g., tensor core module) based on the SIMT instruction.

The problem with conventional vector core modules sending tensor core instruction configuration information is that due to the multiplexing requirements of SIMT computing resources, the delay of even fixed point instructions is relatively long due to the influence of floating point instructions, resulting in relatively slow speed of conventional vector core modules sending tensor core configuration information. Furthermore, tensor core configuration information only needs fixed-point operations or data movement and only works for single threads, so that the resource utilization rate of a conventional vector core module running tensor core instructions is very low. Furthermore, vector core instructions and tensor core instructions share the issue unit and subsequent pipelines, thereby impeding the efficiency of vector core instructions with higher performance requirements, resulting in a gap between actual and desired computing forces. In the application scenario that the vector core module is a master and the tensor core module is a slave, how to improve the efficiency of the vector core module is one of many technical issues in the field of AI chips.

Disclosure of Invention

The invention provides a vector core module and an operation method thereof, which are used for improving the efficiency of the vector core module in the application situation that the vector core module is a host and the tensor core module is a slave.

In an embodiment according to the invention, the vector core module comprises a vector core instruction execution pipeline, a tensor core instruction processing pipeline, and an instruction scheduling unit. The vector core instruction execution pipeline executes the vector core instruction and stores the execution result in the memory module. The tensor core instruction processing pipeline processes the tensor core instruction and sends the processing result to the tensor core module. The instruction dispatch unit is coupled to the vector core instruction execution pipeline and the tensor core instruction processing pipeline. The instruction scheduling unit classifies instructions of at least one thread bundle to distinguish vector core instructions and tensor core instructions from the at least one thread bundle. In response to the at least one thread bundle including a vector core instruction, the instruction dispatch unit sends the vector core instruction to a vector core instruction execution pipeline. In response to the at least one thread bundle including a tensor core instruction, the instruction dispatch unit sends the tensor core instruction to a tensor core instruction processing pipeline.

In an embodiment according to the invention, the method of operation comprises classifying, by an instruction scheduling unit of a vector core module, instructions of at least one thread bundle to distinguish vector core instructions and tensor core instructions from the at least one thread bundle, sending, by the instruction scheduling unit, vector core instructions to a vector core instruction execution pipeline of the vector core module in response to the at least one thread bundle comprising vector core instructions, storing, by the vector core instruction execution pipeline, execution results in a memory module, sending, by the instruction scheduling unit, tensor core instructions to a tensor core instruction processing pipeline of the vector core module in response to the at least one thread bundle comprising tensor core instructions, and sending, by the tensor core instruction processing pipeline, processing the tensor core instructions to the tensor core module.

Based on the above, the vector core module adds a tensor core instruction processing pipeline (an independent pipeline for tensor core configuration information generation and transmission). The tensor core instruction processing pipeline only runs tensor core instructions, and vector core instructions are executed by the vector core instruction execution pipeline. In some embodiments, the newly added tensor core instruction processing pipeline processes only scalar data, without vector to scalar conversion operations, so that the delay of the tensor core instruction processing pipeline is much smaller than the vector core instruction execution pipeline, and the forwarding to the tensor core module is relatively fast. In some embodiments, the newly added tensor core instruction processing pipeline supports only fixed-point addition, fixed-point multiplication, fixed-point multiply addition, and simple data handling, so the delay of the tensor core instruction processing pipeline is much smaller than the vector core instruction execution pipeline. The tensor core instruction processing pipeline sends tensor core configuration information to the tensor core module relatively faster than the vector core instruction execution pipeline. In the application scenario that the vector core module is a master and the tensor core module is a slave, the vector core module of the embodiment of the invention can improve the efficiency.

Drawings

FIG. 1 is a circuit block diagram of a vector core module according to one embodiment;

FIG. 2 is a schematic diagram of a circuit module of a vector core module according to an embodiment of the invention;

FIG. 3 is a flow chart of a method of operation of a vector core module according to an embodiment of the invention;

FIG. 4 is a schematic diagram of a circuit module of an instruction dispatch unit according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a circuit module of an instruction dispatch unit according to another embodiment of the present invention;

FIG. 6 is a schematic diagram of a circuit block diagram of a tensor core instruction operation unit according to an embodiment of the present invention.

Description of the reference numerals

11. The first memory module is a memory module 1,

21. The memory module of the 2 nd one,

12. The 1 st tensor kernel module,

22. The tensor-2 kernel module,

100. The 1 st vector core module is provided with a first vector,

200. The 2 nd vector core module is used for processing the data,

110. The 1 st instruction cache is provided with a first instruction cache,

210. The 2 nd instruction cache is used to store the instructions,

120. An instruction 1 st fetch unit configured to fetch,

220. An instruction 2-fetch unit that fetches the instruction,

130. An instruction 1 st transmission unit,

241. An instruction 2-transmitting unit,

251. A3 rd instruction transmitting unit,

140. A1 st instruction decoding unit for decoding the first instruction,

242. A2 nd instruction decoding unit for decoding the instruction,

252. A3 rd instruction decoding unit for decoding the instruction,

150. A1 st one of the vector register sets,

260. A group of vector-2 registers,

160. A1 st scalar register set is provided which,

270. A2 nd scalar register set is provided,

170. A1 st vector kernel operation unit,

243. A 2 nd vector kernel operation unit,

180. The 1 st active thread detection unit,

280. The active thread detection unit of the 2 nd,

230. An instruction dispatch unit configured to dispatch the instructions,

231. The instruction set is used to classify the instructions into instructions,

232_1, Thread-bundle instruction classifier 1,

232_2, Thread bundle instruction classifier 2,

232—N, nth thread bundle instruction classifier,

233. The vector core instruction polls the arbiter,

234. The tensor core instructions poll the arbiter,

240. The vector core instruction execution pipeline,

250. The tensor core instruction processing pipeline,

253. A tensor core instruction arithmetic unit,

610. A fixed-point multiplier, which is used for multiplying the data,

620. The 1 st gate is provided with a first gate,

630. The 2 nd gate is provided with a gate,

650. 3 Rd Gate the device is used for controlling the temperature of the air,

640. A fixed-point adder which,

A6, a first operand is used for storing,

B6, a second operand,

C6, a third operand.

Detailed Description

Reference will now be made in detail to the exemplary embodiments of the present invention, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers are used in the drawings and the description to refer to the same or like parts.

The term "coupled" as used throughout this specification (including the claims) may refer to any direct or indirect connection. For example, if a first device couples (or connects) to a second device, that connection may be through a direct connection, or through an indirect connection via other devices and connections. The terms first, second and the like in the description (including the claims) are used for naming components or distinguishing between different embodiments or ranges and are not used for limiting the number of components, either upper or lower, or the order of the components.

Single instruction multithreading is a common type of graphics processor programming instruction. SIMT instructions have been widely used for massively parallel computing tasks such as graphics rendering and artificial intelligence operations. SIMT instructions perform parallel operations in the basic unit of a thread group (Warp), one thread group containing multiple threads. For example, a SIMT32 instruction represents an instruction containing 32 threads, and a single SIMT32 instruction may cause threads in a thread bundle containing 32 threads to perform the same operation in parallel. The GPU or AI chip can thereby process multiple data points or computing tasks simultaneously, thereby increasing computing efficiency.

Fig. 1 is a schematic circuit diagram of a 1 st vector core module 100 according to an embodiment. Based on the actual operating scenario, the 1 st vector core module 100 may act as a master and the 1 st tensor core module 12 may act as a slave. The 1 st tensor kernel module 12 is used to process some special operators of AI operations including, but not limited to, tensor data handling, matrix multiply add and convolution operations. In addition to operation-related control information, the 1 st tensor core module 12 typically uses a configuration register approach to obtain further tensor data tags, including but not limited to address coordinate information, size information, and boundary zero-padding information for the tensor. The contents of these registers are not statically configured, but are dynamically configured by instructions, but need only be generated by fixed point operations or simple data movement, and are typically scalar operations.

The 1 st vector core module 100 executes SIMT instructions to perform corresponding vector operations. Vector operations include, but are not limited to, floating point operations, fixed point operations, logical operations, and the like. The pipeline architecture of vector core module 100 supports massively parallel computing tasks. Pipeline architectures include a number of processes such as instruction fetching, instruction scheduling, decoding, fetching operands, performing operations, writing back results, and the like. The 1 st vector core module 100 writes the operation result back to the 1 st Memory (Memory) module 11. Operations to the 1 st memory module 11 include, but are not limited to, load (Load), store (Store), atomic operations (Atomic), and the like.

In the application scenario where the 1 st vector core module 100 is used as the host module of the 1 st memory module 11 and the 1 st tensor core module 12, the 1 st vector core module 100 generates operation control information for the slave module based on the SIMT instruction, and data or operands required by the 1 st memory module 11 and tensor core configuration information required by the 1 st tensor core module 12. The 1 st vector core module 100 transmits tag information of the phase tensor data (including, but not limited to, address coordinate information of tensors, size information of tensors, boundary zero padding information, and other information) to the 1 st tensor core module 12 as a slave using the SIMT pipeline architecture and arithmetic logic resources (ALU).

In detail, the 1 st vector core module 100 shown in fig. 1 includes a1 st instruction cache 110, a1 st instruction fetch unit 120, a1 st instruction issue unit 130, a1 st instruction decode unit 140, a1 st vector register set 150, a1 st scalar register set 160, a1 st vector core arithmetic unit 170, and a1 st effective thread detection unit 180. The 1 st instruction fetch unit 120 reads enough instructions from the 1 st instruction cache 110 and passes them one by one to the 1 st instruction issue unit 130 awaiting issue and execution. The 1 st vector core module 100 does not distinguish instruction types. Whether an instruction that processes only vector data, scalar data, memory data, or the like (collectively referred to herein as a "vector core instruction"), or an instruction that configures only a tensor tag register or controls the 1 st tensor core module 12 (collectively referred to herein as a "tensor core instruction"), is passed from the 1 st instruction fetch unit 120 to the 1 st instruction issue unit 130 and issued to the 1 st instruction decode unit 140 in sequence according to program order.

After any instruction is issued from the 1 st instruction issue unit 130, the 1 st instruction decode unit 140 decodes the instruction to obtain the vector core operand type and address information. The 1 st instruction decoding unit 140 reads data from the 1 st vector register group 150 and gives the data to the 1 st vector core operation unit 170. The operands may be from vector 1 register set 150 (independent data owned by each thread) or from scalar 1 register set 160 (shared data owned by each thread). In addition, the 1 st instruction decode unit 140 also obtains vector core operation control information and passes the information to the 1 st vector core operation unit 170 along with the operands.

The 1 st vector core arithmetic unit 170 performs computation upon receiving the operand and the operation type. The 1 st vector core operation unit 170 includes a plurality of thread operation units to support SIMT instructions operating on a plurality of threads. For example, each of the N independent thread arithmetic units executes a corresponding thread. Each thread arithmetic unit contains logical computing resources, floating point computing resources, fixed point computing resources, and the like. All the thread operation units can execute corresponding operations simultaneously and generate execution results after fixed delay. After the 1 st vector core operation unit 170 finishes executing the vector core instruction, the 1 st vector core operation unit 170 may directly write the execution result of each thread operation unit back into the 1 st vector register set 150. Based on the "1 st vector core module 100 as the master, the 1 st memory module 11 as the slave", the 1 st vector core module 100 directly transmits the execution result of each thread to the 1 st memory module 11, and simultaneously transmits control information of the memory operation.

After the 1 st vector core operation unit 170 finishes processing the tensor core instruction, the 1 st effective thread detection unit 180 selects the effective thread with the smallest number, for example, selects 1 processing result from the plurality of thread operation units of the 1 st vector core operation unit 170, and writes back into the 1 st scalar register set 160. Based on the "1 st vector core module 100 as the master, the 1 st tensor core module 12 as the slave", the 1 st valid thread detection unit 180 selects the valid thread with the smallest number, selects 1 processing result from the plurality of thread operation units of the 1 st vector core operation unit 170, sends the 1 st processing result to the 1 st tensor core module 12, and simultaneously sends control information of tensor core operation. The active thread number is typically 1. When the 1 st vector core arithmetic unit 170 processes the tensor core instruction, the 1 st vector core arithmetic unit 170 processes the tensor core instruction with only 1 thread arithmetic unit, and the rest of the thread arithmetic units of the 1 st vector core arithmetic unit 170 are idle.

The 1 st vector core module 100 shown in fig. 1 transmits tensor core instruction configuration information, which has the following problems. First, due to the multiplexing requirements of SIMT computing resources, subject to floating point instructions, even the delay of fixed point instructions can be relatively long, resulting in a relatively slow rate of sending tensor core configuration information throughout the pipeline apparatus. Furthermore, the tensor core configuration information only needs fixed-point operations or data movement, and only works on a single thread operation unit of the 1 st vector core operation unit 170, so the resource utilization of the pipeline running tensor core instruction of the 1 st vector core module 100 is very low. In addition, the vector core instruction and the tensor core instruction share the 1 st instruction issue unit 130 and the subsequent pipeline, thereby impeding the efficiency of the vector core instruction with higher performance requirements, resulting in a gap between actual and expected computing forces.

Fig. 2 is a circuit block diagram of a 2 nd vector core block 200 according to an embodiment of the present invention. Based on the actual operating scenario, the 2 nd vector core module 200 may act as a master and the 2 nd tensor core module 22 may act as a slave. The 2 nd vector core module 200, the 2 nd memory module 21, and the 2 nd tensor core module 22 shown in fig. 2 may refer to the relevant descriptions of the 1 st vector core module 100, the 1 st memory module 11, and the 1 st tensor core module 12 shown in fig. 1 and so on. In the embodiment shown in FIG. 2, vector core module 200 includes a 2 nd instruction cache 210, a 2 nd instruction fetch unit 220, an instruction dispatch unit 230, a vector core instruction execution pipeline 240, a tensor core instruction processing pipeline 250, a 2 nd vector register set 260, a 2 nd scalar register set 270, and a 2 nd active thread detection unit 280. The vector core instruction execution pipeline 240 executes the vector core instruction and stores the execution result in the 2 nd memory module 21. Vector 2 nd register set 260 and scalar 2 nd register set 270 are coupled to vector core instruction execution pipeline 240. The 2 nd scalar register set 270 is also coupled to the tensor core instruction processing pipeline 250.

After the execution of the vector core instruction is completed, the vector core instruction execution pipeline 240 writes back vector data corresponding to the vector core instruction to the 2 nd vector register group 260. Vector 2 nd register set 260 provides vector data to vector core instruction execution pipeline 240. The active thread detection unit 280 is coupled to the vector core instruction execution pipeline 240. The active thread detection unit 280 detects the execution result of the vector core instruction execution pipeline 240 to determine an active thread. The 2 nd active thread detecting unit 280 writes back scalar data corresponding to the active thread to the 2 nd scalar register set 270. Vector core instruction execution pipeline 240 and tensor core instruction processing pipeline 250 may access scalar data for scalar register group 2 270. The 2 nd vector register set 260, the 2 nd scalar register set 270, and the 2 nd effective thread detection unit 280 shown in fig. 2 can refer to the 1 st vector register set 150, the 1 st scalar register set 160, and the 1 st effective thread detection unit 180 shown in fig. 1 and so forth, and thus are not described in detail herein.

In the embodiment shown in FIG. 2, vector core instruction execution pipeline 240 includes a 2 nd instruction issue unit 241, a 2 nd instruction decode unit 242, and a 2 nd vector core arithmetic unit 243. Instruction 2 issue unit 241 is coupled to instruction dispatch unit 230 to receive vector core instructions. The 2 nd instruction transmitting unit 241 transmits a vector core instruction. The 2 nd instruction decoding unit 242 is coupled to the 2 nd instruction transmitting unit 241 for receiving the vector core instruction. Instruction 2 decode unit 242 decodes the vector core instructions to generate operands and operation types. The 2 nd vector core arithmetic unit 243 is coupled to the 2 nd instruction decode unit 242 to receive operands and operation types. The 2 nd vector core operation unit 243 operates on the operand based on the operation type to generate an execution result to the 2 nd memory module 21. The 2 nd vector core arithmetic unit 243 includes a plurality of thread arithmetic units, each of which performs an operation of a corresponding thread. The multiple thread operation units generate execution results to the 2 nd memory module 21. The instruction 2 transmitting unit 241, the instruction 2 decoding unit 242 and the vector core computing unit 243 shown in fig. 2 can refer to the instruction 1 transmitting unit 130, the instruction 1 decoding unit 140 and the vector core computing unit 170 shown in fig. 1 and so on, and thus are not described herein.

The 2 nd instruction fetch unit 220 is coupled to the 2 nd instruction cache 210 to fetch at least one thread bundle. The 2 nd instruction cache 210 and the 2 nd instruction fetch unit 220 shown in fig. 2 may refer to the related descriptions of the 1 st instruction cache 110 and the 1 st instruction fetch unit 120 shown in fig. 1 and so on, and thus are not described in detail herein. The vector core module 200 of FIG. 2 adds an instruction dispatch unit 230 and a tensor core instruction processing pipeline 250 as compared to the vector core module 100 of FIG. 1. Instruction 2 fetch unit 220 is coupled to instruction dispatch unit 230 to provide a thread bundle. The newly added instruction dispatch unit 230 distinguishes and distributes tensor core instructions and vector core instructions to the corresponding pipeline. In embodiments supporting multiple thread bundles, instruction dispatch unit 230 may dispatch one tensor core instruction and a vector core instruction from each of the different thread bundles to the corresponding pipeline at the same time. The added tensor core instruction processing pipeline 250 will only run tensor core instructions. The tensor core instruction processing pipeline 250 processes scalar data only, without redundant operations of vector and scalar conversion. The processing of the tensor core instructions by the tensor core instruction processing pipeline 250 includes only one or more of fixed point addition, fixed point multiplication, fixed point multiply addition, and data handling, and thus the delay of the tensor core instruction processing pipeline 250 is much less than the delay of the vector core instruction execution pipeline 240. The vector core module 200 of fig. 2 sends tensor core configuration information to the tensor core module 22 of fig. 2 relatively quickly as compared to the vector core module 100 of fig. 1.

FIG. 3 is a flow chart of a method of operation of a vector core module according to an embodiment of the invention. Referring to fig. 2 and 3, in step S310, the instruction scheduling unit 230 classifies the at least one thread bundle to distinguish the vector core instruction and the tensor core instruction from the at least one thread bundle. The instruction dispatch unit 230 is coupled to a vector core instruction execution pipeline 240 to provide vector core instructions. The instruction dispatch unit 230 is coupled to the tensor core instruction processing pipeline 250 to provide tensor core instructions.

In response to the thread bundle including a vector core instruction, the instruction dispatch unit 230 issues the vector core instruction to the vector core instruction execution pipeline 240 (step S320). The vector core instruction execution pipeline 240 executes the vector core instruction and stores the execution result in the 2 nd memory module 21 (step S330). In response to the thread bundle including the tensor core instruction, the instruction dispatch unit 230 sends the tensor core instruction to the tensor core instruction processing pipeline 250 (step S340). The tensor core instruction processing pipeline 250 processes the tensor core instruction and sends the processing result (e.g., tensor core configuration information) to the 2 nd tensor core module 22 (step S350).

In the embodiment shown in FIG. 2, the tensor core instruction processing pipeline 250 includes a 3 rd instruction issue unit 251, a 3 rd instruction decode unit 252, and a tensor core instruction arithmetic unit 253. The 3 rd instruction issue unit 251 is coupled to the instruction dispatch unit 230 for receiving tensor core instructions. The 3 rd instruction transmitting unit 251 sequentially transmits tensor core instructions in program order. The 3 rd instruction decode unit 252 is coupled to the 3 rd instruction issue unit 251 and the 2 nd scalar register set 270. The 3 rd instruction decode unit 252 decodes tensor core instructions to generate operands and operation types. For example, based on the decoding result, the 3 rd instruction decoding unit 252 reads the operand from the 2 nd scalar register group 270 to the tensor core instruction operation unit 253. The tensor core instruction operation unit 253 is coupled to the 3 rd instruction decoding unit 252 to receive operands and operation types. The tensor core instruction operation unit 253 processes the operands based on the operation type, generates a processing result (e.g., tensor core configuration information) to the 2 nd tensor core module 22, and simultaneously transmits control information of the tensor core operation.

In summary, the 2 nd vector core module 200 adds a tensor core instruction processing pipeline 250 (a separate tensor core configuration information generation and transmission pipeline). The tensor core instruction processing pipeline 250 only runs tensor core instructions, which are executed by the vector core instruction execution pipeline 240. In some embodiments, the 3 rd instruction decode unit 252 and the tensor core instruction arithmetic unit 253 only process scalar data, no vector to scalar conversion operations, so the tensor core instruction processing pipeline 250 is much less delayed than the vector core instruction execution pipeline 240, and is relatively fast to send to the 2 nd tensor core module 22. In some embodiments, the tensor core instruction arithmetic unit 253 supports only fixed-point addition, fixed-point multiplication, fixed-point multiply addition, and simple data handling, so the delay of the tensor core instruction processing pipeline 250 is much smaller than the vector core instruction execution pipeline 240. The tensor core instruction processing pipeline 250 sends tensor core configuration information to the 2 nd tensor core module 22 relatively quickly as compared to the vector core instruction execution pipeline 240. In the application scenario where the 2 nd vector core module 200 is the master and the 2 nd tensor core module 22 is the slave, the 2 nd vector core module 200 can improve efficiency.

Fig. 4 is a circuit block diagram of the instruction dispatch unit 230 according to an embodiment of the present invention. The instruction dispatch unit 230 of FIG. 4 may be used as one of many implementation examples of the instruction dispatch unit 230 of FIG. 2. The instruction fetch unit 220, the instruction dispatch unit 230, the vector core instruction execution pipeline 240, and the tensor core instruction processing pipeline 250 of FIG. 4 may refer to the relevant description of FIG. 2, and are therefore not described in detail herein. In the embodiment shown in FIG. 4, the thread bundles provided by instruction fetch unit 2 220 are single-threaded bundles, and instruction dispatch unit 230 includes instruction classifier 231. An input of the instruction classifier 231 is coupled to the 2 nd instruction fetch unit 220 to receive the single-threaded bundle. The instruction classifier 231 is coupled to the vector core instruction execution pipeline 240 and the tensor core instruction processing pipeline 250. The instruction classifier 231 classifies instructions for a single-thread bundle. In response to the single-threaded bundle including a vector core instruction, the instruction classifier 231 sends the vector core instruction to the vector core instruction execution pipeline 240. In response to the single-threaded bundle including the tensor core instruction, the instruction classifier 231 sends the tensor core instruction to the tensor core instruction processing pipeline 250.

Fig. 5 is a circuit block diagram of the instruction dispatch unit 230 according to another embodiment of the present invention. The instruction dispatch unit 230 of FIG. 5 may be used as one of many implementation examples of the instruction dispatch unit 230 of FIG. 2. The instruction fetch unit 220, the instruction dispatch unit 230, the vector core instruction execution pipeline 240, and the tensor core instruction processing pipeline 250 of FIG. 5 may refer to the relevant description of FIG. 2, and are therefore not described in detail herein. In the embodiment shown in fig. 5, the thread bundles provided by the 2 nd instruction fetch unit 220 are multi-threaded bundles, and the instruction dispatch unit 230 includes a plurality of thread bundle instruction classifiers (e.g., the 1 st thread bundle instruction classifier 232_1, the 2 nd thread bundle instruction classifier 232_2, the n-th thread bundle instruction classifier 232—n shown in fig. 5), a vector core instruction poll arbiter 233, and a tensor core instruction poll arbiter 234. The input of each of the 1 st thread bundle instruction classifiers 232_1 through n-th thread bundle instruction classifiers 232_n is coupled to the 2 nd instruction fetch unit 220 to receive a corresponding thread bundle of the multithreaded bundles. Each of the 1 st thread bundle instruction classifier 232_1 through the nth thread bundle instruction classifier 232—n classifies the corresponding thread bundle to distinguish vector core instructions and tensor core instructions from the corresponding thread bundle.

The vector core instruction poll arbiter 233 is coupled to the 1 st thread bundle instruction classifier 232_1 to the n-th thread bundle instruction classifier 232_n. The vector core instruction poll arbiter 233 polls the 1 st thread bundle instruction classifier 232_1 to the n-th thread bundle instruction classifier 232_n to acquire vector core instructions. The output of the vector core instruction poll arbiter 233 is coupled to the vector core instruction execution pipeline 240 to provide vector core instructions. The tensor core instruction poll arbiter 234 is coupled to the 1 st thread bundle instruction classifier 232_1 to the nth thread bundle instruction classifier 232_n. The tensor core instruction poll arbiter 234 polls the 1 st thread bundle instruction classifier 232_1 to the nth thread bundle instruction classifier 232_n to obtain tensor core instructions. An output of the tensor core instruction poll arbiter 234 is coupled to a tensor core instruction processing pipeline 250 to provide the tensor core instructions.

Fig. 6 is a schematic circuit diagram of the tensor instruction operation unit 253 according to an embodiment of the present invention. The tensor core instruction operation unit 253 shown in fig. 6 can be used as one of many embodiments of the tensor core instruction operation unit 253 shown in fig. 2. The 3 rd instruction decoding unit 252, the tensor core instruction computing unit 253 and the 2 nd tensor core module 22 shown in fig. 6 can refer to the related description of fig. 2, and thus are not described herein again. In the embodiment shown in fig. 6, the tensor core instruction operation unit 253 includes a fixed point multiplier 610, a1 st gate 620, a2 nd gate 630, a fixed point adder 640, and a3 rd gate 650. A first input of the 3 rd gate 650 is coupled to the 3 rd instruction decode unit 252 to receive the first operand a6. The output terminal of the 3 rd gate 650 is coupled to the 2 nd tensor core module 22 to provide the processing result of the tensor core instruction operation unit 253. A first input of the fixed point multiplier 610 is coupled to the 3 rd instruction decode unit 252 to receive the first operand a6. A second input of the fixed-point multiplier 610 is coupled to the 3 rd instruction decoding unit 252 to receive the second operand b6. The output of the fixed point multiplier 610 is coupled to a second input of the 3 rd gate 650.

A first input of the 1 st gate 620 is coupled to the 3 rd instruction decode unit 252 to receive the first operand a6. A second input of the 1 st gate 620 is coupled to an output of the fixed point multiplier 610. Based on the control of the 3 rd instruction decode unit 252, the 1 st gate 620 selects one of the first operand a6 and the output of the fixed-point multiplier 610 to be passed to the fixed-point adder 640. The first input terminal of the 2 nd gate 630 is coupled to the 3 rd instruction decode unit 252 to receive the second operand b6. A second input of the 2 nd gate 630 is coupled to the 3 rd instruction decode unit 252 to receive the third operand c6. Based on control of instruction 3 decode unit 252, gate 2 630 selects to pass one of second operand b6 and third operand c6 to setpoint adder 640. A first input of the fixed point adder 640 is coupled to an output of the 1 st gate 620. A second input of the fixed point adder 640 is coupled to an output of the 2 nd gate 630. An output of the fixed point adder 640 is coupled to a third input of the 3 rd gate 650.

Based on the control of the 3 rd instruction decode unit 252, the 3 rd gate 650 selects one of the first operand a6, the output of the fixed-point multiplier 610, and the output of the fixed-point adder 640 to be transferred to the 2 nd tensor core module 22. For data movement requirements, the 3 rd gater 650 selects to directly transfer the first operand a6 to the 2 nd tensor core module 22 as a result of the processing (e.g., tensor core configuration information). For multiplication requirements, the 3 rd gater 650 selects to pass the output "a6·b6" of the fixed point multiplier 610 to the 2 nd tensor core module 22 as a processing result (e.g., tensor core configuration information). For addition requirements, the 1 st gater 620 selects to pass the first operand a6 to the point adder 640, the 2 nd gater 630 selects to pass the second operand b6 to the point adder 640, and the 3 rd gater 650 selects to pass the output "a6+b6" of the point adder 640 to the 2 nd tensor core module 22 as a result of the processing (e.g., tensor core configuration information). For multiply-add requirements, the 1 st gater 620 selects to pass the output of the fixed-point multiplier 610 to the point adder 640, the 2 nd gater 630 selects to pass the third operand c6 to the point adder 640, and the 3 rd gater 650 selects to pass the output "a6.b6+c6" of the fixed-point adder 640 to the 2 nd tensor core module 22 as a result of the processing (e.g., tensor core configuration information).

It should be noted that the above embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that the technical solution described in the above embodiments may be modified or some or all of the technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the scope of the technical solution of the embodiments of the present invention.

Claims

1. A vector core module, characterized in that the vector core module includes:

A vector core instruction execution pipeline executes vector core instructions and stores the execution results in a memory module;

A tensor core instruction processing pipeline processes tensor core instructions and sends processing results to the tensor core module, wherein the processing results are tensor core configuration information, the vector core module is a master, and the tensor core module is a slave; and

an instruction scheduling unit coupled to the vector core instruction execution pipeline and the tensor core instruction processing pipeline, wherein the instruction scheduling unit classifies instructions for at least one thread warp to distinguish the vector core instructions from the tensor core instructions from the at least one thread warp;

In response to the at least one warp including the vector core instruction, the instruction scheduling unit sends the vector core instruction to the vector core instruction execution pipeline; and

In response to the at least one warp including the tensor core instruction, the instruction scheduling unit sends the tensor core instruction to the tensor core instruction processing pipeline.

2. The vector core module according to claim 1, wherein the vector core instruction execution pipeline comprises:

a first instruction issuing unit, coupled to the instruction scheduling unit to receive the vector core instruction, wherein the first instruction issuing unit issues the vector core instruction;

a first instruction decoding unit coupled to the first instruction issuing unit to receive the vector core instruction, wherein the first instruction decoding unit decodes the vector core instruction to generate an operand and an operation type; and

The vector core operation unit is coupled to the first instruction decoding unit to receive the operand and the operation type, wherein the vector core operation unit operates the operand based on the operation type to generate the execution result to the memory module.

3. The vector core module according to claim 2, wherein the vector core operation unit comprises:

A plurality of thread operation units, wherein each of the plurality of thread operation units executes an operation of a corresponding thread, and the plurality of thread operation units generate the execution results to the memory module.

4. The vector core module according to claim 1, wherein the tensor core instruction processing pipeline only processes scalar data.

5. The vector core module according to claim 1, wherein the processing of the tensor core instruction by the tensor core instruction processing pipeline includes only one or more of data movement, fixed-point addition, fixed-point multiplication, and fixed-point multiply-add.

6. The vector core module according to claim 1, wherein the tensor core instruction processing pipeline comprises:

a second instruction issuing unit, coupled to the instruction scheduling unit to receive the tensor core instruction, wherein the second instruction issuing unit issues the tensor core instruction;

a second instruction decoding unit coupled to the second instruction issuing unit, wherein the second instruction decoding unit decodes the tensor core instruction to generate an operand and an operation type; and

A tensor core instruction operation unit is coupled to the second instruction decoding unit to receive the operand and the operation type, wherein the tensor core instruction operation unit processes the operand based on the operation type to generate the processing result to the tensor core module.

7. The vector core module according to claim 6, wherein the tensor core instruction operation unit comprises:

a first gate, wherein a first input terminal of the first gate is coupled to the second instruction decoding unit to receive a first operand of the operands, and an output terminal of the first gate is coupled to the tensor core module to provide the processing result;

a fixed-point multiplier, wherein a first input terminal of the fixed-point multiplier is coupled to the second instruction decoding unit to receive the first operand, a second input terminal of the fixed-point multiplier is coupled to the second instruction decoding unit to receive a second operand among the operands, and an output terminal of the fixed-point multiplier is coupled to the second input terminal of the first gate;

a second gate, wherein a first input terminal of the second gate is coupled to the second instruction decoding unit to receive the first operand, and a second input terminal of the second gate is coupled to an output terminal of the fixed-point multiplier;

a third gate, wherein a first input terminal of the third gate is coupled to the second instruction decoding unit to receive the second operand, and a second input terminal of the third gate is coupled to the second instruction decoding unit to receive a third operand among the operands; and

A fixed-point adder, wherein a first input of the fixed-point adder is coupled to the output of the second gate, a second input of the fixed-point adder is coupled to the output of the third gate, and an output of the fixed-point adder is coupled to the third input of the first gate.

8. The vector core module according to claim 1, further comprising:

instruction cache; and

An instruction fetch unit is coupled to the instruction cache to fetch the at least one warp, wherein the instruction fetch unit is coupled to the instruction scheduling unit to provide the at least one warp.

9. The vector core module according to claim 8, wherein the at least one thread warp is a single thread warp, and the instruction scheduling unit comprises:

An instruction classifier, wherein an input end of the instruction classifier is coupled to the instruction fetch unit to receive the single warp, the instruction classifier is coupled to the vector core instruction execution pipeline and the tensor core instruction processing pipeline, and the instruction classifier performs instruction classification on the single warp.

In response to the single warp including the vector core instruction, the instruction classifier sends the vector core instruction to the vector core instruction execution pipeline; and

In response to the single warp including the Tensor Core instruction, the instruction classifier sends the Tensor Core instruction to the Tensor Core instruction processing pipeline.

10. The vector core module according to claim 8, wherein the at least one thread warp is a multi-thread warp, and the instruction scheduling unit comprises:

a plurality of warp instruction classifiers, wherein an input terminal of each of the plurality of warp instruction classifiers is coupled to the instruction fetch unit to receive a corresponding warp from the multiple warps, and each of the plurality of warp instruction classifiers performs instruction classification on the corresponding warp to distinguish the vector core instructions from the tensor core instructions from the corresponding warp.

a vector core instruction polling arbiter coupled to the plurality of warp instruction classifiers, wherein the vector core instruction polling arbiter polls the plurality of warp instruction classifiers to obtain the vector core instruction, and an output end of the vector core instruction polling arbiter is coupled to the vector core instruction execution pipeline to provide the vector core instruction; and

A tensor core instruction polling arbiter is coupled to the plurality of thread warp instruction classifiers, wherein the tensor core instruction polling arbiter polls the plurality of thread warp instruction classifiers to obtain the tensor core instructions, and an output end of the tensor core instruction polling arbiter is coupled to the tensor core instruction processing pipeline to provide the tensor core instructions.

11. The vector core module according to claim 1, further comprising:

a vector register group coupled to the vector core instruction execution pipeline to provide vector data, wherein the vector core instruction execution pipeline writes the vector data corresponding to the vector core instruction back to the vector register group after the vector core instruction is executed;

a scalar register set coupled to the vector core instruction execution pipeline and the tensor core instruction processing pipeline to provide scalar data; and

An active thread detection unit is coupled to the vector core instruction execution pipeline, wherein the active thread detection unit detects the execution result to determine an active thread, and the active thread detection unit writes the scalar data corresponding to the active thread back to the scalar register group.

12. A method for operating a vector core module, characterized in that the method comprises:

performing, by an instruction scheduling unit of the vector core module, instruction classification on at least one warp to distinguish between vector core instructions and tensor core instructions from the at least one warp, wherein the instruction scheduling unit is coupled to the vector core instruction execution pipeline and the tensor core instruction processing pipeline;

In response to the at least one thread warp including the vector core instruction, the instruction scheduling unit sends the vector core instruction to the vector core instruction execution pipeline of the vector core module;

The vector core instruction execution pipeline executes the vector core instruction and stores the execution result in the memory module;

In response to the at least one thread warp including the tensor core instruction, the instruction scheduling unit sends the tensor core instruction to the tensor core instruction processing pipeline of the vector core module; and

The tensor core instruction processing pipeline processes the tensor core instruction and sends the processing result to the tensor core module, wherein the processing result is tensor core configuration information, the vector core module is the host, and the tensor core module is the slave.

13. The operating method according to claim 12, further comprising:

issuing the vector core instruction by a first instruction issuing unit of the vector core instruction execution pipeline, wherein the first instruction issuing unit is coupled to the instruction scheduling unit to receive the vector core instruction;

decoding the vector core instruction by a first instruction decode unit of the vector core instruction execution pipeline to generate operands and an operation type, wherein the first instruction decode unit is coupled to the first instruction issue unit to receive the vector core instruction; and

A vector core operation unit of the vector core instruction execution pipeline operates the operand based on the operation type to generate the execution result to the memory module, wherein the vector core operation unit is coupled to the first instruction decoding unit to receive the operand and the operation type.

14. The operating method according to claim 13, further comprising:

Each of the multiple thread operation units of the vector core operation unit executes an operation of a corresponding thread, wherein the multiple thread operation units generate the execution results to the memory module.

15 . The operating method according to claim 12 , wherein the tensor core instruction processing pipeline only processes scalar data.

16. The operating method according to claim 12, wherein the processing of the tensor core instruction by the tensor core instruction processing pipeline includes only one or more of data movement, fixed-point addition, fixed-point multiplication, and fixed-point multiply-add.

17. The operating method according to claim 12, further comprising:

issuing the tensor core instruction by a second instruction issue unit of the tensor core instruction processing pipeline, wherein the second instruction issue unit is coupled to the instruction dispatch unit to receive the tensor core instruction;

decoding the tensor core instruction by a second instruction decode unit of the tensor core instruction processing pipeline to generate operands and an operation type, wherein the second instruction decode unit is coupled to the second instruction issue unit; and

A tensor core instruction operation unit of the tensor core instruction processing pipeline processes the operand based on the operation type to generate the processing result to the tensor core module, wherein the tensor core instruction operation unit is coupled to the second instruction decoding unit to receive the operand and the operation type.

18. The operating method according to claim 12, wherein the at least one warp is a single warp, and the operating method further comprises:

fetching, by an instruction fetch unit of the vector core module, the at least one thread warp from an instruction cache of the vector core module, wherein the instruction fetch unit is coupled to the instruction scheduling unit to provide the at least one thread warp;

An instruction classifier of the instruction scheduling unit performs instruction classification on the single warp, wherein an input end of the instruction classifier is coupled to the instruction fetch unit to receive the single warp, and the instruction classifier is coupled to the vector core instruction execution pipeline and the tensor core instruction processing pipeline;

In response to the single warp including the vector core instruction, sending the vector core instruction to the vector core instruction execution pipeline by the instruction classifier; and

19. The operating method according to claim 12, wherein the at least one warp is a multi-warp, and the operating method further comprises:

receiving, by each of a plurality of warp instruction classifiers of the instruction scheduling unit, a corresponding warp among the plurality of warps;

performing instruction classification on the corresponding warp by each of the plurality of warp instruction classifiers to distinguish the vector core instructions and the tensor core instructions from the corresponding warp;

polling the plurality of warp instruction classifiers by a vector core instruction polling arbiter of the instruction scheduling unit to obtain the vector core instruction, wherein an output terminal of the vector core instruction polling arbiter is coupled to the vector core instruction execution pipeline to provide the vector core instruction; and

A tensor core instruction polling arbiter of the instruction scheduling unit polls the plurality of warp instruction classifiers to obtain the tensor core instructions, wherein an output terminal of the tensor core instruction polling arbiter is coupled to the tensor core instruction processing pipeline to provide the tensor core instructions.

20. The operating method according to claim 12, further comprising:

The vector core instruction execution pipeline writes the vector data corresponding to the vector core instruction back to the vector register group of the vector core module after the execution of the vector core instruction is completed;

Providing the vector data to the vector core instruction execution pipeline by the vector register group;

The valid thread detection unit of the vector core module detects the execution result to determine the valid thread;

The active thread detection unit writes the scalar data corresponding to the active thread back to the scalar register group of the vector core module; and

The scalar register group provides the scalar data to the vector core instruction execution pipeline and the tensor core instruction processing pipeline.