[go: up one dir, main page]

CN120469721B - Vector core module of artificial intelligence chip and its operation method - Google Patents

Vector core module of artificial intelligence chip and its operation method

Info

Publication number
CN120469721B
CN120469721B CN202510971173.6A CN202510971173A CN120469721B CN 120469721 B CN120469721 B CN 120469721B CN 202510971173 A CN202510971173 A CN 202510971173A CN 120469721 B CN120469721 B CN 120469721B
Authority
CN
China
Prior art keywords
instruction
core
vector
tensor
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202510971173.6A
Other languages
Chinese (zh)
Other versions
CN120469721A (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bi Ren Technology Co ltd
Original Assignee
Shanghai Bi Ren Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bi Ren Technology Co ltd filed Critical Shanghai Bi Ren Technology Co ltd
Priority to CN202510971173.6A priority Critical patent/CN120469721B/en
Publication of CN120469721A publication Critical patent/CN120469721A/en
Application granted granted Critical
Publication of CN120469721B publication Critical patent/CN120469721B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Advance Control (AREA)

Abstract

本发明提供一种人工智能芯片的向量核模块及其操作方法,用以在“向量核为主机,张量核模块为从机”的应用情境中提升向量核模块的效率。向量核模块包括向量核指令执行流水线、张量核指令处理流水线以及指令调度单元。指令调度单元进行指令分类,以从线程束区分出向量核指令和张量核指令。响应于线程束包括向量核指令,指令调度单元将向量核指令发送给向量核指令执行流水线。向量核指令执行流水线执行向量核指令而将执行结果存于内存模块。响应于线程束包括张量核指令,指令调度单元将张量核指令发送给张量核指令处理流水线。张量核指令处理流水线处理张量核指令而将处理结果发送给张量核模块。

The present invention provides a vector core module of an artificial intelligence chip and an operating method thereof, for improving the efficiency of the vector core module in an application scenario where "the vector core is the master and the tensor core module is the slave." The vector core module includes a vector core instruction execution pipeline, a tensor core instruction processing pipeline, and an instruction scheduling unit. The instruction scheduling unit performs instruction classification to distinguish vector core instructions and tensor core instructions from thread bundles. In response to the thread bundle including a vector core instruction, the instruction scheduling unit sends the vector core instruction to the vector core instruction execution pipeline. The vector core instruction execution pipeline executes the vector core instruction and stores the execution result in a memory module. In response to the thread bundle including a tensor core instruction, the instruction scheduling unit sends the tensor core instruction to the tensor core instruction processing pipeline. The tensor core instruction processing pipeline processes the tensor core instruction and sends the processing result to the tensor core module.

Description

Vector kernel module of artificial intelligent chip and operation method thereof
Technical Field
The present invention relates to an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) chip, and more particularly, to a vector core module and a method of operating the same.
Background
Computing devices such as artificial intelligence chips can provide tremendous computing power. The tremendous computing power of AI chips stems from the large number of hardware cores inside. One AI chip typically contains multiple programmable multiprocessor clusters, such as a stream processor cluster (Stream Processor Cluster, SPC). Each programmable multiprocessor cluster typically includes a plurality of programmable multiprocessors, such as a plurality of Compute Units (CUs), and each Compute Unit typically includes a plurality of Execution units (EUs, or Execution cores), such as at least one of an Integer (INT) core module, a Floating Point (FP) core module, a Tensor core (Tcore) module, and a Vector core (Vcore) module. Programmable multiprocessors can support general purpose computing, scientific computing, and neural network computing by programmatically organizing various types of computing units.
The tensor kernel module is a domain-specific architecture (Domain Specific Architecture, DSA) that performs AI operations, some special operators for handling AI operations, including but not limited to tensor data handling, matrix multiply add (Matrix Multiply and Add, MMA), and convolution (Convolution) operations. In addition to operation-related control information, the tensor core module typically uses a configuration register approach to obtain more tensor data tags (including, but not limited to, address coordinate information for the tensor, size information for the tensor, and boundary zero-padding information). The contents of these registers are not statically configured, but are dynamically configured by instructions, but need only be generated by fixed point operations or simple data movement, and are typically scalar operations.
Single instruction multithreading (Single Instruction Multiple Threads, SIMT) instructions are a common type of graphics processor (Graphics Processing Unit, GPU) programming instruction. SIMT instructions have been widely used for massively parallel computing tasks such as graphics rendering and AI operations. The GPU hardware executes SIMT instructions through the vector core module to perform the corresponding vector operations. Vector operations include, but are not limited to, floating point operations (floating point operation), fixed point operations (fixed point operation), logical operations (logical operation), and the like. The pipeline architecture of the vector core module supports massively parallel computing tasks. Pipeline architectures include a number of processes such as instruction fetching, instruction scheduling, decoding, fetching operands, performing operations, writing back results, and the like. Vector core modules are also often used as host (master) modules for memory modules and tensor core modules. The vector core module (master module) generates operation control information and tensor core configuration information to a slave (slave) module (e.g., tensor core module) based on the SIMT instruction.
The problem with conventional vector core modules sending tensor core instruction configuration information is that due to the multiplexing requirements of SIMT computing resources, the delay of even fixed point instructions is relatively long due to the influence of floating point instructions, resulting in relatively slow speed of conventional vector core modules sending tensor core configuration information. Furthermore, tensor core configuration information only needs fixed-point operations or data movement and only works for single threads, so that the resource utilization rate of a conventional vector core module running tensor core instructions is very low. Furthermore, vector core instructions and tensor core instructions share the issue unit and subsequent pipelines, thereby impeding the efficiency of vector core instructions with higher performance requirements, resulting in a gap between actual and desired computing forces. In the application scenario that the vector core module is a master and the tensor core module is a slave, how to improve the efficiency of the vector core module is one of many technical issues in the field of AI chips.
Disclosure of Invention
The invention provides a vector core module and an operation method thereof, which are used for improving the efficiency of the vector core module in the application situation that the vector core module is a host and the tensor core module is a slave.
In an embodiment according to the invention, the vector core module comprises a vector core instruction execution pipeline, a tensor core instruction processing pipeline, and an instruction scheduling unit. The vector core instruction execution pipeline executes the vector core instruction and stores the execution result in the memory module. The tensor core instruction processing pipeline processes the tensor core instruction and sends the processing result to the tensor core module. The instruction dispatch unit is coupled to the vector core instruction execution pipeline and the tensor core instruction processing pipeline. The instruction scheduling unit classifies instructions of at least one thread bundle to distinguish vector core instructions and tensor core instructions from the at least one thread bundle. In response to the at least one thread bundle including a vector core instruction, the instruction dispatch unit sends the vector core instruction to a vector core instruction execution pipeline. In response to the at least one thread bundle including a tensor core instruction, the instruction dispatch unit sends the tensor core instruction to a tensor core instruction processing pipeline.
In an embodiment according to the invention, the method of operation comprises classifying, by an instruction scheduling unit of a vector core module, instructions of at least one thread bundle to distinguish vector core instructions and tensor core instructions from the at least one thread bundle, sending, by the instruction scheduling unit, vector core instructions to a vector core instruction execution pipeline of the vector core module in response to the at least one thread bundle comprising vector core instructions, storing, by the vector core instruction execution pipeline, execution results in a memory module, sending, by the instruction scheduling unit, tensor core instructions to a tensor core instruction processing pipeline of the vector core module in response to the at least one thread bundle comprising tensor core instructions, and sending, by the tensor core instruction processing pipeline, processing the tensor core instructions to the tensor core module.
Based on the above, the vector core module adds a tensor core instruction processing pipeline (an independent pipeline for tensor core configuration information generation and transmission). The tensor core instruction processing pipeline only runs tensor core instructions, and vector core instructions are executed by the vector core instruction execution pipeline. In some embodiments, the newly added tensor core instruction processing pipeline processes only scalar data, without vector to scalar conversion operations, so that the delay of the tensor core instruction processing pipeline is much smaller than the vector core instruction execution pipeline, and the forwarding to the tensor core module is relatively fast. In some embodiments, the newly added tensor core instruction processing pipeline supports only fixed-point addition, fixed-point multiplication, fixed-point multiply addition, and simple data handling, so the delay of the tensor core instruction processing pipeline is much smaller than the vector core instruction execution pipeline. The tensor core instruction processing pipeline sends tensor core configuration information to the tensor core module relatively faster than the vector core instruction execution pipeline. In the application scenario that the vector core module is a master and the tensor core module is a slave, the vector core module of the embodiment of the invention can improve the efficiency.
Drawings
FIG. 1 is a circuit block diagram of a vector core module according to one embodiment;
FIG. 2 is a schematic diagram of a circuit module of a vector core module according to an embodiment of the invention;
FIG. 3 is a flow chart of a method of operation of a vector core module according to an embodiment of the invention;
FIG. 4 is a schematic diagram of a circuit module of an instruction dispatch unit according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a circuit module of an instruction dispatch unit according to another embodiment of the present invention;
FIG. 6 is a schematic diagram of a circuit block diagram of a tensor core instruction operation unit according to an embodiment of the present invention.
Description of the reference numerals
11. The first memory module is a memory module 1,
21. The memory module of the 2 nd one,
12. The 1 st tensor kernel module,
22. The tensor-2 kernel module,
100. The 1 st vector core module is provided with a first vector,
200. The 2 nd vector core module is used for processing the data,
110. The 1 st instruction cache is provided with a first instruction cache,
210. The 2 nd instruction cache is used to store the instructions,
120. An instruction 1 st fetch unit configured to fetch,
220. An instruction 2-fetch unit that fetches the instruction,
130. An instruction 1 st transmission unit,
241. An instruction 2-transmitting unit,
251. A3 rd instruction transmitting unit,
140. A1 st instruction decoding unit for decoding the first instruction,
242. A2 nd instruction decoding unit for decoding the instruction,
252. A3 rd instruction decoding unit for decoding the instruction,
150. A1 st one of the vector register sets,
260. A group of vector-2 registers,
160. A1 st scalar register set is provided which,
270. A2 nd scalar register set is provided,
170. A1 st vector kernel operation unit,
243. A 2 nd vector kernel operation unit,
180. The 1 st active thread detection unit,
280. The active thread detection unit of the 2 nd,
230. An instruction dispatch unit configured to dispatch the instructions,
231. The instruction set is used to classify the instructions into instructions,
232_1, Thread-bundle instruction classifier 1,
232_2, Thread bundle instruction classifier 2,
232—N, nth thread bundle instruction classifier,
233. The vector core instruction polls the arbiter,
234. The tensor core instructions poll the arbiter,
240. The vector core instruction execution pipeline,
250. The tensor core instruction processing pipeline,
253. A tensor core instruction arithmetic unit,
610. A fixed-point multiplier, which is used for multiplying the data,
620. The 1 st gate is provided with a first gate,
630. The 2 nd gate is provided with a gate,
650. 3 Rd Gate the device is used for controlling the temperature of the air,
640. A fixed-point adder which,
A6, a first operand is used for storing,
B6, a second operand,
C6, a third operand.
Detailed Description
Reference will now be made in detail to the exemplary embodiments of the present invention, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers are used in the drawings and the description to refer to the same or like parts.
The term "coupled" as used throughout this specification (including the claims) may refer to any direct or indirect connection. For example, if a first device couples (or connects) to a second device, that connection may be through a direct connection, or through an indirect connection via other devices and connections. The terms first, second and the like in the description (including the claims) are used for naming components or distinguishing between different embodiments or ranges and are not used for limiting the number of components, either upper or lower, or the order of the components.
Single instruction multithreading is a common type of graphics processor programming instruction. SIMT instructions have been widely used for massively parallel computing tasks such as graphics rendering and artificial intelligence operations. SIMT instructions perform parallel operations in the basic unit of a thread group (Warp), one thread group containing multiple threads. For example, a SIMT32 instruction represents an instruction containing 32 threads, and a single SIMT32 instruction may cause threads in a thread bundle containing 32 threads to perform the same operation in parallel. The GPU or AI chip can thereby process multiple data points or computing tasks simultaneously, thereby increasing computing efficiency.
Fig. 1 is a schematic circuit diagram of a 1 st vector core module 100 according to an embodiment. Based on the actual operating scenario, the 1 st vector core module 100 may act as a master and the 1 st tensor core module 12 may act as a slave. The 1 st tensor kernel module 12 is used to process some special operators of AI operations including, but not limited to, tensor data handling, matrix multiply add and convolution operations. In addition to operation-related control information, the 1 st tensor core module 12 typically uses a configuration register approach to obtain further tensor data tags, including but not limited to address coordinate information, size information, and boundary zero-padding information for the tensor. The contents of these registers are not statically configured, but are dynamically configured by instructions, but need only be generated by fixed point operations or simple data movement, and are typically scalar operations.
The 1 st vector core module 100 executes SIMT instructions to perform corresponding vector operations. Vector operations include, but are not limited to, floating point operations, fixed point operations, logical operations, and the like. The pipeline architecture of vector core module 100 supports massively parallel computing tasks. Pipeline architectures include a number of processes such as instruction fetching, instruction scheduling, decoding, fetching operands, performing operations, writing back results, and the like. The 1 st vector core module 100 writes the operation result back to the 1 st Memory (Memory) module 11. Operations to the 1 st memory module 11 include, but are not limited to, load (Load), store (Store), atomic operations (Atomic), and the like.
In the application scenario where the 1 st vector core module 100 is used as the host module of the 1 st memory module 11 and the 1 st tensor core module 12, the 1 st vector core module 100 generates operation control information for the slave module based on the SIMT instruction, and data or operands required by the 1 st memory module 11 and tensor core configuration information required by the 1 st tensor core module 12. The 1 st vector core module 100 transmits tag information of the phase tensor data (including, but not limited to, address coordinate information of tensors, size information of tensors, boundary zero padding information, and other information) to the 1 st tensor core module 12 as a slave using the SIMT pipeline architecture and arithmetic logic resources (ALU).
In detail, the 1 st vector core module 100 shown in fig. 1 includes a1 st instruction cache 110, a1 st instruction fetch unit 120, a1 st instruction issue unit 130, a1 st instruction decode unit 140, a1 st vector register set 150, a1 st scalar register set 160, a1 st vector core arithmetic unit 170, and a1 st effective thread detection unit 180. The 1 st instruction fetch unit 120 reads enough instructions from the 1 st instruction cache 110 and passes them one by one to the 1 st instruction issue unit 130 awaiting issue and execution. The 1 st vector core module 100 does not distinguish instruction types. Whether an instruction that processes only vector data, scalar data, memory data, or the like (collectively referred to herein as a "vector core instruction"), or an instruction that configures only a tensor tag register or controls the 1 st tensor core module 12 (collectively referred to herein as a "tensor core instruction"), is passed from the 1 st instruction fetch unit 120 to the 1 st instruction issue unit 130 and issued to the 1 st instruction decode unit 140 in sequence according to program order.
After any instruction is issued from the 1 st instruction issue unit 130, the 1 st instruction decode unit 140 decodes the instruction to obtain the vector core operand type and address information. The 1 st instruction decoding unit 140 reads data from the 1 st vector register group 150 and gives the data to the 1 st vector core operation unit 170. The operands may be from vector 1 register set 150 (independent data owned by each thread) or from scalar 1 register set 160 (shared data owned by each thread). In addition, the 1 st instruction decode unit 140 also obtains vector core operation control information and passes the information to the 1 st vector core operation unit 170 along with the operands.
The 1 st vector core arithmetic unit 170 performs computation upon receiving the operand and the operation type. The 1 st vector core operation unit 170 includes a plurality of thread operation units to support SIMT instructions operating on a plurality of threads. For example, each of the N independent thread arithmetic units executes a corresponding thread. Each thread arithmetic unit contains logical computing resources, floating point computing resources, fixed point computing resources, and the like. All the thread operation units can execute corresponding operations simultaneously and generate execution results after fixed delay. After the 1 st vector core operation unit 170 finishes executing the vector core instruction, the 1 st vector core operation unit 170 may directly write the execution result of each thread operation unit back into the 1 st vector register set 150. Based on the "1 st vector core module 100 as the master, the 1 st memory module 11 as the slave", the 1 st vector core module 100 directly transmits the execution result of each thread to the 1 st memory module 11, and simultaneously transmits control information of the memory operation.
After the 1 st vector core operation unit 170 finishes processing the tensor core instruction, the 1 st effective thread detection unit 180 selects the effective thread with the smallest number, for example, selects 1 processing result from the plurality of thread operation units of the 1 st vector core operation unit 170, and writes back into the 1 st scalar register set 160. Based on the "1 st vector core module 100 as the master, the 1 st tensor core module 12 as the slave", the 1 st valid thread detection unit 180 selects the valid thread with the smallest number, selects 1 processing result from the plurality of thread operation units of the 1 st vector core operation unit 170, sends the 1 st processing result to the 1 st tensor core module 12, and simultaneously sends control information of tensor core operation. The active thread number is typically 1. When the 1 st vector core arithmetic unit 170 processes the tensor core instruction, the 1 st vector core arithmetic unit 170 processes the tensor core instruction with only 1 thread arithmetic unit, and the rest of the thread arithmetic units of the 1 st vector core arithmetic unit 170 are idle.
The 1 st vector core module 100 shown in fig. 1 transmits tensor core instruction configuration information, which has the following problems. First, due to the multiplexing requirements of SIMT computing resources, subject to floating point instructions, even the delay of fixed point instructions can be relatively long, resulting in a relatively slow rate of sending tensor core configuration information throughout the pipeline apparatus. Furthermore, the tensor core configuration information only needs fixed-point operations or data movement, and only works on a single thread operation unit of the 1 st vector core operation unit 170, so the resource utilization of the pipeline running tensor core instruction of the 1 st vector core module 100 is very low. In addition, the vector core instruction and the tensor core instruction share the 1 st instruction issue unit 130 and the subsequent pipeline, thereby impeding the efficiency of the vector core instruction with higher performance requirements, resulting in a gap between actual and expected computing forces.
Fig. 2 is a circuit block diagram of a 2 nd vector core block 200 according to an embodiment of the present invention. Based on the actual operating scenario, the 2 nd vector core module 200 may act as a master and the 2 nd tensor core module 22 may act as a slave. The 2 nd vector core module 200, the 2 nd memory module 21, and the 2 nd tensor core module 22 shown in fig. 2 may refer to the relevant descriptions of the 1 st vector core module 100, the 1 st memory module 11, and the 1 st tensor core module 12 shown in fig. 1 and so on. In the embodiment shown in FIG. 2, vector core module 200 includes a 2 nd instruction cache 210, a 2 nd instruction fetch unit 220, an instruction dispatch unit 230, a vector core instruction execution pipeline 240, a tensor core instruction processing pipeline 250, a 2 nd vector register set 260, a 2 nd scalar register set 270, and a 2 nd active thread detection unit 280. The vector core instruction execution pipeline 240 executes the vector core instruction and stores the execution result in the 2 nd memory module 21. Vector 2 nd register set 260 and scalar 2 nd register set 270 are coupled to vector core instruction execution pipeline 240. The 2 nd scalar register set 270 is also coupled to the tensor core instruction processing pipeline 250.
After the execution of the vector core instruction is completed, the vector core instruction execution pipeline 240 writes back vector data corresponding to the vector core instruction to the 2 nd vector register group 260. Vector 2 nd register set 260 provides vector data to vector core instruction execution pipeline 240. The active thread detection unit 280 is coupled to the vector core instruction execution pipeline 240. The active thread detection unit 280 detects the execution result of the vector core instruction execution pipeline 240 to determine an active thread. The 2 nd active thread detecting unit 280 writes back scalar data corresponding to the active thread to the 2 nd scalar register set 270. Vector core instruction execution pipeline 240 and tensor core instruction processing pipeline 250 may access scalar data for scalar register group 2 270. The 2 nd vector register set 260, the 2 nd scalar register set 270, and the 2 nd effective thread detection unit 280 shown in fig. 2 can refer to the 1 st vector register set 150, the 1 st scalar register set 160, and the 1 st effective thread detection unit 180 shown in fig. 1 and so forth, and thus are not described in detail herein.
In the embodiment shown in FIG. 2, vector core instruction execution pipeline 240 includes a 2 nd instruction issue unit 241, a 2 nd instruction decode unit 242, and a 2 nd vector core arithmetic unit 243. Instruction 2 issue unit 241 is coupled to instruction dispatch unit 230 to receive vector core instructions. The 2 nd instruction transmitting unit 241 transmits a vector core instruction. The 2 nd instruction decoding unit 242 is coupled to the 2 nd instruction transmitting unit 241 for receiving the vector core instruction. Instruction 2 decode unit 242 decodes the vector core instructions to generate operands and operation types. The 2 nd vector core arithmetic unit 243 is coupled to the 2 nd instruction decode unit 242 to receive operands and operation types. The 2 nd vector core operation unit 243 operates on the operand based on the operation type to generate an execution result to the 2 nd memory module 21. The 2 nd vector core arithmetic unit 243 includes a plurality of thread arithmetic units, each of which performs an operation of a corresponding thread. The multiple thread operation units generate execution results to the 2 nd memory module 21. The instruction 2 transmitting unit 241, the instruction 2 decoding unit 242 and the vector core computing unit 243 shown in fig. 2 can refer to the instruction 1 transmitting unit 130, the instruction 1 decoding unit 140 and the vector core computing unit 170 shown in fig. 1 and so on, and thus are not described herein.
The 2 nd instruction fetch unit 220 is coupled to the 2 nd instruction cache 210 to fetch at least one thread bundle. The 2 nd instruction cache 210 and the 2 nd instruction fetch unit 220 shown in fig. 2 may refer to the related descriptions of the 1 st instruction cache 110 and the 1 st instruction fetch unit 120 shown in fig. 1 and so on, and thus are not described in detail herein. The vector core module 200 of FIG. 2 adds an instruction dispatch unit 230 and a tensor core instruction processing pipeline 250 as compared to the vector core module 100 of FIG. 1. Instruction 2 fetch unit 220 is coupled to instruction dispatch unit 230 to provide a thread bundle. The newly added instruction dispatch unit 230 distinguishes and distributes tensor core instructions and vector core instructions to the corresponding pipeline. In embodiments supporting multiple thread bundles, instruction dispatch unit 230 may dispatch one tensor core instruction and a vector core instruction from each of the different thread bundles to the corresponding pipeline at the same time. The added tensor core instruction processing pipeline 250 will only run tensor core instructions. The tensor core instruction processing pipeline 250 processes scalar data only, without redundant operations of vector and scalar conversion. The processing of the tensor core instructions by the tensor core instruction processing pipeline 250 includes only one or more of fixed point addition, fixed point multiplication, fixed point multiply addition, and data handling, and thus the delay of the tensor core instruction processing pipeline 250 is much less than the delay of the vector core instruction execution pipeline 240. The vector core module 200 of fig. 2 sends tensor core configuration information to the tensor core module 22 of fig. 2 relatively quickly as compared to the vector core module 100 of fig. 1.
FIG. 3 is a flow chart of a method of operation of a vector core module according to an embodiment of the invention. Referring to fig. 2 and 3, in step S310, the instruction scheduling unit 230 classifies the at least one thread bundle to distinguish the vector core instruction and the tensor core instruction from the at least one thread bundle. The instruction dispatch unit 230 is coupled to a vector core instruction execution pipeline 240 to provide vector core instructions. The instruction dispatch unit 230 is coupled to the tensor core instruction processing pipeline 250 to provide tensor core instructions.
In response to the thread bundle including a vector core instruction, the instruction dispatch unit 230 issues the vector core instruction to the vector core instruction execution pipeline 240 (step S320). The vector core instruction execution pipeline 240 executes the vector core instruction and stores the execution result in the 2 nd memory module 21 (step S330). In response to the thread bundle including the tensor core instruction, the instruction dispatch unit 230 sends the tensor core instruction to the tensor core instruction processing pipeline 250 (step S340). The tensor core instruction processing pipeline 250 processes the tensor core instruction and sends the processing result (e.g., tensor core configuration information) to the 2 nd tensor core module 22 (step S350).
In the embodiment shown in FIG. 2, the tensor core instruction processing pipeline 250 includes a 3 rd instruction issue unit 251, a 3 rd instruction decode unit 252, and a tensor core instruction arithmetic unit 253. The 3 rd instruction issue unit 251 is coupled to the instruction dispatch unit 230 for receiving tensor core instructions. The 3 rd instruction transmitting unit 251 sequentially transmits tensor core instructions in program order. The 3 rd instruction decode unit 252 is coupled to the 3 rd instruction issue unit 251 and the 2 nd scalar register set 270. The 3 rd instruction decode unit 252 decodes tensor core instructions to generate operands and operation types. For example, based on the decoding result, the 3 rd instruction decoding unit 252 reads the operand from the 2 nd scalar register group 270 to the tensor core instruction operation unit 253. The tensor core instruction operation unit 253 is coupled to the 3 rd instruction decoding unit 252 to receive operands and operation types. The tensor core instruction operation unit 253 processes the operands based on the operation type, generates a processing result (e.g., tensor core configuration information) to the 2 nd tensor core module 22, and simultaneously transmits control information of the tensor core operation.
In summary, the 2 nd vector core module 200 adds a tensor core instruction processing pipeline 250 (a separate tensor core configuration information generation and transmission pipeline). The tensor core instruction processing pipeline 250 only runs tensor core instructions, which are executed by the vector core instruction execution pipeline 240. In some embodiments, the 3 rd instruction decode unit 252 and the tensor core instruction arithmetic unit 253 only process scalar data, no vector to scalar conversion operations, so the tensor core instruction processing pipeline 250 is much less delayed than the vector core instruction execution pipeline 240, and is relatively fast to send to the 2 nd tensor core module 22. In some embodiments, the tensor core instruction arithmetic unit 253 supports only fixed-point addition, fixed-point multiplication, fixed-point multiply addition, and simple data handling, so the delay of the tensor core instruction processing pipeline 250 is much smaller than the vector core instruction execution pipeline 240. The tensor core instruction processing pipeline 250 sends tensor core configuration information to the 2 nd tensor core module 22 relatively quickly as compared to the vector core instruction execution pipeline 240. In the application scenario where the 2 nd vector core module 200 is the master and the 2 nd tensor core module 22 is the slave, the 2 nd vector core module 200 can improve efficiency.
Fig. 4 is a circuit block diagram of the instruction dispatch unit 230 according to an embodiment of the present invention. The instruction dispatch unit 230 of FIG. 4 may be used as one of many implementation examples of the instruction dispatch unit 230 of FIG. 2. The instruction fetch unit 220, the instruction dispatch unit 230, the vector core instruction execution pipeline 240, and the tensor core instruction processing pipeline 250 of FIG. 4 may refer to the relevant description of FIG. 2, and are therefore not described in detail herein. In the embodiment shown in FIG. 4, the thread bundles provided by instruction fetch unit 2 220 are single-threaded bundles, and instruction dispatch unit 230 includes instruction classifier 231. An input of the instruction classifier 231 is coupled to the 2 nd instruction fetch unit 220 to receive the single-threaded bundle. The instruction classifier 231 is coupled to the vector core instruction execution pipeline 240 and the tensor core instruction processing pipeline 250. The instruction classifier 231 classifies instructions for a single-thread bundle. In response to the single-threaded bundle including a vector core instruction, the instruction classifier 231 sends the vector core instruction to the vector core instruction execution pipeline 240. In response to the single-threaded bundle including the tensor core instruction, the instruction classifier 231 sends the tensor core instruction to the tensor core instruction processing pipeline 250.
Fig. 5 is a circuit block diagram of the instruction dispatch unit 230 according to another embodiment of the present invention. The instruction dispatch unit 230 of FIG. 5 may be used as one of many implementation examples of the instruction dispatch unit 230 of FIG. 2. The instruction fetch unit 220, the instruction dispatch unit 230, the vector core instruction execution pipeline 240, and the tensor core instruction processing pipeline 250 of FIG. 5 may refer to the relevant description of FIG. 2, and are therefore not described in detail herein. In the embodiment shown in fig. 5, the thread bundles provided by the 2 nd instruction fetch unit 220 are multi-threaded bundles, and the instruction dispatch unit 230 includes a plurality of thread bundle instruction classifiers (e.g., the 1 st thread bundle instruction classifier 232_1, the 2 nd thread bundle instruction classifier 232_2, the n-th thread bundle instruction classifier 232—n shown in fig. 5), a vector core instruction poll arbiter 233, and a tensor core instruction poll arbiter 234. The input of each of the 1 st thread bundle instruction classifiers 232_1 through n-th thread bundle instruction classifiers 232_n is coupled to the 2 nd instruction fetch unit 220 to receive a corresponding thread bundle of the multithreaded bundles. Each of the 1 st thread bundle instruction classifier 232_1 through the nth thread bundle instruction classifier 232—n classifies the corresponding thread bundle to distinguish vector core instructions and tensor core instructions from the corresponding thread bundle.
The vector core instruction poll arbiter 233 is coupled to the 1 st thread bundle instruction classifier 232_1 to the n-th thread bundle instruction classifier 232_n. The vector core instruction poll arbiter 233 polls the 1 st thread bundle instruction classifier 232_1 to the n-th thread bundle instruction classifier 232_n to acquire vector core instructions. The output of the vector core instruction poll arbiter 233 is coupled to the vector core instruction execution pipeline 240 to provide vector core instructions. The tensor core instruction poll arbiter 234 is coupled to the 1 st thread bundle instruction classifier 232_1 to the nth thread bundle instruction classifier 232_n. The tensor core instruction poll arbiter 234 polls the 1 st thread bundle instruction classifier 232_1 to the nth thread bundle instruction classifier 232_n to obtain tensor core instructions. An output of the tensor core instruction poll arbiter 234 is coupled to a tensor core instruction processing pipeline 250 to provide the tensor core instructions.
Fig. 6 is a schematic circuit diagram of the tensor instruction operation unit 253 according to an embodiment of the present invention. The tensor core instruction operation unit 253 shown in fig. 6 can be used as one of many embodiments of the tensor core instruction operation unit 253 shown in fig. 2. The 3 rd instruction decoding unit 252, the tensor core instruction computing unit 253 and the 2 nd tensor core module 22 shown in fig. 6 can refer to the related description of fig. 2, and thus are not described herein again. In the embodiment shown in fig. 6, the tensor core instruction operation unit 253 includes a fixed point multiplier 610, a1 st gate 620, a2 nd gate 630, a fixed point adder 640, and a3 rd gate 650. A first input of the 3 rd gate 650 is coupled to the 3 rd instruction decode unit 252 to receive the first operand a6. The output terminal of the 3 rd gate 650 is coupled to the 2 nd tensor core module 22 to provide the processing result of the tensor core instruction operation unit 253. A first input of the fixed point multiplier 610 is coupled to the 3 rd instruction decode unit 252 to receive the first operand a6. A second input of the fixed-point multiplier 610 is coupled to the 3 rd instruction decoding unit 252 to receive the second operand b6. The output of the fixed point multiplier 610 is coupled to a second input of the 3 rd gate 650.
A first input of the 1 st gate 620 is coupled to the 3 rd instruction decode unit 252 to receive the first operand a6. A second input of the 1 st gate 620 is coupled to an output of the fixed point multiplier 610. Based on the control of the 3 rd instruction decode unit 252, the 1 st gate 620 selects one of the first operand a6 and the output of the fixed-point multiplier 610 to be passed to the fixed-point adder 640. The first input terminal of the 2 nd gate 630 is coupled to the 3 rd instruction decode unit 252 to receive the second operand b6. A second input of the 2 nd gate 630 is coupled to the 3 rd instruction decode unit 252 to receive the third operand c6. Based on control of instruction 3 decode unit 252, gate 2 630 selects to pass one of second operand b6 and third operand c6 to setpoint adder 640. A first input of the fixed point adder 640 is coupled to an output of the 1 st gate 620. A second input of the fixed point adder 640 is coupled to an output of the 2 nd gate 630. An output of the fixed point adder 640 is coupled to a third input of the 3 rd gate 650.
Based on the control of the 3 rd instruction decode unit 252, the 3 rd gate 650 selects one of the first operand a6, the output of the fixed-point multiplier 610, and the output of the fixed-point adder 640 to be transferred to the 2 nd tensor core module 22. For data movement requirements, the 3 rd gater 650 selects to directly transfer the first operand a6 to the 2 nd tensor core module 22 as a result of the processing (e.g., tensor core configuration information). For multiplication requirements, the 3 rd gater 650 selects to pass the output "a6·b6" of the fixed point multiplier 610 to the 2 nd tensor core module 22 as a processing result (e.g., tensor core configuration information). For addition requirements, the 1 st gater 620 selects to pass the first operand a6 to the point adder 640, the 2 nd gater 630 selects to pass the second operand b6 to the point adder 640, and the 3 rd gater 650 selects to pass the output "a6+b6" of the point adder 640 to the 2 nd tensor core module 22 as a result of the processing (e.g., tensor core configuration information). For multiply-add requirements, the 1 st gater 620 selects to pass the output of the fixed-point multiplier 610 to the point adder 640, the 2 nd gater 630 selects to pass the third operand c6 to the point adder 640, and the 3 rd gater 650 selects to pass the output "a6.b6+c6" of the fixed-point adder 640 to the 2 nd tensor core module 22 as a result of the processing (e.g., tensor core configuration information).
It should be noted that the above embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that the technical solution described in the above embodiments may be modified or some or all of the technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the scope of the technical solution of the embodiments of the present invention.

Claims (20)

1.一种向量核模块,其特征在于,所述向量核模块包括:1. A vector core module, characterized in that the vector core module includes: 向量核指令执行流水线,执行向量核指令而将执行结果存于内存模块;A vector core instruction execution pipeline executes vector core instructions and stores the execution results in a memory module; 张量核指令处理流水线,处理张量核指令而将处理结果发送给张量核模块,其中所述处理结果为张量核配置信息,所述向量核模块为主机,所述张量核模块为从机;以及A tensor core instruction processing pipeline processes tensor core instructions and sends processing results to the tensor core module, wherein the processing results are tensor core configuration information, the vector core module is a master, and the tensor core module is a slave; and 指令调度单元,耦接至所述向量核指令执行流水线以及所述张量核指令处理流水线,其中所述指令调度单元对至少一个线程束进行指令分类,以从所述至少一个线程束区分出所述向量核指令和所述张量核指令,an instruction scheduling unit coupled to the vector core instruction execution pipeline and the tensor core instruction processing pipeline, wherein the instruction scheduling unit classifies instructions for at least one thread warp to distinguish the vector core instructions from the tensor core instructions from the at least one thread warp; 响应于所述至少一个线程束包括所述向量核指令,所述指令调度单元将所述向量核指令发送给所述向量核指令执行流水线;以及In response to the at least one warp including the vector core instruction, the instruction scheduling unit sends the vector core instruction to the vector core instruction execution pipeline; and 响应于所述至少一个线程束包括所述张量核指令,所述指令调度单元将所述张量核指令发送给所述张量核指令处理流水线。In response to the at least one warp including the tensor core instruction, the instruction scheduling unit sends the tensor core instruction to the tensor core instruction processing pipeline. 2.根据权利要求1所述的向量核模块,其特征在于,所述向量核指令执行流水线包括:2. The vector core module according to claim 1, wherein the vector core instruction execution pipeline comprises: 第一指令发射单元,耦接至所述指令调度单元以接收所述向量核指令,其中所述第一指令发射单元发射所述向量核指令;a first instruction issuing unit, coupled to the instruction scheduling unit to receive the vector core instruction, wherein the first instruction issuing unit issues the vector core instruction; 第一指令译码单元,耦接至所述第一指令发射单元以接收所述向量核指令,其中所述第一指令译码单元译码所述向量核指令,以生成操作数和操作类型;以及a first instruction decoding unit coupled to the first instruction issuing unit to receive the vector core instruction, wherein the first instruction decoding unit decodes the vector core instruction to generate an operand and an operation type; and 向量核运算单元,耦接至所述第一指令译码单元以接收所述操作数和所述操作类型,其中所述向量核运算单元基于所述操作类型运算所述操作数,以生成所述执行结果给所述内存模块。The vector core operation unit is coupled to the first instruction decoding unit to receive the operand and the operation type, wherein the vector core operation unit operates the operand based on the operation type to generate the execution result to the memory module. 3.根据权利要求2所述的向量核模块,其特征在于,所述向量核运算单元包括:3. The vector core module according to claim 2, wherein the vector core operation unit comprises: 多个线程运算单元,其中所述多个线程运算单元的每一个执行一个对应线程的运算,以及所述多个线程运算单元生成所述执行结果给所述内存模块。A plurality of thread operation units, wherein each of the plurality of thread operation units executes an operation of a corresponding thread, and the plurality of thread operation units generate the execution results to the memory module. 4.根据权利要求1所述的向量核模块,其特征在于,所述张量核指令处理流水线仅处理标量数据。4. The vector core module according to claim 1, wherein the tensor core instruction processing pipeline only processes scalar data. 5.根据权利要求1所述的向量核模块,其特征在于,所述张量核指令处理流水线对所述张量核指令的处理仅包括数据搬运、定点加、定点乘和定点乘加的其中一种或多种。5. The vector core module according to claim 1, wherein the processing of the tensor core instruction by the tensor core instruction processing pipeline includes only one or more of data movement, fixed-point addition, fixed-point multiplication, and fixed-point multiply-add. 6.根据权利要求1所述的向量核模块,其特征在于,所述张量核指令处理流水线包括:6. The vector core module according to claim 1, wherein the tensor core instruction processing pipeline comprises: 第二指令发射单元,耦接至所述指令调度单元以接收所述张量核指令,其中所述第二指令发射单元发射所述张量核指令;a second instruction issuing unit, coupled to the instruction scheduling unit to receive the tensor core instruction, wherein the second instruction issuing unit issues the tensor core instruction; 第二指令译码单元,耦接至所述第二指令发射单元,其中所述第二指令译码单元译码所述张量核指令,以生成操作数和操作类型;以及a second instruction decoding unit coupled to the second instruction issuing unit, wherein the second instruction decoding unit decodes the tensor core instruction to generate an operand and an operation type; and 张量核指令运算单元,耦接至所述第二指令译码单元以接收所述操作数和所述操作类型,其中所述张量核指令运算单元基于所述操作类型处理所述操作数,以生成所述处理结果给所述张量核模块。A tensor core instruction operation unit is coupled to the second instruction decoding unit to receive the operand and the operation type, wherein the tensor core instruction operation unit processes the operand based on the operation type to generate the processing result to the tensor core module. 7.根据权利要求6所述的向量核模块,其特征在于,所述张量核指令运算单元包括:7. The vector core module according to claim 6, wherein the tensor core instruction operation unit comprises: 第一选通器,其中所述第一选通器的第一输入端耦接至所述第二指令译码单元以接收所述操作数中的第一操作数,以及所述第一选通器的输出端耦接至所述张量核模块以提供所述处理结果;a first gate, wherein a first input terminal of the first gate is coupled to the second instruction decoding unit to receive a first operand of the operands, and an output terminal of the first gate is coupled to the tensor core module to provide the processing result; 定点乘法器,其中所述定点乘法器的第一输入端耦接至所述第二指令译码单元以接收所述第一操作数,所述定点乘法器的第二输入端耦接至所述第二指令译码单元以接收所述操作数中的第二操作数,以及所述定点乘法器的输出端耦接至所述第一选通器的第二输入端;a fixed-point multiplier, wherein a first input terminal of the fixed-point multiplier is coupled to the second instruction decoding unit to receive the first operand, a second input terminal of the fixed-point multiplier is coupled to the second instruction decoding unit to receive a second operand among the operands, and an output terminal of the fixed-point multiplier is coupled to the second input terminal of the first gate; 第二选通器,其中所述第二选通器的第一输入端耦接至所述第二指令译码单元以接收所述第一操作数,以及所述第二选通器的第二输入端耦接至所述定点乘法器的输出端;a second gate, wherein a first input terminal of the second gate is coupled to the second instruction decoding unit to receive the first operand, and a second input terminal of the second gate is coupled to an output terminal of the fixed-point multiplier; 第三选通器,其中所述第三选通器的第一输入端耦接至所述第二指令译码单元以接收所述第二操作数,以及所述第三选通器的第二输入端耦接至所述第二指令译码单元以接收所述操作数中的第三操作数;以及a third gate, wherein a first input terminal of the third gate is coupled to the second instruction decoding unit to receive the second operand, and a second input terminal of the third gate is coupled to the second instruction decoding unit to receive a third operand among the operands; and 定点加法器,其中所述定点加法器的第一输入端耦接至所述第二选通器的输出端,所述定点加法器的第二输入端耦接至所述第三选通器的输出端,以及所述定点加法器的输出端耦接至所述第一选通器的第三输入端。A fixed-point adder, wherein a first input of the fixed-point adder is coupled to the output of the second gate, a second input of the fixed-point adder is coupled to the output of the third gate, and an output of the fixed-point adder is coupled to the third input of the first gate. 8.根据权利要求1所述的向量核模块,其特征在于,所述向量核模块还包括:8. The vector core module according to claim 1, further comprising: 指令缓存;以及instruction cache; and 指令获取单元,耦接至所述指令缓存以提取所述至少一个线程束,其中所述指令获取单元耦接至所述指令调度单元以提供所述至少一个线程束。An instruction fetch unit is coupled to the instruction cache to fetch the at least one warp, wherein the instruction fetch unit is coupled to the instruction scheduling unit to provide the at least one warp. 9.根据权利要求8所述的向量核模块,其特征在于,所述至少一个线程束是单线程束,所述指令调度单元包括:9. The vector core module according to claim 8, wherein the at least one thread warp is a single thread warp, and the instruction scheduling unit comprises: 指令分类器,其中所述指令分类器的输入端耦接至所述指令获取单元以接收所述单线程束,所述指令分类器耦接至所述向量核指令执行流水线以及所述张量核指令处理流水线,所述指令分类器对所述单线程束进行指令分类,An instruction classifier, wherein an input end of the instruction classifier is coupled to the instruction fetch unit to receive the single warp, the instruction classifier is coupled to the vector core instruction execution pipeline and the tensor core instruction processing pipeline, and the instruction classifier performs instruction classification on the single warp. 响应于所述单线程束包括所述向量核指令,所述指令分类器将所述向量核指令发送给所述向量核指令执行流水线;以及In response to the single warp including the vector core instruction, the instruction classifier sends the vector core instruction to the vector core instruction execution pipeline; and 响应于所述单线程束包括所述张量核指令,所述指令分类器将所述张量核指令发送给所述张量核指令处理流水线。In response to the single warp including the Tensor Core instruction, the instruction classifier sends the Tensor Core instruction to the Tensor Core instruction processing pipeline. 10.根据权利要求8所述的向量核模块,其特征在于,所述至少一个线程束是多线程束,所述指令调度单元包括:10. The vector core module according to claim 8, wherein the at least one thread warp is a multi-thread warp, and the instruction scheduling unit comprises: 多个线程束指令分类器,其中所述多个线程束指令分类器的每一个的输入端耦接至所述指令获取单元以接收所述多线程束中的一个对应线程束,以及所述多个线程束指令分类器的每一个对所述对应线程束进行指令分类,以从所述对应线程束区分出所述向量核指令和所述张量核指令,a plurality of warp instruction classifiers, wherein an input terminal of each of the plurality of warp instruction classifiers is coupled to the instruction fetch unit to receive a corresponding warp from the multiple warps, and each of the plurality of warp instruction classifiers performs instruction classification on the corresponding warp to distinguish the vector core instructions from the tensor core instructions from the corresponding warp. 向量核指令轮询仲裁器,耦接至所述多个线程束指令分类器,其中所述向量核指令轮询仲裁器轮询所述多个线程束指令分类器以取得所述向量核指令,以及所述向量核指令轮询仲裁器的输出端耦接至所述向量核指令执行流水线以提供所述向量核指令;以及a vector core instruction polling arbiter coupled to the plurality of warp instruction classifiers, wherein the vector core instruction polling arbiter polls the plurality of warp instruction classifiers to obtain the vector core instruction, and an output end of the vector core instruction polling arbiter is coupled to the vector core instruction execution pipeline to provide the vector core instruction; and 张量核指令轮询仲裁器,耦接至所述多个线程束指令分类器,其中所述张量核指令轮询仲裁器轮询所述多个线程束指令分类器以取得所述张量核指令,以及所述张量核指令轮询仲裁器的输出端耦接至所述张量核指令处理流水线以提供所述张量核指令。A tensor core instruction polling arbiter is coupled to the plurality of thread warp instruction classifiers, wherein the tensor core instruction polling arbiter polls the plurality of thread warp instruction classifiers to obtain the tensor core instructions, and an output end of the tensor core instruction polling arbiter is coupled to the tensor core instruction processing pipeline to provide the tensor core instructions. 11.根据权利要求1所述的向量核模块,其特征在于,所述向量核模块还包括:11. The vector core module according to claim 1, further comprising: 向量寄存器组,耦接至所述向量核指令执行流水线以提供向量数据,其中所述向量核指令执行流水线在所述向量核指令执行完毕后将所述向量核指令所对应的所述向量数据写回所述向量寄存器组;a vector register group coupled to the vector core instruction execution pipeline to provide vector data, wherein the vector core instruction execution pipeline writes the vector data corresponding to the vector core instruction back to the vector register group after the vector core instruction is executed; 标量寄存器组,耦接至所述向量核指令执行流水线以及所述张量核指令处理流水线以提供标量数据;以及a scalar register set coupled to the vector core instruction execution pipeline and the tensor core instruction processing pipeline to provide scalar data; and 有效线程检测单元,耦接至所述向量核指令执行流水线,其中所述有效线程检测单元检测所述执行结果以判断有效线程,以及所述有效线程检测单元将所述有效线程所对应的所述标量数据写回所述标量寄存器组。An active thread detection unit is coupled to the vector core instruction execution pipeline, wherein the active thread detection unit detects the execution result to determine an active thread, and the active thread detection unit writes the scalar data corresponding to the active thread back to the scalar register group. 12.一种向量核模块的操作方法,其特征在于,所述操作方法包括:12. A method for operating a vector core module, characterized in that the method comprises: 由所述向量核模块的指令调度单元对至少一个线程束进行指令分类,以从所述至少一个线程束区分出向量核指令和张量核指令,其中所述指令调度单元耦接至所述向量核指令执行流水线以及所述张量核指令处理流水线;performing, by an instruction scheduling unit of the vector core module, instruction classification on at least one warp to distinguish between vector core instructions and tensor core instructions from the at least one warp, wherein the instruction scheduling unit is coupled to the vector core instruction execution pipeline and the tensor core instruction processing pipeline; 响应于所述至少一个线程束包括所述向量核指令,由所述指令调度单元将所述向量核指令发送给所述向量核模块的向量核指令执行流水线;In response to the at least one thread warp including the vector core instruction, the instruction scheduling unit sends the vector core instruction to the vector core instruction execution pipeline of the vector core module; 由所述向量核指令执行流水线执行所述向量核指令而将执行结果存于内存模块;The vector core instruction execution pipeline executes the vector core instruction and stores the execution result in the memory module; 响应于所述至少一个线程束包括所述张量核指令,由所述指令调度单元将所述张量核指令发送给所述向量核模块的张量核指令处理流水线;以及In response to the at least one thread warp including the tensor core instruction, the instruction scheduling unit sends the tensor core instruction to the tensor core instruction processing pipeline of the vector core module; and 由所述张量核指令处理流水线处理所述张量核指令而将处理结果发送给张量核模块,其中所述处理结果为张量核配置信息,所述向量核模块为主机,所述张量核模块为从机。The tensor core instruction processing pipeline processes the tensor core instruction and sends the processing result to the tensor core module, wherein the processing result is tensor core configuration information, the vector core module is the host, and the tensor core module is the slave. 13.根据权利要求12所述的操作方法,其特征在于,所述操作方法还包括:13. The operating method according to claim 12, further comprising: 由所述向量核指令执行流水线的第一指令发射单元发射所述向量核指令,其中所述第一指令发射单元耦接至所述指令调度单元以接收所述向量核指令;issuing the vector core instruction by a first instruction issuing unit of the vector core instruction execution pipeline, wherein the first instruction issuing unit is coupled to the instruction scheduling unit to receive the vector core instruction; 由所述向量核指令执行流水线的第一指令译码单元译码所述向量核指令,以生成操作数和操作类型,其中所述第一指令译码单元耦接至所述第一指令发射单元以接收所述向量核指令;以及decoding the vector core instruction by a first instruction decode unit of the vector core instruction execution pipeline to generate operands and an operation type, wherein the first instruction decode unit is coupled to the first instruction issue unit to receive the vector core instruction; and 由所述向量核指令执行流水线的向量核运算单元基于所述操作类型运算所述操作数,以生成所述执行结果给所述内存模块,其中所述向量核运算单元耦接至所述第一指令译码单元以接收所述操作数和所述操作类型。A vector core operation unit of the vector core instruction execution pipeline operates the operand based on the operation type to generate the execution result to the memory module, wherein the vector core operation unit is coupled to the first instruction decoding unit to receive the operand and the operation type. 14.根据权利要求13所述的操作方法,其特征在于,所述操作方法还包括:14. The operating method according to claim 13, further comprising: 由所述向量核运算单元的多个线程运算单元的每一个执行一个对应线程的运算,其中所述多个线程运算单元生成所述执行结果给所述内存模块。Each of the multiple thread operation units of the vector core operation unit executes an operation of a corresponding thread, wherein the multiple thread operation units generate the execution results to the memory module. 15.根据权利要求12所述的操作方法,其特征在于,所述张量核指令处理流水线仅处理标量数据。15 . The operating method according to claim 12 , wherein the tensor core instruction processing pipeline only processes scalar data. 16.根据权利要求12所述的操作方法,其特征在于,所述张量核指令处理流水线对所述张量核指令的处理仅包括数据搬运、定点加、定点乘和定点乘加的其中一种或多种。16. The operating method according to claim 12, wherein the processing of the tensor core instruction by the tensor core instruction processing pipeline includes only one or more of data movement, fixed-point addition, fixed-point multiplication, and fixed-point multiply-add. 17.根据权利要求12所述的操作方法,其特征在于,所述操作方法还包括:17. The operating method according to claim 12, further comprising: 由所述张量核指令处理流水线的第二指令发射单元发射所述张量核指令,其中所述第二指令发射单元耦接至所述指令调度单元以接收所述张量核指令;issuing the tensor core instruction by a second instruction issue unit of the tensor core instruction processing pipeline, wherein the second instruction issue unit is coupled to the instruction dispatch unit to receive the tensor core instruction; 由所述张量核指令处理流水线的第二指令译码单元译码所述张量核指令,以生成操作数和操作类型,其中所述第二指令译码单元耦接至所述第二指令发射单元;以及decoding the tensor core instruction by a second instruction decode unit of the tensor core instruction processing pipeline to generate operands and an operation type, wherein the second instruction decode unit is coupled to the second instruction issue unit; and 由所述张量核指令处理流水线的张量核指令运算单元基于所述操作类型处理所述操作数,以生成所述处理结果给所述张量核模块,其中所述张量核指令运算单元耦接至所述第二指令译码单元以接收所述操作数和所述操作类型。A tensor core instruction operation unit of the tensor core instruction processing pipeline processes the operand based on the operation type to generate the processing result to the tensor core module, wherein the tensor core instruction operation unit is coupled to the second instruction decoding unit to receive the operand and the operation type. 18.根据权利要求12所述的操作方法,其特征在于,所述至少一个线程束是单线程束,所述操作方法还包括:18. The operating method according to claim 12, wherein the at least one warp is a single warp, and the operating method further comprises: 由所述向量核模块的指令获取单元从所述向量核模块的指令缓存提取所述至少一个线程束,其中所述指令获取单元耦接至所述指令调度单元以提供所述至少一个线程束;fetching, by an instruction fetch unit of the vector core module, the at least one thread warp from an instruction cache of the vector core module, wherein the instruction fetch unit is coupled to the instruction scheduling unit to provide the at least one thread warp; 由所述指令调度单元的指令分类器对所述单线程束进行指令分类,其中所述指令分类器的输入端耦接至所述指令获取单元以接收所述单线程束,所述指令分类器耦接至所述向量核指令执行流水线以及所述张量核指令处理流水线;An instruction classifier of the instruction scheduling unit performs instruction classification on the single warp, wherein an input end of the instruction classifier is coupled to the instruction fetch unit to receive the single warp, and the instruction classifier is coupled to the vector core instruction execution pipeline and the tensor core instruction processing pipeline; 响应于所述单线程束包括所述向量核指令,由所述指令分类器将所述向量核指令发送给所述向量核指令执行流水线;以及In response to the single warp including the vector core instruction, sending the vector core instruction to the vector core instruction execution pipeline by the instruction classifier; and 响应于所述单线程束包括所述张量核指令,由所述指令分类器将所述张量核指令发送给所述张量核指令处理流水线。In response to the single warp including the tensor core instruction, the instruction classifier sends the tensor core instruction to the tensor core instruction processing pipeline. 19.根据权利要求12所述的操作方法,其特征在于,所述至少一个线程束是多线程束,所述操作方法还包括:19. The operating method according to claim 12, wherein the at least one warp is a multi-warp, and the operating method further comprises: 由所述向量核模块的指令获取单元从所述向量核模块的指令缓存提取所述至少一个线程束,其中所述指令获取单元耦接至所述指令调度单元以提供所述至少一个线程束;fetching, by an instruction fetch unit of the vector core module, the at least one thread warp from an instruction cache of the vector core module, wherein the instruction fetch unit is coupled to the instruction scheduling unit to provide the at least one thread warp; 由所述指令调度单元的多个线程束指令分类器的每一个接收所述多线程束中的一个对应线程束;receiving, by each of a plurality of warp instruction classifiers of the instruction scheduling unit, a corresponding warp among the plurality of warps; 由所述多个线程束指令分类器的每一个对所述对应线程束进行指令分类,以从所述对应线程束区分出所述向量核指令和所述张量核指令;performing instruction classification on the corresponding warp by each of the plurality of warp instruction classifiers to distinguish the vector core instructions and the tensor core instructions from the corresponding warp; 由所述指令调度单元的向量核指令轮询仲裁器轮询所述多个线程束指令分类器以取得所述向量核指令,其中所述向量核指令轮询仲裁器的输出端耦接至所述向量核指令执行流水线以提供所述向量核指令;以及polling the plurality of warp instruction classifiers by a vector core instruction polling arbiter of the instruction scheduling unit to obtain the vector core instruction, wherein an output terminal of the vector core instruction polling arbiter is coupled to the vector core instruction execution pipeline to provide the vector core instruction; and 由所述指令调度单元的张量核指令轮询仲裁器轮询所述多个线程束指令分类器以取得所述张量核指令,其中所述张量核指令轮询仲裁器的输出端耦接至所述张量核指令处理流水线以提供所述张量核指令。A tensor core instruction polling arbiter of the instruction scheduling unit polls the plurality of warp instruction classifiers to obtain the tensor core instructions, wherein an output terminal of the tensor core instruction polling arbiter is coupled to the tensor core instruction processing pipeline to provide the tensor core instructions. 20.根据权利要求12所述的操作方法,其特征在于,所述操作方法还包括:20. The operating method according to claim 12, further comprising: 由所述向量核指令执行流水线在所述向量核指令执行完毕后将所述向量核指令所对应的向量数据写回所述向量核模块的向量寄存器组;The vector core instruction execution pipeline writes the vector data corresponding to the vector core instruction back to the vector register group of the vector core module after the execution of the vector core instruction is completed; 由所述向量寄存器组提供所述向量数据给所述向量核指令执行流水线;Providing the vector data to the vector core instruction execution pipeline by the vector register group; 由所述向量核模块的有效线程检测单元检测所述执行结果以判断有效线程;The valid thread detection unit of the vector core module detects the execution result to determine the valid thread; 由所述有效线程检测单元将所述有效线程所对应的标量数据写回所述向量核模块的标量寄存器组;以及The active thread detection unit writes the scalar data corresponding to the active thread back to the scalar register group of the vector core module; and 由所述标量寄存器组提供所述标量数据给所述向量核指令执行流水线以及所述张量核指令处理流水线。The scalar register group provides the scalar data to the vector core instruction execution pipeline and the tensor core instruction processing pipeline.
CN202510971173.6A 2025-07-15 2025-07-15 Vector core module of artificial intelligence chip and its operation method Active CN120469721B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510971173.6A CN120469721B (en) 2025-07-15 2025-07-15 Vector core module of artificial intelligence chip and its operation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510971173.6A CN120469721B (en) 2025-07-15 2025-07-15 Vector core module of artificial intelligence chip and its operation method

Publications (2)

Publication Number Publication Date
CN120469721A CN120469721A (en) 2025-08-12
CN120469721B true CN120469721B (en) 2025-09-30

Family

ID=96635416

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510971173.6A Active CN120469721B (en) 2025-07-15 2025-07-15 Vector core module of artificial intelligence chip and its operation method

Country Status (1)

Country Link
CN (1) CN120469721B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120655494B (en) * 2025-08-19 2025-11-25 上海壁仞科技股份有限公司 Artificial intelligence chip and operation method thereof

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156637A (en) * 2011-05-04 2011-08-17 中国人民解放军国防科学技术大学 Vector crossing multithread processing method and vector crossing multithread microprocessor

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102894763B1 (en) * 2019-03-15 2025-12-03 인텔 코포레이션 Graphics processors and graphics processing units having dot product accumulate instruction for hybrid floating point format
KR20220078819A (en) * 2020-12-04 2022-06-13 삼성전자주식회사 Method and apparatus for performing deep learning operations
CN114625421A (en) * 2020-12-11 2022-06-14 上海阵量智能科技有限公司 SIMT instruction processing method and device
US12299766B2 (en) * 2021-09-24 2025-05-13 Intel Corporation Providing native support for generic pointers in a graphics processing unit
CN120234045B (en) * 2025-05-29 2025-08-08 上海壁仞科技股份有限公司 Vector computing device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156637A (en) * 2011-05-04 2011-08-17 中国人民解放军国防科学技术大学 Vector crossing multithread processing method and vector crossing multithread microprocessor

Also Published As

Publication number Publication date
CN120469721A (en) 2025-08-12

Similar Documents

Publication Publication Date Title
TWI628594B (en) User level forks and rendezvous processors, methods, systems, and instructions
CN111310910B (en) Computing device and method
US11360809B2 (en) Multithreaded processor core with hardware-assisted task scheduling
US20080155197A1 (en) Locality optimization in multiprocessor systems
CN114968358B (en) Device and method for configuring cooperative thread warps in vector computing system
JP6502616B2 (en) Processor for batch thread processing, code generator and batch thread processing method
US20080046689A1 (en) Method and apparatus for cooperative multithreading
CN120469721B (en) Vector core module of artificial intelligence chip and its operation method
US9870267B2 (en) Virtual vector processing
US11875425B2 (en) Implementing heterogeneous wavefronts on a graphics processing unit (GPU)
JP2023509813A (en) SIMT command processing method and device
CN120655494B (en) Artificial intelligence chip and operation method thereof
CN117501254A (en) Providing atomicity for complex operations using near-memory computation
US11416261B2 (en) Group load register of a graph streaming processor
US10133578B2 (en) System and method for an asynchronous processor with heterogeneous processors
CN120764601A (en) SIMT-based neural network processor and its task execution method
CN118747088A (en) A signal processing method, device, equipment and medium for multi-threaded instruction emission
CN118747084A (en) Instruction processing method, device and storage medium based on multi-core processor
US11847462B2 (en) Software-based instruction scoreboard for arithmetic logic units
CN109800064B (en) Processor and thread processing method
Sahar et al. An Interactive System Based on First-Class User-Level Threads: A Systematic Review
US12436808B2 (en) CPU tight-coupled accelerator
CN111752614A (en) A processor, instruction execution device and method
JP7004905B2 (en) Arithmetic processing unit and control method of arithmetic processing unit
Feng et al. Programmable Architecture for Thread Level Parallel Computing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant