CN118349282B - SIMT processor supporting multiple transmissions, instruction processing method and chip - Google Patents
SIMT processor supporting multiple transmissions, instruction processing method and chip Download PDFInfo
- Publication number
- CN118349282B CN118349282B CN202410783914.3A CN202410783914A CN118349282B CN 118349282 B CN118349282 B CN 118349282B CN 202410783914 A CN202410783914 A CN 202410783914A CN 118349282 B CN118349282 B CN 118349282B
- Authority
- CN
- China
- Prior art keywords
- instructions
- operand
- instruction
- groups
- cache
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 15
- 230000005540 biological transmission Effects 0.000 title claims description 13
- 238000000034 method Methods 0.000 claims description 19
- 238000004590 computer program Methods 0.000 claims description 10
- 238000010586 diagram Methods 0.000 description 12
- 238000013473 artificial intelligence Methods 0.000 description 6
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 3
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 3
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 2
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 2
- 230000007547 defect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000004148 unit process Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3004—Arrangements for executing specific machine instructions to perform operations on memory
- G06F9/30047—Prefetch instructions; cache control instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
- G06F9/3888—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple threads [SIMT] in parallel
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Advance Control (AREA)
Abstract
Description
技术领域Technical Field
本发明涉及计算机技术领域,尤其涉及一种支持多发射的SIMT处理器、指令处理方法和芯片。The present invention relates to the field of computer technology, and in particular to a SIMT processor supporting multiple transmissions, an instruction processing method and a chip.
背景技术Background Art
近年来,GPU(Graphics Processing Unit,图形处理器)、AI(ArtificialIntelligence,人工智能)处理器等指令处理芯片在许多领域中发挥了关键作用,尤其是在需要大量并行计算和高性能计算的领域,如人工智能、科学计算等领域。SIMT(SingleInstruction Multiple Threads,单指令多线程)处理器是指令处理芯片的基础,其核心思想为,同时运行大量的线程,这些线程执行相同的指令,处理不同的数据,实现高效的并行计算。In recent years, instruction processing chips such as GPU (Graphics Processing Unit) and AI (Artificial Intelligence) processors have played a key role in many fields, especially in fields that require a large amount of parallel computing and high-performance computing, such as artificial intelligence and scientific computing. SIMT (Single Instruction Multiple Threads) processors are the basis of instruction processing chips. The core idea is to run a large number of threads at the same time. These threads execute the same instructions and process different data to achieve efficient parallel computing.
现有技术中,SIMT处理器采用单发射结构,即,在每个周期内调度器选择一条指令执行,通过在指令处理芯片中增加SIMT处理器的数量以提升整体性能。然而,随着对计算能力需求的提升,单发射结构已无法满足指令处理芯片日渐提升的执行性能要求。In the prior art, SIMT processors use a single-issue structure, that is, the scheduler selects one instruction to execute in each cycle, and the overall performance is improved by increasing the number of SIMT processors in the instruction processing chip. However, with the increase in computing power requirements, the single-issue structure can no longer meet the increasing execution performance requirements of instruction processing chips.
发明内容Summary of the invention
本发明提供一种支持多发射的SIMT处理器、指令处理方法和芯片,用以解决现有技术中单发射结构已无法满足指令处理芯片日渐提升的执行性能要求的缺陷,实现单个SIMT处理器的多发射能力。The present invention provides a SIMT processor supporting multiple launches, an instruction processing method and a chip, so as to solve the defect that the single launch structure in the prior art can no longer meet the increasingly improved execution performance requirements of instruction processing chips, and realize the multiple launch capability of a single SIMT processor.
本发明提供一种支持多发射的SIMT处理器,包括:至少两组寄存器堆、指令调度模块、操作数收集模块、指令分发模块和执行模块,其中:The present invention provides a SIMT processor supporting multiple issues, comprising: at least two groups of register files, an instruction scheduling module, an operand collection module, an instruction distribution module and an execution module, wherein:
所述指令调度模块连接所述操作数收集模块,所述指令调度模块用于对多组线程组进行分组,确定至少两个分组各自对应的第一指令;所述至少两个分组各自对应的第一指令与所述至少两组寄存器堆一一对应;The instruction scheduling module is connected to the operand collection module, and is used to group the multiple thread groups and determine the first instructions corresponding to at least two groups; the first instructions corresponding to the at least two groups correspond to the at least two register files in a one-to-one manner;
所述操作数收集模块连接所述至少两组寄存器堆和所述指令分发模块,所述操作数收集模块用于从所述至少两组寄存器堆中并行收集所有第一指令各自对应的操作数,并将所述所有第一指令发送至所述指令分发模块;The operand collection module is connected to the at least two groups of register files and the instruction distribution module, and is used to collect operands corresponding to all first instructions from the at least two groups of register files in parallel, and send all first instructions to the instruction distribution module;
所述指令分发模块连接所述执行模块,所述指令分发模块用于从已收集操作数的缓存指令中确定至多Q条第二指令,Q是基于所述执行模块中执行单元的数量确定的;所述缓存指令至少包括所述所有第一指令;The instruction distribution module is connected to the execution module, and the instruction distribution module is used to determine at most Q second instructions from the cache instructions of the collected operands, where Q is determined based on the number of execution units in the execution module; the cache instructions at least include all the first instructions;
所述执行模块用于并行执行所述至多Q条第二指令。The execution module is used for executing the at most Q second instructions in parallel.
根据本发明提供的支持多发射的SIMT处理器,所述指令调度模块对多组线程组进行分组,确定至少两个分组各自对应的第一指令,包括:According to the SIMT processor supporting multiple issues provided by the present invention, the instruction scheduling module groups multiple thread groups, and determines the first instructions corresponding to at least two groups, respectively, including:
将所述多组线程组的指令进行分组,得到至少两个分组以及所述至少两个分组各自对应的第一待测指令;所述至少两个分组各自对应的第一待测指令与所述至少两组寄存器堆一一对应;Grouping the instructions of the multiple thread groups to obtain at least two groups and first instructions to be tested corresponding to the at least two groups; the first instructions to be tested corresponding to the at least two groups correspond to the at least two register files in a one-to-one manner;
基于所述至少两个分组各自对应的寄存器占用表和操作数占用表,从所述至少两个分组各自对应的第一待测指令中并行确定所述至少两个分组各自对应的可执行指令组;Based on the register occupancy table and the operand occupancy table corresponding to the at least two groups respectively, determining the executable instruction groups corresponding to the at least two groups respectively from the first instructions to be tested corresponding to the at least two groups respectively in parallel;
将各组所述可执行指令组中优先级最高的指令确定为所述至少两个分组各自对应的第一指令。The instruction with the highest priority in each group of the executable instructions is determined as the first instruction corresponding to each of the at least two groups.
根据本发明提供的支持多发射的SIMT处理器,所述基于所述至少两个分组各自对应的寄存器占用表和操作数占用表,从所述至少两个分组各自对应的第一待测指令中并行确定所述至少两个分组各自对应的可执行指令组,包括:According to the SIMT processor supporting multiple transmissions provided by the present invention, the step of determining in parallel the executable instruction groups corresponding to the at least two groups from the first instructions to be tested corresponding to the at least two groups based on the register occupancy table and the operand occupancy table corresponding to the at least two groups, comprises:
基于所述至少两个分组各自对应的寄存器占用表,判断各组所述分组对应的所述第一待测指令是否存在寄存器读写依赖,得到各组所述分组对应的所述第一待测指令对应的第一判断结果;Based on the register occupancy tables corresponding to the at least two groups, determining whether the first instruction to be tested corresponding to each group of the groups has a register read/write dependency, and obtaining a first determination result corresponding to the first instruction to be tested corresponding to each group of the groups;
基于所述至少两个分组各自对应的操作数占用表,判断各组所述分组对应的所述第一待测指令对应的操作数缓存单元范围中是否存在剩余位置,得到各组所述分组对应的所述第一待测指令对应的第二判断结果;Based on the operand occupancy tables corresponding to the at least two groups, determine whether there are remaining positions in the operand cache unit range corresponding to the first instruction to be tested corresponding to each group of the groups, and obtain a second determination result corresponding to the first instruction to be tested corresponding to each group of the groups;
基于各组所述分组对应的所述第一判断结果和所述第二判断结果,并行确定所述至少两个分组各自对应的可执行指令组。Based on the first judgment result and the second judgment result corresponding to each group of the groups, the executable instruction groups corresponding to the at least two groups are determined in parallel.
根据本发明提供的支持多发射的SIMT处理器,所述操作数收集模块包括至少两个操作数收集单元和至少两个操作数缓存;According to the SIMT processor supporting multiple issues provided by the present invention, the operand collection module includes at least two operand collection units and at least two operand caches;
所述至少两个操作数收集单元连接读写控制器,所述至少两个操作数收集单元、所述至少两个操作数缓存和所述至少两组寄存器堆存在对应关系,所述至少两个操作数收集单元用于并行确定所述所有第一指令各自对应的读请求,并将各所述读请求发送至所述读写控制器;The at least two operand collection units are connected to the read-write controller, and there is a corresponding relationship between the at least two operand collection units, the at least two operand caches and the at least two groups of register stacks. The at least two operand collection units are used to determine the read requests corresponding to all the first instructions in parallel, and send each of the read requests to the read-write controller;
各所述读请求用于指示所述读写控制器从所述至少两组寄存器堆中并行确定所述所有第一指令对应的操作数,并将所述所有第一指令各自对应的操作数写入对应的所述操作数缓存中。Each of the read requests is used to instruct the read/write controller to determine operands corresponding to all the first instructions in parallel from the at least two groups of register files, and write the operands corresponding to each of the first instructions into the corresponding operand cache.
根据本发明提供的支持多发射的SIMT处理器,所述至少两个操作数缓存还连接操作数互联电路和至少两个读互联电路,所述操作数互联电路连接所述执行模块,所述至少两个读互联电路与所述至少两组寄存器堆对应连接;According to the SIMT processor supporting multiple issues provided by the present invention, the at least two operand caches are further connected to an operand interconnection circuit and at least two read interconnection circuits, the operand interconnection circuit is connected to the execution module, and the at least two read interconnection circuits are correspondingly connected to the at least two groups of register files;
所述读写控制器从所述至少两组寄存器堆中并行确定所述所有第一指令对应的操作数,并将所述所有第一指令各自对应的操作数写入对应的所述操作数缓存中,包括:The read/write controller determines operands corresponding to all the first instructions from the at least two groups of register files in parallel, and writes the operands corresponding to all the first instructions into the corresponding operand cache, including:
针对各所述读请求,所述读写控制器用于确定所述第一指令对应的所述读请求中的操作数地址;所述操作数地址包括所述寄存器堆对应的读地址、所述读互联电路对应的选择索引和所述操作数缓存对应的分布选择地址;基于所述寄存器堆对应的读地址,从对应的所述寄存器堆中读取所述操作数;基于所述读互联电路对应的选择索引和所述操作数缓存对应的分布选择地址,在所述操作数缓存中确定所述操作数对应的缓存单元;将所述操作数写入所述缓存单元。For each of the read requests, the read-write controller is used to determine the operand address in the read request corresponding to the first instruction; the operand address includes the read address corresponding to the register stack, the selection index corresponding to the read interconnection circuit, and the distribution selection address corresponding to the operand cache; based on the read address corresponding to the register stack, the operand is read from the corresponding register stack; based on the selection index corresponding to the read interconnection circuit and the distribution selection address corresponding to the operand cache, the cache unit corresponding to the operand is determined in the operand cache; and the operand is written into the cache unit.
根据本发明提供的支持多发射的SIMT处理器,所述指令分发模块从已收集操作数的缓存指令中确定至多Q条第二指令,包括:According to the SIMT processor supporting multiple issues provided by the present invention, the instruction distribution module determines at most Q second instructions from the cache instructions of the collected operands, including:
将已收集操作数的所述所有第一指令并行写入各自对应的待执行指令表;所述待执行指令表与所述所有第一指令一一对应;Writing all the first instructions of the collected operands into their corresponding to-be-executed instruction tables in parallel; the to-be-executed instruction tables correspond one-to-one to all the first instructions;
从所有待执行指令表对应的所有缓存指令中,确定无执行冲突且优先级组合最优的至多Q条第二指令,所述执行冲突包括执行单元冲突和写冲突,所述执行单元冲突用于表征任意两条缓存指令各自对应的执行单元相同,所述写冲突用于表征任意两条缓存指令各自对应的执行结果写入相同寄存器堆的时刻相同,所述执行单元属于所述执行模块。From all cache instructions corresponding to all instruction tables to be executed, determine at most Q second instructions with no execution conflicts and the best priority combination, the execution conflicts include execution unit conflicts and write conflicts, the execution unit conflicts are used to characterize that the execution units corresponding to any two cache instructions are the same, the write conflicts are used to characterize that the execution results corresponding to any two cache instructions are written to the same register stack at the same time, and the execution units belong to the execution module.
根据本发明提供的支持多发射的SIMT处理器,所述执行模块包括至少两个执行单元,所述至少两个执行单元与所述至多Q条第二指令存在对应关系,且所述至少两个执行单元均连接写互联电路和所述操作数互联电路,所述写互联电路连接所述至少两组寄存器堆;According to the SIMT processor supporting multiple issues provided by the present invention, the execution module includes at least two execution units, the at least two execution units correspond to the at most Q second instructions, and the at least two execution units are both connected to a write interconnection circuit and the operand interconnection circuit, and the write interconnection circuit connects the at least two groups of register files;
针对各所述执行单元,接收到第二指令的所述执行单元用于基于所述第二指令,确定对应的目标操作数缓存;基于所述操作数互联电路从所述目标操作数缓存中确定目标操作数;基于所述目标操作数执行所述第二指令,确定执行结果,并将所述执行结果通过所述写互联电路写入所述第二指令对应的至少两组寄存器堆。For each of the execution units, the execution unit that receives the second instruction is used to determine the corresponding target operand cache based on the second instruction; determine the target operand from the target operand cache based on the operand interconnection circuit; execute the second instruction based on the target operand, determine the execution result, and write the execution result into at least two groups of register stacks corresponding to the second instruction through the write interconnection circuit.
本发明还提供一种指令处理方法,应用于如上述任一项所述的支持多发射的SIMT处理器,所述方法包括:The present invention also provides an instruction processing method, which is applied to a SIMT processor supporting multiple transmissions as described in any one of the above, and the method comprises:
利用指令调度模块对多组线程组进行分组,确定至少两个分组各自对应的第一指令;所述至少两个分组各自对应的第一指令与至少两组寄存器堆一一对应;Using an instruction scheduling module to group the plurality of thread groups, and determining first instructions corresponding to at least two groups, wherein the first instructions corresponding to the at least two groups correspond to at least two register stacks in a one-to-one manner;
利用操作数收集模块从所述至少两组寄存器堆中并行收集所有第一指令各自对应的操作数,并将所述所有第一指令发送至指令分发模块;Using the operand collection module to collect operands corresponding to all first instructions from the at least two groups of register files in parallel, and sending all first instructions to the instruction distribution module;
利用所述指令分发模块从已收集操作数的缓存指令中确定至多Q条第二指令,Q是基于执行模块中执行单元的数量确定的;所述缓存指令至少包括所述所有第一指令;Determine, by the instruction distribution module, at most Q second instructions from the cache instructions of the collected operands, where Q is determined based on the number of execution units in the execution module; the cache instructions at least include all the first instructions;
利用执行模块并行执行所述至多Q条第二指令。The execution module is used to execute the at most Q second instructions in parallel.
本发明还提供一种芯片,包括至少两个如上述任一项所述的支持多发射的SIMT处理器。The present invention also provides a chip, comprising at least two SIMT processors supporting multiple transmissions as described in any one of the above items.
本发明还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如上述任一种所述指令处理方法。The present invention also provides a non-transitory computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the instruction processing method described above is implemented.
本发明提供的支持多发射的SIMT处理器、指令处理方法和芯片,通过指令调度模块对多组线程组进行分组,并确定至少两个分组各自对应的第一指令,操作数收集模块从至少两组寄存器堆中并行收集与至少两组寄存器堆对应的所有第一指令各自对应的操作数,并将已收集操作数的所有第一指令发送至指令分发模块,指令分发模块从包括所有第一指令的缓存指令中确定至多Q条第二指令,使执行模块并行执行至多Q条第二指令。本发明分别从指令数量方面、各寄存器堆对应的操作数分组存储方面,以及由指令调度模块、操作数收集模块、指令分发模块和执行模块构成的并行执行方面确保多发射能力的实现,提高指令执行效率。The SIMT processor, instruction processing method and chip supporting multi-issue provided by the present invention group multiple groups of thread groups through an instruction scheduling module, and determine the first instructions corresponding to at least two groups respectively, the operand collection module collects operands corresponding to all first instructions corresponding to at least two groups of register stacks in parallel from at least two groups of register stacks, and sends all first instructions with collected operands to the instruction distribution module, the instruction distribution module determines at most Q second instructions from the cache instructions including all first instructions, so that the execution module executes at most Q second instructions in parallel. The present invention ensures the realization of multi-issue capability from the aspects of the number of instructions, the storage of operand groups corresponding to each register stack, and the parallel execution consisting of the instruction scheduling module, the operand collection module, the instruction distribution module and the execution module, and improves the efficiency of instruction execution.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本发明或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the present invention or the prior art, the following briefly introduces the drawings required for use in the embodiments or the description of the prior art. Obviously, the drawings described below are some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.
图1是本发明实施例提供的支持多发射的SIMT处理器的结构示意图。FIG1 is a schematic diagram of the structure of a SIMT processor supporting multiple transmissions provided by an embodiment of the present invention.
图2是本发明实施例提供的指令调度模块的结构示意图。FIG. 2 is a schematic diagram of the structure of an instruction scheduling module provided in an embodiment of the present invention.
图3是本发明实施例提供的操作数缓存收集模块的连接示意图。FIG. 3 is a connection diagram of an operand cache collection module provided by an embodiment of the present invention.
图4是本发明实施例提供的读互联电路的连接示意图。FIG. 4 is a connection diagram of a read interconnect circuit provided by an embodiment of the present invention.
图5是本发明实施例提供的指令分发模块的结构示意图。FIG5 is a schematic diagram of the structure of an instruction distribution module provided in an embodiment of the present invention.
图6是本发明实施例提供的执行模块的连接示意图。FIG. 6 is a connection diagram of an execution module provided in an embodiment of the present invention.
图7是本发明实施例提供的指令处理方法的流程示意图。FIG. 7 is a schematic flow chart of an instruction processing method provided in an embodiment of the present invention.
具体实施方式DETAILED DESCRIPTION
为使本发明的目的、技术方案和优点更加清楚,下面将结合本发明中的附图,对本发明中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solution and advantages of the present invention clearer, the technical solution of the present invention will be clearly and completely described below in conjunction with the drawings of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.
现有技术中,GPU(Graphics Processing Unit,图形处理器)、AI(ArtificialIntelligence,人工智能)处理器等指令处理芯片中的SIMT处理器内采用单发射结构。以GPU为例,GPU中包括多个互联的流处理器,各流处理器中包括4个互联的SIMT(SingleInstruction Multiple Threads,单指令多线程)处理器。各SIMT处理器采用单发射结构,在每周期内调度器选择一条指令,可同时支持多个线程均执行该指令,以线程组(warp)进行调度,每个线程组中可包括32或64个线程,通过多个线程组的切换提升运算单元的利用率。然而,随着对计算能力需求的提升,SIMT处理器中的运算单元种类增多,例如,增加了张量计算核心(tensor core)。单发射结构的SIMT处理器已无法满足指令处理芯片日渐提升的执行性能要求,同时,单个SIMT处理器中运算单元的利用率较低。In the prior art, the SIMT processors in instruction processing chips such as GPU (Graphics Processing Unit) and AI (Artificial Intelligence) processors adopt a single-issue structure. Taking GPU as an example, the GPU includes multiple interconnected stream processors, and each stream processor includes 4 interconnected SIMT (Single Instruction Multiple Threads) processors. Each SIMT processor adopts a single-issue structure. The scheduler selects an instruction in each cycle, and can support multiple threads to execute the instruction at the same time. It is scheduled in thread groups (warps). Each thread group can include 32 or 64 threads. The utilization of the computing unit is improved by switching multiple thread groups. However, with the increase in demand for computing power, the types of computing units in SIMT processors have increased. For example, tensor computing cores have been added. SIMT processors with a single-issue structure can no longer meet the increasingly improved execution performance requirements of instruction processing chips. At the same time, the utilization of computing units in a single SIMT processor is low.
针对现有技术中的上述问题,本发明实施例提供一种支持多发射的SIMT处理器,图1是本发明实施例提供的支持多发射的SIMT处理器的结构示意图,如图1所示,该支持多发射的SIMT处理器包括:至少两组寄存器堆、指令调度模块、操作数收集模块、指令分发模块和执行模块。In view of the above problems in the prior art, an embodiment of the present invention provides a SIMT processor supporting multiple launches. FIG1 is a schematic diagram of the structure of the SIMT processor supporting multiple launches provided by an embodiment of the present invention. As shown in FIG1 , the SIMT processor supporting multiple launches includes: at least two groups of register stacks, an instruction scheduling module, an operand collection module, an instruction distribution module and an execution module.
所述指令调度模块连接所述操作数收集模块,所述指令调度模块用于对多组线程组进行分组,确定至少两个分组各自对应的第一指令;所述至少两个分组各自对应的第一指令与所述至少两组寄存器堆一一对应。The instruction scheduling module is connected to the operand collection module, and is used to group multiple thread groups and determine the first instructions corresponding to at least two groups; the first instructions corresponding to the at least two groups correspond to the at least two register stacks one by one.
所述操作数收集模块连接所述至少两组寄存器堆和所述指令分发模块,所述操作数收集模块用于从所述至少两组寄存器堆中并行收集所有第一指令各自对应的操作数,并将所述所有第一指令发送至所述指令分发模块。The operand collection module is connected to the at least two groups of register files and the instruction distribution module, and is used to collect operands corresponding to all first instructions from the at least two groups of register files in parallel, and send all first instructions to the instruction distribution module.
所述指令分发模块连接所述执行模块,所述指令分发模块用于从已收集操作数的缓存指令中确定至多Q条第二指令,Q是基于所述执行模块中执行单元的数量确定的;所述缓存指令至少包括所述所有第一指令。The instruction distribution module is connected to the execution module, and is used to determine at most Q second instructions from the cache instructions of the collected operands, where Q is determined based on the number of execution units in the execution module; the cache instructions at least include all the first instructions.
所述执行模块用于并行执行所述至多Q条第二指令。The execution module is used for executing the at most Q second instructions in parallel.
具体地,为实现多发射能力,本发明实施例中,首先,指令调度模块在各周期中均获取多组线程组的指令。之后,指令调度单元对多组线程组进行分组,得到N个分组,N为大于1的整数,则每个分组中的分组指令数量T=WA/N,其中,T表示每个分组中的分组指令数量,W表示线程组的组数,且W为大于1的整数,A表示单线程组包括的指令数量,且A为大于或等于1的整数。即,指令调度模块通过对多组线程组分组,实现对多组线程组的指令的分组。之后,从每个分组中确定一条第一指令,即,从每个分组的T条指令中确定一条第一指令,进而实现从N个分组中确定N条第一指令,N条第一指令与N个寄存器堆一一对应,且N条第一指令各自对应的线程组不同。而并非如现有技术中在每个周期中仅获取一条指令,从指令数量上确保多发射能力。指令调度模块在确定N条第一指令后,将N条第一指令并行发送至操作数收集模块。Specifically, to achieve multi-issue capability, in the embodiment of the present invention, first, the instruction scheduling module obtains instructions of multiple thread groups in each cycle. Then, the instruction scheduling unit groups the multiple thread groups to obtain N groups, where N is an integer greater than 1, and the number of group instructions in each group is T=W. A/N, where T represents the number of grouped instructions in each group, W represents the number of thread groups, and W is an integer greater than 1, and A represents the number of instructions included in a single thread group, and A is an integer greater than or equal to 1. That is, the instruction scheduling module realizes the grouping of instructions of multiple thread groups by grouping multiple thread groups. Afterwards, a first instruction is determined from each group, that is, a first instruction is determined from the T instructions of each group, so as to realize the determination of N first instructions from N groups, and the N first instructions correspond one-to-one to the N register stacks, and the N first instructions correspond to different thread groups. Instead of obtaining only one instruction in each cycle as in the prior art, the multi-launch capability is ensured in terms of the number of instructions. After determining the N first instructions, the instruction scheduling module sends the N first instructions to the operand collection module in parallel.
操作数收集模块在接收到N条第一指令后,为确保N条第一指令后续的执行效率,在N条第一指令执行之前,操作数收集模块需收集N条第一指令各自对应的操作数,即确定N条第一指令中参与运算的数据或数据来源。为提高操作数的收集效率,确保多发射能力,本发明实施例中,将所有寄存器堆划分为多个分组,各组寄存器堆中预先存储了各条第一指令对应的操作数。此时,操作数收集模块可分别从各条第一指令对应的寄存器堆中收集对应的操作数,每条第一指令的操作数的收集为并行执行的。针对各第一指令,在收集该第一指令的操作数后,可将第一指令发送至指令分发模块,用于指示指令分发模块该第一指令的操作数已收集完。After receiving N first instructions, the operand collection module needs to collect the operands corresponding to each of the N first instructions before the N first instructions are executed, in order to ensure the subsequent execution efficiency of the N first instructions, that is, to determine the data or data sources involved in the operation in the N first instructions. In order to improve the efficiency of operand collection and ensure multi-transmission capability, in an embodiment of the present invention, all register stacks are divided into multiple groups, and each group of register stacks pre-stores the operands corresponding to each first instruction. At this time, the operand collection module can collect the corresponding operands from the register stacks corresponding to each first instruction respectively, and the collection of the operands of each first instruction is performed in parallel. For each first instruction, after collecting the operands of the first instruction, the first instruction can be sent to the instruction distribution module to indicate to the instruction distribution module that the operands of the first instruction have been collected.
指令分发模块在接收到各第一指令后,指令分发模块中的缓存指令包括两种情况,第一种情况为:指令分发模块中仅存在当前周期已收集操作数的N条第一指令。此时,指令分发模块可将N条第一指令均确定为缓存指令。指令分发模块可从N条第一指令中选择M条第二指令,M为小于或等于N的整数,N小于或等于Q,Q是基于执行模块的并行能力确定的,即,Q是基于执行模块中包括的执行单元的数量确定的,Q为大于或等于2的整数,例如,Q可以为2、3、4等。第二种情况为:指令分发模块中包括当前周期已收集操作数的N条第一指令,以及历史周期中已收集操作数的历史指令。其中,历史周期为当前周期的前m个周期,m为大于或等于1的整数。历史指令可以为历史周期中存在写冲突和/或优先级较低的指令,此时,所有第一指令连同历史周期中已收集操作数的历史指令构成缓存指令。指令分发模块从所有缓存指令中选择M条第二指令,M为小于或等于Q的整数,且将M条第二指令并行分发至执行模块。执行模块并行执行M条第二指令后,确定各条第二指令对应的执行结果,指令调度模块、操作数收集模块、指令分发模块和执行模块构成至少两条并行的指令处理通路,实现多发射能力。此外,在确定各条第二指令对应的执行结果后,可将执行结果写入各条第二指令对应的寄存器堆中进行存储。After the instruction distribution module receives each first instruction, the cache instructions in the instruction distribution module include two situations. The first situation is: there are only N first instructions whose operands have been collected in the current cycle in the instruction distribution module. At this time, the instruction distribution module can determine all N first instructions as cache instructions. The instruction distribution module can select M second instructions from the N first instructions, M is an integer less than or equal to N, N is less than or equal to Q, Q is determined based on the parallel capability of the execution module, that is, Q is determined based on the number of execution units included in the execution module, Q is an integer greater than or equal to 2, for example, Q can be 2, 3, 4, etc. The second situation is: the instruction distribution module includes N first instructions whose operands have been collected in the current cycle, and historical instructions whose operands have been collected in the historical cycle. Among them, the historical cycle is the first m cycles of the current cycle, and m is an integer greater than or equal to 1. The historical instructions can be instructions with write conflicts and/or lower priorities in the historical cycle. At this time, all first instructions together with the historical instructions whose operands have been collected in the historical cycle constitute cache instructions. The instruction distribution module selects M second instructions from all cache instructions, where M is an integer less than or equal to Q, and distributes the M second instructions in parallel to the execution module. After the execution module executes the M second instructions in parallel, it determines the execution results corresponding to each second instruction. The instruction scheduling module, the operand collection module, the instruction distribution module, and the execution module constitute at least two parallel instruction processing paths to achieve multi-launch capability. In addition, after determining the execution results corresponding to each second instruction, the execution results can be written into the register stack corresponding to each second instruction for storage.
需要说明的是,指令分发模块在每个周期确定的第二指令的指令数量为动态值。例如,以Q=4为例,第一周期确定一条第二指令,第二周期确定三条第二指令,第三周期确定两条第二指令,三个周期确定的第二指令的平均指令数量为2。本发明实施例中,SIMT处理器从各周期执行的平均指令数量上实现多发射,而非每个周期均为多发射。It should be noted that the number of second instructions determined by the instruction distribution module in each cycle is a dynamic value. For example, taking Q=4 as an example, one second instruction is determined in the first cycle, three second instructions are determined in the second cycle, and two second instructions are determined in the third cycle. The average number of second instructions determined in the three cycles is 2. In the embodiment of the present invention, the SIMT processor realizes multi-issue from the average number of instructions executed in each cycle, rather than multi-issue in each cycle.
可选地,各寄存器堆均包括至少四个单口SRAM(Static Random Access Memory,静态随机存取存储器),各寄存器堆中包括的单口SRAM的数量大于或等于4且为2的幂次方。一条第一指令对应的至少一个操作数均存储于该第一指令对应的寄存器堆中,不同第一指令各自对应的至少一个操作数存储于不同的寄存器堆中。例如,如图1所示,第一指令0对应的两个操作数均存储于寄存器堆0中,且分别存储于寄存器堆0的SRAM B0和SRAM B3中。又如,第一指令1对应的三个操作数均存储于寄存器堆1中,且分别存储于寄存器堆1的SRAMB0、SRAM B1和SRAM B2中。Optionally, each register file includes at least four single-port SRAMs (Static Random Access Memory), and the number of single-port SRAMs included in each register file is greater than or equal to 4 and is a power of 2. At least one operand corresponding to a first instruction is stored in the register file corresponding to the first instruction, and at least one operand corresponding to different first instructions is stored in different register files. For example, as shown in Figure 1, the two operands corresponding to the first instruction 0 are stored in register file 0, and are respectively stored in SRAM B0 and SRAM B3 of register file 0. For another example, the three operands corresponding to the first instruction 1 are stored in register file 1, and are respectively stored in SRAMB0, SRAM B1 and SRAM B2 of register file 1.
需要说明的是,N可以根据SIMT处理器支持的多发射能力进行确定,例如,若SIMT处理器支持双发射能力,N可以为2,若SIMT处理器支持四发射能力,N可以为4等,本发明实施例对此不做限定。It should be noted that N can be determined according to the multi-transmission capability supported by the SIMT processor. For example, if the SIMT processor supports dual-transmission capability, N can be 2; if the SIMT processor supports quad-transmission capability, N can be 4, etc. The embodiment of the present invention does not limit this.
需要说明的是,操作数用于表征将要进行运算的数据来源或数据本身。例如,指令为:ADD a,b,其中,a与b均为操作数,且a与b均为将要进行加法运算的数据本身。又如,指令为:ADD ax,bx,其中,ADD表示加法操作符,ax和bx均为操作数,且均为数据来源,即,ax表示ax寄存器,bx表示bx寄存器,ax寄存器和bx寄存器中分别存储了将要参与加法运算的两个数值。It should be noted that operands are used to characterize the source of data to be operated or the data itself. For example, the instruction is: ADD a, b, where a and b are both operands, and a and b are both data to be added. Another example is: ADD ax, bx, where ADD represents the addition operator, ax and bx are both operands, and both are data sources, that is, ax represents the ax register, bx represents the bx register, and the ax register and the bx register respectively store the two values to be involved in the addition operation.
本发明实施例提供的支持多发射的SIMT处理器,通过指令调度模块对多组线程组进行分组,并确定至少两个分组各自对应的第一指令,操作数收集模块从至少两组寄存器堆中并行收集与至少两组寄存器堆对应的所有第一指令各自对应的操作数,并将已收集操作数的所有第一指令发送至指令分发模块,指令分发模块从包括所有第一指令的缓存指令中确定至多Q条第二指令,使执行模块并行执行至多Q条第二指令。本发明分别从指令数量方面、各寄存器堆对应的操作数分组存储方面,以及由指令调度模块、操作数收集模块、指令分发模块和执行模块构成的并行执行方面确保多发射能力的实现,提高指令执行效率。The SIMT processor supporting multiple launches provided by the embodiment of the present invention groups multiple thread groups through an instruction scheduling module, and determines the first instructions corresponding to each of at least two groups. The operand collection module collects operands corresponding to all first instructions corresponding to at least two register stacks in parallel from at least two register stacks, and sends all first instructions with collected operands to the instruction distribution module. The instruction distribution module determines at most Q second instructions from the cache instructions including all first instructions, so that the execution module executes at most Q second instructions in parallel. The present invention ensures the realization of multi-launch capability from the aspects of the number of instructions, the storage of operand groups corresponding to each register stack, and the parallel execution consisting of the instruction scheduling module, the operand collection module, the instruction distribution module and the execution module, and improves the instruction execution efficiency.
进一步地,所述指令调度模块对多组线程组进行分组,确定至少两个分组各自对应的第一指令,包括如下步骤。Furthermore, the instruction scheduling module groups the multiple thread groups and determines the first instructions corresponding to at least two groups, including the following steps.
将所述多组线程组进行分组,得到至少两个分组以及所述至少两个分组各自对应的第一待测指令;所述至少两个分组各自对应的第一待测指令与所述至少两组寄存器堆一一对应。The multiple thread groups are grouped to obtain at least two groups and first instructions to be tested corresponding to the at least two groups; the first instructions to be tested corresponding to the at least two groups correspond to the at least two register files in a one-to-one manner.
基于所述至少两个分组各自对应的寄存器占用表和操作数占用表,从所述至少两个分组各自对应的第一待测指令中并行确定所述至少两个分组各自对应的可执行指令组。Based on the register occupancy table and the operand occupancy table respectively corresponding to the at least two groups, the executable instruction groups respectively corresponding to the at least two groups are determined in parallel from the first instructions to be tested respectively corresponding to the at least two groups.
将各组所述可执行指令组中优先级最高的指令确定为所述至少两个分组各自对应的第一指令。The instruction with the highest priority in each group of the executable instructions is determined as the first instruction corresponding to each of the at least two groups.
具体地,图2是本发明实施例提供的指令调度模块的结构示意图,如图2所示,指令调度模块中整体分为N个指令调度单元,N个指令调度单元的编号n分别为0、1、2、……N-1。指令调度模块在获取W组线程组的指令后,N个指令调度单元将W组线程组分为N个分组,每个指令调度单元处理一个分组的第一待测指令,每个分组中包括T条第一待测指令。在分组后,针对各指令调度单元,该指令调度单元可首先检查分组后得到的各条第一待测指令是否可执行后续的操作数收集,即,获取该指令调度单元的寄存器占用表和操作数占用表,该寄存器占用表用于维护正在被写的寄存器,该操作数占用表用于维护操作数缓存的占用情况。在获取寄存器占用表和操作数占用表后,针对各条第一待测指令,基于该第一待测指令的指令信息、寄存器占用表和操作数占用表,判断该第一待测指令是否准备好执行后续操作数收集,若准备好则将该第一待测指令确定为可执行指令,所有可执行指令构成该指令调度单元的可执行指令组,可执行指令组中可执行指令的数量小于或等于该指令调度单元分组后得到的第一待测指令的数量。之后,将可执行指令组中所有可执行指令按照优先级从高到低的顺序进行排序,将优先级最高的可执行指令确定为该指令调度单元的第一指令。N个指令调度单元均执行上述操作后,即可得到N条第一指令。Specifically, FIG. 2 is a schematic diagram of the structure of the instruction scheduling module provided by an embodiment of the present invention. As shown in FIG. 2 , the instruction scheduling module is divided into N instruction scheduling units as a whole, and the numbers n of the N instruction scheduling units are 0, 1, 2, ... N-1, respectively. After the instruction scheduling module obtains the instructions of the W thread groups, the N instruction scheduling units divide the W thread groups into N groups, and each instruction scheduling unit processes the first instruction to be tested of a group, and each group includes T first instructions to be tested. After grouping, for each instruction scheduling unit, the instruction scheduling unit can first check whether each first instruction to be tested obtained after grouping can execute subsequent operand collection, that is, obtain the register occupancy table and operand occupancy table of the instruction scheduling unit, the register occupancy table is used to maintain the register being written, and the operand occupancy table is used to maintain the occupancy of the operand cache. After obtaining the register occupancy table and the operand occupancy table, for each first instruction to be tested, based on the instruction information, register occupancy table and operand occupancy table of the first instruction to be tested, determine whether the first instruction to be tested is ready to execute subsequent operand collection. If ready, the first instruction to be tested is determined as an executable instruction. All executable instructions constitute the executable instruction group of the instruction scheduling unit. The number of executable instructions in the executable instruction group is less than or equal to the number of first instructions to be tested obtained after the instruction scheduling unit is grouped. Afterwards, all executable instructions in the executable instruction group are sorted in order from high to low priority, and the executable instruction with the highest priority is determined as the first instruction of the instruction scheduling unit. After N instruction scheduling units perform the above operations, N first instructions can be obtained.
可选地,N个指令调度单元可按照交织的方式对W组线程组的指令进行分组,得到N组第一待测指令。指令调度单元n中得到分组后的第一待测指令的编号依次为n、n+N、n+2N、……、n+(T-1)N。例如,N=2时,对W组线程组分组后,指令调度单元0得到编号为偶数的第一待测指令,指令调度单元1得到编号为奇数的第一待测指令。Optionally, the N instruction scheduling units may group the instructions of the W thread groups in an interleaved manner to obtain N groups of first instructions to be tested. The numbers of the first instructions to be tested obtained after grouping in the instruction scheduling unit n are n, n+N, n+2, etc. N, ..., n+(T-1) N. For example, when N=2, after grouping W thread groups, instruction scheduling unit 0 obtains the first instruction to be tested with an even number, and instruction scheduling unit 1 obtains the first instruction to be tested with an odd number.
进一步地,所述基于所述至少两个分组各自对应的寄存器占用表和操作数占用表,从所述至少两个分组各自对应的第一待测指令中并行确定所述至少两个分组各自对应的可执行指令组,包括如下步骤。Furthermore, based on the register occupancy table and the operand occupancy table corresponding to the at least two groups respectively, the method of determining the executable instruction groups corresponding to the at least two groups respectively from the first instructions to be tested corresponding to the at least two groups respectively in parallel includes the following steps.
基于所述至少两个分组各自对应的寄存器占用表,判断各组所述分组对应的所述第一待测指令是否存在寄存器读写依赖,得到各组所述分组对应的所述第一待测指令的第一判断结果。Based on the register occupancy tables corresponding to the at least two groups, it is determined whether the first instruction to be tested corresponding to each group of the groups has register read/write dependency, and a first determination result of the first instruction to be tested corresponding to each group of the groups is obtained.
基于所述至少两个分组各自对应的操作数占用表,判断各组所述分组对应的所述第一待测指令对应的操作数缓存单元范围中是否存在剩余位置,得到各组所述分组对应的所述第一待测指令的第二判断结果。Based on the operand occupancy tables corresponding to the at least two groups, it is determined whether there are remaining positions in the operand cache unit range corresponding to the first instruction to be tested corresponding to each group of the groups, and a second judgment result of the first instruction to be tested corresponding to each group of the groups is obtained.
基于各组所述分组对应的所述第一待测指令的所述第一判断结果和所述第二判断结果,并行确定所述至少两个分组各自对应的可执行指令组。Based on the first judgment result and the second judgment result of the first instruction to be tested corresponding to each group of the groups, the executable instruction groups corresponding to the at least two groups are determined in parallel.
具体地,在获取各分组的寄存器占用表和操作数占用表后,针对该分组中的各条第一待测指令,从该第一待测指令对应的指令编码中确定将要读写的寄存器,并从寄存器占用表中确定该寄存器的读写状态,根据该寄存器的读写状态,判断该第一待测指令是否存在寄存器读写依赖,即,判断该寄存器的读写状态是否为占用状态,占用状态可以理解为该寄存器是否正在被其他未完成的指令占用,若为占用状态,则得到的第一判断结果为存在寄存器读写依赖,若为空闲状态,则得到的第一判断结果为不存在寄存器读写依赖。同时,获取该第一待测指令对应的操作数缓存单元范围,并从操作数占用表中获取该操作数缓存单元范围内缓存单元的占用情况,若该操作数缓存单元范围内的所有缓存单元均被占用,没有剩余位置,则得到的第二判断结果为存在操作数缓存冲突,若该操作数缓存单元范围内的所有缓存单元未被全部占用,存在剩余位置,则得到的第二判断结果为不存在操作数缓存冲突。之后,在确定第一判断结果和第二判断结果后,若第一判断结果不存在寄存器读写依赖,且第二判断结果中不存在操作数缓存冲突时,可将该第一待测指令确定为可执行指令,所有可执行指令构成该指令调度单元的可执行指令组。其他情况时该第一待测指令均不被选择,表明该第一待测指令未准备好执行后续操作数缓存收集操作。Specifically, after obtaining the register occupancy table and operand occupancy table of each group, for each first instruction to be tested in the group, determine the register to be read and written from the instruction code corresponding to the first instruction to be tested, and determine the read and write status of the register from the register occupancy table. According to the read and write status of the register, determine whether the first instruction to be tested has a register read and write dependency, that is, determine whether the read and write status of the register is an occupied state. The occupied state can be understood as whether the register is occupied by other unfinished instructions. If it is an occupied state, the first judgment result obtained is that there is a register read and write dependency. If it is an idle state, the first judgment result obtained is that there is no register read and write dependency. At the same time, obtain the operand cache unit range corresponding to the first instruction to be tested, and obtain the occupancy of the cache units within the operand cache unit range from the operand occupancy table. If all cache units within the operand cache unit range are occupied and there is no remaining position, the second judgment result obtained is that there is an operand cache conflict. If all cache units within the operand cache unit range are not fully occupied and there are remaining positions, the second judgment result obtained is that there is no operand cache conflict. Afterwards, after determining the first judgment result and the second judgment result, if there is no register read/write dependency in the first judgment result and there is no operand cache conflict in the second judgment result, the first instruction to be tested can be determined as an executable instruction, and all executable instructions constitute the executable instruction group of the instruction scheduling unit. In other cases, the first instruction to be tested is not selected, indicating that the first instruction to be tested is not ready to perform subsequent operand cache collection operations.
需要说明的是,该操作数缓存单元范围包括至少一个缓存单元,且该操作数缓存单元范围包括的缓存单元的数量小于操作数收集模块中任意一个操作数缓存中缓存单元的数量。若该操作数缓存单元范围仅包括一个缓存单元时,判断该操作数缓存单元范围中是否存在剩余位置可以理解为判断该缓存单元是否被占用。若该操作数缓存单元范围内包括至少两个缓存单元时,判断该操作数缓存单元范围中是否存在剩余位置可以理解为判断至少两个缓存单元中是否存在未被占用的缓存单元。It should be noted that the operand cache unit range includes at least one cache unit, and the number of cache units included in the operand cache unit range is less than the number of cache units in any operand cache in the operand collection module. If the operand cache unit range includes only one cache unit, determining whether there is a remaining position in the operand cache unit range can be understood as determining whether the cache unit is occupied. If the operand cache unit range includes at least two cache units, determining whether there is a remaining position in the operand cache unit range can be understood as determining whether there is an unoccupied cache unit in at least two cache units.
进一步地,所述操作数收集模块包括至少两个操作数收集单元和至少两个操作数缓存。Furthermore, the operand collection module includes at least two operand collection units and at least two operand caches.
所述至少两个操作数收集单元连接读写控制器,所述至少两个操作数收集单元、所述至少两个操作数缓存和所述至少两组寄存器堆存在对应关系,所述至少两个操作数收集单元用于并行确定所述所有第一指令各自对应的读请求,并将各所述读请求发送至所述读写控制器。The at least two operand collection units are connected to a read-write controller, and a corresponding relationship exists between the at least two operand collection units, the at least two operand caches and the at least two groups of register stacks. The at least two operand collection units are used to determine in parallel the read requests corresponding to all the first instructions respectively, and send each of the read requests to the read-write controller.
各所述读请求用于指示所述读写控制器从所述至少两组寄存器堆中并行确定所述所有第一指令对应的操作数,并将所述所有第一指令各自对应的操作数写入对应的所述操作数缓存中。Each of the read requests is used to instruct the read/write controller to determine operands corresponding to all the first instructions in parallel from the at least two groups of register files, and write the operands corresponding to each of the first instructions into the corresponding operand cache.
具体地,在指令调度模块确定N条第一指令后,操作数收集模块中的N个操作数收集单元分别接收到一条第一指令。针对各操作数收集单元,该操作数收集单元根据接收到的第一指令生成读请求,并将该读请求发送至读写控制器,读写控制器可根据该读请求,从该操作数收集单元对应的寄存器堆中读取该第一指令对应的至少一个操作数,并将该至少一个操作数写入该操作数收集单元对应的操作数缓存中,为该第一指令的执行做好数据准备工作。Specifically, after the instruction scheduling module determines the N first instructions, the N operand collection units in the operand collection module receive one first instruction respectively. For each operand collection unit, the operand collection unit generates a read request according to the received first instruction, and sends the read request to the read-write controller. The read-write controller can read at least one operand corresponding to the first instruction from the register stack corresponding to the operand collection unit according to the read request, and write the at least one operand into the operand cache corresponding to the operand collection unit, so as to prepare data for the execution of the first instruction.
进一步地,图3是本发明实施例提供的操作数缓存收集模块的连接示意图,如图3所示,所述至少两个操作数缓存还连接操作数互联电路和至少两个读互联电路,所述操作数互联电路连接所述执行模块,所述至少两个读互联电路与所述至少两组寄存器堆对应连接。Furthermore, Figure 3 is a connection diagram of the operand cache collection module provided by an embodiment of the present invention. As shown in Figure 3, the at least two operand caches are also connected to an operand interconnection circuit and at least two read interconnection circuits, the operand interconnection circuit is connected to the execution module, and the at least two read interconnection circuits are correspondingly connected to the at least two groups of register stacks.
所述读写控制器从所述至少两组寄存器堆中并行确定所述所有第一指令对应的操作数,并将所述所有第一指令各自对应的操作数写入对应的所述操作数缓存中,包括如下步骤。The read/write controller determines operands corresponding to all the first instructions from the at least two groups of register files in parallel, and writes the operands corresponding to all the first instructions into the corresponding operand cache, including the following steps.
针对各所述读请求,所述读写控制器用于确定所述第一指令对应的所述读请求中的操作数地址;所述操作数地址包括所述寄存器堆对应的读地址、所述读互联电路对应的选择索引和所述操作数缓存对应的分布选择地址;基于所述寄存器堆对应的读地址,从对应的所述寄存器堆中读取所述操作数;基于所述读互联电路对应的选择索引和所述操作数缓存对应的分布选择地址,在所述操作数缓存中确定所述操作数对应的缓存单元;将所述操作数写入所述缓存单元。For each of the read requests, the read-write controller is used to determine the operand address in the read request corresponding to the first instruction; the operand address includes the read address corresponding to the register stack, the selection index corresponding to the read interconnection circuit, and the distribution selection address corresponding to the operand cache; based on the read address corresponding to the register stack, the operand is read from the corresponding register stack; based on the selection index corresponding to the read interconnection circuit and the distribution selection address corresponding to the operand cache, the cache unit corresponding to the operand is determined in the operand cache; and the operand is written into the cache unit.
具体地,在读写控制器接收各操作数收集单元对应的读请求后,可根据该读请求生成该第一指令对应的操作数地址,该操作数地址包括寄存器堆对应的读地址、读互联电路对应的选择索引和操作数缓存对应的分布选择地址,其中,寄存器堆对应的读地址可以理解为第一指令对应的至少一个操作数在对应的寄存器堆中的存储地址,读互联电路对应的选择索引可用于确定第一指令对应的至少一个操作数在对应的操作数缓存中的列数,操作数缓存对应的分布选择地址可用于确定第一指令对应的至少一个操作数在对应的操作数缓存中的行数。之后,读写控制器可分别将寄存器堆对应的读地址发送至对应的寄存器堆,从该寄存器堆中读取该第一指令对应的至少一个操作数。同时,将读互联电路对应的选择索引发送至与该寄存器堆连接的读互联电路,通过控制该读互联电路中数据选择器的输出端信息,确定对应的操作数缓存中存储该至少一个操作数的列数。此外,将操作数缓存对应的分布选择地址发送至对应的操作数缓存,确定至少一个操作数在该操作数缓存中的行数,之后,将该至少一个操作数存储至由行数和列数确定的缓存单元中。Specifically, after the read/write controller receives the read request corresponding to each operand collection unit, the operand address corresponding to the first instruction can be generated according to the read request, and the operand address includes the read address corresponding to the register stack, the selection index corresponding to the read interconnection circuit, and the distribution selection address corresponding to the operand cache, wherein the read address corresponding to the register stack can be understood as the storage address of at least one operand corresponding to the first instruction in the corresponding register stack, the selection index corresponding to the read interconnection circuit can be used to determine the number of columns of at least one operand corresponding to the first instruction in the corresponding operand cache, and the distribution selection address corresponding to the operand cache can be used to determine the number of rows of at least one operand corresponding to the first instruction in the corresponding operand cache. Afterwards, the read/write controller can send the read address corresponding to the register stack to the corresponding register stack, and read at least one operand corresponding to the first instruction from the register stack. At the same time, the selection index corresponding to the read interconnection circuit is sent to the read interconnection circuit connected to the register stack, and the number of columns storing the at least one operand in the corresponding operand cache is determined by controlling the output terminal information of the data selector in the read interconnection circuit. In addition, the distribution selection address corresponding to the operand cache is sent to the corresponding operand cache, the row number of at least one operand in the operand cache is determined, and then the at least one operand is stored in a cache unit determined by the row number and the column number.
需要说明的是,读互联电路中包括多个数据选择器,且数据选择器的数量与操作数缓存中的列数相同。多个数据选择器构成了寄存器堆与操作数缓存之间的多条互联通路。每个数据选择器在选择索引的控制下可从连接的寄存器堆包括的至少四个单口SRAM中选择一个输出至操作数缓存。操作数缓存中的列数限定了从寄存器堆中读取的操作数的数量最大值,以及读互联电路中包括的数据选择器的数量。操作数缓存中的行数与寄存器堆中单口SRAM的数量相同,且操作数缓存中每行可存储一条第一指令对应的至少一个操作数,操作数缓存中的行数限定了操作数缓存中可存储的操作数对应的第一指令的数量。It should be noted that the read interconnection circuit includes multiple data selectors, and the number of data selectors is the same as the number of columns in the operand cache. Multiple data selectors constitute multiple interconnection paths between the register file and the operand cache. Each data selector can select one from at least four single-port SRAMs included in the connected register file and output it to the operand cache under the control of the selection index. The number of columns in the operand cache limits the maximum number of operands read from the register file and the number of data selectors included in the read interconnection circuit. The number of rows in the operand cache is the same as the number of single-port SRAMs in the register file, and each row in the operand cache can store at least one operand corresponding to a first instruction, and the number of rows in the operand cache limits the number of first instructions corresponding to the operands that can be stored in the operand cache.
示例地,图4是本发明实施例提供的读互联电路的连接示意图。如图4所述,以寄存器堆中包括4个单口SRAM、操作数缓存中缓存单元的分布为43,读互联电路为43的互联电路,且第一执行对应两个操作数为例,读互联电路中包括三个数据选择器,这意味着从包括4个单口SRAM的寄存器堆中最多可读取三个操作数。操作数缓存中最多可存储四条第一指令对应的操作数,且每行存储一条第一指令对应的操作数。在读取第一个操作数X时,读写控制器向寄存器堆0发送读地址,通过该读地址,可从SRAM B0中读取操作数X。同时,读写控制器向读互联电路中的数据选择器0发送选择索引,确定要将操作数X写入至操作数缓存0的第一列,此外,读写控制器向操作数缓存0发送分布选择地址,确定要将操作数X写入至操作数缓存0的第二行,因此,操作数X最终写入至操作数缓存0的第二行第一列对应的缓存单元。第二个操作数Y的读取和写入过程与操作数X的不同之处在于,将选择索引发送至读互联电路0中的数据选择器1发送选择索引,操作数Y写入操作数缓存0中的缓存单元与操作数X写入操作数缓存中的缓存单元不同。本发明实施例对于操作数Y的读取和写入过程不再赘述。For example, FIG4 is a connection diagram of a read interconnect circuit provided by an embodiment of the present invention. As shown in FIG4, the register file includes 4 single-port SRAMs, and the distribution of cache units in the operand cache is 4 3, read the interconnect circuit as 4 3, and the first execution corresponds to two operands. For example, the read interconnection circuit includes three data selectors, which means that at most three operands can be read from the register stack including 4 single-port SRAMs. The operands corresponding to the first instructions can be stored in the operand cache at most, and each row stores an operand corresponding to the first instruction. When reading the first operand X, the read-write controller sends a read address to register stack 0, through which operand X can be read from SRAM B0. At the same time, the read-write controller sends a selection index to the data selector 0 in the read interconnection circuit to determine that operand X is to be written to the first column of operand cache 0. In addition, the read-write controller sends a distribution selection address to operand cache 0 to determine that operand X is to be written to the second row of operand cache 0. Therefore, operand X is finally written to the cache unit corresponding to the second row and first column of operand cache 0. The reading and writing process of the second operand Y is different from that of operand X in that the selection index is sent to the data selector 1 in the read interconnect circuit 0 to send the selection index, and the cache unit in the operand cache 0 to which operand Y is written is different from the cache unit in the operand cache to which operand X is written. The reading and writing process of operand Y in the embodiment of the present invention will not be described in detail.
进一步地,所述指令分发模块从已收集操作数的缓存指令中确定至多Q条第二指令,包括如下步骤。Further, the instruction distribution module determines at most Q second instructions from the cache instructions of the collected operands, including the following steps.
将已收集操作数的所述所有第一指令并行写入各自对应的待执行指令表;所述待执行指令表与所述所有第一指令一一对应。All the first instructions for which operands have been collected are written in parallel into their corresponding to-be-executed instruction tables; the to-be-executed instruction tables correspond one-to-one to all the first instructions.
从所有待执行指令表对应的所有缓存指令中,确定无执行冲突且优先级组合最优的至多Q条第二指令,所述执行冲突包括执行单元冲突和写冲突,所述执行单元冲突用于表征任意两条缓存指令各自对应的执行单元相同,所述写冲突用于表征任意两条缓存指令各自对应的执行结果写入相同寄存器堆的时刻相同,所述执行单元属于所述执行模块。From all cache instructions corresponding to all instruction tables to be executed, determine at most Q second instructions with no execution conflicts and the best priority combination, the execution conflicts include execution unit conflicts and write conflicts, the execution unit conflicts are used to characterize that the execution units corresponding to any two cache instructions are the same, the write conflicts are used to characterize that the execution results corresponding to any two cache instructions are written to the same register stack at the same time, and the execution units belong to the execution module.
具体地,图5是本发明实施例提供的指令分发模块的结构示意图,如图5所示,指令分发模块在接收到各操作数缓存单元发送的已收集操作数的N条第一指令后,将N条第一指令分别写入N个待执行指令表中,即,每个待执行指令表中写入一条第一指令,该第一指令写入待执行指令表后成为缓存指令。之后,指令分发单元中的指令选择逻辑单元从N个待执行指令表中的所有缓存指令中,选择优先级组合最优且无执行冲突的M条第二指令,该执行单元冲突可以理解为在M条第二指令中的任意两条缓存指令各自对应的执行单元相同,该两条缓存指令无法同时调用执行模块中的同一个执行单元进行运算。该写冲突可以理解为在M条缓存指令中的任意两条缓存指令各自对应的执行结果在写入N个寄存器堆时,在同一时刻写入同一个寄存器堆。由于寄存器堆的至少四个单口SRAM在同一时刻仅可允许一个单口SRAM执行一次读操作或写操作,因此,任意两条缓存指令无法同时写入同一个寄存器堆。若任意两条缓存指令既不存在指令单元冲突,也不存在写冲突,可认为该任意两条缓存指令不存在执行冲突。若任意两条缓存指令存在指令单元冲突和写冲突中的至少一种时,可认为该任意两条缓存指令存在执行冲突。Specifically, FIG. 5 is a schematic diagram of the structure of the instruction distribution module provided by an embodiment of the present invention. As shown in FIG. 5, after receiving N first instructions of collected operands sent by each operand cache unit, the instruction distribution module writes the N first instructions into N to-be-executed instruction tables, that is, writes a first instruction into each to-be-executed instruction table, and the first instruction becomes a cache instruction after being written into the to-be-executed instruction table. Afterwards, the instruction selection logic unit in the instruction distribution unit selects M second instructions with the best priority combination and no execution conflict from all cache instructions in the N to-be-executed instruction tables. The execution unit conflict can be understood as the execution units corresponding to any two cache instructions in the M second instructions are the same, and the two cache instructions cannot call the same execution unit in the execution module for operation at the same time. The write conflict can be understood as the execution results corresponding to any two cache instructions in the M cache instructions are written into the same register stack at the same time when they are written into the N register stacks. Since at least four single-port SRAMs of the register stack can only allow one single-port SRAM to perform a read operation or a write operation at the same time, any two cache instructions cannot be written into the same register stack at the same time. If any two cache instructions have neither instruction unit conflict nor write conflict, it can be considered that the two cache instructions have no execution conflict. If any two cache instructions have at least one of instruction unit conflict and write conflict, it can be considered that the two cache instructions have execution conflict.
示例地,在确定无执行冲突且优先级组合最优的至多Q条第二指令时,第一种方法为:将N个待执行指令表的所有缓存指令按照优先级从高到低的顺序进行排序,按照排序结果依次遍历各条缓存指令,首先判断第二条缓存指令与首条缓存指令是否存在执行冲突,若存在执行冲突,则继续遍历第三条缓存指令,判断首条缓存指令和第三条缓存之间是否存在执行冲突。若不存在执行冲突,则将首条缓存指令和第二条缓存指令构成缓存指令组,继续遍历第三条缓存指令,并判断缓存指令组中的两条缓存指令和第三条缓存指令中的任意两条缓存指令是否存在执行冲突,若不存在执行冲突,则将第三条缓存指令添加至缓存指令组中,若存在执行冲突,则继续遍历第四条缓存指令,缓存指令组保持不变。重复上述操作,直至缓存指令组中包括M条缓存指令时停止,并将缓存指令组中的M条缓存指令均确定为M条第二指令。第二种方法为:首先,根据N个待执行指令表的编号顺序,依次遍历各待执行指令表中的缓存指令,从N个待执行指令表中确定不存在执行冲突的多条目标缓存指令,判断执行冲突的过程与第一种方法相同,本发明实施例在此不再赘述。在确定多条目标缓存指令后,将多条目标缓存指令按照优先级从高到低的顺序进行排序,将排序靠前的M条目标缓存指令确定为M条第二指令。本发明实施例对于M条第二指令的确定方法不做限定。For example, when determining at most Q second instructions with no execution conflict and the best priority combination, the first method is: sort all cache instructions of the N to-be-executed instruction tables in order of priority from high to low, traverse each cache instruction in turn according to the sorting result, first determine whether there is an execution conflict between the second cache instruction and the first cache instruction, if there is an execution conflict, continue to traverse the third cache instruction, and determine whether there is an execution conflict between the first cache instruction and the third cache. If there is no execution conflict, the first cache instruction and the second cache instruction form a cache instruction group, continue to traverse the third cache instruction, and determine whether there is an execution conflict between the two cache instructions in the cache instruction group and any two cache instructions in the third cache instruction. If there is no execution conflict, add the third cache instruction to the cache instruction group. If there is an execution conflict, continue to traverse the fourth cache instruction, and the cache instruction group remains unchanged. Repeat the above operation until the cache instruction group includes M cache instructions, and stop, and all the M cache instructions in the cache instruction group are determined as M second instructions. The second method is: first, according to the numbering order of the N to-be-executed instruction tables, the cache instructions in each to-be-executed instruction table are traversed in turn, and multiple target cache instructions without execution conflicts are determined from the N to-be-executed instruction tables. The process of judging the execution conflicts is the same as the first method, and the embodiment of the present invention will not be repeated here. After determining the multiple target cache instructions, the multiple target cache instructions are sorted in order from high to low priority, and the top M target cache instructions are determined as M second instructions. The embodiment of the present invention does not limit the method for determining the M second instructions.
需要说明的是,在从所有缓存指令中确定M条第二指令后,需将M条第二指令从N个待执行指令表中删除,避免后续周期对该M条第二指令重复执行。It should be noted that after the M second instructions are determined from all cached instructions, the M second instructions need to be deleted from the N to-be-executed instruction tables to avoid repeated execution of the M second instructions in subsequent cycles.
此外,该指令分发模块还连接指令调度模块,该指令分发模块在确定M条第二指令,并将M条第二指令分别对应发送至执行模块中的Q个执行单元后,可向指令调度模块反馈指令分发信息。In addition, the instruction distribution module is also connected to the instruction scheduling module. After determining M second instructions and sending the M second instructions to Q execution units in the execution module respectively, the instruction distribution module can feed back instruction distribution information to the instruction scheduling module.
进一步地,所述执行模块包括至少两个执行单元,所述至少两个执行单元与所述所有第二指令存在对应关系,且所述至少两个执行单元均连接写互联电路和所述操作数互联电路,所述写互联电路连接所述至少两组寄存器堆。Furthermore, the execution module includes at least two execution units, the at least two execution units correspond to all the second instructions, and the at least two execution units are both connected to a write interconnection circuit and the operand interconnection circuit, and the write interconnection circuit connects the at least two groups of register files.
针对各所述执行单元,接收到第二指令的所述执行单元用于基于所述第二指令,确定对应的目标操作数缓存;基于所述操作数互联电路从所述目标操作数缓存中确定目标操作数;基于所述目标操作数执行所述第二指令,确定执行结果,并将所述执行结果通过所述写互联电路写入所述第二指令对应的寄存器堆。For each of the execution units, the execution unit that receives the second instruction is used to determine the corresponding target operand cache based on the second instruction; determine the target operand from the target operand cache based on the operand interconnection circuit; execute the second instruction based on the target operand, determine the execution result, and write the execution result into the register stack corresponding to the second instruction through the write interconnection circuit.
具体地,执行模块中包括Q个执行单元,该执行单元可以理解为SIMT处理器中包括的运算单元。在指令分发模块确定M条第二指令后,分别发送至M个执行单元,在M=Q时,意味着每个执行单元接收到一条第二指令,在M小于Q时,这意味着可能存在执行单元没有接收到第二指令。之后,接收到第二指令的各执行单元并行执行各自的第二指令。针对各执行单元,均执行以下步骤。Specifically, the execution module includes Q execution units, which can be understood as the operation units included in the SIMT processor. After the instruction distribution module determines M second instructions, they are sent to M execution units respectively. When M=Q, it means that each execution unit receives a second instruction. When M is less than Q, it means that there may be an execution unit that has not received the second instruction. Afterwards, each execution unit that receives the second instruction executes its own second instruction in parallel. For each execution unit, the following steps are performed.
图6是本发明实施例提供的执行模块的连接示意图,如图6所示,该执行单元首先通过第二指令,确定存储该第二指令对应的目标操作数的目标操作数缓存。之后,通过操作数互联电路从该目标操作数缓存中读取该第二指令对应的目标操作数,该目标操作数的数量可以为至少一个。在获取目标操作数后,可根据目标操作数执行第二指令,确定该第二指令的执行结果,并通过写互联电路写入该第二指令对应的寄存器堆中。FIG6 is a connection diagram of an execution module provided by an embodiment of the present invention. As shown in FIG6 , the execution unit first determines the target operand cache storing the target operand corresponding to the second instruction through the second instruction. Afterwards, the target operand corresponding to the second instruction is read from the target operand cache through the operand interconnection circuit, and the number of the target operands may be at least one. After obtaining the target operand, the second instruction may be executed according to the target operand, the execution result of the second instruction may be determined, and the result may be written into the register stack corresponding to the second instruction through the write interconnection circuit.
可选地,执行单元包括但不限于:整数运算单元、浮点运算单元、访存单元和特殊运算单元。Optionally, the execution unit includes but is not limited to: an integer operation unit, a floating point operation unit, a memory access unit and a special operation unit.
本发明实施例还提供一种指令处理方法,应用于如上述任一项所述的支持多发射的SIMT处理器,图7是本发明实施例提供的指令处理方法的流程示意图,如图7所示,所述方法包括步骤710至步骤740。An embodiment of the present invention further provides an instruction processing method, which is applied to a SIMT processor supporting multiple transmissions as described in any one of the above items. FIG. 7 is a flow chart of the instruction processing method provided by an embodiment of the present invention. As shown in FIG. 7 , the method includes steps 710 to 740.
步骤710、利用指令调度模块对多组线程组进行分组,确定至少两个分组各自对应的第一指令;所述至少两个分组各自对应的第一指令与至少两组寄存器堆一一对应。Step 710: Group multiple thread groups using an instruction scheduling module to determine first instructions corresponding to at least two groups; the first instructions corresponding to the at least two groups correspond to at least two register stacks in a one-to-one manner.
步骤720、利用操作数收集模块从所述至少两组寄存器堆中并行收集所有第一指令各自对应的操作数,并将所述所有第一指令发送至指令分发模块。Step 720: Utilize the operand collection module to collect operands corresponding to all first instructions from the at least two groups of register files in parallel, and send all first instructions to the instruction distribution module.
步骤730、利用所述指令分发模块从已收集操作数的缓存指令中确定至多Q条第二指令,Q是基于所述执行模块中执行单元的数量确定的;所述缓存指令至少包括所述所有第一指令。Step 730: Determine at most Q second instructions from the cache instructions of the collected operands using the instruction distribution module, where Q is determined based on the number of execution units in the execution module; the cache instructions at least include all the first instructions.
步骤740、利用所述执行模块并行执行所述至多Q条第二指令。Step 740: Utilize the execution module to execute the at most Q second instructions in parallel.
本发明实施例提供的指令处理方法,通过指令调度模块对多组线程组进行分组,并确定至少两个分组各自对应的第一指令,操作数收集模块从至少两组寄存器堆中并行收集与至少两组寄存器堆对应的所有第一指令各自对应的操作数,并将已收集操作数的所有第一指令发送至指令分发模块,指令分发模块从包括所有第一指令的缓存指令中确定至多Q条第二指令,使执行模块并行执行至多Q条第二指令。本发明分别从指令数量方面、各寄存器堆对应的操作数分组存储方面,以及由指令调度模块、操作数收集模块、指令分发模块和执行模块构成的并行执行方面确保多发射能力的实现,提高指令执行效率。The instruction processing method provided by the embodiment of the present invention groups multiple thread groups through an instruction scheduling module, and determines the first instructions corresponding to each of at least two groups. The operand collection module collects operands corresponding to all first instructions corresponding to at least two register stacks in parallel from at least two register stacks, and sends all first instructions with collected operands to the instruction distribution module. The instruction distribution module determines at most Q second instructions from the cache instructions including all first instructions, so that the execution module executes at most Q second instructions in parallel. The present invention ensures the realization of multi-launch capability from the aspects of the number of instructions, the storage of operand groups corresponding to each register stack, and the parallel execution composed of the instruction scheduling module, the operand collection module, the instruction distribution module and the execution module, thereby improving the instruction execution efficiency.
本发明实施例还提供一种芯片,包括至少两个如上述任一项所述的支持多发射的SIMT处理器。各支持多发射的SIMT处理器之间互联。The embodiment of the present invention further provides a chip, comprising at least two SIMT processors supporting multiple transmissions as described in any one of the above items. The SIMT processors supporting multiple transmissions are interconnected.
可选地,该芯片可以包括GPU或AI处理器,本发明实施例对此不做限定。Optionally, the chip may include a GPU or an AI processor, which is not limited in the embodiments of the present invention.
另一方面,本发明还提供一种计算机程序产品,所述计算机程序产品包括计算机程序,计算机程序可存储在非暂态计算机可读存储介质上,所述计算机程序被各支持多发射的SIMT处理器执行时,计算机能够执行上述各方法所提供的指令处理方法,该方法包括如下步骤。On the other hand, the present invention also provides a computer program product, which includes a computer program. The computer program can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by each SIMT processor supporting multiple transmissions, the computer can execute the instruction processing method provided by the above-mentioned methods, and the method includes the following steps.
利用指令调度模块对多组线程组进行分组,确定至少两个分组各自对应的第一指令;所述至少两个分组各自对应的第一指令与至少两组寄存器堆一一对应。An instruction scheduling module is used to group multiple thread groups, and first instructions corresponding to at least two groups are determined; the first instructions corresponding to the at least two groups correspond to at least two register stacks in a one-to-one manner.
利用操作数收集模块从所述至少两组寄存器堆中并行收集所有第一指令各自对应的操作数,并将所述所有第一指令发送至指令分发模块。The operands corresponding to all the first instructions are collected in parallel from the at least two groups of register files by using the operand collection module, and all the first instructions are sent to the instruction distribution module.
利用所述指令分发模块从已收集操作数的缓存指令中确定至多Q条第二指令,Q是基于执行模块中执行单元的数量确定的;所述缓存指令至少包括所述所有第一指令。The instruction distribution module is used to determine at most Q second instructions from cache instructions of collected operands, where Q is determined based on the number of execution units in the execution module; the cache instructions at least include all the first instructions.
利用执行模块并行执行所述至多Q条第二指令。The execution module is used to execute the at most Q second instructions in parallel.
又一方面,本发明还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被各支持多发射的SIMT处理器执行时实现以执行上述各方法提供的指令处理方法,该方法包括如下步骤。On the other hand, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon. When the computer program is executed by each SIMT processor supporting multiple transmissions, the instruction processing method provided by the above methods is implemented, and the method includes the following steps.
利用指令调度模块对多组线程组进行分组,确定至少两个分组各自对应的第一指令;所述至少两个分组各自对应的第一指令与至少两组寄存器堆一一对应。An instruction scheduling module is used to group multiple thread groups, and first instructions corresponding to at least two groups are determined; the first instructions corresponding to the at least two groups correspond to at least two register stacks in a one-to-one manner.
利用操作数收集模块从所述至少两组寄存器堆中并行收集所有第一指令各自对应的操作数,并将所述所有第一指令发送至指令分发模块。The operands corresponding to all the first instructions are collected in parallel from the at least two groups of register files by using the operand collection module, and all the first instructions are sent to the instruction distribution module.
利用所述指令分发模块从已收集操作数的缓存指令中确定至多Q条第二指令,Q是基于执行模块中执行单元的数量确定的;所述缓存指令至少包括所述所有第一指令。The instruction distribution module is used to determine at most Q second instructions from cache instructions of collected operands, where Q is determined based on the number of execution units in the execution module; the cache instructions at least include all the first instructions.
利用执行模块并行执行所述至多Q条第二指令。The execution module is used to execute the at most Q second instructions in parallel.
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. Ordinary technicians in this field can understand and implement it without paying creative labor.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that each implementation method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be implemented by hardware. Based on this understanding, the above technical solution is essentially or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, a disk, an optical disk, etc., including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in each embodiment or some parts of the embodiments.
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit it. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (9)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410783914.3A CN118349282B (en) | 2024-06-18 | 2024-06-18 | SIMT processor supporting multiple transmissions, instruction processing method and chip |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410783914.3A CN118349282B (en) | 2024-06-18 | 2024-06-18 | SIMT processor supporting multiple transmissions, instruction processing method and chip |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118349282A CN118349282A (en) | 2024-07-16 |
CN118349282B true CN118349282B (en) | 2024-10-29 |
Family
ID=91814228
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410783914.3A Active CN118349282B (en) | 2024-06-18 | 2024-06-18 | SIMT processor supporting multiple transmissions, instruction processing method and chip |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118349282B (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117707625A (en) * | 2024-02-05 | 2024-03-15 | 上海登临科技有限公司 | Computing unit, method and corresponding graphics processor supporting instruction multiple |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8639882B2 (en) * | 2011-12-14 | 2014-01-28 | Nvidia Corporation | Methods and apparatus for source operand collector caching |
US11163578B2 (en) * | 2018-02-23 | 2021-11-02 | Intel Corporation | Systems and methods for reducing register bank conflicts based on a software hint bit causing a hardware thread switch |
US11625244B2 (en) * | 2021-06-22 | 2023-04-11 | Intel Corporation | Native support for execution of get exponent, get mantissa, and scale instructions within a graphics processing unit via reuse of fused multiply-add execution unit hardware logic |
CN114461274A (en) * | 2022-01-30 | 2022-05-10 | 上海阵量智能科技有限公司 | Instruction processing apparatus, method, chip, computer device, and storage medium |
CN118193061A (en) * | 2022-12-13 | 2024-06-14 | 沐曦集成电路(上海)有限公司 | Instruction double-transmitting system |
CN117009054B (en) * | 2023-07-27 | 2024-06-28 | 北京登临科技有限公司 | SIMT device, thread group dynamic construction method and processor |
-
2024
- 2024-06-18 CN CN202410783914.3A patent/CN118349282B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117707625A (en) * | 2024-02-05 | 2024-03-15 | 上海登临科技有限公司 | Computing unit, method and corresponding graphics processor supporting instruction multiple |
Also Published As
Publication number | Publication date |
---|---|
CN118349282A (en) | 2024-07-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7418576B1 (en) | Prioritized issuing of operation dedicated execution unit tagged instructions from multiple different type threads performing different set of operations | |
US9747132B2 (en) | Multi-core processor using former-stage pipeline portions and latter-stage pipeline portions assigned based on decode results in former-stage pipeline portions | |
CN117472448B (en) | A Sunway many-core processor from core cluster acceleration parallel method, equipment and medium | |
CN111813805A (en) | A data processing method and device | |
CN105468439B (en) | The self-adaptive parallel method of neighbours in radii fixus is traversed under CPU-GPU isomery frame | |
CN114489475B (en) | Distributed storage system and data storage method thereof | |
KR20120017454A (en) | Scheduling Threads by Batch Scheduling | |
CN111078394B (en) | A GPU thread load balancing method and device | |
CN118916178A (en) | Thread bundle scheduling method, device and medium for GPU | |
Song et al. | Rethinking graph data placement for graph neural network training on multiple GPUs | |
CN117873730A (en) | Efficient embedded vector access method and system suitable for multi-GPU environment | |
CN119487486A (en) | Memory controller with command reordering | |
CN114489806B (en) | Processor device, multidimensional data reading and writing method, and computing equipment | |
CN118819641A (en) | A GPGPU instruction execution efficiency optimization method based on lookup table | |
JP7617907B2 (en) | The processor and its internal interrupt controller | |
CN108733585B (en) | Cache system and related method | |
CN118349282B (en) | SIMT processor supporting multiple transmissions, instruction processing method and chip | |
CN116483536B (en) | Data scheduling method, computing chip and electronic equipment | |
US20240169019A1 (en) | PERFORMANCE IN SPARSE MATRIX VECTOR (SpMV) MULTIPLICATION USING ROW SIMILARITY | |
US6732196B2 (en) | Allowing slots belonging to a second slot category to receive I/O access requests belonging to a first and a second access request categories in a round robin fashion | |
CN116841751A (en) | Policy configuration method, device and storage medium for multi-task thread pool | |
CN117215806A (en) | Method for improving cache replacement efficiency | |
CN118733206A (en) | Task scheduling method, device and related products based on multi-core system | |
KR20230059536A (en) | Method and apparatus for process scheduling | |
US20220114093A1 (en) | Balancing Memory-Portion Accesses |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |