Disclosure of Invention
The embodiment of the disclosure at least provides a command distribution method, a command distributor, a chip and electronic equipment.
In a first aspect, an embodiment of the present disclosure provides a command distribution method, including determining a plurality of first target register sets corresponding to a current processing cycle from a plurality of register sets, where the first target register sets are different from a second target register set determined by at least one recent historical processing cycle;
Determining target threads respectively corresponding to the plurality of first target register groups in the current processing cycle from thread groups respectively corresponding to the plurality of first target register groups; and distributing the determined commands respectively corresponding to the target threads to the corresponding operation units.
In a possible implementation manner, the determining the plurality of first target register sets corresponding to the current processing period from the plurality of register sets includes determining the register set with the odd number in the plurality of register sets as the first target register set when the current processing period is an odd number period, and determining the register set with the even number in the plurality of register sets as the first target register set when the current processing period is an even number period.
In a possible embodiment, the method further comprises determining the grouping number of the registers according to the number of operands of the operation unit with the largest number of required operands, and dividing the registers into the plurality of register groups.
In a possible implementation manner, the determining the target threads respectively corresponding to the plurality of first target register groups in the current processing cycle from the thread groups respectively corresponding to the plurality of first target register groups comprises determining the target threads respectively corresponding to the plurality of first target register groups in the current processing cycle from the thread groups respectively corresponding to the plurality of first target register groups based on the determined command execution state information of each thread in the thread groups respectively corresponding to the plurality of first target register groups.
In a possible implementation manner, the determining the target threads corresponding to the plurality of first target register groups in the current processing cycle from the thread groups corresponding to the plurality of first target register groups respectively based on the determined command execution state of each thread in the thread groups corresponding to the plurality of first target register groups respectively comprises determining a plurality of candidate threads with command execution state information in a ready state from the thread groups corresponding to the plurality of first target register groups respectively, and determining the target threads corresponding to the plurality of first target register groups in the current processing cycle from the plurality of candidate threads respectively.
In a possible implementation manner, the determining the target threads respectively corresponding to the first target registers in the current processing cycle from the plurality of candidate threads includes determining the target threads respectively corresponding to the first target registers in the current processing cycle from the plurality of candidate threads based on the priorities of commands to be dispatched respectively corresponding to the plurality of candidate threads.
In a possible implementation manner, the determining the target threads respectively corresponding to the first target registers in the current processing cycle from the plurality of candidate threads includes determining the target threads respectively corresponding to the first target registers in the current processing cycle from the plurality of candidate threads based on the priorities of the commands to be distributed respectively corresponding to the plurality of candidate threads and the occupancy states of the operation units corresponding to the commands to be distributed.
In a possible implementation manner, in response to a multi-operand to-be-dispatched command with more than one operand in a to-be-dispatched command corresponding to a target thread determined for a current processing cycle, each processing cycle from the current processing cycle to a target processing cycle, a first target register group corresponding to the multi-operand to-be-dispatched command distributes a corresponding operand to a to-be-dispatched command corresponding operation unit respectively, wherein the difference between the number of cycles of the target processing cycle and the current processing cycle is equal to one less than the number of the multi-operands.
In a possible implementation manner, the method further comprises the step of responding to a multi-operand to-be-dispatched command with two operands in a to-be-dispatched command corresponding to a target thread determined for a current processing period, and determining another single-operand to-be-dispatched command in a ready state for a first target register group where the single-operand to-be-dispatched command is located in a next processing period of the current processing period aiming at each single-operand to-be-dispatched command in the to-be-dispatched command corresponding to the target thread determined for the current processing period.
In a possible implementation manner, the method further comprises the steps of responding to a multi-operand to-be-dispatched command with more than one operand in a to-be-dispatched command corresponding to a target thread determined for a current processing period, determining the operand number of the multi-operand to-be-dispatched command with the largest operand number in the multi-operand to-be-dispatched command, and determining the to-be-dispatched command with the ready state for the first target register group in a thread group corresponding to the first target register group in response to the fact that the multi-operand to-be-dispatched command with more than one operand exists in the to-be-dispatched command corresponding to the target thread determined for the current processing period, wherein the operation number of the ready-state to-be-dispatched command is not more than the number of processing periods from the next processing period of the current processing period to the processing period of the first target register group to be dispatched again.
In a possible implementation manner, the method further comprises the steps of obtaining feedback information generated by the operation unit after executing the command, and generating command execution state information corresponding to a thread to which the executed command belongs based on the feedback information.
In a possible implementation manner, the method further comprises grouping the threads currently being executed based on the number of the register groups and the number of the threads currently being executed to obtain thread groups corresponding to each register group respectively.
In a second aspect, embodiments of the present disclosure provide a command dispatcher comprising a scheduler, and a dispatch interface;
the scheduler is used for determining a plurality of first target register groups corresponding to the current processing cycle from the register groups; wherein the first set of target registers is different from a second set of target registers determined by at least one recent historical processing cycle; determining target threads respectively corresponding to the plurality of first target register groups in the current processing cycle from thread groups respectively corresponding to the plurality of first target register groups;
the distributing interface is used for distributing the determined commands respectively corresponding to the target threads to the corresponding operation units.
In a possible implementation manner, the scheduler is configured, when determining, from a plurality of register sets, a plurality of first target register sets corresponding to a current processing cycle, to:
Determining a register group numbered as an odd number among the plurality of register groups as the first target register group in a case where the current processing cycle is an odd number cycle;
And determining a register group numbered even in the plurality of register groups as the first target register group in the case that the current processing cycle is an even number cycle.
In a possible implementation manner, the scheduler is further configured to:
the number of groupings of registers is determined based on the number of operands of the arithmetic unit that requires the greatest number of operands, and the registers are divided into the plurality of register banks.
In a possible implementation manner, the scheduler is configured to, in determining, from a thread group corresponding to each of the plurality of first target registers, a target thread corresponding to each of the plurality of first target registers in a current processing cycle:
and determining target threads respectively corresponding to the plurality of first target register groups in the current processing period from the thread groups respectively corresponding to the plurality of first target register groups based on the determined command execution state information of each thread in the thread groups respectively corresponding to the plurality of first target register groups.
In a possible implementation manner, the scheduler is configured to, when determining, based on the determined command execution status of each thread in the plurality of first target register groups respectively corresponding to the plurality of first target register groups, a target thread corresponding to the plurality of first target register groups in a current processing cycle from the thread groups respectively corresponding to the plurality of first target register groups, determine:
determining a plurality of alternative threads with command execution state information being ready state from thread groups respectively corresponding to the plurality of first target register groups;
And determining target threads respectively corresponding to the first target registers in the current processing cycle from the candidate threads.
In a possible implementation manner, the scheduler is configured, in determining, from the plurality of candidate threads, a target thread corresponding to each of the plurality of first target registers in the current processing cycle, to:
and determining target threads respectively corresponding to the first target registers in the current processing cycle from the plurality of candidate threads based on the priorities of the commands to be distributed respectively corresponding to the plurality of candidate threads.
In a possible implementation manner, the scheduler is configured, in determining, from the plurality of candidate threads, a target thread corresponding to each of the plurality of first target registers in the current processing cycle, to:
And determining target threads respectively corresponding to the plurality of first target registers in the current processing period from the plurality of candidate threads based on the priorities of the commands to be distributed respectively corresponding to the plurality of candidate threads and the occupation states of operation units corresponding to the commands to be distributed.
In a possible implementation manner, the scheduler is configured, in determining, from the plurality of candidate threads, a target thread corresponding to each of the plurality of first target registers in the current processing cycle, to:
and determining target threads respectively corresponding to the plurality of first target registers in the current processing period from the plurality of candidate threads based on the command types of the current commands to be distributed respectively corresponding to the plurality of candidate threads and the types of the operation units.
In a possible implementation manner, the scheduler is further configured to:
Responding to a multi-operand to-be-dispatched command with more than one operand in a to-be-dispatched command corresponding to a target thread determined for a current processing period, and respectively dispatching a corresponding one operand to a corresponding operation unit of the to-be-dispatched command from the current processing period to each processing period of the target processing period by a first target register group corresponding to the multi-operand to-be-dispatched command;
The difference value between the cycle number of the target processing cycle and the current processing cycle is equal to one less than the number of the multiple operands.
In a possible implementation manner, the scheduler is further configured to:
In response to a multi-operand to-be-dispatched command with two operands in a to-be-dispatched command corresponding to a target thread determined for a current processing cycle, determining another single-operand to-be-dispatched command in a ready state for a first target register group where the single-operand to-be-dispatched command is located in a next processing cycle of the current processing cycle for each single-operand to-be-dispatched command existing in the to-be-dispatched command corresponding to the target thread determined for the current processing cycle.
In a possible embodiment, the scheduler is further configured to:
Responding to a multi-operand to-be-dispatched command with more than one operand in a target thread corresponding to a current processing period, and determining the operand number of the multi-operand to-be-dispatched command with the largest operand number in the multi-operand to-be-dispatched command;
For each other command to be dispatched, in which the number of operands in the command to be dispatched corresponding to the target thread determined for the current processing cycle is less than the maximum number of operands, determining a command to be dispatched in a ready state for the first target register group from the thread group corresponding to the first target register group in response to the first target register group where the other command to be dispatched is idle for each processing cycle from the next processing cycle of the current processing cycle to the processing cycle before the first target register group is scheduled again;
The operation number of the ready-state commands to be distributed is not more than the number of processing cycles from the processing cycle of the ready-state commands to be distributed to the processing cycle of the first target register group which is scheduled again.
In a possible implementation manner, the scheduler is further used for acquiring feedback information generated by the operation unit after executing the command;
and generating command execution state information corresponding to the thread to which the executed command belongs based on the feedback information.
In a possible implementation manner, the scheduler is further configured to:
And grouping the threads currently being executed based on the number of the register groups and the number of the threads currently being executed to obtain thread groups corresponding to each register group.
In a third aspect, embodiments of the present disclosure further provide a chip, a controller, a command distributor, and an operator;
the controller is used for acquiring commands corresponding to the threads respectively and sending the commands to the command distributor;
the command distributor is configured to distribute the command to the operator based on the command distribution method according to any one of the first aspects;
the operator is configured to read an operand from a target register group corresponding to the command based on the command distributed by the command distributor, and execute the command based on the operand.
In a fourth aspect, an embodiment of the disclosure further provides an electronic device, including the chip described in the third aspect.
In a fifth aspect, embodiments of the present disclosure also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the command distribution method according to any of the first aspects above.
The description of the effects of the command distributor, the chip and the electronic device is referred to the description of the command distributing method, and is not repeated here.
The foregoing objects, features and advantages of the disclosure will be more readily apparent from the following detailed description of the preferred embodiments taken in conjunction with the accompanying drawings.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. The components of the embodiments of the present disclosure, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure provided in the accompanying drawings is not intended to limit the scope of the disclosure, as claimed, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be made by those skilled in the art based on the embodiments of this disclosure without making any inventive effort, are intended to be within the scope of this disclosure.
It has been found that the command processing apparatus includes a controller, a command distribution unit, and a plurality of arithmetic units. After the command distribution unit distributes the commands to the operation units, the operation units executing different commands obtain the reading authority of the register through arbitration, after the reading authority of the register is obtained, the operands required by executing the commands are read from the register, and then the arbitration of the reading authority of the register file can cause certain delay based on the read operands, thereby influencing the throughput rate of the operation units to the commands and further causing the problem of low command distribution efficiency and lower processing efficiency of executing single commands.
In addition, after the command is distributed to the command distribution unit, if the operand required for executing the command is not ready, the operation unit will switch to executing the command corresponding to other threads, which requires the command distribution unit to distribute the new command to the operation unit, which results in that the command distributed to the operation unit may have a command which cannot be executed immediately (the command may be executed after the operand is ready is required to wait), thus resulting in reduced efficiency of command distribution and lower processing efficiency of the command.
Based on the above study, the present disclosure provides a command distribution method that divides registers in a register file into a plurality of register groups, and different register groups correspond to different thread groups. In each processing period, a plurality of first target register groups are determined, target threads respectively corresponding to the plurality of first target register groups in the current processing period are determined from thread groups respectively corresponding to the plurality of first target register groups, and commands respectively corresponding to the determined target threads are distributed to corresponding operation units, so that each thread group can be accessed by one operation unit at most in each processing period, and therefore, the operation units receiving the commands can directly access the corresponding register groups without arbitration, operands required by the commands are obtained, the command distribution efficiency is improved, and the command processing efficiency is improved.
The defects of the scheme are all results obtained by the inventor after practice and careful study, and therefore, the discovery process of the above problems and the solutions to the above problems set forth hereinafter by the present disclosure should be all contributions of the inventors to the present disclosure during the course of the present disclosure.
It should be noted that like reference numerals and letters refer to like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
For the sake of understanding the present embodiment, a detailed description will be given first of one of the command distribution methods disclosed in the embodiments of the present disclosure, where an execution body of the command distribution method provided in the embodiments of the present disclosure is generally a command processing device such as a central processing unit (Central Processing Unit, CPU), a graphics processor (Graphics Processing Unit, GPU), an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) chip, and the like.
In the embodiment of the disclosure, the operand of the command is data which needs to be read from an external memory when the command is executed, for example, the command is that the corresponding command operand is data A and data B when the data A and the data B are subjected to multiplication operation, and for example, the command is that the feature map M to be processed is subjected to convolution operation by using a convolution kernel F, and the corresponding command operand is the feature map M and the convolution kernel F.
The command distribution method provided by the embodiment of the present disclosure is described below.
Referring to fig. 1, a flowchart of a command distribution method according to an embodiment of the disclosure is shown, where the method includes steps S101 to S103, where:
S101, determining a plurality of first target register groups corresponding to a current processing period from a plurality of register groups, wherein the first target register groups are different from a second target register group determined by at least one latest historical processing period;
s102, determining target threads respectively corresponding to the plurality of first target register groups in the current processing period from the thread groups respectively corresponding to the plurality of first target register groups;
And S103, distributing the commands respectively corresponding to the determined target threads to the corresponding operation units.
According to the embodiment of the disclosure, the register is divided into the plurality of register groups, the plurality of first target register groups corresponding to the current processing period are determined from the plurality of register groups in each processing period, then the target threads corresponding to the plurality of first target register groups in the current processing period are determined from the thread groups corresponding to the plurality of first target register groups respectively, after the commands corresponding to the plurality of target threads are distributed to the operation units respectively, as at most one operation unit accessing one register group in one processing period is needed, the corresponding register groups can be directly accessed without arbitration among the operation units receiving the commands, the operand required by the commands is obtained, the efficiency of command distribution is further improved, and the processing efficiency of the commands is improved.
The following describes the steps S101 to S103 in detail.
In S101, for example, a plurality of registers may be divided into a plurality of register groups in advance, and a thread currently issued by a command in a host may be also divided into a plurality of thread groups, each of the register groups corresponds to one of the thread groups, and for each of the thread groups, an operand corresponding to the command generated by each of the threads in the thread group is stored in a register in the register group corresponding to the thread group.
Here, for example, the threads currently being executed may be grouped based on the number of register groups and the number of threads currently being executed, to obtain each of the register groups corresponding to each of the register groups.
When determining a plurality of first target register sets corresponding to the current processing cycle from the plurality of register sets, for example, the plurality of register sets may be divided into at least two groups, each group including the plurality of register sets therein. And in each processing period, determining the register group included in one group as a first target register group corresponding to the processing period. The register sets in different groupings are alternately determined as corresponding target register sets for a plurality of processing cycles, respectively, over a plurality of processing cycles.
For example, each register group may be numbered, and for the processing cycle number, the correspondence between the register group number and the processing cycle number may be predetermined. For example, in an even number of processing cycles, the register group with an even number is determined as a target register group corresponding to the even number of processing cycles, and in an odd number of processing cycles, the register group with an odd number is determined as a target register group corresponding to the odd number of processing cycles.
Here, the number of registers grouped may be related to the number of operands required in the executed command, for example, if the number of operands required in each command to be executed is at most 2, the number of register groups is also 2, and if the number of operands required in each command to be executed is at most 3, the number of register groups is 3. In the case that the number of operands required in each command to be executed is at most n, that is, the number of register sets is n, in the ith processing period, if the 1 st register set in the n th register set is taken as the target register set, an operation unit receiving the command with the number of operations of n is used as the target register set, after receiving the command in the current ith processing period, the operation unit reads one operand from the corresponding first target register set in the current ith processing period, then in the following ith (i+1) th to (i+n) -1 th processing periods, n-1 operands remained in the corresponding first target register set are read, and in the following (i+1) th to (i+n) -1 th processing periods, the 2 nd to n-th register sets are taken as the target register sets respectively corresponding to the i+1 th to (i+n) -1 th processing periods, so that the conflict of the operation unit accessing the same register set is avoided.
For S102, after determining the plurality of first target register groups corresponding to the current processing cycle, for example, a plurality of target threads may be determined from thread groups corresponding to the plurality of first target register groups in at least one of the following manners:
(1) Determining the sequence of a plurality of threads in a thread group corresponding to each first target register group according to a circulating mode, and determining different threads in the thread group as target threads corresponding to the first target register group in different processing cycles according to the sequence.
(2) And regarding each first target register group, taking the thread with a command in the current processing cycle in the thread group corresponding to the first target register group as an alternative thread, and taking the alternative thread with the highest priority as the target thread of the first target register group in the current processing cycle according to the priority of each alternative thread.
(3) And determining target threads respectively corresponding to the plurality of first target register groups in the current processing period based on the determined command execution state information of each thread in the thread groups respectively corresponding to the plurality of first target register groups.
Here, for example, for each first target register group, a plurality of candidate threads for which command execution state information is ready may be determined from a thread group corresponding to the first target register group, and a target thread corresponding to each of the plurality of first target register groups in the current processing cycle may be determined from the plurality of candidate threads.
The command execution status information corresponding to any thread includes, for example, whether the command that the thread has recently dispatched to the arithmetic unit has been executed, and/or whether the operand required by the thread corresponding to the command to be dispatched is ready.
The operands corresponding to the command are ready, e.g., the data generated by other commands on which the command depends have been stored in the corresponding registers, and/or operands that need to be read from external memory have been stored in the corresponding registers.
If the instruction recently dispatched to the arithmetic unit by the thread is executed and/or the operand required by the current instruction to be dispatched corresponding to the thread is ready, the instruction execution state information corresponding to the thread is considered as a ready state, and the thread corresponding to the instruction can be used as a target thread.
In this case, in another embodiment of the present disclosure, the method further includes obtaining feedback information generated by the operation unit after executing the command, and generating command execution state information corresponding to a thread to which the executed command belongs based on the feedback information.
Thus, the command distributor can know the execution condition of each operation unit on the command in real time.
In one possible embodiment, in the above (3), the number of target threads specified for a certain first target register group may be greater than 1, or the target threads may be specified from a plurality of threads satisfying the requirement (3) in combination with priorities corresponding to the respective threads or in a round-robin manner.
It should be noted here that there may be a case where, in a certain processing cycle, a certain first target register set does not have a target thread, i.e. the number of target threads determined is less than the number of first target accumulator sets.
In determining a target thread corresponding to each of the plurality of first target registers in the current processing cycle from the determined plurality of candidate threads, any one of the following ①~③ may be used, for example:
① Determining target threads respectively corresponding to the first target registers in the current processing period from the candidate threads based on the priorities of commands to be distributed corresponding to the candidate threads.
② Determining target threads respectively corresponding to the first target registers in the current processing period from the plurality of candidate threads based on the priorities of the commands to be distributed respectively corresponding to the plurality of candidate threads and the occupation states of operation units corresponding to the commands to be distributed.
The occupancy state of the arithmetic unit may include, for example, that a specific target thread has been allocated to the arithmetic unit during the current processing cycle, and/or that the number of commands received by the arithmetic unit during the historical processing cycle and not executed reaches a preset number.
Illustratively, the following is performed in order of priority from high to low:
Determining at least one command to be distributed with highest priority according to the priorities of the commands to be distributed, which correspond to the candidate threads, determining whether the command to be distributed with highest priority can be distributed to the corresponding operation unit or not based on the occupation state of the operation unit corresponding to the command to be distributed with highest priority, and determining the candidate thread corresponding to the command to be distributed with highest priority as a target thread if the command to be distributed with highest priority can be distributed to the corresponding operation unit. If the instruction to be distributed cannot be distributed to the corresponding operation unit, the alternative thread corresponding to the instruction to be distributed is not taken as the target thread.
Then, at least one command to be distributed with high priority is determined from the candidate threads, and whether the command to be distributed with high priority can be distributed to the corresponding operation unit is determined based on the occupation state of the operation unit corresponding to the command to be distributed with high priority.
......
And determining at least one command to be distributed with the lowest priority from the candidate threads, and then determining whether the command to be distributed with the lowest priority can be distributed to the corresponding operation unit based on the occupation state of the operation unit corresponding to the command to be distributed with the lowest priority.
Based on the above procedure, a target thread is determined from the plurality of candidate threads that corresponds to the plurality of first target registers, respectively, at the current processing cycle.
③ Determining target threads respectively corresponding to the first target registers in the current processing period from a plurality of candidate threads based on the command type of the current command to be distributed and the type of the operation unit, wherein the command type corresponds to the candidate threads.
Here, the types of the arithmetic units are different, and the types of commands that can be processed are also different.
The method comprises the steps of enabling an arithmetic operation unit to process an arithmetic operation command, enabling a write address operation unit to process a write address command, enabling a read address operation unit to process a read address command, and enabling an override function operation unit to process an override function.
When the target thread is determined, a plurality of target threads which can be respectively matched with the types of the operation units are determined from the candidate threads according to the types of the commands to be distributed corresponding to the candidate threads, and then the current commands to be distributed corresponding to the target threads are distributed to the operation units with the matched types.
In another embodiment of the present disclosure, for some commands, the number of operands required in executing the command may be different.
After the command to be distributed corresponding to the target thread is distributed to the operation unit, the operation unit needs at least one period to read the operands corresponding to the command to be distributed from the corresponding register group, wherein the number of the periods for reading the operands is the same as the number of the operands corresponding to the command to be distributed.
Further, in response to a multi-operand to-be-dispatched command with more than one operand in a to-be-dispatched command corresponding to a target thread determined for a current processing cycle, each processing cycle from the current processing cycle to a target processing cycle, a first target register group corresponding to the multi-operand to-be-dispatched command distributes a corresponding one operand to a corresponding operation unit of the to-be-dispatched command respectively;
The difference value between the cycle number of the target processing cycle and the current processing cycle is equal to one less than the number of the multiple operands.
For the commands to be distributed with fewer operands, the operation unit can respectively read the operands corresponding to different commands to be distributed from the same target register group in a plurality of periods.
Further, in response to a multi-operand to-be-dispatched command with more than one operand in a to-be-dispatched command corresponding to a target thread determined for a current processing cycle, determining the operand number of the multi-operand to-be-dispatched command with the largest operand number in the multi-operand to-be-dispatched command;
For each other command to be dispatched, in which the number of operands in the command to be dispatched corresponding to the target thread determined for the current processing cycle is less than the maximum number of operands, determining a command to be dispatched in a ready state for the first target register group from the thread group corresponding to the first target register group in response to the first target register group where the other command to be dispatched is idle for each processing cycle from the next processing cycle of the current processing cycle to the processing cycle before the first target register group is scheduled again;
The operation number of the ready-state commands to be distributed is not more than the number of processing cycles from the processing cycle of the ready-state commands to be distributed to the processing cycle of the first target register group which is scheduled again.
For example, in response to a multi-operand to-be-dispatched command with two operands in a to-be-dispatched command corresponding to a target thread determined for a current processing cycle, for each single-operand to-be-dispatched command existing in the to-be-dispatched command corresponding to the target thread determined for the current processing cycle, another single-operand to-be-dispatched command in a ready state is determined for a first target register group in which the single-operand to-be-dispatched command is located in a next processing cycle of the current processing cycle.
In this way, the operation unit can read a plurality of operands corresponding to the multi-operand to-be-distributed instruction from the first target register set corresponding to the multi-operand to-be-distributed instruction in a plurality of processing cycles respectively, and simultaneously can read operands corresponding to different single-operand to-be-distributed commands from the single-operand to-be-distributed instruction to the first target register set in the plurality of processing cycles respectively, so that the efficiency of data reading is improved under the condition of avoiding the reading conflict to the same target register set.
In one embodiment, the number N of the groups of registers may be determined according to the number of operands of the operation unit with the largest number of required operands, so that in the consecutive N processing periods, the N number of registers may be respectively scheduled, and then after the ith processing period, the ith register group may be scheduled again after the ith processing period, and after N-i periods, the command corresponding to the register group scheduled by the ith processing period may be scheduled again, and assuming that the command corresponding to the register group scheduled by the ith processing period just needs N operands, then for the ith register group, the N number of operands may be respectively distributed to the corresponding operation units in the N periods, and if the number of operands required by the command corresponding to the register group scheduled by the ith processing period is less than N, the command with the number of operands matched with the number of remaining periods before the next scheduled may be flexibly scheduled, so as to improve the data reading efficiency.
Referring to fig. 2 and 3, the embodiment of the present disclosure further provides a command distribution apparatus, and a specific example of command distribution using the same, in which the example includes a command distributor, and 5 arithmetic units connected to the command distributor, the 5 arithmetic units being respectively:
Two arithmetic units (ARITHMETIC AND Logic Unit ALUs) are used to process instructions requiring two operands.
A write address Unit (ST) and the processed instruction requires two operands.
A read address arithmetic Unit (LD) and the processed instruction requires an operand.
An override function arithmetic unit (Tensor Function Unit, TFU) is provided for processing instructions requiring two operands. The total number of threads is 64, namely threads 0to 63,8 register sets (banks), namely banks 0to banks 7 and 5 arithmetic units. Each register set is allocated 8 threads.
Each Bank has only one read path, and in one processing cycle, different operation units access the same register set and conflict, and different operation units access different banks and do not conflict.
For an instruction with two operands, the reading of the operands needs to be performed in two cycles in the same Bank.
In the odd cycles, register groups numbered 1,3, 5, and 7 are taken as the first register group.
The valid and highest priority ALU, ST, LD, TFU instructions are selected from the 8 threads allocated to each even numbered Bank.
From these even numbered banks, the two highest priority ALU instructions are first selected and dispatched.
And under the condition that the bank corresponding to the ALU instruction is occupied, selecting the ST instruction from the rest banks and distributing the ST instruction.
When the bank of ALU and ST instructions is occupied, LD instructions are selected from the rest banks and distributed.
Since ALU and ST instructions are two operands, the operands still need to be read from the same bank in the next cycle, TFU instructions are selected from the remaining banks and distributed in the next processing cycle if the banks of ALU and ST instructions are occupied.
The two-operand instruction distributed in the even cycle needs to continue to read the instruction of the same even bank in the next odd cycle, but the problem of bank conflict does not occur because the instruction of the odd bank is only distributed in the next cycle. The same applies to the scheduling mode of the odd cycle.
As shown in fig. 3:
A, in the 0 th processing period, the determined bands are respectively band 0, band 2, band 4 and band 6.
The command determined for Bank0 is an ALU command, and the operation unit that reads the operand from Bank0 is ALU0, and distributes the ALU command to ALU0 in the 0 th processing cycle. In the 0 th processing cycle and the 1 st processing cycle, the arithmetic unit ALU0 reads the first operand alu0_r0 and the second operand alu0_r1 from Bank0, respectively.
The command determined for Bank2 is an ST command, and in the 0 th processing cycle, the ST command is distributed to the ST unit. The operation unit that reads the operands from this Bank2 is ST, and in the 0 th processing cycle, and the 1 ST processing cycle, the operation unit ST reads the first operand st_r0 and the second operand st_r1 from the Bank2, respectively.
The commands determined for Bank4 are LD commands, and TFU commands, and the arithmetic units that read operands from this Bank4 are LD and TFU. In the 0 th processing period, the LD command is distributed to the operation unit LD, the operation unit LD reads the operand corresponding to the LD command from the Bank4, and in the 1 st processing period, the TFU command is distributed to the operation unit TFU, and the operation unit TFU reads the operand corresponding to the TFU command from the Bank 4.
The commands determined for Bank6 are ALU commands and, at processing cycle 0, the ALU commands are distributed to the ALU units. The arithmetic unit that reads the operands from Bank6 is an ALU, and in the 0 th processing cycle and the 1 st processing cycle, the arithmetic unit ALU reads the first operand alu1_r0 and the second operand alu1_r1 from Bank6, respectively.
And B, in the 1 st processing period, the determined bands are respectively band 1, band 3, band 5 and band 6.
The command determined for Bank1 is an ST command, and in the 1 ST processing cycle, the ST command is distributed to the ST unit. The arithmetic unit that reads the operands from this Bank1 is ST, and in the 1 ST processing cycle, and the 2 nd processing cycle, the arithmetic unit ST reads the first operand st_r0 and the second operand st_r1 from the Bank1, respectively.
The command determined for Bank3 is an ALU command, and the arithmetic unit that reads the operands from Bank3 is ALU0, and distributes the ALU command to ALU0 in the 1 st processing cycle. In the 2 nd processing cycle and the 1 st processing cycle, the arithmetic unit ALU0 reads the first operand alu0_r0 and the second operand alu0_r1 from Bank3, respectively.
The commands determined for Bank5 are ALU commands and, in the 1 st processing cycle, the ALU commands are distributed to the ALU units. The arithmetic unit that reads the operands from Bank5 is ALU1, and in the 1 st processing cycle and the 2 nd processing cycle, the arithmetic unit ALU1 reads the first operand alu1_r0 and the second operand alu1_r1 from Bank5, respectively.
The commands determined for Bank7 are LD commands, and TFU commands, and the arithmetic units that read operands from this Bank7 are LD and TFU. In the 1 st processing period, the LD command is distributed to the operation unit LD, the operation unit LD reads the operand corresponding to the LD command from the Bank7, and in the 2 nd processing period, the TFU command is distributed to the operation unit TFU, and the operation unit TFU reads the operand corresponding to the TFU command from the Bank 7.
Then in the 3 rd processing period and the 4 th processing period until the 8 th processing period, and further in the mode, in the same processing period, only one operation unit of each register group is ensured to be accessed, so that data conflict caused by that a plurality of operation units access the same register group in the same processing period is avoided, the command distribution efficiency is improved, and the command processing efficiency is further improved.
It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.
Based on the same inventive concept, the embodiment of the present disclosure further provides a command distributor corresponding to the command distribution method, and since the principle of the command distributor in the embodiment of the present disclosure for solving the problem is similar to that of the command distribution method in the embodiment of the present disclosure, the implementation of the command distributor may refer to the implementation of the method, and the repetition is omitted.
Referring to FIG. 4, a schematic diagram of a command distributor according to an embodiment of the present disclosure is provided, where the command distributor includes a scheduler 41 and a distribution interface 42;
The scheduler 41 is configured to determine a plurality of first target register sets corresponding to a current processing cycle from a plurality of register sets, where the first target register sets are different from a second target register set determined by at least one last historical processing cycle;
the dispatch interface 42 is configured to dispatch the commands corresponding to the determined target threads to the corresponding computing units.
In a possible implementation manner, the scheduler 41 is configured, when determining, from a plurality of register sets, a plurality of first target register sets corresponding to a current processing cycle, to:
Determining a register group numbered as an odd number among the plurality of register groups as the first target register group in a case where the current processing cycle is an odd number cycle;
And determining a register group numbered even in the plurality of register groups as the first target register group in the case that the current processing cycle is an even number cycle.
In a possible implementation, the scheduler 41 is further configured to:
the number of groupings of registers is determined based on the number of operands of the arithmetic unit that requires the greatest number of operands, and the registers are divided into the plurality of register banks.
In a possible implementation manner, the scheduler 41 is configured, in determining, from a thread group corresponding to each of the plurality of first target registers, a target thread corresponding to each of the plurality of first target registers in a current processing cycle, to:
and determining target threads respectively corresponding to the plurality of first target register groups in the current processing period from the thread groups respectively corresponding to the plurality of first target register groups based on the determined command execution state information of each thread in the thread groups respectively corresponding to the plurality of first target register groups.
In a possible implementation manner, the scheduler 41 is configured to, when determining, based on the determined command execution status of each thread in the thread group respectively corresponding to the plurality of first target registers, a target thread corresponding to each of the plurality of first target registers in the current processing cycle from the thread group respectively corresponding to the plurality of first target registers, determine:
determining a plurality of alternative threads with command execution state information being ready state from thread groups respectively corresponding to the plurality of first target register groups;
And determining target threads respectively corresponding to the first target registers in the current processing cycle from the candidate threads.
In a possible implementation manner, the scheduler 41, when determining, from the plurality of candidate threads, a target thread corresponding to each of the plurality of first target registers in the current processing cycle, is configured to:
and determining target threads respectively corresponding to the first target registers in the current processing cycle from the plurality of candidate threads based on the priorities of the commands to be distributed respectively corresponding to the plurality of candidate threads.
In a possible implementation manner, the scheduler 41, when determining, from the plurality of candidate threads, a target thread corresponding to each of the plurality of first target registers in the current processing cycle, is configured to:
And determining target threads respectively corresponding to the plurality of first target registers in the current processing period from the plurality of candidate threads based on the priorities of the commands to be distributed respectively corresponding to the plurality of candidate threads and the occupation states of operation units corresponding to the commands to be distributed.
In a possible implementation manner, the scheduler 41, when determining, from the plurality of candidate threads, a target thread corresponding to each of the plurality of first target registers in the current processing cycle, is configured to:
and determining target threads respectively corresponding to the plurality of first target registers in the current processing period from the plurality of candidate threads based on the command types of the current commands to be distributed respectively corresponding to the plurality of candidate threads and the types of the operation units.
In a possible implementation, the scheduler 41 is further configured to:
Responding to a multi-operand to-be-dispatched command with more than one operand in a to-be-dispatched command corresponding to a target thread determined for a current processing period, and respectively dispatching a corresponding one operand to a corresponding operation unit of the to-be-dispatched command from the current processing period to each processing period of the target processing period by a first target register group corresponding to the multi-operand to-be-dispatched command;
The difference value between the cycle number of the target processing cycle and the current processing cycle is equal to one less than the number of the multiple operands.
In a possible implementation, the scheduler 41 is further configured to:
In response to a multi-operand to-be-dispatched command with two operands in a to-be-dispatched command corresponding to a target thread determined for a current processing cycle, determining another single-operand to-be-dispatched command in a ready state for a first target register group where the single-operand to-be-dispatched command is located in a next processing cycle of the current processing cycle for each single-operand to-be-dispatched command existing in the to-be-dispatched command corresponding to the target thread determined for the current processing cycle.
In a possible embodiment, the scheduler 41 is further configured to:
Responding to a multi-operand to-be-dispatched command with more than one operand in a target thread corresponding to a current processing period, and determining the operand number of the multi-operand to-be-dispatched command with the largest operand number in the multi-operand to-be-dispatched command;
For each other command to be dispatched, in which the number of operands in the command to be dispatched corresponding to the target thread determined for the current processing cycle is less than the maximum number of operands, determining a command to be dispatched in a ready state for the first target register group from the thread group corresponding to the first target register group in response to the first target register group where the other command to be dispatched is idle for each processing cycle from the next processing cycle of the current processing cycle to the processing cycle before the first target register group is scheduled again;
The operation number of the ready-state commands to be distributed is not more than the number of processing cycles from the processing cycle of the ready-state commands to be distributed to the processing cycle of the first target register group which is scheduled again.
In a possible implementation manner, the scheduler 41 is further configured to obtain feedback information generated by the operation unit after executing the command;
and generating command execution state information corresponding to the thread to which the executed command belongs based on the feedback information.
In a possible implementation, the scheduler 41 is further configured to:
And grouping the threads currently being executed based on the number of the register groups and the number of the threads currently being executed to obtain thread groups corresponding to each register group.
For a description of the processing flow of each module in the command distributor, and the interaction flow between the modules, reference is made to the relevant description in the above method embodiment, and will not be described in detail here.
In addition, the command distributor provided by the embodiment of the present disclosure may be a chip capable of implementing the command distribution method provided by the embodiment of the present disclosure.
The embodiment of the disclosure also provides a chip, as shown in fig. 5, comprising a controller 51, a command distributor 52, and an operator 53;
the controller 51 is configured to obtain commands corresponding to the multiple threads, and send the commands to the command distributor 52;
the command distributor 52 is configured to distribute the command to the arithmetic unit 53 based on a command distribution method provided by any one of the embodiments of the present disclosure;
The operator 53 is configured to read an operand from a first target register group corresponding to the command, and execute the command based on the operand.
The specific process of the specific execution command of the command execution device may refer to the steps of the command distribution method described in the embodiments of the present disclosure, which is not described herein.
The embodiment of the disclosure also provides electronic equipment, which comprises the chip provided by any embodiment of the disclosure.
The disclosed embodiments also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the command distribution method described in the method embodiments above. Wherein the storage medium may be a volatile or nonvolatile computer readable storage medium.
The embodiments of the present disclosure further provide a computer program product, where the computer program product carries a program code, where instructions included in the program code may be used to perform the steps of the command distribution method described in the foregoing method embodiments, and specifically reference may be made to the foregoing method embodiments, which are not described herein in detail.
Wherein the above-mentioned computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and apparatus may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present disclosure may be integrated in one operation unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in essence or a part contributing to the prior art or a part of the technical solution, or in the form of a software product stored in a storage medium, including several commands to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present disclosure. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.
It should be noted that the foregoing embodiments are merely specific implementations of the disclosure, and are not intended to limit the scope of the disclosure, and although the disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that any modification, variation or substitution of some of the technical features described in the foregoing embodiments may be made or equivalents may be substituted for those within the scope of the disclosure without departing from the spirit and scope of the technical aspects of the embodiments of the disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.