CN115129369B

CN115129369B - Command distribution method, command distributor, chip and electronic device

Info

Publication number: CN115129369B
Application number: CN202110323622.8A
Authority: CN
Inventors: 王文强; 夏晓旭
Original assignee: Shanghai Power Tensors Intelligent Technology Co Ltd
Current assignee: Shanghai Power Tensors Intelligent Technology Co Ltd
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2025-03-28
Anticipated expiration: 2041-03-26
Also published as: WO2022198955A1; CN115129369A

Abstract

The present disclosure provides a command distribution method, a command distributor, a chip, and an electronic device, wherein the command distribution method includes: determining multiple first target register groups corresponding to the current processing cycle from multiple register groups; wherein the first target register group is different from the second target register group determined in at least one recent historical processing cycle; determining the target threads corresponding to the multiple first target register groups in the current processing cycle from the thread groups corresponding to the multiple first target register groups; and distributing commands corresponding to the determined target threads to the corresponding computing units. In the embodiment of the present disclosure, each thread group will be accessed by at most one computing unit in each processing cycle, so the computing units that receive the command do not need to arbitrate, and can directly access the corresponding register group to obtain the operands required for the command, thereby improving the efficiency of command distribution and the efficiency of command processing.

Description

Command distribution method, command distributor, chip and electronic device

Technical Field

The disclosure relates to the field of computer technology, and in particular relates to a command distribution method, a command distributor, a chip and electronic equipment.

Background

The configuration of a command processing device such as a central processing unit and a graphics processor generally includes a controller, a command distributor connected to the controller, and a plurality of arithmetic units connected to the command distributor. The controller is used for receiving the command from the host, primarily processing the command and then sending the command to the command distributor, and the command distributor distributes the command to different operation units for execution. With the increasing of intensive computing tasks, hardware multithreading is widely used in the fields of image, neural network, data processing and the like as a technology capable of effectively improving parallel computing capability. Hardware multithreading effectively increases the computation speed by increasing the number of arithmetic units, maintaining a greater number of threads executing in parallel, increasing the capacity of a register file for storing command operands, and employing higher bandwidth memory, among other ways.

The current command distribution mode has the problem of low distribution efficiency.

Disclosure of Invention

The embodiment of the disclosure at least provides a command distribution method, a command distributor, a chip and electronic equipment.

In a first aspect, an embodiment of the present disclosure provides a command distribution method, including determining a plurality of first target register sets corresponding to a current processing cycle from a plurality of register sets, where the first target register sets are different from a second target register set determined by at least one recent historical processing cycle;

Determining target threads respectively corresponding to the plurality of first target register groups in the current processing cycle from thread groups respectively corresponding to the plurality of first target register groups; and distributing the determined commands respectively corresponding to the target threads to the corresponding operation units.

In a possible implementation manner, the determining the plurality of first target register sets corresponding to the current processing period from the plurality of register sets includes determining the register set with the odd number in the plurality of register sets as the first target register set when the current processing period is an odd number period, and determining the register set with the even number in the plurality of register sets as the first target register set when the current processing period is an even number period.

In a possible embodiment, the method further comprises determining the grouping number of the registers according to the number of operands of the operation unit with the largest number of required operands, and dividing the registers into the plurality of register groups.

In a possible implementation manner, the determining the target threads respectively corresponding to the plurality of first target register groups in the current processing cycle from the thread groups respectively corresponding to the plurality of first target register groups comprises determining the target threads respectively corresponding to the plurality of first target register groups in the current processing cycle from the thread groups respectively corresponding to the plurality of first target register groups based on the determined command execution state information of each thread in the thread groups respectively corresponding to the plurality of first target register groups.

In a possible implementation manner, the determining the target threads corresponding to the plurality of first target register groups in the current processing cycle from the thread groups corresponding to the plurality of first target register groups respectively based on the determined command execution state of each thread in the thread groups corresponding to the plurality of first target register groups respectively comprises determining a plurality of candidate threads with command execution state information in a ready state from the thread groups corresponding to the plurality of first target register groups respectively, and determining the target threads corresponding to the plurality of first target register groups in the current processing cycle from the plurality of candidate threads respectively.

In a possible implementation manner, the determining the target threads respectively corresponding to the first target registers in the current processing cycle from the plurality of candidate threads includes determining the target threads respectively corresponding to the first target registers in the current processing cycle from the plurality of candidate threads based on the priorities of commands to be dispatched respectively corresponding to the plurality of candidate threads.

In a possible implementation manner, the determining the target threads respectively corresponding to the first target registers in the current processing cycle from the plurality of candidate threads includes determining the target threads respectively corresponding to the first target registers in the current processing cycle from the plurality of candidate threads based on the priorities of the commands to be distributed respectively corresponding to the plurality of candidate threads and the occupancy states of the operation units corresponding to the commands to be distributed.

In a possible implementation manner, in response to a multi-operand to-be-dispatched command with more than one operand in a to-be-dispatched command corresponding to a target thread determined for a current processing cycle, each processing cycle from the current processing cycle to a target processing cycle, a first target register group corresponding to the multi-operand to-be-dispatched command distributes a corresponding operand to a to-be-dispatched command corresponding operation unit respectively, wherein the difference between the number of cycles of the target processing cycle and the current processing cycle is equal to one less than the number of the multi-operands.

In a possible implementation manner, the method further comprises the step of responding to a multi-operand to-be-dispatched command with two operands in a to-be-dispatched command corresponding to a target thread determined for a current processing period, and determining another single-operand to-be-dispatched command in a ready state for a first target register group where the single-operand to-be-dispatched command is located in a next processing period of the current processing period aiming at each single-operand to-be-dispatched command in the to-be-dispatched command corresponding to the target thread determined for the current processing period.

In a possible implementation manner, the method further comprises the steps of responding to a multi-operand to-be-dispatched command with more than one operand in a to-be-dispatched command corresponding to a target thread determined for a current processing period, determining the operand number of the multi-operand to-be-dispatched command with the largest operand number in the multi-operand to-be-dispatched command, and determining the to-be-dispatched command with the ready state for the first target register group in a thread group corresponding to the first target register group in response to the fact that the multi-operand to-be-dispatched command with more than one operand exists in the to-be-dispatched command corresponding to the target thread determined for the current processing period, wherein the operation number of the ready-state to-be-dispatched command is not more than the number of processing periods from the next processing period of the current processing period to the processing period of the first target register group to be dispatched again.

In a possible implementation manner, the method further comprises the steps of obtaining feedback information generated by the operation unit after executing the command, and generating command execution state information corresponding to a thread to which the executed command belongs based on the feedback information.

In a possible implementation manner, the method further comprises grouping the threads currently being executed based on the number of the register groups and the number of the threads currently being executed to obtain thread groups corresponding to each register group respectively.

In a second aspect, embodiments of the present disclosure provide a command dispatcher comprising a scheduler, and a dispatch interface;

the scheduler is used for determining a plurality of first target register groups corresponding to the current processing cycle from the register groups; wherein the first set of target registers is different from a second set of target registers determined by at least one recent historical processing cycle; determining target threads respectively corresponding to the plurality of first target register groups in the current processing cycle from thread groups respectively corresponding to the plurality of first target register groups;

the distributing interface is used for distributing the determined commands respectively corresponding to the target threads to the corresponding operation units.

In a possible implementation manner, the scheduler is configured, when determining, from a plurality of register sets, a plurality of first target register sets corresponding to a current processing cycle, to:

Determining a register group numbered as an odd number among the plurality of register groups as the first target register group in a case where the current processing cycle is an odd number cycle;

And determining a register group numbered even in the plurality of register groups as the first target register group in the case that the current processing cycle is an even number cycle.

In a possible implementation manner, the scheduler is further configured to:

the number of groupings of registers is determined based on the number of operands of the arithmetic unit that requires the greatest number of operands, and the registers are divided into the plurality of register banks.

In a possible implementation manner, the scheduler is configured to, in determining, from a thread group corresponding to each of the plurality of first target registers, a target thread corresponding to each of the plurality of first target registers in a current processing cycle:

and determining target threads respectively corresponding to the plurality of first target register groups in the current processing period from the thread groups respectively corresponding to the plurality of first target register groups based on the determined command execution state information of each thread in the thread groups respectively corresponding to the plurality of first target register groups.

In a possible implementation manner, the scheduler is configured to, when determining, based on the determined command execution status of each thread in the plurality of first target register groups respectively corresponding to the plurality of first target register groups, a target thread corresponding to the plurality of first target register groups in a current processing cycle from the thread groups respectively corresponding to the plurality of first target register groups, determine:

determining a plurality of alternative threads with command execution state information being ready state from thread groups respectively corresponding to the plurality of first target register groups;

And determining target threads respectively corresponding to the first target registers in the current processing cycle from the candidate threads.

In a possible implementation manner, the scheduler is configured, in determining, from the plurality of candidate threads, a target thread corresponding to each of the plurality of first target registers in the current processing cycle, to:

and determining target threads respectively corresponding to the first target registers in the current processing cycle from the plurality of candidate threads based on the priorities of the commands to be distributed respectively corresponding to the plurality of candidate threads.

And determining target threads respectively corresponding to the plurality of first target registers in the current processing period from the plurality of candidate threads based on the priorities of the commands to be distributed respectively corresponding to the plurality of candidate threads and the occupation states of operation units corresponding to the commands to be distributed.

and determining target threads respectively corresponding to the plurality of first target registers in the current processing period from the plurality of candidate threads based on the command types of the current commands to be distributed respectively corresponding to the plurality of candidate threads and the types of the operation units.

In a possible implementation manner, the scheduler is further configured to:

Responding to a multi-operand to-be-dispatched command with more than one operand in a to-be-dispatched command corresponding to a target thread determined for a current processing period, and respectively dispatching a corresponding one operand to a corresponding operation unit of the to-be-dispatched command from the current processing period to each processing period of the target processing period by a first target register group corresponding to the multi-operand to-be-dispatched command;

The difference value between the cycle number of the target processing cycle and the current processing cycle is equal to one less than the number of the multiple operands.

In a possible implementation manner, the scheduler is further configured to:

In response to a multi-operand to-be-dispatched command with two operands in a to-be-dispatched command corresponding to a target thread determined for a current processing cycle, determining another single-operand to-be-dispatched command in a ready state for a first target register group where the single-operand to-be-dispatched command is located in a next processing cycle of the current processing cycle for each single-operand to-be-dispatched command existing in the to-be-dispatched command corresponding to the target thread determined for the current processing cycle.

In a possible embodiment, the scheduler is further configured to:

Responding to a multi-operand to-be-dispatched command with more than one operand in a target thread corresponding to a current processing period, and determining the operand number of the multi-operand to-be-dispatched command with the largest operand number in the multi-operand to-be-dispatched command;

For each other command to be dispatched, in which the number of operands in the command to be dispatched corresponding to the target thread determined for the current processing cycle is less than the maximum number of operands, determining a command to be dispatched in a ready state for the first target register group from the thread group corresponding to the first target register group in response to the first target register group where the other command to be dispatched is idle for each processing cycle from the next processing cycle of the current processing cycle to the processing cycle before the first target register group is scheduled again;

The operation number of the ready-state commands to be distributed is not more than the number of processing cycles from the processing cycle of the ready-state commands to be distributed to the processing cycle of the first target register group which is scheduled again.

In a possible implementation manner, the scheduler is further used for acquiring feedback information generated by the operation unit after executing the command;

and generating command execution state information corresponding to the thread to which the executed command belongs based on the feedback information.

In a possible implementation manner, the scheduler is further configured to:

And grouping the threads currently being executed based on the number of the register groups and the number of the threads currently being executed to obtain thread groups corresponding to each register group.

In a third aspect, embodiments of the present disclosure further provide a chip, a controller, a command distributor, and an operator;

the controller is used for acquiring commands corresponding to the threads respectively and sending the commands to the command distributor;

the command distributor is configured to distribute the command to the operator based on the command distribution method according to any one of the first aspects;

the operator is configured to read an operand from a target register group corresponding to the command based on the command distributed by the command distributor, and execute the command based on the operand.

In a fourth aspect, an embodiment of the disclosure further provides an electronic device, including the chip described in the third aspect.

In a fifth aspect, embodiments of the present disclosure also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the command distribution method according to any of the first aspects above.

The description of the effects of the command distributor, the chip and the electronic device is referred to the description of the command distributing method, and is not repeated here.

The foregoing objects, features and advantages of the disclosure will be more readily apparent from the following detailed description of the preferred embodiments taken in conjunction with the accompanying drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for the embodiments are briefly described below, which are incorporated in and constitute a part of the specification, these drawings showing embodiments consistent with the present disclosure and together with the description serve to illustrate the technical solutions of the present disclosure. It is to be understood that the following drawings illustrate only certain embodiments of the present disclosure and are therefore not to be considered limiting of its scope, for the person of ordinary skill in the art may admit to other equally relevant drawings without inventive effort.

FIG. 1 illustrates a flow chart of a command distribution method provided by an embodiment of the present disclosure;

FIG. 2 illustrates a specific example of a command distribution device provided by an embodiment of the present disclosure;

Fig. 3 is a schematic diagram showing a specific example of command distribution by the command distribution device according to the embodiment of the present disclosure;

FIG. 4 illustrates a schematic diagram of a command distributor provided by an embodiment of the present disclosure;

Fig. 5 shows a schematic structural diagram of a chip according to an embodiment of the disclosure.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. The components of the embodiments of the present disclosure, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure provided in the accompanying drawings is not intended to limit the scope of the disclosure, as claimed, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be made by those skilled in the art based on the embodiments of this disclosure without making any inventive effort, are intended to be within the scope of this disclosure.

It has been found that the command processing apparatus includes a controller, a command distribution unit, and a plurality of arithmetic units. After the command distribution unit distributes the commands to the operation units, the operation units executing different commands obtain the reading authority of the register through arbitration, after the reading authority of the register is obtained, the operands required by executing the commands are read from the register, and then the arbitration of the reading authority of the register file can cause certain delay based on the read operands, thereby influencing the throughput rate of the operation units to the commands and further causing the problem of low command distribution efficiency and lower processing efficiency of executing single commands.

In addition, after the command is distributed to the command distribution unit, if the operand required for executing the command is not ready, the operation unit will switch to executing the command corresponding to other threads, which requires the command distribution unit to distribute the new command to the operation unit, which results in that the command distributed to the operation unit may have a command which cannot be executed immediately (the command may be executed after the operand is ready is required to wait), thus resulting in reduced efficiency of command distribution and lower processing efficiency of the command.

Based on the above study, the present disclosure provides a command distribution method that divides registers in a register file into a plurality of register groups, and different register groups correspond to different thread groups. In each processing period, a plurality of first target register groups are determined, target threads respectively corresponding to the plurality of first target register groups in the current processing period are determined from thread groups respectively corresponding to the plurality of first target register groups, and commands respectively corresponding to the determined target threads are distributed to corresponding operation units, so that each thread group can be accessed by one operation unit at most in each processing period, and therefore, the operation units receiving the commands can directly access the corresponding register groups without arbitration, operands required by the commands are obtained, the command distribution efficiency is improved, and the command processing efficiency is improved.

The defects of the scheme are all results obtained by the inventor after practice and careful study, and therefore, the discovery process of the above problems and the solutions to the above problems set forth hereinafter by the present disclosure should be all contributions of the inventors to the present disclosure during the course of the present disclosure.

It should be noted that like reference numerals and letters refer to like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

For the sake of understanding the present embodiment, a detailed description will be given first of one of the command distribution methods disclosed in the embodiments of the present disclosure, where an execution body of the command distribution method provided in the embodiments of the present disclosure is generally a command processing device such as a central processing unit (Central Processing Unit, CPU), a graphics processor (Graphics Processing Unit, GPU), an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) chip, and the like.

In the embodiment of the disclosure, the operand of the command is data which needs to be read from an external memory when the command is executed, for example, the command is that the corresponding command operand is data A and data B when the data A and the data B are subjected to multiplication operation, and for example, the command is that the feature map M to be processed is subjected to convolution operation by using a convolution kernel F, and the corresponding command operand is the feature map M and the convolution kernel F.

The command distribution method provided by the embodiment of the present disclosure is described below.

Referring to fig. 1, a flowchart of a command distribution method according to an embodiment of the disclosure is shown, where the method includes steps S101 to S103, where:

S101, determining a plurality of first target register groups corresponding to a current processing period from a plurality of register groups, wherein the first target register groups are different from a second target register group determined by at least one latest historical processing period;

s102, determining target threads respectively corresponding to the plurality of first target register groups in the current processing period from the thread groups respectively corresponding to the plurality of first target register groups;

And S103, distributing the commands respectively corresponding to the determined target threads to the corresponding operation units.

According to the embodiment of the disclosure, the register is divided into the plurality of register groups, the plurality of first target register groups corresponding to the current processing period are determined from the plurality of register groups in each processing period, then the target threads corresponding to the plurality of first target register groups in the current processing period are determined from the thread groups corresponding to the plurality of first target register groups respectively, after the commands corresponding to the plurality of target threads are distributed to the operation units respectively, as at most one operation unit accessing one register group in one processing period is needed, the corresponding register groups can be directly accessed without arbitration among the operation units receiving the commands, the operand required by the commands is obtained, the efficiency of command distribution is further improved, and the processing efficiency of the commands is improved.

The following describes the steps S101 to S103 in detail.

In S101, for example, a plurality of registers may be divided into a plurality of register groups in advance, and a thread currently issued by a command in a host may be also divided into a plurality of thread groups, each of the register groups corresponds to one of the thread groups, and for each of the thread groups, an operand corresponding to the command generated by each of the threads in the thread group is stored in a register in the register group corresponding to the thread group.

Here, for example, the threads currently being executed may be grouped based on the number of register groups and the number of threads currently being executed, to obtain each of the register groups corresponding to each of the register groups.

When determining a plurality of first target register sets corresponding to the current processing cycle from the plurality of register sets, for example, the plurality of register sets may be divided into at least two groups, each group including the plurality of register sets therein. And in each processing period, determining the register group included in one group as a first target register group corresponding to the processing period. The register sets in different groupings are alternately determined as corresponding target register sets for a plurality of processing cycles, respectively, over a plurality of processing cycles.

For example, each register group may be numbered, and for the processing cycle number, the correspondence between the register group number and the processing cycle number may be predetermined. For example, in an even number of processing cycles, the register group with an even number is determined as a target register group corresponding to the even number of processing cycles, and in an odd number of processing cycles, the register group with an odd number is determined as a target register group corresponding to the odd number of processing cycles.

Here, the number of registers grouped may be related to the number of operands required in the executed command, for example, if the number of operands required in each command to be executed is at most 2, the number of register groups is also 2, and if the number of operands required in each command to be executed is at most 3, the number of register groups is 3. In the case that the number of operands required in each command to be executed is at most n, that is, the number of register sets is n, in the ith processing period, if the 1 st register set in the n th register set is taken as the target register set, an operation unit receiving the command with the number of operations of n is used as the target register set, after receiving the command in the current ith processing period, the operation unit reads one operand from the corresponding first target register set in the current ith processing period, then in the following ith (i+1) th to (i+n) -1 th processing periods, n-1 operands remained in the corresponding first target register set are read, and in the following (i+1) th to (i+n) -1 th processing periods, the 2 nd to n-th register sets are taken as the target register sets respectively corresponding to the i+1 th to (i+n) -1 th processing periods, so that the conflict of the operation unit accessing the same register set is avoided.

For S102, after determining the plurality of first target register groups corresponding to the current processing cycle, for example, a plurality of target threads may be determined from thread groups corresponding to the plurality of first target register groups in at least one of the following manners:

(1) Determining the sequence of a plurality of threads in a thread group corresponding to each first target register group according to a circulating mode, and determining different threads in the thread group as target threads corresponding to the first target register group in different processing cycles according to the sequence.

(2) And regarding each first target register group, taking the thread with a command in the current processing cycle in the thread group corresponding to the first target register group as an alternative thread, and taking the alternative thread with the highest priority as the target thread of the first target register group in the current processing cycle according to the priority of each alternative thread.

(3) And determining target threads respectively corresponding to the plurality of first target register groups in the current processing period based on the determined command execution state information of each thread in the thread groups respectively corresponding to the plurality of first target register groups.

Here, for example, for each first target register group, a plurality of candidate threads for which command execution state information is ready may be determined from a thread group corresponding to the first target register group, and a target thread corresponding to each of the plurality of first target register groups in the current processing cycle may be determined from the plurality of candidate threads.

The command execution status information corresponding to any thread includes, for example, whether the command that the thread has recently dispatched to the arithmetic unit has been executed, and/or whether the operand required by the thread corresponding to the command to be dispatched is ready.

The operands corresponding to the command are ready, e.g., the data generated by other commands on which the command depends have been stored in the corresponding registers, and/or operands that need to be read from external memory have been stored in the corresponding registers.

If the instruction recently dispatched to the arithmetic unit by the thread is executed and/or the operand required by the current instruction to be dispatched corresponding to the thread is ready, the instruction execution state information corresponding to the thread is considered as a ready state, and the thread corresponding to the instruction can be used as a target thread.

In this case, in another embodiment of the present disclosure, the method further includes obtaining feedback information generated by the operation unit after executing the command, and generating command execution state information corresponding to a thread to which the executed command belongs based on the feedback information.

Thus, the command distributor can know the execution condition of each operation unit on the command in real time.

In one possible embodiment, in the above (3), the number of target threads specified for a certain first target register group may be greater than 1, or the target threads may be specified from a plurality of threads satisfying the requirement (3) in combination with priorities corresponding to the respective threads or in a round-robin manner.

It should be noted here that there may be a case where, in a certain processing cycle, a certain first target register set does not have a target thread, i.e. the number of target threads determined is less than the number of first target accumulator sets.

In determining a target thread corresponding to each of the plurality of first target registers in the current processing cycle from the determined plurality of candidate threads, any one of the following ①～③ may be used, for example:

① Determining target threads respectively corresponding to the first target registers in the current processing period from the candidate threads based on the priorities of commands to be distributed corresponding to the candidate threads.

② Determining target threads respectively corresponding to the first target registers in the current processing period from the plurality of candidate threads based on the priorities of the commands to be distributed respectively corresponding to the plurality of candidate threads and the occupation states of operation units corresponding to the commands to be distributed.

The occupancy state of the arithmetic unit may include, for example, that a specific target thread has been allocated to the arithmetic unit during the current processing cycle, and/or that the number of commands received by the arithmetic unit during the historical processing cycle and not executed reaches a preset number.

Illustratively, the following is performed in order of priority from high to low:

Determining at least one command to be distributed with highest priority according to the priorities of the commands to be distributed, which correspond to the candidate threads, determining whether the command to be distributed with highest priority can be distributed to the corresponding operation unit or not based on the occupation state of the operation unit corresponding to the command to be distributed with highest priority, and determining the candidate thread corresponding to the command to be distributed with highest priority as a target thread if the command to be distributed with highest priority can be distributed to the corresponding operation unit. If the instruction to be distributed cannot be distributed to the corresponding operation unit, the alternative thread corresponding to the instruction to be distributed is not taken as the target thread.

Then, at least one command to be distributed with high priority is determined from the candidate threads, and whether the command to be distributed with high priority can be distributed to the corresponding operation unit is determined based on the occupation state of the operation unit corresponding to the command to be distributed with high priority.

......

And determining at least one command to be distributed with the lowest priority from the candidate threads, and then determining whether the command to be distributed with the lowest priority can be distributed to the corresponding operation unit based on the occupation state of the operation unit corresponding to the command to be distributed with the lowest priority.

Based on the above procedure, a target thread is determined from the plurality of candidate threads that corresponds to the plurality of first target registers, respectively, at the current processing cycle.

③ Determining target threads respectively corresponding to the first target registers in the current processing period from a plurality of candidate threads based on the command type of the current command to be distributed and the type of the operation unit, wherein the command type corresponds to the candidate threads.

Here, the types of the arithmetic units are different, and the types of commands that can be processed are also different.

The method comprises the steps of enabling an arithmetic operation unit to process an arithmetic operation command, enabling a write address operation unit to process a write address command, enabling a read address operation unit to process a read address command, and enabling an override function operation unit to process an override function.

When the target thread is determined, a plurality of target threads which can be respectively matched with the types of the operation units are determined from the candidate threads according to the types of the commands to be distributed corresponding to the candidate threads, and then the current commands to be distributed corresponding to the target threads are distributed to the operation units with the matched types.

In another embodiment of the present disclosure, for some commands, the number of operands required in executing the command may be different.

After the command to be distributed corresponding to the target thread is distributed to the operation unit, the operation unit needs at least one period to read the operands corresponding to the command to be distributed from the corresponding register group, wherein the number of the periods for reading the operands is the same as the number of the operands corresponding to the command to be distributed.

Further, in response to a multi-operand to-be-dispatched command with more than one operand in a to-be-dispatched command corresponding to a target thread determined for a current processing cycle, each processing cycle from the current processing cycle to a target processing cycle, a first target register group corresponding to the multi-operand to-be-dispatched command distributes a corresponding one operand to a corresponding operation unit of the to-be-dispatched command respectively;

For the commands to be distributed with fewer operands, the operation unit can respectively read the operands corresponding to different commands to be distributed from the same target register group in a plurality of periods.

Further, in response to a multi-operand to-be-dispatched command with more than one operand in a to-be-dispatched command corresponding to a target thread determined for a current processing cycle, determining the operand number of the multi-operand to-be-dispatched command with the largest operand number in the multi-operand to-be-dispatched command;

For example, in response to a multi-operand to-be-dispatched command with two operands in a to-be-dispatched command corresponding to a target thread determined for a current processing cycle, for each single-operand to-be-dispatched command existing in the to-be-dispatched command corresponding to the target thread determined for the current processing cycle, another single-operand to-be-dispatched command in a ready state is determined for a first target register group in which the single-operand to-be-dispatched command is located in a next processing cycle of the current processing cycle.

In this way, the operation unit can read a plurality of operands corresponding to the multi-operand to-be-distributed instruction from the first target register set corresponding to the multi-operand to-be-distributed instruction in a plurality of processing cycles respectively, and simultaneously can read operands corresponding to different single-operand to-be-distributed commands from the single-operand to-be-distributed instruction to the first target register set in the plurality of processing cycles respectively, so that the efficiency of data reading is improved under the condition of avoiding the reading conflict to the same target register set.

In one embodiment, the number N of the groups of registers may be determined according to the number of operands of the operation unit with the largest number of required operands, so that in the consecutive N processing periods, the N number of registers may be respectively scheduled, and then after the ith processing period, the ith register group may be scheduled again after the ith processing period, and after N-i periods, the command corresponding to the register group scheduled by the ith processing period may be scheduled again, and assuming that the command corresponding to the register group scheduled by the ith processing period just needs N operands, then for the ith register group, the N number of operands may be respectively distributed to the corresponding operation units in the N periods, and if the number of operands required by the command corresponding to the register group scheduled by the ith processing period is less than N, the command with the number of operands matched with the number of remaining periods before the next scheduled may be flexibly scheduled, so as to improve the data reading efficiency.

Referring to fig. 2 and 3, the embodiment of the present disclosure further provides a command distribution apparatus, and a specific example of command distribution using the same, in which the example includes a command distributor, and 5 arithmetic units connected to the command distributor, the 5 arithmetic units being respectively:

Two arithmetic units (ARITHMETIC AND Logic Unit ALUs) are used to process instructions requiring two operands.

A write address Unit (ST) and the processed instruction requires two operands.

A read address arithmetic Unit (LD) and the processed instruction requires an operand.

An override function arithmetic unit (Tensor Function Unit, TFU) is provided for processing instructions requiring two operands. The total number of threads is 64, namely threads 0to 63,8 register sets (banks), namely banks 0to banks 7 and 5 arithmetic units. Each register set is allocated 8 threads.

Each Bank has only one read path, and in one processing cycle, different operation units access the same register set and conflict, and different operation units access different banks and do not conflict.

For an instruction with two operands, the reading of the operands needs to be performed in two cycles in the same Bank.

In the odd cycles, register groups numbered 1,3, 5, and 7 are taken as the first register group.

The valid and highest priority ALU, ST, LD, TFU instructions are selected from the 8 threads allocated to each even numbered Bank.

From these even numbered banks, the two highest priority ALU instructions are first selected and dispatched.

And under the condition that the bank corresponding to the ALU instruction is occupied, selecting the ST instruction from the rest banks and distributing the ST instruction.

When the bank of ALU and ST instructions is occupied, LD instructions are selected from the rest banks and distributed.

Since ALU and ST instructions are two operands, the operands still need to be read from the same bank in the next cycle, TFU instructions are selected from the remaining banks and distributed in the next processing cycle if the banks of ALU and ST instructions are occupied.

The two-operand instruction distributed in the even cycle needs to continue to read the instruction of the same even bank in the next odd cycle, but the problem of bank conflict does not occur because the instruction of the odd bank is only distributed in the next cycle. The same applies to the scheduling mode of the odd cycle.

As shown in fig. 3:

A, in the 0 th processing period, the determined bands are respectively band 0, band 2, band 4 and band 6.

The command determined for Bank0 is an ALU command, and the operation unit that reads the operand from Bank0 is ALU0, and distributes the ALU command to ALU0 in the 0 th processing cycle. In the 0 th processing cycle and the 1 st processing cycle, the arithmetic unit ALU0 reads the first operand alu0_r0 and the second operand alu0_r1 from Bank0, respectively.

The command determined for Bank2 is an ST command, and in the 0 th processing cycle, the ST command is distributed to the ST unit. The operation unit that reads the operands from this Bank2 is ST, and in the 0 th processing cycle, and the 1 ST processing cycle, the operation unit ST reads the first operand st_r0 and the second operand st_r1 from the Bank2, respectively.

The commands determined for Bank4 are LD commands, and TFU commands, and the arithmetic units that read operands from this Bank4 are LD and TFU. In the 0 th processing period, the LD command is distributed to the operation unit LD, the operation unit LD reads the operand corresponding to the LD command from the Bank4, and in the 1 st processing period, the TFU command is distributed to the operation unit TFU, and the operation unit TFU reads the operand corresponding to the TFU command from the Bank 4.

The commands determined for Bank6 are ALU commands and, at processing cycle 0, the ALU commands are distributed to the ALU units. The arithmetic unit that reads the operands from Bank6 is an ALU, and in the 0 th processing cycle and the 1 st processing cycle, the arithmetic unit ALU reads the first operand alu1_r0 and the second operand alu1_r1 from Bank6, respectively.

And B, in the 1 st processing period, the determined bands are respectively band 1, band 3, band 5 and band 6.

The command determined for Bank1 is an ST command, and in the 1 ST processing cycle, the ST command is distributed to the ST unit. The arithmetic unit that reads the operands from this Bank1 is ST, and in the 1 ST processing cycle, and the 2 nd processing cycle, the arithmetic unit ST reads the first operand st_r0 and the second operand st_r1 from the Bank1, respectively.

The command determined for Bank3 is an ALU command, and the arithmetic unit that reads the operands from Bank3 is ALU0, and distributes the ALU command to ALU0 in the 1 st processing cycle. In the 2 nd processing cycle and the 1 st processing cycle, the arithmetic unit ALU0 reads the first operand alu0_r0 and the second operand alu0_r1 from Bank3, respectively.

The commands determined for Bank5 are ALU commands and, in the 1 st processing cycle, the ALU commands are distributed to the ALU units. The arithmetic unit that reads the operands from Bank5 is ALU1, and in the 1 st processing cycle and the 2 nd processing cycle, the arithmetic unit ALU1 reads the first operand alu1_r0 and the second operand alu1_r1 from Bank5, respectively.

The commands determined for Bank7 are LD commands, and TFU commands, and the arithmetic units that read operands from this Bank7 are LD and TFU. In the 1 st processing period, the LD command is distributed to the operation unit LD, the operation unit LD reads the operand corresponding to the LD command from the Bank7, and in the 2 nd processing period, the TFU command is distributed to the operation unit TFU, and the operation unit TFU reads the operand corresponding to the TFU command from the Bank 7.

Then in the 3 rd processing period and the 4 th processing period until the 8 th processing period, and further in the mode, in the same processing period, only one operation unit of each register group is ensured to be accessed, so that data conflict caused by that a plurality of operation units access the same register group in the same processing period is avoided, the command distribution efficiency is improved, and the command processing efficiency is further improved.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.

Based on the same inventive concept, the embodiment of the present disclosure further provides a command distributor corresponding to the command distribution method, and since the principle of the command distributor in the embodiment of the present disclosure for solving the problem is similar to that of the command distribution method in the embodiment of the present disclosure, the implementation of the command distributor may refer to the implementation of the method, and the repetition is omitted.

Referring to FIG. 4, a schematic diagram of a command distributor according to an embodiment of the present disclosure is provided, where the command distributor includes a scheduler 41 and a distribution interface 42;

The scheduler 41 is configured to determine a plurality of first target register sets corresponding to a current processing cycle from a plurality of register sets, where the first target register sets are different from a second target register set determined by at least one last historical processing cycle;

the dispatch interface 42 is configured to dispatch the commands corresponding to the determined target threads to the corresponding computing units.

In a possible implementation manner, the scheduler 41 is configured, when determining, from a plurality of register sets, a plurality of first target register sets corresponding to a current processing cycle, to:

In a possible implementation, the scheduler 41 is further configured to:

In a possible implementation manner, the scheduler 41 is configured, in determining, from a thread group corresponding to each of the plurality of first target registers, a target thread corresponding to each of the plurality of first target registers in a current processing cycle, to:

In a possible implementation manner, the scheduler 41 is configured to, when determining, based on the determined command execution status of each thread in the thread group respectively corresponding to the plurality of first target registers, a target thread corresponding to each of the plurality of first target registers in the current processing cycle from the thread group respectively corresponding to the plurality of first target registers, determine:

In a possible implementation manner, the scheduler 41, when determining, from the plurality of candidate threads, a target thread corresponding to each of the plurality of first target registers in the current processing cycle, is configured to:

In a possible implementation, the scheduler 41 is further configured to:

In a possible embodiment, the scheduler 41 is further configured to:

In a possible implementation manner, the scheduler 41 is further configured to obtain feedback information generated by the operation unit after executing the command;

In a possible implementation, the scheduler 41 is further configured to:

For a description of the processing flow of each module in the command distributor, and the interaction flow between the modules, reference is made to the relevant description in the above method embodiment, and will not be described in detail here.

In addition, the command distributor provided by the embodiment of the present disclosure may be a chip capable of implementing the command distribution method provided by the embodiment of the present disclosure.

The embodiment of the disclosure also provides a chip, as shown in fig. 5, comprising a controller 51, a command distributor 52, and an operator 53;

the controller 51 is configured to obtain commands corresponding to the multiple threads, and send the commands to the command distributor 52;

the command distributor 52 is configured to distribute the command to the arithmetic unit 53 based on a command distribution method provided by any one of the embodiments of the present disclosure;

The operator 53 is configured to read an operand from a first target register group corresponding to the command, and execute the command based on the operand.

The specific process of the specific execution command of the command execution device may refer to the steps of the command distribution method described in the embodiments of the present disclosure, which is not described herein.

The embodiment of the disclosure also provides electronic equipment, which comprises the chip provided by any embodiment of the disclosure.

The disclosed embodiments also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the command distribution method described in the method embodiments above. Wherein the storage medium may be a volatile or nonvolatile computer readable storage medium.

The embodiments of the present disclosure further provide a computer program product, where the computer program product carries a program code, where instructions included in the program code may be used to perform the steps of the command distribution method described in the foregoing method embodiments, and specifically reference may be made to the foregoing method embodiments, which are not described herein in detail.

Wherein the above-mentioned computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and apparatus may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated in one operation unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in essence or a part contributing to the prior art or a part of the technical solution, or in the form of a software product stored in a storage medium, including several commands to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present disclosure. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

It should be noted that the foregoing embodiments are merely specific implementations of the disclosure, and are not intended to limit the scope of the disclosure, and although the disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that any modification, variation or substitution of some of the technical features described in the foregoing embodiments may be made or equivalents may be substituted for those within the scope of the disclosure without departing from the spirit and scope of the technical aspects of the embodiments of the disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A command distribution method, comprising:

Determine a plurality of first target register groups corresponding to a current processing cycle from a plurality of register groups; wherein the first target register groups are different from a second target register group determined in at least one recent historical processing cycle;

Determine, from the thread groups corresponding to the plurality of first target register groups respectively, target threads corresponding to the plurality of first target register groups respectively in a current processing cycle;

Distribute commands corresponding to the determined target threads to the corresponding computing units;

Also includes:

In response to the presence of a multi-operand to-be-dispatched command having two operands in the commands to-be-dispatched corresponding to the target thread determined for the current processing cycle, for each single-operand to-be-dispatched command in the commands to-be-dispatched corresponding to the target thread determined for the current processing cycle, in the next processing cycle of the current processing cycle, another single-operand to-be-dispatched command in a ready state is determined for the first target register group where the single-operand to-be-dispatched command is located.

2. The command distribution method according to claim 1, wherein determining a plurality of first target register groups corresponding to the current processing cycle from the plurality of register groups comprises:

In a case where the current processing cycle is an odd-numbered cycle, determining an odd-numbered register group among the multiple register groups as the first target register group;

When the current processing cycle is an even-numbered cycle, an even-numbered register group among the multiple register groups is determined as the first target register group.

3. The command distribution method according to claim 1, further comprising:

The number of register groups is determined according to the number of operands of the operation unit requiring the largest number of operands, and the registers are divided into the plurality of register groups.

4. The command distribution method according to any one of claims 1 to 3, characterized in that the step of determining, from the thread groups corresponding to the plurality of first target register groups respectively, the target threads corresponding to the plurality of first target register groups respectively in the current processing cycle comprises:

Based on the determined command execution status information of each thread in the thread groups corresponding to the plurality of first target register groups, target threads corresponding to the plurality of first target register groups in a current processing cycle are determined from the thread groups corresponding to the plurality of first target register groups.

5. The command distribution method according to claim 4, characterized in that the step of determining, based on the command execution status of each thread in the thread groups corresponding to the plurality of first target register groups, target threads corresponding to the plurality of first target register groups in a current processing cycle from the thread groups corresponding to the plurality of first target register groups comprises:

Determine, from the thread groups respectively corresponding to the plurality of first target register groups, a plurality of candidate threads whose command execution status information is in a ready state;

From the multiple candidate threads, target threads corresponding to the multiple first target register groups respectively in a current processing cycle are determined.

6. The command distribution method according to claim 5, wherein determining the target threads corresponding to the plurality of first target register groups respectively in the current processing cycle from the plurality of candidate threads comprises:

Based on the priorities of the to-be-dispatched commands respectively corresponding to the multiple candidate threads, target threads respectively corresponding to the multiple first target register groups in a current processing cycle are determined from the multiple candidate threads.

7. The command distribution method according to claim 5, wherein determining the target threads corresponding to the plurality of first target register groups respectively in the current processing cycle from the plurality of candidate threads comprises:

Based on the priorities of the commands to be issued respectively corresponding to the multiple candidate threads and the occupation status of the computing units corresponding to the commands to be issued, target threads corresponding to the multiple first target register groups in the current processing cycle are determined from the multiple candidate threads.

8. The command distribution method according to claim 5, wherein determining the target threads corresponding to the plurality of first target register groups respectively in the current processing cycle from the plurality of candidate threads comprises:

Based on the command types of the current commands to be distributed corresponding to the candidate threads respectively and the type of the operation unit, target threads corresponding to the first target register groups respectively in the current processing cycle are determined from the candidate threads.

9. The command distribution method according to any one of claims 1 to 3, characterized in that it further comprises:

In response to the existence of a multi-operand to-be-dispatched command with more than one operand in the to-be-dispatched commands corresponding to the target thread determined for the current processing cycle, the first target register group corresponding to the multi-operand to-be-dispatched command distributes a corresponding operand to the computing unit corresponding to the to-be-dispatched command in each processing cycle from the current processing cycle to the target processing cycle;

The difference between the target processing cycle and the current processing cycle is equal to the number of the multiple operands reduced by one.

10. The command distribution method according to any one of claims 1 to 3, characterized in that it further comprises:

In response to the existence of a multi-operand to-be-dispatched command having more than one operand among the commands to be distributed corresponding to the target thread determined for the current processing cycle, determining the number of operands of the multi-operand to-be-dispatched command having the largest number of operands among the multi-operand to-be-dispatched commands;

For each other to-be-dispatched command corresponding to the target thread determined for the current processing cycle and having an operand number less than the maximum operand number, in each processing cycle from the next processing cycle of the current processing cycle to the processing cycle before the first target register group is scheduled again, in response to the first target register group where the other to-be-dispatched command is located being idle, determine a to-be-dispatched command in a ready state for the first target register group from the thread group corresponding to the first target register group;

The number of operations of the ready-to-be-dispatched command is not greater than the number of processing cycles from the processing cycle where the ready-to-be-dispatched command is located to the processing cycle where the first target register group is scheduled again.

11. The command distribution method according to claim 4, further comprising: obtaining feedback information generated by the computing unit after executing the command;

Based on the feedback information, command execution status information corresponding to the thread to which the executed command belongs is generated.

12. The command distribution method according to any one of claims 1 to 3, characterized in that it further comprises:

Based on the number of the register groups and the number of currently executing threads, the currently executing threads are grouped to obtain thread groups corresponding to each of the register groups.

13. A command distributor, comprising: a scheduler and a distribution interface;

The scheduler is configured to determine a plurality of first target register groups corresponding to a current processing cycle from a plurality of register groups; wherein the first target register group is different from a second target register group determined in at least one recent historical processing cycle; and determine, from thread groups corresponding to the plurality of first target register groups, target threads corresponding to the plurality of first target register groups in the current processing cycle;

The distribution interface is used to distribute commands corresponding to the determined target threads to the corresponding computing units;

The scheduler is further used for:

14. The command distributor according to claim 13, wherein the scheduler, when determining a plurality of first target register groups corresponding to a current processing cycle from a plurality of register groups, is configured to:

15. The command distributor according to claim 13, wherein the scheduler is further used for:

16. The command distributor according to any one of claims 13 to 15, characterized in that the scheduler, when determining, from the thread groups corresponding to the plurality of first target register groups, the target threads corresponding to the plurality of first target register groups in the current processing cycle, is configured to:

17. The command distributor according to claim 16, characterized in that the scheduler, when determining the target threads corresponding to the plurality of first target register groups respectively in the current processing cycle from the thread groups respectively corresponding to the plurality of first target register groups based on the determined command execution status of each thread in the thread groups respectively corresponding to the plurality of first target register groups, is configured to:

18. The command distributor according to claim 17, wherein the scheduler, when determining the target threads corresponding to the plurality of first target register groups respectively in the current processing cycle from the plurality of candidate threads, is configured to:

19. The command distributor according to claim 17, wherein the scheduler, when determining the target threads corresponding to the plurality of first target register groups respectively in the current processing cycle from the plurality of candidate threads, is configured to:

20. The command distributor according to claim 17, wherein the scheduler, when determining the target threads corresponding to the plurality of first target register groups respectively in the current processing cycle from the plurality of candidate threads, is configured to:

21. The command distributor according to any one of claims 13 to 15, characterized in that the scheduler is further used for:

22. The command distributor according to any one of claims 13 to 15, characterized in that the scheduler is further used for:

23. The command distributor according to claim 16, characterized in that the scheduler is further used to: obtain feedback information generated by the computing unit after executing the command;

24. The command distributor according to any one of claims 13 to 15, characterized in that the scheduler is further used for:

25. A chip, characterized in that it comprises: a controller, a command distributor, and an operator;

The controller is used to obtain commands corresponding to the multiple threads respectively, and send the commands to the command distributor;

The command distributor is used to distribute the command to the operator based on the command distribution method according to any one of claims 1 to 12;

The operator is used to read operands from a target register group corresponding to a command distributed by the command distributor based on the command, and execute the command based on the operands.

26. An electronic device, comprising the chip according to claim 25.

27. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the command distribution method according to any one of claims 1 to 12 are executed.