WO2022067510A1

WO2022067510A1 - Processor, processing method, and related device

Info

Publication number: WO2022067510A1
Application number: PCT/CN2020/118836
Authority: WO
Inventors: 王锦
Original assignee: 华为技术有限公司
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2022-04-07
Also published as: CN116507999A; CN116507999B

Abstract

Disclosed in the present invention are a processor, a processing method, and a related device. The processor comprises a data selection unit, an instruction decoding unit connected to the data selection unit, and M rows*N columns of general-purpose register sets; the instruction decoding unit is used for decoding X input instructions, obtaining at least one source operand address of each of the X instructions, Y source operand addresses in total, and sending the Y source operand addresses to the data selection unit; each source operand address among the Y source operand addresses comprises at least one column address bit, at least one row address bit, and at least one target address bit; the data selection unit is used for accessing, according to the at least one column address bit, the at least one row address bit, and the at least one target address bit comprised in the i-th source operand address, the general-purpose register corresponding to the i-th source operand address in the M rows*N columns of general-purpose register sets, and obtaining a corresponding source operand. According to the present invention, the operand selection cost can be reduced.

Description

A processor, processing method and related equipment

technical field

The present invention relates to the field of computer technology, and in particular, to a processor, a processing method and related equipment.

Background technique

With the continuous increase of data scale and complexity in various fields, the requirements for processor computing power are getting higher and higher. In order to improve the efficiency of processor execution of instructions, most of the existing technologies adopt the operation of dividing an instruction into multiple small ones. The way of the steps, that is, the instruction pipeline. For example, a common five-stage pipeline may include: an instruction fetch pipeline, an instruction decoding pipeline, a data selection pipeline, an execution pipeline, and a write return pipeline. In general, the more pipeline stages, that is, the more small steps an instruction is divided into, the higher the execution efficiency of the instruction. Among them, in the data selection pipeline, it is necessary to read data from the corresponding general-purpose register (General-Purposed Register, GPR) according to the source operand address of each instruction, and the source operand required for the execution of the instruction has been obtained. In the write-back pipeline, the calculation result of the instruction needs to be written back to the corresponding GPR. When the subsequent instruction needs to use the calculation result, the calculation result can be obtained from the GPR, that is, the calculation result can be used as the result of the subsequent instruction. source operand, etc.

To sum up, a GPR with a small capacity (such as a 32-byte GPR) often causes the data to be temporarily written into the data memory when the data cannot be written back to the GPR, and then when the subsequent instructions need to use the data. , and then read it from the data memory to the GPR, which will result in frequent data swapping in and out between the data memory and the GPR. If the correlation between instructions is involved, it will cause the pipeline stall (Pipeline stall), reducing processor performance. However, if the GPR is directly expanded in order to solve the above-mentioned problem of frequent data swapping in and out, for example, the entire data memory is used as the GPR (for example, the GPR is expanded to 512bytes), the selection range will be greatly increased (from the original 32bytes) The data range is expanded to a data range of 512 bytes), which brings a huge logical cost to the above operand selection.

SUMMARY OF THE INVENTION

Embodiments of the present invention provide a processor, a processing method, and related equipment, which can reduce the cost of operand selection and ensure the processing efficiency of the processor.

In a first aspect, an embodiment of the present invention provides a processor, including a data selection unit, an instruction decoding unit connected to the data selection unit, and a general-purpose register set of M rows*N columns, wherein the M rows*N columns are general-purpose registers Each general-purpose register group in the register group includes K general-purpose registers; M, N, and K are integers greater than or equal to 1; wherein, the instruction decoding unit is used to decode the input X instructions, obtain The respective at least one source operand address of the X instructions, a total of Y source operand addresses; and the Y source operand addresses are sent to the data selection unit; the Y source operand addresses are Each source operand address includes at least one column address bit, at least one row address bit and at least one target address bit; the at least one column address bit is used to indicate the general register column to which each source operand address belongs, The at least one row address bit is used to indicate the general register row to which each source operand address belongs; the at least one destination address bit is used to indicate that each source operand address is in the K general registers. The corresponding relationship of the t-th general-purpose register; X, Y, and t are integers greater than or equal to 1; the data selection unit is used for the at least one column address bit included in the i-th source operand address, The at least one row address bit and the at least one target address bit access the general register corresponding to the i-th source operand address in the M row*N column general register group to obtain the corresponding source operand; The i-th source operand address is one of the Y source operand addresses; i is an integer greater than or equal to 1 and less than or equal to Y.

The embodiment of the present invention provides a processor, which realizes reducing the selection cost by narrowing the selection range of the operand, specifically including the division of the general register group and the design of the address of the source operand. Wherein, in terms of general-purpose register groups, the processor may include pre-divided M-row*N-column general-purpose register groups, and each general-purpose register group may include at least one general-purpose register. In addition, in terms of the source operand address, the source operand address in this embodiment of the present invention may include at least one column address bit, at least one row address bit, and at least one target address bit. Wherein, the at least one column address bit can be used to indicate the general-purpose register column to which the source operand address belongs (for example, if there are 4 columns of general-purpose registers in total, two binary column address bits can be used to indicate the source operand address belongs to the general-purpose register column), the at least one row address bit can be used to indicate the general-purpose register row to which the source operand address belongs, and the at least one destination address bit can be used to indicate that the source operand address is specifically the first address in a general-purpose register group Several general purpose registers. Therefore, it is realized that the processor can select the operand in the general register row and general register column indicated by each part of the address bits included in the address of each source operand. For example, the general-purpose register column to which each source operand address belongs can be determined according to the column address bits, and then in the general-purpose register column to which it belongs, the specific correspondence of each source operand address can be determined through other row address bits and target address bits. Or, the general register row to which each source operand address belongs can be determined according to the row address bits, and then in the general register row to which it belongs, each source operation can be determined by other column address bits and destination address bits. The corresponding general register of the number address is obtained, and the data in it is obtained, that is, the corresponding source operand is obtained. Combining the above two aspects, the embodiment of the present invention can greatly narrow the selection range and reduce the logic cost of the selection by performing operand selection in a designated general-purpose register column or a designated general-purpose register row. Therefore, compared with the prior art, if the operand selection is performed directly in the entire data range (that is, in all general-purpose registers), it will bring a great selection cost; if a part of the data is extracted in advance , and then perform operand selection within this data range, which will bring additional hardware costs, etc. Therefore, compared with the aforementioned existing solutions, the embodiments of the present invention can realize that, based on the existing hardware structure, within the original larger data range, for each source operand address, a corresponding smaller one can be determined. The data range is selected, and then the operand selection is performed within the smaller data range, which greatly reduces the selection cost and ensures the execution efficiency of the instruction and the performance of the processor.

In a possible implementation manner, the processor further includes an execution unit connected to the data selection unit, the execution unit includes at least one arithmetic logic unit; the at least one arithmetic logic unit is configured to The source operands corresponding to each of the Y source operand addresses execute the X instructions to obtain respective calculation results of the X instructions.

In this embodiment of the present invention, the processor may further include an execution unit, and the execution unit may further include one or more arithmetic logic units, through which the one or more arithmetic logic units can respectively execute based on the acquired multiple source operands The respective computing tasks corresponding to the multiple instructions, such as completing addition, subtraction and other calculations, are obtained to obtain the respective calculation results of the multiple instructions, which further improves the overall execution efficiency of the instructions.

In a possible implementation manner, the processor further includes a result write-back unit connected to the execution unit; the instruction decoding unit is further configured to acquire the respective destination operand addresses of the X instructions, A total of X destination operand addresses; each destination operand address in the X destination operand addresses includes the at least one column address bit, the at least one row address bit and the at least one target address bit; The at least one column address bit is used to indicate the general-purpose register column to which each destination operand address belongs, and the at least one row address bit is used to indicate the general-purpose register to which each destination operand address belongs. row; the at least one target address bit is used to indicate the corresponding relationship between the address of each destination operand and the t-th general-purpose register in the K general-purpose registers; the result write-back unit is used for according to The at least one column address bit, the at least one row address bit, and the at least one target address bit included in the jth destination operand address access the jth row address bit in the M row*N column general-purpose register set the general-purpose registers corresponding to the destination operand addresses, and write the calculation result of the jth instruction back into the general-purpose register corresponding to the jth destination operand address; j is greater than or equal to 1, and An integer less than or equal to X.

In this embodiment of the present invention, the processor may further include a result write-back unit. In addition, the destination operand address in this embodiment of the present invention may include at least one column address bit, at least one row address bit, and at least one target address bit. The at least one column address bit can be used to indicate the general-purpose register column to which the destination operand address belongs (for example, if there are 4 columns of general-purpose registers, two binary column address bits can be used to indicate the destination operand address belongs to the general-purpose register row), the at least one row address bit can be used to indicate the general-purpose register row to which the destination operand address belongs, and the at least one target address bit can be used to indicate that the destination operand address is specifically the No. Several general purpose registers. Therefore, the processor can select the operand in the general register row and general register column indicated by each part of the address bits included in the address of each destination operand. For example, the general-purpose register column to which each destination operand address belongs can be determined according to the column address bits, and then in the respective general-purpose register column, the specific correspondence of each destination operand address can be determined through other row address bits and target address bits. Or, the general register row to which each destination operand address belongs can be determined according to the row address bits, and then each destination operation can be determined by other column address bits and target address bits in the respective general register row. The corresponding general-purpose register of the number address is written, and the respective calculation results of the multiple instructions are written back to the corresponding general-purpose register. Therefore, by performing operand selection in a designated general-purpose register row or a designated general-purpose register row, the selection range is greatly reduced, and the logic cost of the selection is reduced.

In a possible implementation manner, the data selection unit is specifically configured to: determine the ith source operand address according to the at least one column address bit included in the ith source operand address the i'th general-purpose register column to which it belongs, and in the i'th general-purpose register column, according to the at least one row address bit and the at least one destination address included in the i-th source operand address Bit access to the corresponding general-purpose register, i' is an integer greater than or equal to 1 and less than or equal to N; or, according to the at least one row address bit included in the i-th source operand address, determine the The i"th general register row to which the ith source operand address belongs, and in the ith" general register row, according to the at least one column address included in the ith source operand address The bit and the at least one target address bit access the corresponding general-purpose register, and i" is an integer greater than or equal to 1 and less than or equal to M.

In this embodiment of the present invention, the general-purpose register column to which each source operand address belongs may be determined according to the column address bits, and then in the general-purpose register column to which each source operand address belongs, each source address bit may be determined by other row address bits and target address bits. The general register corresponding to the operand address. For example, it is determined that the source operand address belongs to the general-purpose register group in the first row by using the aforementioned at least one column address bit. The operand address belongs to the 2nd general purpose register in the 1st row general register group. Alternatively, the general register row to which each source operand address belongs can be determined according to the row address bits, and then in the general register row to which they belong, the specific address of each source operand can be determined by other column address bits and target address bits. the corresponding general-purpose registers. For example, it is determined that the source operand address belongs to the general-purpose register group in the first row by using the aforementioned at least one row address bit. The operand address belongs to the 2nd general purpose register in the 1st column general register group. In this way, the selection of operands in the general-purpose register group of M rows*N columns is completed according to each part of the address bits in the source operand address, that is, the general register corresponding to the source operand address is determined, which greatly reduces the Choice cost. It should be noted that the embodiment of the present application aims to select the number based on the row and column indicated by the address bit through the division of rows and columns, when the operand is selected, so as to narrow the selection range. Therefore, the embodiment of the present application There is no specific limitation on whether the selection range is narrowed by rows or the selection range is narrowed by columns. It can be understood that, in some possible implementations, if each general-purpose register group includes only one general-purpose register group, the source operand address may not include target address bits.

In a possible implementation manner, the result write-back unit is specifically configured to: determine the jth target operand according to the at least one column address bit included in the jth target operand address The j'th general register column to which the address belongs, and in the j'th general register column, according to the at least one row address bit included in the jth target operand address and the at least one target The address bit accesses the corresponding general-purpose register, and j' is an integer greater than or equal to 1 and less than or equal to N; or, according to the at least one row address bit included in the jth target operand address, determine The j"th general register row to which the jth target operand address belongs, and in the jth" general register row, according to the at least one column included in the jth target operand address The address bit and the at least one target address bit access the corresponding general-purpose register, and j" is an integer greater than or equal to 1 and less than or equal to M.

In this embodiment of the present invention, the general-purpose register column to which each destination operand address belongs may be determined according to the column address bits, and then in the respective general-purpose register column, other row address bits and target address bits are used to determine each destination The general register corresponding to the operand address. For example, it is determined that the destination operand address belongs to the general-purpose register group of the first row by using the aforementioned at least one column address bit. The operand address belongs to the 2nd general purpose register in the 1st row general register group. Alternatively, the general register row to which each destination operand address belongs can be determined according to the row address bits, and then in the respective general register row, the specific address of each destination operand can be determined by other column address bits and target address bits. the corresponding general-purpose registers. For example, it is determined that the destination operand address belongs to the general-purpose register group in the first row by using the aforementioned at least one row address bit. The operand address belongs to the 2nd general purpose register in the 1st column general register group. In this way, the selection of operands in the M row*N column general register group is completed according to the address bits of each part of the destination operand address, that is, the general register corresponding to the destination operand address is determined, which greatly reduces the Choice cost.

In a possible implementation manner, any two different source operand addresses among the Y source operand addresses belong to different general register columns; any two among the X destination operand addresses Different destination operand addresses belong to different columns of the general-purpose register.

In this embodiment of the present invention, in order to further reduce the logical cost of selecting numbers, each column of general-purpose registers is constrained to select only one operand at a time. If the respective source operand addresses or destination operand addresses of multiple instructions correspond to the same column different general-purpose registers, the processor will be required to access (that is, select) different general-purpose registers in a row of general-purpose registers, and obtain their corresponding source operands, or write calculation results, etc., which often increases the number of selections. The logical cost is exactly contrary to the technical problem to be solved by the present invention.

In a possible implementation manner, the processor further includes a temporary register, and the temporary register corresponds to a target general-purpose register; the target general-purpose register is one of the M-row*N-column general-purpose register groups; the The temporary register stores the data in the target general-purpose register; the Y source operand addresses include a plurality of first source operand addresses whose number is greater than the first threshold, and second source operand addresses; the The general-purpose register corresponding to the first source operand address is the target general-purpose register, and the second source operand address belongs to the general-purpose register column where the target general-purpose register is located; the data selection unit is also used for The corresponding temporary register is accessed according to the address of the first source operand to obtain the corresponding source operand.

In this embodiment of the present invention, the above-mentioned constraint rules set to reduce the cost of selection are likely to lead to the problem of decreased instruction execution efficiency when processing data dependencies. are the same, and the source operands of other instructions processed in parallel in the same stage and the source operands of the multiple instructions correspond to different general-purpose registers of the same register row, then according to the aforementioned constraint rules, the multiple instructions cannot be combined with each other. Other instructions in the same layer are processed in parallel, only by inserting empty instructions and placing the multiple instructions in the next layer for processing. As a result, the instruction utilization rate is reduced and the instruction execution efficiency is reduced. Therefore, the general-purpose register that needs to be frequently referenced and updated can be replaced by a temporary register through the compiler. The data in the general-purpose register can be stored in the temporary register, and the processor can access the corresponding temporary register according to the source operand address. , to obtain the corresponding source operand, which is not constrained by the constraint rules and ensures the execution efficiency of the instruction.

In a possible implementation manner, the processor further includes a temporary register, and the temporary register corresponds to a target general-purpose register; the target general-purpose register is one of the M-row*N-column general-purpose register groups; the The temporary register stores the data in the target general-purpose register; the X destination operand addresses include a plurality of first destination operand addresses whose number is greater than the first threshold, and the second destination operand address; the The general-purpose register corresponding to the first destination operand address is the target general-purpose register, and the second destination operand address belongs to the general-purpose register column where the target general-purpose register is located; the result is written back to the unit, and also uses accessing the corresponding temporary register according to the target destination operand address, and writing the corresponding calculation result back into the temporary register.

In this embodiment of the present invention, the above-mentioned constraint rules set to reduce the cost of selection are likely to lead to the problem of decreased instruction execution efficiency when processing data dependencies. are the same, and the destination operands of other instructions processed in parallel in the same layer and the destination operands of the multiple instructions correspond to different general-purpose registers in the same register row, then according to the aforementioned constraint rules, the multiple instructions cannot be combined with the same layer. In parallel processing of other instructions, only empty instructions can be inserted, and the multiple instructions are placed in the next layer for processing. As a result, the instruction utilization rate is reduced and the instruction execution efficiency is reduced. Therefore, the general-purpose register that needs to be frequently referenced and updated can be replaced by a temporary register through the compiler. The data in the general-purpose register can be stored in the temporary register, and the processor can access the corresponding temporary register according to the destination operand address. , so as to write the corresponding calculation result back into the temporary register, which is not constrained by the constraint rules and ensures the execution efficiency of the instruction.

In a possible implementation manner, the processor further includes an instruction acquisition unit connected to the instruction decoding unit; wherein, the instruction acquisition unit is configured to acquire the X instructions to be executed, and The X instructions are sent to the instruction decoding unit; the X instructions are instructions executed by the processor in parallel within one clock cycle.

In this embodiment of the present invention, the processor may further include an instruction acquisition unit, through which a plurality of instructions to be executed may be acquired, and the multiple instructions to be executed may be instructions in the same layer, that is, the multiple instructions The instructions may be instructions that the processor executes in parallel within one clock cycle (that is, within one beat (cycle)), so that the execution efficiency of the instructions can be improved.

In a possible implementation manner, the access types of the Y source operand addresses and the X destination operand addresses are both direct access types or indirect access types.

In the embodiment of the present invention, since the direct access can directly know the operand address, and the indirect access is a base address plus a variable, the operand address can be finally obtained, so the operand address finally obtained by the indirect access is often very high. May differ from the operand address for direct access, but both belong to the same general register row. In this way, for the same general-purpose register row, two different operands need to be selected at one time, which will increase the logical cost of the selection. However, if they are both direct access or indirect access, according to the aforementioned constraint rules, instructions with different operand addresses but belonging to the same general-purpose register row can be processed at different layers, avoiding the need for a single general-purpose register row in advance. In the case of selecting two different operands, the logical cost of selecting the operands is reduced.

In a second aspect, an embodiment of the present invention provides a processing method, which is applied to a processor, where the processor includes a data selection unit, an instruction decoding unit connected to the data selection unit, and a general-purpose register set of M rows*N columns , each general-purpose register group in the M-row*N-column general-purpose register group includes K general-purpose registers; M, N and K are integers greater than or equal to 1; the method includes:

Through the instruction decoding unit, the input X instructions are decoded, and at least one source operand address of each of the X instructions is obtained, which is a total of Y source operand addresses; and the Y source operands are address is sent to the data selection unit; each of the Y source operand addresses includes at least one column address bit, at least one row address bit and at least one destination address bit; the at least one column address bit The address bits are used to indicate the general-purpose register column to which each source operand address belongs, and the at least one row address bit is used to indicate the general-purpose register row to which each source operand address belongs; the at least one destination address bit Used to indicate the correspondence between each source operand address and the t-th general-purpose register in the K general-purpose registers; X, Y, and t are integers greater than or equal to 1;

By the data selection unit, according to the at least one column address bit, the at least one row address bit and the at least one destination address bit included in the i-th source operand address are common in the M rows*N columns Access the general-purpose register corresponding to the i-th source operand address in the register group to obtain the corresponding source operand; the i-th source operand address is one of the Y source operand addresses; i is greater than or an integer equal to 1 and less than or equal to Y.

In a possible implementation manner, the processor further includes an execution unit connected to the data selection unit, the execution unit includes at least one arithmetic logic unit; the method further includes:

Through the at least one arithmetic logic unit, the X instructions are executed based on the source operands corresponding to the Y source operand addresses, respectively, to obtain respective calculation results of the X instructions.

In a possible implementation manner, the processor further includes a result write-back unit connected to the execution unit; the method further includes:

Through the instruction decoding unit, the respective destination operand addresses of the X instructions are acquired, with a total of X destination operand addresses; each destination operand address in the X destination operand addresses includes the at least one column address bit, the at least one row address bit and the at least one target address bit; the at least one column address bit is used to indicate the general register column to which each destination operand address belongs, the at least one column address bit One row address bit is used to indicate the general-purpose register row to which each destination operand address belongs; the at least one target address bit is used to indicate that each destination operand address is associated with the K general-purpose registers. the correspondence of the t-th general-purpose register;

According to the result write-back unit, according to the at least one column address bit, the at least one row address bit and the at least one target address bit included in the jth destination operand address, in the M row*N column Access the general-purpose register corresponding to the j-th destination operand address in the general-purpose register group, and write the calculation result of the j-th instruction back into the general-purpose register corresponding to the j-th destination operand address ; j is an integer greater than or equal to 1 and less than or equal to X.

In a possible implementation manner, the data selection unit is based on the at least one column address bit, the at least one row address bit and the at least one destination included in the i-th source operand address. The address bits access the general-purpose register corresponding to the i-th source operand address in the M-row*N-column general-purpose register group, including:

Through the data selection unit, according to the at least one column address bit included in the i-th source operand address, determine the i'th general-purpose register column to which the i-th source operand address belongs, and perform a In the i'th general-purpose register column, the corresponding general-purpose register is accessed according to the at least one row address bit and the at least one target address bit included in the i-th source operand address, where i' is: an integer greater than or equal to 1 and less than or equal to N; or,

Through the data selection unit, according to the at least one row address bit included in the i-th source operand address, determine the i-th general register row to which the i-th source operand address belongs, and set the row in the i-th source operand address. In the i" th general register row, the corresponding general register is accessed according to the at least one column address bit and the at least one target address bit included in the i th source operand address, where i" is: An integer greater than or equal to 1 and less than or equal to M.

In a possible implementation manner, the writing back unit through the result is based on the at least one column address bit, the at least one row address bit and the at least one column address bit included in the jth destination operand address. The target address bit accesses the general-purpose register corresponding to the j-th destination operand address in the M-row*N-column general-purpose register group, including:

Through the result write-back unit, according to the at least one column address bit included in the jth target operand address, determine the j'th general register column to which the jth target operand address belongs, and In the j'th general register column, the corresponding general register is accessed according to the at least one row address bit and the at least one target address bit included in the jth target operand address, j' is an integer greater than or equal to 1 and less than or equal to N; or,

Through the result write-back unit, according to the at least one row address bit included in the jth target operand address, determine the j"th general register row to which the jth target operand address belongs, and In the j"-th general-purpose register row, the corresponding general-purpose register is accessed according to the at least one column address bit and the at least one target address bit included in the j-th target operand address, j" is an integer greater than or equal to 1 and less than or equal to M.

In a possible implementation manner, the processor further includes a temporary register, and the temporary register corresponds to a target general-purpose register; the target general-purpose register is one of the M-row*N-column general-purpose register groups; the The temporary register stores the data in the target general-purpose register; the Y source operand addresses include a plurality of first source operand addresses whose number is greater than the first threshold, and second source operand addresses; the The general-purpose register corresponding to the first source operand address is the target general-purpose register, and the second source operand address belongs to the general-purpose register row where the target general-purpose register is located; the method further includes:

Through the data selection unit, access the corresponding temporary register according to the first source operand address, and obtain the corresponding source operand;

In a possible implementation manner, the processor further includes a temporary register, and the temporary register corresponds to a target general-purpose register; the target general-purpose register is one of the M-row*N-column general-purpose register groups; the The temporary register stores the data in the target general-purpose register; the X destination operand addresses include a plurality of first destination operand addresses whose number is greater than the first threshold, and the second destination operand address; the The general-purpose register corresponding to the first destination operand address is the target general-purpose register, and the second destination operand address belongs to the general-purpose register row where the target general-purpose register is located; the method further includes:

Through the result write-back unit, the corresponding temporary register is accessed according to the target destination operand address, and the corresponding calculation result is written back into the temporary register.

In a possible implementation manner, the processor further includes an instruction acquisition unit connected to the instruction decoding unit; the method further includes:

Obtain the X instructions to be executed by the instruction acquisition unit, and send the X instructions to the instruction decoding unit; the X instructions are executed in parallel by the processor within one clock cycle instruction.

In a third aspect, the present invention provides a semiconductor chip, which may include the processor provided by any one of the implementation manners of the foregoing first aspect.

In a fourth aspect, the present invention provides a semiconductor chip, which may include: the processor provided by any one of the implementation manners of the first aspect, an internal memory coupled to the multi-core processor, and an external memory.

In a fifth aspect, the present invention provides a system-on-chip SoC chip, where the SoC chip includes the processor provided by any one of the implementation manners of the first aspect, an internal memory coupled to the processor, and an external memory. The SoC chip may be composed of chips, or may include chips and other discrete devices.

In a sixth aspect, the present invention provides a chip system, where the chip system includes the multi-core processor provided by any one of the implementation manners of the foregoing first aspect. In a possible design, the chip system further includes a memory, and the memory is used for saving necessary or related program instructions and data during the operation of the multi-core processor. The chip system may be composed of chips, or may include chips and other discrete devices.

In a seventh aspect, the present invention provides a processing device having a function of implementing any one of the processing methods in the second aspect. This function can be implemented by hardware or by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions.

In an eighth aspect, the present invention provides a terminal, where the terminal includes a processor, and the processor is the processor provided by any one of the implementation manners of the foregoing first aspect. The terminal may also include a memory for coupling with the processor that holds program instructions and data necessary for the terminal. The terminal may also include a communication interface for the terminal to communicate with other devices or a communication network.

In a ninth aspect, the present invention provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, implements the process flow of any one of the above-mentioned second aspects. .

In a tenth aspect, an embodiment of the present invention provides a computer program, where the computer program includes instructions that, when the computer program is executed by a processor, enable the processor to execute the processing method flow described in any one of the second aspect above .

Description of drawings

FIG. 1 is a schematic diagram of an IE structure in the prior art.

FIG. 2 is a schematic diagram of another IE structure in the prior art.

FIG. 3 is a schematic diagram of a program split in the prior art.

FIG. 4 is a schematic diagram of instruction processing based on PV extraction in the prior art.

FIG. 5 is a schematic diagram of an FV-based instruction processing provided by an embodiment of the present invention.

FIG. 6 is a schematic structural diagram of a processor according to an embodiment of the present invention.

FIG. 7 is a schematic structural diagram of another processor according to an embodiment of the present invention.

8a-8e are schematic diagrams of a division manner of a general-purpose register according to an embodiment of the present invention.

FIG. 9 is a schematic diagram of a data selection provided by an embodiment of the present invention.

FIG. 10 is a schematic diagram of a result write-back provided by an embodiment of the present invention.

FIG. 11 is a schematic flowchart of a processing method provided by an embodiment of the present invention.

Detailed ways

The embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention.

The terms "first", "second", "third" and "fourth" in the description and claims of the present invention and the accompanying drawings are used to distinguish different objects, rather than to describe a specific order. . Moreover, the “rows” and “columns” mentioned in the description, claims and drawings of the present application are not absolutely actual “rows” and “columns”, that is, M rows*N columns in the embodiments of the present application Register banks are not completely arranged in vertical "rows" and "columns" in actual circuits. In some possible implementations, the present application may constrain “rows” and “columns” based on vectors. For example, the N-column general-purpose register group in this embodiment of the present application may correspond to N column vectors, and M-row general-purpose registers A group may correspond to M row vectors, where each element in the vector may represent a general-purpose register group. Furthermore, the terms "comprising" and "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product or device comprising a series of steps or units is not limited to the listed steps or units, but optionally also includes unlisted steps or units, or optionally also includes For other steps or units inherent to these processes, methods, products or devices.

Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor a separate or alternative embodiment that is mutually exclusive of other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.

The terms "component", "module", "system" and the like are used in this specification to refer to a computer-related entity, hardware, firmware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a computing device and the computing device may be components. One or more components may reside within a process and/or thread of execution, and a component may be localized on one computer and/or distributed between 2 or more computers. In addition, these components can execute from various computer readable media having various data structures stored thereon. A component may, for example, be based on a signal having one or more data packets (eg, data from two components interacting with another component between a local system, a distributed system, and/or a network, such as the Internet interacting with other systems via signals) Communicate through local and/or remote processes.

First, some terms in the present invention will be explained so as to facilitate the understanding of those skilled in the art.

(1) The instruction pipeline is to improve the efficiency of the processor to execute instructions, and divide the operation of an instruction into multiple small steps, and each step is completed by a special circuit. For example, an instruction needs to go through 3 stages to execute: fetch, decode, and execute. Each stage takes one machine cycle. If pipeline technology is not used, then this instruction needs 3 machine cycles to execute; The instruction pipeline technology, then when this instruction completes the "instruction fetch" and then enters the "decoding", the next instruction can be "instructed fetch", which improves the execution efficiency of the instruction. In general, the more pipeline stages, that is, the more small steps an instruction is divided into, the higher the execution efficiency of the instruction.

Please refer to FIG. 1. FIG. 1 is a schematic structural diagram of an instruction processing engine (Instruction Engine, IE) in the prior art. The IE structure of the processor may be as shown in FIG. 1 . In Figure 1, a processor with a five-stage pipeline is taken as an example. The life cycle of an instruction in the pipeline structure may include an instruction fetch pipeline → an instruction decoding pipeline → a data selection pipeline → an execution pipeline → a write return pipeline, which is the pipeline structure. The execution process of an instruction is divided into at least five stages, in which the basic functions of each stage of pipeline are as follows:

Instruction Fetching Pipeline: Instruction Fetching (IF) is to read instructions from Instruction Memory (IMEM) based on the Program Counter (PC) as an address, and send the instructions to the instruction decoder after completing the verification (Instruction Decoder, ID);

Instruction decoding pipeline: decode the input instruction through the instruction decoder and check the validity, and then extract the command type, operand type, immediate value, source operand address and destination operand address of the instruction;

Data selection pipeline: Data selector (Data Selector, DS) selects data from general-purpose registers (General-Purpose Register, GPR) according to the source operand address obtained in the instruction decoding pipeline, and data correlation analysis is required at the same time; , the data selector is sometimes called multiplexers (Mux); as shown in "4×2×16 bit src operands (source operand)" in Figure 1, it can be shown that the processor can be parallelized in one shot Processing 4 instructions, each instruction can include two source operands, that is, 8 source operands can be obtained from the general-purpose register in one shot, and the size of each source operand can be 16bit, that is, 2byte, etc. etc., will not be repeated here;

Execution pipeline (Execute, EX): Through multiple arithmetic logic units, the corresponding arithmetic, logic and shift operations are performed according to the command type and the obtained operands, and the operation results are output; instruction execution refers to the process of performing real operations on instructions . For example, if the instruction is an addition operation instruction, the operand is added; if it is a subtraction operation instruction, the operand is subtracted, etc., which will not be repeated here;

Write-back pipeline: Write-back (Write-Back, WB) refers to the process of writing the result of instruction execution back to the general-purpose register set, or writing to the data memory, or reading data from the data memory and The process of writing back to GPR, that is, performing read/write (load/store) as shown in Figure 1; as shown in "4×16bit dst operands (destination operand)" in Figure 1, it can indicate that the processor One shot can process 4 instructions in parallel, and each instruction can include a destination operand, that is, the calculation results of the 4 instructions can be written into the corresponding general-purpose registers in one shot. The size of the result) can be 16bit, that is, 2byte, etc., which will not be repeated here;

It can be understood that the above-mentioned processor architecture and the pipeline structure of the processor are only some exemplary implementations provided by the embodiments of the present invention, and the processor architecture and the pipeline structure of the processor in the embodiments of the present invention include but are not limited to the above implementations. Way.

Obviously, as shown in Figure 1, the capacity of GPR is only 32 bytes. Due to the small capacity of GPR, in the above write-back pipeline, if the calculation result cannot be written back to GPR, the calculation result is temporarily written into the data memory, and then When the subsequent instruction needs to use the calculation result, it will be read from the data memory to the GPR, which will result in frequent data swapping in and out between the data memory and the GPR, reducing the efficiency of instruction execution, and even causing the pipeline to stop, etc. Wait. Therefore, for such cases, there is a solution to use the entire original data memory as a GPR, that is, to expand the capacity of the GPR to include the full range vector (Full Vector, FV), that is, to include the full range of operand selection. scope. Please refer to FIG. 2 , which is a schematic diagram of another IE structure in the prior art. As shown in Figure 2, compared to the GPR with a capacity of only 32 bytes in Figure 1, the capacity of the GPR in Figure 2 has been expanded to 512 bytes, which ensures that all data can be stored in the GPR to a certain extent, without the need to frequently update the GPR and GPR. Data is swapped in and out. However, as shown in Figure 2, directly expanding the capacity of the GPR will result in that in the data selection pipeline, the processor needs to select the source operand address obtained by decoding within the 512byte data range to obtain the corresponding source operand address. Operands, in this way, will greatly increase the logical cost of selection.

To sum up, in order to facilitate the understanding of the embodiments of the present invention, the technical problems to be solved by the present invention are further analyzed and proposed. In the prior art, regarding narrowing the selection range and reducing the logic cost of operand selection, various technical solutions are included, and the following is an example of a commonly used solution.

Scheme 1: Extract a partial vector (PV) from FV, use PV as the selection range of operands, and select operands from PV.

In order to improve the ability to execute instructions in parallel, processors now use a Very Long Instruction Word (VLIW) architecture in which multiple instructions are parallelized to speed up the execution of instructions. In this way, for a large bit width FV (for example, the FV of 512B (byte) shown in FIG. 2 , or even the FV of 1024 bytes), the selection cost will be further amplified.

In order to solve the problem of high selection cost caused by using the entire FV as the selection range of the operand, a solution is proposed to extract part of PV from FV as the operand of VLIW. The general idea of this scheme is to use the compiler to split the program into several interconnected small program blocks, and then when the previous program block is executed, the entry of the next program block will be specified. Repeat this, the program will select a program The path completes the processing of the specified program. Please refer to FIG. 3 , which is a schematic diagram of program splitting in the prior art. As shown in Figure 3, each circle in the figure represents a program block, and the connection between the program blocks represents the program path. The compiler generates PV through partial vector construction (PV Builder, PVB) in advance for the source operands that need to be referenced and the destination operands that need to be written back in each instruction to be executed. Then, use PV as the selection range of the source operand and destination operand of IE, and then restore the PV to FV through the full-range vector construction (FV Builder, FVB) after the execution of the program block is completed. Please refer to FIG. 4 , which is a schematic diagram of instruction processing based on PV extraction in the prior art. As shown in Figure 4, the FV is 512 bytes, and the extracted PV is only 128 bytes. Optionally, the extracted PV can be stored in a 128-byte general-purpose register group (for example, a general-purpose register can store 1-byte data, then the general-purpose register group at least It can include 128 general-purpose registers, etc.), in this way, the selection range is narrowed and the logic cost of selection is reduced.

Disadvantages of this scheme one:

1. It is necessary to increase PVB to extract PV from FV, and to increase FVB to restore PV to FV, which will cause a significant increase in circuit area and power consumption overhead;

2. The access types of the source operand and the destination operand may include direct access and indirect access. As a result, PV extraction is likely to cause a stampede problem between indirect access and direct access to general-purpose registers;

3. In order to reduce the cost of PVB extraction and the cost of indirect addressing (PVB aligns the extraction segment according to 4 bytes), indirect addressing is forced to only support operands with a granularity of 1byte, resulting in instructions with an operand size of 2byte being used. Forcibly split into 2 instructions, thereby reducing the efficiency of instruction execution, and increasing the number of instructions, that is, increasing the instruction space;

4. In order to save the instruction space and enhance the cache hit rate, the hardware needs to do the translation of the instruction operand based on the full range vector FV-based to the instruction operand based on the partial vector PV-based, which further increases the hardware cost.

To sum up, although the first solution in the above-mentioned prior art narrows the selection range through PV extraction and reduces the selection cost to a certain extent, it also increases the cost and brings about other problems, such as the above-mentioned increase of instruction space, reduction of Instruction execution efficiency and so on. There is no real solution to the logical cost of number selection, nor can the execution efficiency of instructions be guaranteed. Therefore, in order to solve the problem that the current application processing technology does not meet the actual business requirements, the technical problems to be solved in the present invention include the following aspects:

1. Based on the original FV, the selection range is narrowed, the selection cost is reduced, and the problem of cost increase introduced by PV extraction and PV reduction is effectively solved; please refer to FIG. The schematic diagram of instruction processing, as shown in Figure 5, based on the original FV, without adding PV extraction and PV restoration, the selection range of the source operand and the destination operand is constrained to reduce the logical cost of data selection;

2. Effectively solve the trampling problem of indirect access and direct access registers caused by PV extraction;

3. Effectively save the instruction space and enhance the cache hit rate, and save the cost of translation from FV-based instruction operands to PV-based instruction operands in hardware;

4. Since the PVs corresponding to different program blocks (that is, the bundles) extracted in Scheme 1 are different, free jumps cannot be realized between the instruction blocks, so it is necessary to execute the jump instruction to achieve no connection between different instruction blocks. Constrained jumps, which can effectively solve the cost of program jumps that require dedicated branch acceleration.

Based on the above-mentioned processor architecture and the pipeline structure of the processor, the present invention provides a processor. Please refer to FIG. 6, which is a schematic structural diagram of a processor according to an embodiment of the present invention. The processor 10 may be located in any electronic device, such as a computer, a computer, a mobile phone, a tablet, a personal digital assistant, a smart wearable device, a smart vehicle, or a smart home appliance. The processor 10 may specifically be a chip or a chip set or a circuit board on which the chip or the chip set is mounted. The chip or chip set or the circuit board on which the chip or chip set is mounted can be driven by necessary software.

Specifically, the processor 10 may include an instruction decoding unit 103, a data selection unit 104, and an M row*N column register group 107 connected to the data selection unit, and each of the M row*N column register group 107 is common The register group may include K general registers, where M, N and K are integers greater than or equal to 1. The instruction decoding unit 103 is connected to the data selection unit 104, and the instruction decoding unit 103 operates in the instruction decoding pipeline stage of the processor 10 to complete the decoding of the X instructions to be executed, and extract the respective X instructions. At least one source operand address (Y source operand addresses in total), and the Y source operand addresses are sent to the data selection unit 104 . X and Y are integers greater than or equal to 1. The data selection unit 104 operates in the data selection pipeline stage of the processor 10 to select the corresponding general-purpose register from the M row*N column register group 107 according to the source operand address, and obtain the corresponding source operand. Specifically, each source operand address obtained by the instruction decoding unit 103 may include at least one column address bit, at least one row address bit and at least one target address bit. The at least one column address bit may be used to indicate a general-purpose register column to which each source operand address belongs, the at least one row address bit may be used to indicate a general-purpose register row to which each source operand address belongs, and the at least one destination address bit It is used to indicate the correspondence between each source operand address and the t-th general-purpose register among the K general-purpose registers, where t is an integer greater than or equal to 1 and less than or equal to K. For example, M and N are 4, and K is 2, that is, there are 4 rows*4 columns of general-purpose register groups. Each general-purpose register group can include 2 general-purpose registers, and each source operand address can include 2 columns. address bits, 2 row address bits, and 1 destination address bit. For example, the two column address bit values can be 00, 01, 10 or 11, where 00 can indicate that the source operand address belongs to the first general register column, and 01 can indicate that the source operand address belongs to the second general register Register row, 10 may indicate that the source operand address belongs to the third general-purpose register row, 11 may indicate that the source operand address belongs to the fourth general-purpose register row, and so on, which will not be repeated here. Optionally, if the i-th source operand address in the Y source operand addresses includes 2 column address bits, 2 row address bits and 1 destination address bit, and its values are 10, 11 and 10 respectively. 1, then the data selection unit 104 can access in the M row*N column general register group 107 according to the 2 column address bits, 2 row address bits and 1 target address bit included in the i-th source operand address the corresponding general-purpose registers. For example, the data selection unit 104 can access the second general-purpose register in the fourth general-purpose register row in the third general-purpose register row to obtain the corresponding source operand; or, in the fourth general-purpose register row, access The 2nd general-purpose register in the 3rd general-purpose register column to obtain the corresponding source operand. i is an integer greater than or equal to 1 and less than or equal to Y. In this way, based on the pre-divided M row*N column register group, and at least one column address bit, at least one row address bit and at least one destination address bit included in the source operand address, for each source operand address from the Select the corresponding general-purpose register from the general-purpose register column indicated by at least one column address bit, or select the corresponding general-purpose register from the general-purpose register row indicated by the at least one row address bit to obtain the corresponding source operand, so as to realize the narrowing selection range, reducing the logical cost of selecting numbers.

In a possible implementation manner, please refer to FIG. 7 , which is a schematic structural diagram of another processor according to an embodiment of the present invention. As shown in FIG. 7 , the processor 10 may further include an instruction memory 101 , an instruction retrieval unit 102 , an execution unit 105 and a result write-back unit 106 . Among them, as shown in FIG. 7, the instruction memory 101, the instruction acquisition unit 102, the instruction decoding unit 103, the data selection unit 104, the execution unit 105 and the result write-back unit 106 can be connected in sequence; Connected to the data selection unit 104 and the result write-back unit 106 . Wherein, the instruction acquisition unit 102 operates in the instruction fetch pipeline stage of the processor 10 to complete the acquisition of X instructions to be executed from the instruction memory 101, and can verify the X instructions. Optionally, the instruction decoding unit 103 can decode the X instructions to be executed, extract at least one source operand address of each of the X instructions, and can also extract the respective destination operand addresses of the X instructions to be executed. (a total of X destination operand addresses), as well as command type, operand type and immediate data, etc. Optionally, the instruction decoding unit 103 may further perform validity checking on the X instructions, and so on. Optionally, as described above, the destination operand address obtained through instruction decoding may include at least one column address bit, at least one row address bit and at least one target address bit. The at least one column address bit can be used to indicate the general-purpose register column to which the destination operand address belongs, the at least one row address bit can be used to indicate the general-purpose register row to which the destination operand address belongs, and the at least one target address bit can be used with to indicate the correspondence between the destination operand address and the t-th general-purpose register among the K general-purpose registers. The execution unit 105 may include multiple arithmetic logic units (Arithmetic Logic Units, ALUs) as shown in FIG. 7 (for example, including the arithmetic logic unit 1051 and the arithmetic logic unit 1052, etc.), and optionally, may also include other A unit for performing a computing task, etc., is not specifically limited in this embodiment of the present invention. The execution unit 105 runs in the execution pipeline stage of the processor 10, and can complete the calculation task of the instruction through the arithmetic logic unit therein, and obtain the corresponding calculation result. Optionally, based on the VLIW architecture, the processor 10 can call multiple arithmetic logic units to execute tasks of multiple instructions in parallel to obtain the calculation results of the multiple instructions, and the X instructions can be instructions that the processor executes in parallel within one beat. . For example, if X is 4, that is, there are 4 instructions to be executed in a layer (stage), the tasks of the 4 instructions can be executed in parallel by 4 arithmetic logic units. The result write-back unit 106 operates in the write-back pipeline stage of the processor 10 to select the corresponding general-purpose register from the M row*N column register group 107 according to the destination operand address, and write the calculation result back to the corresponding general-purpose register. in the general-purpose registers. Optionally, as described above, for example, the j-th destination operand address in the X destination operand addresses includes 2 column address bits, 2 row address bits and 1 target address bit, and their values are respectively is 00, 01, and 0, then the result write-back unit 107 can write data in M rows*N columns according to the 2 column address bits, 2 row address bits and 1 destination address bit included in the i-th source operand address The corresponding general registers are accessed in the general register group 107 . For example, the result write-back unit 106 may access the first general-purpose register in the second general-purpose register row in the first general-purpose register row to write the corresponding calculation result into the general-purpose register; or, in the second general-purpose register In the general register row, access the 1st general register in the 1st general register column. j is an integer greater than or equal to 1 and less than or equal to X. In this way, based on the pre-divided M row*N column register group, and at least one column address bit, at least one row address bit and at least one target address bit included in the destination operand address, for each destination operand address from the Select a corresponding general-purpose register from the general-purpose register column indicated by at least one column address bit, or select a corresponding general-purpose register from the general-purpose register row indicated by the at least one row address bit, and write the corresponding calculation result into the general-purpose register, The implementation narrows the selection range and reduces the logical cost of the selection.

It should be noted that the connection relationship shown in FIG. 7 does not limit the connection relationship between them. In addition, the pipeline structure may be different according to the structure of each processor. Therefore, the pipeline structure referred to in the present invention refers to the pipeline structure of the processor 10, and does not specifically limit the pipeline structure of other processors.

As mentioned above, for example, if N is 4, that is, 8 columns of general-purpose register groups are divided, and M is 8, that is, 8 rows of general-purpose register groups are divided, then each source operand address can include 2 column address bits and 3 row address bits. For example, the values of the three row address bits can be 000, 001, 010, 011, 100, 101, 110 or 111, where 000 can indicate that the source operand address belongs to the first general register row, and 001 can indicate that the source operand address belongs to the first general register row. The source operand address belongs to the 2nd general register row, 010 can indicate that the source operand address belongs to the 3rd general register row, 111 can indicate that the source operand address belongs to the 8th general register row, etc. Let's go into details. For another example, if K is 2, that is, each general-purpose register group includes 2 general-purpose registers, then each source operand address can include 1 target address bit, and the value of the 1 target address bit can be 0 or 1. , where 0 can indicate that the source operand address corresponds to the first general-purpose register in the general-purpose register group, and 1 can indicate that the source operand address corresponds to the second general-purpose register in the general-purpose register group. Therefore, through the division of the M row*N column register group, and the column address bits, row address bits and target address bits included in the source operand address, the selection range can be narrowed, and the corresponding general-purpose register can be finally determined, so that The purpose of narrowing the selection range and reducing the selection cost can be achieved without additionally increasing the hardware cost of PV extraction and PV restoration. For example, the first source operand address of the Y source operand addresses includes a total of 6 address bits, including 2 column address bits, 3 row address bits, and 1 destination address bit, and their values are 01 respectively. , 100 and 0 (that is, the first source operand address can be 011000), then the data selection unit 104 can select the 2 column address bits and the 3 row address bits included in the first source operand address according to the and 1 target address bit, access the first general-purpose register in the register group in the fifth general-purpose register row of the second general-purpose register column (that is, the source operand address whose address is 011000 corresponds to the second column. the 2nd general-purpose register in the 5-row general-purpose register bank) to obtain the corresponding source operand, or to access the 1st general-purpose register within the register bank in the 2nd general-purpose register column of the 5th general-purpose register row, to get the corresponding source operand.

As mentioned above, for the same reason, for example, N is 4, M is 8, and K is 2, that is, it is divided into M rows*N columns of general-purpose register groups, and each general-purpose register group can include 2 general-purpose registers, then each purpose The operand address can include 2 column address bits, 3 row address bits and 1 destination address bit. For example, the 3 row address bit values can be 000, 001, 010, 011, 100, 101, 110 or 111, where 000 can indicate that the destination operand address belongs to the first general register row, and 001 can indicate the destination The operand address belongs to the 2nd general register row, 010 can indicate that the destination operand address belongs to the 3rd general register row, 111 can indicate that the destination operand address belongs to the 8th general register row, and so on, no more here. Repeat. For example, the value of the 1 destination address bit may be 0 or 1, wherein 0 may indicate that the destination operand address corresponds to the first general-purpose register in the general-purpose register group, and 1 may indicate that the destination operand address corresponds to the general-purpose register The second general-purpose register in the register bank. Therefore, through the division of M rows * N columns of register groups, and the column address bits, row address bits and target address bits included in the destination operand address, the selection range can be narrowed, and the corresponding general-purpose register can be finally determined, so that The purpose of narrowing the selection range and reducing the selection cost can be achieved without additionally increasing the hardware cost of PV extraction and PV restoration. For example, the third destination operand address in the X destination operand addresses includes a total of 6 address bits, including 2 column address bits, 3 row address bits and 1 target address bit, and their values are 01 respectively. , 100, and 0 (that is, the third destination operand address can be 011000), the result write-back unit 106 can be based on the 2 column address bits, 3 row addresses included in the third destination operand address bit and 1 target address bit, access the first general-purpose register in the register group in the fifth general-purpose register row of the second general-purpose register column (that is, the destination operand address whose address is 011000 corresponds to the second column The 2nd general-purpose register in the 5th general-purpose register group), or access the 1st general-purpose register in the register group in the 2nd general-purpose register column of the 5th general-purpose register row to write the corresponding calculation result into this general register.

Optionally, in order to further reduce the cost of number selection, it can be constrained that only 2 bytes of data can be selected for one selection in each column of registers (that is, only one operand with a size of 2 bytes is selected). Therefore, the above Y source operations Any two different source operand addresses in the data address must belong to different general-purpose register columns. Obviously, if any two different source operand addresses belong to the same general register row, then the general register row needs to select at least two different source operands at a time, that is, 4byte data, which increases the logical cost of the selection. . Optionally, instructions corresponding to different source operand addresses belonging to the same general-purpose register row can be placed in different layers for processing by the compiler, so as to avoid the above-mentioned same general-purpose register row selecting multiple options of different source operands at one time. Conflict, reducing the logical cost of selection. Similarly, any two different destination operand addresses in the above X destination operand addresses also need to belong to different general-purpose register columns. If any two different destination operand addresses belong to the same general-purpose register column, the general-purpose register The column needs to select at least two different source operands at a time, that is, 4 bytes of data, which increases the logical cost of the selection. Wait, and no further description will be given here.

Optionally, in order to further reduce the selection cost, the access types of the Y source operand addresses and the X destination operand addresses may be restricted to be direct access or indirect access. It should be noted that since direct access can directly know the operand address, and indirect access is a base address plus a variable to finally get the operand address, it often leads to indirect access. The resulting operand address is very likely to be the same as Operand addresses for direct access are different, but both belong to the same general-purpose register row. In this way, for the same general-purpose register row, it is necessary to select two different general-purpose registers at a time to read two different source operands, or write the calculation results of two different instructions, which will greatly increase the number of options. logical cost. However, if the operand addresses are constrained to be directly accessed, the compiler can place instructions with different operand addresses but belong to the same general register row in different layers for processing according to the aforementioned constraint rules; or, if the operand addresses are constrained If both are indirect accesses, the compiler can place instructions with different operand addresses but belonging to the same general-purpose register row in different layers for processing according to the aforementioned constraint rules and the difference value that always exists between the base addresses of the indirect accesses. To sum up, optionally, the same operand in the same layer (for example, src1 shown in Figure 9 below) can be directly accessed or both accessed indirectly, so as to prevent the compiler from not being able to identify the column conflict of direct access and indirect access. question. It is understandable that only when the compiler recognizes the column conflict, it can resolve the column conflict problem when selecting numbers based on the reasonable map instruction. Therefore, the purpose of prohibiting both direct and indirect access to the same operand in the same layer is to allow the compiler to recognize column conflicts. Therefore, the situation that two different general-purpose registers need to be selected at one time in the same general-purpose register row can be avoided in advance, and the logic cost of the number selection can be reduced.

As described above, in order to reduce the cost of data selection, the embodiments of the present invention divide the existing general-purpose registers to obtain M rows*N columns of register groups 107, wherein each general-purpose register group may include at least one general-purpose register. Therefore, the processor can, based on the source operand address and at least one column address bit included in the destination operand address, from the general-purpose register column corresponding to the at least one column address bit, according to the remaining row address bits and target address bits. Or based on at least one row address bit included in the source operand address and the destination operand address, from the general register row corresponding to the at least one row address bit, according to the remaining column address bits and target address bits. general-purpose registers. For each source operand address and destination operand address, the scope of selection is reduced, and the selection cost is greatly reduced.

Optionally, please refer to FIG. 8a-FIG. 8e, FIG. 8a-FIG. 8e are schematic diagrams of a division manner of a general-purpose register provided by an embodiment of the present invention. As shown in Fig. 8a-Fig. 8e, FV (that is, all general-purpose registers) can be divided into N banks (Banks) in a K byte granularity interleaving manner, that is, the above-mentioned division into N columns. Then, the selection range of ALU is constrained based on the Bank dimension to reduce the selection cost. Among them, the interleaving granularity K is generally consistent with the size of the ALU operand, so that the processor can fetch the required source operand in a general register group (for example, the size of the source operand in the instruction is 2byte, then the interleaving granularity K can be is 2byte, that is, each general-purpose register group includes 2 general-purpose registers, usually each general-purpose register can store 1byte of data). Among them, the number of banks N can generally be consistent with the size of the VLIW (or consistent with the number of instructions included in each layer), that is, consistent with the number of instructions processed in parallel by the processor in one shot, so that the processor can target multiple The source operand of each instruction, select the corresponding general-purpose register in the Bank, for example, the source operand of instruction 1 can belong to Bank0, the source operand of instruction 2 can belong to Bank2, etc., thereby reducing the need to select the same Bank for multiple instructions. selection conflict. For example, the processor can process 4 instructions in parallel in one beat (that is, one clock cycle), then N can be 4, that is, the general-purpose register is divided into 4 Banks. Optionally, in some possible implementation manners, the interleaving granularity K may not be consistent with the size of the operand, for example, when the size of the operand is 2 bytes, the interleaving granularity may also be 1, that is, each general register group. may include 1 general-purpose register, or the interleaving granularity K may be 4, that is, each general-purpose register group may include 4 general-purpose registers, etc., which is not specifically limited in this embodiment of the present invention. Optionally, in some possible implementations, the number of banks N may also be greater than the number of instructions processed in parallel by the processor in one beat. For example, if the processor can process 4 instructions in parallel in one beat, N may also be 8, that is, The general-purpose register is divided into 8 banks, so that the number selection conflict in the same bank can be further reduced, and so on, which is not specifically limited in this embodiment of the present invention.

Optionally, take 512byte FV as an example, that is, take a total of 512 general-purpose registers (R0-R511) as an example (FV is mainly composed of general-purpose registers, and other registers, such as status registers, etc., here is the general-purpose register. Taking a register as an example, which is not specifically limited in this embodiment of the present invention), with reference to FIGS. 8a-8e , the division method of general-purpose registers and the number selection method involved in the embodiment of the present invention are further elaborated.

Optionally, as shown in Figure 8a, taking K equal to 2 and N equal to 4 as an example, the FV is multi-banked. Obviously, the division method takes K*N=8 as the period, and the 512 general-purpose registers are denoted by "Z". The glyphs are interleaved and divided into corresponding Banks. As shown in Figure 8a, general registers R0-R1, R8-R9, R16-R17...R504-R505 can be divided into Bank0; R2-R3, R10-R11, R18-R19...R506-R507 can be divided into Bank1 ; Divide R4-R5, R12-R13, R20-R21...R508-R509 to Bank2; divide R6-R7, R14-R15, R22-R23...R510-R511 to Bank3. Each general-purpose register in R0 to R511 can store 1byte of data, optionally, R0, etc. can also be recorded as R0b, where b represents 8bit, that is, 1byte. Thus, as described above, the 512 general-purpose registers are divided into 4 banks (ie, 4 columns), each bank includes 64 rows of general-purpose register banks, and each general-purpose register bank includes 2 general-purpose registers. In this case, obviously, since 2 ⁹ =512, both the source operand address and the destination operand address may include at least 9 address bits, which may include 2 column address bits, which are used to indicate the Bank to which they belong, and may also include It includes 6 row address bits, which are used to indicate the general-purpose register row to which it belongs, and can also include 1 target address bit, which is used to indicate that the source operand address or the destination operand address is the first general-purpose register group. register. For example, taking the source operand address as an example, the upper 2 bits of the source operand address may be column address bits, the lower 1 bit may be the destination address bits, and the middle 6 bits may be row address bits. Optionally, the upper 6 bits of the source operand address can be row address bits, the lowest 1 bit can be the destination address bit, and the middle 2 bits can be column address bits, or, the upper 1 bit of the source operand address can be are target address bits, the lowest 2 bits may be column address bits, the middle 6 bits may be row address bits, etc., which are not specifically limited in this embodiment of the present invention. Optionally, taking the upper 2 bits as the column address bits, the lowest 1 bit as the target address bits, and the middle 6 bits as the row address bits as an example, for example, if the source operand address is 000000000, it can be determined that the source operand address belongs to Bank0, By accessing the first general-purpose register in the general-purpose register group of the first row (that is, the general-purpose register R0 shown in FIG. 8a ) in Bank0, the corresponding source operand can be obtained. For another example, if the source operand address is 100000001, it can be determined that the source operand address belongs to Bank2, and by accessing the second general register in the first row general register group in Bank2 (that is, the general register shown in Figure 8a) R5), the corresponding source operand can be obtained, etc., which will not be repeated here. Optionally, if the size of the source operand is consistent with the interleaving granularity K, that is, if the size of the source operand is also 2 bytes, the addresses of consecutive 2 bytes belonging to the same Bank can be considered to be the same address, such as the source operand. Addresses 000000000 and 000000001 can be considered to be the same address, that is, R0 and R1 as shown in Figure 8a can be considered to be the same address. When the processor obtains the corresponding source operand according to the source operand address 000000000 or 000000001, both are read Take the data in R0 and R1 to obtain a 2byte source operand. Alternatively, R0h and R0, R0h and R1 can also be considered to be the same address. Since h in R0h represents 16bit, that is, 2byte, then R0h is R0+R1, so both read the data in R0 and R1 , thereby obtaining a 2byte source operand. As mentioned above, in the same way, the upper 2 bits in the destination operand address can also be the column address bits, the lowest 1 bit can be the target address bits, and the middle 6 bits can be the row address bits. For example, if the destination operand address is 111111110, it can be determined that the destination operand address belongs to Bank3. By accessing the first general-purpose register in the general-purpose register group of row 64 in Bank3 (that is, general-purpose register R510 as shown in Figure 8a) , the corresponding calculation result can be written back to the general register R510. For another example, if the destination operand address is 110000001, it can be determined that the source operation address belongs to Bank3, and by accessing the second general-purpose register in the general-purpose register group of the first row in Bank3 (that is, the general-purpose register shown in Figure 8a) R7), the corresponding calculation result can be written back into the general-purpose register R7, etc., which will not be repeated here.

Optionally, as shown in Figure 8b, taking K equal to 4 and N equal to 4 as an example, the FV is multi-banked. Obviously, this division method takes K*N=16 as the period, and the 512 general-purpose registers are marked with "Z". The glyphs are interleaved and divided into corresponding Banks. As shown in Figure 8b, general registers R0-R3, R16-R19, R32-R35...R496-R499 can be divided into Bank0; R4-R7, R20-R23, R36-R39...R500-R503 can be divided into Bank1 ; Divide R8-R11, R24-R27, R40-R43...R504-R507 to Bank2; divide R12-R15, R28-R31, R44-R47...R508-R511 to Bank3. As described above, the 512 general-purpose registers are divided into 4 banks (ie, 4 columns), each bank includes 32 rows of general-purpose register banks, and each general-purpose register bank includes 4 general-purpose registers. In this way, in the same way, both the source operand address and the destination operand address can include 9 address bits, of which 2 column address bits can be included to indicate the Bank to which they belong, and 5 row address bits can be included to indicate the bank. The general-purpose register row to which it belongs may further include two target address bits, which are used to indicate which general-purpose register in the general-purpose register group the source operand address or the destination operand address is. For example, taking the source operand address as an example, the upper 2 bits of the source operand address may be column address bits, the lower 2 bits may be destination address bits, and the middle 5 bits may be row address bits. Optionally, the upper 5 bits of the source operand address may be row address bits, the lowest 2 bits may be target address bits, the middle 2 bits may be column address bits, etc., which are not specifically limited in this embodiment of the present invention. Optionally, taking the upper 2 bits as the column address bits, the lower 2 bits as the target address bits, and the middle 5 bits as the row address bits as an example, for example, if the source operand address is 100000011, it can be determined that the source operation address belongs to Bank2, The corresponding source operand can be obtained by accessing the fourth general-purpose register in the first-row general-purpose register group (that is, the general-purpose register R11 shown in Figure 8b) in Bank2. For another example, if the source operand address is 100000110, it can be determined that the source operand address belongs to Bank2, by accessing the third general-purpose register in the second row general-purpose register group in Bank2 (that is, the general-purpose register shown in Figure 8b). R26), the corresponding source operand can be obtained. Optionally, reference may be made to the above-mentioned embodiment corresponding to FIG. 8 a , which will not be repeated here.

Optionally, as shown in Figure 8c, taking K equal to 2 and N equal to 2 as an example, the FV is multi-banked. Obviously, the division method takes K*N=4 as the period, and the 512 general-purpose registers are denoted by "Z". The glyphs are interleaved and divided into corresponding Banks. As shown in Figure 8c, general registers R0-R1, R4-R5, R8-R9...R508-R509 can be divided into Bank0; R2-R3, R6-R7, R10-R11...R510-R511 can be divided into Bank1 . As described above, the 512 general-purpose registers are divided into 2 banks (ie, 2 columns), each bank includes 128 rows of general-purpose register banks, and each general-purpose register bank includes 2 general-purpose registers. In this way, in the same way, both the source operand address and the destination operand address can include 9 address bits, which can include 1 column address bit to indicate the Bank to which they belong, and 7 row address bits to indicate the bank. The general-purpose register row to which it belongs may also include a target address bit, which is used to indicate that the source operand address or the destination operand address is the th general-purpose register in the general-purpose register group. For example, taking the source operand address as an example, the highest 1 bit of the source operand address can be the column address bit, the lowest 1 bit can be the target address bit, and the middle 7 bits can be the row address bit. For example, if the source operand address is 100000001, it can be determined that the source operation address belongs to Bank1. By accessing the second general-purpose register in the first row of general-purpose register group in Bank1 (that is, general-purpose register R3 as shown in Figure 8c) , you can get the corresponding source operand. For another example, if the source operand address is 10000010, it can be determined that the source operand address belongs to Bank1, by accessing the first general-purpose register in the second row general-purpose register group in Bank1 (that is, the general-purpose register shown in Figure 8c). R6), you can get the corresponding source operand. Optionally, reference may be made to the above-mentioned embodiment corresponding to FIG. 8 a , which will not be repeated here.

Optionally, reference may be made to FIG. 8d and FIG. 8e for other possible division manners. As shown in Figure 8d, with K*N=8 as the cycle, 512 general-purpose registers are divided into 2 banks in a "Z" shape interleaved, each bank includes 64 rows of general-purpose register groups, and each general-purpose register group can Includes 4 general purpose registers. As shown in Figure 8e, with K*N=16 as the period, 512 general-purpose registers are divided into 8 banks in a "Z" shape interleaved, each bank includes 32 rows of general-purpose register groups, and each general-purpose register group can be Including 2 general-purpose registers, etc., which will not be repeated here.

Optionally, if the FV is 1024 bytes, take K equals 2 and N equals 4 as an example, the FV is multi-banked, the 1024 general-purpose registers can be divided into 4 banks (that is, 4 columns), and each bank includes 128 Row general-purpose register banks, each general-purpose register bank includes 2 general-purpose registers. Then, both the source operand address and the destination operand address may include at least 10 address bits, which may include 2 column address bits, 7 row address bits, and 1 destination address bit, etc., which will not be repeated here.

Optionally, the way of dividing the Bank may not be limited to the dividing way of the "Z"-shaped interleaving shown in the above-mentioned Figures 8a-8e. In some possible implementations, consecutive addresses can be divided into each Bank, for example , Take the FV of 512byte and the number of Bank N equal to 4 as an example, you can divide R0-R127 into Bank0, R128-R255 into Bank1, R256-R382 into Bank2, R383-R511 into Bank3, etc., This embodiment of the present invention does not specifically limit this.

Please refer to FIG. 9. FIG. 9 is a schematic diagram of data selection provided by an embodiment of the present invention. Among them, the arithmetic logic unit ALU mainly completes arithmetic operations (addition, subtraction, multiplication and division), logical operations (and or non-exclusive OR) and shift operations on binary data. Mathematical operations such as addition, subtraction, multiplication, division, and logical operations such as "OR, AND, ASL, ROL" instructions are performed in the arithmetic logic unit ALU. As shown in Figure 9, an instruction can include two source operands, for example, the arithmetic logic unit ALU0 can be used to execute the task of instruction 0, and can include two source operands src1 and src2 (for example, the two source operands of src1 and src2 can be For example, ALU1 can be used to execute the task of instruction 1, which can include two source operands, src1 and src2, and, for example, ALU2 can be used to execute the task of instruction 2, which can include There are two source operands, src1 and src2. For example, ALU3 can be used to execute the task of instruction 3, and can include two source operands, src1 and src2. Wherein, instruction 0, instruction 1, instruction 2, and instruction 3 may be four instructions of the same layer under the VLIW architecture. Among them, the two source operands of each instruction (such as src1 and src2 of ALU0 shown in Figure 9, src1 and src2 of ALU1, src1 and src2 of ALU2, and src1 and src2 of ALU3) can be independent of each other, and they are not independent of each other. Subject to the above binding rules. In this way, although the same source operands of different ALUs cannot access different general-purpose registers in the same Bank (for example, taking the division method shown in Figure 8a as an example, src1 of ALU0 and src1 of ALU1 cannot access different general-purpose registers of Bank0, Such as R0 and R8), however, there is no restriction on accessing Bank by different ALUs or different source operands of the same ALU. For example, src1 of ALU0 and src2 of ALU0 can access different general registers in the same Bank, such as R0 and R8 in Bank0 shown in Figure 8a, or src1 of ALU0 and src2 of ALU1 can also access different general registers in the same Bank , such as R10 and R11 in Bank1 shown in FIG. 8a , etc., which will not be repeated here. Due to the independence between src1 and src2, they are not constrained by the above constraint rules, so that the problem of reduced instruction execution efficiency caused by constraint rules can be reduced to a certain extent. Optionally, as mentioned above, since instruction 0, instruction 1, instruction 2, and instruction 3 are 4 instructions of the same layer, they need to be constrained by the above-mentioned selection constraint rules to avoid selecting multiple different sources in the same Bank at one time. Based on the increase of the optional logic cost brought about by the operand, it can be understood that the source operands of the instructions of different layers are not constrained to access the Bank. For example, instruction 4 is the instruction of the next layer, which is in a different layer from instruction 0, instruction 1, instruction 2 and instruction 3. Therefore, when the processor processes the instruction of the next layer, the instruction 0, instruction 1, Instruction 2 and instruction 3 have completed the access to the corresponding Bank and obtained the source operands required for the calculation, then the access to the Bank by src1 or src2 of instruction 4 is the same as that of the previous instruction 0, instruction 1, instruction 2 and instruction 3. There will be no conflict. Similarly, since different ALUs or different source operands of the same ALU have no restrictions on accessing Bank, different ALUs or different source operands of the same ALU can also have any access type. src2 may be indirect access, for example, src2 of ALU1 may be direct access, src1 of ALU3 may be indirect access, etc., which will not be repeated here.

Please refer to FIG. 10. FIG. 10 is a schematic diagram of a result write-back provided by an embodiment of the present invention. As shown in Figure 10, each instruction can include a destination operand. During the write-backflow pipeline stage, ALU0, ALU1, ALU2, and ALU3 can write their respective calculation results back to the corresponding general-purpose registers according to their destination operand addresses. . As mentioned above, for the same reason, since instruction 0, instruction 1, instruction 2 and instruction 3 are 4 instructions of the same layer, they need to be constrained by the above selection constraint rules to avoid selecting multiple operations with different purposes in the same Bank at one time. The increase in the selection logic cost caused by the number (that is, the general-purpose registers corresponding to multiple different destination operand addresses), however, the destination operands of different layers of instructions are not restricted to access the Bank, and the access to the Bank of instructions of different layers is not restricted. The access type of the destination operand is also not restricted. For details, reference may be made to the embodiment corresponding to FIG. 9 above, which will not be repeated here. Optionally, if the destination operands of different instructions are the same address, that is, the destination operand addresses of different instructions are the same, when the calculation result is written back to the general-purpose register, the Write After Write (WAW) mode can be supported. , take the destination operand of the logically executed instruction as the destination operand of the last operation, that is, the later written can overwrite the previously written. For example, the destination operand addresses of ALU0, ALU1, ALU2 and ALU3 are all R0, and ALU3 writes the calculation result of instruction 3 into R0 after ALU0, ALU1 and ALU2, then R0 finally saves the calculation result of instruction 3.

Optionally, in order to alleviate the problem of decreased instruction execution efficiency when processing data dependencies caused by the addition of the above-mentioned constraint rules, the processor 10 in this embodiment of the present invention may further include one or more temporary registers (Temporary Register, T) ( 7), the one or more temporary registers can be simultaneously visible to all ALUs (eg, ALU0, ALU1, ALU2 and ALU3 shown in FIG. 9) without any constraint. At the same time, temporary registers are only visible to the compiler (Compiler), but not to the programmer (Programer), and can be used to improve instruction utilization. The compiler can replace general-purpose registers that need to be frequently referenced and updated with temporary registers.

For example, taking Figure 8a and Figure 9 as examples, in the 4 instructions of the same layer, if src1 of ALU0 needs to access R0 in Bank0, src1 of ALU1 needs to access R8 in Bank0, and src1 of ALU2 needs to access R8 in Bank0 R8, and src1 of ALU3 also needs to access R8 in Bank0, so, according to the constraint rule that the same source operand (such as src1) of different instructions in the same layer cannot access different general registers of the same Bank, the instruction 1, instruction 2 and instruction 3 cannot be processed in the same layer as instruction 0. You can only insert empty instructions in this layer and move instruction 1, instruction 2 and instruction 3 to the next layer for processing. In this way, due to the insertion of empty instructions, the reduction is greatly reduced. The instruction utilization and the execution efficiency of the instruction. In this case, that is, knowing that R8 will be frequently accessed and referenced, R8 can be replaced by a temporary register T0 in the upper layer of instruction 0, instruction 1, instruction 2 and instruction 3, and the T0 can be The data of R8 is stored and corresponds to R8. In this way, when instruction 1, instruction 2 and instruction 3 select data according to the source operand address, they can directly access the corresponding T0 to obtain the source operand, thus not being constrained by the constraint rules, ensuring the utilization and effectiveness.

For another example, taking Figure 8a and Figure 10 as examples, in the four instructions of the same layer, if dst1 of ALU0 needs to access R2 in Bank1, src1 of ALU1 needs to access R10 in Bank1, and src1 of ALU2 needs to access R10 in Bank1 At the same time, src1 of ALU3 also needs to access R10 in Bank1. In the same way, in the case of knowing that R10 will be frequently accessed and referenced, R10 can be replaced by a temporary register T1 in the upper layer of instruction 0, instruction 1, instruction 2 and instruction 3, which can store There is data for R10, and it corresponds to R10. In this way, when instruction 1, instruction 2 and instruction 3 write back the calculation results according to the destination operand address, they can directly access the corresponding T1, and write their respective calculation results into T1, without being constrained by the constraint rules, Guaranteed instruction utilization and execution efficiency.

It should be noted that the embodiments of the present invention can be applied to any chip, device, or device that processes multiple instructions in parallel, and is not limited to whether the implementation mode is hardware or software.

Please refer to FIG. 11. FIG. 11 is a schematic flowchart of a processing method according to an embodiment of the present invention. The processing method is applied to a processor, and the processor includes a data selection unit and an instruction decoding unit connected to the data selection unit. unit and M row*N column general-purpose register group, each general-purpose register group in the M row*N-column general-purpose register group includes K general-purpose registers; M, N, and K are integers greater than or equal to 1; and the processing The method is applicable to any one of the processors in the above-mentioned FIG. 1 to FIG. 3 and a device (such as a mobile phone, a computer, a server, etc.) including the processor. The method may include the following steps S201-S202, wherein,

Step S201, through the instruction decoding unit, decode the input X instructions, and obtain at least one source operand address of each of the X instructions, a total of Y source operand addresses; The source operand address is sent to the data selection unit; each of the Y source operand addresses includes at least one column address bit, at least one row address bit and at least one target address bit; the At least one column address bit is used to indicate the general-purpose register column to which each source operand address belongs, and the at least one row address bit is used to indicate the general-purpose register row to which each source operand address belongs; the at least one The target address bit is used to indicate the correspondence between each source operand address and the t-th general-purpose register in the K general-purpose registers; X, Y, and t are integers greater than or equal to 1;

Step S202, through the data selection unit, according to the at least one column address bit, the at least one row address bit and the at least one target address bit included in the i-th source operand address in the M row* Access the general register corresponding to the i-th source operand address in the N-column general-purpose register group to obtain the corresponding source operand; the i-th source operand address is one of the Y source operand addresses; i is an integer greater than or equal to 1 and less than or equal to Y.

It should be noted that, for the specific flow of the processing method described in the embodiments of the present invention, reference may be made to the relevant descriptions in the embodiments of the present invention described in FIG. 1 to FIG. 10 , which will not be repeated here.

An embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium may store a program, and when the program is executed by a processor, the processor may execute any of the methods described in the foregoing method embodiments. Some or all of the steps of a kind.

Embodiments of the present invention further provide a computer program, where the computer program includes instructions, when the computer program is executed by a multi-core processor, the processor can perform some or all of the steps of any one of the above method embodiments .

In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.

It should be noted that, for the sake of simple description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that the present invention is not limited by the described action sequence. As in accordance with the present invention, certain steps may be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present invention.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative, for example, the division of the above-mentioned units is only a logical function division, and other division methods may be used in actual implementation, for example, multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical or other forms.

The above-mentioned units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.

If the above-mentioned integrated units are implemented in the form of software functional units and sold or used as independent products, they may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present invention is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc., specifically a processor in the computer device) to execute all or part of the steps of the above methods in various embodiments of the present invention. Wherein, the aforementioned storage medium may include: U disk, mobile hard disk, magnetic disk, optical disk, Read-Only Memory (Read-Only Memory, abbreviation: ROM) or Random Access Memory (Random Access Memory, abbreviation: RAM), etc. A medium that can store program code.

As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand: The technical solutions described in the embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the present invention.

Claims

A processor, characterized in that it includes a data selection unit, an instruction decoding unit connected to the data selection unit, and an M row*N column general-purpose register group, each of the M rows*N column general-purpose register group The general-purpose register group includes K general-purpose registers; M, N, and K are integers greater than or equal to 1; among them,

The instruction decoding unit is used to decode the input X instructions, obtain at least one source operand address of each of the X instructions, and a total of Y source operand addresses; and operate the Y source operands. A number address is sent to the data selection unit; each of the Y source operand addresses includes at least one column address bit, at least one row address bit and at least one destination address bit; the at least one The column address bit is used to indicate the general-purpose register column to which each source operand address belongs, and the at least one row address bit is used to indicate the general-purpose register row to which each source operand address belongs; the at least one destination address Bit is used to indicate the correspondence between each source operand address and the t-th general-purpose register in the K general-purpose registers; X, Y, and t are integers greater than or equal to 1;

The data selection unit is configured to select the data in the M row*N column according to the at least one column address bit, the at least one row address bit and the at least one destination address bit included in the i-th source operand address Access the general register corresponding to the ith source operand address in the general register group to obtain the corresponding source operand; the ith source operand address is one of the Y source operand addresses; i is An integer greater than or equal to 1 and less than or equal to Y.
The processor of claim 1, wherein the processor further comprises an execution unit connected to the data selection unit, the execution unit comprising at least one arithmetic logic unit;

The at least one arithmetic logic unit is configured to execute the X instructions based on the source operands corresponding to the Y source operand addresses, and obtain respective calculation results of the X instructions.
The processor of claim 2, wherein the processor further comprises a result write-back unit connected to the execution unit; wherein,

The instruction decoding unit is also used to obtain the respective destination operand addresses of the X instructions, totaling X destination operand addresses; each destination operand address in the X destination operand addresses includes all the destination operand addresses. the at least one column address bit, the at least one row address bit, and the at least one target address bit; the at least one column address bit is used to indicate the general-purpose register column to which each destination operand address belongs, and the The at least one row address bit is used to indicate the general register row to which each destination operand address belongs; the at least one target address bit is used to indicate that each destination operand address and the K general registers The corresponding relationship of the t-th general-purpose register in ;

The result write-back unit is configured to write the result in the M row*N according to the at least one column address bit, the at least one row address bit and the at least one target address bit included in the jth destination operand address. Accessing the general-purpose register corresponding to the j-th destination operand address in the column general-purpose register group, and writing the calculation result of the j-th instruction back to the general-purpose register corresponding to the j-th destination operand address In; j is an integer greater than or equal to 1 and less than or equal to X.
The processor according to claim 1, wherein the data selection unit is specifically configured to:

According to the at least one column address bit included in the i-th source operand address, determine the i'th general-purpose register column to which the i-th source operand address belongs, and perform the operation in the i'th general-purpose register column. In the register column, the corresponding general-purpose register is accessed according to the at least one row address bit and the at least one target address bit included in the i-th source operand address, where i' is greater than or equal to 1 and less than or an integer equal to N; or,

According to the at least one row address bit included in the ith source operand address, determine the ith general register row to which the ith source operand address belongs, and set the ith general register row to which the ith source operand address belongs. In the register row, the corresponding general-purpose register is accessed according to the at least one column address bit and the at least one target address bit included in the i-th source operand address, where i" is greater than or equal to 1 and less than or an integer equal to M.
The processor according to claim 3, wherein the result write-back unit is specifically used for:

According to the at least one column address bit included in the jth target operand address, determine the j'th general-purpose register column to which the jth target operand address belongs, and perform the operation in the j'th general-purpose register column. In the register column, the corresponding general-purpose register is accessed according to the at least one row address bit and the at least one target address bit included in the jth target operand address, where j' is greater than or equal to 1 and less than or an integer equal to N; or,

According to the at least one row address bit included in the jth target operand address, determine the j"th general-purpose register row to which the jth target operand address belongs, and perform the operation in the j"th general purpose register row. In the register row, the corresponding general-purpose register is accessed according to the at least one column address bit and the at least one target address bit included in the jth target operand address, where j" is greater than or equal to 1 and less than or equal to 1. or an integer equal to M.
The processor according to claim 3, wherein any two different source operand addresses in the Y source operand addresses belong to different general register columns; among the X destination operand addresses Any two different destination operand addresses belong to different columns of the general-purpose register.
The processor of claim 3, wherein the processor further comprises a temporary register, the temporary register corresponds to a target general-purpose register; the target general-purpose register is in the general-purpose register group of M rows*N columns one; the temporary register stores the data in the target general-purpose register; the Y source operand addresses include a plurality of first source operand addresses with the same number greater than the first threshold, and the second source operand address address; the general-purpose register corresponding to the first source operand address is the target general-purpose register, and the second source operand address belongs to the general-purpose register row where the target general-purpose register is located;

The data selection unit is further configured to access the corresponding temporary register according to the address of the first source operand, and obtain the corresponding source operand.
The processor of claim 3, wherein the processor further comprises a temporary register, the temporary register corresponds to a target general-purpose register; the target general-purpose register is in the general-purpose register group of M rows*N columns one; the temporary register stores the data in the target general-purpose register; the X destination operand addresses include a plurality of first destination operand addresses whose number is greater than the first threshold, and the second destination operation address; the general-purpose register corresponding to the first destination operand address is the target general-purpose register, and the second destination operand address belongs to the general-purpose register row where the target general-purpose register is located;

The result write-back unit is further configured to access the corresponding temporary register according to the target destination operand address, and write the corresponding calculation result back into the temporary register.
The processor according to any one of claims 1-8, wherein the processor further comprises an instruction acquisition unit connected to the instruction decoding unit; wherein,

The instruction acquisition unit is configured to acquire the X instructions to be executed, and send the X instructions to the instruction decoding unit; the X instructions are parallel for the processor within one clock cycle instruction to execute.
The processor according to any one of claims 1-9, wherein the access types of the Y source operand addresses and the X destination operand addresses are both direct access types or indirect access types .
A processing method, applied to a processor, characterized in that the processor comprises a data selection unit, an instruction decoding unit connected to the data selection unit, and a general-purpose register group of M rows*N columns, the M rows* Each general-purpose register group in the N-column general-purpose register group includes K general-purpose registers; M, N and K are integers greater than or equal to 1; the method includes:

Through the instruction decoding unit, the input X instructions are decoded, and at least one source operand address of each of the X instructions is obtained, which is a total of Y source operand addresses; and the Y source operands are address is sent to the data selection unit; each of the Y source operand addresses includes at least one column address bit, at least one row address bit and at least one destination address bit; the at least one column address bit The address bits are used to indicate the general-purpose register column to which each source operand address belongs, and the at least one row address bit is used to indicate the general-purpose register row to which each source operand address belongs; the at least one destination address bit Used to indicate the correspondence between each source operand address and the t-th general-purpose register in the K general-purpose registers; X, Y, and t are integers greater than or equal to 1;

By the data selection unit, according to the at least one column address bit, the at least one row address bit and the at least one destination address bit included in the i-th source operand address are common in the M rows*N columns Access the general-purpose register corresponding to the i-th source operand address in the register group to obtain the corresponding source operand; the i-th source operand address is one of the Y source operand addresses; i is greater than or an integer equal to 1 and less than or equal to Y.
The method of claim 11, wherein the processor further comprises an execution unit connected to the data selection unit, the execution unit comprising at least one arithmetic logic unit; the method further comprises:

Through the at least one arithmetic logic unit, the X instructions are executed based on the source operands corresponding to the Y source operand addresses, respectively, to obtain respective calculation results of the X instructions.
The method according to claim 12, wherein the processor further comprises a result write-back unit connected to the execution unit; the method further comprises:

Through the instruction decoding unit, the respective destination operand addresses of the X instructions are acquired, with a total of X destination operand addresses; each destination operand address in the X destination operand addresses includes the at least one column address bit, the at least one row address bit and the at least one target address bit; the at least one column address bit is used to indicate the general register column to which each destination operand address belongs, the at least one column address bit One row address bit is used to indicate the general-purpose register row to which each destination operand address belongs; the at least one target address bit is used to indicate that each destination operand address is associated with the K general-purpose registers. the correspondence of the t-th general-purpose register;

According to the result write-back unit, according to the at least one column address bit, the at least one row address bit and the at least one target address bit included in the jth destination operand address, in the M row*N column Access the general-purpose register corresponding to the j-th destination operand address in the general-purpose register group, and write the calculation result of the j-th instruction back into the general-purpose register corresponding to the j-th destination operand address ; j is an integer greater than or equal to 1 and less than or equal to X.
The method according to claim 11, wherein the data selection unit is based on the at least one column address bit, the at least one row address bit and all the bits included in the i-th source operand address. The at least one target address bit accesses the general-purpose register corresponding to the i-th source operand address in the M-row*N-column general-purpose register group, including:

Through the data selection unit, according to the at least one column address bit included in the i-th source operand address, determine the i'th general-purpose register column to which the i-th source operand address belongs, and perform a In the i'th general-purpose register column, the corresponding general-purpose register is accessed according to the at least one row address bit and the at least one target address bit included in the i-th source operand address, where i' is: an integer greater than or equal to 1 and less than or equal to N; or,

Through the data selection unit, according to the at least one row address bit included in the i-th source operand address, determine the i-th general register row to which the i-th source operand address belongs, and set the row in the i-th source operand address. In the i" th general register row, the corresponding general register is accessed according to the at least one column address bit and the at least one target address bit included in the i th source operand address, where i" is: An integer greater than or equal to 1 and less than or equal to M.
14. The method according to claim 13, wherein the writing back unit through the result is based on the at least one column address bit, the at least one row address bit included in the jth destination operand address and the The at least one target address bit accesses the general-purpose register corresponding to the j-th destination operand address in the M-row*N-column general-purpose register group, including:

Through the result write-back unit, according to the at least one column address bit included in the jth target operand address, determine the j'th general register column to which the jth target operand address belongs, and In the j'th general register column, the corresponding general register is accessed according to the at least one row address bit and the at least one target address bit included in the jth target operand address, j' is an integer greater than or equal to 1 and less than or equal to N; or,

Through the result write-back unit, according to the at least one row address bit included in the jth target operand address, determine the j"th general register row to which the jth target operand address belongs, and In the j"-th general-purpose register row, the corresponding general-purpose register is accessed according to the at least one column address bit and the at least one target address bit included in the j-th target operand address, j" is an integer greater than or equal to 1 and less than or equal to M.
The method according to claim 13, wherein any two different source operand addresses in the Y source operand addresses belong to different general register columns; among the X destination operand addresses Any two different destination operand addresses belong to different columns of the general-purpose register.
The method according to claim 13, wherein the processor further comprises a temporary register, the temporary register corresponds to a target general-purpose register; the target general-purpose register is a general-purpose register set in the M row*N column general-purpose register group. One; the temporary register stores the data in the target general-purpose register; the Y source operand addresses include a plurality of first source operand addresses with the same number greater than the first threshold, and the second source operand address; the general-purpose register corresponding to the first source operand address is the target general-purpose register, and the second source operand address belongs to the general-purpose register column where the target general-purpose register is located; the method further includes :

Through the data selection unit, the corresponding temporary register is accessed according to the address of the first source operand, and the corresponding source operand is acquired.
The method according to claim 13, wherein the processor further comprises a temporary register, the temporary register corresponds to a target general-purpose register; the target general-purpose register is a general-purpose register set in the M row*N column general-purpose register group. One; the temporary register stores the data in the target general-purpose register; the X destination operand addresses include a plurality of first destination operand addresses whose number is greater than the first threshold, and the second destination operand address; the general-purpose register corresponding to the first destination operand address is the target general-purpose register, and the second destination operand address belongs to the general-purpose register row where the target general-purpose register is located; the method further includes :

Through the result write-back unit, the corresponding temporary register is accessed according to the target destination operand address, and the corresponding calculation result is written back into the temporary register.
The method according to any one of claims 11-18, wherein the processor further comprises an instruction acquisition unit connected to the instruction decoding unit; the method further comprises:

Obtain the X instructions to be executed by the instruction acquisition unit, and send the X instructions to the instruction decoding unit; the X instructions are executed in parallel by the processor within one clock cycle instruction.
The method according to any one of claims 11-19, wherein the access types of the Y source operand addresses and the X destination operand addresses are both direct access types or indirect access types.
A computer-readable storage medium, characterized in that, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the method described in any one of the preceding claims 11-20 is implemented.
A computer program, characterized in that the computer-readable program includes instructions that, when the computer program is executed by a processor, cause the processor to perform the method according to any one of the preceding claims 11-20 .