CN114996647B

CN114996647B - Arithmetic unit, correlation apparatus and method

Info

Publication number: CN114996647B
Application number: CN202110231342.4A
Authority: CN
Inventors: 李浩然; 孙飞; 罗竣文; 王邦彦; 赵梓豪
Original assignee: Alibaba Innovation Co
Current assignee: Alibaba Innovation Co
Priority date: 2021-03-02
Filing date: 2021-03-02
Publication date: 2024-10-15
Anticipated expiration: 2041-03-02
Also published as: CN114996647A

Abstract

The present disclosure provides an arithmetic unit, a related apparatus and a method. The arithmetic unit includes: the instruction cache unit is used for receiving an index and effective element loading instruction, wherein the index and effective element loading instruction comprises a first source address, a second source address and a destination address, the first source address represents a starting storage address of an index of an unbiased element in a sparse matrix, the second source address represents a starting storage address of a vector to be multiplied, and the destination address represents an address where an effective element vector is to be stored back; the non-pruned element index reading unit is used for reading the non-pruned element index according to the first source address; an effective element reading unit, configured to read an effective element for multiplication with an unbiased element in a sparse matrix in the vector to be multiplied according to the second source address and the read unbiased element index; and the restoring unit is used for assembling the effective elements into effective element vectors and storing the effective element vectors according to destination addresses. The sparse matrix correlation operation method and device improve the operation efficiency of sparse matrix correlation operation and reduce the processing overhead of sparse matrix correlation operation.

Description

Arithmetic unit, correlation apparatus and method

Technical Field

The present disclosure relates to the field of chips, and more particularly, to an arithmetic unit, a related apparatus, and a method.

Background

Deep Neural Networks (DNNs) use a large number of matrix and vector operations, such as matrix and matrix multiplication, with matrix and vector multiplication. The deep neural network has a plurality of layers of nodes, each node is connected with a node of a previous layer, and the node receives the output of the node of the previous layer as an input and performs operation (multiplication, convolution, etc.) with a weight matrix of the node. The output result of the deep neural network can be regarded as the product of the weight matrix and the input excitation vector or excitation matrix. Because the calculated amount of matrix operation is large, in order to reduce the calculation intensity and the memory expenditure, a pruning technology is proposed to remove redundant connection in DNN, which is equivalent to changing a large number of elements with little influence on the result in a weight matrix into 0, and the elements with 0 do not need to participate in matrix multiplication operation, so that the calculation intensity and the memory expenditure are reduced by improving the sparseness. For example, if 30% of the elements in the weight matrix with a sparsity of 30% are non-0, the final result can be obtained by simply calculating the product of the 30% of the elements and the corresponding elements in the excitation vector and accumulating them.

In the multiplication of a sparse matrix with an excitation vector (the multiplication of a sparse matrix with an excitation matrix can be converted into a multiplication of a sparse matrix with individual excitation vectors), typically 3 loads are required. First, a weight value (a weight value other than 0) that is not pruned needs to be loaded into the sparse matrix. For pruned elements with value 0, no participation in the operation is required. Then, indexes of the weight which is not pruned (indexes of the weight which is not 0) are required to be loaded into the sparse matrix, and the indexes are indexes of effective elements which need to participate in operation in the excitation vector. The elements at the index elsewhere in the excitation vector correspond to weights of 0, the result of the multiplication still being equal to 0, with no practical significance. Then, elements corresponding to these indices need to be loaded in the excitation vector according to these indices. After the 3 times of loading, the multiplication products of the elements selected according to the index in the excitation vector and the weight which is not pruned in the sparse matrix are accumulated to obtain the multiplication result of the sparse matrix and the excitation vector.

Because the 3 loads need 3 different instructions, the processing unit needs to fetch and decode the instructions frequently, and the processing cost is high and the speed is low. Modern processing units, especially low power processing units, often have a limited instruction fetch width and do not have sufficient cache space for loading. A large number of the above instructions seriously affect the operation efficiency and increase the processing overhead.

Disclosure of Invention

In view of this, the present disclosure aims to improve the operation efficiency in the sparse matrix correlation operation described above, and reduce the processing overhead thereof.

According to an aspect of the present disclosure, there is provided an arithmetic unit including:

The instruction cache unit is used for receiving an index and effective element loading instruction, wherein the index and effective element loading instruction is used for acquiring an effective element vector corresponding to an element which is not pruned in a sparse matrix from the vector to be multiplied in multiplication of the sparse matrix and the vector to be multiplied, the index and effective element loading instruction comprises a first source address, a second source address and a destination address, the first source address represents a starting storage address of the index of the element which is not pruned in the sparse matrix, the second source address represents a starting storage address of the vector to be multiplied, and the destination address represents an address to be stored back by the effective element vector;

the non-pruned element index reading unit is used for reading the non-pruned element index according to the first source address;

An effective element reading unit, configured to read an effective element for multiplication with an unbiased element in a sparse matrix in the vector to be multiplied according to the second source address and the read unbiased element index;

And the restoring unit is used for assembling the effective elements into effective element vectors and storing the effective element vectors according to the destination addresses.

Optionally, the effective element reading unit adds the second source address and the read index of the unbiased element respectively to obtain a storage address of each effective element, and reads the effective element according to the storage address.

Optionally, the instruction cache unit is further configured to receive an unharreared element value loading instruction, where the unharreared element value loading instruction obtains an unharreared element value in a sparse matrix in multiplication with a vector to be multiplied, and the unharreared element value loading instruction includes a third source address, where the third source address represents a start storage address of the unharreared element value; the arithmetic unit further includes: and the non-pruned element value reading unit is used for reading the non-pruned element value according to the destination address.

Optionally, the instruction buffer unit is further configured to receive a vector multiplication instruction, where the vector multiplication instruction obtains a multiplication result vector of the sparse matrix and the vector to be multiplied based on the unharreared element value and the valid element vector in multiplication of the sparse matrix and the vector to be multiplied; the arithmetic unit further includes: and the multiplication result vector acquisition unit is used for obtaining the multiplication result vector through multiplication of the unharpened element value and the effective element in the effective element vector.

Optionally, the first source address comprises a first base source address and the second source address comprises a second base source address.

Optionally, the first source address includes a first base source address and a first immediate, and the second source address includes a second base source address and a second immediate; the un-pruned element index reading unit reads an un-pruned element index according to an address obtained by the sum of the first base source address and the first immediate; the effective element reading unit reads the effective element in the vector to be multiplied according to the address obtained by the sum of the second base source address and the second immediate and the read unbroken element index.

Optionally, the first source address includes a first base source address and a first coefficient, and the second source address includes a second base source address and a second coefficient; the un-pruned element index reading unit reads an un-pruned element index according to the first coefficient, a product of a vector length and an element size in a vector, and an address obtained by the sum of the first base source addresses; the effective element reading unit reads the effective element in the vector to be multiplied according to the second coefficient, the product of the vector length and the element size in the vector, the address obtained by the sum of the second base source addresses, and the read unbiased element index.

Optionally, the first source address includes a first base source address and a third coefficient, and the second source address includes a second base source address and a fourth coefficient; the un-pruned element index reading unit reads an un-pruned element index according to the product of the third coefficient and the element size in the vector and an address obtained by the sum of the first base source addresses; and the effective element reading unit reads the effective element in the vector to be multiplied according to the product of the fourth coefficient and the element size in the vector, the address obtained by the sum of the second base source address and the read unbiased element index.

Optionally, the elements of the vectors to be multiplied are stored in a plurality of memory banks in a tightly coupled memory, wherein the elements in the plurality of memory banks are allowed to be accessed in parallel; the indexes of the non-pruned elements in the sparse matrix are stored in groups, and a single group contains the indexes of one non-pruned element extracted from each of a plurality of rows in the sparse matrix, and different indexes in the same group correspond to different storage volumes of the close-coupled memory.

Optionally, the plurality of rows is a bank number of rows of the plurality of banks sequentially fetched from the sparse matrix according to a row index.

Optionally, the un-pruned element index reading unit reads an un-pruned element index of a group from the first source address, and updates the first source address with a sum of the first source address plus a length of the group.

Optionally, the valid element vector is assembled in a cache and moved back from the cache to the processing unit where the arithmetic unit is located according to the destination address.

Optionally, the effective element vector is assembled in the close-coupled memory, and is moved back from the close-coupled memory to the processing unit where the arithmetic unit is located according to the destination address.

According to an aspect of the present disclosure, there is provided a processing unit including:

An instruction fetching unit for fetching the index and valid element load instruction from the memory;

the instruction decoding unit is used for decoding the fetched index and the effective element loading instruction;

The arithmetic unit is used for receiving and executing the decoded index and the effective element loading instruction.

According to an aspect of the present disclosure, there is provided a computing device comprising:

a memory for storing the index and valid element load instructions;

A processing unit as described above.

According to an aspect of the present disclosure, there is provided a system on a chip comprising a processing unit as described above.

According to an aspect of the present disclosure, there is provided a data center including a computing device as described above.

According to an aspect of the present disclosure, there is provided an effective element vector obtaining method, where the effective element vector is a vector formed by effective elements corresponding to non-pruned elements in a sparse matrix obtained from a vector to be multiplied in multiplication of the sparse matrix and the vector to be multiplied, the method including:

Receiving an index and effective element loading instruction, wherein the index and effective element loading instruction is used for acquiring the effective element vector, the index and effective element loading instruction comprises a first source address, a second source address and a destination address, the first source address represents a starting storage address of an index of an element which is not pruned in the sparse matrix, the second source address represents a starting storage address of the vector to be multiplied, and the destination address represents an address where the effective element vector is to be stored back;

Reading an index of an unbiased element according to the first source address;

reading effective elements for multiplication with the un-pruned elements in the sparse matrix in the vector to be multiplied according to the second source address and the read un-pruned element index;

And assembling the effective elements into effective element vectors, and storing according to the destination addresses.

Optionally, the reading, according to the second source address and the read index of the un-pruned element, a valid element for multiplying with the un-pruned element in the sparse matrix in the vector to be multiplied includes:

adding the second source address and the read index of the unbiased element respectively to obtain the storage address of each effective element;

And reading the effective element according to the storage address.

Optionally, the elements of the vectors to be multiplied are stored in a plurality of memory banks in a close-coupled memory, wherein the elements in the plurality of memory banks are allowed to be accessed in parallel; the indexes of the non-pruned elements in the sparse matrix are stored in groups, and a single group contains the indexes of one non-pruned element extracted from each of a plurality of rows in the sparse matrix, and different indexes in the same group correspond to different storage volumes of the close-coupled memory.

Optionally, the reading the index of the unbiased element according to the first source address includes: starting from the first source address, the index of the unbiased element of a group is read and the first source address is updated with the sum of the first source address plus the length of a group.

In the embodiment of the disclosure, for the last two loads (the indexes of the weight which is not pruned are loaded from the sparse matrix, and the effective elements corresponding to the indexes are loaded from the excitation vector according to the indexes) in the three loads in the multiplication of the sparse matrix and the excitation vector, the method is executed by one instruction. The instruction includes a first source address, a second source address, and a destination address. The first source address indicates a storage address of the unbiased element index and the second source address indicates a storage address of the excitation vector. And loading indexes of the weight which is not pruned from the sparse matrix according to the first source address. And loading the effective elements corresponding to the indexes from the excitation vector according to the second source address and the indexes. And then restoring according to the destination address. By such an instruction, the purpose of executing the last two loads among the above three loads simultaneously is achieved. Compared with the prior art that three instructions execute three loads, the embodiment of the disclosure greatly improves the operation efficiency in the multiplication operation of the sparse matrix and the excitation vector, and reduces the processing overhead.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing embodiments thereof with reference to the following drawings in which:

FIG. 1 is a block diagram of a data center to which one embodiment of the present disclosure is applied;

FIG. 2 is an internal block diagram of one server in a data center of one embodiment of the present disclosure;

FIG. 3 is an internal block diagram of an arithmetic unit as an instruction execution unit according to one embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a Tightly Coupled Memory (TCM) storing elements in vectors to be multiplied by a memory bank according to one embodiment of the disclosure;

FIG. 5A is a schematic diagram of loading active elements from vectors to be multiplied according to an index of unbiased elements in a sparse matrix, according to one embodiment of the present disclosure;

FIG. 5B is a schematic diagram of loading active elements from vectors to be multiplied according to an index of unbiased elements in a sparse matrix according to another embodiment of the present disclosure;

FIG. 6A is a schematic diagram of un-pruned element values in a sparse matrix acquired in accordance with an un-pruned element value load instruction in one embodiment of the present disclosure;

FIG. 6B is a schematic diagram of a set of un-pruned element indexes read according to an index and a first source address of a valid element load instruction in one embodiment of the present disclosure;

FIG. 6C is a schematic diagram of loading an active element from a vector to be multiplied according to the index and a second source address of the active element load instruction and the index obtained in FIG. 6B in one embodiment of the present disclosure;

FIG. 6D is a schematic diagram of an effective element vector obtained in one embodiment of the present disclosure;

FIG. 6E is a schematic diagram of an unclean element value corresponding to the unclean element index of FIG. 6B in one embodiment of the disclosure;

FIG. 6F is a vector and an illustration of the composition of the pruned elements of FIG. 6E in one embodiment of the present disclosure

FIG. 6D is a schematic diagram showing the result of multiplication of the effective element vectors;

FIG. 7 is a listing of various addressing modes and corresponding operand cases for embodiments of the present disclosure;

FIG. 8 is a comparison histogram of an embodiment of the present disclosure in terms of occupied clock cycles compared to the prior art;

fig. 9 is a flow chart of an effective element vector acquisition method according to one embodiment of the present disclosure.

Detailed Description

The present disclosure is described below based on embodiments, but the present disclosure is not limited to only these embodiments. In the following detailed description of the present disclosure, certain specific details are set forth in detail. The present disclosure may be fully understood by one skilled in the art without a description of these details. Well-known methods, procedures, and flows have not been described in detail so as not to obscure the nature of the disclosure. The figures are not necessarily drawn to scale.

The following terms are used herein.

Deep Neural Network (DNN): deep neural networks are a new research direction in the field of machine learning, which introduces machine learning to make it closer to the original goal-Artificial Intelligence (AI). The deep neural network learns the inherent regularity and representation hierarchy of sample data, and the information obtained in the learning process is greatly helpful for interpretation of data such as text, images and sounds. Its final goal is to have the machine have analytical learning capabilities like a person, and to recognize text, image, and sound data.

Weight: the deep neural network has multiple layers of nodes, each node being connected to a node of a previous layer, corresponding to receiving the output of the node of the previous layer as inputs and processing these inputs to produce outputs to the node of the next layer, the inputs comprising the outputs of the nodes of the previous layer, the outputs comprising different outputs to the nodes of the next layer, and the internal processing of which can be seen as a process of multiplying the outputs of the nodes of the previous layer by a value (possibly plus an offset) called a weight, and outputting the value to the node of the next layer. Since the input and output are not single values, but are in the form of a vector or matrix (the output of the plurality of nodes of the previous layer received at a single point in time may be represented as a vector, the plurality of nodes of the previous layer have different outputs at different points in time and may thus be represented as a matrix), their output may also be represented in the form of a vector or matrix (the output to the plurality of nodes of the subsequent layer at a single point in time may be represented as a vector, the outputs may again be different at different points in time and may thus be represented as a matrix). Thus, the weights of the different nodes of the previous layer at different points in time are often arranged in the form of a matrix, i.e. a weight matrix.

Excitation: when the weights of all the nodes of the deep neural network are trained, the real input of the nodes is called excitation. The inputs from the plurality of previous layer nodes received at the same point in time are embodied as excitation vectors, and the inputs from the plurality of previous layer nodes received at different points in time are embodied as excitation matrices.

Sparse matrix: the element in the matrix is judged, if the influence of the element on the whole matrix is not great, the element is changed into 0, namely pruning, so that when multiplication operation is carried out on other vectors or the matrix, the element with 0 does not actually need to participate in the multiplication operation, and the calculation intensity and the memory overhead are greatly reduced. In practice, the weight matrix is often pruned, that is, the weight that has little influence on the operation result of the deep neural network becomes 0 in the weight matrix. Thus, when the weight matrix is multiplied by the excitation vector or the excitation matrix as described above, the weight of 0 is not multiplied, thereby reducing the computation strength and the memory overhead.

Pruning: namely, the thinning process is a process in which a certain element in the matrix is not important and becomes 0.

Sparseness: the ratio of the elements of the matrix that are not pruned to all elements of the matrix after pruning, e.g., 10% sparsity indicates that 90% of the elements in the matrix have become 0 and only 10% of the elements are non-0.

Effective elements: since 0 element in the sparse matrix does not participate in multiplication in the multiplication of the sparse matrix and the vector to be multiplied, those elements in the vector to be multiplied, which correspond to the index of 0 element in the sparse matrix, do not participate in operation, and are therefore regarded as invalid. Only those elements of the vector to be multiplied that correspond to the index of non-0 elements in the sparse matrix participate in the operation, this part of the elements being called valid elements.

Element index: the number of an element, which is uniquely identified by an element index, is generally indexed by 0, 1, 2 … ….

Base address: the address on which this is based. The addressing is generally done by the base address or, in the case of an offset, by the base address plus the offset result address.

Vector length: the number of elements included in the vector. The general system is automatically configured. When configured, the number of elements in the excitation vector and the row vector in the sparse matrix is the vector length.

Element size: typically the number of bytes occupied by an element. For example, one element in a vector is represented by 2 bytes, the element size being 2 bytes.

Tightly Coupled Memory (TCM): is a fixed-size Random Access Memory (RAM) that is tightly coupled to the processor core to provide performance comparable to a cache. It has the advantage over cache that the program code can control exactly what functions or where the code is placed in, and therefore it has a performance that is preset by the user, rather than being statistically affected as in cache.

Bank (bank): a portion of the memory space in the close-coupled memory. A tightly coupled memory is divided into a plurality of memory banks, and different memory banks can be accessed in parallel, thereby improving access efficiency.

And a processing unit: and a unit that performs conventional processing (processing other than complex operations such as image processing and full-join operations in various deep neural networks). The processing unit may take a variety of forms, such as a Central Processing Unit (CPU), application Specific Integrated Circuit (ASIC), field programmable gate matrix (FPGA), etc.

System-on-a-chip (SoC): a complete system integrated on a single chip that packages all or part of the necessary electronic circuitry to perform a function or functions. So-called complete systems typically include a Central Processing Unit (CPU), memory, peripheral circuits, and the like.

Application environment of the present disclosure

The embodiment of the disclosure provides a method for acquiring an effective element vector corresponding to an unbiased element in a sparse matrix from a vector to be multiplied in multiplication of the sparse matrix and the vector to be multiplied. The sparse matrix here may be a weight matrix. The vector to be multiplied may be an excitation vector. The whole scheme is relatively universal. For example, a data center, a general computer embedded device requiring multiplication of a sparse matrix with vectors to be multiplied, or an internet of things (IOT) device may be applicable. The scheme is independent of the hardware environment in which it is ultimately deployed. For exemplary purposes, however, the description will hereinafter be mainly made with the data center as an application scenario. Those skilled in the art will appreciate that the disclosed embodiments may also be applicable to other application scenarios.

Data center

Data centers are globally coordinated, specific networks of devices used to communicate, accelerate, display, calculate, store data information over an internet network infrastructure. In future developments, data centers will also become an asset for enterprise competition. With the widespread use of data centers, artificial intelligence and the like are increasingly applied to data centers. Deep learning, as an important technology of artificial intelligence, has been widely applied to data center big data analysis operations. In deep learning, multiplication of a large number of weight matrices and excitation vectors is involved, wherein the weight matrices are thinned to improve the operation efficiency and reduce the overhead.

In a conventional large data center, the network architecture is generally shown in fig. 1, i.e., an interconnection network model (HIERARCHICAL INTER-networking model). This model contains the following parts:

server 140: each server 140 is a processing and storage entity of a data center in which the processing and storage of large amounts of data is accomplished by these servers 140.

Access switch 130: access switch 130 is a switch used to allow server 140 access to a data center. An access switch 130 accesses a plurality of servers 140. The access switches 130 are typically located at the Top of the Rack, so they are also referred to as Top of Rack switches, which physically connect to the servers.

Aggregation switch 120: each aggregation switch 120 connects multiple access switches 130 while providing other services such as firewall, intrusion detection, network analysis, etc.

Core switch 110: core switch 110 provides high speed forwarding of packets into and out of the data center and connectivity for aggregation switch 120. The network of the entire data center is divided into an L3 layer routing network and an L2 layer routing network, and the core switch 110 provides a flexible L3 layer routing network for the network of the entire data center in general.

Typically, the aggregation switch 120 is a demarcation point for L2 and L3 layer routing networks, below the aggregation switch 120 is an L2 network, above is an L3 network. Each group of aggregation switches manages one transport point (POD, point Of Delivery), within each of which is a separate VLAN network. The server migration within the POD does not have to modify the IP address and default gateway because one POD corresponds to one L2 broadcast domain.

Spanning tree Protocol (STP, spanning Tree Protocol) is typically used between the aggregation switch 120 and the access switch 130. STP makes only one aggregation layer switch 120 available for one VLAN network, and the other aggregation switches 120 are used when a failure occurs. That is, at the level of the aggregation switch 120, no horizontal expansion is made, since only one is working even if multiple aggregation switches 120 are added.

Server device

Fig. 2 shows a schematic block diagram of the server 140 of fig. 1. As shown in FIG. 2, the server 140 of embodiments of the present invention may include one or more processors 12, memory 14, L1 cache 16, L2 cache 15, TCM 17.

The memory 14 may be a main memory (referred to as main memory or simply memory) for storing instruction information and/or data information represented by data signals, such as data provided by the processor 12 (e.g., as a result of an operation), or may be used to implement data exchange between the processor 12 and an external storage device (referred to as auxiliary memory or external memory).

In some cases, the processor 12 may need to access the memory 14 to retrieve data in the memory 14 or to modify data in the memory 14. Because of the slower access speed of memory 14, computer system 10 also includes caches coupled to bus 18, i.e., L1 cache 16 and L2 cache 15, in order to mitigate speed differences between processor 12 and memory 14. The cache is used to cache some program data or message data in the memory 14 that may be called repeatedly. The cache may be a type of storage device such as static random access memory (Static Random Access Memory, abbreviated SRAM). The cache of FIG. 2 is divided into a hierarchy of L1 cache 16 and L2 cache 15, but may in fact be more than three levels of cache structures or other types of cache structures. While FIG. 2 shows L1 cache 16 and L2 cache 15 as being external to processor 12, they may in fact be internal to processor 12.

Tightly Coupled Memory (TCM) 17 is tightly coupled to the processor core, providing a parallel access space. The Tightly Coupled Memory (TCM) 17 has a plurality of banks, and different banks can be accessed in parallel, thereby improving access efficiency. The L1 cache 16 and the L2 cache 15 cannot be accessed in parallel. Thus, when data requiring parallel access is stored, it can be put into the TCM 17. The excitation vectors in the embodiments of the present disclosure need to be accessed in parallel to improve the efficiency of multiplication with sparse matrices and are therefore put into the TCM 17.

The information interaction between memory 14 and caches (L1 cache 16 and L2 cache 15) is typically organized in blocks. In some embodiments, the cache and memory 14 may be divided into data blocks in the same spatial size, and the data blocks may be the smallest unit of data exchange (including one or more data of a preset length) between the cache and memory 14. For simplicity and clarity of description, each data block in the cache is simply referred to as a cache block, and different cache blocks have different cache block addresses; each data block in the memory 14 is simply referred to as a memory block, and different memory blocks have different memory block addresses. The cache block address includes, for example, a physical address tag for locating the data block.

Due to space and resource constraints, the cache cannot cache the entire contents of the memory 14, i.e., the storage capacity of the cache is typically less than that of the memory 14, and each cache block address provided by the cache cannot correspond to the entire memory block address provided by the memory 14. When the processor 12 needs to access the memory, it first accesses the cache via the bus 18 to determine whether the content to be accessed has been stored in the cache, and if so, the cache hits, at which point the processor 12 directly calls the content to be accessed from the cache. If the content that the processor 12 needs to access is not in the cache, the processor 12 needs to access the memory 14 via the bus 11 to look up the corresponding information in the memory 14. Because the cache access rate is very fast, the efficiency of the processor 12 may be significantly improved when the cache hits, thereby also improving the performance and efficiency of the overall server 140.

Processor and method for controlling the same

As shown in FIG. 2, each processor 12 may include one or more processor cores 120 for processing instructions, the processing and execution of which may be controlled by a user (e.g., via an application program) and/or a system platform. In some embodiments, each processor core 120 may be configured to process a particular instruction set. In some embodiments, the instruction set may support complex instruction set computing (Complex Instruction Set Computing, CISC), reduced instruction set computing (Reduced Instruction Set Computing, RISC), or very long instruction word (Very Long Instruction Word, VLIW) based computing. Different processor cores 120 may each process different or the same instruction sets. In some embodiments, the Processor core 120 may also include other processing modules, such as a digital signal Processor (DIGITAL SIGNAL Processor, DSP), and the like. As an example, the natural number of processor cores 1 through m, m being non-0, is shown in fig. 2.

In some embodiments, as shown in FIG. 2, processor 12 may include a register file 126 (REGISTER FILE), and register file 126 may include a plurality of registers for storing different types of data and/or instructions, which may be of different types. For example, register file 126 may include: integer registers, floating point registers, status registers, instruction registers, pointer registers, and the like. The registers in register file 126 may be implemented using general purpose registers, or may be designed specifically according to the actual needs of processor 12.

The processor 12 may include a memory management unit (Memory Management Unit, MMU) 122 to effect virtual address to physical address translation. Some entries in the page table are cached in the memory management unit 122, and the memory management unit 122 may also obtain entries from the memory that are not cached. One or more memory management units 122 may be disposed in each processor core 120, and the memory management units 120 in different processor cores 120 may also be synchronized with the memory management units 120 in other processors or processor cores, so that each processor or processor core may share a unified virtual storage system.

The processor 12 is configured to execute sequences of instructions (i.e., programs). The process by which processor 12 executes each instruction includes: fetching the instruction from the memory storing the instruction, decoding the fetched instruction, executing the decoded instruction, saving the instruction execution result, and the like, and circulating until all instructions in the instruction sequence are executed or a shutdown instruction is encountered.

To achieve the above, the processor 12 may include an instruction fetch unit 124, an instruction decode unit 125, an instruction issue unit (not shown), an instruction execution unit 121, an instruction retirement unit (not shown), and the like.

Instruction fetch unit 124 acts as a boot engine for processor 12 to handle instructions from memory 14 into instruction registers (which may be one of the register files 26 shown in FIG. 2 for storing instructions) and to receive or calculate a next instruction fetch address according to an instruction fetch algorithm, including, for example: the address is incremented or decremented according to the instruction length.

After fetching an instruction, processor 12 enters an instruction decode stage where instruction decode unit 125 decodes the fetched instruction in accordance with a predetermined instruction format to obtain operand fetch information required by the fetched instruction to prepare for operation of instruction execution unit 121. Operand fetch information refers, for example, to an immediate, registers, or other software/hardware capable of providing source operands.

The instruction issue unit is typically present in the high performance processor 12 between the instruction decode unit 125 and the instruction execution unit 121 for scheduling and control of instructions to efficiently distribute individual instructions to the different instruction execution units 121, enabling parallel operation of multiple instructions. After an instruction is fetched, decoded, and dispatched to the corresponding instruction execution unit 121, the corresponding instruction execution unit 121 begins executing the instruction, i.e., performing the operation indicated by the instruction, and performing the corresponding function.

The instruction retirement unit (or referred to as an instruction writeback unit) is mainly responsible for writing the execution results generated by the instruction execution unit 121 back into a corresponding memory location (e.g., a register within the processor 12) so that subsequent instructions can quickly obtain corresponding execution results from the memory location.

For different classes of instructions, different instruction execution units 121 may be provided accordingly in the processor 12. The instruction execution unit 121 may be an operation unit (for example, including an arithmetic logic unit, a vector operation unit, etc., for performing an operation on the basis of operands and outputting an operation result, which is an execution subject of multiplication of a sparse matrix and a vector to be multiplied according to the embodiment of the present disclosure), a memory execution unit (for example, for accessing a memory to read data in a memory or writing specified data into a memory according to an instruction, etc.), a coprocessor, etc. In the processor 12, the respective instruction execution units 121 may run in parallel and output corresponding execution results.

The instruction execution unit 121, when executing some type of instruction (e.g., a memory access instruction), needs to access the memory 14 to retrieve information stored in the memory 14 or to provide data that needs to be written into the memory 14.

The instruction execution Unit 121 for executing memory access instructions may also be referred to as simply a memory execution Unit, such as a Load Store Unit (LSU) and/or other Unit for memory access.

After the access instruction is fetched by the instruction fetch unit 124, the instruction decode unit 125 may decode the access instruction so that the source operands of the access instruction may be fetched. The decoded memory access instruction is provided to the corresponding instruction execution unit 121, and the instruction execution unit 121 may perform a corresponding operation on a source operand of the memory access instruction (e.g., an operation on a source operand stored in a register by an arithmetic logic unit) to obtain address information corresponding to the memory access instruction, and initiate a corresponding request, such as an address translation request, a write access request, etc., according to the address information.

The source operands of the memory instructions typically include address operands that are operated on by the instruction execution unit 121 to obtain a virtual or physical address corresponding to the memory instruction. When the memory management unit 122 is disabled, the instruction execution unit 121 may obtain the physical address of the memory access instruction directly through a logical operation. When the memory management unit 121 is enabled, the corresponding instruction execution unit 121 initiates an address translation request according to the virtual address corresponding to the memory access instruction, the address translation request including the virtual address corresponding to the address operand of the memory access instruction; the memory management unit 122 responds to the address translation request and translates the virtual address in the address translation request to a physical address based on an entry matching the virtual address such that the instruction execution unit 121 can access the cache and/or the memory 14 based on the translated physical address.

Depending on the functionality, the memory access instructions may include load instructions and store instructions. The execution of the load instruction typically does not require modification of information in memory 14 or the cache, and instruction execution unit 121 need only read data stored in memory 14, the cache, or an external storage device based on the address operand of the load instruction.

Unlike load instructions, the source operands of store instructions include not only address operands but also data information, and the execution of store instructions typically requires modifications to the memory 14 and/or cache. The data information of the store instruction may point to write data, where the write data may be the result of execution of an instruction such as an operation instruction, a load instruction, etc., may be data provided by a register or other memory location in the processor 12, or may be an immediate.

General overview of the disclosure

As described above, multiplication of the weight matrix and the excitation vector is often used in deep neural networks (multiplication of the weight matrix and the excitation matrix may be multiplication of the weight matrix and the individual excitation vectors). Typically, the weight matrix is pruned to obtain a sparse matrix. In the prior art, 3 loads are generally required in the process of multiplying the sparse weight matrix by the excitation vector, corresponding to 3 load instructions.

The first load instruction loads the weight value that is not pruned (a weight value other than 0) from the sparse weight matrix. For the pruned element value of 0, no participation in the operation is needed, i.e. no loading is needed. Thus, the first load instruction needs to include the starting store address in memory 14 for the un-pruned element values. The instruction fetching unit 124 fetches the instruction from the memory 14, decodes the instruction by the instruction decoding unit 125, and executes the instruction by the instruction execution unit 121, which is an arithmetic unit according to the embodiment of the present disclosure. The instruction execution unit 121 loads weight values in the sparse weight matrix, which are not pruned, from the memory 14 according to the address, and places them in a cache (e.g., the L2 cache 15).

The second load instruction loads the index of the weight that was not pruned (the index of the weight that was not 0) from the sparse weight matrix. These indices are the indices of the active elements of the excitation vector that need to participate in the operation. Elements at the index elsewhere in the excitation vector need to be multiplied by a weight of 0 and are therefore meaningless. The second load instruction needs to contain the starting memory address in memory 14 for the un-pruned weight index in the sparse weight matrix. The instruction fetching unit 124 fetches the instruction from the memory 14, decodes the instruction by the instruction decoding unit 125, and sends the instruction to the instruction execution unit 121 for execution. The instruction execution unit 121 loads the weight index of the sparse weight matrix, which is not pruned, from the memory 14 according to the address, and places it in a cache (e.g., the L2 cache 15).

The third load instruction loads elements corresponding to these indices in the stimulus vector according to the weight indices in the cache (e.g., L2 cache 15) that are not pruned, the elements loaded being elements in the stimulus vector that are to be multiplied by the weight loaded by the first load instruction. The third load instruction needs to contain the starting store address of the stimulus vector and the address of a register in processor 12 (e.g., one of register files 126) that the element loaded from the stimulus vector needs to be saved back. The instruction fetching unit 124 fetches the instruction from the memory 14, decodes the instruction by the instruction decoding unit 125, and sends the instruction to the instruction execution unit 121 for execution. The instruction execution unit 121 adds the starting memory address of the stimulus vector to the weight index that is not pruned to obtain the addresses of the valid elements in the stimulus vector, loads the valid elements from these addresses, and saves these valid elements back to the addresses of registers in the processor 12 specified in the instruction.

Then, the instruction execution unit 121 multiplies and accumulates the weight value of the first load instruction, which is not pruned and is obtained by the execution of the first load instruction, with the effective element obtained by the third load instruction, to obtain a multiplication result of the sparse weight matrix and the excitation vector.

Because the 3 loads need 3 different instructions, frequent instruction fetching and decoding are needed, the processing cost is high, and the speed is low. Modern processing units, and in particular low power processing units, typically have a limited instruction fetch width and do not use enough cache space for loading. In the disclosed embodiment, for the last two load instructions, an index and valid element load instruction is used to execute. The following is one example of this instruction:

LoadGatherFused pg,Destaddr,Addr 1,Addr 2

Addr 1 is a first source address, and is used for indicating a storage address indexed by an unbiased element (for example, a weight in a weight matrix); addr 2 is a second source address for indicating the storage address of the vector to be multiplied (e.g. the excitation vector); destaddr is a destination address indicating the address of vector copy-back of valid elements loaded from a vector to be multiplied (e.g., an excitation vector). pg is a control parameter that indicates how many bits of the vector employed by embodiments of the present disclosure are valid. For example, the excitation vector has 24 bits, possibly only the first 16 bits being valid, representing the excitation value. The last 8 bits may be invalid, not represent an excitation value, may be set to 0, etc. The administrator can intervene in the vector significand by pg.

The instruction fetching unit 124 fetches the instruction from the memory 14, decodes the instruction by the instruction decoding unit 125, and sends the instruction to the instruction execution unit 121 for execution. The instruction execution unit 121 loads the indexes of the elements in the sparse matrix, which are not pruned, from the memory 14 according to the first source address Addr 1, loads the effective elements corresponding to the indexes of the elements, which are not pruned, from the vectors to be multiplied according to the second source address Addr 2 and the sum address of the indexes, and then restores the vector of the obtained effective elements according to the destination address Destaddr. By such an instruction, the purpose of executing the last two loads among the above three loads simultaneously is achieved. Compared with the prior art that three instructions are used for executing and loading, the embodiment of the disclosure is changed into the method that two instructions are used for executing and loading, so that the operation efficiency of multiplication operation of a sparse matrix and a vector to be multiplied is improved, and the operation cost is reduced.

Detailed implementation of basic embodiments of the present disclosure

As shown in fig. 3, the instruction execution unit 121, which is an operation unit of one embodiment of the present disclosure, includes an instruction cache unit 210, an unbiased element index reading unit 220, an effective element reading unit 230, a restore unit 240, an unbiased element value reading unit 250, and a multiplication result vector acquisition unit 260.

An implementation according to one basic embodiment of the present disclosure is described below in conjunction with fig. 5A.

In the example of fig. 5A, the sparse matrix may be a sparse weight matrix and the vector to be multiplied is the excitation vector. The sparse weight matrix is a matrix with a plurality of rows and 24 columns, each row has 24 elements, and indexes of the matrix are respectively represented by indexes 0, 1 and 2 … …, wherein elements which are not pruned, namely elements which are not 0, are represented by thick line boxes, and elements which are pruned, namely elements which are 0, are represented by thin line boxes. The vector to be multiplied is a vector of 24 elements, and the indexes of the vector are also 0, 1 and 2 … … and correspond to the row indexes of the sparse weight matrix one by one. The vector to be multiplied is multiplied with the row vector of the first row of the sparse weight matrix to obtain the first element of the multiplication result vector, i.e. the element of index 0. Since the first row of the sparse weight matrix has only 4 elements that are not 0, the element of index 0 of the multiplication result vector is obtained by multiplying and accumulating these 4 elements with the element of the corresponding index position in the vector to be multiplied. The multiplication of the element of 0 in the first row with the element of the corresponding index position in the vector to be multiplied is still equal to 0, with no effect on the result. Similarly, the elements of index 1 of the multiplication result vector are obtained by multiplying and accumulating 4 elements which are not 0 in the second row of the sparse weight matrix and the elements of the corresponding index position in the vector to be multiplied. And so on, the final multiplication result vector is obtained.

In this basic embodiment, first, instruction fetch unit 124 fetches an uncrushed element value load instruction from memory 14. An un-pruned element value load instruction is an instruction to obtain un-pruned element values in the sparse matrix. As shown in fig. 5A, in the first row of the sparse matrix, the element values of indexes 0, 2, 4, 9 are not pruned, i.e., are not 0, and the un-pruned element value load instruction obtains the element values of these elements. The un-pruned element value load instruction includes a third source address representing a starting store address of the un-pruned element value. In the memory 14, the element values of 0, 2, 4, 9 are stored consecutively. When the third source addresses, i.e. the starting addresses they store on the memory 14, are obtained, they can be loaded sequentially. The instruction fetching unit 124 fetches the load instruction of the unbiased element value, decodes the load instruction by the instruction decoding unit 125, and sends the load instruction to the instruction execution unit 121 for execution. The instruction cache unit 210 of the instruction execution unit 121 receives the uncrushed element value load instruction. The un-pruned element value reading unit 220 reads the un-pruned element value from the memory 14 according to a third source address in the un-pruned element value load instruction, placed in a cache, e.g. the L2 cache 15.

Instruction fetch unit 124 then fetches the index and valid element load instructions from memory 14. The index and valid element load instruction is an instruction that obtains a valid element vector corresponding to an unbiased element in the sparse matrix from the vector to be multiplied. In this embodiment, the vectors to be multiplied are stored in the memory 14. As shown in fig. 5A, in the first row of the sparse matrix, the element values of indexes 0, 2, 4, and 9 are not pruned, and then in the vector to be multiplied, the elements of indexes 0, 2, 4, and 9 are also taken out as effective element vectors. As indicated above, the index and valid element load instruction is LoadGatherFused pg, destaddr, addr 1, addr 2. The first source address Addr 1 represents the starting memory address of the index of the unbiased element in the sparse matrix. As in the first row of the sparse matrix of fig. 5A, the un-pruned element indexes are indexes 0, 2, 4, 9. Which is stored in memory 14 at a consecutive location, but at a location different from the location where the value of the pruned element is stored. The first source address Addr 1 represents their starting memory address. The second source address Addr 2 represents the starting memory address of the vector to be multiplied on the memory 14. The destination address Destaddr represents the address in the register file 126 of the processor 12 where the valid element vector is to be stored back.

The instruction fetching unit 124 fetches the index and valid element load instruction, decodes the instruction by the instruction decoding unit 125, and sends the instruction to the instruction execution unit 121 for execution. The instruction cache unit 210 of the instruction execution unit 121 receives the index and valid element load instruction. The un-pruned element index reading unit 220 reads the un-pruned element indexes, e.g., un-pruned element indexes 0, 2,4, 9 in the first row of the sparse matrix of fig. 5A, according to the index and the first source address in the valid element load instruction, placed in a cache (e.g., L2 cache 15). Then, the effective element reading unit 230 sums the second source address placed in the cache, which reflects the address of the first element of the vector to be multiplied, and the read index of the unbiased element (e.g., 0, 2,4, 9 as above) corresponding to the offset on the basis, which are added to obtain the specific address of each effective element, to be loaded, and the specific address of the effective element multiplied by the unbiased element in the sparse matrix. Then, the effective element reading unit 230 reads the effective element multiplied by the uncrushed element according to the sum address, such as the elements of indexes 0, 2,4, and 9 of the vector to be multiplied in fig. 5A. The restore unit 240 assembles the valid elements (e.g., elements of indexes 0, 2,4, 9 of the vector to be multiplied of the example above) into a valid element vector that is stored into the register file 126 of the processor 12 at the destination address.

Instruction fetch unit 124 then fetches the vector multiply instruction from memory 14. The vector multiplication instruction is used for loading the effective element vector obtained by the second instruction and the un-pruned element value obtained by the instruction based on the un-pruned element value to obtain a multiplication result vector of the sparse matrix and the vector to be multiplied. The multiplication result vector acquisition unit 260 executes the vector multiplication instruction described above. The specific method includes that the un-pruned element values obtained by the un-pruned element value loading instruction are synthesized into an un-pruned element value vector, and the un-pruned element value vector is multiplied by an effective element vector to obtain one element of a multiplication result vector. For example, assuming that in the above example, the element values of indexes 0, 2, 4, and 9 of the first row of the sparse matrix are 0.5, 1,2, and 3, respectively, and the effective element vector is (2,2,1,4), the first element of the multiplication result vector is 0.5×2+1×2+2×1+3×4=1+2+2+12=17. Multiplying the vector of the un-pruned element values of the second row of the sparse matrix with the valid element vector extracted from the un-pruned element index of the second row of the sparse matrix to obtain the second element of the multiplication result vector, and so on.

Through the above process, multiplication of the sparse matrix and the vector to be multiplied is completed. Except for the last vector multiplication instruction, 2 loading instructions are used, and compared with 3 loading instructions used in the prior art, the operation efficiency is improved, and the processing overhead is reduced.

The present disclosure incorporates a detailed description of embodiments of the TCM 17

In the embodiment of fig. 5A, when loading the valid elements from the vectors to be multiplied according to the un-pruned element index of the sparse matrix, only one valid element can be loaded per clock cycle. As in fig. 5A, the element of index 0 of the vector to be multiplied is loaded according to index 0 in period 1, the element of index 2 of the vector to be multiplied is loaded according to index 2 in period 2, the element of index 4 of the vector to be multiplied is loaded according to index 4 in period 3, and the element of index 9 of the vector to be multiplied is loaded according to index 9 in period 4. This is because the vector to be multiplied is stored in the memory 14 and does not support simultaneous access of multiple elements in the vector.

A Tightly Coupled Memory (TCM) 17 may support simultaneous access of multiple elements in a vector. The TCM17 is divided into a plurality of banks, and different banks can be accessed in parallel, thereby improving access efficiency. Thus, in the embodiment of fig. 5B, the vectors to be multiplied are stored in TCM 17. The elements of the vectors to be multiplied may be stored in a plurality of memory banks 171, such as memory banks 0-3 of fig. 4. In this case, if a plurality of elements to be accessed simultaneously are in different memory banks 171, the plurality of elements may be loaded simultaneously.

In the example of fig. 4, the elements a [ 0 ], a [ 4 ], a [ 8 ], a [ 12 ] of the vectors to be multiplied are stored in the memory bank 0, and their indexes are characterized by a remainder after modulo 4 of 0. The elements A [ 1 ], A [ 5 ], A [ 9 ], A [ 13 ] of the indexes 1, 5, 9 and 13 of the vectors to be multiplied are stored in the memory bank 1, and the indexes are characterized in that the remainder after the module 4 is 1. The indexes 2, 6, 10, 14 of the vectors to be multiplied are stored in the memory bank 2, and the index is characterized in that the remainder after modulo 4 is 2. The indexes 3, 7, 11, 15 of the vectors to be multiplied are stored in the memory bank 3, and the index is characterized in that the remainder after modulo 4 is 3. It can be seen that in this embodiment, the remainder of the index of an element modulo the total number of memory banks 171 is the index of the memory bank 171 in which the element is stored. The above-mentioned distribution of the elements of the vectors to be multiplied in each memory bank is only an example, and other distributions are also possible.

Correspondingly, the indices of the pruned elements in the sparse matrix may be stored in groups, with the number of indices in a group being less than or equal to the total number of memory banks 171 in the TCM 17, and ensuring that different indices in the same group correspond to different memory banks 171 of the TCM 17. Group 0 of FIG. 5B contains 4 indices 0, 2, 5, 7, with the remainder of modulo 4 being 0, 2, 1, 3, respectively, corresponding to four different memory banks 171. Thus, when the effective elements are loaded from the vectors to be multiplied according to the 4 indexes, the corresponding 4 effective elements are located in 4 different memory banks 171, so that the effective elements can be taken from the 4 memory banks 171 in parallel to form an effective element vector. Similarly, group 1 contains 4 indices 2, 3, 0, 5, with the remainder of modulo 4 being 2, 3, 0, 1, respectively, corresponding to four different memory banks 171, and so on.

In addition, a single set may contain an index of one unbiased element taken from each of a plurality of rows in the sparse matrix, the number of rows being equal to the total number of banks 171 in TCM 17. For example, in the above example, the total number of memory banks 171 in TCM 17 is 4, so that indexes of one unbiased element can be fetched from each of the 4 rows in the sparse matrix, taking care to make these indexes correspond to the different memory banks 171 of TCM 17. For example, in fig. 5B, index 0 is taken in the first row of the sparse matrix, index 2 is taken in the second row, index 5 is taken in the third row, index 7 is taken in the fourth row, their remainder modulo 4 is different, indexes corresponding to four different memory banks 171 respectively, and so on.

Thus, as shown in fig. 5B, when the indexes of the unbiased elements are loaded from the sparse matrix, the indexes of the unbiased elements of one row are not loaded any more, but a set of indexes formed as described above is loaded. After loading the effective elements from the vectors to be multiplied according to the group of indexes, multiplying the un-pruned element values corresponding to the indexes with the corresponding effective elements, and accumulating the obtained products in the elements of the indexes corresponding to the multiplied result vectors. The elements of the multiplication result vector are added to the respective elements continuously as the above operation is performed, that is, the elements of the multiplication result vector are added to the newly generated product continuously as the above operation is performed. The product of the un-pruned elements taken from the first row of the sparse array and the corresponding active element is accumulated in the first element of the multiplication result vector. The product of the un-pruned elements taken from the second row of the sparse array and the corresponding active elements is accumulated in the second element of the multiplication result vector, and so on.

Specific implementations of the above embodiments are described below in conjunction with fig. 6A-F.

In the embodiment of FIG. 5B, first, instruction fetch unit 124 fetches an uncrushed element value load instruction from memory 14. An un-pruned element value load instruction is an instruction to obtain un-pruned element values in the sparse matrix. The un-pruned element value load instruction includes a third source address representing a starting memory address of the un-pruned element value in memory 14. In fig. 5B, since the index 0 of the first row, the index 2 of the second row, the index 5 of the third row, and the index 7 of the fourth row of the sparse matrix collectively form a group 0, the element value of the index 0 of the first row, the element value of the index 2 of the second row, the element value of the index 5 of the third row, and the element value of the index 7 of the fourth row are also continuously stored together; similarly, the element values of index 2 for the first row, index 3 for the second row, index 0 for the third row, index 5 for the fourth row, and so on, are also stored consecutively together, corresponding to group 2. In the above example, the third source address represents the address of index 0 of the first row in the memory 14.

The instruction fetching unit 124 fetches the load instruction of the unbiased element value, decodes the load instruction by the instruction decoding unit 125, and sends the load instruction to the instruction execution unit 121 for execution. The instruction cache unit 210 of the instruction execution unit 121 receives the uncrushed element value load instruction. The un-pruned element value reading unit 220 sequentially fetches un-pruned element values of a plurality of rows, which are equal to the number of banks in the TCM 17, according to the row index according to the third source address in the un-pruned element value load instruction. In the embodiment of fig. 5B, the number of banks in TCM 17 is 4, so that 4 rows of unharpened element values can be taken at a time. The first time the value of the pruned element of rows 0-3 is taken. The second time the value of the unbiased element of lines 4-7 is taken, and so on. Fig. 6A shows the first 4 rows of pruned element values taken, where the boxes with values represent the pruned elements and the boxes with blanks represent pruned elements, i.e., 0 elements. In practice, the order in which they are stored in the memory 14 is not stored row by row, but in the order of the index groups as described above. The element value of index 0 of the first row, the element value of index 2 of the second row, the element value of index 5 of the third row, the element value of index 7 of the fourth row are sequentially stored at the forefront, followed by the element value of index 2 of the first row, the element value of index 3 of the second row, the element value of index 0 of the third row, the element value of index 5 of the fourth row, and so on. The un-pruned element value reading unit 220 determines a first un-pruned element value to be read according to the third source address, and then sequentially reads the element values of the indexes of the index groups of the number of rows of the bank.

Instruction fetch unit 124 then fetches the index and valid element load instructions from memory 14. The index and valid element load instruction is an instruction for acquiring a valid element vector corresponding to an unbiased element in the sparse matrix from the vector to be multiplied, that is, loadGatherFused pg, destaddr, addr 1, addr 2 described above. The first source address Addr 1 represents the starting memory address of the index of the unbiased element in the sparse matrix. The pruned element indexes are stored in groups according to fig. 6B. Group 0 includes index 0 of the first row, index 2 of the second row, index 5 of the third row, and index 7 of the fourth row. Group 1 includes index 2 for the first row, index 3 for the second row, index 0 for the third row, index 5 for the fourth row, and so on. The first source address is the address of the first index of group 0. Since the indexes of the groups are sequentially stored, the subsequent indexes can be sequentially acquired according to the first source Address 1. The second source address Addr 2 represents the starting memory address of the vector to be multiplied on the memory 14. The destination address Destaddr represents the address in the register file 126 of the processor 12 where the valid element vector is to be stored back.

The instruction fetching unit 124 fetches the index and valid element load instruction, decodes the instruction by the instruction decoding unit 125, and sends the instruction to the instruction execution unit 121 for execution. The instruction cache unit 210 of the instruction execution unit 121 receives the index and valid element load instruction. The un-pruned element index reading unit 220 reads the un-pruned element index of a group starting from a first source address in the index and valid element load instruction and updates the first source address with the sum of the first source address plus the length of a group. Thus, when the second instruction is executed next time, the index of the unbiased element of the next group is read until all groups are read. For example, in FIG. 6B, the index of set 0 is first read and placed in a cache (e.g., L2 cache 15).

Then, the valid element reading unit 230 sums the second source address placed in the cache and the read index of the non-pruned element (e.g., indexes 0, 2, 5, 7 in group 0) to obtain the specific address of the valid element to be loaded multiplied by the non-pruned element in the sparse matrix. Then, as shown in fig. 6C, the effective element reading unit 230 reads the effective elements multiplied by the uncrushed elements according to the sum address, such as elements 1.5, 0.4, 0.8, 3.0 of indexes 0, 2, 5, 7 of the vectors to be multiplied in fig. 6C.

The restore unit 240 assembles the valid elements (e.g., elements of indexes 0, 2, 5, 7 of the vector to be multiplied of the example above) into a valid element vector, such as the vector of FIG. 6D, which is stored into the register file 126 of the processor 12 at the destination address. The copy-back unit 240 may assemble the valid elements into vectors in a cache (e.g., L2 cache 15) and move the registers of the processor 12 where the instruction execution unit 121 is located back from the cache according to the destination address. The replay unit 240 may also assemble the active elements into vectors at the TCM17 and move the registers of the processor 12 where the instruction execution unit 121 is located back from the TCM17 according to the destination address.

Instruction fetch unit 124 then fetches the vector multiply instruction from memory 14. The vector multiplication instruction is used for obtaining a multiplication result vector of the sparse matrix and the vector to be multiplied based on the un-pruned element value obtained by the un-pruned element value loading instruction and the effective element vector obtained by the index and the effective element loading instruction. The multiplication result vector acquisition unit 260 executes the vector multiplication instruction described above. The specific method includes that an unclean element value obtained by an unclean element value loading instruction and an index are multiplied by an effective element corresponding to the unclean element index obtained by an effective element loading instruction, and the obtained product is accumulated on an element corresponding to a sparse matrix row from which the unclean element is derived in a multiplication result vector. For example, the product of the un-pruned elements taken from the first row of the sparse array and the corresponding active element is accumulated on the first element of the multiplication result vector. The product of the un-pruned elements taken from the second row of the sparse array and the corresponding active elements is accumulated on the second element of the multiplication result vector, and so on. For example, FIG. 6E is a diagram of the un-pruned element values 0.5, 0.3, 1.6, 1.9 of index set 0 resulting from an un-pruned element value load instruction. The first element 0.5 of fig. 6E is multiplied by the first element 1.5 of fig. 6D to yield 0.75 which is added to the first element of the multiplication result vector of fig. 6F (since the first element 0.5 of fig. 6E is taken from the first row of the sparse matrix). The second element 0.3 of fig. 6E multiplied by the second element 0.4 of fig. 6D yields 0.12 added to the second element of the multiplication result vector of fig. 6F (since the second element 0.3 of fig. 6E is taken from the second row of the sparse matrix), and so on.

Through the above process, multiplication of the sparse matrix and the vector to be multiplied is completed. In the process, two load instructions are used except for the last vector multiplication instruction, and compared with three load instructions in the prior art, the operation efficiency is improved, and the processing overhead is reduced. Meanwhile, since vectors to be multiplied are stored in the plurality of memory banks 171 of the TCM 17, and indexes of the unbiased elements in the sparse matrix are stored in groups, and indexes in each group correspond to different memory banks of the TCM 17, when loading the corresponding effective elements in the vectors to be multiplied according to the plurality of indexes in each group, the effective elements can be executed in parallel in one cycle (as in fig. 5B, loading the effective elements according to indexes 0, 2, 5 and 7 can be executed in cycle 1), and the operation efficiency is further improved.

Addressing modes employable by embodiments of the present disclosure

As shown in fig. 7, the index and valid element load instructions LoadGatherFused pg, destaddr, addr 1, addr 2 may be modified differently.

In one embodiment, the index and valid element load instruction may be written LoadGatherFused pg, destaddr, base1, base 2. That is, the first source address is denoted by a first Base source address Base1, and the second source address is denoted by a second Base source address Base 2. The base address is an address on which the base is based. In addressing, a certain address to be found may be offset from the base address, and thus may be addressed in such a way that the base address is added with the offset. Without an offset, it can be addressed only by the base address. For LoadGatherFused pg, destaddr, base1, base 2, the un-pruned element index reading unit 220 reads the un-pruned element index according to the first Base source address Base1, and the valid element reading unit 230 reads the valid element according to the second Base source address Base 2 and the sum address of the read un-pruned element indexes.

In one embodiment, the index and valid element load instruction may be written LoadGatherFused pg, destaddr, base 1,Imm 1,Base 2,Imm 2. I.e. the address to be addressed is represented by an offset represented by a base address and an immediate. The first source address is represented by the sum address of the first Base source address Base 1 plus the first immediate Imm 1, and the second source address is represented by the sum address of the second Base source address Base 2 plus the second immediate Imm 2. For LoadGatherFused pg, destaddr, base 1,Imm 1,Base 2,Imm 2, the un-pruned element index reading unit 220 reads the un-pruned element index according to the first Base source address Base 1 plus the sum address of the first immediate Imm 1, and the valid element reading unit 230 reads the valid element according to the second Base source address Base 2, the second immediate Imm 2, and the read sum address of the un-pruned element index.

In one embodiment, the index and valid element load instruction may be written LoadGatherFused pg, destaddr, base 1, n1, base 2, n2. It still represents the address to be addressed by the base address and the offset, except that it determines the offset in such a way that the first coefficient is multiplied by the vector length times the element size. The vector length is the number of elements included in the vector. The general system is automatically configured. When configured, the number of elements in the excitation vector and the row vector in the sparse matrix is the vector length. In the example of fig. 5A, the vector length is 24 elements. Element size generally refers to the number of bytes an element occupies. In most cases, the element size is 2 bytes. In this embodiment, the first source address is equal to Base 1+n1wuntength elesize, where Base 1 is the first Base source address, N1 is the first coefficient, wlnteth is the vector length, elesize is the element size. The second source address is equal to Base 2+n2wlength elesize, where Base 2 is the second Base source address and N2 is the second coefficient. For the above-mentioned index and valid element load instruction, the non-pruned element index reading unit 220 reads the non-pruned element index according to the address of Base 1+n1×wlength×elesize, and the valid element reading unit 230 reads the valid element according to the sum address of Base 2+n2×wlength×elesize and the read non-pruned element index.

In one embodiment, the index and valid element load instruction may be written LoadGatherFused pg, destaddr, base 1, n3, base 2, n4. It also represents the address to be addressed by the base address and the offset. It is to determine the offset in such a way that the first coefficient is multiplied by the element size. In this embodiment, the first source address is equal to Base 1+n3×elesize, where Base 1 is the first Base source address, N3 is the third coefficient, and Elesize is the element size. The second source address is equal to Base 2+n4×elesize, where Base 2 is the second Base source address and N4 is the fourth coefficient. For the second instruction, the unharpened element index reading unit 220 reads the unharpened element index according to the address of Base 1+n3×elesize, and the effective element reading unit 230 reads the effective element according to the sum address of Base 2+n4×elesize and the read unharpened element index.

The above-described design manner of multiple index and effective element load instructions improves the flexibility of addressing according to the index and effective element load instructions in embodiments of the disclosure.

Effect assessment data and commercial value

To evaluate the effectiveness of embodiments of the present disclosure, simulator GEM 5 simulations with ARM SVE SIMD extensions are employed, where SIMD refers to a single instruction multiple data stream, i.e., a set of instructions capable of copying multiple operands and packing them into a large register. The second instructions presented by the embodiments of the present disclosure run in the context of the GEM 5 and GCC compilers described above. Embodiments of the present disclosure are configured in accordance with the vector length of ARM v8.2a, 8 elements per vector, 2 bytes (16 bits) per element. Two CPU models are adopted in the experiment respectively, one is an MINOR model, and the power consumption is low, but the performance is relatively poor; the other is an O3 model, which has high performance but high power consumption. TCM with 8 banks. On one hand, the sparse matrix adopts the sparse matrix of the synthesized data with the sparsity of 0, 0.25 and 0.5 respectively, and on the other hand, the sparse matrix of the real data is adopted. In real data, the probability of a weight of 0 is 10%, and thus the sparsity is 0.9.

The final experimental results are shown in FIG. 8. Fig. 8 shows 8 kinds of histograms when the CPU adopts the MINOR model and the O3 model, respectively, in four cases where the sparseness is 0, 0.25, 0.5, 0.9 (real data), respectively. Each set of histograms includes 2 histograms. The left histogram represents the number of clock cycles used in the prior art to calculate the product of the sparse matrix and the vector to be multiplied, and the right histogram represents the number of clock cycles used in the embodiments of the present disclosure to calculate the product of the sparse matrix and the vector to be multiplied. It can be seen that in the embodiment of the disclosure, the number of clock cycles used to calculate the product of the sparse matrix and the vector to be multiplied becomes smaller, so the operation speed is faster, and is 1.16 times (the CPU adopts the O3 model) to 1.39 times (the CPU adopts the MINOR model) that of the existing method, and in the case of using the real data (sparseness=0.9), the operation speed is 1.2 times (the CPU adopts the O3 model) to 1.47 times (the CPU adopts the MINOR model) that of the existing method. In the multiplication process of the sparse matrix and the vector to be multiplied, three loading instructions are compressed into two loading instructions, so that the instruction fetching and decoding pressure in the process of calculating the product of the sparse matrix and the vector to be multiplied is greatly reduced, and the method is particularly suitable for modern low-power processors with limited instruction fetching width. By applying the embodiment of the disclosure, the market share of the low-power processor is expected to be enlarged by more than 20%, and the method has good market prospect.

Deep neural network operation method flow according to embodiments of the present disclosure

As shown in fig. 9, according to one embodiment of the present disclosure, there is also provided an effective element vector acquisition method, including:

Step 910, receiving an index and valid element load instruction, where the index and valid element load instruction is used to obtain the valid element vector, and the index and valid element load instruction includes a first source address, a second source address, and a destination address, where the first source address represents a starting storage address of an index of an element that is not pruned in the sparse matrix, the second source address represents a starting storage address of the vector to be multiplied, and the destination address represents an address where the valid element vector is to be stored back;

step 920, reading an index of the unbiased element according to the first source address;

Step 930, reading effective elements for multiplying with the un-pruned elements in the sparse matrix in the vector to be multiplied according to the second source address and the read un-pruned element index;

step 940, assembling the effective elements into effective element vectors, and storing according to the destination addresses.

Optionally, step 930 includes:

And reading the effective element according to the storage address.

Optionally, step 920 includes: starting from the first source address, the index of the unbiased element of a group is read and the first source address is updated with the sum of the first source address plus the length of a group.

Since the implementation details of the above process are described in detail in the description of the foregoing apparatus embodiments, they are not repeated.

It should be understood that each embodiment in this specification is described in an incremental manner, and the same or similar parts between each embodiment are all referred to each other, and each embodiment focuses on differences from other embodiments. In particular, for method embodiments, the description is relatively simple as it is substantially similar to the methods described in the apparatus and system embodiments, with reference to the description of other embodiments being relevant.

It should be understood that the foregoing describes specific embodiments of this specification. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

It should be understood that elements described herein in the singular or shown in the drawings are not intended to limit the number of elements to one. Furthermore, modules or elements described or illustrated herein as separate may be combined into a single module or element, and modules or elements described or illustrated herein as a single may be split into multiple modules or elements.

It is also to be understood that the phraseology and terminology employed herein are for the purpose of description and should not be regarded as limiting. The use of these terms and expressions is not meant to exclude any equivalents of the features shown and described (or portions thereof), and it is recognized that various modifications are possible and are intended to be included within the scope of the claims. Other modifications, variations, and alternatives are also possible. Accordingly, the claims should be looked to in order to cover all such equivalents.

Claims

1. An arithmetic unit comprising:

2. The operation unit according to claim 1, wherein the effective element reading unit adds the second source address and the read index of the unbiased element, respectively, to obtain a storage address of each effective element, and reads the effective element according to the storage address.

3. The arithmetic unit of claim 1, wherein the instruction cache unit is further configured to receive an unharpened element value load instruction, the unharpened element value load instruction obtaining unharpened element values in a sparse matrix in multiplication with a vector to be multiplied, the unharpened element value load instruction including a third source address representing a starting storage address of the unharpened element values;

The arithmetic unit further includes:

and the non-pruned element value reading unit is used for reading the non-pruned element value according to the third source address.

4. The arithmetic unit of claim 2, wherein the instruction cache unit is further configured to receive a vector multiplication instruction, where the vector multiplication instruction obtains a multiplication result vector of a sparse matrix and a vector to be multiplied based on the un-pruned element values and the valid element vector in multiplication of the sparse matrix and the vector to be multiplied;

The arithmetic unit further includes:

And the multiplication result vector acquisition unit is used for obtaining the multiplication result vector through multiplication of the unharpened element value and the effective element in the effective element vector.

5. The arithmetic unit of claim 1, wherein the first source address comprises a first base source address and the second source address comprises a second base source address.

6. The arithmetic unit of claim 1, wherein the first source address comprises a first base source address and a first immediate, the second source address comprises a second base source address and a second immediate,

The un-pruned element index reading unit reads an un-pruned element index according to an address obtained by the sum of the first base source address and the first immediate;

the effective element reading unit reads the effective element in the vector to be multiplied according to the address obtained by the sum of the second base source address and the second immediate and the read unbroken element index.

7. The arithmetic unit of claim 1, wherein the first source address comprises a first base source address and a first coefficient, the second source address comprises a second base source address and a second coefficient,

The un-pruned element index reading unit reads an un-pruned element index according to the first coefficient, a product of a vector length and an element size in a vector, and an address obtained by the sum of the first base source addresses;

The effective element reading unit reads the effective element in the vector to be multiplied according to the second coefficient, the product of the vector length and the element size in the vector, the address obtained by the sum of the second base source addresses, and the read unbiased element index.

8. The arithmetic unit of claim 1, wherein the first source address comprises a first base source address and a third coefficient, the second source address comprises a second base source address and a fourth coefficient,

The un-pruned element index reading unit reads an un-pruned element index according to the product of the third coefficient and the element size in the vector and an address obtained by the sum of the first base source addresses;

and the effective element reading unit reads the effective element in the vector to be multiplied according to the product of the fourth coefficient and the element size in the vector, the address obtained by the sum of the second base source address and the read unbiased element index.

9. The arithmetic unit of claim 1, wherein elements of the vectors to be multiplied are stored in a plurality of memory banks in a tightly coupled memory, wherein elements in the plurality of memory banks are allowed to be accessed in parallel;

The indexes of the non-pruned elements in the sparse matrix are stored in groups, and a single group contains the indexes of one non-pruned element extracted from each of a plurality of rows in the sparse matrix, and different indexes in the same group correspond to different storage volumes of the close-coupled memory.

10. The arithmetic unit of claim 9, wherein the plurality of rows is a bank number of rows of the plurality of banks sequentially taken out of the sparse matrix by a row index.

11. The operation unit according to claim 9, wherein the un-pruned element index reading unit reads an un-pruned element index of a group from the first source address and updates the first source address with a sum of the first source address plus a length of the group.

12. The arithmetic unit of claim 1, wherein the valid element vector is assembled in a cache and moved back from the cache to the processing unit where the arithmetic unit resides according to the destination address.

13. The arithmetic unit of claim 9, wherein the valid element vector is assembled in the tightly coupled memory and is moved back from the tightly coupled memory to the processing unit in which the arithmetic unit resides in accordance with the destination address.

14. A processing unit, comprising:

The arithmetic unit of any one of claims 1-13, configured to receive and execute the decoded index and valid element load instruction.

15. A computing device, comprising:

a memory for storing the index and valid element load instructions;

The processing unit of claim 14.

16. A system on a chip comprising a processing unit according to claim 14.

17. A data center comprising the computing device of claim 15.

18. An effective element vector obtaining method, wherein the effective element vector is a vector formed by effective elements corresponding to non-pruning elements in a sparse matrix, which are obtained from a vector to be multiplied in multiplication of the sparse matrix and the vector to be multiplied, and the method comprises the following steps:

Reading an index of an unbiased element according to the first source address;

19. The method of claim 18, wherein the reading, in the vector to be multiplied, a valid element for multiplication with an unharpened element in a sparse matrix according to the second source address and the read unharpened element index, comprises:

And reading the effective element according to the storage address.

20. The method of claim 18, wherein the elements of the vector to be multiplied are stored in a plurality of memory banks in a tightly coupled memory, wherein elements in the plurality of memory banks are allowed to be accessed in parallel;

21. The method of claim 20, wherein the plurality of rows is a bank number of rows of the plurality of banks sequentially fetched from the sparse matrix according to a row index.

22. The method of claim 20, wherein the reading an unharreared element index from the first source address comprises: starting from the first source address, the index of the unbiased element of a group is read and the first source address is updated with the sum of the first source address plus the length of a group.