CN115658146B

CN115658146B - A kind of AI chip, tensor processing method and electronic equipment

Info

Publication number: CN115658146B
Application number: CN202211597989.XA
Authority: CN
Inventors: 王平; 罗前; 顾铭秋
Original assignee: Chengdu Denglin Technology Co ltd; Shanghai Denglin Technology Co Ltd
Current assignee: Chengdu Denglin Technology Co ltd; Suzhou Denglin Technology Co ltd
Priority date: 2022-12-14
Filing date: 2022-12-14
Publication date: 2023-03-31
Anticipated expiration: 2042-12-14
Also published as: CN115658146A

Abstract

The application relates to an AI chip, a tensor processing method and electronic equipment, and belongs to the technical field of computers. The AI chip includes: the device comprises a vector register, an engine unit and a vector operation unit; the engine unit is used for acquiring source operands required by a vector operation instruction from original tensor data according to a tensor addressing mode; the vector register is used for storing the source operand acquired by the engine unit; the vector operation unit is used for acquiring a source operand of the vector operation instruction from the vector register to perform operation to obtain an operation result. In the application, the vector register is used for storing the source operand acquired by the engine unit, so that the vector operation unit can directly acquire the source operand required by the vector operation instruction from the vector register to perform operation.

Description

A kind of AI chip, tensor processing method and electronic equipment

技术领域technical field

本申请属于计算机技术领域，具体涉及一种AI芯片、张量处理方法及电子设备。The application belongs to the field of computer technology, and specifically relates to an AI chip, a tensor processing method and electronic equipment.

背景技术Background technique

随着人工智能技术(Artificial Intelligence，AI)的发展，神经网络已然成为当今最为流行的人工智能技术之一。神经网络中经常涉及张量之间的计算，当前用于张量之间计算的AI芯片的架构如图1所示，在进行张量（tensor）之间的计算时，大致的步骤如下：步骤1、处理单元（处理核）计算一个或者多个源操作数对应的张量元素（tensor element）的地址；步骤2、根据上述地址从存储器中把一个或者多个源操作数读入向量运算单元；步骤3、向量运算单元计算出结果。With the development of artificial intelligence (AI), neural networks have become one of the most popular artificial intelligence technologies today. Calculations between tensors are often involved in neural networks. The architecture of the current AI chip used for tensor calculations is shown in Figure 1. When performing tensor calculations, the general steps are as follows: Steps 1. The processing unit (processing core) calculates the address of the tensor element (tensor element) corresponding to one or more source operands; Step 2. Read one or more source operands from the memory into the vector operation unit according to the above address ; Step 3, the vector operation unit calculates the result.

该AI芯片在寻址时仅支持实时寻址，在计算出张量元素的地址后，需要及时根据上述地址从存储器中把一个或者多个源操作数读入向量运算单元中，这样若中途运算出错，又需要重新寻址，导致运算效率低及开销大。The AI chip only supports real-time addressing when addressing. After calculating the address of the tensor element, it is necessary to read one or more source operands from the memory into the vector operation unit in time according to the above address. If an error occurs, re-addressing is required, resulting in low operational efficiency and high overhead.

发明内容Contents of the invention

鉴于此，本申请的目的在于提供一种AI芯片、张量处理方法及电子设备，以改善现有AI芯片仅支持实时寻址，容易出现若中途运算出错，又需要重新寻址，导致运算效率低及开销大的问题。In view of this, the purpose of this application is to provide an AI chip, a tensor processing method and electronic equipment to improve the existing AI chip that only supports real-time addressing, and it is easy to occur that if an error occurs in the middle of the calculation, it needs to be re-addressed, resulting in calculation efficiency. low cost and high cost.

本申请的实施例是这样实现的：The embodiment of the application is realized like this:

第一方面，本申请实施例提供了一种AI芯片，包括：向量寄存器、引擎单元、向量运算单元；引擎单元用于根据张量寻址模式从原始张量数据中获取向量运算指令所需的源操作数，所述张量寻址模式一次寻址获取的源操作数个数大于或等于1；向量寄存器与所述引擎单元连接，所述向量寄存器用于存储所述引擎单元获取的源操作数；向量运算单元与所述向量寄存器连接，所述向量运算单元用于从所述向量寄存器中获取所述向量运算指令的源操作数进行运算，得到运算结果。其中，引擎单元采用张量寻址模式一次寻址确定的张量元素可以有一个或多个。In the first aspect, the embodiment of the present application provides an AI chip, including: a vector register, an engine unit, and a vector operation unit; Source operand, the number of source operands obtained by one addressing of the tensor addressing mode is greater than or equal to 1; the vector register is connected to the engine unit, and the vector register is used to store the source operation obtained by the engine unit number; the vector operation unit is connected to the vector register, and the vector operation unit is used to obtain the source operand of the vector operation instruction from the vector register to perform an operation to obtain an operation result. There can be one or more tensor elements determined by the engine unit in one addressing mode using the tensor addressing mode.

本申请实施例中，引擎单元采用张量寻址模式从张量数据中获取向量运算指令所需的源操作数，使得一次寻址可获得至少一个源操作数，极大提高了获取源操作数的效率，使得只需要较少的指令便可寻找到所需的源操作数；同时，利用向量寄存器来存储引擎单元所获取的源操作数，使得向量运算单元可以直接从向量寄存器中获取向量运算指令的源操作数进行运算，这样的设计方式可以实现寻址与运算相分离，可以支持提前去获取向量运算指令所需的源操作数，后续在进行运算时，直接读取即可。即便中途运算出错，也不需要重新去寻址，直接从向量寄存器中获取对应的源操作数即可，可以提高运算的效率。In the embodiment of the present application, the engine unit adopts the tensor addressing mode to obtain the source operands required by the vector operation instruction from the tensor data, so that at least one source operand can be obtained in one addressing, which greatly improves the speed of obtaining the source operands. The efficiency makes it possible to find the required source operand with fewer instructions; at the same time, the vector register is used to store the source operand obtained by the engine unit, so that the vector operation unit can directly obtain the vector operation from the vector register The source operand of the instruction is used for calculation. This design method can realize the separation of addressing and operation, and can support obtaining the source operand required by the vector operation instruction in advance, and then directly read it when performing the operation. Even if an error occurs in the middle of the operation, there is no need to re-address, and the corresponding source operand can be directly obtained from the vector register, which can improve the efficiency of the operation.

结合第一方面实施例的一种可能的实施方式，所述张量寻址模式的参数包括：表征寻址起点的起始地址、表征寻址终点的终点地址、表征寻址指针偏移幅度的步长、表征寻址所得数据形状的尺寸。With reference to a possible implementation manner of the embodiment of the first aspect, the parameters of the tensor addressing mode include: a starting address representing the starting point of addressing, an end address representing the end point of addressing, and an address representing the offset magnitude of the addressing pointer. Step size, the size that characterizes the shape of the data obtained by addressing.

本申请实施例中，通过对现有的切片寻址模式进行了改进与扩展，张量寻址模式（也可以看成是新的切片寻址模式）所包含的参数，从原有切片寻址模式包含的三个参数扩展成包含更多参数（包含用于表征寻址所得数据形状的尺寸信息），从而可以提高寻址效率，达到减少冗余指令的目的。In the embodiment of this application, by improving and extending the existing slice addressing mode, the parameters contained in the tensor addressing mode (which can also be regarded as a new slice addressing mode) are changed from the original slice addressing mode The three parameters included in the mode are expanded to include more parameters (including size information used to characterize the shape of the data obtained by addressing), so that addressing efficiency can be improved and redundant instructions can be reduced.

结合第一方面实施例的一种可能的实施方式，所述张量寻址模式的参数还包括表征寻址所得数据形状为非完整形状的保留情况的特征参数。With reference to a possible implementation manner of the embodiment of the first aspect, the parameters of the tensor addressing mode further include a characteristic parameter characterizing the preservation of an incomplete shape of the data obtained by addressing.

本申请实施例中，可以通过引入表征寻址所得数据形状为非完整形状的保留情况的特征参数（如用partial表示），使得可以通过配置特征参数的值，从而可以灵活的决定是否保留非完整形状所包含的数据。In the embodiment of this application, by introducing characteristic parameters (such as represented by partial) that characterize the retention of the data shape obtained by addressing as an incomplete shape, it is possible to configure the value of the characteristic parameter, so that it is possible to flexibly decide whether to retain the incomplete shape The data contained by the shape.

结合第一方面实施例的一种可能的实施方式，所述张量寻址模式包括嵌套的双重张量寻址模式，所述双重张量寻址模式包括：外迭代寻址模式和内迭代寻址模式，其中，所述内迭代寻址模式在所述外迭代寻址模式寻址得到的张量数据上进行寻址。With reference to a possible implementation manner of the embodiment of the first aspect, the tensor addressing mode includes a nested double tensor addressing mode, and the double tensor addressing mode includes: an outer iteration addressing mode and an inner iteration addressing mode An addressing mode, wherein the inner iterative addressing mode addresses the tensor data obtained by addressing in the outer iterative addressing mode.

本申请实施例中，采用嵌套的双重张量寻址模式进行寻址，以实现外迭代寻址模式和内迭代寻址模式的独立寻址，相比于单一的寻址模式来说，寻址效率更高，使得只需要一条指令就可实现2条单层指令的寻址，并且内迭代寻址模式是在外迭代寻址模式得到的张量数据的基础上进行寻址的，外迭代寻址模式寻址到的数据仅是一个中间结果，并不需要读写，相比于采用单层的张量寻址模式的寻址，减少了第一层（外迭代寻址模式）的数据读写。In the embodiment of this application, the nested double tensor addressing mode is used for addressing to realize the independent addressing of the outer iterative addressing mode and the inner iterative addressing mode. Compared with the single addressing mode, the addressing The addressing efficiency is higher, so that only one instruction is needed to realize the addressing of two single-layer instructions, and the inner iteration addressing mode is based on the tensor data obtained by the outer iteration addressing mode, and the outer iteration addressing mode The data addressed by the address mode is only an intermediate result, and does not need to be read and written. Compared with the addressing of the single-layer tensor addressing mode, the data reading of the first layer (outer iterative addressing mode) is reduced. Write.

结合第一方面实施例的一种可能的实施方式，所述引擎单元还与外部存储器连接，所述外部存储器，用于存储所述原始张量数据。With reference to a possible implementation manner of the embodiment of the first aspect, the engine unit is further connected to an external memory, and the external memory is used to store the original tensor data.

本申请实施例中，当向量寄存器通过引擎单元与外部存储器连接时，此时，外部存储器，可以直接用于存储原始张量数据，而不是存储引擎单元获取的源操作数，使得外部存储器的存储空间不用设计得很大（由于张量运算中会涉及到大量重复的数据，因此利用张量寻址模式寻址后的数据的数据量可能会大于原始张量数据）。In the embodiment of this application, when the vector register is connected to the external memory through the engine unit, at this time, the external memory can be directly used to store the original tensor data instead of storing the source operand obtained by the engine unit, so that the storage of the external memory The space does not need to be designed to be very large (because tensor operations involve a large amount of repeated data, the data volume of the data addressed by the tensor addressing mode may be larger than the original tensor data).

结合第一方面实施例的一种可能的实施方式，所述向量寄存器的数量为多个，不同的所述向量寄存器存储不同的源操作数。With reference to a possible implementation manner of the embodiment of the first aspect, there are multiple vector registers, and different vector registers store different source operands.

本申请实施例中，通过设置多个向量寄存器，不同向量寄存器存储不同的源操作数，这样在进行向量运算时，可以同时从各个向量寄存器中读取所需的操作数，可以提高运算效率。In the embodiment of the present application, by setting multiple vector registers, different vector registers store different source operands, so that when performing vector operations, required operands can be read from each vector register at the same time, which can improve operation efficiency.

结合第一方面实施例的一种可能的实施方式，所述引擎单元包括：多个寻址引擎，一个所述寻址引擎对应一个所述向量寄存器，每个所述寻址引擎，用于根据各自的张量寻址模式从对应的原始张量数据中获取向量运算指令所需的源操作数。With reference to a possible implementation manner of the embodiment of the first aspect, the engine unit includes: a plurality of addressing engines, one of the addressing engines corresponds to one of the vector registers, and each of the addressing engines is configured to The respective tensor addressing modes obtain the source operands required by the vector operation instructions from the corresponding raw tensor data.

本申请实施例中，通过设置多个寻址引擎，且一个寻址引擎对应一个向量寄存器，这样在寻址时，每个寻址引擎可以根据各自的张量寻址模式从对应的原始张量数据中获取向量运算指令所需的源操作数，不仅能提高寻址效率，且互不干扰。In the embodiment of the present application, by setting multiple addressing engines, and one addressing engine corresponds to one vector register, each addressing engine can start from the corresponding original tensor according to its own tensor addressing mode during addressing. Obtaining the source operands required by the vector operation instructions in the data can not only improve the addressing efficiency, but also do not interfere with each other.

结合第一方面实施例的一种可能的实施方式，所述引擎单元还包括：主引擎，用于向每个所述寻址引擎发送控制命令，控制每个所述寻址引擎根据各自的张量寻址模式进行寻址，从对应的原始张量数据中获取向量运算指令所需的源操作数。With reference to a possible implementation manner of the embodiment of the first aspect, the engine unit further includes: a main engine, configured to send a control command to each of the addressing engines, and control each of the addressing engines to The vector addressing mode is used for addressing, and the source operand required for the vector operation instruction is obtained from the corresponding original tensor data.

本申请实施例中，通过设置主引擎向每个寻址引擎发送控制命令，来控制各个寻址引擎独立寻址，通过主引擎来集中控制，以保证各个寻址引擎所获取的源操作数的形状（shape）一致，从而保证运算的准确性。In the embodiment of the present application, the independent addressing of each addressing engine is controlled by setting the main engine to send control commands to each addressing engine, and the centralized control is performed by the main engine to ensure the accuracy of the source operands obtained by each addressing engine The shape (shape) is consistent, thus ensuring the accuracy of the operation.

结合第一方面实施例的一种可能的实施方式，所述控制命令包括：Advance前进命令、Reset重置命令、NOP空操作命令三个控制命令，所述主引擎每次向每个所述寻址引擎发送控制命令时，发送由所述三个控制命令中的至少一个控制命令组合而成的组合控制命令，以控制每个所述寻址引擎根据各自的张量寻址模式在所述原始张量数据的不同维度进行寻址。With reference to a possible implementation of the embodiment of the first aspect, the control commands include: Advance advance command, Reset reset command, and NOP no-operation command. When the addressing engine sends a control command, send a combined control command composed of at least one control command in the three control commands, so as to control each of the addressing engines according to their respective tensor addressing modes in the original Different dimensions of tensor data are addressed.

本申请实施例中，通过向寻址引擎发送由Advance前进命令、Reset重置命令、NOP空操作命令中的至少一种命令组合而成的组合控制命令，例如，组合控制命令为：“W cmd：NOP；H cmd：NOP”、“W cmd：Advance；H cmd：NOP”等，从而控制每个寻址引擎根据各自的张量寻址模式在原始张量数据的不同维度（如宽度W、高度H等维度）进行寻址，只需要将简单的命令进行组合，从而可以实现多种功能。In the embodiment of the present application, by sending a combined control command composed of at least one of the Advance command, the Reset command, and the NOP empty operation command to the addressing engine, for example, the combined control command is: "W cmd : NOP; H cmd: NOP", "W cmd: Advance; H cmd: NOP", etc., so as to control each addressing engine in different dimensions of the original tensor data (such as width W, Height H and other dimensions) for addressing, only need to combine simple commands, so that multiple functions can be realized.

第二方面，本申请实施例还提供了一种电子设备，包括：存储器和上述第一方面提供的AI芯片；存储器，用于存储运算所需的张量数据；所述AI芯片与所述存储器连接，所述AI芯片，用于根据张量寻址模式从所述存储器存储的原始张量数据中，获取向量运算指令所需的源操作数，并根据所述向量运算指令，对获取的源操作数进行运算，得到运算结果。In the second aspect, the embodiment of the present application also provides an electronic device, including: a memory and the AI chip provided in the first aspect above; a memory for storing tensor data required for calculation; the AI chip and the memory Connecting, the AI chip is used to obtain the source operands required by the vector operation instruction from the original tensor data stored in the memory according to the tensor addressing mode, and to obtain the source operand according to the vector operation instruction The operands are operated and the result of the operation is obtained.

第三方面，本申请实施例还提供了一种张量处理方法，包括：In the third aspect, the embodiment of the present application also provides a tensor processing method, including:

根据张量寻址模式从原始张量数据中获取向量运算指令所需的源操作数，其中，所述张量寻址模式一次寻址获取的源操作数个数大于或等于1；将获取的源操作数存储至向量寄存器；从所述向量寄存器中获取所述向量运算指令的源操作数进行运算，得到运算结果。According to the tensor addressing mode, the source operands required by the vector operation instruction are obtained from the original tensor data, wherein, the number of source operands obtained by one addressing of the tensor addressing mode is greater than or equal to 1; the acquired The source operand is stored in a vector register; the source operand of the vector operation instruction is acquired from the vector register to perform an operation to obtain an operation result.

该方法可应用于前述第一方面的AI芯片和/或前述第二方面的电子设备。该方法中，采用张量寻址模式一次寻址获得的张量元素可以有一个或多个。关于该方法的原理、有益效果，可参见其他方面实施例或实施方式的相关描述。The method can be applied to the aforementioned AI chip of the first aspect and/or the aforementioned electronic device of the second aspect. In this method, there can be one or more tensor elements obtained by one addressing using the tensor addressing mode. Regarding the principles and beneficial effects of the method, reference may be made to the relevant descriptions in other aspects of the embodiments or implementation manners.

结合第三方面实施例的一种可能的实施方式，根据张量寻址模式从原始张量数据中获取向量运算指令所需的源操作数，包括：利用寻址引擎在主引擎发送的控制命令下，根据张量寻址模式从原始张量数据中获取向量运算指令所需的源操作数。With reference to a possible implementation of the embodiment of the third aspect, obtaining the source operands required by the vector operation instruction from the original tensor data according to the tensor addressing mode includes: using the control command sent by the addressing engine to the main engine Next, the source operands required by the vector operation instruction are obtained from the original tensor data according to the tensor addressing mode.

结合第三方面实施例的一种可能的实施方式，所述张量寻址模式的参数包括：表征寻址起点的起始地址、表征寻址终点的终点地址、表征寻址指针偏移幅度的步长、表征寻址所得数据形状的尺寸。With reference to a possible implementation of the embodiment of the third aspect, the parameters of the tensor addressing mode include: a starting address representing the starting point of addressing, an end address representing the end point of addressing, and an address representing the offset magnitude of the addressing pointer. Step size, the size that characterizes the shape of the data obtained by addressing.

结合第三方面实施例的一种可能的实施方式，所述张量寻址模式包括嵌套的双重张量寻址模式，所述双重张量寻址模式包括：外迭代寻址模式和内迭代寻址模式；根据张量寻址模式从原始张量数据中获取向量运算指令所需的源操作数，包括：利用所述外迭代寻址模式从所述原始张量数据中选定至少一个候选区域，所述候选区域包含多个张量元素；利用所述内迭代寻址模式从所述至少一个候选区域中获取向量运算指令所需的源操作数。With reference to a possible implementation manner of the embodiment of the third aspect, the tensor addressing mode includes a nested double tensor addressing mode, and the double tensor addressing mode includes: an outer iteration addressing mode and an inner iteration addressing mode Addressing mode: according to the tensor addressing mode, the source operand required for the vector operation instruction is obtained from the original tensor data, including: using the outer iteration addressing mode to select at least one candidate from the original tensor data An area, the candidate area includes a plurality of tensor elements; the source operand required by the vector operation instruction is obtained from the at least one candidate area by using the inner iteration addressing mode.

附图说明Description of drawings

为了更清楚地说明本申请实施例或现有技术中的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。通过附图所示，本申请的上述及其它目的、特征和优势将更加清晰。在全部附图中相同的附图标记指示相同的部分。并未刻意按实际尺寸等比例缩放绘制附图，重点在于示出本申请的主旨。In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the accompanying drawings required in the embodiments. Obviously, the accompanying drawings in the following description are only some of the present application. Embodiments, for those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort. The above and other objects, features and advantages of the present application will be more clearly shown by the accompanying drawings. Like reference numerals designate like parts throughout the drawings. The drawings are not intentionally scaled and drawn according to the actual size, and the emphasis is on illustrating the gist of the application.

图1为现有技术中的AI芯片与存储器的结构示意图。FIG. 1 is a schematic structural diagram of an AI chip and a memory in the prior art.

图2为现有技术中的切片寻址模式进行寻址的原理示意图。FIG. 2 is a schematic diagram of the principle of addressing in the slice addressing mode in the prior art.

图3示出了本申请实施例提供的一种AI芯片与存储器连接的结构示意图。FIG. 3 shows a schematic structural diagram of the connection between an AI chip and a memory provided by an embodiment of the present application.

图4示出了本申请实施例提供的一种发生广播的原理示意图。FIG. 4 shows a schematic diagram of a principle of broadcasting provided by an embodiment of the present application.

图5示出了本申请实施例提供的另一种发生广播的原理示意图。FIG. 5 shows another schematic diagram of the principle of broadcasting provided by the embodiment of the present application.

图6示出了本申请实施例提供的第一种张量寻址模式的原理示意图。FIG. 6 shows a schematic diagram of the principle of the first tensor addressing mode provided by the embodiment of the present application.

图7示出了本申请实施例提供的第二种张量寻址模式的原理示意图。FIG. 7 shows a schematic diagram of the principle of the second tensor addressing mode provided by the embodiment of the present application.

图8示出了本申请实施例提供的另一种AI芯片的结构示意图。FIG. 8 shows a schematic structural diagram of another AI chip provided by an embodiment of the present application.

图9示出了本申请实施例提供的按照0:8:2:5的寻址模式进行寻址的原理示意图。FIG. 9 shows a schematic diagram of the principle of addressing according to the addressing mode of 0:8:2:5 provided by the embodiment of the present application.

图10示出了本申请实施例提供的按照0:5:2:1的寻址模式进行寻址的原理示意图。FIG. 10 shows a schematic diagram of the principle of addressing according to the addressing mode of 0:5:2:1 provided by the embodiment of the present application.

图11示出了本申请实施例提供的一种电子设备的结构示意图。FIG. 11 shows a schematic structural diagram of an electronic device provided by an embodiment of the present application.

图12示出了本申请实施例提供的按照0:4:1:3的寻址模式进行寻址的原理示意图。FIG. 12 shows a schematic diagram of the principle of addressing according to the addressing mode of 0:4:1:3 provided by the embodiment of the present application.

图13示出了本申请实施例提供的一种张量处理方法的示意图。FIG. 13 shows a schematic diagram of a tensor processing method provided by an embodiment of the present application.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行描述。The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

应注意到：相似的标号和字母在下面的附图中表示类似项，因此，一旦某一项在一个附图中被定义，则在随后的附图中不需要对其进行进一步定义和解释。术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that like numerals and letters denote similar items in the following figures, therefore, once an item is defined in one figure, it does not require further definition and explanation in subsequent figures. The term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements but also other elements not expressly listed elements, or also elements inherent in such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

发明人经过研究发现，现有技术对于张量之间的计算过程，需要消耗较长的时间去计算源操作数对应的张量元素的地址，即很大一部分时间都浪费在了计算源操作数对应的张量元素的地址上，究其原因，主要是因为现有方式是基于numpy（Numerical Python，为Python的一种开源的数值计算扩展）的切片寻址模式进行寻址。例如，对于一个张量数组a，用a[start：stop：step]这种思路提取出一个新数组，并且可以支持多维数组中的任一维度进行处理。其中，start表示寻址的起点地址，stop表示寻址的终点地址，step相当于stride（步长），表示寻址指针的偏移幅度。以切片寻址模式start：stop：step=0：5：2为例，其寻址原理图如图2所示。这种切片寻址模式一次寻址只能得到一个张量元素，比如，示例中的（0，0）、（2，0）、（4，0）、（0，2）、（2，2）、（4，2）、（0，4）、（2，4）、（4，4）。这种切片寻址模式一次寻址只能得到一个张量元素，由于张量计算会涉及到大量的源操作数，如果要获取很多源操作数对应的多个张量元素，则需要用大量的指令进行多次寻址，这样就会导致大量指令被浪费在了寻址上，即浪费在了非有效计算上。After research, the inventor found that, for the calculation process between tensors in the prior art, it takes a long time to calculate the address of the tensor element corresponding to the source operand, that is, a large part of the time is wasted in calculating the source operand The address of the corresponding tensor element is mainly because the existing method is based on the slice addressing mode of numpy (Numerical Python, an open source numerical computing extension of Python). For example, for a tensor array a, use the idea of a[start:stop:step] to extract a new array, and can support any dimension in the multidimensional array for processing. Among them, start represents the start address of addressing, stop represents the end address of addressing, and step is equivalent to stride (step size), which represents the offset range of the addressing pointer. Taking the slice addressing mode start:stop:step=0:5:2 as an example, its addressing principle is shown in Figure 2. This slice addressing mode can only get one tensor element at a time, for example, (0, 0), (2, 0), (4, 0), (0, 2), (2, 2 ), (4,2), (0,4), (2,4), (4,4). This slice addressing mode can only get one tensor element at a time. Since tensor calculation involves a large number of source operands, if you want to obtain multiple tensor elements corresponding to many source operands, you need to use a large number of Instructions are addressed multiple times, which causes a large number of instructions to be wasted on addressing, that is, wasted on ineffective calculations.

基于此，本申请提供了一种全新的AI芯片，一次寻址便可确定至少一个源操作数对应的张量元素的地址，能极大提高计算源操作数对应的张量元素的地址的效率，使得只需要较少的指令便可寻找到所需的源操作数，极大的减少了冗余指令、提高了有效指令密度、提高了性能、简化了编程。同时，利用向量寄存器来存储引擎单元所获取的源操作数，使得向量运算单元可以直接从向量寄存器中获取向量运算指令的源操作数进行运算，这样的设计方式使得可以支持提前获取向量运算指令所需的源操作数，后续在进行运算时，直接读取即可，从而实现寻址与运算相分离，即便中途运算出错，也不需要重新去寻址，直接从向量寄存器中获取对应的源操作数即可，可以提高运算的效率。Based on this, this application provides a brand new AI chip, which can determine the address of at least one tensor element corresponding to the source operand in one addressing, which can greatly improve the efficiency of calculating the address of the tensor element corresponding to the source operand , so that the required source operand can be found with only a few instructions, which greatly reduces redundant instructions, improves effective instruction density, improves performance, and simplifies programming. At the same time, the vector register is used to store the source operand obtained by the engine unit, so that the vector operation unit can directly obtain the source operand of the vector operation instruction from the vector register for operation. The required source operands can be read directly when performing subsequent operations, so as to realize the separation of addressing and operation. Even if there is an error in the middle of the operation, there is no need to re-address, and the corresponding source operation can be obtained directly from the vector register. The number is enough, which can improve the efficiency of operation.

该AI芯片能够对张量的寻址、运算进行硬件层面的加速，硬件可支持多种张量寻址模式，兼容性好，有利于快速完成张量之间的计算。The AI chip can accelerate the addressing and operation of tensors at the hardware level. The hardware can support multiple tensor addressing modes, and has good compatibility, which is conducive to the rapid completion of calculations between tensors.

此外，本申请实施例还从灵活性、易用性和处理效率方面，针对张量的寻址提出了新的寻址模式，一方面，基于传统切片寻址这种按顺序或按固定跨度来访问数组中的元素的设计思想，进一步扩展设计了一种新的基于切片的张量寻址方式；另一方面，设计了一种可以嵌套使用的能够进行双重张量寻址的寻址模式。这些新的寻址模式都可以在本申请实施例提供的AI芯片上进行应用，有利于在面临张量之间的计算时，减少所需的指令总量，提升硬件处理性能，提升对于张量的处理效率。In addition, the embodiment of the present application also proposes a new addressing mode for tensor addressing in terms of flexibility, ease of use, and processing efficiency. The design idea of accessing the elements in the array is further expanded and designed a new slice-based tensor addressing mode; on the other hand, an addressing mode that can be nested and capable of double tensor addressing is designed . These new addressing modes can all be applied on the AI chip provided by the embodiment of this application, which is beneficial to reduce the total amount of instructions required, improve hardware processing performance, and improve the processing performance of tensors when faced with calculations between tensors. processing efficiency.

下面将结合图3对本申请实施例提供的AI芯片的原理进行说明，该AI芯片包括向量寄存器、引擎单元以及向量运算单元。其中，向量寄存器与向量运算单元直接连接，引擎单元与向量运算单元可以是直接连接。The principle of the AI chip provided by the embodiment of the present application will be described below with reference to FIG. 3 . The AI chip includes a vector register, an engine unit, and a vector operation unit. Wherein, the vector register is directly connected to the vector operation unit, and the engine unit may be directly connected to the vector operation unit.

引擎单元用于根据张量寻址模式从原始张量数据中获取向量运算指令所需的源操作数，张量寻址模式一次寻址获取的源操作数（个数）大于或等于1。张量数据是多维数据，通常会以某种方式（如线性布局、平铺布局等）展开，存储在诸如内存、片上存储器等具有存储功能的器件中。The engine unit is used to obtain the source operands required for vector operation instructions from the original tensor data according to the tensor addressing mode, and the source operands (number) obtained by one addressing in the tensor addressing mode are greater than or equal to 1. Tensor data is multi-dimensional data, which is usually expanded in a certain way (such as linear layout, tile layout, etc.) and stored in devices with storage functions such as memory and on-chip memory.

向量寄存器，用于存储引擎单元获取的源操作数，这样即便中途运算出错，也不需要重新去寻址，直接从向量寄存器中获取对应的源操作数即可。The vector register is used to store the source operand obtained by the engine unit, so that even if an error occurs in the middle of the operation, there is no need to re-address, and the corresponding source operand can be obtained directly from the vector register.

向量运算单元与向量寄存器连接，向量运算单元用于从向量寄存器中获取向量运算指令的源操作数进行运算，得到运算结果。其中，向量运算指令是指一条指令能够同时计算两个以上的操作数的指令。向量运算指令的类型可以是加法、减法、乘法、乘累加等各种运算类型。The vector operation unit is connected with the vector register, and the vector operation unit is used to obtain the source operand of the vector operation instruction from the vector register for operation to obtain an operation result. Wherein, the vector operation instruction refers to an instruction that can calculate more than two operands at the same time. The types of vector operation instructions can be various operation types such as addition, subtraction, multiplication, and multiply-accumulate.

向量运算单元与向量寄存器直接连接，由于存储在向量寄存器中的数据为根据张量寻址模式从原始张量数据中寻址的向量运算指令所需的源操作数，因此向量运算单元直接访问向量寄存器就能得到张量寻址的内容。The vector operation unit is directly connected to the vector register. Since the data stored in the vector register is the source operand required by the vector operation instruction addressed from the original tensor data according to the tensor addressing mode, the vector operation unit directly accesses the vector The register can get the content of the tensor address.

其中，当向量寄存器与引擎单元直接连接时，引擎单元还与外部存储器连接，此时，相当于向量寄存器通过引擎单元与外部存储器连接。外部存储器用于存储原始张量数据。以向量运算指令为三操作数的向量运算指令，如A*B+C为例，A、B、C均为源操作数，其中，A、B、C可以是单个元素，也可以是包含多个元素的数组。存储器用于存储操作数A所在的原始张量数据、存储操作数B所在的原始张量数据、存储操作数C所在的原始张量数据。引擎单元根据张量寻址模式从操作数A所在的原始张量数据获取向量运算指令的操作数A，并将其存储至向量寄存器中，以及从操作数B所在的原始张量数据获取向量运算指令的操作数B，并将其存储至向量寄存器中，以及从操作数C所在的原始张量数据获取向量运算指令的操作数C，并将其存储至向量寄存器中。Wherein, when the vector register is directly connected to the engine unit, the engine unit is also connected to the external memory. At this time, it is equivalent to that the vector register is connected to the external memory through the engine unit. External memory is used to store raw tensor data. Take a vector operation instruction with three operands, such as A*B+C as an example, where A, B, and C are all source operands, where A, B, and C can be single elements or multiple array of elements. The memory is used to store the original tensor data where operand A is located, store the original tensor data where operand B is located, and store the original tensor data where operand C is located. The engine unit obtains the operand A of the vector operation instruction from the original tensor data where the operand A is located according to the tensor addressing mode, stores it in the vector register, and obtains the vector operation from the original tensor data where the operand B is located The operand B of the instruction is stored in the vector register, and the operand C of the vector operation instruction is obtained from the original tensor data where the operand C is located, and stored in the vector register.

其中，上述的AI芯片可以是一种集成电路芯片，具有数据处理能力，可用于处理张量与张量之间的运算。例如，可以是通用处理器，包括中央处理器(Central ProcessingUnit， CPU)、网络处理器(Network Processor，NP)等；还可以是数字信号处理器(DigitalSignal Processor，DSP)、专用集成电路(Application Specific Integrated Circuit，ASIC)、现场可编程门阵列(Field Programmable Gate Array，FPGA)或者其他可编程逻辑器件。通用处理器也可以是微处理器或者该AI芯片也可以是任何常规的处理器等，比如可以是图形处理器（Graphics Processing Unit，GPU）、通用图形处理器(General Purposecomputing on Graphics Processing Unit，GPGPU)等。Wherein, the aforementioned AI chip may be an integrated circuit chip, which has data processing capability and can be used to process operations between tensors. For example, it can be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; it can also be a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), Field Programmable Gate Array (Field Programmable Gate Array, FPGA) or other programmable logic devices. The general-purpose processor can also be a microprocessor or the AI chip can also be any conventional processor, such as a graphics processor (Graphics Processing Unit, GPU), a general-purpose graphics processor (General Purposecomputing on Graphics Processing Unit, GPGPU )wait.

发明人发现，张量元素的操作过程比较具有规律性，一般操作过程有以下两种：The inventors found that the operation process of tensor elements is relatively regular, and there are two general operation processes as follows:

第一种，维度尺寸相同的两个或者多个张量数据中的张量元素，进行一对一的操作（称为逐个张量元素操作或简称逐元素操作，element-wise operation）。比如，假设两个张量数据的宽（W）和高（H）均为5，即W*H为5*5，则这两个张量数据中的张量元素在进行一对一的操作时，是对应位置的张量元素之间进行一对一的操作。The first is to perform a one-to-one operation on tensor elements in two or more tensor data with the same dimension size (called element-by-element operation or element-wise operation for short). For example, assuming that the width (W) and height (H) of the two tensor data are both 5, that is, W*H is 5*5, then the tensor elements in the two tensor data are performing one-to-one operations When , it is a one-to-one operation between tensor elements at corresponding positions.

第二种，一个张量数据中的某一个张量元素和另一个张量数据中的一组张量元素进行操作（称为广播，broadcasting）。比如，对于一个图像张量（第一操作数），需要把其中的每一个像素值都除以255（第二操作数）的场景下，假设这2个操作数的维度分别是：第一操作数的通道数、宽度、高度分别是：3，224，224；第二操作数的通道数、宽度、高度分别是：1，1，1。The second is to operate on a certain tensor element in one tensor data and a set of tensor elements in another tensor data (called broadcasting). For example, for an image tensor (the first operand), it is necessary to divide each pixel value by 255 (the second operand), assuming that the dimensions of the two operands are: the first operation The number of channels, width and height of the number are: 3, 224, 224 respectively; the number of channels, width and height of the second operand are: 1, 1, 1 respectively.

此时，第二操作数在3个维度（即上述的通道数、宽度、高度）上都会发生广播，也就是说，在进行除法运算之前，需要把维度为(1，1，1)的第二操作数（值为255），扩展成一个维度为(3，224，224)的张量，并且这个张量的每一个维度上的值都是255。此时这两个操作数的维度尺寸相同，可以进行除法操作，之后再将这2个张量数据中的张量元素进行一对一的操作即可。At this time, the second operand will be broadcast in three dimensions (namely, the number of channels, width, and height mentioned above), that is, before performing the division operation, it is necessary to convert the second operand with the dimension (1, 1, 1) Two operands (value 255), expanded into a tensor of dimension (3, 224, 224), and the value of each dimension of this tensor is 255. At this time, the dimensions of the two operands are the same, and the division operation can be performed, and then the tensor elements in the two tensor data can be operated one-to-one.

又例如，假设第一操作数的通道数、宽度、高度分别是：3，224，224，第二操作数的通道数、宽度、高度分别是：3，1，224，则对于第二操作数，只需要在宽度方向上进行广播，即将维度为(3，1，224)的第二操作数扩展成一个维度为(3，224，224)的张量，此时这两个操作数的维度尺寸相同，可以进行除法操作。For another example, suppose the number of channels, width and height of the first operand are respectively: 3,224,224, and the number of channels, width and height of the second operand are respectively: 3,1,224, then for the second operand , you only need to broadcast in the width direction, that is, expand the second operand with dimension (3, 1, 224) into a tensor with dimension (3, 224, 224), at this time, the dimensions of the two operands Same size, division operation is possible.

其中，要发生广播操作，需要两个张量数据的维度数量相同，且其中一个张量数据的至少一个维度为1，进行广播处理时，发生广播的张量数据的所有其他维度都会在发生广播的维度上进行元素复制。Among them, for a broadcast operation to occur, the dimensions of the two tensor data need to be the same, and at least one dimension of one of the tensor data is 1. When performing broadcast processing, all other dimensions of the broadcast tensor data will be broadcasted Elements are replicated along the dimension of .

为了更好的理解上述的广播操作，下面结合图4、图5所示的原理图进行说明。在图4中，由于操作数5与“np.arange（3）”这个数组（即第一数组）在宽度这个维度上的尺寸不同，因此，在相加之前，需要在宽度方向对5进行广播，使其在宽度方向的维度尺寸与“np.arange（3）”这个数组的维度尺寸一致，之后再相加。同理，在图5中，由于“np.arange（3）.reshape（（3，1））”这个数组（即第二数组）与“np.arange（3）”这个数组，在宽度、高度的维度尺寸都不同，因此需要对“np.arange（3）.reshape（（3，1））”这个数组在宽度方向进行广播，以及对“np.arange（3）”这个数组在高度方向进行广播，使二者的维度尺寸相同，之后再相加。In order to better understand the above-mentioned broadcasting operation, the following will be described in conjunction with the schematic diagrams shown in FIG. 4 and FIG. 5 . In Figure 4, since the operand 5 is different from the array "np.arange(3)" (that is, the first array) in the width dimension, it is necessary to broadcast 5 in the width direction before adding , so that the dimensions in the width direction are consistent with the dimensions of the array "np.arange(3)", and then added. Similarly, in Figure 5, due to the array "np.arange(3).reshape((3,1))" (that is, the second array) and the array "np.arange(3)", the width and height The dimensions are different, so it is necessary to broadcast the array "np.arange(3).reshape((3, 1))" in the width direction, and the array "np.arange(3)" in the height direction Broadcast so that the dimensions of the two are the same, and then add them together.

关于张量的逐元素操作、广播操作这些部分的原理和细节，已经为本领域所熟知，在此不再介绍。The principles and details of element-by-element operations and broadcast operations on tensors are well known in the art and will not be introduced here.

此外，常见张量元素的地址计算模式（比如切片寻址模式）也比较具有规则性，其中，切片是指按照一定顺序或者跨度去访问张量元素。In addition, the address calculation modes of common tensor elements (such as the slice addressing mode) are also relatively regular. Among them, slice refers to accessing tensor elements in a certain order or span.

本申请提供的AI芯片除了在架构上进行改进外，同时为了能提高张量元素的寻址效率，本申请发明人在充分利用现有张量操作所具有的规律性的基础上，提出了一种全新的张量寻址模式。该张量寻址模式通过对现有的切片寻址模式进行了改进与扩展，使得张量寻址模式（也可以看成是新的切片寻址模式）所包含的参数，从原有切片寻址模式包含的三个参数扩展成包含更多参数，从而可以提高寻址效率，达到减少冗余指令的目的。In addition to improving the architecture of the AI chip provided by this application, in order to improve the addressing efficiency of tensor elements, the inventors of this application proposed a method based on making full use of the regularity of existing tensor operations. A brand new tensor addressing mode. The tensor addressing mode improves and expands the existing slice addressing mode, so that the parameters contained in the tensor addressing mode (which can also be regarded as a new slice addressing mode) are changed from the original slice addressing mode. The three parameters included in the address mode are expanded to include more parameters, which can improve addressing efficiency and reduce redundant instructions.

本申请中的张量寻址模式的参数包括：表征寻址起点的起始地址(如用start表示)、表征寻址终点的终点地址（如用stop表示）、表征寻址指针偏移幅度的步长（如用step表示）、表征寻址所得数据形状的尺寸（如用size表示）。则张量寻址模式的表达式可以记为[start：stop：step：size]。其中，size用来描述寻址所得数据形状（shape）的尺寸，即在每一个step的时候会提取一个shape包含的元素点，而不只提取一个点。上述表达式中的start、stop、step、size的值都是可以配置的，以适用于各种张量之间的计算。The parameters of the tensor addressing mode in this application include: the starting address representing the starting point of addressing (such as represented by start), the end address representing the end point of addressing (such as represented by stop), and the value representing the offset magnitude of the addressing pointer The step size (such as represented by step), the size representing the shape of the data obtained by addressing (such as represented by size). Then the expression of the tensor addressing mode can be recorded as [start:stop:step:size]. Among them, size is used to describe the size of the data shape (shape) obtained by addressing, that is, at each step, an element point contained in a shape will be extracted instead of only one point. The values of start, stop, step, and size in the above expressions are all configurable to apply to calculations between various tensors.

可选地，张量寻址模式的参数还包括：表征寻址所得数据形状为非完整形状的保留情况的特征参数（如用partial表示，用于反映局部张量数据形状的完整性），此时，张量寻址模式的表达式为[start：stop：step：size：partial]。作为一种实施方式，当partial=false，即partial的参数被设为false时，处在边缘的非完整shape所包含的点会被抛弃，即不保留；当partial=true，即partial的参数被设为true时，处在边缘的非完整的shape所包含的点需保留。本申请中的张量寻址模式的参数相比于现有的切片寻址模式的参数，增加size和partial两个参数。Optionally, the parameters of the tensor addressing mode also include: characteristic parameters that characterize the retention of the data shape obtained by addressing as an incomplete shape (such as represented by partial, used to reflect the integrity of the local tensor data shape), here , the expression for the tensor addressing mode is [start:stop:step:size:partial]. As an implementation, when partial=false, that is, when the partial parameter is set to false, the points contained in the incomplete shape at the edge will be discarded, that is, not retained; when partial=true, that is, the partial parameter is set to When set to true, the points contained in the incomplete shape at the edge need to be kept. Compared with the parameters of the existing slice addressing mode, the parameters of the tensor addressing mode in this application add two parameters, size and partial.

为了更好的理解，下面以张量寻址模式[start：stop：step：size] =0:6:2:3为例，分别对partial=false以及partial=true的情况进行说明。其中，在该表达式中，start=0，stop=6，step=2，size=3。For a better understanding, the following takes the tensor addressing mode [start:stop:step:size] =0:6:2:3 as an example to explain the cases of partial=false and partial=true respectively. Among them, in this expression, start=0, stop=6, step=2, size=3.

当partial= false时，处在边缘的非完整shape所包含的点会被抛弃，其原理图如图6。在这种情况下，每滑动一个step都会根据size判断当前寻址到的元素点能不能形成一个完整的shape，如果不能形成完整的shape，那么这个step寻址到的点会被抛弃。当partial=true时，处在边缘的非完整的shape所包含的点需保留，其原理图如图7所示。在这种情况下，每一个step还是会根据size判断当前寻址到的点能不能形成一个完整的shape，即便是不能形成完整的shape，这个step寻址到的点也需保留。When partial=false, the points contained in the incomplete shape on the edge will be discarded. The schematic diagram is shown in Figure 6. In this case, every time you slide a step, it will judge whether the currently addressed element point can form a complete shape according to the size. If it cannot form a complete shape, then the point addressed by this step will be discarded. When partial=true, the points contained in the incomplete shape on the edge need to be kept, and its schematic diagram is shown in Figure 7. In this case, each step will still judge whether the currently addressed point can form a complete shape according to the size. Even if it cannot form a complete shape, the point addressed by this step needs to be retained.

通过对比图6和图7可知，当partial= false时，会抛弃掉图6中处于边缘的非完整shape所包含的张量元素点。在图6和图7的示例中，一个完整的shape包含9个点，由于处于边缘的shape的点低于9个，为非完整shape，当partial= false时，非完整shape所包含的点会被抛弃；当partial=true时，非完整shape所包含的点需保留。By comparing Figure 6 and Figure 7, it can be seen that when partial=false, the tensor element points contained in the incomplete shape at the edge in Figure 6 will be discarded. In the examples in Figure 6 and Figure 7, a complete shape contains 9 points. Since the points of the shape on the edge are less than 9, it is an incomplete shape. When partial=false, the points contained in the incomplete shape will be Abandoned; when partial=true, the points contained in the incomplete shape need to be kept.

通过对比图6（或图7）与图2，可以看出：采用本申请所示的张量寻址模式，一个step可确定多个张量元素。一次寻址便可确定多个源操作数对应的张量元素的地址，例如，上述示例中，一次（即一个step）寻址会寻址到9个张量元素的地址，而采用现有的方式，需要9次寻址才能寻址到这9个张量元素的地址，相应地，需要更多的指令才能寻址到这9个张量元素的地址。因此，相对于现有寻址方式，本申请实施例提供的寻址方式能极大提高计算源操作数对应的张量元素的地址的效率，使得只需要较少的指令便可寻找到所需的源操作数，可减少冗余指令，提高有效指令密度，提高性能、简化编程。By comparing Figure 6 (or Figure 7 ) with Figure 2, it can be seen that: using the tensor addressing mode shown in this application, one step can determine multiple tensor elements. One addressing can determine the addresses of tensor elements corresponding to multiple source operands. For example, in the above example, one (ie, one step) addressing will address the addresses of 9 tensor elements, while using the existing In this way, 9 addressings are required to address the addresses of the 9 tensor elements, and correspondingly, more instructions are required to address the addresses of the 9 tensor elements. Therefore, compared with the existing addressing mode, the addressing mode provided by the embodiment of the present application can greatly improve the efficiency of calculating the address of the tensor element corresponding to the source operand, so that only fewer instructions are needed to find the required The number of source operands can reduce redundant instructions, increase effective instruction density, improve performance, and simplify programming.

可以理解的是，当每一个step只需要确定1个张量元素时，可以通过配置size=1实现，此时，每一个step只寻址1个张量元素，此时的寻址模式与现有的寻址模式类似（通过更改参数的值即可在硬件上兼容现有的寻址逻辑），其原理图如图1所示。或者，直接采用已有的寻址模式进行寻址。It is understandable that when each step only needs to determine one tensor element, it can be realized by configuring size=1. At this time, each step only addresses one tensor element. The addressing mode at this time is the same as the current Some addressing modes are similar (the existing addressing logic can be compatible with the hardware by changing the value of the parameter), and its schematic diagram is shown in Figure 1. Or, directly use the existing addressing mode for addressing.

需要说明的是，当每一个step只需要寻址1个张量元素，其size参数也可以配置为大于1，这种情况下，从根据size和step得到的shape中进一步选择所需的张量元素即可。因此，当每一个step只需要寻址1个张量元素时，size也可以不为1。以此可以通过本申请实施例提供的张量寻址模式兼容实现现有的切片寻址效果，兼顾处理效率和寻址灵活性。It should be noted that when each step only needs to address 1 tensor element, its size parameter can also be configured to be greater than 1. In this case, the required tensor is further selected from the shape obtained according to size and step element. Therefore, when each step only needs to address 1 tensor element, the size may not be 1. In this way, the tensor addressing mode provided by the embodiment of the present application can be compatible to realize the existing slice addressing effect, taking into account processing efficiency and addressing flexibility.

其中，向量寄存器的数量可以为多个，不同的向量寄存器存储的源操作数（为张量数据中的一部分）可以不同。以向量运算指令为三操作数的向量运算指令，如A*B+C为例。此时，向量寄存器的数量可以包括3个，例如为向量寄存器1、向量寄存器2、向量寄存器3。向量寄存器1可以用于存储源操作数A，向量寄存器2可以用于存储源操作数B，向量寄存器3可以用于存储源操作数C。本申请不对具体的指令表达形式进行限制。Among them, the number of vector registers can be multiple, and the source operands (which are part of the tensor data) stored in different vector registers can be different. Take a vector operation instruction with three operands, such as A*B+C, as an example. At this time, the number of vector registers may include three, for example, vector register 1 , vector register 2 , and vector register 3 . Vector register 1 can be used to store source operand A, vector register 2 can be used to store source operand B, and vector register 3 can be used to store source operand C. This application does not limit the specific instruction expression form.

在一条向量运算指令中，每一个操作数都对应有一个寻址模式，多个操作数的寻址模式相互匹配，在匹配的过程中可能会发生广播操作，匹配过后的多个操作数的shape相同。In a vector operation instruction, each operand corresponds to an addressing mode, and the addressing modes of multiple operands match each other. Broadcast operations may occur during the matching process, and the shape of multiple operands after matching same.

可以理解的是，不同的操作数（如上述A、B）也可以存储在同一个向量寄存器中，只要向量寄存器的空间够大，因此，不能将上述示例的不同的向量寄存器存储的操作数可以不同的情形理解成是对本申请的限制。It can be understood that different operands (such as A and B above) can also be stored in the same vector register, as long as the space of the vector register is large enough, therefore, the operands stored in different vector registers in the above example can be Different situations are to be understood as limitations on the present application.

当向量寄存器的数量为多个时，可以采用一个或多个寻址引擎对多个向量寄存器存储的张量数据进行寻址。When there are multiple vector registers, one or more addressing engines can be used to address tensor data stored in multiple vector registers.

为了提高寻址效率，一种可选实施方式下，如图8所示，引擎单元包括：多个寻址引擎，一个寻址引擎对应一个向量寄存器，每个寻址引擎，用于根据各自的张量寻址模式从对应的向原始张量数据中获取向量运算指令所需的源操作数。例如，以采用3个向量寄存器分别存储所需的3个源操作数为例，可通过3个寻址引擎（例如分别记为寻址引擎1、寻址引擎2、寻址引擎3）分别进行寻址。其中，寻址引擎1对应向量寄存器1，用于根据该寻址引擎1的张量寻址模式，从源操作数A所在的原始张量数据中获取向量运算指令所需的源操作数A；寻址引擎2对应向量寄存器2，用于根据该寻址引擎2的张量寻址模式，从对源操作数B所在的原始张量数据中获取向量运算指令所需的源操作数B；寻址引擎3对应向量寄存器3，用于根据该寻址引擎3的张量寻址模式，从源操作数C所在的原始张量数据中获取向量运算指令所需的源操作数C。In order to improve addressing efficiency, under an optional implementation manner, as shown in Figure 8, the engine unit includes: a plurality of addressing engines, one addressing engine corresponds to a vector register, and each addressing engine is used to The tensor addressing mode obtains the source operands required by the vector operation instruction from the corresponding raw tensor data. For example, taking 3 vector registers as an example to store the 3 required source operands respectively, it can be performed by 3 addressing engines (for example, respectively marked as addressing engine 1, addressing engine 2, and addressing engine 3). addressing. Wherein, the addressing engine 1 corresponds to the vector register 1, and is used to obtain the source operand A required for the vector operation instruction from the original tensor data where the source operand A is located according to the tensor addressing mode of the addressing engine 1; The addressing engine 2 corresponds to the vector register 2, and is used to obtain the source operand B required for the vector operation instruction from the original tensor data where the source operand B is located according to the tensor addressing mode of the addressing engine 2; The addressing engine 3 corresponds to the vector register 3, and is used to obtain the source operand C required by the vector operation instruction from the original tensor data where the source operand C is located according to the tensor addressing mode of the addressing engine 3.

可以理解的是，一种可选实施方式下，也可以是由同一个寻址引擎从多个原始张量数据中获取向量运算指令所需的多个源操作数，比如，由寻址引擎1来获取前述示例的源操作数A、源操作数B以及源操作数C。It can be understood that, in an optional implementation mode, the same addressing engine can also obtain multiple source operands required for vector operation instructions from multiple original tensor data, for example, by addressing engine 1 to obtain the source operand A, source operand B, and source operand C of the preceding example.

其中，每个寻址引擎，如上述的寻址引擎1、寻址引擎2、寻址引擎3都是独立的，互相不干扰。每个寻址引擎在获取向量运算指令所需的源操作数时，都是采用独立的张量寻址模式进行独立寻址，即，向量运算指令的每一个源操作数对应一种独立的张量寻址模式。比如上述的源操作数A对应张量寻址模式1，源操作数B对应张量寻址模式2，源操作数C对应张量寻址模式3。张量寻址模式1、张量寻址模式2、张量寻址模式3包含的参数种类可以相同，比如可以都包含上述的5种（start：stop：step：size：partial）参数，但参数的具体数值可能不同。Wherein, each addressing engine, such as addressing engine 1, addressing engine 2, and addressing engine 3 mentioned above, is independent and does not interfere with each other. When each addressing engine obtains the source operand required by the vector operation instruction, it uses an independent tensor addressing mode for independent addressing, that is, each source operand of the vector operation instruction corresponds to an independent tensor volume addressing mode. For example, the above source operand A corresponds to tensor addressing mode 1, source operand B corresponds to tensor addressing mode 2, and source operand C corresponds to tensor addressing mode 3. The types of parameters contained in tensor addressing mode 1, tensor addressing mode 2, and tensor addressing mode 3 can be the same. The exact values may vary.

可以理解的是，多个寻址引擎在独立寻址时，可以是所有寻址引擎都采用本申请提供的张量寻址模式进行寻址，也可以是其中一部分采用本申请提供的张量寻址模式进行寻址，剩余部分采用现有的寻址模式（索引寻址模式、numpy的基本切片寻址模式）进行寻址。例如，张量寻址模式1、张量寻址模式2、张量寻址模式3除了可以是同一类型的寻址模式（如新的切片寻址模式）外，还可以是对应不同的寻址模式，比如，张量寻址模式1对应本申请中的张量寻址模式（新的切片寻址模式）、张量寻址模式2采用原有3参数的切片寻址模式，张量寻址模式3采用原有的索引寻址模式。也即，相比于全部采用现有的寻址模式来说，也可以在一定程度上提高寻址效率，减少寻址所需的指令数。It can be understood that when multiple addressing engines are independently addressing, all addressing engines may use the tensor addressing mode provided by this application for addressing, or some of them may use the tensor addressing mode provided by this application The addressing mode is used for addressing, and the remaining part is addressed using the existing addressing mode (index addressing mode, numpy's basic slice addressing mode). For example, tensor addressing mode 1, tensor addressing mode 2, and tensor addressing mode 3 can not only be the same type of addressing mode (such as the new slice addressing mode), but also correspond to different addressing modes Mode, for example, tensor addressing mode 1 corresponds to the tensor addressing mode in this application (new slice addressing mode), tensor addressing mode 2 adopts the original 3-parameter slice addressing mode, tensor addressing Mode 3 uses the original indexed addressing mode. That is to say, compared with adopting all existing addressing modes, the addressing efficiency can also be improved to a certain extent, and the number of instructions required for addressing can be reduced.

作为一种实施方式，每个寻址引擎除了各自的独立寻址模式以外，也都还可以被设置为支持处理广播操作的能力。As an implementation manner, in addition to its own independent addressing mode, each addressing engine can also be configured to support the ability to process broadcast operations.

为了便于控制各个寻址引擎的寻址，一种可选实施方式下，该引擎单元还包括：主引擎，主引擎分别与每个寻址引擎连接，主引擎用于向每个寻址引擎发送控制命令，控制每个寻址引擎根据张量寻址模式进行寻址，从对应的原始张量数据中获取向量运算指令所需的源操作数。通过引入一个主引擎来集中控制这些独立的寻址引擎，可以提高寻址效率。主引擎在每一个step会给操作数的寻址引擎发送控制命令，使得相应的寻址引擎按照从低维到高维的顺序遍历张量数据上的每一个维度，在对应维度寻址。In order to facilitate the addressing of each addressing engine, in an optional implementation manner, the engine unit also includes: a main engine, which is connected to each addressing engine respectively, and the main engine is used to send each addressing engine The control command controls each addressing engine to address according to the tensor addressing mode, and obtains the source operand required by the vector operation instruction from the corresponding original tensor data. Addressing efficiency can be improved by introducing a master engine to centrally control these independent addressing engines. The main engine sends control commands to the addressing engine of the operand at each step, so that the corresponding addressing engine traverses each dimension of the tensor data in order from low-dimensional to high-dimensional, and addresses in the corresponding dimension.

其中，控制命令包括Advance前进命令、Reset重置命令、NOP空操作命令。也即，主引擎在每一个step会给操作数的寻址引擎发送上述三个控制命令中的至少一个控制命令。即，主引擎每次向每个寻址引擎发送控制命令时，发送由上述三个控制命令中的至少一个控制命令组合而成的组合控制命令，以控制每个寻址引擎根据各自的张量寻址模式在原始张量数据的不同维度进行寻址。Wherein, the control commands include an Advance command, a Reset command, and a NOP empty operation command. That is, the main engine sends at least one control command among the above three control commands to the addressing engine of the operand at each step. That is, each time the main engine sends a control command to each addressing engine, it sends a combined control command composed of at least one of the above three control commands to control each addressing engine according to its respective tensor Addressing modes address different dimensions of raw tensor data.

Advance前进命令表示在这个维度上advance（前进或增加）一个step。The Advance forward command means to advance (advance or increase) a step in this dimension.

Reset重置命令表示在这个维度上走到尽头时，需要在这个维度上从头开始寻址。The Reset command indicates that when this dimension comes to an end, it needs to start addressing in this dimension from the beginning.

NOP空操作命令表示在这个维度上没有发生动作。The NOP command indicates that no action has taken place on this dimension.

为了更好的理解主引擎如何根据上述的Advance前进命令、Reset重置命令、NOP空操作命令，控制寻址引擎进行寻址的逻辑，下面以图7所示的原理图进行说明：In order to better understand the logic of how the main engine controls the addressing engine to perform addressing according to the above-mentioned Advance command, Reset command, and NOP command, the schematic diagram shown in Figure 7 is explained below:

Step1：W cmd：NOP；H cmd：NOP，其中cmd表示命令，此时的状态如图7中的（1）所示；Step1: W cmd: NOP; H cmd: NOP, where cmd represents a command, and the state at this time is shown in (1) in Figure 7;

Step2：W cmd：Advance；H cmd：NOP，此时的状态如图7中的（2）所示；Step2: W cmd: Advance; H cmd: NOP, the state at this time is shown in (2) in Figure 7;

Step3：W cmd：Advance；H cmd：NOP，此时的状态如图7中的（3）所示；Step3: W cmd: Advance; H cmd: NOP, the state at this time is shown in (3) in Figure 7;

Step4：W cmd：Reset；H cmd：Advance，此时的状态如图7中的（4）所示；Step4: W cmd: Reset; H cmd: Advance, the state at this time is shown in (4) in Figure 7;

Step5：W cmd：Advance；H cmd：NOP，此时的状态如图7中的（5）所示；Step5: W cmd: Advance; H cmd: NOP, the state at this time is shown in (5) in Figure 7;

Step6：W cmd：Advance；H cmd：NOP，此时的状态如图7中的（6）所示；Step6: W cmd: Advance; H cmd: NOP, the state at this time is shown in (6) in Figure 7;

Step7：W cmd：Reset；H cmd：Advance，此时的状态如图7中的（7）所示；Step7: W cmd: Reset; H cmd: Advance, the state at this time is shown in (7) in Figure 7;

Step8：W cmd：Advance；H cmd：NOP，此时的状态如图7中的（8）所示；Step8: W cmd: Advance; H cmd: NOP, the state at this time is shown in (8) in Figure 7;

Step9：W cmd：Advance；H cmd：NOP，此时的状态如图7中的（9）所示。Step9: W cmd: Advance; H cmd: NOP, the state at this time is shown in (9) in Figure 7.

若张量数据的第一维度（可以是张量数据的任一维度）需要广播，则该张量数据对应的寻址引擎在张量数据的第一维度上进行寻址时，在接收到Advance前进命令时，保持寻址指针不变，这样寻址引擎在从向量寄存器中读取源操作数时，会一直读取当前寻址指针所指向的源操作数，从而实现广播的目的。相应地，若张量数据的第一维度（可以是张量数据的任一维度）需要广播，则主引擎在发送控制命令控制寻址引擎在张量数据的第一维度上进行寻址时，当寻址引擎在张量数据的第一维度上走到尽头时，主引擎并不会发送Reset重置命令，可以一直发Advance前进命令，以便一直读取当前寻址指针所指向的源操作数，直至完成广播后，再发送其他控制命令。If the first dimension of the tensor data (it can be any dimension of the tensor data) needs to be broadcasted, when the addressing engine corresponding to the tensor data is addressing on the first dimension of the tensor data, upon receiving Advance When forwarding the command, keep the addressing pointer unchanged, so that when the addressing engine reads the source operand from the vector register, it will always read the source operand pointed by the current addressing pointer, so as to achieve the purpose of broadcasting. Correspondingly, if the first dimension of the tensor data (it can be any dimension of the tensor data) needs to be broadcasted, when the main engine sends a control command to control the addressing engine to address the first dimension of the tensor data, When the addressing engine comes to an end on the first dimension of the tensor data, the main engine will not send the Reset reset command, but can always send the Advance command to read the source operand pointed by the current addressing pointer , until the broadcast is completed, and then send other control commands.

为了更好的理解，下面结合图4所示的原理图进行说明。假设向量运算单元要完成图4所示的运算操作，寻址引擎1负责获取“np.arange（3）”这个数组中的操作数，寻址引擎2负责获取操作数5，初始时刻，主引擎向寻址引擎1、寻址引擎2发送Advance前进命令，寻址引擎1获取“np.arange（3）”这个数组中的0操作数，寻址引擎2获取操作数5；下一时刻，主引擎向寻址引擎1、寻址引擎2发送Advance前进命令，寻址引擎1前进一步获取“np.arange（3）”这个数组中的1操作数，寻址引擎2在接收到Advance前进命令时，保持寻址指针不变，此时仍然输出操作数5；在下一时刻，主引擎向寻址引擎1、寻址引擎2发送Advance前进命令，寻址引擎1前进一步获取“np.arange（3）”这个数组中的2操作数，寻址引擎2在接收到Advance前进命令时，保持寻址指针不变，此时仍然输出操作数5。For a better understanding, the following description will be made in conjunction with the schematic diagram shown in FIG. 4 . Assuming that the vector operation unit needs to complete the operations shown in Figure 4, the addressing engine 1 is responsible for obtaining the operand in the array "np.arange(3)", and the addressing engine 2 is responsible for obtaining the operand 5. At the initial moment, the main engine Send the Advance command to addressing engine 1 and addressing engine 2, addressing engine 1 obtains operand 0 in the array of "np.arange(3)", addressing engine 2 obtains operand 5; at the next moment, the master The engine sends an Advance command to addressing engine 1 and addressing engine 2, and addressing engine 1 gets the operand 1 in the array of "np.arange(3)" one step ahead, and addressing engine 2 receives the Advance command , keep the addressing pointer unchanged, and still output operand 5 at this time; at the next moment, the main engine sends the Advance command to addressing engine 1 and addressing engine 2, and addressing engine 1 further obtains "np.arange(3 )” in the array of 2 operands, when the addressing engine 2 receives the Advance command, it keeps the addressing pointer unchanged, and still outputs operand 5 at this time.

一种可选实施方式下，张量寻址模式还可以包括嵌套的双重张量寻址模式，双重张量寻址模式包括：外迭代寻址模式（outer iterator）和内迭代寻址模式（inneriterator），其中，内迭代寻址模式在外迭代寻址模式得到的张量数据的基础上进行寻址。In an optional implementation, the tensor addressing mode can also include nested double tensor addressing modes, and the double tensor addressing mode includes: outer iterator addressing mode (outer iterator) and inner iterator addressing mode ( inneriterator), where the inner iterative addressing mode addresses based on the tensor data obtained by the outer iterative addressing mode.

作为一种实施方式，外迭代寻址模式（outer iterator）可以是前述的采用start：stop：step：size：partial这些参数的新的切片寻址模式，这种情况下，可以快速选定局部区域（例如基于这些参数限定出的shape）的张量数据，为内迭代寻址模式提供寻址的数据基础。在一些应用场景下，通过外迭代寻址模式选定的局部区域均是在partial=true的情况下得到的完整形状，这样就无需用额外的指令来指示进行模式切换，或者，用额外的指令来处理内迭代寻址模式下可能面临的数据形状改变（从完整到不完整）的情况。As an implementation, the outer iterator addressing mode (outer iterator) can be the aforementioned new slice addressing mode using the parameters start: stop: step: size: partial. In this case, the local area can be quickly selected The tensor data (such as the shape defined based on these parameters) provides the addressable data basis for the inner iteration addressing mode. In some application scenarios, the local area selected by the outer iteration addressing mode is the complete shape obtained in the case of partial=true, so that there is no need to use additional instructions to indicate mode switching, or use additional instructions To handle the data shape change (from complete to incomplete) that may be faced in the inner iterative addressing mode.

采用嵌套的双重张量寻址模式进行寻址，相比于其他寻址模式来说，寻址效率更高，使得只需要一条指令就可实现2条单层的指令的寻址，并且，内迭代寻址模式是在外迭代寻址模式得到的张量数据的基础上进行寻址的，相比于采用单层的张量寻址模式的寻址，减少了第一层（外迭代寻址模式）的数据读写。The nested double tensor addressing mode is used for addressing. Compared with other addressing modes, the addressing efficiency is higher, so that only one instruction is needed to realize the addressing of two single-layer instructions, and, The inner iterative addressing mode is based on the tensor data obtained by the outer iterative addressing mode. Compared with the addressing of the single-layer tensor addressing mode, the first layer (outer iterative addressing mode) data read and write.

其中，外迭代寻址模式和内迭代寻址模式的表达式可以为：[start：stop：step：size]，或者为[start：stop：step：size：partial]。外迭代寻址模式和内迭代寻址模式的表达式中的参数值可以不同。Wherein, the expressions of the outer iteration addressing mode and the inner iteration addressing mode may be: [start:stop:step:size], or [start:stop:step:size:partial]. Parameter values in expressions for outer and inner iteration addressing modes can be different.

为了更好的理解，以外迭代寻址模式表达式为[start：stop：step：size]=0:8:2:5为例，其原理图如图9所示，图9中被遍历的这部分可以视为尺寸为8*8的特征图。内迭代寻址模式表达式的值以0:5:2:1为例，其寻址原理图如图10所示。按照0:8:2:5的寻址模式进行寻址，可以得到4部分参数，即图9中（1）、（2）、（3）、（4）的shape所包含的参数。内迭代寻址模式在外迭代寻址模式得到的张量数据的基础上进行寻址，即利用0:5:2:1的寻址模式分别对图9中的（1）、（2）、（3）、（4）得到的多个张量元素进行寻址。For a better understanding, the expression of the external iterative addressing mode is [start:stop:step:size]=0:8:2:5 as an example, and its schematic diagram is shown in Figure 9, which is traversed in Figure 9 Part can be regarded as a feature map of size 8*8. The value of the inner iterative addressing mode expression is 0:5:2:1 as an example, and its addressing schematic diagram is shown in Figure 10. Addressing according to the addressing mode of 0:8:2:5, you can get 4 parts of parameters, that is, the parameters contained in the shapes of (1), (2), (3), and (4) in Figure 9. The inner iterative addressing mode is addressed on the basis of the tensor data obtained by the outer iterative addressing mode, that is, the addressing mode of 0:5:2:1 is used to address (1), (2), ( 3), (4) to address multiple tensor elements obtained.

通过图9和图10的示例可以看出，外迭代寻址模式用于从张量数据中选定至少一个候选区域，候选区域包含多个张量元素，比如选取图9中的（1）、（2）、（3）、（4）等区域；内迭代寻址模式用于从至少一个候选区域中获取向量运算指令所需的源操作。内迭代寻址模式是在图9中的（1）、（2）、（3）、（4）的基础上进行寻址，以获取向量运算指令所需的源操作数。It can be seen from the examples in Figure 9 and Figure 10 that the outer iterative addressing mode is used to select at least one candidate area from tensor data, and the candidate area contains multiple tensor elements, such as selecting (1), (2), (3), (4) and other areas; the inner iteration addressing mode is used to obtain the source operation required for the vector operation instruction from at least one candidate area. The inner iteration addressing mode is based on (1), (2), (3), and (4) in Figure 9 to address to obtain the source operand required by the vector operation instruction.

需要说明的是，内迭代寻址模式并不一定是等到外迭代寻址模式全部寻址完后，才进行寻址，示例性的，当外迭代寻址模式寻址到图9中（1）所包含的5*5个元素时，内迭代寻址模式便可在此基础上进行寻址，内迭代寻址模式完成对图9中（1）所包含的这部分参数的寻址后，又可继续外迭代寻址模式进行寻址，当外迭代寻址模式寻址到图9中（2）所包含的5*5个元素时，内迭代寻址模式便在此基础上进行寻址，内迭代寻址模式完成对图9中（2）所包含的这部分数据的寻址后，又继续外迭代寻址模式，以此类推，直至完成所有的寻址。这样，外迭代寻址模式寻址到的数据仅是一个中间结果，并不需要读写，相比于采用单层的张量寻址模式的寻址，减少了第一层（外迭代寻址模式）的数据读写。It should be noted that the inner iterative addressing mode does not necessarily wait until the outer iterative addressing mode is all addressed before addressing. For example, when the outer iterative addressing mode is addressed to (1) in Figure 9 When it contains 5*5 elements, the internal iterative addressing mode can address on this basis. After the internal iterative addressing mode completes the addressing of this part of the parameters contained in (1) in Figure 9, and Addressing can continue in the outer iterative addressing mode. When the outer iterative addressing mode addresses the 5*5 elements contained in (2) in Figure 9, the inner iterative addressing mode will address on this basis. After the inner iterative addressing mode completes the addressing of this part of the data included in (2) in Figure 9, it continues the outer iterative addressing mode, and so on, until all the addressing is completed. In this way, the data addressed by the outer iterative addressing mode is only an intermediate result, and does not need to be read and written. Compared with the addressing of the single-layer tensor addressing mode, the first layer (outer iterative addressing) is reduced. mode) data read and write.

可以理解的是，外迭代寻址模式和内迭代寻址模式除了可以采用本申请所示的寻址模式（新的切片寻址模式）外，还可以对其寻址模式进行扩展，比如采用已有的寻址模式（切片寻址模式、索引寻址模式）进行寻址，从而可以有多种组合方式，示例性的一些组合方式参见下表1所示。It can be understood that, in addition to the addressing mode (new slice addressing mode) shown in this application, the outer iterative addressing mode and the inner iterative addressing mode can also be extended, such as using the existing Some addressing modes (slice addressing mode, index addressing mode) are used for addressing, so there can be multiple combinations, some exemplary combinations are shown in Table 1 below.

表1Table 1

外迭代寻址模式Outer iteration addressing mode 新的切片寻址模式New slice addressing mode 新的切片寻址模式New slice addressing mode 切片寻址模式slice addressing mode 切片寻址模式slice addressing mode 索引寻址模式indexed addressing mode 切片寻址模式slice addressing mode 索引寻址模式indexed addressing mode 索引寻址模式indexed addressing mode 新的切片寻址模式New slice addressing mode 内迭代寻址模式Intra-iteration addressing mode 新的切片寻址模式New slice addressing mode 切片寻址模式slice addressing mode 新的切片寻址模式New slice addressing mode 切片寻址模式slice addressing mode 索引寻址模式indexed addressing mode 索引寻址模式indexed addressing mode 切片寻址模式slice addressing mode 新的切片寻址模式New slice addressing mode 索引寻址模式indexed addressing mode

需要说明的是，当外迭代寻址模式采用新的切片寻址模式进行寻址时，指针每移动一个step便可得到图9中的（1）、（2）、（3）、（4）所示的shape所包含的5*5个元素参数。当外迭代寻址模式采用现有的切片寻址模式或索引寻址模式进行寻址时，由于每滑动一个step仅能得到一个元素点，若要得到图9中的（1）、（2）、（3）、（4）所示的shape所包含的这些元素参数，则需要进行多次寻址，以保证无论外迭代寻址模式采用何种寻址模式，都可以得到采用新的切片寻址模式寻址所得的这些数据，以此为内迭代寻址模式提供依据。即，无论外迭代寻址模式采用何种寻址模式所得到的数据是一致的，例如，都是得到图9中的（1）、（2）、（3）、（4）所示的shape所包含的多个张量元素参数，只是采用新的切片寻址模式时每滑动一个step便可得到一个shape所包含的参数，若是采用其他的寻址模式，则需要进行多次寻址（滑动多个step），才能得到新的切片寻址模式每滑动一个step所得到的shape所包含的这些参数。It should be noted that when the outer iterative addressing mode adopts the new slice addressing mode for addressing, each time the pointer moves by one step, (1), (2), (3), and (4) in Figure 9 can be obtained The 5*5 element parameters contained in the shape shown. When the outer iterative addressing mode uses the existing slice addressing mode or index addressing mode for addressing, since only one element point can be obtained for each sliding step, to obtain (1) and (2) in Figure 9 , (3), and (4), these element parameters contained in the shape shown in (4) need to be addressed multiple times to ensure that no matter what addressing mode is used in the outer iteration addressing mode, the new slice addressing mode can be obtained. The data obtained by addressing in the addressing mode provides a basis for the inner iterative addressing mode. That is, no matter what addressing mode the outer iterative addressing mode uses, the data obtained is consistent, for example, the shape shown in (1), (2), (3), and (4) in Figure 9 is obtained The multiple tensor element parameters contained in it are only the parameters contained in a shape can be obtained by sliding a step when using the new slice addressing mode. If other addressing modes are used, multiple addressing (sliding) is required. Multiple steps) to get the parameters contained in the shape obtained by sliding a step in the new slice addressing mode.

基于同样的发明构思，本申请实施例还提供了一种电子设备，如图11所示，该电子设备包括：存储器和如上述所示的AI芯片。AI芯片与存储器连接，AI芯片，用于根据张量寻址模式从存储器存储的原始张量数据中，获取向量运算指令所需的源操作数，并根据向量运算指令，对获取的源操作数进行运算，得到运算结果。Based on the same inventive concept, an embodiment of the present application further provides an electronic device, as shown in FIG. 11 , the electronic device includes: a memory and the above-mentioned AI chip. The AI chip is connected to the memory, and the AI chip is used to obtain the source operand required by the vector operation instruction from the original tensor data stored in the memory according to the tensor addressing mode, and to perform the obtained source operand according to the vector operation instruction Perform calculations to obtain calculation results.

其中，该存储器用于存储运算所需的原始张量数据，其可以是常见的各种存储器，比如可以是随机存取存储器（Random Access Memory，RAM），只读存储器（Read OnlyMemory，ROM），可编程只读存储器（Programmable Read-Only Memory，PROM），可擦除只读存储器（Erasable Programmable Read-Only Memory，EPROM），电可擦除只读存储器（Electric Erasable Programmable Read-Only Memory，EEPROM）。其中，随机存取存储器又可以是静态随机存取存储器（Static Random Access Memory，SRAM）或动态随机存取存储器（Dynamic Random Access Memory，DRAM）。此外，存储器可以是单倍数据速率（SingleData Rate，SDR）存储器，也可以是双倍数据速率（Double Data Rate，DDR）存储器。Among them, the memory is used to store the original tensor data required for operation, which can be various common memories, such as Random Access Memory (Random Access Memory, RAM), Read Only Memory (Read Only Memory, ROM), Programmable Read-Only Memory (PROM), Erasable Programmable Read-Only Memory (EPROM), Electric Erasable Programmable Read-Only Memory (EEPROM) . Wherein, the random access memory may be a static random access memory (Static Random Access Memory, SRAM) or a dynamic random access memory (Dynamic Random Access Memory, DRAM). In addition, the memory may be a single data rate (Single Data Rate, SDR) memory, or a double data rate (Double Data Rate, DDR) memory.

为了更好的理解AI芯片与存储器的交互流程，下面以向量运算指令为三操作数的向量运算指令，如D=A*B+C为例，首先，将操作数A所在的原始张量数据（如H*W=4*4）写入存储器，将操作数B所在的原始张量数据（如H*W=4*4）写入存储器，将操作数C所在的原始张量数据（如H*W=4*4）写入存储器，引擎单元根据张量寻址模式从操作数A所在的原始张量数据获取向量运算指令的操作数A，并将其写入向量寄存器1中，引擎单元根据张量寻址模式从操作数B所在的原始张量数据获取向量运算指令的操作数B，并将其写入向量寄存器2中，引擎单元根据张量寻址模式从操作数C所在的原始张量数据获取向量运算指令的操作数C，并将其写入向量寄存器3中。在需要进行张量运算的时候，向量运算单元直接从向量寄存器1中获取向量运算指令对应的源操作数A、从向量寄存器2中获取向量运算指令对应的源操作数B、从向量寄存器3中获取向量运算指令对应的源操作数C进行运算，得到运算结果。In order to better understand the interaction process between the AI chip and the memory, the following vector operation instruction is a vector operation instruction with three operands, such as D=A*B+C, as an example. First, the original tensor data where the operand A is located is (such as H*W=4*4) into memory, write the original tensor data of operand B (such as H*W=4*4) into memory, and write the original tensor data of operand C (such as H*W=4*4) Write to the memory, the engine unit obtains the operand A of the vector operation instruction from the original tensor data where the operand A is located according to the tensor addressing mode, and writes it into the vector register 1, the engine unit The unit obtains the operand B of the vector operation instruction from the original tensor data where the operand B is located according to the tensor addressing mode, and writes it into the vector register 2, and the engine unit obtains the operand B from the original tensor data where the operand C is located according to the tensor addressing mode Raw tensor data takes the operand C of the vector operation instruction and writes it into vector register 3. When a tensor operation is required, the vector operation unit directly obtains the source operand A corresponding to the vector operation instruction from the vector register 1, obtains the source operand B corresponding to the vector operation instruction from the vector register 2, and obtains the source operand B corresponding to the vector operation instruction from the vector register 3. Obtain the source operand C corresponding to the vector operation instruction for operation, and obtain the operation result.

其中，假设张量寻址模式[start：stop：step：size] =0:4:1:3，则对于每个4*4的原始张量数据，引擎单元都会产生3*3*4的地址，之后会读取这些地址所在的数据，得到3*3*4的数据（即为操作数），并发送给向量寄存器进行存储。其中，引擎单元按照0:4:1:3的寻址模式进行寻址的原理如图12所示。Among them, assuming the tensor addressing mode [start:stop:step:size] =0:4:1:3, for each 4*4 original tensor data, the engine unit will generate a 3*3*4 address , and then read the data at these addresses, get 3*3*4 data (that is, the operand), and send it to the vector register for storage. Wherein, the principle of addressing by the engine unit according to the addressing mode of 0:4:1:3 is shown in FIG. 12 .

可以理解的是，在存储器的存储空间够用的情况下，上述过程中，可以是将上述3个操作数所在原始张量数据全部写入存储器中后，再依次根据张量寻址模式进行寻址，将寻址到的数据写入向量寄存器中；也可以是先将其中某个操作数所在原始张量数据写入存储器中后，再根据张量寻址模式对该操作数所在原始张量数据进行寻址，将寻址到的操作数写入向量寄存器中，之后，再将其中某个操作数所在原始张量数据写入存储器中后，之后再根据张量寻址模式对该操作数所在原始张量数据进行寻址，将寻址到的操作数写入向量寄存器中。本申请中，不对各个操作数所在原始张量数据写入存储器的先后顺序，以及从存储器中写入向量寄存器的先后顺序进行限定。It can be understood that, in the case that the storage space of the memory is sufficient, in the above process, all the original tensor data where the above three operands are located can be written into the memory, and then sequentially search according to the tensor addressing mode address, write the addressed data into the vector register; or first write the original tensor data of one of the operands into the memory, and then write the original tensor data of the operand according to the tensor addressing mode The data is addressed, and the addressed operand is written into the vector register. After that, the original tensor data of one of the operands is written into the memory, and then the operand is based on the tensor addressing mode. The original tensor data is addressed, and the addressed operand is written into the vector register. In this application, the order in which the original tensor data of each operand is written into the memory and the order in which the vector register is written from the memory are not limited.

此外，在上述过程中，可以是由同一个寻址引擎对不同操作数所在的原始张量数据进行寻址，也可以是由不同的寻址引擎对不同操作数所在的原始张量数据进行寻址。In addition, in the above process, the original tensor data of different operands can be addressed by the same addressing engine, or the original tensor data of different operands can be addressed by different addressing engines site.

其中，上述的电子设备可以是但不限于手机、平板、电脑、服务器、车载设备、可穿戴设备、边缘盒子等电子设备。Wherein, the above-mentioned electronic devices may be, but not limited to, electronic devices such as mobile phones, tablets, computers, servers, vehicle-mounted devices, wearable devices, and edge boxes.

基于同样的发明构思，本申请实施例还提供了一种张量处理方法，如图13所示。下面结合图13对该张量处理方法的原理进行说明。该方法可应用于前述的AI芯片，可应用于前述的电子设备。Based on the same inventive concept, the embodiment of the present application also provides a tensor processing method, as shown in FIG. 13 . The principle of the tensor processing method will be described below with reference to FIG. 13 . This method can be applied to the aforementioned AI chip, and can be applied to the aforementioned electronic equipment.

S1：根据张量寻址模式从张量数据中获取向量运算指令所需的源操作数，所述张量寻址模式一次寻址获取的源操作数个数大于或等于1。S1: Obtain the source operands required by the vector operation instruction from the tensor data according to the tensor addressing mode, where the number of source operands acquired by one addressing in the tensor addressing mode is greater than or equal to 1.

本申请实施例中的张量寻址模式通过对现有的切片寻址模式进行了改进与扩展，使其包含更多的参数，比如该张量寻址模式的参数包括：表征寻址起点的起始地址(如用start表示)、表征寻址终点的终点地址（如用stop表示）、表征寻址指针偏移幅度的步长（如用step表示）、表征寻址所得数据形状的尺寸（如用size表示）。则张量寻址模式的表达式为[start：stop：step：size]。其中，size用来描述寻址所得数据形状（shape）的尺寸，即在每一个step的时候会提取一个shape包含的所有元素点，而不只提取一个点。从而使得一次寻址便可确定至少一个源操作数对应的张量元素的地址，能极大提高计算源操作数对应的张量元素的地址的效率，使得只需要较少的指令便可寻找到所需的源操作数，极大的减少了冗余指令、提高了有效指令密度、提高了性能、简化了编程。The tensor addressing mode in the embodiment of the present application improves and expands the existing slice addressing mode to include more parameters. For example, the parameters of the tensor addressing mode include: The start address (such as represented by start), the end address representing the addressing end point (such as represented by stop), the step size representing the offset range of the addressing pointer (such as represented by step), and the size of the data shape represented by addressing ( Such as expressed in size). Then the expression for the tensor addressing mode is [start:stop:step:size]. Among them, size is used to describe the size of the data shape (shape) obtained by addressing, that is, at each step, all element points contained in a shape will be extracted, not just one point. Therefore, the address of the tensor element corresponding to at least one source operand can be determined in one addressing, which can greatly improve the efficiency of calculating the address of the tensor element corresponding to the source operand, so that only fewer instructions are needed to find The required source operands greatly reduce redundant instructions, increase effective instruction density, improve performance, and simplify programming.

一种实施方式下，可以是上述AI芯片中的引擎单元根据张量寻址模式从外部存储器存储的原始张量数据中获取向量运算指令所需的源操作数。In one embodiment, the engine unit in the above-mentioned AI chip can obtain the source operands required by the vector operation instruction from the original tensor data stored in the external memory according to the tensor addressing mode.

S2：将获取的源操作数存储至向量寄存器。S2: Store the obtained source operand into the vector register.

引擎单元在根据张量寻址模式获取到源操作数后，可以将获取到的源操作数存储至向量寄存器中，其中，不同的源操作数可以存储在不同的向量寄存器中，以便于后续读取其中的操作数进行运算时，可以并行读取，从而提高效率。After the engine unit obtains the source operand according to the tensor addressing mode, it can store the obtained source operand in the vector register, where different source operands can be stored in different vector registers for subsequent reading When the operands are taken for operation, they can be read in parallel to improve efficiency.

S3：从所述向量寄存器中获取所述向量运算指令所需的源操作数进行运算，得到运算结果。S3: Acquiring source operands required by the vector operation instruction from the vector register to perform an operation to obtain an operation result.

在获取到向量运算指令所需的源操作数后，便可根据向量运算指令，对获取的源操作数进行运算，得到运算结果。一种可选实施方式下，可以是上述AI芯片中的向量运算单元从向量寄存器中获取向量运算指令所需的源操作数进行运算，得到运算结果。After the source operand required by the vector operation instruction is obtained, the obtained source operand can be operated according to the vector operation instruction to obtain an operation result. In an optional implementation manner, the vector operation unit in the above-mentioned AI chip can obtain the source operands required by the vector operation instruction from the vector register to perform the operation, and obtain the operation result.

可以理解的是，根据本申请实施例揭露的关于张量寻址模式的内容，可以在对张量进行处理的各个环节进行高效寻址，例如对于张量之间的计算过程，除了计算出最终结果的这一步有效运算步骤外，不论是要读取待计算的张量，还是要写入计算完毕的张量结果，都会涉及到张量元素的地址计算，这是因为在读取待计算的张量时需要根据张量元素的地址去读取数据，而在从相应的地址读取到待计算的张量并进行张量之间的运算后，如果到了输出结果的环节，在一些场景下可能需要将张量的运算结果写入用于存储张量结果的地址，这个写入运算结果的过程也有可能需要进行张量元素的地址计算。基于本申请揭露的关于张量寻址模式的原理，可以在能够高效寻址的情况下对张量进行快速读写。It can be understood that, according to the content of the tensor addressing mode disclosed in the embodiment of the present application, efficient addressing can be performed in each link of tensor processing. For example, for the calculation process between tensors, in addition to calculating the final In addition to the effective calculation steps of this step of the result, whether it is to read the tensor to be calculated or to write the calculated tensor result, it will involve the calculation of the address of the tensor element. This is because when reading the tensor to be calculated When tensor needs to read data according to the address of the tensor element, and after reading the tensor to be calculated from the corresponding address and performing operations between tensors, if it comes to the link of outputting results, in some scenarios It may be necessary to write the tensor operation result to the address used to store the tensor result, and the process of writing the operation result may also require the calculation of the address of the tensor element. Based on the principle of the tensor addressing mode disclosed in this application, the tensor can be quickly read and written under the condition of efficient addressing.

本申请实施例所提供的张量处理方法，其实现原理及产生的技术效果和前述AI芯片实施例相同，为简要描述，方法实施例部分未提及之处，可参考前述AI芯片实施例中相应内容。The tensor processing method provided by the embodiment of the present application has the same realization principle and technical effect as the aforementioned AI chip embodiment. For a brief description, the parts not mentioned in the method embodiment can refer to the aforementioned AI chip embodiment. Corresponding content.

需要说明的是，本说明书中的各个实施例均采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似的部分互相参见即可。It should be noted that each embodiment in this specification is described in a progressive manner, and each embodiment focuses on the differences from other embodiments. For the same and similar parts in each embodiment, refer to each other, that is, Can.

以上所述，仅为本申请的具体实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应所述以权利要求的保护范围为准。The above is only a specific implementation of the application, but the scope of protection of the application is not limited thereto. Anyone familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the application. Should be covered within the protection scope of this application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.

Claims

1. An AI chip, comprising:

the engine unit is used for acquiring source operands required by a vector operation instruction from original tensor data according to a tensor addressing mode, wherein the number of the source operands acquired by one-time addressing of the tensor addressing mode is greater than or equal to 1, and parameters of the tensor addressing mode comprise: the method comprises the steps of representing a starting address of an addressing starting point, representing an end address of an addressing end point, representing a step length of the offset amplitude of an addressing pointer and representing the size of a data shape obtained by addressing, wherein tensor elements extracted by each step length are element points contained in one data shape;

the vector register is connected with the engine unit and used for storing the source operand acquired by the engine unit;

and the vector operation unit is connected with the vector register and is used for acquiring a source operand of the vector operation instruction from the vector register to perform operation so as to obtain an operation result.

2. The AI chip of claim 1, wherein the parameters of the tensor addressing mode further include characteristic parameters characterizing a preservation of the shape of the data resulting from addressing as an incomplete shape.

3. The AI chip of claim 1, wherein the tensor addressing mode comprises a nested dual tensor addressing mode, the dual tensor addressing mode comprising: the device comprises an outer iteration addressing mode and an inner iteration addressing mode, wherein the inner iteration addressing mode carries out addressing on tensor data obtained by addressing in the outer iteration addressing mode.

4. The AI chip of any of claims 1-3, wherein the engine unit is further coupled to an external memory for storing the raw tensor data; the engine unit includes:

and each addressing engine is used for acquiring a source operand required by a vector operation instruction from corresponding original tensor data according to a respective tensor addressing mode.

5. The AI chip of claim 4, wherein the engine unit further includes:

and the main engine is used for sending a control command to each addressing engine, controlling each addressing engine to address according to the respective tensor addressing mode, and acquiring a source operand required by the vector operation instruction from the corresponding original tensor data.

6. An electronic device, comprising:

the memory is used for storing tensor data required by operation; and

the AI chip of any of claims 1-5, the AI chip coupled to the memory, the AI chip to obtain source operands required by a vector operation instruction from the original tensor data stored by the memory according to a tensor addressing mode, and to perform an operation on the obtained source operands according to the vector operation instruction to obtain an operation result.

7. A tensor processing method, comprising:

obtaining source operands required by a vector operation instruction from original tensor data according to a tensor addressing mode, wherein the number of the source operands obtained by one-time addressing of the tensor addressing mode is more than or equal to 1;

storing the obtained source operands to a vector register, wherein the parameters of the tensor addressing mode include: the method comprises the steps of representing a starting address of an addressing starting point, representing an end address of an addressing end point, representing a step length of the offset amplitude of an addressing pointer, and representing the size of a data shape obtained by addressing, wherein tensor elements extracted each time the step length is advanced are element points contained in one data shape;

and acquiring a source operand of the vector operation instruction from the vector register to operate to obtain an operation result.

8. The method of claim 7, wherein obtaining source operands required by a vector operation instruction from raw tensor data according to a tensor addressing mode comprises:

and acquiring source operands required by the vector operation instruction from the original tensor data according to the tensor addressing mode by utilizing the addressing engine under the control command sent by the main engine.

9. The method of claim 8, wherein the control command comprises: the main engine sends a combined control command formed by combining at least one control command in the three control commands every time the main engine sends the control command to each addressing engine so as to control each addressing engine to address in different dimensions of the original tensor data according to respective tensor addressing modes.