CN118012505A

CN118012505A - Artificial intelligence processors, integrated circuit chips, boards, electronic devices

Info

Publication number: CN118012505A
Application number: CN202410218141.4A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2020-06-30
Filing date: 2020-06-30
Publication date: 2024-05-10
Also published as: CN113867800A; WO2022001500A1

Abstract

The present disclosure discloses a computing device, an integrated circuit chip, a board card, an electronic apparatus, and a method of performing arithmetic operations using the foregoing computing device. Wherein the computing device may be included in a combined processing device that may also include a universal interconnect interface and other processing devices. The computing device interacts with other processing devices to jointly complete the computing operation designated by the user. The combined processing means may further comprise storage means connected to the device and the other processing means, respectively, for storing data of the device and the other processing means. The scheme disclosed by the invention can improve the operation efficiency of operation in various data processing fields including the artificial intelligence field, so that the overall cost and the cost of operation are reduced.

Description

Artificial intelligence processors, integrated circuit chips, boards, electronic devices

本申请是申请号为202010619460.8，申请日为2020年6月30日，发明名称为“计算装置、集成电路芯片、板卡、电子设备和计算方法”的专利申请的分案申请。This application is a divisional application of the patent application with application number 202010619460.8, application date June 30, 2020, and invention name “Computing device, integrated circuit chip, board, electronic device and computing method”.

技术领域Technical Field

本披露一般地涉及计算领域。更具体地，本披露涉及一种计算装置、集成电路芯片、板卡、电子设备和计算方法。The present disclosure generally relates to the field of computing. More specifically, the present disclosure relates to a computing device, an integrated circuit chip, a board, an electronic device, and a computing method.

背景技术Background technique

在计算系统中，指令集是用于执行计算和对计算系统进行控制的一套指令的集合，并且在提高计算系统中计算芯片(例如处理器)的性能方面发挥着关键性的作用。当前的各类计算芯片(特别是人工智能领域的芯片)利用相关联的指令集，可以完成各类通用或特定的控制操作和数据处理操作。然而，当前的指令集还存在诸多方面的缺陷。例如，现有的指令集受限于硬件架构而在灵活性方面表现较差。进一步，许多指令仅能完成单一的操作，而多个操作的执行则通常需要多条指令，这潜在地导致片内I/O数据吐吞量增大。另外，当前的指令在执行速度、执行效率和对芯片造成的功耗方面还有改进之处。In a computing system, an instruction set is a set of instructions for performing calculations and controlling the computing system, and plays a key role in improving the performance of computing chips (such as processors) in computing systems. Current computing chips (especially chips in the field of artificial intelligence) use associated instruction sets to complete various general or specific control operations and data processing operations. However, the current instruction set still has many defects. For example, the existing instruction set is limited by the hardware architecture and has poor flexibility. Furthermore, many instructions can only complete a single operation, while the execution of multiple operations usually requires multiple instructions, which potentially leads to an increase in the on-chip I/O data throughput. In addition, the current instructions still have room for improvement in terms of execution speed, execution efficiency, and power consumption caused to the chip.

另外，传统的处理器CPU的运算指令被设计为能够执行基本的单数据标量运算操作。这里，单数据操作指的是指令的每一个操作数都是一个标量数据。然而，在图像处理和模式识别等任务里，面向的操作数往往是多维向量(即，张量数据)的数据类型，仅仅使用标量操作无法使硬件高效地完成运算任务。因此，如何高效地执行多维的张量运算也是当前计算领域亟需解决的问题。In addition, the operation instructions of traditional processor CPUs are designed to perform basic single-data scalar operations. Here, single-data operations refer to each operand of the instruction being a scalar data. However, in tasks such as image processing and pattern recognition, the operands are often multi-dimensional vectors (i.e., tensor data) data types, and using only scalar operations cannot enable the hardware to efficiently complete the operation tasks. Therefore, how to efficiently perform multi-dimensional tensor operations is also a problem that needs to be solved in the current computing field.

发明内容Summary of the invention

为了至少解决上述现有技术中存在的问题，本披露提供一种硬件构架平台和相关指令的解决方案。利用本披露公开的方案，可以增加指令的灵活性，提高指令执行的效率并且降低计算成本和开销。进一步，本披露的方案在前述硬件架构的基础上支持对张量数据的高效访存和处理，从而在计算指令中包括多维向量操作数的情况下，加速张量运算并且减小张量运算所带来的计算开销。In order to at least solve the problems existing in the above-mentioned prior art, the present disclosure provides a solution of a hardware architecture platform and related instructions. Utilizing the solution disclosed in the present disclosure, the flexibility of instructions can be increased, the efficiency of instruction execution can be improved, and the computing cost and overhead can be reduced. Furthermore, the solution disclosed in the present disclosure supports efficient memory access and processing of tensor data on the basis of the aforementioned hardware architecture, thereby accelerating tensor operations and reducing the computing overhead caused by tensor operations when multi-dimensional vector operands are included in the computing instructions.

在第一方面中，本披露公开了一种计算装置，包括主处理电路和至少一个从处理电路，其中：In a first aspect, the present disclosure discloses a computing device comprising a master processing circuit and at least one slave processing circuit, wherein:

所述主处理电路配置成响应于主指令来执行主运算操作，The main processing circuit is configured to perform main computing operations in response to main instructions,

所述从处理电路配置成响应于从指令来执行从运算操作，The slave processing circuit is configured to perform slave computing operations in response to slave instructions,

其中，所述主运算操作包括针对于所述从运算操作的前处理操作和/或后处理操作，所述主指令和所述从指令根据所述计算装置接收的计算指令解析得到，其中所述计算指令的操作数包括用于指示张量的形状的描述符，所述描述符用于确定所述操作数对应数据的存储地址，The master operation includes a pre-processing operation and/or a post-processing operation for the slave operation, the master instruction and the slave instruction are obtained by parsing the computing instruction received by the computing device, the operand of the computing instruction includes a descriptor for indicating the shape of a tensor, and the descriptor is used to determine the storage address of the data corresponding to the operand,

其中所述主处理电路和/或从处理电路配置成根据所述存储地址来执行各自对应的主运算操作和/或从处理操作。The master processing circuit and/or the slave processing circuit are configured to execute respective corresponding master computing operations and/or slave processing operations according to the storage address.

在第二方面中，本披露公开了一种集成电路芯片，其包括前一方面中提及并且在稍后的多个实施例中描述的计算装置。In a second aspect, the present disclosure discloses an integrated circuit chip, which includes the computing device mentioned in the previous aspect and described in a plurality of embodiments later.

在第三方面中，本披露公开了一种板卡，其包括前一方面中提及并且在稍后的多个实施例中描述的集成电路芯片。In a third aspect, the present disclosure discloses a board comprising the integrated circuit chip mentioned in the previous aspect and described in a plurality of embodiments later.

在第四方面中，本披露公开了一种电子设备，其包括前一方面中提及并且在稍后的多个实施例中描述的集成电路芯片。In a fourth aspect, the present disclosure discloses an electronic device, which includes the integrated circuit chip mentioned in the previous aspect and described in the later embodiments.

在第五方面中，本披露公开了一种使用前述的计算装置来执行计算操作的方法，其中所述计算装置包括主处理电路和至少一个从处理电路，所述方法包括：In a fifth aspect, the present disclosure discloses a method of performing a computing operation using the aforementioned computing device, wherein the computing device comprises a master processing circuit and at least one slave processing circuit, the method comprising:

将所述主处理电路配置成响应于主指令来执行主运算操作，configuring the main processing circuit to perform main computing operations in response to main instructions,

将所述从处理电路配置成响应于从指令来执行从运算操作，configuring the slave processing circuit to perform slave computing operations in response to slave instructions,

其中所述主运算操作包括针对于所述从运算操作的前处理操作和/或后处理操作，所述主指令和所述从指令根据所述计算装置接收的计算指令解析得到，其中所述计算指令的操作数包括用于指示张量的形状的描述符，所述描述符用于确定所述操作数对应数据的存储地址，The master operation includes a pre-processing operation and/or a post-processing operation for the slave operation, the master instruction and the slave instruction are obtained by parsing the computing instruction received by the computing device, the operand of the computing instruction includes a descriptor for indicating the shape of a tensor, and the descriptor is used to determine the storage address of the data corresponding to the operand,

其中所述方法还包括将所述主处理电路和/或从处理电路配置成根据所述存储地址来执行各自对应的主运算操作和/或从处理操作。The method further includes configuring the master processing circuit and/or the slave processing circuit to execute respective corresponding master computing operations and/or slave processing operations according to the storage address.

利用本披露公开的计算装置、集成电路芯片、板卡、电子设备和方法，可以高效地执行与主运算操作和从运算操作相关的主指令和从指令，从而加速运算操作的执行。进一步，由于主运算操作和从运算操作的结合，使得本披露的计算装置可以支持更多类型的运算和操作。另外，基于本披露的计算装置的流水线运算布置，可以对计算指令进行灵活地配置以满足计算的要求。By using the computing device, integrated circuit chip, board, electronic device and method disclosed in the present disclosure, the master instructions and slave instructions related to the master operation and the slave operation can be efficiently executed, thereby accelerating the execution of the operation. Further, due to the combination of the master operation and the slave operation, the computing device disclosed in the present disclosure can support more types of operations and operations. In addition, based on the pipeline operation arrangement of the computing device disclosed in the present disclosure, the computing instructions can be flexibly configured to meet the requirements of the calculation.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

通过参考附图阅读下文的详细描述，本披露示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中，以示例性而非限制性的方式示出了本披露的若干实施方式，并且相同或对应的标号表示相同或对应的部分其中：By reading the detailed description below with reference to the accompanying drawings, the above and other purposes, features and advantages of the exemplary embodiments of the present disclosure will become readily understood. In the accompanying drawings, several embodiments of the present disclosure are shown in an exemplary and non-limiting manner, and the same or corresponding reference numerals represent the same or corresponding parts, wherein:

图1a是示出根据本披露实施例的计算装置的概略图；FIG. 1a is a schematic diagram showing a computing device according to an embodiment of the present disclosure;

图1b是示出根据本披露实施例的数据存储空间的示意图；FIG1b is a schematic diagram showing a data storage space according to an embodiment of the present disclosure;

图2是示出根据本披露实施例的计算装置的框图；FIG2 is a block diagram illustrating a computing device according to an embodiment of the present disclosure;

图3是示出根据本披露实施例的计算装置的主处理电路的框图；3 is a block diagram showing a main processing circuit of a computing device according to an embodiment of the present disclosure;

图4a，4b和4c是示出根据本披露实施例的数据转换电路所执行的矩阵转换示意图；4a, 4b and 4c are schematic diagrams showing matrix conversion performed by a data conversion circuit according to an embodiment of the present disclosure;

图5是示出根据本披露实施例的计算装置的从处理电路的框图；5 is a block diagram showing a slave processing circuit of a computing device according to an embodiment of the present disclosure;

图6是示出根据本披露实施例的一种组合处理装置的结构图；以及FIG6 is a structural diagram showing a combined processing device according to an embodiment of the present disclosure; and

图7是示出根据本披露实施例的一种板卡的结构示意图。FIG. 7 is a schematic diagram showing the structure of a board according to an embodiment of the present disclosure.

具体实施方式Detailed ways

本披露的方案利用主处理电路和至少一个从处理电路的硬件架构来执行相关联的数据操作，从而可以利用相对灵活、简化的计算指令来完成相对复杂的运算。具体来说，本披露的方案利用从计算指令解析获得的主指令和从指令，并且令主处理电路执行主指令以实现主运算操作，而令从处理电路执行从指令以实现从运算操作，以便实现包括例如向量运算的各种复杂运算。这里，主运算操作可以包括针对于从运算操作的前处理操作和/或后处理操作。在一个实施例中，该前处理操作可以例如是数据转换操作和/或数据拼接操作。在另一个实施例中，该后处理操作可以例如是对从处理电路输出结果的算术操作。在一些场景中，当计算指令的操作数包括用于指示张量的形状的描述符时，本披露的方案利用该描述符来确定所述操作数对应数据的存储地址。基于此，所述主处理电路和/或从处理电路可以配置成根据所述存储地址来执行各自对应的主运算操作和/或从运算操作，其中主运算操作和/或从运算操作可以涉及张量数据的各类运算操作。另外，根据主处理电路中运算电路或运算器的不同，本披露的计算指令支持灵活和个性化的配置，以满足不同的应用场景。The scheme disclosed herein utilizes the hardware architecture of the main processing circuit and at least one slave processing circuit to perform associated data operations, so that relatively flexible and simplified calculation instructions can be used to complete relatively complex operations. Specifically, the scheme disclosed herein utilizes the main instructions and slave instructions obtained from the calculation instruction parsing, and the main processing circuit is made to execute the main instruction to implement the main operation, and the slave processing circuit is made to execute the slave instruction to implement the slave operation, so as to implement various complex operations including, for example, vector operations. Here, the main operation operation may include a pre-processing operation and/or a post-processing operation for the slave operation. In one embodiment, the pre-processing operation may be, for example, a data conversion operation and/or a data splicing operation. In another embodiment, the post-processing operation may be, for example, an arithmetic operation on the output result of the slave processing circuit. In some scenarios, when the operand of the calculation instruction includes a descriptor for indicating the shape of a tensor, the scheme disclosed herein utilizes the descriptor to determine the storage address of the operand corresponding data. Based on this, the main processing circuit and/or the slave processing circuit may be configured to perform the respective corresponding main operation and/or slave operation according to the storage address, wherein the main operation and/or the slave operation may involve various types of operation operations of tensor data. In addition, depending on the different computing circuits or operators in the main processing circuit, the computing instructions disclosed herein support flexible and personalized configurations to meet different application scenarios.

下面将结合附图对本披露的技术方案进行详细地描述。The technical solution of the present disclosure will be described in detail below with reference to the accompanying drawings.

图1a是示出根据本披露实施例的计算装置100的概略图。如图1a中所示，计算装置100可以包括主处理电路102和从处理电路，例如图中所示出的从处理电路104、106和108。尽管这里示出三个从处理电路，但本领域技术人员可以理解本披露的计算装置100可以包括任意合适数目的从处理电路，且多个从处理电路之间、多个从处理电路和主处理电路之间可以以不同方式连接，本披露不做任何的限制。在一个或多个实施例中，本披露的多个从处理电路可以并行地执行各类从指令(例如由计算指令解析获得)，以提高计算装置的处理效率。FIG. 1a is a schematic diagram showing a computing device 100 according to an embodiment of the present disclosure. As shown in FIG. 1a, the computing device 100 may include a master processing circuit 102 and slave processing circuits, such as the slave processing circuits 104, 106, and 108 shown in the figure. Although three slave processing circuits are shown here, it will be appreciated by those skilled in the art that the computing device 100 of the present disclosure may include any suitable number of slave processing circuits, and multiple slave processing circuits and multiple slave processing circuits and the master processing circuit may be connected in different ways, and the present disclosure does not impose any limitation. In one or more embodiments, the multiple slave processing circuits of the present disclosure may execute various types of slave instructions (e.g., obtained by parsing computing instructions) in parallel to improve the processing efficiency of the computing device.

在本披露的上下文中，计算指令可以是软件和硬件的交互接口的指令系统中的指令，其可以是二进制或其他形式的、供处理器(或称处理电路)等硬件接收并处理的机器语言。计算指令可以包括用于指示处理器操作的操作码和操作数。根据不同的应用场景，计算指令可以包括一个或多个操作码，而当前述计算指令包括一个操作码时，该操作码可以用于指示处理器的多个操作。另外，计算指令还可以包括一个或多个操作数。根据本披露的方案，操作数可以包括用于指示张量的形状的描述符，该描述符可以用于确定所述操作数对应数据的存储地址。In the context of the present disclosure, a computing instruction may be an instruction in an instruction system of an interactive interface between software and hardware, which may be a machine language in binary or other forms for hardware such as a processor (or processing circuit) to receive and process. The computing instruction may include an opcode and an operand for indicating the operation of the processor. Depending on different application scenarios, the computing instruction may include one or more opcodes, and when the aforementioned computing instruction includes an opcode, the opcode may be used to indicate multiple operations of the processor. In addition, the computing instruction may also include one or more operands. According to the scheme of the present disclosure, an operand may include a descriptor for indicating the shape of a tensor, which may be used to determine the storage address of the data corresponding to the operand.

在一个实施例中，可以通过解析计算装置接收到的计算指令来获得主指令和从指令。在操作中，主处理电路可以配置成响应于主指令来执行主运算操作，而所述从处理电路可以配置成响应于从指令来执行从运算操作。根据本披露的方案，前述的主指令或从指令可以是在处理器内部运行的微指令或控制信号，并且可以包括(或者说指示)一个或多个操作。当计算指令的操作数包括如前所述的描述符时，主处理电路和/或从处理电路可以配置成根据基于描述符所获得的存储地址来对张量进行访存。通过基于描述符的访存机制，本披露的方案在执行张量的运算中能够显著提升张量数据的读取和存储速度，由此加速计算并减小计算开销。In one embodiment, the master instruction and the slave instruction can be obtained by parsing the calculation instruction received by the computing device. In operation, the master processing circuit can be configured to perform the master operation in response to the master instruction, and the slave processing circuit can be configured to perform the slave operation in response to the slave instruction. According to the scheme disclosed herein, the aforementioned master instruction or slave instruction can be a microinstruction or control signal running inside the processor, and can include (or indicate) one or more operations. When the operand of the calculation instruction includes a descriptor as described above, the master processing circuit and/or the slave processing circuit can be configured to access the tensor according to the storage address obtained based on the descriptor. Through the memory access mechanism based on the descriptor, the scheme disclosed herein can significantly improve the reading and storage speed of tensor data in the execution of tensor operations, thereby accelerating the calculation and reducing the calculation overhead.

在一个实施例中，前述主运算操作可以包括针对于所述从运算操作的前处理操作和/或后处理操作。具体而言，对于由主处理电路执行的主指令，其可以包括例如对将参与运算的数据进行数据转换和/或数据拼接的前处理操作。在一些应用场景中，主指令也可以包括仅仅对数据的选择性读取的前处理操作，例如将专用或私有缓存器中所存储的数据读出并发送到从处理电路，或者为从处理电路的运算生成相应的随机数。在另外一些应用场景中，依据主处理电路中包括的运算器的类型和数目，所述主指令可以包括一个或多个与运算器的功能相关联的多个后处理操作。例如，主指令可以包括对从处理电路执行从指令后获得的中间运算结果或最终运算结果进行相加、相乘、查表、比较、求平均、过滤等多种类型的操作。在一些应用场景中，前述的中间运算结果或最终运算结果可以是前述的张量，并且其存储地址可以根据本披露的描述符来获得。In one embodiment, the aforementioned main operation may include a pre-processing operation and/or a post-processing operation for the slave operation. Specifically, for the main instruction executed by the main processing circuit, it may include, for example, a pre-processing operation of data conversion and/or data splicing for the data to be involved in the operation. In some application scenarios, the main instruction may also include a pre-processing operation of only selective reading of data, such as reading out the data stored in a dedicated or private buffer and sending it to the slave processing circuit, or generating a corresponding random number for the operation of the slave processing circuit. In some other application scenarios, depending on the type and number of operators included in the main processing circuit, the main instruction may include one or more post-processing operations associated with the functions of the operator. For example, the main instruction may include adding, multiplying, looking up, comparing, averaging, filtering, and other types of operations on the intermediate operation results or final operation results obtained after the slave processing circuit executes the slave instruction. In some application scenarios, the aforementioned intermediate operation results or final operation results may be the aforementioned tensors, and their storage addresses may be obtained according to the descriptors disclosed herein.

为了便于识别前处理操作和/或后处理操作，在一些应用场景中，所述主指令可以包括用于标识所述前处理操作和/或后处理操作的标识位。由此，当获取主指令时，主处理电路可以根据所述标识位来确定对运算数据执行前处理操作还是后处理操作。附加地或替代地，可以通过所述计算指令的预设位置(或称指令域段)来对所述主指令中的所述前处理操作和后处理操作进行匹分。例如，当计算指令中设置有包括(主指令+从指令)的预设位置时，则可以确定此计算指令中的主指令涉及针对于从操作的前处理操作。又例如，当计算指令中设置有包括(从指令+主指令)的预设位置时，则可以确定此计算指令中的主指令涉及对从操作的后处理操作。为了便于理解，假设计算指令具有三段预定位宽(即前述的预设位置)的长度，则可以将位于第一段预定位宽的指令指定为用于前处理操作的主指令，位于中间位置的第二段预定位宽的指令指定为用于从操作的从指令，而位于最后位置的第三段预定位宽的指令指定为用于后处理操作的主指令。In order to facilitate the identification of pre-processing operations and/or post-processing operations, in some application scenarios, the main instruction may include an identification bit for identifying the pre-processing operation and/or post-processing operation. Thus, when the main instruction is obtained, the main processing circuit can determine whether to perform a pre-processing operation or a post-processing operation on the operation data according to the identification bit. Additionally or alternatively, the pre-processing operation and the post-processing operation in the main instruction can be matched by the preset position (or instruction domain segment) of the calculation instruction. For example, when a preset position including (main instruction + slave instruction) is set in the calculation instruction, it can be determined that the main instruction in this calculation instruction involves a pre-processing operation for the slave operation. For another example, when a preset position including (slave instruction + main instruction) is set in the calculation instruction, it can be determined that the main instruction in this calculation instruction involves a post-processing operation for the slave operation. For ease of understanding, assuming that the computing instruction has a length of three segments with predetermined bit widths (i.e., the preset positions mentioned above), the instructions located in the first segment with the predetermined bit width can be designated as the main instructions for pre-processing operations, the instructions located in the second segment with the predetermined bit width in the middle position can be designated as slave instructions for slave operations, and the instructions located in the third segment with the predetermined bit width in the last position can be designated as the main instructions for post-processing operations.

对于由从处理电路执行的从指令，其可以包括与从处理电路中的一个或多个运算电路的功能关联的一个或多个操作。所述从指令可以包括对经主处理电路执行前处理操作后的数据执行运算的操作。在一些应用场景中，所述从指令可以包括算术运算、逻辑运算、数据类型转换等各种操作。例如，从指令可以包括对经所述前处理操作后的数据执行向量相关的各类乘加操作，包括例如卷积操作。在另一些应用场景中，当前述计算指令中不包括关于前处理操作的主指令时，从处理电路也可以根据从指令来对输入数据直接进行从运算操作。For slave instructions executed by the slave processing circuit, it may include one or more operations associated with the functions of one or more computing circuits in the slave processing circuit. The slave instruction may include an operation of performing operations on data after the master processing circuit performs pre-processing operations. In some application scenarios, the slave instruction may include various operations such as arithmetic operations, logical operations, data type conversion, etc. For example, the slave instruction may include performing various vector-related multiplication and addition operations on the data after the pre-processing operations, including, for example, convolution operations. In other application scenarios, when the aforementioned calculation instructions do not include master instructions regarding pre-processing operations, the slave processing circuit may also directly perform slave operations on the input data according to the slave instructions.

在一个或多个实施例中，主处理电路102可以配置成获得计算指令并且对其进行解析，从而得到前述的主指令和从指令，并且将从指令发送到从处理电路。具体来说，主处理电路可以包括用于解析计算指令的一个或多个译码电路(或称译码器)。通过内部的译码电路，主处理电路可以将接收到的计算指令解析成一个或多个主指令和/或从指令，并且将相应的从指令发送到从处理电路，以便从处理电路执行从运算操作。这里，根据应用场景的不同，可以以多种方式将从指令发送到从处理电路。例如，当计算装置中包括存储电路时，主处理电路可以将从指令发送到存储电路，并且经由存储电路来向从处理电路发送。又例如，当多个从处理电路执行并行操作时，主处理电路可以向多个从处理电路广播相同的从指令。附加地或可选地，在一些硬件架构场景中，计算装置还可以包括单独的电路、单元或模块来专用于对计算装置接收到的计算指令进行解析，如稍后结合图2所描述的架构。In one or more embodiments, the main processing circuit 102 may be configured to obtain a computing instruction and parse it, thereby obtaining the aforementioned master instruction and slave instruction, and sending the slave instruction to the slave processing circuit. Specifically, the main processing circuit may include one or more decoding circuits (or decoders) for parsing the computing instruction. Through the internal decoding circuit, the main processing circuit may parse the received computing instruction into one or more master instructions and/or slave instructions, and send the corresponding slave instruction to the slave processing circuit so that the slave processing circuit performs the slave operation. Here, depending on the application scenario, the slave instruction can be sent to the slave processing circuit in a variety of ways. For example, when a storage circuit is included in the computing device, the main processing circuit may send the slave instruction to the storage circuit, and send it to the slave processing circuit via the storage circuit. For another example, when multiple slave processing circuits perform parallel operations, the main processing circuit may broadcast the same slave instruction to multiple slave processing circuits. Additionally or optionally, in some hardware architecture scenarios, the computing device may also include a separate circuit, unit or module dedicated to parsing the computing instruction received by the computing device, such as the architecture described later in conjunction with FIG. 2.

在一个或多个实施例中，本披露的从处理电路可以包括用于执行从运算操作的多个运算电路，其中所述多个运算电路可以被连接并且配置成执行多级流水的运算操作。根据运算场景的不同，运算电路可以包括用于至少执行向量运算的乘法电路、比较电路、累加电路和转数电路中的一个或多个。在一个实施例中，当本披露的计算装置应用于人工智能领域的计算时，从处理电路可以根据从指令来执行神经网络中的多维卷积运算。In one or more embodiments, the slave processing circuit of the present disclosure may include multiple operation circuits for performing slave operation, wherein the multiple operation circuits may be connected and configured to perform multi-stage pipeline operation. Depending on the operation scenario, the operation circuit may include one or more of a multiplication circuit, a comparison circuit, an accumulation circuit, and a rotation circuit for performing at least vector operations. In one embodiment, when the computing device of the present disclosure is applied to calculations in the field of artificial intelligence, the slave processing circuit may perform multidimensional convolution operations in a neural network according to slave instructions.

如前所述，本披露的主运算操作和/或从运算操作还可以包括针对张量数据的各种类型的操作，为此本披露的方案提出利用描述符来获取关于张量形状相关的信息，以便确定张量数据的存储地址，从而通过前述的存储地址来获取和保存张量数据。As mentioned above, the main operation and/or slave operation disclosed in the present invention may also include various types of operations on tensor data. For this purpose, the scheme disclosed in the present invention proposes to use descriptors to obtain information related to the shape of the tensor in order to determine the storage address of the tensor data, thereby obtaining and saving the tensor data through the aforementioned storage address.

在一种可能的实现方式中，可以用描述符指示N维的张量数据的形状，N为正整数，例如N＝1、2或3，或者为零。其中，张量可以包含多种形式的数据组成方式，张量可以是不同维度的，比如标量可以看作是0维张量，向量可以看作1维张量，而矩阵可以是2维或2维以上的张量。张量的形状包括张量的维度、张量各个维度的尺寸等信息。举例而言，对于张量：In one possible implementation, a descriptor can be used to indicate the shape of N-dimensional tensor data, where N is a positive integer, such as N = 1, 2 or 3, or zero. Among them, a tensor can contain multiple forms of data composition, and a tensor can be of different dimensions. For example, a scalar can be regarded as a 0-dimensional tensor, a vector can be regarded as a 1-dimensional tensor, and a matrix can be a tensor of 2 dimensions or more. The shape of a tensor includes information such as the dimension of the tensor and the size of each dimension of the tensor. For example, for a tensor:

该张量的形状可以被描述符描述为(2，4)，也即通过两个参数表示该张量为二维张量，且该张量的第一维度(列)的尺寸为2、第二维度(行)的尺寸为4。需要说明的是，本申请对于描述符指示张量形状的方式并不做限定。The shape of the tensor can be described by the descriptor as (2, 4), that is, the two parameters indicate that the tensor is a two-dimensional tensor, and the size of the first dimension (column) of the tensor is 2, and the size of the second dimension (row) is 4. It should be noted that the present application does not limit the way in which the descriptor indicates the shape of the tensor.

在一种可能的实现方式中，N的取值可根据张量数据的维数(阶数)来确定，也可以根据张量数据的使用需要进行设定。例如，在N的取值为3时，张量数据为三维的张量数据，描述符可用来指示该三维的张量数据在三个维度方向上的形状(例如偏移量、尺寸等)。应当理解，本领域技术人员可以根据实际需要对N的取值进行设置，本披露对此不作限制。In one possible implementation, the value of N can be determined according to the dimension (order) of the tensor data, or it can be set according to the usage requirements of the tensor data. For example, when the value of N is 3, the tensor data is three-dimensional tensor data, and the descriptor can be used to indicate the shape of the three-dimensional tensor data in three dimensions (e.g., offset, size, etc.). It should be understood that those skilled in the art can set the value of N according to actual needs, and this disclosure does not limit this.

在一种可能的实现方式中，所述描述符可包括描述符的标识和/或描述符的内容。其中，描述符的标识用于对描述符进行区分，例如描述符的标识可以为其编号；描述符的内容可以包括表示张量数据的形状的至少一个形状参数。例如，张量数据为3维数据，在该张量数据的三个维度中，其中两个维度的形状参数固定不变，其描述符的内容可包括表示该张量数据的另一个维度的形状参数。In a possible implementation, the descriptor may include an identifier of the descriptor and/or content of the descriptor. The identifier of the descriptor is used to distinguish the descriptor, for example, the identifier of the descriptor may be a number for it; the content of the descriptor may include at least one shape parameter representing the shape of the tensor data. For example, the tensor data is 3D data, and among the three dimensions of the tensor data, the shape parameters of two dimensions are fixed, and the content of its descriptor may include the shape parameter representing another dimension of the tensor data.

在一种可能的实现方式中，描述符的标识和/或内容可存储在描述符存储空间(内部存储器)，例如寄存器、片上的SRAM或其他介质缓存等。描述符所指示的张量数据可存储在数据存储空间(内部存储器或外部存储器)，例如片上缓存或片下存储器等。本披露对描述符存储空间及数据存储空间的具体位置不作限制。In one possible implementation, the identifier and/or content of the descriptor may be stored in a descriptor storage space (internal memory), such as a register, an on-chip SRAM, or other media cache. The tensor data indicated by the descriptor may be stored in a data storage space (internal memory or external memory), such as an on-chip cache or an off-chip memory. The present disclosure does not limit the specific locations of the descriptor storage space and the data storage space.

在一种可能的实现方式中，描述符的标识、内容以及描述符所指示的张量数据可以存储在内部存储器的同一块区域，例如，可使用片上缓存的一块连续区域来存储描述符的相关内容，其地址为ADDR0-ADDR1023。其中，可将地址ADDR0-ADDR63作为描述符存储空间，存储描述符的标识和内容，地址ADDR64-ADDR1023作为数据存储空间，存储描述符所指示的张量数据。在描述符存储空间中，可用地址ADDR0-ADDR31存储描述符的标识，地址ADDR32-ADDR63存储描述符的内容。应当理解，地址ADDR并不限于1位或一个字节，此处用来表示一个地址，是一个地址单位。本领域技术人员可以实际情况确定描述符存储空间、数据存储空间以及其具体地址，本披露对此不作限制。In one possible implementation, the identifier, content, and tensor data indicated by the descriptor of the descriptor can be stored in the same area of the internal memory. For example, a continuous area of the on-chip cache can be used to store the relevant content of the descriptor, and its address is ADDR0-ADDR1023. Among them, the address ADDR0-ADDR63 can be used as a descriptor storage space to store the identifier and content of the descriptor, and the address ADDR64-ADDR1023 can be used as a data storage space to store the tensor data indicated by the descriptor. In the descriptor storage space, the identifier of the descriptor can be stored at the address ADDR0-ADDR31, and the content of the descriptor can be stored at the address ADDR32-ADDR63. It should be understood that the address ADDR is not limited to 1 bit or one byte. It is used here to represent an address and is an address unit. Those skilled in the art can determine the descriptor storage space, data storage space, and their specific addresses according to actual conditions, and this disclosure is not limited to this.

在一种可能的实现方式中，描述符的标识、内容以及描述符所指示的张量数据可以存储在内部存储器的不同区域。例如，可以将寄存器作为描述符存储空间，在寄存器中存储描述符的标识及内容，将片上缓存作为数据存储空间，存储描述符所指示的张量数据。In one possible implementation, the identifier and content of the descriptor and the tensor data indicated by the descriptor can be stored in different areas of the internal memory. For example, a register can be used as a descriptor storage space to store the identifier and content of the descriptor, and an on-chip cache can be used as a data storage space to store the tensor data indicated by the descriptor.

在一种可能的实现方式中，在使用寄存器存储描述符的标识和内容时，可以使用寄存器的编号来表示描述符的标识。例如，寄存器的编号为0时，其存储的描述符的标识设置为0。当寄存器中的描述符有效时，可根据描述符所指示的张量数据的大小在缓存空间中分配一块区域用于存储该张量数据。In one possible implementation, when a register is used to store the identifier and content of a descriptor, the identifier of the descriptor may be represented by the register number. For example, when the register number is 0, the identifier of the descriptor stored therein is set to 0. When the descriptor in the register is valid, an area may be allocated in the cache space for storing the tensor data according to the size of the tensor data indicated by the descriptor.

在一种可能的实现方式中，描述符的标识及内容可存储在内部存储器，描述符所指示的张量数据可存储在外部存储器。例如，可以采用在片上存储描述符的标识及内容、在片下存储描述符所指示的张量数据的方式。In one possible implementation, the identifier and content of the descriptor may be stored in an internal memory, and the tensor data indicated by the descriptor may be stored in an external memory. For example, the identifier and content of the descriptor may be stored on-chip, and the tensor data indicated by the descriptor may be stored off-chip.

在一种可能的实现方式中，与各描述符对应的数据存储空间的数据地址可以是固定地址。例如，可以为张量数据划分单独的数据存储空间，每个张量数据在数据存储空间的起始地址与描述符一一对应。在这种情况下，控制电路根据描述符即可确定与操作数对应的数据在数据存储空间中的数据地址。In one possible implementation, the data address of the data storage space corresponding to each descriptor may be a fixed address. For example, a separate data storage space may be divided for tensor data, and the starting address of each tensor data in the data storage space corresponds to the descriptor one by one. In this case, the control circuit can determine the data address of the data corresponding to the operand in the data storage space according to the descriptor.

在一种可能的实现方式中，在与描述符对应的数据存储空间的数据地址为可变地址时，所述描述符还可用于指示N维的张量数据的地址，其中，所述描述符的内容还可包括表示张量数据的地址的至少一个地址参数。例如，张量数据为3维数据，在描述符指向该张量数据的地址时，描述符的内容可包括表示该张量数据的地址的一个地址参数，例如张量数据的起始物理地址，也可以包括该张量数据的地址的多个地址参数，例如张量数据的起始地址+地址偏移量，或张量数据基于各维度的地址参数。本领域技术人员可以根据实际需要对地址参数进行设置，本披露对此不作限制。In one possible implementation, when the data address of the data storage space corresponding to the descriptor is a variable address, the descriptor can also be used to indicate the address of N-dimensional tensor data, wherein the content of the descriptor may also include at least one address parameter representing the address of the tensor data. For example, the tensor data is 3D data, and when the descriptor points to the address of the tensor data, the content of the descriptor may include an address parameter representing the address of the tensor data, such as the starting physical address of the tensor data, or may include multiple address parameters of the address of the tensor data, such as the starting address + address offset of the tensor data, or the address parameters of the tensor data based on each dimension. Those skilled in the art can set the address parameters according to actual needs, and this disclosure is not limited to this.

在一种可能的实现方式中，所述张量数据的地址参数可以包括所述描述符的数据基准点在所述张量数据的数据存储空间中的基准地址。其中，基准地址可根据数据基准点的变化而不同。本披露对数据基准点的选取不作限制。In a possible implementation, the address parameter of the tensor data may include a reference address of the data reference point of the descriptor in the data storage space of the tensor data. The reference address may be different according to the change of the data reference point. The present disclosure does not limit the selection of the data reference point.

在一种可能的实现方式中，所述基准地址可包括所述数据存储空间的起始地址。在描述符的数据基准点是数据存储空间的第一个数据块时，描述符的基准地址即为数据存储空间的起始地址。在描述符的数据基准点是数据存储空间中第一个数据块以外的其他数据时，描述符的基准地址即为该数据块在数据存储空间中的地址。In a possible implementation, the reference address may include the starting address of the data storage space. When the data reference point of the descriptor is the first data block of the data storage space, the reference address of the descriptor is the starting address of the data storage space. When the data reference point of the descriptor is other data other than the first data block in the data storage space, the reference address of the descriptor is the address of the data block in the data storage space.

在一种可能的实现方式中，所述张量数据的形状参数包括以下至少一种：所述数据存储空间在N个维度方向的至少一个方向上的尺寸、所述存储区域在N个维度方向的至少一个方向上的尺寸、所述存储区域在N个维度方向的至少一个方向上的偏移量、处于N个维度方向的对角位置的至少两个顶点相对于所述数据基准点的位置、所述描述符所指示的张量数据的数据描述位置与数据地址之间的映射关系。其中，数据描述位置是描述符所指示的张量数据中的点或区域的映射位置，例如，张量数据为3维数据时，描述符可使用三维空间坐标(x，y，z)来表示该张量数据的形状，该张量数据的数据描述位置可以是使用三维空间坐标(x，y，z)表示的、该张量数据映射在三维空间中的点或区域的位置。In a possible implementation, the shape parameters of the tensor data include at least one of the following: the size of the data storage space in at least one direction of the N dimensional directions, the size of the storage area in at least one direction of the N dimensional directions, the offset of the storage area in at least one direction of the N dimensional directions, the positions of at least two vertices at diagonal positions in the N dimensional directions relative to the data reference point, and the mapping relationship between the data description position of the tensor data indicated by the descriptor and the data address. The data description position is the mapping position of the point or area in the tensor data indicated by the descriptor. For example, when the tensor data is 3D data, the descriptor can use three-dimensional space coordinates (x, y, z) to represent the shape of the tensor data, and the data description position of the tensor data can be the position of the point or area mapped in the three-dimensional space represented by the three-dimensional space coordinates (x, y, z).

应当理解，本领域技术人员可以根据实际情况选择表示张量数据的形状参数，本披露对此不作限制。通过在数据存取过程中使用描述符，可建立数据之间的关联，从而降低数据存取的复杂度，提高指令处理效率。It should be understood that those skilled in the art can select shape parameters representing tensor data according to actual conditions, and this disclosure does not limit this. By using descriptors in the data access process, associations between data can be established, thereby reducing the complexity of data access and improving instruction processing efficiency.

在一种可能的实现方式中，可根据所述描述符的数据基准点在所述张量数据的数据存储空间中的基准地址、所述数据存储空间的N个维度方向的至少一个方向上的尺寸、所述存储区域在N个维度方向的至少一个方向上的尺寸和/或所述存储区域在N个维度方向的至少一个方向上的偏移量，确定所述张量数据的描述符的内容。In one possible implementation, the content of the descriptor of the tensor data can be determined based on a reference address of a data reference point of the descriptor in the data storage space of the tensor data, a size of the data storage space in at least one of the N dimensional directions, a size of the storage area in at least one of the N dimensional directions, and/or an offset of the storage area in at least one of the N dimensional directions.

图1b示出根据本披露实施例的数据存储空间的示意图。如图1b所示，数据存储空间21采用行优先的方式存储了一个二维数据，可通过(x，y)来表示(其中，X轴水平向右，Y轴垂直向下)，X轴方向上的尺寸(每行的尺寸)为ori_x(图中未示出)，Y轴方向上的尺寸(总行数)为ori_y(图中未示出)，数据存储空间21的起始地址PA_start(基准地址)为第一个数据块22的物理地址。数据块23是数据存储空间21中的部分数据，其在X轴方向上的偏移量25表示为offset_x，在Y轴方向上的偏移量24表示为offset_y，在X轴方向上的尺寸表示为size_x，在Y轴方向上的尺寸表示为size_y。FIG1b shows a schematic diagram of a data storage space according to an embodiment of the present disclosure. As shown in FIG1b , the data storage space 21 stores a two-dimensional data in a row-first manner, which can be represented by (x, y) (wherein the X axis is horizontal to the right and the Y axis is vertically downward), the size in the X axis direction (the size of each row) is ori_x (not shown in the figure), the size in the Y axis direction (the total number of rows) is ori_y (not shown in the figure), and the starting address PA_start (reference address) of the data storage space 21 is the physical address of the first data block 22. The data block 23 is part of the data in the data storage space 21, and its offset 25 in the X axis direction is represented as offset_x, the offset 24 in the Y axis direction is represented as offset_y, the size in the X axis direction is represented as size_x, and the size in the Y axis direction is represented as size_y.

在一种可能的实现方式中，使用描述符来定义数据块23时，描述符的数据基准点可使用数据存储空间21的第一个数据块，可以约定描述符的基准地址为数据存储空间21的起始地址PA start。然后可以结合数据存储空间21在X轴的尺寸ori_x、在Y轴上的尺寸ori_y，以及数据块23在Y轴方向的偏移量offset_y、X轴方向上的偏移量offset_x、X轴方向上的尺寸size_x以及Y轴方向上的尺寸size_y来确定数据块23的描述符的内容。In a possible implementation, when a descriptor is used to define the data block 23, the data reference point of the descriptor can use the first data block of the data storage space 21, and the reference address of the descriptor can be agreed to be the starting address PA start of the data storage space 21. Then, the content of the descriptor of the data block 23 can be determined by combining the size ori_x of the data storage space 21 on the X axis, the size ori_y on the Y axis, and the offset offset_y of the data block 23 in the Y axis direction, the offset offset_x in the X axis direction, the size size_x in the X axis direction, and the size size_y in the Y axis direction.

在一种可能的实现方式中，可以使用下述公式(1)来表示描述符的内容：In a possible implementation, the following formula (1) may be used to represent the content of the descriptor:

应当理解，虽然上述示例中，描述符的内容表示的是二维空间，但本领域技术人员可以根据实际情况对描述符的内容表示的具体维度进行设置，本披露对此不作限制。It should be understood that although in the above examples, the content of the descriptor represents a two-dimensional space, those skilled in the art may set the specific dimension represented by the content of the descriptor according to actual conditions, and the present disclosure does not limit this.

在一种可能的实现方式中，可约定所述描述符的数据基准点在所述数据存储空间中的基准地址，在基准地址的基础上，根据处于N个维度方向的对角位置的至少两个顶点相对于所述数据基准点的位置，确定所述张量数据的描述符的内容。In one possible implementation, a reference address of the data reference point of the descriptor in the data storage space can be agreed upon, and based on the reference address, the content of the descriptor of the tensor data is determined according to the positions of at least two vertices at diagonal positions in N dimensional directions relative to the data reference point.

举例来说，可以约定描述符的数据基准点在数据存储空间中的基准地址PA_base。例如，可以在数据存储空间21中选取一个数据(例如，位置为(2，2)的数据)作为数据基准点，将该数据在数据存储空间中的物理地址作为基准地址PA_base。可以根据对角位置的两个顶点相对于数据基准点的位置，确定出图1b中数据块23的描述符的内容。首先，确定数据块23的对角位置的至少两个顶点相对于数据基准点的位置，例如，使用左上至右下方向的对角位置顶点相对于数据基准点的位置，其中，左上角顶点的相对位置为(x_min，y_min)，右下角顶点的相对位置为(x_max，y_max)，然后可以根据基准地址PA_base、左上角顶点的相对位置(x_min，y_min)以及右下角顶点的相对位置(x_max，y_max)确定出数据块23的描述符的内容。For example, the reference address PA_base of the data reference point of the descriptor in the data storage space can be agreed upon. For example, a data (for example, data at position (2, 2)) can be selected in the data storage space 21 as the data reference point, and the physical address of the data in the data storage space can be used as the reference address PA_base. The content of the descriptor of the data block 23 in FIG. 1b can be determined according to the positions of the two vertices at the diagonal positions relative to the data reference point. First, the positions of at least two vertices at the diagonal positions of the data block 23 relative to the data reference point are determined, for example, the positions of the vertices at the diagonal positions in the direction from the upper left to the lower right relative to the data reference point, wherein the relative position of the upper left vertex is (x_min, y_min), and the relative position of the lower right vertex is (x_max, y_max), and then the content of the descriptor of the data block 23 can be determined according to the reference address PA_base, the relative position of the upper left vertex (x_min, y_min), and the relative position of the lower right vertex (x_max, y_max).

在一种可能的实现方式中，可以使用下述公式(2)来表示描述符的内容(基准地址为PA_base)：In a possible implementation, the following formula (2) may be used to represent the content of the descriptor (the base address is PA_base):

应当理解，虽然上述示例中使用左上角和右下角两个对角位置的顶点来确定描述符的内容，但本领域技术人员可以根据实际需要对对角位置的至少两个顶点的具体顶点进行设置，本披露对此不作限制。It should be understood that although the above example uses the vertices at the upper left corner and the lower right corner to determine the content of the descriptor, those skilled in the art can set the specific vertices of at least two vertices at the diagonal positions according to actual needs, and the present disclosure does not limit this.

在一种可能的实现方式中，可根据所述描述符的数据基准点在所述数据存储空间中的基准地址，以及所述描述符所指示的张量数据的数据描述位置与数据地址之间的映射关系，确定所述张量数据的描述符的内容。其中，数据描述位置与数据地址之间的映射关系可以根据实际需要进行设定，例如，描述符所指示的张量数据为三维空间数据时，可以使用函数f(x，y，z)来定义数据描述位置与数据地址之间的映射关系。In a possible implementation, the content of the descriptor of the tensor data can be determined according to the reference address of the data reference point of the descriptor in the data storage space and the mapping relationship between the data description position and the data address of the tensor data indicated by the descriptor. The mapping relationship between the data description position and the data address can be set according to actual needs. For example, when the tensor data indicated by the descriptor is three-dimensional space data, the function f(x, y, z) can be used to define the mapping relationship between the data description position and the data address.

在一种可能的实现方式中，可以使用下述公式(3)来表示描述符的内容：In a possible implementation, the following formula (3) may be used to represent the content of the descriptor:

在一种可能的实现方式中，所述描述符还用于指示N维的张量数据的地址，其中，所述描述符的内容还包括表示张量数据的地址的至少一个地址参数，例如描述符的内容可以是：In a possible implementation, the descriptor is also used to indicate the address of the N-dimensional tensor data, wherein the content of the descriptor also includes at least one address parameter representing the address of the tensor data, for example, the content of the descriptor may be:

其中PA为地址参数。地址参数可以是逻辑地址，也可以是物理地址。描述符解析电路可以以PA为向量形状的顶点、中间点或预设点中的任意一个，结合X方向和Y方向的形状参数得到对应的数据地址。Wherein PA is an address parameter. The address parameter can be a logical address or a physical address. The descriptor parsing circuit can use PA as any vertex, middle point or preset point of the vector shape, and combine the shape parameters in the X direction and the Y direction to obtain the corresponding data address.

在一种可能的实现方式中，所述张量数据的地址参数包括所述描述符的数据基准点在所述张量数据的数据存储空间中的基准地址，所述基准地址包括所述数据存储空间的起始地址。In a possible implementation, the address parameter of the tensor data includes a reference address of a data reference point of the descriptor in a data storage space of the tensor data, and the reference address includes a starting address of the data storage space.

在一种可能的实现方式中，描述符还可以包括表示张量数据的地址的至少一个地址参数，例如描述符的内容可以是：In a possible implementation, the descriptor may further include at least one address parameter representing the address of the tensor data. For example, the content of the descriptor may be:

其中PA_start为基准地址参数，不再赘述。Among them, PA_start is the reference address parameter and will not be repeated here.

应当理解，本领域技术人员可以根据实际情况对数据描述位置与数据地址之间的映射关系进行设定，本披露对此不作限制。It should be understood that those skilled in the art can set the mapping relationship between the data description location and the data address according to actual conditions, and this disclosure does not limit this.

在一种可能的实现方式中，可以在一个任务中设定约定的基准地址，此任务下指令中的描述符均使用此基准地址，描述符内容中可以包括基于此基准地址的形状参数。可以通过设定此任务的环境参数的方式确定此基准地址。基准地址的相关描述和使用方式可参见上述实施例。此种实现方式下，描述符的内容可以更快速的被映射为数据地址。In a possible implementation, an agreed reference address can be set in a task, and the descriptors in the instructions under this task all use this reference address, and the descriptor content may include shape parameters based on this reference address. This reference address can be determined by setting the environmental parameters of this task. The relevant description and use of the reference address can be found in the above embodiment. In this implementation, the content of the descriptor can be mapped to a data address more quickly.

在一种可能的实现方式中，可以在各描述符的内容中包含基准地址，则各描述符的基准地址可不同。相对于利用环境参数设定共同的基准地址的方式，此种方式中的各描述符可以更加灵活的描述数据，并使用更大的数据地址空间。In a possible implementation, the reference address may be included in the content of each descriptor, and the reference address of each descriptor may be different. Compared with the method of setting a common reference address using environmental parameters, each descriptor in this method can describe data more flexibly and use a larger data address space.

在一种可能的实现方式中，可根据描述符的内容，确定与处理指令的操作数对应的数据在数据存储空间中的数据地址。其中，数据地址的计算由硬件自动完成，且描述符的内容的表示方式不同时，数据地址的计算方法也会不同。本披露对数据地址的具体计算方法不作限制。In a possible implementation, the data address of the data corresponding to the operand of the processing instruction in the data storage space can be determined according to the content of the descriptor. The calculation of the data address is automatically completed by hardware, and the calculation method of the data address will be different when the content of the descriptor is expressed in different ways. The present disclosure does not limit the specific calculation method of the data address.

例如，操作数中描述符的内容是使用公式(1)表示的，描述符所指示的张量数据在数据存储空间中的偏移量分别为offset_x和offset_y，尺寸为size_x*size_y，那么，该描述符所指示的张量数据在数据存储空间中的起始数据地址PA1_(x,y)可以使用下述公式(4)来确定：For example, the content of the descriptor in the operand is expressed using formula (1), the offsets of the tensor data indicated by the descriptor in the data storage space are offset_x and offset_y, and the size is size_x*size_y. Then, the starting data address PA1 _{(x, y)} of the tensor data indicated by the descriptor in the data storage space can be determined using the following formula (4):

PA1_(x,y)＝PA_start+(offset_y-1)*ori_x+offset_x (4)PA1 _(x,y) = PA_start + (offset_y-1)*ori_x + offset_x (4)

根据上述公式(4)确定的数据起始地址PA1_(x,y)，结合偏移量offset_x和offset_y，以及存储区域的尺寸size_x和size_y，可确定出描述符所指示的张量数据在数据存储空间中的存储区域。According to the data starting address PA1 _{(x, y)} determined by the above formula (4), combined with the offsets offset_x and offset_y, and the sizes size_x and size_y of the storage area, the storage area of the tensor data indicated by the descriptor in the data storage space can be determined.

在一种可能的实现方式中，当操作数还包括针对描述符的数据描述位置时，可根据描述符的内容以及数据描述位置，确定操作数对应的数据在数据存储空间中的数据地址。通过这种方式，可以对描述符所指示的张量数据中的部分数据(例如一个或多个数据)进行处理。In one possible implementation, when the operand also includes a data description position for the descriptor, the data address of the data corresponding to the operand in the data storage space can be determined according to the content of the descriptor and the data description position. In this way, part of the data (e.g., one or more data) in the tensor data indicated by the descriptor can be processed.

例如，操作数中描述符的内容是使用公式(1)表示的，描述符所指示的张量数据在数据存储空间中偏移量分别为offset_x和offset_y，尺寸为size_x*size_y，操作数中包括的针对描述符的数据描述位置为(x_q，y_q)，那么，该描述符所指示的张量数据在数据存储空间中的数据地址PA2_(x,y)可以使用下述公式(5)来确定：For example, the content of the descriptor in the operand is expressed using formula (1), the offsets of the tensor data indicated by the descriptor in the data storage space are offset_x and offset_y respectively, the size is size_x*size_y, and the data description position for the descriptor included in the operand is ( _xq , _yq ), then the data address PA2 _{(x, y)} of the tensor data indicated by the descriptor in the data storage space can be determined using the following formula (5):

PA2_(x,y)＝PA_start+(offset_y+y_q-1)*ori_x+(offset_x+x_q) (5)PA2 _{(x, y)} = PA_start + (offset_y + _yq - 1) * ori_x + (offset_x + _xq ) (5)

以上结合图1a和图1b对本披露的计算装置进行了描述，通过利用本披露的计算装置以及主指令和从指令，可以利用一条计算指令完成多个操作，减少了多个操作需由多个指令完成所导致的每个指令需要进行的数据搬运，解决了计算装置IO瓶颈问题，有效地提高计算的效率和降低计算的开销。另外，本披露的方案还可以根据主处理电路中配置的运算器的类型、从处理电路中配置的运算电路的功能，并通过主处理电路和从处理电路的协作，灵活设置计算指令中所包括的操作的类型和数量，以使计算装置可以执行多种类型的计算操作，从而扩展和丰富了计算装置的应用场景，满足不同的计算需求。The computing device disclosed in the present invention is described above in conjunction with FIG. 1a and FIG. 1b. By utilizing the computing device disclosed in the present invention and the master instruction and the slave instruction, multiple operations can be completed using one computing instruction, which reduces the data handling required for each instruction caused by multiple operations being completed by multiple instructions, solves the IO bottleneck problem of the computing device, and effectively improves the efficiency of computing and reduces the overhead of computing. In addition, the scheme disclosed in the present invention can also flexibly set the type and number of operations included in the computing instruction according to the type of arithmetic unit configured in the main processing circuit, the function of the arithmetic circuit configured in the slave processing circuit, and through the collaboration of the main processing circuit and the slave processing circuit, so that the computing device can perform various types of computing operations, thereby expanding and enriching the application scenarios of the computing device and meeting different computing needs.

另外，由于主处理电路和从处理电路可以配置成支持多级流水运算，从而提升了主处理电路和从处理电路中运算器的执行效率，进一步缩短计算的时间。根据上文的描述，本领域技术人员可以理解图1a所示硬件架构仅仅是示例性而非限制性的。在本披露的公开和教导下，本领域技术人员也可以基于该架构来增加新的电路或器件，以实现更多的功能或操作。例如，可以在图1a所示的架构中增加存储电路，以存储各类指令和数据(例如张量数据)。进一步，也可以将主处理电路和从处理电路布置于不同的物理或逻辑位置，并且二者之间可以通过各种数据接口或互联单元来连接，以便通过二者的交互来完成上文的主运算操作和从运算操作，包括对结合图1b所述的张量的各类操作。In addition, since the main processing circuit and the slave processing circuit can be configured to support multi-stage pipeline operations, the execution efficiency of the operators in the main processing circuit and the slave processing circuit is improved, and the calculation time is further shortened. According to the above description, those skilled in the art can understand that the hardware architecture shown in Figure 1a is only exemplary and not restrictive. Under the disclosure and teaching of the present disclosure, those skilled in the art can also add new circuits or devices based on the architecture to achieve more functions or operations. For example, a storage circuit can be added to the architecture shown in Figure 1a to store various instructions and data (such as tensor data). Furthermore, the main processing circuit and the slave processing circuit can also be arranged in different physical or logical positions, and the two can be connected through various data interfaces or interconnection units, so as to complete the above main operation and slave operation through the interaction between the two, including various operations on the tensors described in conjunction with Figure 1b.

图2是示出根据本披露实施例的计算装置200的框图。可以理解的是，图2中所示出的计算装置200是图1a所示计算装置100的一种具体实现方式，因此结合图1a所描述的计算装置100的主处理电路和从处理电路的细节也同样适用于图2所示出的计算装置200。FIG2 is a block diagram of a computing device 200 according to an embodiment of the present disclosure. It is understood that the computing device 200 shown in FIG2 is a specific implementation of the computing device 100 shown in FIG1a, and therefore the details of the master processing circuit and the slave processing circuit of the computing device 100 described in conjunction with FIG1a are also applicable to the computing device 200 shown in FIG2.

如图2中所示，根据本披露的计算装置200包括主处理电路202和多个从处理电路204、206和208。由于前文结合图1a对主处理电路和从处理电路的操作进行了详细的描述，此处将不再赘述。除了包括与图1a所示计算装置100相同的主处理电路和从处理电路，图2的计算装置200还包括控制电路210和存储电路212。在一个实施例中，控制电路可以配置成获取所述计算指令并且对该计算指令进行解析，以得到所述主指令和从指令，并且将所述主指令发送至所述主处理电路202并且将所述从指令发送至所述多个从处理电路204、206和208中的一个或多个。在一个场景中，控制电路可以通过主处理电路将解析后得到的从指令发送至从处理电路，即如图2中所示出的。替代地，当控制电路和从处理电路之间存在连接时，控制电路也可以直接向从处理电路发送解析后的从指令。类似地，当存储电路与从处理电路之间存在连接时，控制电路也可以经由存储电路将从指令发送到从处理电路。在一些计算场景中，当计算指令中包括涉及张量运算的操作数时，控制电路可以利用前面讨论的描述符来确定操作数对应数据的存储地址，例如张量的起始地址，并且可以指示主处理电路或从处理电路从存储电路212中的对应存储地址处获取参与张量运算的张量数据，以便执行张量运算。As shown in FIG2 , the computing device 200 according to the present disclosure includes a master processing circuit 202 and a plurality of slave processing circuits 204, 206, and 208. Since the operations of the master processing circuit and the slave processing circuit are described in detail in the foregoing text in conjunction with FIG1a , they will not be repeated here. In addition to including the same master processing circuit and slave processing circuit as the computing device 100 shown in FIG1a , the computing device 200 of FIG2 also includes a control circuit 210 and a storage circuit 212. In one embodiment, the control circuit may be configured to obtain the computing instruction and parse the computing instruction to obtain the master instruction and the slave instruction, and send the master instruction to the master processing circuit 202 and send the slave instruction to one or more of the plurality of slave processing circuits 204, 206, and 208. In one scenario, the control circuit may send the slave instruction obtained after parsing to the slave processing circuit through the master processing circuit, as shown in FIG2 . Alternatively, when there is a connection between the control circuit and the slave processing circuit, the control circuit may also directly send the parsed slave instruction to the slave processing circuit. Similarly, when there is a connection between the storage circuit and the slave processing circuit, the control circuit may also send the slave instruction to the slave processing circuit via the storage circuit. In some computing scenarios, when the computing instruction includes an operand involving a tensor operation, the control circuit may use the descriptor discussed above to determine the storage address of the data corresponding to the operand, such as the starting address of the tensor, and may instruct the master processing circuit or the slave processing circuit to obtain the tensor data involved in the tensor operation from the corresponding storage address in the storage circuit 212 so as to perform the tensor operation.

在一个或多个实施例中，存储电路212可以存储各类与计算相关的数据或指令，例如包括上文所述的张量。在一个场景中，存储电路可以存储与神经网络运算相关的神经元或权值数据，或者存储经主处理电路执行后处理操作后获得的最终运算结果。附加地，存储电路可以存储经主处理电路执行前处理操作后所获得的中间结果，或经从处理电路执行运算操作后所获得的中间结果。在针对于张量的操作中，前述的中间结果也可以是张量类型的数据，并且通过描述符确定的存储地址来读取和存放。在一些应用场景中，存储电路可以用作计算装置200的片上存储器来与片外存储器执行数据读写操作，例如通过直接存储器访问(“DMA”)接口。在一些场景中，当计算指令由控制电路来解析时，存储电路可以存储由控制电路解析后所得到的运算指令，例如主指令和/或从指令。另外，尽管图2中以一个框图示出存储电路，但根据应用场景的不同，存储电路可以实现为包括主存储器和主缓存器的存储器，其中主存储器可以用于存储相关的运算数据例如神经元、权值和各类常数项，而主缓存模块可以用于临时存储中间数据，例如经所述前处理操作后的数据和后处理操作前的数据，而这些中间数据根据设置可以对于操作人员不可见。In one or more embodiments, the storage circuit 212 may store various types of data or instructions related to the calculation, such as the tensors described above. In one scenario, the storage circuit may store neuron or weight data related to the neural network operation, or store the final operation result obtained after the main processing circuit performs the post-processing operation. Additionally, the storage circuit may store the intermediate result obtained after the main processing circuit performs the pre-processing operation, or the intermediate result obtained after the slave processing circuit performs the operation operation. In the operation for the tensor, the aforementioned intermediate result may also be data of the tensor type, and is read and stored by the storage address determined by the descriptor. In some application scenarios, the storage circuit may be used as an on-chip memory of the computing device 200 to perform data read and write operations with the off-chip memory, such as through a direct memory access ("DMA") interface. In some scenarios, when the calculation instruction is parsed by the control circuit, the storage circuit may store the operation instruction obtained after the control circuit is parsed, such as the master instruction and/or the slave instruction. In addition, although the storage circuit is shown in a block diagram in FIG2, depending on the application scenario, the storage circuit can be implemented as a memory including a main memory and a main cache, wherein the main memory can be used to store relevant operation data such as neurons, weights and various constant terms, and the main cache module can be used to temporarily store intermediate data, such as data after the pre-processing operation and data before the post-processing operation, and these intermediate data may be invisible to the operator according to the settings.

在主存储器与主处理电路的交互应用中，主处理电路中的流水运算电路还可以借助于存储在主存储电路中的掩码进行对应的操作。例如，在执行流水运算的过程中，该运算电路可以从主存储电路中读取一个掩码，并且可以利用该掩码来表示该运算电路中执行运算操作的数据是否有效。主存储电路不仅可以进行内部的存储应用，还具有与本披露的计算装置外的存储装置进行数据交互的功能，例如可以通过直接存储器访问(“DMA”)与外部的存储装置进行数据交换。In the interactive application of the main memory and the main processing circuit, the pipeline operation circuit in the main processing circuit can also perform corresponding operations with the help of the mask stored in the main storage circuit. For example, in the process of performing pipeline operations, the operation circuit can read a mask from the main storage circuit, and the mask can be used to indicate whether the data performing the operation in the operation circuit is valid. The main storage circuit can not only perform internal storage applications, but also has the function of data interaction with the storage device outside the computing device of the present disclosure, for example, it can exchange data with the external storage device through direct memory access ("DMA").

图3是示出根据本披露实施例的计算装置的主处理电路300的框图。可以理解的是图3所示出的主处理电路300也即结合图1a和图2所示和描述的主处理电路，因此对图1a和图2中的主处理电路的描述也同样适用于下文结合图3的描述。FIG3 is a block diagram showing a main processing circuit 300 of a computing device according to an embodiment of the present disclosure. It can be understood that the main processing circuit 300 shown in FIG3 is also the main processing circuit shown and described in conjunction with FIG1a and FIG2, so the description of the main processing circuit in FIG1a and FIG2 is also applicable to the description below in conjunction with FIG3.

如图3中所示，所述主处理电路300可以包括数据处理单元302、第一组流水运算电路304和最后一组流水运算电路306以及位于二组之间的一组或多组流水运算电路(以黑圈替代)。在一个实施例中，数据处理单元302包括数据转换电路3021和数据拼接电路3022。如前所述，当主运算操作包括针对于从运算操作的前处理操作时，例如数据转换操作或数据拼接操作时，数据转换电路3021或数据拼接电路3022将根据相应的主指令来执行相应的转换操作或拼接操作。下面将以示例来说明转换操作和拼接操作。As shown in FIG3 , the main processing circuit 300 may include a data processing unit 302, a first group of pipeline operation circuits 304 and a last group of pipeline operation circuits 306, and one or more groups of pipeline operation circuits (replaced by black circles) located between the two groups. In one embodiment, the data processing unit 302 includes a data conversion circuit 3021 and a data splicing circuit 3022. As described above, when the main operation operation includes a pre-processing operation for the slave operation operation, such as a data conversion operation or a data splicing operation, the data conversion circuit 3021 or the data splicing circuit 3022 will perform the corresponding conversion operation or splicing operation according to the corresponding main instruction. The conversion operation and the splicing operation will be explained by examples below.

就数据转换操作而言，当输入到数据转换电路的数据位宽较高时(例如数据位宽为1024比特位宽)，则数据转换电路可以根据运算要求将输入数据转换为较低比特位宽的数据(例如输出数据的位宽为512比特位宽)。根据不同的应用场景，数据转换电路可以支持多种数据类型之间的转换，例如可以进行FP16(浮点数16位)、FP32(浮点数32位)、FIX8(定点数8位)、FIX4(定点数4位)、FIX16(定点数16位)等具有不同比特位宽的数据类型间转换。当输入到数据转换电路的数据是矩阵时，数据转换操作可以是针对矩阵元素的排列位置进行的变换。该变换可以例如包括矩阵转置与镜像(稍后结合图4a-图4c描述)、矩阵按照预定的角度(例如是90度、180度或270度)旋转和矩阵维度的转换。As far as the data conversion operation is concerned, when the data bit width input to the data conversion circuit is high (for example, the data bit width is 1024 bits wide), the data conversion circuit can convert the input data into data with a lower bit width (for example, the output data bit width is 512 bits wide) according to the operation requirements. According to different application scenarios, the data conversion circuit can support conversions between multiple data types, for example, FP16 (16-bit floating point), FP32 (32-bit floating point), FIX8 (8-bit fixed point), FIX4 (4-bit fixed point), FIX16 (16-bit fixed point) and other data types with different bit widths can be converted. When the data input to the data conversion circuit is a matrix, the data conversion operation can be a transformation performed on the arrangement position of the matrix elements. The transformation can, for example, include matrix transposition and mirroring (described later in conjunction with Figures 4a-4c), matrix rotation at a predetermined angle (for example, 90 degrees, 180 degrees or 270 degrees) and matrix dimension conversion.

就数据拼接操作而言，数据拼接电路可以根据例如指令中设定的位长对数据中提取的数据块进行奇偶拼接等操作。例如，当数据位长为32比特位宽时，数据拼接电路可以按照4比特的位宽长度将数据分为1～8共8个数据块，然后将数据块1、3、5和7共四个数据块拼接在一起，并且将数据2、4、6和8共四个数据块拼接在一起以用于运算。As for the data splicing operation, the data splicing circuit can perform operations such as parity splicing on the data blocks extracted from the data according to the bit length set in the instruction, for example. For example, when the data bit length is 32 bits, the data splicing circuit can divide the data into 8 data blocks 1 to 8 according to the bit width length of 4 bits, and then splice the data blocks 1, 3, 5 and 7 together, and splice the data blocks 2, 4, 6 and 8 together for operation.

在另一些应用场景中，还可以针对执行运算后获得的数据M(例如可以是向量)执行上述的数据拼接操作。假设数据拼接电路可以将数据M偶数行的低256位先以8位比特位宽作为1个单位数据进行拆分，以得到32个偶数行单位数据(分别表示为M_2i₀至M_2i₃₁)。类似地，可以将数据M奇数行的低256位也以8位比特位宽作为1个单位数据进行拆分，以得到32个奇数行单位数据(分别表示为M_(2i+1)₀至M_(2i+1)₃₁)。进一步，根据数据位由低到高、先偶数行后奇数行的顺序依次交替布置拆分后的32个奇数行单位数据与32个偶数行单位数据。具体地，将偶数行单位数据0(M_2i₀)布置在低位，再顺序布置奇数行单位数据0(M_(2i+1)₀)。接着，布置偶数行单位数据1(M_2i₁)……。以此类推，当完成奇数行单位数据31(M_(2i+1)₃₁)的布置时，64个单位数据拼接在一起以组成一个512位比特位宽的新数据。In other application scenarios, the above-mentioned data splicing operation can also be performed on the data M (for example, a vector) obtained after performing the operation. It is assumed that the data splicing circuit can first split the lower 256 bits of the even-numbered rows of the data M as 1 unit data with an 8-bit bit width to obtain 32 even-numbered row unit data (represented as M_2i ₀ to M_2i ₃₁ respectively). Similarly, the lower 256 bits of the odd-numbered rows of the data M can also be split as 1 unit data with an 8-bit bit width to obtain 32 odd-numbered row unit data (represented as M_(2i+1) ₀ to M_(2i+1) ₃₁ respectively). Further, the 32 odd-numbered row unit data and the 32 even-numbered row unit data after the split are alternately arranged in sequence according to the order of the data bits from low to high, first the even-numbered rows and then the odd-numbered rows. Specifically, the even-numbered row unit data 0 (M_2i ₀ ) is arranged in the low position, and then the odd-numbered row unit data 0 (M_(2i+1) ₀ ) is arranged in sequence. Next, even-numbered row unit data 1 (M_2i ₁ ) ... are arranged. Similarly, when the arrangement of odd-numbered row unit data 31 (M_(2i+1) ₃₁ ) is completed, 64 unit data are concatenated together to form a new data with a bit width of 512 bits.

根据不同的应用场景，数据处理单元中的数据转换电路和数据拼接电路可以配合使用，以便更加灵活地进行数据前处理。例如，根据主指令中包括的不同操作，数据处理单元可以仅执行数据转换而不执行数据拼接操作、仅执行数据拼接操作而不执行数据转换、或者既执行数据转换又执行数据拼接操作。在一些场景中，当所述主指令中并不包括针对于从运算操作的前处理操作时，则数据处理单元可以配置成禁用所述数据转换电路和数据拼接电路。According to different application scenarios, the data conversion circuit and the data splicing circuit in the data processing unit can be used together to perform data pre-processing more flexibly. For example, according to different operations included in the main instruction, the data processing unit can only perform data conversion without performing data splicing operations, only perform data splicing operations without performing data conversion, or perform both data conversion and data splicing operations. In some scenarios, when the main instruction does not include pre-processing operations for slave operation operations, the data processing unit can be configured to disable the data conversion circuit and the data splicing circuit.

如前所述，根据本披露的主处理电路可以包括一组或多组多级流水运算电路，如图3中示出的两组多级流水运算电路304和306，其中每组多级流水运算电路执行包括第一级到第N级的多级流水操作，其中每一级可以包括一个或多个运算器，以便根据所述主指令来执行多级流水运算。在一个实施例中，本披露的主处理电路可以实现为单指令多数据(Single Instruction Multiple Data,SIMD)模块，并且每组多级流水运算电路可以形成一条运算流水线。该运算流水线根据运算需求可以逐级设置数量不等的、不同的或相同的功能模块(也即本披露的运算器)，例如加法模块(或加法器)、乘法模块(或乘法器)、查表模块(或查表器)等各种类型的功能模块。As mentioned above, the main processing circuit according to the present disclosure may include one or more groups of multi-stage pipeline operation circuits, such as two groups of multi-stage pipeline operation circuits 304 and 306 shown in Figure 3, wherein each group of multi-stage pipeline operation circuits performs multi-stage pipeline operations including the first level to the Nth level, wherein each level may include one or more operators, so as to perform multi-stage pipeline operations according to the main instruction. In one embodiment, the main processing circuit of the present disclosure can be implemented as a single instruction multiple data (Single Instruction Multiple Data, SIMD) module, and each group of multi-stage pipeline operation circuits can form an operation pipeline. The operation pipeline can be set up with different numbers of different or identical functional modules (i.e., operators of the present disclosure) step by step according to the operation requirements, such as various types of functional modules such as addition modules (or adders), multiplication modules (or multipliers), and table lookup modules (or table lookups).

在一些应用场景中，当满足流水线的顺序要求时，可以将一条流水线上的不同功能模块组合使用，一级流水完成一个微指令中的一个操作码(“op”)所代表的操作。由此，本披露的SIMD可以支持不同级别的流水操作。即，基于运算流水线上运算器的设置，本披露的SIMD可以灵活地支持不同数量的op的组合。In some application scenarios, when the order requirements of the pipeline are met, different functional modules on a pipeline can be used in combination, and one level of pipeline completes the operation represented by an operation code ("op") in a microinstruction. Therefore, the SIMD disclosed in the present invention can support different levels of pipeline operations. That is, based on the setting of the operator in the operation pipeline, the SIMD disclosed in the present invention can flexibly support the combination of different numbers of ops.

假设存在与第一组多级流水运算电路304和第二组多级流水线运算电路306相类似的一条流水线(以“stage1”来表示其名称)，其按照从上到下的顺序设置有六个功能模块以形成六级流水线，具体可以为：stage1-1-加法器1(第一级加法器)、stage1-2-加法器2(第二级加法器)、stage1-3-乘法器1(第一级乘法器)、stage1-4-乘法器2(第二级乘法器)、stage1-5-加法器1(第一级加法器)、stage1-6-加法器2(第二级加法器)。可以看出，第一级加法器(其作为流水线的第一级)和第二级加法器(其作为流水线的第二级)配合使用，以便完成加法操作的两级运算。同样地，第一级乘法器和第二级乘法器也执行类似地两级运算。当然，此处的两级加法器或乘法器仅仅是示例性地而非限制性地，在一些应用场景中，也可以在多级流水线中仅设置一级加法器或乘法器。Assume that there is a pipeline similar to the first group of multi-stage pipeline operation circuits 304 and the second group of multi-stage pipeline operation circuits 306 (named "stage 1"), which is provided with six functional modules in order from top to bottom to form a six-stage pipeline, specifically: stage 1-1-adder 1 (first-stage adder), stage 1-2-adder 2 (second-stage adder), stage 1-3-multiplier 1 (first-stage multiplier), stage 1-4-multiplier 2 (second-stage multiplier), stage 1-5-adder 1 (first-stage adder), stage 1-6-adder 2 (second-stage adder). It can be seen that the first-stage adder (which serves as the first stage of the pipeline) and the second-stage adder (which serves as the second stage of the pipeline) are used in conjunction to complete the two-stage operation of the addition operation. Similarly, the first-stage multiplier and the second-stage multiplier also perform similar two-stage operations. Of course, the two-stage adder or multiplier here is merely exemplary and not restrictive. In some application scenarios, only one-stage adder or multiplier may be provided in a multi-stage pipeline.

在一些实施例中，还可以设置两条或更多条如上所述的流水线，其中每条流水线中可以包括若干个相同或不同的运算器，以实现相同或不同的功能。进一步，不同的流水线可以包括不同的运算器，以便各个流水线实现不同功能的运算操作。实现前述不同功能的运算器或电路可以包括但不限于随机数处理电路、加减电路、减法电路、查表电路、参数配置电路、乘法器、除法器、池化器、比较器、求绝对值电路、逻辑运算器、位置索引电路或过滤器。这里以池化器为例，其可以示例性由加法器、除法器、比较器等运算器来构成，以便执行神经网络中的池化操作。In some embodiments, two or more pipelines as described above may also be provided, wherein each pipeline may include a number of identical or different operators to implement identical or different functions. Further, different pipelines may include different operators so that each pipeline implements operations of different functions. The operators or circuits implementing the aforementioned different functions may include, but are not limited to, random number processing circuits, addition and subtraction circuits, subtraction circuits, table lookup circuits, parameter configuration circuits, multipliers, dividers, poolers, comparators, absolute value circuits, logic operators, position index circuits or filters. Here, a pooler is taken as an example, which may be exemplarily composed of operators such as adders, dividers, and comparators to perform pooling operations in a neural network.

在一些应用场景中，主处理电路中的多级流水运算可以支持一元运算(即只有一项输入数据的情形)。以神经网络中的scale层+relu层处的运算操作为例，假设待执行的计算指令表达为result＝relu(a*ina+b)，其中ina是输入数据(例如可以是向量、矩阵或者张量)，a、b均为运算常量。对于该计算指令，可以应用本披露的包括乘法器、加法器、非线性运算器的一组三级流水运算电路来执行运算。具体来说，可以利用第一级流水的乘法器计算输入数据ina与a的乘积，以获得第一级流水运算结果。接着，可以利用第二级流水的加法器，对该第一级流水运算结果(a*ina)与b执行加法运算获得第二级流水运算结果。最后，可以利用第三级流水的relu激活函数，对该第二级流水运算结果(a*ina+b)进行激活操作，以获得最终的运算结果result。In some application scenarios, the multi-stage pipeline operation in the main processing circuit can support unary operations (i.e., there is only one input data). Taking the operation at the scale layer + relu layer in the neural network as an example, it is assumed that the calculation instruction to be executed is expressed as result = relu (a * ina + b), where ina is the input data (for example, it can be a vector, a matrix or a tensor), and a and b are both operation constants. For this calculation instruction, a set of three-stage pipeline operation circuits including multipliers, adders, and nonlinear operators disclosed in the present disclosure can be used to perform the operation. Specifically, the multiplier of the first-stage pipeline can be used to calculate the product of the input data ina and a to obtain the first-stage pipeline operation result. Then, the adder of the second-stage pipeline can be used to perform an addition operation on the first-stage pipeline operation result (a * ina) and b to obtain the second-stage pipeline operation result. Finally, the relu activation function of the third-stage pipeline can be used to activate the second-stage pipeline operation result (a * ina + b) to obtain the final operation result result.

在一些应用场景中，主处理电路中的多级流水运算电路可以支持二元运算(例如卷积计算指令result＝conv(ina,inb))或三元运算(例如卷积计算指令result＝conv(ina,inb,bias))，其中输入数据ina、inb与bias既可以是一维张量(即向量，其例如可以是整型、定点型或浮点型数据)，也可以是二维张量(即矩阵)，或者是3维或3维以上的张量。这里以卷积计算指令result＝conv(ina,inb)为例，可以利用三级流水运算电路结构中包括的多个乘法器、至少一个加法树和至少一个非线性运算器来执行该计算指令所表达的卷积运算，其中两个输入数据ina和inb可以例如是神经元数据。具体来说，首先可以利用三级流水运算电路中的第一级流水乘法器进行计算，从而可以获得第一级流水运算结果product＝ina*inb(视为运算指令中的一条微指令，其对应于乘法操作)。继而可以利用第二级流水运算电路中的加法树对第一级流水运算结果“product”执行加和操作，以获得第二级流水运算结果sum。最后，利用第三级流水运算电路的非线性运算器对“sum”执行激活操作，从而得到最终的卷积运算结果。In some application scenarios, the multi-stage pipeline operation circuit in the main processing circuit can support binary operations (for example, the convolution calculation instruction result = conv (ina, inb)) or ternary operations (for example, the convolution calculation instruction result = conv (ina, inb, bias)), where the input data ina, inb and bias can be either a one-dimensional tensor (i.e., a vector, which can be, for example, integer, fixed-point or floating-point data), or a two-dimensional tensor (i.e., a matrix), or a tensor of 3 dimensions or more. Here, taking the convolution calculation instruction result = conv (ina, inb) as an example, the convolution operation expressed by the calculation instruction can be performed using multiple multipliers, at least one addition tree and at least one nonlinear operator included in the three-stage pipeline operation circuit structure, where the two input data ina and inb can be, for example, neuron data. Specifically, the first-stage pipeline multiplier in the three-stage pipeline operation circuit can be used for calculation, so that the first-stage pipeline operation result product = ina * inb (regarded as a microinstruction in the operation instruction, which corresponds to the multiplication operation) can be obtained. Then, the addition tree in the second-stage pipeline operation circuit can be used to perform an addition operation on the first-stage pipeline operation result "product" to obtain the second-stage pipeline operation result sum. Finally, the nonlinear operator of the third-stage pipeline operation circuit is used to perform an activation operation on "sum" to obtain the final convolution operation result.

在一些应用场景中，可以对运算操作中将不使用的一级或多级流水运算电路执行旁路操作，即可以根据运算操作的需要选择性地使用多级流水运算电路的一级或多级，而无需令运算操作经过所有的多级流水操作。以计算欧式距离的运算操作为例，假设其计算指令表示为dis＝sum((ina-inb)^2)，可以只使用由加法器、乘法器、加法树和累加器构成的若干级流水运算电路来进行运算以获得最终的运算结果，而对于未使用的流水运算电路，可以在流水运算操作前或操作中予以旁路。In some application scenarios, a bypass operation can be performed on one or more stages of pipeline operation circuits that will not be used in the operation, that is, one or more stages of the multi-stage pipeline operation circuit can be selectively used according to the needs of the operation, without requiring the operation to pass through all the multi-stage pipeline operations. Taking the operation of calculating the Euclidean distance as an example, assuming that its calculation instruction is expressed as dis=sum((ina-inb)^2), only a few stages of pipeline operation circuits composed of adders, multipliers, adder trees and accumulators can be used to perform the operation to obtain the final operation result, and the unused pipeline operation circuits can be bypassed before or during the pipeline operation.

在前述的流水操作中，每组流水运算电路可以独立地执行所述流水操作。然而，多组中的每组流水运算电路也可以协同地执行所述流水操作。例如，第一组流水运算电路中的第一级、第二级执行串行流水运算后的输出可以作为另一组流水运算电路的第三级流水的输入。又例如，第一组流水运算电路中的第一级、第二级执行并行流水运算，并分别输出各自流水运算的结果，作为另一组流水运算电路的第一级和/或第二级流水操作的输入。In the aforementioned pipeline operation, each group of pipeline operation circuits can independently perform the pipeline operation. However, each group of pipeline operation circuits in multiple groups can also perform the pipeline operation in a collaborative manner. For example, the output of the first and second stages of the first group of pipeline operation circuits after performing serial pipeline operation can be used as the input of the third stage of the pipeline of another group of pipeline operation circuits. For another example, the first and second stages of the first group of pipeline operation circuits perform parallel pipeline operation, and respectively output the results of their respective pipeline operations as the input of the first and/or second stage pipeline operation of another group of pipeline operation circuits.

图4a，4b和4c是示出根据本披露实施例的数据转换电路所执行的矩阵转换示意图。为了更好地理解主处理电路中的数据转换电路3021执行的转换操作，下面将以原始矩阵(可以视为本披露下的2维张量)进行的转置操作与水平镜像操作为例做进一步描述。4a, 4b and 4c are schematic diagrams showing matrix conversion performed by the data conversion circuit according to the embodiment of the present disclosure. In order to better understand the conversion operation performed by the data conversion circuit 3021 in the main processing circuit, the following will be further described by taking the transposition operation and horizontal mirror operation performed by the original matrix (which can be regarded as a 2-dimensional tensor under the present disclosure) as an example.

如图4a所示，原始矩阵是(M+1)行×(N+1)列的矩阵。根据应用场景的需求，数据转换电路可以对图4a中示出的原始矩阵进行转置操作转换，以获得如图4b所示出的矩阵。具体来说，数据转换电路可以将原始矩阵中元素的行序号与列序号进行交换操作以形成转置矩阵。具体来说，在图4a示出的原始矩阵中坐标是第1行第0列的元素“10”，其在图4b示出的转置矩阵中的坐标则是第0行第1列。以此类推，在图4a示出的原始矩阵中坐标是第M+1行第0列的元素“M0”，其在图4b示出的转置矩阵中的坐标则是第0行第M+1列。As shown in FIG4a, the original matrix is a matrix of (M+1) rows × (N+1) columns. According to the requirements of the application scenario, the data conversion circuit can perform a transposition operation on the original matrix shown in FIG4a to obtain a matrix as shown in FIG4b. Specifically, the data conversion circuit can exchange the row number and column number of the elements in the original matrix to form a transposed matrix. Specifically, the element "10" whose coordinate is the 1st row and 0th column in the original matrix shown in FIG4a has a coordinate of the 0th row and 1st column in the transposed matrix shown in FIG4b. By analogy, the element "M0" whose coordinate is the M+1th row and 0th column in the original matrix shown in FIG4a has a coordinate of the 0th row and the M+1th column in the transposed matrix shown in FIG4b.

如图4c所示，数据转换电路可以对图4a示出的原始矩阵进行水平镜像操作以形成水平镜像矩阵。具体来说，所述数据转换电路可以通过水平镜像操作，将原始矩阵中从首行元素到末行元素的排列顺序转换成从末行元素到首行元素的排列顺序，而对原始矩阵中元素的列号保持不变。具体来说，图4a示出的原始矩阵中坐标分别是第0行第0列的元素“00”与第1行第0列的元素“10”，在图4c中示出的水平镜像矩阵中的坐标则分别是第M+1行第0列与第M行第0列。以此类推，在图4a示出的原始矩阵中坐标是第M+1行第0列的元素“M0”，在图4c示出的水平镜像矩阵中的坐标则是第0行第0列。As shown in FIG4c, the data conversion circuit can perform a horizontal mirror operation on the original matrix shown in FIG4a to form a horizontal mirror matrix. Specifically, the data conversion circuit can convert the arrangement order from the first row elements to the last row elements in the original matrix into the arrangement order from the last row elements to the first row elements through the horizontal mirror operation, while the column number of the elements in the original matrix remains unchanged. Specifically, the coordinates in the original matrix shown in FIG4a are the element "00" in the 0th row and the 0th column and the element "10" in the 1st row and the 0th column, respectively, and the coordinates in the horizontal mirror matrix shown in FIG4c are the M+1th row and the 0th column and the Mth row and the 0th column, respectively. By analogy, the coordinates in the original matrix shown in FIG4a are the element "M0" in the M+1th row and the 0th column, and the coordinates in the horizontal mirror matrix shown in FIG4c are the 0th row and the 0th column.

图5是示出根据本披露实施例的计算装置的从处理电路500的框图。可以理解的是图中所示结构仅仅是示例性的而非限制性的，本领域技术人员基于本披露的教导也可以想到增加更多的运算器来形成更多级的流水运算电路。Fig. 5 is a block diagram showing a slave processing circuit 500 of a computing device according to an embodiment of the present disclosure. It is understood that the structure shown in the figure is merely exemplary and not restrictive, and those skilled in the art may also conceive of adding more operators to form more stages of pipeline operation circuits based on the teachings of the present disclosure.

如图5中所示，从处理电路500包括乘法器502、比较器504、选择器506、累加器508和转换器510构成的四级流水运算电路。在一个应用场景中，该从处理电路整体上可以执行向量(包括例如矩阵)运算。As shown in Fig. 5, the slave processing circuit 500 includes a four-stage pipeline operation circuit consisting of a multiplier 502, a comparator 504, a selector 506, an accumulator 508 and a converter 510. In an application scenario, the slave processing circuit as a whole can perform vector (including, for example, matrix) operations.

当执行向量运算中，从处理电路500根据接收到的微指令(如图中所示出的控制信号)来控制包括权值数据和神经元数据的向量数据(可以视为本披露下的1维张量)输入到乘法器中。在执行完乘法操作后，乘法器将结果输入到选择器506。此处，选择器506选择将乘法器的结果而非来自于比较器的结果传递至累加器508，执行向量运算中的累加操作。接着，累加器将累加后的结果传递至转换器510执行前文描述的数据转换操作。最终，由转换器将累加和(即图中所示“ACC_SUM”)作为最终结果输出。When performing vector operations, the vector data (which can be regarded as a 1-dimensional tensor under the present disclosure) including weight data and neuron data is controlled by the processing circuit 500 according to the received microinstructions (such as the control signal shown in the figure) to be input into the multiplier. After performing the multiplication operation, the multiplier inputs the result to the selector 506. Here, the selector 506 selects to pass the result of the multiplier rather than the result from the comparator to the accumulator 508 to perform the accumulation operation in the vector operation. Then, the accumulator passes the accumulated result to the converter 510 to perform the data conversion operation described above. Finally, the converter outputs the cumulative sum (i.e., "ACC_SUM" shown in the figure) as the final result.

除了执行上述的神经元数据和权值数据之间的矩阵乘加(“MAC”)运算以外，图5所示的四级流水运算电路还可以用于执行神经网络运算中的直方图运算、depthwise层乘加运算、积分和winograd乘加运算等运算。当执行直方图运算时，在第一级运算中，从处理电路根据微指令来将输入数据输入到比较器。相应地，此处选择器506选择将比较器的结果而非乘法器的结果传递至累加器以执行后续的操作。In addition to performing the matrix multiplication and addition ("MAC") operation between the above-mentioned neuron data and weight data, the four-stage pipeline operation circuit shown in Figure 5 can also be used to perform histogram operations, depthwise layer multiplication and addition operations, integration and winograd multiplication and addition operations in neural network operations. When performing histogram operations, in the first stage of operation, the input data is input to the comparator from the processing circuit according to the microinstruction. Accordingly, here the selector 506 selects to pass the result of the comparator rather than the result of the multiplier to the accumulator to perform subsequent operations.

通过上述的描述，本领域技术人员可以理解就硬件布置来说，本披露的从处理电路可以包括用于执行从运算操作的多个运算电路，并且所述多个运算电路被连接并配置成执行多级流水的运算操作。在一个或多个实施例中，前述的运算电路可以包括但不限于乘法电路、比较电路、累加电路和转数电路中的一个或多个，以至少执行向量运算，例如神经网络中的多维卷积运算。Through the above description, those skilled in the art can understand that in terms of hardware arrangement, the slave processing circuit of the present disclosure may include multiple operation circuits for performing slave operation, and the multiple operation circuits are connected and configured to perform multi-stage pipeline operation. In one or more embodiments, the aforementioned operation circuit may include but is not limited to one or more of a multiplication circuit, a comparison circuit, an accumulation circuit, and a rotation circuit to at least perform vector operations, such as multi-dimensional convolution operations in a neural network.

在一个运算场景中，本披露的从处理电路可以根据从指令(实现为例如一个或多个微指令或控制信号)对经主处理电路执行前处理操作的数据进行运算，以获得预期的运算结果。在另一个运算场景中，从处理电路可以将其运算后获得的中间结果发送(例如经由互联接口)发送到主处理电路中的数据处理单元，以便由数据处理单元中的数据转换电路来对中间结果执行数据类型转换或者以便由数据处理单元中的数据拼接电路来对中间结果执行数据拆分和拼接操作，从而获得最终的运算结果。下面将结合几个示例性的指令来描述本披露的主处理电路和从处理电路的操作。In one operation scenario, the slave processing circuit disclosed in the present invention can operate on the data that has been pre-processed by the main processing circuit according to the slave instruction (implemented as, for example, one or more microinstructions or control signals) to obtain the expected operation result. In another operation scenario, the slave processing circuit can send the intermediate result obtained after its operation (for example, via an interconnect interface) to the data processing unit in the main processing circuit, so that the data conversion circuit in the data processing unit can perform data type conversion on the intermediate result or the data splicing circuit in the data processing unit can perform data splitting and splicing operations on the intermediate result, so as to obtain the final operation result. The operation of the master processing circuit and the slave processing circuit disclosed in the present invention will be described below in conjunction with several exemplary instructions.

以包括前处理操作的计算指令“COSHLC”为例，其执行的操作(包括主处理电路执行的前处理操作和从处理电路执行的从运算操作)可以表达为：Taking the calculation instruction "COSHLC" including the pre-processing operation as an example, the operation performed by it (including the pre-processing operation performed by the master processing circuit and the slave operation performed by the slave processing circuit) can be expressed as:

COSHLC＝FPTOFIX+SHUFFLE+LT3DCONV，COSHLC＝FPTOFIX+SHUFFLE+LT3DCONV，

其中FPTOFIX表示主处理电路中的数据转换电路执行的数据类型转换操作，即将输入数据从浮点型数转换成定点型数，SHUFFLE表示数据拼接电路执行的数据拼接操作，而LT3DCONV表示从处理电路(以“LT”指代)执行的3DCONV操作，即3维数据的卷积操作。可以理解的是，当仅执行3维数据的卷积操作时，则作为主操作一部分的FPTOFIX和SHUFFLE均可以设置为可选的操作。Wherein FPTOFIX represents the data type conversion operation performed by the data conversion circuit in the main processing circuit, that is, converting the input data from floating-point numbers to fixed-point numbers, SHUFFLE represents the data splicing operation performed by the data splicing circuit, and LT3DCONV represents the 3DCONV operation performed by the slave processing circuit (represented by "LT"), that is, the convolution operation of 3D data. It can be understood that when only the convolution operation of 3D data is performed, FPTOFIX and SHUFFLE as part of the main operation can be set as optional operations.

以包括后处理操作的计算指令LCSU为例，其执行的操作(包括从处理电路执行的从运算操作和主处理电路执行的后处理操作)可以表达为：Taking the calculation instruction LCSU including the post-processing operation as an example, the operation performed by it (including the slave operation performed by the slave processing circuit and the post-processing operation performed by the master processing circuit) can be expressed as:

LCSU＝LT3DCONV+SUB，LCSU＝LT3DCONV+SUB，

其中在从处理电路执行LT3DCONV操作以获得3D卷积结果后，可以由主处理电路中的减法器对3D卷积结果执行减法操作SUB。由此，在每个指令的执行周期，可以输入1个2元操作数(即卷积结果和减数)，输出1个1元操作数(即执行LCSU指令后获得的最终结果)。After the slave processing circuit performs the LT3DCONV operation to obtain the 3D convolution result, the subtractor in the master processing circuit may perform a subtraction operation SUB on the 3D convolution result. Thus, in each instruction execution cycle, a 2-element operand (i.e., the convolution result and the subtrahend) may be input, and a 1-element operand (i.e., the final result obtained after executing the LCSU instruction) may be output.

再以包括前处理操作、从运算操作和后处理操作的计算指令SHLCAD为例，其执行的操作(包括主处理电路执行的前处理操作、从处理电路执行的从运算操作和主处理电路执行的后处理操作)可以表达为：Taking the calculation instruction SHLCAD including the pre-processing operation, the slave operation and the post-processing operation as an example, the operation performed by it (including the pre-processing operation performed by the main processing circuit, the slave operation performed by the slave processing circuit and the post-processing operation performed by the main processing circuit) can be expressed as:

SHLCAD＝SHUFFLE+LT3DCONV+ADDSHLCAD＝SHUFFLE+LT3DCONV+ADD

其中在所述前处理操作中，数据拼接电路执行SHUFFLE表示的数据拼接操作。接着，由从处理电路对拼接后的数据执行LT3DCONV操作以获得以3D卷积结果。最后，由主处理电路中的加法器对3D卷积结果执行加法操作ADD以获得最终的计算结果。In the pre-processing operation, the data splicing circuit performs a data splicing operation represented by SHUFFLE. Then, the slave processing circuit performs an LT3DCONV operation on the spliced data to obtain a 3D convolution result. Finally, the adder in the master processing circuit performs an addition operation ADD on the 3D convolution result to obtain the final calculation result.

从上面的例子可以，本领域技术人员可以理解在对计算指令解析后，本披露所获得的运算指令根据具体的运算操作包括以下组合中的一项：前处理指令和从处理指令；从处理指令和后处理指令；以及前处理指令、从处理指令和后处理指令。基于此，在一些实施例中，所述前处理指令可以包括数据转换指令和/或数据拼接指令。在另一些实施例中，所述后处理指令包括以下中的一项或多项：随机数处理指令、加法指令、减法指令、查表指令、参数配置指令、乘法指令、池化指令、激活指令、比较指令、求绝对值指令、逻辑运算指令、位置索引指令或过滤指令。在另一些实施例中，所述从处理指令可以包括各种类型的运算指令，包括但不限于与后处理指令中相类似的指令以及针对于复杂数据处理的指令，例如向量运算指令或者张量运算指令。From the above examples, it can be understood by those skilled in the art that after parsing the calculation instructions, the operation instructions obtained by the present disclosure include one of the following combinations according to the specific operation operations: pre-processing instructions and sub-processing instructions; sub-processing instructions and post-processing instructions; and pre-processing instructions, sub-processing instructions and post-processing instructions. Based on this, in some embodiments, the pre-processing instructions may include data conversion instructions and/or data splicing instructions. In other embodiments, the post-processing instructions include one or more of the following: random number processing instructions, addition instructions, subtraction instructions, table lookup instructions, parameter configuration instructions, multiplication instructions, pooling instructions, activation instructions, comparison instructions, absolute value instructions, logical operation instructions, position index instructions or filtering instructions. In other embodiments, the sub-processing instructions may include various types of operation instructions, including but not limited to instructions similar to those in the post-processing instructions and instructions for complex data processing, such as vector operation instructions or tensor operation instructions.

基于上述结合图1(包括图1a和图1b)-图5的描述，本领域技术人员可以理解本披露同样也公开了一种使用计算装置来执行计算操作的方法，其中所述计算装置包括主处理电路和至少一个从处理电路(即前文结合图1～图5所讨论的计算装置)，所述方法包括将所述主处理电路配置成响应于主指令来执行主运算操作，并且将所述从处理电路配置成响应于从指令来执行从运算操作。在一个实施例中，前述的主运算操作包括针对于所述从运算操作的前处理操作和/或后处理操作，所述主指令和所述从指令根据所述计算装置接收的计算指令解析得到。在另一个实施例中，所述计算指令的操作数包括用于指示张量的形状的描述符，所述描述符用于确定所述操作数对应数据的存储地址。Based on the above description in combination with Figures 1 (including Figures 1a and 1b) to 5, those skilled in the art can understand that the present disclosure also discloses a method of using a computing device to perform a computing operation, wherein the computing device includes a main processing circuit and at least one slave processing circuit (i.e., the computing device discussed in combination with Figures 1 to 5 above), and the method includes configuring the main processing circuit to perform a main operation in response to a main instruction, and configuring the slave processing circuit to perform a slave operation in response to a slave instruction. In one embodiment, the aforementioned main operation includes a pre-processing operation and/or a post-processing operation for the slave operation, and the main instruction and the slave instruction are parsed according to the computing instruction received by the computing device. In another embodiment, the operand of the computing instruction includes a descriptor for indicating the shape of a tensor, and the descriptor is used to determine the storage address of the data corresponding to the operand.

基于上述的描述符设置，所述方法还可以包括将所述主处理电路和/或从处理电路配置成根据所述存储地址来执行各自对应的主运算操作和/或从处理操作。如前所述，通过本披露的描述符设置，可以提升张量运算的效率和数据访存的速率，进一步减小张量运算的开销。另外，尽管这里为了简明的目的，没有对方法的另外步骤进行描述，但本领域技术人员根据本披露公开的内容可以理解本披露的方法可以执行前文结合图1-图5所述的各类操作。Based on the above-mentioned descriptor settings, the method may also include configuring the master processing circuit and/or slave processing circuit to perform respective corresponding master operations and/or slave processing operations according to the storage address. As previously mentioned, through the descriptor settings disclosed in the present invention, the efficiency of tensor operations and the rate of data access can be improved, and the overhead of tensor operations can be further reduced. In addition, although the other steps of the method are not described here for the purpose of simplicity, those skilled in the art can understand that the method disclosed in the present invention can perform the various operations described in the foregoing text in conjunction with Figures 1-5 based on the contents disclosed in the present invention.

图6是示出根据本披露实施例的一种组合处理装置600的结构图。如图6中所示，该组合处理装置600包括计算处理装置602、接口装置604、其他处理装置606和存储装置608。根据不同的应用场景，计算处理装置中可以包括一个或多个计算装置610，该计算装置可以配置用于执行本文结合图1-图5所描述的操作。FIG6 is a structural diagram showing a combined processing device 600 according to an embodiment of the present disclosure. As shown in FIG6, the combined processing device 600 includes a computing processing device 602, an interface device 604, other processing devices 606, and a storage device 608. According to different application scenarios, the computing processing device may include one or more computing devices 610, which may be configured to perform the operations described herein in conjunction with FIG1-FIG5.

在不同的实施例中，本披露的计算处理装置可以配置成执行用户指定的操作。在示例性的应用中，该计算处理装置可以实现为单核人工智能处理器或者多核人工智能处理器。类似地，包括在计算处理装置内的一个或多个计算装置可以实现为人工智能处理器核或者人工智能处理器核的部分硬件结构。当多个计算装置实现为人工智能处理器核或人工智能处理器核的部分硬件结构时，就本披露的计算处理装置而言，其可以视为具有单核结构或者同构多核结构。In different embodiments, the computing and processing device of the present disclosure may be configured to perform user-specified operations. In an exemplary application, the computing and processing device may be implemented as a single-core artificial intelligence processor or a multi-core artificial intelligence processor. Similarly, one or more computing devices included in the computing and processing device may be implemented as an artificial intelligence processor core or a partial hardware structure of an artificial intelligence processor core. When multiple computing devices are implemented as an artificial intelligence processor core or a partial hardware structure of an artificial intelligence processor core, with respect to the computing and processing device of the present disclosure, it may be regarded as having a single-core structure or a homogeneous multi-core structure.

在示例性的操作中，本披露的计算处理装置可以通过接口装置与其他处理装置进行交互，以共同完成用户指定的操作。根据实现方式的不同，本披露的其他处理装置可以包括中央处理器(Central Processing Unit，CPU)、图形处理器(Graphics ProcessingUnit,GPU)、人工智能处理器等通用和/或专用处理器中的一种或多种类型的处理器。这些处理器可以包括但不限于数字信号处理器(Digital Signal Processor，DSP)、专用集成电路(Application Specific Integrated Circuit，ASIC)、现场可编程门阵列(Field-Programmable Gate Array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等，并且其数目可以根据实际需要来确定。如前所述，仅就本披露的计算处理装置而言，其可以视为具有单核结构或者同构多核结构。然而，当将计算处理装置和其他处理装置共同考虑时，二者可以视为形成异构多核结构。In an exemplary operation, the computing processing device of the present disclosure can interact with other processing devices through an interface device to jointly complete the operation specified by the user. Depending on the implementation, other processing devices of the present disclosure may include one or more types of processors in general and/or special processors such as a central processing unit (CPU), a graphics processing unit (GPU), an artificial intelligence processor, etc. These processors may include but are not limited to a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc., and their number can be determined according to actual needs. As previously mentioned, only with respect to the computing processing device of the present disclosure, it can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when the computing processing device and other processing devices are considered together, the two can be regarded as forming a heterogeneous multi-core structure.

在一个或多个实施例中，该其他处理装置可以作为本披露的计算处理装置(其可以具体化为人工智能例如神经网络运算的相关运算装置)与外部数据和控制的接口，执行包括但不限于数据搬运、对计算装置的开启和/或停止等基本控制。在另外的实施例中，其他处理装置也可以和该计算处理装置协作以共同完成运算任务。In one or more embodiments, the other processing device can serve as an interface between the computing device disclosed herein (which can be embodied as an artificial intelligence such as a computing device related to neural network computing) and external data and control, and perform basic controls including but not limited to data handling, starting and/or stopping the computing device, etc. In other embodiments, the other processing device can also cooperate with the computing device to jointly complete computing tasks.

在一个或多个实施例中，该接口装置可以用于在计算处理装置与其他处理装置间传输数据和控制指令。例如，该计算处理装置可以经由所述接口装置从其他处理装置中获取输入数据，写入该计算处理装置片上的存储装置(或称存储器)。进一步，该计算处理装置可以经由所述接口装置从其他处理装置中获取控制指令，写入计算处理装置片上的控制缓存中。替代地或可选地，接口装置也可以读取计算处理装置的存储装置中的数据并传输给其他处理装置。In one or more embodiments, the interface device can be used to transmit data and control instructions between the computing and processing device and other processing devices. For example, the computing and processing device can obtain input data from other processing devices via the interface device and write it into a storage device (or memory) on the computing and processing device chip. Further, the computing and processing device can obtain control instructions from other processing devices via the interface device and write them into a control cache on the computing and processing device chip. Alternatively or optionally, the interface device can also read data from the storage device of the computing and processing device and transmit it to other processing devices.

附加地或可选地，本披露的组合处理装置还可以包括存储装置。如图中所示，该存储装置分别与所述计算处理装置和所述其他处理装置连接。在一个或多个实施例中，存储装置可以用于保存所述计算处理装置和/或所述其他处理装置的数据。例如，该数据可以是在计算处理装置或其他处理装置的内部或片上存储装置中无法全部保存的数据。Additionally or optionally, the combined processing device of the present disclosure may further include a storage device. As shown in the figure, the storage device is connected to the computing processing device and the other processing device, respectively. In one or more embodiments, the storage device may be used to store data of the computing processing device and/or the other processing device. For example, the data may be data that cannot be fully stored in the internal or on-chip storage device of the computing processing device or other processing device.

在一些实施例里，本披露还公开了一种芯片(例如图7中示出的芯片702)。在一种实现中，该芯片是一种系统级芯片(System on Chip，SoC)，并且集成有一个或多个如图6中所示的组合处理装置。该芯片可以通过对外接口装置(如图7中示出的对外接口装置706)与其他相关部件相连接。该相关部件可以例如是摄像头、显示器、鼠标、键盘、网卡或wifi接口。在一些应用场景中，该芯片上可以集成有其他处理单元(例如视频编解码器)和/或接口模块(例如DRAM接口)等。在一些实施例中，本披露还公开了一种芯片封装结构，其包括了上述芯片。在一些实施例里，本披露还公开了一种板卡，其包括上述的芯片封装结构。下面将结合图7对该板卡进行详细地描述。In some embodiments, the present disclosure further discloses a chip (e.g., chip 702 shown in FIG. 7 ). In one implementation, the chip is a system-on-chip (SoC) and is integrated with one or more combined processing devices as shown in FIG. 6 . The chip can be connected to other related components through an external interface device (e.g., external interface device 706 shown in FIG. 7 ). The related components may be, for example, a camera, a display, a mouse, a keyboard, a network card, or a wifi interface. In some application scenarios, other processing units (e.g., video codecs) and/or interface modules (e.g., DRAM interfaces) may be integrated on the chip. In some embodiments, the present disclosure further discloses a chip packaging structure, which includes the above-mentioned chip. In some embodiments, the present disclosure further discloses a board card, which includes the above-mentioned chip packaging structure. The board card will be described in detail below in conjunction with FIG. 7 .

图7是示出根据本披露实施例的一种板卡700的结构示意图。如图7中所示，该板卡包括用于存储数据的存储器件704，其包括一个或多个存储单元710。该存储器件可以通过例如总线等方式与控制器件708和上文所述的芯片702进行连接和数据传输。进一步，该板卡还包括对外接口装置706，其配置用于芯片(或芯片封装结构中的芯片)与外部设备712(例如服务器或计算机等)之间的数据中继或转接功能。例如，待处理的数据可以由外部设备通过对外接口装置传递至芯片。又例如，所述芯片的计算结果可以经由所述对外接口装置传送回外部设备。根据不同的应用场景，所述对外接口装置可以具有不同的接口形式，例如其可以采用标准PCIE接口等。FIG. 7 is a schematic diagram showing the structure of a board 700 according to an embodiment of the present disclosure. As shown in FIG. 7 , the board includes a storage device 704 for storing data, which includes one or more storage units 710. The storage device can be connected and data can be transmitted with the control device 708 and the chip 702 described above by means of, for example, a bus. Further, the board also includes an external interface device 706, which is configured for data relay or switching between the chip (or the chip in the chip packaging structure) and an external device 712 (such as a server or computer, etc.). For example, the data to be processed can be transmitted to the chip by the external device through the external interface device. For another example, the calculation result of the chip can be transmitted back to the external device via the external interface device. According to different application scenarios, the external interface device can have different interface forms, for example, it can adopt a standard PCIE interface, etc.

在一个或多个实施例中，本披露板卡中的控制器件可以配置用于对所述芯片的状态进行调控。为此，在一个应用场景中，该控制器件可以包括单片机(Micro ControllerUnit，MCU)，以用于对所述芯片的工作状态进行调控。In one or more embodiments, the control device in the disclosed board may be configured to regulate the state of the chip. To this end, in an application scenario, the control device may include a microcontroller unit (MCU) for regulating the working state of the chip.

根据上述结合图6和图7的描述，本领域技术人员可以理解本披露也公开了一种电子设备或装置，其可以包括一个或多个上述板卡、一个或多个上述芯片和/或一个或多个上述组合处理装置。According to the above description in combination with Figures 6 and 7, those skilled in the art can understand that the present disclosure also discloses an electronic device or device, which may include one or more of the above-mentioned boards, one or more of the above-mentioned chips and/or one or more of the above-mentioned combined processing devices.

根据不同的应用场景，本披露的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆；所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机；所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本披露的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步，本披露的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中，根据本披露方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器)，而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中，云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容，从而可以根据终端设备和/或边缘端设备的硬件信息，从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源，以便完成端云一体或云边端一体的统一管理、调度和协同工作。According to different application scenarios, the electronic equipment or device disclosed herein may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC devices, IoT terminals, mobile terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, automatic driving terminals, transportation, household appliances, and/or medical equipment. The transportation includes airplanes, ships and/or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods; the medical equipment includes magnetic resonance imaging, ultrasound machines and/or electrocardiographs. The electronic equipment or device disclosed herein may also be applied to the Internet, IoT, data centers, energy, transportation, public administration, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and medical fields. Further, the electronic equipment or device disclosed herein may also be used in cloud, edge, and terminal applications related to artificial intelligence, big data, and/or cloud computing. In one or more embodiments, electronic devices or devices with high computing power according to the disclosed solution can be applied to cloud devices (such as cloud servers), while electronic devices or devices with low power consumption can be applied to terminal devices and/or edge devices (such as smart phones or cameras). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or edge device are compatible with each other, so that according to the hardware information of the terminal device and/or edge device, appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or edge device, so as to complete the unified management, scheduling and collaborative work of end-to-end or cloud-edge-to-end.

需要说明的是，为了简明的目的，本披露将一些方法及其实施例表述为一系列的动作及其组合，但是本领域技术人员可以理解本披露的方案并不受所描述的动作的顺序限制。因此，依据本披露的公开或教导，本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步，本领域技术人员可以理解本披露所描述的实施例可以视为可选实施例，即其中所涉及的动作或模块对于本披露某个或某些方案的实现并不一定是必需的。另外，根据方案的不同，本披露对一些实施例的描述也各有侧重。鉴于此，本领域技术人员可以理解本披露某个实施例中没有详述的部分，也可以参见其他实施例的相关描述。It should be noted that, for the purpose of simplicity, the present disclosure describes some methods and embodiments thereof as a series of actions and combinations thereof, but those skilled in the art will appreciate that the scheme of the present disclosure is not limited by the order of the actions described. Therefore, based on the disclosure or teaching of the present disclosure, those skilled in the art will appreciate that some of the steps therein may be performed in other orders or simultaneously. Further, those skilled in the art will appreciate that the embodiments described in the present disclosure may be regarded as optional embodiments, i.e., the actions or modules involved therein are not necessarily necessary for the implementation of one or some of the schemes of the present disclosure. In addition, depending on the different schemes, the present disclosure also has different focuses on the description of some embodiments. In view of this, those skilled in the art will appreciate that the parts that are not described in detail in a certain embodiment of the present disclosure may also refer to the relevant descriptions of other embodiments.

在具体实现方面，基于本披露的公开和教导，本领域技术人员可以理解本披露所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如，就前文所述的电子设备或装置实施例中的各个单元来说，本文在考虑了逻辑功能的基础上对其进行划分，而实际实现时也可以有另外的划分方式。又例如，可以将多个单元或组件结合或者集成到另一个系统，或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言，前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中，前述的直接或间接耦合涉及利用接口的通信连接，其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。In terms of specific implementation, based on the disclosure and teaching of this disclosure, those skilled in the art can understand that several embodiments disclosed in this disclosure can also be implemented by other methods not disclosed herein. For example, with respect to the various units in the electronic device or device embodiments described above, this article divides them on the basis of considering logical functions, and there may be other division methods in actual implementation. For another example, multiple units or components can be combined or integrated into another system, or some features or functions in a unit or component can be selectively disabled. With respect to the connection relationship between different units or components, the connection discussed in the above text in conjunction with the accompanying drawings can be a direct or indirect coupling between units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection using an interface, wherein the communication interface can support electrical, optical, acoustic, magnetic or other forms of signal transmission.

在本披露中，作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外，根据实际的需要，可以选择其中的部分或者全部单元来实现本披露实施例所述方案的目的。另外，在一些场景中，本披露实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。In the present disclosure, the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units. The aforementioned components or units may be located in the same location or distributed on multiple network units. In addition, according to actual needs, some or all of the units may be selected to achieve the purpose of the scheme described in the embodiments of the present disclosure. In addition, in some scenarios, multiple units in the embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.

在一些实现场景中，上述集成的单元可以采用软件程序模块的形式来实现。如果以软件程序模块的形式实现并作为独立的产品销售或使用时，所述集成的单元可以存储在计算机可读取存储器中。基于此，当本披露的方案以软件产品(例如计算机可读存储介质)的形式体现时，该软件产品可以存储在存储器中，其可以包括若干指令用以使得计算机设备(例如个人计算机、服务器或者网络设备等)执行本披露实施例所述方法的部分或全部步骤。前述的存储器可以包括但不限于U盘、闪存盘、只读存储器(Read Only Memory，ROM)、随机存取存储器(Random Access Memory，RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。In some implementation scenarios, the above-mentioned integrated unit can be implemented in the form of a software program module. If implemented in the form of a software program module and sold or used as an independent product, the integrated unit can be stored in a computer-readable memory. Based on this, when the scheme of the present disclosure is embodied in the form of a software product (such as a computer-readable storage medium), the software product can be stored in a memory, which may include several instructions to enable a computer device (such as a personal computer, a server or a network device, etc.) to perform some or all of the steps of the method described in the embodiment of the present disclosure. The aforementioned memory may include, but is not limited to, various media that can store program codes, such as a USB flash drive, a flash drive, a read-only memory (ROM), a random access memory (RAM), a mobile hard disk, a magnetic disk or an optical disk.

在另外一些实现场景中，上述集成的单元也可以采用硬件的形式实现，即为具体的硬件电路，其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件，而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此，本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现，例如CPU、GPU、FPGA、DSP和ASIC等。进一步，前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等)，其例如可以是可变电阻式存储器(Resistive Random Access Memory，RRAM)、动态随机存取存储器(Dynamic RandomAccess Memory，DRAM)、静态随机存取存储器(Static Random Access Memory，SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory，EDRAM)、高带宽存储器(High Bandwidth Memory，HBM)、混合存储器立方体(Hybrid Memory Cube，HMC)、ROM和RAM等。In some other implementation scenarios, the above-mentioned integrated unit can also be implemented in the form of hardware, that is, a specific hardware circuit, which may include digital circuits and/or analog circuits, etc. The physical implementation of the hardware structure of the circuit may include but is not limited to physical devices, and the physical devices may include but are not limited to devices such as transistors or memristors. In view of this, the various devices described herein (such as computing devices or other processing devices) can be implemented by appropriate hardware processors, such as CPU, GPU, FPGA, DSP and ASIC, etc. Further, the aforementioned storage unit or storage device can be any appropriate storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistive memory (Resistive Random Access Memory, RRAM), a dynamic random access memory (Dynamic Random Access Memory, DRAM), a static random access memory (Static Random Access Memory, SRAM), an enhanced dynamic random access memory (Enhanced Dynamic Random Access Memory, EDRAM), a high bandwidth memory (High Bandwidth Memory, HBM), a hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.

依据以下条款可更好地理解前述内容：The foregoing content can be better understood in accordance with the following terms:

条款1、一种计算装置，包括主处理电路和至少一个从处理电路，其中：Clause 1. A computing device comprising a master processing circuit and at least one slave processing circuit, wherein:

条款2、根据条款1所述的计算装置，其中所述计算指令包括描述符的标识和/或描述符的内容，所述描述符的内容包括表示张量数据的形状的至少一个形状参数。Clause 2. A computing device according to clause 1, wherein the computing instruction includes an identification of a descriptor and/or the content of the descriptor, and the content of the descriptor includes at least one shape parameter representing the shape of tensor data.

条款3、根据条款2所述的计算装置，其中所述描述符的内容还包括表示张量数据的地址的至少一个地址参数。Clause 3. A computing device according to clause 2, wherein the contents of the descriptor also include at least one address parameter representing an address of tensor data.

条款4、根据条款3所述的计算装置，其中所述张量数据的地址参数包括所述描述符的数据基准点在所述张量数据的数据存储空间中的基准地址。Clause 4. A computing device according to clause 3, wherein the address parameter of the tensor data includes a reference address of a data reference point of the descriptor in a data storage space of the tensor data.

条款5、根据条款4所述的计算装置，其中所述张量数据的形状参数包括以下至少一种：Clause 5. The computing device of clause 4, wherein the shape parameter of the tensor data comprises at least one of the following:

所述数据存储空间在N个维度方向的至少一个方向上的尺寸、所述张量数据的存储区域在N个维度方向的至少一个方向上的尺寸、所述存储区域在N个维度方向的至少一个方向上的偏移量、处于N个维度方向的对角位置的至少两个顶点相对于所述数据基准点的位置、所述描述符所指示的张量数据的数据描述位置与数据地址之间的映射关系，其中N为大于或等于零的整数。The size of the data storage space in at least one direction of the N dimensions, the size of the storage area of the tensor data in at least one direction of the N dimensions, the offset of the storage area in at least one direction of the N dimensions, the positions of at least two vertices at diagonal positions in the N dimensions relative to the data reference point, and the mapping relationship between the data description position of the tensor data indicated by the descriptor and the data address, where N is an integer greater than or equal to zero.

条款6、根据条款1所述的计算装置，其中所述主处理电路配置成：Clause 6. The computing device of clause 1, wherein the main processing circuit is configured to:

获取所述计算指令并对所述计算指令进行解析，以得到所述主指令和所述从指令；以及Acquire the computing instruction and parse the computing instruction to obtain the master instruction and the slave instruction; and

将所述从指令发送至所述从处理电路。The slave instruction is sent to the slave processing circuit.

条款7、根据条款1所述的计算装置，还包括控制电路，所述控制电路配置成：Clause 7. The computing device of clause 1, further comprising a control circuit configured to:

将所述主指令发送至所述主处理电路并且将所述从指令发送至所述从处理电路。The master instructions are sent to the master processing circuit and the slave instructions are sent to the slave processing circuit.

条款8、根据条款1所述的计算装置，其中所述主指令包括用于标识所述前处理操作和/或所述后处理操作的标识位。Clause 8. A computing device according to Clause 1, wherein the main instruction includes an identification bit for identifying the pre-processing operation and/or the post-processing operation.

条款9、根据条款1所述的计算装置，其中所述计算指令包括用于区分所述主指令中的所述前处理操作和所述后处理操作的预设位。Clause 9. A computing device according to clause 1, wherein the computing instruction includes a preset bit for distinguishing the pre-processing operation and the post-processing operation in the main instruction.

条款10、根据条款1所述的计算装置，其中所述主处理电路包括用于执行所述主运算操作的数据处理单元，并且所述数据处理单元包括用于执行数据转换操作的数据转换电路和/或用于执行数据拼接操作的数据拼接电路。Item 10. A computing device according to Item 1, wherein the main processing circuit includes a data processing unit for performing the main computing operation, and the data processing unit includes a data conversion circuit for performing a data conversion operation and/or a data splicing circuit for performing a data splicing operation.

条款11、根据条款10所述的计算装置，其中所述数据转换电路包括一个或多个转换器，用于实现计算数据在多种不同数据类型之间的转换。Item 11. A computing device according to Item 10, wherein the data conversion circuit includes one or more converters for realizing conversion of computing data between a plurality of different data types.

条款12、根据条款10所述的计算装置，其中所述数据拼接电路配置成以预定的位长对计算数据进行拆分，并且将拆分后获得的多个数据块按照预定顺序进行拼接。Clause 12. A computing device according to clause 10, wherein the data splicing circuit is configured to split the computing data into predetermined bit lengths and to splice the multiple data blocks obtained after the splitting in a predetermined order.

条款13、根据条款1所述的计算装置，其中所述主处理电路包括一组或多组流水运算电路，所述每组流水运算电路形成一条运算流水线并且包括一个或多个运算器，其中当所述每组流水运算电路包括多个运算器时，所述多个运算器被连接并配置成根据所述主指令选择性地参与以执行所述主运算操作。Item 13. A computing device according to Item 1, wherein the main processing circuit includes one or more groups of pipeline operation circuits, each group of pipeline operation circuits forms an operation pipeline and includes one or more operators, wherein when each group of pipeline operation circuits includes multiple operators, the multiple operators are connected and configured to selectively participate in performing the main operation according to the main instruction.

条款14、根据条款13所述的计算装置，其中所述主处理电路包括至少两条运算流水线，并且每条运算流水线包括以下中的一个或多个运算器或电路：Clause 14. A computing device according to clause 13, wherein the main processing circuit comprises at least two operation pipelines, and each operation pipeline comprises one or more of the following operators or circuits:

随机数处理电路、加减电路、减法电路、查表电路、参数配置电路、乘法器、除法器、池化器、比较器、求绝对值电路、逻辑运算器、位置索引电路或过滤器。Random number processing circuit, addition and subtraction circuit, subtraction circuit, table lookup circuit, parameter configuration circuit, multiplier, divider, pooler, comparator, absolute value circuit, logic operator, position index circuit or filter.

条款15、根据条款1所述的计算装置，其中所述从处理电路包括用于执行所述从运算操作的多个运算电路，并且所述多个运算电路被连接并配置成执行多级流水的运算操作，其中所述运算电路包括乘法电路、比较电路、累加电路和转数电路中的一个或多个，以至少执行向量运算。Item 15. A computing device according to Item 1, wherein the slave processing circuit includes a plurality of operation circuits for performing the slave operation, and the plurality of operation circuits are connected and configured to perform multi-stage pipeline operation, wherein the operation circuit includes one or more of a multiplication circuit, a comparison circuit, an accumulation circuit, and a rotation circuit to perform at least vector operation.

条款16、根据条款15所述的计算装置，其中所述从指令包括对经所述前处理操作的计算数据执行卷积运算的卷积指令，所述从处理电路配置成：Clause 16. The computing device according to clause 15, wherein the slave instruction comprises a convolution instruction for performing a convolution operation on the calculation data subjected to the pre-processing operation, and the slave processing circuit is configured to:

根据所述卷积指令对经所述前处理操作的计算数据执行卷积运算。A convolution operation is performed on the calculated data after the pre-processing operation according to the convolution instruction.

条款17、一种集成电路芯片，包括根据条款1-16的任意一项所述的计算装置。Clause 17. An integrated circuit chip comprising a computing device according to any one of clauses 1-16.

条款18、一种板卡，包括根据条款17所述的集成电路芯片。Clause 18. A board comprising the integrated circuit chip according to clause 17.

条款19、一种电子设备，包括根据条款17所述的集成电路芯片。Clause 19. An electronic device comprising the integrated circuit chip according to clause 17.

条款20、一种使用计算装置来执行计算操作的方法，其中所述计算装置包括主处理电路和至少一个从处理电路，所述方法包括：Clause 20. A method of performing a computing operation using a computing device, wherein the computing device includes a master processing circuit and at least one slave processing circuit, the method comprising:

其中所述方法还包括将所述主处理电路和/或从处理电路配置成根据所述存储地址来执行各自对应的主运算操作和/或从处理操作。The method further comprises configuring the master processing circuit and/or the slave processing circuit to execute respective corresponding master computing operations and/or slave processing operations according to the storage address.

条款21、根据条款20所述的方法，其中所述计算指令包括描述符的标识和/或描述符的内容，所述描述符的内容包括表示张量数据的形状的至少一个形状参数。Clause 21. A method according to clause 20, wherein the computation instruction includes an identification of a descriptor and/or content of the descriptor, wherein the content of the descriptor includes at least one shape parameter representing a shape of tensor data.

条款22、根据条款21所述的方法，其中所述描述符的内容还包括表示张量数据的地址的至少一个地址参数。Clause 22. A method according to clause 21, wherein the contents of the descriptor also include at least one address parameter representing an address of tensor data.

条款23、根据条款22所述的方法，其中所述张量数据的地址参数包括所述描述符的数据基准点在所述张量数据的数据存储空间中的基准地址。Clause 23. A method according to clause 22, wherein the address parameter of the tensor data includes a reference address of a data reference point of the descriptor in a data storage space of the tensor data.

条款24、根据条款23所述的方法，其中所述张量数据的形状参数包括以下至少一种：Clause 24. The method of clause 23, wherein the shape parameter of the tensor data comprises at least one of the following:

条款25、根据条款20所述的方法，其中将所述主处理电路配置成：Clause 25. The method of clause 20, wherein the main processing circuit is configured to:

条款26、根据条款20所述的方法，其中计算装置包括控制电路，所述方法还包括将控制电路配置成：Clause 26. The method of clause 20, wherein the computing device includes a control circuit, the method further comprising configuring the control circuit to:

条款27、根据条款20所述的方法，其中所述主指令包括用于标识所述前处理操作和/或所述后处理操作的标识位。Clause 27. A method according to clause 20, wherein the main instruction includes an identification bit for identifying the pre-processing operation and/or the post-processing operation.

条款28、根据条款20所述的方法，其中所述计算指令包括用于区分所述主指令中的所述前处理操作和所述后处理操作的预设位。Clause 28. The method of clause 20, wherein the computation instruction includes a preset bit for distinguishing the pre-processing operation from the post-processing operation in the host instruction.

条款29、根据条款20所述的方法，其中所述主处理电路包括数据处理单元，并且所述数据处理单元包括数据转换电路和/或数据拼接电路，所述方法包括将数据处理单元配置成执行所述主运算操作，并且将所述数据转换电路配置成执行数据转换操作，以及将所述数据拼接电路配置成执行数据拼接操作。Item 29. A method according to Item 20, wherein the main processing circuit includes a data processing unit, and the data processing unit includes a data conversion circuit and/or a data splicing circuit, and the method includes configuring the data processing unit to perform the main operation, configuring the data conversion circuit to perform a data conversion operation, and configuring the data splicing circuit to perform a data splicing operation.

条款30、根据条款29所述的方法，其中所述数据转换电路包括一个或多个转换器，所述方法包括将一个或多个转换器配置成实现计算数据在多种不同数据类型之间的转换。Clause 30. A method according to clause 29, wherein the data conversion circuit comprises one or more converters, and the method comprises configuring the one or more converters to implement conversion of computational data between a plurality of different data types.

条款31、根据条款29所述的方法，其中将所述数据拼接电路配置成以预定的位长对计算数据进行拆分，并且将拆分后获得的多个数据块按照预定顺序进行拼接。Clause 31. A method according to clause 29, wherein the data splicing circuit is configured to split the calculation data into predetermined bit lengths and to splice the multiple data blocks obtained after the splitting in a predetermined order.

条款32、根据条款20所述的方法，其中所述主处理电路包括一组或多组流水运算电路，所述每组流水运算电路形成一条运算流水线并且包括一个或多个运算器，其中当所述每组流水运算电路包括多个运算器时，所述方法包括将所述多个运算器进行连接并且配置成根据所述主指令选择性地参与以执行所述主运算操作。Item 32. A method according to Item 20, wherein the main processing circuit includes one or more groups of pipeline operation circuits, each group of pipeline operation circuits forms an operation pipeline and includes one or more operators, wherein when each group of pipeline operation circuits includes multiple operators, the method includes connecting the multiple operators and configuring them to selectively participate in performing the main operation according to the main instruction.

条款33、根据条款32所述的方法，其中所述主处理电路包括至少两条运算流水线，并且每条运算流水线包括以下中的一个或多个运算器或电路：Clause 33. A method according to clause 32, wherein the main processing circuit comprises at least two operation pipelines, and each operation pipeline comprises one or more of the following operators or circuits:

条款34、根据条款20所述的方法，其中所述从处理电路包括多个运算电路，所述方法包括将所述多个运算电路配置成执行所述从运算操作，并且所述方法还包括将所述多个运算电路连接并配置成执行多级流水的运算操作，其中所述运算电路包括乘法电路、比较电路、累加电路和转数电路中的一个或多个，以至少执行向量运算。Item 34. A method according to Item 20, wherein the slave processing circuit includes multiple operation circuits, the method includes configuring the multiple operation circuits to perform the slave operation, and the method also includes connecting and configuring the multiple operation circuits to perform multi-stage pipeline operation, wherein the operation circuit includes one or more of a multiplication circuit, a comparison circuit, an accumulation circuit and a rotation circuit to perform at least vector operation.

条款35、根据条款34所述的方法，其中所述从指令包括对经所述前处理操作的计算数据执行卷积运算的卷积指令，所述方法包括将所述从处理电路配置成：Clause 35. A method according to clause 34, wherein the slave instruction comprises a convolution instruction for performing a convolution operation on the computational data subjected to the pre-processing operation, the method comprising configuring the slave processing circuit to:

虽然本文已经示出和描述了本披露的多个实施例，但对于本领域技术人员显而易见的是，这样的实施例只是以示例的方式来提供。本领域技术人员可以在不偏离本披露思想和精神的情况下想到许多更改、改变和替代的方式。应当理解的是在实践本披露的过程中，可以采用对本文所描述的本披露实施例的各种替代方案。所附权利要求书旨在限定本披露的保护范围，并因此覆盖这些权利要求范围内的等同或替代方案。Although multiple embodiments of the present disclosure have been shown and described herein, it will be apparent to those skilled in the art that such embodiments are provided by way of example only. Those skilled in the art may think of many changes, modifications, and alternatives without departing from the thought and spirit of the present disclosure. It should be understood that in the process of practicing the present disclosure, various alternatives to the embodiments of the present disclosure described herein may be adopted. The attached claims are intended to define the scope of protection of the present disclosure, and therefore cover equivalents or alternatives within the scope of these claims.

Claims

1. An artificial intelligence processor for executing a computing instruction, the computing instruction comprising a descriptor, the artificial intelligence processor comprising:

Storage circuit;

A main processing circuit accesses the tensor based on the storage address in the storage circuit obtained from the descriptor and the shape of the tensor indicating N dimensions, and performs pre-processing operations;

A plurality of slave processing circuits are configured to perform N-dimensional matrix operations in parallel based on the tensors after the pre-processing operations.

2. The artificial intelligence processor according to claim 1, wherein the main processing circuit is used to perform the data splicing operation in the pre-processing operation.

3. The artificial intelligence processor according to claim 1, wherein the main processing circuit is used to perform a data conversion operation in the pre-processing operation.

4. The artificial intelligence processor according to claim 1, wherein the tensors are stored in the same area of the internal memory.

5. The artificial intelligence processor of claim 1, wherein the main processing circuit performs post-processing operations based on the N-dimensional matrix operation.

6. The artificial intelligence processor according to claim 5, wherein the main processing circuit matches the pre-processing operation and the post-processing operation through an identification bit of the computing instruction.

7. The artificial intelligence processor of claim 5, wherein the computing instruction comprises one of the following combinations: a pre-processing instruction and a sub-processing instruction; a sub-processing instruction and a post-processing instruction; and a pre-processing instruction, a sub-processing instruction, and a post-processing instruction.

8. The artificial intelligence processor of claim 1, wherein the slave processing circuit performs data type conversion.

9. The artificial intelligence processor of claim 1, wherein the master processing circuit and the slave processing circuit are configured to support multi-stage pipeline operations.

10. The artificial intelligence processor according to claim 1, wherein

The storage address includes a reference address of the data reference point of the descriptor in the data storage space.

The shape of an N-dimensional tensor includes at least the size of a storage area of the tensor in at least one direction of the N dimensions.

11. An artificial intelligence processor for executing a computing instruction, the computing instruction comprising a descriptor, the artificial intelligence processor comprising:

A main processing circuit configured to perform a main operation, the main operation including accessing the tensor using a storage address obtained from the descriptor and a shape indicating an N-dimensional tensor, and a pre-processing operation;

The main processing circuit is used to perform the splicing operation in the pre-processing operation.

12. The artificial intelligence processor according to claim 11, wherein the main processing circuit is used to perform a data conversion operation in the pre-processing operation.

13. The artificial intelligence processor according to claim 11, further comprising:

14. The artificial intelligence processor according to claim 13, wherein the tensors are stored in the same area of the internal memory.

15. The artificial intelligence processor of claim 13, wherein the main processing circuit performs post-processing operations based on the N-dimensional matrix operation.

16. The artificial intelligence processor according to claim 15, wherein the main processing circuit matches the pre-processing operation and the post-processing operation through an identification bit of the computing instruction.

17. The artificial intelligence processor of claim 15, wherein the computing instruction comprises one of the following combinations: a pre-processing instruction and a sub-processing instruction; a sub-processing instruction and a post-processing instruction; and a pre-processing instruction, a sub-processing instruction, and a post-processing instruction.

18. The artificial intelligence processor of claim 13, wherein the slave processing circuit performs data type conversion.

19. The artificial intelligence processor of claim 13, wherein the master processing circuit and the slave processing circuit are configured to support multi-stage pipeline operations.

20. The artificial intelligence processor of claim 11, wherein

21. An artificial intelligence processor for executing a computing instruction, the computing instruction comprising a descriptor, the artificial intelligence processor comprising:

The main processing circuit is used to perform the data conversion operation in the pre-processing operation.

22. The artificial intelligence processor according to claim 21, further comprising:

23. The artificial intelligence processor of claim 22, wherein the tensors are stored in the same area of the internal memory.

24. The artificial intelligence processor of claim 22, wherein the main processing circuit performs post-processing operations based on the N-dimensional matrix operations.

25. The artificial intelligence processor according to claim 24, wherein the main processing circuit matches the pre-processing operation and the post-processing operation through an identification bit of the computing instruction.

26. The artificial intelligence processor of claim 24, wherein the computational instructions comprise one of the following combinations: a pre-processing instruction and a sub-processing instruction; a sub-processing instruction and a post-processing instruction; and a pre-processing instruction, a sub-processing instruction, and a post-processing instruction.

27. The artificial intelligence processor of claim 22, wherein the slave processing circuit performs data type conversion.

28. The artificial intelligence processor of claim 22, wherein the master processing circuit and the slave processing circuit are configured to support multi-stage pipeline operations.

29. The artificial intelligence processor of claim 21, wherein

30. An integrated circuit chip comprising an artificial intelligence processor according to any one of claims 1-29.

31. A board comprising the integrated circuit chip according to claim 30.

32. An electronic device comprising the integrated circuit chip according to claim 30.