CN111966402A

CN111966402A - Instruction processing method and device and related product

Info

Publication number: CN111966402A
Application number: CN201910420884.9A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2019-05-20
Filing date: 2019-05-20
Publication date: 2020-11-20

Abstract

The present disclosure relates to a tensor instruction processing method, device and related products. The machine learning device includes one or more instruction processing devices, which are used to obtain tensors to be operated and control information from other processing devices, perform specified machine learning operations, and transmit the execution results to other processing devices through the I/O interface; When the machine learning computing device includes multiple instruction processing devices, the multiple instruction processing devices can be connected through a specific structure to transmit data. Among them, multiple instruction processing devices are interconnected and transmit data through the fast peripheral device interconnection bus PCIE bus; multiple instruction processing devices share the same control system or have their own control systems, and share memory or have their own memory; multiple instructions The interconnection of the processing devices is an arbitrary interconnection topology. The tensor instruction processing method, device and related products provided by the embodiments of the present disclosure have a wide range of applications, and have high processing efficiency and high processing speed for instructions.

Description

Instruction processing method, device and related products

技术领域technical field

本公开涉及计算机技术领域，尤其涉及一种张量指令处理方法、装置及相关产品。The present disclosure relates to the field of computer technology, and in particular, to a tensor instruction processing method, device and related products.

背景技术Background technique

随着科技的不断发展，机器学习，尤其是神经网络算法的使用越来越广泛。其在图像识别、语音识别、自然语言处理等领域中都得到了良好的应用。但由于神经网络算法的复杂度越来越高，所涉及的数据运算种类和数量不断增大。相关技术中，在对张量数据进行张量相关运算的效率低、速度慢。With the continuous development of science and technology, the use of machine learning, especially neural network algorithms, is becoming more and more extensive. It has been well used in image recognition, speech recognition, natural language processing and other fields. However, due to the increasing complexity of neural network algorithms, the types and quantities of data operations involved continue to increase. In the related art, the efficiency and speed of performing tensor correlation operations on tensor data are low.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本公开提出了一种张量指令处理方法、装置及相关产品，以提高对张量数据进行张量相关运算的效率和速度。In view of this, the present disclosure proposes a tensor instruction processing method, device, and related products, so as to improve the efficiency and speed of performing tensor-related operations on tensor data.

根据本公开的第一方面，提供了一种张量指令处理装置，所述装置包括：According to a first aspect of the present disclosure, there is provided a tensor instruction processing apparatus, the apparatus comprising:

控制模块，用于对获取到的张量指令解析，得到张量指令的操作码和操作域，并根据所述操作码和所述操作域获取执行所述张量指令所需的待运算张量、待运算标量和目标地址；The control module is used to parse the acquired tensor instruction, obtain the operation code and operation field of the tensor instruction, and obtain the to-be-operated tensor required to execute the tensor instruction according to the operation code and the operation field , the scalar to be operated and the target address;

运算模块，用于张量对所述待运算张量和所述待运算标量进行张量与标量相乘运算，获得运算结果，并将所述运算结果存入所述目标地址中；an operation module, used for tensors to perform a tensor-to-scalar multiplication operation on the to-be-operated tensor and the to-be-operated scalar, obtain an operation result, and store the operation result in the target address;

其中，所述操作码用于指示所述张量指令对数据所进行的运算为张量与标量相乘运算，所述操作域包括待运算张量的源地址、所述待运算标量和所述目标地址。The operation code is used to indicate that the operation performed by the tensor instruction on the data is a multiplication operation of a tensor and a scalar, and the operation domain includes the source address of the tensor to be operated, the scalar to be operated and the target address.

根据本公开的第二方面，提供了一种机器学习运算装置，所述装置包括：According to a second aspect of the present disclosure, there is provided a machine learning computing device, the device comprising:

一个或多个上述第一方面所述的张量指令处理装置，用于从其他处理装置中获取待运算张量和控制信息，并执行指定的机器学习运算，将执行结果通过I/O接口传递给其他处理装置；One or more tensor instruction processing devices described in the first aspect above are used to obtain tensors to be operated and control information from other processing devices, perform specified machine learning operations, and transmit the execution results through the I/O interface to other processing devices;

当所述机器学习运算装置包含多个所述张量指令处理装置时，所述多个所述张量指令处理装置间可以通过特定的结构进行连接并传输数据；When the machine learning computing device includes a plurality of the tensor instruction processing devices, the plurality of the tensor instruction processing devices can be connected through a specific structure and data can be transmitted;

其中，多个所述张量指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据，以支持更大规模的机器学习的运算；多个所述张量指令处理装置共享同一控制系统或拥有各自的控制系统；多个所述张量指令处理装置共享内存或者拥有各自的内存；多个所述张量指令处理装置的互联方式是任意互联拓扑。Wherein, a plurality of the tensor instruction processing devices are interconnected and transmit data through the fast peripheral device interconnection bus PCIE bus to support larger-scale machine learning operations; a plurality of the tensor instruction processing devices share the same control system Or have their own control systems; a plurality of the tensor instruction processing devices share memory or have their own memory; the interconnection mode of the multiple tensor instruction processing devices is any interconnection topology.

根据本公开的第三方面，提供了一种组合处理装置，所述装置包括：According to a third aspect of the present disclosure, there is provided a combined processing device, the device comprising:

上述第二方面所述的机器学习运算装置、通用互联接口和其他处理装置；The machine learning computing device, universal interconnection interface, and other processing devices described in the second aspect above;

所述机器学习运算装置与所述其他处理装置进行交互，共同完成用户指定的计算操作。The machine learning computing device interacts with the other processing devices to jointly complete the computing operation specified by the user.

根据本公开的第四方面，提供了一种机器学习芯片，所述机器学习芯片包括上述第二方面所述的机器学习络运算装置或上述第三方面所述的组合处理装置。According to a fourth aspect of the present disclosure, there is provided a machine learning chip, where the machine learning chip includes the machine learning network computing device described in the second aspect or the combined processing device described in the third aspect.

根据本公开的第五方面，提供了一种机器学习芯片封装结构，该机器学习芯片封装结构包括上述第四方面所述的机器学习芯片。According to a fifth aspect of the present disclosure, a machine learning chip packaging structure is provided, and the machine learning chip packaging structure includes the machine learning chip described in the fourth aspect.

根据本公开的第六方面，提供了一种板卡，该板卡包括上述第五方面所述的机器学习芯片封装结构。According to a sixth aspect of the present disclosure, there is provided a board card including the machine learning chip packaging structure described in the fifth aspect.

根据本公开的第七方面，提供了一种电子设备，所述电子设备包括上述第四方面所述的机器学习芯片或上述第六方面所述的板卡。According to a seventh aspect of the present disclosure, an electronic device is provided, and the electronic device includes the machine learning chip described in the fourth aspect or the board card described in the sixth aspect.

根据本公开的第八方面，提供了一种张量指令处理方法，所述方法应用于张量指令处理装置，所述方法包括：According to an eighth aspect of the present disclosure, there is provided a tensor instruction processing method, the method being applied to a tensor instruction processing apparatus, and the method includes:

对获取到的张量指令进行解析，得到张量指令的操作码和操作域，并根据所述操作码和所述操作域获取执行张量指令所需的待运算张量、待运算标量和目标地址；Analyze the acquired tensor instruction to obtain the operation code and operation field of the tensor instruction, and obtain the tensor to be operated, the scalar to be operated and the target required to execute the tensor instruction according to the operation code and the operation field address;

对所述待运算张量和所述待运算标量进行张量与标量相乘运算，获得运算结果，并将所述运算结果存入所述目标地址中；Multiplying the tensor and the scalar to be calculated by the tensor and the scalar to obtain an operation result, and storing the operation result in the target address;

其中，所述操作码用于指示所述张量指令对数据所进行的运算为张量与标量相乘运算，所述操作域包括待运算张量的源地址和目标地址。The operation code is used to indicate that the operation performed by the tensor instruction on the data is a multiplication operation of a tensor and a scalar, and the operation domain includes a source address and a destination address of the tensor to be operated.

根据本公开的第九方面，提供了一种计算机可读存储介质，该存储介质中存储有计算机程序，所述计算机程序被一个或多个处理器执行时，实现上述张量指令处理方法的步骤。According to a ninth aspect of the present disclosure, a computer-readable storage medium is provided, and a computer program is stored in the storage medium. When the computer program is executed by one or more processors, the steps of the above-mentioned tensor instruction processing method are implemented. .

在一些实施例中，所述电子设备包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。In some embodiments, the electronic device includes a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a mobile phone, a driving recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, Cameras, projectors, watches, headphones, mobile storage, wearables, vehicles, home appliances, and/or medical equipment.

在一些实施例中，所述交通工具包括飞机、轮船和/或车辆；所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机；所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。In some embodiments, the vehicles include airplanes, ships and/or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods; the medical Equipment includes MRI machines, ultrasound machines and/or electrocardiographs.

本公开实施例所提供的张量指令处理方法、装置及相关产品，该装置包括控制模块和运算模块，控制模块用于对获取到的张量指令进行解析，得到张量指令的操作码和操作域，并根据操作码和操作域获取执行张量指令所需的待运算张量和目标地址；运算模块用于张量对待运算张量进行张量与标量相乘运算，获得运算结果，并将运算结果存入目标地址中。本公开实施例所提供的张量指令处理方法、装置及相关产品的适用范围广，对张量指令的处理效率高、处理速度快。The tensor instruction processing method, device, and related products provided by the embodiments of the present disclosure include a control module and an operation module, and the control module is used to parse the acquired tensor instruction to obtain the operation code and operation of the tensor instruction. field, and obtain the tensor to be operated and the target address required to execute the tensor instruction according to the opcode and operation field; the operation module is used to multiply the tensor to be operated by a tensor and a scalar to obtain the operation result, and use The result of the operation is stored in the target address. The tensor instruction processing method, device and related products provided by the embodiments of the present disclosure have a wide range of applications, and have high processing efficiency and fast processing speed for tensor instructions.

根据下面参考附图对示例性实施例的详细说明，本公开的其它特征及方面将变得清楚。Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments with reference to the accompanying drawings.

附图说明Description of drawings

包含在说明书中并且构成说明书的一部分的附图与说明书一起示出了本公开的示例性实施例、特征和方面，并且用于解释本公开的原理。The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate exemplary embodiments, features, and aspects of the disclosure, and together with the description, serve to explain the principles of the disclosure.

图1示出根据本公开一实施例的张量指令处理装置的框图。FIG. 1 shows a block diagram of a tensor instruction processing apparatus according to an embodiment of the present disclosure.

图2a-图2f示出根据本公开一实施例的张量指令处理装置的框图。2a-2f show block diagrams of a tensor instruction processing apparatus according to an embodiment of the present disclosure.

图3示出根据本公开一实施例的张量指令处理装置的应用场景的示意图。FIG. 3 shows a schematic diagram of an application scenario of a tensor instruction processing apparatus according to an embodiment of the present disclosure.

图4a、图4b示出根据本公开一实施例的组合处理装置的框图。4a and 4b illustrate block diagrams of a combined processing apparatus according to an embodiment of the present disclosure.

图5示出根据本公开一实施例的板卡的结构示意图。FIG. 5 shows a schematic structural diagram of a board according to an embodiment of the present disclosure.

图6示出根据本公开一实施例的张量指令处理方法的流程图。FIG. 6 shows a flowchart of a tensor instruction processing method according to an embodiment of the present disclosure.

具体实施方式Detailed ways

以下将参考附图详细说明本公开的各种示例性实施例、特征和方面。附图中相同的附图标记表示功能相同或相似的元件。尽管在附图中示出了实施例的各种方面，但是除非特别指出，不必按比例绘制附图。Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. The same reference numbers in the figures denote elements that have the same or similar functions. While various aspects of the embodiments are shown in the drawings, the drawings are not necessarily drawn to scale unless otherwise indicated.

在这里专用的词“示例性”意为“用作例子、实施例或说明性”。这里作为“示例性”所说明的任何实施例不必解释为优于或好于其它实施例。The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration." Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

另外，为了更好的说明本公开，在下文的具体实施方式中给出了众多的具体细节。本领域技术人员应当理解，没有某些具体细节，本公开同样可以实施。在一些实例中，对于本领域技术人员熟知的方法、手段、元件和电路未作详细描述，以便于凸显本公开的主旨。In addition, in order to better illustrate the present disclosure, numerous specific details are given in the following detailed description. It will be understood by those skilled in the art that the present disclosure may be practiced without certain specific details. In some instances, methods, means, components and circuits well known to those skilled in the art have not been described in detail so as not to obscure the subject matter of the present disclosure.

图1示出根据本公开一实施例的张量指令处理装置的框图。如图1所示，该装置包括控制模块11和运算模块12。其中，控制模块11用于对获取到的张量指令进行解析，得到张量指令的操作码和操作域，并根据操作码和操作域获取执行张量指令所需的待运算张量、待运算标量和目标地址张量张量。其中，操作码用于指示张量指令对数据所进行的运算为张量与标量相乘运算，操作域包括待运算张量的源地址和目标地址。运算模块12用于张量对待运算张量进行张量与标量相乘运算，获得运算结果，并将运算结果存入目标地址。具体地，运算模块12可以将该待运算张量中的每个元素与待运算标量进行相乘，获得运算结果，并将该运算结果存入目标地址中，可选地，上述待运算标量可以是立即数或标量寄存器。FIG. 1 shows a block diagram of a tensor instruction processing apparatus according to an embodiment of the present disclosure. As shown in FIG. 1 , the device includes a control module 11 and an arithmetic module 12 . Among them, the control module 11 is used to analyze the acquired tensor instruction, obtain the operation code and operation field of the tensor instruction, and obtain the tensors to be operated and the operation fields required to execute the tensor instruction according to the operation code and operation field. scalar and destination address tensor tensor. The operation code is used to indicate that the operation performed by the tensor instruction on the data is a multiplication operation of a tensor and a scalar, and the operation domain includes the source address and destination address of the tensor to be operated. The operation module 12 is used for multiplying the tensor to be operated by a tensor and a scalar to obtain an operation result, and store the operation result in the target address. Specifically, the operation module 12 can multiply each element in the tensor to be calculated by the scalar to be calculated to obtain an operation result, and store the operation result in the target address. Optionally, the above-mentioned scalar to be calculated can be an immediate number or scalar register.

可选地，该目标地址可以是起始地址，处理模块可以根据该起始地址和运算结果的大小确定运算结果所需的存储空间大小，并将该运算结果存储至该确定的存储空间中。Optionally, the target address may be a start address, and the processing module may determine the size of the storage space required for the operation result according to the start address and the size of the operation result, and store the operation result in the determined storage space.

在本实施例中，控制模块可以从待运算张量地址中，分别获得待运算张量。控制模块可以通过数据输入输出单元获得指令和数据，该数据输入输出单元可以为一个或多个数据I/O接口或I/O引脚。运算模块用于根据张量运算类型对待运算张量进行张量运算，获得运算结果，并将运算结果存入目标地址中。本公开实施例所提供的张量指令处理装置的适用范围广，对张量指令的处理效率高、处理速度快。In this embodiment, the control module may obtain the to-be-operated tensors respectively from the addresses of the to-be-operated tensors. The control module can obtain instructions and data through a data input and output unit, and the data input and output unit can be one or more data I/O interfaces or I/O pins. The operation module is used to perform tensor operation on the tensor to be operated according to the tensor operation type, obtain the operation result, and store the operation result in the target address. The tensor instruction processing device provided by the embodiment of the present disclosure has a wide range of applications, and has high processing efficiency and fast processing speed for tensor instructions.

在本实施例中，操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示)，是指令序列号，用来告知执行指令的装置具体需要执行哪一条指令。操作域可以是执行对应的指令所需的所有数据的来源，执行对应的指令所需的所有数据包括待运算张量等参数以及对应的张量指令处理方法，或者存储运算张量、张量运算类型等参数以及对应的张量指令处理方法的地址等等。对于一个张量指令其必须包括操作码和操作域，其中，操作域至少包括待运算张量地址和目标地址。应当理解的是，本领域技术人员可以根据需要对张量指令的指令格式以及所包含的操作码和操作域进行设置，本公开对此不作限制。In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and the instruction sequence number, which is used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain can be the source of all data required to execute the corresponding instruction. All the data required to execute the corresponding instruction includes parameters such as the tensor to be operated and the corresponding tensor instruction processing method, or the storage operation tensor, tensor operation Parameters such as type and the address of the corresponding tensor instruction processing method, etc. For a tensor instruction, it must include an opcode and an operation field, where the operation field at least includes the address of the tensor to be operated and the target address. It should be understood that those skilled in the art can set the instruction format of the tensor instruction and the included operation codes and operation fields as required, which is not limited in the present disclosure.

可选地，操作域中的待运算张量的源地址和目标地址可以是一个或多个，每个源地址对应一个目标地址。该待运算张量可以是一个或多个。也就是说，本申请运算模块12可以同时实现一个或多个待运算张量与同一待运算标量相乘的运算。进一步可选地，该待运算标量的数量也可以是一个或多个，每个待运算张量对应设置有一个待运算标量，该多个待运算标量的值可以相同也可以不同。此时，本申请运算模块12可以同时实现一个或多个待运算张量与对应待运算标量相乘的运算。Optionally, there may be one or more source addresses and target addresses of the tensors to be operated in the operation domain, and each source address corresponds to a target address. The number of tensors to be operated on can be one or more. That is to say, the operation module 12 of the present application can simultaneously implement the operation of multiplying one or more tensors to be calculated and the same scalar to be calculated. Further optionally, the number of the scalars to be calculated may also be one or more, each tensor to be calculated is correspondingly set with a scalar to be calculated, and the values of the multiple scalars to be calculated may be the same or different. At this time, the operation module 12 of the present application can simultaneously realize the operation of multiplying one or more tensors to be operated and corresponding scalars to be operated.

进一步可选地，该装置可以包括一个或多个控制模块，以及一个或多个运算模块，可以根据实际需要对控制模块和运算模块的数量进行设置，本公开对此不作限制。Further optionally, the device may include one or more control modules and one or more operation modules, and the number of the control modules and the operation modules may be set according to actual needs, which is not limited in the present disclosure.

图2a示出根据本公开一实施例的张量指令处理装置的框图。在一种可能的实现方式中，如图2a所示，运算模块12可以包括至少一个张量运算器120，至少一个张量运算器120用于执行与张量运算类型相对应的张量运算。具体地，张量运算器用于将所述待运算张量中的每个元素与所述待运算标量相乘获得运算结果，以实现所述张量与标量相乘运算。Fig. 2a shows a block diagram of a tensor instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 2a, the operation module 12 may include at least one tensor operator 120, and the at least one tensor operator 120 is configured to perform a tensor operation corresponding to a tensor operation type. Specifically, the tensor operator is configured to multiply each element of the to-be-operated tensor by the to-be-operated scalar to obtain an operation result, so as to realize the multiplication operation of the tensor and the scalar.

进一步地，该运算模块还可以包含数据访存电路，该数据访存电路可以从存储模块获得待运算数据，该数据访存电路还可以将运算结果存储至存储模块中。可选地，该数据访存电路可以是直接内存访问模块。Further, the operation module may further include a data access circuit, the data access circuit may obtain the data to be operated from the storage module, and the data access circuit may also store the operation result in the storage module. Optionally, the data access circuit may be a direct memory access module.

在该实现方式中，张量运算器可以包括加法器、除法器、乘法器、比较器等能够对张量进行算术运算、逻辑运算等运算的运算器。张量运算器可以根据所需进行的张量运算的数据量的大小、张量运算类型、对张量运算的处理速度、效率等要求对张量运算器的种类及数量进行设置，本公开对此不作限制。In this implementation manner, the tensor operator may include operators such as adders, dividers, multipliers, and comparators that can perform arithmetic operations, logical operations, and other operations on tensors. The tensor operator can set the type and quantity of the tensor operator according to the size of the data volume of the tensor operation to be performed, the type of tensor operation, the processing speed and efficiency of the tensor operation, etc. This is not limited.

图2b示出根据本公开一实施例的张量指令处理装置的框图。在一种可能的实现方式中，如图2b所示，运算模块12可以包括主运算子模块121和多个从运算子模块122。所述主运算子模块121和所述多个从运算子模块122均包括所述张量运算器(图中未示出)。FIG. 2b shows a block diagram of a tensor instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 2 b , the operation module 12 may include a master operation sub-module 121 and a plurality of slave operation sub-modules 122 . Both the master operation sub-module 121 and the plurality of slave operation sub-modules 122 include the tensor operator (not shown in the figure).

控制模块11，还用于解析张量指令得到多个运算指令，并将待运算张量和多个运算指令发送至主运算子模块121。The control module 11 is further configured to parse the tensor instruction to obtain multiple operation instructions, and send the tensor to be operated and the multiple operation instructions to the main operation sub-module 121 .

主运算子模块121，对所述待运算张量执行前序处理，并将所述运算指令及所述待运算张量的至少一部分发送至所述从运算子模块；所述主运算子模块的张量运算器能够执行所述张量与标量相乘运算，获得中间结果。The main operation sub-module 121 performs pre-processing on the to-be-operated tensor, and sends the operation instruction and at least a part of the to-be-operated tensor to the slave operation sub-module; The tensor operator can perform a multiplication operation of the tensor and a scalar to obtain an intermediate result.

从运算子模块122的张量运算器用于根据从所述主运算子模块122接收的数据和运算指令并行执行所述张量与标量相乘运算得到多个中间结果，并将所述多个中间结果传输给所述主运算子模块121。主运算子模块121，还用于对多个中间结果执行后续处理，得到运算结果，并将运算结果存入目标地址中。The tensor operator of the slave operation sub-module 122 is configured to perform the multiplication operation of the tensor and the scalar in parallel according to the data and operation instructions received from the main operation sub-module 122 to obtain multiple intermediate results, and combine the multiple intermediate results The result is transmitted to the main operation sub-module 121 . The main operation sub-module 121 is further configured to perform subsequent processing on a plurality of intermediate results, obtain operation results, and store the operation results in the target address.

在一种可能的实现方式中，所述张量指令和所述待运算张量根据所述待运算张量的维度被拆分为至少在一个相对应的张量运算指令和至少一个子张量，其中，所述张量运算指令与子张量一一对应，并将所述多个相对应的运算指令和子张量分别发送至所述从运算子模块进行张量与标量的相乘运算，获得所述张量与标量相乘的中间结果并发送至所述主运算子模块，所述主运算子模块将所述中间结果合并得到运算结果，并将运算结果存入目标地址中，其中所述根据所述张量的维度被拆分为多个相对应的张量运算指令可以拆分为多一个一维张量，也可以拆分为多个多维张量，所述被拆分成的张量的维度小于等于所述获取的待运算张量的维度。In a possible implementation manner, the tensor instruction and the to-be-operated tensor are split into at least one corresponding tensor operation instruction and at least one sub-tensor according to the dimension of the to-be-operated tensor , wherein the tensor operation instructions and sub-tensors are in one-to-one correspondence, and the plurality of corresponding operation instructions and sub-tensors are respectively sent to the slave operation sub-module for multiplication of tensors and scalars, The intermediate result of multiplying the tensor and the scalar is obtained and sent to the main operation sub-module, and the main operation sub-module combines the intermediate results to obtain the operation result, and stores the operation result in the target address, where the According to the dimension of the tensor is divided into a plurality of corresponding tensor operation instructions can be divided into one more one-dimensional tensor, can also be divided into multiple multi-dimensional tensors, the divided into The dimension of the tensor is less than or equal to the dimension of the acquired tensor to be operated.

可选地，上述张量与标量相乘运算也可以是只由主处理子模块中的运算器实现。例如，在运算指令为针对标量、向量数据所进行的运算时，装置可以控制主运算子模块利用其中的运算器进行与计算指令相对应的运算。在运算指令为针对矩阵、张量等维度大于或等于2的数据进行运算时，装置可以通过主运算子模块与从运算子模块协同的方式进行运算，具体实现方式可参见上文的描述。Optionally, the above-mentioned tensor and scalar multiplication operation may also be implemented only by an operator in the main processing submodule. For example, when the operation instruction is an operation performed on scalar or vector data, the apparatus may control the main operation sub-module to perform an operation corresponding to the calculation instruction by using the operator therein. When the operation instruction is to perform operations on data with dimensions greater than or equal to 2 such as matrices and tensors, the device may perform operations in a coordinated manner through the master operation sub-module and the slave operation sub-module. For the specific implementation, see the above description.

需要说明的是，本领域技术人员可以根据实际需要对主运算子模块和多个从运算子模块之间的连接方式进行设置，以实现对运算模块的架构设置，例如，运算模块的架构可以是“H”型架构、阵列型架构、树型架构等，本公开对此不作限制。It should be noted that those skilled in the art can set the connection mode between the main operation sub-module and multiple slave operation sub-modules according to actual needs, so as to realize the architecture setting of the operation module. For example, the architecture of the operation module can be "H"-type architecture, array-type architecture, tree-type architecture, etc., are not limited in the present disclosure.

图2c示出根据本公开一实施例的张量指令处理装置的框图。在一种可能的实现方式中，如图2c所示，运算模块12还可以包括一个或多个分支运算子模块123，该分支运算子模块123用于转发主运算子模块121和从运算子模块122之间的数据和/或运算指令。其中，主运算子模块121与一个或多个分支运算子模块123连接。这样，运算模块中的主运算子模块、分支运算子模块和从运算子模块之间采用“H”型架构连接，通过分支运算子模块转发数据和/或运算指令，节省了对主运算子模块的资源占用，进而提高指令的处理速度。FIG. 2c shows a block diagram of a tensor instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 2c, the operation module 12 may further include one or more branch operation sub-modules 123, and the branch operation sub-module 123 is used for forwarding the master operation sub-module 121 and the slave operation sub-module 122 between data and/or operation instructions. Wherein, the main operation sub-module 121 is connected with one or more branch operation sub-modules 123 . In this way, the main operation sub-module, the branch operation sub-module and the slave operation sub-module in the operation module are connected by an "H" type structure, and data and/or operation instructions are forwarded through the branch operation sub-module, saving the need for the main operation sub-module. resource consumption, thereby improving the processing speed of instructions.

图2d示出根据本公开一实施例的张量指令处理装置的框图。在一种可能的实现方式中，如图2d所示，多个从运算子模块122呈阵列分布。FIG. 2d shows a block diagram of a tensor instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 2d , a plurality of slave operation sub-modules 122 are distributed in an array.

每个从运算子模块122与相邻的其他从运算子模块122连接，主运算子模块121连接多个从运算子模块122中的k个从运算子模块122，k个从运算子模块122为：第1行的n个从运算子模块122、第m行的n个从运算子模块122以及第1列的m个从运算子模块122。Each slave operation sub-module 122 is connected to other adjacent slave operation sub-modules 122, the master operation sub-module 121 is connected to k slave operation sub-modules 122 in the plurality of slave operation sub-modules 122, and the k slave operation sub-modules 122 are : n slave operation submodules 122 in the first row, n slave operation submodules 122 in the mth row, and m slave operation submodules 122 in the first column.

其中，如图2d所示，k个从运算子模块仅包括第1行的n个从运算子模块、第m行的n个从运算子模块以及第1列的m个从运算子模块，即该k个从运算子模块为多个从运算子模块中直接与主运算子模块连接的从运算子模块。其中，k个从运算子模块，用于在主运算子模块以及多个从运算子模块之间的数据以及指令的转发。这样，多个从运算子模块呈阵列分布，可以提高主运算子模块向从运算子模块发送数据和/或运算指令速度，进而提高指令的处理速度。Among them, as shown in Figure 2d, the k slave operation submodules only include n slave operation submodules in the first row, n slave operation submodules in the mth row, and m slave operation submodules in the first column, that is, The k slave operation submodules are slave operation submodules that are directly connected to the master operation submodule among the plurality of slave operation submodules. The k slave operation submodules are used for data and instruction forwarding between the master operation submodule and a plurality of slave operation submodules. In this way, the plurality of slave operation sub-modules are distributed in an array, which can improve the speed at which the master operation sub-module sends data and/or operation instructions to the slave operation sub-modules, thereby increasing the instruction processing speed.

图2e示出根据本公开一实施例的张量指令处理装置的框图。在一种可能的实现方式中，如图2e所示，运算模块还可以包括树型子模块124。该树型子模块124包括一个根端口401和多个支端口402。根端口401与主运算子模块121连接，多个支端口402与多个从运算子模块122分别连接。其中，树型子模块124具有收发功能，用于转发主运算子模块121和从运算子模块122之间的数据和/或运算指令。这样，通过树型子模块的作用使得运算模块呈树型架构连接，并利用树型子模块的转发功能，可以提高主运算子模块向从运算子模块发送数据和/或运算指令速度，进而提高指令的处理速度。FIG. 2e shows a block diagram of a tensor instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 2e , the operation module may further include a tree-type sub-module 124 . The tree-type sub-module 124 includes a root port 401 and a plurality of branch ports 402 . The root port 401 is connected to the master operation sub-module 121 , and the multiple branch ports 402 are respectively connected to the multiple slave operation sub-modules 122 . The tree-type sub-module 124 has the function of sending and receiving, and is used to forward data and/or operation instructions between the master operation sub-module 121 and the slave operation sub-module 122 . In this way, the operation modules are connected in a tree structure through the function of the tree-type sub-module, and the forwarding function of the tree-type sub-module can be used to improve the speed at which the master operation sub-module sends data and/or operation instructions to the slave operation sub-module, thereby improving the The processing speed of the instruction.

在一种可能的实现方式中，树型子模块124可以为该装置的可选结果，其可以包括至少一层节点。节点为具有转发功能的线结构，节点本身不具备运算功能。最下层的节点与从运算子模块连接，以转发主运算子模块121和从运算子模块122之间的数据和/或运算指令。特殊地，如树型子模块具有零层节点，该装置则无需树型子模块。In a possible implementation manner, the tree-type sub-module 124 may be an optional result of the apparatus, which may include at least one level of nodes. The node is a line structure with forwarding function, and the node itself does not have the computing function. The node at the lowest level is connected to the slave operation sub-module to forward data and/or operation instructions between the master operation sub-module 121 and the slave operation sub-module 122 . In particular, if the tree-type sub-module has zero-level nodes, the device does not need the tree-type sub-module.

在一种可能的实现方式中，树型子模块124可以包括n叉树结构的多个节点，n叉树结构的多个节点可以具有多个层。In a possible implementation manner, the tree sub-module 124 may include multiple nodes in an n-ary tree structure, and the multiple nodes in the n-ary tree structure may have multiple layers.

举例来说，图2f示出根据本公开一实施例的张量指令处理装置的框图。如图2f所示，n叉树结构可以是二叉树结构，树型子模块包括2层节点01。最下层节点01与从运算子模块122连接，以转发主运算子模块121和从运算子模块122之间的数据和/或运算指令。For example, FIG. 2f shows a block diagram of a tensor instruction processing apparatus according to an embodiment of the present disclosure. As shown in FIG. 2f , the n-ary tree structure may be a binary tree structure, and the tree-type sub-module includes 2-layer nodes 01 . The lowermost node 01 is connected to the slave operation sub-module 122 to forward data and/or operation instructions between the master operation sub-module 121 and the slave operation sub-module 122 .

在该实现方式中，n叉树结构还可以是三叉树结构等，n为大于或等于2的正整数。本领域技术人员可以根据需要对n叉树结构中的n以及n叉树结构中节点的层数进行设置，本公开对此不作限制。In this implementation manner, the n-ary tree structure may also be a ternary tree structure, etc., and n is a positive integer greater than or equal to 2. Those skilled in the art can set n in the n-ary tree structure and the number of layers of nodes in the n-ary tree structure as required, which is not limited in the present disclosure.

在一种可能的实现方式中，操作域还可以包括张量运算类型。In a possible implementation, the operation domain may also include a tensor operation type.

其中，控制模块11，还可以用于根据操作域确定张量运算类型。Wherein, the control module 11 can also be used to determine the tensor operation type according to the operation domain.

在一种可能的实现方式中，张量运算类型可以包括以下至少一种：张量相乘运算、张量与标量相乘运算、张量相加运算、张量求和运算、满足运算条件存储指定值运算、按位与运算、按位或运算、按位异或运算、按位取反运算、按位求最大值运算、按位求最小值运算。其中，运算条件可以包括以下任一种：按位相等、按位不相等、按位小于、按位大于或等于、按位大于、按位小于或等于。指定值可以是0、1等数值，本公开对此不作限制。In a possible implementation manner, the tensor operation type may include at least one of the following: tensor multiplication operation, tensor and scalar multiplication operation, tensor addition operation, tensor sum operation, and storage of specified values that satisfy operation conditions Operations, bitwise AND operations, bitwise OR operations, bitwise XOR operations, bitwise negation operations, bitwise maximum operations, bitwise minimum operations. The operation condition may include any of the following: bitwise equality, bitwise inequality, bitwise less than, bitwise greater than or equal to, bitwise greater than, and bitwise less than or equal to. The specified value may be a numerical value such as 0, 1, etc., which is not limited in the present disclosure.

其中，满足按位相等存储指定值运算可以是：判断待运算张量中的第一待运算张量与第二待运算张量的对应位是否相等，在第一待运算张量与第二待运算张量的对应位相等时，存储指定值；在对应位不相等时存储第一待运算张量或第二待运算张量在对应位的值、或者存储0等与指定值不同的数值。Wherein, the operation that satisfies the bitwise equality to store the specified value may be: judging whether the corresponding bits of the first to-be-operated tensor and the second to-be-operated tensor in the to-be-operated tensors are equal, and the When the corresponding bits of the quantities are equal, the specified value is stored; when the corresponding bits are not equal, the value of the corresponding bit of the first tensor to be operated or the second tensor to be operated is stored, or a value different from the specified value such as 0 is stored.

满足按位不相等存储指定值运算可以是：判断待运算张量中的第一待运算张量与第二待运算张量的对应位是否相等，在第一待运算张量与第二待运算张量的对应位不相等时，存储指定值；在对应位相等时存储第一待运算张量或第二待运算张量在对应位的值、或者存储0等与指定值不同的数值。The operation that satisfies the bitwise unequal storage of the specified value can be: judging whether the corresponding bits of the first tensor to be calculated and the second tensor to be calculated in the tensors to be calculated are equal, and the first tensor to be calculated and the second tensor to be calculated When the corresponding bits of are not equal, store the specified value; when the corresponding bits are equal, store the value of the first tensor to be operated or the value of the second tensor to be operated in the corresponding bit, or store a value different from the specified value such as 0.

满足按位小于存储指定值运算可以是：判断待运算张量中的第一待运算张量与第二待运算张量的对应位的大小关系，在对应位上的第一待运算张量的值小于第二待运算张量的值时，存储指定值；在对应位上的第一待运算张量的值大于或等于第二待运算张量的值时，存储第一待运算张量或第二待运算张量在对应位的值、或者存储0等与指定值不同的数值。The operation that satisfies the bitwise less than storage specified value may be: judging the size relationship between the corresponding bits of the first to-be-operated tensor and the second to-be-operated tensor in the to-be-operated tensor, and the value of the first to-be-operated tensor in the corresponding bit is less than When the value of the second tensor to be calculated, the specified value is stored; when the value of the first tensor to be calculated on the corresponding bit is greater than or equal to the value of the second tensor to be calculated, the first tensor to be calculated or the second tensor to be calculated is stored. The value of the corresponding bit of the tensor to be operated, or a value different from the specified value, such as 0, is stored.

满足按位大于或等于存储指定值运算可以是：判断待运算张量中的第一待运算张量与第二待运算张量的对应位的大小关系，在对应位上的第一待运算张量的值大于或等于第二待运算张量的值时，存储指定值；在对应位上的第一待运算张量的值小于第二待运算张量的值时，存储第一待运算张量或第二待运算张量在对应位的值、或者存储0等与指定值不同的数值。The operation that satisfies the bitwise value greater than or equal to the storage specified value may be: judging the size relationship between the corresponding bits of the first to-be-operated tensor and the second to-be-operated tensor in the When the value is greater than or equal to the value of the second tensor to be computed, store the specified value; when the value of the first tensor to be computed on the corresponding bit is less than the value of the second tensor to be computed, store the first tensor to be computed or The value in the corresponding bit of the second tensor to be operated, or a value different from the specified value, such as 0, is stored.

满足按位大于存储指定值运算可以是：判断待运算张量中的第一待运算张量与第二待运算张量的对应位的大小关系，在对应位上的第一待运算张量的值大于第二待运算张量的值时，存储指定值；在对应位上的第一待运算张量的值小于或等于第二待运算张量的值时，存储第一待运算张量或第二待运算张量在对应位的值、或者存储0等与指定值不同的数值。The operation that satisfies the bitwise greater than the storage specified value may be: judging the size relationship between the corresponding bits of the first to-be-operated tensor and the second to-be-operated tensor in the to-be-operated tensor, and the value of the first to-be-operated tensor in the corresponding bit is greater than When the value of the second tensor to be computed, the specified value is stored; when the value of the first tensor to be computed on the corresponding bit is less than or equal to the value of the second tensor to be computed, the first tensor to be computed or the second tensor to be computed is stored. The value of the corresponding bit of the tensor to be operated, or a value different from the specified value, such as 0, is stored.

满足按位小于或等于存储指定值运算可以是：判断待运算张量中的第一待运算张量与第二待运算张量的对应位的大小关系，在对应位上的第一待运算张量的值小于或等于第二待运算张量的值时，存储指定值；在对应位上的第一待运算张量的值大于第二待运算张量的值时，存储第一待运算张量或第二待运算张量在对应位的值、或者存储0等与指定值不同的数值。The operation that satisfies the bitwise value less than or equal to the storage specified value may be: judging the size relationship between the corresponding bits of the first to-be-operated tensor and the second to-be-operated tensor in the to-be-operated tensor; When the value is less than or equal to the value of the second tensor to be computed, store the specified value; when the value of the first tensor to be computed on the corresponding bit is greater than the value of the second tensor to be computed, store the first tensor to be computed or The value in the corresponding bit of the second tensor to be operated, or a value different from the specified value, such as 0, is stored.

在该实现方式中，可以为不同的张量运算类型设置不同的操作域代码，以区分不同的运算种类。例如，可以将“张量相乘运算”的代码设置为“mult”。可以将“张量与标量相乘运算”的代码设置为“mult.const”。可以将“张量相加运算”的代码设置为“add”。可以将“张量求和运算”的代码设置为“sub”。可以将“按位与运算”的代码设置为“and”。可以将“按位或运算”的代码设置为“or”。可以将“按位异或运算”的代码设置为“xor”。可以将“按位取反运算”的代码设置为“not”。可以将“按位求最大值运算”的代码设置为“max”。可以将“按位求最小值运算”的代码设置为“min”。可以将“满足按位相等则存储指定值1运算”的代码设置为“eq”。可以将“满足按位不相等则存储指定值1运算”的代码设置为“ne”。可以将“满足按位小于存储指定值1运算”的代码设置为“lt”。可以将“满足按位大于或等于存储指定值1运算”的代码设置为“ge”。可以将“满足按位大于存储指定值1运算”的代码设置为“gt”。可以将“满足按位小于或等于存储指定值1运算”的代码设置为“le”。In this implementation manner, different operation domain codes can be set for different tensor operation types to distinguish different operation types. For example, the code for "multiplication of tensors" can be set to "mult". The code for "multiplying a tensor by a scalar" can be set to "mult.const". You can set the code for the "addition of tensors" to "add". The code for "summation of tensors" can be set to "sub". The code for "bitwise AND" can be set to "and". The code for "bitwise OR" can be set to "or". The code for "bitwise exclusive-or" can be set to "xor". The code for "bitwise negation" can be set to "not". The code for "bitwise max operation" can be set to "max". You can set the code for "bit-wise min operation" to "min". You can set the code for "storing the specified value 1 operation if bitwise equality is satisfied" to "eq". You can set the code for "Store the specified value 1 operation if bitwise inequality is satisfied" to "ne". The code that "satisfies the bitwise less than store specified value 1 operation" can be set to "lt". The code that "satisfies the bitwise greater than or equal to store the specified value 1 operation" can be set to "ge". The code that "satisfies the bitwise greater than storage specified value 1 operation" can be set to "gt". The code that "satisfies the bitwise less than or equal to store the specified value 1 operation" can be set to "le".

本领域技术人员可以根据实际需要对运算种类、及其对应的代码进行设置，本公开对此不作限制。Those skilled in the art can set the operation types and their corresponding codes according to actual needs, which is not limited in the present disclosure.

在一种可能的实现方式中，操作域还可以包括输入量。其中，控制模块11，还用于根据操作域确定输入量，并从待运算数据地址中获取数据量为输入量的待运算张量。In a possible implementation, the operation field may also include an input quantity. Wherein, the control module 11 is further configured to determine the input quantity according to the operation domain, and obtain the to-be-operated tensor whose data quantity is the input quantity from the to-be-operated data address.

在该实现方式中，输入量可以是表征待运算张量的数据量的参数，例如，张量长度、宽度等。In this implementation manner, the input quantity may be a parameter representing the data quantity of the tensor to be operated, for example, the length and width of the tensor.

在一种可能的实现方式中，可以设置默认输入量。在根据操作域无法确定输入量时，可以将默认输入量确定为当前张量指令的输入量，并从待运算数据地址中获取数据量为默认输入量的待运算张量。In one possible implementation, a default input amount can be set. When the input quantity cannot be determined according to the operation domain, the default input quantity can be determined as the input quantity of the current tensor instruction, and the to-be-operated tensor whose data quantity is the default input quantity is obtained from the to-be-operated data address.

在一种可能的实现方式中，如图2a-图2f所示，该装置还可以包括存储模块13。存储模块13用于存储待运算张量。In a possible implementation manner, as shown in FIGS. 2 a to 2 f , the apparatus may further include a storage module 13 . The storage module 13 is used for storing the tensors to be operated.

在该实现方式中，存储模块可以包括内存、缓存和寄存器中的一种或多种，缓存可以包括速暂存缓存。可以根据需要将待运算张量存储在存储模块的内存、缓存和/或寄存器中，本公开对此不作限制。In this implementation, the storage module may include one or more of a memory, a cache, and a register, and the cache may include a temporary cache. The tensors to be operated can be stored in the memory, cache and/or registers of the storage module as required, which is not limited in the present disclosure.

在一种可能的实现方式中，该装置还可以包括直接内存访问模块，用于从存储模块中读取或者存储数据。In a possible implementation manner, the apparatus may further include a direct memory access module for reading or storing data from the storage module.

在一种可能的实现方式中，如图2a-图2f所示，控制模块11可以包括指令存储子模块111、指令处理子模块112和队列存储子模块113。In a possible implementation manner, as shown in FIGS. 2 a to 2 f , the control module 11 may include an instruction storage sub-module 111 , an instruction processing sub-module 112 and a queue storage sub-module 113 .

指令存储子模块111用于存储张量指令。The instruction storage sub-module 111 is used to store tensor instructions.

指令处理子模块112用于对张量指令进行解析，得到张量指令的操作码和操作域。The instruction processing sub-module 112 is used to parse the tensor instruction to obtain the operation code and operation domain of the tensor instruction.

队列存储子模块113用于存储指令队列，指令队列包括按照执行顺序依次排列的多个待执行指令，待执行指令可以包括张量指令。The queue storage sub-module 113 is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed sequentially arranged in an execution order, and the instructions to be executed may include tensor instructions.

在该实现方式中，待执行指令还可以包括与张量运算相关或者无关的计算指令，本公开对此不作限制。可以根据待执行指令的接收时间、优先级别等对多个待执行指令的执行顺序进行排列获得指令队列，以便于根据指令队列依次执行多个待执行指令。In this implementation manner, the instructions to be executed may also include calculation instructions related or unrelated to tensor operations, which are not limited in the present disclosure. An instruction queue can be obtained by arranging the execution order of multiple instructions to be executed according to the receiving time, priority level, etc. of the instructions to be executed, so as to execute the multiple instructions to be executed in sequence according to the instruction queue.

在一种可能的实现方式中，如图2a-图2f所示，控制模块11还可以包括依赖关系处理子模块114。In a possible implementation manner, as shown in FIGS. 2 a to 2 f , the control module 11 may further include a dependency relationship processing sub-module 114 .

依赖关系处理子模块114，用于在确定多个待执行命令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时，将第一待执行指令缓存在指令存储子模块111中，在第零待执行指令执行完毕后，从指令存储子模块111中提取第一待执行指令发送至运算模块12。The dependency relationship processing sub-module 114 is configured to cache the first to-be-executed instruction in the In the storage sub-module 111 , after the execution of the zeroth to-be-executed instruction is completed, the first to-be-executed instruction is extracted from the instruction storage sub-module 111 and sent to the operation module 12 .

其中，第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系包括：存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。反之，第一待执行指令与第一待执行指令之前的第零待执行指令之间没有关联关系可以是第一存储地址区间与第零存储地址区间没有重叠区域。Wherein, the relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes: a first storage address range for storing data required by the first instruction to be executed and data required for storing the zeroth instruction to be executed The zeroth memory address range of has overlapping regions. Conversely, if there is no association between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed, it may be that the first storage address interval and the zeroth storage address interval have no overlapping area.

通过这种方式，可以根据第一待执行指令与第一待执行指令之前的第零待执行指令之间的依赖关系，使得在先的第零待执行指令执行完毕之后，再执行在后的第一待执行指令，保证运算结果的准确性。In this way, according to the dependency between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction, after the previous zeroth to-be-executed instruction is executed, the subsequent zeroth to-be-executed instruction is executed. Once the instruction to be executed, the accuracy of the operation result is guaranteed.

在一种可能的实现方式中，张量指令的指令格式可以是：In one possible implementation, the instruction format of a tensor instruction can be:

opcode,dst,src,type,NumOfEleopcode,dst,src,type,NumOfEle

其中，opcode为张量指令的操作码，dst、src、type、NumOfEle为张量指令的操作域。其中，dst为目标地址。src为待运算张量地址，在待运算张量为多个时，src可以包括多个待运算数据地址src0,src1，…，srcn，本公开对此不作限制。type为张量运算类型。NumOfEle为输入量。其中，type可以是张量运算类型的代码，如mult、mult.const、add、sub、eq、ne、lt、ge、gt、le、eq、and、or、xor、not、max、min。Among them, opcode is the operation code of the tensor instruction, and dst, src, type, and NumOfEle are the operation fields of the tensor instruction. Among them, dst is the target address. src is the address of the tensor to be operated on. When there are multiple tensors to be operated on, src may include multiple addresses of data to be operated on src0, src1, . . . , srcn, which is not limited in the present disclosure. type is the tensor operation type. NumOfEle is the input quantity. Among them, type can be the code of the tensor operation type, such as mult, mult.const, add, sub, eq, ne, lt, ge, gt, le, eq, and, or, xor, not, max, min.

其中，在待运算张量为多个时，指令格式中可以包括多个待运算数据地址，以下以包括两个待运算张量为例，张量指令的指令格式可以是：Wherein, when there are multiple tensors to be operated on, the instruction format may include multiple addresses of data to be operated on. Taking the following as an example of including two tensors to be operated on, the instruction format of the tensor instruction may be:

opcode,dst,src0,src1,type,NumOfEleopcode,dst,src0,src1,type,NumOfEle

type,dst,src,NumOfEletype,dst,src,NumOfEle

在一种可能的实现方式中，可以将用于“张量相乘运算”的张量指令的指令格式设置为：mult,dst,src0,src1,NumOfEle。其表示：从第一待运算地址src0获取NumOfEle大小的第一待运算张量、从第二待运算地址src1中获取NumOfEle大小的第二待运算张量，对第一待运算张量和第二待运算张量进行相乘运算，得到运算结果。并将运算结果存储到目标地址dst中。In one possible implementation, the instruction format of the tensor instruction used for the "tensor multiplication operation" can be set to: mult, dst, src0, src1, NumOfEle. It means: obtain the first tensor to be operated with the size of NumOfEle from the first address to be operated src0, and obtain the second tensor to be operated with the size of NumOfEle from the second address to be operated src1. Multiply the tensors to be operated to obtain the operation result. And store the operation result in the target address dst.

在一种可能的实现方式中，可以将用于“张量与标量相乘运算”的张量指令的指令格式设置为：mult.const,dst,src0,NumOfEle，scale。其表示：从第一待运算数据地址src0获取NumOfEle大小的待运算张量、从标量寄存器中获取待运算标量scale和输入量NumOfEle，对待运算张量和待运算标量进行相乘运算，得到运算结果，并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the tensor instruction used for the "multiply tensor and scalar operation" can be set to: mult.const, dst, src0, NumOfEle, scale. It means: obtain the NumOfEle-sized to-be-operated tensor from the first to-be-operated data address src0, obtain the to-be-operated scalar scale and the input NumOfEle from the scalar register, multiply the to-be-operated tensor and the to-be-operated scalar, and obtain the operation result , and store the operation result in the target address dst.

在一种可能的实现方式中，还可以将用于“张量与标量相乘运算”的张量指令的指令格式设置为：mult.const,dst,src0,NumOfEle，scale。其表示：从第一待运算数据地址src0获取NumOfEle大小的待运算张量、从标量寄存器中获取待运算标量scale和输入量NumOfEle。可选地，所述第一待运算数据的类型可以是定点数据，也可以是浮点数据，可以为16位数据，还可以是32位数据，此处仅以举例说明，并不用于限定该第一待运算数据的类型，对待运算张量和待运算标量进行相乘运算，得到运算结果，并将运算结果存储到目标地址dst中。In a possible implementation manner, the instruction format of the tensor instruction used for "multiplication of tensor and scalar" can also be set to: mult.const, dst, src0, NumOfEle, scale. It means: obtaining the tensor to be operated with the size of NumOfEle from the first data address src0 to be operated, and obtaining the scalar scale to be operated and the input quantity NumOfEle from the scalar register. Optionally, the type of the first data to be operated can be fixed-point data, floating-point data, 16-bit data, or 32-bit data. The first type of the data to be operated is multiplied by the tensor to be operated and the scalar to be operated to obtain the operation result, and the operation result is stored in the target address dst.

其中，所述输入量NumOfEle可以是能被64整除的整数，当然，在其他实施例中，输入量NumOfEle还可以能被2、4、8、16或32等整除的整数，此处仅以举例说明，并不用于限定该输入量的具体取值范围。Wherein, the input quantity NumOfEle may be an integer divisible by 64. Of course, in other embodiments, the input quantity NumOfEle may also be an integer divisible by 2, 4, 8, 16, or 32, etc. This is only an example The description is not used to limit the specific value range of the input quantity.

可选地，存储模块可以包括片上存储空间，该片上存储空间可以是片上NRAM，用于存储张量数据或标量数据。所述源地址sr0，目标地址dst指向的存储空间可以是NRAM。当然，在其他实施例中，该源地址sr0，目标地址dst指向的存储空间还可以是存储模块的其他存储空间。Optionally, the storage module may include an on-chip storage space, which may be an on-chip NRAM, for storing tensor data or scalar data. The storage space pointed to by the source address sr0 and the target address dst may be NRAM. Certainly, in other embodiments, the storage space pointed to by the source address sr0 and the target address dst may also be other storage spaces of the storage module.

进一步地，所述源地址sr0和目标地址dst均是指起始地址，该源地址对应有默认的地址偏移量，该目标地址对应有默认的地址偏移量，该默认的地址偏移可以是64字节的倍数。当然，在其他实施例中，该默认的地址偏移量还可以是8字节、16字节、32字节或128字节等的整数倍，此处仅以举例说明，不做具体限定。具体地，该地址偏移量可以根据运算结果进行确定。Further, the source address sr0 and the target address dst both refer to the starting address, the source address corresponds to a default address offset, the target address corresponds to a default address offset, and the default address offset can be is a multiple of 64 bytes. Of course, in other embodiments, the default address offset may also be an integer multiple of 8 bytes, 16 bytes, 32 bytes, or 128 bytes, etc., which is only used for illustration here, and is not specifically limited. Specifically, the address offset can be determined according to the operation result.

在一种可能的实现方式中，可以将用于“张量相加运算”的张量指令的指令格式设置为：add,dst,src0,src1,NumOfEle。其表示：从第一待运算地址src0获取NumOfEle大小的第一待运算张量、从第二待运算地址src1中获取NumOfEle大小的第二待运算张量，对第一待运算张量和第二待运算张量进行相加运算，得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the tensor instruction used for the "tensor addition operation" can be set as: add, dst, src0, src1, NumOfEle. It means: obtain the first tensor to be operated with the size of NumOfEle from the first address to be operated src0, and obtain the second tensor to be operated with the size of NumOfEle from the second address to be operated src1. The tensors to be operated are added to obtain the operation result. And store the operation result in the target address dst.

在一种可能的实现方式中，可以将用于“张量求和运算”的张量指令的指令格式设置为：sub,dst,src,NumOfEle。其表示：从待运算地址src获取NumOfEle大小的多个待运算张量，对多个待运算张量进行求和运算，得到运算结果。并将运算结果存储到目标地址dst中。In one possible implementation, the instruction format of the tensor instruction used for "summation of tensors" can be set to: sub, dst, src, NumOfEle. It means: obtain a plurality of tensors to be operated with the size of NumOfEle from the address src to be operated, and perform a sum operation on the plurality of tensors to be operated to obtain the operation result. And store the operation result in the target address dst.

在一种可能的实现方式中，可以将用于“按位与运算”的张量指令的指令格式设置为：and,dst,src0,src1,NumOfEle。其表示：从第一待运算地址src0获取NumOfEle大小的第一待运算张量、从第二待运算地址src1中获取NumOfEle大小的第二待运算张量，对第一待运算张量和第二待运算张量进行按位与运算，得到运算结果。并将运算结果存储到目标地址dst中。In one possible implementation, the instruction format of the tensor instruction for "bitwise AND" can be set to: and,dst,src0,src1,NumOfEle. It means: obtain the first tensor to be operated with the size of NumOfEle from the first address to be operated src0, and obtain the second tensor to be operated with the size of NumOfEle from the second address to be operated src1. Perform a bitwise AND operation on the tensors to be operated to obtain the operation result. And store the operation result in the target address dst.

在一种可能的实现方式中，可以将用于“按位或运算”的张量指令的指令格式设置为：or,dst,src0,src1,NumOfEle。其表示：从第一待运算地址src0获取NumOfEle大小的第一待运算张量、从第二待运算地址src1中获取NumOfEle大小的第二待运算张量，对第一待运算张量和第二待运算张量进行按位或运算，得到运算结果。并将运算结果存储到目标地址dst中。In one possible implementation, the instruction format of the tensor instruction for "bitwise OR" can be set to: or,dst,src0,src1,NumOfEle. It means: obtain the first tensor to be operated with the size of NumOfEle from the first address to be operated src0, and obtain the second tensor to be operated with the size of NumOfEle from the second address to be operated src1. Perform a bitwise OR operation on the tensors to be operated to obtain the operation result. And store the operation result in the target address dst.

在一种可能的实现方式中，可以将用于“按位异或运算”的张量指令的指令格式设置为：xor,dst,src0,src1,NumOfEle。其表示：从第一待运算地址src0获取NumOfEle大小的第一待运算张量、从第二待运算地址src1中获取NumOfEle大小的第二待运算张量，对第一待运算张量和第二待运算张量进行按位异或运算，得到运算结果。并将运算结果存储到目标地址dst中。In one possible implementation, the instruction format of the tensor instruction for "bitwise XOR" can be set to: xor,dst,src0,src1,NumOfEle. It means: obtain the first tensor to be operated with the size of NumOfEle from the first address to be operated src0, and obtain the second tensor to be operated with the size of NumOfEle from the second address to be operated src1. Perform a bitwise XOR operation on the tensors to be operated to obtain the operation result. And store the operation result in the target address dst.

在一种可能的实现方式中，可以将用于“按位取反运算”的张量指令的指令格式设置为：not,dst,src,NumOfEle。其表示：从待运算地址src获取NumOfEle大小的待运算张量，对待运算张量进行按位取反运算，得到运算结果。并将运算结果存储到目标地址dst中。In one possible implementation, the instruction format of the tensor instruction for "bitwise negation" can be set to: not,dst,src,NumOfEle. It means: obtain the NumOfEle-sized tensor to be operated from the address src to be operated, and perform a bitwise inversion operation on the to-be-operated tensor to obtain the operation result. And store the operation result in the target address dst.

在一种可能的实现方式中，可以将用于“按位求最大值运算”的张量指令的指令格式设置为：max,dst,src0,src1,NumOfEle。其表示：从第一待运算地址src0获取NumOfEle大小的第一待运算张量、从第二待运算地址src1中获取NumOfEle大小的第二待运算张量，对第一待运算张量和第二待运算张量进行按位求最大值运算，得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the tensor instruction used for "bitwise maximum operation" can be set as: max, dst, src0, src1, NumOfEle. It means: obtain the first tensor to be operated with the size of NumOfEle from the first address to be operated src0, and obtain the second tensor to be operated with the size of NumOfEle from the second address to be operated src1. Perform a bitwise maximum operation on the tensor to be operated to obtain the operation result. And store the operation result in the target address dst.

在一种可能的实现方式中，可以将用于“按位求最小值运算”的张量指令的指令格式设置为：min,dst,src0,src1,NumOfEle。其表示：从第一待运算地址src0获取NumOfEle大小的第一待运算张量、从第二待运算地址src1中获取NumOfEle大小的第二待运算张量，对第一待运算张量和第二待运算张量进行按位求最小值运算，得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the tensor instruction used for "bitwise minimum operation" can be set as: min, dst, src0, src1, NumOfEle. It means: obtain the first tensor to be operated with the size of NumOfEle from the first address to be operated src0, and obtain the second tensor to be operated with the size of NumOfEle from the second address to be operated src1. Perform a bitwise minimum operation on the tensor to be operated to obtain the operation result. And store the operation result in the target address dst.

在一种可能的实现方式中，可以将用于“满足按位相等则存储指定值1运算”的张量指令的指令格式设置为：eq,dst,src0,src1,NumOfEle。其表示：从第一待运算地址src0获取NumOfEle大小的第一待运算张量、从第二待运算地址src1中获取NumOfEle大小的第二待运算张量，对第一待运算张量和第二待运算张量进行按位比较，在第一待运算张量和第二待运算张量的对应位相等时存储指定值1，得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the tensor instruction used for "the operation of storing the specified value 1 if bitwise equality is satisfied" can be set as: eq, dst, src0, src1, NumOfEle. It means: obtain the first tensor to be operated with the size of NumOfEle from the first address to be operated src0, and obtain the second tensor to be operated with the size of NumOfEle from the second address to be operated src1. The tensors to be operated on are compared bit by bit, and the specified value 1 is stored when the corresponding bits of the first tensor to be operated and the second tensor to be operated are equal to obtain the operation result. And store the operation result in the target address dst.

在一种可能的实现方式中，可以将用于“满足按位不相等则存储指定值1运算”的张量指令的指令格式设置为：ne,dst,src0,src1,NumOfEle。其表示：从第一待运算地址src0获取NumOfEle大小的第一待运算张量、从第二待运算地址src1中获取NumOfEle大小的第二待运算张量，对第一待运算张量和第二待运算张量进行按位比较，在第一待运算张量和第二待运算张量的对应位不相等时存储指定值1，得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation manner, the instruction format of the tensor instruction used for "storing the specified value 1 operation if bitwise unequal is satisfied" can be set as: ne, dst, src0, src1, NumOfEle. It means: obtain the first tensor to be operated with the size of NumOfEle from the first address to be operated src0, and obtain the second tensor to be operated with the size of NumOfEle from the second address to be operated src1. The to-be-operated tensors are compared bit by bit, and the specified value 1 is stored when the corresponding bits of the first to-be-operated tensor and the second to-be-operated tensor are not equal to obtain the operation result. And store the operation result in the target address dst.

在一种可能的实现方式中，可以将用于“满足按位小于则存储指定值1运算”的张量指令的指令格式设置为：lt,dst,src0,src1,NumOfEle。其表示：从第一待运算地址src0获取NumOfEle大小的第一待运算张量、从第二待运算地址src1中获取NumOfEle大小的第二待运算张量，对第一待运算张量和第二待运算张量进行按位比较，在对应位上第一待运算张量的数值小于第二待运算张量的数值时存储指定值1，得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the tensor instruction that "satisfies the bitwise less than 1 operation to store the specified value" can be set as: lt, dst, src0, src1, NumOfEle. It means: obtain the first tensor to be operated with the size of NumOfEle from the first address to be operated src0, and obtain the second tensor to be operated with the size of NumOfEle from the second address to be operated src1. The tensors to be operated on are compared bit by bit, and the specified value 1 is stored when the value of the first tensor to be operated on the corresponding bit is smaller than the value of the second tensor to be operated, and the operation result is obtained. And store the operation result in the target address dst.

在一种可能的实现方式中，可以将用于“满足按位大于或等于则存储指定值1运算”的张量指令的指令格式设置为：ge,dst,src0,src1,NumOfEle。其表示：从第一待运算地址src0获取NumOfEle大小的第一待运算张量、从第二待运算地址src1中获取NumOfEle大小的第二待运算张量，对第一待运算张量和第二待运算张量进行按位比较，在对应位上第一待运算张量的数值大于或等于第二待运算张量的数值时存储指定值1，得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of a tensor instruction that "satisfies the bitwise greater than or equal to store the specified value 1 operation" can be set as: ge, dst, src0, src1, NumOfEle. It means: obtain the first tensor to be operated with the size of NumOfEle from the first address to be operated src0, and obtain the second tensor to be operated with the size of NumOfEle from the second address to be operated src1. The tensors to be operated on are compared bit by bit, and when the value of the first tensor to be operated on the corresponding bit is greater than or equal to the value of the second tensor to be operated, the specified value 1 is stored, and the operation result is obtained. And store the operation result in the target address dst.

在一种可能的实现方式中，可以将用于“满足按位大于则存储指定值1运算”的张量指令的指令格式设置为：gt,dst,src0,src1,NumOfEle。其表示：从第一待运算地址src0获取NumOfEle大小的第一待运算张量、从第二待运算地址src1中获取NumOfEle大小的第二待运算张量，对第一待运算张量和第二待运算张量进行按位比较，在对应位上第一待运算张量的数值大于第二待运算张量的数值时存储指定值1，得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of a tensor instruction that "satisfies the bitwise greater than then store specified value 1 operation" can be set to: gt, dst, src0, src1, NumOfEle. It means: obtain the first tensor to be operated with the size of NumOfEle from the first address to be operated src0, and obtain the second tensor to be operated with the size of NumOfEle from the second address to be operated src1. The tensors to be operated on are compared bit by bit, and the specified value 1 is stored when the value of the first tensor to be operated on the corresponding bit is greater than the value of the second tensor to be operated on, and the operation result is obtained. And store the operation result in the target address dst.

在一种可能的实现方式中，可以将用于“满足按位小于或等于则存储指定值1运算”的张量指令的指令格式设置为：le,dst,src0,src1,NumOfEle。其表示：从第一待运算地址src0获取NumOfEle大小的第一待运算张量、从第二待运算地址src1中获取NumOfEle大小的第二待运算张量，对第一待运算张量和第二待运算张量进行按位比较，在对应位上第一待运算张量的数值小于或等于第二待运算张量的数值时存储指定值1，得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the tensor instruction that "satisfies the bitwise less than or equal to store the specified value 1 operation" can be set as: le, dst, src0, src1, NumOfEle. It means: obtain the first tensor to be operated with the size of NumOfEle from the first address to be operated src0, and obtain the second tensor to be operated with the size of NumOfEle from the second address to be operated src1. The tensors to be operated on are compared bit by bit, and the specified value 1 is stored when the value of the first tensor to be operated on the corresponding bit is less than or equal to the value of the second tensor to be operated, and the operation result is obtained. And store the operation result in the target address dst.

应当理解的是，本领域技术人员可以根据需要对张量指令的操作码、指令格式中操作码和操作域的位置进行设置，本公开对此不作限制。It should be understood that those skilled in the art can set the operation code of the tensor instruction, the position of the operation code and the operation field in the instruction format as required, which is not limited in the present disclosure.

在一种可能的实现方式中，该装置可以设置于图形处理器(Graphics ProcessingUnit，简称GPU)、中央处理器(Central Processing Unit，简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit，简称NPU)的一种或多种之中。In a possible implementation manner, the apparatus may be provided in a graphics processing unit (Graphics Processing Unit, GPU for short), a central processing unit (Central Processing Unit, CPU for short), and an embedded neural-network processing unit (Neural-network Processing Unit, abbreviated as NPU) among one or more of them.

需要说明的是，尽管以上述实施例作为示例介绍了张量指令处理装置如上，但本领域技术人员能够理解，本公开应不限于此。事实上，用户完全可根据个人喜好和/或实际应用场景灵活设定各模块，只要符合本公开的技术方案即可。It should be noted that, although the tensor instruction processing apparatus is described above by taking the above embodiment as an example, those skilled in the art can understand that the present disclosure should not be limited thereto. In fact, the user can flexibly set each module according to personal preferences and/or actual application scenarios, as long as it conforms to the technical solutions of the present disclosure.

应用示例Application example

以下结合“利用张量指令处理装置进行张量运算”作为一个示例性应用场景，给出根据本公开实施例的应用示例，以便于理解张量指令处理装置的流程。本领域技术人员应理解，以下应用示例仅仅是出于便于理解本公开实施例的目的，不应视为对本公开实施例的限制In the following, an application example according to an embodiment of the present disclosure is given in conjunction with "using a tensor instruction processing apparatus to perform tensor operations" as an exemplary application scenario, so as to facilitate understanding of the flow of the tensor instruction processing apparatus. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating the understanding of the embodiments of the present disclosure, and should not be regarded as limitations on the embodiments of the present disclosure

图3示出根据本公开一实施例的张量指令处理装置的应用场景的示意图。如图3所示，张量指令处理装置对张量指令进行处理的过程如下：FIG. 3 shows a schematic diagram of an application scenario of a tensor instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 3, the process of processing the tensor instruction by the tensor instruction processing device is as follows:

控制模块11对获取到的张量指令1(如张量指令1为@opcode#500#101#mult.const#1024)，进行解析，得到张量指令1的操作码和操作域。其中，张量指令1的操作码为opcode，目标地址为500，第一待运算张量地址为101。张量运算类型为mult.const(张量与标量相乘运算)，输入量为1024。控制模块11从待运算张量地址101中获取到数据量为输入量1024的第一待运算张量，以及从立即数寄存器中获取到数据量为输入量1024以及待运算标量。运算模块12对第一待运算张量和第二待运算张量进行相加运算，得到运算结果1，并将运算结果1存入目标地址500中。The control module 11 parses the acquired tensor instruction 1 (for example, the tensor instruction 1 is @opcode#500#101#mult.const#1024), and obtains the opcode and operation domain of the tensor instruction 1. The operation code of tensor instruction 1 is opcode, the target address is 500, and the address of the first tensor to be operated is 101. The tensor operation type is mult.const (multiplication of tensor and scalar), and the input is 1024. The control module 11 obtains the first to-be-operated tensor whose data volume is the input amount 1024 from the address 101 of the to-be-operated tensor, and obtains the input amount 1024 and the to-be-operated scalar from the immediate register. The operation module 12 performs an addition operation on the first to-be-operated tensor and the second to-be-operated tensor to obtain an operation result 1, and stores the operation result 1 in the target address 500 .

其中，张量指令1除可以为上述@opcode#500#101#102#add#1024，还可以为@add#500#101#102#1024，不同指令格式的张量指令的处理过程相似，不再赘述。Among them, the tensor instruction 1 can be the above @opcode#500#101#102#add#1024 or @add#500#101#102#1024. The processing process of tensor instructions in different instruction formats is similar. Repeat.

以上各模块的工作过程可参考上文的相关描述。For the working process of the above modules, reference may be made to the above related descriptions.

这样，张量指令处理装置可以高效、快速地对张量指令进行处理。In this way, the tensor instruction processing apparatus can process the tensor instructions efficiently and quickly.

本公开提供一种机器学习运算装置，该机器学习运算装置可以包括一个或多个上述张量指令处理装置，用于从其他处理装置中获取待运算张量和控制信息，执行指定的机器学习运算。该机器学习运算装置可以从其他机器学习运算装置或非机器学习运算装置中获得张量指令，并将执行结果通过I/O接口传递给外围设备(也可称其他处理装置)。外围设备譬如摄像头，显示器，鼠标，键盘，网卡，wifi接口，服务器。当包含一个以上张量指令处理装置时，张量指令处理装置间可以通过特定的结构进行链接并传输数据，譬如，通过PCIE总线进行互联并传输数据，以支持更大规模的神经网络的运算。此时，可以共享同一控制系统，也可以有各自独立的控制系统；可以共享内存，也可以每个加速器有各自的内存。此外，其互联方式可以是任意互联拓扑。The present disclosure provides a machine learning computing device. The machine learning computing device may include one or more of the above-mentioned tensor instruction processing devices, which are used to obtain the tensors to be operated and control information from other processing devices, and perform specified machine learning operations. . The machine learning computing device can obtain tensor instructions from other machine learning computing devices or non-machine learning computing devices, and transmit the execution results to peripheral devices (also called other processing devices) through the I/O interface. Peripherals such as camera, monitor, mouse, keyboard, network card, wifi interface, server. When more than one tensor instruction processing device is included, the tensor instruction processing devices can be linked and transmitted through a specific structure, for example, interconnected and transmitted through the PCIE bus to support larger-scale neural network operations. At this time, the same control system can be shared, or there can be independent control systems; memory can be shared, or each accelerator can have its own memory. In addition, the interconnection method can be any interconnection topology.

该机器学习运算装置具有较高的兼容性，可通过PCIE接口与各种类型的服务器相连接。The machine learning computing device has high compatibility and can be connected with various types of servers through the PCIE interface.

图4a示出根据本公开一实施例的组合处理装置的框图。如图4a所示，该组合处理装置包括上述机器学习运算装置、通用互联接口和其他处理装置。机器学习运算装置与其他处理装置进行交互，共同完成用户指定的操作。FIG. 4a shows a block diagram of a combined processing apparatus according to an embodiment of the present disclosure. As shown in Fig. 4a, the combined processing device includes the above-mentioned machine learning computing device, a general interconnection interface and other processing devices. The machine learning computing device interacts with other processing devices to jointly complete the operation specified by the user.

其他处理装置，包括中央处理器CPU、图形处理器GPU、神经网络处理器等通用/专用处理器中的一种或以上的处理器类型。其他处理装置所包括的处理器数量不做限制。其他处理装置作为机器学习运算装置与外部数据和控制的接口，包括数据搬运，完成对本机器学习运算装置的开启、停止等基本控制；其他处理装置也可以和机器学习运算装置协作共同完成运算任务。Other processing devices include one or more processor types among general-purpose/special-purpose processors such as a central processing unit (CPU), a graphics processing unit (GPU), and a neural network processor. The number of processors included in other processing devices is not limited. Other processing devices serve as the interface between the machine learning computing device and external data and control, including data transfer, to complete the basic control of starting and stopping the machine learning computing device; other processing devices can also cooperate with the machine learning computing device to complete computing tasks.

通用互联接口，用于在机器学习运算装置与其他处理装置间传输数据和控制指令。该机器学习运算装置从其他处理装置中获取所需的输入数据，写入机器学习运算装置片上的存储装置；可以从其他处理装置中获取控制指令，写入机器学习运算装置片上的控制缓存；也可以读取机器学习运算装置的存储模块中的数据并传输给其他处理装置。A universal interconnect interface for transferring data and control instructions between machine learning computing devices and other processing devices. The machine learning computing device obtains required input data from other processing devices, and writes it into a storage device on-chip of the machine learning computing device; it can obtain control instructions from other processing devices and write it into the control cache on the machine learning computing device chip; The data in the storage module of the machine learning computing device can be read and transmitted to other processing devices.

图4b示出根据本公开一实施例的组合处理装置的框图。在一种可能的实现方式中，如图4b所示，该组合处理装置还可以包括存储装置，存储装置分别与机器学习运算装置和所述其他处理装置连接。存储装置用于保存在机器学习运算装置和所述其他处理装置的数据，尤其适用于所需要运算的数据在本机器学习运算装置或其他处理装置的内部存储中无法全部保存的数据。FIG. 4b shows a block diagram of a combined processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 4b, the combined processing device may further include a storage device, and the storage device is respectively connected to the machine learning computing device and the other processing devices. The storage device is used to save data in the machine learning computing device and the other processing devices, and is especially suitable for data that cannot be fully stored in the internal storage of the machine learning computing device or other processing devices.

该组合处理装置可以作为手机、机器人、无人机、视频监控设备等设备的SOC片上系统，有效降低控制部分的核心面积，提高处理速度，降低整体功耗。此情况时，该组合处理装置的通用互联接口与设备的某些部件相连接。某些部件譬如摄像头，显示器，鼠标，键盘，网卡，wifi接口。The combined processing device can be used as an SOC system for mobile phones, robots, drones, video surveillance equipment and other equipment, effectively reducing the core area of the control part, improving the processing speed and reducing the overall power consumption. In this case, the general interconnection interface of the combined processing device is connected to certain components of the apparatus. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.

本公开提供一种机器学习芯片，该芯片包括上述机器学习运算装置或组合处理装置。The present disclosure provides a machine learning chip, which includes the above-mentioned machine learning computing device or combined processing device.

本公开提供一种机器学习芯片封装结构，该机器学习芯片封装结构包括上述机器学习芯片。The present disclosure provides a machine learning chip packaging structure, and the machine learning chip packaging structure includes the above-mentioned machine learning chip.

本公开提供一种板卡，图5示出根据本公开一实施例的板卡的结构示意图。如图5所示，该板卡包括上述机器学习芯片封装结构或者上述机器学习芯片。板卡除了包括机器学习芯片389以外，还可以包括其他的配套部件，该配套部件包括但不限于：存储器件390、接口装置391和控制器件392。The present disclosure provides a board, and FIG. 5 shows a schematic structural diagram of the board according to an embodiment of the present disclosure. As shown in FIG. 5 , the board includes the above-mentioned machine learning chip packaging structure or the above-mentioned machine learning chip. In addition to the machine learning chip 389 , the board may also include other supporting components, including but not limited to: a storage device 390 , an interface device 391 and a control device 392 .

存储器件390与机器学习芯片389(或者机器学习芯片封装结构内的机器学习芯片)通过总线连接，用于存储数据。存储器件390可以包括多组存储单元393。每一组存储单元393与机器学习芯片389通过总线连接。可以理解，每一组存储单元393可以是DDR SDRAM(英文：Double Data Rate SDRAM，双倍速率同步动态随机存储器)。The storage device 390 is connected to the machine learning chip 389 (or the machine learning chip in the machine learning chip package structure) through a bus for storing data. The memory device 390 may include groups of memory cells 393 . Each group of storage units 393 is connected to the machine learning chip 389 through a bus. It can be understood that each group of storage units 393 may be DDR SDRAM (English: Double Data Rate SDRAM, double-rate synchronous dynamic random access memory).

DDR不需要提高时钟频率就能加倍提高SDRAM的速度。DDR允许在时钟脉冲的上升沿和下降沿读出数据。DDR的速度是标准SDRAM的两倍。DDR does not need to increase the clock frequency to double the speed of SDRAM. DDR allows data to be read out on both the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM.

在一个实施例中，存储器件390可以包括4组存储单元393。每一组存储单元393可以包括多个DDR4颗粒(芯片)。在一个实施例中，机器学习芯片389内部可以包括4个72位DDR4控制器，上述72位DDR4控制器中64bit用于传输数据，8bit用于ECC校验。可以理解，当每一组存储单元393中采用DDR4-3200颗粒时，数据传输的理论带宽可达到25600MB/s。In one embodiment, the memory device 390 may include four sets of memory cells 393 . Each set of memory cells 393 may include a plurality of DDR4 particles (chips). In one embodiment, the machine learning chip 389 may include four 72-bit DDR4 controllers inside, where 64 bits of the 72-bit DDR4 controllers are used for data transmission and 8 bits are used for ECC verification. It can be understood that when DDR4-3200 particles are used in each group of storage units 393, the theoretical bandwidth of data transmission can reach 25600MB/s.

在一个实施例中，每一组存储单元393包括多个并联设置的双倍速率同步动态随机存储器。DDR在一个时钟周期内可以传输两次数据。在机器学习芯片389中设置控制DDR的控制器，用于对每个存储单元393的数据传输与数据存储的控制。In one embodiment, each set of memory cells 393 includes a plurality of double-rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling the DDR is provided in the machine learning chip 389 for controlling data transmission and data storage of each storage unit 393 .

接口装置391与机器学习芯片389(或者机器学习芯片封装结构内的机器学习芯片)电连接。接口装置391用于实现机器学习芯片389与外部设备(例如服务器或计算机)之间的数据传输。例如在一个实施例中，接口装置391可以为标准PCIE接口。比如，待处理的数据由服务器通过标准PCIE接口传递至机器学习芯片289，实现数据转移。优选的，当采用PCIE 3.0X 16接口传输时，理论带宽可达到16000MB/s。在另一个实施例中，接口装置391还可以是其他的接口，本公开并不限制上述其他的接口的具体表现形式，接口装置能够实现转接功能即可。另外，机器学习芯片的计算结果仍由接口装置传送回外部设备(例如服务器)。The interface device 391 is electrically connected to the machine learning chip 389 (or the machine learning chip within the machine learning chip package structure). The interface device 391 is used to realize data transmission between the machine learning chip 389 and an external device (such as a server or a computer). For example, in one embodiment, the interface device 391 may be a standard PCIE interface. For example, the data to be processed is transmitted by the server to the machine learning chip 289 through a standard PCIE interface to realize data transfer. Preferably, when the PCIE 3.0X 16 interface is used for transmission, the theoretical bandwidth can reach 16000MB/s. In another embodiment, the interface device 391 may also be other interfaces, and the present disclosure does not limit the specific expression forms of the other interfaces, as long as the interface device can realize the switching function. In addition, the calculation result of the machine learning chip is still transmitted back to the external device (such as a server) by the interface device.

控制器件392与机器学习芯片389电连接。控制器件392用于对机器学习芯片389的状态进行监控。具体的，机器学习芯片389与控制器件392可以通过SPI接口电连接。控制器件392可以包括单片机(Micro Controller Unit，MCU)。如机器学习芯片389可以包括多个处理芯片、多个处理核或多个处理电路，可以带动多个负载。因此，机器学习芯片389可以处于多负载和轻负载等不同的工作状态。通过控制器件可以实现对机器学习芯片中多个处理芯片、多个处理和/或多个处理电路的工作状态的调控。The control device 392 is electrically connected to the machine learning chip 389 . The control device 392 is used to monitor the state of the machine learning chip 389 . Specifically, the machine learning chip 389 and the control device 392 may be electrically connected through an SPI interface. The control device 392 may include a Micro Controller Unit (MCU). For example, the machine learning chip 389 may include multiple processing chips, multiple processing cores or multiple processing circuits, and may drive multiple loads. Therefore, the machine learning chip 389 can be in different working states such as multi-load and light-load. The control device can realize the regulation of the working states of multiple processing chips, multiple processing and/or multiple processing circuits in the machine learning chip.

本公开提供一种电子设备，该电子设备包括上述机器学习芯片或板卡。The present disclosure provides an electronic device including the above-mentioned machine learning chip or board.

电子设备可以包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。Electronic devices may include data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, video cameras, projectors, watches, Headphones, mobile storage, wearables, vehicles, home appliances, and/or medical equipment.

交通工具可以包括飞机、轮船和/或车辆。家用电器可以包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机。医疗设备可以包括核磁共振仪、B超仪和/或心电图仪。Vehicles may include aircraft, ships and/or vehicles. Household appliances may include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods. Medical equipment may include MRI machines, ultrasound machines and/or electrocardiographs.

图6示出根据本公开一实施例的张量指令处理方法的流程图。如图6所示，该方法应用于上述张量指令处理装置，该方法包括步骤S51和步骤S52。FIG. 6 shows a flowchart of a tensor instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 6, the method is applied to the above-mentioned tensor instruction processing apparatus, and the method includes step S51 and step S52.

在步骤S51中，对所述编译后的张量指令进行解析，得到张量指令的操作码和操作域，并根据所述操作码和所述操作域获取执行所述张量指令所需的待运算张量、待运算标量和目标地址。其中，操作码用于指示张量指令对数据所进行的运算为张量与标量相乘运算，操作域包括待运算张量的源地址、待运算标量和目标地址。In step S51, the compiled tensor instruction is parsed to obtain an operation code and an operation field of the tensor instruction, and according to the operation code and the operation field, a waiting list required to execute the tensor instruction is obtained. Operates on the tensor, the scalar to be operated on, and the destination address. The operation code is used to indicate that the operation performed by the tensor instruction on the data is a multiplication operation of a tensor and a scalar, and the operation domain includes the source address of the tensor to be operated, the scalar to be operated and the target address.

在步骤S52中，对待运算张量和待运算标量进行张量与标量相乘运算，获得运算结果，并将运算结果存入目标地址中。In step S52, a tensor-to-scalar multiplication operation is performed on the tensor to be operated and the scalar to be operated to obtain an operation result, and the operation result is stored in the target address.

在一种可能的实现方式中，对待运算张量进行张量运算，获得运算结果，可以包括：利用至少一个张量运算器执行张量与标量相乘运算。具体地，利用至少一个张量运算器将所述待运算张量中的每个元素与所述待运算标量相乘获得运算结果，以实现所述张量与标量相乘运算。In a possible implementation manner, performing a tensor operation on a tensor to be operated to obtain an operation result may include: using at least one tensor operator to perform a multiplication operation between a tensor and a scalar. Specifically, at least one tensor operator is used to multiply each element of the tensor to be operated by the scalar to be operated to obtain an operation result, so as to realize the multiplication operation of the tensor and the scalar.

在一种可能的实现方式中，该方法还可以包括：解析编译后的张量指令得到多个运算指令。其中，步骤S52可以包括：In a possible implementation manner, the method may further include: parsing the compiled tensor instructions to obtain multiple operation instructions. Wherein, step S52 may include:

所述控制模块解析所述编译后的张量指令得到多个运算指令，并将所述待运算张量和所述多个运算指令发送至所述主运算子模块；The control module parses the compiled tensor instruction to obtain multiple operation instructions, and sends the to-be-operational tensor and the multiple operation instructions to the main operation sub-module;

所述主运算子模块对所述待运算张量执行前序处理，并将所述运算指令及所述待运算张量的至少一部分发送至所述从运算子模块；所述主运算子模块的张量运算器能够执行所述张量与标量相乘运算，获得中间结果；The master operation sub-module performs pre-processing on the to-be-operated tensor, and sends the operation instruction and at least a part of the to-be-operated tensor to the slave operation sub-module; The tensor operator can perform the multiplication operation of the tensor and the scalar to obtain an intermediate result;

所述从运算子模块的张量运算器根据从所述主运算子模块接收的数据和运算指令并行执行所述张量与标量相乘运算得到多个中间结果，并将所述多个中间结果传输给所述主运算子模块；The tensor operator of the slave operation sub-module executes the multiplication operation of the tensor and the scalar in parallel according to the data and operation instructions received from the master operation sub-module to obtain multiple intermediate results, and combines the multiple intermediate results. transmitted to the main operation sub-module;

所述主运算子模块所述多个中间结果执行后续处理，得到运算结果，并将所述运算结果存入所述目标地址中。The main operation submodule performs subsequent processing on the plurality of intermediate results to obtain operation results, and stores the operation results in the target address.

在一种可能的实现方式中，操作域还可以包括输入量。其中，根据操作码和操作域获取执行张量指令所需的待运算张量和目标地址，还可以包括：根据操作域确定输入量，并从待运算数据地址中获取数据量为输入量的待运算张量。In a possible implementation, the operation field may also include an input quantity. Wherein, obtaining the tensor to be operated and the target address required to execute the tensor instruction according to the operation code and the operation field may further include: determining the input amount according to the operation field, and obtaining the pending operation whose data amount is the input amount from the address of the data to be operated. Operate on tensors.

在一种可能的实现方式中，该方法还可以包括：存储待运算张量。In a possible implementation manner, the method may further include: storing a tensor to be operated.

在一种可能地实现方式中，所述待运算标量为立即数或标量寄存器。In a possible implementation manner, the scalar to be operated is an immediate value or a scalar register.

在一种可能的实现方式中，对编译后的张量指令进行解析，得到张量指令的操作码和操作域，可以包括：In a possible implementation manner, the compiled tensor instruction is parsed to obtain the operation code and operation field of the tensor instruction, which may include:

存储编译后的张量指令；Store compiled tensor instructions;

对编译后的张量指令进行解析，得到张量指令的操作码和操作域；Analyze the compiled tensor instruction to obtain the opcode and operation domain of the tensor instruction;

存储指令队列，指令队列包括按照执行顺序依次排列的多个待执行指令，多个待执行指令可以包括编译后的张量指令。An instruction queue is stored, and the instruction queue includes multiple instructions to be executed sequentially arranged in an execution order, and the multiple instructions to be executed may include compiled tensor instructions.

在一种可能的实现方式中，该方法还可以包括：在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时，缓存第一待执行指令，并在确定第零执行指令执行完毕后，控制进行第一待执行指令的执行，In a possible implementation manner, the method may further include: when it is determined that the first to-be-executed instruction in the plurality of to-be-executed instructions has an associated relationship with the zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first to-be-executed instruction The instruction to be executed, and after it is determined that the execution of the zeroth execution instruction is completed, the execution of the first instruction to be executed is controlled,

其中，第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系可以包括：存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。Wherein, the relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed may include: a first storage address interval for storing data required by the first instruction to be executed and a first storage address range required for storing the zeroth instruction to be executed The zeroth storage address range of data has overlapping areas.

在一种可能的实现方式中，对获取到的张量指令进行编译，得到编译后的张量指令，可以包括：In a possible implementation manner, the acquired tensor instructions are compiled to obtain the compiled tensor instructions, which may include:

根据张量指令生成汇编文件，并将汇编文件翻译成二进制文件。其中，二进制文件为编译后的张量指令。Generate assembly files based on tensor instructions, and translate assembly files into binary files. Among them, the binary file is the compiled tensor instruction.

需要说明的是，尽管以上述实施例作为示例介绍了张量指令处理方法如上，但本领域技术人员能够理解，本公开应不限于此。事实上，用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤，只要符合本公开的技术方案即可。It should be noted that although the above embodiments are used as examples to describe the tensor instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited thereto. In fact, the user can flexibly set each step according to personal preference and/or actual application scenarios, as long as it conforms to the technical solutions of the present disclosure.

本公开实施例所提供的张量指令处理方法的适用范围广，对张量的处理效率高、处理速度快。The tensor instruction processing method provided by the embodiment of the present disclosure has a wide range of applications, and has high processing efficiency and high processing speed for tensors.

需要说明的是，对于前述的各方法实施例，为了简单描述，故将其都表述为一系列的动作组合，但是本领域技术人员应该知悉，本申请并不受所描述的动作顺序的限制，因为依据本申请，某些步骤可以采用其他顺序或者同时进行。其次，本领域技术人员也应该知悉，说明书中所描述的实施例均属于可选实施例，所涉及的动作和模块并不一定是本公开所必须的。It should be noted that, for the sake of simple description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that the present application is not limited by the described action sequence. Because in accordance with the present application, certain steps may be performed in other orders or concurrently. Secondly, those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the actions and modules involved are not necessarily required by the present disclosure.

在上述实施例中，对各个实施例的描述都各有侧重，某个实施例中没有详述的部分，可以参见其他实施例的相关描述。In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.

在本公开所提供的实施例中，应该理解到，所揭露的系统、装置，可通过其它的方式实现。例如，以上所描述的系统、装置实施例仅仅是示意性的，例如设备、装置、模块的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个模块可以结合或者可以集成到另一个系统或装置，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，设备、装置或模块的间接耦合或通信连接，可以是电性或其它的形式。In the embodiments provided in the present disclosure, it should be understood that the disclosed systems and devices may be implemented in other manners. For example, the above-described system and device embodiments are only illustrative, for example, the division of devices, devices, and modules is only a logical function division. In actual implementation, there may be other division methods. For example, multiple modules may be combined. Or it may be integrated into another system or device, or some features may be omitted, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices, devices or modules, which may be in electrical or other forms.

作为分离部件说明的模块可以是或者也可以不是物理上分开的，作为模块显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。Modules described as separate components may or may not be physically separated, and components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外，在本公开各个实施例中的各功能模块可以集成在一个处理单元中，也可以是各个模块单独物理存在，也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现，也可以采用软件程序模块的形式实现。In addition, each functional module in each embodiment of the present disclosure may be integrated into one processing unit, or each module may exist physically alone, or two or more modules may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware, or can be implemented in the form of software program modules.

集成的模块如果以软件程序模块的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储器中。基于这样的理解，本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储器中，包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储器包括：U盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。The integrated modules, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer-readable memory. Based on such understanding, the technical solutions of the present disclosure essentially or the parts that contribute to the prior art or all or part of the technical solutions can be embodied in the form of software products, and the computer software products are stored in a memory, Several instructions are included to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the various embodiments of the present disclosure. The aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.

本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成，该程序可以存储于一计算机可读存储器中，存储器可以包括：闪存盘、只读存储器(英文：Read-Only Memory，简称：ROM)、随机存取器(英文：Random Access Memory，简称：RAM)、磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above embodiments can be completed by instructing relevant hardware through a program, and the program can be stored in a computer-readable memory, and the memory can include: a flash disk , Read-only memory (English: Read-Only Memory, referred to as: ROM), random access device (English: Random Access Memory, referred to as: RAM), magnetic disk or optical disk, etc.

本申请实施例还提供了一种计算机可读存储介质，所述计算机可读存储介质中存储有计算机程序，计算机程序被一个或多个处理装置执行时，实现如上述任一实施例的方法中的步骤。具体地，上述计算机程序被处理器执行时，实现如下步骤：Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by one or more processing devices, the method in any of the foregoing embodiments is implemented. A step of. Specifically, when the above computer program is executed by the processor, the following steps are implemented:

对获取到的张量指令进行解析，得到张量指令的操作码和操作域，并根据所述操作码和所述操作域获取执行所述张量指令所需的待运算张量、待运算标量和目标地址；Parse the acquired tensor instruction to obtain the operation code and operation field of the tensor instruction, and obtain the tensor to be operated and the scalar to be operated required to execute the tensor instruction according to the operation code and the operation field and destination address;

应当清楚的是，本申请实施例中，各个步骤的实现方式与上述方法中各个步骤的实现方式一致，具体可参见上文的描述，此处不再赘述。It should be clear that, in the embodiment of the present application, the implementation manner of each step is the same as the implementation manner of each step in the foregoing method, and for details, reference may be made to the above description, which will not be repeated here.

根据以下条款可以更好地理解前述内容：The foregoing can be better understood in terms of:

条款1：一种张量指令处理装置，所述装置包括：Clause 1: A tensor instruction processing apparatus, the apparatus comprising:

控制模块，用于对获取到的张量指令进行解析，得到张量指令的操作码和操作域，并根据所述操作码和所述操作域获取执行所述张量指令所需的待运算张量、待运算标量和目标地址张张量；The control module is used to parse the acquired tensor instruction, obtain the operation code and operation field of the tensor instruction, and obtain the to-be-operated tensor required to execute the tensor instruction according to the operation code and the operation field Quantity, scalar to be operated and target address tensor;

条款2：根据条款1所述的装置，所述运算模块包括：Clause 2: The apparatus of clause 1, the computing module comprising:

多个张量运算器，用于将所述待运算张量中的每个元素与所述待运算标量相乘获得运算结果，以实现所述张量与标量相乘运算。A plurality of tensor operators, configured to multiply each element of the tensor to be calculated by the scalar to be calculated to obtain an operation result, so as to realize the multiplication operation of the tensor and the scalar.

条款3：根据条款1或2所述的装置，所述运算模块包括主运算子模块和多个从运算子模块，所述主运算子模块和所述多个从运算子模块均包括所述张量运算器；Clause 3: The apparatus of Clause 1 or 2, the operation module comprising a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module and the plurality of slave operation sub-modules each including the Zhang Quantity calculator;

所述控制模块还用于解析所述编译后的张量指令得到多个运算指令，并将所述待运算张量和所述多个运算指令发送至所述主运算子模块；The control module is further configured to parse the compiled tensor instructions to obtain multiple operation instructions, and send the to-be-operated tensor and the multiple operation instructions to the main operation sub-module;

所述从运算子模块的张量运算器用于根据从所述主运算子模块接收的数据和运算指令并行执行所述张量与标量相乘运算得到多个中间结果，并将所述多个中间结果传输给所述主运算子模块；The tensor operator of the slave operation sub-module is configured to perform the multiplication operation of the tensor and the scalar in parallel according to the data and operation instructions received from the master operation sub-module to obtain multiple intermediate results, and combine the multiple intermediate results. The result is transmitted to the main operation sub-module;

所述主运算子模块，还用于对所述多个中间结果执行后续处理，得到运算结果，并将所述运算结果存入所述目标地址中。The main operation sub-module is further configured to perform subsequent processing on the plurality of intermediate results, obtain operation results, and store the operation results in the target address.

条款4：根据条款1-3任一项所述的装置，所述操作域还包括输入量，Clause 4: The apparatus of any of clauses 1-3, the operational field further comprising an input quantity,

其中，所述控制模块还用于根据所述待运算张量的源地址和所述输入量，确定所述待运算张量。Wherein, the control module is further configured to determine the to-be-operated tensor according to the source address of the to-be-operated tensor and the input quantity.

条款5：根据条款1-4任一项所述的装置，所述待运算标量为立即数或标量寄存器。Clause 5: The apparatus of any one of clauses 1-4, the scalar to be operated is an immediate value or a scalar register.

条款6：根据条款1-5任一项所述的装置，所述装置还包括：Clause 6: The apparatus of any of clauses 1-5, further comprising:

存储模块，用于存储所述待运算张量、所述待运算标量的至少一种。A storage module, configured to store at least one of the to-be-operated tensor and the to-be-operated scalar.

条款7：根据条款1-6任一项所述的装置，所述控制模块包括：Clause 7: The apparatus of any of clauses 1-6, the control module comprising:

指令存储子模块，用于存储所述编译后的张量指令；an instruction storage submodule for storing the compiled tensor instruction;

指令处理子模块，用于对所述编译后的张量指令进行解析，得到张量指令的操作码和操作域；an instruction processing submodule, used for parsing the compiled tensor instruction to obtain the operation code and operation domain of the tensor instruction;

队列存储子模块，用于存储指令队列，所述指令队列包括按照执行顺序依次排列的多个待执行指令，所述多个待执行指令包括所述编译后的张量指令。The queue storage sub-module is used for storing an instruction queue, where the instruction queue includes a plurality of instructions to be executed sequentially arranged in an execution order, and the multiple instructions to be executed include the compiled tensor instructions.

条款8：根据条款1-7任一项所述的装置，所述控制模块还包括：Clause 8: The apparatus of any of clauses 1-7, the control module further comprising:

依赖关系处理子模块，用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时，将所述第一待执行指令缓存在所述指令存储子模块中，在所述第零待执行指令执行完毕后，从所述指令存储子模块中提取所述第一待执行指令发送至所述运算模块，The dependency relationship processing submodule is configured to, when it is determined that the first to-be-executed instruction in the plurality of to-be-executed instructions has an associated relationship with the zeroth to-be-executed instruction before the first to-be-executed instruction, to process the first to-be-executed instruction The execution instruction is cached in the instruction storage sub-module, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage sub-module and sent to the operation module,

其中，所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括：Wherein, the relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:

存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have an overlapping area.

条款9：一种张量指令处理方法，所述方法应用于张量指令处理装置，所述方法包括：Clause 9: A tensor instruction processing method, the method being applied to a tensor instruction processing apparatus, the method comprising:

条款10：根据条款9所述的方法，所述运算模块包括：Clause 10: The method of Clause 9, the computing module comprising:

条款11：根据条款9或10所述的方法，所述运算模块包括主运算子模块和多个从运算子模块，所述主运算子模块和所述多个从运算子模块均包括所述张量运算器；Clause 11: The method of clause 9 or 10, the operation module comprising a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module and the plurality of slave operation sub-modules each including the Zhang Quantity calculator;

所述主运算子模块对所述多个中间结果执行后续处理，得到运算结果，并将所述运算结果存入所述目标地址中。The main operation sub-module performs subsequent processing on the plurality of intermediate results to obtain operation results, and stores the operation results in the target address.

条款11：根据条款9-10任一项所述的方法，所述操作域还包括输入量，Clause 11: The method of any of clauses 9-10, the operational field further comprising an input quantity,

条款12：根据条款9-11任一项所述的方法，其特征在于，所述待运算标量为立即数或标量寄存器。Item 12: The method according to any one of Items 9-11, wherein the scalar to be operated is an immediate value or a scalar register.

条款13：根据条款9-12任一项所述的方法，其特征在于，所述方法还包括：Clause 13: The method according to any one of Clauses 9-12, characterized in that the method further comprises:

存储所述待运算张量、所述待运算标量的至少一种。At least one of the to-be-operated tensor and the to-be-operated scalar is stored.

条款14：根据条款9-13任一项所述的方法，其特征在于，所述对所述编译后的张量指令进行解析，得到张量指令的操作码和操作域，包括：Item 14: The method according to any one of Items 9 to 13, characterized in that, by parsing the compiled tensor instruction, the operation code and operation domain of the tensor instruction are obtained, including:

指令存储子模块存储所述编译后的张量指令；The instruction storage submodule stores the compiled tensor instruction;

指令处理子模块对所述编译后的张量指令进行解析，得到张量指令的操作码和操作域；The instruction processing submodule parses the compiled tensor instruction to obtain the operation code and operation domain of the tensor instruction;

队列存储子模块存储指令队列，所述指令队列包括按照执行顺序依次排列的多个待执行指令，所述多个待执行指令包括所述编译后的张量指令。The queue storage submodule stores an instruction queue, and the instruction queue includes a plurality of instructions to be executed sequentially arranged in an execution order, and the multiple instructions to be executed include the compiled tensor instructions.

条款15：根据条款9-14任一项所述的方法，所述方法还包括：Clause 15: The method of any of clauses 9-14, further comprising:

在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时，将所述第一待执行指令缓存在所述指令存储子模块中，在所述第零待执行指令执行完毕后，从所述指令存储子模块中提取所述第一待执行指令发送至所述运算模块，When it is determined that the first to-be-executed instruction in the plurality of to-be-executed instructions has an associated relationship with the zeroth to-be-executed instruction before the first to-be-executed instruction, the first to-be-executed instruction is cached in the instruction storage In the sub-module, after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage sub-module and sent to the operation module,

条款16：一种计算机可读存储介质，所述计算机可读存储介质中存储有计算机程序，所述计算机程序被一个或多个处理装置执行时，实现如条款9-15任意一项所述方法中的步骤。Clause 16: A computer-readable storage medium having stored therein a computer program that, when executed by one or more processing devices, implements the method as described in any one of clauses 9-15 steps in .

以上对本申请实施例进行了详细介绍，本文中应用了具体个例对本申请的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本申请的方法及其核心思想；同时，对于本领域的一般技术人员，依据本申请的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本申请的限制。The embodiments of the present application have been introduced in detail above, and specific examples are used to illustrate the principles and implementations of the present application. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present application; at the same time, for Persons of ordinary skill in the art, according to the idea of the present application, will have changes in the specific implementation manner and application scope. In conclusion, the contents of this specification should not be construed as a limitation on the present application.

Claims

1. A tensor instruction processing device, wherein the device comprises:

The control module is used to parse the acquired tensor instruction, obtain the operation code and operation field of the tensor instruction, and obtain the to-be-operated tensor required to execute the tensor instruction according to the operation code and the operation field Quantity, scalar to be operated and target address;

an operation module, used for tensors to perform a tensor-to-scalar multiplication operation on the to-be-operated tensor and the to-be-operated scalar, obtain an operation result, and store the operation result in the target address;

The operation code is used to indicate that the operation performed by the tensor instruction on the data is a multiplication operation of a tensor and a scalar, and the operation domain includes the source address of the tensor to be operated, the scalar to be operated and the target address.

2. The device according to claim 1, wherein the operation module comprises:

A tensor operator, configured to multiply each element in the tensor to be operated by the scalar to be operated to obtain an operation result, so as to realize the multiplication operation of the tensor and the scalar.

3. The device according to claim 2, wherein the operation module comprises a master operation sub-module and a plurality of slave operation sub-modules, and the master operation sub-module and the multiple slave operation sub-modules both include the the tensor operator;

The control module is also used to parse the

The tensor instruction obtains multiple operation instructions, and sends the to-be-operated tensor and the multiple operation instructions to the main operation submodule;

The master operation sub-module performs pre-processing on the to-be-operated tensor, and sends the operation instruction and at least a part of the to-be-operated tensor to the slave operation sub-module; The tensor operator can perform the multiplication operation of the tensor and the scalar to obtain an intermediate result;

The tensor operator of the slave operation sub-module is configured to perform the multiplication operation of the tensor and the scalar in parallel according to the data and operation instructions received from the master operation sub-module to obtain multiple intermediate results, and combine the multiple intermediate results. The result is transmitted to the main operation sub-module;

The main operation sub-module is further configured to perform subsequent processing on the plurality of intermediate results, obtain operation results, and store the operation results in the target address.

4. The apparatus according to claim 1, wherein the operation field further comprises an input quantity,

Wherein, the control module is further configured to determine the to-be-operated tensor according to the source address of the to-be-operated tensor and the input quantity.

5 . The apparatus according to claim 1 , wherein the scalar to be operated is an immediate value or a scalar register. 6 .

6. The device according to any one of claims 1-4, wherein the device further comprises:

A storage module, configured to store at least one of the to-be-operated tensor and the to-be-operated scalar.

7. The device according to any one of claims 1-4, wherein the control module comprises:

an instruction storage submodule for storing the tensor instruction;

The instruction processing submodule is used to parse the tensor instruction to obtain the operation code and operation domain of the tensor instruction;

The queue storage submodule is used to store an instruction queue, the instruction queue includes a plurality of instructions to be executed sequentially arranged in an execution order, and the multiple instructions to be executed include the tensor instructions.

8. The apparatus according to claim 7, wherein the control module further comprises:

The dependency relationship processing submodule is configured to, when it is determined that the first to-be-executed instruction in the plurality of to-be-executed instructions has an associated relationship with the zeroth to-be-executed instruction before the first to-be-executed instruction, to process the first to-be-executed instruction The execution instruction is cached in the instruction storage sub-module, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage sub-module and sent to the operation module,

Wherein, the relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:

The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have an overlapping area.

9. A tensor instruction processing method, wherein the method is applied to a tensor instruction processing device, the method comprising:

Analyze the acquired tensor instruction, obtain the operation code and operation field of the tensor instruction, and obtain the tensor to be operated, the scalar to be operated and the operation field required to execute the tensor instruction according to the operation code and the operation field. target address;

The tensor performs a tensor-scalar multiplication operation on the to-be-operated tensor and the to-be-operated scalar to obtain an operation result, and stores the operation result in the target address;

10. A computer-readable storage medium, characterized in that, a computer program is stored in the computer-readable storage medium, and when the computer program is executed by one or more processing devices, the method according to claim 9 is implemented. A step of.