[go: up one dir, main page]

CN111966402A - Instruction processing method and device and related product - Google Patents

Instruction processing method and device and related product Download PDF

Info

Publication number
CN111966402A
CN111966402A CN201910420884.9A CN201910420884A CN111966402A CN 111966402 A CN111966402 A CN 111966402A CN 201910420884 A CN201910420884 A CN 201910420884A CN 111966402 A CN111966402 A CN 111966402A
Authority
CN
China
Prior art keywords
tensor
instruction
operated
executed
scalar
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201910420884.9A
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Cambricon Information Technology Co Ltd
Original Assignee
Shanghai Cambricon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Cambricon Information Technology Co Ltd filed Critical Shanghai Cambricon Information Technology Co Ltd
Priority to CN201910420884.9A priority Critical patent/CN111966402A/en
Priority to PCT/CN2020/088248 priority patent/WO2020233387A1/en
Publication of CN111966402A publication Critical patent/CN111966402A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30029Logical and Boolean instructions, e.g. XOR, NOT
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/34Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Advance Control (AREA)

Abstract

The disclosure relates to a tensor instruction processing method, a tensor instruction processing device and a related product. The machine learning device comprises one or more instruction processing devices, is used for acquiring tensors to be calculated and control information from other processing devices, executing specified machine learning operation and transmitting the execution result to other processing devices through an I/O interface; when the machine learning arithmetic device includes a plurality of instruction processing devices, the plurality of instruction processing devices can be connected to each other by a specific configuration to transfer data. The command processing devices are interconnected through a Peripheral Component Interface Express (PCIE) bus and transmit data; the plurality of instruction processing devices share the same control system or own control system and share the memory or own memory; the interconnection mode of the plurality of instruction processing apparatuses is an arbitrary interconnection topology. The tensor instruction processing method and device and the related products provided by the embodiment of the disclosure have the advantages of wide application range, high instruction processing efficiency and high instruction processing speed.

Description

Instruction processing method and device and related product
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a tensor instruction processing method and apparatus, and a related product.
Background
With the continuous development of science and technology, machine learning, especially neural network algorithms, are more and more widely used. The method is well applied to the fields of image recognition, voice recognition, natural language processing and the like. However, as the complexity of neural network algorithms is higher and higher, the types and the number of involved data operations are increasing. In the related art, the efficiency and speed of performing tensor correlation operation on tensor data are low.
Disclosure of Invention
In view of the above, the present disclosure provides a tensor instruction processing method, device and related product, so as to improve efficiency and speed of performing tensor correlation operations on tensor data.
According to a first aspect of the present disclosure, there is provided a tensor instruction processing apparatus, the apparatus comprising:
the control module is used for analyzing the acquired tensor instruction to obtain an operation code and an operation domain of the tensor instruction, and acquiring a tensor to be operated, a scalar to be operated and a target address which are required by executing the tensor instruction according to the operation code and the operation domain;
the operation module is used for carrying out tensor and scalar multiplication operation on the tensor to be operated and the scalar to be operated by the tensor to be operated to obtain an operation result, and storing the operation result into the target address;
the operation code is used for indicating that the tensor instruction performs operation on data as tensor and scalar multiplication operation, and the operation domain comprises a source address of a tensor to be operated, the scalar to be operated and the target address.
According to a second aspect of the present disclosure, there is provided a machine learning arithmetic device, the device including:
one or more tensor instruction processing devices of the first aspect, configured to acquire tensors to be computed and control information from other processing devices, execute a specified machine learning operation, and transmit an execution result to the other processing devices through an I/O interface;
when the machine learning operation device includes a plurality of the tensor instruction processing devices, the plurality of the tensor instruction processing devices may be connected to each other by a specific structure to transmit data;
the tensor instruction processing devices are interconnected through a PCIE bus of a fast peripheral equipment interconnection bus and transmit data so as to support larger-scale machine learning operation; the tensor instruction processing devices share the same control system or own respective control systems; the tensor instruction processing devices share a memory or own respective memories; the plurality of tensor instruction processing devices are interconnected in an arbitrary interconnection topology.
According to a third aspect of the present disclosure, there is provided a combined processing apparatus, the apparatus comprising:
the machine learning arithmetic device, the universal interconnect interface, and the other processing device according to the second aspect;
and the machine learning arithmetic device interacts with the other processing devices to jointly complete the calculation operation designated by the user.
According to a fourth aspect of the present disclosure, there is provided a machine learning chip including the machine learning network operation device of the second aspect or the combination processing device of the third aspect.
According to a fifth aspect of the present disclosure, there is provided a machine learning chip package structure, which includes the machine learning chip of the fourth aspect.
According to a sixth aspect of the present disclosure, a board card is provided, which includes the machine learning chip packaging structure of the fifth aspect.
According to a seventh aspect of the present disclosure, there is provided an electronic device, which includes the machine learning chip of the fourth aspect or the board of the sixth aspect.
According to an eighth aspect of the present disclosure, there is provided a tensor instruction processing method applied to a tensor instruction processing apparatus, the method including:
analyzing the acquired tensor instruction to obtain an operation code and an operation domain of the tensor instruction, and acquiring a tensor to be operated, a scalar to be operated and a target address which are required by execution of the tensor instruction according to the operation code and the operation domain;
carrying out tensor and scalar multiplication operation on the tensor to be operated and the scalar to be operated to obtain an operation result, and storing the operation result into the target address;
the operation code is used for indicating that the tensor instruction performs operation on data as tensor and scalar multiplication operation, and the operation domain comprises a source address and a target address of a tensor to be operated.
According to a ninth aspect of the present disclosure, there is provided a computer-readable storage medium having stored therein a computer program which, when executed by one or more processors, implements the steps of the tensor instruction processing method described above.
In some embodiments, the electronic device comprises a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a server, a cloud server, a camera, a camcorder, a projector, a watch, a headset, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.
In some embodiments, the vehicle comprises an aircraft, a ship, and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.
The device comprises a control module and an operation module, wherein the control module is used for analyzing the acquired tensor instruction to obtain an operation code and an operation domain of the tensor instruction, and acquiring a to-be-operated tensor and a target address required by executing the tensor instruction according to the operation code and the operation domain; the operation module is used for carrying out tensor and scalar multiplication operation on the tensor to be operated by the tensor to obtain an operation result, and storing the operation result into a target address. The tensor instruction processing method and device and the related products provided by the embodiment of the disclosure have the advantages of wide application range, high processing efficiency and high processing speed of tensor instructions.
Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.
Fig. 1 shows a block diagram of a tensor instruction processing apparatus according to an embodiment of the present disclosure.
Figures 2 a-2 f illustrate block diagrams of tensor instruction processing apparatus according to an embodiment of the present disclosure.
Fig. 3 is a schematic diagram illustrating an application scenario of a tensor instruction processing apparatus according to an embodiment of the present disclosure.
Fig. 4a, 4b show block diagrams of a combined processing device according to an embodiment of the present disclosure.
Fig. 5 shows a schematic structural diagram of a board card according to an embodiment of the present disclosure.
Figure 6 illustrates a flow diagram of a tensor instruction processing method according to an embodiment of the present disclosure.
Detailed Description
Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.
Fig. 1 shows a block diagram of a tensor instruction processing apparatus according to an embodiment of the present disclosure. As shown in fig. 1, the apparatus includes a control module 11 and an operation module 12. The control module 11 is configured to analyze the acquired tensor instruction to obtain an operation code and an operation domain of the tensor instruction, and obtain a to-be-operated tensor, a to-be-operated scalar and a target address tensor required by executing the tensor instruction according to the operation code and the operation domain. The operation code is used for indicating that the operation of the tensor instruction on the data is the multiplication operation of the tensor and the scalar, and the operation domain comprises a source address and a target address of the tensor to be operated. The operation module 12 is configured to perform tensor-scalar multiplication operation on the tensor to be operated by the tensor to obtain an operation result, and store the operation result in the target address. Specifically, the operation module 12 may multiply each element in the tensor to be operated by a scalar to be operated, which may be an immediate number or a scalar register, to obtain an operation result, and store the operation result in the target address.
Alternatively, the target address may be a start address, and the processing module may determine a size of a storage space required for the operation result according to the start address and the size of the operation result, and store the operation result into the determined storage space.
In this embodiment, the control module may obtain the tensors to be calculated from the addresses of the tensors to be calculated, respectively. The control module may obtain instructions and data through a data input output unit, which may be one or more data I/O interfaces or I/O pins. The operation module is used for carrying out tensor operation on the tensor to be operated according to the tensor operation type to obtain an operation result, and storing the operation result into the target address. The tensor instruction processing device provided by the embodiment of the disclosure has the advantages of wide application range, high processing efficiency of tensor instructions and high processing speed.
In this embodiment, the operation code may be a part of an instruction or a field (usually indicated by a code) specified in the computer program to perform an operation, and is an instruction sequence number used to inform a device executing the instruction which instruction needs to be executed specifically. The operation domain may be a source of all data required for executing the corresponding instruction, where all data required for executing the corresponding instruction includes parameters such as a tensor to be operated, and a corresponding tensor instruction processing method, or stores parameters such as an operation tensor, a tensor operation type, and an address of the corresponding tensor instruction processing method, and so on. For a tensor instruction it must include an operation code and an operation field, where the operation field includes at least a tensor address to be operated on and a target address. It should be understood that the instruction format of the tensor instruction and the included operation code and operation field may be set as needed by those skilled in the art, and the disclosure is not limited thereto.
Alternatively, the source address and the target address of the tensor to be operated in the operation domain may be one or more, and each source address corresponds to one target address. The tensor to be computed may be one or more. That is, the operation module 12 can simultaneously implement operations of multiplying one or more tensors to be operated by the same scalar to be operated. Further optionally, the number of the scalars to be operated may also be one or more, each tensor to be operated is provided with one scalar to be operated, and the values of the scalars to be operated may be the same or different. At this time, the operation module 12 of the present application can simultaneously implement the operation of multiplying one or more tensors to be operated by a scalar to be operated.
Further optionally, the apparatus may include one or more control modules and one or more operation modules, and the number of the control modules and the number of the operation modules may be set according to actual needs, which is not limited in this disclosure.
Figure 2a illustrates a block diagram of a tensor instruction processing apparatus according to an embodiment of the present disclosure. In one possible implementation, as shown in fig. 2a, the operation module 12 may include at least one tensor operator 120, and the at least one tensor operator 120 is configured to perform a tensor operation corresponding to the type of the tensor operation. Specifically, the tensor operator is used for multiplying each element in the tensor to be operated by the scalar to be operated to obtain an operation result, so that the tensor and scalar multiplication operation is realized.
Furthermore, the operation module may further include a data access circuit, the data access circuit may obtain data to be operated from the storage module, and the data access circuit may further store an operation result in the storage module. Alternatively, the data access circuit may be a direct memory access module.
In this implementation, the tensor operator may include an operator capable of performing arithmetic operations, logical operations, and the like on the tensor, such as an adder, a divider, a multiplier, and a comparator. The tensor operator may set the type and number of the tensor operators according to the size of the data amount of the tensor operation to be performed, the type of the tensor operation, the processing speed and efficiency of the tensor operation, and the like, which is not limited by the present disclosure.
Figure 2b illustrates a block diagram of a tensor instruction processing apparatus according to an embodiment of the present disclosure. In one possible implementation, as shown in fig. 2b, the operation module 12 may include a master operation sub-module 121 and a plurality of slave operation sub-modules 122. The master operation submodule 121 and the plurality of slave operation submodules 122 each include the tensor operator (not shown).
The control module 11 is further configured to analyze the tensor instruction to obtain a plurality of operation instructions, and send the tensor to be operated and the plurality of operation instructions to the main operation submodule 121.
The main operation submodule 121 is configured to perform preamble processing on the tensor to be operated, and send the operation instruction and at least a part of the tensor to be operated to the slave operation submodule; and the tensor operator of the main operation sub-module can execute the multiplication operation of the tensor and the scalar to obtain an intermediate result.
The tensor operator of the slave operation submodule 122 is configured to execute the tensor and scalar multiplication operation in parallel according to the data and the operation instruction received from the master operation submodule 122 to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the master operation submodule 121. The main operation sub-module 121 is further configured to perform subsequent processing on the plurality of intermediate results to obtain an operation result, and store the operation result in the target address.
In a possible implementation, the tensor instruction and the tensor to be operated on are split into at least one corresponding tensor operation instruction and at least one sub-tensor according to the dimension of the tensor to be operated on, wherein, the tensor operation instructions are in one-to-one correspondence with the sub-tensors, and the corresponding operation instructions and the sub-tensors are respectively sent to the slave operation submodule to carry out the multiplication operation of the tensor and the scalar, so as to obtain the intermediate result of the multiplication of the tensor and the scalar and send the intermediate result to the master operation submodule, the main operation submodule combines the intermediate results to obtain an operation result and stores the operation result into a target address, wherein the split into a plurality of corresponding tensor operation instructions according to the dimensionality of the tensor can be split into one more one-dimensional tensors or can be split into a plurality of multi-dimensional tensors, and the dimensionality of the split tensor is less than or equal to the dimensionality of the acquired tensor to be operated.
Alternatively, the tensor and scalar multiplication operation can be realized by an operator in the main processing submodule only. For example, when the operation instruction is an operation performed on scalar or vector data, the apparatus may control the main operation submodule to perform an operation corresponding to the operation instruction by using an operator therein. When the operation instruction is to operate on data with dimensionality greater than or equal to 2, such as matrix, tensor, and the like, the device can operate in a mode of cooperation of the master operation submodule and the slave operation submodule, and the specific implementation mode can refer to the description above.
It should be noted that, a person skilled in the art may set the connection manner between the master operation submodule and the plurality of slave operation submodules according to actual needs to implement the configuration setting of the operation module, for example, the configuration of the operation module may be an "H" configuration, an array configuration, a tree configuration, and the like, which is not limited in the present disclosure.
Figure 2c illustrates a block diagram of a tensor instruction processing apparatus according to an embodiment of the present disclosure. In one possible implementation, as shown in fig. 2c, the operation module 12 may further include one or more branch operation sub-modules 123, and the branch operation sub-module 123 is configured to forward data and/or operation instructions between the master operation sub-module 121 and the slave operation sub-module 122. The main operation sub-module 121 is connected to one or more branch operation sub-modules 123. Therefore, the main operation sub-module, the branch operation sub-module and the slave operation sub-module in the operation module are connected by adopting an H-shaped structure, and data and/or operation instructions are forwarded by the branch operation sub-module, so that the resource occupation of the main operation sub-module is saved, and the instruction processing speed is further improved.
Figure 2d shows a block diagram of a tensor instruction processing apparatus according to an embodiment of the present disclosure. In one possible implementation, as shown in FIG. 2d, a plurality of slave operation sub-modules 122 are distributed in an array.
Each slave operation submodule 122 is connected to another adjacent slave operation submodule 122, the master operation submodule 121 is connected to k slave operation submodules 122 of the plurality of slave operation submodules 122, and the k slave operation submodules 122 are: n slave operator sub-modules 122 of row 1, n slave operator sub-modules 122 of row m, and m slave operator sub-modules 122 of column 1.
As shown in fig. 2d, the k slave operator modules include only the n slave operator modules in the 1 st row, the n slave operator modules in the m th row, and the m slave operator modules in the 1 st column, that is, the k slave operator modules are slave operator modules directly connected to the master operator module among the plurality of slave operator modules. The k slave operation submodules are used for forwarding data and instructions between the master operation submodules and the plurality of slave operation submodules. Therefore, the plurality of slave operation sub-modules are distributed in an array, the speed of sending data and/or operation instructions to the slave operation sub-modules by the master operation sub-module can be increased, and the instruction processing speed is further increased.
Figure 2e shows a block diagram of a tensor instruction processing apparatus according to an embodiment of the present disclosure. In one possible implementation, as shown in fig. 2e, the operation module may further include a tree sub-module 124. The tree submodule 124 includes a root port 401 and a plurality of branch ports 402. The root port 401 is connected to the master operation submodule 121, and the plurality of branch ports 402 are connected to the plurality of slave operation submodules 122, respectively. The tree sub-module 124 has a transceiving function, and is configured to forward data and/or operation instructions between the master operation sub-module 121 and the slave operation sub-module 122. Therefore, the operation modules are connected in a tree-shaped structure under the action of the tree-shaped sub-modules, and the speed of sending data and/or operation instructions from the main operation sub-module to the auxiliary operation sub-module can be increased by utilizing the forwarding function of the tree-shaped sub-modules, so that the instruction processing speed is increased.
In one possible implementation, the tree submodule 124 may be an optional result of the apparatus, which may include at least one level of nodes. The nodes are line structures with forwarding functions, and the nodes do not have operation functions. The lowest level node is connected to the slave operation sub-module to forward data and/or operation instructions between the master operation sub-module 121 and the slave operation sub-module 122. In particular, if the tree submodule has zero level nodes, the apparatus does not require the tree submodule.
In one possible implementation, the tree submodule 124 may include a plurality of nodes of an n-ary tree structure, and the plurality of nodes of the n-ary tree structure may have a plurality of layers.
For example, fig. 2f shows a block diagram of a tensor instruction processing apparatus according to an embodiment of the present disclosure. As shown in FIG. 2f, the n-ary tree structure may be a binary tree structure with tree-type sub-modules including 2 levels of nodes 01. The lowest level node 01 is connected with the slave operation sub-module 122 to forward data and/or operation instructions between the master operation sub-module 121 and the slave operation sub-module 122.
In this implementation, the n-ary tree structure may also be a ternary tree structure or the like, where n is a positive integer greater than or equal to 2. The number of n in the n-ary tree structure and the number of layers of nodes in the n-ary tree structure may be set by those skilled in the art as needed, and the disclosure is not limited thereto.
In one possible implementation, the operation domain may further include a tensor operation type.
The control module 11 may be further configured to determine a tensor operation type according to the operation domain.
In one possible implementation, the type of tensor operation may include at least one of: tensor multiplication operation, tensor and scalar multiplication operation, tensor addition operation, tensor summation operation, specified value storage operation meeting the operation condition, bitwise and operation, bitwise or operation, bitwise exclusive or operation, bitwise negation operation, bitwise maximum value operation and bitwise minimum value operation. The operation condition may include any one of the following: bitwise equal, bitwise unequal, bitwise less, bitwise greater than or equal to, bitwise greater than, bitwise less than or equal to. The specified value may be a numerical value such as 0, 1, etc., and the present disclosure does not limit this.
The operation satisfying the bit-wise equal storage of the specified value can be: judging whether corresponding bits of a first tensor to be operated and a second tensor to be operated in the tensors to be operated are equal, and storing a specified value when the corresponding bits of the first tensor to be operated and the second tensor to be operated are equal; and when the corresponding bits are not equal, storing the value of the first tensor to be operated or the second tensor to be operated in the corresponding bits, or storing a value which is different from the specified value, such as 0.
Satisfying the bitwise inequality store specified value operation may be: judging whether corresponding bits of a first tensor to be operated and a second tensor to be operated in the tensors to be operated are equal, and storing a specified value when the corresponding bits of the first tensor to be operated and the second tensor to be operated are not equal; when the corresponding bits are equal, the value of the first tensor to be operated or the second tensor to be operated at the corresponding bits is stored, or the value of 0 and the like which is different from the specified value is stored.
Satisfying the bitwise less than store specified value operation may be: judging the size relationship of a first tensor to be operated and a corresponding bit of a second tensor to be operated in the tensors to be operated, and storing a specified value when the value of the first tensor to be operated on the corresponding bit is smaller than the value of the second tensor to be operated; when the value of the first tensor to be operated on the corresponding bit is larger than or equal to the value of the second tensor to be operated, storing the value of the first tensor to be operated or the second tensor to be operated on the corresponding bit, or storing a value different from a specified value, such as 0.
Satisfying the bitwise greater than or equal to store the specified value operation may be: judging the size relationship of a first tensor to be operated and a corresponding bit of a second tensor to be operated in the tensors to be operated, and storing a specified value when the value of the first tensor to be operated on the corresponding bit is greater than or equal to the value of the second tensor to be operated; when the value of the first tensor to be operated on the corresponding bit is smaller than the value of the second tensor to be operated, storing the value of the first tensor to be operated or the second tensor to be operated on the corresponding bit, or storing a value such as 0 which is different from the specified value.
Satisfying the bitwise greater than store specified value operation may be: judging the size relationship of a corresponding bit of a first tensor to be operated and a second tensor to be operated in the tensors to be operated, and storing a specified value when the value of the first tensor to be operated on the corresponding bit is larger than the value of the second tensor to be operated; and when the value of the first tensor to be operated on the corresponding bit is less than or equal to the value of the second tensor to be operated, storing the value of the first tensor to be operated or the second tensor to be operated on the corresponding bit, or storing a value which is different from the specified value, such as 0.
Satisfying the bitwise less than or equal to store the specified value operation may be: judging the size relationship of a first tensor to be operated and a corresponding bit of a second tensor to be operated in the tensors to be operated, and storing a specified value when the value of the first tensor to be operated on the corresponding bit is less than or equal to the value of the second tensor to be operated; when the value of the first tensor to be operated on the corresponding bit is larger than the value of the second tensor to be operated, the value of the first tensor to be operated or the second tensor to be operated on the corresponding bit is stored, or the value of 0 and the like which is different from the specified value is stored.
In this implementation, different operation domain codes can be set for different tensor operation types to distinguish different operation categories. For example, the code of the "tensor multiplication operation" may be set to "mult". The code of the "tensor and scalar multiplication operation" may be set to "mult. The code of the "tensor addition operation" may be set to "add". The code of the "tensor summation operation" may be set to "sub". The code for "bitwise AND operation" may be set to "and". The code for "bitwise OR" may be set to "or". The code for the "bitwise exclusive-or operation" may be set to "xor". The code for the bitwise negation operation may be set to "not". The code for "maximum bitwise operation" may be set to "max". The code for "minimum by bit operation" may be set to "min". The code "store a specified value 1 operation if bitwise equality is satisfied" may be set to "eq". The code "store a specified value 1 operation if bitwise inequality is satisfied" may be set to "ne". The code "satisfy bitwise less than store specified value 1 operation" may be set to "lt". Code that satisfies bitwise greater than or equal to store a specified value 1 operation may be set to "ge". The code "satisfy bitwise greater than store specified value 1 operation" may be set to "gt". The code "satisfy bitwise less than or equal to store a specified value 1 operation" may be set to "le".
The operation types and the corresponding codes thereof can be set by those skilled in the art according to actual needs, and the disclosure does not limit this.
In one possible implementation, the operation field may further include an input quantity. The control module 11 is further configured to determine an input quantity according to the operation domain, and obtain a tensor to be operated, where the data quantity is the input quantity, from the data address to be operated.
In this implementation, the input quantity may be a parameter characterizing the amount of data of the tensor to be operated on, e.g., tensor length, width, etc.
In one possible implementation, a default input amount may be set. When the input quantity cannot be determined according to the operation domain, the default input quantity can be determined as the input quantity of the current tensor instruction, and the tensor to be operated with the data quantity as the default input quantity is obtained from the data address to be operated.
In one possible implementation, as shown in fig. 2 a-2 f, the apparatus may further include a storage module 13. The storage module 13 is used for storing the tensor to be operated.
In this implementation, the storage module may include one or more of a memory, a cache, and a register, and the cache may include a scratch pad cache. The tensor to be computed may be stored in the memory, cache, and/or register of the storage module as needed, which is not limited by this disclosure.
In a possible implementation manner, the apparatus may further include a direct memory access module for reading or storing data from the storage module.
In one possible implementation, as shown in fig. 2 a-2 f, the control module 11 may include an instruction storage sub-module 111, an instruction processing sub-module 112, and a queue storage sub-module 113.
The instruction storage submodule 111 is used to store tensor instructions.
The instruction processing sub-module 112 is configured to parse the tensor instruction to obtain an operation code and an operation domain of the tensor instruction.
The queue storage submodule 113 is configured to store an instruction queue, where the instruction queue includes multiple instructions to be executed that are sequentially arranged according to an execution order, and the instructions to be executed may include tensor instructions.
In this implementation, the instructions to be executed may also include computation instructions related or unrelated to tensor operations, which is not limited by this disclosure. The execution sequence of the multiple instructions to be executed can be arranged according to the receiving time, the priority level and the like of the instructions to be executed to obtain an instruction queue, so that the multiple instructions to be executed can be sequentially executed according to the instruction queue.
In one possible implementation, as shown in fig. 2 a-2 f, the control module 11 may further include a dependency processing sub-module 114.
The dependency relationship processing submodule 114 is configured to, when it is determined that a first to-be-executed instruction in the multiple to-be-executed instructions is associated with a zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction in the instruction storage submodule 111, and after the zeroth to-be-executed instruction is executed, extract the first to-be-executed instruction from the instruction storage submodule 111 and send the first to-be-executed instruction to the operation module 12.
The method for determining the zero-th instruction to be executed before the first instruction to be executed has an incidence relation with the first instruction to be executed comprises the following steps: the first storage address interval for storing the data required by the first to-be-executed instruction and the zeroth storage address interval for storing the data required by the zeroth to-be-executed instruction have an overlapped area. On the contrary, there is no association relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction, which may be that there is no overlapping area between the first storage address interval and the zeroth storage address interval.
By the method, according to the dependency relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction, the subsequent first to-be-executed instruction is executed after the execution of the previous zeroth to-be-executed instruction is finished, and the accuracy of the operation result is ensured.
In one possible implementation, the instruction format of the tensor instruction may be:
opcode,dst,src,type,NumOfEle
wherein opcode is the operation code of tensor instruction, dst, src, type, and NumOfEle are the operation fields of tensor instruction. Wherein dst is the target address. src is an address of the tensor to be operated, and when the tensor to be operated is multiple, src may include multiple addresses src0, src1, …, and src n of data to be operated, which is not limited by this disclosure. type is a tensor operation type. NumOfEle is the input quantity. The type can be a code of tensor operation type, such as mult, mult.const, add, sub, eq, ne, lt, ge, gt, le, eq, and, or, xor, not, max, min.
When the to-be-computed tensors are multiple, the instruction format may include multiple to-be-computed data addresses, and the instruction format of the tensor instruction may be as follows, taking two to-be-computed tensors as an example:
opcode,dst,src0,src1,type,NumOfEle
in one possible implementation, the instruction format of the tensor instruction may be:
type,dst,src,NumOfEle
in one possible implementation, the instruction format of the tensor instruction for the "tensor multiplication operation" may be set to: mult, dst, src0, src1, NumOfEle. It represents: the method comprises the steps of obtaining a first to-be-computed tensor of the NumOfEle size from a first to-be-computed address src0, obtaining a second to-be-computed tensor of the NumOfEle size from a second to-be-computed address src1, and multiplying the first to-be-computed tensor and the second to-be-computed tensor to obtain an operation result. And stores the operation result into the target address dst.
In one possible implementation, the instruction format of the tensor instruction for the "tensor by scalar multiplication operation" may be set to: cont, dst, src0, NumOfEle, scale. It represents: obtaining a tensor to be operated with the size of NumOfEle from a first data address to be operated src0, obtaining a scalar scale to be operated and an input quantity NumOfEle from a scalar register, multiplying the tensor to be operated and the scalar to be operated to obtain an operation result, and storing the operation result in a target address dst.
In one possible implementation, the instruction format of the tensor instruction for the "tensor by scalar multiplication operation" may also be set to: cont, dst, src0, NumOfEle, scale. It represents: the method comprises the steps of obtaining a tensor to be operated of the size of NumOfEle from a first address src0 of data to be operated, and obtaining a scalar scale to be operated and an input quantity NumOfEle from a scalar register. Optionally, the type of the first data to be operated may be fixed-point data, floating-point data, 16-bit data, or 32-bit data, which is only illustrated here and is not used to limit the type of the first data to be operated, and the tensor to be operated and the scalar to be operated are multiplied to obtain an operation result, and the operation result is stored in the target address dst.
The input quantity NumOfEle may be an integer divisible by 64, but in other embodiments, the input quantity NumOfEle may also be an integer divisible by 2, 4, 8, 16, 32, or the like, and this is merely an example and is not intended to limit a specific value range of the input quantity.
Optionally, the memory module may include an on-chip memory space, which may be an on-chip NRAM, for storing tensor data or scalar data. The source address sr0, the destination address dst may point to a memory space that may be NRAM. Of course, in other embodiments, the source address sr0, the memory space pointed to by the target address dst may also be other memory spaces of the memory module.
Further, the source address sr0 and the destination address dst both refer to a start address, the source address corresponds to a default address offset, the destination address corresponds to a default address offset, and the default address offset may be a multiple of 64 bytes. Of course, in other embodiments, the default address offset may also be an integer multiple of 8 bytes, 16 bytes, 32 bytes, or 128 bytes, etc., which is only for illustration and is not limited in particular. Specifically, the address offset may be determined according to the operation result.
In one possible implementation, the instruction format of the tensor instruction for the "tensor addition operation" may be set to: add, dst, src0, src1, NumOfEle. It represents: the method comprises the steps of obtaining a first to-be-operated tensor of the NumOfEle size from a first to-be-operated address src0, obtaining a second to-be-operated tensor of the NumOfEle size from a second to-be-operated address src1, and adding the first to-be-operated tensor and the second to-be-operated tensor to obtain an operation result. And stores the operation result into the target address dst.
In one possible implementation, the instruction format of the tensor instruction for the "tensor summation operation" may be set to: sub, dst, src, NumOfEle. It represents: and acquiring a plurality of tensors to be operated with the size of NumOfEle from the address src to be operated, and performing summation operation on the plurality of tensors to be operated to obtain an operation result. And stores the operation result into the target address dst.
In one possible implementation, the instruction format of the tensor instruction for the "bitwise and operation" may be set to: and, dst, src0, src1, NumOfEle. It represents: the method comprises the steps of obtaining a first to-be-operated tensor of the NumOfEle size from a first to-be-operated address src0, obtaining a second to-be-operated tensor of the NumOfEle size from a second to-be-operated address src1, and carrying out bitwise AND operation on the first to-be-operated tensor and the second to-be-operated tensor to obtain an operation result. And stores the operation result into the target address dst.
In one possible implementation, the instruction format of the tensor instruction for "bitwise or operation" may be set to: or, dst, src0, src1, NumOfEle. It represents: the method comprises the steps of obtaining a first to-be-operated tensor of the NumOfEle size from a first to-be-operated address src0, obtaining a second to-be-operated tensor of the NumOfEle size from a second to-be-operated address src1, and carrying out bitwise OR operation on the first to-be-operated tensor and the second to-be-operated tensor to obtain an operation result. And stores the operation result into the target address dst.
In one possible implementation, the instruction format of the tensor instruction for the "bitwise xor operation" may be set to: xor, dst, src0, src1, NumOfEle. It represents: the method comprises the steps of obtaining a first to-be-operated tensor of the NumOfEle size from a first to-be-operated address src0, obtaining a second to-be-operated tensor of the NumOfEle size from a second to-be-operated address src1, and carrying out bitwise XOR operation on the first to-be-operated tensor and the second to-be-operated tensor to obtain an operation result. And stores the operation result into the target address dst.
In one possible implementation, the instruction format of the tensor instruction for the "bitwise negation" operation may be set to: not, dst, src, NumOfEle. It represents: and acquiring a tensor to be operated with the size of NumOfEle from the address src to be operated, and performing bitwise negation operation on the tensor to be operated to obtain an operation result. And stores the operation result into the target address dst.
In one possible implementation, the instruction format of the tensor instruction for the "maximum bitwise operation" may be set to: max, dst, src0, src1, NumOfEle. It represents: the method comprises the steps of obtaining a first to-be-computed tensor of the NumOfEle size from a first to-be-computed address src0, obtaining a second to-be-computed tensor of the NumOfEle size from a second to-be-computed address src1, and carrying out bitwise maximum value computing on the first to-be-computed tensor and the second to-be-computed tensor to obtain a computing result. And stores the operation result into the target address dst.
In one possible implementation, the instruction format of the tensor instruction for the "minimum by bit operation" may be set to: min, dst, src0, src1, NumOfEle. It represents: the method comprises the steps of obtaining a first to-be-operated tensor of the NumOfEle size from a first to-be-operated address src0, obtaining a second to-be-operated tensor of the NumOfEle size from a second to-be-operated address src1, and carrying out bitwise minimum value calculation on the first to-be-operated tensor and the second to-be-operated tensor to obtain an operation result. And stores the operation result into the target address dst.
In one possible implementation, the instruction format for the tensor instruction that satisfies the bit-wise equality store the specified value 1 operation may be set to: eq, dst, src0, src1, NumOfEle. It represents: the method includes the steps of obtaining a first to-be-computed tensor of the NumOfEle size from a first to-be-computed address src0, obtaining a second to-be-computed tensor of the NumOfEle size from a second to-be-computed address src1, comparing the first to-be-computed tensor with the second to-be-computed tensor in a bitwise mode, and storing a specified value 1 when corresponding bits of the first to-be-computed tensor and the second to-be-computed tensor are equal to each other to obtain an operation result. And stores the operation result into the target address dst.
In one possible implementation, the instruction format for the tensor instruction that stores the specified value 1 operation if bit-wise inequality is satisfied may be set to: ne, dst, src0, src1, NumOfEle. It represents: the method comprises the steps of obtaining a first to-be-operated tensor of the NumOfEle size from a first to-be-operated address src0, obtaining a second to-be-operated tensor of the NumOfEle size from a second to-be-operated address src1, comparing the first to-be-operated tensor with the second to-be-operated tensor by bit, and storing a specified value 1 when corresponding bits of the first to-be-operated tensor and the second to-be-operated tensor are not equal to obtain an operation result. And stores the operation result into the target address dst.
In one possible implementation, the instruction format for the tensor instruction that satisfies the specified value 1 operation if bitwise less is stored may be set to: lt, dst, src0, src1, NumOfEle. It represents: the method comprises the steps of obtaining a first to-be-operated tensor of the NumOfEle size from a first to-be-operated address src0, obtaining a second to-be-operated tensor of the NumOfEle size from a second to-be-operated address src1, comparing the first to-be-operated tensor with the second to-be-operated tensor in a bitwise mode, and storing a specified value 1 when the value of the first to-be-operated tensor on a corresponding bit is smaller than that of the second to-be-operated tensor to obtain an operation result. And stores the operation result into the target address dst.
In one possible implementation, the instruction format for the tensor instruction that satisfies the specified value 1 operation if bitwise is greater than or equal to is set to: ge, dst, src0, src1, NumOfEle. It represents: the method comprises the steps of obtaining a first to-be-operated tensor of the NumOfEle size from a first to-be-operated address src0, obtaining a second to-be-operated tensor of the NumOfEle size from a second to-be-operated address src1, comparing the first to-be-operated tensor with the second to-be-operated tensor in a bitwise mode, and storing a specified value 1 when the value of the first to-be-operated tensor on the corresponding bit is larger than or equal to the value of the second to-be-operated tensor to obtain an operation result. And stores the operation result into the target address dst.
In one possible implementation, the instruction format for the tensor instruction that satisfies the specified value 1 operation if bitwise greater is stored may be set to: gt, dst, src0, src1, NumOfEle. It represents: the method comprises the steps of obtaining a first to-be-computed tensor of the NumOfEle size from a first to-be-computed address src0, obtaining a second to-be-computed tensor of the NumOfEle size from a second to-be-computed address src1, comparing the first to-be-computed tensor with the second to-be-computed tensor in a bitwise mode, and storing a specified value 1 when the value of the first to-be-computed tensor on a corresponding bit is larger than that of the second to-be-computed tensor to obtain an operation result. And stores the operation result into the target address dst.
In one possible implementation, the instruction format for the tensor instruction that satisfies the specified value 1 operation if bitwise is less than or equal to is set to: le, dst, src0, src1, NumOfEle. It represents: the method comprises the steps of obtaining a first to-be-operated tensor of the NumOfEle size from a first to-be-operated address src0, obtaining a second to-be-operated tensor of the NumOfEle size from a second to-be-operated address src1, comparing the first to-be-operated tensor with the second to-be-operated tensor in a bitwise mode, and storing a specified value 1 when the value of the first to-be-operated tensor on a corresponding bit is smaller than or equal to the value of the second to-be-operated tensor to obtain an operation result. And stores the operation result into the target address dst.
It should be understood that the position of the operation code and the operation field in the instruction format of the tensor instruction can be set by one skilled in the art according to requirements, and the disclosure is not limited thereto.
In one possible implementation manner, the apparatus may be disposed in one or more of a Graphics Processing Unit (GPU), a Central Processing Unit (CPU), and an embedded Neural Network Processor (NPU).
It should be noted that, although the tensor instruction processing apparatus is described above by taking the above-described embodiment as an example, those skilled in the art can understand that the present disclosure should not be limited thereto. In fact, the user can flexibly set each module according to personal preference and/or actual application scene, as long as the technical scheme of the disclosure is met.
Application example
An application example according to an embodiment of the present disclosure is given below in conjunction with "tensor operation by a tensor instruction processing apparatus" as an exemplary application scenario to facilitate understanding of a flow of the tensor instruction processing apparatus. It is understood by those skilled in the art that the following application examples are merely for the purpose of facilitating understanding of the embodiments of the present disclosure and should not be construed as limiting the embodiments of the present disclosure
Fig. 3 is a schematic diagram illustrating an application scenario of a tensor instruction processing apparatus according to an embodiment of the present disclosure. As shown in fig. 3, the tensor instruction processing device processes the tensor instruction as follows:
the control module 11 analyzes the obtained tensor instruction 1 (for example, the tensor instruction 1 is @ opcode #500#101# multicast #1024), and obtains an operation code and an operation domain of the tensor instruction 1. The operation code of the tensor instruction 1 is opcode, the target address is 500, and the first to-be-computed tensor address is 101. The tensor operation type is mult. const (tensor and scalar multiplication operation), and the input amount is 1024. The control module 11 acquires the first tensor to be operated with the data amount as the input amount 1024 from the tensor to be operated address 101, and acquires the data amount as the input amount 1024 and the scalar to be operated from the immediate register. The operation module 12 performs addition operation on the first tensor to be operated and the second tensor to be operated to obtain an operation result 1, and stores the operation result 1 in the target address 500.
The tensor instruction 1 may be the @ opcode #500#101#102# add #1024 or the @ add #500#101#102#1024, and the processing procedures of tensor instructions of different instruction formats are similar and will not be described again.
The working process of the above modules can refer to the above related description.
Thus, the tensor instruction processing device can efficiently and quickly process tensor instructions.
The present disclosure provides a machine learning operation device, which may include one or more of the tensor instruction processing devices described above, and is configured to acquire tensors to be operated and control information from other processing devices, and perform a specified machine learning operation. The machine learning arithmetic device can obtain a tensor instruction from another machine learning arithmetic device or a non-machine learning arithmetic device, and transmit an execution result to a peripheral device (also referred to as another processing device) through an I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one tensor instruction processing device is included, the tensor instruction processing devices can be linked and transmit data through a specific structure, for example, the data is interconnected and transmitted through a PCIE bus, so as to support larger-scale operation of the neural network. At this time, the same control system may be shared, or there may be separate control systems; the memory may be shared or there may be separate memories for each accelerator. In addition, the interconnection mode can be any interconnection topology.
The machine learning arithmetic device has high compatibility and can be connected with various types of servers through PCIE interfaces.
Fig. 4a shows a block diagram of a combined processing device according to an embodiment of the present disclosure. As shown in fig. 4a, the combined processing device includes the machine learning arithmetic device, the universal interconnection interface, and other processing devices. The machine learning arithmetic device interacts with other processing devices to jointly complete the operation designated by the user.
Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the machine learning arithmetic device and external data and control, and comprise data transportation to finish basic control of starting, stopping and the like of the machine learning arithmetic device; other processing devices may cooperate with the machine learning computing device to perform computing tasks.
And the universal interconnection interface is used for transmitting data and control instructions between the machine learning arithmetic device and other processing devices. The machine learning arithmetic device acquires required input data from other processing devices and writes the input data into a storage device on the machine learning arithmetic device; control instructions can be obtained from other processing devices and written into a control cache on a machine learning arithmetic device chip; the data in the storage module of the machine learning arithmetic device can also be read and transmitted to other processing devices.
Fig. 4b shows a block diagram of a combined processing device according to an embodiment of the present disclosure. In a possible implementation manner, as shown in fig. 4b, the combined processing device may further include a storage device, and the storage device is connected to the machine learning operation device and the other processing device respectively. The storage device is used for storing data stored in the machine learning arithmetic device and the other processing device, and is particularly suitable for data which is required to be calculated and cannot be stored in the internal storage of the machine learning arithmetic device or the other processing device.
The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.
The present disclosure provides a machine learning chip, which includes the above machine learning arithmetic device or combined processing device.
The present disclosure provides a machine learning chip package structure, which includes the above machine learning chip.
Fig. 5 shows a schematic structural diagram of a board card according to an embodiment of the present disclosure. As shown in fig. 5, the board includes the above-mentioned machine learning chip package structure or the above-mentioned machine learning chip. The board may include, in addition to the machine learning chip 389, other kits including, but not limited to: memory device 390, interface device 391 and control device 392.
The memory device 390 is coupled to a machine learning chip 389 (or a machine learning chip within a machine learning chip package structure) via a bus for storing data. Memory device 390 may include multiple sets of memory cells 393. Each group of memory cells 393 is coupled to a machine learning chip 389 via a bus. It is understood that each group 393 may be a DDR SDRAM (Double Data Rate SDRAM).
DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM.
In one embodiment, memory device 390 may include 4 groups of memory cells 393. Each group of memory cells 393 may include a plurality of DDR4 particles (chips). In one embodiment, the machine learning chip 389 may include 4 72-bit DDR4 controllers therein, where 64bit is used for data transmission and 8bit is used for ECC check in the 72-bit DDR4 controller. It is appreciated that when DDR4-3200 particles are used in each group of memory cells 393, the theoretical bandwidth of data transfer may reach 25600 MB/s.
In one embodiment, each group 393 of memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling DDR is provided in the machine learning chip 389 for controlling data transfer and data storage of each memory unit 393.
Interface device 391 is electrically coupled to machine learning chip 389 (or a machine learning chip within a machine learning chip package). The interface device 391 is used to implement data transmission between the machine learning chip 389 and an external device (e.g., a server or a computer). For example, in one embodiment, the interface device 391 may be a standard PCIE interface. For example, the data to be processed is transmitted to the machine learning chip 289 by the server through the standard PCIE interface, so as to implement data transfer. Preferably, when PCIE 3.0X 16 interface transmission is adopted, the theoretical bandwidth can reach 16000 MB/s. In another embodiment, the interface device 391 may also be another interface, and the disclosure does not limit the specific representation of the other interface, and the interface device can implement the switching function. In addition, the calculation result of the machine learning chip is still transmitted back to the external device (e.g., server) by the interface device.
The control device 392 is electrically connected to a machine learning chip 389. The control device 392 is used to monitor the state of the machine learning chip 389. Specifically, the machine learning chip 389 and the control device 392 may be electrically connected through an SPI interface. The control device 392 may include a single chip Microcomputer (MCU). For example, machine learning chip 389 may include multiple processing chips, multiple processing cores, or multiple processing circuits, which may carry multiple loads. Therefore, the machine learning chip 389 can be in different operation states such as a multi-load and a light load. The control device can regulate and control the working states of a plurality of processing chips, a plurality of processing circuits and/or a plurality of processing circuits in the machine learning chip.
The present disclosure provides an electronic device, which includes the above machine learning chip or board card.
The electronic device may include a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.
The vehicle may include an aircraft, a ship, and/or a vehicle. The household appliances may include televisions, air conditioners, microwave ovens, refrigerators, electric rice cookers, humidifiers, washing machines, electric lamps, gas cookers, and range hoods. The medical device may include a nuclear magnetic resonance apparatus, a B-mode ultrasound apparatus and/or an electrocardiograph.
Figure 6 illustrates a flow diagram of a tensor instruction processing method according to an embodiment of the present disclosure. As shown in fig. 6, the method is applied to the tensor instruction processing apparatus described above, and includes step S51 and step S52.
In step S51, the compiled tensor instruction is analyzed to obtain an operation code and an operation domain of the tensor instruction, and a tensor to be operated, a scalar to be operated, and a target address required for executing the tensor instruction are obtained according to the operation code and the operation domain. The operation code is used for indicating the tensor instruction to perform the operation on the data to be the tensor and scalar multiplication operation, and the operation domain comprises a source address of the tensor to be operated, a scalar to be operated and a target address.
In step S52, tensor and scalar multiplication operation is performed on the tensor to be operated and the scalar to be operated to obtain an operation result, and the operation result is stored in the target address.
In one possible implementation, performing tensor operation on the to-be-operated tensor to obtain the operation result may include: the tensor-scalar multiplication operation is performed using at least one tensor operator. Specifically, each element in the tensor to be calculated is multiplied by the scalar to be calculated by at least one tensor operator to obtain an operation result, so that the tensor and scalar multiplication operation is realized.
In one possible implementation, the method may further include: and analyzing the compiled tensor instructions to obtain a plurality of operation instructions. Wherein, the step S52 may include:
the control module analyzes the compiled tensor instructions to obtain a plurality of operation instructions and sends the tensor to be operated and the operation instructions to the main operation submodule;
the main operation submodule executes preorder processing on the tensor to be operated and sends the operation instruction and at least one part of the tensor to be operated to the slave operation submodule; the tensor operator of the main operation sub-module can execute the multiplication operation of the tensor and the scalar to obtain an intermediate result;
the tensor arithmetic unit of the slave arithmetic submodule executes multiplication operation of the tensor and the scalar in parallel according to data and an arithmetic instruction received from the master arithmetic submodule to obtain a plurality of intermediate results, and transmits the intermediate results to the master arithmetic submodule;
and the main operation submodule executes subsequent processing on the plurality of intermediate results to obtain an operation result and stores the operation result into the target address.
In one possible implementation, the operation field may further include an input quantity. The obtaining, according to the operation code and the operation domain, the tensor to be operated and the target address which are required for executing the tensor instruction may further include: and determining the input quantity according to the operation domain, and acquiring a tensor to be operated with the data quantity as the input quantity from the data address to be operated.
In one possible implementation, the method may further include: the tensor to be computed is stored.
In one possible implementation, the scalar to be computed is an immediate or scalar register.
In one possible implementation, parsing the compiled tensor instruction to obtain an operation code and an operation domain of the tensor instruction may include:
storing the compiled tensor instruction;
analyzing the compiled tensor instruction to obtain an operation code and an operation domain of the tensor instruction;
the method includes storing an instruction queue, where the instruction queue includes a plurality of instructions to be executed that are sequentially arranged according to an execution order, and the plurality of instructions to be executed may include compiled tensor instructions.
In one possible implementation, the method may further include: when determining that the first to-be-executed instruction in the plurality of to-be-executed instructions is in an association relation with a zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first to-be-executed instruction, and after determining that the zeroth to-be-executed instruction is completely executed, controlling to execute the first to-be-executed instruction,
the associating relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction may include: the first storage address interval for storing the data required by the first to-be-executed instruction and the zeroth storage address interval for storing the data required by the zeroth to-be-executed instruction have an overlapped area.
In a possible implementation manner, compiling the obtained tensor instruction to obtain a compiled tensor instruction may include:
and generating an assembly file according to the tensor instruction, and translating the assembly file into a binary file. The binary file is a compiled tensor instruction.
It should be noted that, although the tensor instruction processing method is described above by taking the above embodiment as an example, those skilled in the art can understand that the present disclosure should not be limited thereto. In fact, the user can flexibly set each step according to personal preference and/or actual application scene, as long as the technical scheme of the disclosure is met.
The tensor instruction processing method provided by the embodiment of the disclosure has the advantages of wide application range, high tensor processing efficiency and high processing speed.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present disclosure, it should be understood that the disclosed system and apparatus may be implemented in other ways. For example, the above-described embodiments of systems and apparatuses are merely illustrative, and for example, a division of a device, an apparatus, and a module is merely a logical division, and an actual implementation may have another division, for example, a plurality of modules may be combined or integrated into another system or apparatus, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices, apparatuses or modules, and may be an electrical or other form.
Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present disclosure may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a form of hardware or a form of a software program module.
The integrated modules, if implemented in the form of software program modules and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a memory and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
Embodiments of the present application further provide a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by one or more processing devices, the steps in the method according to any of the above embodiments are implemented. Specifically, when being executed by a processor, the computer program realizes the following steps:
analyzing the acquired tensor instruction to obtain an operation code and an operation domain of the tensor instruction, and acquiring a tensor to be operated, a scalar to be operated and a target address which are required by executing the tensor instruction according to the operation code and the operation domain;
carrying out tensor and scalar multiplication operation on the tensor to be operated and the scalar to be operated to obtain an operation result, and storing the operation result into the target address;
the operation code is used for indicating that the tensor instruction performs operation on data as tensor and scalar multiplication operation, and the operation domain comprises a source address of a tensor to be operated, the scalar to be operated and the target address.
It should be clear that, in the embodiments of the present application, implementation manners of each step are consistent with implementation manners of each step in the foregoing method, and specific reference may be made to the above description, and details are not described here again.
The foregoing may be better understood in light of the following clauses:
clause 1: a tensor instruction processing apparatus, the apparatus comprising:
the control module is used for analyzing the acquired tensor instruction to obtain an operation code and an operation domain of the tensor instruction, and acquiring a tensor to be operated, a scalar to be operated and a target address tensor required by executing the tensor instruction according to the operation code and the operation domain;
the operation module is used for carrying out tensor and scalar multiplication operation on the tensor to be operated and the scalar to be operated by the tensor to be operated to obtain an operation result, and storing the operation result into the target address;
the operation code is used for indicating that the tensor instruction performs operation on data as tensor and scalar multiplication operation, and the operation domain comprises a source address of a tensor to be operated, the scalar to be operated and the target address.
Clause 2: the apparatus of clause 1, the operation module comprising:
and the tensor operators are used for multiplying each element in the tensor to be operated with the scalar to be operated to obtain an operation result so as to realize the tensor and scalar multiplication operation.
Clause 3: the apparatus of clause 1 or 2, the computation module comprising a master computation sub-module and a plurality of slave computation sub-modules, the master computation sub-module and the plurality of slave computation sub-modules each comprising the tensor operator;
the control module is further configured to analyze the compiled tensor instruction to obtain a plurality of operation instructions, and send the tensor to be operated and the plurality of operation instructions to the main operation sub-module;
the main operation submodule executes preorder processing on the tensor to be operated and sends the operation instruction and at least one part of the tensor to be operated to the slave operation submodule; the tensor operator of the main operation sub-module can execute the multiplication operation of the tensor and the scalar to obtain an intermediate result;
the tensor arithmetic device of the slave arithmetic submodule is used for executing multiplication operation of the tensor and the scalar in parallel according to the data and the arithmetic instruction received from the master arithmetic submodule to obtain a plurality of intermediate results, and transmitting the plurality of intermediate results to the master arithmetic submodule;
and the main operation sub-module is also used for executing subsequent processing on the plurality of intermediate results to obtain operation results and storing the operation results into the target address.
Clause 4: the apparatus of any of clauses 1-3, the operational field further comprising an input quantity,
the control module is further configured to determine the tensor to be operated according to a source address of the tensor to be operated and the input quantity.
Clause 5: the apparatus of any of clauses 1-4, wherein the scalar to be operated on is an immediate or a scalar register.
Clause 6: the apparatus of any of clauses 1-5, further comprising:
and the storage module is used for storing at least one of the tensor to be operated and the scalar to be operated.
Clause 7: the apparatus of any of clauses 1-6, wherein the control module comprises:
the instruction storage submodule is used for storing the compiled tensor instruction;
the instruction processing submodule is used for analyzing the compiled tensor instruction to obtain an operation code and an operation domain of the tensor instruction;
the queue storage submodule is used for storing an instruction queue, the instruction queue comprises a plurality of instructions to be executed which are sequentially arranged according to an execution sequence, and the plurality of instructions to be executed comprise the compiled tensor instructions.
Clause 8: the apparatus of any of clauses 1-7, the control module further comprising:
the dependency relationship processing submodule is used for caching a first instruction to be executed in the instruction storage submodule when the fact that the first instruction to be executed in the plurality of instructions to be executed is associated with a zeroth instruction to be executed before the first instruction to be executed is determined, extracting the first instruction to be executed from the instruction storage submodule after the zeroth instruction to be executed is executed, and sending the first instruction to be executed to the operation module,
wherein the association relationship between the first to-be-executed instruction and a zeroth to-be-executed instruction before the first to-be-executed instruction comprises:
and a first storage address interval for storing the data required by the first instruction to be executed and a zeroth storage address interval for storing the data required by the zeroth instruction to be executed have an overlapped area.
Clause 9: a tensor instruction processing method applied to a tensor instruction processing apparatus, the method comprising:
analyzing the acquired tensor instruction to obtain an operation code and an operation domain of the tensor instruction, and acquiring a tensor to be operated, a scalar to be operated and a target address which are required by executing the tensor instruction according to the operation code and the operation domain;
carrying out tensor and scalar multiplication operation on the tensor to be operated and the scalar to be operated to obtain an operation result, and storing the operation result into the target address;
the operation code is used for indicating that the tensor instruction performs operation on data as tensor and scalar multiplication operation, and the operation domain comprises a source address of a tensor to be operated, the scalar to be operated and the target address.
Clause 10: the method of clause 9, wherein the calculation module comprises:
and the tensor operators are used for multiplying each element in the tensor to be operated with the scalar to be operated to obtain an operation result so as to realize the tensor and scalar multiplication operation.
Clause 11: the method of clause 9 or 10, wherein the calculation module comprises a master calculation sub-module and a plurality of slave calculation sub-modules, each comprising the tensor calculator;
the control module analyzes the compiled tensor instructions to obtain a plurality of operation instructions and sends the tensor to be operated and the operation instructions to the main operation submodule;
the main operation submodule executes preorder processing on the tensor to be operated and sends the operation instruction and at least one part of the tensor to be operated to the slave operation submodule; the tensor operator of the main operation sub-module can execute the multiplication operation of the tensor and the scalar to obtain an intermediate result;
the tensor arithmetic unit of the slave arithmetic submodule executes multiplication operation of the tensor and the scalar in parallel according to data and an arithmetic instruction received from the master arithmetic submodule to obtain a plurality of intermediate results, and transmits the intermediate results to the master arithmetic submodule;
and the main operation sub-module executes subsequent processing on the plurality of intermediate results to obtain operation results, and stores the operation results into the target address.
Clause 11: the method of any of clauses 9-10, the operational field further comprising an input quantity,
the control module is further configured to determine the tensor to be operated according to a source address of the tensor to be operated and the input quantity.
Clause 12: the method according to any of clauses 9-11, wherein the scalar to be operated on is an immediate or scalar register.
Clause 13: the method of any of clauses 9-12, further comprising:
and storing at least one of the tensor to be operated and the scalar to be operated.
Clause 14: the method according to any of clauses 9-13, wherein parsing the compiled tensor instruction to obtain an operation code and an operation field of the tensor instruction comprises:
the instruction storage submodule stores the compiled tensor instruction;
the instruction processing submodule analyzes the compiled tensor instruction to obtain an operation code and an operation domain of the tensor instruction;
the queue storage submodule stores an instruction queue, the instruction queue comprises a plurality of instructions to be executed, which are sequentially arranged according to an execution sequence, and the plurality of instructions to be executed comprise the compiled tensor instructions.
Clause 15: the method of any of clauses 9-14, further comprising:
when determining that a first to-be-executed instruction in the plurality of to-be-executed instructions is associated with a zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first to-be-executed instruction in the instruction storage submodule, after the zeroth to-be-executed instruction is executed, extracting the first to-be-executed instruction from the instruction storage submodule and sending the first to-be-executed instruction to the operation module,
wherein the association relationship between the first to-be-executed instruction and a zeroth to-be-executed instruction before the first to-be-executed instruction comprises:
and a first storage address interval for storing the data required by the first instruction to be executed and a zeroth storage address interval for storing the data required by the zeroth instruction to be executed have an overlapped area.
Clause 16: a computer readable storage medium having stored thereon a computer program which, when executed by one or more processing devices, implements the steps of the method of any of clauses 9-15.
The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A tensor instruction processing apparatus, the apparatus comprising:
the control module is used for analyzing the acquired tensor instruction to obtain an operation code and an operation domain of the tensor instruction, and acquiring a tensor to be operated, a scalar to be operated and a target address which are required by execution of the tensor instruction according to the operation code and the operation domain;
the operation module is used for carrying out tensor and scalar multiplication operation on the tensor to be operated and the scalar to be operated by the tensor to be operated to obtain an operation result, and storing the operation result into the target address;
the operation code is used for indicating that the tensor instruction performs operation on data as tensor and scalar multiplication operation, and the operation domain comprises a source address of a tensor to be operated, the scalar to be operated and the target address.
2. The apparatus of claim 1, wherein the operation module comprises:
and the tensor arithmetic unit is used for multiplying each element in the tensor to be operated with the scalar to be operated to obtain an operation result so as to realize the multiplication operation of the tensor and the scalar.
3. The apparatus of claim 2, wherein the operation module comprises a master operation submodule and a plurality of slave operation submodules, each of the master operation submodule and the plurality of slave operation submodules comprising the tensor operator;
the control module is also used for analyzing the
The tensor instruction obtains a plurality of operation instructions, and sends the tensor to be operated and the operation instructions to the main operation submodule;
the main operation submodule executes preorder processing on the tensor to be operated and sends the operation instruction and at least one part of the tensor to be operated to the slave operation submodule; the tensor operator of the main operation sub-module can execute the multiplication operation of the tensor and the scalar to obtain an intermediate result;
the tensor arithmetic device of the slave arithmetic submodule is used for executing multiplication operation of the tensor and the scalar in parallel according to the data and the arithmetic instruction received from the master arithmetic submodule to obtain a plurality of intermediate results, and transmitting the plurality of intermediate results to the master arithmetic submodule;
and the main operation sub-module is also used for executing subsequent processing on the plurality of intermediate results to obtain operation results and storing the operation results into the target address.
4. The apparatus of claim 1, wherein the operational field further comprises an input quantity,
the control module is further configured to determine the tensor to be operated according to a source address of the tensor to be operated and the input quantity.
5. The apparatus according to any of claims 1-4, wherein the scalar to be operated on is an immediate or a scalar register.
6. The apparatus of any of claims 1-4, further comprising:
and the storage module is used for storing at least one of the tensor to be operated and the scalar to be operated.
7. The apparatus of any of claims 1-4, wherein the control module comprises:
the instruction storage submodule is used for storing the tensor instruction;
the instruction processing submodule is used for analyzing the tensor instruction to obtain an operation code and an operation domain of the tensor instruction;
the queue storage submodule is used for storing an instruction queue, the instruction queue comprises a plurality of instructions to be executed which are sequentially arranged according to an execution sequence, and the plurality of instructions to be executed comprise the tensor instructions.
8. The apparatus of claim 7, wherein the control module further comprises:
the dependency relationship processing submodule is used for caching a first instruction to be executed in the instruction storage submodule when the fact that the first instruction to be executed in the plurality of instructions to be executed is associated with a zeroth instruction to be executed before the first instruction to be executed is determined, extracting the first instruction to be executed from the instruction storage submodule after the zeroth instruction to be executed is executed, and sending the first instruction to be executed to the operation module,
wherein the association relationship between the first to-be-executed instruction and a zeroth to-be-executed instruction before the first to-be-executed instruction comprises:
and a first storage address interval for storing the data required by the first instruction to be executed and a zeroth storage address interval for storing the data required by the zeroth instruction to be executed have an overlapped area.
9. A tensor instruction processing method applied to a tensor instruction processing apparatus, the method comprising:
analyzing the obtained tensor instruction to obtain an operation code and an operation domain of the tensor instruction, and obtaining a tensor to be operated, a scalar to be operated and a target address which are required by executing the tensor instruction according to the operation code and the operation domain;
the tensor performs tensor and scalar multiplication operation on the tensor to be operated and the scalar to be operated to obtain an operation result, and the operation result is stored in the target address;
the operation code is used for indicating that the tensor instruction performs operation on data as tensor and scalar multiplication operation, and the operation domain comprises a source address of a tensor to be operated, the scalar to be operated and the target address.
10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by one or more processing means, carries out the steps of the method as set forth in claim 9.
CN201910420884.9A 2019-05-17 2019-05-20 Instruction processing method and device and related product Withdrawn CN111966402A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910420884.9A CN111966402A (en) 2019-05-20 2019-05-20 Instruction processing method and device and related product
PCT/CN2020/088248 WO2020233387A1 (en) 2019-05-17 2020-04-30 Command processing method and apparatus, and related products

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910420884.9A CN111966402A (en) 2019-05-20 2019-05-20 Instruction processing method and device and related product

Publications (1)

Publication Number Publication Date
CN111966402A true CN111966402A (en) 2020-11-20

Family

ID=73357745

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910420884.9A Withdrawn CN111966402A (en) 2019-05-17 2019-05-20 Instruction processing method and device and related product

Country Status (1)

Country Link
CN (1) CN111966402A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150227365A1 (en) * 2014-02-12 2015-08-13 Imagination Technologies Limited Processor Supporting Arithmetic Instructions with Branch on Overflow & Methods
CN107729990A (en) * 2017-07-20 2018-02-23 上海寒武纪信息科技有限公司 Support the device and method for being used to perform artificial neural network forward operation that discrete data represents
CN109358900A (en) * 2016-04-15 2019-02-19 北京中科寒武纪科技有限公司 The artificial neural network forward operation device and method for supporting discrete data to indicate

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150227365A1 (en) * 2014-02-12 2015-08-13 Imagination Technologies Limited Processor Supporting Arithmetic Instructions with Branch on Overflow & Methods
CN109358900A (en) * 2016-04-15 2019-02-19 北京中科寒武纪科技有限公司 The artificial neural network forward operation device and method for supporting discrete data to indicate
CN107729990A (en) * 2017-07-20 2018-02-23 上海寒武纪信息科技有限公司 Support the device and method for being used to perform artificial neural network forward operation that discrete data represents

Similar Documents

Publication Publication Date Title
CN110096309B (en) Operation method, operation device, computer equipment and storage medium
CN110096310B (en) Operation method, operation device, computer equipment and storage medium
CN111381871A (en) Operation method, device and related product
CN111353124A (en) Operation method, operation device, computer equipment and storage medium
CN111047005A (en) Operation method, operation device, computer equipment and storage medium
CN112395003A (en) Operation method, device and related product
CN111966402A (en) Instruction processing method and device and related product
CN111353595A (en) Operation method, device and related product
CN111966401A (en) Instruction processing method and device and related product
CN111381872A (en) Operation method, device and related product
CN111290788B (en) Operation method, operation device, computer equipment and storage medium
CN111382850A (en) Operation method, device and related product
CN111047030A (en) Operation method, operation device, computer equipment and storage medium
CN111353125B (en) Operation method, operation device, computer equipment and storage medium
CN111026440B (en) Operation method, operation device, computer equipment and storage medium
CN111401536A (en) Operation method, device and related product
CN112396186B (en) Execution method, execution device and related product
CN111382390B (en) Operation method, device and related product
CN111381873A (en) Operation method, device and related product
CN111338694B (en) Operation method, device, computer equipment and storage medium
CN111124497B (en) Operation method, operation device, computer equipment and storage medium
CN112394985B (en) Execution method, execution device and related product
CN111382851A (en) Operation method, device and related product
CN111290789B (en) Operation method, operation device, computer equipment and storage medium
CN111325331B (en) Operation method, device and related product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20201120

WW01 Invention patent application withdrawn after publication