Detailed Description
The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the disclosure. Based on the embodiments in this disclosure, all other embodiments that may be made by those skilled in the art without the inventive effort are within the scope of the present disclosure.
It should be understood that the terms "zero," "first," "second," and the like in the claims, specification and drawings of the present disclosure are used for distinguishing between different objects and not for describing a particular sequential order. The terms "comprises" and "comprising" when used in the specification and claims of the present disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the present disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present disclosure and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".
Because of the wide use of neural network algorithms, the capability of computer hardware operators is continuously improved, and the variety and the number of data operations involved in practical application are continuously improved. Because the programming languages are various, in order to realize scalar data migration under different language environments, in the related technology, because the current stage has no scalar data migration instruction which can be widely applied to various programming languages, technicians need to customize one or more instructions corresponding to the programming language environments to realize scalar data migration, so that the efficiency of scalar data migration is low and the speed is low. The present disclosure provides a scalar data migration instruction processing method, apparatus, computer device, and storage medium, in which scalar data migration can be implemented with only one instruction, and the efficiency and speed of performing scalar data migration can be significantly improved.
FIG. 1 illustrates a block diagram of a scalar data migration instruction processing apparatus according to an embodiment of the present disclosure. As shown in fig. 1, the apparatus includes a control module 11 and a processing module 12.
The control module 11 is configured to parse the obtained scalar data migration instruction to obtain an operation code and an operation domain of the scalar data migration instruction, obtain scalar data to be migrated and a target address required for executing the scalar data migration instruction according to the operation code and the operation domain, and determine migration parameters required for performing migration processing. The operation code is used for indicating that the scalar data migration instruction performs processing on scalar data as migration processing, the operation domain comprises a scalar data address to be migrated and a target address, and the migration parameters can comprise an initial storage space where the scalar data address to be migrated is located, a target storage space where the target address is located and a migration type for performing migration processing.
The processing module 12 stores scalar data to be migrated in the target address according to the migration parameters.
In this embodiment, scalar data to be migrated may be one or more. The migration type may indicate a scalar data storage speed of the initial storage space, a scalar data storage speed of the target storage space, and a fast-slow relationship of the two storage speeds. In the scalar data migration instruction, different codes can be set for the storage speed relationship between different target storage spaces and initial storage spaces to distinguish the storage speed. For example, a code of the migration type "the storage speed of the initial storage space is greater than the storage speed of the target storage space" may be set to "st". The code of the migration type "the storage speed of the initial storage space is equal to the storage speed of the target storage space" may be set to "mv". The code of the migration type "the storage speed of the initial storage space is smaller than the storage speed of the target storage space" may be set to "ld". Those skilled in the art may set the migration type and the code of the migration type according to actual needs, which is not limited by the present disclosure.
In this embodiment, the migration parameters may include an identifier such as a name, a number, etc. of the initial storage space and the target storage space, so as to represent the initial storage space and the target storage space.
In this embodiment, the initial memory space may be NRAM, DDR, register, etc. of the device. The target memory space may be NRAM, DDR of the device. Among them, NRAM (Nanotube Random Access Memory) is a Carbon Nanotube (CNT) based nonvolatile memory. DDR (also known as DDR SDRAM) is a double rate synchronous dynamic random access memory (Double Data Rate Synchronous Dynamic Random Access Memory).
In this embodiment, the scalar data migration instruction acquired by the control module is a hardware instruction that does not need to be compiled and can be directly executed by hardware, and the control module can analyze the acquired scalar data migration instruction. The control module may obtain scalar data to be migrated from the scalar data address to be migrated. The control module may obtain instructions and data through a data input output unit, which may be one or more data I/O interfaces or I/O pins.
In this embodiment, the opcode may be a portion of an instruction or field (usually represented by a code) specified in a computer program to perform an operation, and is an instruction sequence number used to inform an apparatus executing the instruction of which instruction is specifically required to be executed. The operation domain may be a source of all data required for executing the corresponding instruction, where all data required for executing the corresponding instruction includes a target address, a scalar data address to be migrated, an initial storage space where the scalar data address to be migrated is located, a target storage space where the target address is located, and migration parameters for performing migration processing, and so on. For a scalar data migration instruction it must include an opcode and an operation field, where the operation field includes at least the scalar data to be migrated and the target address.
It should be appreciated that one skilled in the art may set the scalar data migration instruction format and the contained opcodes and operation fields as desired, and this disclosure is not limited in this regard.
In this embodiment, the apparatus may include one or more control modules and one or more processing modules, and the number of the control modules and the processing modules may be set according to actual needs, which is not limited in this disclosure. When the apparatus includes a control module, the control module may receive scalar data migration instructions and control one or more processing modules to perform scalar data migration. When the device comprises a plurality of control modules, the control modules can respectively receive scalar data migration instructions and control the corresponding processing module or processing modules to perform scalar data migration.
The scalar data migration instruction processing device provided by the embodiment of the disclosure comprises a control module and a processing module. The control module is used for analyzing the obtained scalar data migration instruction to obtain an operation code and an operation domain of the scalar data migration instruction, obtaining scalar data to be migrated and a target address required by executing the scalar data migration instruction according to the operation code and the operation domain, and determining migration parameters required by migration processing. The processing module is used for storing scalar data to be migrated into the target address according to the migration parameters. The scalar data migration instruction processing device provided by the embodiment of the disclosure has wide application range, high processing efficiency and high processing speed for scalar data migration instructions, and high processing efficiency and high processing speed for scalar data migration.
FIG. 2a illustrates a block diagram of a scalar data instruction processing apparatus according to an embodiment of the present disclosure. In one possible implementation, as shown in fig. 2a, the processing module 12 may include a master processing sub-module 121 and a plurality of slave processing sub-modules 122.
The main processing sub-module 121 is configured to process the scalar data to be migrated to obtain processed scalar data to be migrated, and store the processed scalar data to be migrated in the target address. The processing performed on scalar data to be migrated includes conversion processing of data types and the like, which is not limited by the present disclosure.
In one possible implementation manner, the control module 11 is further configured to parse the obtained calculation instruction to obtain an operation domain and an operation code of the calculation instruction, and obtain data to be operated required for executing the calculation instruction according to the operation domain and the operation code. The processing module 12 is further configured to operate on the data to be operated according to the calculation instruction, so as to obtain a calculation result of the calculation instruction. The processing module may include a plurality of operators for performing operations corresponding to operation types of the calculation instructions.
In this implementation manner, the calculation instruction may be other instructions for performing arithmetic operations, logical operations, and other operations on data such as scalar, vector, matrix, tensor, etc., and those skilled in the art may set the calculation instruction according to actual needs, which is not limited in this disclosure.
In this implementation, the arithmetic unit may include an arithmetic unit capable of performing arithmetic operations, logical operations, and the like on data, such as an adder, a divider, a multiplier, and a comparator. The type and number of the operators may be set according to the size of the data amount of the operation to be performed, the operation type, the processing speed of the operation on the data, the efficiency, and the like, and the present disclosure is not limited thereto.
In a possible implementation manner, the control module 11 is further configured to parse the calculation instruction to obtain a plurality of operation instructions, and send the data to be operated and the plurality of operation instructions to the main processing sub-module 121.
The master processing sub-module 121 is configured to perform preamble processing on data to be operated on, and perform transmission of data and operation instructions with the plurality of slave processing sub-modules 122.
The slave processing sub-module 122 is configured to perform an intermediate operation in parallel according to the data and the operation instruction transmitted from the master processing sub-module 121 to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the master processing sub-module 122.
The main processing sub-module 121 is further configured to perform subsequent processing on the plurality of intermediate results, obtain a calculation result of the calculation instruction, and store the calculation result in a corresponding address.
In this implementation, when the calculation instruction is an operation performed on scalar or vector data, the apparatus may control the main processing sub-module to perform an operation corresponding to the calculation instruction using an operator therein. When the calculation instruction is an operation for data with dimensions greater than or equal to 2, such as a matrix, a tensor and the like, the device can control the slave processing submodule to perform an operation corresponding to the calculation instruction by using an operator in the slave processing submodule.
It should be noted that, a person skilled in the art may set the connection manner between the master processing sub-module and the plurality of slave processing sub-modules according to actual needs, so as to implement an architecture setting of the processing module, for example, the architecture of the processing module may be an "H" type architecture, an array type architecture, a tree type architecture, etc., which is not limited in this disclosure.
FIG. 2b illustrates a block diagram of a scalar data migration instruction processing apparatus according to an embodiment of the present disclosure. In one possible implementation, as shown in fig. 2b, the processing module 12 may further include one or more branch processing sub-modules 123, where the branch processing sub-modules 123 are configured to forward data and/or operation instructions between the master processing sub-module 121 and the slave processing sub-module 122. Wherein the main processing sub-module 121 is connected to one or more branch processing sub-modules 123. In this way, the main processing sub-module, the branch processing sub-module and the auxiliary processing sub-module in the processing module are connected by adopting an H-shaped framework, and data and/or operation instructions are forwarded by the branch processing sub-module, so that the occupation of resources of the main processing sub-module is saved, and the processing speed of the instructions is further improved.
FIG. 2c illustrates a block diagram of a scalar data migration instruction processing apparatus according to an embodiment of the present disclosure. In one possible implementation, as shown in FIG. 2c, a plurality of slave processing sub-modules 122 are distributed in an array.
Each of the slave processing sub-modules 122 is connected to other adjacent slave processing sub-modules 122, and the master processing sub-module 121 is connected to k slave processing sub-modules 122 among the plurality of slave processing sub-modules 122, where the k slave processing sub-modules 122 are: n slave processing sub-modules 122 of row 1, n slave processing sub-modules 122 of row m, and m slave processing sub-modules 122 of column 1.
As shown in fig. 2c, the k slave processing sub-modules only include n slave processing sub-modules in the 1 st row, n slave processing sub-modules in the m th row, and m slave processing sub-modules in the 1 st column, that is, the k slave processing sub-modules are slave processing sub-modules directly connected with the master processing sub-module from among the plurality of slave processing sub-modules. And k slave processing sub-modules are used for forwarding data and instructions among the master processing sub-module and the plurality of slave processing sub-modules. In this way, the plurality of slave processing sub-modules are distributed in an array, so that the speed of sending data and/or operating instructions to the slave processing sub-modules by the master processing sub-module can be improved, and the processing speed of instructions can be further improved.
FIG. 2d illustrates a block diagram of a scalar data migration instruction processing apparatus according to an embodiment of the present disclosure. In one possible implementation, as shown in fig. 2d, the processing module may also include a tree submodule 124. The tree submodule 124 includes a root port 401 and a plurality of branch ports 402. The root port 401 is connected to the master processing sub-module 121, and the plurality of branch ports 402 are connected to the plurality of slave processing sub-modules 122, respectively. The tree submodule 124 has a transceiver function and is used for forwarding data and/or operation instructions between the main processing submodule 121 and the auxiliary processing submodule 122. Therefore, the processing modules are connected in a tree-shaped structure through the action of the tree-shaped sub-modules, and the forwarding function of the tree-shaped sub-modules is utilized, so that the speed of transmitting data and/or operating instructions to the slave processing sub-modules by the main processing sub-modules can be improved, and the processing speed of the instructions is further improved.
In one possible implementation, the tree submodule 124 may be an optional result of the apparatus, which may include at least one layer of nodes. The node is a line structure with a forwarding function, and the node itself has no operation function. The lowest level of nodes is connected to the slave processing submodules to forward data and/or arithmetic instructions between the master processing submodule 121 and the slave processing submodule 122. In particular, if the tree submodule has zero level nodes, the device does not require a tree submodule.
In one possible implementation, tree submodule 124 may include a plurality of nodes of an n-ary tree structure, which may have a plurality of layers.
For example, FIG. 2e shows a block diagram of a scalar data migration instruction processing apparatus according to an embodiment of the disclosure. As shown in fig. 2e, the n-ary tree structure may be a binary tree structure, the tree submodule comprising a layer 2 node 01. The lowest level node 01 is connected to the slave processing sub-module 122 to forward data and/or operation instructions between the master processing sub-module 121 and the slave processing sub-module 122.
In this implementation, the n-ary tree structure may also be a three-ary tree structure or the like, where n is a positive integer greater than or equal to 2. The number of layers of n in the n-ary tree structure and nodes in the n-ary tree structure can be set as desired by those skilled in the art, and this disclosure is not limited in this regard.
In one possible implementation, the operation domain may also include scalar data migration volumes. The control module 11 is further configured to determine a scalar data migration amount according to the operation domain, and obtain scalar data to be migrated corresponding to the scalar data migration amount from the scalar data address to be migrated.
In this implementation, the scalar data migration volume may be the volume of data of the acquired scalar data to be migrated.
In one possible implementation, a default scalar data migration amount may be preset. When the scalar data migration amount is not included in the operation domain, a default scalar data migration amount may be determined as the scalar data migration amount of the current scalar data migration instruction. And further acquiring scalar data to be migrated corresponding to the scalar data migration quantity from the scalar data address to be migrated.
In one possible implementation, when the operation domain does not include scalar data migration volume, all scalar data to be migrated stored therein may be obtained directly from the scalar data to be migrated address.
In one possible implementation, the operation domain may also include migration parameters. Wherein, determining the migration parameters required for performing the migration process may include: and determining migration parameters required by migration processing according to the operation domain.
In one possible implementation, the opcode may also be used to indicate a migration parameter. Wherein, determining the migration parameters required for performing the migration process may include: and determining migration parameters required by migration processing according to the operation codes.
In one possible implementation, default migration parameters may also be set. When the migration parameters of the current scalar data migration instruction cannot be determined according to both the operation field and the operation code, the default migration parameters may be determined as the migration parameters of the current scalar data migration instruction.
In one possible implementation manner, the initial storage space and the target storage space corresponding to the scalar data address to be migrated and the target address can be determined according to the data address to be migrated, and further migration parameters are determined according to parameters such as storage speed, storage space type and the like of the initial storage space and the target storage space.
In one possible implementation, as shown in fig. 2 a-2 e, the apparatus may further comprise a storage module 13. The storage module 13 is used for storing scalar data to be migrated.
In this implementation, the memory module may include one or more of a cache and a register, the cache may include a scratch pad cache, and may also include at least one NRAM (Neuron Random Access Memory, neuronal random access memory). And the cache is used for storing the data to be operated. And the register is used for storing scalar data to be migrated and scalar data in the data to be operated. The data to be operated on may be data related to execution of a calculation instruction and a scalar data migration instruction.
In one possible implementation, the cache may comprise a neuron cache. The neuron cache, that is, the above-mentioned neuron random access memory, may be used to store neuron data in the data to be operated on, the neuron data including neuron vector data.
In one possible implementation, the apparatus may further include a direct memory access module for reading or storing data from the storage module.
In one possible implementation, as shown in fig. 2 a-2 e, the control module 11 may include an instruction storage sub-module 111, an instruction processing sub-module 112, and a queue storage sub-module 113.
Instruction storage sub-module 111 is used to store scalar data migration instructions.
The instruction processing sub-module 112 is configured to parse the scalar data migration instruction to obtain an operation code and an operation domain of the scalar data migration instruction.
The queue storage submodule 113 is configured to store an instruction queue, where the instruction queue includes a plurality of to-be-executed instructions sequentially arranged according to an execution order, and the plurality of to-be-executed instructions may include scalar data migration instructions.
In this implementation, the instructions to be executed may also include computing instructions related to or unrelated to scalar data migration, which is not limiting of the present disclosure. The instruction queue may be obtained by arranging the execution sequences of the plurality of instructions to be executed according to the receiving time, the priority level, and the like of the instructions to be executed, so that the plurality of instructions to be executed are sequentially executed according to the instruction queue.
In one possible implementation, as shown in fig. 2 a-2 e, the control module 11 may include a dependency processing sub-module 114.
The dependency relationship processing sub-module 114 is configured to cache a first to-be-executed instruction in the instruction storage sub-module 111 when determining that there is an association relationship between the first to-be-executed instruction in the plurality of to-be-executed instructions and a zeroth to-be-executed instruction before the first to-be-executed instruction, and extract the first to-be-executed instruction from the instruction storage sub-module 111 and send the first to-be-executed instruction to the processing module 12 after the execution of the zeroth to-be-executed instruction is completed.
The association relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction includes: the first storage address interval for storing the data required by the first instruction to be executed and the zeroth storage address interval for storing the data required by the zeroth instruction to be executed have overlapping areas. Otherwise, the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction have no association relationship, and the first memory address interval and the zeroth memory address interval have no overlapping area.
In this way, the first to-be-executed instruction can be executed after the previous zero to-be-executed instruction is executed according to the dependency relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction, so that the accuracy of the result is ensured.
In one possible implementation, the instruction format of the scalar data migration instruction may be:
migrate dst src type.space1.space2 size
where, the size is the opcode of the scalar data migration instruction, dst, src0, type. Space1.Space2, size is the operation field of the scalar data migration instruction. Where dst is a target address, src is an address of scalar data to be migrated, and when scalar data to be migrated is plural, src may include addresses src0, src1, …, src n of plural scalar data to be migrated, which is not limited by the present disclosure. type.space1.space2 is a migration parameter, type in type.space1.space2 represents a migration type, space1 in type.space1.space2 represents an initial storage space where scalar data address src to be migrated is located, and space2 in type.space1.space2 represents a target storage space where target address dst is located. size is the scalar data migration volume.
In one possible implementation, the instruction format of the scalar data migration instruction may also be:
type.space1.space2 dst src size
where type.space1.space2 is the opcode of the scalar data migration instruction and dst, src, size is the operation domain of the scalar data migration instruction. Where dst is a target address, src is an address of scalar data to be migrated, and when scalar data to be migrated is plural, src may include addresses src0, src1, …, src n of plural scalar data to be migrated, which is not limited by the present disclosure. size is the scalar data migration volume. The type in the opcode type.space1.space2 represents the migration type, the space1 in the type.space1.space2 represents the initial storage space where the scalar data address src to be migrated is located, and the space2 in the type.space1.space2 represents the target storage space where the target address dst is located.
Wherein, the type can be ld, st, mv. ld indicates that the migration type is "the storage speed of the initial storage space is smaller than the storage speed of the target storage space". The migration type denoted by st is "the storage speed of the initial storage space is greater than the storage speed of the target storage space". The migration type denoted mv is "the storage speed of the initial storage space is equal to the storage speed of the target storage space".
In one possible implementation, the instruction format of the scalar data migration instruction with a migration type of "the storage speed of the initial storage space is smaller than the storage speed of the target storage space" may be set as: space1.space2 dst src0 size. And according to the scalar data migration volume size, the initial storage space1, the target storage space2 and the migration type ld, acquiring scalar data to be migrated, which is the scalar data migration volume size, from the scalar data address src0 to be migrated in the initial storage space1, and storing the scalar data to be migrated into the target address dst in the target storage space 2. The storage speed of the initial storage space1 is smaller than the storage speed of the target storage space 2.
In one possible implementation, the instruction format of the scalar data migration instruction with a migration type of "the storage speed of the initial storage space is greater than the storage speed of the target storage space" may be set as: space1.space2 dst src0 size. And according to the scalar data migration volume size, the initial storage space1, the target storage space2 and the migration type st, acquiring scalar data to be migrated, which is the scalar data migration volume size, from the scalar data address src0 to be migrated in the initial storage space1, and storing the scalar data to be migrated into the target address dst in the target storage space 2. The storage speed of the initial storage space1 is greater than the storage speed of the target storage space 2.
In one possible implementation, the instruction format of the scalar data migration instruction with a migration type of "the storage speed of the initial storage space is equal to the storage speed of the target storage space" may be set as: space1.Space2 dst src0 size. And according to the scalar data migration volume size, the initial storage space1, the target storage space2 and the migration type st, acquiring scalar data to be migrated, which is the scalar data migration volume size, from the scalar data address src0 to be migrated in the initial storage space1, and storing the scalar data to be migrated into the target address dst in the target storage space 2. Wherein the storage speed of the initial storage space1 is equal to the storage speed of the target storage space 2.
It should be appreciated that one skilled in the art may set the opcode of a scalar data migration instruction, the location of the opcode and the operation field in the instruction format, as desired, and this disclosure is not limited in this regard.
In one possible implementation, the apparatus may be provided in one or more of (Graphics Processing Unit, GPU for short), a central processor (Central Processing Unit, CPU for short) and an embedded Neural network processor (Neural-network Processing Unit, NPU for short).
It should be noted that, although the scalar data migration instruction processing apparatus is described above by way of example in the above embodiments, those skilled in the art will appreciate that the present disclosure should not be limited thereto. In fact, the user can flexibly set each module according to personal preference and/or actual application scene, so long as the technical scheme of the disclosure is met.
Application example
An application example according to an embodiment of the present disclosure is given below in conjunction with "data migration with scalar data migration instruction processing apparatus" as one exemplary application scenario, in order to facilitate understanding of the flow of the scalar data migration instruction processing apparatus. It will be appreciated by those skilled in the art that the following application examples are for purposes of facilitating understanding of the embodiments of the present disclosure only and should not be construed as limiting the embodiments of the present disclosure.
Fig. 3 shows a schematic diagram of an application scenario of a scalar data migration instruction processing apparatus according to an embodiment of the present disclosure. As shown in fig. 3, the scalar data migration instruction processing apparatus processes a scalar data migration instruction as follows:
the control module 11 analyzes the obtained scalar data migration instruction 1 (for example, the scalar data migration instruction 1 is ld.200.300 500 400 5) to obtain the operation code and the operation domain of the scalar data migration instruction 1. The operation code of the scalar data migration instruction 1 is ld, the initial storage space is 200, the target storage space is 300, the target address is 500, the scalar data address to be migrated is 400, and the scalar data migration amount is 5. It can be determined from the operation code ld that the storage speed of the initial storage space 200 is smaller than the storage speed of the target storage space 300. The control module 11 acquires scalar data to be migrated having a data size of scalar data migration 5 from within the scalar data address to be migrated 400 in the initial storage space 200. The operation module 12 stores scalar data to be migrated into the target address 500 in the target storage space 300 according to the migration parameters.
The scalar data migration instruction 1 may be, besides the foregoing ld.200.300 500 400 5, also may be a scale 500 400ld.200.300 5, and the processing procedures of the two are similar, and will not be described again.
The operation of the above modules may be described with reference to the relevant description above.
Thus, the scalar data migration instruction processing device can efficiently and rapidly process scalar data migration instructions, and the scalar data migration processing device is high in processing efficiency and high in processing speed.
The present disclosure provides a machine learning operation device, which may include one or more of the scalar data migration instruction processing devices described above, for acquiring scalar data and control information to be migrated from other processing devices, and performing a specified machine learning operation. The machine learning computing device may obtain scalar data transfer instructions from other machine learning computing devices or non-machine learning computing devices, and transfer execution results to peripheral devices (also referred to as other processing devices) through I/O interfaces. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one scalar data migration instruction processing apparatus is included, the scalar data migration instruction processing apparatuses may be linked and data may be transferred through a specific structure, for example, interconnected and data may be transferred through a PCIE bus, so as to support operation of a larger-scale neural network. At this time, the same control system may be shared, or independent control systems may be provided; the memory may be shared, or each accelerator may have its own memory. In addition, the interconnection mode can be any interconnection topology.
The machine learning operation device has higher compatibility and can be connected with various types of servers through PCIE interfaces.
Fig. 4a shows a block diagram of a combined processing apparatus according to an embodiment of the disclosure. As shown in fig. 4a, the combined processing device includes the machine learning computing device, the universal interconnect interface, and other processing devices. The machine learning operation device interacts with other processing devices to jointly complete the operation designated by the user.
Other processing means may include one or more processor types of general purpose/special purpose processors such as Central Processing Units (CPU), graphics Processing Units (GPU), neural network processors, etc. The number of processors included in the other processing means is not limited. Other processing devices are used as interfaces between the machine learning operation device and external data and control, including data carrying, and complete basic control such as starting, stopping and the like of the machine learning operation device; the other processing device may cooperate with the machine learning computing device to complete the computing task.
And the universal interconnection interface is used for transmitting data and control instructions between the machine learning operation device and other processing devices. The machine learning operation device acquires required input data from other processing devices and writes the required input data into a storage device on a chip of the machine learning operation device; the control instruction can be obtained from other processing devices and written into a control cache on a machine learning operation device chip; the data in the memory module of the machine learning arithmetic device may be read and transmitted to the other processing device.
Fig. 4b shows a block diagram of a combined processing apparatus according to an embodiment of the disclosure. In a possible implementation, as shown in fig. 4b, the combined processing device may further comprise a storage device, which is connected to the machine learning computing device and the other processing device, respectively. The storage device is used for storing data of the machine learning arithmetic device and the other processing devices, and is particularly suitable for data which cannot be stored in the machine learning arithmetic device or the other processing devices in the internal storage of the data required to be calculated.
The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle, video monitoring equipment and the like, so that the core area of a control part is effectively reduced, the processing speed is improved, and the overall power consumption is reduced. In this case, the universal interconnect interface of the combined processing apparatus is connected to some parts of the device. Some components such as cameras, displays, mice, keyboards, network cards, wifi interfaces.
The present disclosure provides a machine learning chip including the machine learning arithmetic device or the combination processing device described above.
The present disclosure provides a machine learning chip packaging structure including the machine learning chip described above.
The present disclosure provides a board card, and fig. 5 shows a schematic structural diagram of the board card according to an embodiment of the present disclosure. As shown in fig. 5, the board card includes the above machine learning chip package structure or the above machine learning chip. In addition to including machine learning chip 389, the board card may include other kits including, but not limited to: a memory device 390, an interface device 391 and a control device 392.
The memory device 390 is connected to the machine learning chip 389 (or the machine learning chip within the machine learning chip package structure) via a bus for storing data. Memory device 390 may include multiple sets of memory cells 393. Each set of memory units 393 is connected to the machine learning chip 389 via a bus. It is understood that each set of memory cells 393 may be DDR SDRAM (English: double Data Rate SDRAM, double Rate synchronous dynamic random Access memory).
DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on both the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM.
In one embodiment, memory device 390 may include 4 sets of memory cells 393. Each set of memory cells 393 may include a plurality of DDR4 particles (chips). In one embodiment, the machine learning chip 389 may include 4 72-bit DDR4 controllers within, where 64 bits of the 72-bit DDR4 controllers are used to transfer data and 8 bits are used for ECC verification. It is appreciated that the theoretical bandwidth of data transfer may reach 25600MB/s when DDR4-3200 granules are employed in each set of memory cells 393.
In one embodiment, each set of memory cells 393 includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling DDR is provided in the machine learning chip 389 for controlling data transfer and data storage for each memory unit 393.
The interface device 391 is electrically connected to the machine learning chip 389 (or the machine learning chip within the machine learning chip package structure). The interface device 391 is used to enable data transfer between the machine learning chip 389 and an external device (e.g., a server or computer). For example, in one embodiment, the interface device 391 may be a standard PCIE interface. For example, the data to be processed is transferred from the server to the machine learning chip 289 through a standard PCIE interface, so as to implement data transfer. Preferably, when PCIE 3.0X10 interface transmission is adopted, the theoretical bandwidth can reach 16000MB/s. In another embodiment, the interface device 391 may be another interface, and the disclosure is not limited to the specific implementation form of the other interface, and the interface device may be capable of implementing the transfer function. In addition, the calculation result of the machine learning chip is still transmitted back to the external device (e.g., server) by the interface device.
The control device 392 is electrically connected to the machine learning chip 389. The control device 392 is configured to monitor the status of the machine learning chip 389. Specifically, machine learning chip 389 and control device 392 may be electrically connected via an SPI interface. The control device 392 may include a single-chip microcomputer (Micro Controller Unit, MCU). For example, the machine learning chip 389 may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may drive a plurality of loads. Therefore, the machine learning chip 389 may be in different operating states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing circuits and/or a plurality of processing circuits in the machine learning chip.
The present disclosure provides an electronic device including the machine learning chip or the board card described above.
The electronic device may include a data processing apparatus, a computer device, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.
The vehicle may include an aircraft, a ship, and/or a vehicle. The household appliances may include televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers, range hoods. The medical device may include a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus.
FIG. 6 illustrates a flow chart of a scalar data migration instruction processing method according to an embodiment of the present disclosure. The method may be applied to, for example, a computer device comprising a memory and a processor, wherein the memory is used to store data used during execution of the method; the processor is configured to perform related processing and operation steps, such as performing step S51 and step S52 described below. As shown in fig. 6, the method is applied to the scalar data migration instruction processing apparatus described above, and includes step S51 and step S52.
In step S51, the control module is utilized to parse the obtained scalar data migration instruction to obtain an operation code and an operation field of the scalar data migration instruction, obtain scalar data to be migrated and a target address required for executing the scalar data migration instruction according to the operation code and the operation field, and determine migration parameters required for performing migration processing. The operation code is used for indicating that the scalar data migration instruction performs the migration processing on the scalar data, the operation domain comprises a scalar data address to be migrated and a target address, and the migration parameters comprise an initial storage space where the scalar data address to be migrated is located, a target storage space where the target address is located and a migration type for performing the migration processing.
In step S52, the processing module stores scalar data to be migrated in the destination address according to the migration parameters,
in one possible implementation, the processing module may include a master processing sub-module and a plurality of slave processing sub-modules. Wherein, step S52 may include:
processing the scalar data to be migrated to obtain processed scalar data to be migrated, and storing the processed scalar data to be migrated into a target address.
In one possible implementation, the operation domain further includes a scalar data migration volume. The method for obtaining scalar data to be migrated and a target address required by executing a scalar data migration instruction according to the operation code and the operation domain may include:
and determining the scalar data migration quantity according to the operation domain, and acquiring scalar data to be migrated corresponding to the scalar data migration quantity from the scalar data address to be migrated.
In one possible implementation, the operation domain further includes migration parameters. Wherein, determining the migration parameters required for performing the migration process may include: and determining migration parameters required by migration processing according to the operation domain.
In one possible implementation, the opcode is also used to indicate a migration parameter. Wherein, determining the migration parameters required for performing the migration process may include: and determining migration parameters required by migration processing according to the operation codes.
In one possible implementation, the method further includes: the scalar data to be migrated is stored using the memory module of the device,
wherein the memory module comprises at least one of a register and a cache,
the buffer memory is used for storing data to be operated and comprises at least one neuron buffer memory NRAM;
a register for storing scalar data among scalar data to be migrated and data to be operated;
and the neuron cache is used for storing neuron data in the data to be operated, wherein the neuron data comprises neuron vector data.
In one possible implementation, step S51 may include:
storing scalar data migration instructions;
analyzing the scalar data migration instruction to obtain an operation code and an operation domain of the scalar data migration instruction;
the instruction queue is stored, the instruction queue comprises a plurality of instructions to be executed which are sequentially arranged according to an execution sequence, and the plurality of instructions to be executed can comprise scalar data migration instructions.
In one possible implementation, the method may further include:
when determining that the first to-be-executed instruction in the plurality of to-be-executed instructions has an association relation with a zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first to-be-executed instruction, controlling the execution of the first to-be-executed instruction after determining that the zeroth to-be-executed instruction is executed,
The association relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction includes: the first storage address interval for storing the data required by the first instruction to be executed and the zeroth storage address interval for storing the data required by the zeroth instruction to be executed have overlapping areas.
It should be noted that, although the scalar data migration instruction processing method is described above by way of example in the above embodiments, those skilled in the art will appreciate that the present disclosure should not be limited thereto. In fact, the user can flexibly set each step according to personal preference and/or actual application scene, so long as the technical scheme of the disclosure is met.
The scalar data migration instruction processing method provided by the embodiment of the disclosure has wide application range, high processing efficiency and high processing speed for scalar data migration instructions, and high processing efficiency and high processing speed for scalar data migration.
The present disclosure also provides a non-transitory computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the scalar data migration instruction processing method described above.
It should be noted that, for simplicity of description, the foregoing method embodiments are all depicted as a series of acts, but it should be understood by those skilled in the art that the present disclosure is not limited by the order of acts described, as some steps may occur in other orders or concurrently in accordance with the disclosure. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all alternative embodiments, and that the acts and modules referred to are not necessarily required by the present disclosure.
It should be further noted that, although the steps in the flowchart of fig. 6 are sequentially shown as indicated by arrows, the steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 6 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the sub-steps or stages of other steps or other steps.
It should be understood that the above-described device embodiments are merely illustrative and that the device of the present disclosure may be implemented in other ways. For example, the division of the units/modules in the above embodiments is merely a logic function division, and there may be another division manner in actual implementation. For example, multiple units, modules, or components may be combined, or may be integrated into another system, or some features may be omitted or not performed.
In addition, unless specifically stated, each functional unit/module in the embodiments of the present disclosure may be integrated into one unit/module, or each unit/module may exist alone physically, or two or more units/modules may be integrated together. The integrated units/modules described above may be implemented either in hardware or in software program modules.
The integrated units/modules, if implemented in hardware, may be digital circuits, analog circuits, etc. Physical implementations of hardware structures include, but are not limited to, transistors, memristors, and the like. Unless otherwise indicated, the Memory modules may be any suitable magnetic or magneto-optical storage medium, such as resistive Random Access Memory RRAM (Resistive Random Access Memory), dynamic Random Access Memory DRAM (Dynamic Random Access Memory), static Random Access Memory SRAM (Static Random Access Memory), enhanced dynamic Random Access Memory EDRAM (Enhanced Dynamic Random Access Memory), high-Bandwidth Memory HBM (High-Bandwidth Memory), hybrid Memory cube HMC (Hybrid Memory Cube), etc.
The integrated units/modules may be stored in a computer readable memory if implemented in the form of software program modules and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present disclosure may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a memory, comprising several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method described in the various embodiments of the present disclosure. And the aforementioned memory includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments. The technical features of the foregoing embodiments may be arbitrarily combined, and for brevity, all of the possible combinations of the technical features of the foregoing embodiments are not described, however, all of the combinations of the technical features should be considered as being within the scope of the disclosure.
The foregoing may be better understood in light of the following clauses:
clause A1, a scalar data migration instruction processing apparatus, the apparatus comprising:
the control module is used for analyzing the obtained scalar data migration instruction to obtain an operation code and an operation domain of the scalar data migration instruction, obtaining scalar data to be migrated and a target address required by executing the scalar data migration instruction according to the operation code and the operation domain, and determining migration parameters required by migration processing;
the processing module stores the scalar data to be migrated into the target address according to the migration parameters,
the operation code is used for indicating that the scalar data migration instruction performs processing on scalar data as migration processing, the operation domain comprises a scalar data address to be migrated and the target address, and the migration parameters comprise an initial storage space where the scalar data address to be migrated is located, a target storage space where the target address is located and a migration type for performing migration processing.
Clause A2, the apparatus of clause A1, the processing module comprising a master processing sub-module and a plurality of slave processing sub-modules,
The main processing sub-module is used for processing the scalar data to be migrated to obtain processed scalar data to be migrated, and storing the processed scalar data to be migrated into the target address.
Clause A3, the apparatus of clause A1, the operation field further comprising a scalar data migration volume,
the control module is further configured to determine the scalar data migration amount according to the operation domain, and obtain scalar data to be migrated corresponding to the scalar data migration amount from the scalar data address to be migrated.
Clause A4, the apparatus of clause A1, the operation domain further comprising a migration parameter,
wherein determining migration parameters required for performing migration processing includes:
and determining migration parameters required by migration processing according to the operation domain.
Clause A5, the apparatus of clause A1, the opcode further being for indicating a migration parameter,
wherein determining migration parameters required for performing migration processing includes:
and determining migration parameters required by migration processing according to the operation code.
Clause A6, the apparatus of clause A1, further comprising:
a storage module for storing the scalar data to be migrated,
Wherein the memory module comprises at least one of a register and a cache,
the buffer memory is used for storing data to be operated, and the buffer memory comprises at least one neuron buffer memory NRAM;
the register is used for storing scalar data to be migrated and scalar data in the data to be operated;
the neuron cache is used for storing neuron data in the data to be operated, and the neuron data comprises neuron vector data.
Clause A7, the apparatus of clause A1, the control module comprising:
an instruction storage sub-module for storing the scalar data migration instruction;
the instruction processing sub-module is used for analyzing the scalar data migration instruction to obtain an operation code and an operation domain of the scalar data migration instruction;
the queue storage submodule is used for storing an instruction queue, the instruction queue comprises a plurality of instructions to be executed, the instructions to be executed are sequentially arranged according to an execution sequence, and the instructions to be executed comprise the scalar data migration instructions.
Clause A8, the apparatus of clause A7, the control module further comprising:
a dependency relationship processing sub-module, configured to cache a first to-be-executed instruction in the plurality of to-be-executed instructions in the instruction storage sub-module when determining that there is an association relationship between the first to-be-executed instruction and a zeroth to-be-executed instruction before the first to-be-executed instruction, extract the first to-be-executed instruction from the instruction storage sub-module after the execution of the zeroth to-be-executed instruction is completed, send the first to-be-executed instruction to the processing module,
Wherein, the association relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction includes:
and the first storage address interval for storing the data required by the first instruction to be executed and the zeroth storage address interval for storing the data required by the zeroth instruction to be executed have overlapping areas.
Clause A9, a machine learning computing device, the device comprising:
one or more scalar data migration instruction processing apparatuses according to any one of clauses A1 to A8, configured to acquire scalar data and control information to be migrated from other processing apparatuses, perform a specified machine learning operation, and transfer the execution result to the other processing apparatuses through an I/O interface;
when the machine learning arithmetic device comprises a plurality of scalar data migration instruction processing devices, the scalar data migration instruction processing devices can be connected through a specific structure and transmit data;
the scalar data migration instruction processing devices are interconnected and transmit data through a PCIE bus of a rapid external equipment interconnection bus so as to support larger-scale machine learning operation; a plurality of scalar data migration instruction processing apparatuses share the same control system or have respective control systems; a plurality of scalar data migration instruction processing devices share a memory or have respective memories; the interconnection mode of a plurality of scalar data migration instruction processing apparatuses is any interconnection topology.
Clause a10, a combination processing device, the combination processing device comprising:
the machine learning computing device, universal interconnect interface, and other processing device of clause A9;
the machine learning operation device interacts with the other processing devices to jointly complete the calculation operation designated by the user,
wherein the combination processing device further comprises: and a storage device connected to the machine learning operation device and the other processing device, respectively, for storing data of the machine learning operation device and the other processing device.
Clause a11, a machine learning chip, the machine learning chip comprising:
the machine learning computing device of clause A9 or the combination processing device of clause a 10.
Clause a12, an electronic device, comprising:
the machine learning chip of clause a 11.
Clause a13, a board card, the board card comprising: a memory device, an interface device, and a control device, and a machine learning chip as set forth in clause a 11;
wherein the machine learning chip is respectively connected with the storage device, the control device and the interface device;
The storage device is used for storing data;
the interface device is used for realizing data transmission between the machine learning chip and external equipment;
the control device is used for monitoring the state of the machine learning chip.
Clause a14, a scalar data migration instruction processing method, the method being applied to a scalar data migration instruction processing apparatus, the apparatus comprising a control module and a processing module, the method comprising:
analyzing the obtained scalar data migration instruction by utilizing a control module to obtain an operation code and an operation domain of the scalar data migration instruction, obtaining scalar data to be migrated and a target address required by executing the scalar data migration instruction according to the operation code and the operation domain, and determining migration parameters required by migration processing;
storing the scalar data to be migrated into the target address by a processing module according to the migration parameter,
the operation code is used for indicating that the scalar data migration instruction performs processing on scalar data as migration processing, the operation domain comprises a scalar data address to be migrated and the target address, and the migration parameters comprise an initial storage space where the scalar data address to be migrated is located, a target storage space where the target address is located and a migration type for performing migration processing.
Clause a15, the method of clause a14, the processing module comprising a master processing sub-module and a plurality of slave processing sub-modules,
wherein storing the scalar data to be migrated in the target address according to the migration parameter includes:
and processing the scalar data to be migrated by using the main processing sub-module to obtain processed scalar data to be migrated, and storing the processed scalar data to be migrated into the target address.
Clause a16, the method of clause a14, the operation field further comprising a scalar data migration volume,
the method for obtaining scalar data to be migrated and a target address required by executing a scalar data migration instruction according to the operation code and the operation domain comprises the following steps:
and determining the scalar data migration quantity according to the operation domain, and acquiring scalar data to be migrated corresponding to the scalar data migration quantity from the scalar data address to be migrated.
Clause a17, the method of clause a14, the operation domain further comprising a migration parameter,
wherein determining migration parameters required for performing migration processing includes:
and determining migration parameters required by migration processing according to the operation domain.
Clause a18, the method of clause a14, the opcode further being used to indicate a migration parameter,
wherein determining migration parameters required for performing migration processing includes:
and determining migration parameters required by migration processing according to the operation code.
Clause a19, the method of clause a14, further comprising:
storing the scalar data to be migrated using a storage module of the device,
wherein the memory module comprises at least one of a register and a cache,
the buffer memory is used for storing data to be operated, and the buffer memory comprises at least one neuron buffer memory NRAM;
the register is used for storing scalar data in the scalar to be migrated and the data to be operated;
the neuron cache is used for storing neuron data in the data to be operated, and the neuron data comprises neuron vector data.
Clause a20, the method according to clause a14, analyzing the obtained scalar data migration instruction to obtain an operation code and an operation field of the scalar data migration instruction, including:
storing the scalar data migration instruction;
analyzing the scalar data migration instruction to obtain an operation code and an operation domain of the scalar data migration instruction;
And storing an instruction queue, wherein the instruction queue comprises a plurality of instructions to be executed, which are sequentially arranged according to an execution sequence, and the plurality of instructions to be executed comprise the scalar data migration instruction.
Clause a21, the method of clause a20, the method further comprising:
when determining that the first to-be-executed instruction in the plurality of to-be-executed instructions has an association relation with a zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first to-be-executed instruction, controlling the execution of the first to-be-executed instruction after determining that the zeroth to-be-executed instruction is executed,
wherein, the association relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction includes:
and the first storage address interval for storing the data required by the first instruction to be executed and the zeroth storage address interval for storing the data required by the zeroth instruction to be executed have overlapping areas.
Clause a22, a non-transitory computer readable storage medium having stored thereon computer program instructions that, when executed by a processor, implement the method of any of clauses a14 to a 21.
The foregoing has outlined rather broadly the more detailed description of embodiments of the present application, wherein specific examples are provided herein to illustrate the principles and embodiments of the present application, the above examples being provided solely to assist in the understanding of the methods of the present application and the core ideas thereof; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.