CN110515739B

CN110515739B - Deep learning neural network model load calculation method, device, equipment and medium

Info

Publication number: CN110515739B
Application number: CN201911008660.3A
Authority: CN
Inventors: 黎兴民
Original assignee: Shanghai Suiyuan Intelligent Technology Co Ltd
Current assignee: Shanghai Suiyuan Intelligent Technology Co Ltd
Priority date: 2019-10-23
Filing date: 2019-10-23
Publication date: 2020-01-31
Anticipated expiration: 2039-10-23
Also published as: CN110515739A

Abstract

The embodiment of the invention discloses a load calculation method, a device, equipment and a medium for deep learning neural network models, wherein the method comprises the steps of analyzing a pre-constructed network model, decomposing a calculation flow of the network model into at least two calculation tasks, dividing each calculation task to form at least calculation subtasks, respectively allocating resources for all calculation subtasks related to the calculation tasks according to each resource allocation strategy to obtain an allocation data set of the calculation tasks under each resource allocation strategy, counting the allocation data set of the calculation tasks under each resource allocation strategy to form a load matrix of the network model, and calculating the operation time of each calculation subtask according to a performance parameter set of a chip to be evaluated to determine a performance matrix.

Description

Deep learning neural network model load calculation method, device, equipment and medium

Technical Field

The embodiment of the invention relates to the field of data processing, in particular to deep learning neural network model load calculation methods, devices, equipment and media.

Background

The rapid development of the artificial intelligence industry has put higher demands on the computing power of computers, and various major semiconductor manufacturers are actively developing and launching special chips aiming at accelerating deep learning training and reasoning processes.

The development and manufacturing of chips are relatively long processes, and generally, the rationality verification of chip architecture design and the evaluation of computing performance need to be performed after small-batch production and sample acquisition, so that the iterative cycle of product development can be greatly increased, and even the product time to market can be indefinitely delayed, which is unacceptable for various semiconductor manufacturers.

The existing solution is to use a dedicated server to simulate a chip architecture, and provide complete sets of matched software and hardware solutions by a server, and perform performance verification of the chip based on the solutions, but the solution is expensive in manufacturing cost, and the simulation service software has a slow operating speed, generally simple test samples need to operate for hours, and in addition, for verification of an accelerated chip architecture supporting parallel computing, the splitting of computing tasks and different scheduling strategies of on-chip hardware resources lead to different operating loads of the chip, thereby bringing different performance performances, and the trial and exploration of the strategies are helpful for finding structural defects at the beginning of chip design.

Disclosure of Invention

The embodiment of the invention provides deep learning neural network model load calculation methods, devices, equipment and media, which can improve the performance simulation speed of a chip running a learning network model.

, the embodiment of the invention provides methods for calculating the load of deep learning neural network models, which include:

analyzing a pre-constructed network model, and decomposing a calculation process of the network model into at least two calculation tasks; wherein the at least two computing tasks have a dependency relationship;

dividing each computing task according to at least pre-configured resource allocation strategies to form at least computing subtasks;

distributing resources for all computing subtasks associated with the computing task according to each resource distribution strategy to obtain a distribution data set of the computing task under each resource distribution strategy; the resources include computing resources and storage resources;

counting distribution data sets of each calculation task under each resource distribution strategy to form a load matrix of the network model;

and calculating the running time of each calculation subtask obtained by decomposing the network model under each resource allocation strategy according to the performance parameter set of the chip to be evaluated and the load matrix, and determining the performance matrix so as to evaluate the performance of the chip for running the network model.

In a second aspect, an embodiment of the present invention provides deep learning neural network model load calculation apparatuses, including:

the computing task analysis module is used for analyzing a pre-constructed network model and decomposing the computing process of the network model into at least two computing tasks; wherein the at least two computing tasks have a dependency relationship;

the computing task dividing module is used for dividing each computing task according to at least pre-configured resource allocation strategies to form at least computing subtasks;

the resource allocation module is used for allocating resources for all the computing subtasks associated with the computing task according to the resource allocation strategies to obtain an allocation data set of the computing task under the resource allocation strategies; the resources include computing resources and storage resources;

a load matrix generation module, configured to count an allocation data set of each computation task under each resource allocation policy to form a load matrix of the network model;

and the performance matrix calculation module is used for calculating the running time of each calculation subtask obtained by decomposing the network model under each resource allocation strategy according to the performance parameter set of the chip to be evaluated and the load matrix, and determining the performance matrix so as to evaluate the performance of the chip for running the network model.

In a third aspect, an embodiment of the present invention further provides computer apparatuses, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the deep learning neural network model load calculation method according to any in the embodiment of the present invention when the processor executes the program.

In a fourth aspect, the present invention further provides computer-readable storage media, where the computer program is stored, and when the computer program is executed by a processor, the deep learning neural network model load calculation method according to any in the present invention is implemented.

The embodiment of the invention automatically analyzes the deep learning neural network model to form at least two calculation tasks, divides the calculation tasks according to the pre-configured resource allocation strategy in step to form calculation sub-tasks, respectively allocates resources for the calculation sub-tasks according to different resource allocation strategies to obtain load matrixes under different resource allocation strategies, and calculates the operation time of each calculation sub-task under different resource allocation strategies based on the performance parameter set of the chip to be evaluated, thereby determining the performance matrix of the chip operation network model for evaluating the performance of the chip operation network model, solving the problems of high economic cost and low efficiency of the chip simulation operation network model in the prior art, and improving the performance simulation speed of the chip operating the learning network model.

Drawings

FIG. 1 is a flowchart of a deep learning neural network model load calculation method in of an embodiment of the present invention;

FIG. 2 is a flowchart of methods for calculating the load of deep learning neural network model according to the second embodiment of the present invention;

FIG. 3 is a schematic structural diagram of deep learning neural network model load calculation apparatuses according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of computer devices in the fourth embodiment of the present invention.

Detailed Description

The present invention will now be described in further detail with reference to the drawings and examples, it being understood that the specific embodiments herein described are merely illustrative of and not restrictive on the broad invention, and it should be further noted that for the purposes of description, only some, but not all, of the structures associated with the present invention are shown in the drawings.

Example

Fig. 1 is a flowchart of a deep learning neural network model load calculation method in , where this embodiment is applicable to a case of simulating a process of running a network model on a chip, and the method can be executed by a deep learning neural network model load calculation apparatus provided in this embodiment of the present invention, the apparatus can be implemented in a software and/or hardware manner, and can be integrated into a computer device, such as a terminal device or a server, as shown in fig. 1, the method of this embodiment specifically includes:

s110, analyzing a pre-constructed network model, and decomposing a calculation process of the network model into at least two calculation tasks; wherein the at least two computing tasks have a dependency relationship.

The network model may be referred to as a Deep Learning Neural network model (Deep Learning Neural Networks).

The computational flow of the network model is used to represent a plurality of successive computational steps that the network model needs to perform at runtime. Wherein the calculation flow may be converted into a plurality of successive calculation steps.

A computational task is used to represent a certain computational step or steps. Each computational task is different.

The method comprises the steps of combining a plurality of computing tasks with dependency relationships in order, namely, a computing process of a network model is a computing task sequence, and the arrangement order of the computing tasks in the computing task sequence is the execution order of the computing tasks.

Illustratively, the data processing operations may include Padding (Padding), morphing (Reshape), Convolution (Convolution), and Pooling (Pooling), among others.

The network model may be built through a predefined programming interface. The user can input data related to the network model through a predefined programming interface to build the neural network model. Illustratively, the network model is established as

Wherein

representing some level of the neural network.

It will be appreciated that the user may transfer the network model to the programming interface, and the structure of the network model may be obtained via the programming interface, i.e., the structure of each layers in the network model is determined and fixed during subsequent processingProcessing so that data processing at each level can be used as a computational task for the network model

S continuous computing task sequences can be obtained

，

Wherein, from

To

In, computing task

Is inputted as

(ii) a The boxes representing computational tasksThere is a dependency relationship between the first calculation tasks as the input of the last calculation tasks, and the S calculation tasks are executed sequentially to ensure the correctness of the calculation result.

In fact, the computation task is divided from the viewpoint of the running time sequence of the network model, that is, the computation task is divided from the time of the computation flow of the network model.

And S120, dividing each computing task according to at least pre-configured resource allocation strategies to form at least computing subtasks.

Generally, a resource allocation policy is used to allocate resources required for executing a computing task, and the resource allocation policy may refer to a resource allocation manner, where the resources may specifically include computing resources and storage resources. The computing resources are used to perform computing tasks. The storage resources are used for storing data associated with executing the computing task. The calculation sub-tasks are used for forming the calculation tasks and are part of calculation in the calculation tasks.

In practice, the computing task may also be further divided, e.g., subdivided into a plurality of computing subtasks. Each calculation subtask is different and independent, and all calculation subtasks form a complete calculation task. The dividing manner may be that the calculation amount of the calculation task is divided into n calculation subtasks equally, where n is greater than or equal to 1, and may be specifically set according to needs, which is not limited in the embodiments of the present invention. Exemplarily, the calculation task is convolution calculation on 10 feature maps, and the calculation task can be divided into 10 calculation subtasks, wherein each calculation subtask is convolution calculation on 1 feature map; or the calculation task can be divided into 5 calculation subtasks, each calculation subtask is used for performing convolution calculation on 2 feature maps, and the feature maps of the convolution calculation performed by each calculation subtask are different from each other.

As in the previous example, the computing task is

The relationship between the computing task and the computing subtask is as follows:wherein

for the computation subtasks, each computation subtask has the same subscript step representation, and the computation subtasks are commonly assigned to the computation taskSuperscript differencing slave computing tasks

The Q items split in the method represent that each computation subtask has no dependency relationship and exists in parallel.

The dividing mode of each computing task can be determined according to the resource allocation strategy, that is, the number of computing subtasks which can be divided into each computing task is determined according to the resource allocation strategy, that is, the number of n is determined.

Specifically, the computing resource comprises a plurality of computing units, the number of the computing units is n, namely the computing task is divided into n computing subtasks, so that computing units execute computing subtasks.

In addition, the space size of the storage resource can be used as the basis for dividing the calculation sub-tasks; or the division mode of the calculation subtasks can be determined comprehensively according to the calculation resources and the storage resources. The specific configuration may be set as required, and the embodiment of the present invention is not particularly limited.

In summary, a computing sub-task is actually a unit task that divides the computing task into units to be executed in parallel at time .

S130, respectively allocating resources to all the computing subtasks associated with the computing task according to each resource allocation strategy to obtain an allocation data set of the computing task under each resource allocation strategy; the resources include computing resources and storage resources.

The distribution data set is used for describing resource distribution conditions of all the computing subtasks of the computing task division, and the distribution data set records the mapping relation between all the computing subtasks of the computing task division and resources.

The resource allocation for the computation subtasks is actually the allocation of computation resources and storage resources for the computation subtasks, that is, the allocation of processors and storage space for the computation subtasks.

, the allocation of resources for different resource allocation policies is different, in that the computing resources and/or storage resources are quantified and allocated to the computing subtasks in different quantities.

Specifically, the computing resources may be equally divided into N computing units, i.e., the computing resources are grouped into

，

computing units, it should be noted that the computing resource generally refers to a processor on a chip, computing units are an integer number of processors, wherein computing units may include processors, computing units may include two processors, or even more, and therefore, the embodiment of the present invention is not particularly limited.

Simultaneously, the storage resources are evenly divided into M equal parts to obtain a storage resource set

，The storage resources are the storage space allocated for running the network model on the chip, and the storage space can be divided equally.

In practice, resource allocation policies specify that compute subtasks can be allocated the number of compute units, and the number of storage units

，

I.e. to ensureAll the computing sub-tasks obtained by computing task division are just divided into all the computing resources C and the storage resources D, and cannot be divided more or less, and the computing sub-tasks cannot be overlapped with each other.

Under resource allocation strategies, each computing sub-task obtained by computing task division is allocated to at least computing units and at least storage units, correspondingly, under different resource allocation strategies, each computing sub-task is allocated to different numbers of computing units and different numbers of storage units, and each computing sub-task under each resource allocation strategy, the number of the computing units allocated to each computing sub-task and the number of the storage units allocated to each computing sub-task are used as sequences, so that sets can be obtained by counting data of different resource allocation strategies, wherein the sets are actually allocation data sets of the computing tasks under each resource allocation strategy.

And S140, counting distribution data sets of the calculation tasks under the resource distribution strategies to form a load matrix of the network model.

The load matrix is used for describing the resource allocation condition of the network model under different resource allocation strategies. The load matrix records the mapping relation between each calculation stage and resource allocation in the operation process of the network model.

According to the method, the distribution data set of each calculation task under each resource distribution strategy is obtained, and a load matrix of the network model is formed.

S150, calculating the running time of each calculation subtask obtained by the network model decomposition under each resource allocation strategy according to the performance parameter set of the chip to be evaluated and the load matrix, and determining the performance matrix so as to evaluate the performance of the chip running the network model.

The performance parameter set is used for describing the performance of the chip to be evaluated. The performance parameter set records performance parameters of the chip, and is illustratively a Performance Dictionary (PD). Runtime is used to describe the length of time spent performing a computation subtask. The performance matrix is used for evaluating the operation performance of the chip to be evaluated. And calculating the running time of each calculation subtask according to the performance parameters of the chip and the calculation unit and the storage unit which are distributed by each calculation subtask under different resource distribution strategies to form a performance matrix corresponding to the load matrix. Therefore, the running time of each computing subtask obtained by decomposing the network model under different resource allocation strategies can be determined, namely the running time of the network model under different resource allocation strategies is determined, and therefore the performance of the chip running network model is determined. It will be appreciated that the shorter the run time of the network model, the higher the performance of the chip running the network model.

Optionally, the calculating, according to the performance parameter set of the chip to be evaluated and the load matrix, the running time of each computation subtask obtained by decomposing the network model under each resource allocation policy to determine the performance matrix includes: according to the performance parameter set of the chip to be evaluated, calculating input data carrying time, input data processing consumption time and result data carrying time of each calculation subtask in the load matrix under each resource allocation strategy; taking the sum of the input data transfer time, the input data processing consumption time and the result data transfer time of the calculation subtasks under the resource allocation strategy as the running time of the calculation subtasks under the resource allocation strategy; and forming a performance matrix of the network model according to the running time of each computing sub-task under each resource allocation strategy.

Typically, the execution of computation subtasks is completed through three processes of reading input data from the storage space, performing data processing on the input data, and writing result data obtained by the data processing into the storage space, so that the corresponding operation time of computation subtasks includes input data transfer time, input data processing consumption time, and result data transfer time.

The input data transfer time is used to describe a time period for the computing sub-task to acquire the input data, and may refer to a time period for transferring the input data from the allocated storage resource to the allocated computing resource. Specifically, the input data transfer time may be calculated based on the following formula:

wherein,

the method is used for calculating the size of input data of a subtask, taking Byte as a statistical unit, and input bandwidth is a slave storage resource

Porting to computing resources

The data transmission bandwidth of the time can be obtained from the performance parameter set by taking Byte/s as a unit.

The processing consumption time of the input data is used for describing the time length consumed by the computing subtask for processing the input data correspondingly, and may refer to computing resourcesThe length of time the input data is processed. Specifically, the processing consumption time of the input data may be calculated based on the following formula:

wherein, the cost of data processing per byte refers to the time consumed for processing each byte of data. The time consumed to process each byte of data may be calculated by a computational task

May also be obtained from a set of performance parameters.

The result data transfer time is used to describe a time period for the computing subtask to output the result data, and may refer to a time period for transferring the result data from the allocated computing resource to the allocated storage resource. The resulting data transfer time may be specifically calculated based on the following formula:

wherein,

is the size of the result data obtained by the calculation subtask, and takes Byte as the statistical unit, and output bandwidth is the calculation resourceTransporting computed result data to storage resources

The data transmission bandwidth, in terms of Byte/s, can be obtained from the set of performance parameters.

And taking the sum of the input data transfer time, the processing consumption time of the input data and the result data transfer time of the calculation subtask as the running time of the calculation subtask under the resource allocation strategy.

And counting the running time of different computing subtasks under different resource allocation strategies, and taking the running time of the computing subtask with the longest running time as the running time of the computing task. It will be appreciated that the computation subtasks into which the computation task is decomposed are performed in parallel, such that the running time of the computation task is equal to the longest duration of the running time of the computation subtasks. Determining a running time of the computing task based on the following formula:

wherein,

the running time of the calculation task for the first step.

The method comprises the steps of determining the running time of each calculation subtask under each resource allocation strategy and accurately determining the running time of each calculation subtask by calculating the input data carrying time, the input data processing consumption time and the result data carrying time of each calculation subtask under each resource allocation strategy, thereby accurately counting the time consumed by the running of the whole network model.

Example two

Fig. 2 is a flowchart of deep learning neural network model load calculation methods according to a second embodiment of the present invention, which is embodied on the basis of the above embodiments, and is configured to analyze a pre-constructed network model and decompose a calculation flow of the network model into at least two calculation tasks, specifically, analyze the network model and determine a hierarchical structure of the network model, where the hierarchical structure of the network model includes at least two layers, and form at least two calculation tasks by using data processing operations associated with the layers as calculation tasks.

Specifically, the method of this embodiment specifically includes:

s210, analyzing a pre-constructed network model, and determining the hierarchical structure of the network model, wherein the hierarchical structure of the network model comprises at least two layers.

The hierarchical structure of the network model comprises at least two layers, namely the calculation process of the network model can be decomposed into at least two calculation tasks.

The network model, the computing task, the dependency relationship, the resource allocation policy, the computing subtask, the allocation data set, the performance parameter set, and the performance matrix in this embodiment may refer to the description of the foregoing embodiments.

S220, taking the data processing operation associated with the layer as calculation tasks to form at least two calculation tasks, wherein the at least two calculation tasks have a dependency relationship.

The data processing operation related to the layers is calculation tasks, namely the data processing operation executed by nodes is calculation tasks, the result obtained by data processing performed by nodes is transmitted to the next nodes as input, so that the next nodes continue to perform the data processing operation, correspondingly, the calculation result of calculation tasks is used as the input data of the next calculation tasks, and therefore, the calculation flow of the network model is mapped to a plurality of calculation tasks with dependency relationship.

And S230, dividing each computing task according to at least pre-configured resource allocation strategies to form at least computing subtasks.

Optionally, the dividing of each computing task according to at least preconfigured resource allocation policies to form at least computing subtasks includes determining an allocation number of computing resources according to at least preconfigured resource allocation policies, and dividing each computing task according to the allocation number to form at least computing subtasks, where the number of computing subtasks divided by each computing task is less than or equal to the allocation number.

Specifically, the allocated amount of computing resources may refer to the number of computing resources that the chip may invoke. Generally, the greater the number of allocations of computing resources, the greater the number of computing subtask partitions. The allocation quantity is used to determine the partitioning method of the calculation subtasks. The computing resources can be equally divided into a plurality of computing units according to the allocation number, and the number of the computing units is the same as the allocation number, so that the allocation number is equal to the number of the computing units to be allocated in each resource allocation strategy and is also equal to the maximum value of the number of the allocated computing units in each resource allocation strategy. It will be appreciated that the amount of allocation data for a computing resource may describe the operational performance of the chip, and thus the amount of allocation may be derived from a set of performance parameters for the chip.

At least computing units are required for each computing subtask to perform corresponding data processing, therefore, the computing task can be divided into a distribution number of computing subtasks at most, generally computing subtasks are performed by computing units, the distribution number can be larger than the number of computing subtasks, and at this time, at least computing units are idle and do not perform the computing subtasks.

The allocation quantity is determined through the resource allocation strategy, the quantity of the calculation subtasks is determined according to the allocation quantity, the divided calculation subtasks can be adapted to the calculation resources, and the calculation subtasks are guaranteed to be correctly executed, so that the correct simulation operation of the network model is guaranteed.

Optionally, before dividing each computing task according to at least preconfigured resource allocation policies to form at least computing subtasks, the method further includes receiving at least resource allocation tables, analyzing the resource allocation tables to obtain a combination relationship between computing resources and storage resources corresponding to each resource allocation table, and using the combination relationship between the computing resources and the storage resources corresponding to resource allocation tables as resource allocation policies.

The resource allocation table records a combination relationship between the computing resources and the storage resources, wherein the combination relationship is used for determining the computing resources and the storage resources allocated to each computing subtask, and is used for describing the relationship between the computing units and the storage units which respectively have mapping relationships with computing subtasks.

In fact, the resource allocation table allocates computing resources and storage resources for each computing subtasks, and there is a combination relationship between the computing resources and storage resources allocated to the same computing subtasks.

Illustratively, the resource configuration table is as follows:

…

accordingly, resource allocation policies of the Q computation subtasks of the computation task on the computation resources and the storage resources are:

wherein,

。

the resource allocation table may be input by a tester to specify the resource allocation policy to be tested. And analyzing the received resource configuration table input by the user to obtain a resource allocation strategy.

For all allocation strategies (i.e. individual calculation of sub-tasks)

Process of (2)Exhaustive enumeration is performed, and a K distribution method is assumed to be obtained, so that load matrixes describing chip load conditions at the step one of the whole network model can be obtained

。

The resource allocation condition of each computing subtask is determined by receiving and analyzing the resource allocation table, so that flexible resource allocation is realized, the load condition of the chip is increased, the test range of the chip is increased, and the accuracy of the performance test of the chip is improved.

Optionally, the allocated number of the computing resources included in each resource configuration table is equal to the number of the processors in the chip performance parameter set. The allocated number of the computing resources is the number of the processors, that is, the number of the computing resources including the computing units is the number of the processors. The distribution quantity of the computing resources is equal to the quantity of the processors in the chip, so that the condition of the processors of the chip can be adapted, the computing resources are distributed, the rationality of resource distribution is improved, and the accuracy of performance test of the chip is improved.

Optionally, the allocating resources to all computing subtasks associated with the computing task according to each resource allocation policy to obtain an allocation data set of the computing task under each resource allocation policy includes traversing each resource allocation table from a target resource allocation table in each resource allocation table until all resource allocation tables are traversed, traversing each computing subtask from a target computing subtask in each computing subtask of the computing task in the traversing process of the resource allocation tables, selecting a target computing resource from a combination relation corresponding to the resource allocation tables for a traversed current computing subtask, acquiring a corresponding target storage resource, establishing a corresponding relation among the current computing subtask, the target computing resource and at least target storage resources until all computing subtasks are completed, wherein computing resources corresponding to each computing task of the computing task are different, storage resources corresponding to each computing subtask of the computing task are different, and all resource allocation data sets of the computing subtasks under the computing task are acquired after all resource allocation data sets of the computing subtasks under each computing task allocation table are completed.

Specifically, the target resource allocation table is resource allocation tables in all resource allocation tables, for example, the target resource allocation table is any resource allocation tables selected randomly, or all resource allocation tables are numbered and the resource allocation tables are selected according to the number order, for example, the resource allocation table with the number of 1 is selected.

The allocation data is used for describing resource allocation conditions of each computing subtask obtained by computing task decomposition under different resource allocation tables, namely, used for determining corresponding computing resources and storage resources of each computing subtask under different resource allocation tables.

And traversing the resource allocation tables from the resource allocation tables for obtaining the resource allocation condition of each resource allocation table.

The method comprises the steps of traversing all the computing subtasks of the current computing task, and establishing the corresponding relation between each computing subtask and the computing resources and the storage resources, thereby realizing the allocation of the computing resources and the storage resources for each computing subtask obtained by decomposing the current computing task and reducing the omission condition.

And repeating the steps until all the resource allocation tables are traversed. Therefore, the distribution data of each computing subtask can be obtained, so that a distribution data set of the computing tasks under different resource configuration tables is formed, and the resource distribution condition of each computing subtask under different resource configuration tables is determined.

For example, the allocation data set of the computation task in step one under resource allocation tables is as follows:

by traversing the resource allocation table and respectively establishing the corresponding relation between all the calculation subtasks obtained by the calculation tasks and the calculation resources and the storage resources allocated by the current resource allocation table when the current resource allocation table is traversed, resource allocation of the calculation subtasks based on the current resource allocation table is realized, the calculation subtasks can be allocated to the calculation resources and the storage resources, flexible resource allocation is realized, the accuracy of resource allocation is improved, and the accuracy of performance test of the chip is improved.

S240, respectively allocating resources to all computing subtasks associated with the computing task according to each resource allocation strategy to obtain an allocation data set of the computing task under each resource allocation strategy; the resources include computing resources and storage resources.

And S250, counting distribution data sets of each calculation task under each resource distribution strategy to form a load matrix of the network model.

As the previous example, the distribution data set of the calculation task in the step one under the K resource allocation tables is counted to obtain the load matrixAs follows:

and S260, calculating the running time of each calculation subtask obtained by decomposing the network model under each resource allocation strategy according to the performance parameter set of the chip to be evaluated and the load matrix, and determining the performance matrix so as to evaluate the performance of the chip for running the network model.

Illustratively, the set of performance parameters is PD.

Calculating the running time of the allocation data set of the step calculation task under the k resource configuration table based on the following formula

：

Further, the performance matrix is calculated based on the following formula

：

Wherein, each line element

The calculation tasks with the same step items, namely the calculation tasks representing the step (0,1, …, S-1) items obtained by decomposition in the calculation process of the whole network model, respectively run for long time on K different resource allocation strategies.

Each column of elements

Resource allocation strategy at k-th position representing whole neural network computing task

Next calculation task item by item

The operation time of each calculation task has a dependency relationship, and the total operation time of the network model under the k-th allocation strategy can be obtained by summing the columns.

Optionally, the calculating, according to the performance parameter set of the chip to be evaluated and the load matrix, the running time of each computation subtask obtained by decomposing the network model under each resource allocation policy to determine the performance matrix includes: according to the performance parameter set of the chip to be evaluated, calculating the input data carrying time, the input data processing consumption time and the result data carrying time of each calculation subtask in the load matrix; taking the sum of the input data transfer time, the input data processing consumption time and the result data transfer time of the calculation subtasks as the running time of the calculation subtasks under the resource allocation strategy; and forming a performance matrix of the network model according to the running time of each computing sub-task under each resource allocation strategy.

The embodiment of the invention analyzes the network structure of the network model, determines the layers in the hierarchical structure, and takes the data processing operation associated with each layer as the calculation task respectively to form at least two calculation tasks, thereby realizing the automatic decomposition of the calculation process of the network model into the calculation tasks without depending on specific hardware equipment to support the simulation of the network model, reducing the cost of the simulation operation of the network model, and simultaneously obtaining the calculation tasks by dividing the network structure, thereby improving the accuracy of the decomposition of the calculation process of the network model.

EXAMPLE III

Fig. 3 is a schematic diagram of deep learning neural network model load calculation apparatuses in the third embodiment of the present invention, and the fourth embodiment is a corresponding apparatus for implementing the deep learning neural network model load calculation method provided in the above embodiments of the present invention, and the apparatus can be implemented by software and/or hardware, and can be integrated with a computer device, etc.

Accordingly, the apparatus of the present embodiment may include:

the calculation task analysis module 310 is configured to analyze a pre-constructed network model, and decompose a calculation flow of the network model into at least two calculation tasks; wherein the at least two computing tasks have a dependency relationship;

the computing task dividing module 320 is configured to divide each computing task according to at least preconfigured resource allocation policies to form at least computing subtasks;

a resource allocation module 330, configured to allocate resources to all computation subtasks associated with the computation task according to each resource allocation policy, respectively, to obtain an allocation data set of the computation task under each resource allocation policy; the resources include computing resources and storage resources;

a load matrix generation module 340, configured to count an allocation data set of each computing task under each resource allocation policy, and form a load matrix of the network model;

and a performance matrix calculation module 350, configured to calculate, according to the performance parameter set of the chip to be evaluated and the load matrix, an operation time of each computation subtask obtained by decomposing the network model under each resource allocation policy, and determine a performance matrix, so as to evaluate performance of the chip in operating the network model.

, the calculation task analysis module 310 includes a network model hierarchy analysis unit configured to analyze the network model to determine a hierarchy of the network model, where the hierarchy of the network model includes at least two layers, and form at least two calculation tasks by using data processing operations associated with the layers as calculation tasks.

, the calculation task dividing module 320 includes a resource allocation policy dividing unit configured to determine an allocation amount of the calculation resources according to at least resource allocation policies configured in advance, and divide each of the calculation tasks according to the allocation amount to form at least calculation subtasks, where the amount of the calculation subtasks obtained by dividing each of the calculation tasks is less than or equal to the allocation amount.

, the deep learning neural network model load calculation apparatus further includes a resource allocation table receiving module, configured to receive and analyze at least resource allocation tables before dividing each of the computation tasks according to at least preconfigured resource allocation policies to form at least computation subtasks, to obtain a combination relationship between the computation resources and the storage resources corresponding to each of the resource allocation tables, and use the combination relationship between the computation resources and the storage resources corresponding to resource allocation tables as resource allocation policies.

, the resource allocation module 330 includes a resource allocation table traversal parsing unit configured to traverse each of the resource allocation tables from a target resource allocation table in each of the resource allocation tables until all of the resource allocation tables are traversed, traverse each of the computing subtasks from a target computing subtask in each of the computing subtasks of the computing task in a traversal process of the resource allocation tables, select a target computing resource from a combination relationship corresponding to the resource allocation tables for a traversed current computing subtask, acquire a corresponding target storage resource, establish a corresponding relationship between the current computing subtask, the target computing resource, and at least target storage resources until all of the computing subtasks are traversed, wherein the computing resources corresponding to each of the computing subtasks of the computing task are different, the storage resources corresponding to each of the computing subtasks of the computing task are different, generate an allocation data set of the computing task in the resource allocation tables according to the computing resources corresponding to each of the computing subtasks and the storage resources corresponding to the computing subtasks, and acquire the data set of the computing data allocated under the resource allocation tables after all of the computing subtasks are traversed.

, the performance matrix calculation module 350 includes a running time calculation unit configured to calculate, according to the performance parameter set of the chip to be evaluated, input data transfer time, input data processing consumption time, and result data transfer time of each calculation subtask in the load matrix under each resource allocation policy, use a sum of the input data transfer time, the input data processing consumption time, and the result data transfer time of the calculation subtask under the resource allocation policy as the running time of the calculation subtask under the resource allocation policy, and form the performance matrix of the network model according to the running time of each calculation subtask under each resource allocation policy.

Further , the allocated amount of computing resources included in each of the resource allocation tables is equal to the number of processors in the set of chip performance parameters.

The deep learning neural network model load calculation device can execute the deep learning neural network model load calculation method provided by the embodiment of the invention, and has the corresponding functional modules and beneficial effects of the executed deep learning neural network model load calculation method.

Example four

FIG. 4 is a schematic diagram of computer devices in the fourth embodiment of the present invention, FIG. 4 is a block diagram of an exemplary computer device 12 suitable for implementing the embodiments of the present invention, and the computer device 12 shown in FIG. 4 is only examples and should not bring any limitations to the function and scope of the embodiments of the present invention.

As shown in FIG. 4, computer device 12 is embodied in a general purpose computing device, the components of computer device 12 may include, but are not limited to, or more processors or processing units 16, a system memory 28, a bus 18 that couples the various system components including the system memory 28 and the processing unit 16, and computer device 12 may be a device that is attached to the bus.

Bus 18 represents or more of several types of bus structures, including a memory bus or memory controller, a Peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures, including, but not limited to, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an enhanced ISA (enhanced ISA) bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnect (PCI) bus, to name a few.

Computer device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

System Memory 28 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 30 and/or cache Memory 32 computer device 12 may further include other removable/non-removable, volatile/non-volatile computer system storage media storage systems 34 may be used to Read from and write to non-removable, non-volatile magnetic media (not shown in fig. 4, commonly referred to as "hard drives"). although not shown in fig. 4, magnetic disk drives may be provided for reading from and writing to removable non-volatile magnetic disks (e.g., "floppy disks"), and optical disk drives for reading from and writing to removable non-volatile optical disks (e.g., Compact disk Read-Only memories (CD-ROMs), Digital Video disks-Read Only memories (DVD-ROMs), or other optical media) may be provided, by way of or multiple data media interfaces, system Memory 28 may include at least program products having at least one program module (e.g., ) configured to execute embodiments of the present invention.

Program/utility 40 having sets (at least ) of program modules 42 may be stored, for example, in system memory 28, such program modules 42 including, but not limited to, an operating system, or more application programs, other program modules, and program data, each or some combination of these examples possibly including implementation of a network environment.

Computer device 12 may also communicate with or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), or more devices that enable a user to interact with the computer device 12, and/or any device (e.g., Network card, modem, etc.) that enables the computer device 12 to communicate with or more other computing devices.this communication may be via an Input/Output (I/O) interface 22. furthermore, computer device 12 may also communicate with or more networks (e.g., Local Area Network (LAN), Area Network (WAN) via a Network adapter 20. As shown, Network adapter 20 communicates with other modules of computer device 12 via bus 18. it should be understood that although not shown in FIG. 4, other hardware and/or software modules may be used in conjunction with computer device 12, including, but not limited to, microcode, a device driver, Redundant array drive unit, disk drive system drive, RAID, disk drive system, RAID, and the like.

The processing unit 16 executes programs stored in the system memory 28 to perform various functional applications and data processing, such as implementing deep learning neural network model load calculation methods provided by any of the embodiments of the present invention.

EXAMPLE five

The fifth embodiment of the invention provides computer-readable storage media, wherein the computer-readable storage media are stored with computer programs, and the computer programs are implemented by a processor, when the computer programs are executed by the processor, the method for computing the load of the deep learning neural network model provided by all the embodiments of the invention of the present application comprises the steps of analyzing a pre-constructed network model, decomposing the computing process of the network model into at least two computing tasks, wherein the at least two computing tasks have dependency relationships, dividing each computing task according to at least pre-configured resource allocation strategies to form at least computing subtasks, respectively allocating resources to all computing subtasks associated with the computing tasks according to each resource allocation strategy to obtain an allocation data set of the computing tasks under each resource allocation strategy, wherein the resources comprise computing resources and storage resources, counting the allocation data set of each computing task under each resource allocation strategy to form a load matrix of the network model, calculating the allocation data set of each computing subtask obtained by decomposing the network model under each resource allocation strategy according to-be-evaluated chip performance parameter set and the load matrix, and determining the performance matrix of the network model under each computing subtask.

A more specific example (a non-exhaustive list) of the computer readable storage medium includes an electrical connection having or more wires, a portable computer diskette, a hard disk, a RAM, a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable CD-ROM, an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave .

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, Radio Frequency (RF), etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, or a combination thereof, as well as conventional procedural programming languages, such as the "C" programming language or similar programming languages.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1, deep learning neural network model load calculation method, which is characterized by comprising:

analyzing a pre-constructed network model, and decomposing a calculation process of the network model into at least two calculation tasks; wherein, the at least two calculation tasks have a dependency relationship, and the network model is a deep learning neural network model;

2. The method of claim 1, wherein the parsing the pre-constructed network model to decompose the computation flow of the network model into at least two computation tasks comprises:

analyzing the network model to determine a hierarchical structure of the network model, wherein the hierarchical structure of the network model comprises at least two layers;

and taking the data processing operation associated with the layer as calculation tasks to form at least two calculation tasks.

3. The method of claim 2, wherein the partitioning of each of the computing tasks according to the at least preconfigured resource allocation policies to form at least computing subtasks comprises:

determining the distribution quantity of the computing resources according to at least pre-configured resource distribution strategies;

and dividing the calculation tasks according to the distribution quantity to form at least calculation subtasks, wherein the quantity of the calculation subtasks obtained by dividing each calculation task is less than or equal to the distribution quantity.

4. The method of claim 3, further comprising, before partitioning each of the computing tasks according to at least preconfigured resource allocation policies to form at least computing subtasks:

receiving at least resource allocation tables, and analyzing to obtain the combination relationship between the computing resources and the storage resources corresponding to each resource allocation table;

the combination relationship between the computing resources and the storage resources corresponding to resource allocation tables is used as resource allocation strategies.

5. The method according to claim 4, wherein the allocating resources to all the computing subtasks associated with the computing task according to each of the resource allocation policies to obtain an allocation data set of the computing task under each of the resource allocation policies comprises:

traversing each resource allocation table from a target resource allocation table in each resource allocation table until all resource allocation tables are traversed;

in the traversal process of the resource configuration table, traversing each computing subtask from a target computing subtask in each computing subtask of the computing tasks, selecting a target computing resource from a combination relation corresponding to the resource configuration table for the traversed current computing subtask, acquiring a corresponding target storage resource, and establishing a corresponding relation among the current computing subtask, the target computing resource and at least target storage resources until all computing subtasks are traversed;

the computing resources corresponding to each computing sub-task of the computing task are different, and the storage resources corresponding to each computing sub-task of the computing task are different;

generating an allocation data set of the computing tasks under the resource allocation table according to the computing resources corresponding to the computing sub-tasks and the corresponding storage resources;

and acquiring an allocation data set of the computing task under each resource allocation table after all the resource allocation tables are traversed.

6. The method according to claim 1, wherein the step of determining the performance matrix by calculating the running time of each computation subtask obtained by the network model decomposition under each resource allocation policy according to the performance parameter set of the chip to be evaluated and the load matrix comprises:

according to the performance parameter set of the chip to be evaluated, calculating input data carrying time, input data processing consumption time and result data carrying time of each calculation subtask in the load matrix under each resource allocation strategy;

taking the sum of the input data transfer time, the input data processing consumption time and the result data transfer time of the calculation subtasks under the resource allocation strategy as the running time of the calculation subtasks under the resource allocation strategy;

and forming a performance matrix of the network model according to the running time of each computing sub-task under each resource allocation strategy.

7. The method of claim 4, wherein the allocated amount of computing resources included in each of the resource allocation tables is equal to the number of processors in the set of chip performance parameters.

8, deep learning neural network model load calculation device, comprising:

the computing task analysis module is used for analyzing a pre-constructed network model and decomposing the computing process of the network model into at least two computing tasks; wherein, the at least two calculation tasks have a dependency relationship, and the network model is a deep learning neural network model;

Computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method for deep learning neural network model load calculation according to any of claims 1-7 when executing the program.

10, a computer-readable storage medium, on which a computer program is stored, the program, when being executed by a processor, implementing a deep learning neural network model load calculating method as claimed in any of claims 1-7.