CN111338695B

CN111338695B - Data processing method based on pipeline technology and related product

Info

Publication number: CN111338695B
Application number: CN201811555572.0A
Authority: CN
Inventors: 不公告发明人
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2018-12-19
Filing date: 2018-12-19
Publication date: 2022-05-17
Anticipated expiration: 2038-12-19
Also published as: CN111338695A

Abstract

The application relates to a data processing method based on pipeline technology and a related product, wherein different processors execute different operation tasks, including a first data processing operation, a second data processing operation and a third data processing operation, so that the first processor and the second processor can simultaneously work for different data processing steps of a plurality of input data, the operation amount of the first processor is reduced, and the data processing efficiency is improved.

Description

Data processing method based on pipeline technology and related product

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a data processing method based on pipeline technology and a related product.

Background

With the rapid development of data processing technology, more and more data processing modes are available. Different data processing modes exist in different technical fields, for example, in the technical field of a neural network with a large data volume of data to be processed, a processing mode for processing data of the neural network by using a neural network offline model appears. Firstly, obtaining a model data set and model structure parameters of an original network, wherein the model data set comprises network weights corresponding to each computing node in the original network, and the model structure parameters comprise the connection relation of a plurality of computing nodes in the original network and the computing attributes of each computing node; then, operating the original network according to the model data set and the model structure parameters of the original network to obtain instructions corresponding to each computing node in the original network; and finally, generating a neural network offline model corresponding to the original network according to the network weight and the instruction corresponding to each computing node of the original network.

In the conventional technology, for the technical field that the data volume of the data to be processed is very large, the work flow of the data processing is mostly executed in a CPU (central processing unit). However, as the data amount of the data to be processed is increased, the conventional technology has a problem of inefficiency.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a data processing method and related product based on pipeline technology, which can improve data processing efficiency.

A method of data processing based on pipelining, the method comprising:

the first processor performs first data processing operation on current input data to obtain a current first intermediate result, then acquires next input data, and performs first data processing operation on the next input data until the first data processing operation of the last input data is completed;

the second processor obtains the current first intermediate result, performs second data processing operation on the current first intermediate result to obtain a current second intermediate result, then obtains a next first intermediate result corresponding to the next input data, and performs second data processing operation on the next first intermediate result until the second data processing operation of the last first intermediate result is completed;

the first processor obtains the current second intermediate result, performs third data processing operation on the current second intermediate result to obtain a current output result, then obtains a next second intermediate result corresponding to the next first intermediate result, and performs third data processing operation on the next second intermediate result until the third data processing operation of the last second intermediate result is completed;

and the starting time of the first data processing operation performed on the next input data by the first processor cannot be later than the ending time of the third data processing operation performed on the current second intermediate result by the first processor.

In one embodiment, the first data processing operation comprises data preprocessing, and/or the second data processing operation comprises data inference, and/or the third data processing operation comprises data post-processing;

if the second data processing operation includes data inference, the second processor obtains the current first intermediate result, performs the second data processing operation on the current first intermediate result to obtain a current second intermediate result, then obtains a next first intermediate result corresponding to the next input data, and performs the second data processing operation on the next first intermediate result until the second data processing operation of the last first intermediate result is completed, including:

and the second processor acquires the current first intermediate result, performs data inference on the current first intermediate result by using the neural network offline model to obtain current inference data, acquires a next first intermediate result corresponding to the next input data, and performs data inference on the next first intermediate result by using the neural network offline model until the data inference on the last first intermediate result is completed.

In one embodiment, the first intermediate result comprises preprocessed data, the method further comprising:

the first processor searches for an input storage space with an idle state and writes the preprocessed data into the idle input storage space; wherein the number of the input storage spaces is multiple; and after the preprocessing data is written into the free input storage space, updating the state of the free input storage space to be an occupied state.

In one embodiment, the method further comprises:

the second processor searches the input storage space occupied by the state and reads the preprocessed data from the occupied input storage space; and after the second processor reads the preprocessed data, the state of the occupied input storage space is updated to be an idle state.

In one embodiment, the method further comprises:

the second processor searches for an output storage space with an idle state and writes inference data into the idle output storage space; wherein the number of the output storage spaces is multiple; and after the idle output storage space is written in the reasoning data, updating the state of the idle output storage space to be an occupied state.

In one embodiment, the method further comprises:

the first processor searches the output storage space occupied by the state and reads the inference data from the occupied output storage space; and after the first processor reads the inference data, updating the state of the occupied output storage space to be an idle state.

In one embodiment, the number of the input storage spaces is two, and the two input storage spaces form a ping-pong structure; and/or

The number of the output storage spaces is two, and the two output storage spaces form a ping-pong structure.

In one embodiment, the method further comprises:

the first processor acquires attribute information of the preprocessed data and attribute information of the input storage spaces; and determining an input storage space matched with the preprocessed data according to the attribute information of the preprocessed data and the attribute information of the plurality of input storage spaces, and writing the preprocessed data into the matched input storage space.

In one embodiment, the method further comprises:

the second processor acquires attribute information of the inference data and attribute information of the plurality of output storage spaces; and determining an output storage space matched with the inference data according to the attribute information of the inference data and the attribute information of the output storage spaces, and writing the inference data into the matched output storage space.

In one embodiment, the attribute information of the data includes at least one of a data size, a data type, and a data format, and correspondingly, the attribute information of the storage space includes at least one of a storage space size, a data type that the storage space can store, and a data format that the storage space can store.

In one embodiment, the first processor is a general purpose processor and the second processor is an artificial intelligence processor.

A data processing apparatus based on pipelining, the apparatus comprising:

the input data processing module is used for the first processor to perform first data processing operation on current input data to obtain a current first intermediate result, then obtain next input data and perform first data processing operation on the next input data until the first data processing operation of the last input data is completed;

a first intermediate result processing module, configured to obtain the current first intermediate result by the second processor, perform a second data processing operation on the current first intermediate result to obtain a current second intermediate result, then obtain a next first intermediate result corresponding to the next input data, and perform a second data processing operation on the next first intermediate result until the second data processing operation of the last first intermediate result is completed;

an output result determining module, configured to obtain the current second intermediate result by the first processor, perform a third data processing operation on the current second intermediate result to obtain a current output result, then obtain a next second intermediate result corresponding to the next first intermediate result, and perform a third data processing operation on the next second intermediate result until the third data processing operation of the last second intermediate result is completed;

and the starting time of the first processor for carrying out the first data processing operation on the next input data cannot be later than the ending time of the first processor for carrying out the third data processing operation on the current second intermediate result.

A data processing method based on pipeline technology, the method is applied to a heterogeneous computing architecture, the heterogeneous computing architecture comprises a general-purpose processor and an artificial intelligence processor, and the method comprises the following steps:

the general processor receives input data and preprocesses the input data to obtain preprocessed data;

and the general processor receives inference data obtained by the artificial intelligence processor by using a neural network offline model to perform data inference on the preprocessed data, performs post-processing on the inference data to obtain post-processed data, and executes the step of receiving input data again before the general processor finishes post-processing the inference data to obtain the post-processed data.

the artificial intelligence processor receives the preprocessed data and infers the preprocessed data through a neural network offline model to obtain inference data; the preprocessing data is obtained by receiving input data and preprocessing the input data by the general processor;

the artificial intelligence processor inputs the reasoning data into the general processor so that the general processor carries out post-processing on the reasoning data to obtain post-processing data;

and the starting time of the general-purpose processor for preprocessing the next input data cannot be later than the finishing time of the general-purpose processor for post-processing the inference data.

the general processor receives inference data obtained by the artificial intelligence processor by using a neural network offline model to perform data inference on the preprocessed data; and when the artificial intelligence processor uses the neural network offline model to carry out data inference on the preprocessed data, the general processor parallelly executes the steps of receiving the input data again and preprocessing the received input data again, and carries out post-processing on the inferred data to obtain post-processed data.

A data processing method based on pipelining, applied to a heterogeneous computing architecture including a general-purpose processor and an artificial intelligence processor, the method comprising:

the artificial intelligence processor receives the preprocessed data and infers the preprocessed data through a neural network offline model to obtain inference data; the preprocessing data are obtained by receiving input data and processing the input data by the general processor;

when the artificial intelligence processor infers the preprocessed data, the general processor executes the steps of receiving the input data again and preprocessing the received input data again and postprocessing the inferred data input by the artificial intelligence processor in parallel.

A board card applied to a heterogeneous computing architecture, the board card comprising: an artificial intelligence processor for performing the above method.

A motherboard for use in a heterogeneous computing architecture, the motherboard comprising: a general processor and the board card.

An electronic device is applied to a heterogeneous computing architecture, and the electronic device comprises the mainboard.

According to the data processing method based on the pipeline technology and the related product, firstly, a first processor carries out first data processing operation on current input data to obtain a current first intermediate result, and then carries out first data processing operation on next input data until the first data processing operation of the last input data is completed; then the second processor obtains the current first intermediate result, and carries on the second data processing operation to the current first intermediate result, receive the current second intermediate result, later carry on the second data processing operation to the next first intermediate result, until finishing the second data processing operation of the last first intermediate result; finally, the first processor obtains a current second intermediate result, carries out third data processing operation on the current second intermediate result to obtain a current output result, and then carries out third data processing operation on a next second intermediate result until the third data processing operation of the last second intermediate result is completed; and the starting time of the first data processing operation performed on the next input data by the first processor cannot be later than the ending time of the third data processing operation performed on the current second intermediate result by the first processor. It can be understood that the present application divides the workflow of data processing into three stages of pipelines, in the first stage of pipeline, the first processor performs the first data processing operation step; in the second stage pipeline, the second processor performs a second data processing operation step; in the third stage pipeline, the first processor performs a third data processing operation step. Therefore, the first processor and the second processor work for different data processing steps of a plurality of input data at the same time, the different data processing steps are efficiently executed on different processors, the operation amount of the first processor is reduced, and the data processing efficiency is improved.

Drawings

FIG. 1 is a block diagram of a data processing system in one embodiment;

FIG. 2 is a block diagram of a data processing system in accordance with another embodiment;

FIG. 3 is a flow diagram illustrating a data processing method based on pipeline techniques in one embodiment;

fig. 4 is a flow chart illustrating a data processing method based on pipeline technology in another embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The data processing method based on the pipeline technology provided by the embodiment of the application can be applied to the data processing system 10 shown in fig. 1. The data processing system 10 includes a first processor 110 and a second processor 120, wherein the first processor 110 and the second processor 120 are connected. It should be clear that, taking the workflow of applying the data processing method to the neural network offline model as an example, the workflow of the neural network offline model is divided into three steps in the embodiment of the present application: a data preprocessing (DataPreProcessor) step, a data reasoning (inferncer) step and a data post-processing (PostProcessor) step. Wherein the first processor 110 is configured to perform a data preprocessing step and a data post-processing step, and the second processor 120 is configured to perform a data inference step.

Specifically, in the first stage pipeline, the first processor 110 performs a data pre-processing (DataPreProcessor) step. Optionally, the data preprocessing includes importing raw data, format conversion, averaging, and the like. In the second stage pipeline, the second processor 120 performs data inference steps. Optionally, the data inference includes offline model import, memory application data copy, inference calculation, and the like. In the third stage pipeline, the first processor 110 performs data post-processing steps. Optionally, the data post-processing includes data copying, result calculation, and the like. The data preprocessing step, the data reasoning step and the data post-processing step are rearranged through a pipeline technology, so that the steps can be executed more efficiently.

In one embodiment, the neural network offline model pipeline performs 3 steps of the process: the pipeline arrangement of the data preprocessing (DataPreProcessor) step, the data reasoning (inferncer) step and the data post-processing (PostProcessor) step is shown in table 1:

D(1)

D(2)

D(3)

…

D(n)

D(n+1)

D(n+2)

…

D(m)

I(1)

I(2)

…

I(n-1)

I(n)

I(n+1)

…

I(m-1)

I(m)

P(1)

…

P(n-2)

P(n-1)

P(n)

…

P(m-2)

P(m-1)

P(m)

TABLE 1

D (1) represents that the first processor carries out a preprocessing step on first input data, I (1) represents that the second processor carries out a data inference step on the first preprocessed data, P (1) represents that the first processor carries out a post-processing step on the first inferred data, and so on, and all the work flows of the mth input data are completed according to a pipeline technology.

As can be seen from table 1, when the second processor 120 performs data inference on the nth preprocessed data, the first processor 110 performs data preprocessing on the (n + 1) th input data and data post-processing on the (n-1) th inferred data at the same time. In this embodiment, because a pipeline technique is adopted, and on the premise that a hardware condition is idle, the workflow of the neural network offline model is divided to make the first processor 110 and the second processor 120 in a full state, and it can be understood that the whole data processing process is: fill line-line full-line empty. In the pipeline filling step, the first processor 110 and the second processor 120 are both in working state, and the performance is theoretically 3 times of that of the non-pipeline, so that the efficiency of the network processing data is improved.

It should be clear that the pipeline arrangement of the three data processing steps of the neural network offline model shown in table 1 above is only an exemplary one of the arrangement cases, and for other arrangement cases, for example, when the first processor 110 performs data post-processing on the 1 st inference data, the first processor 110 performs data pre-processing on the 2 nd input data at the same time. In this way, the first processor 110 and the second processor 120 can simultaneously work for different data processing steps of a plurality of input data, so that the computation load of the first processor 110 is reduced, the neural network offline model is efficiently executed on the heterogeneous processors, and the efficiency of network data processing is improved. The present application does not limit the pipeline arrangement of the three data processing steps of the neural network offline model, as long as the purpose that the first processor 110 and the second processor 120 simultaneously work for different data processing steps of a plurality of input data can be achieved.

Alternatively, the first processor 110 may be a general-purpose processor such as a CPU, or may be another type of processor. The second processor 120 may act as a co-processor for the first processor 110. Alternatively, the second processor 120 may be an artificial Intelligence processor such as an IPU (intelligent Processing Unit) or an NPU (Neural-network Processing Unit); or special intelligent processors such as a GPU; but may also be a general purpose processor such as a CPU.

Wherein, when the type of the processor is IPU (intelligent processor), the intelligent processor comprises a main Processing circuit and a plurality of slave Processing circuits. In one embodiment, the intelligent processor further comprises: a tree module, the tree module comprising: the root port of the tree module is connected with the main processing circuit, and the branch ports of the tree module are respectively connected with one of the plurality of slave processing circuits; and the tree module is used for forwarding data blocks, weights and operation instructions between the main processing circuit and the plurality of slave processing circuits.

In another embodiment, the smart processor further includes one or more branch processing circuits, each branch processing circuit is connected to at least one slave processing circuit, and the master processing circuit is specifically configured to determine that the input neuron is broadcast data, weight is a distribution data block, allocate one distribution data input neuron to a plurality of data blocks, and send at least one operation instruction of at least one data block of the plurality of data blocks, weight broadcast data, and a plurality of operation instructions to the branch processing circuit; the branch processing circuit is used for forwarding data blocks, weight broadcast data and operation instructions between the main processing circuit and the plurality of slave processing circuits; the plurality of slave processing circuits are used for executing operation on the received data blocks and the broadcast data weight according to the operation instruction to obtain an intermediate result and transmitting the intermediate result to the branch processing circuit; and the main processing circuit is used for carrying out subsequent processing on the intermediate result sent by the branch processing circuit to obtain a result of the calculation instruction, and sending the result of the calculation instruction to the controller unit.

In yet another embodiment, the plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected with other adjacent slave processing circuits, the master processing circuit is connected with k slave processing circuits in the plurality of slave processing circuits, and the k basic circuits are: n slave processing circuits of row 1, n slave processing circuits of row m, and m slave processing circuits of column 1; the K slave processing circuits are used for forwarding data and instructions between the main processing circuit and the plurality of slave processing circuits; the main processing circuit is used for determining that the input neuron is broadcast data, the weight value is distribution data, distributing the input data distribution data into a plurality of data blocks, and sending at least one data block in the plurality of data blocks and at least one operation instruction in the plurality of operation instructions to the K slave processing circuits; the K slave processing circuits are used for converting data between the main processing circuit and the plurality of slave processing circuits; the plurality of slave processing circuits are used for performing operation on the received data blocks according to the operation instruction to obtain an intermediate result and transmitting the operation result to the K slave processing circuits; and the main processing circuit is used for carrying out subsequent processing on the intermediate results sent by the K slave processing circuits to obtain a result of the calculation instruction, and sending the result of the calculation instruction to the controller unit.

In one embodiment, referring to fig. 2, the first processor 110 is disposed in the main Device (Host Device) 10a, and the second processor 120 is disposed in the auxiliary Device (Device) 10 b.

In one embodiment, the data processing system 10 further comprises a preprocessed data storage unit 130 and an inferred data storage unit 140, the preprocessed data storage unit 130 is coupled to the first processor 110 and the second processor 120, respectively, the inferred data storage unit 140 is coupled to the first processor 110 and the second processor 120, respectively, the preprocessed data storage unit 130 comprises a plurality of input storage spaces, and the inferred data storage unit 140 comprises a plurality of output storage spaces. Optionally, ping-pong buffers are opened up for each of the input storage space and the output storage space. The preprocessing data storage unit 130 or the inference data storage unit 140 may include: register, cache, or any combination thereof. Specifically, the cache is used for storing an operation instruction; the register is used for storing the neural network offline model, data and scalar; the cache is a scratch pad cache. Optionally, the preprocessing data storage unit 130 and the inference data storage unit 140 are provided in the auxiliary Device (Device) 10 b.

In one embodiment, the data processing system 10 further includes a state update unit coupled to the first processor 110, the second processor 120, the pre-processing data storage unit 130, and the inference data storage unit 140, respectively. The state updating unit is used for updating the state of the input storage space in the preprocessing data storage unit 130 and updating the state of the output storage space in the inference data storage unit 140. Optionally, the state of the input storage space includes a busy state and an idle state, and the state of the output storage space also includes a busy state and an idle state. The occupied state refers to a state in which data is stored, and the idle state refers to a state in which data is not stored. Optionally, the state updating unit is provided in the auxiliary Device (Device) 10 b.

As one possible implementation, the data processing system 10 in fig. 2 is taken as an example for detailed description. The data processing system 10 includes a main Device (Host Device) 10a and an auxiliary Device (Device) 10b, wherein the main Device (Host Device) 10a includes a first processor 110, and the auxiliary Device (Device) 10b includes a second processor 120, a preprocessing data storage unit 130, an inference data storage unit 140, and a state update unit. Specifically, please read together with table 1, after completing the preprocessing of the n +1 th input data, the main Device (Host Device) 10a first obtains the input buffer with the idle state in the auxiliary Device (Device) 10b, and then the main Device (Host Device) 10a copies the n +1 th preprocessed data into the idle input buffer, and meanwhile, the auxiliary Device (Device) 10b updates the buffer state of the input buffer to the occupied state. Before performing data inference on the nth preprocessed data, the auxiliary Device (Device) 10b searches an input cache whose state is occupied and an output cache whose state is idle, reads the nth preprocessed data from the occupied input cache, stores the nth inferred data after the data inference in the idle output cache in real time, updates the cache state of the input cache to be idle, and updates the cache state of the output cache to be occupied. Before post-processing the nth-1 reasoning data, the main Device (Host Device) 10a searches an output cache in the auxiliary Device (Device) 10b, the output cache being occupied by the nth-1 reasoning data in the output cache is copied to a memory of the main Device (Host Device) 10a, and after the output cache is set to be idle by the auxiliary Device (Device) 10b, the main Device (Host Device) 10a performs data post-processing. The embodiment combines the ping-pong operation technology with the pipeline technology, and can manage the space on the chip more efficiently.

In one embodiment, a data processing method based on pipeline technology is provided, which is illustrated by taking the data processing system 10 in fig. 1 as an example, and includes the following steps:

Specifically, in the data processing method provided in this embodiment, first, a first processor performs a first data processing operation on current input data to obtain a current first intermediate result, and then performs the first data processing operation on next input data until the first data processing operation on the last input data is completed; then the second processor obtains the current first intermediate result, and carries on the second data processing operation to the current first intermediate result, receive the current second intermediate result, later carry on the second data processing operation to the next first intermediate result, until finishing the second data processing operation of the last first intermediate result; finally, the first processor obtains a current second intermediate result, carries out third data processing operation on the current second intermediate result to obtain a current output result, and then carries out third data processing operation on a next second intermediate result until the third data processing operation of the last second intermediate result is completed; and the starting time of the first data processing operation performed on the next input data by the first processor cannot be later than the ending time of the third data processing operation performed on the current second intermediate result by the first processor. It can be understood that the present application divides the workflow of data processing into three stages of pipelines, in the first stage of pipeline, the first processor performs the first data processing operation step; in the second stage pipeline, the second processor performs a second data processing operation step; in the third stage pipeline, the first processor performs a third data processing operation step. Therefore, the first processor and the second processor work for different data processing steps of a plurality of input data at the same time, the different data processing steps are efficiently executed on different processors, the operation amount of the first processor is reduced, and the data processing efficiency is improved.

In one embodiment, the data processing method based on the pipeline technology is applied to the workflow of the neural network offline model, and it should be clear that the workflow of the neural network offline model is divided into a data preprocessing step, a data reasoning step and a data post-processing step. On the basis of the above embodiments, it can be correspondingly obtained that the first data processing operation comprises data preprocessing, and/or the second data processing operation comprises data inference, and/or the third data processing operation comprises data post-processing. As shown in fig. 3, the present embodiment includes the following steps:

s202, the first processor preprocesses the current input data to obtain current preprocessed data, then obtains next input data, and preprocesses the next input data until the preprocessing of the last input data is completed.

The input data may include data in image, audio, or text format, among others. The image includes a still picture, pictures constituting a video, and the like. The audio includes human voice, music, noise, etc. Text includes structured text, text characters in various languages, and the like. Optionally, the input data is input neuron data.

Specifically, the embodiment adopts a pipeline technology, where the number of input data is required to be multiple, and may be 10 input data, or 100 input data, or more, and the number of input data may be divided according to actual requirements, which is not limited in this embodiment. The first processor firstly obtains first input data, preprocesses the first input data to obtain first preprocessed data, immediately obtains next input data after the first preprocessed data is output, or obtains next input data within preset time after the first preprocessed data is output, and executes preprocessing of the next input data until preprocessing of the last input data is finished. The preset time after output needs to meet the requirement that the starting time of the first processor for preprocessing the next input data cannot be later than the finishing time of the first processor for post-processing the current inference data, which is described below. It should be clear that any input data may be the current input data, and the next input data is the next input data.

Optionally, taking an image as an example, the first processor obtains first input image data, and performs preprocessing operations such as format conversion and mean processing on the input image data to obtain first preprocessed image data. And then immediately acquiring next input image data, performing preprocessing operations such as format conversion, mean value processing and the like on the next input image data to obtain next preprocessed image data, and repeatedly executing preprocessing on other input image data until the preprocessing of the last input image data is finished.

And S204, the second processor acquires the current preprocessed data, performs data inference on the current preprocessed data by using a neural network offline model to obtain current inference data, acquires next preprocessed data corresponding to the next input data, and performs data inference on the next preprocessed data until the data inference on the last preprocessed data is completed.

Specifically, the second processor first obtains first preprocessed data and a corresponding neural network offline model, performs data inference on the first preprocessed data according to the neural network offline model to obtain first inference data, and immediately obtains next preprocessed data after the first inference data is output, or obtains next preprocessed data within preset time after the first inference data is output, and performs data inference on the next preprocessed data according to the neural network offline model until data inference on the last preprocessed data is completed. The preset time after output needs to meet the requirement that the starting time of the first processor for preprocessing the next input data cannot be later than the finishing time of the first processor for post-processing the current inference data, which is described below. It should be clear that any preprocessed data may be used as the current preprocessed data, and the next preprocessed data is the next preprocessed data.

Optionally, taking the image as an example again, the second processor obtains the first preprocessed image data, obtains a corresponding neural network offline model, and performs data inference calculation such as memory application data copy on the first preprocessed image data according to the neural network offline model to obtain the first inferred image data. And then immediately acquiring next preprocessed image data, and performing data reasoning calculation such as memory application data copying on the next preprocessed image data according to the neural network offline model to obtain next reasoning image data. And repeatedly executing data inference of other preprocessed image data until data inference of the last preprocessed image data is completed.

And S206, the first processor acquires the current inference data, performs post-processing on the current inference data to acquire current post-processing data, acquires next inference data corresponding to the next pre-processing data, and performs post-processing on the next inference data until the post-processing of the last inference data is completed.

Specifically, based on the processes of S202 and S204, the first processor first obtains first inference data, performs post-processing on the first inference data to obtain first post-processing data, and immediately obtains next inference data after outputting the first post-processing data, or obtains next inference data within a preset time after outputting, and performs post-processing on the next inference data until the post-processing on the last inference data is completed. The preset time after output needs to meet the requirement that the starting time of the first processor for preprocessing the next input data cannot be later than the finishing time of the first processor for post-processing the current inference data, which is described below. It should be clear that any inference data can be taken as current inference data, and the next inference data as next inference data.

Optionally, taking the image as an example again, the first processor acquires the first inference image data, and performs post-processing operations such as data copying and result calculation on the inference image data to obtain the first post-processing data. And then immediately acquiring next inference image data, performing post-processing operations such as data copying, result calculation and the like on the next inference image data to obtain next post-processing data, and repeatedly executing post-processing of other inference image data until the final inference image data is post-processed.

The data processing method based on the pipeline technology comprises the steps that firstly, a first processor carries out preprocessing on current input data to obtain current preprocessed data, and then preprocessing is carried out on next input data until the preprocessing of the last input data is completed; then the second processor acquires the current preprocessed data, performs data inference on the current preprocessed data by using a neural network offline model to obtain current inference data, and performs data inference on the next preprocessed data until the data inference of the last preprocessed data is completed; finally, the first processor acquires current reasoning data, performs post-processing on the current reasoning data to acquire current post-processing data, and then acquires the next reasoning data for post-processing until the post-processing of the last reasoning data is completed; and the starting time of the first processor for preprocessing the next input data cannot be later than the finishing time of the first processor for post-processing the current inference data. It can be appreciated that the present application divides the workflow of the neural network offline model into three stages of pipelines, in the first stage of pipeline, the first processor performs the data preprocessing step; in the second stage pipeline, the second processor performs a data inference step; in the third stage pipeline, the first processor performs a data post-processing step. Therefore, the first processor and the second processor work for different data processing steps of a plurality of input data at the same time, the neural network offline model is efficiently executed on the heterogeneous processor, the calculation amount of the first processor is reduced, and the efficiency of network data processing is improved.

Referring to fig. 4, fig. 4 is a flow chart illustrating a data processing method based on pipeline technology according to another embodiment. The present embodiment relates to a specific process in which the first processor writes the preprocessed data into the free input memory space. On the basis of the above embodiment, the method further comprises the steps of:

s302, the first processor searches for an input storage space with an idle state, and writes the preprocessed data into the idle input storage space.

Wherein the number of the input storage spaces is multiple; and after the preprocessing data is written into the free input storage space, updating the state of the free input storage space to be an occupied state.

Specifically, the number of input storage spaces is plural, for example, the number of input storage spaces is 2, and the two input storage spaces constitute a ping-pong structure. Of course, the number of the input storage spaces may be 3, 4 or 5, or may be more, and the number of the input storage spaces may be divided according to actual requirements, which is not limited in this embodiment. The state of the input storage space can be divided into an idle state and an occupied state, wherein the idle state is that the input storage space stores no preprocessing data, and the occupied state is that the input storage space stores preprocessing data. Alternatively, the state of the input storage space may be distinguished by a state identification.

Optionally, the input storage space of the ping-pong architecture may operate in the following manner: the two input memory spaces of the ping-pong architecture are alternately read (i.e., reading the pre-processed data) and written (i.e., storing the pre-processed data). The process of reading and writing alternates between the two input storage spaces. The contents of the first two input memory spaces are both invalid, i.e. an idle state, which cannot be read; then writing the previous preprocessed data into one of the input storage spaces (the input storage space of the ping structure is written first), and changing the input storage space into a valid state after the writing is finished; then writing the next preprocessed data into another input storage space to enable the other input storage space to become an effective state, meanwhile, reading the previous preprocessed data from the one input storage space, and enabling the input storage space to become an idle state after the previous preprocessed data is read, wherein the other preprocessed data can be read in; thereafter, the next preprocessed data in the other input memory space is read, and the input memory space also becomes free, so that loop … … thus shows that for each input memory space, there are four possible states and the loop cycles back and forth in the following order: readable & writable & readable & … …

When the first processor obtains the preprocessed data, the input storage space with the idle state is searched first, the input storage space is determined to be the idle input storage space, and after the first processor determines that the idle input storage space exists, the preprocessed data is written into the idle input storage space for storage, so that the preprocessed data can be called later conveniently. After the preprocessing data are written into the idle input storage space, the state of the idle input storage space is updated to be an occupied state by the state updating unit.

In this embodiment, the preprocessed data are written into the idle input storage space for storage, so that when an error occurs in the data processing process, the preprocessed data can be obtained from the input storage space again, and the accuracy of the data is ensured.

In one embodiment, a specific process is involved in which the second processor reads the preprocessed data from the occupied input memory space. On the basis of the above embodiment, the method further comprises the steps of:

s304, the second processor searches the input storage space occupied by the state and reads the preprocessed data from the occupied input storage space; and after the second processor reads the preprocessed data, the state of the occupied input storage space is updated to be an idle state.

Specifically, before the second processor performs data inference on the preprocessed data, the second processor searches an input storage space occupied by the state, determines the input storage space occupied by the input storage space, and reads the preprocessed data from the occupied input storage space after determining that the input storage space is occupied by the second processor. After the second processor reads the preprocessed data, the state of the occupied input storage space is updated to an idle state by the state updating unit.

In one embodiment, a specific process is involved in which the second processor writes the inference data to free output storage space. On the basis of the above embodiment, the method further comprises the steps of:

s306, the second processor searches for an output storage space with an idle state and writes inference data into the idle output storage space; wherein the number of the output storage spaces is multiple; and after the idle output storage space is written with the reasoning data, updating the state of the idle output storage space to be an occupied state.

Specifically, the number of output storage spaces is plural, for example, the number of output storage spaces is 2, and the two output storage spaces constitute a ping-pong structure. Of course, the number of the output storage spaces may be 3, 4 or 5, or may be more, and the number of the output storage spaces may be divided according to actual requirements, which is not limited in this embodiment. The state of the output storage space can be divided into an idle state and an occupied state, the idle state is that the output storage space stores no inference data, and the occupied state is that the output storage space stores inference data. Optionally, the state of the output storage space may be differentiated by a state identification.

Optionally, the output storage space of the ping-pong architecture may operate in the following manner: the two output memory spaces of the ping-pong architecture are alternately read (i.e., reading the inference data) and written (i.e., storing the inference data). The process of reading and writing alternates between the two output storage spaces. The contents of both output memory spaces are initially invalid, i.e., an idle state, which cannot be read; then writing the previous reasoning data into one of the output storage spaces (mostly, the output storage space of the ping structure is written first), and changing the output storage space into a valid state after the previous reasoning data is written; then writing the latter inference data in the other output storage space to make the other output storage space become effective state, at the same time, reading the former inference data from one of the output storage spaces, after reading, the output storage space becomes idle state, which can read in other inference data; thereafter, the latter inference data in the other output memory space is read, and the output memory space also becomes free, so that loop … … thus follows four possible states for each output memory space and cycles back and forth in the following order: readable & writable & readable & … …

When the second processor obtains the inference data, the output storage space with the idle state is searched first, the output storage space is determined to be the idle output storage space, and after the first processor determines that the idle output storage space exists, the inference data is written into the idle output storage space for storage, so that subsequent retrieval is facilitated. After the idle output storage space is written with the inference data, the state of the idle output storage space is updated to an occupied state by the state updating unit.

In this embodiment, the inference data is written into the idle output storage space for storage, so that when an error occurs in the data processing process, the inference data can be obtained from the output storage space again, and the accuracy of the data is ensured.

In one embodiment, a specific process involving the first processor reading the inference data from the occupied output memory space. On the basis of the above embodiment, the method further comprises the steps of:

s308, the first processor searches the output storage space occupied by the state and reads the inference data from the occupied output storage space; and after the first processor reads the inference data, updating the state of the occupied output storage space to be an idle state.

Specifically, before the first processor performs post-processing on the inference data, the first processor searches for an output storage space occupied in the state, determines that the output storage space is occupied, and reads the inference data from the occupied output storage space after determining that the output storage space is occupied. After the first processor reads the inference data, the state occupying the output storage space is updated to an idle state by the state updating unit.

In one embodiment, the method further comprises:

s402, the first processor acquires attribute information of the preprocessed data and attribute information of the input storage spaces; and determining an input storage space matched with the preprocessed data according to the attribute information of the preprocessed data and the attribute information of the plurality of input storage spaces, and writing the preprocessed data into the matched input storage space.

Specifically, the attribute information of the preprocessed data refers to information that can characterize the nature of the preprocessed data, and the attribute information of the input storage space refers to information that can characterize the nature of the input storage space. Optionally, the attribute information of the preprocessed data includes at least one of a size of the preprocessed data, a type of the preprocessed data, and a format of the preprocessed data, and correspondingly, the attribute information of the input storage space includes at least one of a size of the input storage space, a type of data that the input storage space can store, and a format of data that the input storage space can store. After the first processor acquires the attribute information of the preprocessed data and the attribute information of the plurality of input storage spaces, the first processor determines the input storage space matched with the preprocessed data according to the attribute information of the preprocessed data and the attribute information of the plurality of input storage spaces, wherein the matching can be that the size of the preprocessed data is not larger than the size of the input storage space, the type of the preprocessed data is the same as the type of data which can be stored in the input storage space, the format of the preprocessed data is the same as the format of the data which can be stored in the input storage space, or a combination of the above.

When the first processor determines that there is an input memory space that matches the preprocessed data, the first processor writes the preprocessed data to the matching input memory space.

In one embodiment, the method further comprises:

Specifically, the attribute information of the inference data refers to information capable of characterizing the nature of the inference data, and the attribute information of the output storage space refers to information capable of characterizing the nature of the output storage space. Optionally, the attribute information of the inference data includes at least one of an inference data size, an inference data type, and an inference data format, and correspondingly, the attribute information of the output storage space includes at least one of an output storage space size, a data type storable in the output storage space, and a data format storable in the output storage space. After the second processor acquires the attribute information of the inference data and the attribute information of the plurality of output storage spaces, the second processor determines the output storage spaces matched with the inference data according to the attribute information of the inference data and the attribute information of the plurality of output storage spaces, wherein the matching can be that the size of the inference data is not larger than that of the output storage spaces, the type of the inference data is the same as the type of data which can be stored in the output storage spaces, the format of the inference data is the same as the format of data which can be stored in the output storage spaces, or the combination of the above.

When the second processor determines that there is an output memory space that matches the inference data, the second processor writes the inference data to the matching output memory space.

It should be understood that although the various steps in the flow charts of fig. 3-4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 3-4 may include multiple sub-steps or multiple steps, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the sub-steps or steps of other steps.

In one embodiment, there is provided a data processing apparatus based on pipeline technology, the apparatus comprising:

The data processing device based on the pipeline technology comprises a first processor, a second processor and a third processor, wherein the first processor firstly carries out first data processing operation on current input data to obtain a current first intermediate result, and then carries out first data processing operation on next input data until the first data processing operation of the last input data is completed; then the second processor obtains the current first intermediate result, and carries on the second data processing operation to the current first intermediate result, receive the current second intermediate result, then carry on the second data processing operation to the next first intermediate result, until finishing the second data processing operation of the last first intermediate result; finally, the first processor obtains a current second intermediate result, carries out third data processing operation on the current second intermediate result to obtain a current output result, and then carries out third data processing operation on a next second intermediate result until the third data processing operation of the last second intermediate result is completed; and the starting time of the first data processing operation performed on the next input data by the first processor cannot be later than the ending time of the third data processing operation performed on the current second intermediate result by the first processor. It can be understood that the present application divides the workflow of data processing into three stages of pipelines, in the first stage of pipeline, the first processor performs the first data processing operation step; in the second stage pipeline, the second processor performs a second data processing operation step; in the third stage pipeline, the first processor performs a third data processing operation step. Therefore, the first processor and the second processor work for different data processing steps of a plurality of input data at the same time, the different data processing steps are efficiently executed on different processors, the operation amount of the first processor is reduced, and the data processing efficiency is improved.

For specific limitations of the data processing apparatus based on pipeline technology, reference may be made to the above limitations of the data processing method based on pipeline technology, which are not described herein again. The various modules in the pipeline-based data processing apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a data processing method based on pipeline technology is further provided, and the method is applied to a heterogeneous computing architecture, wherein the heterogeneous computing architecture comprises a general-purpose processor and an artificial intelligence processor, and the method comprises the following steps:

In this embodiment, the specific implementation process of the above steps may refer to the description in the embodiment shown in fig. 3, and the implementation principle and the technical effect are similar, which are not described herein again.

when the artificial intelligence processor infers the preprocessed data, the general-purpose processor parallelly executes the steps of receiving input data again and preprocessing the received input data again and the step of post-processing the inferred data input by the artificial intelligence processor.

In an embodiment, a board is further provided, and is applied to a heterogeneous computing architecture, where the board includes: an artificial intelligence processor for performing the method according to the above embodiments.

In one embodiment, a motherboard is further provided, which is applied to a heterogeneous computing architecture, and includes: general purpose processor and above-mentioned integrated circuit board.

In an embodiment, an electronic device is further provided, and is applied to a heterogeneous computing architecture, where the electronic device includes the motherboard.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for processing data based on pipeline technique, the method comprising:

2. The method of claim 1, wherein the first data processing operation comprises data pre-processing, and/or the second data processing operation comprises data inference, and/or the third data processing operation comprises data post-processing;

3. The method of claim 2, wherein the first intermediate result comprises preprocessed data, the method further comprising:

the first processor searches for an input storage space with a free state, and writes the preprocessed data into the free input storage space; wherein the number of the input storage spaces is multiple; and after the preprocessing data is written into the free input storage space, updating the state of the free input storage space to be an occupied state.

4. The method of claim 3, further comprising:

5. The method of claim 2, further comprising:

6. The method of claim 5, further comprising:

7. The method according to any one of claims 3 to 6,

the number of the input storage spaces is two, and the two input storage spaces form a ping-pong structure; and/or

8. The method of claim 3, further comprising:

9. The method of claim 5, further comprising:

10. The method according to claim 8 or 9, wherein the attribute information of the data comprises at least one of a data size, a data type and a data format, and correspondingly, the attribute information of the storage space comprises at least one of a storage space size, a data type which can be stored in the storage space and a data format which can be stored in the storage space.

11. The method of claim 1, wherein the first processor is a general purpose processor and the second processor is an artificial intelligence processor.

12. A data processing apparatus based on pipeline technology, the apparatus comprising:

13. A data processing method based on pipeline technology is applied to a heterogeneous computing architecture, the heterogeneous computing architecture comprises a general-purpose processor and an artificial intelligence processor, and the method comprises the following steps:

14. A data processing method based on pipeline technology is applied to a heterogeneous computing architecture, the heterogeneous computing architecture comprises a general-purpose processor and an artificial intelligence processor, and the method comprises the following steps:

15. A data processing method based on pipeline technology is applied to a heterogeneous computing architecture, the heterogeneous computing architecture comprises a general-purpose processor and an artificial intelligence processor, and the method comprises the following steps:

16. A data processing method based on pipeline technology is applied to a heterogeneous computing architecture, the heterogeneous computing architecture comprises a general-purpose processor and an artificial intelligence processor, and the method comprises the following steps:

17. A board card applied to a heterogeneous computing architecture, the board card comprising: an artificial intelligence processor for performing the method of claim 14 or the method of claim 16.

18. A motherboard for use in a heterogeneous computing architecture, the motherboard comprising: a general purpose processor and a board as claimed in claim 17.

19. An electronic device, for use in a heterogeneous computing architecture, comprising a motherboard as claimed in claim 18.