[go: up one dir, main page]

CN117437113A - System, method and storage medium for accelerating image data - Google Patents

System, method and storage medium for accelerating image data Download PDF

Info

Publication number
CN117437113A
CN117437113A CN202311443683.3A CN202311443683A CN117437113A CN 117437113 A CN117437113 A CN 117437113A CN 202311443683 A CN202311443683 A CN 202311443683A CN 117437113 A CN117437113 A CN 117437113A
Authority
CN
China
Prior art keywords
data
image data
matrix
instruction
cache
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311443683.3A
Other languages
Chinese (zh)
Inventor
田旭
李文明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Processor Technology Innovation Center
Original Assignee
Shanghai Processor Technology Innovation Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Processor Technology Innovation Center filed Critical Shanghai Processor Technology Innovation Center
Priority to CN202311443683.3A priority Critical patent/CN117437113A/en
Publication of CN117437113A publication Critical patent/CN117437113A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Image Processing (AREA)

Abstract

The present disclosure discloses a system, method, and storage medium for acceleration processing of image data. The system comprises: a transmission network; a main memory for storing an input operation instruction and image data; the instruction cache is connected with the main memory through a transmission network and is used for caching the operation instruction received from the main memory; the data cache is connected with the main memory through a transmission network and is used for caching the image data received from the main memory; and a processing unit array composed of a plurality of processing units, which are respectively connected with the instruction cache and the data cache through a transmission network, wherein each processing unit is used for executing matrix multiplication and addition operation on the image data in the data cache according to the operation instruction in the instruction cache. The disclosed embodiments utilize the highly parallel nature of the data flow architecture to reduce bandwidth and energy consumption in data transmission, and perform matrix multiply-add operations in parallel through the processing unit array, thereby implementing an extensible and portable image data processing acceleration system.

Description

System, method and storage medium for accelerating image data
Technical Field
The present disclosure relates generally to the field of hardware accelerator technology. More particularly, the present disclosure relates to a system, method, and storage medium for acceleration processing of image data.
Background
The data flow architecture is a computer architecture, which relies on data flow graph computation to enable a compiler to arrange multiple sequential loops and functions simultaneously, and has advantages of low memory access requirements, low synchronization overhead, etc., so that the data flow architecture exhibits excellent performance in neural networks and scientific computing applications, such as matrix multiply-add operations for image processing, etc.
Matrix multiply-add operations are an important part of the composition of scientific computing applications, as well as one of the basic and computationally intensive operations in machine learning and deep learning. Which can be used to represent and perform a number of important linear operations such as linear transformations, linear classifiers, linear regression, convolution, embedding, etc. Matrix multiply add may also be used to describe and optimize the structure and parameters of the neural network. The efficiency of matrix multiply-add has an important effect on the performance and scalability of machine learning and deep learning, and the matrix multiply-add operation has wide application in the fields of image processing and signal processing.
However, the existing computer architecture is limited, and the computing efficiency is still poor when performing large-scale matrix multiply-add operations. In view of this, it is desirable to provide an acceleration scheme for image data processing in order to achieve a more rapid and efficient matrix multiply-add operation of image data.
Disclosure of Invention
To address at least one or more of the technical problems mentioned above, the present disclosure proposes, among other things, an acceleration scheme for image data processing.
In a first aspect, the present disclosure provides a system for accelerating processing of image data comprising: a transmission network; a main memory for storing an input operation instruction and image data; the instruction cache is connected with the main memory through a transmission network and is used for caching the operation instruction received from the main memory; the data cache is connected with the main memory through a transmission network and is used for caching the image data received from the main memory; and a processing unit array composed of a plurality of processing units, which are respectively connected with the instruction cache and the data cache through a transmission network, wherein each processing unit is used for executing matrix multiplication and addition operation on the image data in the data cache according to the operation instruction in the instruction cache.
In some embodiments, wherein the processing unit array comprises 16 processing units, each processing unit performs a 64 x 64 scale matrix multiply-add operation.
In some embodiments, wherein the array of processing units comprises n 2 The processing units are connected in the form of a data flow graph, and the size of the processing unit array is n multiplied by n, wherein n is a positive integer.
In a second aspect, the present disclosure provides a method for accelerating processing of image data, the method being applied to a system as set out in any one of the first aspects, the method comprising: carrying input image data from a main memory to a data cache; carrying an input operation instruction from a main memory to an instruction cache; acquiring a data reading instruction from an instruction cache and executing the data reading instruction in a processing unit array so as to distribute image data from the data cache to a plurality of processing units; acquiring a matrix multiplication and addition instruction from an instruction cache and executing the matrix multiplication and addition instruction in a processing unit array to generate a matrix multiplication and addition result based on image data; outputting the matrix multiplication and addition result to a data cache; and returning the matrix multiplication and addition result to the main memory.
In some embodiments, wherein the handling of the input image data from the host to the data cache comprises: dividing input image data into a plurality of matrix data; and carrying the plurality of matrix data from the main memory to the data cache.
In some embodiments, the processing unit array includes 16 processing units, wherein each processing unit performs a 64 x 64 scale matrix multiply-add operation; wherein dividing the input image data into a number of matrix data includes: dividing image data of m×m scale into (m/64) 2 Each matrix data has a size of 64×64, where m is a positive number and is a multiple of 64.
In some embodiments, configuration information is stored in the main memory; wherein dividing the input image data into a number of matrix data includes: the image data is divided into matrix data different in size from the image data according to the configuration information.
In some embodiments, wherein retrieving the data read instruction from the instruction cache and executing in the processing unit array comprises: the data read instruction is executed by an instruction offset technique to avoid retrieving consecutive rows or consecutive columns of matrix data from the data cache.
In some embodiments, wherein during the transferring of the input image data from the main memory to the data cache, the method further comprises: converting the image data carried in the main memory into SIMD data, so that the matrix multiplication and addition task of the image data is split into a plurality of matrix multiplication and addition subtasks; wherein generating a matrix multiply-add result based on the image data comprises: and distributing the matrix multiplication and addition subtasks to the processing units for execution to generate matrix multiplication and addition results.
In a third aspect, the present disclosure provides a computer storage medium having stored thereon computer readable instructions which, when executed by one or more processors, implement the method of any of the second aspects.
With the system for accelerating processing of image data provided above, embodiments of the present disclosure utilize the highly parallel feature of the data flow architecture to transmit image data through a transmission network, thereby reducing memory access and thus reducing bandwidth and energy consumption overhead. And the matrix multiplication and addition operation is performed in parallel on a processing unit array formed by a plurality of processing units so as to adapt to matrix data with different sizes and shapes and characteristics and constraints of different hardware platforms, thereby realizing an image data processing acceleration system with higher expandability and portability.
Drawings
The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:
FIG. 1 illustrates an exemplary block diagram of an image data acceleration processing system of some embodiments of the present disclosure;
FIG. 2 illustrates an exemplary block diagram of an image data acceleration processing system of some embodiments of the present disclosure;
FIG. 3 illustrates an exemplary flow chart of an image data acceleration processing method of some embodiments of the present disclosure;
fig. 4 illustrates an exemplary flow chart of a data handling method of some embodiments of the present disclosure.
Detailed Description
The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the disclosure. Based on the embodiments in this disclosure, all other embodiments that may be made by those skilled in the art without the inventive effort are within the scope of the present disclosure.
It should be understood that the terms "comprises" and "comprising," when used in this specification and the claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the present disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present disclosure and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".
Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.
Exemplary application scenarios
Matrix multiply-add operations are an important part of the composition of scientific computing applications, and are one of the most basic and computationally intensive operations in machine learning and deep learning. In the field of image processing and signal processing, matrix multiply-add operations may be used to represent and perform a number of important linear operations, such as linear transformations, linear classifiers, linear regression, convolution, embedding, and the like. Therefore, the performance of the machine learning model and the deep learning model can be improved to a certain extent by improving the efficiency of the computer for performing matrix multiplication and addition operation, and the efficiency of image data processing is further improved.
However, matrix multiply-add operations may involve a large number of multiply and add operations, which require a significant amount of computational resources and time to be consumed. In addition, matrix multiply add operations involve a large amount of memory access and data transfer, requiring a large amount of bandwidth and power consumption. Also, the performance of matrix multiply add operations is also limited by the size and shape of the matrix, as well as the characteristics and constraints of the hardware platform. Therefore, existing computer architecture is still not efficient enough to perform large-scale matrix multiply-add operations.
Exemplary application scenario
In view of this, the embodiments of the present disclosure provide an acceleration scheme for image data processing, which uses the highly parallel feature of a data flow architecture and the parallel execution of matrix multiply-add operation by a processing unit array, so as to reduce the bandwidth and energy consumption during data transmission and improve the execution efficiency of the matrix multiply-add operation.
FIG. 1 illustrates an exemplary block diagram of an image data acceleration processing system 100, which may also be referred to as a graphics processor (GPU, graphics Processing Unit) acceleration card, which is typically a separate expansion card that can be inserted into a computer for accelerating various computing tasks, such as graphics processing, scientific computing, deep learning training, data mining, and big data analysis, etc., that may significantly improve computing performance and execution efficiency.
As shown in fig. 1, the image data acceleration processing system 100 includes: a transmission network, a main memory, an instruction buffer, CUBF, a data buffer, SPM, and an array of processing units (PE, processing Element), wherein the main memory, the instruction buffer, CUBF, the data buffer, SPM, and the array of PE are communicatively connected through the transmission network for transmitting data and instructions.
In the system of the present embodiment, the main memory is used to store the input image data and the operation instruction. The image data and the operation instructions are transmitted to an image data acceleration processing system by a processor executing the image data processing in the computer, and the image data processing is accelerated by the image data acceleration processing system, specifically, the image data and the operation instructions are transmitted to a main memory by the processor.
In the system of the present embodiment, the instruction cache CUBF and the data cache SPM are both communicatively connected to the main memory through the transmission network, for example, data communication between the instruction cache CUBF and the main memory may be implemented through a direct memory access (DMA, direct Memory Access) module (not shown in the figure). The instruction cache is used for caching operation instructions received from the main memory, and the data cache is used for caching image data received from the main memory. The PE array is respectively in communication connection with the instruction cache and the data cache through a transmission network, and can read and execute required operation instructions from the instruction cache and read image data required by operation from the data cache. Further, the data buffer SPM may be further used to buffer other data besides image data, such as weight data required for convolution operation, and intermediate results and output results obtained by calculation of the PE array.
It should be noted that the system of this embodiment is designed based on a DFGPU-E accelerator, and its architecture generally inherits the GPDPU accelerator, except that the DFGPU-E accelerator focuses on achieving semi-precision data-scale processing, including FP16 type data and INT8 type data.
In view of the architecture characteristics of the DFGPU-E accelerator, the data transmission path is as follows when the PE array in this embodiment reads the operation instruction and the image data: the input image data is firstly transmitted from the main memory to the data buffer SPM, the input operation instruction is transmitted from the main memory to the instruction buffer CUBF, and then the PE array reads the image data from the data buffer SPM through the data reading instruction in the instruction buffer CUBF.
Further, the PE array is composed of a plurality of PEs, wherein each PE is used for performing matrix multiplication and addition operation on the image data in the data cache according to the operation instruction in the instruction cache, and thus the matrix multiplication and addition operation of the image data is completed in parallel by the PEs. In some embodiments, the PE array comprises n 2 And n is a positive integer, and the n PEs are connected in a data flow diagram form, so that a PE matrix with the size of n multiplied by n is formed. Illustratively, the PE array can include 16 PEs, the 16 PEs being assigned 4 rows and 4 columns.
The data flow architecture is a computer architecture, and the data flow architecture relies on data flow graph computation, so that a compiler can arrange a plurality of sequential loops and functions at the same time, and the data flow architecture has the advantages of less memory access requirements, low synchronization overhead and the like. In a dataflow architecture, a program may first generate a dataflow graph from data dependencies among instructions, and then execute by mapping algorithms onto processing units. When an operand is transferred from an upstream node to a downstream node in a dataflow graph, the node that obtains the operand can perform operations, so that the program of the dataflow architecture proceeds according to the dataflow graph generated by the data dependency.
In this embodiment, PEs in the PE array are connected in the form of a data flow graph, and a program in the processor generates the data flow graph according to the data dependency relationship between matrix multiply-add instructions, and then maps the data flow graph to the PE array, so as to implement that a plurality of PEs execute matrix multiply-add operations simultaneously.
Further, since each PE array is most efficient in performing a matrix multiply-add operation of 64×64 scale, in order to achieve acceleration processing of image data, it may be provided that each processing unit performs a matrix multiply-add operation of 64×64 scale. In view of the arrangement of PEs in the PE array and the matrix size that each PE can handle, when performing matrix multiply add operation, matrix blocking can be performed on input image data, for example, small-scale image data such as 64×64, 128×128, and 256×256, which can be split into 1, 4, and 16 matrix data of 64×64 sizes in turn. For a PE array consisting of 16 PEs, only 1, 4 and 16 PEs are needed in turn to complete the parallel operation of matrix multiply add.
Further, when a large-scale matrix multiply-add operation such as 512×512 or 1024×1024 is faced, for a PE array composed of 16 PEs, image data may be divided into 16 matrix data and allocated to the 16 PEs, and then matrix blocking is continued in each PE. Taking a 1024×1024 large-scale matrix as an example, it can be split into 16 pieces of 256×256 matrix data and put into 16 pieces of 256×256 PEs entirely, the 16 PEs being arranged in a 4×4 array. Further, for the 16 pieces of 256×256 matrix data, each 256×256 matrix data may be further divided into 16 pieces of 64×64 matrix data arranged in a 4×4 array, and put into 64×64 PEs for processing. It follows that a 1024 x 1024 matrix can be processed by 256 64 x 64 PEs in essence, and the 256 64 x 64 PEs are arranged in a 16 x 16 PE array.
In order to fully utilize the computing resources of each PE, when dividing the input image data, the image data may be divided into several matrix data of the same size as the matrix processed by the PE. It should be noted that, when performing matrix division, it is necessary to determine the matrix size that each PE can process and the matrix size corresponding to the image data, and based on this, the main memory of some embodiments of the disclosure may also be used to store configuration information transmitted by the processor, where the configuration information reflects information such as the matrix size corresponding to the image data.
Based on that a plurality of PEs are arranged on each row or column of the PE array, each PE has matrix data and output calculation results which need to be processed, therefore, some embodiments of the disclosure provide an image data acceleration processing system, which can independently allocate one data buffer SPM for each PE on each row or column, each data buffer SPM is connected to a main memory and the PE through a transmission network, and the transmission network is responsible for data transmission between the main memory and the data buffer SPM and between the PE and the data buffer SPM. The instruction buffer CUBF is connected to the DMA module through a transmission network, and exchanges information with the main memory through the DMA module (not shown in the figure). Fig. 2 illustrates an exemplary structure diagram of an image data acceleration processing system 200 according to some embodiments of the present disclosure, as shown in fig. 2, the image data acceleration processing system 200 includes a plurality of data caches, for example, a PE array formed by 16 PEs, where SPM0, SPM1, SPM2, and SPM3 may be configured to cache matrix data required by a first-column PE, a second-column PE, a third-column PE, and a fourth-column PE, and calculation results outputted by the first-column PE, the second-column PE, the third-column PE, and the fourth-column PE, and so on.
The image data acceleration processing system shown in several embodiments of the present disclosure is introduced above, which makes full use of the highly parallel characteristic of the data flow architecture, and performs data transfer and sharing through the data flow architecture, so as to reduce access of operands and memory, and further reduce the overhead of bandwidth and energy consumption. Meanwhile, PE in the matrix is subjected to pipeline stage deep fusion, so that high parallelism and pipelining are realized in a matrix multiply-add operation link, and matrix multiply-add efficiency is improved. In addition, the framework of the image data acceleration processing system can be adapted to image data with different sizes and shapes and adapt to the characteristics and constraints of different hardware platforms, namely the image data acceleration processing system has higher expandability and portability.
An image data acceleration processing method applied to the above-described image data acceleration processing system is described below with reference to the drawings.
Fig. 3 illustrates an exemplary flowchart of an image data acceleration processing method 300 of some embodiments of the present disclosure. As shown in fig. 3, in step S301, input image data is transferred from a main memory to a data buffer. In the present embodiment, image data and operation instructions transmitted to the image data acceleration processing system by a processor executing image data processing in a computer are stored in a main memory.
Further, in some embodiments, various elements in the data flow architecture may also be initialized, including but not limited to, the main memory and the data cache, to initialize matrix data of different sizes in the main memory and the data cache, prior to the data being transferred from the main memory to the data cache.
In step S302, an input operation instruction is transferred from the main memory to the instruction cache. Because the system of the embodiment is designed based on the DFGPU-E accelerator, in view of the architecture characteristics of the DFGPU-E accelerator, input image data is first transferred from the main memory to the data cache SPM, an input operation instruction is then transferred from the main memory to the instruction cache CUBF, then, the PE array reads the image data from the data cache SPM through a data reading instruction in the instruction cache CUBF, in some embodiments, the operation instruction may include a data reading instruction and a matrix multiply add instruction, where the data reading instruction is used to instruct the PE array to read the required data from the data cache, and the matrix multiply add instruction is used to instruct the PE array to perform a matrix multiply add operation on the read data.
In step S303, a data read instruction is fetched from the instruction cache and executed in the processing unit array to distribute the image data from the data cache to the several processing units. In some embodiments, the image data acceleration processing system employs an instruction offset technique at the hardware level, through which data read instructions are executed to avoid retrieving consecutive rows or consecutive columns of matrix data from the data cache.
In step S304, a matrix multiply-add instruction is acquired from the instruction cache and executed in the processing unit array to generate a matrix multiply-add result based on the image data. In step S304, in the execution period of one matrix multiplication and addition instruction, a plurality of PEs in the PE array may perform matrix multiplication and addition operation on the image data in parallel, thereby implementing highly parallel and pipelined operations and improving the processing efficiency of the image data.
Further, in the process of transferring the input image data from the main memory to the data cache, the image data transferred in the main memory can be converted into SIMD data, so that the matrix multiplication and addition task of the image data is split into a plurality of matrix multiplication and addition subtasks, and then when the processing unit array executes the matrix multiplication and addition instruction, the plurality of matrix multiplication and addition subtasks are distributed to a plurality of processing units to be executed in parallel, so as to generate a matrix multiplication and addition result.
The SIMD, collectively Single Instruction Multiple Data, refers to a single instruction stream, multiple data stream, a technique that employs one controller to control multiple processors while performing the same operations on each of a set of data separately to achieve spatial parallelism. The SIMD can store a plurality of data elements through the vector register and use a single instruction to perform parallel operation on the SIMD data, so that the plurality of data elements are processed simultaneously in one instruction cycle, and the calculation speed is further improved.
Based on the SIMD technology, the matrix multiplication and addition task of the image data can be split into a plurality of matrix multiplication and addition subtasks, and each matrix multiplication and addition subtask corresponds to matrix data in one SIMD data format, so that matrix multiplication and addition operation is carried out on the matrix data through a plurality of PEs in a PE array in an instruction period of one matrix multiplication and addition operation, and the matrix multiplication and addition operation speed is improved.
In step S305, the matrix multiply-add result is output to the data buffer. In view of the architecture characteristics of the DFGPU-E accelerator, the matrix multiplication and addition result output by the PE array of this embodiment is first transmitted to the data cache SPM, and then the matrix multiplication and addition result is returned to the main memory through the transmission network.
In step S306, the matrix multiply-add result is returned to the main memory.
By the image data acceleration processing method shown in fig. 3, the embodiment can fully utilize the highly parallel characteristic of the data flow architecture, complete data transmission while reducing the access of operands and memory, and further reduce the cost of bandwidth and energy consumption. Meanwhile, the PE array in the image data acceleration processing system is utilized to carry out matrix multiplication and addition operation with high parallelism and pipelining, so that efficient matrix multiplication and addition operation is realized.
In the acceleration processing of the image data, the input image data needs to be distributed to a plurality of PEs, so that in the execution of step S301, matrix blocking can be performed on the input image data. To illustrate this process, fig. 4 shows an exemplary flow chart of a data handling method 400 of some embodiments of the present disclosure. It will be appreciated that the data handling method is a specific implementation in step S301 described above, and so the features described above in connection with fig. 3 may be similarly applied thereto.
As shown in fig. 4, in step S401, the input image data is divided into several matrix data. In some embodiments, the configuration information transmitted from the processor is stored in the main memory, and when step S401 is performed, the image data may be divided into matrix data having a size different from that of the image data according to the configuration information, for example, the matrix size of each PE process is 64×64, and the size of the input image data is 128×128, then step S401 divides the image data into a plurality of matrix data having a size of 64×64.
Taking a PE array made up of 16 PEs as an example, assuming that each PE performs a matrix multiply-add operation of 64×64 scale, m×m-scale image data may be divided into (m/64) in main memory according to configuration information when step S401 is performed 2 Each matrix data is 64×64 in size, and m is a positive number and a multiple of 64. For example, if 128×128 image data is received from the processor, it may be divided into 4 matrix data of 64×64 in the main memory and stored.
In step S402, a plurality of matrix data are transferred from the main memory to the data cache. In some embodiments, the image data acceleration processing system may be as shown in fig. 2, which allocates one data cache SPM for each row or column of PEs individually. The system comprises a PE array formed by 16 PEs, wherein SPM0, SPM1, SPM2 and SPM3 are arranged in the system to buffer matrix data required by a first column PE, a second column PE, a third column PE and a fourth column PE and calculation results output by the matrix data.
Based on the image data acceleration processing system shown in fig. 2, a plurality of matrix data obtained by dividing in step S401 may be output to the data buffers at the corresponding positions, respectively, so that the PE that subsequently processes the matrix data reads the matrix data required by the PE from the data buffer at the corresponding position.
Combining the image data acceleration processing method shown in fig. 3 and the data handling method shown in fig. 4, the task executed by the image data acceleration processing system shown in the embodiment of the present disclosure may be split into three subtasks, specifically including: data preprocessing, PE allocation and result feedback.
The subtask of data preprocessing needs to divide the image data into matrix data of different scales based on the configuration information in the main memory, and transfer the divided matrix data from the main memory into the data cache SPM, for example, divide the image data of 256×256 scales into 16 matrix data of 64×64 scales.
The subtask allocated to the PE is to allocate matrix data in the data cache SPM to each PE, for example, allocate the above 16 matrix data with 64×64 scale to 16 PEs, so as to obtain a matrix multiplication and addition result output by each PE.
The subtask of the result feedback is to return the matrix multiplication and addition result output by each PE to the main memory through the data transfer station of the data cache.
In summary, the present disclosure provides an image data acceleration processing system, which uses the highly parallel feature of a data flow architecture and a processing unit array to execute matrix multiplication and addition operation in parallel, so as to reduce the bandwidth and energy consumption required in image data transmission, and accelerate the image data processing through the highly parallel and pipelined matrix multiplication and addition operation, thereby improving the efficiency of executing the image data processing by a computer.
The embodiment of the disclosure also provides an image data acceleration processing method which is applied to any image data acceleration processing system in the disclosure, and can complete data transmission while reducing the access of operands and memory. Meanwhile, the PE array in the image data acceleration processing system can be utilized to carry out highly parallel and pipelined matrix multiplication and addition operation, so that efficient matrix multiplication and addition operation is realized.
The present disclosure may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon computer-readable instructions (or computer programs, or computer instruction codes) which, when executed by a processor of an electronic device (or electronic device, server, etc.), cause the processor to perform part or all of the steps of the above-described methods according to the present disclosure.
While various embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous modifications, changes, and substitutions will occur to those skilled in the art without departing from the spirit and scope of the present disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. The appended claims are intended to define the scope of the disclosure and are therefore to cover all equivalents or alternatives falling within the scope of these claims.

Claims (10)

1. A system for accelerating processing of image data, comprising:
a transmission network;
a main memory for storing an input operation instruction and image data;
an instruction cache connected with the main memory through the transmission network for caching operation instructions received from the main memory
A data buffer connected to the main memory via the transmission network for buffering image data received from the main memory and
and the processing unit array is composed of a plurality of processing units and is respectively connected with the instruction cache and the data cache through the transmission network, wherein each processing unit is used for executing matrix multiplication and addition operation on the image data in the data cache according to the operation instruction in the instruction cache.
2. The system of claim 1, wherein the array of processing units comprises 16 processing units, wherein each processing unit performs a 64 x 64 scale matrix multiply-add operation.
3. The system of claim 1 or 2, wherein the array of processing units comprises n 2 Processing of connections in the form of dataflow graphsAnd the scale of the processing unit array is n multiplied by n, wherein n is a positive integer.
4. A method for acceleration processing of image data, characterized in that the method is applied to a system according to any of claims 1-3, the method comprising:
carrying input image data from the main memory to the data cache;
carrying the input operation instruction from the main memory to the instruction cache;
acquiring a data reading instruction from the instruction cache and executing the data reading instruction in the processing unit array so as to distribute the image data from the data cache to a plurality of processing units;
acquiring a matrix multiplication and addition instruction from the instruction cache and executing the matrix multiplication and addition instruction in the processing unit array to generate a matrix multiplication and addition result based on the image data;
outputting the matrix multiplication and addition result to the data cache; and
and returning the matrix multiplication and addition result to the main memory.
5. The method of claim 4, wherein the handling of the input image data from the host to the data cache comprises:
dividing input image data into a plurality of matrix data; and
and carrying the matrix data from the main memory to the data cache.
6. The method of claim 5, wherein the array of processing units comprises 16 processing units, wherein each processing unit performs a 64 x 64 scale matrix multiply add operation;
wherein dividing the input image data into a number of matrix data includes:
dividing image data of m×m scale into (m/64) 2 A plurality of matrix data, wherein each matrix data has a size of 64×64, wherein m is a positive number and isA multiple of 64.
7. The method of claim 5 or 6, wherein configuration information is stored in the main memory;
wherein dividing the input image data into a number of matrix data includes:
the image data is divided into matrix data of a size different from the image data according to the configuration information.
8. The method of claim 4, wherein fetching a data read instruction from the instruction cache and executing in the processing unit array comprises:
and executing a data reading instruction through an instruction offset technology to avoid acquiring matrix data of continuous rows or continuous columns from the data cache.
9. The method of claim 4, wherein during the transferring of the input image data from the host to the data cache, the method further comprises:
converting the image data carried in the main memory into SIMD data, so that the matrix multiplication and addition task of the image data is split into a plurality of matrix multiplication and addition subtasks;
wherein generating a matrix multiply-add result based on the image data comprises: and distributing the matrix multiplication and addition subtasks to a plurality of processing units for parallel execution to generate the matrix multiplication and addition result.
10. A computer storage medium having stored thereon computer readable instructions which, when executed by one or more processors, implement the method of any of claims 4-9.
CN202311443683.3A 2023-11-01 2023-11-01 System, method and storage medium for accelerating image data Pending CN117437113A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311443683.3A CN117437113A (en) 2023-11-01 2023-11-01 System, method and storage medium for accelerating image data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311443683.3A CN117437113A (en) 2023-11-01 2023-11-01 System, method and storage medium for accelerating image data

Publications (1)

Publication Number Publication Date
CN117437113A true CN117437113A (en) 2024-01-23

Family

ID=89547732

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311443683.3A Pending CN117437113A (en) 2023-11-01 2023-11-01 System, method and storage medium for accelerating image data

Country Status (1)

Country Link
CN (1) CN117437113A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120179420A (en) * 2025-05-20 2025-06-20 北京清微智能科技有限公司 Processor chip, collective communication method and electronic device
WO2026012509A1 (en) * 2024-07-12 2026-01-15 Moffett International Co., Limited Hybrid network-on-chip (noc) for thread synchronization in many-core neural network accelerators

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2026012509A1 (en) * 2024-07-12 2026-01-15 Moffett International Co., Limited Hybrid network-on-chip (noc) for thread synchronization in many-core neural network accelerators
CN120179420A (en) * 2025-05-20 2025-06-20 北京清微智能科技有限公司 Processor chip, collective communication method and electronic device
CN120179420B (en) * 2025-05-20 2025-09-16 北京清微智能科技有限公司 Processor chip, aggregate communication method and electronic equipment

Similar Documents

Publication Publication Date Title
US20220365753A1 (en) Accelerated mathematical engine
CN107679620B (en) Artificial Neural Network Processing Device
US9015354B2 (en) Efficient complex multiplication and fast fourier transform (FFT) implementation on the ManArray architecture
US7584342B1 (en) Parallel data processing systems and methods using cooperative thread arrays and SIMD instruction issue
EP3659074B1 (en) Vector computational unit
US7788468B1 (en) Synchronization of threads in a cooperative thread array
Tanomoto et al. A CGRA-based approach for accelerating convolutional neural networks
US7861060B1 (en) Parallel data processing systems and methods using cooperative thread arrays and thread identifier values to determine processing behavior
CN109522254B (en) Arithmetic device and method
CN107704922B (en) Artificial Neural Network Processing Device
WO2019152069A1 (en) Instruction architecture for a vector computational unit
CN111047036A (en) Neural network processor, chip and electronic equipment
CN117437113A (en) System, method and storage medium for accelerating image data
Sunitha et al. Performance improvement of CUDA applications by reducing CPU-GPU data transfer overhead
CN111047035A (en) Neural network processor, chip and electronic equipment
CN104615584B (en) The method for solving vectorization calculating towards GPDSP extensive triangular linear equation group
EP4206999A1 (en) Artificial intelligence core, artificial intelligence core system, and loading/storing method of artificial intelligence core system
CN111091181A (en) Convolution processing unit, neural network processor, electronic device and convolution operation method
CN120508740B (en) CNN-oriented batch matrix multiplication parallel optimization method and system on Shenwei architecture
de Dinechin et al. Deep learning inference on the mppa3 manycore processor
de Dinechin et al. A qualitative approach to many-core architecture
US12423104B2 (en) Clipping operations using partial clip instructions
US20250036363A1 (en) Flooring divide using multiply with right shift
US12106102B1 (en) Vector clocks for highly concurrent execution engines
CN116627494B (en) Processor and processing method for parallel instruction transmission

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination