CN116362303A

CN116362303A - Data processing device, data processing method and related device

Info

Publication number: CN116362303A
Application number: CN202111584166.9A
Authority: CN
Inventors: 高迪
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2021-12-22
Filing date: 2021-12-22
Publication date: 2023-06-30

Abstract

The application provides a data processing device, a data processing method and a related device, which comprise a neural network processor, wherein N-1 line tail copy space after tail address space of any one annular cache space forming a circular cache space is used for storing data of front N-1 line head address space of the copied subsequent annular cache space, and N-1 line head copy space before head address space of any one annular cache space forming the circular cache space is used for storing data of front N-1 line tail address space of the copied previous annular cache space. The frequency of accessing the memory during data reading can be reduced, and the occupation amount of bus bandwidth and the system power consumption are reduced.

Description

Data processing device, data processing method and related device

Technical Field

The present disclosure relates to the technical field of neural network processors, and in particular, to a data processing apparatus, a data processing method, and a related apparatus.

Background

With the development of the prior art, in order to enhance the artificial intelligence capability of the device, a neural network processor (Neural network Processing Unit, NPU) is generally integrated in the system, and a data-driven parallel computing architecture is generally adopted for accelerating the operation of the neural network, so as to solve the problem of low efficiency of the traditional chip in the operation of the neural network. How to reduce the power consumption of the NPU becomes a challenge.

Disclosure of Invention

In view of this, the present application provides a data communication method and related device, which can reduce the frequency of accessing the memory during data reading through a specific storage space architecture, and reduce the bus bandwidth occupation amount and the system power consumption.

In a first aspect, an embodiment of the present application provides a data processing apparatus, including a neural network processor, where the neural network processor includes a processing unit array and M storage modules, the processing unit array includes M columns of processing unit sets, M is a positive even number, the neural network processor is applicable to a convolution kernel of NxN, and N is a positive integer greater than 1 and less than x;

each storage module comprises annular cache spaces corresponding to the layers of the neural network one by one, M annular cache spaces of each layer form at most M/2 circulating cache spaces, and each circulating cache space comprises at least 2 and at most M annular cache spaces;

each annular cache space comprises x line cache address spaces, wherein x is a positive integer greater than 2, and each x line cache address space comprises a front N-1 line head address space and a rear N-1 line tail address space; the N-1 line tail copy space behind the tail address space of any one annular cache space composing the circular cache space is used for storing the data of the head address space of the front N-1 line of the copied next annular cache space, and the N-1 line head copy space ahead of the head address space of any one annular cache space composing the circular cache space is used for storing the data of the N-1 line tail address space of the copied previous annular cache space;

the M storage modules are used for storing data to be stored in a distributed mode, and the M-column processing unit set is used for reading the data to be stored in the distributed mode from the M storage modules.

In a second aspect, an embodiment of the present application provides a data processing method, which is applied to the data processing apparatus according to the first aspect of the embodiment of the present application, where the method includes:

acquiring data to be stored;

dividing the data to be stored into n sub-data to be stored according to the channel number n of the data to be stored, wherein n is a positive integer less than or equal to M/2;

and writing each sub-data to be stored into a circular buffer space formed by INT (M/n) annular buffer spaces, wherein any circular buffer space is used for storing any sub-data to be stored.

In a third aspect, embodiments of the present application provide an electronic device, including a memory for storing a program, and a processor executing the program stored in the memory, where the processor is configured to execute instructions of steps in the method according to any one of the second aspects of the embodiments of the present application when the program stored in the memory is executed.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to perform a method according to any of the second aspects of embodiments of the present application.

In a fifth aspect, embodiments of the present application provide a computer program product, wherein the computer program product comprises a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps described in any of the methods of the second aspect of embodiments of the present application. The computer program product may be a software installation package.

It can be seen that, by the above data processing apparatus, data processing method and related apparatus, the apparatus includes a neural network processor, where the neural network processor includes a processing unit array and M storage modules, the processing unit array includes M columns of processing unit sets, M is a positive even number, the neural network processor is applicable to a convolution kernel of NxN, and N is a positive integer greater than 1 and less than x; each storage module comprises annular cache spaces corresponding to the layers of the neural network one by one, M annular cache spaces of each layer form at most M/2 circulating cache spaces, and each circulating cache space comprises at least 2 and at most M annular cache spaces; each annular cache space comprises x line cache address spaces, wherein x is a positive integer greater than 2, and each x line cache address space comprises a front N-1 line head address space and a rear N-1 line tail address space; the N-1 line tail copy space behind the tail address space of any one annular cache space composing the circular cache space is used for storing the data of the head address space of the front N-1 line of the copied next annular cache space, and the N-1 line head copy space ahead of the head address space of any one annular cache space composing the circular cache space is used for storing the data of the N-1 line tail address space of the copied previous annular cache space; the M storage modules are used for storing first data in a distributed mode, and the M-column processing unit set is used for reading the first data stored in the distributed mode from the M storage modules. The frequency of accessing the memory during data reading can be reduced, and the occupation amount of bus bandwidth and the system power consumption are reduced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a data processing apparatus according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of a memory module according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of an annular buffer space according to an embodiment of the present disclosure;

FIG. 4 is an exemplary block diagram of a circular cache space according to an embodiment of the present application;

fig. 5 is a schematic flow chart of a data processing method according to an embodiment of the present application;

fig. 6 is a schematic architecture diagram of a neural network processor according to an embodiment of the present application;

fig. 7 is a schematic diagram of bandwidth ratio comparison provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will clearly and completely describe the technical solution in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In this context, the character "/" indicates that the front and rear associated objects are an "or" relationship. The term "plurality" as used in the embodiments herein refers to two or more.

The "connection" in the embodiments of the present application refers to various connection manners such as direct connection or indirect connection, so as to implement communication between devices, which is not limited in any way in the embodiments of the present application.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The following is a description of the background art and related terms of the present application.

The background technology is related:

in NPU architecture design, multiple levels of cache are typically designed to increase the data bandwidth of the system. The basic computational units in NPUs are typically arithmetic logic units (Arithmetic and Logic Unit, ALU) with storage, also called processing units (Processing Element, PE). The storage in the PE is referred to as level 0 cache. And the data processed by the neural network is usually data divided into a plurality of channels, such as image data including a plurality of color channels, in an NPU systolic array architecture for the convolutional neural network, one common architecture is to split a buffer (buffer) into separate small buffers corresponding to different channels composed of static random access memories (Static Random Access Memory, SRAMs) to increase the bandwidth corresponding to different PEs, and each SRAM corresponds to a column of PEs. When a certain column of PE in the PE array needs to access data in the SRAM which is not corresponding to the PE array, one of the existing methods is to add an extra bus in buffers of different SRAMs to support data access among different SRAMs, so that the area and the power consumption of a chip can be increased; another method is to splice multiple SRAMs into one large SRAM for use, but because different SRAMs have overlapping data portions, the data of the overlapping portions need to be accessed to the dynamic random access memory (Dynamic Random Access Memory, DRAM) multiple times, and the power consumption of each time the DRAM is read is about 100 times higher than the power consumption of accessing the SRAM, which greatly increases the bus bandwidth and the system power consumption.

In order to solve the above problems, the present application provides a data processing apparatus, a data processing method, and related apparatus, which can reduce the frequency of accessing a memory during data reading through a specific storage space architecture, and reduce the bus bandwidth occupation amount and system power consumption.

In the following, a data processing apparatus according to an embodiment of the present application will be described with reference to fig. 1, fig. 1 is a schematic architecture diagram of a data processing apparatus provided in an embodiment of the present application, where the data processing apparatus 100 includes a processing unit array 110 and M storage modules 120, the processing unit array 110 includes M columns of processing unit sets 111, M is a positive even number, and is suitable for a convolution kernel of NxN, and N is a positive integer greater than 1 and less than x.

Each storage module 120 comprises annular cache spaces corresponding to the layers of the neural network one by one, wherein M annular cache spaces of each layer form at most M/2 circulating cache spaces, and each circulating cache space comprises at least 2 and at most M annular cache spaces;

the M storage modules 120 are configured to store data to be stored in a distributed manner, and the M-column processing unit set is configured to read the data to be stored in a distributed manner from the M storage modules.

The storage module 120 may be an SRAM, and the processing unit set 111 may include a plurality of PEs.

For easy understanding, an arbitrary one of the memory modules 120 in the embodiments of the present application will be described separately with reference to fig. 2, and fig. 2 is a schematic structural diagram of a memory module provided in the embodiments of the present application, where the memory module includes α Ring buffer spaces, that is, an annular buffer space 0 to an annular buffer space α -1 in the drawing, and different annular buffer spaces correspond to different layers in a neural network.

In one possible embodiment, when the annular buffer space of each storage module 120 is opened, the space can be opened by taking the space occupied by the pixel points in the even rows of the image as a unit, so as to ensure that the buffer addresses of different storage modules 120 in the same layer of neural network are aligned.

Further, referring to fig. 3 for describing the ring memory space in the embodiment of the present application, fig. 3 is a schematic structural diagram of a ring buffer space provided in the embodiment of the present application, where the ring memory space includes x line buffer address spaces, that is, lines 0 to x-1 in the figure, when the convolution kernel is NxN, a front N-1 may be set as a head address space, a rear N-1 may be set as a tail address space, a front N-1 line head copy space of the head address space, and a rear N-1 line tail copy space of the tail address space may be set.

Further, describing the circular buffer space in the embodiment of the present application with reference to fig. 4, fig. 4 is an exemplary structure diagram of the circular buffer space provided in the embodiment of the present application, it can be seen that, when 2 circular buffer spaces form the circular buffer space, each circular buffer space is set to include 8 line buffer address spaces, that is, line 0 to line 7 in the first circular buffer space and line 8 to line 15 (middle line is not shown) in the second circular buffer space, and the convolution kernel size is set to 3x3, then it can be determined that the first 2 lines and the last 2 lines in the first circular buffer space and the first 2 lines and the last 2 lines in the second circular buffer space are mutually copied, that is, the data in line 6 and line 7 are copied to the first 2 line head copy space before line 8, the data in line 8 and line 9 are copied to the second 2 line tail copy space after line 7, and the data in line 14 and line 15 are copied to the first 2 line head copy space before line 0 and the data in line 1 are copied to the second line 2 line tail copy space after line 15, so that the circular buffer space is formed. It will be appreciated that the copy space will actually store data and may be retrieved from the head copy space or the tail copy space when data in the head address space or the tail address space is required. For example, when the data of the line 7, the line 8, and the line 9 are needed, the data of the line 7 is read from the first annular buffer space, and the data of the line 8 and the line 9 is read from the tail copy space of the first annular buffer space, which is not described herein.

By the data processing device, the distributed storage of the data can be executed, the data stored in the distributed storage can be read at the same time, the frequency of accessing the memory during data reading is reduced, and the bus bandwidth occupation amount and the system power consumption are reduced.

Next, a data processing method in an embodiment of the present application will be described with reference to fig. 5, where the data processing method is applied to the data processing apparatus, and fig. 5 is a schematic flow chart of a data processing method provided in an embodiment of the present application, and specifically includes the following steps:

step 501, data to be stored is acquired.

Step 502, dividing the data to be stored into n sub-data to be stored according to the number n of channels of the data to be stored.

The data to be stored may be image data having n channels, and in order to ensure that the data of each channel is uniformly distributed and stored in M storage modules, n is required to be less than or equal to M/2.

Wherein, it can be obtained from dynamic random access memory (Dynamic Random Access Memory, DRAM) and written into M memory modules.

In step 503, each sub-data to be stored is written into a circular buffer space formed by INT (M/n) annular buffer spaces.

Wherein, any one cycle buffer space is used for storing any one sub data to be stored.

The method comprises the steps of acquiring the row number y of each first sub-data, sequentially writing each first sub-data into a circular buffer space formed by INT (M/n) annular buffer spaces, and writing y/(INT (M/n) x) wheels to store the first data.

It can be understood that when n is 2 and M is 4, there are 2 first sub-data, and 2 circular buffer spaces can be constructed, where 2 first sub-data are respectively stored, that is, each first sub-data is written into a circular buffer space formed by INT (4/2) =2 circular buffer spaces; when n is 2 and m is 9, there are 2 first sub-data, 2 circular buffer spaces can be constructed, and 2 first sub-data can be stored respectively, that is, each first sub-data is written into a circular buffer space formed by INT (9/2) =4 annular buffer spaces, so that the maximized distributed storage can be ensured.

Specifically, when each first sub-data is written into each line cache address space, the following steps may be performed:

s1, judging whether the first buffer address space for storing the data to be stored has a corresponding first tail copy space or a corresponding first head copy space.

The first cache address space may be any line of cache address space to which each first sub-data needs to be written, and since the N-1 head address space has a corresponding N-1 tail copy space and the N-1 tail address space has a corresponding N-1 head copy space, it may be determined whether the first cache address space has a corresponding first tail copy space or a corresponding first head copy space according to the number of lines. And will not be described in detail herein.

S2, when the corresponding first tail copy space or the corresponding first head copy space exists in the first cache address space for storing the data to be stored, the data to be stored is simultaneously written into the first cache address space, the first tail copy space or the first head copy space.

When the first buffer address space has a corresponding first header copy space, the data to be buffered can be written into the first buffer address space and the first header copy space at the same time, and when the first buffer address space has a corresponding first tail copy space, the data to be buffered can be written into the first buffer address space and the first tail copy space at the same time.

And S3, when the first buffer address space for storing the data to be stored does not have the corresponding first tail copy space or the corresponding first head copy space, writing the data to be stored into the first buffer address space.

And when the first cache address space is not the head address space or the tail address space, writing the data to be stored into the first cache space.

And step 504, reading the sub data to be cached in each circular cache space.

The INT (M/n) row y-wheels of the INT (M/n) annular buffer spaces can be sequentially and circularly read to read each first sub-data.

In one possible embodiment, when the sub data to be stored in any one header address space needs to be read, the first sub data to be stored is read from a tail copy space corresponding to the header address space.

In one possible embodiment, when the sub data to be stored in any tail address space needs to be read, the first sub data to be stored is read from a head copy space corresponding to the tail address space.

In one possible embodiment, each row of sub-data to be cached is read in turn when it is not necessary to read either the head address space or either the tail address space.

Therefore, through the data processing method, the data to be cached can be obtained from the external storage module, and the data to be stored is written into the SRAM and the copy space according to the copy space, so that the external storage module is not required to be accessed again during reading, and the occupation of bus bandwidth and the system power consumption are reduced. And the size of the data storage is improved, for example, when the number of channels is 2 and the number of SRAMs is 8, four SRAMs can be spliced together for one channel to use, and the size of the image which can be stored is 4 times the size of the image which is not stored on average, and the details are not repeated here.

For the sake of understanding, the data processing apparatus and the data processing method in the present application are described below with reference to examples, for example, 4 storage modules are set, that is, SRAM0, SRAM1, SRAM2, and SRAM3, respectively, and the first data to be stored is image data of 2 channels, where in the conventional method, data of one channel is generally stored in SRAM0, data of another channel is stored in SRAM1, and both SRAM2 and SRAM3 are in an empty state, which is very wasteful of resources.

The head-to-tail address space between the SRAM0 and the SRAM2 is set to be copied with each other, the head-to-tail address space between the SRAM1 and the SRAM3 is set to be copied with each other, when data of the DRAM is acquired, data of an overlapped part of the SRAM0 and the SRAM2 can be written into the copying space of the SRAM0 and the copying space of the SRAM2 at the same time, data of an overlapped part of the SRAM1 and the SRAM3 can be written into the copying space of the SRAM1 and the copying space of the SRAM3 at the same time, so that when the data of the overlapped part of the SRAM0 or a processing unit set corresponding to the SRAM2 needs to be read, the data of the overlapped part of the SRAM0 or the processing unit set corresponding to the SRAM2 does not need to be accessed again, and only needs to be acquired from the copying space. Therefore, the frequency of accessing the memory during data reading can be reduced, and the occupation amount of bus bandwidth and the system power consumption are reduced.

It should be understood that the foregoing is illustrative, and that the data of one channel may be divided into two memory modules by itself, for example, the data of one channel may be stored in SRAM0 and SRAM4, or SRAM1 and SRAM2, etc., which are not limited herein.

The scene that this embodiment of the application can be suitable for is:

n×2≤M

n is the number of channels of data to be stored, and M is the number of memory modules.

For example, the data to be stored is an image, which is data of 2000 lines of 2 channels. In the conventional method, the channel 1 is stored in the annular storage space 1 of the allocated storage module 1, and 20 lines of data can be stored in the annular storage space 1, so that 20 lines of data can also be stored in the annular storage space 2 of the corresponding other storage module 2. Then, a fetch may fetch 40 lines of data. Based on the previous analysis, in the case where the convolution kernel is 3×3, 40 lines of data are taken, and there are 8 lines in total in the overlapping portion of data. A total of 50 rounds are required to complete all the image data. The 50 rounds of overlapping data need to be additionally fetched for 400 rows in total, the 400 rows account for 400/2000=0.2 of the total image row number, and 20% of the data are overlapping data, so that the DRAM needs to be accessed for repeated handling. If the policy of the copy space proposed by the technology is adopted, the bus bandwidth and the system power consumption occupied by 20% of data movement can be saved.

As shown in fig. 7, fig. 7 is a schematic diagram of comparing bandwidth occupation ratios provided in the embodiments of the present application, and the hatched portion is the bus bandwidth occupation amount when the overlapping area is read, which can be seen that, in the conventional scheme, the overlapping portion needs to repeatedly read data, so that the bus bandwidth occupation in a short time is increased, and the scheme does not need to repeatedly access the DRAM, and the bus bandwidth occupation ratio is relatively low.

An electronic device in the embodiment of the present application will be described below with reference to fig. 8, and fig. 8 is a schematic structural diagram of an electronic device provided in the embodiment of the present application, as shown in fig. 8, where the electronic device 800 includes a processor 801, a communication interface 802, and a memory 803, where the processor, the communication interface, and the memory are connected to each other, where the electronic device 800 may further include a bus 804, where the processor 801, the communication interface 802, and the memory 803 may be connected to each other through the bus 804, and the bus 804 may be a peripheral component interconnect standard (Peripheral Component Interconnect, abbreviated as PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated as EISA) bus, etc. The bus 804 may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, only one thick line is shown in fig. 8, but not only one bus or one type of bus. The memory 803 is used to store a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform all or part of the method described in fig. 5 above.

The foregoing description of the embodiments of the present application has been presented primarily in terms of a method-side implementation. It will be appreciated that the electronic device, in order to achieve the above-described functions, includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied as hardware or a combination of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The embodiment of the application may divide the functional units of the electronic device according to the above method example, for example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated in one processing unit. The integrated units may be implemented in hardware or in software functional units. It should be noted that, in the embodiment of the present application, the division of the units is schematic, which is merely a logic function division, and other division manners may be implemented in actual practice.

The present application also provides a computer storage medium storing a computer program for electronic data exchange, the computer program causing a computer to execute some or all of the steps of any one of the methods described in the method embodiments above.

Embodiments of the present application also provide a computer program product comprising a non-transitory computer-readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any one of the methods described in the method embodiments above. The computer program product may be a software installation package, said computer comprising an electronic device.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, such as the above-described division of units, merely a division of logic functions, and there may be additional manners of dividing in actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, or may be in electrical or other forms.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units described above, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a memory, including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the above-mentioned method of the various embodiments of the present application. And the aforementioned memory includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the above embodiments may be implemented by a program that instructs associated hardware, and the program may be stored in a computer readable memory, which may include: flash disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.

The foregoing has outlined rather broadly the more detailed description of embodiments of the present application, wherein specific examples are provided herein to illustrate the principles and embodiments of the present application, the above examples being provided solely to assist in the understanding of the methods of the present application and the core ideas thereof; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. The data processing device is characterized by comprising a neural network processor, wherein the neural network processor comprises a processing unit array and M storage modules, the processing unit array comprises M columns of processing unit sets, M is a positive even number, the neural network processor is suitable for a convolution kernel of NxN, and N is a positive integer which is more than 1 and less than x;

2. A data processing method applied to the data processing apparatus of claim 1, the method comprising:

acquiring data to be stored;

3. The method according to claim 2, wherein writing each sub-data to be stored into a circular buffer space consisting of INT (M/n) ring buffer spaces comprises:

acquiring the row number y of each sub-data to be stored;

and sequentially writing each sub-data to be stored into the cyclic buffer space formed by the INT (M/n) annular buffer spaces, and writing y/(INT (M/n) x) wheels to store the first data.

4. The method according to claim 2, wherein writing each sub-data to be stored into a circular buffer space consisting of INT (M/n) ring buffer spaces comprises:

judging whether the first buffer address space of the data storage to be stored has a corresponding first tail copy space or a corresponding first head copy space;

and when the first buffer address space for storing the data to be stored has the corresponding first tail copy space or the corresponding first head copy space, writing the data to be stored into the first buffer address space, the first tail copy space or the first head copy space at the same time.

5. The method of claim 4, wherein after determining whether the first buffer address space of the data to be stored has a corresponding first tail copy space or a corresponding first head copy space, the method further comprises:

and when the first buffer address space for storing the data to be stored does not exist the corresponding first tail copy space or the corresponding first head copy space, writing the data to be stored into the first buffer address space.

6. The method according to any one of claims 2-5, wherein after writing each sub-data to be stored into a circular buffer space consisting of INT (M/n) ring buffer spaces, the method further comprises:

and reading the sub data to be cached in each circulating cache space.

7. The method of claim 6, wherein the reading the sub-data to be cached in the each circular cache space

And sequentially circularly reading INT (M/n) row y wheels of INT (M/n) annular buffer spaces to read each first sub data.

8. The method of claim 7, wherein sequentially cyclically reading INT (M/n) x-line y-wheels of INT (M/n) ring buffer spaces to read each of the first sub-data comprises:

when the sub data to be stored in any head address space needs to be read, the first sub data to be stored is read from a tail copy space corresponding to the head address space;

when the sub data to be stored in any tail address space needs to be read, reading the first sub data to be stored from a head copy space corresponding to the tail address space;

and when any head address space or any tail address space is not required to be read, each row of sub data to be cached is sequentially read.

9. An electronic device, comprising: a memory for storing a program and a processor for executing the program stored in the memory, the processor being configured to execute the data processing method according to any one of claims 2 to 8 when the program stored in the memory is executed.

10. A computer storage medium storing program code for execution by an electronic device, the program code comprising the method of any one of claims 2 to 8.