CN118519960B

CN118519960B - Fusion buffer architecture based on array structure

Info

Publication number: CN118519960B
Application number: CN202410969196.9A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shencun Technology Wuxi Co ltd
Current assignee: Shencun Technology Wuxi Co ltd
Priority date: 2024-07-19
Filing date: 2024-07-19
Publication date: 2024-10-01
Anticipated expiration: 2044-07-19
Also published as: CN118519960A

Abstract

The application discloses a fusion buffer architecture based on an array structure, which relates to the field of buffers and comprises a buffer array and an instruction module connected with the buffer array through a signal line; the instruction module is cached with compiled instruction data, and an instruction stream is sent to the buffer array according to the time sequence; the buffer array comprises M rows of buffer blocks with the same structure, adjacent rows of buffer blocks are sequentially cascaded through unidirectional instruction signal lines, and an instruction stream sent by the instruction module is transmitted to the M rows of buffer blocks row by row according to a time sequence; the M line buffer blocks are respectively provided with a data input end and a data output end, respectively receive external data input, and execute buffering and output tasks according to instruction data transmitted to the line. The array unit of the sram is used, a pipeline register is adopted to replace the traditional bus design, and the multi-instruction slot is adopted to control the data read-write transmission, so that the buffer architecture of global unified management and access irrelevant to the design scale is realized, the circuit design difficulty is reduced, and the data caching and read-write efficiency is improved.

Description

Fusion buffer architecture based on array structure

Technical Field

The embodiment of the application relates to the field of buffers, in particular to a fusion buffer architecture based on an array structure.

Background

In modern society and industry, such as internet, big data, thing networking, autopilot etc. field, the use of artificial intelligence based on neural network can improve efficiency by a wide margin, reduce cost, and the degree of use is increasingly extensive. While the computations within the neural network may be run using general purpose computing techniques such as general purpose computing CPUs, they are relatively inefficient. The industry typically employs specialized processors to accelerate the computation of neural networks, commonly referred to as neural network accelerators, or artificial intelligence chips (AI chips).

Currently, the mainstream neural network accelerators in the industry are mainly general-purpose computing graphics processors (GPGPU), and also have accelerators based on reduced instruction processor (RISC) cores, and also have special-purpose accelerators (collectively referred to as DSA, such as google TPU chips) that use systolic arrays in combination with general-purpose computing cores. Taking a DSA accelerator as an example, a plurality of computing cores are usually arranged in a chip, and a plurality of or a plurality of memories are arranged in the chip, the architecture of the accelerator is generally composed of a plurality of computing components or computing cores, and the cores can complete overall operation through cooperation computation; a single computing element or computing core has independent local caches and buffers; the multiple computing elements or cores exchange data via multi-level, multi-block buffers (L1 and L2 buffers). This type of mainstream design typically has repetitive storage, and the complexity of the circuit design increases due to the bus connection between multiple caches; the complexity of a compiler is greatly increased due to the interactive execution of tasks among multiple caches; the inconsistent path length among the multiple caches causes uncertain data exchange delay and affects programming efficiency; the multi-buffer multi-level connection relationship causes low scaling efficiency, and further causes great reduction of the design multiplexing degree of different power chips. The above problems all bring about excessive circuit area or complex programming modes, thereby negatively affecting the cost and efficiency of the chip.

Disclosure of Invention

The embodiment of the application provides a fusion buffer architecture based on an array structure, which solves the problems of complex circuit design and influence on chip efficiency and cost caused by multi-buffer multi-level connection relation. The device comprises a buffer array and an instruction module connected with the buffer array through a signal line; the instruction module is cached with compiled instruction data, and an instruction stream is sent to the buffer array according to a time sequence;

the buffer array comprises M rows of buffer blocks with the same structure, the buffer blocks of adjacent rows are sequentially cascaded through unidirectional instruction signal lines, and an instruction stream sent by the instruction module is transmitted to the M rows of buffer blocks row by row according to a time sequence;

The M line buffer blocks are respectively provided with a data input end and a data output end, respectively receive external data input, and execute buffering and output tasks according to instruction data transmitted to the line.

Specifically, the buffer block includes N buffer units with the same structure, and M rows of buffer blocks form a buffer array of m×n arrays; the N cache units of each row are sequentially cascaded through data lines, and data between adjacent cache blocks are transferred left and right.

Specifically, the buffer unit comprises a buffer, a first data selector, a second data selector and a first data register;

One data input end of the first data selector is connected to the buffer, the other data input end is connected to the second data register in the adjacent buffer unit, the data output end is connected to the first data register, and the buffer output or the second register outputs data to the first data register;

one data input end of the second data selector is connected to the buffer, the other data input end is connected to the first data register, the data output end is connected to the second data register of the adjacent buffer block, and the buffer output or the first register outputs data to the adjacent buffer unit.

Specifically, the cache unit further comprises instruction registers, and in the same-column cache unit, the instruction registers of adjacent rows are cascaded with each other through signal lines to register and transfer instructions;

The buffer also comprises n sram banks and a read-write circuit, the instruction register is connected with the read-write circuit, the instruction information is read and sent according to the time sequence, and the read-write circuit executes the read-write task according to the instruction information; when the read-write circuit executes a write instruction, external target data is received, and the target data is written into a designated position of a target bank; when the read-write circuit executes the read instruction, the target data is obtained from the target bank according to the instruction and sent out to the outside.

Specifically, in the same-row cache units, the buffer units in adjacent columns carry out transverse data transfer through the data register; in the same-column buffer units, longitudinal instruction transfer is performed between buffer units of adjacent lines through cascaded instruction registers.

Specifically, the instruction register in the buffer unit is respectively connected to the gating ends of the first data selector and the second data selector, generates a gating signal according to a target instruction under the current time sequence, and controls gating output of the data selector based on the gating signal.

Specifically, the instruction module includes N instruction slots, each row of instruction slots includes S FIFO memories with cascade depths, for storing S compiled instructions, and the FIFO memories sequentially send the instructions to corresponding rows of the buffer array according to a time sequence, and transfer the instructions according to an instruction stream direction.

Specifically, the compiling instruction comprises a mode control bit, a data direction bit, a channel selection bit and a bank selection bit;

The mode control bit is used for controlling the target cache unit to execute a data operation mode;

the data direction bit is used for controlling the target cache unit to execute the transmission direction of the data;

the channel selection bit is used for controlling the gating state of the data selector in the target cache unit;

the bank select bit is used to select a target bank in the buffer.

Specifically, the cache units in the same column as the instruction slots sequentially execute the same compiling instruction according to the order of the instruction streams; when the data operation mode is data reading, the read-write circuit reads target data from a target bank according to a compiling instruction, selects bit gating data output according to a channel, and sends the target data into a buffer unit at a corresponding position according to a data direction bit; when the data operation mode is data writing, the read-write circuit reads target data in a target data register according to the compiling instruction and writes the target data into a target bank; when the data operation mode is idle, the read-write circuit is not executed, and the data selector and the data register are gated by the channel to transfer data transversely.

Specifically, the buffers, the data selector and the data registers in each line of buffer units share one power supply network, the instruction registers in all the buffer units share one power supply network, and the power supply networks are connected and controlled by the controller.

Specifically, when a buffer unit in the buffer array is damaged, a power supply switch of a row where the damaged buffer unit is located is turned off, and the N-column instruction stream normally skips the damaged row and continues to be executed.

The technical scheme provided by the embodiment of the application has the beneficial effects that at least: the application uses array units based on the sram, refers to the composition mode of a pulse array, adopts a pipeline register to replace the traditional bus design, adopts multiple instruction slots to control the read-write and transmission of data, realizes a buffer architecture of global unified management and access irrelevant to the design scale, and has great innovation and improvement on the front-end design, data control and flow direction, algorithm support efficiency and back-end physical realization difficulty compared with the traditional buffer design.

Drawings

FIG. 1 is a diagram of a well-known buffer architecture in the related art;

FIG. 2 is a diagram of a buffer control architecture for a pulse array design in the related art;

FIG. 3 is a top-level design of a fused buffer architecture based on an array structure;

FIG. 4 is a schematic diagram of a fusion buffer architecture according to an embodiment of the present application;

FIG. 5 is a detailed diagram of a fusion buffer architecture;

FIG. 6 is a schematic diagram showing the structure of a buffer unit in a buffer block;

FIG. 7 shows a schematic diagram of the internal connections of a buffer array and two adjacent buffer units;

FIG. 8 is a schematic diagram of a design of an instruction module and a buffer array;

FIG. 9 is a schematic diagram of one possible masking lateral buffer block;

FIG. 10 is a schematic diagram of another shielded longitudinal buffer unit;

Fig. 11 is a schematic diagram of a structure applied to a stream tensor processor.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

References herein to "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

Fig. 1 is a design diagram of a many-core buffer architecture in the related art, taking four-level buffering as an example for illustration, the multi/many-core CPU is internally designed with a plurality of cores and a system-level L4 buffer, and for a single CPU core, the interior is divided into a four-core micro architecture, including four cores and an SRAM L3 buffer. And dividing a kernel, wherein the interior of the kernel is a micro-architecture comprising an SRAM L1 buffer, an SRAM L2 buffer, a logic computing unit ALU and a register file. Each CPU core of the above structure has a local buffer arrangement and is controlled independently of each other. The four-level buffer layers of the layer-by-layer storage architecture are nested and controlled independently of each other. With the newly designed chip, the same batch of data may exist in buffers of different levels, such as data broadcasting and data exchange scenes, thus reducing the storage utilization. The communication protocol between the buffer periods of different layers is complex, and the delay cannot be determined due to the fact that the communication paths are not a while. For different layers of buffer designs, compilers under different tasks need to be optimized to reduce communication cost and improve calculation efficiency.

Fig. 2 is a schematic diagram of a buffer control architecture of a pulse array design in the related art, in which three buffers are required to be respectively disposed in three directions of a computing array (core), such as input buffers a and B and output buffer C in fig. 3, and the three buffers are connected to each other and are commonly connected to the central computing core (array). The input and output buffers are respectively and independently controlled, and are provided with independent read-write control circuits so as to realize data transmission among the input and output buffers. Data of this architecture needs to be exchanged or transferred back and forth between the three buffers, which increases control complexity and transfer delay, thereby reducing computational efficiency.

FIG. 3 is a fused buffer architecture based on an array structure designed based on the above design problem or defect, wherein the top layer design is to completely separate the compute array core from the buffer module, and the data read-write buffering of the compute array core is realized by the separate buffer module, which fuses all the buffer layers into the same architecture layer, so as to achieve the effect of buffering the overall visible controllable non-layering, and the aim is to improve the read-write and storage efficiency.

Fig. 4 is a schematic structural diagram of a fusion buffer architecture based on an array structure according to an embodiment of the present application, which is an explanation of the expansion of the buffer module in fig. 3. The fusion buffer architecture comprises a buffer array and an instruction module connected with the buffer array through a signal line. The instruction module is cached with instruction data compiled for completing specific tasks, and when the instruction module runs, the instruction module sends instruction streams to the buffer array according to a set time sequence. The buffer array comprises M rows of buffer blocks with the same structure, and the buffer blocks of adjacent rows are sequentially cascaded through unidirectional instruction signal lines and are used for transmitting the instruction stream sent by the instruction module to the M rows of buffer blocks row by row according to the time sequence. Such as the instruction stream in fig. 4, in a top-down order. In addition, the data between the M line buffer blocks are isolated from each other, i.e., the data does not create inter-block transfers. Each buffer block is provided with a data input end and a data output end, and is respectively used for receiving external data input and executing buffering and outputting tasks according to the instructions transferred to the line. Because the M line buffers are identical and are connected in series through the signal lines, for a series of instruction streams, the M line buffers sequentially execute the instruction, only the time sequence difference exists between different buffers, the time sequence difference of the intervals between adjacent buffers is identical, the instruction streams sequentially flow through all the buffers according to the time sequence difference, namely, each instruction sequentially flows through all the buffers, and therefore, the whole time sequence is completely predictable and controllable. For example, the instruction stream is denoted as { instruction A, instruction B, instruction C, and instruction D }, and after being delivered to the buffer array, the M1 buffer block first executes instruction A in a first clock cycle. In the second clock cycle, the M1 buffer block executes instruction B, at which point instruction a flows into the M2 buffer block for execution. In the third clock cycle, instruction A flows to the M3 buffer block for execution, instruction B flows to the M2 buffer block for execution, and instruction C flows to the M1 buffer block for execution. For the upper chip control program, the instruction execution condition of all the buffer arrays can be determined as long as the time sequence and the instruction flow are planned, and in a complete execution period, all the buffer blocks execute the tasks of the same set of instruction flow, namely the execution tasks are completely the same.

Based on the above design, in some embodiments, a plurality of parallel buffer modules may be further configured to perform different tasks in parallel, and for data interaction between the multiple tasks, a Switch circuit or a routing device may be used for data interaction.

Fig. 5 is a detailed diagram of a fused buffer architecture, and for one buffer block, the fused buffer architecture may be further subdivided into N buffer units with the same structure, i.e., M line buffer blocks form a buffer array of m×n arrays. For N buffer units of each row, adjacent buffer blocks are sequentially cascaded through data lines, and the data line transmission adopts a bidirectional structure, so that data between the adjacent buffer blocks can be transmitted transversely and leftwards and rightwards, the whole row forms a chain type bidirectional buffer block, and data input and output ends are respectively arranged at two ends of the chain type structure, so that data diversification buffer transmission is realized. In some embodiments, the instruction stream transfer between the buffer blocks may be implemented using instruction registers, and the lateral data transfer may be implemented using data registers, that is, a data register is disposed between each adjacent buffer or buffer core sram, and the lateral data transfer is implemented using the data registers.

FIG. 6 illustrates a schematic diagram of a buffer unit in a buffer block, which may include a buffer, a first data selector, a second data selector, and a first data register in some embodiments. In this embodiment, two adjacent cache units in the same row are taken as an example for illustration, wherein one data input end of a first data selector MUX1 in a first cache unit is connected to a buffer, and the other data input end is connected to a second data register in another adjacent cache unit; the data output end of the MUX1 is connected with the first data register of the buffer unit, and the MUX1 is used for gating buffer output or outputting data from the second register to the first data register.

Correspondingly, one data input end of the second data selector MUX2 in the first buffer unit is connected to the buffer, the other data input end is connected to the first data register, and the data output end of the MUX2 is connected to the second data register of the other adjacent buffer block; MUX2 is used to strobe the buffer output or the first register output data into the adjacent buffer unit.

In the above structure, the data register is used for exchanging data between adjacent memory/buffer units, the exchange direction is bidirectional, and a plurality of paths are arranged in each direction. Namely, the cache units realizing the same row realize the transverse transmission of data through two data registers and a data selector, and the cache units in the same longitudinal row also need the instruction registers to cooperate in order to realize controllable instruction operation. In the same-column buffer unit, the instruction registers of adjacent rows are cascaded through signal lines to register and transfer instructions, as in fig. 5, for m×n array structures, the instruction registers of N columns need to be connected in series respectively, so as to implement instruction pipelining. Of course, in some embodiments, the instruction pipeline may also select a bottom-up transmission manner according to actual design requirements, and the specific instruction flow direction is not limited by the present application.

Fig. 7 shows a schematic diagram of internal connection of a buffer array and two adjacent buffer units, and in order to realize mass storage, the buffer may be further designed to include n sram banks and a read-write circuit, where the instruction register is connected to the read-write circuit, and stores and sends one instruction per cycle. And the read-write circuit executes corresponding read-write tasks according to the instruction information. When the read-write circuit executes a write instruction, external target data is received, and the target data is written into a designated position of a target bank; when the read-write circuit executes the read instruction, the target data is obtained from the target bank according to the instruction and sent out to the outside. External read/write is herein defined as reading or writing data from or to adjacent buffers, each of which has a freely configurable number and size of specific sram banks.

The design is focused on simplifying the circuit design, reducing the compiling complexity and improving the data caching and reading and writing efficiency, and the M-by-N cache array simplifies the complex structure of the multi-layer cache; in the same-row cache units, the buffer units in adjacent columns carry out transverse data transfer through a data register; in the same-column buffer units, longitudinal instruction transmission is carried out between buffer units of adjacent rows through cascaded instruction registers; the longitudinal instruction stream and the transverse data stream can improve the data caching and reading and writing efficiency. Based on the structure and the design principle, the compiling instruction and the instruction module can be also subjected to design optimization.

Fig. 8 is a schematic diagram of a design of an instruction module and a buffer array, where the instruction module includes N instruction slots, where the N instruction slots are connected to N rows of buffer units by signal lines, and each row of instruction slots includes S FIFO memories with cascade depths for storing S compiled instructions. The FIFO memory sequentially sends instructions to corresponding columns of the buffer array of columns according to the timing sequence and transfers according to the instruction stream direction. For a compiled instruction, in one possible implementation, an instruction code may be designed that includes mode control bits, data direction bits, channel select bits, and bank select bits, such as setting bits 0-4 as bit offset bits, bits 5-14 as row select bits, bits 15-18 as bank select bits, bits 19-22 as data format bits, bits 23-25 as register channel select bits, bits 26 as direction transfer bits, and the highest 27-28 bits as mode control bits. All valid instructions in the FIFO are designed in this format and are passed out in an instruction stream according to clock cycle or timing, such as sending an instruction every other clock cycle, to achieve pipelined execution. The mode control bit is used for controlling the target cache unit to execute a data operation mode, including a read operation mode, a write operation mode and a null operation mode; the data direction bit is used for controlling the target cache unit to execute the transmission direction of data, including left transmission and right transmission; the channel selection bit is used for controlling the gating state of the data selector in the target cache unit; the bank selection bit is used for selecting a target bank in the buffer; the row selection bit is used for indicating that the read-write of the target bank is in a row; the bit offset bits are used to represent the data input source/data output destination. The buffer shown in fig. 8 is composed of 4 banks, each bank having a capacity of 128KB. The rows and columns of the array are configured according to the overall processor requirements. Instructions are pipelined uni-directionally through the instruction registers of each buffer unit, and data flows bi-directionally between buffers in the same row through the data registers. The instruction is composed of a plurality of instruction slots, one instruction slot controlling a column of buffers. The cache units in the same column of the instruction slots sequentially execute the same compiling instruction according to the sequence of the instruction stream. When the data operation mode is data reading, the read-write circuit reads target data from a target bank according to a compiling instruction, selects bit gating data output according to a channel, and sends the target data into a buffer unit at a corresponding position according to a data direction bit; when the data operation mode is data writing, the read-write circuit reads target data in a target data register according to the compiling instruction and writes the target data into a corresponding row of a target bank; when the data operation mode is the idle operation, the read-write circuit does not execute any read-write operation, and the data selector and the data register are gated to transversely transfer data.

Taking fig. 6 as an example, assuming that the data is transferred from right to left when the write operation mode is executed, the instruction register sends the compiling instruction to the read-write circuit, and the read-write circuit extracts the corresponding data from the first data register and writes the corresponding data into the target bank in the sram. Assuming a read mode of operation is performed, data is transferred from right to left, the instruction register sends a compiled instruction to the read-write circuit, and the read-write circuit reads target data from a target bank in the sram, and gates the first data port of the MUX2, and sends the data in the target bank to the second data register. Assuming that the data is transferred from right to left when the idle mode is executed, the read-write circuit directly gates the second data port of the MUX2 without performing the read-write operation, and the data in the first data register is directly fed into the second data register through the MUX 2. The left transfer of data is similar and will not be described in detail here.

As shown in FIG. 9, in some cases, the high-capacity buffer can affect the yield of chips, so that we can increase the yield by adding one or more redundant rows, and under a certain design, we need to configure a 16-row memory array to meet the storage requirement, then we can copy one more to 17 rows in the physical design, and only need to screen out the buffer array chips with 16 rows of complete designs in the test, and the test result of any certain row can be ignored, i.e. the whole buffer array can not pass the test, so that the final yield can be increased. Similarly, if 32 rows are required to be configured to meet the storage requirement under a certain design, we can copy 2 more to 34 rows during the physical design, then the test results of any 2 rows can be ignored, and the final yield is improved. For such redundancy design, in some embodiments, when any buffer in the m×n array is damaged or a data register is damaged, the sequential logic of the whole line of buffer blocks is destroyed, thereby affecting the data operation of the chip. Therefore, the application shares one power supply network with the buffer, the data selector and the data register in each line of buffer units, and in turn shares one power supply network with the instruction register in all buffer units, and the power supply network is connected and controlled by the controller. When the buffer units in the buffer array are damaged, the power supply switch of the line where the damaged buffer units are located is closed, namely the power supply of the line buffer block is closed, NOP instructions need to be filled in corresponding FIFO (first in first out) in the corresponding N instruction slots, and N columns of instruction streams normally skip the damaged line and continue to execute under the condition that the instruction register is normally powered, so that the directional shielding of the buffer block is realized, and the yield is improved.

In another case, as shown in fig. 10, the cost of the chip needs to be controlled, so that one memory cell column can be shielded to improve the production yield without adding redundant rows, for example, a buffer array of 4 columns needs to be configured to meet the storage requirement under a certain design, after the chip production test, 1 or more srams of a memory block on one column are found to be defective and cannot be used, and then the whole memory cell column can be shielded to be used as a chip with smaller capacity of only 3 columns functionally, and the overall function is not affected. Furthermore, 2 columns and even 3 columns of storage blocks can be shielded by the same means, so that the overall yield of the chip is improved to a greater extent. In the design mode, N rows of cache units are respectively powered through a power supply network, all instruction registers still adopt a centralized power supply mode, NOP instructions are written into corresponding instruction slots for the rows where the damaged cache units are located in the mode, and directional shielding is realized on the premise of not influencing transverse data transmission.

Based on the above-mentioned fusion buffer architecture of array structure, the application can also be applied to the stream tensor processor, as shown in fig. 11, the compiled instruction/compiled program can be stored in the external memory, for the tensor processor, the interior is divided into matrix calculation module, data exchange module, data buffer module, vector calculation module and matrix calculation module, as in fig. 11, the middle vector calculation module is set to the middle two sides and adopts symmetrical structure design, each function module contains corresponding instruction module, the data exchange module bears data input and output and data interaction function between function blocks, so that only half of compiling workload is actually needed in the design, and all function modules can be opened, and the processing performance is improved.

In summary, in the fusion array architecture designed by the application, all caches are not layered, no matter how large the scale is, different levels, such as L1/L2/L3 levels, do not need to be divided; all input and output buffers are fused together, and a control circuit aiming at the buffers does not need to be designed independently; all the buffers are controlled uniformly and distributed dynamically, so that data exchange among different buffers is not needed; the pipeline transmission is adopted, no bus design or control is needed, and the working frequency and the communication efficiency are improved; the array structure based on the memory cells is adopted, and the circuit design of each memory cell is completely consistent, so that the difficulty of circuit design is reduced; the complex communication expense is avoided, the time sequence can be completely predicted, the programming efficiency is improved, and the difficulty of a compiler is reduced; the capacity of each buffer can be arbitrarily increased or reduced, and the data flow time sequence or delay and the capacity synchronously change linearly; the overall capacity of the buffer can be scaled by simple configuration, the control or programming mode is not changed, and the labor is saved and the efficiency is improved.

The foregoing describes preferred embodiments of the present invention; it is to be understood that the invention is not limited to the specific embodiments described above, wherein devices and structures not described in detail are to be understood as being implemented in a manner common in the art; any person skilled in the art will make many possible variations and modifications, or adaptations to equivalent embodiments without departing from the technical solution of the present invention, which do not affect the essential content of the present invention; therefore, any simple modification, equivalent variation and modification of the above embodiments according to the technical substance of the present invention still fall within the scope of the technical solution of the present invention.

Claims

1. The fusion buffer architecture based on the array structure is characterized by comprising a buffer array and an instruction module connected with the buffer array through a signal line; the instruction module is cached with compiled instruction data, and an instruction stream is sent to the buffer array according to a time sequence;

The buffer array comprises M rows of buffer blocks with the same structure, the buffer blocks of adjacent rows are sequentially cascaded through unidirectional instruction signal lines, and an instruction stream sent by the instruction module is transmitted to the M rows of buffer blocks row by row according to a time sequence; the buffer blocks of adjacent lines are mutually separated by the same time sequence difference, the same instruction sequentially flows through all the buffer blocks, and all the buffer blocks execute the tasks of the same set of instruction streams;

2. The fusion buffer architecture based on an array structure according to claim 1, wherein the buffer blocks comprise N buffer units with the same structure, and M rows of buffer blocks form a buffer array of M x N arrays; the N cache units of each row are sequentially cascaded through data lines, and data between adjacent cache blocks are transferred left and right.

3. The array-based fusion buffer architecture of claim 2, wherein the buffer unit comprises a buffer, a first data selector, a second data selector, and a first data register;

4. The fusion buffer architecture based on an array structure according to claim 3, wherein the buffer units further comprise instruction registers, and in the same-column buffer units, the instruction registers of adjacent rows are cascaded with each other through signal lines for instruction registering and transferring;

5. A fusion buffer architecture based on an array structure according to claim 3, wherein in the in-line buffer units, the buffer units of adjacent columns perform lateral data transfer through a data register; in the same-column buffer units, longitudinal instruction transfer is performed between buffer units of adjacent lines through cascaded instruction registers.

6. The architecture of claim 5, wherein the instruction registers in the buffer unit are respectively connected to the gate terminals of the first data selector and the second data selector, generate a gate signal according to a target instruction at a current timing, and control the gate output of the data selector based on the gate signal.

7. The array-based fusion buffer architecture of claim 4, wherein the instruction module comprises N instruction slots, each row of instruction slots comprises S FIFO memories of cascaded depth for storing S compiled instructions, the FIFO memories sequentially sending instructions to corresponding rows of the buffer array according to a timing sequence and delivering according to an instruction stream direction.

8. The array-based fusion buffer architecture of claim 7, wherein the compiled instruction includes a mode control bit, a data direction bit, a channel select bit, and a bank select bit;

the bank select bit is used to select a target bank in the buffer.

9. The fusion buffer architecture based on an array structure according to claim 8, wherein the cache units in the same column as the instruction slots sequentially execute the same compiled instructions according to the order of instruction streams; when the data operation mode is data reading, the read-write circuit reads target data from a target bank according to a compiling instruction, selects bit gating data output according to a channel, and sends the target data into a buffer unit at a corresponding position according to a data direction bit; when the data operation mode is data writing, the read-write circuit reads target data in a target data register according to the compiling instruction and writes the target data into a target bank; when the data operation mode is idle, the read-write circuit is not executed, and the data selector and the data register are gated by the channel to transfer data transversely.

10. The fusion buffer architecture based on an array structure according to claim 4, wherein the buffers, the data selector, and the data registers in each row of buffer units share a power supply network, the instruction registers in all the buffer units share a power supply network, and the power supply network is connected and controlled by the controller.

11. The fusion buffer architecture based on an array architecture of claim 10, wherein when a buffer unit in the buffer array is damaged, a power switch of a row where the damaged buffer unit is located is turned off, and the N-column instruction stream normally skips the damaged row and continues to execute.