CN102012803A

CN102012803A - Configurable matrix register unit for supporting multi-width SIMD and multi-granularity SIMT

Info

Publication number: CN102012803A
Application number: CN2010105594582A
Authority: CN
Inventors: 陈书明; 张凯; 陈海燕; 万江华; 彭元喜; 刘仲; 阳柳; 杨惠; 刘蓬侠; 胡春媚; 唐涛
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2010-11-25
Filing date: 2010-11-25
Publication date: 2011-04-13
Anticipated expiration: 2030-11-25
Also published as: CN102012803B

Abstract

The invention relates to a configurable matrix register unit for supporting multi-width single instruction multiple data stream (SIMD) and multi-granularity single instruction multiple threads (SIMT). The configurable matrix register unit comprises a matrix register and a control register SR; the matrix register of which the size is N*N is divided into M*M blocks, wherein N is a positive integer and is the power of 2, and M is an integer which is more than or equal to 0 and is the power of 2; the block modes of the matrix register and the multi-thread numbers simultaneously processed by a vector processing unit are recorded in the control register; and the width of the control register is log2C+log2T, wherein C is the number of the number of the block modes of the matrix register, and T is the number of multi-thread modes which can be processed by a vector processor. The configurable matrix register unit has the advantages that: the principle is simple; the configurable matrix register unit is simple and convenient to operate; the block size and the thread number can be configured flexibly; the access to vector data in the mode of multi-width SIMD and multi-granularity SIMT is supported at the same time and the like.

Description

Configurable matrix register unit supporting multi-width SIMD and multi-granularity SIMT

技术领域technical field

本发明主要涉及到向量处理器中向量寄存器的设计领域，特指在向量处理器中的一种块大小和线程数目可配置的矩阵寄存器，以支持按单指令流多数据流(SIMD)和单指令流多线程(SIMT)方式操作的向量运算单元对数据进行多宽度和多粒度访问。The present invention mainly relates to the design field of vector registers in vector processors, in particular to a matrix register with configurable block size and thread number in vector processors to support single instruction stream multiple data streams (SIMD) and single The vector operation unit operating in the instruction stream multithreading (SIMT) mode performs multi-width and multi-granularity access to data.

背景技术Background technique

随着4G无线通信技术和视频图像处理技术的深入研究，向量处理器得到了广泛的应用。快速演进的无线通信协议和视频图像处理算法中需要进行大量的矩阵运算，如信道估计、MIMO均衡和DCT变换。不同算法中的矩阵运算并行粒度不同，算法所处理的矩阵块大小也不同，向量处理器只有提供对这些不同数目和不同块大小的矩阵运算的高效支持，才能够更好地适应这类数据密集型应用，满足实时数据处理要求。With the in-depth research of 4G wireless communication technology and video image processing technology, vector processors have been widely used. Rapidly evolving wireless communication protocols and video image processing algorithms require a large number of matrix operations, such as channel estimation, MIMO equalization, and DCT transformation. The parallel granularity of matrix operations in different algorithms is different, and the size of the matrix blocks processed by the algorithms is also different. Only when the vector processor provides efficient support for these matrix operations with different numbers and different block sizes, can it better adapt to this kind of data-intensive type applications to meet real-time data processing requirements.

无线通信协议及视频图像处理的核心算法通常表现为数据级并行和线程级并行同时存在，面向这类应用的向量处理器通常采用超长指令字(VLIW)、单指令流多数据流(SIMD)体系结构，同时还会提供单指令流多线程(SIMT)技术的支持，以获得足够的并行运算能力。上述两类算法通常还表现为以下特点：随着协议的快速演进，算法所处理的向量长度也在不断发生变化，同时，算法内可开发的线程级并行也在发生变化。如3G无限通信协议中，协议的演进使得基站和手持端的天线数目一直在改变，这就导致了在信道均衡矩阵中的向量长度也在不断改变，意味着向量处理单元可以处理的向量数据的宽度和同时处理的线程个数都在改变。以上的这些特点对向量处理器能否从体系结构级提供对多宽度SIMD处理和多粒度SIMT处理提供足够有效的支持提出了强烈的要求。因此本发明提出了一种块大小和线程数目可配置的矩阵寄存器，可满足算法中不同并行粒度和块大小的向量运算需求。The core algorithms of wireless communication protocols and video image processing usually exhibit data-level parallelism and thread-level parallelism at the same time. Vector processors for such applications usually use Very Long Instruction Word (VLIW), Single Instruction Multiple Data (SIMD) architecture, and will also provide support for Single Instruction Stream Multi-Threading (SIMT) technology to obtain sufficient parallel computing capabilities. The above two types of algorithms usually have the following characteristics: With the rapid evolution of the protocol, the length of the vectors processed by the algorithm is also changing, and at the same time, the thread-level parallelism that can be developed in the algorithm is also changing. For example, in the 3G wireless communication protocol, the evolution of the protocol makes the number of antennas of the base station and the handheld end constantly changing, which leads to the constant change of the vector length in the channel equalization matrix, which means that the vector data width that the vector processing unit can process And the number of threads processed at the same time is changing. The above characteristics put forward strong requirements on whether the vector processor can provide sufficient and effective support for multi-width SIMD processing and multi-granularity SIMT processing from the architecture level. Therefore, the present invention proposes a matrix register with configurable block size and thread number, which can meet the vector operation requirements of different parallel granularities and block sizes in the algorithm.

矩阵寄存器的存储单元阵列一般由N*M(M、N均为大于1的整数)个存储单元组成，每个存储单元的位宽一般为4、8、12、16、32，该阵列在逻辑上可以看成由N个行向量寄存器VR₀-VR_N-1或M个列向量CVR₀-CVR_M-1寄存器组成，N和M通常为2的幂数。每个行向量寄存器包含M个元素(存储单元)E_i，0-E_i，M-1(i＝0，1，2……N-1)，每个列向量寄存器包含N个元素E_0，i-E_M-1，i(i＝0，1，2……M-1)。矩阵寄存器在读写使能、读写地址和行列选择信号的控制下完成行列向量的读出和写入。The storage unit array of the matrix register is generally composed of N*M (M, N are integers greater than 1) storage units, and the bit width of each storage unit is generally 4, 8, 12, 16, 32. It can be regarded as composed of N row vector registers VR ₀ -VR _N-1 or M column vector registers CVR ₀ -CVR _M-1 , and N and M are usually powers of 2. Each row vector register contains M elements (storage units) E _{i, 0} -E _{i, M-1} (i=0, 1, 2...N-1), and each column vector register contains N elements E _{0 , i} -E _{M-1, i} (i=0, 1, 2...M-1). The matrix register completes the read and write of row and column vectors under the control of read and write enable, read and write addresses and row and column selection signals.

已有的研究提供了对上述矩阵寄存器固定规模大小的块数据的访问，这些技术每次读写矩阵的一个行向量或列向量，向量的长度固定，当向量长度大于或小于该固定长度时，通常采用将多个短向量组合成一个长向量来并行处理，或者将一个长向量拆分成几个短向量来分步处理，无法灵活处理不同大小的矩阵数据，不支持多宽度的SIMD处理，也不支持以多粒度SIMT的方式同时访问多个矩阵数据，既不能获得足够的灵活性，也不能开发足够的并行度，特别是线程级并行。Existing research provides access to block data of a fixed size in the above-mentioned matrix registers. These technologies read and write a row vector or a column vector of the matrix at a time. The length of the vector is fixed. When the length of the vector is greater than or less than the fixed length, Usually, multiple short vectors are combined into one long vector for parallel processing, or a long vector is split into several short vectors for step-by-step processing, which cannot flexibly handle matrix data of different sizes, and does not support multi-width SIMD processing. It also does not support simultaneous access to multiple matrix data in the way of multi-granularity SIMT, neither can obtain enough flexibility, nor can it develop enough parallelism, especially thread-level parallelism.

综上所述，如何在向量处理器中提供对矩阵数据的高效灵活处理，为向量处理器的多粒度SIMT和多宽度SIMD处理提供灵活和足够的并行操作数，提高向量处理器、阵列处理器的并行处理效率，以满足无线通信和图像处理等应用对大规模矩阵运算的需求仍是本领域研究的一个热点问题。In summary, how to provide efficient and flexible processing of matrix data in vector processors, provide flexible and sufficient parallel operands for multi-granularity SIMT and multi-width SIMD processing of vector processors, and improve the performance of vector processors and array processors. The efficiency of parallel processing to meet the needs of large-scale matrix operations in applications such as wireless communication and image processing is still a hot issue in this field.

发明内容Contents of the invention

本发明要解决的技术问题就在于：针对现有技术存在的技术问题，本发明提供一种原理简单、操作简便、块大小和线程数目可灵活配置、同时支持多宽度SIMD和多粒度SIMT方式访问向量数据的矩阵寄存器单元。The technical problem to be solved by the present invention lies in: aiming at the technical problems existing in the prior art, the present invention provides a simple principle, easy operation, flexible configuration of block size and number of threads, and simultaneous support for multi-width SIMD and multi-granularity SIMT mode access Matrix register unit for vector data.

为解决上述技术问题，本发明采用以下技术方案：In order to solve the problems of the technologies described above, the present invention adopts the following technical solutions:

一种支持多宽度SIMD和多粒度SIMT的可配置矩阵寄存器单元，其特征在于：包括矩阵寄存器和控制寄存器SR，所述大小N*N的矩阵寄存器分成M*M块，其中N为正整数且为2的幂，M为大于等于0的整数且为2的幂；所述控制寄存器中记录了矩阵寄存器分块模式和向量处理单元同时处理的多线程数目，所述控制寄存器的宽度为log₂C+log₂T，其中C为矩阵寄存器的分块模式数，T为向量处理器能处理的多线程模式数。A configurable matrix register unit supporting multi-width SIMD and multi-granularity SIMT, characterized in that: it includes a matrix register and a control register SR, and the matrix register of the size N*N is divided into M*M blocks, wherein N is a positive integer and is a power of 2, and M is an integer greater than or equal to 0 and is a power of 2; the matrix register block mode and the number of multi-threads processed by the vector processing unit are recorded in the control register, and the width of the control register is log ₂ C+log ₂ T, where C is the number of block modes of the matrix register, and T is the number of multi-thread modes that the vector processor can handle.

作为本发明的进一步改进：As a further improvement of the present invention:

当M为0时，表示矩阵寄存器不分块，向量运算部件每次可以访问矩阵寄存器中的一个行向量或列向量；当M不为0时，向量运算部件根据同时处理的线程数目的不同访问矩阵寄存器中的一个或多个相同长度的子行向量或子列向量，这些相同长度的子行向量或子列向量来自于不同分块矩阵。When M is 0, it means that the matrix register is not divided into blocks, and the vector operation unit can access one row vector or column vector in the matrix register at a time; when M is not 0, the vector operation unit accesses according to the number of threads processed at the same time One or more sub-row vectors or sub-column vectors of the same length in the matrix register, and these same-length sub-row vectors or sub-column vectors come from different block matrices.

所述向量运算部件对矩阵寄存器进行访问时，地址译码逻辑单元根据控制寄存器SR的内容、读写地址和行列选择信号进行译码，选择矩阵寄存器的一个行向量或列向量进行读写，又或者选择一个或多个子行向量或子列向量进行读写。When the vector operation part accesses the matrix register, the address decoding logic unit decodes according to the content of the control register SR, the read-write address and the row and column selection signal, selects a row vector or column vector of the matrix register to read and write, and Or select one or more sub-row vectors or sub-column vectors to read and write.

所述控制寄存器SR为一个独立的控制寄存器，或者保存于其它控制寄存器的保留位中，且其他控制寄存器的保留位长度为大于log₂C+log₂T的整数。The control register SR is an independent control register, or stored in reserved bits of other control registers, and the length of reserved bits of other control registers is an integer greater than log ₂ C+log ₂ T.

与现有技术相比，本发明的优点在于：本发明支持多宽度SIMD和多粒度SIMT方式访问向量数据的矩阵寄存器单元，原理简单、操作简便、块大小和线程数目可灵活配置，在向量处理器中能够对矩阵数据的高效灵活处理，为向量处理器的多粒度SIMT和多宽度SIMD处理提供灵活和足够的并行操作数，从而提高了向量处理器、阵列处理器的并行处理效率，满足了无线通信和图像处理等应用对大规模矩阵运算的需求。Compared with the prior art, the present invention has the advantages of: the present invention supports multi-width SIMD and multi-granularity SIMT modes to access the matrix register unit of vector data, simple in principle, easy to operate, flexible in configuration of block size and number of threads, and can be used in vector processing The processor can efficiently and flexibly process matrix data, and provide flexible and sufficient parallel operands for the multi-granularity SIMT and multi-width SIMD processing of vector processors, thereby improving the parallel processing efficiency of vector processors and array processors and satisfying Applications such as wireless communication and image processing require large-scale matrix operations.

附图说明Description of drawings

图1是本发明矩阵寄存器的总体结构示意图；Fig. 1 is the overall structural representation of matrix register of the present invention;

图2是向量处理器的体系结构框架示意图；Fig. 2 is a schematic diagram of the architectural framework of a vector processor;

图3是本发明中SR寄存器的结构示意图；Fig. 3 is the structural representation of SR register among the present invention;

图4是本发明矩阵寄存器的存储单元阵列结构示意图；Fig. 4 is a schematic diagram of the storage cell array structure of the matrix register of the present invention;

图5是本发明矩阵寄存器不分块时的行地址空间示意图；Fig. 5 is a schematic diagram of the row address space when the matrix register of the present invention is not divided into blocks;

图6是本发明矩阵寄存器不分块时的列地址空间示意图；Fig. 6 is a schematic diagram of the column address space when the matrix register of the present invention is not divided into blocks;

图7是本发明矩阵寄存器分块时的单线程模式下行地址空间示意图；Fig. 7 is a schematic diagram of the downlink address space in the single-thread mode when the matrix register of the present invention is divided into blocks;

图8是本发明矩阵寄存器分块时的单线程模式下列地址空间示意图；Fig. 8 is a schematic diagram of the following address space in the single-threaded mode when the matrix register of the present invention is divided into blocks;

图9是本发明矩阵寄存器分块时的M线程模式下行地址空间示意图；Fig. 9 is a schematic diagram of the downlink address space in the M-thread mode when the matrix register of the present invention is divided into blocks;

图10是本发明矩阵寄存器分块时的M线程模式下列地址空间示意图；Fig. 10 is a schematic diagram of the following address space in the M-thread mode when the matrix register of the present invention is divided into blocks;

图11是本发明中译码通路示意图。Fig. 11 is a schematic diagram of the decoding path in the present invention.

具体实施方式Detailed ways

以下将结合说明书附图和具体实施例对本发明做进一步详细说明。The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

如图1所示，为本发明矩阵寄存器的总体结构示意图。本发明的支持多宽度SIMD和多粒度SIMT的可配置矩阵寄存器单元，包括矩阵寄存器和控制寄存器SR。As shown in FIG. 1 , it is a schematic diagram of the overall structure of the matrix register of the present invention. The configurable matrix register unit supporting multi-width SIMD and multi-granularity SIMT of the present invention includes a matrix register and a control register SR.

当读写使能信号有效时，地址译码逻辑单元根据读写地址的内容，在行列选择信号和控制寄存器SR的控制下进行译码，选择矩阵寄存器的一个行向量或列向量进行读写，或者选择一个或多个子行向量或子列向量进行读写。矩阵寄存器由N*N的存储单元阵列组成，每个单元的位宽为W，存储容量大小为(N*N*W)位。通过配置控制寄存器SR的部分字段，可以对矩阵寄存器进行分块。当控制寄存器SR中的内容表示该矩阵寄存器的分块方式为M*M时，表示矩阵寄存器同时按行按列分成M子块，共分成M*M子块，每个子块的大小为(N/M)*(N/M)，M一般为0、2、4、8……，且M不超过N/2，当M为0时，即表示矩阵寄存器不分块(也可以称分块模式为0*0)，当M不为0时，每个行向量寄存器或列向量寄存器在逻辑上划分为M个子行向量寄存器或子列向量寄存器，每个子行向量寄存器或子列向量寄存器包含N/M个存储单元，这时，向量运算单元对矩阵寄存器的读写单位长度为子行向量寄存器或子列向量寄存器的长度，每次可以选择一个或多个子行向量或子列向量进行读写。在本发明中，向量运算单元的功能部件可以按访问普通寄存器一样的方式访问矩阵寄存器，而且矩阵寄存器还提供了列访问的功能，并且可以满足不同向量长度的向量读写，同时还支持了SIMT的访问方式。该矩阵寄存器的每次读写的最大带宽为N*W位。When the read and write enable signal is valid, the address decoding logic unit performs decoding under the control of the row and column selection signal and the control register SR according to the content of the read and write address, and selects a row vector or column vector of the matrix register to read and write. Or select one or more sub-row vectors or sub-column vectors to read and write. The matrix register is composed of N*N storage unit arrays, each unit has a bit width of W and a storage capacity of (N*N*W) bits. By configuring some fields of the control register SR, the matrix register can be divided into blocks. When the content in the control register SR represents that the block mode of the matrix register is M*M, it means that the matrix register is divided into M sub-blocks by rows and columns at the same time, and is divided into M*M sub-blocks, and the size of each sub-block is (N /M)*(N/M), M is generally 0, 2, 4, 8..., and M does not exceed N/2. When M is 0, it means that the matrix register is not divided into blocks (also called block mode is 0*0), when M is not 0, each row vector register or column vector register is logically divided into M sub-row vector registers or sub-column vector registers, and each sub-row vector register or sub-column vector register contains N/M storage units, at this time, the length of the read and write unit of the vector operation unit to the matrix register is the length of the sub-row vector register or the sub-column vector register, and one or more sub-row vectors or sub-column vectors can be selected for reading each time Write. In the present invention, the functional parts of the vector operation unit can access the matrix registers in the same way as accessing ordinary registers, and the matrix registers also provide the column access function, and can meet the vector read and write of different vector lengths, and also support SIMT access method. The maximum bandwidth of each read and write of the matrix register is N*W bits.

图3是本发明中控制寄存器SR的结构示意图。控制寄存器SR中记录了当前矩阵寄存器的分块模式和运算单元正在同时处理的线程数目，为了减小代价，提高设计的可重用性，本发明将SR作为处理器的控制寄存器，程序员可以通过已有的访问控制寄存器的指令来访问SR，不需要再增加额外的指令。SR中有效数据的位宽为大于(log₂C+log₂T)的最小整数(C为矩阵寄存器的分块模式数，T为向量处理器能处理的多线程模式数，C、T均为大于1的整数)。SR可以作为一个专有的控制寄存器独立存在，或者处理器中的控制寄存器一般都有保留位，如果处理器原有的控制寄存器中保留位的数目大于等于log₂C+log₂T，就可以不需要增加额外的SR寄存器，否则，就还需要增加一个控制寄存器SR。不管属于哪种情况，程序员都可以通过已有的访问控制寄存器的指令来访问SR，因此不需要再增加额外的指令就可以实现SR的动态配置。FIG. 3 is a schematic diagram of the structure of the control register SR in the present invention. The block mode of the current matrix register and the number of threads that the arithmetic unit is processing at the same time are recorded in the control register SR. In order to reduce the cost and improve the reusability of design, the present invention uses SR as the control register of the processor, and the programmer can pass Existing instructions for accessing the control registers are used to access the SR, and no additional instructions are required. The bit width of valid data in SR is the smallest integer greater than (log ₂ C+log ₂ T) (C is the number of block modes of the matrix register, T is the number of multi-thread modes that the vector processor can handle, C and T are integer greater than 1). SR can exist independently as a dedicated control register, or the control register in the processor generally has reserved bits, if the number of reserved bits in the original control register of the processor is greater than or equal to log ₂ C+log ₂ T, it can be There is no need to add an additional SR register, otherwise, a control register SR needs to be added. Regardless of the situation, the programmer can access the SR through the existing instructions for accessing the control register, so the dynamic configuration of the SR can be realized without adding additional instructions.

向量处理单元对矩阵寄存器的每次访问都需要提供三种信号：读写使能、读写地址、行列选择信号。地址译码逻辑单元根据控制寄存器SR的内容、读写地址和行列选择信号进行译码，选择矩阵寄存器的一个行向量或列向量进行读写，也可以选择一个或多个子行向量或子列向量进行读写。Each access of the vector processing unit to the matrix register needs to provide three signals: read and write enable, read and write address, row and column selection signal. The address decoding logic unit decodes according to the content of the control register SR, read and write addresses, and row and column selection signals, selects a row vector or column vector of the matrix register for reading and writing, and can also select one or more sub-row vectors or sub-column vectors to read and write.

本发明设计了一种完备的地址映射方案，在不同的分块模式和线程模式下，矩阵寄存器呈现出不同的地址视图，这些地址视图为程序员提供了完整灵活的访问方式。地址映射方案的规则如下：块间按照先上后下再先左后右的顺序，块内行地址按照先上后下的顺序、列地址按照先左后右的顺序依次线性增加编址，线程数目为L(L≤max(M))时，行方向连续的L个块共享一片列向量地址空间，列方向连续的L个块共享一片行向量地址空间，即当线程数目为L时，向量处理单元可以同时访问矩阵寄存器的L个子行向量或L个子列向量，每一个子行向量或子列向量由V个向量处理单元来处理，V为子行向量或子列向量的长度。The invention designs a complete address mapping scheme. Under different block mode and thread mode, the matrix register presents different address views, and these address views provide complete and flexible access methods for programmers. The rules of the address mapping scheme are as follows: inter-blocks follow the order of first up, then down, then left, then right, the row addresses within the block follow the order of first up, then down, and the column addresses follow the order of first left, then right, and the number of threads increases linearly. When L(L≤max(M)), L consecutive blocks in the row direction share a column vector address space, and L consecutive blocks in the column direction share a row vector address space, that is, when the number of threads is L, the vector processing The unit can simultaneously access L sub-row vectors or L sub-column vectors of the matrix register, each sub-row vector or sub-column vector is processed by V vector processing units, and V is the length of the sub-row vector or sub-column vector.

图2是本发明中向量处理器的体系结构框架。向量处理器一般由N个并行处理单元(PE)组成，每个处理单元有i个功能部件，这些功能部件可以是MAC、ALU、除法、移位部件等，每个功能部件根据需要可以对矩阵寄存器进行读写。N个PE构成了SIMD的处理方式，每个PE内部一般采取VLIW的结构，多个功能部件并行运算。矩阵寄存器的单元阵列大小为N*N，在逻辑上构成了N个行向量寄存器VR₀-VR_N-1和N个列向量寄存器CVR₀-CVR_N-1。每个行向量寄存器VR_i包含N个存储单元E_i，0-E_i，N-1(i＝0，1，2……N-1)，每个列向量寄存器CVR_i包含N个存储单元E_0，i-E_N-1，i(i＝0，1，2……N-1)。矩阵寄存器设计为多端口读写方式，同时可以为N个PE的多个功能部件提供源操作数，也可以支持N个PE的多个功能部件的数据写入。Fig. 2 is the architectural framework of the vector processor in the present invention. A vector processor is generally composed of N parallel processing units (PE), each processing unit has i functional units, these functional units can be MAC, ALU, division, shifting units, etc., and each functional unit can perform matrix processing as needed Registers are read and written. N PEs constitute a SIMD processing method, and each PE generally adopts a VLIW structure, and multiple functional components perform parallel operations. The size of the cell array of the matrix register is N*N, logically forming N row vector registers VR ₀ -VR _N-1 and N column vector registers CVR ₀ -CVR _N-1 . Each row vector register VR _i includes N storage units E _{i, 0} -E _{i, N-1} (i=0, 1, 2...N-1), and each column vector register CVR _i includes N storage units E _0,i -E _N-1,i (i=0, 1, 2...N-1). The matrix register is designed as a multi-port read and write mode, and can provide source operands for multiple functional components of N PEs at the same time, and can also support data writing of multiple functional components of N PEs.

图4是本发明中矩阵寄存器的存储单元阵列结构示意图。矩阵寄存器的存储单元阵列一般由N*N个存储单元组成，N通常为2的幂数。每个存储单元的位宽为W，W一般为4、8、12、16、32。该阵列在逻辑上可以看成N个行向量寄存器VR₀-VR_N-1或N个列向量CVR₀-CVR_N-1寄存器组成，每个行向量寄存器包含N个元素(存储单元)E_i，0-E_i，N-1(i＝0，1，2……N-1)。以VR₀为例，该行向量寄存器包括存储单元E_0，0-E_0，N-1。该存储单元阵列按列划分为N个N*W位的存储单元列，每列由N个同列的元素组成。这N个存储单元列与N个列向量寄存器CVR₀-CVR_N-1一一对应，用于实现相应列向量寄存器的存取功能。以CVR_N-1为例，该列向量寄存器包括所有行向量寄存器VR₀-VR_N-1的最后一个元素E_i，N-1(i＝0，1，2……N-1)。Fig. 4 is a schematic diagram of the structure of the storage cell array of the matrix register in the present invention. The storage unit array of the matrix register generally consists of N*N storage units, and N is usually a power of 2. The bit width of each storage unit is W, and W is generally 4, 8, 12, 16, 32. Logically, the array can be regarded as N row vector registers VR ₀ -VR _N-1 or N column vector registers CVR ₀ -CVR _N-1 , and each row vector register contains N elements (storage units) E _{i , 0} -E _{i, N-1} (i=0, 1, 2...N-1). Taking VR ₀ as an example, the row vector register includes storage units E _0,0 -E _0,N-1 . The memory cell array is divided into N memory cell columns of N*W bits by column, and each column is composed of N elements of the same column. The N memory cell columns are in one-to-one correspondence with the N column vector registers CVR ₀ -CVR _N−1 , and are used to realize the access function of the corresponding column vector registers. Taking CVR _N-1 as an example, the column vector register includes the last elements E _{i, N-1} (i=0, 1, 2...N-1) of all row vector registers VR ₀ -VR _N-1 .

图5是本发明矩阵寄存器不分块时的行地址空间示意图。当SR中的分块模式字段指示矩阵寄存器不分块时，矩阵寄存器只支持单线程运算。矩阵寄存器由N个行向量寄存器组成，每个行向量寄存器包含N个存储单元E_i，0-E_i，N-1(i＝0，1，2……N-1)。向量运算单元的功能部件可以对一个行向量寄存器读写，每次读写的数据带宽为N*W位。行向量寄存器的地址按照从上到下的顺序线性增加编址。向量运算单元可以根据不同的行地址访问矩阵寄存器不同的行向量。Fig. 5 is a schematic diagram of the row address space when the matrix register of the present invention is not divided into blocks. When the blocking mode field in the SR indicates that the matrix register is not blocked, the matrix register only supports single-threaded operations. The matrix register is composed of N row vector registers, and each row vector register includes N storage units E _i,0 -E _i,N-1 (i=0,1,2...N-1). The functional parts of the vector operation unit can read and write a row vector register, and the data bandwidth of each read and write is N*W bits. The address of the row vector register increases linearly from top to bottom. The vector operation unit can access different row vectors of the matrix register according to different row addresses.

图6是本发明矩阵寄存器不分块时的列地址空间示意图。当SR中的分块模式字段指示矩阵寄存器不分块时，矩阵寄存器只支持单线程运算。矩阵寄存器由N个列向量寄存器组成，每个列向量寄存器包含N个存储单元E_0，i-E_N-1，i(i＝0，1，2……N-1)向量运算单元的功能部件可以对一个列向量寄存器读写，每次读写的数据带宽为N*W位。列向量寄存器的地址按照从左到右的顺序线性增加编址。向量运算单元可以根据不同的列地址访问矩阵寄存器不同的列向量。Fig. 6 is a schematic diagram of the column address space when the matrix register of the present invention is not divided into blocks. When the blocking mode field in the SR indicates that the matrix register is not blocked, the matrix register only supports single-threaded operations. The matrix register is made up of N column vector registers, and each column vector register contains N storage units E _{0, i} -E _{N-1, i} (i=0, 1, 2...N-1) the function of the vector operation unit The component can read and write to a column vector register, and the data bandwidth of each read and write is N*W bits. The addresses of the column vector registers are addressed linearly from left to right. The vector operation unit can access different column vectors of the matrix register according to different column addresses.

图7是本发明矩阵寄存器分块时的单线程模式下行地址空间示意图。当SR中的分块模式字段指示矩阵寄存器分块模式为M*M并且线程模式字段指示当前运算为单线程运算时，矩阵寄存器由N*M个子行向量寄存器组成，每个子行向量寄存器包含N/M个存储单元E_i，0-E_i，N/M-1(i＝0，1，2……(N-1)*M)。向量运算单元的功能部件可以对一个子行向量寄存器读写，每次读写的数据带宽为(N/M)*W位。子行向量寄存器的编址按照如下规则：块间按照先上后下再先左后右的顺序，块内按照从上到下的顺序线性增加编址。向量运算单元可以根据不同的行地址访问矩阵寄存器不同的子行向量。M的取值不同时，就实现了对不同长度的SIMD运算的支持，即支持了多宽度SIMD访问。FIG. 7 is a schematic diagram of the downstream address space in the single-thread mode when the matrix register is divided into blocks according to the present invention. When the block mode field in the SR indicates that the matrix register block mode is M*M and the thread mode field indicates that the current operation is a single-thread operation, the matrix register consists of N*M sub-row vector registers, and each sub-row vector register contains N /M storage units E _i,0 -E _i,N/M-1 (i=0, 1, 2...(N-1)*M). The functional parts of the vector operation unit can read and write a sub-row vector register, and the data bandwidth of each read and write is (N/M)*W bits. The addressing of the sub-row vector register follows the following rules: the order between blocks is first up, then down, then left, then right, and the addressing within a block is linearly increased from top to bottom. The vector operation unit can access different sub-row vectors of the matrix register according to different row addresses. When the values of M are different, support for SIMD operations of different lengths is realized, that is, multi-width SIMD access is supported.

图8是本发明矩阵寄存器分块时的单线程模式下列地址空间示意图。当SR中的分块模式字段指示矩阵寄存器分块模式为M*M并且线程模式字段指示当前运算为单线程运算时，矩阵寄存器由N*M个子列向量寄存器组成，每个子列向量寄存器包含N/M个存储单元E_0，i-E_N/M-1。i(i＝0，1，2……(N-1)*M)。向量运算单元的功能部件可以对一个子列向量寄存器读写，每次读写的数据带宽为(N/M)*W位。子列向量寄存器的编址按照如下规则：块间按照先上后下再先左后右的顺序，块内按照从左到右的顺序线性增加编址。向量运算单元可以根据不同的列地址访问矩阵寄存器不同的子列向量。M的取值不同时，就实现了对不同长度的SIMD运算的支持，即支持了多宽度SIMD访问。FIG. 8 is a schematic diagram of the following address space in the single-thread mode when the matrix register is divided into blocks according to the present invention. When the block mode field in the SR indicates that the matrix register block mode is M*M and the thread mode field indicates that the current operation is a single-thread operation, the matrix register consists of N*M sub-column vector registers, and each sub-column vector register contains N /M memory cells E _0,i -E _{N/M-1. i} (i=0, 1, 2...(N-1)*M). The functional parts of the vector operation unit can read and write a sub-column vector register, and the data bandwidth for each read and write is (N/M)*W bits. The addressing of the sub-column vector register follows the following rules: the order between blocks is first up, then down, then left, then right, and the addressing within a block is linearly increased from left to right. The vector operation unit can access different sub-column vectors of the matrix register according to different column addresses. When the values of M are different, support for SIMD operations of different lengths is realized, that is, multi-width SIMD access is supported.

图9是本发明矩阵寄存器分块时的M线程模式下行地址空间示意图。当SR中的分块模式字段指示矩阵寄存器分块模式为M*M并且线程模式字段指示当前运算为多线程(线程数目为L)运算时，矩阵寄存器由N*M/L个子行向量寄存器组成，每个子行向量寄存器包含N/M个存储单元E_i，0-E_i，N/M-1(i＝0，1，2……(N-1)*M/L)。向量运算单元的功能部件可以对L个子行向量寄存器读写，每次读写的数据带宽为(L*N/M)*W位。子行向量寄存器的编址按照如下规则：块间按照先上后下再先左后右的顺序，块内按照从上到下的顺序线性增加编址，列方向连续的L个块共享一片行地址空间，向量运算单元可以根据不同的行地址访问矩阵寄存器L个不同的子行向量，这L个不同的子行向量共享同一地址，但是来源于不同的矩阵数据块，即L个不同的子行向量实质是一个多线程的数据访问。图9所示为L等于M时的子行向量编址方式。M的取值不同时，就实现了对不同长度的SIMD运算的支持，即支持了多宽度SIMD访问。当L的取值不同时，就实现了不同粒度的线程级并行，即支持了多粒度SIMT的访问方式。FIG. 9 is a schematic diagram of the downstream address space in the M-thread mode when the matrix register is divided into blocks according to the present invention. When the block mode field in the SR indicates that the block mode of the matrix register is M*M and the thread mode field indicates that the current operation is a multi-thread (the number of threads is L) operation, the matrix register is composed of N*M/L sub-row vector registers , each sub-row vector register includes N/M storage units E _i,0 -E _i,N/M-1 (i=0,1,2...(N-1)*M/L). The functional parts of the vector operation unit can read and write to L sub-row vector registers, and the data bandwidth of each read and write is (L*N/M)*W bits. The addressing of sub-row vector registers follows the following rules: between blocks follow the sequence of first up, then down, then left, then right; within a block, addressing increases linearly from top to bottom; consecutive L blocks in the column direction share a row In the address space, the vector operation unit can access L different sub-row vectors of the matrix register according to different row addresses. These L different sub-row vectors share the same address, but come from different matrix data blocks, that is, L different sub-row vectors A row vector is essentially a multi-threaded data access. Figure 9 shows the sub-row vector addressing mode when L is equal to M. When the values of M are different, support for SIMD operations of different lengths is realized, that is, multi-width SIMD access is supported. When the value of L is different, thread-level parallelism of different granularities is realized, that is, the access mode of multi-granularity SIMT is supported.

图10是本发明矩阵寄存器分块时的M线程模式下列地址空间示意图。当SR中的分块模式字段指示矩阵寄存器分块模式为M*M并且线程模式字段指示当前运算为多线程(线程数目为L)运算时，矩阵寄存器由N*M/L个子列向量寄存器组成，每个子列向量寄存器包含N/M个存储单元E_0，i-E_N/M-1，i(i＝0，1，2……(N-1)*M/L)。向量运算单元的功能部件可以对L个子列向量寄存器读写，每次读写的数据带宽为(L*N/M)*W位。子列向量寄存器的编址按照如下规则：块间按照先上后下再先左后右的顺序，块内按照从左到右的顺序线性增加编址，行方向连续的L个块共享一片列地址空间，向量运算单元可以根据不同的行地址访问矩阵寄存器L个不同的子列向量，这L个不同的子列向量共享同一地址，但是来源于不同的矩阵数据块，即L个不同的子列向量实质是一个多线程的数据访问。图10所示为L等于M时的子列向量编址方式。M的取值不同时，就实现了对不同长度的SIMD运算的支持，即支持了多宽度SIMD访问。当L的取值不同时，就实现了不同粒度的线程级并行，即支持了多粒度SIMT的访问方式。FIG. 10 is a schematic diagram of the following address space in the M-thread mode when the matrix register is divided into blocks according to the present invention. When the block mode field in the SR indicates that the block mode of the matrix register is M*M and the thread mode field indicates that the current operation is a multi-thread (the number of threads is L) operation, the matrix register is composed of N*M/L sub-column vector registers , each sub-column vector register includes N/M storage units E _0,i -E _N/M-1,i (i=0,1,2...(N-1)*M/L). The functional parts of the vector operation unit can read and write L sub-column vector registers, and the data bandwidth of each read and write is (L*N/M)*W bits. The addressing of the sub-column vector registers follows the following rules: the order between blocks is first up, then down, then left, then right, within the block, the addressing is increased linearly from left to right, and the consecutive L blocks in the row direction share a column In the address space, the vector operation unit can access L different sub-column vectors of the matrix register according to different row addresses. These L different sub-column vectors share the same address, but come from different matrix data blocks, that is, L different sub-column vectors. A column vector is essentially a multi-threaded data access. Figure 10 shows the sub-column vector addressing mode when L is equal to M. When the values of M are different, support for SIMD operations of different lengths is realized, that is, multi-width SIMD access is supported. When the value of L is different, thread-level parallelism of different granularities is realized, that is, the access mode of multi-granularity SIMT is supported.

图11是本发明矩阵寄存器的译码通路示意图。该译码通路由地址译码逻辑、读写数据缓冲单元和读写数据总线组成。当向量处理单元对矩阵寄存器进行读写时，地址译码逻辑根据读写地址、行列选择信号、SR的内容进行地址译码，选择一个或多个行/列向量，或者选择一个或多个子行/子列向量进行读写。对矩阵寄存器进行读操作时，当某一个存储单元被译码逻辑选中后，该存储单元的内容被读出放到该存储单元所在行的读数据总线上，然后送至读数据缓冲。读数据缓冲单元将不同存储单元的数据组织成向量形式返回给向量运算单元。对矩阵寄存器进行写操作时，写数据缓冲单元将来自向量运算单元的向量数据拆分成多个要写到不同存储单元的数据。当某一个存储单元被译码逻辑选中后，要写入该存储单元的内容被放到该存储单元所在行的写数据总线上，当时钟有效时，再写入该存储单元。Fig. 11 is a schematic diagram of the decoding path of the matrix register of the present invention. The decoding path is composed of address decoding logic, read and write data buffer unit and read and write data bus. When the vector processing unit reads and writes the matrix register, the address decoding logic performs address decoding according to the read and write addresses, row and column selection signals, and the content of SR, and selects one or more row/column vectors, or selects one or more sub-rows / sub-column vector to read and write. When the matrix register is read, when a certain storage unit is selected by the decoding logic, the content of the storage unit is read out and placed on the read data bus of the row where the storage unit is located, and then sent to the read data buffer. The read data buffer unit organizes the data of different storage units into a vector form and returns it to the vector operation unit. When performing a write operation on the matrix register, the write data buffer unit splits the vector data from the vector operation unit into multiple data to be written to different storage units. When a certain storage unit is selected by the decoding logic, the content to be written into the storage unit is placed on the write data bus of the row where the storage unit is located, and then written into the storage unit when the clock is valid.

以上仅是本发明的优选实施方式，本发明的保护范围并不仅局限于上述实施例，凡属于本发明思路下的技术方案均属于本发明的保护范围。应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理前提下的若干改进和润饰，应视为本发明的保护范围。The above are only preferred implementations of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions under the idea of the present invention belong to the protection scope of the present invention. It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principle of the present invention should be regarded as the protection scope of the present invention.

Claims

1. A configurable matrix register unit that supports multi-width SIMD and multi-granularity SIMT, is characterized in that: comprise matrix register and control register SR, the matrix register of described size N*N is divided into M*M block, and wherein N is positive An integer and a power of 2, M is an integer greater than or equal to 0 and a power of 2; the matrix register block mode and the number of multithreads processed by the vector processing unit are recorded in the control register, and the width of the control register is log ₂ C+log ₂ T, where C is the number of block modes of the matrix register, and T is the number of multi-thread modes that the vector processor can handle.

2. The configurable matrix register unit supporting multi-width SIMD and multi-granularity SIMT according to claim 1 is characterized in that: when M is 0, it means that the matrix register is not divided into blocks, and the vector operation unit can access the matrix register at every turn A row vector or column vector in ; when M is not 0, the vector operation unit accesses one or more sub-row vectors or sub-column vectors of the same length in the matrix register according to the number of threads processed at the same time, these same length The sub-row or sub-column vectors of are from different block matrices.

3. The configurable matrix register unit supporting multi-width SIMD and multi-granularity SIMT according to claim 2, is characterized in that: when the vector operation part accesses the matrix register, the address decoding logic unit is based on the matrix register SR Content, read and write addresses, and row and column selection signals are decoded, and a row vector or column vector of the matrix register is selected for reading and writing, or one or more sub-row vectors or sub-column vectors are selected for reading and writing.

4. The configurable matrix register unit supporting multi-width SIMD and multi-granularity SIMT according to claim 1, 2 or 3, characterized in that: the control register SR is an independent control register, or is stored in other control registers In the reserved bits of the other control registers, and the length of the reserved bits of other control registers is an integer greater than log ₂ C+log ₂ T.