CN103389967B

CN103389967B - The device and method of a kind of matrix transposition based on SRAM

Info

Publication number: CN103389967B
Application number: CN201310367449.7A
Authority: CN
Inventors: 胡封林; 郭阳; 刘仲; 吴虎成; 李振涛; 罗恒; 余再祥; 亓磊; 申晖
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2013-08-21
Filing date: 2013-08-21
Publication date: 2016-06-01
Anticipated expiration: 2033-08-21
Also published as: CN103389967A

Abstract

The invention discloses a device and method for matrix transposition based on SRAM. The device comprises an address decoding module, a read-write bus, a diagonal read control module and n matrix transpose memory bodies. Each matrix transpose memory body is connected by n rows and n columns of matrix memory modules. The ones located on the diagonal from the upper left to the lower right are 14-tube SRAM memory modules, and the ones outside the diagonal are 12-tube SRAM memory modules. . The 14-tube SRAM memory module accesses the elements on the diagonal in the matrix, and the 12-tube SRAM memory module accesses the elements outside the diagonal in the matrix; the method is based on the type of the input matrix and the different access modes. The transpose is done by word, row, column, or diagonal access. The invention has the advantages of simple realization method, simple and compact structure, low cost, fast transposition speed, high efficiency, flexibility and multiple functions.

Description

A device and method for matrix transposition based on SRAM

技术领域technical field

本发明涉及矩阵转置数据处理领域，特别地，涉及一种基于SRAM的矩阵转置的装置及方法。The present invention relates to the field of matrix transposition data processing, in particular to a device and method for matrix transposition based on SRAM.

背景技术Background technique

矩阵转置(MatrixTransposing)在科学计算中应用广泛，例如应用于矩阵分解、线性代数求解、3G通信算法、图像视频算法、雷达、水声算法等科学与工程计算中。对于一个n×n阶（n>1）源矩阵A=[a_ij]，将其每一下三角元素a_ij(i>j)沿主对角线与其对称元素a_ji互换就构成了转置矩阵A^T[b_ij]。公式(1)给出了转置矩阵A^T[b_ij]与源矩阵A[a_ij]的每一个矩阵元素对应关系：Matrix Transposing is widely used in scientific computing, such as matrix decomposition, linear algebra solution, 3G communication algorithm, image video algorithm, radar, underwater acoustic algorithm and other scientific and engineering calculations. For an n×n order (n>1) source matrix A=[a _ij ], the transposition is formed by exchanging each lower triangular element a _ij (i>j) along the main diagonal with its symmetric element a _ji Matrix A ^T [b _ij ]. Formula (1) gives the corresponding relationship between the transposed matrix A ^T [b _ij ] and each matrix element of the source matrix A[a _ij ]:

b_ji=a_ij(i=0,1,…,n-1;j=0,1,…,n-1)(1)b _ji =a _ij (i=0,1,…,n-1;j=0,1,…,n-1)(1)

快速高效的矩阵转置方法和装置的应用将极大地提高各种系统的性能和效率，过去的时间里，人们发明了各种基于软件和硬件的矩阵转置方法，也得到了广泛的应用。The application of fast and efficient matrix transposition methods and devices will greatly improve the performance and efficiency of various systems. In the past, various matrix transposition methods based on software and hardware have been invented and widely used.

软件方法方面，最简单的方法是以主对角线为轴，一次交换一对数据，对于一个n×n阶矩阵，其时间开销为O(n(n-1)/2)，效率较低。Eklundh提出一种快速的矩阵转置算法，该算法完全独立于硬件结构，是基于divide-and-conquer策略的，时间开销为O(nlogn)，其缺点是对于一个n×n阶矩阵，n必须是2的整次幂，且实时性差。PRIM提出一种算法，克服了n必须是2的整次幂的限制，其时间开销与Eklundh算法相同。针对向量处理器，有人提出用Eklundh算法在SIMD上进行矩阵转置操作，这种方法需要增加硬件指令给以支持，才能更好地提高其效率。时间开销为O(logn)。另外Inte公司、MIPS公司和Motorola公司各自提出自己的矩阵转置方法，虽不完全相同，但其时间开销是一样的。In terms of software methods, the simplest method is to use the main diagonal as the axis to exchange a pair of data at a time. For an n×n order matrix, the time overhead is O(n(n-1)/2), and the efficiency is low. . Eklundh proposed a fast matrix transposition algorithm, which is completely independent of the hardware structure and is based on the divide-and-conquer strategy. The time overhead is O(nlogn). The disadvantage is that for an n×n order matrix, n must It is an integral power of 2, and the real-time performance is poor. PRIM proposes an algorithm that overcomes the limitation that n must be an integral power of 2, and its time overhead is the same as that of Eklundh algorithm. For vector processors, it was proposed to use the Eklundh algorithm to perform matrix transposition operations on SIMD. This method needs to be supported by additional hardware instructions in order to better improve its efficiency. The time overhead is O(logn). In addition, Intel Corporation, MIPS Corporation, and Motorola Corporation each proposed their own matrix transposition methods, although they are not exactly the same, but their time overhead is the same.

硬件方法方面，目前已经实现了多种设计方案，例如脉动阵列（SystolicArray）模块、通用阵列结构（UniversalArrayArchitecture）及存储斜移（MemorySkewing）等。In terms of hardware methods, various design schemes have been implemented, such as SystolicArray module, UniversalArrayArchitecture and MemorySkewing.

（1）SystolicArray方法(1) SystolicArray method

SystolicArray方法的矩阵转置操作简单，数据按列或按行加载，从顶部读出，实现矩阵转置操作。对于一个n×n的矩阵，需要3n-1步才能完成矩阵转置。然而用寄存器和组合开关实现，其硬件代价较高。The matrix transposition operation of the SystolicArray method is simple, the data is loaded by columns or rows, and read from the top to realize the matrix transposition operation. For an n×n matrix, it takes 3n-1 steps to complete the matrix transpose. However, using registers and combined switches to implement, the hardware cost is relatively high.

（2）UniversalArrayArchitecture方法(2) UniversalArrayArchitecture method

UniversalArrayArchitecture方法比SystolicArray方法既简洁且效率又高，数据可以以行（列）的方式一次写入一行(列)，当所有数据加载完成时，时以列(行)的方式一次读出一列（行），数据全部读出，完成转置。对于一个n×n的矩阵，需要2n+1步才能完成矩阵转置操作。然而该方法用寄存器和组合开关实现，硬件代价高，功耗大。The UniversalArrayArchitecture method is simpler and more efficient than the SystolicArray method. Data can be written in a row (column) at a time in the form of rows (columns). ), read all the data, and complete the transposition. For an n×n matrix, it takes 2n+1 steps to complete the matrix transpose operation. However, this method is implemented with registers and combined switches, which has high hardware cost and high power consumption.

（3）MemorySkewing方法(3) MemorySkewing method

MemorySkewing方法用并行的存贮器系统实现矩阵的行列操作，曾用于IlliacIV和BSP计算机系统中，但是这要增加很多硬件开销，从而降低了访存速度，由于太复杂而没有太多使用价值。The MemorySkewing method uses a parallel memory system to realize the row and column operations of the matrix. It has been used in IlliacIV and BSP computer systems, but this will increase a lot of hardware overhead, thereby reducing the speed of memory access. It is too complicated and has little use value.

目前，大规模集成电路的工艺越来越小，已达到了10nm的水平，集成度要求越来越高，不论是通用CPU或是DSP，都是从一核向多核的方向发展，对处理能力的要求一再提升，从标量运算处理到向量运算处理，从向量运算处理到矩阵运算处理，对矩阵的运算粒度也是从小粒度到大粒度，不断有新的要求提出。为适应新的发展需求，对面向向量运算、面向矩阵运算的巨型机以及DSP，要具有矩阵寄存器体支撑的矩阵运算单元。矩阵计算遇到的首要问题之一就是矩阵的转置，在未来的矩阵计算机结构中，首先需要的是一个高效能的、低功耗的、结构灵活多变、多功能的矩阵转置部件。而目前现有的矩阵转置方法中，不论是通过软件方法，或是通过专门的硬件，或是通过SIMD（SingleInstructionMultipleData，单指令多数据流）网络结构，或是利用现有存储器的方法实现矩阵转置，都不能简单、高效、灵活地满足矩阵计算机的要求。At present, the process of large-scale integrated circuits is getting smaller and smaller, and has reached the level of 10nm, and the requirements for integration are getting higher and higher. Whether it is a general-purpose CPU or DSP, it is developing from one core to multi-core. The requirements have been continuously improved, from scalar operation processing to vector operation processing, from vector operation processing to matrix operation processing, and the operation granularity of matrix is also from small granularity to large granularity, and new requirements are constantly put forward. In order to meet the new development requirements, the supercomputer and DSP oriented to vector operation and matrix operation must have a matrix operation unit supported by a matrix register body. One of the most important problems encountered in matrix computing is matrix transposition. In the future matrix computer structure, a matrix transposition component with high performance, low power consumption, flexible structure and multiple functions is firstly needed. In the existing matrix transposition methods, whether it is through software, or through special hardware, or through SIMD (Single Instruction Multiple Data, Single Instruction Multiple Data) network structure, or using existing memory methods to realize the matrix Transposition cannot simply, efficiently, and flexibly meet the requirements of matrix computers.

发明内容Contents of the invention

本发明要解决的技术问题就在于：针对现有技术中存在的技术问题，本发明提供一种实现方法简单、结构简单紧凑、成本低廉、转置速度快、高效、灵活、多功能的基于SRAM的矩阵转置的装置及方法。The technical problem to be solved by the present invention is: aiming at the technical problems existing in the prior art, the present invention provides a SRAM-based Apparatus and method for matrix transposition.

为了解决上述技术问题，本发明提供的技术方案为：In order to solve the problems of the technologies described above, the technical solution provided by the invention is:

一种基于SRAM的矩阵转置的装置，包括地址译码模块、读写总线、对角线读控制模块及n个矩阵转置存储器本体；每个矩阵转置存储器本体为由n行、n列矩阵存储模块连接而成的矩阵结构，在每个所述矩阵转置存储器本体的矩阵结构中，位于自左上到右下对角线上的矩阵存储模块均为14管SRAM存储模块，对角线以外的矩阵存储模块均为12管SRAM存储模块；所述14管SRAM存储模块用于对待转置矩阵中处于对角线上的元素进行访问；所述12管SRAM存储模块用于用于对待转置矩阵中处于对角线以外的元素进行访问；所述对角线读控制模块与矩阵转置存储器本体连接，用来对对角线地址进行译码、控制对待转置矩阵中处于对角线上元素进行的读操作。A device for matrix transposition based on SRAM, comprising an address decoding module, a read-write bus, a diagonal read control module, and n matrix transposition memory bodies; each matrix transpose memory body consists of n rows and n columns A matrix structure formed by connecting matrix memory modules, in each matrix structure of the matrix transpose memory body, the matrix memory modules located on the diagonal from the upper left to the lower right are all 14-tube SRAM memory modules, and the diagonal The other matrix memory modules are all 12-tube SRAM memory modules; the 14-tube SRAM memory modules are used to access the elements on the diagonal in the matrix to be transposed; the 12-tube SRAM memory modules are used to The elements outside the diagonal in the matrix are accessed; the diagonal read control module is connected with the matrix transpose memory body, and is used to decode the diagonal address and control the elements in the diagonal to be transposed. The read operation performed on the element.

作为本发明装置的进一步改进：所述14管SRAM存储模块包括多个依次连接的1位14管SRAM单元，1位14管SRAM单元的数量与待转置矩阵中元素的字长数w相同。As a further improvement of the device of the present invention: the 14-tube SRAM storage module includes a plurality of sequentially connected 1-bit 14-tube SRAM units, and the number of 1-bit 14-tube SRAM units is the same as the word length w of elements in the matrix to be transposed.

作为本发明装置的进一步改进：所述1位14管SRAM单元具有读写操作分开的控制管，分别控制读写地址的选通。所述1位14管SRAM单元的控制管包括：两个行读写行控制管、两个行读写列控制管、两个列读写列控制管、两个列读写行控制管及两个对角线元素读出控制管；所述对角线读控制管，用来控制待转置矩阵中对角线元素地址的选通；所述行读写行控制管、行读写列控制管、列读写列控制管及列读写行控制管，控制待转置矩阵中行、列数据地址的选通。As a further improvement of the device of the present invention: the 1-bit 14-tube SRAM unit has separate control tubes for read and write operations, which respectively control the gating of read and write addresses. The control tubes of the 1-bit 14-tube SRAM unit include: two row read-write row control tubes, two row read-write column control tubes, two column read-write column control tubes, two column read-write row control tubes, and two row control tubes. A diagonal element readout control tube; the diagonal read control tube is used to control the gating of the diagonal element address in the matrix to be transposed; the row read-write row control tube, row read-write column control The tube, column read-write column control tube and column read-write row control tube control the gating of row and column data addresses in the matrix to be transposed.

作为本发明装置的进一步改进：所述12管SRAM存储模块包括多个依次连接的1位12管SRAM单元，1位12管SRAM单元的数量与待转置矩阵中元素的字长数w相同；As a further improvement of the device of the present invention: the 12-tube SRAM storage module includes a plurality of sequentially connected 1-bit 12-tube SRAM units, and the number of 1-bit 12-tube SRAM units is the same as the word length w of elements in the matrix to be transposed;

作为本发明装置的进一步改进：所述1位12管SRAM单元具有读写操作分开的控制管，分别控制读写地址的选通。所述1位12管SRAM单元的控制管包括两个行读写行控制管、两个行读写列控制管、两个列读写列控制管、两个列读写行控制管；所述行读写行控制管、行读写列控制管、列读写列控制管及列读写行控制管构成行、列数据访问行、列地址信号端，控制待转置矩阵中行、列数据地址的选通。As a further improvement of the device of the present invention: the 1-bit 12-tube SRAM unit has separate control tubes for read and write operations, which respectively control the gating of read and write addresses. The control tubes of the 1-bit 12-tube SRAM unit include two row read-write row control tubes, two row read-write column control tubes, two column read-write column control tubes, and two column read-write row control tubes; Row read-write row control tube, row read-write column control tube, column read-write column control tube, and column read-write row control tube form row and column data access row and column address signal terminals, and control the row and column data addresses in the matrix to be transposed the strobe.

作为本发明装置的进一步改进：所述地址译码模块包括与矩阵转置存储器本体分别连接的行读写列地址译码模块、行读写行地址译码模块、列读写列地址译码模块及列读写行地址译码模块。As a further improvement of the device of the present invention: the address decoding module includes a row reading and writing column address decoding module, a row reading and writing row address decoding module, and a column reading and writing column address decoding module respectively connected to the matrix transposition memory body And column read and write row address decoding module.

本发明进一步提供一种矩阵转置的方法，步骤为：The present invention further provides a method for matrix transposition, the steps are:

（1）输入待转置矩阵，对矩阵类型进行判断，若为m×m阶且m>n，执行步骤（2）；若为n×n阶或m×m阶，且m<n，执行步骤（3）；(1) Input the matrix to be transposed, and judge the matrix type. If it is m×m order and m>n, execute step (2); if it is n×n order or m×m order, and m<n, execute step (3);

（2）控制多个n行n列的寄存器联合操作，将各个寄存器的行、列地址译码的设置为同步，执行步骤（3）对每个寄存器进行访问；(2) Control the combined operation of multiple n-row and n-column registers, set the row and column address decoding of each register to be synchronized, and perform step (3) to access each register;

（3）通过指令进行访问模式的配置；若配置为单字访问模式，执行步骤（4）；若配置为行写入列读出访问模式，执行步骤（5）后执行步骤（6）；若配置为列读入行写出访问模式，执行步骤（6）后执行步骤（5）；若配置为对角线向量访问模式，执行步骤（7）；(3) Configure the access mode through instructions; if it is configured as a single-word access mode, perform step (4); if it is configured as a row write column read access mode, perform step (6) after performing step (5); if configured For column read and row write access mode, perform step (6) and then perform step (5); if configured as diagonal vector access mode, perform step (7);

（4）以行优先的方式进行访问，进行行地址译码后选通对应的行读写行地址，列地址译码后选通行读写列地址，通过行读写总线对单字进行访问；(4) Access in a row-first manner, after decoding the row address, strobe the corresponding row read and write row address, after decoding the column address, strobe the row read and write column address, and access the single word through the row read and write bus;

（5）对行向量的元素同时进行读写，进行行地址译码后选通对应的行读写行地址，列地址译码后所有的行读写列地址被选通，通过行读写总线对行进行访问；(5) Read and write the elements of the row vector at the same time, after decoding the row address, strobe the corresponding row read and write row address, after decoding the column address, all the row read and write column addresses are strobed, through the row read and write bus access the row;

（6）对列向量的元素同时进行读写，进行行地址译码后选通对应的列读写列地址，列地址译码后所有的列读写行地址被选通，通过列读写总线对列进行访问；(6) Read and write the elements of the column vector at the same time, and after decoding the row address, select the corresponding column read and write column address. access to the column;

（7）对对角线向量的元素同时进行读写，进行地址译码后列读写行地址全被选通，通过列读总线对对角线进行访问。(7) Read and write the elements of the diagonal vector at the same time. After address decoding, all the column read and write row addresses are strobed, and the diagonal is accessed through the column read bus.

作为本发明的进一步改进：所述步骤（3）中访问模式的配置为单字访问、行写入列读出访问、列写入行读出访问及对角线向量访问中的一种。As a further improvement of the present invention: the configuration of the access mode in the step (3) is one of single word access, row write column read access, column write row read access and diagonal vector access.

作为本发明方法的进一步改进：所述步骤（5）具体实施步骤包括：As a further improvement of the method of the present invention: the specific implementation steps of the step (5) include:

（5.1）判断待转置矩阵的行向量大小，若为n行，执行步骤（5.2）；若为m行且m<n，执行步骤（5.3）；(5.1) Determine the size of the row vector of the matrix to be transposed. If it is n rows, perform step (5.2); if it is m rows and m<n, perform step (5.3);

（5.2）对行向量的所有元素同时进行读写，进行行地址译码后选通1～n行中对应的行读写行地址，列地址译码后行读写列地址全被选通，通过行读写总线对行进行访问；(5.2) Read and write all the elements of the row vector at the same time. After the row address is decoded, the corresponding row read and write row addresses in the 1~n rows are selected. After the column address is decoded, the row, read, write and column addresses are all selected. Rows are accessed through the row read and write bus;

（5.3）对行向量的所有元素同时进行读写，进行行地址译码后选通1～m行中对应的行读写行地址，列地址译码后行读写列地址中的1～m列的列地址被选通，通过行读写总线对行进行访问。(5.3) Read and write all the elements of the row vector at the same time, after decoding the row address, select the corresponding row read and write row address in the 1~m row, and after decoding the column address, read and write the column address 1~m in the row The column address of the column is strobed, and the row is accessed through the row read and write bus.

作为本发明方法的进一步改进：所述步骤（6）具体实施步骤包括：As a further improvement of the method of the present invention: the specific implementation steps of the step (6) include:

（6.1）判断待转置矩阵的列向量大小，若为n列，执行步骤（6.2）；若为m列且m<n，执行步骤（6.3）；(6.1) Determine the size of the column vector of the matrix to be transposed. If it is n columns, perform step (6.2); if it is m columns and m<n, perform step (6.3);

（6.2）对列向量的元素同时进行读写，进行行地址译码后选通1～n列中对应的列读写列地址，列地址译码后列读写行地址全被选通，通过列读写总线对列进行访问；(6.2) Read and write the elements of the column vector at the same time, and after the row address is decoded, the corresponding column read and write column addresses in the 1~n columns are strobed. After the column address is decoded, all the column read and write row addresses are strobed. The column read and write bus accesses the column;

（6.3）对列向量的元素同时进行读写，进行行地址译码后选通1～m列中对应的列读写列地址，列地址译码后列读写行地址全被选通，通过列读写总线对列进行访问。(6.3) Read and write the elements of the column vector at the same time. After decoding the row address, select the corresponding column read and write column address in the 1~m columns. After the column address is decoded, all the column read and write row addresses are selected. Pass The column read and write bus accesses the columns.

与现有技术相比，本发明具有以下优点：Compared with the prior art, the present invention has the following advantages:

（1）本发明利用SRAM存储单元的高效读写性能的基础上将读写操作分开，可以实现行、列及对角线读操作，所需时间开销小、转置速度快、效率高。(1) The present invention separates the read and write operations on the basis of the efficient read and write performance of the SRAM storage unit, and can realize the row, column and diagonal read operations, requiring less time overhead, fast transposition speed, and high efficiency.

（2）本发明使用多个基于12管及14管SRAM存储单元来实现矩阵转置，所耗器材少且功耗低。(2) The present invention uses multiple 12-tube and 14-tube SRAM storage units to realize matrix transposition, which consumes less equipment and consumes less power.

（3）本发明能够实现各种类型矩阵的转置，通过指令可以对矩阵的行和列根据实际的应用需求进行灵活的裁剪，且对元素的访问可以是单字读写、行读列写、行写列读、对角线读等多种形式，结构灵活、功能多样。(3) The present invention can realize the transposition of various types of matrices. Through instructions, the rows and columns of the matrix can be flexibly tailored according to actual application requirements, and the access to elements can be single-word read and write, row read and column write, Row write column read, diagonal read and other forms, flexible structure, diverse functions.

附图说明Description of drawings

图1是本发明中基于SRAM的矩阵转置装置的结构示意图。FIG. 1 is a schematic structural diagram of an SRAM-based matrix transposition device in the present invention.

图2是本实施例中矩阵转置存储器本体的电路结构示意图。FIG. 2 is a schematic diagram of the circuit structure of the matrix transpose memory body in this embodiment.

图3是本实施例中矩阵转置存储器本体的电路接口结构示意图。FIG. 3 is a schematic diagram of the circuit interface structure of the matrix transpose memory body in this embodiment.

图4是本实施例中14管SRAM存储模块电路结构示意图。FIG. 4 is a schematic diagram of the circuit structure of the 14-tube SRAM memory module in this embodiment.

图5是本实施例中14管SRAM存储模块电路接口示意图。FIG. 5 is a schematic diagram of the circuit interface of the 14-tube SRAM storage module in this embodiment.

图6是本实施例中1位14管SRAM单元电路结构示意图。FIG. 6 is a schematic diagram of the circuit structure of a 1-bit 14-tube SRAM unit in this embodiment.

图7是本实施例中1位14管SRAM单元电路接口示意图。FIG. 7 is a schematic diagram of a circuit interface of a 1-bit 14-tube SRAM unit in this embodiment.

图8是本实施例中12管SRAM存储模块电路结构示意图。FIG. 8 is a schematic diagram of the circuit structure of the 12-tube SRAM storage module in this embodiment.

图9是本实施例中12管SRAM存储模块电路接口示意图。FIG. 9 is a schematic diagram of the circuit interface of the 12-tube SRAM storage module in this embodiment.

图10是本实施例中1位12管SRAM单元电路结构示意图。FIG. 10 is a schematic diagram of the circuit structure of a 1-bit 12-tube SRAM unit in this embodiment.

图11是本实施例中1位12管SRAM单元电路接口示意图。FIG. 11 is a schematic diagram of a circuit interface of a 1-bit 12-tube SRAM unit in this embodiment.

图例说明：1、矩阵转置存储器本体；11、14管SRAM存储模块；111、1位14管SRAM单元；12、12管SRAM存储模块；122、1位12管SRAM单元；2、行读写列地址译码模块；3、行读写行地址译码模块；4、列读写列地址译码模块；5、列读写行地址译码模块；6、对角线读控制模块。Legend: 1. Matrix transpose memory body; 11. 14-tube SRAM storage module; 111. 1-bit 14-tube SRAM unit; 12. 12-tube SRAM storage module; 122. 1-bit 12-tube SRAM unit; Column address decoding module; 3. Row reading and writing row address decoding module; 4. Column reading and writing column address decoding module; 5. Column reading and writing row address decoding module; 6. Diagonal reading control module.

具体实施方式detailed description

以下将结合说明书附图和具体实施对本发明做进一步详细说明。The present invention will be further described in detail below in conjunction with the accompanying drawings and specific implementation.

如图1所示，本实施例中基于SRAM的矩阵转置装置，包括地址译码模块、读写总线、n个矩阵转置存储器本体1及对角线读控制模块6。地址译码模块、对角线读控制模块6均与矩阵转置存储器本体1连接。地址译码模块，对待转置矩阵所在行、列地址进行译码进而选通相应的地址；读写总线分为行读写总线和列读写总线，行读写总线，根据地址译码模块的译码结果对待转置矩阵的单字或行向量进行访问；列读写总线，根据地址译码模块的译码结果对待转置矩阵的列向量或对角线元素进行访问；对角线读控制模块6，与矩阵转置存储器本体连接，对对角线地址进行译码、控制对待转置矩阵中处于对角线上元素进行的读操作。As shown in FIG. 1 , the SRAM-based matrix transposition device in this embodiment includes an address decoding module, a read-write bus, n matrix transposition memory bodies 1 and a diagonal read control module 6 . The address decoding module and the diagonal reading control module 6 are both connected to the matrix transposition memory body 1 . The address decoding module decodes the address of the row and column where the transposed matrix is located and then selects the corresponding address; the read-write bus is divided into a row read-write bus and a column read-write bus. The decoding result accesses the single word or row vector of the transposed matrix; the column read and write bus accesses the column vector or diagonal elements of the transposed matrix according to the decoding result of the address decoding module; the diagonal read control module 6. Connect with the matrix transposition memory body, decode the diagonal address, and control the read operation of the elements on the diagonal in the matrix to be transposed.

如图2、图3所示，本实施例中矩阵转置存储器本体的结构及接口，具有由n行n列个矩阵存储模块连接而成的矩阵结构。矩阵转置存储器本体1的矩阵结构中，自左上到右下对角线L上的矩阵存储模块均为14管SRAM存储模块11，用于对待转置矩阵中处于对角线上的元素进行访问，其中对角线L为自左上到右下对角线的连接方向；对角线L以外的矩阵存储模块均为12管SRAM存储模块12，用于对待转置矩阵中处于对角线以外的元素进行访问。As shown in Fig. 2 and Fig. 3, the structure and interface of the matrix transpose memory body in this embodiment has a matrix structure formed by connecting matrix memory modules in n rows and n columns. In the matrix structure of the matrix transpose memory body 1, the matrix storage modules on the diagonal L from the upper left to the lower right are all 14-tube SRAM storage modules 11, which are used to access the elements on the diagonal in the matrix to be transposed , wherein the diagonal line L is the connection direction from the upper left to the lower right diagonal line; the matrix storage modules other than the diagonal line L are all 12-tube SRAM storage modules 12, which are used to treat the transposed matrix outside the diagonal line element to access.

本实施例中，矩阵转置存储器本体1中具有4个行、列数据访问行、列地址信号端，分别为：行数据访问列地址信号端Wr、行数据访问行地址信号端Wc、列数据访问行地址信号端Rc及列数据访问列地址信号端Rr。矩阵转置存储器本体1中，每行的矩阵存储模块的行数据访问列地址信号端Wr相连，共有n行，构成行数据访问列地址信号Wr<1:n>；每行的矩阵存储模块的行数据访问行地址信号端Wc相连，共有n行，构成行数据访问行地址信号Wc<1:n>；每列的矩阵存储模块的列数据访问行地址信号端Rc相连，共有n列，构成列数据访问行地址信号Rc<1:n>；每列的矩阵存储模块的列数据访问列地址信号端Rr相连，共有n列，构成列数据访问列地址信号Rr<1:n>；每行的矩阵存储模块的行访问数据端wd<w：1>相连，共n行，构成行访问数据原码信号wd<nw:1>，相同的方法构成行访问数据反码信号/wd<w：1>；每列的矩阵存储模块的列访问数据端rd<w:1>相连，共n行，构成列访问数据原码rd<nw:1>，相同的方法构成列访问数据反码/rd<nw:1>。In this embodiment, the matrix transposition memory body 1 has four row and column data access row and column address signal terminals, which are respectively: row data access column address signal terminal Wr, row data access row address signal terminal Wc, column data Access row address signal end Rc and column data access column address signal end Rr. In the matrix transposition memory body 1, the row data access column address signal terminals Wr of the matrix storage modules of each row are connected to each other, and there are n rows in total, forming row data access column address signals Wr<1:n>; the matrix storage modules of each row The row data access and row address signal terminals Wc are connected, and there are n rows in total, forming a row data access row address signal Wc<1:n>; the column data access row address signal terminals Rc of each column matrix memory module are connected, and a total of n columns are formed. Column data access row address signal Rc<1:n>; the column data access column address signal terminal Rr of the matrix memory module of each column is connected, and there are n columns in total, forming the column data access column address signal Rr<1:n>; each row The row access data terminal wd<w: 1> of the matrix storage module is connected, and there are n rows in total to form the row access data original code signal wd<nw:1>, and the row access data complement signal /wd<w is formed in the same way: 1>; the column access data terminal rd<w:1> of the matrix storage module of each column is connected, and there are n rows in total to form the original code rd<nw:1> of the column access data. <nw:1>.

本实施例中，通过访问模式的指令对行、列数据访问行、列地址信号端进行控制，由地址译码模块进行地址译码同时选通相应的地址，执行相应的访问操作。由于具有n行n列个矩阵存储模块，可以对待转置矩阵进行整行、整列操作，进一步的14管SRAM存储模块11及对角线读控制模块6，可以一次性读取待转置对角线上的数据。In this embodiment, the row and column data access row and column address signal terminals are controlled by the instruction of the access mode, and the address decoding module performs address decoding and simultaneously selects the corresponding address to execute the corresponding access operation. Since there are n rows and n columns of matrix storage modules, the operation of the entire row and column of the matrix to be transposed can be performed. Further, the 14-tube SRAM storage module 11 and the diagonal read control module 6 can read the diagonal to be transposed at one time. online data.

若待转置矩阵为A，A中的元素可表示为aij（i=1,2,3…n,j=1,2,3…n），位元就是指矩阵A的元素aij，其字长为w位。14管SRAM存储模块11就是w位的位元结构，由w个1位14管SRAM单元111依次连接而成。如图4、图5所示，本实施例中14管SRAM存储模块11结构及接口，包括w个依次连接的1位14管SRAM单元111。在矩阵转置过程中，对位元的读写操作是一致的，因此各个14管SRAM存储模块11的行数据访问列地址信号端Wr、行数据访问行地址信号端Wc、列数据访问行地址信号端Rc及列数据访问列地址信号端Rr相互连接，形成行、列或对角线数据访问时的地址选通信号端，控制行、列或对角线数据的访问；行、列数据访问端wd、rd分别与待转置矩阵各数据位连接，对待转置矩阵中各数据进行、列的读或写操作。If the matrix to be transposed is A, the elements in A can be expressed as aij (i=1,2,3...n,j=1,2,3...n), and the bit refers to the element aij of matrix A, and its word The length is w bits. The 14-tube SRAM storage module 11 is a w-bit bit structure, which is formed by sequentially connecting w 1-bit 14-tube SRAM units 111 . As shown in FIG. 4 and FIG. 5 , the structure and interface of the 14-tube SRAM storage module 11 in this embodiment includes w sequentially connected 1-bit 14-tube SRAM units 111 . In the matrix transposition process, the read and write operations on the bit are consistent, so the row data access column address signal terminal Wr of each 14-tube SRAM memory module 11, the row data access row address signal terminal Wc, and the column data access row address The signal terminal Rc and the column data access column address signal terminal Rr are connected to each other to form the address strobe signal terminal for row, column or diagonal data access, and control the access of row, column or diagonal data; row and column data access The terminals wd and rd are respectively connected to each data bit of the matrix to be transposed, and the data in the matrix to be transposed is read or written.

如图6、图7所示，本实施例中1位14管SRAM单元111电路结构及接口，包括：两个行读写列控制管第一MOS管n1和第二MOS管n2、两个行读写行控制管第三MOS管n3和第四MOS管n4、两个列读写列控制管第五MOS管n5和第六MOS管n6、两个列读写行控制管第七MOS管n7和第八MOS管n8及两个对角线元素读出控制管第九MOS管n9和第十MOS管n10。1位14管SRAM单元111的行、列数据访问行、列地址信号端中，行读写列信号端Wr与第一MOS管n1和第二MOS管n2相连，控制行读写列地址的选通；行读写行信号端Wc与第三MOS管n3和第四MOS管n4连接，控制行读写行地址的选通；列读写列信号端Rr与第五MOS管n5和第六MOS管n6连接，控制列读写列地址的选通；列读写行信号端Rc与第七MOS管n7和第八MOS管n8连接，控制列读写行地址的选通；对角线控制端Dia与第九MOS管n9和第十MOS管n10连接，与第七MOS管n7和第八MOS管n8一起控制对角线地址的选通。第一MOS管n1、第二MOS管n2、第三MOS管n3、第四MOS管n4，共同支持对行操作的单字访问或行向量访问，第五MOS管n5、第六MOS管n6、第七MOS管n7、第八MOS管n8，共同支持对列操作的单字访问或列向量访问。As shown in Figure 6 and Figure 7, the circuit structure and interface of the 1-bit 14-tube SRAM unit 111 in this embodiment include: two row read-write column control tubes, the first MOS tube n1 and the second MOS tube n2, two row The third MOS tube n3 and the fourth MOS tube n4 of the read-write row control tube, the fifth MOS tube n5 and the sixth MOS tube n6 of the two column read-write column control tubes, the seventh MOS tube n7 of the two column read-write row control tubes and the eighth MOS transistor n8 and two diagonal element readout control transistors, the ninth MOS transistor n9 and the tenth MOS transistor n10. In the row and column address signal terminals of the row and column data access row and column address of the 1-bit 14-tube SRAM unit 111, The row read/write column signal terminal Wr is connected to the first MOS transistor n1 and the second MOS transistor n2 to control the gating of the row read/write column address; the row read/write row signal terminal Wc is connected to the third MOS transistor n3 and the fourth MOS transistor n4 Connect to control the strobe of the row read and write row address; the column read and write column signal terminal Rr is connected to the fifth MOS transistor n5 and the sixth MOS transistor n6 to control the strobe of the column read and write column address; the column read and write row signal terminal Rc It is connected with the seventh MOS transistor n7 and the eighth MOS transistor n8, and controls the strobe of the row address for column reading and writing; the diagonal control terminal Dia is connected with the ninth MOS transistor n9 and the tenth MOS transistor n10, and is connected with the seventh MOS transistor n7 Together with the eighth MOS transistor n8, it controls the gating of the diagonal address. The first MOS transistor n1, the second MOS transistor n2, the third MOS transistor n3, and the fourth MOS transistor n4 jointly support single-word access or row vector access to row operations. The fifth MOS transistor n5, the sixth MOS transistor n6, and the fourth MOS transistor n6 The seventh MOS transistor n7 and the eighth MOS transistor n8 jointly support single-word access or column vector access for column operations.

本实施例中，12管SRAM存储模块12为字长为w的位元结构，即由w个1位12管SRAM单元122连接而成。如图8、图9所示，本发明12管SRAM存储模块的电路结构及接口，具有除对角线控制信号端Dia以外与1位14管SRAM存储单元相同的结构及连接方式。w位的12管SRAM存储模块12各个1位12管SRAM单元122的连接关系为：各个1位12管SRAM单元122行数据访问的列地址信号端Wr相连，行数据访问的行地址信号端Wc相连，列数据访问的列地址信号端Rr相连，列数据访问的行地址信号端Rc相连，行访问数据的原码及反码为wd<1：w>和/wd<1：w>，行访问数据的原码及反码则为rd<1：w>和/rd<1：w>。In this embodiment, the 12-tube SRAM memory module 12 has a bit structure with a word length of w, that is, it is formed by connecting w 1-bit 12-tube SRAM units 122 . As shown in Fig. 8 and Fig. 9, the circuit structure and interface of the 12-tube SRAM memory module of the present invention have the same structure and connection mode as the 1-bit 14-tube SRAM memory unit except for the diagonal control signal terminal Dia. The connection relationship of each 1-bit 12-tube SRAM unit 122 of the w-bit 12-tube SRAM memory module 12 is as follows: each 1-bit 12-tube SRAM unit 122 is connected to the column address signal terminal Wr for row data access, and the row address signal terminal Wc for row data access. Connected, the column address signal terminal Rr for column data access is connected, the row address signal terminal Rc for column data access is connected, the original code and inverse code of row access data are wd<1:w> and /wd<1:w>, row The original code and inverse code of the access data are rd<1:w> and /rd<1:w>.

如图10、图11所示，本实施例中1位12管SRAM单元122电路结构及接口，包括除对角线元素读出控制管第八MOS管n8、第九MOS管n9以外的所有与1位14管SRAM单元111相同的结构，包括：两个行读写列控制管第一MOS管n1和第二MOS管n2、两个行读写行控制管第三MOS管n3和第四MOS管n4、两个列读写列控制管第五MOS管n5和第六MOS管n6、两个列读写行控制管第七MOS管n7和第八MOS管n8。行读写列信号端Wr与第一MOS管n1和第二MOS管n2相连，控制行读写列地址的选通；行读写行信号端Wc与第三MOS管n3和第四MOS管n4连接，控制行读写行地址的选通；列读写列信号端Rr与第五MOS管n5和第六MOS管n6连接，控制列读写列地址的选通；列读写行信号端Rc与第七MOS管n7和第八MOS管n8连接，控制列读写行地址的选通。As shown in Fig. 10 and Fig. 11, the circuit structure and interface of the 1-bit 12-tube SRAM unit 122 in this embodiment include all and 1-bit 14-tube SRAM unit 111 has the same structure, including: two row read-write column control tubes, the first MOS tube n1 and the second MOS tube n2, two row read-write row control tubes, the third MOS tube n3 and the fourth MOS tube Tube n4, two column read-write column control tubes fifth MOS tube n5 and sixth MOS tube n6, two column read-write row control tubes seventh MOS tube n7 and eighth MOS tube n8. The row read/write column signal terminal Wr is connected to the first MOS transistor n1 and the second MOS transistor n2 to control the gating of the row read/write column address; the row read/write row signal terminal Wc is connected to the third MOS transistor n3 and the fourth MOS transistor n4 Connect to control the strobe of the row read and write row address; the column read and write column signal terminal Rr is connected to the fifth MOS transistor n5 and the sixth MOS transistor n6 to control the strobe of the column read and write column address; the column read and write row signal terminal Rc It is connected with the seventh MOS transistor n7 and the eighth MOS transistor n8 to control the strobing of column read and write row addresses.

本实施例中，对应矩阵转置存储器本体1的4个行、列数据访问行、列地址信号端，地址译码模块包括4个相应的地址译码模块，分别为：行读写列地址译码模块2、行读写行地址译码模块3、列读写列地址译码模块4及列读写行地址译码模块5。In this embodiment, corresponding to the 4 row and column data access row and column address signal terminals of the matrix transpose memory body 1, the address decoding module includes 4 corresponding address decoding modules, which are respectively: row, read, write, column address decoding Code module 2, row read-write row address decoding module 3, column read-write column address decoding module 4 and column read-write row address decoding module 5.

行读写列地址译码模块2与矩阵转置存储器本体1的行读写列信号端Wr连接，对待转置矩阵中行地址进行译码同时选通行数据访问时的列地址；行读写行地址译码模块3与矩阵转置存储器本体1的行读写行信号端Wc连接，对待转置矩阵中行地址译码同时选通行数据访问时行地址；列读写列地址译码模块4与矩阵转置存储器本体1的列读写列信号端Rr连接，对待转置矩阵中列地址译码同时选通列数据访问时列地址；列读写行地址译码模块5与矩阵转置存储器本体1的列读写行信号端Rc连接，对待转置矩阵中列地址译码同时选通列数据访问时行地址；对角线读控制模块6与矩阵转置存储器本体1的对角线读控制信号端Dia连接，对待转置矩阵中对角线地址译码同时选通待转置矩阵中对角线元素地址。The row read-write column address decoding module 2 is connected to the row read-write column signal terminal Wr of the matrix transpose memory body 1, and the row address in the matrix to be transposed is decoded and the column address when the row data is accessed is simultaneously selected; the row read-write row address The decoding module 3 is connected with the row read-write row signal terminal Wc of the matrix transposition memory body 1, and the row address in the transposed matrix is decoded while gating the row data access; the column read-write column address decoding module 4 is connected with the matrix transfer Set the column read-write column signal terminal Rr of the memory body 1 to be connected, and the column address in the column address decoding in the transposed matrix is selected while the column data is accessed; the column read-write row address decoding module 5 is connected to the matrix transpose memory body 1 The column read and write row signal terminal Rc is connected, and the column address in the transposed matrix is to be decoded while the row address is selected when the column data is accessed; the diagonal read control module 6 is connected to the diagonal read control signal terminal of the matrix transpose memory body 1 Dia connection, decoding the diagonal address in the matrix to be transposed and selecting the address of the diagonal element in the matrix to be transposed at the same time.

本实施例中，通过指令可以实现多种方式的灵活访问，包括单字访问、行向量、列向量或对角线访问模式。因此n个矩阵转置存储器本体1可以作为片上LocalMemory使用，呈现一维线性存储器结构，存储空间的大是n³个字，按字地址访问，逐行逐字访问，或将其作为二维片上寄存器RMi(i=1…n)或矩阵转置模块使用，一次可访问1个字，一次也可访问1行/列向量Vi，i=2，3，…n；还可以作为n个矩阵转置存储模块使用，一个n×2n个、一个2n×2n个等矩阵转置模块，可构成的矩阵转置模式的个数为：C_n ¹+C_n ²+C_n ³+…+C_n ⁿ=2ⁿ–1。In this embodiment, multiple flexible access modes can be realized through instructions, including single word access, row vector, column vector or diagonal access mode. Therefore, the n matrix transpose memory body 1 can be used as an on-chip LocalMemory, presenting a one-dimensional linear memory structure, with a storage space of n ³ words, accessed by word address, row by row, or as a two-dimensional on-chip Register RMi (i=1...n) or matrix transposition module can access 1 word at a time, and can also access 1 row/column vector Vi at a time, i=2, 3,...n; it can also be used as n matrix transposition The number of matrix transposition modes that can be formed is: C _n ¹ +C _n ² +C _n ³ +…+C _n ⁿ = 2 ⁿ -1.

本实施例中作为二维片上寄存器RMi访问时，有n个存储平面且其中每两个2-D存储面都是同构的，具有相同的访问模式；每个存储平面有n×n个位单元，每个位单元的字长为w位；每个存储平面的n×n个位单元中，主对角线单元由14管SRAM存储模块构成，其余的单元由12管SRAM存储模块模块构成。每个寄存器RMi中由RMi_Wr<1:n>进行行访问行控制，RMi_Wc<1:n>进行行访问列控制，RMi_Rr<1:n>进行列访问列控制，RMi_Rc<1:n>进行列访问行控制，RMi_Dia进行主对角线读控制，RMi_wd<nw：1>、RMi_/wd<nw：1>为行访问数据，RMi_rd<nw：1>、RMi_/rd<nw：1>为列访问数据，其中i=1，2，…，n。In this embodiment, when accessed as a two-dimensional on-chip register RMi, there are n storage planes and wherein every two 2-D storage planes are isomorphic and have the same access mode; each storage plane has n×n bits The word length of each bit unit is w bits; among the n×n bit units of each storage plane, the main diagonal unit is composed of 14-tube SRAM memory modules, and the rest of the units are composed of 12-tube SRAM memory module modules . In each register RMi, RMi_Wr<1:n> performs row access row control, RMi_Wc<1:n> performs row access column control, RMi_Rr<1:n> performs column access column control, RMi_Rc<1:n> performs column access Access row control, RMi_Dia performs main diagonal read control, RMi_wd<nw: 1>, RMi_/wd<nw: 1> are row access data, RMi_rd<nw: 1>, RMi_/rd<nw: 1> are columns Access data where i = 1, 2, ..., n.

本实施例中，每个寄存器RMi中对单字、行向量、列向量及对角线向量访问模式配置算法为：In this embodiment, the access mode configuration algorithm for single word, row vector, column vector and diagonal vector in each register RMi is:

1）按字访问时算法A1为：1) Algorithm A1 when accessing by word is:

2）按行向量访问A2和按列向量访问算法A3分别为：2) Algorithms for accessing A2 by row vector and accessing algorithm A3 by column vector are:

4）按对角线数据访问算法A4为：4) According to the diagonal data access algorithm A4 is:

本实施例中，可以采用行写入列读出访问方式来完成一次矩阵转置，也可以采用列写入行读出的方法进行转置。可以对n×n、m×m、m×n及n×m等多种类型的矩阵进行转置。In this embodiment, a matrix transposition can be completed by using the access mode of row writing and column reading, or can be transposed by using the method of column writing and row reading. Various types of matrices such as n×n, m×m, m×n and n×m can be transposed.

当待转置矩阵为n×n阶时，进行转置的工作过程为：When the matrix to be transposed is of order n×n, the working process of transposing is:

1）矩阵的转置通过行写入列读出对矩阵进行转置，行写入时，对行向量的n个元素进行同时读写，此时对应的行读写行地址模块3进行行地址译码，行读写列地址信号端Wr被选通，i=1,2,…,n为所在的行数，对应的行读写列地址译码模块2进行列地址译码，行读写行地址信号端Wc<1:n>全被选通，通过行读写总线对行进行访问；列读出时，对列向量的n个元素进行同时读写，此时对应的列读写行地址模块5进行行地址译码，列读写列地址信号端Rr<j>被选通，j=1,2,…,n是所在的列数，对应的列读写列地址模块4进行列地址译码，列读写行地址信号端Rc<1:n>全被选通，通过列读写总线对列进行访问；1) The transposition of the matrix transposes the matrix through row writing and column reading. When writing a row, read and write n elements of the row vector at the same time. At this time, the corresponding row read and write row address module 3 performs row address Decoding, the row read/write column address signal terminal Wr is strobed, i=1, 2,..., n is the number of rows where it is located, and the corresponding row read/write column address decoding module 2 performs column address decoding, Row read and write row address signal terminals Wc<1:n> are all strobed, and the row is accessed through the row read and write bus; when the column is read, the n elements of the column vector are simultaneously read and written. At this time, the corresponding column The read-write row address module 5 performs row address decoding, the column read-write column address signal terminal Rr<j> is strobed, j=1,2,...,n is the number of columns, and the corresponding column read-write column address module 4 Perform column address decoding, the column read and write row address signal terminals Rc<1:n> are all strobed, and the columns are accessed through the column read and write bus;

2）矩阵的转置通过列写入行读出对矩阵进行转置，列写入时，列向量的n个元素进行同时读写，对应的列读写列地址模块4进行列地址译码，列读写行地址Rc<1:n>全被选通，对应的列读写行地址模块5进行行地址译码，列读写列地址信号端Rr<j>被选通。2) The transposition of the matrix transposes the matrix through column writing and row reading. When writing columns, n elements of the column vector are read and written at the same time, and the corresponding column address module 4 for column reading and writing performs column address decoding. The column read/write row addresses Rc<1:n> are all strobed, and the corresponding column read/write row address module 5 performs row address decoding, and the column read/write column address signal terminal Rr<j> is strobed.

当待转置矩阵为m×m阶且m<n时，进行转置的工作过程为：When the matrix to be transposed is of order m×m and m<n, the working process of transposing is:

1）矩阵的转置通过行写入列读出对矩阵进行转置，行写入时，对行向量的m个元素进行同时读写，此时对应的行读写行地址模块3进行行地址译码，行读写行地址信号端Wc被选通，i=1,2,…,m为所在的行数，对应的行读写列地址模块2进行列地址译码，行读写列地址信号端Wr<1:n>全被选通，通过行读写总线对行进行访问；列读出时，对列向量的n个元素进行同时读写，此时对应的列读写行地址模块5进行行地址译码，列读写列地址信号端Rr<j>被选通，j=1,2,…,m是所在的列数，对应的列读写列地址模块4进行列地址译码，列读写行地址信号端Rc<1:m>全被选通，通过列读写总线对列进行访问；1) The transposition of the matrix transposes the matrix through row writing and column reading. When writing a row, the m elements of the row vector are read and written simultaneously. At this time, the corresponding row read and write row address module 3 performs row address Decoding, row read/write row address signal terminal Wc is strobed, i=1, 2,...,m is the number of rows, the corresponding row read/write column address module 2 performs column address decoding, row read The write column address signal terminal Wr<1:n> is all strobed, and the row is accessed through the row read and write bus; when the column is read, the n elements of the column vector are read and written simultaneously, and the corresponding column read and write at this time The row address module 5 decodes the row address, the column address signal terminal Rr<j> is strobed, and j=1, 2,..., m is the number of columns, and the corresponding column read and write column address module 4 performs Column address decoding, column read and write row address signal terminals Rc<1:m> are all strobed, and columns are accessed through the column read and write bus;

2）矩阵的转置通过列写入行读出对矩阵进行转置，列写入时，列向量的n个元素进行同时读写，对应的列读写列地址模块4进行列地址译码，列读写行地址Rc<1:m>全被选通，对应的列读写行地址模块5进行行地址译码，列读写列地址信号端Rr<j>被选通。2) The transposition of the matrix transposes the matrix through column writing and row reading. When writing columns, n elements of the column vector are read and written at the same time, and the corresponding column address module 4 for column reading and writing performs column address decoding. The column read/write row address Rc<1:m> is all strobed, the corresponding column read/write row address module 5 performs row address decoding, and the column read/write column address signal terminal Rr<j> is strobed.

当待转置矩阵为m×m阶且m>n时，此时需要多个矩阵转置存储器本体1构成的多个RM寄存器联合操作，在逻辑上组成了一个m×m阶的矩阵。对行和列的访问就如同对一个n×n介矩阵的访问一样，不同的是多个RM寄存器相对应的行读写行地址信号端Wr、行读写列地址信号端Wc、列读写行地址信号端Rr和列读写列地址信号端Rc同步，步调一致地协同工作完成矩阵的转置。When the matrix to be transposed is of order m×m and m>n, multiple RM registers composed of multiple matrix transposition memory bodies 1 need to be jointly operated to form a matrix of order m×m logically. The access to rows and columns is the same as the access to an n×n matrix, the difference is that multiple RM registers correspond to the row read and write row address signal terminal Wr, the row read and write column address signal terminal Wc, and the column read and write The row address signal terminal Rr and the column address signal terminal Rc for reading and writing are synchronized, and work together in unison to complete the transposition of the matrix.

当待转置矩阵为m×n或n×m阶矩阵其中，m>n或m<n，此时待转置矩阵不是一个方阵，其操作过程与m×m阶矩阵的转置原理一致，不再赘述。其中所有写入读出过程中只有当所有行或列都写入完毕才能开始列或行的读出。When the matrix to be transposed is an m×n or n×m order matrix, where m>n or m<n, the matrix to be transposed is not a square matrix, and its operation process is consistent with the transposition principle of an m×m order matrix ,No longer. In the process of all writing and reading, the reading of columns or rows can only be started when all rows or columns have been written.

采用以上结构，在应用SRAM存储单元的高效读写性能同时将读写操作分开，且通过14管SRAM对角线结构还可以实现对角线读操作，通过数据访问的行地址、列地址选通可以进行行读、行写、列读、列写以及对角线读操作，每个矩阵转置存储器本体包含多个存储单元，可以进行串行或并行的读写、写串读并、写并读串或对角线并读或串读等多种读写方式，由矩阵A生成A^T需要只n+n个时钟周期，实现方法简单、转置速度快且高效。且行和列可根据实际的应用需求进行灵活裁剪，使一个片上LocalMemory变得结构灵活，功能多样、便于集成、功耗低。With the above structure, the read and write operations are separated while applying the efficient read and write performance of the SRAM storage unit, and the diagonal read operation can also be realized through the 14-tube SRAM diagonal structure, and the row address and column address strobe through the data access It can perform row read, row write, column read, column write and diagonal read operations. Each matrix transpose memory body contains multiple storage units, which can perform serial or parallel read and write, write serial read and parallel, and write parallel Multiple reading and writing methods such as serial reading or diagonal parallel reading or serial reading, generating A ^T from matrix A requires only n+n clock cycles, the implementation method is simple, the transposition speed is fast and efficient. Moreover, the rows and columns can be flexibly tailored according to actual application requirements, making an on-chip LocalMemory flexible in structure, diverse in function, easy to integrate, and low in power consumption.

本实施例中，利用本发明装置的矩阵转置方法，步骤为：In this embodiment, using the matrix transposition method of the device of the present invention, the steps are:

（3）通过指令进行转置模式的配置；若配置为单字访问模式，执行步骤（4）；若配置为行读入列写出访问模式，执行步骤（5）；若配置为列读入行写出访问模式，执行步骤（6）；若配置为对角线向量访问模式，执行步骤（7）；(3) Configure the transposition mode through instructions; if it is configured as a single-word access mode, perform step (4); if it is configured as a row read-in column write-out access mode, perform step (5); if it is configured as a column read-in row To write out the access mode, perform step (6); if configured as a diagonal vector access mode, perform step (7);

（4）以行优先的方式对单字进行访问，此时对应的行数据访问行地址选通模块进行行地址译码，行读写行地址Wc被选通，i是单字所在的行，对应的行数据访问列地址译码模块进行列地址译码，行读写列地址Wr<j>被选通，j是单字所在的列，通过行读写总线对单字w进行访问。(4) Access the single word in a row-first manner. At this time, the corresponding row data access row address gating module performs row address decoding, and the row read and write row address Wc is strobed, and i is the row where the word is located. , the corresponding row data access column address decoding module performs column address decoding, the row read-write column address Wr<j> is strobed, j is the column where the word is located, and the word w is accessed through the row read-write bus.

（5）对行向量的元素同时进行读写，进行行地址译码后选通行读写行地址，列地址译码后行读写列地址被选通，通过行读写总线对行进行访问；(5) Read and write the elements of the row vector at the same time, after the row address is decoded, the row address is gated for reading and writing, and after the column address is decoded, the column address for reading and writing is strobed, and the row is accessed through the row reading and writing bus;

（6）对列向量的元素同时进行读写，进行行地址译码后选通列读写列地址，列地址译码后列读写行地址被选通，通过列读写总线对列进行访问；(6) Read and write the elements of the column vector at the same time, and after decoding the row address, select the column address for reading and writing. ;

本实施例中，可以实现对n×n阶，m×m阶、m×n、n×m阶等多种阶此矩阵的转置，其中n>1，m>n或1<m<n。In this embodiment, it is possible to realize the transposition of the matrix of n×n order, m×m order, m×n, n×m order, etc., where n>1, m>n or 1<m<n .

本实施例中，可通过行写入列读出对矩阵进行转置，也可通过行列入行读出对矩阵进行转置，由行向量访问操作进行行写入或读出，由列向量访问操作进行列写入或读出其写入读出，将所有行或者列全部写入完毕后才能开始列或者行的读出。In this embodiment, the matrix can be transposed by row writing and column reading, and the matrix can also be transposed by row writing and row reading, row writing or reading is performed by row vector access operations, and row vector access operations are performed by column vector access operations. Perform column writing or reading, write and read, and read out columns or rows only after all rows or columns have been written.

执行步骤（5）、（6）的具体步骤则根据待转置矩阵类型不同而有所不同。The specific steps for performing steps (5) and (6) vary according to the type of matrix to be transposed.

对n×n阶矩阵进行转置时，步骤（5）具体为：对行向量的n个元素进行同时读写，行读写行地址Wc被选通，i=1,2,…,n，行读写列地址Wr<1:n>全被选通，通过行读写总线对行进行访问；步骤（6）为：对列向量的n个元素进行同时读写，列读写列地址Rr<j>被选通，j=1,2,…,n列读写行地址Rc<1:n>全被选通，通过列读写总线对列进行访问。When transposing an n×n order matrix, step (5) is specifically: read and write n elements of the row vector at the same time, the row read and write row address Wc is selected, i=1,2,… , n, the row read/write column address Wr<1:n> is all strobed, and the row is accessed through the row read/write bus; step (6) is: read and write to n elements of the column vector at the same time, column read/write The column address Rr<j> is strobed, j=1, 2,..., n column read and write row addresses Rc<1:n> are all strobed, and the column is accessed through the column read and write bus.

对m×m阶矩阵进行转置时，当m<n时，步骤（5）为：行写入时行读写列地址Wr<1:m>全被选通，行读写行地址Wc被选通（i=1,2,…,m）；列读出时，列读写列地址Rr<j>被选通，j=1,2,…,m；步骤（6）为：列读写行地址Rc<1:m>全被选通，列读写列地址Rr<j>被选通，j=1,2,…,m，行读出时转置，行读写行地址Wc被选通，i=1,2,…,m。When transposing the matrix of order m×m, when m<n, step (5) is: when writing a row, the row read/write column address Wr<1:m> is all strobed, and the row read/write row address Wc is strobed (i=1,2,...,m); when the column is read, the column address Rr<j> is gated, j=1,2,...,m; step (6) is : Column read and write row address Rc<1:m> is all strobed, column read and write column address Rr<j> is strobed, j=1,2,...,m, transpose when row is read, row read and write The row address Wc is strobed, i=1,2,...,m.

当m>n时，需要先执行步骤（2），即需要多个n行n列的寄存器参与操作，本实施例中，控制多个矩阵转置存储器本体1构成的多个RM寄存器联合操作，在逻辑上组成了一个m×m阶的矩阵，将多个RM寄存器相对应的行读写列地址信号端Wr、行读写行地址信号端Wc、列读写行地址信号端Rr和列读写列地址信号端Rc的步调设置为同步，多个RM寄存器协同完成矩阵的转置，其对行和列的访问与n×n阶矩阵的访问方法相同。When m>n, step (2) needs to be performed first, that is, multiple n-row and n-column registers are required to participate in the operation. In this embodiment, multiple RM registers composed of multiple matrix transposition memory bodies 1 are controlled to jointly operate. A matrix of m×m order is logically formed, and the row read and write column address signal terminal Wr, the row read and write row address signal terminal Wc, the column read and write row address signal terminal Rr and the column read signal terminal corresponding to multiple RM registers are formed. The pace of writing the column address signal terminal Rc is set to be synchronous, and multiple RM registers cooperate to complete the transposition of the matrix, and its access to rows and columns is the same as that of n×n order matrices.

读写串并形方式可以是串行或并行的读写、写串读并、写并读串或对角线并读或串读，具有适用于多种矩阵的矩阵转置功能。The reading and writing serial parallel mode can be serial or parallel reading and writing, writing serial reading parallel, writing parallel reading serial or diagonal parallel reading or serial reading, and has a matrix transposition function suitable for various matrices.

对于m×n、n×m阶矩阵的转置，转置过程与m×m阶矩阵的转置步骤一致。For the transposition of m×n and n×m order matrices, the transposition process is consistent with the transposition steps of m×m order matrices.

本发明进一步的还可以使用SDRAM代替SRAM，其原理与本发明一致。In the present invention, SDRAM can also be used instead of SRAM, and its principle is consistent with the present invention.

以上所述仅是本发明的优选实施方式，本发明的保护范围并不仅局限于上述实施例，凡属于本发明思路下的技术方案均属于本发明的保护范围。应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理前提下的若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above descriptions are only preferred implementations of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions under the idea of the present invention belong to the protection scope of the present invention. It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention should also be regarded as the protection scope of the present invention.

Claims

1. a kind of device based on the matrix transposition of SRAM, it is characterized in that: comprise address decoding module, read-write bus, diagonal read control module (6) and n matrix transposition memory bodies (1); The matrix transpose memory body (1) is a matrix structure formed by connecting n rows and n column matrix storage modules. The matrix storage modules on the diagonal are all 14-tube SRAM storage modules (11), and the matrix storage modules outside the diagonal are all 12-tube SRAM storage modules (12); the 14-tube SRAM storage modules (11) are used to treat The elements on the diagonal in the transpose matrix are accessed; the 12-tube SRAM storage module (12) is used to access the elements outside the diagonal in the transposed matrix; the diagonal read control module ( 6) connected with the matrix transposition memory body (1), used to decode the diagonal address and control the read operation of the elements on the diagonal in the matrix to be transposed; the 14-tube SRAM memory module (11 ) includes a plurality of sequentially connected 1-bit 14-tube SRAM units (111), and the number of 1-bit 14-tube SRAM units (111) is the same as the word length w of elements in the matrix to be transposed;

The 12-tube SRAM storage module (12) includes a plurality of sequentially connected 1-bit 12-tube SRAM units (122), and the number of 1-bit 12-tube SRAM units (122) is the same as the word length w of elements in the matrix to be transposed .

2. the device based on the matrix transposition of SRAM according to claim 1, is characterized in that: described 1 14 pipe SRAM units (111) have the separate control tube of read and write operation, respectively control the gate of read and write address ; The control tubes of the 1-bit 14-tube SRAM unit (111) include: two row read-write row control tubes, two row read-write column control tubes, two column read-write column control tubes, two column read-write row control tubes A control tube and two diagonal element readout control tubes; the diagonal element readout control tube is used to control the gating of the diagonal element address in the matrix to be transposed; the row read and write row control tube , a row read-write column control tube, a column read-write column control tube and a column read-write row control tube, which are respectively used to control the strobe of the row and column data access addresses.

3. the device based on the matrix transposition of SRAM according to claim 1, is characterized in that: described 1 12 pipe SRAM units (122) have the separate control tube of read and write operation, respectively control the gate of read and write address ;

The control tubes of the 1-bit 12-tube SRAM unit (122) include two row read-write row control tubes, two row read-write column control tubes, two column read-write column control tubes, and two column read-write row control tubes. ; The row read-write row control tube, the row read-write column control tube, the column read-write column control tube and the column read-write row control tube are used to control the strobe of row and column data access addresses.

4. the device based on the matrix transposition of SRAM according to claim 1, is characterized in that: described address decoding module comprises the row read-write column address decoding module ( 2), row read-write row address decoding module (3), column read-write column address decoding module (4) and column read-write row address decoding module (5).

5. A matrix transposition method utilizing any one of the devices described in claims 1 to 4, characterized in that the steps are:

(1) Input the matrix to be transposed, and judge the matrix type. If it is m×m order and m>n, execute step (2); if it is n×n order or m×m order, and m<n, execute Step (3);

(2) control the joint operation of registers of multiple n rows and n columns, set the row and column address decoding of each register to be synchronous, and perform step (3) to access each register;

(3) Configure the access mode by instruction; if configured as a single-word access mode, perform step (4); if configured as a row write column read access mode, perform step (6) after performing step (5); if configured For column write row read access mode, perform step (5) after performing step (6); if configured as diagonal vector access mode, perform step (7);

(4) Access in a row-first manner, after decoding the row address, strobe the corresponding row read-write row address, after decoding the column address, strobe the row read-write column address, and access the single word through the row read-write bus;

(5) Read and write the elements of the row vector at the same time, and after the row address is decoded, the corresponding row read and write row address is selected. After the column address is decoded, all the row read and write column addresses are selected, and the row read and write bus is passed. access the row;

(6) Read and write the elements of the column vector at the same time, and after decoding the row address, select the corresponding column read and write column address. access to the column;

(7) Read and write the elements of the diagonal vector at the same time, and after address decoding, the column read and write row addresses are all strobed, and the diagonal is accessed through the column read bus.

6. according to the method for matrix transposition described in claim 5, it is characterized in that, the configuration of access mode in the described step (3) is single-word access, row writes column read access, column writes row read access and one of diagonal vector access.

7. according to the method for matrix transposition described in claim 5, it is characterized in that, described step (5) concrete implementation step comprises:

(5.1) Determine the size of the row vector of the matrix to be transposed, if it is n rows, perform step (5.2); if it is m rows and m<n, perform step (5.3);

(5.2) Read and write all the elements of the row vector at the same time. After the row address is decoded, the corresponding row read and write row addresses in the 1 to n rows are selected. After the column address is decoded, the row, read and write column addresses are all selected. Rows are accessed through the row read and write bus;

(5.3) Read and write all the elements of the row vector at the same time, after decoding the row address, strobe the corresponding row read and write row address in 1~m rows, and after the column address is decoded, all the row read and write column addresses are strobed, Rows are accessed through the row read and write bus.

8. according to the method for matrix transposition described in claim 5, it is characterized in that, described step (6) concrete implementation step comprises:

(6.1) Determine the size of the column vector of the matrix to be transposed, if it is n columns, perform step (6.2); if it is m columns and m<n, perform step (6.3);

(6.2) Read and write the elements of the column vector at the same time. After decoding the row address, select the corresponding column read and write column address in the 1~n columns. After the column address is decoded, all the column read and write row addresses are selected. The column read and write bus accesses the column;

(6.3) Read and write the elements of the column vector at the same time, after decoding the row address, strobe the corresponding column read and write column address in the 1~m columns, after the column address is decoded, all the column read and write row addresses are strobed, through The column read and write bus accesses the columns.