CN108762719A

CN108762719A - A kind of parallel broad sense inner product reconfigurable controller

Info

Publication number: CN108762719A
Application number: CN201810497969.2A
Authority: CN
Inventors: 李丽; 祁鹏展; 鲍贤亮; 宋文清; 李伟; 何书专; 潘红兵
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2018-05-21
Filing date: 2018-05-21
Publication date: 2018-11-06
Anticipated expiration: 2038-05-21
Also published as: CN108762719B

Abstract

The parallel generalized inner product reconstruction controller of the present invention includes: an intermediate result calculation module, which receives source data and calculates an intermediate result vector according to the source data , generating a vector The address is stored in the bank; every time an intermediate result vector is completed The calculation generates a completion signal, and sends the completion signal to the final result calculation module as a start signal; the final result calculation module reads data into the complex multiplication accumulator for final result calculation to obtain the result matrix Lth element , generating a vector The address is stored in the bank; the data storage address processing module selects the data according to the ping-pong operation selection signal, and generates the correct bank address signal. Beneficial effects: the calculation time is less and the storage resource utilization rate is large, and the high real-time requirement for obtaining test statistics can be met when non-uniform detection is performed in many signal detection application scenarios.

Description

A Parallel Generalized Inner Product Reconfiguration Controller

技术领域technical field

本发明属于非均匀检测技术领域，尤其涉及一种并行广义内积重构控制器。The invention belongs to the technical field of non-uniform detection, in particular to a parallel generalized inner product reconstruction controller.

背景技术Background technique

空时自适应处理(STAP)是一种对运动目标的检测技术。常规STAP算法中，必须进行杂波协方差矩阵估计。当利用二次数据进行杂波协方差矩阵的估计时，二次数据必须满足独立同分布的条件，才能减少性能损失。Space-time adaptive processing (STAP) is a detection technology for moving objects. In the conventional STAP algorithm, the clutter covariance matrix must be estimated. When using quadratic data to estimate the clutter covariance matrix, the quadratic data must meet the condition of independent and identical distribution in order to reduce performance loss.

在实际应用中，所检测到的信号回波不仅会被自然杂波污染，还会受到人为的非均匀干扰所污染，因此经常不满足独立同分布条件。In practical applications, the detected signal echo will not only be polluted by natural clutter, but also be polluted by man-made non-uniform interference, so the independent and identical distribution condition is often not satisfied.

针对样本中的干扰目标，Melvin首先提出了非均匀检测器(NHD)的思想，通过剔除包含干扰目标的样本，来抑制其对杂波协方差矩阵估计的影响。NHD的基本思路为：根据被干扰目标污染的样本与其他样本统计特性的差异，设置相应的检验统计量来区分两种样本。For the interference target in the sample, Melvin first proposed the idea of non-uniform detector (NHD), which can suppress its influence on the estimation of the clutter covariance matrix by eliminating the sample containing the interference target. The basic idea of NHD is: according to the difference between the statistical characteristics of the sample polluted by the interference target and other samples, set the corresponding test statistics to distinguish the two samples.

在NHD检验统计量选取方面，美国海军实验室Gerlach等人提出了广义内积(GIP)和自适应功率剩余两个准则。令X_L表示初始样本中的第L个样本，则其对应的自相关矩阵表示为：其中T为杂噪协方差矩阵，令表示由L个样本组成的样本协方差矩阵，则每个样本对应的GIP值可表示为：根据每个样本对应的GIP值，可以有效剔除干扰目标。In the selection of NHD test statistics, Gerlach et al. proposed two criteria of generalized inner product (GIP) and adaptive power residual. Let X _L represent the Lth sample in the initial sample, then its corresponding autocorrelation matrix is expressed as: where T is the noise covariance matrix, so that Represents a sample covariance matrix composed of L samples, then the GIP value corresponding to each sample can be expressed as: According to the GIP value corresponding to each sample, the interference target can be effectively eliminated.

广义内积非均匀检测方法对杂波的抑制能力与样本的数量大小有关，样本数量越大，杂波协方差矩阵数据越真实，其对杂波的抑制能力越强。软件上实现广义内积非均匀检测方法对大量样本进行计算时存在精度不高和运算时间过长的问题，以满足实际非均匀检测技术的高实时性要求。The ability of the generalized inner product non-uniform detection method to suppress clutter is related to the number of samples. The larger the number of samples, the more realistic the clutter covariance matrix data, and the stronger the ability to suppress clutter. The generalized inner product non-uniform detection method implemented in software has the problems of low precision and long operation time when calculating a large number of samples, so as to meet the high real-time requirements of the actual non-uniform detection technology.

发明内容Contents of the invention

本发明的目的是克服上述背景技术中的不足，提出一种并行广义内积重构控制器，更好地满足实际应用的高实时性、大点数计算的需求，具体通过以下技术方案来实现的：The purpose of the present invention is to overcome the deficiencies in the above-mentioned background technology, and propose a parallel generalized inner product reconstruction controller to better meet the needs of high real-time and large-point calculations in practical applications, specifically through the following technical solutions. :

所述并行广义内积重构控制器包括：The parallel generalized inner product reconstruction controller includes:

中间结果计算模块，接收源数据并根据源数据计算中间结果向量Y_L，生成向量Y_L的地址，存入bank；每完成一个中间结果向量Y_L的计算生成一个完成信号，并将所述完成信号发送至最终结果计算模块，作为启动信号；The intermediate result calculation module receives the source data and calculates the intermediate result vector Y _L according to the source data, generates the address of the vector Y _L , and stores it in the bank; every time the calculation of an intermediate result vector Y _L is completed, a completion signal is generated, and the completed The signal is sent to the final result calculation module as a start signal;

最终结果计算模块，通过地址生成器连续生成矩阵X的列X_L元素的地址和相应中间结果向量Y_L元素的地址，读数据进入复数乘累加器得到结果矩阵Z_1xN第L个元素Z_L，生成向量Z_L的地址，存入bank；The final result calculation module continuously generates the address of the column X _L element of the matrix X and the address of the corresponding intermediate result vector Y _L element through the address generator, and reads the data into the complex multiplication accumulator to obtain the Lth element Z _L of the result matrix Z _1xN , Generate the address of the vector Z _L and store it in the bank;

数据存储地址处理模块，根据乒乓操作选择信号进行数据选择，同时对来自中间结果计算模块和最终结果计算模块的针对同一个bank的信号进行处理，生成正确的bank地址信号。The data storage address processing module selects data according to the ping-pong operation selection signal, and simultaneously processes signals for the same bank from the intermediate result calculation module and the final result calculation module to generate correct bank address signals.

所述并行广义内积运算的硬件实现方法的进一步设计在于，计算Y_L的过程是X_L和方阵T，每一列乘累加的过程，所述方阵T的行列数与矩阵X的列数相等，该乘累加的过程通过多路并行计算实现。The further design of the hardware realization method of described parallel generalized inner product operation is that the process of calculating Y _L is X _L and square matrix T, and the process of multiplying and accumulating each column, the number of rows and columns of said square matrix T and the number of columns of matrix X equal, the process of multiplying and accumulating is realized by multi-channel parallel computing.

所述并行广义内积运算的硬件实现方法的进一步设计在于，中间结果计算模块采用四路并行的实现方式实现。A further design of the hardware implementation method of the parallel generalized inner product operation is that the intermediate result calculation module is implemented in a four-way parallel implementation.

所述并行广义内积运算的硬件实现方法的进一步设计在于，中间结果计算模块的源数据存储方式为：矩阵T按列存放在bank0-bank3中，存满之后继续按列存放于bank4-bank7中；矩阵X按列存放在bank8-bank11中。The further design of the hardware implementation method of the parallel generalized inner product operation is that the source data storage method of the intermediate result calculation module is: the matrix T is stored in bank0-bank3 by column, and continues to be stored in bank4-bank7 by column after it is full ;Matrix X is stored in bank8-bank11 by column.

所述并行广义内积运算的硬件实现方法的进一步设计在于，中间结果计算模块的中间结果存储方式为：奇数项存放到bank12中，偶数项存放到bank13中。The further design of the hardware implementation method of the parallel generalized inner product operation lies in that the intermediate result storage mode of the intermediate result calculation module is as follows: odd items are stored in bank12, and even items are stored in bank13.

所述并行广义内积运算的硬件实现方法的进一步设计在于，中间结果计算模块进行中间结果计算的流程为：在一次运算过程中，首先地址生成器生成X的一列元素X_L和四列T矩阵元素地址，同时搬运对应的矩阵元素数据，输入复数乘累加器得到中间结果Y_L；接着由地址生成器生成中间结果存储地址，将中间结果存入bank中。The further design of the hardware implementation method of the parallel generalized inner product operation is that the intermediate result calculation module performs the intermediate result calculation process as follows: in an operation process, first the address generator generates a column of elements X _L and four columns of T matrix of X The element address, and at the same time transport the corresponding matrix element data, input the complex number multiplied by the accumulator to obtain the intermediate result Y _L ; then the address generator generates the storage address of the intermediate result, and stores the intermediate result in the bank.

所述并行广义内积运算的硬件实现方法的进一步设计在于，最终结果计算模块进行最终结果计算的流程为：当最终结果计算模块得到中间结果计算完成信号时，地址生成器连续生成矩阵X的列X_L元素的地址和相应中间结果向量Y_L元素的地址；同时输入到复数乘累加器得到最终结果Z_L，由地址生成器生成最终结果存储地址，将最终结果存入bank中。The further design of the hardware implementation method of the parallel generalized inner product operation is that the final result calculation module performs the final result calculation process as follows: when the final result calculation module obtains the intermediate result calculation completion signal, the address generator continuously generates the columns of the matrix X The address of the X _L element and the address of the corresponding intermediate result vector Y _L element; at the same time input to the complex multiplication accumulator to obtain the final result Z _L , the address generator generates the final result storage address, and stores the final result in the bank.

所述并行广义内积运算的硬件实现方法的进一步设计在于，所述复数乘法器均为延迟4个时钟周期的流水单精度浮点运算单元，复数乘法器的访存延迟设定为6个周期。The further design of the hardware implementation method of the parallel generalized inner product operation is that the complex multipliers are pipelined single-precision floating-point units with a delay of 4 clock cycles, and the memory access delay of the complex multipliers is set to 6 cycles .

所述并行广义内积运算的硬件实现方法的进一步设计在于，所述复数乘累加器为五个，其中四个用于四路并行计算中间结果，另一个用于同步计算最终结果。A further design of the hardware implementation method of the parallel generalized inner product operation is that there are five complex multiplication accumulators, four of which are used for four-way parallel calculation of intermediate results, and the other is used for synchronous calculation of final results.

所述并行广义内积运算的硬件实现方法的进一步设计在于，每个复数乘累加器由一个复数乘法器和三个复数加法器组成，在40nm CMOS工艺下DC综合的面积为19993.56μm²。The further design of the hardware implementation method of the parallel generalized inner product operation is that each complex multiplication accumulator is composed of a complex multiplier and three complex adders, and the area of DC integration is 19993.56 μm ² in 40nm CMOS technology.

本发明的优点Advantages of the invention

本发明提供的并行广义内积重构控制器采用计算一个中间结果后立即计算一个最终结果元素的策略，计算Z_L-1的时间可以被隐藏于计算Y_L的时间内，计算时间少且存储资源利用率大。该并行广义内积重构控制器可满足在许多信号检测应用场景中进行非均匀检测时，获取检验统计量的高实时性要求。The parallel generalized inner product reconstruction controller provided by the present invention adopts the strategy of calculating a final result element immediately after calculating an intermediate result, and the time for calculating ZL _-1 can be hidden in the time for calculating _YL , and the calculation time is less and storage High resource utilization. The parallel generalized inner product reconstruction controller can meet the high real-time requirements for obtaining test statistics when performing non-uniform detection in many signal detection application scenarios.

附图说明Description of drawings

图1是并行广义内积重构控制器的架构示意图。Figure 1 is a schematic diagram of the architecture of a parallel generalized inner product reconstruction controller.

图2是并行广义内积数据存储示意图。Fig. 2 is a schematic diagram of parallel generalized inner product data storage.

图3是并行广义内积算法计算流程示意图。Fig. 3 is a schematic diagram of the calculation flow of the parallel generalized inner product algorithm.

具体实施方式Detailed ways

下面结合附图和具体实现案例对本发明进行详细说明。The present invention will be described in detail below in conjunction with the accompanying drawings and specific implementation cases.

如图1，本实施例的并行广义内积重构控制器以四路并行为例，主要由由三个子模块组成，分别为：中间结果计算模块、最终结果计算模块以及数据存储地址处理模块。中间结果计算模块用于计算中间结果；最终结果计算模块计算最终结果；数据存储地址处理模块处理bank地址等相关信号。As shown in Figure 1, the parallel generalized inner product reconstruction controller of this embodiment takes four-way parallelism as an example, and is mainly composed of three sub-modules: an intermediate result calculation module, a final result calculation module, and a data storage address processing module. The intermediate result calculation module is used to calculate intermediate results; the final result calculation module calculates the final result; the data storage address processing module processes bank addresses and other related signals.

中间结果计算模块，完全流水的计算中间结果向量Y_L，包括生成X_L列元素地址，对X_L一列元素与方阵T_MxM每一列进行内积乘累加运算，得到中间结果向量Y_L，生成向量Y_L的地址，存入bank。每完成一个Y_L的计算给出一个完成信号给最终结果计算模块，作为它的一次计算的启动信号。The intermediate result calculation module is a fully pipelined calculation of the intermediate result vector Y _L , including generating the element address of the X _L column, performing the inner product multiplication and accumulation operation on the elements of the X _L column and each column of the square matrix T _MxM to obtain the intermediate result vector Y _L , and generating Address of vector Y _L , stored in bank. Every time a calculation of Y _L is completed, a completion signal is given to the final result calculation module as a start signal for its calculation.

最终结果计算模块，通过地址生成器连续生成矩阵X的列X_L元素的地址和相应中间结果向量Y_L元素的地址，读数据进入复数乘累加器得到结果矩阵Z_1xN第L个元素Z_L，生成向量Z_L的地址，存入bank。The final result calculation module continuously generates the address of the column X _L element of the matrix X and the address of the corresponding intermediate result vector Y _L element through the address generator, and reads the data into the complex multiplication accumulator to obtain the Lth element Z _L of the result matrix Z _1xN , Generate the address of the vector Z _L and store it in the bank.

数据存储地址处理模块，根据乒乓操作选择信号进行数据选择，同时对来自中间结果计算模块和最终结果计算模块的针对同一个bank的信号进行处理，生成正确的bank地址等信号。The data storage address processing module selects data according to the ping-pong operation selection signal, and simultaneously processes the signals for the same bank from the intermediate result calculation module and the final result calculation module to generate correct bank address and other signals.

如图1，存储单元包括15个bank，其中矩阵T存放于bank0-7，矩阵X存放于bank8-11，中间结果Y_L存放到bank12和bank13中，最终的并行广义内积矩阵存储在bank14中。运算单元包括5个复数乘累加器，复数乘累加器0-3用于四路并行计算中间结果，复数乘累加器4用于同时计算最终结果。As shown in Figure 1, the storage unit includes 15 banks, where the matrix T is stored in bank0-7, the matrix X is stored in bank8-11, the intermediate result Y _L is stored in bank12 and bank13, and the final parallel generalized inner product matrix is stored in bank14 . The arithmetic unit includes 5 complex multiplication accumulators, complex multiplication accumulators 0-3 are used for four-way parallel calculation of intermediate results, and complex multiplication accumulator 4 is used for simultaneous calculation of final results.

如图2所示是并行广义内积数据存储示意图。其源数据存储方式为：矩阵T按列存放在bank0-bank3中，存满之后继续按列存放于bank4-bank7中；矩阵X按列存放在bank8-bank11中。如此存放便于计算中间结果Y_L时进行4路并行运算，也可以简化相应的DMA模块的设计；中间结果Y_L，Y₁、Y₃…等奇数项存放到bank12中(后者覆盖前者)，Y₂、Y₄…等偶数项存放到bank13当中(后者覆盖前者)。最终的广义内积矩阵存储在bank14中。Figure 2 is a schematic diagram of parallel generalized inner product data storage. The source data storage method is as follows: the matrix T is stored in bank0-bank3 by column, and continues to be stored in bank4-bank7 by column after it is full; the matrix X is stored in bank8-bank11 by column. Such storage is convenient for 4-way parallel operation when calculating the intermediate result Y _L , and can also simplify the design of the corresponding DMA module; intermediate results Y _L , Y ₁ , Y ₃ ... and other odd items are stored in bank12 (the latter covers the former), Even items such as Y ₂ , Y ₄ ... are stored in bank13 (the latter covers the former). The final generalized inner product matrix is stored in bank14.

如图3，并行广义内积算法进行中间结果计算的流程为：在一次运算过程中，首先地址生成器1生成X的一列元素X_L和四列T矩阵元素地址，同时搬运对应的矩阵元素数据，输入复数乘累加器得到中间结果Y_L，然后由地址生成器2生成中间结果存储地址，将中间结果存入bank中。As shown in Figure 3, the process of calculating the intermediate results of the parallel generalized inner product algorithm is as follows: in the course of one operation, first, the address generator 1 generates the addresses of one column element X _L of X and four columns T matrix elements, and at the same time transfers the corresponding matrix element data , input the complex number multiplied by the accumulator to get the intermediate result Y _L , then the address generator 2 generates the storage address of the intermediate result, and stores the intermediate result in the bank.

同理，并行广义内积算法进行最终结果计算的流程为：在一次运算过程中，当该模块得到中间结果计算完成信号时，地址生成器1连续生成矩阵X的列X_L元素的地址，和相应中间结果向量Y_L元素的地址。同时输入到复数乘累加器得到最终结果Z_L，然后由地址生成器2生成最终结果存储地址，将最终结果存入bank中。Similarly, the process of calculating the final result of the parallel generalized inner product algorithm is as follows: during one operation, when the module receives a signal of completion of intermediate result calculation, the address generator 1 continuously generates the addresses of the column X _L elements of the matrix X, and The address of the corresponding intermediate result vector Y _L element. At the same time, it is input to the complex multiplication accumulator to obtain the final result Z _L , and then the address generator 2 generates the storage address of the final result, and stores the final result in the bank.

本发明所述并行广义内积算法硬件实现一次完整的计算包括如下步骤：The parallel generalized inner product algorithm hardware of the present invention realizes a complete calculation including the following steps:

步骤1)置L＝1，从矩阵X的第一列开始计算；Step 1) put L=1, start to calculate from the first column of matrix X;

步骤2)计算中间结果Y_L。Step 2) Calculate the intermediate result Y _L .

计算中间结果Y_L包括如下步骤：Calculating the intermediate result Y _L includes the following steps:

步骤2-1)根据地址生成器子模块所生成的地址，依次取X_L和(T₁T₂T₃T₄)的元素送入乘累加子模块进行复数乘累加运算，得到(Y_L1Y_L2Y_L3Y_L4)；Step 2-1) According to the address generated by the address generator sub-module, the elements of X _L and (T ₁ T ₂ T ₃ T ₄ ) are sequentially taken and sent to the multiplication and accumulation sub-module for complex multiplication and accumulation operation to obtain (Y _L1 Y _L2 Y _L3 Y _L4 );

步骤2-2)根据地址生成器子模块所生成的地址将(Y_L1Y_L2Y_L3Y_L4)顺序写入中间结果bank中，同时取下一组4列T矩阵元素和X_L，重复1)和2)，直到完成Y_L的计算；Step 2-2) Write (Y _L1 Y _L2 Y _L3 Y _L4 ) into the intermediate result bank sequentially according to the address generated by the address generator sub-module, and at the same time take the next set of 4-column T matrix elements and X _L , repeat 1 ) and 2), until the calculation of Y _L is completed;

步骤3)计算最终结果Z_L。与1),2)同步进行，若已产生Y_L-1,根据地址生成器所生成的地址依次取X_L-1和Y_L-1的元素进行复数乘累加，得到Z_L-1，根据地址生成器所生成的地址将最终结果写入最终结果bank中；Step 3) Calculate the final result Z _L . Synchronous with 1) and 2), if Y _L-1 has been generated, according to the address generated by the address generator, the elements of X _L-1 and Y _L-1 are sequentially taken for complex multiplication and accumulation, and Z _L-1 is obtained according to The address generated by the address generator writes the final result into the final result bank;

步骤4)若L<N,L＝L+1,跳转到步骤二,；Step 4) If L<N, L=L+1, jump to step 2;

步骤5)依次取X_N和Y_N的元素进行复数乘累加，得到Z_N，存入bank中，完成内积运算。Step 5) Take the elements of X _N and Y _N in turn to perform complex multiplication and accumulation to obtain Z _N , store it in the bank, and complete the inner product operation.

本实施例的并行广义内积重构控制器中所用到的复数乘法器，复数加法器均为延迟4个时钟周期的流水单精度浮点运算单元，访存延迟为6个周期，采用EDA仿真/综合工具，工作主频达1GHz。The complex multipliers and complex adders used in the parallel generalized inner product reconstruction controller of this embodiment are pipelined single-precision floating-point units with a delay of 4 clock cycles, and the memory access delay is 6 cycles, using EDA simulation /Comprehensive tool, the working frequency is up to 1GHz.

本实施例的并行广义内积重构控制器总计耗用五个复数乘累加器，其中四个用来四路并行计算中间结果，另一个用来同步计算最终结果。每个复数乘累加器由一个复数乘法器和三个复数加法器构成，在40nm CMOS工艺下DC综合的面积为19993.56μm²。The parallel generalized inner product reconstruction controller of this embodiment consumes five complex multiply-accumulators in total, four of which are used for four-way parallel calculation of intermediate results, and the other is used for synchronous calculation of final results. Each complex multiply-accumulator is composed of a complex multiplier and three complex adders, and the area of DC integration is 19993.56μm ² in 40nm CMOS process.

本实施例的并行广义内积重构控制器采用计算一个中间结果后立即计算一个最终结果元素的策略，计算Z_L-1的时间可以被隐藏于计算Y_L的时间内，相比于计算完整中间结果后并行计算最终结果的方法，计算时间少且存储资源利用率高。The parallel generalized inner product reconstruction controller of this embodiment adopts the strategy of calculating a final result element immediately after calculating an intermediate result, and the time for calculating ZL _-1 can be hidden in the time for calculating _YL , which is compared to the time for calculating the complete The method of calculating the final result in parallel after the intermediate result has less calculation time and high storage resource utilization.

本实施例的并行广义内积重构控制器的特点为计算速度快，点数灵活可变且存储资源利用率高。可以满足在数据量较大的数字信号处理，例如即时信号检测应用场景中进行非均匀检测时，获取检验统计量的高实时性要求。The parallel generalized inner product reconstruction controller of this embodiment is characterized by fast calculation speed, flexible and variable number of points, and high utilization rate of storage resources. It can meet the high real-time requirements for obtaining test statistics when performing non-uniform detection in digital signal processing with a large amount of data, such as real-time signal detection application scenarios.

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到的变化或变换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应该以权利要求的保护范围为准。The above is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any person skilled in the art can easily conceive of changes or modifications within the technical scope disclosed in the present invention. Any transformation should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be determined by the protection scope of the claims.

Claims

1. A parallel generalized inner product reconstruction controller, characterized in that: comprising:

The intermediate result calculation module receives the source data and calculates the intermediate result vector Y _L according to the source data, generates the address of the vector Y _L , and stores it in the bank; every time the calculation of an intermediate result vector Y _L is completed, a completion signal is generated, and the completed The signal is sent to the final result calculation module as a start signal;

The final result calculation module continuously generates the address of the column X _L element of matrix X and the address of the corresponding intermediate result vector Y _L element through the address generator, and reads the data into the complex multiplication accumulator for final result calculation to obtain the result matrix Z _1xNth L The element Z _L generates the address of the vector Z _L and stores it in the bank;

The data storage address processing module selects data according to the ping-pong operation selection signal, and simultaneously processes signals for the same bank from the intermediate result calculation module and the final result calculation module to generate correct bank address signals.

2. the hardware implementation method of parallel generalized inner product operation according to claim 1, is characterized in that: the process of calculating Y _L is X _L and square matrix T, and the process of multiplying and accumulating each column, the ranks of described square matrix T The number is equal to the number of columns of the matrix X, and the process of multiplying and accumulating is realized by multi-channel parallel computing.

3. The parallel generalized inner product reconstruction controller according to claim 2, characterized in that: the intermediate result calculation module is implemented in a four-way parallel implementation.

4. The parallel generalized inner product reconstruction controller according to claim 3, characterized in that: the source data storage mode of the intermediate result calculation module is: the matrix T is stored in bank0-bank3 by column, and continues to be column-by-column after it is full Stored in bank4-bank7; matrix X is stored in bank8-bank11 by column.

5. The parallel generalized inner product reconstruction controller according to claim 3, characterized in that: the intermediate result storage method of the intermediate result calculation module is as follows: odd items are stored in bank12, and even items are stored in bank13.

6. The parallel generalized inner product reconstruction controller according to claim 1, characterized in that: the intermediate result calculation module performs the intermediate result calculation process as follows: in an operation process, at first the address generator generates a column element X of X _L and four columns of T matrix element addresses, simultaneously transport the corresponding matrix element data, and input the complex multiplication accumulator to obtain the intermediate result Y _L ; then the address generator generates the intermediate result storage address, and stores the intermediate result in the bank.

7. parallel generalized inner product reconstruction controller according to claim 1, is characterized in that: the flow process that final result calculation module carries out final result calculation is: when final result calculation module obtains intermediate result calculation completion signal, address generator Continuously generate the address of the column X _L element of the matrix X and the address of the corresponding intermediate result vector Y _L element; at the same time input to the complex multiplication accumulator to obtain the final result Z _L , the address generator generates the final result storage address, and stores the final result in in bank.

8. parallel generalized inner product reconfiguration controller according to claim 1, is characterized in that: described complex multiplier is the pipeline single-precision floating-point operation unit that delays 4 clock cycles, and the memory access delay of complex multiplier Set to 6 cycles.

9. The parallel generalized inner product reconstruction controller according to claim 1, characterized in that: there are five complex multiplication accumulators, four of which are used for four-way parallel calculation of intermediate results, and the other is used for synchronous calculation Final result.

10. parallel generalized inner product reconstruction controller according to claim 1, is characterized in that: each complex multiplying accumulator is made up of a complex multiplier and three complex adders, and the area of DC synthesis under 40nm CMOS technology It is 19993.56 μm ² .