CN100594491C

CN100594491C - Reconfigurable Digital Signal Processor

Info

Publication number: CN100594491C
Application number: CN200610086398A
Authority: CN
Inventors: 洪一; 郭二辉; 赵斌; 洪灏; 彭勇俊; 陈风波
Original assignee: CETC 38 Research Institute
Current assignee: Anhui Core Century Technology Co ltd
Priority date: 2006-07-14
Filing date: 2006-07-14
Publication date: 2010-03-17
Anticipated expiration: 2026-07-14
Also published as: CN1900927A

Abstract

The invention discloses a reconfigurable digital signal processor (DSP). The internal hardware resources of the device can be restructured according to different application requirements, so that various forms of filtering operations can be realized. The invention combines the advantages of the traditional application-specific integrated circuit (ASIC) and the general digital signal processor. It has the computing capability of ultra-large-scale special-purpose devices, and can adapt to different digital signal real-time processing occasions such as fast Fourier transform (FFT), inverse fast Fourier transform (IFFT), FIR pulse group processing, correlation processing, etc., and is easy to use and low in price. .

Description

Reconfigurable Digital Signal Processor

所属技术领域 Technical field

本发明公开了一种用于快速傅立叶变换(FFT)、快速傅立叶逆变换(IFFT)、FIR脉组处理、相关处理等数字信号实时处理的可重构数字信号处理器(DSP)。The invention discloses a reconfigurable digital signal processor (DSP) for real-time processing of digital signals such as fast Fourier transform (FFT), fast Fourier inverse transform (IFFT), FIR pulse group processing and correlation processing.

背景技术 Background technique

20世纪60年代以来，随着计算技术和信息技术的迅速发展，数字信号处理作为一个独立学科迅速发展并在诸多领域得到广泛应用。随着大规模集成电路技术和半导体技术的快速发展以及各种实时处理需求的不断提高，数字信号处理能力也以指数级的速度飞速提升，并在科研、军事以及民用等领域发挥着越来越重要的作用，数字信号处理器件已成为支撑这些领域高速发展的重要条件。在数字信号实时处理中，快速傅立叶变换(FFT)、快速傅立叶逆变换(IFFT)、FIR脉组处理、相关处理等滤波运算的应用最为广泛。硬件的实现方式目前主要有基于通用数字信号处理器、基于现场可编程门阵列(FPGA)/大规模可编程逻辑器件(CPLD)和基于专用集成电路(ASIC)三种。一方面，三种器件各有局限，通用数字信号处理器的优势在于编程的灵活性和普适性，但其运算能力有限。大容量的FPGA/CPLD内部硬件资源较多，但需要针对具体应用单独开发固件逻辑，人力成本高，且大容量FPGA/CPLD价格昂贵。传统的专用集成电路架构与硬线连接固定，功能较单一，其应用范围大受局限。另一方面，数字信号处理的技术要求却在不断提高，随着宽带运用场合不断扩大、阵列处理的数目不断增大、合作及非合作数目目标处理涉及的运算量不断加大，对信号处理的速度在不断加码。如复数1024点FFT运算速度要求在100MHz以上，有些场合需要为500MHz以上。上述三种器件在功能、价格、适应性、易用性上越来越难满足数字信号实时处理的要求。Since the 1960s, with the rapid development of computing technology and information technology, digital signal processing has developed rapidly as an independent subject and has been widely used in many fields. With the rapid development of large-scale integrated circuit technology and semiconductor technology and the continuous improvement of various real-time processing requirements, the digital signal processing capability has also increased rapidly at an exponential rate, and it has played an increasingly important role in scientific research, military and civilian fields. Important role, digital signal processing devices have become an important condition to support the rapid development of these fields. In the real-time processing of digital signals, filtering operations such as fast Fourier transform (FFT), inverse fast Fourier transform (IFFT), FIR pulse group processing, and correlation processing are most widely used. Currently, there are mainly three hardware implementation methods based on general-purpose digital signal processors, field-programmable gate arrays (FPGA)/large-scale programmable logic devices (CPLDs) and application-specific integrated circuits (ASICs). On the one hand, the three devices have their own limitations. The advantage of the general-purpose digital signal processor lies in the flexibility and universality of programming, but its computing power is limited. Large-capacity FPGA/CPLD has many internal hardware resources, but firmware logic needs to be developed separately for specific applications, the labor cost is high, and large-capacity FPGA/CPLD is expensive. The traditional application-specific integrated circuit architecture and hard-wire connection are fixed, and the function is relatively single, and its application range is greatly limited. On the other hand, the technical requirements of digital signal processing are constantly improving. With the continuous expansion of broadband applications, the continuous increase of the number of array processing, and the continuous increase in the amount of calculations involved in cooperative and non-cooperative number target processing, the requirements for signal processing The speed is constantly increasing. For example, the operation speed of complex 1024-point FFT is required to be above 100MHz, and in some cases it is required to be above 500MHz. It is increasingly difficult for the above three devices to meet the requirements of real-time processing of digital signals in terms of function, price, adaptability, and ease of use.

发明内容 Contents of the invention

本发明所要解决的技术问题是提供一种应用于数字信号实时处理的可重构数字信号处理器，它具有专用大规模集成电路的运算能力，而且能适应快速傅立叶变换(FFT)、快速傅立叶逆变换(IFFT)、FIR脉组处理、相关处理等不同的数字信号实时处理场合，同时使用简单，价格低廉。The technical problem to be solved by the present invention is to provide a reconfigurable digital signal processor applied to real-time processing of digital signals, which has the computing capability of dedicated large-scale integrated circuits, and can adapt to fast Fourier transform (FFT), fast Fourier inverse Transformation (IFFT), FIR pulse group processing, correlation processing and other digital signal real-time processing occasions, at the same time, it is easy to use and low in price.

本发明所采取的技术方案是：The technical scheme that the present invention takes is:

可重构数字信号处理器内部的硬件架构和硬件连线可通过配置控制字进行结构重组，从而实现快速傅立叶变换(FFT)/快速傅立叶逆变换(IFFT)、FIR脉组及相关处理等多种形式的滤波运算。The internal hardware architecture and hardware connection of the reconfigurable digital signal processor can be restructured by configuring the control word, so as to realize various types of fast Fourier transform (FFT)/inverse fast Fourier transform (IFFT), FIR pulse group and related processing, etc. form of filtering operations.

主体架构包括输入单元、输出单元、数据交换单元和4个基本单元，其基本单元中包含160个实数浮点乘法累加器，而且它们平均分布于4个基本单元中。The main structure includes input unit, output unit, data exchange unit and 4 basic units. The basic unit contains 160 real floating-point multiplication accumulators, and they are evenly distributed in the 4 basic units.

硬件的组织形式可以通过配置控制字重组：通过控制字和控制信号的配置，可以改变所述160个实数浮点乘法累加器以及数据交换单元的组织形式，使之选择不同的工作模式，以适应三种不同的运算任务：FFT/IFFT、FIR脉组处理、相关运算。The organizational form of the hardware can be reorganized by configuring the control word: through the configuration of the control word and control signal, the organizational form of the 160 real-number floating-point multiplication accumulators and data exchange units can be changed, so that they can choose different working modes to adapt to Three different calculation tasks: FFT/IFFT, FIR pulse group processing, correlation calculation.

硬件调度方案采取集中式与分布式相结合的两级调度方法：即控制字先由全局模块进行一级译码，再由各基本单元进行二级译码。The hardware scheduling scheme adopts a two-level scheduling method combining centralized and distributed: that is, the control word is decoded first by the global module, and then decoded by each basic unit.

总体架构采用两级控制架构，全局控制模块用于协调4个基本单元，每个基本单元内部有其自身的本地控制逻辑。介于4个基本单元之间的数据交换单元负责把每个基本单元中的数据按照不同的控制要求送入其他3个基本单元。The overall architecture adopts a two-level control architecture. The global control module is used to coordinate the four basic units, and each basic unit has its own local control logic inside. The data exchange unit between the four basic units is responsible for sending the data in each basic unit to the other three basic units according to different control requirements.

输入单元接收控制字、对控制字进行一级译码、分配控制字到各单元。控制字与系数入口1复用同一个端口，控制字接收模块接收控制字，然后送入一级译码模块译码，一方面产生全局控制信号用于产生片内时序、为系数和数据提供同步，另一方面通过控制字分配模块分别向片内其他单元发射。系数同步模块对系数入口1与系数入口2进行同步。数据同步模块对数据入口1和数据入口2进行同步。The input unit receives the control word, performs one-level decoding on the control word, and distributes the control word to each unit. The control word and coefficient entry 1 multiplex the same port. The control word receiving module receives the control word, and then sends it to the first-level decoding module for decoding. On the one hand, it generates a global control signal for generating on-chip timing and providing synchronization for coefficients and data. , and on the other hand transmit to other units in the chip respectively through the control word allocation module. The coefficient synchronization module synchronizes coefficient entry 1 and coefficient entry 2. The data synchronization module synchronizes data entry 1 and data entry 2.

数据交换单元是一组多输入、多输出的开关组合，在4个基本单元之间交换数据。The data exchange unit is a group of multi-input and multi-output switch combination, which exchanges data between the four basic units.

输出单元对各个基本单元的运算结果进行排序，并且按照不同的格式输出。基本单元输出结果排序模块在FFT/IFFT和FIR脉组处理时将基本单元的输出按照频道顺序排列；当工作于相关处理运算模式时将来自各个基本单元的、按帧输出的数据调整为与输入数据率相同的连续数据流。求模模块完成实部/虚部的输出格式到模值/相角的输出格式的转换。取对数模块将输入的模值转换为对数表示。浮点/定点转换模块可以将实部/虚部模块的或者求模模块的输出由浮点格式转换为定点格式。指数归一化模块通过对尾数作相应移位将浮点格式的运算结果的指数统一为固定值。上述4种格式转换模块分别有两套，每个输出端口对应一套，保证两个输出端口可以独立地以任意一种格式输出。The output unit sorts the operation results of each basic unit and outputs them in different formats. The output result sorting module of the basic unit arranges the output of the basic unit according to the channel order during FFT/IFFT and FIR pulse group processing; when working in the correlation processing operation mode, it adjusts the frame-by-frame output data from each basic unit to be consistent with the input A continuous stream of data at the same data rate. The modulo module completes the conversion from the output format of the real part/imaginary part to the output format of the modulus value/phase angle. The logarithm module converts the input modulo value to a logarithmic representation. The floating-point/fixed-point conversion module can convert the output of the real part/imaginary part module or the modulo module from a floating-point format to a fixed-point format. The exponent normalization module unifies the exponents of the operation results in floating-point format to a fixed value by correspondingly shifting the mantissa. There are two sets of the above four format conversion modules, one for each output port, to ensure that the two output ports can be independently output in any format.

基本单元中，数据存储器包括8个512×40的双端口RAM，用于运算数据的输入缓存、运算中间结果暂存和运算结果输出缓存。数据缓存可以同时把数据加到每个复数乘法累加器和复数乘法累加器子阵。系数存储器包括10个256×32的双端口RAM，用于存储相关处理和FIR滤波器系数以及FFT运算时的加权系数。FFT迭代运算的系数是一个固定的、以专用逻辑实现的一个表。每个复数乘法累加器对应4个实数浮点乘法累加器，每个复数乘法累加器子阵包括16个实数浮点乘法累加器，相当于4个复数乘法累加器。两个复数乘法累加器子阵和复数乘法累加器按照不同的方式搭配组合，用于完成不同的运算。实数浮点乘法累加器的结构分为5个部分：定点乘法部分、截位部分、指数调整部分、定点加法部分、指数判断部分。In the basic unit, the data memory includes eight 512×40 dual-port RAMs, which are used for input buffering of operation data, temporary storage of intermediate results of operation, and output buffer of operation results. The data buffer can add data to each complex multiply-accumulator and complex multiply-accumulator sub-array simultaneously. The coefficient memory includes 10 dual-port RAMs of 256×32, which are used to store correlation processing and FIR filter coefficients and weighting coefficients during FFT operations. The coefficients of the FFT iterative operation are a fixed table implemented with dedicated logic. Each complex multiplication accumulator corresponds to four real floating-point multiplication accumulators, and each complex multiplication accumulator sub-array includes 16 real floating-point multiplication accumulators, equivalent to four complex multiplication accumulators. The two complex multiplication accumulator sub-arrays and the complex multiplication accumulator are combined in different ways to complete different operations. The structure of the real floating-point multiplication accumulator is divided into five parts: fixed-point multiplication part, truncation part, index adjustment part, fixed-point addition part, and index judgment part.

在三种滤波运算时采取相应的组织形式为：The corresponding organizational forms in the three filtering operations are:

做FFT/IFFT运算时，复数乘法累加子阵A和复数乘法累加子阵B分别作为基8算法的一个迭代运算的节点，子阵中的4个复数乘法累加器用于完成一级基8迭代运算。子阵前面的那个复数乘法累加器A及复数乘法累加器B作为加窗运算器，对数据进行加窗处理。加窗之后的数据进入复数乘法累加子阵进行一级基8迭代运算。每一级迭代运算的输出首先导入数据缓存，然后再通过数据交换单元送到相应的另外一个复数乘法累加子阵，以进行下一级迭代运算。When doing FFT/IFFT operations, complex multiplication and accumulation sub-array A and complex multiplication and accumulation sub-array B are respectively used as nodes of an iterative operation of the radix-8 algorithm, and the four complex multiplication accumulators in the sub-array are used to complete the first-level radix-8 iterative operation . The complex multiplication accumulator A and the complex multiplication accumulator B in front of the sub-array are used as a windowing operator to perform windowing processing on the data. The data after windowing enters the complex multiplication and accumulation sub-array for the first-level radix-8 iterative operation. The output of each level of iterative operation is first imported into the data cache, and then sent to another corresponding complex multiplication and accumulation sub-array through the data exchange unit for the next level of iterative operation.

在做FIR脉组处理时，基本单元内部的2个复数乘法累加器模块和2个复数乘法累加器子阵模块中的乘法累加器并联使用，每个复数乘法累加器对应FIR脉组的一个频道。共轭运算时对应2个频道。When doing FIR pulse group processing, the 2 complex multiplication accumulator modules inside the basic unit and the multiplication accumulators in the 2 complex multiplication accumulator sub-array modules are used in parallel, and each complex multiplication accumulator corresponds to a channel of the FIR pulse group . The conjugate operation corresponds to 2 channels.

做相关运算时，复数乘法累加器和复数乘法累加器子阵串联使用，相当于串联10个复数乘法累加器，此时基本单元内部的乘法累加器被构造成级联的形式。When performing correlation operations, the complex multiplication accumulator and the complex multiplication accumulator sub-array are used in series, which is equivalent to connecting 10 complex multiplication accumulators in series. At this time, the multiplication accumulator inside the basic unit is constructed in a cascaded form.

所采用的硬件调度方案是：The hardware scheduling scheme adopted is:

通过控制字和控制信号对硬件资源的调度采用全局控制和基本单元内本地控制相结合的两级调度模式，控制字与控制信号首先进入全局控制模块进行一级译码与控制字分配。在这个模块内，首先接收控制字，然后根据控制字产生一些全局控制信号。这些全局控制包括：协调4个基本单元的动作，决定运算结果的输出格式，对片内主时钟的设置。全局控制模块的第二个作用，是向各基本单元的本地控制模块发射控制信息。这些被发射的信息包括：工作模式、工作模式的子类型、运算点数、频道数目、处理通道的数目。The scheduling of hardware resources through control words and control signals adopts a two-level scheduling mode combining global control and local control in the basic unit. Control words and control signals first enter the global control module for first-level decoding and control word distribution. In this module, the control word is received first, and then some global control signals are generated according to the control word. These global controls include: coordinating the actions of the four basic units, determining the output format of the operation results, and setting the on-chip master clock. The second function of the global control module is to transmit control information to the local control modules of each basic unit. The transmitted information includes: working mode, subtype of working mode, number of operation points, number of channels, number of processing channels.

各基本单元内的本地控制模块在收到全局控制模块发射的信息之后，对这些信息进行二级译码，将其转换为硬件调度控制信号的细节。After receiving the information transmitted by the global control module, the local control module in each basic unit performs two-level decoding on the information and converts it into the details of the hardware scheduling control signal.

本发明具有显著的技术进步和积极效果：The present invention has significant technical progress and positive effects:

本发明兼具了传统专用集成电路与通用数字信号处理器的优点。The invention combines the advantages of the traditional special integrated circuit and the general digital signal processor.

高速处理能力：本发明采用可重构专用数字信号处理器方案，片内含有大量硬件资源，浮点乘法累加器多达160个，在处理能力上与传统专用集成电路相比犹有过之。由于器件内部算法全部用硬件实现，因而具有通用数字信号处理器芯片所无法比拟的高速运算性能。器件内部运算格式为浮点，单片完成快速傅立叶变换(FFT)运算的最大点数为4096点，完成4096点FFT运算时间为25.6us，完成同点数FFT+IFFT时间为51.2us；FIR脉组处理最大频道数为128，80个频道以下最大滤波长度为256，80个频道以上最大滤波长度为128；相关处理运算单片单通道最大运算长度为4000，最多可并行处理16个通道，各通道数据与各组系数之间的对应关系非常灵活。High-speed processing capability: The present invention adopts a reconfigurable special-purpose digital signal processor scheme, and the chip contains a large amount of hardware resources, and there are as many as 160 floating-point multiplication accumulators, which is far superior to traditional ASICs in terms of processing capability. Since the internal algorithms of the device are all implemented by hardware, it has high-speed computing performance unmatched by general-purpose digital signal processor chips. The internal operation format of the device is floating point. The maximum number of points for fast Fourier transform (FFT) operation on a single chip is 4096 points. The time to complete 4096 points of FFT operation is 25.6us, and the time to complete FFT+IFFT with the same number of points is 51.2us; FIR pulse group processing The maximum number of channels is 128, the maximum filter length of less than 80 channels is 256, and the maximum filter length of more than 80 channels is 128; the maximum operation length of a single-chip single-channel correlation processing operation is 4000, and a maximum of 16 channels can be processed in parallel. The corresponding relationship with each group of coefficients is very flexible.

功能可重构：有别于传统意义上的专用集成电路芯片的是，器件内部的资源根据不同的应用需求可以进行结构重组，从而能够实现FFT、IFFT、FIR脉组及相关处理等多种形式的滤波运算。增强了器件使用灵活性，拓展了器件的应用范围。Reconfigurable function: Different from ASIC chips in the traditional sense, the internal resources of the device can be restructured according to different application requirements, so that various forms such as FFT, IFFT, FIR pulse group and related processing can be realized filtering operation. The use flexibility of the device is enhanced, and the application range of the device is expanded.

便捷的使用方式：本发明在一定程度上结合了通用数字信号处理器的灵活性，采用了独有的64bit功能控制字对器件进行配置，以满足不同的应用需求。使用非常简便，没有繁琐的编程和调试，也不需要象FPGA那样需要进行逻辑设计和时序分析，只需送出64位控制字，运算数据按照标准时序输入即可。只要改变控制字，器件内部的资源结构就进行了重组，器件也就按照另外一种模式工作。Convenient usage: the present invention combines the flexibility of general-purpose digital signal processors to a certain extent, and uses a unique 64-bit function control word to configure the device to meet different application requirements. It is very easy to use, without cumbersome programming and debugging, and does not need logic design and timing analysis like FPGA. It only needs to send 64-bit control words, and the operation data can be input according to the standard timing. As long as the control word is changed, the internal resource structure of the device is reorganized, and the device works in another mode.

附图说明 Description of drawings

图1硬件架构框图一Figure 1 Hardware Architecture Block Diagram 1

图2基本单元结构框图Figure 2 Basic unit structure block diagram

图3实数乘法累加器结构框图Figure 3 Real number multiplication accumulator structure block diagram

图4FFT运算的基本单元配置结构框图Fig.4 Block diagram of basic unit configuration of FFT operation

图5FIR脉组运算的基本单元配置结构框图Figure 5 The basic unit configuration structure block diagram of FIR pulse group operation

图6相关运算的基本单元配置结构框图Figure 6 The basic unit configuration structure block diagram of the correlation operation

图7硬件资源两级调度框架Figure 7 Two-level scheduling framework for hardware resources

图8硬件架构框图二Figure 8 Hardware Architecture Block Diagram II

图9输入单元实现框图Figure 9 Input unit implementation block diagram

图10输出单元实现框图Figure 10 output unit implementation block diagram

图11 32位控制字输入实序图Figure 11 32-bit control word input sequence diagram

图12 16位控制字输入实序图Figure 12 16-bit control word input sequence diagram

具体实施方式 Detailed ways

下面结合附图对本发明作进一步说明。The present invention will be further described below in conjunction with accompanying drawing.

本发明的主体为160个乘法累加器，平均分布于4个基本单元(basic unit)中。通过控制字和控制信号的配置，可以让这160个乘法累加器以及存储器、数据交换单元等硬件资源工作于不同的组织形式，这些组织形式就决定了器件的不同工作模式类型。另外，本发明对硬件资源的调度采用两级调度方法，在控制上采用集中式控制与分布式控制相结合的办法。The main body of the present invention is 160 multiplication accumulators, which are evenly distributed in 4 basic units. Through the configuration of control words and control signals, the 160 multiplier accumulators, memory, data exchange unit and other hardware resources can work in different organizational forms, and these organizational forms determine the different working modes of the device. In addition, the present invention adopts a two-level scheduling method for hardware resource scheduling, and adopts a method of combining centralized control and distributed control in control.

本发明的具体实施方式分为3个层次，第一层是器件硬件架构；第二层是硬件组织方式；第三层是对硬件资源调度方法。下面对这三个层次分别加以说明。The specific implementation of the present invention is divided into three levels, the first level is the device hardware architecture; the second level is the hardware organization mode; the third level is the hardware resource scheduling method. The three levels are described below.

一、硬件的主体架构1. The main structure of the hardware

本发明的硬件架构如图1和图8所示。The hardware architecture of the present invention is shown in Fig. 1 and Fig. 8 .

本发明采用全同步设计，即整个芯片只有一个时钟域。从功能划分、逻辑字资源的均衡等角度考虑，整个器件的顶层设计为7个部分，即以4个基本单元为主体，外加输入单元、输出单元、数据交换3个部分。The present invention adopts a fully synchronous design, that is, the whole chip has only one clock domain. From the perspective of functional division and balance of logical word resources, the top-level design of the entire device is divided into 7 parts, that is, 4 basic units as the main body, plus 3 parts of input unit, output unit, and data exchange.

器件采用两级控制架构，全局控制模块用于协调4个基本单元，每个基本单元内部有其自身的本地控制逻辑。这4个基本单元为器件的主体，完成数据缓存、运算、地址产生等主要功能。介于4个基本单元之间的数据交换单元负责把每个基本单元中的数据按照不同的控制要求送入其他3个基本单元。为满足不同应用，器件内部集成了若干专用功能模块，包括：求模电路、取对数电路、指数归一化电路、浮点-定点转换电路等。这些专用的功能模块可以将基本单元的运算结果以不同方式输出，而且两个输出端口完全独立，互不影响。另外，器件内部还集成了一个用于倍频的锁相环，这样使得内部运算时钟既可以直接从外部输入，也可以先由外部输入一个低速时钟，再由内部锁相环倍频之后作为运算时钟。The device adopts a two-level control architecture. The global control module is used to coordinate the four basic units, and each basic unit has its own local control logic inside. These four basic units are the main body of the device, and complete the main functions such as data cache, operation, and address generation. The data exchange unit between the four basic units is responsible for sending the data in each basic unit to the other three basic units according to different control requirements. In order to meet different applications, several special function modules are integrated inside the device, including: modulo circuit, logarithm circuit, exponent normalization circuit, floating-point-fixed-point conversion circuit, etc. These dedicated functional modules can output the operation results of the basic unit in different ways, and the two output ports are completely independent and do not affect each other. In addition, a phase-locked loop for frequency multiplication is also integrated inside the device, so that the internal operation clock can be directly input from the outside, or a low-speed clock can be input from the outside first, and then the frequency multiplied by the internal phase-locked loop can be used as the operation clock.

输入单元的具体实现如图9所示。除用于可测性设计的测试输入管脚之外，所有功能性的输入管脚首先进入输入单元。输入单元担负如下任务：接收控制字、对控制字进行一级译码、分配控制字到各单元、对输入系数进行同步、对输入数据进行同步、产生片内所需的全局时序信号等。下面就对输入单元中的各个模块一一简要说明。控制字与系数入口1复用同一个端口(coeff_in[31:0])，因此必须将输入的控制字首先接收下来，这是控制字接收模块的任务。控制字接收模块内部主要是一组寄存器，根据片外输入的使能信号将64位控制字分2次或4次打入这组寄存器中。在接收到控制字之后，一级译码模块将部分控制字译码为全局控制信号，这些全局控制信号用于产生片内时序、为系数和数据提供同步等。控制字分配模块的任务是，将64位控制字译码后分别向片内其他单元发射。因为系数入口1同时担负输入控制字的任务，而系数入口2不担负此任务，因此系数入口1与系数入口2延时不同，故必须用一个同步模块对系数入口1与系数入口2进行同步。类似地，数据入口1和数据入口2在功能上不尽相同，故这两个数据入口的数据必须按照工作模式和数据类型同步。The specific implementation of the input unit is shown in Figure 9. Except for the test input pins used for design for test, all functional input pins enter the input cell first. The input unit is responsible for the following tasks: receiving control words, performing first-level decoding on control words, distributing control words to each unit, synchronizing input coefficients, synchronizing input data, and generating global timing signals required on-chip, etc. The following is a brief description of each module in the input unit. The control word and coefficient entry 1 multiplex the same port (coeff_in[31:0]), so the input control word must be received first, which is the task of the control word receiving module. The inside of the control word receiving module is mainly a group of registers, and the 64-bit control word is divided into two or four times into this group of registers according to the enable signal input outside the chip. After receiving the control word, the first-level decoding module decodes part of the control word into global control signals, which are used to generate on-chip timing, provide synchronization for coefficients and data, and so on. The task of the control word allocation module is to decode the 64-bit control word and transmit it to other units in the chip respectively. Because coefficient entry 1 is also responsible for inputting control words, but coefficient entry 2 is not responsible for this task, so coefficient entry 1 and coefficient entry 2 have different delays, so a synchronization module must be used to synchronize coefficient entry 1 and coefficient entry 2. Similarly, data entry 1 and data entry 2 have different functions, so the data of these two data entries must be synchronized according to the working mode and data type.

输出单元的具体实现如图10所示。本单元的主要作用是对各个基本单元的运算结果进行排序，并且按照不同的格式输出。滤波运算是在基本单元中完成的，且4个基本单元各自输出其结果，因此必须将4个基本单元的输出结果进行排序，以保证最终输出结果有序输出。“基本单元输出结果排序”模块的作用为：当器件工作于FFT/IFFT和FIR脉组处理时，此模块将基本单元的输出按照频道顺序排列；当工作于相关处理运算模式时，基本单元按照每10点为一帧的格式输出，两个数据帧之间存在间隔，此时此模块将基本单元的输出数据格式进行调整，将来自各个基本单元的、按帧输出的数据，调整为与输入数据率相同的连续数据流。求模模块主要完成实部/虚部的输出格式到模值/相角的输出格式的转换。取对数模块可以将输入的模值转换为对数表示。浮点/定点转换模块可以将实部/虚部模块的或者求模模块的输出由浮点格式转换为定点格式。指数归一化模块的作用类似于浮点/定点转换模块，它通过对尾数作相应移位，从而将浮点格式的运算结果的指数统一为某个固定值。上述的4种格式转换模块分别有两套，芯片的每个输出端口对应一套，这样保证了两个输出端口可以独立地以任意一种格式输出。The specific implementation of the output unit is shown in Figure 10. The main function of this unit is to sort the operation results of each basic unit and output them in different formats. The filtering operation is completed in the basic unit, and each of the four basic units outputs its results, so the output results of the four basic units must be sorted to ensure that the final output results are output in order. The function of the "basic unit output result sorting" module is: when the device is working in FFT/IFFT and FIR pulse group processing, this module will arrange the output of the basic unit in order of channels; Every 10 points is output in a frame format, and there is an interval between two data frames. At this time, this module adjusts the output data format of the basic unit, and adjusts the data output by frame from each basic unit to be consistent with the input A continuous stream of data at the same data rate. The modulus module mainly completes the conversion from the output format of the real part/imaginary part to the output format of the modulus value/phase angle. The logarithm module can convert the input modulo value into a logarithmic representation. The floating-point/fixed-point conversion module can convert the output of the real part/imaginary part module or the modulo module from a floating-point format to a fixed-point format. The function of the exponent normalization module is similar to the floating-point/fixed-point conversion module. It unifies the exponent of the operation result in floating-point format to a fixed value by shifting the mantissa accordingly. There are two sets of the above four format conversion modules, one for each output port of the chip, which ensures that the two output ports can be independently output in any format.

数据交换单元实际上是一组多输入、多输出的开关组合。它接收来自4个基本单元的8组输入，然后通过选择开关将任意一组输入切换到任意一组输出。这样，数据可以在4个基本单元之间自由传输，在实现特定算法时，为4个基本单元之间的配合提供了数据路径的保障。The data exchange unit is actually a group of multi-input, multi-output switch combination. It receives 8 sets of inputs from 4 basic units, and then switches any set of inputs to any set of outputs through the selector switch. In this way, data can be freely transmitted between the four basic units, and when implementing a specific algorithm, it provides a data path guarantee for the cooperation between the four basic units.

4个基本单元是本发明的主体部分，完成数据缓存、运算、地址产生等大部分的功能。其结构如图2所示。4个基本单元具有统一的架构。“数据存储器”包括了8个512×40的双端口RAM，用于运算模式的数据输入缓存、FFT运算中间结果暂存和FFT运算结果输出缓存。由图可见，“数据缓存”可以同时把数据加到每个“复数乘法累加器”和“复数乘法累加器子阵”。“系数存储器”包括了10个256×32的双端口RAM，用于存储相关处理运算、FIR滤波运算的滤波器系数和FFT运算的加权系数。“FFT迭代运算系数”用一个固定的、以专用逻辑实现的表，用于提供FFT/IFFT运算的迭代运算系数。图中每个“复数乘法累加器”对应4个实数的浮点乘法累加器，而每个“复数乘法累加器子阵”则包括16个实数的浮点乘法累加器，相当于4个“复数乘法累加器”。两个“复数乘法累加器子阵”和“复数乘法累加器”按照不同的方式搭配组合，即可完成不同的运算。The four basic units are the main part of the present invention, and complete most of the functions such as data cache, operation, and address generation. Its structure is shown in Figure 2. The 4 basic units have a unified architecture. The "data memory" includes eight 512×40 dual-port RAMs, which are used for the data input cache of the operation mode, the temporary storage of the intermediate results of the FFT operation, and the output cache of the FFT operation results. It can be seen from the figure that the "data cache" can add data to each "complex multiplication accumulator" and "complex multiplication accumulator sub-array" at the same time. The "coefficient memory" includes 10 256×32 dual-port RAMs, which are used to store correlation processing operations, filter coefficients of FIR filtering operations, and weighting coefficients of FFT operations. "FFT iterative operation coefficient" uses a fixed table implemented with dedicated logic to provide iterative operation coefficients for FFT/IFFT operations. Each "complex multiplication accumulator" in the figure corresponds to 4 real-number floating-point multiplication accumulators, and each "complex multiplication-accumulator sub-array" includes 16 real-number floating-point multiplication accumulators, which is equivalent to 4 "complex multiplication accumulators". Multiply Accumulator". Two "complex multiplication accumulator sub-arrays" and "complex multiplication accumulators" can be combined in different ways to complete different operations.

在做FFT/IFFT运算时，“复数乘法累加器”作为加权乘法器，对需要多级运算的输入数据进行加权，然后送入“复数乘法累加子阵”进行基8迭代运算。也就是说，在做FFT/IFFT运算时，基本单元中的“复数乘法累加器A”和“复数乘法累加器子阵A”组成一个基8运算核，“复数乘法累加器B”和“复数乘法累加器子阵B”组成另一个基8运算核，这样整个芯片就有8个这样的运算核，在运算时将FFT/IFFT的运算分解到这8个运算核并行处理。When doing FFT/IFFT operations, the "complex multiplication accumulator" is used as a weighted multiplier to weight the input data that requires multi-level operations, and then send it to the "complex multiplication and accumulation sub-array" for radix-8 iterative operations. That is to say, when doing FFT/IFFT operations, the "complex multiplication accumulator A" and "complex multiplication accumulator sub-array A" in the basic unit form a radix-8 operation core, and the "complex multiplication accumulator B" and "complex The multiplication accumulator sub-array B" forms another radix-8 operation core, so that the entire chip has 8 such operation cores, and the FFT/IFFT operation is decomposed into these 8 operation cores for parallel processing during operation.

在做FIR脉组处理时，基本单元内部的2个“复数乘法累加器”模块和2个“复数乘法累加器子阵”模块中的乘法累加器并联使用，每个复数乘法累加器对应FIR脉组的一个频道。如果采用共轭运算，则每个复数乘法累加器对应2个频道，如果再复用一次，可以对应更多频道的运算。When doing FIR pulse group processing, the two "complex multiplication accumulator" modules inside the basic unit and the multiplication accumulators in the two "complex multiplication accumulator sub-array" modules are used in parallel, and each complex multiplication accumulator corresponds to the FIR pulse A channel of the group. If the conjugate operation is used, each complex multiplication accumulator corresponds to 2 channels, and if it is multiplexed again, it can correspond to the operation of more channels.

做相关运算时，“复数乘法累加器”和“复数乘法累加器子阵”串联使用，相当于串联了10个复数乘法累加器；做FIR滤波运算时，“复数乘法累加器”和“复数乘法累加器子阵”并联使用，对应着滤波运算的10个频道；做FFT运算时，“复数乘法累加器”作为加权乘法器，对数据进行加权后再送给“复数乘法累加器子阵”做迭代运算。When doing correlation operations, the "complex multiplication accumulator" and "complex multiplication accumulator sub-array" are used in series, which is equivalent to connecting 10 complex multiplication accumulators in series; when doing FIR filtering operations, the "complex multiplication accumulator" and "complex multiplication Accumulator sub-array" is used in parallel, corresponding to 10 channels of filtering operation; when doing FFT operation, "complex multiplication accumulator" is used as a weighted multiplier, after weighting the data, it is sent to "complex multiplication accumulator sub-array" for iteration operation.

本发明采用浮点数据格式进行复数运算，而复数运算是由4个实数运算所组成，因此实数浮点乘法累加器是器件的核心运算部件。实数浮点乘法累加器的结构框图如图3所示，分为5个部分：定点乘法部分、截位部分、指数调整部分、定点加法部分、指数判断部分。The present invention adopts the floating point data format to carry out the complex number operation, and the complex number operation is composed of four real number operations, so the real number floating point multiplication accumulator is the core operation part of the device. The structural block diagram of the real floating-point multiplication accumulator is shown in Figure 3, which is divided into five parts: fixed-point multiplication part, truncation part, index adjustment part, fixed-point addition part, and index judgment part.

器件的待运算数据共20位，格式为4位无符号指数+16位有符号尾数，所能表示的动态范围是-2¹⁵×2¹⁵~2¹⁵×2¹⁵-1。系数为16位有符号定点数，所能表示的动态范围是-2¹⁵~2¹⁵-1。对于运算的中间数据，折衷考虑运算的精度因素和硬件实现的面积、速度因素，采用24位浮点数据格式，即4位无符号指数+20位有符号尾数。乘法累加器的运算过程有如下5个步骤。The data to be calculated by the device has a total of 20 bits, and the format is 4-bit unsigned exponent + 16-bit signed mantissa. The dynamic range that can be represented is -2 ¹⁵ × 2 ¹⁵ ~2 ¹⁵ × 2 ¹⁵ -1. The coefficient is a 16-bit signed fixed-point number, and the dynamic range that can be expressed is -2 ¹⁵ ~2 ¹⁵ -1. For the intermediate data of the operation, the accuracy factor of the operation and the area and speed factors of the hardware implementation are considered in compromise, and the 24-bit floating-point data format is adopted, that is, 4-bit unsigned exponent + 20-bit signed mantissa. The operation process of the multiplication accumulator has the following five steps.

①、定点乘法部分，用于数据的16位有符号尾数与16位有符号的系数进行定点相乘。①. The fixed-point multiplication part is used for fixed-point multiplication of the 16-bit signed mantissa of the data and the 16-bit signed coefficient.

②、截位部分，用于最大限度地保留定点乘法结果的精度。根据定点乘法之后的32位乘法结果的冗余符号位的位数，来确定所要保留的24位中间运算数据的20位尾数和这个尾数所对应的4位指数。如果32位的定点乘法结果带有k个冗余符号位，而又只能保留20位，那么显然，为了获得最大精度，最好是截掉这k位的冗余符号(左移k位)并同时在指数上减k。这样尾数变成了一位符号位和其后的19位数据位的形式。这就相当于对一个32位定点乘法结果a₃₁a₃₀a₂₉……a₂a₁a₀×2^e同时进行如下操作：②. The truncation part is used to preserve the precision of the fixed-point multiplication result to the greatest extent. The 20-bit mantissa of the 24-bit intermediate operation data to be reserved and the 4-bit exponent corresponding to the mantissa are determined according to the number of redundant sign bits of the 32-bit multiplication result after the fixed-point multiplication. If the 32-bit fixed-point multiplication result has k redundant sign bits, and only 20 bits can be reserved, then obviously, in order to obtain the maximum precision, it is best to truncate the k-bit redundant signs (shift left by k bits) And at the same time subtract k from the exponent. In this way, the mantissa becomes a sign bit followed by 19 data bits. This is equivalent to performing the following operations on a 32-bit fixed-point multiplication result a ₃₁ a ₃₀ a ₂₉ ... a ₂ a ₁ a ₀ × 2 ^e at the same time:

去冗余符号位并截位：a₃₁a₃₀a₂₉…a₂a₁a₀□a_31-ka_30-ka_29-k……a₃₁ _-18-ka_31-19-k Remove redundant sign bit and truncate: a ₃₁ a ₃₀ a ₂₉ …a ₂ a ₁ a ₀ □a _31-k a _30-k a _29-k …a ₃₁ _-18-k a _31-19-k

指数在原来基础上减k：e□e-kThe index is subtracted from the original basis by k: e□e-k

对于e＞k＞12的情况，即左移超过12位之后，应该在低位补0。同时应该注意到，不能使减了k以后的指数小于0。这样，在截位的时候就要综合考虑冗余符号位的位数和指数的大小，假如原指数小于冗余符号位的位数，即e＜k，那么就只能截掉e位的冗余符号位，同时把指数减为0：For the case of e>k>12, that is, after the left shift exceeds 12 bits, 0 should be added to the lower bits. At the same time, it should be noted that the index after subtracting k cannot be made less than 0. In this way, when truncation, the number of redundant sign bits and the size of the exponent must be considered comprehensively. If the original exponent is smaller than the number of redundant sign bits, that is, e<k, then only the redundant number of e bits can be truncated. The cosign bit, while decrementing the exponent to 0:

去冗余符号位并截位：a₃₁a₃₀a₂₉…a₂a₁a₀□a_31-ea_30-ea_29-e……a₃₁ _-18-ea_31-19-e Remove redundant sign bit and truncate: a ₃₁ a ₃₀ a ₂₉ …a ₂ a ₁ a ₀ □a _31-e a _30-e a _29-e ……a ₃₁ _-18-e a _31-19-e

指数在原来基础上减k：e□0The index is subtracted from the original basis by k: e 0

③、对于浮点数的加法，须加数和被加数的指数相同二者的尾数才能相加。指数调整部分正是起调整加数或被加数的指数以使二者相等的作用。假定A1＝a1×2^e1、A2＝a2×2^e2两数相加，且e1＜e2，那么就要把A1的指数e1调整到与e2相同的数值，同时把A1的尾数作e2-e1位符号位扩展并右移e2-e1位。③. For the addition of floating-point numbers, the mantissas of the addend and the summand must have the same exponents before they can be added. The index adjustment part just plays the role of adjusting the exponent of the addend or the summand to make them equal. Assuming that A1=a1×2 ^e1 and A2=a2×2 ^e2 are added together, and e1<e2, then the exponent e1 of A1 should be adjusted to the same value as e2, and the mantissa of A1 should be set as e2-e1 Sign bit extended and shifted right e2-e1 bits.

④、经过指数调整之后，加数和被加数指数已经统一，就可以将二者的经调整之后的尾数在符号位扩展之后定点相加，得到21位的定点加法结果。④. After the exponent adjustment, the addend and the augend exponents have been unified, and the adjusted mantissas of the two can be fixed-point added after the sign bit is extended to obtain a 21-bit fixed-point addition result.

⑤、因为在③指数调整过程中总是将较小的指数调整为较大的指数，故在定点相加之后要对21位的定点和去除冗余符号位，同时判断相加之后的指数的值，这就是指数判断的作用。⑤. Because the smaller exponent is always adjusted to a larger exponent in the process of ③exponent adjustment, it is necessary to remove redundant sign bits from the 21-bit fixed-point sum after the fixed-point addition, and judge the exponent after the addition at the same time Value, this is the role of index judgment.

二、硬件资源的组织方式Second, the organization of hardware resources

本发明是一款可重构专用数字信号处理器，所谓“可重构”，即可以通过配置不同的控制字来组织硬件资源，使硬件资源工作在不同的模式之下。本发明可以配置为三类工作模式：FFT/IFFT、FIR脉组处理、相关处理运算。这三类工作模式的每一类又可通过控制字调节其关键参数，来满足不同的处理需求。下面结合附图分别说明这三类工作模式下硬件资源的组织方式。The present invention is a reconfigurable special-purpose digital signal processor. The so-called "reconfigurable" means that hardware resources can be organized by configuring different control words, so that the hardware resources can work in different modes. The present invention can be configured into three types of working modes: FFT/IFFT, FIR pulse group processing, and correlation processing operations. Each of these three types of working modes can adjust its key parameters through the control word to meet different processing requirements. The organization modes of the hardware resources under the three types of working modes are respectively described below in conjunction with the accompanying drawings.

1、FFT/IFFT1. FFT/IFFT

离散傅里叶变换(DFT)的基本公式为：The basic formula of the discrete Fourier transform (DFT) is:

k＝0，1，…，N-1

k=0, 1, ..., N-1

其中w(i)为DFT加权因子，x(i)为输入数据，e^□j2＜ki/N为旋转因子。N为DFT运算的点数。如果N是一个复合数，则一个长点数的DFT运算可以被转化为两个短点数(N₁、N₂)的DFT运算。若N₁、N₂能够继续分解，则这种分解可以一直进行下去，运算量会进一步下降。Where w(i) is the DFT weighting factor, x(i) is the input data, and e ^□j2<ki/N is the rotation factor. N is the number of points for DFT operation. If N is a composite number, the DFT operation of one long-point number can be transformed into the DFT operation of two short-point numbers (N ₁ , N ₂ ). If N ₁ and N ₂ can continue to be decomposed, then this kind of decomposition can go on forever, and the calculation amount will be further reduced.

按照不同的分解方法，FFT有基2、基4、基8、混合基等等算法。本发明的FFT/IFFT采用基8运算，这种工作模式可以细分为FFT、IFFT、FFT+IFFT三种类型，而其处理点数可以是256点、512点、1024点、2048点、4096点等，而且在点数小于等于2048时，还可以同时对两路数据进行FFT或IFFT处理。不管FFT/IFFT的处理点数是多少，在此类工作模式下，器件的4个基本单元(basic_unit)都将被配置成如图4所示的方式。此时图2所示的“复数乘法累加子阵A”和“复数乘法累加子阵B”分别作为基8算法的一个迭代运算的节点，子阵中的4个复数乘法累加器用于完成一级基8迭代运算。子阵前面的那个“复数乘法累加器A”及“复数乘法累加器B”作为加窗运算器，对数据进行加窗处理。加窗之后的数据进入“复数乘法累加子阵”进行一级基8迭代运算。每一级迭代运算的输出首先导入数据缓存，然后再通过图1中的数据交换单元送到相应的另外一个“复数乘法累加子阵”，以进行下一级迭代运算。下面针对单个基本单元来描述FFT/IFFT模式下的硬件组织方式。According to different decomposition methods, FFT has base 2, base 4, base 8, mixed base and other algorithms. The FFT/IFFT of the present invention adopts radix-8 operation, and this working mode can be subdivided into three types: FFT, IFFT, and FFT+IFFT, and its processing points can be 256 points, 512 points, 1024 points, 2048 points, and 4096 points etc., and when the number of points is less than or equal to 2048, it can also perform FFT or IFFT processing on two channels of data at the same time. Regardless of the number of FFT/IFFT processing points, in this mode of operation, the four basic units (basic_unit) of the device will be configured as shown in Figure 4. At this time, the "complex multiplication and accumulation sub-array A" and "complex multiplication and accumulation sub-array B" shown in Figure 2 are respectively used as nodes of an iterative operation of the radix-8 algorithm, and the four complex multiplication and accumulators in the sub-array are used to complete the first-level Base-8 iterative operation. The "complex multiplication accumulator A" and "complex multiplication accumulator B" in front of the sub-array are used as windowing operators to perform windowing processing on the data. The data after windowing enters the "complex multiplication and accumulation sub-array" for a first-level radix-8 iterative operation. The output of each level of iterative operation is first imported into the data cache, and then sent to the corresponding other "complex multiplication and accumulation sub-array" through the data exchange unit in Figure 1 for the next level of iterative operation. The following describes the hardware organization in FFT/IFFT mode for a single basic unit.

GA3816器件进行FFT变换时，如果运算点数大于256，器件内部以分时复用方式工作，处理点数越大，复用次数越多。数据存储器配置成乒乓结构的数据缓存，器件内总的数据存储容量为32×256×40bits，分配到每个基本单元内的数据缓存为2×4×256×40bits。每个基本单元数据缓存划出一半的地址空间，即4×256×40bits，作为数据输入缓存。这样，单个器件4个基本单元可完成最大4096点FFT运算。When the GA3816 device performs FFT transformation, if the number of operation points is greater than 256, the device will work in a time-division multiplex mode. The larger the number of processing points, the more times of multiplexing. The data memory is configured as a data cache in a ping-pong structure. The total data storage capacity in the device is 32×256×40bits, and the data cache allocated to each basic unit is 2×4×256×40bits. Each basic unit data buffer allocates half of the address space, that is, 4×256×40 bits, as a data input buffer. In this way, four basic units of a single device can complete a maximum of 4096 FFT operations.

在进行FFT运算时，为了抑制副瓣，需要对输入数据进行加窗处理。在GA3816器件内部有40×256×32bit的双口RAM作为系数存储器，分配到每个基本单元有10×256×32bits。在每一个基本单元中，取8×256×32bits用作窗函数缓存，并且此缓存还被分成两组，每一组寻址深度为1024，分别向每一个“复数乘法累加器子阵”之前的加窗运算器提供窗函数。When performing FFT operation, in order to suppress the side lobe, it is necessary to perform window processing on the input data. Inside the GA3816 device, there is a 40×256×32bit dual-port RAM as a coefficient memory, which is allocated to each basic unit with 10×256×32bits. In each basic unit, 8×256×32bits are used as a window function buffer, and this buffer is also divided into two groups, each group has an addressing depth of 1024, which are sent to each "complex multiplication accumulator sub-array" respectively. The windowing operator provides a window function.

因为FFT/IFFT运算的旋转因子具有规律性，因此本发明将基8FFT运算的旋转因子存储于一个固定的表中。本发明单片可完成的最大FFT/IFFT点数为4096，芯片内部共有8个512×32bit的这个旋转因子系数表，并被平均分配在四个基本单元(basic_unit)中。如果运算点数小于4096，则旋转因子就从这个表中抽取。另外我们还设计了4×8×32bits的系数表，用于存储基8运算时的系数。Because the twiddle factor of the FFT/IFFT operation has regularity, the present invention stores the twiddle factor of the radix-8 FFT operation in a fixed table. The maximum number of FFT/IFFT points that can be completed by a single chip of the present invention is 4096, and there are eight 512*32bit twiddle factor coefficient tables inside the chip, which are evenly distributed in four basic units (basic_unit). If the number of operation points is less than 4096, the twiddle factors are extracted from this table. In addition, we also designed a coefficient table of 4×8×32bits, which is used to store the coefficients in base-8 operation.

当FFT点数小于等于2048时，本发明针对FFT/IFFT设计了第二种硬件资源组织方式：双路同时进行FFT运算。双路进数据模式中，basic_uint0、basic_unit2两个基本单元为一组，用于完成第一路运算；basic_uint1、basic_unit3两个基本单元为另一组，用于完成另外一路运算。虽然硬件资源分为两组，但是两组硬件基8运算的原理不变。两路数据同时从数据输入1、数据输入2端口输入，运算所得两组结果同时从输出端口1、输出端口2并行输出。此模式可用于两组独立的数据同时做FFT/IFFT或一组数据做二维FFT/IFFT。When the number of FFT points is less than or equal to 2048, the present invention designs a second hardware resource organization method for FFT/IFFT: two channels simultaneously perform FFT operations. In the dual-input data mode, the two basic units basic_uint0 and basic_unit2 form a group to complete the first operation; the two basic units basic_uint1 and basic_unit3 form another group to complete the other operation. Although the hardware resources are divided into two groups, the principle of the radix-8 operation of the two groups of hardware remains unchanged. Two channels of data are input from the data input 1 and data input 2 ports at the same time, and the two sets of results obtained by the operation are simultaneously output from the output port 1 and the output port 2 in parallel. This mode can be used to perform FFT/IFFT on two sets of independent data at the same time or two-dimensional FFT/IFFT on one set of data.

2、FIR脉组处理2. FIR pulse group processing

FIR脉组处理完成的基本功能为一个矩阵运算，其基本公式为：Y＝H*XThe basic function of FIR pulse group processing is a matrix operation, and its basic formula is: Y=H*X

其中：X＝(X₀，X₁，......，X_N□1)^T，Y＝(Y₀，Y₁，......，Y_M□1)^T (N≥M)Where: X=(X ₀ , X ₁ ,...,X _N□1 ) ^T , Y=(Y ₀ , Y ₁ ,...,Y _M□1 ) ^T (N≥ M)

上式的矩阵运算在实现上可看作乘法累加运算。如果单独观察H矩阵(以下也称为“系数矩阵”)的一行和X矩阵(以下也称为“数据”)一列的运算且运算数据为复数时，其基本运算形式就是乘法累加，表达式为：The matrix operation of the above formula can be regarded as a multiplication and accumulation operation in realization. If the operation of one row of the H matrix (hereinafter also referred to as "coefficient matrix") and one column of X matrix (hereinafter also referred to as "data") is observed separately and the operation data is complex, the basic operation form is multiplication and accumulation, and the expression is :

可见，FIR脉组处理事实上共有i组乘累加运算，每组包括j次相乘和j-1次累加。It can be seen that the FIR pulse group processing actually has i groups of multiplication and accumulation operations, and each group includes j times of multiplication and j-1 times of accumulation.

除了FIR脉组处理的基本形式之外，本发明提出了FIR脉组处理几种扩展形式：滑动FIR脉组处理、两组数据并行的FIR脉组处理、FIR脉组处理的两频道相加形式等。不管基本形式还是扩展形式，在FIR脉组处理这类工作模式下，4个基本单元(basic_unit)内的硬件资源都将被配置成图5所示的组织形式。In addition to the basic form of FIR pulse group processing, the present invention proposes several extended forms of FIR pulse group processing: sliding FIR pulse group processing, FIR pulse group processing in parallel with two sets of data, and two-channel addition form of FIR pulse group processing wait. Regardless of the basic form or the extended form, in the working mode of FIR pulse group processing, the hardware resources in the four basic units (basic_unit) will be configured as the organizational form shown in FIG. 5 .

图5所示的组织形式中，“复数乘法累加子阵A”中4个复数乘法累加器连同“复数乘法累加器A”一起组成“并联乘法累加阵列1”；“复数乘法累加子阵B”中4个复数乘法累加器连同“复数乘法累加器B”一起组成“并联乘法累加阵列2”。在“并联乘法累加阵列”中，1个复数乘法累加器用于FIR脉组处理1个频道(共轭情况2个频道)运算，这样一个“并联乘法累加阵列”可以并行处理5个频道(共轭情况下处理10个频道)，所以1个基本单元能够并行处理10个频道(共轭情况下处理20个频道)。因此，当输入数据率等于芯片运算时钟时，4个基本单元最多可以并行完成80个频道FIR脉组处理。如果“并联乘法累加阵列”复用，则最多可以完成128个频道FIR脉组处理。In the organizational form shown in Figure 5, the four complex multiplication accumulators in the "complex multiplication and accumulation sub-array A" together with the "complex multiplication and accumulation accumulator A" form a "parallel multiplication and accumulation array 1"; "complex multiplication and accumulation sub-array B" The 4 complex multiplication accumulators together with the "complex multiplication accumulator B" form a "parallel multiplication accumulation array 2". In the "parallel multiplication and accumulation array", one complex multiplication accumulator is used for FIR pulse group to process one channel (two channels in the conjugate case) operation, such a "parallel multiplication and accumulation array" can process five channels in parallel (conjugated 10 channels in case of conjugation), so 1 basic unit can process 10 channels in parallel (20 channels in conjugate case). Therefore, when the input data rate is equal to the chip operation clock, the four basic units can complete 80 channels of FIR pulse group processing in parallel at most. If the "parallel multiplication and accumulation array" is multiplexed, a maximum of 128 channels of FIR pulse group processing can be completed.

FIR脉组处理的系数存储于图2所示的系数存储器中。每个基本单元有10个256×32bits的双口RAM存储系数，每个乘法单元配备一个双口RAM，固定地向“并联乘法累加阵列”中的某一个复数乘法累加器提供系数，这样一个乘法累加器连同那个向它提供系数的双口RAM构成FIR脉组处理的一个频道。The coefficients processed by the FIR pulse group are stored in the coefficient memory shown in FIG. 2 . Each basic unit has 10 256×32bits dual-port RAM storage coefficients, and each multiplication unit is equipped with a dual-port RAM, which fixedly provides coefficients to a certain complex multiplication accumulator in the "parallel multiplication accumulation array". Such a multiplication The accumulator, together with the dual-port RAM supplying it with coefficients, constitutes a channel of FIR burst processing.

每个基本单元(basic_unit)中有8个512×40bits的双口RAM用作数据存储器。FIR脉组处理时，此8个双口RAM全部用于数据缓存。在每个基本单元中，数据缓存从数据输入口接收相同的数据。并加到各个乘法累加器数据输入端与不同频道的系数进行乘法累加运算。There are 8 dual-port RAMs of 512*40bits in each basic unit (basic_unit) as data memory. When the FIR pulse group is processed, the 8 dual-port RAMs are all used for data cache. In each basic unit, the data buffer receives the same data from the data input port. And add to each multiplication accumulator data input terminal and the coefficients of different channels for multiplication and accumulation operation.

FIR脉组处理的第二种硬件资源组织形式为两组并行数据的FIR脉组处理。在这种组织形式下，第一个基本单元(basic_uint0)和第三个基本单元(basic_uint2)划分为一组，第二个基本单元(basic_uint1)和第四个基本单元(basic_uint3)划分为另一组。第一组基本单元中的“并联乘法累加阵列”用于对“数据输入1”端口输入的数据进行处理，第二组的“并联乘法累加阵列”则用于对“数据输入2”端口进入的数据处理。第一组基本单元中的数据存储器和系数存储器用于存储第一组数据和系数；第二组基本单元中的数据存储器和系数存储器则用于存储第二组数据和系数。这样一来，单片器件就可以并行地对两组不同数据进行FIR脉组处理。The second hardware resource organization form of FIR pulse group processing is FIR pulse group processing of two sets of parallel data. In this form of organization, the first basic unit (basic_uint0) and the third basic unit (basic_uint2) are divided into one group, and the second basic unit (basic_uint1) and the fourth basic unit (basic_uint3) are divided into another Group. The "parallel multiplication and accumulation array" in the first group of basic units is used to process the data input from the "data input 1" port, and the "parallel multiplication and accumulation array" in the second group is used to process the data input from the "data input 2" port. data processing. The data memory and coefficient memory in the first group of basic units are used to store the first group of data and coefficients; the data memory and coefficient memory in the second group of basic units are used to store the second group of data and coefficients. In this way, a single-chip device can perform FIR pulse group processing on two sets of different data in parallel.

上面介绍了两组数据并行运算的模式，在那种模式下，把整个芯片的硬件资源分为两组，每一组单独处理一组数据。类似地，当频道数小于等于40而滤波长度大于等于80时，为了提高运算速度，可以将待处理数据按照滤波长度的1/2平均分为两段，前一段从数据输入端口1进入，后一段从数据输入端口2进入，然后利用片内的两组硬件资源同时与各自的系数相乘、累加，然后再将各自乘累加的结果相加，得到完整的处理结果。这就是FIR脉组处理工作模式下的第三种硬件资源组织方式。The mode of parallel operation of two groups of data is introduced above. In that mode, the hardware resources of the entire chip are divided into two groups, and each group processes a group of data independently. Similarly, when the number of channels is less than or equal to 40 and the filter length is greater than or equal to 80, in order to improve the calculation speed, the data to be processed can be divided into two segments according to 1/2 of the filter length. One section enters from the data input port 2, and then uses two groups of hardware resources in the chip to multiply and accumulate their respective coefficients at the same time, and then add the results of the respective multiplication and accumulation to obtain a complete processing result. This is the third hardware resource organization mode in the FIR pulse group processing working mode.

3、相关处理运算3. Related processing operations

相关处理运算指从一个连续采样的数据序列中以滑动方式从中连续截取N个数据进行滤波运算。它的运算特点是相邻两组滤波运算中有N-1个数据相同。其数学表达式为：

其中N为滤波器运算长度。x_i为输入信号，h_i为滤波器对应的系数。The correlation processing operation refers to continuously intercepting N data from a continuously sampled data sequence in a sliding manner for filtering operation. Its operation characteristic is that there are N-1 pieces of data in the adjacent two groups of filtering operations that are the same. Its mathematical expression is:

Among them, N is the operation length of the filter. x _i is the input signal, and h _i is the coefficient corresponding to the filter.

相关处理运算工作模式时基本单元内的硬件资源配置见图6所示。此时基本单元内部的乘法累加器被构造成级联的形式。每个基本单元包含10个浮点复数乘累器。第一个复数乘法累加器的输出作为第二个复数乘法累加器输入，同第二个乘法累加结果进行相加运算，相加结果作为第二个复数乘法累加器的输出，输入到第三个复数乘法累加器中，作为第三个加法运算的一个加数……以此类推，10个复数乘法累加器级联使用。这样，每个基本单元(basic_unit)能够组织成10个复数乘累器级联的运算结构，通过对这个“级连乘法累加阵列”不同次数的复用，可以完成10*N点(N＝1、2、3…100)滤波长度的相关运算。同理，4个基本单元之间也采用级连方式，前一个基本单元得到相关运算的部分乘法累加和，送到下一个基本单元，作为下一个基本单元“级连乘法累加阵列”中第一个复数乘法累加器的加数。这样4个基本单元依次级连累加，最终的相关运算结果在第四个基本单元(basic_unit3)中产生。The configuration of hardware resources in the basic unit in the relevant processing operation mode is shown in FIG. 6 . At this time, the multiplication accumulator inside the basic unit is constructed in a cascaded form. Each basic unit contains 10 floating-point complex multipliers. The output of the first complex multiplication accumulator is used as the input of the second complex multiplication accumulator, and is added to the second multiplication and accumulation result, and the addition result is used as the output of the second complex multiplication accumulator, which is input to the third In the complex multiplication accumulator, as an addend of the third addition operation...and so on, 10 complex multiplication accumulators are used in cascade. In this way, each basic unit (basic_unit) can be organized into a cascaded operation structure of 10 complex multiply-accumulators. By multiplexing the "concatenated multiply-accumulate array" for different times, 10*N points (N=1 , 2, 3...100) correlation calculation of filter length. In the same way, the cascading method is also adopted between the four basic units. The previous basic unit obtains the partial multiplication accumulation sum of related operations and sends it to the next basic unit as the first in the next basic unit "cascade multiplication accumulation array". Addends of complex multiply accumulators. In this way, the four basic units are successively accumulated and accumulated, and the final correlation operation result is generated in the fourth basic unit (basic_unit3).

器件的每个基本单元内有相同容量的数据存储器：8个512×40bits的双口RAM。在相关运算模式下，每个基本单元内的数据存储器被统一编址，作为数据缓存。在数据输入时，每个基本单元的数据缓存存储相同的数据。运算时，同一时个钟节拍，由数据缓存向同一基本单元内的10个级连复数乘法累加器提供相同的数据。Each basic unit of the device has data memory of the same capacity: 8 dual-port RAMs of 512×40bits. In the correlation operation mode, the data memory in each basic unit is uniformly addressed as a data cache. At data input, the data cache of each basic unit stores the same data. During operation, the data cache provides the same data to 10 cascaded complex multiplication accumulators in the same basic unit at the same clock beat.

相关运算的系数暂存于每个基本单元内的10个256×32bits的双口RAM中。在基本单元内，每个32bit的双口RAM对应着级连乘法累加阵列中的一个复数乘法累器，固定地向那个复数乘法累加器提供系数。系数序列在这10个双口RAM中的存储特点是，系数按照h_n，h_n+1…h_n+9的顺序依次存储于第1到第10个双口RAM中。也就是说，对于同一个双口RAM，其地址n和地址n+1位置所存储的系数的序号相差10。这是为了配合级连乘法累加阵列中的10个复数乘法累加器。因为在级连乘法累加阵列的流水线上，同一个时钟周期需要进行10次乘累加运算，这就要求系数缓存在同一个时钟周期提供10个连续的系数，所以系数以这种特点存储。The coefficients of correlation operations are temporarily stored in 10 dual-port RAMs of 256×32 bits in each basic unit. In the basic unit, each 32-bit dual-port RAM corresponds to a complex multiplication accumulator in the cascaded multiplication accumulation array, and provides coefficients to that complex multiplication accumulator fixedly. The characteristic of the storage of the coefficient sequence in the 10 dual-port RAMs is that the coefficients are sequentially stored in the 1st to the 10th dual-port RAMs in the order of h _n , h _n+1 ... h _n+9 . That is to say, for the same dual-port RAM, the serial numbers of the coefficients stored in the address n and address n+1 differ by 10. This is to accommodate the 10 complex multiply-accumulators in the cascaded multiply-accumulate array. Because on the pipeline of the cascaded multiply-accumulate array, 10 multiply-accumulate operations need to be performed in the same clock cycle, which requires the coefficient cache to provide 10 consecutive coefficients in the same clock cycle, so the coefficients are stored with this feature.

三、硬件资源的调度方法3. Scheduling method of hardware resources

本发明对硬件资源的调度采用两级调度框架，在控制方法上采用集中控制与分布控制相结合。图7为器件两级调度框架图。控制字与控制信号首先进入“全局控制”模块进行一级译码与控制字分配。在这个模块内，首先接收控制字，然后根据控制字产生一些全局控制信号。这些全局控制包括：协调4个基本单元的动作，决定运算结果的输出格式，对片内主时钟的设置等。“全局控制”模块的第二个作用，是向各基本单元的“本地控制”模块发射控制信息。控制信号和控制字首先进入“全局控制”模块，在此模块内，与运算控制的各种参数有关的控制字被译码，并向各基本单元发射。这些被发射的信息包括：工作模式、工作模式的子类型、运算点数、频道数目、处理通道的数目等等。The present invention adopts a two-level scheduling framework for the scheduling of hardware resources, and adopts a combination of centralized control and distributed control in the control method. Fig. 7 is a frame diagram of two-level scheduling of devices. Control words and control signals first enter the "global control" module for first-level decoding and control word distribution. In this module, the control word is received first, and then some global control signals are generated according to the control word. These global controls include: coordinating the actions of the four basic units, determining the output format of the operation results, and setting the on-chip master clock. The second function of the "global control" module is to transmit control information to the "local control" modules of each basic unit. Control signals and control words first enter the "global control" module, in this module, the control words related to various parameters of operation control are decoded and transmitted to each basic unit. The transmitted information includes: working mode, subtype of working mode, number of computing points, number of channels, number of processing channels, and so on.

各基本单元(basic_unit)内的本地控制模块在收到全局控制模块发射的信息之后，会对这些信息进行二级译码，将其转换为硬件调度控制信号的细节。例如，根据工作模式及其类型决定一些选择开关的开合；根据处理通道的数目来决定系数如何存储，系数存储器地址如何产生；根据运算点数决定乘法累加阵列的复用次数，进而确定某个寄存器是否更新以及更新时间等等。将硬件资源的调度分为两级，主要是为了控制上的条理性和简约性。从实现的角度看，更加清晰和明了。After receiving the information transmitted by the global control module, the local control module in each basic unit (basic_unit) will perform two-level decoding on the information, and convert it into the details of the hardware scheduling control signal. For example, the opening and closing of some selection switches is determined according to the working mode and its type; how to store the coefficients and how to generate the coefficient memory address is determined according to the number of processing channels; the multiplexing times of the multiplication and accumulation array are determined according to the number of operation points, and then a certain register is determined Whether to update and update time and so on. The scheduling of hardware resources is divided into two levels, mainly for the sake of orderliness and simplicity in control. From an implementation point of view, it is clearer and more explicit.

在基本单元内部，采用算术运算的方法来对关键参数进行译码。无论FFT/IFFT、FIR脉组还是相关处理运算，其数据和系数的存/取地址、乘法累加阵列的复用次数等都是具有规律性的，可以在这些规律的基础上，利用算术运算对控制字进行解码，得到所有的控制信息和时序信号。例如，在FIR脉组运算中，滤波阶数为40，点数也为40，则专用解码乘法器会计算出总共需要40×40＝1600点的系数。那么在产生系数写地址时，会依次产生0~1599的写地址，将系数写入对应的缓存。采用算术运算解码方式的另一个原因，是出于调度的复杂程度和使用资源之间的折衷：首先，8×8的定点乘法器并不会占用太多资源；其次，如果将处理器所有工作状态列举出来，用查表方式实现解码的话，不但有繁琐而巨大的工作量，其存储单元和查找逻辑占用的资源也是很可观的。Inside the basic unit, the method of arithmetic operation is used to decode the key parameters. Regardless of FFT/IFFT, FIR pulse group or related processing operations, the storage/retrieval addresses of data and coefficients, the multiplexing times of multiplication and accumulation arrays, etc. are all regular. On the basis of these laws, arithmetic operations can be used to The control word is decoded to obtain all control information and timing signals. For example, in the FIR pulse group operation, if the filter order is 40 and the number of points is also 40, then the dedicated decoding multiplier will calculate the coefficients that require a total of 40×40=1600 points. Then, when generating the coefficient write address, the write address of 0~1599 will be generated sequentially, and the coefficient will be written into the corresponding cache. Another reason for using the arithmetic operation decoding method is the compromise between the complexity of scheduling and the use of resources: first, the 8×8 fixed-point multiplier does not take up too many resources; second, if all the work of the processor The states are listed, and if decoding is implemented by look-up table, not only will there be a cumbersome and huge workload, but the resources occupied by the storage unit and look-up logic will also be considerable.

本发明对硬件资源调度所采用的控制字共有64位，在FFT/IFFT、FIR脉组、相关运算的工作模式下，控制字的说明分别如表1、表2、表3The control word adopted by the present invention for hardware resource scheduling has a total of 64 bits. Under the working modes of FFT/IFFT, FIR pulse group, and correlation operation, the description of the control word is shown in Table 1, Table 2, and Table 3 respectively.

控制字 control word 名称 name 意义及取值 meaning and value chip_id(3:0) chip_id(3:0) 级连号 Concatenated number 不使用，可恒置为0000。 If not used, it can always be set to 0000. chip_num(3:0) chip_num(3:0) 级连片数 The number of consecutive pieces 不使用，可恒置为0000。 If not used, it can always be set to 0000. chan_num(7:0) chan_num(7:0) 通道数 number of channels 不使用，可恒置为00000000。 If not used, it can always be set to 00000000. work_model(4:0)work_model(4:0) 工作模式Operating mode FFT：取值为10000；IFFT：取值为10010；FFT+IFFT：取值为10011；双路进数据FFT：取值为10100；双路进数据IFFT：取值为10110。 FFT: the value is 10000; IFFT: the value is 10010; FFT+IFFT: the value is 10011; dual-input data FFT: the value is 10100; dual-input data IFFT: the value is 10110. coeff_num(4:0) coeff_num(4:0) 系数组数 Number of coefficient groups 不使用，可恒置为00000。 If not used, it can always be set to 00000.

coeff_chan(4:0) coeff_chan(4:0) 系数通道 coefficient channel 不使用，可恒置为00000。 If not used, it can be set to 00000. conj conj 共轭选择 Conjugate selection 不使用，可置为‘0’。 If not used, it can be set to '0'. length(7:0)length(7:0) 滤波长度filter length 低四位length(3:0)表征的是FFT/IFFT运算的数据长度(也即一组加权系数的点数)。以2为底，length(3:0)对应的正整数值作为指数，所求出的幂就是运算的长度。目前所能处理的长度代码为：“1000”-256点；“1001”-512点；“1010”-1024点；“1011”-2048点；“1100”-4096点。length(6:4)对应的正整数值加1得到的值代表大点数FFT/IFFT运算拆分成小点数基八运算所需要的运算级数，其值与运算长度有关。256点～512点运算须三级，对应代码为：“010”；1024点～4096点运算须四级，对应代码为：“011”。最高位length(7)暂时不用，置为‘0’。 The lower four bits length (3:0) represent the data length of the FFT/IFFT operation (that is, the number of points of a set of weighting coefficients). With the base 2, the positive integer value corresponding to length(3:0) is used as the exponent, and the obtained power is the length of the operation. Currently, the length codes that can be processed are: "1000"-256 points; "1001"-512 points; "1010"-1024 points; "1011"-2048 points; "1100"-4096 points. The value obtained by adding 1 to the positive integer value corresponding to length(6:4) represents the number of operation stages required for the large-point FFT/IFFT operation to be split into small-point radix-eight operations, and its value is related to the operation length. The operation of 256 points to 512 points requires three levels, and the corresponding code is "010"; the operation of 1024 points to 4096 points requires four levels, and the corresponding code is "011". The highest bit length (7) is temporarily unused and set to '0'. sel_coeff_ram(4:0)sel_coeff_ram(4:0) 系数基址coefficient base FFT/IFFT运算时，在运算过程中可更换系数组。sel_coeff_ram(4:0)对应的整数值表示要使用的系数存储器中对应某组系数。系数组数可选值与运算点数有关。4096点有2组系数可换，2048点有4组，1024点有8组，512点有16组，256点则有32组系数可换。即sel_coeff_ram取值范围在“00000”到“11111”之间。 During FFT/IFFT operation, the coefficient group can be replaced during the operation. The integer value corresponding to sel_coeff_ram(4:0) indicates a certain set of coefficients in the coefficient memory to be used. The optional value of the number of coefficient groups is related to the number of operation points. 4096 points have 2 groups of coefficients that can be changed, 2048 points have 4 groups, 1024 points have 8 groups, 512 points have 16 groups, and 256 points have 32 groups of coefficients that can be changed. That is, the value range of sel_coeff_ram is between "00000" and "11111". sel_data sel_data 是否加权 whether weighted FIR脉组运算模式下，选择是否对输入数据加权。FFT模式恒设为‘0’。 In the FIR pulse group operation mode, select whether to weight the input data. The FFT mode is always set to '0'. sel_coeff_pp sel_coeff_pp 系数乒乓 coefficient ping pong 双路进数据时系数是否乒乓写入。不乒乓，设置为‘0’；乒乓，设置为‘1’。 Whether the coefficients are ping-pong written when data is entered in two ways. No ping-pong, set to '0'; ping-pong, set to '1'. result1_mod(1:0)result1_mod(1:0) 输出口1输出方式 Output port 1 output mode 最终结果输出端口1的结果输出方式可以有4种。此控制字取00：输出I/Q数据；取01：输出模值、相角；取10：输出对数、相角；取11：输出相关级联数据。 There are 4 ways to output the result of the final result output port 1. This control word takes 00: output I/Q data; takes 01: outputs modulus value and phase angle; takes 10: outputs logarithm and phase angle; takes 11: outputs related cascade data. sel_AGC1(1:0)sel_AGC1(1:0) 输出口1增益控制Output port 1 gain control 最终结果输出端口1既可以按照输入数据的格式，I/Q分别为20位浮点输出；也可以用控制字GAC1(3:0)的值作为标准取一个归一化的浮点值；还可以将I/Q都转化为20位的定点数输出。sel_AGC1(1:0)取00，11时，浮点输出；取01时，指数归一化输出；取10时，定点输出。 The final result output port 1 can be output according to the format of the input data, and I/Q are 20-bit floating-point output respectively; it can also use the value of the control word GAC1 (3:0) as a standard to take a normalized floating-point value; I/Q can be converted into 20-bit fixed-point output. When sel_AGC1(1:0) is set to 00 or 11, floating-point output; when set to 01, the index is normalized to output; when set to 10, fixed-point output. AGC1(3:0)AGC1(3:0) 输出口1归一化电平 Normalized level of output port 1 sel_AGC1＝01，10时，都需要确定一个基准的指数值，而AGC1(3:0)即为此基准指数值，当指数归一化输出或定点输出时，都以此为基准。 When sel_AGC1=01 and 10, a reference index value needs to be determined, and AGC1(3:0) is the reference index value, which is used as the reference when the index is normalized or fixed-point output. result2_mod(1:0)result2_mod(1:0) 输出口2输出方式 Output port 2 output mode 类似于result1_mod(1:0)。取00：输出I/Q数据；取01：输出模值、相角；取10：输出对数、相角；取11：输出相关级联数据。 Similar to result1_mod(1:0). Take 00: output I/Q data; take 01: output modulus and phase angle; take 10: output logarithm and phase angle; take 11: output related cascade data. sel_AGC2(1:0)sel_AGC2(1:0) 输出口1增益控制 Output port 1 gain control 类似于sel_AGC1(1:0)。取00，11：浮点输出；取01：指数归一化输出；取10：定点输出。 Similar to sel_AGC1(1:0). Take 00, 11: floating-point output; take 01: index-normalized output; take 10: fixed-point output. AGC2(3:0)AGC2(3:0) 输出口1归一化电平 Normalized level of output port 1 类似于AGC1(3:0)。当sel_AGC2＝01，10时，输出的浮点指数取AGC2值为基准，做指数归一化或浮点转定点处理。 Similar to AGC1(3:0). When sel_AGC2=01, 10, the output floating-point index takes the value of AGC2 as the reference, and performs index normalization or floating-point conversion to fixed-point processing.

表1Table 1

控制字 control word 名称 name 意义及取值 meaning and value chip_id(3:0) chip_id(3:0) 级连号 Concatenated number 在FIR模式下恒置为0000。 Constantly set to 0000 in FIR mode. chip_num(3:0) chip_num(3:0) 级连片数 The number of consecutive pieces 在FIR模式下恒置为0000。 Constantly set to 0000 in FIR mode. chan_num(7:0)chan_num(7:0) 频道数number of channels FIR脉组运算的频道数目。根据使用需要确定。如，并行50个频道运算，则chan_num+1＝50。 The number of channels for FIR pulse group calculation. Determined according to the needs of use. For example, 50 channels are operated in parallel, then chan_num+1=50. work_model(4:0)work_model(4:0) 工作模式Operating mode FIR脉组运算模式下取值为01000；滑动FIR脉组运算模式下取值为00010；两组数据并行运算的FIR脉组模式下取值为00100； The value is 01000 in the FIR pulse group operation mode; the value is 00010 in the sliding FIR pulse group operation mode; the value is 00100 in the FIR pulse group mode of parallel operation of two sets of data;

两频道相加的FIR脉组模式下取值为00001；FIR脉组实现DFT模式，取值同普通的FIR脉组运算模式，为01000，同时控制字sel_data必须设置为‘1’。 The value of the FIR pulse group mode in which two channels are added is 00001; the FIR pulse group realizes the DFT mode, and the value is the same as the ordinary FIR pulse group operation mode, which is 01000, and the control word sel_data must be set to '1'. coeff_num(4:0) coeff_num(4:0) 系数组数 Number of coefficient groups 在FIR脉组模式下恒置为00000。 Constantly set to 00000 in FIR pulse group mode. coeff_chan(4:0) coeff_chan(4:0) 系数通道 coefficient channel 在FIR脉组模式下恒置为00000。 Constantly set to 00000 in FIR pulse group mode. conjconj 共轭选择conjugate selection FIR脉组模式下系数是否共轭。若滤波器组采用频道间共轭的系数，则相当于运算速率提高一倍。设为‘0’，不共轭；设为‘1’，共轭。 Whether the coefficient is conjugate in FIR pulse group mode. If the filter bank uses the coefficients of the channel conjugate, it is equivalent to doubling the operation speed. Set to '0', not conjugated; set to '1', conjugated. length(7:0) length(7:0) 滤波长度 filter length FIR脉组模式下的滤波运算长度，滤波长度N＝length+1。 The filter operation length in FIR pulse group mode, filter length N=length+1. sel_coeff_ram(4:0) sel_coeff_ram(4:0) 系数基址 Coefficient base address FIR脉组模式下运算过程系数更换基址。高2位恒置为00，低3位表示使用系数存储器当中哪个区域的系数。 In the FIR pulse group mode, the coefficient of the operation process is changed to the base address. The upper 2 bits are always set to 00, and the lower 3 bits indicate which area of the coefficient memory to use the coefficient. sel_datasel_data 是否加权Whether weighted FIR脉组运算模式下，选择是否对输入数据加权。‘0’，不加权；‘1’加权。当且仅当芯片工作于FIR脉组实现DFT运算时，此控制字为‘1’。 In the FIR pulse group operation mode, select whether to weight the input data. '0', no weighting; '1' weighting. This control word is '1' if and only when the chip works in FIR pulse group to realize DFT operation. sel_coeff_pp sel_coeff_pp 系数乒乓 coefficient ping pong FIR脉组模式下，此处固定设为‘0’。 In the FIR pulse group mode, it is fixed as '0' here. result1_mod(1:0)result1_mod(1:0) 输出口1输出方式 Output port 1 output mode 最终结果输出端口1的结果输出方式可以有4种。此控制字取00：输出I/Q数据；取01：输出模值、相角；取10：输出对数、相角；取11：输出相关级联数据。 There are 4 ways to output the result of the final result output port 1. This control word takes 00: output I/Q data; takes 01: outputs modulus value and phase angle; takes 10: outputs logarithm and phase angle; takes 11: outputs related cascade data. sel_AGC1(1:0)sel_AGC1(1:0) 输出口1增益控制Output port 1 gain control 最终结果输出端口1既可以按照输入数据的格式，I/Q分别为20位浮点输出；也可以用控制字GAC1(3:0)的值作为标准取一个归一化的浮点值；还可以将I/Q都转化为20位的定点数输出。sel_AGC1(1:0)取00，11时，浮点输出；取01时，指数归一化输出；取10时，定点输出。 The final result output port 1 can be output according to the format of the input data, and I/Q are 20-bit floating-point output respectively; it can also use the value of the control word GAC1 (3:0) as a standard to take a normalized floating-point value; I/Q can be converted into 20-bit fixed-point output. When sel_AGC1(1:0) is set to 00 or 11, floating-point output; when set to 01, the index is normalized to output; when set to 10, fixed-point output. AGC1(3:0)AGC1(3:0) 输出口1归一化电平 Normalized level of output port 1 sel_AGC1＝01，10时，都需要确定一个基准的指数值，而AGC1(3:0)即为此基准指数值，当指数归一化输出或定点输出时，都以此为基准。 When sel_AGC1=01, 10, it is necessary to determine a benchmark index value, and AGC1(3:0) is the benchmark index value, which is used as the benchmark when the index is normalized or fixed-point output. result2_mod(1:0)result2_mod(1:0) 输出口2输出方式 Output port 2 output mode 类似于result1_mod(1:0)。取00：输出I/Q数据；取01：输出模值、相角；取10：输出对数、相角；取11：输出相关级联数据。 Similar to result1_mod(1:0). Take 00: output I/Q data; take 01: output modulus and phase angle; take 10: output logarithm and phase angle; take 11: output related cascade data. sel_AGC2(1:0)sel_AGC2(1:0) 输出口1增益控制 Output port 1 gain control 类似于sel_AGC1(1:0)。取00，11：浮点输出；取01：指数归一化输出；取10：定点输出。 Similar to sel_AGC1(1:0). Take 00, 11: floating-point output; take 01: index-normalized output; take 10: fixed-point output. AGC2(3:0)AGC2(3:0) 输出口1归一化电平 Normalized level of output port 1 类似于AGC1(3:0)。当sel_AGC2＝01，10时，输出的浮点指数取AGC2值为基准，做指数归一化或浮点转定点处理。 Similar to AGC1(3:0). When sel_AGC2=01, 10, the output floating-point index takes the value of AGC2 as the reference, and performs index normalization or floating-point conversion to fixed-point processing.

表2Table 2

控制字 control word 名称 name 意义及取值 meaning and value chip_id(3:0)chip_id(3:0) 级连号Concatenated number 多片级连使用时，本片在级连链上的位置。例如：本片为级连链上的第3片，则chip_id+1＝3。若单片使用，则chip_id恒为0000。 When multiple slices are used in cascading, the position of this slice on the cascading chain. For example: this chip is the third chip on the cascade chain, then chip_id+1=3. If used in a single chip, the chip_id is always 0000. Chip_num(3:0)Chip_num(3:0) 级连片数Number of contiguous pieces 多片级连使用时，表明级连链上共有几片器件。例如：有3片器件级连使用，则chip_num+1＝3。若单片使用，则chip_num恒为0000。 When multi-chip cascading is used, it indicates how many devices there are in the cascading chain. For example: if there are 3 devices connected in cascade, then chip_num+1=3. If single chip is used, chip_num is always 0000. Chan_num(7:0)Chan_num(7:0) 通道数number of channels 取值在00000000～00001111之间，根据使用需要确定。例如，需要并行运算7个通道，则chan_num+1＝7。 The value is between 00000000 and 00001111, which is determined according to the application requirements. For example, if 7 channels need to be operated in parallel, then chan_num+1=7.

Work_model(4:0) Work_model(4:0) 工作模式 Operating mode 相关运算模式下取值为11000 The value in the relevant operation mode is 11000 coeff_num(4:0)coeff_num(4:0) 系数组数Number of coefficient groups 表明多通道运算时共需多少组系数。取值不得超过当前的通道数chan_num(7:0)。例如，共有3组系数，则coeff_num+1＝3。 Indicates how many sets of coefficients are required for multi-channel operation. The value cannot exceed the current number of channels chan_num(7:0). For example, there are 3 sets of coefficients, then coeff_num+1=3. Coeff_chan(4:0)Coeff_chan (4:0) 系数通道coefficient channel 各组系数和各通道间的对应关系。配合系数组数coeff_num和通道数chan_num使用。三者间的关系为：coeff_chan+1＝[(chan_num+1)/(coeff_num+1)]+1即，(coeff_chan+1)为(chan_num+1)除以(coeff_num+1)取整后加1。例如：chan_num+1＝7，coeff_num+1＝3，则coeff_num+1＝3。Coeff_chan+1所表达的意义是：每一组系数所对应的通道数。以上例来说，7个通道，3组系数，系数通道加1为3，说明第1、2、3通道与第1组系数匹配，第4、5、6通道与第2组系数匹配，第7通道与第3组系数匹配。 Correspondence between each group of coefficients and each channel. It is used together with the number of coefficient groups coeff_num and the number of channels chan_num. The relationship between the three is: coeff_chan+1=[(chan_num+1)/(coeff_num+1)]+1 that is, (coeff_chan+1) is (chan_num+1) divided by (coeff_num+1) and added 1. For example: chan_num+1=7, coeff_num+1=3, then coeff_num+1=3. The meaning expressed by Coeff_chan+1 is: the number of channels corresponding to each set of coefficients. For the above example, there are 7 channels, 3 sets of coefficients, adding 1 to the coefficient channel is 3, indicating that the 1st, 2nd, and 3rd channels match with the 1st set of coefficients, and the 4th, 5th, and 6th channels match with the 2nd set of coefficients. 7 channels are matched to the 3rd set of coefficients. Conj Conj 共轭选择 Conjugate selection 相关运算时恒置为0。 Constantly set to 0 during correlation operations. Length(7:0)Length(7:0) 滤波长度filter length 表征的是相关运算的滤波长度(也即一组系数的点数)。设窗口长度为X，则，X＝40*(length+1)。 Characterizes the filter length of the correlation operation (that is, the number of points of a set of coefficients). Let the window length be X, then, X=40*(length+1). Sel_coeff_ram(4:0) Sel_coeff_ram(4:0) 读系数基址 Read coefficient base address 相关运算时，系数存储器分为两个区域，在运算过程中可实时更换系数。Sel_coeff_ram(4:0)的高4位恒置为0000，最低位表示用哪个区域的系数。 During the correlation operation, the coefficient memory is divided into two areas, and the coefficient can be replaced in real time during the operation. The upper 4 bits of Sel_coeff_ram(4:0) are always set to 0000, and the lowest bit indicates which area's coefficient is used. Sel_data Sel_data 是否加权 whether weighted 相关运算模式下，此处固定设为‘0’。 In the relevant operation mode, this is fixed as '0'. Sel_coeff_pp Sel_coeff_pp 系数乒乓 coefficient ping pong 相关运算模式下，此处固定设为‘0’。 In the relevant operation mode, this is fixed as '0'. Result1_mod(1:0) Result1_mod(1:0) 输出1模式 Output 1 mode 00：result1直接输出I/Q；01：result1输出模值、相角；10：result1输出对数、相角；11：result1相关级联。 00: result1 directly outputs I/Q; 01: result1 outputs modulus and phase angle; 10: result1 outputs logarithm and phase angle; 11: result1 is cascaded. Sel_AGC1(1:0)Sel_AGC1(1:0) 输出1增益控制选择 Output 1 gain control selection 00，11：result1增益不作处理；01：result1结果进行指数归一化；10：result1结果由浮点转定点。 00, 11: the gain of result1 is not processed; 01: the result of result1 is normalized by the index; 10: the result of result1 is converted from floating point to fixed point. AGC1(3:0)AGC1(3:0) 输出1增益 output 1 gain 当sel_AGC1取值为01或10时，就把运算所得浮点结果1的指数按照AGC1的取值进行归一化。 When the value of sel_AGC1 is 01 or 10, the index of the floating-point result 1 obtained by the operation is normalized according to the value of AGC1. Result2_mod(1:0) Result2_mod(1:0) 输出2模式 Output 2 mode 00：result2直接输出I/Q；01：result2输出模值、相角；10：result2输出对数、相角；11：result2相关级联。 00: result2 directly outputs I/Q; 01: result2 outputs modulus and phase angle; 10: result2 outputs logarithm and phase angle; 11: result2 is cascaded. Sel_AGC2(1:0)Sel_AGC2(1:0) 输出2增益控制选择 Output 2 gain control selection 00，11：result2增益不作处理；01：result2结果进行指数归一化；10：result2结果由浮点转定点。00, 11: the gain of result2 is not processed; 01: the result of result2 is normalized by the index; 10: the result of result2 is converted from floating point to fixed point. AGC2(3:0)AGC2(3:0) 输出2增益 output 2 gain 当sel_AGC2取值为01或10时，就把运算所得浮点结果2的指数按照AGC2的取值进行归一化。 When the value of sel_AGC2 is 01 or 10, the index of the floating-point result 2 obtained from the operation is normalized according to the value of AGC2.

表3table 3

64位控制字通过系数输入端口coeff_in输入，输入方式分为32位输入和16位输入两种，由管脚control_en1、control_en2作为使能，在control_en1、control_en2各自为低时分别送入。32位输入时，功能控制字分两组，占用coeff_in(31:0)的全部32位，control_en1、control_en2的低电平分别持续1个coeff_en周期，在control_en1为低的coeff_en周期内送功能控制字的前32位，在control_en2为低的coeff_en周期内送控制字的后32位。16位输入时，功能控制字占用coeff_in(31:0)的高16位，control_en1、control_en2的低电平持续2个coeff_en周期，在control_en1为低的2个coeff_en周期内送功能控制字前32位，在control_en2为低的2个coeff_en周期内送功能控制字的后32位。32位控制字输入时序和16位控制字输入的时序图11、图12所示。The 64-bit control word is input through the coefficient input port coeff_in. The input mode is divided into 32-bit input and 16-bit input. The pins control_en1 and control_en2 are used as enable, and they are respectively sent when control_en1 and control_en2 are low. For 32-bit input, the function control word is divided into two groups, occupying all 32 bits of coeff_in(31:0), the low levels of control_en1 and control_en2 last for 1 coeff_en cycle respectively, and the function control word is sent during the coeff_en cycle when control_en1 is low The first 32 bits of the control word, send the last 32 bits of the control word in the coeff_en period when control_en2 is low. For 16-bit input, the function control word occupies the upper 16 bits of coeff_in(31:0), the low level of control_en1 and control_en2 lasts for 2 coeff_en cycles, and the first 32 bits of the function control word are sent within the 2 coeff_en cycles when control_en1 is low , Send the last 32 bits of the function control word in the 2 coeff_en cycles when control_en2 is low. The input timing of 32-bit control word and the timing sequence of 16-bit control word input are shown in Figure 11 and Figure 12.

在以本发明为主处理器的通用信号处理板上，分别进行1024点FFT运算、10阶FIR脉组处理、360点线性调频信号的相关运算，其控制字设置分别如表4、表5、表6所示。On the general signal processing board with the present invention as the main processor, respectively carry out 1024 points of FFT calculation, 10-order FIR pulse group processing, 360 points of linear frequency modulation signal related calculation, its control word setting is as table 4, table 5, respectively. Table 6 shows.

控制字 control word 名称 name 意义及取值 meaning and value work_model(4:0) work_model(4:0) 工作模式 Operating mode FFT：取值为10000； FFT: The value is 10000; length(7:0)length(7:0) 滤波长度filter length 低四位length(3:0)表征的是FFT/IFFT运算的数据长度(也即一组加权系数的点数)。以2为底，length(3:0)对应的正整数值作为指数，所求出的幂就是运算的长度。此处取值“1010”，对应1024点。length(6:4)对应的正整数值加1得到的值代表大点数FFT/IFFT运算拆分成小点数基八运算所需要的运算级数，其值与运算长度有关。此处1024点运算须四级，对应代码为：“011”。最高位length(7)置为‘0’。 The lower four bits length (3:0) represent the data length of the FFT/IFFT operation (that is, the number of points of a set of weighting coefficients). With the base 2, the positive integer value corresponding to length(3:0) is used as the exponent, and the obtained power is the length of the operation. The value here is "1010", which corresponds to 1024 points. The value obtained by adding 1 to the positive integer value corresponding to length(6:4) represents the number of operation stages required for the large-point FFT/IFFT operation to be split into small-point radix-eight operations, and its value is related to the operation length. Here, the 1024-point calculation requires four levels, and the corresponding code is: "011". The highest bit length (7) is set to '0'. result1_mod(1:0)result1_mod(1:0) 输出口1输出方式 Output port 1 output mode 最终结果输出端口1的结果输出方式可以有4种。此处控制字取01，输出模值、相角。 There are 4 ways to output the result of the final result output port 1. Here the control word is 01, and the modulus value and phase angle are output. sel_AGC1(1:0)sel_AGC1(1:0) 输出口1增益控制Output port 1 gain control 最终结果输出端口1既可以按照输入数据的格式，I/Q分别为20位浮点输出；也可以用控制字GAC1(3:0)的值作为标准取一个归一化的浮点值；还可以将I/Q都转化为20位的定点数输出。此处取10，定点输出。 The final result output port 1 can be output according to the format of the input data, and I/Q are 20-bit floating-point output respectively; it can also use the value of the control word GAC1 (3:0) as a standard to take a normalized floating-point value; I/Q can be converted into 20-bit fixed-point output. Take 10 here, fixed-point output. AGC1(3:0)AGC1(3:0) 输出口1归一化电平 Normalized level of output port 1 当sel_AGC1＝01，10时，都需要确定一个基准的指数值，而AGC1(3:0)即为此基准指数值，当指数归一化输出或定点输出时，都以此为基准。此处取基准为1011。 When sel_AGC1=01, 10, it is necessary to determine a benchmark index value, and AGC1(3:0) is the benchmark index value, which is used as the benchmark when the index is normalized or fixed-point output. The benchmark here is 1011.

表4Table 4

控制字 control word 名称 name 意义及取值 meaning and value chan_num(7:0)chan_num(7:0) 频道数number of channels FIR脉组运算的频道数目。规则为：频道数目＝chan_num+1。此处为10阶FIR脉组运算，故chan_num为00001001 The number of channels for FIR pulse group calculation. The rule is: number of channels=chan_num+1. Here is the 10th-order FIR pulse group operation, so chan_num is 00001001 work_model(4:0)work_model(4:0) 工作模式Operating mode FIR脉组运算模式下取值为01000此处为FIR脉组运算模式，故设置为01000 In the FIR pulse group operation mode, the value is 01000, here is the FIR pulse group operation mode, so it is set to 01000 conjconj 共轭选择conjugate selection FIR脉组模式下系数是否共轭。若滤波器组采用频道间共轭的系数，则相当于运算速率提高一倍。此处设为‘1’，选择共轭。 Whether the coefficient is conjugate in FIR pulse group mode. If the filter bank uses the coefficients of the channel conjugate, it is equivalent to doubling the operation speed. Set to '1' here to select conjugation. length(7:0)length(7:0) 滤波长度filter length FIR脉组模式下的滤波运算长度，滤波长度N＝length+1。此处为10阶FIR脉组运算，故length设置为00001001 The filter operation length in FIR pulse group mode, filter length N=length+1. Here is the 10th-order FIR pulse group operation, so the length is set to 00001001 result1_mod(1:0)result1_mod(1:0) 输出口1输出方式 Output port 1 output mode 最终结果输出端口1的结果输出方式可以有4种。此处取00：输出I/Q数据。 There are 4 ways to output the result of the final result output port 1. Take 00 here: output I/Q data. sel_AGC1(1:0)sel_AGC1(1:0) 输出口1增益控制Output port 1 gain control 最终结果输出端口1既可以按照输入数据的格式，I/Q分别为20位浮点输出；也可以用控制字GAC1(3:0)的值作为标准取一个归一化的浮点值；还可以将I/Q都转化为20位的定点数输出。此处sel_AGC1(1:0)取10，定点输出。 The final result output port 1 can be output according to the format of the input data, and I/Q are 20-bit floating-point output respectively; it can also use the value of the control word GAC1 (3:0) as a standard to take a normalized floating-point value; I/Q can be converted into 20-bit fixed-point output. Here sel_AGC1(1:0) takes 10, fixed-point output. AGC1(3:0)AGC1(3:0) 输出口1归一化电平Normalized level of output port 1 sel_AGC1＝01，10时，都需要确定一个基准的指数值，而AGC1(3:0)即为此基准指数值，当指数归一化输出或定点输出时，都以此为基准。此处基准GAC1取0111 When sel_AGC1=01, 10, it is necessary to determine a benchmark index value, and AGC1(3:0) is the benchmark index value, which is used as the benchmark when the index is normalized or fixed-point output. Here the benchmark GAC1 takes 0111

表5table 5

控制字 control word 名称 name 意义及取值 meaning and value chip_id(3:0)chip_id(3:0) 级连号Concatenated number 多片级连使用时，本片在级连链上的位置。此处单片使用，故chip_id设置为0000。 When multiple slices are used in cascade, the position of this slice on the cascade chain. Single chip is used here, so chip_id is set to 0000. Chip_hum(3:0)Chip_hum (3:0) 级连片数Number of contiguous pieces 多片级连使用时，表明级连链上共有几片器件。此处单片使用，故chip_num设置为0000。 When multi-chip cascading is used, it indicates how many devices there are in the cascading chain. Single chip is used here, so chip_num is set to 0000. Chan_num(7:0)Chan_num(7:0) 通道数number of channels 通道数目，规则为：通道数＝chan_num+1。此处一个通道，故chan_num设置为00000000 The number of channels, the rule is: number of channels=chan_num+1. Here is a channel, so chan_num is set to 00000000 Work_model(4:0)Work_model(4:0) 工作模式Operating mode 相关运算模式下取值为11000此处为相关运算，work_model设置为11000 The value in the correlation calculation mode is 11000, here is the correlation calculation, and the work_model is set to 11000 coeff_num(4:0)coeff_num(4:0) 系数组数Number of coefficient groups 表明多通道运算时共需多少组系数。取值不得超过当前的通道数。此处只有一组系数，故coeff_num设置为00000 Indicates how many sets of coefficients are required for multi-channel operation. The value cannot exceed the current number of channels. There is only one set of coefficients here, so coeff_num is set to 00000 Coeff_chan(4:0)Coeff_chan (4:0) 系数通道coefficient channel 各组系数和各通道间的对应关系。此处一组系数对应一个通道，故coeff_chan设置为00000 Correspondence between each group of coefficients and each channel. Here a group of coefficients corresponds to a channel, so coeff_chan is set to 00000 Length(7:0)Length(7:0) 滤波长度filter length 表征的是相关运算的点数。设窗口长度为X，则，X＝40*(length+1)。此处为360点相关运算，故length设置为00001001 Represents the number of points for related operations. Let the window length be X, then, X=40*(length+1). Here is a 360-point correlation calculation, so the length is set to 00001001 Result1_mod(1:0)Result1_mod(1:0) 输出1模式 Output 1 mode 表明result1输出端口以何中模式输出。此处result1_mod设置为10，result1输出对数、相角。 Indicates the output mode of the result1 output port. Here result1_mod is set to 10, and result1 outputs logarithm and phase angle.

表6Table 6

Claims

1, reconstructable digital signal processor, it is characterized in that: the hardware structure of processor inside and hardware line can carry out structural rearrangement by the configuration control word, thereby realize the filtering operation of fast fourier transform/invert fast fourier transformation, FIR arteries and veins group and relevant treatment form;

Hardware structure comprises input block, output unit, exchanges data unit and 4 elementary cells, comprises 160 real number floating-point multiplication totalizers in its elementary cell, and they are evenly distributed in 4 elementary cells;

Input block receives control word, control word is carried out one-level decoding, distribute control word to output unit, exchanges data unit and 4 elementary cells: input block comprises the control word receiver module, the one-level decoding module, the control word distribution module, the coefficient synchronization module, data simultaneous module, clock signal generation module in the sheet, control word and the coefficient 1 multiplexing same port that enters the mouth, the control word receiver module receives control word, send into the decoding of one-level decoding module then, produce overall control signal on the one hand and be used to produce sequential in the sheet, for coefficient and data provide synchronous, pass through the control word distribution module on the other hand respectively to output unit, exchanges data unit and 4 elementary cell emissions, the coefficient synchronization module carries out synchronously coefficient inlet 1 and coefficient inlet 2, and data simultaneous module is carried out synchronously data inlet 1 and data inlet 2;

The exchanges data unit is the switch combination of importing, exporting more a group more, swap data between 4 elementary cells;

Output unit sorts to the operation result of each elementary cell, and export according to different forms: output unit comprises elementary cell output sort result module, the module of taking the logarithm, floating-point/fixed point modular converter, ask the mould module, index normalization module, elementary cell output sort result module is arranged the output of elementary cell when working in FFT/IFFT and FIR arteries and veins group tupe according to channel order, will be from each elementary cell when working in the relevant treatment pattern, Shu Chu data are adjusted into the continuous data stream identical with importing data transfer rate frame by frame, ask the mould module to finish of the conversion of the output format of real part/imaginary part to the output format of mould value/phase angle, the module of taking the logarithm is converted to logarithm with the mould value of importing and represents, floating-point/fixed point modular converter is with real part/imaginary part or ask the output of mould module to be converted to fixed point format by floating-point format, index normalization module unifies to be fixed value by mantissa being done corresponding displacement with the index of the operation result of floating-point format, above-mentioned 4 kinds of format converting module have two covers respectively, the corresponding cover of each output port guarantees that two output ports are independently with any one form output;

Comprise data-carrier store in the elementary cell, the complex multiplication totalizer, complex multiplication totalizer submatrix, coefficient memory, sequential control circuit, data-carrier store comprises 8 512 * 40 two-port RAM, the input-buffer that is used for operational data, temporary and the operation result output buffers of intermediate results of operations, metadata cache can be added to each complex multiplication totalizer and complex multiplication totalizer submatrix to data simultaneously, coefficient memory comprises 10 256 * 32 two-port RAM, weighting coefficient when being used to store relevant treatment and FIR filter coefficient and FFT computing, the coefficient of FFT interative computation be one fixing, table with the special logic realization, corresponding 4 the real number floating-point multiplication totalizers of each complex multiplication totalizer, each complex multiplication totalizer submatrix comprises 16 real number floating-point multiplication totalizers, be equivalent to 4 complex multiplication totalizers, two complex multiplication totalizer submatrixs and complex multiplication totalizer are according to different mode matched combined, be used to finish different computings, the structure of real number floating-point multiplication totalizer is divided into 5 parts: the fixed-point multiplication part, the cut position part, the index adjustment member, the fixed point addition section, the index judgment part;

The organizational form of hardware can be recombinated by the configuration control word: by the configuration of control word and control signal, can change the organizational form of described 160 real number floating-point multiplication totalizers and exchanges data unit, make it to select different mode of operations, to adapt to three kinds of different processor active tasks: FFT/IFFT, the processing of FIR arteries and veins group, related operation;

Take corresponding organizational form when doing three kinds of filtering operations, be respectively:

When doing the FFT/IFFT computing, two complex multiplication totalizer submatrixs are respectively as the node of an interative computation of basic 8 algorithms, 4 complex multiplication totalizers in each complex multiplication totalizer submatrix are used to finish one-level base 8 interative computations: that complex multiplication totalizer of each complex multiplication totalizer submatrix front is as the windowing arithmetical unit, data are carried out windowing process, data after the windowing enter this complex multiplication totalizer submatrix and carry out one-level base 8 interative computations, the output of each grade interative computation at first imports metadata cache, and then deliver to complex multiplication totalizer submatrix in the corresponding another one elementary cell by the exchanges data unit, to carry out the next stage interative computation;

When doing the processing of FIR arteries and veins group, in parallel use of parallel multiplication in 2 complex multiplication accumulator module of elementary cell inside and 2 the complex multiplication totalizer submatrix modules, a channel of the corresponding FIR arteries and veins of each complex multiplication totalizer group, corresponding 2 channels during conjugate operation;

When carrying out related calculation, complex multiplication totalizer and the series connection of complex multiplication totalizer submatrix are used, and 10 complex multiplication totalizers are equivalent to connect.

2, reconstructable digital signal processor as claimed in claim 1 is characterized in that: the hardware scheduling scheme that is adopted is,

The hardware scheduling scheme is taked the centralized and distributed two-level scheduler method that combines: control word and control signal at first enter global control module and carry out one-level decoding and control word distribution, in this module, at first receive control word, produce overall control signal according to control word then, comprise: coordinate the action of 4 elementary cells, the output format of decision operation result, setting to major clock in the sheet, second effect of global control module is the Local Control Module emission control information to each elementary cell, comprising: mode of operation, the subtype of mode of operation, computing is counted, number of channels, the number of treatment channel;

Local Control Module in each elementary cell carries out two-stage decode to these information after the information of receiving the global control module emission, be converted into the details of hardware scheduling control signal.