CN102520906A

CN102520906A - Vector dot product accumulating network supporting reconfigurable fixed floating point and configurable vector length

Info

Publication number: CN102520906A
Application number: CN2011104130015A
Authority: CN
Inventors: 王东琳; 汪涛; 尹磊祖
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2011-12-13
Filing date: 2011-12-13
Publication date: 2012-06-27

Abstract

The invention discloses a vector dot product accumulating network supporting reconfigurable fixed floating point and configurable vector length, comprising a parallel reconfigurable multiplying unit for receiving vectors B, C, FBS and U as input, and performing vector multiplying operation to obtain multiplying result B*C of the vectors B and C; a floating point index and mantissa pre-processing part for receiving multiplying result B*C of the parallel reconfigurable multiplying unit and scalar A as input, finishing operations of selecting maximum floating point index, subtracting the indexes, shifting and aligning, complementing bits and converting and sticky compensating to obtain processed vector result B*C and scalar result A; a reconfigurable compressor for receiving processing result of the floating point index and mantissa pre-processing part, and compressing the result to obtain a sum string S and a binary string C; and a floating point index and mantissa post-processing/fixed point operating part for receiving the sum string S and the binary string C of the reconfigurable compressor, and finishing mantissa addition and post-processing to obtain a final vector dot product accumulating result.

Description

Support fixed-floating-point reconfigurable vector point-accumulation-add network with configurable vector length

技术领域 technical field

本发明涉及高性能数字信号处理器技术领域，尤其涉及一种支持定浮点可重构的向量长度可配置的向量点积累加网络。The invention relates to the technical field of high-performance digital signal processors, in particular to a vector point accumulation and addition network with configurable vector length supporting fixed-floating point reconfiguration.

背景技术 Background technique

在现代数字信号处理领域中，数字信号处理器(DSP)是整个系统的核心，DSP的性能直接决定整个系统的性能。在DSP中，无论多么复杂的运算都要交由运算单元来实现，因此运算单元是整个DSP的核心部件，其计算能力是衡量DSP性能的主要指标。特别地，随着技术的不断发展，以现代雷达信号处理、星载卫星图像处理、图像压缩、高清视频等为代表的高计算密集型领域，对信号处理的能力越来越高。这对运算单元提出了越来越高的挑战，特别是针对特定领域的可变尺寸、高密度并行计算给运算单元带来的压力不言而喻。In the field of modern digital signal processing, digital signal processor (DSP) is the core of the whole system, and the performance of DSP directly determines the performance of the whole system. In DSP, no matter how complicated the calculation is, it must be implemented by the calculation unit, so the calculation unit is the core component of the entire DSP, and its computing power is the main indicator to measure the performance of the DSP. In particular, with the continuous development of technology, high computing-intensive fields represented by modern radar signal processing, satellite image processing, image compression, high-definition video, etc., have higher and higher signal processing capabilities. This poses increasingly higher challenges to the computing unit, especially the variable-size, high-density parallel computing for specific fields puts pressure on the computing unit is self-evident.

在现代数字信号处理中，大量使用了“点积”操作，如FFT、FIR滤波、信号相关等。所有这些操作都是将输入信号与一个系数或者本地参数相乘后积分(累加)，这就表现为将两个向量(或称序列)进行点积。对于点积运算，目前主流DSP还没有专用的指令，一般是通过乘法、累加或乘累加多条指令完成的。该技术普遍存在一些缺点：In modern digital signal processing, "dot product" operations are widely used, such as FFT, FIR filtering, signal correlation, etc. All these operations are to multiply the input signal with a coefficient or local parameter and then integrate (accumulate), which is expressed as a dot product of two vectors (or sequences). For the dot product operation, the current mainstream DSP does not have a dedicated instruction, and it is generally completed by multiplying, accumulating, or multiplying and accumulating multiple instructions. This technique generally has some disadvantages:

1)硬件利用率低。对于向量点积操作，仅仅利用了标量乘法、累加或标量乘累加资源，没有合理地利用向量运算资源。1) The hardware utilization rate is low. For the vector dot product operation, only scalar multiplication, accumulation or scalar multiplication and accumulation resources are used, and vector operation resources are not reasonably utilized.

2)处理能力弱。一般只能执行16/32位标量点积操作，对于向量点积只能由多个标量点积操作完成，数据吞吐能力低，效率低下。2) The processing ability is weak. Generally, only 16/32-bit scalar dot product operations can be performed. For vector dot products, only multiple scalar dot product operations can be performed. The data throughput is low and the efficiency is low.

3)执行周期长。一条点积运算需要多条乘法、累加或乘累加指令串行完成，存在数据相关，点积运算需要多个时钟周期。3) The execution cycle is long. A dot product operation requires multiple multiplication, accumulation or multiply-accumulate instructions to be completed serially, there is data correlation, and the dot product operation requires multiple clock cycles.

4)灵活度不高。仅仅能支持某种特定的数据格式或特定长度的点积运算，且参与点积运算的数据个数不可配置。4) The flexibility is not high. It can only support a specific data format or a specific length of dot product operation, and the number of data involved in the dot product operation is not configurable.

5)编程困难。点积操作由乘法、累加或乘累加操作完成，而这些操作都不是单周期指令，需要考虑各条指令间的数据相关度。5) Programming is difficult. The dot product operation is completed by multiplication, accumulation or multiply-accumulate operations, and these operations are not single-cycle instructions, and the data correlation between each instruction needs to be considered.

点积运算属于多操作数计算范畴，多操作数计算需重点分析数据的符号位扩展以及数据精度。目前已有一些专利讨论了如何实现多操作数运算，如申请号为201010162375.X的专利《一种支持定浮点可重构横向求和网络》提出了一种多操作数加法运算，但其没有分析浮点数据格式的精度并且整个数据长度不可配置。申请号为201010535666.9的专利《用于执行点积运算的指令和逻辑》提出了一种点积指令执行的思路，数据格式可配置，但其仍然是标量点积，且参与点积运算的数据个数不可配置。申请号为201010559300.5的专利《用于SIMD向量微处理器的多功能》介绍了一种向量浮点乘加单元实现的思路，但其没有对定点数据进行重构，且同样参与乘加运算的数据个数不可配置。The dot product operation belongs to the category of multi-operand calculation, and the multi-operand calculation needs to focus on analyzing the sign bit extension and data precision of the data. At present, some patents have discussed how to implement multi-operand operations. For example, the patent "A Reconfigurable Horizontal Summation Network Supporting Fixed-Floating Points" with application number 201010162375.X proposes a multi-operand addition operation, but its The precision of the floating point data format is not analyzed and the overall data length is not configurable. The patent application number 201010535666.9 "Instructions and Logic for Performing Dot Product Operations" proposes an idea for executing dot product instructions. The data format is configurable, but it is still a scalar dot product, and the data involved in the dot product operation The number is not configurable. Patent Application No. 201010559300.5 "Multifunction for SIMD Vector Microprocessor" introduces an idea of implementing a vector floating-point multiplication and addition unit, but it does not reconstruct fixed-point data and also participates in multiplication and addition operations. The number is not configurable.

在算术级分析点积运算的依赖性，可以知道点积运算由乘法和累加操作完成；因此，改变传统方法中利用标量运算部件执行向量点积的思路，复用已有的向量乘法资源，增加累加网络资源，利用向量运算部件来执行向量点积操作，能够从根本上提高向量点积运算的能力。同时，分析浮点数据和定点数据格式的相关性，对浮点数据进行预处理，复用定点数据通路；通过可重构压缩技术支持不同的数据格式和不同的数据粒度；采用Mask寄存器配置参与点积运算的数据个数，从而提供一种支持不同粒度、不同数据格式、不同向量长度的向量点积累加网络，以提供数字信号处理中强大的向量点积运算能力，是本发明需要解决的问题。Analyzing the dependence of the dot product operation at the arithmetic level, we can know that the dot product operation is completed by multiplication and accumulation operations; therefore, the idea of using scalar operation components to perform vector dot products in the traditional method is changed, and the existing vector multiplication resources are reused to increase Accumulating network resources and using vector operation components to perform vector dot product operations can fundamentally improve the capability of vector dot product operations. At the same time, analyze the correlation between floating-point data and fixed-point data format, preprocess floating-point data, and multiplex fixed-point data paths; support different data formats and different data granularities through reconfigurable compression technology; use Mask register configuration to participate The number of data of the dot product operation, thereby providing a kind of vector point accumulation network that supports different granularities, different data formats, and different vector lengths, to provide powerful vector dot product computing capabilities in digital signal processing, is the problem that the present invention needs to solve question.

发明内容 Contents of the invention

(一)要解决的技术问题(1) Technical problems to be solved

有鉴于此，本发明的主要目的在于提供一种支持定浮点可重构、向量长度可配置的向量点积累加网络，能够支持8/16/32位定点数据、32位精简IEEE-754标准单精度浮点数据点积运算，可灵活配置参与点积运算的向量长度，合理利用向量运算资源，以提高向量点积运算的效率和处理能力，简化向量点积运算软件编程复杂度，满足数字信号处理中强大的向量点积运算需求。In view of this, the main purpose of the present invention is to provide a vector point accumulation and addition network that supports fixed-floating point reconfigurability and configurable vector length, can support 8/16/32-bit fixed-point data, and 32-bit simplified IEEE-754 standard Single-precision floating-point data dot product operation can flexibly configure the length of the vector involved in the dot product operation, and rationally use vector operation resources to improve the efficiency and processing capacity of the vector dot product operation, simplify the programming complexity of the vector dot product operation software, and meet the needs of digital signals Handle the powerful vector dot product operation requirements.

(二)技术方案(2) Technical solutions

为达到上述目的，本发明提供了一种支持定浮点可重构的向量长度可配置的向量点积累加网络，包括：并行可重构乘法器1，用于接收向量数据B、C和数据选项FBS、U作为输入，执行向量乘法操作，得到向量数据B、C的乘法结果B×C，并输出给浮点指数、尾数预处理部分2；浮点指数、尾数预处理部分2，用于接收并行可重构乘法器1的乘法结果B×C和标量数据A作为输入，完成选择浮点指数最大值、指数求差、移位对齐、补码转换和sticky位补偿操作，得到处理后的向量结果B×C和标量结果A，并输出给可重构压缩器部分3；可重构压缩器部分3，用于接收浮点指数、尾数预处理部分2的处理结果，并对其进行压缩，得到“和串”(S)和“进位串”(C)，并输出给浮点指数、尾数后处理/定点操作部分4；以及浮点指数、尾数后处理/定点操作部分4，用于对接收自可重构压缩器部分3的“和串”S和“进位串”C并进行尾数相加，对相加的尾数结果进行后处理得到最终的向量点积累加结果。In order to achieve the above object, the present invention provides a vector point accumulation network with configurable vector length that supports fixed-floating point reconfigurability, including: parallel reconfigurable multiplier 1, used to receive vector data B, C and data The options FBS and U are used as input to execute the vector multiplication operation, and the multiplication result B×C of the vector data B and C is obtained, and output to the floating-point exponent and mantissa preprocessing part 2; the floating-point exponent and mantissa preprocessing part 2 is used for Receive the multiplication result B×C and scalar data A of the parallel reconfigurable multiplier 1 as input, complete the operations of selecting the maximum value of the floating-point exponent, difference of the exponent, shift alignment, complement conversion and sticky bit compensation, and obtain the processed The vector result B×C and the scalar result A are output to the reconfigurable compressor part 3; the reconfigurable compressor part 3 is used to receive the processing results of the floating-point exponent and mantissa preprocessing part 2 and compress them , get "sum string" (S) and "carry string" (C), and output to floating-point exponent, mantissa post-processing/fixed-point operation part 4; and floating-point exponent, mantissa post-processing/fixed-point operation part 4 for Mantissa addition is performed on the "sum string" S and "carry string" C received from the reconfigurable compressor part 3, and the added mantissa result is post-processed to obtain the final vector point accumulation result.

(三)有益效果(3) Beneficial effects

本发明提供的这种支持定浮点可重构的向量长度可配置的向量点积累加网络，对浮点数据进行预处理复用定点数据通路，通过可重构压缩技术支持不同的数据格式和数据粒度，采用Mask寄存器灵活配置参与点积运算的向量长度，能支持8/16/32位定点数据、精简的IEEE-754标准单精度浮点数据的向量点积运算，可灵活配置参与点积运算的向量长度，运算性能高、开销小、功能多、编码少、速度快，降低了浮点向量点积累加关键路径的延时，减少了定点向量点积累加所消耗的资源，简化了软件编程的复杂度，提高了代码密度。The vector point accumulation and addition network with configurable fixed-floating-point reconfigurable vector length provided by the present invention preprocesses floating-point data and multiplexes fixed-point data paths, and supports different data formats and formats through reconfigurable compression technology. Data granularity, use the Mask register to flexibly configure the length of the vector involved in the dot product operation, can support 8/16/32-bit fixed-point data, simplified IEEE-754 standard single-precision floating-point data vector dot product operation, can be flexibly configured to participate in the dot product The vector length of the operation has high operation performance, low overhead, many functions, less coding, and fast speed, which reduces the delay of the floating-point vector point accumulation and critical path, reduces the resources consumed by the fixed-point vector point accumulation, and simplifies the software. The complexity of programming increases the code density.

附图说明 Description of drawings

图1是依照本发明实施例的支持定浮点可重构的向量长度可配置的向量点积累加网络的结构示意图；FIG. 1 is a schematic structural diagram of a vector point accumulation and addition network that supports fixed-floating-point reconfigurable vector lengths and can be configured according to an embodiment of the present invention;

图2是依照本发明实施例的8×8乘法器组构为16×16乘法器的示意图；2 is a schematic diagram of an 8×8 multiplier configured as a 16×16 multiplier according to an embodiment of the present invention;

图3是依照本发明实施例的8×8位乘法器组构为16×16乘法器的阵列图；FIG. 3 is an array diagram of an 8×8-bit multiplier configured as a 16×16 multiplier according to an embodiment of the present invention;

图4是依照本发明实施例的向量点积累加网络中浮点指数、尾数预处理部分，一次比较三个指数获取浮点较大指数网络的示意图；Fig. 4 is a schematic diagram of the floating-point exponent and mantissa preprocessing part in the vector point accumulation network according to an embodiment of the present invention, and comparing three exponents at a time to obtain a larger floating-point exponent network;

图5是依照本发明实施例的向量点积累加网络中浮点指数、尾数预处理部分，级联指数比较器网络获取浮点最大指数的示意图；5 is a schematic diagram of the floating-point exponent and mantissa preprocessing part in the vector point accumulation network according to an embodiment of the present invention, and the cascaded exponent comparator network obtaining the maximum floating-point exponent;

图6是依照本发明实施例的向量点积累加网络中浮点指数、尾数预处理部分，浮点尾数处理的示意图；6 is a schematic diagram of floating-point exponent and mantissa preprocessing part and floating-point mantissa processing in the vector point accumulation network according to an embodiment of the present invention;

图7是依照本发明实施例的向量点积累加网络中可重构压缩器部分，8/16/32位定点、32位浮点可重构压缩器网络的示意图；7 is a schematic diagram of the reconfigurable compressor part in the vector point accumulation network according to an embodiment of the present invention, 8/16/32-bit fixed-point, 32-bit floating-point reconfigurable compressor network;

图8是依照本发明实施例的向量点积累加网络中可重构压缩器部分，32位子压缩器的示意图；8 is a schematic diagram of a 32-bit sub-compressor in the reconfigurable compressor part of the vector point accumulation network according to an embodiment of the present invention;

图9是依照本发明实施例的向量点积累加网络中浮点指数、尾数后处理/定点操作数部分，并行指数修正单元的示意图。9 is a schematic diagram of the floating-point exponent, the mantissa post-processing/fixed-point operand part, and the parallel exponent correction unit in the vector point accumulation network according to an embodiment of the present invention.

具体实施方式 Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白，以下结合具体实施例，并参照附图，对本发明进一步详细说明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in further detail below in conjunction with specific embodiments and with reference to the accompanying drawings.

本发明的主要特点为：数据格式可重构、向量长度可配置。描述过程中约定以下说明符号：点积指令描述为D＝A+B DOTC{(U)}{(M)}{(FBS)}；其中，A、D为32位标量数据，B、C为512位向量数据，DOT表示点积操作符；Mask为64位寄存器，每位分别控制B×C结果的8位字节；M表示向量点积累加操作受Mask寄存器影响，当M选项不存在时表示Mask寄存器对向量点积累加操作无影响；U表示无符号选项；FBS表示数据格式，用2位二进制表示。“00”代表32位定点，“01”代表32位精简单精度浮点，“10”代表8位字节，“11”代表16位半字。The main features of the invention are: the data format can be reconfigured, and the vector length can be configured. The following description symbols are agreed during the description process: the dot product instruction is described as D=A+B DOTC{(U)}{(M)}{(FBS)}; where A and D are 32-bit scalar data, and B and C are 512-bit vector data, DOT means the dot product operator; Mask is a 64-bit register, each bit controls the 8-bit byte of the B×C result; M means that the vector point accumulation operation is affected by the Mask register, when the M option does not exist Indicates that the Mask register has no effect on the vector point accumulation operation; U indicates the unsigned option; FBS indicates the data format, expressed in 2-bit binary. "00" stands for 32-bit fixed point, "01" stands for 32-bit compact simple-precision floating point, "10" stands for 8-bit byte, and "11" stands for 16-bit halfword.

本发明实施例中假定B、C为512位向量数据，但本发明适用于任何B、C为32倍数位宽的场合，Mask寄存器的宽度与向量B、C的长度关系为Length_Mask＝Length_B/8。In the embodiment of the present invention, it is assumed that B and C are 512-bit vector data, but the present invention is applicable to any occasion where B and C are 32 multiples of bit width, and the width of the Mask register and the length relationship of vector B and C are Length _Mask =Length _B /8.

如图1所示，图1是依照本发明实施例的支持定浮点可重构的向量长度可配置的向量点积累加网络的结构示意图，该向量点积累加网络包括依次连接的并行可重构乘法器部分1，浮点指数、尾数预处理部分2，可重构压缩器部分3，浮点指数、尾数后处理/定点操作部分4。其中：As shown in Figure 1, Figure 1 is a schematic structural diagram of a vector point accumulation network that supports fixed-floating point reconfigurable vector lengths according to an embodiment of the present invention. The vector point accumulation network includes sequentially connected parallel reconfigurable Multiplier part 1, floating-point exponent and mantissa preprocessing part 2, reconfigurable compressor part 3, floating-point exponent and mantissa post-processing/fixed-point operation part 4. in:

并行可重构乘法器1接收向量数据B、C和数据选项FBS、U作为输入，并根据相应的数据格式，执行向量乘法操作，得到向量B×C的结果，并输出给浮点指数、尾数预处理部分2；Parallel reconfigurable multiplier 1 receives vector data B, C and data options FBS, U as input, and performs vector multiplication operation according to the corresponding data format, obtains the result of vector B×C, and outputs it to floating-point exponent and mantissa Preprocessing part 2;

浮点指数、尾数预处理部分2接收并行可重构乘法器1的乘法结果B×C和标量数据A作为输入，完成选择浮点指数最大值、指数求差、移位对齐、补码转换和sticky位补偿等操作，得到处理后的向量结果B×C和标量结果A，并将该处理结果输出给可重构压缩器部分3；The floating-point exponent and mantissa preprocessing part 2 receives the multiplication result B×C and scalar data A of the parallel reconfigurable multiplier 1 as input, and completes the selection of the maximum value of the floating-point exponent, difference of exponent, shift alignment, complement conversion and Sticky bit compensation and other operations to obtain the processed vector result B×C and scalar result A, and output the processed result to the reconfigurable compressor part 3;

可重构压缩器部分3接收浮点指数、尾数预处理部分2的处理结果，并对其进行压缩，得到“和串”(S)和“进位串”(C)，并输出给浮点指数、尾数后处理/定点操作部分4；The reconfigurable compressor part 3 receives the floating-point exponent and the processing result of the mantissa preprocessing part 2, and compresses it to obtain a "sum string" (S) and a "carry string" (C), and outputs it to the floating-point exponent , Mantissa post-processing/fixed-point operation part 4;

浮点指数、尾数后处理/定点操作部分4接收可重构压缩器部分3的“和串”S和“进位串”C进行尾数相加，对于定点格式，直接进行结果处理得到定点向量点积累加结果；对于浮点格式进行前导1判断、规格化移位、规格化舍入、指数调整、符号调整等操作，最终得到浮点向量点积累加结果。The floating-point exponent and mantissa post-processing/fixed-point operation part 4 receives the "sum string" S and "carry string" C of the reconfigurable compressor part 3 for mantissa addition, and for the fixed-point format, directly performs result processing to obtain a fixed-point vector point accumulation Add the result; for the floating-point format, perform leading 1 judgment, normalized shift, normalized rounding, exponent adjustment, sign adjustment and other operations, and finally get the floating-point vector point accumulation result.

下面结合图2至图9，详细介绍本发明提供的支持定浮点可重构的向量长度可配置的向量点积累加网络。本发明在具体实现方面，包括并行、级联、可重构和可配置设计。The vector point accumulation and addition network with configurable vector length supporting fixed-floating-point reconfigurable provided by the present invention will be described in detail below with reference to FIG. 2 to FIG. 9 . In terms of specific implementation, the present invention includes parallel, cascade, reconfigurable and configurable designs.

并行可重构乘法器1采用16个完全相同的32位可重构乘法器11，支持8/16/32位定点乘法，32位IEEE754标准精简单精度浮点乘法，16个32位乘法器并行工作，达到16×32位的吞吐量。该32位可重构乘法器由基本的8×8乘法器组构而成，可以完成8/16/32位有/无符号定点乘法以及32位精简单精度浮点乘法，得到512(＝16×32)位乘法结果。高位宽的乘法器由较低位宽的乘法器组构而成，以8×8乘法器为基本单元，4个8×8乘法器按照一定的关系组构为1个16×16乘法器，4个16×16乘法器按照一定的关系组构为1个32×32乘法器。Parallel reconfigurable multiplier 1 uses 16 identical 32-bit reconfigurable multipliers 11, supports 8/16/32-bit fixed-point multiplication, 32-bit IEEE754 standard simple-precision floating-point multiplication, and 16 32-bit multipliers in parallel work, reaching a throughput of 16 x 32 bits. The 32-bit reconfigurable multiplier is composed of basic 8×8 multipliers, which can complete 8/16/32-bit signed/unsigned fixed-point multiplication and 32-bit compact simple-precision floating-point multiplication to obtain 512 (=16 ×32) bit multiplication result. The high-bit-width multiplier is composed of lower-bit-width multipliers, with an 8×8 multiplier as the basic unit, and four 8×8 multipliers are organized into a 16×16 multiplier according to a certain relationship. Four 16×16 multipliers are configured as a 32×32 multiplier according to a certain relationship.

如图2所示，图2是依照本发明实施例的8×8乘法器组构为16×16乘法器的示意图，其中16位乘法器的数学描述如下公式1所示：As shown in FIG. 2, FIG. 2 is a schematic diagram of an 8×8 multiplier configured as a 16×16 multiplier according to an embodiment of the present invention, wherein the mathematical description of the 16-bit multiplier is shown in Formula 1 below:

A×B＝(AH×2⁸+AL)×(BH×2⁸+BL) (公式1)A×B=(AH×2 ⁸ +AL)×(BH×2 ⁸ +BL) (Formula 1)

＝AH×BH×2¹⁶+(AH×BL+AL×BH)×2⁸+AL×BL=AH×BH×2 ¹⁶ +(AH×BL+AL×BH)×2 ⁸ +AL×BL

其中AH/BH、AL/BL分别为16位数据A/B的高、低8位。首先4个8×8乘法器并行工作，每个8×8乘法器通过华莱士压缩树得到2个压缩结果；然后4个乘法器的8个压缩结果权重对齐后一起进入1个8-2压缩器得到最终的“和串”S和“进位串”C；S和C通过一个24位加法器得到16×16乘法器的高24位，16×16乘法器的另外低8位直接来自AL×BL乘法器的低8位，同时AL×BL乘法器的低8位进位输出作为24位乘法器的进位输入。最后，通过1个选择器，当工作为8×8模式时，选择器输出AL×BL和AH×BL乘法器的16位结果；当工作在16×16模式时，选择器输出24位加法器的结果和AL×BL的低8位结果构成的16×16乘法器的32位结果。Among them, AH/BH and AL/BL are the high and low 8 bits of 16-bit data A/B respectively. First, 4 8×8 multipliers work in parallel, and each 8×8 multiplier obtains 2 compression results through the Wallace compression tree; then the 8 compression results of the 4 multipliers are weighted and then enter together into 1 8-2 The compressor gets the final "sum string" S and "carry string" C; S and C get the upper 24 bits of the 16×16 multiplier through a 24-bit adder, and the other low 8 bits of the 16×16 multiplier come directly from AL The lower 8 bits of the ×BL multiplier, and the lower 8-bit carry output of the AL×BL multiplier is used as the carry input of the 24-bit multiplier. Finally, through a selector, when working in 8×8 mode, the selector outputs 16-bit results of AL×BL and AH×BL multipliers; when working in 16×16 mode, the selector outputs 24-bit adder The result of the result and the low 8-bit result of AL×BL form the 32-bit result of the 16×16 multiplier.

图3给出了依照本发明实施例的8×8位乘法器组构为16×16乘法器的阵列图，同理4个16×16乘法器可以按照相同的方法组构成1个32×32乘法器。Fig. 3 has provided the array diagram that the 8 * 8 multipliers according to the embodiment of the present invention are organized into 16 * 16 multipliers, similarly four 16 * 16 multipliers can be combined to form a 32 * 32 multipliers according to the same method multiplier.

上面给出的是低位乘法器组构为高位乘法器的思想，但在本发明的乘法器的具体实现时全部采用默认饱和截断处理方式，即8×8乘法器的结果仅取低8位，当高位有结果时，低8位直接饱和处理取最大/小值。同理对于16×16乘法运算，结果取低16位；32×32乘法运算，结果取低32位。Provided above is the idea that the low-order multiplier is configured as a high-order multiplier, but when the multiplier of the present invention is implemented, all adopt the default saturation and truncation processing method, that is, the result of the 8×8 multiplier only takes the lower 8 bits, When the high bit has a result, the low 8 bits directly saturate and take the maximum/minimum value. Similarly, for 16×16 multiplication, the lower 16 bits are taken as the result; for 32×32 multiplication, the lower 32 bits are taken as the result.

如图1所示，所述浮点指数、尾数操作部分2包括指数级联比较器21、指数求差阵列22、移位对齐单元23、补码转换单元24和sticky位补偿单元25；所述指数级联比较器21用于获取17个浮点指数最大值E_max；所述指数求差阵列22用于获取各浮点指数E_i与浮点指数最大值E_max的差E_max-E_i，该指数差用于移位对齐单元23的移位距离；所述移位对齐单元23采用指数求差阵列22的输出E_max-E_i作为控制信号，执行右移操作，对浮点尾数进行对齐；所述补码转换单元24对特定的移位后浮点尾数进行求反操作，需要求反的尾数为符号位与最大浮点指数符号位不同的浮点尾数；所述sticky位补偿单元25对移出去的浮点尾数和需要求补码的浮点尾数进行一位补偿，得到17位补偿单元。As shown in Figure 1, the floating-point exponent and the mantissa operation part 2 include an exponent cascade comparator 21, an exponent difference array 22, a shift alignment unit 23, a complement conversion unit 24 and a sticky bit compensation unit 25; The exponent cascade comparator 21 is used to obtain the maximum value E _max _of 17 floating-point exponents; the exponent difference array 22 is used to obtain the difference E _max -E i between each floating-point exponent E _i and the maximum value E _max of the floating-point exponent , the index difference is used to shift the shift distance of the alignment unit 23; the shift alignment unit 23 uses the output _Emax - _Ei of the index difference array 22 as a control signal, performs a right shift operation, and performs a right shift operation on the floating-point mantissa Alignment; the complement conversion unit 24 performs a negation operation on the floating-point mantissa after the specific shift, and the mantissa that needs to be negated is a floating-point mantissa whose sign bit is different from the sign bit of the largest floating-point index; the sticky bit compensation unit 25 performs one-bit compensation for the floating-point mantissa that is removed and the floating-point mantissa that needs to be complemented to obtain a 17-bit compensation unit.

浮点指数、尾数预处理部分2从并行可重构乘法器部分1中获取16个浮点乘法结果，并依次分离出浮点的指数E₀-E₁₅，尾数M₀-M₁₅和符号位S₀-S₁₅。16个浮点乘法结果指数E₀-E₁₅和标量寄存器中的浮点指数E₁₆进入级联指数比较器21获得最大浮点指数E_max，然后通过指数求差阵列22依次求出个浮点指数E_i和指数最大值E_max的差值ΔE_i，指数求差阵列为17个并行的8位减法器。移位对齐单元23采用17个32位并行移位器同时进行，每个移位器的控制信号来自指数求差阵列22的输出ΔE_i，移位器的输出给补码转换单元24，当

时需要对相应移位后的浮点尾数进行求反操作；同时对移出去的ΔE_i位尾数进行Sticky补偿，当

(尾数需要求补)或者移出去的二进制位中含有1时需要进行Sticky补偿，即Sticky_i＝1。Sticky位补偿单元25统计出Sticky₀-Sticky₁₆中需要进行Sticky位补偿的个数Comp。整个浮点指数、尾数预处理部分输出512位向量尾数、32位标量尾数和5位Sticky补偿位Comp。The floating-point exponent and mantissa preprocessing part 2 obtains 16 floating-point multiplication results from the parallel reconfigurable multiplier part 1, and sequentially separates the floating-point exponent E ₀ -E ₁₅ , the mantissa M ₀ -M ₁₅ and the sign bit S ₀ -S ₁₅ . 16 floating-point multiplication result exponents E ₀ -E ₁₅ and the floating-point exponent E ₁₆ in the scalar register enter the cascaded exponent comparator 21 to obtain the maximum floating-point exponent E _max , and then obtain a floating-point value in turn through the index difference array 22 The difference ΔE _i between the exponent E _i and the maximum value E _max of the exponent, the exponent difference array is 17 parallel 8-bit subtractors. The shift alignment unit 23 uses 17 32-bit parallel shifters to perform simultaneously, the control signal of each shifter comes from the output ΔE _i of the index difference array 22, and the output of the shifter is given to the complement conversion unit 24, when

It is necessary to invert the corresponding shifted floating-point mantissa; at the same time, sticky compensation is performed on the shifted ΔE _i- bit mantissa, when

(The mantissa needs to be complemented) or when the shifted binary bit contains 1, sticky compensation needs to be performed, that is, Sticky _i =1. The sticky bit compensating unit 25 counts the number Comp of Sticky bit compensating among Sticky ₀ - Sticky ₁₆ . The entire floating-point exponent and mantissa preprocessing part outputs 512-bit vector mantissa, 32-bit scalar mantissa and 5-bit Sticky compensation bit Comp.

当工作在定点模式时，浮点指数、尾数预处理部分2通过另外一条路径直接输出定点结果，定点数据在此部分不需要经过任何处理。When working in the fixed-point mode, the floating-point exponent and mantissa preprocessing part 2 directly outputs the fixed-point result through another path, and the fixed-point data does not need any processing in this part.

如图4所示，图4是依照本发明实施例的向量点积累加网络中浮点指数、尾数预处理部分，一次比较三个指数获取浮点较大指数网络的示意图。为了减少关键路径延时，在进行浮点指数级联比较时采用每次比较3个数，3个8位比较器并行工作，分别产生相应的标志位，然后根据各自标志位选择输出3个数中的最大值。As shown in FIG. 4 , FIG. 4 is a schematic diagram of a floating-point exponent and a mantissa preprocessing part in the vector point accumulation network according to an embodiment of the present invention, and a comparison of three exponents at a time to obtain a larger floating-point exponent network. In order to reduce the delay of the critical path, when performing cascade comparison of floating-point indices, three numbers are compared each time, and three 8-bit comparators work in parallel to generate corresponding flag bits, and then select and output 3 numbers according to the respective flag bits. the maximum value in .

如图5所示，图5是依照本发明实施例的向量点积累加网络中浮点指数、尾数预处理部分，级联指数比较器网络获取浮点最大指数的示意图。E₀-E₁₆分别输入第一级比较器阵列，得到6个“较大值”；然后这6个“较大值”依次进入第二、三级比较器阵列，最终得到浮点指数最大值E_max。这样级联比较器由原来的

级减少为级，减少了2级8位比较器的延时，增加了一些相应的控制逻辑，但控制逻辑延时大大小于8位比较器，总体上减少了级联指数比较器的延时。As shown in FIG. 5, FIG. 5 is a schematic diagram of the floating-point exponent and mantissa preprocessing part in the vector point accumulation network according to the embodiment of the present invention, and the cascaded exponent comparator network to obtain the maximum floating-point exponent. E ₀ -E ₁₆ are respectively input into the first-stage comparator array to obtain 6 "larger values"; then these 6 "larger values" enter the second and third-stage comparator arrays in turn, and finally obtain the maximum value of the floating-point index E _max . In this way the cascaded comparator consists of the original

level reduced to stage, the delay of the 2-stage 8-bit comparator is reduced, and some corresponding control logic is added, but the delay of the control logic is much smaller than that of the 8-bit comparator, and the delay of the cascaded exponent comparator is generally reduced.

如图6所示，图6是依照本发明实施例的向量点积累加网络中浮点指数、尾数预处理部分，浮点尾数处理的示意图。为了保持浮点点积计算的精度满足IEEE754标准，对浮点尾数进行7位尾数扩展，24位尾数左移7位；同时为了使浮点尾数复用定点尾数压缩通路，在浮点尾数最高位补0，这样24位浮点尾数扩展成32位。在移位对齐时32位尾数右移ΔE_i位，保存移出的ΔE_i位；当

时32位尾数需要进行求补操作；Sticky位补偿单元接收

和右移出的ΔE_i位作为控制信号，当

或者对右移出的ΔE_i位规约或操作为1时，进行补码补偿，即Sticky_i＝1，否则Sticky_i＝0。As shown in FIG. 6 , FIG. 6 is a schematic diagram of floating-point exponent and mantissa preprocessing part and floating-point mantissa processing in the vector point accumulation network according to an embodiment of the present invention. In order to keep the accuracy of the floating-point dot product calculation to meet the IEEE754 standard, a 7-bit mantissa extension is performed on the floating-point mantissa, and the 24-bit mantissa is left-shifted by 7 bits; 0, so that the 24-bit floating-point mantissa is extended to 32 bits. When shifting and aligning, the 32-bit mantissa is shifted to the right by ΔE _i bits, and the shifted ΔE _i bits are saved; when

When the 32-bit mantissa needs to be complemented; the Sticky bit compensation unit receives

and the right-shifted ΔE _i bit as a control signal, when

Alternatively, when the bits of ΔE _i shifted out from the right are reduced or operated to be 1, complementary code compensation is performed, that is, Sticky _i =1, otherwise Sticky _i =0.

如图1所示，可重构压缩器部分3完成8/16/32位定点和32位浮点尾数压缩，包括Mask屏蔽单元31、符号位扩展单元32和可重构压缩器网络33。其中，Mask屏蔽单元31接收浮点指数、尾数预处理部分的输出，并分析Mask寄存器的值，Mask寄存器控制向量寄存器是否参与点积操作。当M选项有效时，只有Mask寄存器为1相应指示的值才进入压缩器网络；当M选项无效时，Mask寄存器对点积操作没有影响。Mask寄存器64位，每位分别指示向量寄存器的一个字节。标量寄存器的值和Sticky位补偿Comp不受Mask寄存器的影响。数据经Mask寄存器屏蔽后进入符号位扩展单元32，512位向量寄存器中每8位字节为1个独立的单元，进行8位符号位扩展。当为无符号定点点积操作时(U选项有效)，直接在高位补8个0；当为有符号定点点积操作(U选项无效)或浮点点积操作时，在高位补偿8位符号位。经符号位扩展后，向量尾数、标量尾数、Comp补偿单元一起进入可重构压缩器网络33，可重构压缩器网络33支持8/16/32位有/无符号数据压缩，将16/32/64个32/16/8位数据、32位标量数据和Comp补偿单元压缩为“和串”S和“进位串”C。As shown in FIG. 1 , the reconfigurable compressor part 3 completes 8/16/32-bit fixed-point and 32-bit floating-point mantissa compression, including a Mask masking unit 31 , a sign bit extension unit 32 and a reconfigurable compressor network 33 . Wherein, the Mask masking unit 31 receives the output of the floating-point exponent and the mantissa preprocessing part, and analyzes the value of the Mask register, and the Mask register controls whether the vector register participates in the dot product operation. When the M option is valid, only the value indicated by the Mask register being 1 enters the compressor network; when the M option is invalid, the Mask register has no effect on the dot product operation. The Mask register is 64 bits, and each bit indicates a byte of the vector register. The value of the scalar register and the sticky bit compensation Comp are not affected by the Mask register. After the data is masked by the Mask register, it enters the sign bit extension unit 32, and each 8-bit byte in the 512-bit vector register is an independent unit for 8-bit sign bit extension. When it is an unsigned fixed-point dot product operation (U option is valid), directly add 8 0s in the high order; when it is a signed fixed-point dot product operation (U option is invalid) or a floating-point dot product operation, compensate the 8-bit sign bit in the high order . After sign bit extension, the vector mantissa, scalar mantissa, and Comp compensation unit enter the reconfigurable compressor network 33 together, and the reconfigurable compressor network 33 supports 8/16/32 bit with/unsigned data compression, compressing 16/32 /64 32/16/8-bit data, 32-bit scalar data and Comp compensation units are compressed into "sum string" S and "carry string" C.

如图7所示，图7是依照本发明实施例的向量点积累加网络中可重构压缩器部分，8/16/32位定点、32位浮点可重构压缩器网络的示意图。512位向量尾数经符号位扩展后进入第一级压缩器阵列，经3层32位压缩器压缩后得到2个压缩结果。当工作在浮点模式时，第3层32位压缩器的2个输出、标量寄存器的值以及Sticky位补偿Comp进入一个4-2压缩器，得到浮点尾数压缩结果；当工作在32位定点模式时，4-2压缩器的Sticky位补偿Comp为0，得到32位定点压缩结果。当为16位定点格式时，经3层32位压缩器后依次通过一个16位压缩器以及一个3-2压缩器同标量寄存器的值一起压缩，得到16位定点压缩结果。8位定格格式同16位定点格式类似。As shown in FIG. 7, FIG. 7 is a schematic diagram of the reconfigurable compressor part of the vector point accumulation network according to the embodiment of the present invention, 8/16/32-bit fixed-point, 32-bit floating-point reconfigurable compressor network. The 512-bit vector mantissa enters the first-stage compressor array after sign bit expansion, and is compressed by three layers of 32-bit compressors to obtain two compression results. When working in floating-point mode, the 2 outputs of the third layer 32-bit compressor, the value of the scalar register and the Sticky bit compensation Comp enter a 4-2 compressor to obtain the floating-point mantissa compression result; when working in 32-bit fixed-point mode, the sticky bit compensation Comp of the 4-2 compressor is 0, and the 32-bit fixed-point compression result is obtained. When it is a 16-bit fixed-point format, after three layers of 32-bit compressors, a 16-bit compressor and a 3-2 compressor are compressed together with the value of the scalar register to obtain a 16-bit fixed-point compression result. The 8-bit fixed-frame format is similar to the 16-bit fixed-point format.

如图8所示，图8是依照本发明实施例的向量点积累加网络中可重构压缩器部分，32位子压缩器的示意图。该图上部分为压缩器部分，下部分为符号位扩展部分。32位压缩器由4个8位压缩器和MUX串联组成，根据不同的数据格式MUX选择低位选择器的进位或0进入高位压缩器。若工作在8位模式，则3个MUX分别选择0输入；若工作在16位模式，则第1、3个MUX选择低位进位，第2个MUX选择0；若工作在32位模式，则3个MUX都选择低位进位。符号位扩展部分的4个8位压缩器分别与压缩器部分的4个8位压缩器相连，并接收扩展符号位的输入，保证不同粒度数据压缩时算术等价。As shown in FIG. 8 , FIG. 8 is a schematic diagram of a 32-bit sub-compressor in the reconfigurable compressor part of the vector point-accumulation-add network according to an embodiment of the present invention. The upper part of the figure is the compressor part, and the lower part is the sign bit extension part. The 32-bit compressor is composed of four 8-bit compressors and MUX connected in series. According to different data formats, MUX selects the carry or 0 of the low-bit selector to enter the high-bit compressor. If working in 8-bit mode, the 3 MUXs select 0 input respectively; if working in 16-bit mode, the 1st and 3rd MUX select low-order carry, and the 2nd MUX selects 0; if working in 32-bit mode, 3 Each MUX selects the low bit carry. The four 8-bit compressors in the sign bit extension part are respectively connected to the four 8-bit compressors in the compressor part, and receive the input of the extended sign bit, so as to ensure arithmetic equivalence when compressing data with different granularities.

如图1所示，所述浮点指数、尾数后处理/定点数操作部分4，包括尾数相加单元41、前导0预测PZD 42、浮点尾数规格化移位单元43、浮点尾数规格化舍入单元44、指数修正单元45、符号位修正46和定点结果处理47；所述尾数相加单元41对压缩后的S、C串进行相加，得到尾数相加结果；所述前导0预测PZD 42对进入尾数相加单元41的S、C串进行0、1预编码，此0、1编码串经前导0判断电路处理得到前导1的位置，控制尾数计算结果在规格化移位时移位的距离，由于预编码可能会产生1位误差，移位后的结果还要经过补偿电路判断并纠正误差；所述浮点尾数规格化移位单元43采用前导0预测PZD 42得到的移位距离对结果尾数进行移位，得到规格化的浮点尾数结果；所述浮点尾数规格化舍入单元44根据Guard、Round、Sticky位对规格化的尾数进行舍入；所述指数修正单元45和符号修正单元46根据PZD的输出、规格化舍入的情况以及尾数相加结果进行指数调整和符号位修正，得到最终的浮点指数和符号位；所述定点结果处理单元47根据定点指令选项，对尾数相加单元41的结果进行处理，得到定点向量点积累加结果。As shown in Fig. 1, described floating-point exponent, mantissa post-processing/fixed-point number operation part 4 include mantissa addition unit 41, leading 0 prediction PZD 42, floating-point mantissa normalization shift unit 43, floating-point mantissa normalization Rounding unit 44, exponent correction unit 45, sign bit correction 46 and fixed-point result processing 47; the mantissa addition unit 41 adds the compressed S and C strings to obtain the mantissa addition result; the leading 0 prediction PZD 42 performs 0 and 1 pre-coding on the S and C strings entering the mantissa addition unit 41. The 0 and 1 coded strings are processed by the leading 0 judgment circuit to obtain the position of the leading 1, and the mantissa calculation results are shifted during the normalized shift. The distance of the bit, because the precoding may produce 1 bit error, the result after the shift will also be judged and corrected by the compensation circuit; the normalized shift unit 43 of the floating point mantissa adopts the shift obtained by leading 0 prediction PZD 42 The distance shifts the mantissa of the result to obtain a normalized floating-point mantissa result; the normalized floating-point mantissa rounding unit 44 rounds the normalized mantissa according to the Guard, Round, and Sticky bits; the exponent correction unit 45 And sign correction unit 46 carries out exponent adjustment and sign bit correction according to the output of PZD, the situation of normalized rounding and mantissa addition result, obtains final floating-point exponent and sign bit; Said fixed-point result processing unit 47 according to fixed-point instruction option , the result of the mantissa addition unit 41 is processed to obtain a fixed-point vector point accumulation result.

浮点指数、尾数后处理/定点操作数部分4利用快速复合加法器分别计算S+C和

的值。当工作在定点格式时，定点结果处理47选择S+C结果，并进行后处理，得到向量定点点积累加结果。当工作在浮点格式时，根据S+C的结果符号位选择S+C或

的值，并控制符号位修正单元46完成浮点符号位修正。当S+C的符号位为负，选择

的值并对S_max取反作为最终结果的符号位；否则选择S+C的值，且最终结果的符号位与S_max一样。前导0预测PZD 42对进入尾数相加单元41的S、C串进行二进制预编码，此二进制串经前导0判断电路处理得到前导1的位置和规格化移位距离D_normal，控制浮点尾数规格化移位单元43规格化左移或右移的位数；由于预编码可能会产生一位误差，移位后的结果还要经过补偿电路判断并纠正误差。规格化移位完成后，浮点尾数规格化舍入单元44根据移位结果进行Guard/Round/Sticky位判断，完成舍入，在舍入的同时判断尾数结果是否需要进行二次舍入(SecondRound)；当舍入引起浮点尾数最高位产生进位时需要二次舍入，二次舍入会影响浮点的结果指数。指数修正单元45根据浮点最大指数E_max、规格化移位距离D_normal和是否二次舍入(SecondRound)完成浮点结果指数调整。Floating-point exponent, mantissa post-processing/fixed-point operand Part 4 utilizes fast composite adder to calculate S+C and

value. When working in a fixed-point format, the fixed-point result processing 47 selects the S+C result and performs post-processing to obtain a vector fixed-point accumulation result. When working in floating-point format, select S+C or

and control the sign bit correction unit 46 to complete the floating point sign bit correction. When the sign bit of S+C is negative, select

The value of S _max is reversed as the sign bit of the final result; otherwise, the value of S+C is selected, and the sign bit of the final result is the same as S _max . The leading 0 prediction PZD 42 performs binary precoding on the S and C strings entering the mantissa addition unit 41, and the binary string is processed by the leading 0 judging circuit to obtain the position of the leading 1 and the normalized shift distance D _normal to control the floating-point mantissa specification The normalization shift unit 43 standardizes the number of bits shifted left or right; because precoding may generate a one-bit error, the shifted result must be judged and corrected by the compensation circuit. After the normalized shift is completed, the floating-point mantissa normalized rounding unit 44 performs Guard/Round/Sticky bit judgment according to the shift result, completes rounding, and judges whether the mantissa result needs to be rounded twice (SecondRound) while rounding. ); When the rounding causes the highest bit of the floating-point mantissa to generate a carry, double rounding is required, and the double rounding will affect the floating-point result exponent. The exponent correction unit 45 completes the exponent adjustment of the floating-point result according to the maximum floating-point exponent E _max , the normalized shift distance D _normal and whether SecondRound is used.

如图9所示，图9是依照本发明实施例的向量点积累加网络中浮点指数、尾数后处理/定点操作数部分，并行指数修正单元的示意图，它与浮点尾数规格化移位单元43并行完成的。采用两个8位加法器分别计算E_max+D_normal和E_max+D_normal+1的值，然后通过二次舍入控制逻辑(SecondRound)选择最终的指数结果。当需要二次舍入时，一次舍入完成后尾数还需要左移一位，浮点结果尾数为E_max+D_normal+1。采用指数并行调整的方法，原来指数调整路径上两个8位加法器减少为一个，减少了关键路径的延时，提高了整个系统的性能。As shown in Figure 9, Figure 9 is a schematic diagram of the parallel exponent correction unit in the floating-point exponent and mantissa post-processing/fixed-point operand part in the vector point accumulation network according to an embodiment of the present invention, which is normalized and shifted with the floating-point mantissa Unit 43 is done in parallel. Two 8-bit adders are used to calculate the values of E _max +D _normal and E _max +D _normal +1 respectively, and then the final exponential result is selected through the secondary rounding control logic (SecondRound). When double rounding is required, the mantissa needs to be shifted to the left by one bit after the first rounding is completed, and the mantissa of the floating-point result is E _max +D _normal +1. Using the method of exponential parallel adjustment, the original two 8-bit adders on the index adjustment path are reduced to one, which reduces the delay of the critical path and improves the performance of the entire system.

基于图1至图9所示的支持定浮点可重构的向量长度可配置的向量点积累加网络，本发明还提供了一种定浮点可重构、数据长度可配置的数据求和方法，该方法包括：8/16/32位定点数据可重构，8位定点数据为基本单元；2个8位定点数据和相应的控制逻辑组重构为16位定点数据；4个8位定点数据和相应的控制逻辑重构为32位定点数据；定浮点可重构，浮点尾数移位对齐后根据符号位情况决定是否完成求反操作；当符号位为1，浮点尾数求反，符号位保持1不变，浮点符号位、尾数形成新的32位数据；当符号位为0，浮点尾数保持保持不变。浮点尾数经符号位处理后可以复用定点数据通路；数据长度可配置，通过Mask寄存器实现；Mask寄存器的每一位分别控制向量数据的某个位域，通过配置Mask寄存器的值来配置参与运算的数据长度。Based on the vector point accumulation network with configurable fixed-floating point reconfigurable vector length shown in Figures 1 to 9, the present invention also provides a data summation with reconfigurable fixed-floating point and configurable data length The method includes: 8/16/32-bit fixed-point data can be reconfigured, and the 8-bit fixed-point data is the basic unit; two 8-bit fixed-point data and corresponding control logic groups are reconstructed into 16-bit fixed-point data; four 8-bit fixed-point data The fixed-point data and the corresponding control logic are reconstructed into 32-bit fixed-point data; the fixed-floating point can be reconfigured, and after the floating-point mantissa is shifted and aligned, it is determined whether to complete the negation operation according to the sign bit; when the sign bit is 1, the floating-point mantissa is calculated On the contrary, the sign bit remains 1, and the floating-point sign bit and mantissa form new 32-bit data; when the sign bit is 0, the floating-point mantissa remains unchanged. After the floating-point mantissa is processed by the sign bit, the fixed-point data path can be multiplexed; the data length is configurable and realized through the Mask register; each bit of the Mask register controls a certain bit field of the vector data, and the participation is configured by configuring the value of the Mask register Operation data length.

以上所述的具体实施例，对本发明的目的、技术方案和有益效果进行了进一步详细说明，所应理解的是，以上所述仅为本发明的具体实施例而已，并不用于限制本发明，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The specific embodiments described above have further described the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above descriptions are only specific embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

1. support the configurable dot product of the reconfigurable vector length of the fixed and floating network that adds up for one kind, it is characterized in that, comprising:

Parallel restructural multiplier (1) is used to receive vector data B, C and data options FBS, U as input, and the execute vector multiply operation obtains multiplication result B * C of vector data B, C, and exports to floating-point index, mantissa's preprocessing part (2);

Floating-point index, mantissa's preprocessing part (2); Multiplication result B * the C and the scalar data A that are used to receive parallel restructural multiplier (1) are as input; Accomplish to select floating-point index maximal value, that index is asked is poor, displacement alignment, complement code are changed and sticky position compensating operation; Vector result B * C after obtaining handling and scalar result A, and export to restructural compressor reducer part (3);

Restructural compressor reducer part (3) is used to receive the result of floating-point index, mantissa's preprocessing part (2), and it is compressed, obtain " and string " (S) with " carry string " (C), and export to floating-point index, mantissa's aftertreatment/fixed-point operation part (4); And

Floating-point index, mantissa's aftertreatment/fixed-point operation part (4); Be used for " and string " S of being received from restructural compressor reducer part (3) and " carry string " C end of line of going forward side by side is counted addition, the result of mantissa of addition carried out aftertreatment obtain final dot product accumulation result.

2. the configurable dot product of the reconfigurable vector length of the support fixed and floating according to claim 1 network that adds up; It is characterized in that said parallel restructural multiplier (1) adopts 16 identical 32 restructural multipliers (11) group structure to form, and supports 8/16/32 fixed-point multiplication; 32 IEEE754 standards are simplified the single-precision floating point multiplication; 16 32 multiplier concurrent workings obtain 16 * 32=512 position multiplication result, reach 16 * 32 handling capacity.

3. the configurable dot product of the reconfigurable vector length of the support fixed and floating according to claim 2 network that adds up; It is characterized in that; Said parallel restructural multiplier (1) adopts 16 identical 32 restructural multipliers (11) group structure to form, and it is specifically organized the structure process and comprises:

AH/BH, AL/BL are respectively high and low 8 of 16 bit data A/B, at first 48 * 8 multiplier concurrent workings, and each 8 * 8 multiplier obtains 2 compression result through Wallace's compressed tree; Get into 1 8-2 compressor reducer after 8 of 4 multipliers compression result weights are alignd then together and obtain final " and string " S and " carry string " C; S and C obtain the high 24 of 16 * 16 multipliers through 24 totalizers, and directly from the least-significant byte of AL * BL multiplier, the least-significant byte carry of AL * BL multiplier output simultaneously is as the carry input of 24 multipliers for the other least-significant byte of 16 * 16 multipliers; At last, through 1 selector switch, when work is 8 * 8 patterns, 16 results of selector switch output AL * BL and AH * BL multiplier; When being operated in 16 * 16 patterns, 32 results of 16 * 16 multipliers that the result of 24 totalizers of selector switch output and the least-significant byte result of AL * BL constitute.

4. the configurable dot product of the reconfigurable vector length of the support fixed and floating according to claim 1 network that adds up; It is characterized in that; Said floating-point index, mantissa's operation part (2) comprise that index cascade comparer (21), index are asked poor array (22), be shifted alignment unit (23), complement code converting unit (24) and sticky position compensating unit (25), wherein:

Said index cascade comparer (21) is used to obtain 17 floating-point index maximal value E _Max

Said index is asked poor array (22), is used to obtain each floating-point index E _iWith floating-point index maximal value E _MaxDifference E _Max-E _i, this difference E _Max-E _iThe translocation distance of alignment unit (23) is used to be shifted;

Said displacement alignment unit (23) adopts index to ask the output E of poor array (22) _Max-E _iAs control signal, carry out right-shift operation, floating-point coefficient is alignd;

Said complement code converting unit (24) is used for floating-point coefficient after the specific displacement is carried out complementary operation, and the mantissa that need negate is sign bit S _iWith maximum floating-point exponent sign position S _MaxDifferent floating-point coefficients;

Said sticky position compensating unit (25), be used for to the floating-point coefficient that shifts out with need the floating-point coefficient of supplement sign indicating number to carry out a compensation, obtain 17 compensating units.

5. the configurable dot product of the reconfigurable vector length of the support fixed and floating according to claim 4 network that adds up; It is characterized in that; Said floating-point index, mantissa's operation part (2) are obtained 16 floating-point multiplication results, and are isolated the index E of floating-point successively from parallel restructural multiplier part (1) ₀-E ₁₅, the M of mantissa ₀-M ₁₅With sign bit S ₀-S ₁₅16 floating-point multiplication result exponent E ₀-E ₁₅With the floating-point index E in the scalar register ₁₆Get into cascade index comparator (21) and obtain maximum floating-point index E _Max, ask poor array (22) to obtain a floating-point index E successively through index then _iWith index maximal value E _MaxDifference DELTA E _i, it is 17 8 parallel subtracters that index is asked poor array; Displacement alignment unit (23) adopts 17 32 bit parallel shift units to carry out simultaneously, and the control signal of each shift unit is asked the output Δ E of poor array (22) from index _i, shift unit export to complement code converting unit (24), when mantissa needs supplement

The time need to the displacement after floating-point coefficient carry out complementary operation; Δ E to shifting out simultaneously _iPosition mantissa carries out the Sticky compensation, when mantissa needs supplement

Contain in the binary digit that perhaps shifts out at 1 o'clock, need carry out the Sticky compensation, i.e. Sticky _i=1; Sticky position compensating unit (25) counts Sticky ₀-Sticky ₁₆In need carry out Sticky position compensation number Comp.

6. the configurable dot product of the reconfigurable vector length of the support fixed and floating according to claim 1 network that adds up; It is characterized in that; Said restructural compressor reducer part (3) comprises Mask screen unit (31), sign bit expanding element (32) and restructural compressor reducer network (33); Wherein: Mask screen unit (31); Be used to receive the output of floating-point index, mantissa's preprocessing part (2), and analyze the value of Mask register, Mask register controlled vector registor is participated in the number that dot product adds up and operates; Data get into sign bit expanding element (32) after the shielding of Mask register, every octet is 1 independently unit in the 512 bit vector registers of sign bit expanding element (32), carry out the expansion of 8 bit sign positions; After the sign bit expansion; Vector data, scalar data, Comp compensating unit get into restructural compressor reducer network (33) together; Restructural compressor reducer network (33) is supported 8/16/32 to be had/the data without sign compression, and with 16/32/64 32/16/8 bit data and sign-extension bit, 32 scalar datas and sign-extension bit and Comp compensating unit boil down to " and string " S and " carry string " C.

7. the configurable dot product of the reconfigurable vector length of the support fixed and floating according to claim 1 network that adds up; It is characterized in that; " and string " S that said floating-point index, mantissa's aftertreatment/fixed-point operation part (4) are received from restructural compressor reducer part (3) carries out mantissa's addition with " carry string " C; For fixed point format, directly carry out result treatment and obtain fixed point vector dot product accumulation result; For floating-point format, carry out that leading 1 judgement, normalization shift, normalization are rounded off, index is adjusted and symbol adjustment operation, finally obtain floating point vector dot product accumulation result.

8. the configurable dot product of the reconfigurable vector length of the support fixed and floating according to claim 7 network that adds up; It is characterized in that; Said floating-point index, mantissa's aftertreatment/fixed-point number operation part (4) comprise mantissa's addition unit (41), leading 0 prediction PZD (42), floating-point coefficient normalization shift unit (43), floating-point coefficient's normalization round off unit (44), index amending unit (45), sign bit correction (46) and the result treatment (47) of fixing a point, wherein:

Said mantissa addition unit (41) is used for the S after the compression, C string are carried out addition, obtains mantissa's addition result;

Said leading 0 prediction PZD (42); Be used for the S, the C string that get into mantissa's addition unit (41) are carried out 0,1 precoding; This 0,1 coded strings is handled through leading 0 decision circuitry and is obtained leading 1 position; The distance that control mantissa result of calculation is shifted when normalization shift, because precoding may produce 1 bit error, the result after the displacement also will pass through compensating circuit and judge and rectification error;

Said floating-point coefficient normalization shift unit (43), the translocation distance that is used to adopt leading 0 prediction PZD (42) to obtain is shifted to resultant mantissa, obtains normalized result of floating-point coefficient;

The said floating-point coefficient normalization unit (44) that rounds off is used for according to Guard, Round, Sticky position normalized mantissa being rounded off;

Said index amending unit (45) and symbol amending unit (46), the output, normalization situation about rounding off and the mantissa's addition result that are used for according to PZD are carried out index adjustment and sign bit correction, obtain final floating-point exponential sum sign bit;

Said fixed point result treatment unit (47) is used for according to fixed point instruction option, and the result of mantissa's addition unit (41) is handled, and obtains fixed point vector dot product accumulation result.

9. the configurable dot product of the reconfigurable vector length of the support fixed and floating according to claim 8 network that adds up is characterized in that, said floating-point index, mantissa's aftertreatment/fixed-point operation part (4) utilize quick compound totalizer calculate respectively S+C with

Value, wherein: when being operated in fixed point format, fixed point result treatment (47) is selected S+C result, and carries out aftertreatment, obtains fixed point vector dot product accumulation result; When being operated in floating-point format, according to the outcome symbol position of S+C select S+C or

Value, and control character position amending unit (46) is accomplished the floating-point-sign position and is revised; When the sign bit of S+C for negative, select

Value and to S _MaxNegate is as the sign bit of net result; Otherwise select the value of S+C and the sign bit and the S of net result _MaxIdentical.

10. the configurable dot product of the reconfigurable vector length of the support fixed and floating according to claim 8 network that adds up; It is characterized in that; Said leading 0 prediction PZD (42) carries out the scale-of-two precoding to S, the C string that gets into mantissa's addition unit (41), and this binary string is handled through leading 0 decision circuitry and obtained leading 1 position and normalization shift distance B _Normal, control floating-point coefficient's normalization shift unit (43) denormalization left shift or the figure place that moves to right.

The network 11. the configurable dot product of the reconfigurable vector length of support fixed and floating according to claim 8 adds up; It is characterized in that; After normalization shift is accomplished; The floating-point coefficient normalization unit (44) that rounds off carries out the Guard/Round/Sticky position according to shift result and judges that completion is rounded off, and judges when rounding off whether the result of mantissa need carry out secondary round off (SecondRound); Cause that when rounding off floating-point coefficient's most significant digit needs secondary to round off when producing carry, secondary rounds off can influence the result exponent of floating-point.

The network 12. the configurable dot product of the reconfigurable vector length of support fixed and floating according to claim 8 adds up is characterized in that said index amending unit (45) is according to floating-point maximal index E _Max, the normalization shift distance B _NormalWhether secondary round off (SecondRound) accomplish the adjustment of floating point result index.

The network 13. the configurable dot product of the reconfigurable vector length of support fixed and floating according to claim 12 adds up is characterized in that, said index amending unit (45) adopts two 8 totalizers to calculate E respectively _Max+ D _NormalAnd E _Max+ D _Normal+ 1 value is selected final index result through the secondary steering logic (SecondRound) that rounds off then; When the needs secondary rounded off, mantissa also need move to left one after the completion of once rounding off, and floating point result mantissa is E _Max+ D _Normal+ 1.