CN117931123B

CN117931123B - Low-power-consumption variable-precision embedded DSP hard core structure applied to FPGA

Info

Publication number: CN117931123B
Application number: CN202410340137.5A
Authority: CN
Inventors: 李越航; 黄志洪; 蔡刚; 魏育成
Original assignee: Ehiway Microelectronic Science And Technology Suzhou Co ltd
Current assignee: Ehiway Microelectronic Science And Technology Suzhou Co ltd
Priority date: 2024-03-25
Filing date: 2024-03-25
Publication date: 2024-06-14
Anticipated expiration: 2044-03-25
Also published as: CN117931123A

Abstract

The invention provides a low-power-consumption variable-precision embedded DSP hard core structure applied to an FPGA, which comprises the following components: an accumulation path and a multiplication and addition path; the accumulation path comprises an accumulation path input register and a variable precision floating point adder unit; the multiplication and addition path comprises a multiplication and addition path input register, a first-order multiplication and addition structure and a single-precision floating point adder unit; the multiplication and addition path input register is used for realizing a data shift register transmission function; the accumulation path input register is used for realizing the selection of data register; the preprocessing unit is arranged in the multiplication and addition path and comprises a coefficient selection unit and a pre-adder, wherein the coefficient selection unit pre-stores internal coefficients; the preprocessing unit receives the multiplication and addition path input register data and performs pre-addition on the input number according to the calculation requirement. The invention can realize operation with various precision on the basis of reducing the area cost of the device, and has the advantages of both cost and flexibility.

Description

Low-power-consumption variable-precision embedded DSP hard core structure applied to FPGA

Technical Field

The invention belongs to the technical field of digital integrated circuits, and particularly relates to a low-power-consumption variable-precision embedded DSP hard core structure applied to an FPGA.

Background

The FPGA device belongs to a semi-custom circuit in an application-specific integrated circuit, is a programmable logic array, and can realize specific logic functions or digital calculation tasks. The basic structure of the FPGA comprises a programmable input/output unit, a configurable logic block, a digital clock management module, an embedded RAM (random access memory), wiring resources and embedded special hard cores, and has the advantages of rich wiring resources and repeated programming. By virtue of the advantages of flexibility, short development period and low cost, the method is popular with engineers.

In recent years, with the rise of fields of artificial intelligence, big data, and the like, the application range of FPGAs has been further expanded. Specific tasks are realized by embedding a hard core circuit in the FPGA, and the embedded DSP is a hardware resource special for realizing a digital signal processing function in the FPGA and is also used for accelerating various calculation tasks. With the continuous enrichment of application scenes, floating point data are continuously applied to many scenes, a mode of operation originally relying on fixed point DSPs and logic resources consumes a large amount of on-chip resources, and the operation performance is difficult to realize efficiently. So in order to not occupy on-chip resources, a developer designs an embedded floating-point DSP IP core which is embedded in an FPGA chip and is used for accelerating specific operation and algorithm. The embedded DSP is a hardware resource special for realizing a digital signal processing function in the FPGA, is also used as a unit for mainly providing calculation power in the FPGA, bears the dense calculation tasks of floating point numbers and fixed point numbers, and is used for accelerating various calculation tasks. With the continuous expansion of the application range of FPGA, floating point number operation, fixed point number operation, vector operation, scalar operation, mixed precision operation, etc. are continuously required to be integrated in the embedded DSP hard core structure. Because of the great complexity of floating point arithmetic and the different demands of different arithmetic types on hardware structures, the area overhead and time overhead of the embedded DSP become large. Because of the limitation of the internal resources of the FPGA, the cost of the embedded DSP can be reduced as much as possible when the FPGA is designed and developed in the prior art.

The existing hard core structure improvement thought for the interior of the FPGA is that different operations share the same hardware structure, and diversified operation types are realized by minimizing the cost. In the prior art, the mainstream embedded DSP design realizes flexibility of switching between operations through different operation modes and flexible cascade ports. However, current commercial embedded DSPs do not support flexible switching of multiple precision floating point operations, and their internal multiplier and adder structures are cost prohibitive and do not support double precision floating point operations. In some high-precision computing scenarios, it is necessary to support double-precision floating-point operations, and the design method of the multi-precision floating-point multiplier according to the mainstream consumes a lot of resources. Therefore, there is an urgent need for an embedded DSP hard core structure capable of reducing cost, improving flexibility, and supporting low power consumption, and various precision operations as an arithmetic operation unit inside the FPGA.

Disclosure of Invention

The invention provides a low-power-consumption variable-precision embedded DSP hard core structure applied to an FPGA, which can realize low-precision floating point operation and low-precision fixed point number vector dot product scalar product operation while taking account of the characteristics of overhead and flexibility. And the input end is provided with a selectable passage, so that the unused passage is selectively closed. The invention has the advantages of low power consumption and supporting various precision operations.

Other objects and advantages of the present invention will be further appreciated from the technical features disclosed in the present invention.

A low-power consumption variable precision embedded DSP hard core structure applied to FPGA comprises an accumulation path and a multiplication and addition path; the accumulation path comprises an accumulation path input register and a variable precision floating point adder unit; the multiplication and addition path comprises a multiplication and addition path input register, a first-order multiplication and addition structure and a single-precision floating point adder unit;

the multiply-add path input register and the accumulate path input register include a register multiplexer and a register;

the multiplication and addition path input register is used for realizing the selection of data register and the data shift input;

the accumulation path input register is used for realizing the selection of data register;

the preprocessing unit is arranged in the multiplication and addition path and comprises a coefficient selection unit and a pre-adder, wherein the coefficient selection unit pre-stores internal coefficients; the preprocessing unit receives the multiplication and addition path input register data and performs pre-addition on the input number according to the calculation requirement. The invention has the beneficial effects that two paths are internally arranged, wherein a first-order multiplication and addition structure in the multiplication and addition path can realize independent multiplication operation of three precision data and two paths of low-precision data multiplication and addition operation, wherein a single-precision floating point adder unit in the multiplication and addition operation path and a variable-precision floating point adder unit in the accumulation path are mutually matched for operation, and the whole DSP structure carries out independent multiplication, multiplication and addition, multiplication and accumulation, mixed precision operation and the like of three precision. Meanwhile, the multiplication and addition path register in the multiplication and addition path is matched with the preprocessing unit to realize the preprocessing of data so as to realize complex operation. The device can realize operation with various precision on the basis of reducing the area cost of the device, and has the advantages of taking account of cost and flexibility.

The multiply-add path input register and the register in the accumulate path input register are connected to the input port of the register multiplexer to form a register unit.

A primary pipeline register is arranged between the first-order multiplication and addition structure in the multiplication and addition path and the single-precision floating point adder unit;

and a second-stage pipeline register is arranged between the single-precision floating point adder unit in the multiplication and addition path and the variable-precision floating point adder unit in the accumulation path.

The accumulation path input register, the primary pipeline register, the secondary pipeline register and the output register comprise a register unit.

The multiplication and addition path input register is provided with four parallel input ends, and the four parallel input ends are connected in series; each input has two inputs, each input comprising a register unit;

Each input port of the input end is provided with an input port multiplexer;

An input multiplexer of a starting DSP in the DSP array selects initial data input from an input port of a multiplication and addition path input register of the starting DSP to transmit so as to realize a shift register transmission function;

the input multiplexer of the cascade non-initial DSP selects the data output by the multiplication and addition channel input register of the last DSP for transmission, so as to realize the shift register transmission function.

After being divided, the single-precision floating point number is input at two input ends, the data input by the two input ends are respectively shifted and transmitted, and the single-precision floating point number is spliced into a complete single-precision floating point number on a shifting output port of a multiplication and addition path input register, so that a shifting register transmission function of the single-precision floating point number is realized.

The preprocessing unit is used for a fixed point number pre-addition function with various precision.

At least two 18-bit pre-addition functional units are arranged in the preprocessing unit;

The two 18-bit pre-addition functional units realize a 27-bit addition function through an intermediate cascade channel.

The coefficient selection unit is used for pre-storing 8 coefficients at least, and the coefficients selected by the coefficient selection unit are output from the output port of the preprocessing unit.

The multiplication and addition path input register comprises a direct output port and a preprocessing output port;

Data needing to be subjected to operation of the preprocessing unit are input from the multiplication and addition path input register, then are input from the preprocessing input port to the preprocessing unit, are processed by the preprocessing unit and are output from the output port of the preprocessing unit;

The data which does not need to be subjected to preprocessing unit operation is input from the multiplication and addition path input register and then output to a first-order multiplication and addition structure from a direct output port for multiplication and addition operation.

The first-order multiplication structure input end comprises two first-order multiplication structure multiplexers, and data output by the output port of the preprocessing unit and data output by the direct output port enter the first-order multiplication structure multiplexers for selection so as to realize the selection of whether the data needs preprocessing or not.

And a register clock gating circuit is arranged in the accumulation path input register, the multiplication and addition path input register, the primary pipeline register and the secondary pipeline register, and the register clock gating circuit selectively turns off unused paths.

The clock gating program of the clock gating circuit is automatically inserted and generated through an EDA comprehensive tool.

A multiplexer group is arranged between the first-stage pipeline register and a first-order multiply-add structure in the multiply-add path, and the output end of the multiplexer group is also connected to the second-stage pipeline register;

And the multiplexer group selects multiplication operation results and multiplication and addition operation results registered in the primary pipeline register to enter the accumulation path variable precision floating point adder unit or the multiplication and addition path single precision floating point adder unit for operation according to the operation precision.

The adder units in the accumulation path and the multiplication and addition path comprise an input register, a data processing module, a sign bit processing unit, a step code alignment unit, a mantissa shifting unit, an LZA coding module, an error correction module, an ALU unit and a shifting module;

The data processing module divides floating point numbers stored in the input register into sign bits, exponent bits and mantissa bits;

The sign bit processing is used for obtaining a sign bit value of an operation result, judging an operation type in an addition path, and outputting a sign for operation of an ALU unit;

The step code alignment unit at least comprises two operations, namely comparing the size of the step code, outputting a marking signal and calculating the step difference;

the mantissa shifting unit at least comprises three operations, wherein the mantissa is shifted according to the order value, then the mantissa size is compared, the size of the floating point number is judged according to the mantissa size and the comparison result of the step size, and the judgment result of the size of the floating point number is used for data input judgment of the ALU unit;

the LZA coding module codes input data in advance and is used for calculating the number of 0's in the preamble data;

The error correction module outputs a one-bit error correction signal to the shift module and is used for correcting errors of a leading zero prediction algorithm during data shift;

The shifting module is a barrel shifter.

Compared with the prior art, the invention has the beneficial effects that: 1. the invention sets four registers, including the input end register of the addition path, the input end register of the multiplication and addition path, the first-stage pipeline register and the second-stage pipeline register, the four registers are internally provided with the register units which can select whether to store data or not, and the number of stages of the pipeline is controlled by the registers, so that the use is more flexible.

2. The multiplication and addition path register in the multiplication and addition path can be matched with the preprocessing unit to realize the preprocessing of data so as to realize complex operation.

3. The invention is provided with an accumulation path and a multiplication and addition path, wherein the multiplication and addition structure in the multiplication and addition path can realize multiplication and addition operation, and can selectively execute operations of multiplication, multiplication and addition and the like of three precision data by matching with a single-precision floating point adder unit of the multiplication and addition path and a variable-precision floating point adder unit in the accumulation path, so that the device has more flexible operation. Meanwhile, the single-precision floating point adder unit and the variable-precision floating point adder unit can realize addition operation of one group of double-precision floating point numbers and addition operation of two groups of single-precision floating point numbers.

The foregoing and other objects, features and advantages of the invention will be apparent from the following more particular description of preferred embodiments, as illustrated in the accompanying drawings.

Drawings

In order to more clearly illustrate the technical solutions of specific embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.

FIG. 1 is a block diagram of the overall structure of a variable precision DSP core for use in an FPGA;

FIG. 2 is a block diagram of a multiply-add path input register module of the present invention;

FIG. 3 is a schematic diagram of a pretreatment module according to the present invention;

FIG. 4 is a schematic diagram of a register unit in an accumulation path input register, a primary pipeline register, and a secondary pipeline register according to the present invention;

FIG. 5 is a schematic diagram of a one-factorial addition structure in accordance with the present invention;

FIG. 6 is a block diagram of a booster unit according to the invention

FIG. 7 is a schematic diagram of three precision multiplicand mantissa splitting/packing approaches in the present invention;

FIG. 8 is a Booth coding circuit of the present invention;

FIG. 9 is a schematic diagram of single-precision mantissa symbol preprocessing in the present invention;

FIG. 10 is a schematic of a first stage compression process in accordance with the present invention;

FIG. 11 is a schematic diagram of a partial product arrangement of two precision two-stage compression in accordance with the present invention;

FIG. 12 is a schematic of a second stage compression process according to the present invention;

FIG. 13 is a schematic diagram of an adder in the multiplier of the present invention;

FIG. 14 is a schematic diagram of single precision mantissa protection bits, rounding bits and sticky bits processing in accordance with the present invention;

FIG. 15 is a schematic diagram of a multi-precision mantissa sticky bit processing in accordance with the present invention;

FIG. 16 is a circuit configuration diagram of a 2-bit LZD circuit of the present invention;

FIG. 17 is a block diagram of a 4-bit LZD circuit of the present invention;

FIG. 18 is a block diagram of a 32-bit LZD circuit of the present invention;

FIG. 19 is a circuit diagram illustrating a first order floating point multiply-add module and a single precision floating point adder architecture for determining the type of mantissa operation by sign bit processing in accordance with the present invention;

FIG. 20 is a schematic diagram of a barrel shifter in accordance with the present invention;

FIG. 21 is a schematic diagram of another barrel shifter in accordance with the present invention;

fig. 22 is a diagram of a clock gating circuit in accordance with the present invention.

Detailed Description

The foregoing and other features, aspects, and advantages of the present invention will become more apparent from the following detailed description of a preferred embodiment, which proceeds with reference to the accompanying drawings. The directional terms mentioned in the following embodiments are, for example: upper, lower, left, right, front or rear, etc., are merely references to the directions of the attached drawings. Thus, the directional terminology is used for purposes of illustration and is not intended to be limiting of the invention.

The following describes in detail a low-power consumption variable precision embedded DSP hard core structure applied to FPGA with reference to the accompanying drawings, referring to the overall schematic structure of the DSP core shown in fig. 1, the DSP core includes two paths, which are a multiply-add path for multiply-add operation and an accumulate path for add operation, wherein the multiply-add path includes a multiply-add path input register S20, a preprocessing module S50, a factorial add structure S10, a first stage pipeline register S60, a multiplexer set S90, and a single precision floating point adder unit S30 at an input end. The accumulation path of the other path comprises an accumulation path input register S70 and a variable precision floating point adder unit S40 at the input end, and the results after multiplication and addition path operation are respectively output to an output register S80. A two-stage pipeline register S100 is arranged between the output end of the single-precision floating point adder unit S30 in the multiply-add path and the input end of the variable-precision floating point adder unit S40 in the accumulate path.

At least one register unit S200 is disposed in each of the multiply-add path input register S60, the accumulate path input register S70, the first stage pipeline input register S60, and the second stage pipeline input register S100, as shown in fig. 4, and the register unit includes a register multiplexer S2002 (indicated by a trapezoid box in the figure) and a register S2001 (indicated by a box in the figure), wherein the register S2001 is disposed at an input terminal of the multiplexer S2002, and the multiplexer S2002 selects whether to register input data. The multiplication and addition path input register S60 may not only select whether to register data, but also realize a shift register transmission function of data, and the structure of the multiplication and addition path input register S60 is shown in fig. 2, and the multiplication and addition path input register S60 includes eight register units S200 and 4 input port multiplexers S601, where the multiplication and addition path input register S60 includes 4 input terminals, and each input terminal includes one input port multiplexer S601 and 2 register units S200. Each input end can be divided into an X input end and a Y input end, and the shift register transmission function can be realized after the X input ends of the 4 input ends are connected in series.

The specific shift function is implemented as follows, referring to fig. 2, by controlling a plurality of multiplexers (trapezoids) to realize switching between two functions, one is a half-precision/single-precision/9-bit/18-bit shift register output function (i.e., a shift input function, a shift transmission function capable of realizing data, and input data capable of being transmitted from a first DSP to a next DSP), and the specific implementation is that: when the DSP structure is a starting DSP, the data is input from X1 (9-bit number or half-precision floating point number) or { X1, X2} (18-bit number or single-precision floating point number) through the selection of the input multiplexer S601 and is transmitted to the output end scan_out of the multiplication and addition path input register S60; when the DSP structure is a cascaded DSP structure, the input data is selected from the scan_in input through the input multiplexer S601 (the output scan_out of the last DSP structure is connected to the input scan_in of the current DSP structure), and is input and transmitted to the output scan_out of the multiplication and addition path input register S60.

Another function is a conventional register selection function, which is implemented by a register unit.

The multiply-add path input register can be compatible with four kinds of precision shift operations, and the specific data transmission mode is that when the input is 9bit fixed point number or half precision floating point number, the input is transmitted to the Scan_out [31:16] port from the shift transmission path of X1 or Scan_in [15:0] (the Scan_out port of the last DSP is connected to the input Scan_in of the current DSP structure). When 18bit fixed point number or single precision floating point number is input, the input ports X1 and X2 have 16 bit input bit width or Scan_in [31:16] and Scan_in [15:0] (the input of the Scan_out port of the last DSP), 32 bit data are divided into two input ports at the moment, and then the shift register input of single precision data can be realized by matching with an internal multiplexer. And the signals are respectively transmitted to two ports of Scan_out [31:16] and Scan_out [15:0] through two transmission paths, wherein the ports are used for connecting with the Scan_in port of the next DSP.

The scheme of the application forms a pipeline design through four register units, controls the number of stages of the pipeline by controlling whether the register is registered or not, and is generally configured by a user, so that the register enables the DSP structure to be more flexible in use.

As shown in fig. 3, the preprocessing unit S50 in the multiplication and addition path is used for pre-addition operation before multiplication, a coefficient selection unit is arranged in the preprocessing unit S50, the coefficient selection unit is configured with 8 selectable coefficients, two 18-bit pre-addition functional units are also arranged, and the two 18-bit pre-addition functional units are in cascade connection through a cascade connection channel in the middle to realize a 27-bit addition function. The coefficient selection unit selects through the multiplexer, extracts the coefficients preset in advance during operation, reduces the input of each coefficient under the condition of unchanged operation coefficients, can be directly lifted from the inside, and reduces the use of an input port. The selected internal coefficients are output from the preprocessing unit output ports (x_2 and y_2 ports in fig. 3).

The multiplication and addition path is input into the register S60 and the preprocessing unit S50 to be combined, so that whether the data is processed by the preprocessing unit can be selected, and the specific operation steps are as follows: for Data that does not require preprocessing operations, the input data_mux_x and data_mux_y are output on the ports of x_1 and y_1 through the input registers; if the Data needs to pass through the preprocessing unit, the data_ax and data_bx, and the data_ay and data_by need to be output through the input register unit, and then the data_ax and data_bx, data_ay and data_by need to pass through the preprocessing unit, and the data_ax and data_bx need to be output through the ports of the X_2 and Y_2. The finally output data x_1, y_1, x_2, y_2 are input to the multiplexer S117 at the input of the one-factorial add structure for selection, thereby realizing the selection of whether the data need preprocessing function.

The first-order multiply-add structure S10 is composed of two parts, namely a multi-precision multiplier in the upper half and two low-precision addition paths in the lower half, and the structure is shown in fig. 5, wherein the two low-precision addition paths in the lower half are designed and added in the multiplier, so that the low-precision two-group two-order multiply-add function can be realized. The two low-precision two-way multiply-add operation results output by the one-order multiply-add structure enter the single-precision floating-point adder unit S30 to perform addition operation, and the single-precision floating-point adder unit S30 can implement two half-precision addition operations (i.e., single-precision floating-point data output by the one-order multiply-add structure S10) and two single-precision multiplication results (a multiplexer of an output part in the one-order multiply-add structure S10 can select to output multiplication results), and corresponding fixed-point number multiply-add operations.

The multiplier and two paths of addition operation in the first-order multiplication structure S10 are combined with the accompanying drawings; and the structure of the adder unit and the manner of implementing the function are described in detail.

Referring to the first-order floating-point multiply-add structure of fig. 5, the input multiplicand Y and multiplier X are simultaneously input into a data preprocessing unit S101 in the system, and the data preprocessing unit S101 is used for data segmentation and checking for anomalies of the data. The data preprocessing unit divides the input multiplier and multiplicand into sign bits, exponent bits and mantissa bits. The multiplier can realize multiplication operation of floating point numbers, and the multiplier mainly calculates mantissa parts, and can process exponent bits and sign bits to participate in operation. The sign bit module and the exponent bit module do not necessarily participate in each data operation, and for some data which does not contain sign bits, although the data is divided, the sign bits are not formed, and a sign bit processing unit is not naturally needed to participate in the operation.

The sign bit processing is performed by the sign bit module S111, two sign bit groups with four bits are obtained by a sign bit splicing mode, and then sign bits are operated by a two-by-two exclusive-or mode to obtain sign bits required by multiplier operation (i.e. one path of operation signal of the sign bit module S111 in fig. 5 is sent to the multiplier normalization and rounding module S107). In order to determine the type of the add path operation, the sign bit needs to be reprocessed, and the specific principle of determining the operation type is that referring to the circuit structure of determining the mantissa operation type in fig. 19, the sign a_sign, b_sign of two floating point operands and the custom operation type Operate _in are input into the structure in fig. 19, and the operation type output Operate _out is obtained through operation, that is, whether the mantissa operation is an addition operation or a subtraction operation is determined according to the output result.

The exponent bit processing is carried out by an exponent bit module S110, and four half-precision floating point number exponent bit data are spliced together to form large bit width data; the bit width size supports a splice of one double-precision digit, two single-precision digits, or four half-precision digits. And (3) obtaining a comprehensive offset step code value by adopting a small bit width spliced large bit width adding mode, selecting offset through a multiplexer, subtracting the offset value corresponding to each precision from the comprehensive offset step code, and finally obtaining the value of the exponent bits required by multiplier operation of the result (namely, one path of operation signal of the exponent bit module S110 in FIG. 5 is sent to the multiplier normalization and rounding module S107).

The mantissa obtained through segmentation enters a segmentation module S102, and the data segmentation module divides multiplicand data into four data packets with equal widths and sends the data packets into four encoders for encoding; the data coding module S103 includes at least four data coder graphs, B1, B2, B3, and B4, where the coding rule is to code the multiplier by using a base 4-booth coding rule, and the data coder performs cross coding on the segmented equipotential data packet in a manner of forming a group of three bits, so as to obtain partial products after multiplier coding, and the obtained partial products are differently combined through different position arrangements, and the different combinations of the partial products can implement the operation of floating point numbers with three accuracies. The partial product processed by the coding module enters a coding register unit to wait for the calling of the compression step.

Because of the negative processing steps, the inverting operation is realized in the encoding module, but the adding 1 operation is completed in the compression stage, this specific operation can be illustrated by the single-precision mantissa in fig. 9, and each partial product places the value obtained by the symbol preprocessing in the last bit of each partial product, which is the data arrangement mode shown in the figure.

The data processed by the multiplication unit in the application enters a data compression module, and the data compression module comprises a first stage of compression S104, which is used for carrying out primary compression on the partial product obtained by encoding. The device also comprises a second-stage compression module S105, wherein the second-stage compression is used for secondary compression of medium-precision number and high-precision number, and the result obtained by the first-stage compression is compressed again to obtain the final compression result. The compressed partial product enters a compression register to be stored, the compressed partial product enters an adder S106 to be added, the result obtained after calculation by the adder also needs to be rounded and normalized, the result obtained in the processing step enters a rounding register unit to be stored, and the data stored in the rounding register unit can be output to a multi-path selector S109 after being processed by a data reorganizing module S108 to wait for the data to be fetched and output.

Meanwhile, the multiplier can also operate the fixed point number. The fixed-point number operation is simply to operate on fixed-point numbers of the non-exponent bit operation and the sign bit operation. I.e. the fixed point number is represented by mantissa bits. The specific three precision conversion modes are as follows: and converting three precision format data operations according to the selection of signals, wherein the single operation result is a double-precision floating point number/two single-precision floating point numbers/four half-precision floating point numbers. The operation also comprises fixed point number multiplication operation with three precision formats, and the single operation result is a 27-bit fixed point number/two 18-bit fixed point numbers/four 9-bit fixed point numbers; wherein the fixed point number operation comprises signed, unsigned selection, the selection module being arranged in the coding circuit of the lower part. For convenience of description, four half-precision floating-point mantissas and four 9-bit fixed-point numbers are described by four low-precision numbers, two single-precision floating-point mantissas and two 18-bit fixed-point numbers are described by two medium-precision numbers, and one double-precision floating-point mantissa and one 27-bit fixed-point number are described by one high-precision number. The application can realize more complex operation by less hardware, and can reduce the area of the device.

The data splitting, data compression, adder, normalization, and rounding operations in the mantissa multiplication unit are described in detail below:

Referring to fig. 7, fig. 7 shows three data dividing processing modes, the first type of data chain in fig. 7 is a high-precision data, the second type of data chain is composed of two medium-precision data, the third type of data chain is composed of four low-precision data, the multiplicand is divided into four groups when dividing, 14-bit and 14-bit data of each group are called data packets, and the four data packets form a long chain. When two medium precision numbers are combined and four low precision numbers are combined, a sign bit S _y and zero padding are needed, wherein the zero padding is used for enabling multiplicands of four groups of encoders to keep uniform bit width, so that codes among data packets of different multiplicands are not affected, and a sign bit S _y is used for separately calculating a sign bit fixed point number and an unsigned bit fixed point number.

When three kinds of data with different bit widths in fig. 7 are required to be input into the encoder, the selected data packet and the highest bit of the last data packet are also required to be spliced, the splice is a common bit, the data packet entering the encoding step is 15 bits, the common bit is obtained through zero padding and is called a truncated common bit, and the data packet reselects a new multiplicand three-bit encoding group at the node to generate a separate encoding effect; the common bit is the original data packet value called as continuous common bit, and the continuous common bit is the original data of the multiplicand, so that the multiplicand with high bit width is continuously encoded at the node and generates partial product, and the continuous selection of the code group of the multiplicand with high bit width at the node is not influenced. The multiplicand is denoted as a, and the data bit width divided by the four encoders is as follows: is defined as (1) A [55:41], A [41:27], A [27:13], A [13:0],1' b0 ].

FIG. 8 shows a booth encoding circuit, in which the encoding circuit of FIG. 8 outputs pp _ji when three consecutive multiplicands y _2i+1、y_2i、y_2i-1 and two consecutive multipliers x _i、x_i-1 are input into the encoding circuit; the lower part of the coding circuit is a sign bit selection module which can be used for representing the sign bit of each partial product of the sign bit preprocessing output and can also select the sign bit of the fixed point number without the exponent bit and the sign bit operation. Specifically, the input custom signal S _{_sel} and the S output by the highest order X _{（width-1）} of the multiplier through logic operation represent that whether the multiplier is signed or not is judged, when the multiplier is signed, S outputs 1, when the multiplier is unsigned, S outputs 0, and then sign and unsign are selected to be called as sign bits of each partial product of the multiplier; specifically, after three sets of consecutive multiplicands y _2i+1、y_2i、y_2i-1 are input, the floating point mantissa multiplication and unsigned number multiplication are output via unsign signals, which can represent sign bits of each partial product generated by unsigned bits. For signed number multiplication operations, the sign signal output may be used to represent the sign bit of each partial product generated by the signed number. The coding mode in the coding module synchronizes all paths in the coder and the partial product generator, and the coder and the partial product generator have the same time delay. The design is to reduce the energy consumption of the burrs in the circuit.

Since the bit width of the partial product generated by each precision is different and the position of the symbol bit precoding corresponding to the partial product is also different, the data obtained by the encoder needs to be encapsulated in an encapsulation form, specifically, the value obtained by the symbol bit preprocessing is reversely placed in the head, the value is simultaneously placed in the tail, and then the constants for different precision are calculated by the following formula.

In the above formula, S0, S1, S2.

The constant for the low-precision mantissa can thus be obtained as: 1010101011; the constant of the medium precision mantissa is: 11010101010101010101011; the constant of the high precision mantissa is: 1010101010 … 10101011.

The specific operation of compressing the partial products obtained by encoding according to the present invention is shown in fig. 10, 11 and 12, and the present compression module is divided into two stages, so that the delay of low-precision data operation can be reduced, and the first stage of processing is that 8 groups of partial products obtained by the encoder are compressed, and the compression process is shown in fig. 10. The first stage has two layers of compression, and eight partial products are required for compression, and the eighth compression position is a free position and can be used for placing constants.

The second stage of compression is the result of the last stage of compression, and there are two cases in this stage of compression, one is a high-precision partial product addition, and the other is a two-medium-precision partial product addition; the specific arrangement position is shown in fig. 11. When the two middle-precision mantissa partial products are added, the dislocation is needed, and the interference generated by adding the two groups of partial products is prevented. The selection and arrangement positions of the compressors are shown in fig. 10 and 12, and mainly two-stage compression is performed by using a 4-2 compressor and a 3-2 compressor, wherein the 4-2 compressor compresses four-bit data into 2 bits, and the 3-2 compressor compresses three-bit data into two bits, and the compressed data is two bits. A uniform compression structure is employed for both precision mantissas. Wherein, during the first stage compression process, carry data needs to be prevented from being transferred to the next stage compression. Taking the compression of fig. 10 as an example, the 3-2 compressors in the S401 compression step and the S402 compression step are used to process groups of isolated 2-bit data, isolated 3-bit data and 2-bit data before three-bit isolated data by the 3-2 compressors, and process the groups of isolated 2-bit data and the isolated 3-bit data into a compressor consisting of two digits. The low-precision operation is not subjected to the second-stage compression, so that the delay of the operation can be reduced.

The adder S106 in the multiplier adopts a low-order-width adder to splice into a high-order-width adder, and whether carry is carried is controlled by controlling the on-off of the connecting bit. The bit width of each low-precision data fed into the adder is 22 bits, so that 88 bits are required for four parallels, and the bit width of the two medium-precision data fed into the adder is 48 bits, so that 96 bits are required for two parallels; the bit width of one high-precision data sent into the adder is 106 bits, so that the multiple requirements of multiple bit widths can be conveniently carried out, and the number of the bit widths of the large-bit-width adder needed at this time is 108 bits; the specific design is shown in fig. 13 when four 27-bit small-bit-width adders are spliced into one large-bit-width adder. The adder adopts koggle-stone parallel adder to carry out final addition operation. The bit width of each small adder is 27 bits, so that the addition operation of four parallel low-precision data can be performed, and no carry influence exists between the small adders; meanwhile, carry selection exists for the addition operation of the high-precision mantissa, and each small adder is connected by carry, so that the addition operation of high-precision data high bit width is met. The structure performs the final multi-precision mantissa addition operation.

Rounding of floating point mantissa results there are four rounding rules according to the IEEE754 floating point standard, and the type of rounding is determined by the values of the three positions (sticky, round, saturation) as illustrated in fig. 14 with various positions in single precision rounding. The rounding operation for the multi-precision floating point number specifically needs to generate rounding sign signals Hp_s, sp_s and Dp_s, wherein the sign signals are sticky bits, the value, the saturation bit and the rounding bit jointly determine the rounding type, and whether the rounding operation needs to be added with 1 is selected according to the rounding type. In the design, the multi-precision floating point number performs OR operation on the viscous bit by adopting low bit width and high bit width, and obtains a marking signal by adopting multiple OR operations on the high-precision mantissa. The logic diagram is shown in fig. 15. Because the first two bits of the multiplication result have three possibilities, namely 11, 10 and 01, when normalization is carried out after rounding is finished, whether the highest one bit of the mantissa needs to be intercepted or not is judged to enter the result according to the first two bits.

The two-way low-precision addition operation path of the lower half part of the first-order floating point multiplication-addition structure in fig. 5 realizes two-way low-precision square addition operation. The specific operation is that four groups of values are obtained by processing by the exponent bit module S110, the processing mode of the four groups of values is the same as that of the multiplier part exponent bit processing mode, the four groups of values obtained by processing are combined into the alignment module S112 for alignment, the exponent alignment mainly participates in two operation parts, one is a comparator operation part, the other is a subtracter operation part, the alignment operation carries out alignment on a small exponent to a large exponent, and the output result is a large exponent value, a step and a sign signal for judging the size of the exponent bit. Specifically, a large exponent value is used for the exponent bits of the coarse processing of the addition result, where the coarse processing of the addition result refers to a result that has not undergone normalization in the two-way low-precision addition path and normalization by the rounding module. The step is used for the displacement of mantissa before mantissa addition, and the flag signal for judging the size of exponent bits is used for placing large data at the position to be added later when ALU unit S114 operates.

The exponent processing result obtained by the opposite-order operation and four low-precision data multiplication results obtained by the upper-half multiplier enter a shifter, and in order to meet the two-way multiplication and addition operation requirement, the shifter is also provided with two, wherein the shifter at least needs two operation parts, a barrel-shaped shifter operation part and a comparator operation part carry out shift operation on mantissas according to step differences, and a sign signal for judging the size of exponent bits obtained by the opposite-order operation processing and a sign signal for judging the size of mantissas output by the shifter jointly determine the position of the added number of the large number placed in an ALU unit S114. The reason why the large number is selected to be placed at the added position of the ALU unit S114 twice here is that, since the mantissa portion of the floating point number is an unsigned fixed point number, the selection of the positive and negative signs is selected only by the sign bit of the floating point number, and therefore, the large number needs to be placed at the added position during the mantissa operation to prevent the negative number from occurring in the small number minus the large number. And there are two partial reasons for sizing because it is necessary to exclude situations when the indices are the same.

ALU unit S114 is used to implement addition and subtraction of floating-point mantissas and addition and subtraction of fixed-point numbers (unsigned numbers and signed numbers). ALU arithmetic units are prior art and their internal structure or implementation of the arithmetic functions are not described in detail herein.

Before the result obtained by the ALU unit operation enters the addition path rounding module S116, a leading zero detection circuit (LZD) is added, and the leading zero detection circuit (LZD) is responsible for detecting the first 1 position in data, and the leading zero detection circuit (LZD) is shown in fig. 16-18, wherein fig. 16 is a 2-bit LZD circuit structure, fig. 17 is a 4-bit LZD circuit structure, specifically, the 32-bit LZD circuit in fig. 18 is formed by combining a plurality of 2-bit circuits through 5-stage cascade connection. The leading zero detection circuit generates a mark signal, is used for searching the position of a first data 1 in the mantissa digit of the floating point addition operation result and hiding the data, and places the value after the first data 1 and the data 0 in front of the first data 1 in the mantissa part of the floating point addition operation result in the normalization operation; and the corresponding processing is carried out on the exponent bits, wherein the corresponding processing mode of the exponent bits is that the exponent bits are added with the shift quantity to obtain the final exponent bits. As illustrated by the data "00010101011," the position of the first "1" of the string is in the fourth digit, and the portions resulting from the removal of the first "1" and the preceding "0" and the removal of "0001" are the following values. In this case, the exponent is shifted by the mantissa after the four-bit data is removed, and the shift amount is added to the exponent, thereby obtaining the exponent of the final operation result.

The operation result processed by the LZD circuit enters an addition operation path normalization and rounding module to perform normalization and rounding processing to obtain a final operation result. The rounding mode here adopts the rounding nearby. The rounding is also known as conventional rounding, and involves rounding the true value to the nearest representable value, and if it is exactly centered between the two representable values, rounding to the value with the least significant bit "0". If overflow occurs, the result output is ≡and the sign is the true value.

Three stages of registers, namely a coding register, a compression register and a rounding register, are arranged in the multiplication and addition structure, and can separate critical paths, shorten the critical paths and ensure the working frequency of the module. The main functions of the multiply-add structure are three-precision independent multiply operation and two-way low-precision two-multiply-add operation, and the multiply-add structure has normalization and rounding functions.

The adder unit (structure as shown in fig. 6) is divided into two positions in the whole DSP, and the tasks born by the adder unit are addition operation, but the adder unit is used in different modes of the DSP, and the difference between the adder unit and the DSP is that the input bit width is different and the realized functions are different. The single precision floating point adder unit S30 located after the one-factorial addition structure S10 is used to implement a set of half precision floating point addition operations or a set of single precision floating point addition operations, and the variable precision floating point adder unit S40 located in the accumulation path is used to implement two sets of single precision floating point addition operations or a set of double precision floating point addition operations.

The single-precision floating point adder unit S30 located behind the first-order multiply-add structure S10 is mainly designed for multiply-add operation, and is an adder unit for single-precision add operation, where two half-precision add operations (low-precision data of the last two-way multiply-add) and two single-precision multiplication results are mainly implemented, and corresponding fixed-point multiply-add operations, and since the floating point number is the maximum bit width of the operation (the output bit width obtained by 24×24 of the input bit width is 48), the input bit width of the ALU in the adder should be 48 bits, which can support three forms of multiply-add operations.

The variable precision floating point adder unit S40 in the accumulation path is an adder capable of implementing two ways of single precision floating point numbers or one way of double precision floating point numbers. The shifting function realizes the shifting function of two bit widths by designing an improved multi-bit wide barrel-shaped shifter, and the shifter is used for selecting whether data with 0 or high bit width is used for sequentially realizing shifting of high bit version or shifting of two low bit widths by arranging a multiplexer in the middle. The barrel shifter realizes the high-bit-width shift or the shift of two low-bit-widths, and because the bit width entering the shifter in the design is too large to be shown here, the structural principle (fig. 20) is described by way of example by 8-bit width, the shifting number is selected according to the values of (b 0, b1 and b 2), the shifting function of 8-bit-width data or the independent shifting function of two 4-bit-width data is selected by selecting the flag signals in the figure, and the structure can be expanded to a higher bit width for realizing the shifting function of two different bit widths in other high bit widths, as shown in fig. 21.

The main function of the variable precision floating point adder unit S40 in the accumulation path is for cascade addition, accumulation, independent floating point addition, and the structure of the adder allows the double precision floating point multiply-add operation through cascade connection of two DSPs because the structure design can realize the double precision floating point multiply-add operation. The independent addition operation of the single-precision floating point number and the multiply-accumulate operation of multiple precision are arranged in front of the adder, and a plurality of paths of operations for inputting different inputs to realize different modes are arranged in front of the adder.

A multiplexer set S90 is further disposed between the single-precision floating-point adder unit S30 and the first-stage pipeline register S20 in the multiply-add path, where the multiplexer set S90 may select whether the multiply result output by the first-order multiply-add structure S10 or the multiply-add result enters the variable-precision floating-point adder unit S40 in the multiply-add path to operate or enters the single-precision floating-point adder unit S30 in the multiply-add path to operate. For example: if the result is a double-precision floating point number multiplication result, the result enters the variable-precision floating point adder unit S40 of the accumulation path to perform operation, and if the result is a single-precision multiplication result, the result enters the single-precision floating point adder unit S30 of the multiplication and addition path to perform addition operation.

The accumulation path input register S70, the multiplication and addition path input register S60, the primary pipeline register S20 and the secondary pipeline register S100 all adopt a clock gating technology, the clock gating circuit structure is as shown in fig. 22, and the on and off of the module are realized by using an EN enabling signal as an input end of a selector to control the updating or the holding of data, so that the control of the power consumption is achieved. One of the four registers may be selectively turned off when the DSP does not require a module to achieve dynamic control to reduce power consumption. In order to prevent burrs from being generated during clock gating, the clock gating is automatically inserted through an EDA comprehensive tool in the design process. Clock gating techniques are common in the art and are not described here.

Fig. 6 shows a block diagram of an adder unit including an input register, a data processing block S301 dividing input data input_a and input_b into sign bits, exponent bits, and mantissa bits, a sign bit processing block S309, a step code alignment block S308, a mantissa shift block S302, an LZA encoding block S310, an error correction block S311, an ALU unit S304, a preliminary shift block S305, and a shift correction block S306. The method specifically comprises the following steps: the sign bit processing module S309 is configured to determine the type of operation, and the circuit structure for determining the type of operation is as shown in fig. 19, and the principle for implementing the determination of the type of operation is the same as the principle for implementing the determination of the type of operation by the sign bit module S111 with the first-order multiply-add structure. I.e. the actual operation type is determined by the sign of the two floating point operands entered and the custom operation type entered, which sign is used for the operation of the ALU unit S304. The alignment of the step codes has two operations, namely the size of the step code ratio, outputting a marking signal and calculating the step difference; the mantissa shift has three operations, namely mantissa shift operation, mantissa ratio size and floating point number size according to the size of the step code, and is used for input judgment of data of an ALU unit, and the principle is that the large number is always placed in the added number, so that the negative number of the subtraction operation is prevented. The LZA coding module encodes the input data in advance and uses the data to calculate the number of leading 0. The error correction module outputs a one-bit error correction signal for error correction of a leading zero prediction algorithm (wherein the leading zero prediction algorithm is realized by the LZA coding module, the error correction module, the preliminary shifting module and the shifting correction module which are matched together to realize the algorithm function), the error correction signal is transmitted to the shifting module, and the barrel-shaped shifter is mainly adopted by the shifting module.

Before the operation is executed, whether the input data has abnormal conditions or not is checked, an abnormal signal and an output condition are output, the step is processed by a data preprocessing module S101 of a factorial addition structure and a data processing module S301 of an adder unit, and the two modules respectively judge the input of the abnormal signals of the multiplication operation, the multiplication addition operation and the addition operation and output related data. There are four types of exception signal outputs for a particular floating point operation, including an overflow signal: for identifying that the result is greater than the maximum representable by the selected precision floating point number underflow (underflow signal): for identifying that the result is less than the minimum representable by the selected precision floating point number Inexact (uncertain anomaly signal): for identifying that the result cannot be represented exactly by the selected precision floating point number, invaild (invalid operation exception signal): for producing invalid results.

Several definitions are first described: the NaN (Not a Number) is specified in the IEEE754 floating-point number standard to represent some undefined or unrepresented result, such as infinity minus infinity, 0 divided by 0, etc. NaN is further divided into two types: sNaN (signaling NaN) and qNaN (quiet NaN). The sNaN is mainly used for processing some special situations in the process of program development and debugging. qNaN is mainly used to represent some invalid or undefined operation results, e.g. 0 divided by 0, infinity minus infinity, squaring a negative number, etc.

Denormal numbers refer to numbers with an absolute value of a floating point number between 0 and the minimum normalized number, which function to represent values very close to 0. These values are uniformly close to 0, called the gradual underflow attribute, and are handled as exceptions at the time of processing.

The following is a summary of the cases of data output for different cases of input for floating point multiplication, addition, multiply-add operations, where these abnormal input cases need to be considered in the operations.

1. The processing for normalized numbers is shown in Table 1:

TABLE 1 normalized number processing results

2. See table 2 for ++ (infinity) treatment:

Table 2 infinity processing results

3. For NaN treatment: if QNaN of the inputs are present, then the outputs are QNaN.

4. See table 3 for treatment of non-normal numbers:

TABLE 3 result of non-reduction processing

The low-power-consumption variable-precision embedded DSP hard core structure applied to the FPGA can allow multiplication operation and multiplication and addition operation of four low-precision floating point numbers (FP 16/BF 16) through designing the multiplication and addition structure; multiplication operation and multiplication addition operation of two single-precision floating point numbers; a double-precision multiplication operation and a multiplication addition operation; meanwhile, multiplication operation and multiplication addition operation of signed fixed point number and unsigned fixed point number are supported, and three-precision operation, four 8-bit operation, two 18-bit operation and one 27-bit operation are also supported for the fixed point number. The variable precision floating point adder with the accumulation path can complete the accumulation or chained addition function. Meanwhile, the DSP hard core structure is added with rich cascade selection ports, and a plurality of DSPs can be cascaded to realize various operations, such as functions of multiplication and addition operation, FIR filtering operation, vector dot product operation, complex multiplication operation and the like of a double-precision floating point number.

The invention provides a low-power consumption variable precision embedded DSP hard core structure applied to an FPGA, and specific examples are applied to illustrate the structure and the working principle of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention. It should be noted that it will be apparent to those skilled in the art that various improvements and modifications can be made to the present invention without departing from the principles of the invention, and such improvements and modifications fall within the scope of the appended claims.

Claims

1. The utility model provides a be applied to variable precision embedded DSP hard core structure of low-power consumption of FPGA which characterized in that: comprising the following steps:

An accumulation path and a multiplication and addition path; the accumulation path comprises an accumulation path input register and a variable precision floating point adder unit; the multiplication and addition path comprises a multiplication and addition path input register, a first-order multiplication and addition structure and a single-precision floating point adder unit;

The multiplication and addition path input register is used for realizing a data shift register transmission function;

The multiplication and addition path input register and the register in the accumulation path input register are connected to the input port of the register multiplexer to form a register unit;

The multiplication and addition path input register is provided with four parallel input ends, and the four parallel input ends are connected in series; each input has two inputs, each input comprising a register unit; each input port of the input end is provided with an input port multiplexer; an input multiplexer of a starting DSP in the DSP array selects initial data input from an input port of a multiplication and addition path input register of the starting DSP to transmit so as to realize a shift register transmission function; selecting the data output by the multiplication and addition path input register of the last DSP to transmit through the cascade non-initial DSP input port multiplexer so as to realize a shift register transmission function;

the preprocessing unit is arranged in the multiplication and addition path and comprises a coefficient selection unit and a pre-adder, wherein the coefficient selection unit pre-stores internal coefficients; the preprocessing unit receives the multiplication and addition path input register data and performs pre-addition on the input number according to the calculation requirement.

2. The low-power-consumption variable-precision embedded DSP hard core structure applied to an FPGA of claim 1, wherein: a primary pipeline register is arranged between the first-order multiplication and addition structure in the multiplication and addition path and the single-precision floating point adder unit;

3. The low-power-consumption variable-precision embedded DSP hard core structure applied to an FPGA according to claim 2, wherein: the accumulation path input register, the primary pipeline register, the secondary pipeline register and the output register comprise a register unit.

4. The low-power-consumption variable-precision embedded DSP hard core structure applied to an FPGA of claim 1, wherein: after being divided, the single-precision floating point number is input at two input ends, the data input by the two input ends are respectively shifted and transmitted, and the single-precision floating point number is spliced into a complete single-precision floating point number on a shifting output port of a multiplication and addition path input register, so that a shifting register transmission function of the single-precision floating point number is realized.

5. The low-power-consumption variable-precision embedded DSP hard core structure applied to an FPGA of claim 1, wherein: the preprocessing unit is used for a fixed point number pre-addition function with various precision.

6. The low-power-consumption variable-precision embedded DSP hard core structure applied to an FPGA of claim 1, wherein: at least two 18-bit pre-addition functional units are arranged in the preprocessing unit;

7. The low-power-consumption variable-precision embedded DSP hard core structure applied to an FPGA of claim 1, wherein: the coefficient selection unit is used for pre-storing 8 coefficients at least, and the coefficients selected by the coefficient selection unit are output from the output port of the preprocessing unit.

8. The low-power-consumption variable-precision embedded DSP hard core structure applied to an FPGA of claim 1, wherein: the multiplication and addition path input register comprises a direct output port and a preprocessing output port;

9. The low-power-consumption variable-precision embedded DSP hard core structure applied to an FPGA of claim 8, wherein: the first-order multiplication structure input end comprises two first-order multiplication structure multiplexers, and data output by the output port of the preprocessing unit and data output by the direct output port enter the first-order multiplication structure multiplexers for selection so as to realize the selection of whether the data needs preprocessing or not.

10. The low-power-consumption variable-precision embedded DSP hard core structure applied to an FPGA of claim 1, wherein: and a register clock gating circuit is arranged in the accumulation path input register, the multiplication and addition path input register, the primary pipeline register and the secondary pipeline register, and the register clock gating circuit selectively turns off unused paths.

11. The low-power-consumption variable-precision embedded DSP hard core structure applied to an FPGA of claim 10, wherein: the clock gating program of the clock gating circuit is automatically inserted and generated through an EDA comprehensive tool.

12. The low-power-consumption variable-precision embedded DSP hard core structure applied to an FPGA of claim 10, wherein: a multiplexer group is arranged between the first-stage pipeline register and a first-order multiply-add structure in the multiply-add path, and the output end of the multiplexer group is also connected to the second-stage pipeline register;

13. The low-power-consumption variable-precision embedded DSP hard core structure applied to an FPGA of claim 1, wherein: the adder units in the accumulation path and the multiplication and addition path comprise an input register, a data processing module, a sign bit processing unit, a step code alignment unit, a mantissa shifting unit, an LZA coding module, an error correction module, an ALU unit and a shifting module;

The shifting module is a barrel shifter.