CN112732220A

CN112732220A - Multiplier, method, integrated circuit chip and computing device for floating-point operation

Info

Publication number: CN112732220A
Application number: CN202011074061.4A
Authority: CN
Inventors: 不公告发明人
Original assignee: Anhui Cambricon Information Technology Co Ltd
Current assignee: Anhui Cambricon Information Technology Co Ltd
Priority date: 2019-10-14
Filing date: 2020-10-09
Publication date: 2021-04-30
Also published as: TWI763079B; CN112732221A; TW202115560A

Abstract

The present invention relates to a multiplier, method, integrated circuit chip and computing device for floating-point operations, wherein the computing device may be included in a combined processing device, which may also include a universal interconnect interface and other processing devices. The computing device interacts with other processing devices to jointly complete the computing operation specified by the user. The combined processing device may also include a storage device connected to the computing device and other processing devices, respectively, for data of the computing device and the other processing devices. The solution of the present invention can be widely used in various floating-point data operations.

Description

Multiplier, method, integrated circuit chip and computing device for floating-point operation

Technical Field

The present disclosure relates generally to the field of floating point operations. More particularly, the present disclosure relates to methods, multipliers, integrated circuit chips and computing devices for floating point operations.

Background

In various current signal processing algorithms, such as inner product operations between vectors and convolution operations of matrices, a large number of multiply-add operations are used, and the efficiency of these multiply-add operations often depends on the execution speed of the multiplier. While current multipliers achieve significant improvements in execution efficiency, they also have room for improvement in processing floating point type data. Therefore, how to obtain a high-efficiency, low-power consumption and low-cost multiplier to perform the multiplication operation of floating-point data becomes a problem to be solved in the prior art.

Disclosure of Invention

To at least partially solve the technical problems mentioned in the background, the disclosed aspects provide a multiplier, a method, an integrated circuit chip and a computing device for floating-point operation.

In one aspect, the present disclosure provides a multiplier for performing multiplication operations on floating point numbers, wherein the multiplier comprises: and the mantissa processing unit is used for obtaining the mantissa after the multiplication operation according to the mantissa of the floating point number, and the mantissa processing unit comprises a control circuit which is used for calling the mantissa processing unit for multiple times when the mantissa bit width of at least one of the two floating point numbers is larger than the data bit width which can be processed by the mantissa processing unit at one time.

In another aspect, the present disclosure provides a method for performing a floating-point number multiplication operation using a multiplier, wherein a mantissa processing unit of the multiplier is utilized to obtain a mantissa after the multiplication operation according to a mantissa of the floating-point number, and the mantissa processing unit includes a control circuit configured to call the mantissa processing unit multiple times when a mantissa bit width of at least one of two floating-point numbers is greater than a data bit width of a number that the mantissa processing unit can process at one time.

In yet another aspect, the present disclosure provides an integrated circuit chip comprising the multiplier. In one or more embodiments, the multiplier of the present disclosure may be formed as a stand-alone integrated circuit chip or disposed on an integrated circuit chip or computing device that implements operations on floating point numbers of a variety of different data formats.

With the multiplier, the corresponding operation method, the integrated circuit chip and the computing device disclosed by the invention, the operation on data of multiple floating point types can be supported without providing a plurality of independent multipliers for different floating point types of data. Therefore, the multiplier disclosed by the invention is flexible in application and can be widely applied to various floating-point data operations. In addition, when processing input data with a large bit width, the multiplier of the present disclosure supports a cyclic multiplexing operation, so that it is not necessary to arrange more processing chips, thereby also reducing the arrangement area of the integrated circuit.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

FIG. 1 is a schematic diagram illustrating a floating point data format according to an embodiment of the present disclosure;

FIG. 2 is a schematic block diagram illustrating a multiplier according to an embodiment of the present disclosure;

FIG. 3 is a block diagram showing more details of a multiplier according to an embodiment of the present disclosure;

FIG. 4 is a schematic block diagram illustrating a mantissa processing unit in accordance with an embodiment of the present disclosure;

FIG. 5 is a schematic diagram illustrating a partial product operation according to an embodiment of the present disclosure;

FIG. 6 is a flow and schematic block diagram illustrating the operation of a Wallace tree compressor in accordance with an embodiment of the present disclosure;

FIG. 7 is an overall schematic block diagram illustrating a multiplier in accordance with an embodiment of the present disclosure;

FIG. 8 is a flow chart illustrating a method of performing a floating point number multiply operation using a multiplier in accordance with an embodiment of the present disclosure;

FIG. 9 is a block diagram illustrating a combined treatment device according to an embodiment of the present disclosure; and

fig. 10 is a schematic diagram illustrating a structure of a board according to an embodiment of the disclosure.

Detailed Description

The disclosed solution generally provides a multiplier, a method, an integrated circuit chip and a computing device for floating-point arithmetic. Unlike prior art floating-point arithmetic multipliers, the present disclosure provides a multiplier that supports multiple modes of operation, thereby overcoming the drawback of the prior art multiplier that can only support one type of floating-point operation. In particular, the present disclosure utilizes multiple operational modes to indicate different floating point data types, and during multiplication of floating point numbers, performs various types of operations of data based on one of the operational modes, including, for example, encoding, compression, summation, normalization, and rounding operations, to thereby implement operations associated with one of the multiple floating point data types. Therefore, the multiplier disclosed by the invention can support the operation under multiple modes, further improves the flexibility of floating-point operation and reduces the operation cost.

The technical solution of the present disclosure and various embodiments thereof will be described in detail below with reference to the accompanying drawings. It should be understood that numerous specific details are set forth with respect to floating point operations in order to provide a thorough understanding of the various embodiments of the disclosure. However, one of ordinary skill in the art, with the teachings of the present disclosure, may practice the embodiments described in the present disclosure without these specific details. In other instances, well-known methods, procedures, and components have not been described in detail so as not to unnecessarily obscure the embodiments described in this disclosure. In addition, this description should not be taken as limiting the scope of the embodiments of the disclosure.

FIG. 1 is a schematic diagram illustrating a floating point data format 100 according to an embodiment of the present disclosure. As shown in fig. 1, a floating point number to which the disclosed techniques may be applied may include three portions, such as a sign (or sign bit) 102, an exponent (or exponent bit) 104, and a mantissa (or mantissa bit) 106, where no sign or sign bit may be present for an unsigned floating point number. In some embodiments, floating point numbers suitable for use in multipliers of the present disclosure may include at least one of half-precision floating point numbers, single-precision floating point numbers, brain floating point numbers, double-precision floating point numbers, custom floating point numbers. In particular, in some embodiments, the floating point number format to which the disclosed solution may be applied may be a floating point format compliant with IEEE754 standards, such as a double-precision floating point number (float64, abbreviated as "FP 64"), a single-precision floating point number (float32, abbreviated as "FP 32"), or a half-precision floating point number (float16, abbreviated as "FP 16"). In some other embodiments, the floating point format may be an existing 16-bit floating point (bfloat16, abbreviated "BF 16") or a custom floating point format, such as an 8-bit floating point (bfloat8, abbreviated "BF 8"), an unsigned half-precision floating point (unsigned float16, abbreviated "UFP 16"), and an unsigned 16-bit floating point (unsigned float16, abbreviated "UBF 16"). For ease of understanding, table 1 below shows the partial data format described above, with the sign bit width, exponent bit width, and mantissa bit width used for exemplary purposes only.

TABLE 1

Data type	Bit width of symbol	Bit width of exponent	Mantissa bit width
				FP16	1	5	10
BF16	1	8	7
				FP32	1	8	23
BF8	1	5	3
				UFP16	0	5 (or 6)	11 (or 10)
UBF16	0	8	8

For the various floating point number formats mentioned above, the multiplier of the present disclosure may, in operation, support at least a multiplication operation between two floating point numbers having any of the above-mentioned formats, where the two floating point numbers may have the same or different floating point data formats. For example, the multiplication operation between two floating-point numbers may be a multiplication operation between two floating-point numbers such as FP16 × FP16, BF16 × BF16, FP32 × FP32, FP32 × BF16, FP16 × BF16, FP32 × FP16, BF8 × BF16, UBF16 × UFP16, or UBF16 × FP 16.

Fig. 2 is a schematic block diagram illustrating a multiplier 200 according to an embodiment of the present disclosure. As previously mentioned, the multiplier of the present disclosure supports multiplication operations of floating point numbers in a variety of data formats that may be indicated by the operational modes of the present disclosure, such that the multiplier operates in one of a plurality of operational modes.

As shown in fig. 2, the multiplier of the present disclosure may generally include an exponent processing unit 202 and a mantissa processing unit 204, where the exponent processing unit is to process exponent bits of a floating point number and the mantissa processing unit is to process mantissa bits of the floating point number. Alternatively or additionally, in some embodiments, when the floating point number processed by the multiplier has a sign bit, the multiplier may further include a sign processing unit 206, which may be used to process floating point numbers that include a sign bit.

In operation, the multiplier may perform a floating point operation on received, input or cached first and second floating point numbers having one of the floating point data formats as discussed above, according to one of the operating modes. For example, when the multiplier is in the first operational mode, it may support multiplication by two floating point numbers FP16 × FP16, and when the multiplier is in the second operational mode, it may support multiplication by two floating point numbers BF16 × BF 16. Similarly, when the multiplier is in the third operational mode, it may support multiplication by two floating point numbers FP32 × FP32, and when the multiplier is in the fourth operational mode, it may support multiplication by two floating point numbers FP32 × BF 16. Here, the example operation mode and floating point number correspondence is shown in table 2 below.

TABLE 2

Operation mode numbering	Arithmetic floating point number type
		1	FP16*FP16
2	BF16*BF16
		3	FP32*FP32
4	FP32*BF16

In one embodiment, table 2 above may be stored in a memory of the multiplier, and the multiplier selects one of the operation modes in the table according to an instruction received from an external device, such as external device 1012 shown in fig. 10. In another embodiment, the input of the operation mode may also be automatically realized via the mode selection unit 308 as shown in fig. 3. For example, when two floating point numbers of FP16 type are input to the multiplier of the present disclosure, the mode selection unit may select the multiplier to operate in the first operation mode according to the data formats of the two floating point numbers. For another example, when one FP32 type floating point number and one BF16 type floating point number are input to the multiplier of the present disclosure, the mode selection unit may select the multiplier to operate in the fourth operation mode according to the data formats of the two floating point numbers.

It can be seen that the different operational modes of the present disclosure are associated with corresponding floating point type data. That is, the operational modes of the present disclosure may be used to indicate a data format of a first floating point number and a data format of a second floating point number. In another embodiment, the operation mode of the present disclosure may indicate not only the data format of the first floating point number and the data format of the second floating point number, but also the data format after the multiplication operation. The extended operation mode in conjunction with table 2 is shown in table 3 below.

TABLE 3

Unlike the operation mode numbers shown in table 2, the operation mode in table 3 is extended by one bit for indicating the data format after the floating-point multiplication operation. For example, when the multiplier operates in the operation mode 21, it performs floating-point operations on two floating-point numbers input as BF16 × BF16, and outputs the floating-point multiplication operations in the FP16 data format.

The above designation of floating point data formats in numbered operational modes is merely exemplary and not limiting, and establishing indices to determine the format of the multiplier and multiplicand according to operational modes is also contemplated in accordance with the teachings of the present disclosure. For example, the operation mode includes two indexes, the first index is used for indicating the type of the first floating point number, the second index is used for indicating the type of the second floating point number, for example, the first index "1" in the operation mode 13 indicates that the first floating point number (or multiplicand) is in the first floating point format, namely FP16, and the second index "3" indicates that the second floating point number (or multiplier) is in the second floating point format, namely FP 32. Further, a third index may also be added to the operation mode, the third index indicating the data format of the output result, e.g. for a third index "1" in the operation mode 131, it may indicate that the data format of the output result is the first floating point format, i.e. FP 16. When the number of operation modes is increased, corresponding indexes or index hierarchies can be increased as needed to facilitate establishment of the relationship between the operation modes and the data format.

In addition, although the operation mode is exemplarily referred to by a number, in other examples, the operation mode may be referred to by other symbols or codes according to application requirements, for example, by letters, symbols or numbers, combinations thereof, and the like, and the operation mode is referred to by expressions of such letters, numbers, symbols or combinations thereof and identifies the first floating point number, the second floating point number and the data format of the output result. Additionally, when the expressions are formed in the form of an instruction, the instruction may include three fields or fields, a first field to indicate the data format of a first floating point number, a second field to indicate the data format of a second floating point number, and a third field to indicate the data format of the output result. Of course, these fields may be combined into one field, or a new field may be added for indicating more content related to the floating point data format. It can be seen that the disclosed operational modes can be associated not only with the input floating point number data format, but also used to normalize the output result to obtain a product result in a desired data format.

Fig. 3 is a block diagram illustrating a more detailed structure of a multiplier 300 according to an embodiment of the present disclosure. As can be seen from the illustration of fig. 3, it not only includes exponent processing unit 202, mantissa processing unit 204, and optional sign processing unit 206 shown in fig. 2, but also illustrates internal components that these units may include and units related to the operation of these units, exemplary operations of which are described in detail below in connection with fig. 3.

In order to perform the multiplication operation of the floating point number, the exponent processing unit may be configured to obtain the exponent after the multiplication operation according to the operation mode, the exponent of the first floating point number, and the exponent of the second floating point number. In one embodiment, the exponent processing unit may be implemented by an addition and subtraction circuit. For example, the exponent processing unit may be configured to add the exponent of the first floating point number, the exponent of the second floating point number, and the corresponding offset value of the input floating point data format, and then subtract the offset value of the output floating point data format to obtain the multiplied exponent of the first floating point number and the second floating point number.

Further, the mantissa processing unit of the multiplier may be configured to obtain the multiplied mantissa according to the aforementioned operation mode, the first floating point number, and the second floating point number. In one embodiment, the mantissa processing unit may include a partial product operation unit 312 to obtain an intermediate result from a mantissa of the first floating point number and a mantissa of the second floating point number, and a partial product summation unit 314. In some embodiments, the intermediate result may be a plurality of partial products obtained during a multiplication operation of the first floating point number and the second floating point number (as schematically illustrated in fig. 5 and 6). The partial product summing unit is used for summing the intermediate results to obtain a summed result, and taking the summed result as a mantissa after the multiplication operation.

To obtain an intermediate result, in one embodiment, the present disclosure utilizes Booth encoding circuitry to complement 0, in one embodiment, the upper and lower bits of the mantissa of a second floating point number (e.g., acting as a multiplier in a floating point operation), wherein complementing 0 the upper bits is to convert the mantissa as an unsigned number to a signed number, in order to obtain the intermediate result. It is to be understood that, depending on the encoding method, the mantissa of the first floating-point number (e.g., serving as a multiplicand in a floating-point operation) may be encoded (e.g., with 0's being filled up), or both, to obtain a plurality of partial products. More description of the partial product will be explained later in conjunction with the accompanying drawings.

In another embodiment, the partial product summing unit may comprise an adder for summing the intermediate results to obtain the summed result. In a further embodiment, the partial product summing unit comprises a wallace tree for summing the intermediate results to obtain a second intermediate result and an adder for summing the second intermediate result to obtain the summed result. In these embodiments, the adder may include at least one of a full adder, a serial adder, and a carry-look-ahead adder.

In one embodiment, the multiplier of the present disclosure further includes a regularization unit 318 and a rounding unit 320. The regularization unit may be configured to perform floating-point number regularization on the multiplied mantissa and the exponent to obtain a regularized exponent result and a regularized mantissa result, and to use the regularized exponent result and the regularized mantissa result as the multiplied exponent and the multiplied mantissa. For example, depending on the data format indicated by the operation mode, the regularization unit may adjust the bit widths of the exponent and mantissa to conform to the requirements of the data format indicated previously. In addition, the regularization unit may also make other adjustments to the exponent or mantissa. For example, in some application scenarios, when the value of the mantissa is not 0, the most significant bit of the mantissa bit should be 1; otherwise, the exponent bits may be modified and the mantissa bits may be shifted at the same time into the form of a normalized number. In another embodiment, the regularizing unit may further adjust the multiplied exponent according to the multiplied mantissa. For example, when the most significant bit of the mantissa after the multiplication is 1, 1 may be added to the exponent obtained after the multiplication. Accordingly, the rounding unit may be configured to perform a rounding operation on the regularized mantissa result according to a rounding mode, and to take the mantissa on which the rounding operation is performed as the mantissa after the multiplication operation. Depending on the application scenario, the rounding unit may perform rounding operations including, for example, rounding down, rounding up, rounding to the nearest significant number, etc. In some application scenarios, the rounding unit may also round the shifted-out 1 in the mantissa right shift process.

In addition to the exponent processing unit and the mantissa processing unit, the multiplier of the present disclosure may optionally include a sign processing unit, which may be configured to obtain a sign after the multiplication operation from a sign of the first floating point number and a sign of the second floating point number when the input floating point number is a floating point number with a sign bit. For example, in one embodiment, the symbol processing unit may include an exclusive or logic circuit 322, configured to perform an exclusive or operation according to the symbol of the first floating point number and the symbol of the second floating point number, and obtain the multiplied symbol. In another embodiment, the symbol processing unit may also be implemented by a truth table or logic determination.

In addition, in order to make the input or received first and second floating point numbers conform to a prescribed format, in one embodiment, the multiplier of the present disclosure may further include a normalization processing unit 324 for normalizing the first floating point number or the second floating point number to obtain a corresponding exponent and mantissa according to the operation mode when the first floating point number or the second floating point number is a non-normalized non-zero floating point number. For example, when the selected operation mode is the 2 nd operation mode shown in table 2 and the input first and second floating point numbers are FP16 type data, the FP16 type data may be normalized to BF16 type data by the normalization processing unit so that the multiplier operates in the 2 nd operation mode. In one or more embodiments, the normalization processing unit may be further configured to pre-process (e.g., expand) mantissas of normalized floating-point numbers where there is an implicit 1 and unnormalized floating-point numbers where there is no implicit 1 to facilitate subsequent operation of the mantissa processing unit. Based on the above description, it will be appreciated that normalization 324 and the aforementioned regularization 318 may also perform the same or similar operations in some embodiments, except that normalization 324 normalizes input floating point data and regularization 318 regularizes mantissas and exponents to be output.

The multiplier and its various embodiments of the present disclosure are described above in conjunction with fig. 3. Based on the above description, those skilled in the art can understand that the scheme of the present disclosure obtains the result (including exponent, mantissa, and optional sign) after the multiplication operation through the execution of the multiplier. Depending on the application scenario, for example, when the foregoing regularization process and rounding process are not required, the result obtained by the mantissa processing unit and the exponent processing unit may be regarded as the final operation result. Further, for the case where the foregoing regularization and rounding processes are required, the exponent and mantissa obtained after the regularization and rounding processes may be regarded as the final operation result or a part of the final operation result (when the final sign is considered). Further, the scheme disclosed by the invention enables the multiplier to support the operation of floating point numbers of different types or data formats through multiple operation modes, so that the multiplexing of the multiplier can be realized, and the expenditure of chip design is saved and the calculation cost is saved. In addition, the multiplier of the present disclosure also supports the calculation of floating point numbers of high bit widths through a multiple call mechanism. Whereas in a floating-point multiply operation, the multiplication of mantissa (or mantissa bit or mantissa portion) is critical to the performance of the overall floating-point operation, the mantissa operation of the present disclosure will be described below in conjunction with FIG. 4.

FIG. 4 is a schematic block diagram illustrating mantissa processing unit operations 400 in accordance with an embodiment of the present disclosure. As shown in fig. 4, the mantissa processing operations of the present disclosure may primarily involve two units, namely the partial product operation unit and the partial product summation unit discussed above in connection with fig. 3. From an operational timing perspective, the mantissa processing operation may be generally divided into a first stage in which the mantissa processing operation will obtain an intermediate result and a second stage in which the mantissa processing operation will obtain the mantissa result output from the adder 408.

In an exemplary specific operation, the first and second floating point numbers received by the multiplier may be divided into a plurality of portions, namely the aforementioned sign (optional), exponent, and mantissa. Optionally, after normalization, the mantissa portions of the two floating point numbers will enter as input into a mantissa processing unit (such as the mantissa processing unit in FIG. 2 or FIG. 3), and specifically into a partial product operation unit. As shown in fig. 4, the present disclosure complements 0 to the high and low bits of the mantissa of the second floating-point number (i.e., multiplier in floating-point operation) with the booth encoding circuit 402, and performs the booth encoding process, thereby obtaining the intermediate result in the partial product generating circuit 404. Of course, the first floating point number and the second floating point number are used herein for illustrative purposes only and are not limiting, and thus in some application scenarios the first floating point number may be a multiplier and the second floating point number may be a multiplicand. Accordingly, in some encoding processes, encoding operations may also be performed on floating point numbers that serve as multiplicands.

For better understanding of the technical solution of the present disclosure, booth encoding is briefly introduced below. Generally, when two binary numbers are multiplied, a large number of intermediate results called partial products are generated by the multiplication operation, and then the partial products are accumulated to obtain the final result of the multiplication of the two binary numbers. The larger the number of partial products, the larger the area and power consumption of the array multiplier, the slower the execution speed, and the more difficult it is to implement the circuit. The objective of booth encoding is to effectively reduce the number of summation terms of partial products, thereby reducing the circuit area. The algorithm is to first perform a corresponding rule encoding on the input multiplier, and in one embodiment, the encoding rule may be, for example, the rule shown in table 4 below:

TABLE 4

Wherein y in Table 4_2i+1，y_2iAnd y_2i-1May represent the corresponding numerical value of each set of subdata to be encoded (i.e., the multiplier), and X may represent the mantissa in the first floating-point number (i.e., the multiplicand). After the booth encoding processing is performed on each group of corresponding data to be encoded, a corresponding encoded signal PPi (i ═ 0, 1, 2.. times, n) is obtained. As schematically shown in table 4, the resulting encoded signal after booth encoding may include five classes, which are-2X, -X, X, and 0, respectively. Illustratively, based on the encoding rules described above, if the received multiplicand is 8 bits of data "X₇X₆X₅X₄X₃X₂X₁X₀", the following partial product can be obtained:

1) when the multiplier bits include the successive three bits of data "001" in the above table, the partial product is X, which can be expressed as "X"₇X₆X₅X₄X₃X₂X₁X₀", bit 9 is a sign bit, i.e., PPi ═ X [7 ═ X]，X}；

2) When the multiplier bit comprises the continuous three bits data "011" in the above table, the partial product is 2X, which can be expressed as X left-shifted by one bit, resulting in "X₇X₆X₅X₄X₃X₂X₁X₀0 ", i.e., PPi ═ { X, 0 };

3) when the multiplier bits include successive triples of data "101" in the table above, the partial product is-X, which can be expressed as

Is represented by the pair "X₇X₆X₅X₄X₃X₂X₁X₀"negate by bit and then add 1, i.e. PPi ═ X [7 ]]，X}+1；

4) When the multiplier bits include the successive three bits of data "100" in the table above, the partial product is-2X, which can be expressed as

Is represented by the pair "X₇X₆X₅X₄X₃X₂X₁X₀After left shift by one bit, taking the inverse and then adding 1, namely PPi ═ X, 0} + 1;

5) when the multiplier bits include the successive three bits of data "111" or "000" in the above table, the partial product is 0, i.e., PPi ═ 9' b 0.

It should be understood that the above description of the process of obtaining partial products in conjunction with table 4 is merely exemplary and not limiting, and that one skilled in the art, given the teachings of this disclosure, may make changes to the rules in table 4 to obtain partial products other than those shown in table 4. For example, when there is a specific number of consecutive bits (e.g., 3 bits or more) in the multiplier bit, the resulting partial product may be the complement of the multiplicand, or the "add 1" operation in terms of 3) and 4) above may be performed, for example, after summing the partial products.

As can be appreciated from the introductory description above, by encoding the mantissa of the second floating point number using a booth encoding circuit and using the mantissa of the first floating point number, a plurality of partial products may be generated from the partial product generation circuit as intermediate results and the intermediate results may be input to a Wallace Tree ("Wallace Tree") compressor 406 in the partial product summing unit. It should be understood that the use of booth encoding to obtain partial products is only one preferred way of obtaining partial products in the present disclosure, and that one skilled in the art may obtain the partial products in other ways. For example, the partial product may be obtained by a shift operation, i.e., selecting whether to shift plus the multiplicand or add 0 according to whether the bit value of the multiplier is 1 or 0 to obtain the corresponding partial product. Similarly, the addition operation using the Wallace tree compressor to implement the partial product is also exemplary only and not limiting, and those skilled in the art will recognize that other types of adders may be used to implement such a partial product addition operation. The adder may be, for example, one or more full adders, half adders, or various combinations of the two.

Regarding the wallace tree compressor (or wallace tree for short), it is mainly used to sum the above-mentioned intermediate results (i.e., a plurality of partial products) to reduce the number of times the partial products are accumulated (i.e., compressed). Generally, Wallace Tree compactors may employ a carry-save CAS (carry-save) architecture and Wallace Tree algorithms that utilize Wallace Tree arrays to compute much faster than traditional carry-propagate additions.

Specifically, the Wallace tree compressor can calculate the sum of partial products of each row in parallel, for example, the accumulated number of N partial products can be reduced from N-1 to Log₂N times, thereby improving the speed of the multiplier and having important significance for the effective utilization of resources. The Wallace tree compressor can be designed into various types according to different application requirements, such as 7-2 Wallace trees, 4-2 Wallace trees, 3-2 Wallace trees and the like. In one or more embodiments, the present disclosure uses a 7-2 Wallace tree as an example of various floating point operations to implement the present disclosure, which will be described in detail later in conjunction with FIGS. 5 and 6.

In some embodiments, the wallace tree compression operation disclosed by the present disclosure may be arranged to have M inputs, N outputs, the number of which may be no less than K, where N is a preset positive integer less than M and K is a positive integer no less than the maximum bit width of the intermediate result. For example, M may be 7 and N may be 2, i.e., a 7-2 Wallace tree as described in more detail below. When the maximum bit width of the intermediate result is 48, K may take a positive integer of 48, that is, the number of wallace trees may be 48.

In some embodiments, one or more groups of the wallace trees may be selected to sum the intermediate results according to the operation mode, where each group has X wallace trees, and X is the number of bits of the intermediate results. Further, the Wallace trees in each group may have a carry-by-carry relationship, and no carry relationship exists between the groups. In an exemplary concatenation, the Wallace tree compactors may be concatenated with carry bits, e.g., carry outputs from the lower Wallace tree compactor (e.g., C in FIG. 6)_in) To highBit Wallace Tree, and carry out of high order Wallace Tree compressor (C)_out) And may become a higher order wallace tree compressor to receive carry inputs from a lower order wallace tree compressor. In addition, when one or more wallaisles are selected from the multiple wallais tree compressors, any selection may be made, for example, the selection may be made in the order of 0, 1, 2, and 3, or the connection may be made in the order of 0, 2, 4, and 6, as long as the selected wallaisi tree compressor is selected in the carry relation described above.

The above Wallace Tree and its operation are described below in connection with an illustrative example. Assuming that the first and second floating point numbers are 16 bits of data (e.g., FP16 FP16), the bit width of the data supported by the multiplier is 32 bits (thereby supporting two sets of 16bit number parallel multiplication operations), the wallace tree is a 7-2 wallace tree compressor with 7 inputs (i.e., one example value of M above) and 2 outputs (i.e., one example value of N above). In this example scenario, 48 Wallace trees (i.e., one example value of K above) may be employed to perform the multiplication of two sets of data in parallel.

Among the 48 Wallace trees, the Wallace trees from 0 to 23 (i.e., the 24 Wallace trees in the first set of Wallace trees) can complete the partial addition and addition operation of the first set of multiplication, and the Wallace trees in the set can be sequentially connected by carry. Further, the 24 th to 47 th Wallace trees (i.e., the 24 Wallace trees in the second group of Wallace trees) can complete the partial product-sum operation of the second group of multiplications, wherein the Wallace trees in the group are sequentially connected by carry. In addition, no carry relation exists between the 23 rd Wallace tree in the first group and the 24 th Wallace tree in the second group, namely, no carry relation exists between Wallace trees of different groups.

Returning to fig. 4, after the partial products are summed and compressed by the wallace tree compressor, the compressed partial products are summed by an adder to obtain the result of the mantissa multiplication operation. Regarding the adder, in one or more embodiments of the present disclosure, it may include one of a full adder, a serial adder, and a carry-look-ahead adder for performing a summation operation on the last two rows of partial products resulting from the summation performed by the wallace tree compressor to obtain a result of the mantissa multiplication operation.

It will be appreciated that the result of the mantissa multiplication operation may be efficiently obtained by the mantissa multiplication operation illustrated in fig. 4, particularly by exemplary use of booth encoding and wallace trees. Specifically, the Booth coding process can effectively reduce the number of partial product summation terms, thereby reducing the circuit area, and the Wallace compression tree can calculate the sum of partial products of each row in parallel, thereby improving the speed of the multiplier.

An exemplary operation of the partial sum 7-2 Wallace tree is described in detail below in conjunction with FIGS. 5 and 6. It is to be understood that the present description is intended to be illustrative, and not restrictive, and that the intention is only to provide a better understanding of the aspects of the disclosure.

Fig. 5 shows a partial product 500 obtained after passing through the partial product generation circuit in the mantissa processing unit described above in connection with fig. 2-4, such as four rows of white dots between two dotted lines in the figure, where each row of white dots identifies one partial product. To facilitate subsequent implementation of the Wallace tree compressor, the bit number may be extended in advance. For example, the black dots in FIG. 5 are the most significant bit values of each 9-bit partial product that is replicated, and it can be seen that the partial products are extended to align to 16(8+8) bits (i.e., 8 bits wide for the multiplicand mantissa +8 bits wide for the multiplier mantissa). In another embodiment, for example, for a partial product of a25 × 13 binary multiplication, its partial product is extended to 38(25+13) bits (i.e., 25 bits wide for the multiplicand mantissa +13 bits wide for the multiplier mantissa).

FIG. 6 is a flow and schematic block diagram 600 illustrating the operation of a Wallace tree compressor in accordance with an embodiment of the present disclosure.

As shown in fig. 6, after performing a multiplication operation on the mantissas of two floating-point numbers, the 7 partial products shown in fig. 6 may be obtained by booth encoding the multiplier and by the multiplicand, for example, as previously described. The number of partial products generated is reduced due to the use of the booth encoding algorithm. For ease of understanding, a wallace tree consisting of 7 elements is identified in the partial area portion of the figure by a dashed box, and further the process of compressing from 7 elements to 2 elements is shown by arrows. In one embodiment, the compression process (or summation process) can be implemented by means of a full adder, i.e., inputting three elements and outputting two elements (i.e., one sum and carry to high bit "carry"). A schematic block diagram of a 7-2 Wallace tree compressor is shown on the right side of FIG. 6, it being understood that the Wallace tree compressor includes 7 inputs from a list of partial products (seven elements identified in the dashed box on the left side of FIG. 6). In operation, the carry input of the Wallace Tree of column 0 is 0, and the carry output Cout of each column of Wallace trees is used as the carry input Cin of the Wallace Tree of the next column.

As can be seen from the left part of fig. 6, the wallace tree including 7 elements can be compressed to include 2 elements after four times of compression. As previously mentioned, the present disclosure utilizes a 7-2 wallace tree compressor to finally compress the partial product of 7 rows into a partial product having two rows (i.e., the second intermediate result of the present disclosure), and utilizes an adder (e.g., a carry-look-ahead adder) to obtain the mantissa result.

To further illustrate the principles of the disclosed scheme, it will be described below by way of example how the multiplier of the present disclosure performs the operations at the first stage in four operation modes, FP16 FP16, FP16 FP16, FP32 FP32 and FP32 BF16, i.e., until the wallace tree compressor performs the summation of the intermediate results to obtain a second intermediate result:

(1)FP16*FP16

in this operational mode of the multiplier, the mantissa bits of the floating point number are 10 bits, and considering the denormalized nonzero number under the IEEE754 standard, 1bit can be extended so that the mantissa bits are 11 bits. In addition, since the mantissa bits are unsigned numbers, 0 of 1bit can be extended in the high order when the booth encoding algorithm is adopted, and thus the total mantissa bit number is 12 bits. When the second floating point number, that is, the multiplier, is booth-encoded and the first floating point number is referenced, 7 partial products, where the seventh partial product is 0 and the bit width of each partial product is 24 bits, are obtained in the high and low parts by the partial product generating circuit, respectively, at this time, the compression process may be performed by 48 7-2 wallace trees, and the carry from the 23 rd to the 24 th wallace trees is 0.

(2)BF16*BF16

In this operational mode of the multiplier, the mantissa bits of the floating-point number are 7 bits, and considering the denormalized nonzero number under the IEEE754 standard and extended to a signed number, the mantissa may be extended to 9 bits. When Booth coding is carried out on a multiplier which is a second floating point number, and the first floating point number is referred, 7 effective partial products can be respectively obtained at high and low parts through a partial product generating circuit, wherein 6 th and 7 th partial products are 0, bit width of each partial product is 18 bits, compression processing is carried out by using two groups of 7-2 Wallace trees of 0-17 th and 24-41 th, and carry bits of 23-24 th Wallace trees are 0.

(3)FP32*FP32

In this operational mode of the multiplier, the mantissa bits of the floating-point number may be 23 bits, and considering the denormalized nonzero number under the IEEE754 standard and extended to a signed number, the mantissa may be extended to 25 bits. To save area of the multiplication unit, for example, the bit width supported by the multiplier can be designed to be small, and the multiplier of the present disclosure can be called twice to complete one operation in the operation mode. Therefore, each multiplication of mantissa bits is 25 bits by 13 bits, i.e., the first floating point number ina is expanded by 1bit 0 to become a signed number of 25 bits, and the 24-bit mantissa bits of the second floating point number inb are respectively expanded by 1bit 0 in two high and low parts, namely 12 bits, to obtain two multipliers of 13 bits, which are expressed as inb _ high13 and inb _ low13 in two high and low parts. In particular, the multiplier calculation of the present disclosure is invoked for the first time, ina _ inb _ low13, and the multiplier calculation is invoked for the second time, ina _ inb _ high 13. In each calculation, 7 effective partial products are generated through Booth coding, the bit width of each partial product is 38 bits, and the partial products are compressed through 7-2 Wallace trees of 0-37 th.

(4)FP32*BF16

In the operation mode of the multiplier, the mantissa bit of the first floating point number ina is 23 bits, the mantissa bit of the second floating point number inb is 7 bits, and under consideration of the non-normalized non-zero number under the IEEE754 standard and the expansion into the signed number, the mantissa can be respectively expanded into 25 bits and 9 bits, and the multiplication of 25 bits multiplied by 9 bits is performed to obtain 7 effective partial products, wherein the 6 th and 7 th partial products are 0, the bit width of each partial product is 34 bits, and the compression is performed through the Wallace trees from 0 th to 33 th.

How the multiplier of the present disclosure accomplishes the first stage operation in four operation modes is described above by way of specific examples, wherein the Booth encoding algorithm and 7-2 Wallace Tree are preferably used. Based on the above description, one skilled in the art will appreciate that the present disclosure uses 7 partial products, such that 7-2 Wallace trees can be multiplexed in different modes of operation.

The case where the multipliers (mantissa processing unit and exponent processing unit) of the present disclosure are called multiple times will be described in more detail below.

According to another aspect of the disclosure, as shown in FIG. 3, the mantissa processing unit may include a control circuit 316, and the control circuit 316 may be configured to call the mantissa processing unit multiple times when a mantissa bit width of at least one of the two floating point numbers is greater than a data bit width that the mantissa processing unit can process at one time. The data bit width that the mantissa processing unit can process at one time refers to two bit widths (such as a multiplier bit width and a multiplicand bit width) supported by the mantissa processing unit. Therefore, it can be understood that the control circuit is configured to determine to call the mantissa processing unit multiple times to obtain the mantissa after the multiplication according to the mantissa bit width of one of the two floating point numbers and one of the two bit widths supported by the mantissa processing unit, or according to the mantissa bit width of the two floating point numbers and the two bit widths supported by the mantissa processing unit. Therefore, the repeated calling of the mantissa processing unit in the multiplier avoids the arrangement of a large-area multiplier component for processing the large-bit-width mantissa operation and the arrangement of a small-area multiplier component for processing the large-bit-width mantissa operation, thereby being beneficial to reducing the chip area while having stronger applicability.

According to a first embodiment of the present disclosure, the two floating point numbers include a first floating point number and a second floating point number, the mantissa processing unit supports a first bit width and a second bit width, the mantissa of the first floating point number serves as a first input corresponding to the first bit width, the mantissa of the second floating point number serves as a second input corresponding to the second bit width, the bit width of the first input is smaller than or equal to the first bit width, and the control circuit is configured to, when the bit width of the second input is larger than the second bit width, call the mantissa processing unit multiple times to obtain the mantissa after the multiplication operation. According to this embodiment, it is known that the bit width of one of the two inputs is fixedly less than or equal to the bit width supported by the corresponding mantissa processing unit, and thus, it is only necessary to determine the size relationship between the bit width supported by the other input and the corresponding mantissa processing unit, and it can be determined whether to call the mantissa processing unit multiple times.

According to a second embodiment of the present disclosure, the two floating point numbers include a first floating point number and a second floating point number, the mantissa processing unit supporting a first bit width and a second bit width, a mantissa of the first floating point number being a first input corresponding to the first bit width, the mantissa of the second floating point number is used as a second input corresponding to the second bit width, and the control circuit is configured to call the mantissa processing unit multiple times to obtain the mantissa after the multiplication operation when the bit width of the first input is greater than the first bit width and the bit width of the second input is less than or equal to the second bit width, when the bit width of the second input is greater than the second bit width and the bit width of the first input is less than or equal to the first bit width, or when the bit width of the first input is greater than the first bit width and the bit width of the second input is greater than the second bit width. According to this embodiment, the size relationship between the bit width of the two inputs and the two bit widths supported by the mantissa processing unit is uncertain, and it is necessary to determine whether to call the mantissa processing unit multiple times by determining the size relationship between the bit width of the two inputs and the bit width supported by the mantissa processing unit corresponding to each input.

According to this second embodiment, the control circuit selects the mantissa of the first floating point number as the second input corresponding to the second bit width and selects the mantissa of the second floating point number as the first input corresponding to the first bit width, when the mantissa bit width of the first floating point number is smaller than the mantissa bit width of the second floating point number and the first bit width is larger than the second bit width, or when the mantissa bit width of the first floating point number is larger than the mantissa bit width of the second floating point number and the first bit width is smaller than the second bit width. It should be understood that when the mantissas of two floating point numbers are input irregularly, the mantissas of the two input floating point numbers may be first matched with the two bit widths supported by the mantissa processing unit according to the strategy of large bit width to large bit width and small bit width to small bit width, so as to avoid that the mantissa operation of the two floating point numbers can be completed by one-time processing, but multiple calls are performed.

Further, when the bit width of the first input is greater than the first bit width and the bit width of the second input is less than or equal to the second bit width, the control circuit determines the number of times of calling the mantissa processing unit and inputs the data of the mantissa processing unit in each call according to the bit width of the first input and the first bit width. When the bit width of the second input is larger than the second bit width and the bit width of the first input is smaller than or equal to the first bit width, the control circuit determines the number of times of calling the mantissa processing unit according to the bit width of the second input and the second bit width and inputs the data of the mantissa processing unit in each calling. When the bit width of the first input is larger than the first bit width and the bit width of the second input is larger than the second bit width, the control circuit determines the number of times of calling the mantissa processing unit according to the bit width and the first bit width of the first input and the bit width and the second bit width of the second input and inputs the data of the mantissa processing unit in each calling.

In the present disclosure, the description about the first floating point number and the second floating point number is only for distinguishing the two floating point numbers, where "first" and "second" have no limiting role. Likewise, the description about the first bit-width and the second bit-width is only for distinguishing the two maximum processing bit-widths supported by the mantissa processing unit, and the description about the first input and the second input is only for distinguishing the two inputs of the mantissa processing unit corresponding to the two maximum processing bit-widths, and thus neither of "first" and "second" has a limiting role.

It should be noted that the floating-point number of the input multiplier described in the above embodiment is a floating-point number that conforms to the format required by the operation and is applicable to the internal and external components of the multiplier, i.e., a floating-point number that has been preprocessed, for example, by normalization. It should be understood that the floating point number input to the multiplier may be a normalized or denormalized floating point number, and as will be understood from the above description of the normalization unit, if at least one of the two input floating point numbers is a denormalized non-zero floating point number, the at least one floating point number may be first normalized by the normalization unit to obtain a normalized exponent and mantissa, and then the normalized mantissa may be used as an input to the mantissa processing unit to perform the floating point number multiplication operation described above. In addition, since the booth encoding circuit mentioned earlier in the present disclosure performs signed fixed-point multiplication, it is necessary to perform the floating-point multiplication described above by extending the mantissa by 1-bit 0, that is, by changing the mantissa to a signed positive number, and then using the extended signed mantissa as an input to the mantissa processing means. Of course, the floating-point number may be subjected to other preprocessing, and the mantissa of the preprocessed floating-point number may be used as an input of the mantissa processing unit to perform the floating-point number multiplication described above, for example, normalization of the floating-point number to apply the operation mode as described above with respect to the normalization unit.

Three examples of the multiple call mantissa processing unit according to the above-described second embodiment of the present disclosure will be described in detail below. For a clearer and intuitive understanding of these three examples, the first input may be, for example, a multiplier, the second input may be, for example, a multiplicand, the first bit width may be, for example, a maximum multiplier bit width supported by the mantissa processing unit, and the second bit width may be, for example, a maximum multiplicand bit width supported by the mantissa processing unit.

According to the first example of the multi-call mantissa processing unit of the present disclosure, in combination with the floating-point number multiplication operation according to the operation mode described above, taking as an example that two floating-point numbers input to the multiplier of the present disclosure are non-zero floating-point numbers that are denormalized, and in combination with the case that the booth encoding circuit used in the present disclosure performs signed fixed-point number multiplication, the two floating-point numbers are first normalized, and therefore, the mantissas of the two floating-point numbers are extended by 1bit, and in addition, in order to be applied to the booth encoding circuit in the embodiment of the present disclosure, the two mantissas are further extended by 1bit to form a signed number. After these pre-processing, the mantissas of the two floating-point numbers are matched to the input of the mantissa processing unit. Therefore, when the bit width of the multiplier is greater than the maximum multiplier bit width and the bit width of the multiplicand is less than or equal to the maximum multiplicand, the control circuit takes only the mantissa formed after normalizing the original mantissa corresponding to the multiplier as the mantissa to be truncated, and in order to be applied to the booth encoding circuit in the embodiment of the present disclosure, the sign bit is extended for each truncated part. In order to enable the mantissa processing unit to process the mantissa to be intercepted, a part with the bit width of A-1 is intercepted from the mantissa to be intercepted in each call, wherein A represents the maximum multiplier bit width supported by the mantissa processing unit, a multiplier part with the bit width of A-1 is formed by supplementing one bit 0 to the high bit of the part with the bit width of A-1 as a symbol, and the multiplier part is used as one input for inputting the mantissa processing unit in each call. In addition, the multiplicand (which in this embodiment is a mantissa that normalizes and expands the sign bit) is input to the mantissa processing unit as another input in each call. Thus, the number of calls of the mantissa processing unit may be determined using the following formula:

n＝ceil((B+1)/(A-1))，

wherein n represents the number of times of calling the mantissa processing unit, B represents the bit width of the mantissa which is not normalized and does not expand the sign bit, B +1 represents the bit width after normalization of the mantissa, B +1 can also be understood as B +2-1, that is, the bit width of the sign bit is subtracted from the bit width of the multiplier, a represents the bit width of the multiplier part (the maximum multiplier bit width supported by the mantissa processing unit), and a-1 represents the bit width of the part intercepted from the mantissa to be intercepted in each call.

For example, the maximum multiplier bit width supported by the mantissa processing unit is, for example, 8 bits, the maximum multiplicand bit width is, for example, 32 bits, the two floating point numbers input to the multiplier are floating point numbers of FP32 type and BF16 type, respectively, so that the multiplication operation is performed in the FP32 × BF16 operation mode, and the two floating point numbers are non-normalized non-zero numbers, so that the mantissas of the two floating point numbers have bit widths of 23 bits and 7 bits, respectively, and considering the IEEE754 standard, the bit widths of the two mantissas can be extended to 24 bits and 8 bits. To be suitable for use in the booth encoding circuit in the embodiments of the present disclosure, the two mantissas are then extended by 1bit 0 to 25bit and 9bit signed numbers. Therefore, the control circuit takes the mantissa with the bit width of 9 bits as the multiplier corresponding to the maximum multiplier bit width and takes the mantissa with the bit width of 25 bits as the multiplicand corresponding to the maximum multiplicand bit width, and because only the bit width (9 bits) of the multiplier is greater than the maximum multiplier bit width (8 bits) and the bit width (25 bits) of the multiplicand is less than the maximum multiplicand bit width (32 bits), the mantissa formed after normalizing the original mantissa corresponding to the multiplier is taken as the mantissa inb to be intercepted, and the multiplicand is taken as the multiplicand ina of the input mantissa processing unit. According to the above formula, ceil ((7+1)/(8-1)) > is 2, so that the mantissa processing unit needs to be called twice, and each time 7-bit data is intercepted in inb, and when the last call (second call) is called, the remaining data is intercepted and previously supplemented with 0 to complete 7 bits, and each intercepted 7-bit data is expanded by 1bit 0 (sign bit) to 8 bits as a multiplier part inb _ m, so that the calculation performed at each call is ina × inb _ m, that is, the multiplication operation of a multiplicand with a bit width of 25 bits and a multiplier part with a bit width of 8 bits, and the mantissa result obtained by the call can be calculated. It should be noted that the truncations of the mantissa to be truncated may be performed in the order from the high order bits to the low order bits, or may be performed in the order from the low order bits to the high order bits. It is noted that this example is equally applicable to the above-described first embodiment of the present disclosure.

According to the second example of the multi-call mantissa processing unit of the present disclosure, in combination with the floating point number multiplication operation according to the operation mode described above, taking as an example that two floating point numbers input to the multiplier of the present disclosure are non-zero floating point numbers that are denormalized, and in combination with the case that the booth encoding circuit used in the present disclosure performs signed fixed point number multiplication, the two floating point numbers are first normalized, and therefore, the mantissas of the two floating point numbers are extended by 1bit, and in addition, in order to be applied to the booth encoding circuit in the embodiment of the present disclosure, the two mantissas are further extended by 1bit to form a signed number. After these pre-processing, the mantissas of the two floating-point numbers are matched to the input of the mantissa processing unit. Therefore, when the bit width of the multiplicand is greater than the maximum multiplicand bit width and the bit width of the multiplier is less than or equal to the maximum multiplier bit width, the control circuit takes only the mantissa formed after normalizing the original mantissa corresponding to the multiplicand as the mantissa to be intercepted, and in order to be applied to the booth encoding circuit in the embodiment of the present disclosure, the sign bit is extended to each intercepted part. In order to enable the mantissa processing unit to process the mantissa to be intercepted, a part with the bit width of C-1 is intercepted from the mantissa in each call, wherein C represents the maximum multiplicand bit width supported by the mantissa processing unit, a part with the bit width of C-1 intercepted in each call is supplemented with one bit of 0 in the high order as a symbol to form a multiplicand part with the bit width of C, and the multiplicand part is used as one input of the mantissa processing unit input in each call. In addition, the multiplier (in this embodiment, the multiplier is a mantissa that normalizes and expands the sign bit) is input to the mantissa processing unit as another input in each call. Thus, the number of calls of the mantissa processing unit may be determined using the following formula:

n＝ceil((D+1)/(C-1))，

wherein n represents the number of times of calling the mantissa processing unit, D represents the bit width of the mantissa which is not normalized and does not expand the sign bit, D +1 represents the bit width after normalization of the mantissa, D +1 can also be understood as D +2-1, that is, the bit width of the sign bit is subtracted from the bit width of the multiplicand, C represents the bit width of the multiplicand part (the maximum multiplicand bit width supported by the mantissa processing unit), and C-1 represents the bit width of the part intercepted from the mantissa to be intercepted in each call.

For example, the maximum multiplier bit width supported by the mantissa processing unit is, for example, 12 bits, the maximum multiplicand bit width is, for example, 16 bits, the two floating point numbers input to the multiplier are floating point numbers of the FP32 type and the BF16 type, respectively, so that the multiplication operation is performed in the FP32 × BF16 operation mode, and the two floating point numbers are non-normalized non-zero numbers, so that the mantissas of the two floating point numbers have bit widths of 23 bits and 7 bits, respectively, and considering the IEEE754 standard, the bit widths of the two mantissas can be extended to 24 bits and 8 bits. To be suitable for use in the booth encoding circuit in the embodiments of the present disclosure, the two mantissas are then extended by 1bit 0 to 25bit and 9bit signed numbers. Therefore, the control circuit takes the mantissa with the bit width of 9 bits as the multiplier corresponding to the maximum multiplier bit width and takes the mantissa with the bit width of 25 bits as the multiplicand corresponding to the maximum multiplicand bit width, and because only the bit width (25 bits) of the multiplicand is greater than the maximum multiplicand bit width (16 bits) supported by the mantissa processing unit and the bit width (9 bits) of the multiplier is less than the maximum multiplier bit width (12 bits), the original mantissa corresponding to the multiplicand is normalized to form the mantissa serving as the mantissa ina to be intercepted, and then the multiplier is taken as the multiplier inb of the input mantissa processing unit. According to the above formula, ceil ((23+1)/(16-1)) > 2, therefore, two mantissa processing units need to be called, and each time 15-bit data is intercepted in ina at each time, and when the last call (second call) is called, 15-bit data is complemented by 0 at the front, and the intercepted 15-bit data is expanded by 1bit 0 (sign bit) to 16 bits as a multiplicand part ina _ m, therefore, the calculation performed at each call is ina _ m × inb, that is, the multiplication operation of the multiplicand part with 16 bits and the multiplier with 9 bits in width, and the mantissa result obtained by the call can be calculated. It should be noted that the truncations of the mantissa to be truncated may be performed in the order from the high order bits to the low order bits, or may be performed in the order from the low order bits to the high order bits. It is noted that this example is equally applicable to the above-described first embodiment of the present disclosure.

According to the third example of the multiple call mantissa processing unit of the present disclosure, in combination with the floating point number multiplication operation according to the operation mode described above, taking as an example that two floating point numbers input to the multiplier of the present disclosure are non-normalized non-zero floating point numbers, and in combination with the case that the booth encoding circuit used in the present disclosure performs signed fixed point number multiplication, the two floating point numbers are first normalized, and therefore the mantissas of the two floating point numbers are extended by 1bit, and in addition, in order to be applied to the booth encoding circuit in the embodiment of the present disclosure, the two mantissas are further extended by 1bit to form a signed number. After these pre-processing, the mantissas of the two floating-point numbers are matched to the input of the mantissa processing unit. Therefore, when the bit width of the multiplier is greater than the maximum multiplier bit width and the bit width of the multiplicand (in this embodiment, the multiplicand is a mantissa which is normalized and extends a sign bit) is greater than the maximum multiplicand bit width, the control circuit sets, as mantissas to be truncated, only mantissas formed by normalizing the original mantissa corresponding to the multiplier and only mantissas formed by normalizing the original mantissa corresponding to the multiplicand, and extends a sign bit for each truncated part in order to be applied to the booth encoding circuit in this embodiment of the present disclosure. In order to enable the mantissa processing unit to process the two mantissas to be intercepted, a part with the bit width of A-1 is intercepted from the mantissa to be intercepted corresponding to the multiplier and a part with the bit width of C-1 is intercepted from the mantissa to be intercepted corresponding to the multiplicand in each call, wherein A represents the maximum multiplier bit width supported by the mantissa processing unit, C represents the maximum multiplicand bit width supported by the mantissa processing unit, the part with the bit width of A-1 intercepted each time is supplemented with one bit of 0 at the high bit as a sign to form a multiplier part with the bit width of A, the multiplier section is used as one input of the input mantissa processing unit in each call, and a multiplicand section with a bit width of C-1 is formed by supplementing a bit 0 in the upper order as a sign to a section with a bit width of C intercepted each time, and the multiplicand section is used as the other input of the input mantissa processing unit in each call. Thus, the number of calls of the mantissa processing unit may be determined using the following formula:

n＝ceil((B+1)/(A-1))*ceil((D+1)/(C-1))

where n represents the number of calls to the mantissa processing unit, B represents the bit width of the mantissa that is unnormalized and without sign bit extended, B +1 represents the bit width after normalization of the mantissa, B +1 is also understood to be B +2-1, i.e. the bit width of the multiplier minus the bit width of the sign bit, a represents the bit width of the multiplier part (the maximum multiplier bit width supported by the mantissa processing unit), a-1 represents the bit width of the part intercepted from the mantissa to be intercepted corresponding to the multiplier in each call, D represents the bit width of the mantissa which is not normalized and does not expand the sign bit, D +1 represents the bit width after normalization of the mantissa, D +1 can also be understood as D +2-1, that is, the bit width of the sign bit is subtracted from the bit width of the multiplicand, C represents the bit width of the multiplicand part (the maximum multiplicand bit width supported by the mantissa processing unit), and C-1 represents the bit width of the part intercepted from the mantissa to be intercepted in each call.

For example, the maximum multiplier bit width supported by the mantissa processing unit is, for example, 8 bits, the maximum multiplicand bit width is, for example, 16 bits, both floating point numbers input to the multiplier are floating point numbers of the FP32 type, so that the multiplication operation is performed in the FP32 × FP32 operation mode, and both floating point numbers are denormalized nonzero numbers, so that the mantissa bit widths of both floating point numbers are 23 bits, and considering the IEEE754 standard, the bit widths of both mantissas can be extended to 24 bits. To be suitable for use in the booth encoding circuit in the disclosed embodiment, the two mantissas are then extended by 1bit 0 to a25 bit signed number. Therefore, the control circuit selects mantissas of two floating point numbers as a multiplier corresponding to a maximum multiplier bit width and a multiplicand corresponding to a maximum multiplicand bit width respectively (one of the mantissas is selected as a multiplier and the other is selected as a multiplicand because the bit width (25bit) of the multiplier is greater than the maximum multiplier bit width (8bit) and the bit width (25bit) of the multiplicand is greater than the maximum multiplicand bit width (16bit), so that the mantissa formed after normalizing an original mantissa corresponding to the multiplier is used as a mantissa inb to be intercepted, and the mantissa formed after normalizing the original mantissa corresponding to the multiplicand is used as a mantissa ina to be intercepted. According to the above formula, ceil ((23+ 1)/(8-1)). ceil ((23+ 1)/(16-1)). is 8, and therefore, eight mantissa processing units need to be called. Each time in inb, 7-bit data is intercepted, and in the last call, less than 7-bit data, the remaining data is all intercepted and previously complemented by 0 to complete 7 bits, and the intercepted 7-bit data is extended by 1bit 0 (sign bit) to 8 bits as multiplier part inb _ m, and since inb is intercepted into four parts, there can be four multiplier parts inb _ m1, inb _ m2, inb _ m3, inb _ m 4. In addition, each time of calling, 15-bit data is intercepted in ina, and when less than 15-bit data is called for the last time, the remaining data is completely intercepted and previously complemented by 0 to complete 15 bits, and the intercepted 15-bit data is extended by 1bit 0 (sign bit) to 16 bits as a multiplicand part ina _ m, and since ina is intercepted into two parts, it is possible to have two multiplicand parts ina _ m1, ina _ m 2. Thus, for example, the following calculations may be performed in sequence when eight calls are made to the mantissa processing unit: ina _ m1 × inb _ m1, ina _ m1 × inb _ m2, ina _ m1 × inb _ m3, ina _ m1 × inb _ m4, ina _ m2 × inb _ m1, ina _ m2 × inb _ m2, ina _ m2 × inb _ m3, ina _ m2 × inb _ m4, although the following calculations may be performed in order: inb _ m1 × ina _ m1, inb _ m1 × ina _ m2, inb _ m2 × ina _ m1, inb _ m2 × ina _ m2, inb _ m3 × ina _ m1, inb _ m3 × ina _ m2, inb _ m4 × ina _ m1, and inb _ m4 × ina _ m 2. The calculation performed for each call is a multiplication operation of a multiplicand part with a bit width of 16 bits and a multiplier part with a bit width of 8 bits, so that the mantissa result obtained by the call can be calculated. It should be noted that the truncations of the mantissa to be truncated may be performed in the order from the high order bits to the low order bits, or may be performed in the order from the low order bits to the high order bits.

The above examples are for illustrative purposes only and not for limiting purposes, and those skilled in the art will appreciate that in other operation modes, floating point number multiplication operations by mantissa processing units that support up to any bit width may be invoked multiple times.

For the above multi-call mantissa processing unit, the mantissa processing unit may further include a shift and add circuit, where the shift and add circuit is configured to obtain the mantissa after the multiplication according to a mantissa result obtained by calling the mantissa processing unit each time.

Further, the shift addition circuit includes a shifter, an intermediate memory, and an adder, and when the control circuit calls the mantissa processing unit a plurality of times according to the operation mode, after the first call, the shifter shifts the mantissa result obtained by the first call to obtain a shifted mantissa result and stores the shifted mantissa result in the intermediate memory, starting from the second call, the shifter shifts the mantissa result obtained in the current call to obtain a current mantissa result, the adder adding the current mantissa result to the result stored in the intermediate memory and storing the added result in the intermediate memory to update the intermediate memory, and the result stored in the intermediate memory after the last call is taken as the mantissa after the multiplication operation.

In this embodiment, for example, the truncation of the mantissa to be truncated is performed in order from the upper bits to the lower bits. Each time the mantissa processing unit is called, the shifter shifts the mantissa result obtained in the current call according to the following formula:

Y＝k+j

wherein, Y represents the bit shift number needed by the mantissa result obtained in the current call, k represents the sum of the bit numbers of all data after the interception part used in the current call in the mantissas to be intercepted corresponding to the multiplier, and j represents the sum of the bit numbers of all data after the interception part used in the current call in the mantissas to be intercepted corresponding to the multiplicand. It should be understood that if only the bit width of the multiplier is greater than the maximum multiplier bit width or only the bit width of the multiplicand is greater than the maximum multiplicand bit width, only the mantissa to be intercepted corresponding to the multiplier or the mantissa to be intercepted corresponding to the multiplicand needs to be intercepted, and it is not necessary that all data of the intercepted mantissa is used in each call, and therefore no data exists later, and thus the value of k or j is 0, and thus it can be known that for the case where only the bit width of the multiplier is greater than the maximum multiplier bit width, the above formula for calculating the shift number can be written as: for the case where only the multiplicand bit width is greater than the maximum multiplicand bit width, the above formula to calculate the shift number can be written as: and Y ═ j.

For example, as described above, in the FP32 × BF16 operation mode, when only the bit width of the multiplier is larger than the maximum multiplier bit width, the mantissa processing unit is called twice, and, for example, the truncation of the mantissa to be truncated is performed in order from high bits to low bits. Specifically, for example, the multiplier parts in the two calls are inb _ m1 and inb _ m2, respectively, after the first call, the shifter shifts the result of ina × inb _ m1 to the left, since 7-bit data is intercepted in the first call, the sum of the bits of all data after the 7-bit data used in the call is k 8-7-1 bit, as can be seen from the above formula, Y is 1, and therefore, the number of bits shifted to the left is 1bit, so that a result R1 after 1bit is obtained, and the adder stores the R1 into the intermediate memory; after the second call (the last call), the shifter shifts the result of ina _ inb _ m2 to the left, since the last 1-bit data has been truncated in the second call, there is no data after the 1-bit data used in the call, and as can be seen from the above equation, Y is 0, and therefore the number of bits shifted to the left is 0, i.e., it is not shifted, thereby obtaining a result R2, the adder adds the R2 to R1 stored in the intermediate memory and stores the added result in the intermediate memory to update the intermediate memory, and since the second call is the last call, the result stored in the intermediate memory after the second call is the mantissa after the multiplication operation. The shift-add circuit may work the same for the above case when only the multiplicand bit width is greater than the maximum multiplicand bit width.

For example, as described above, in the FP32 × FP32 operation mode, when the bit width of the multiplier is greater than the maximum multiplier bit width and the bit width of the multiplicand is greater than the maximum multiplicand bit width, the mantissa processing unit is called eight times, and the truncations of the mantissa to be truncated are performed in order from high bits to low bits, for example. Specifically, for example, the multiplier parts in eight calls are inb _ m1, inb _ m2, inb _ m3 and inb _ m4, respectively, and the multiplicand parts are ina _ m1 and ina _ m2, for example, the following calculations are performed in sequence when the mantissa processing unit is called eight times: ina _ m1 × inb _ m1, ina _ m1 × inb _ m2, ina _ m1 × inb _ m3, ina _ m1 × inb _ m4, ina _ m2 × inb _ m1, ina _ m2 × inb _ m2, ina _ m2 × inb _ m3, and ina _ m2 × inb _ m 4. In the first call, the shifter shifts the result of ina _ m1 × inb _ m1 to the left, since 7-bit data is truncated in the mantissa to be truncated corresponding to the multiplier in the first call, the sum of the bits of all data following the 7-bit data used in the call in the mantissa to be truncated is k-24-7-17-bit, and 15-bit data is truncated in the mantissa to be truncated corresponding to the multiplicand, so that the sum of the bits of all data following the 15-bit data used in the call in the mantissa to be truncated is j-24-15-9-bit, as can be seen from the above formula, Y-17 + 9-26, and therefore, the number of bits shifted to the left is 26-bit, thereby obtaining a result S1 after shifting the bits, and the adder stores the result S1 in the intermediate memory; after the second call, the shifter shifts the result ina _ m1 × inb _ m2 to the left, since the next 7-bit data is truncated in the mantissa to be truncated corresponding to the multiplier in the second call, the sum of the bits of all the data after the 7-bit data used in the call in the mantissa to be truncated is k 24-7-10 bits, and the same 7-bit data as that used in the previous call is truncated in the mantissa to be truncated corresponding to the multiplier (using the same 7-bit data as that used in the previous call), so the sum of the bits of all the data after the 7-bit data used in the call in the mantissa to be truncated is j 24-15-9 bits, as known from the above formula, Y is 10+ 9-19 bits, and therefore, the number of bits shifted to the left is 19 bits, thereby obtaining a result S3919 bits after the shifting, the adder adds the result to the intermediate memory S1, and storing the added result in the intermediate memory to update the intermediate memory; the call of the mantissa processing unit is repeated as such until the fourth call, in which the shifter shifts the result of ina _ m1 × inb _ m4 to the left, since the last 3-bit data in the mantissa to be intercepted corresponding to the multiplier is intercepted in the fourth call, there is no data in the mantissa to be intercepted after the 3-bit data used in the call, so k is equal to 0, and the same 7-bit data as in the previous call is intercepted in the mantissa to be intercepted corresponding to the multiplier, so that the sum of the bits of all data in the mantissa to be intercepted after the 7-bit data used in the call is still j 24-15 equal to 9 bits, as can be seen from the above formula, Y is equal to 0+9 equal to 9, so the number of bits shifted to the left is 9 bits, so that a result S4 after shifting 9 bits is obtained, and the adder adds the result stored in the intermediate memory to the S4, and storing the added result in the intermediate memory to update the intermediate memory; since the last 9 bits of data in the mantissas to be intercepted corresponding to the multiplicand are intercepted in the fifth to eighth calls, and no more data follows the 9-bit data, so in the fifth through eighth calls, j is 0, in the fifth call, the shifter shifts the result of ina _ m 2x inb _ m1 to the left, since the same 7-bit data as in the first call is intercepted in the mantissa to be intercepted corresponding to the multiplier in the fifth call, k is 24-7 is 17 bits, as can be seen from the above equation, Y is 17+0 is 17, and therefore the number of bits shifted to the left is 17, thereby obtaining a result S5 shifted by 17 bits, the adder adding the result S5 to the result stored in the intermediate memory, and storing the added result in the intermediate memory to update the intermediate memory; the call of the mantissa processing unit is repeated as such until an eighth call, in which the shifter shifts the result of ina _ m2 × inb _ m4 to the left, since the last 3-bit data in the mantissa to be intercepted corresponding to the multiplier is intercepted in the eighth call, no data exists in the mantissa to be intercepted after the 3-bit data used in the call, so that k is 0, and as can be seen from the above equation, Y is 0+0, so that the number of bits shifted to the left is 0 bits, i.e., not shifted, and from which an unshifted result S8 is obtained, the adder adds the result of S8 to the result stored in the intermediate memory and stores the added result in the intermediate memory to update the intermediate memory; since the eighth call is the last call, the result stored in the intermediate memory after the eighth call is the mantissa after the multiplication operation.

On the other hand, in order to further reduce the area of the multiplier, the exponent processing unit includes a second control circuit (not shown in the figure) configured to determine to call the exponent processing unit multiple times to obtain the exponent after the multiplication operation according to the exponent bit width of one of the two floating point numbers and one of the two bit widths supported by the exponent processing unit, or according to the exponent bit widths of the two floating point numbers and the two bit widths supported by the exponent processing unit.

According to a third embodiment of the present disclosure, the two floating point numbers include a first floating point number and a second floating point number, the exponent processing unit supports a third bit width and a fourth bit width, an exponent of the first floating point number is used as a third input corresponding to the third bit width, an exponent of the second floating point number is used as a fourth input corresponding to the fourth bit width, a bit width of the third input is less than or equal to the third bit width, and the second control circuit is configured to, when the bit width of the fourth input is greater than the fourth bit width, call the exponent processing unit multiple times to obtain the exponent after the multiplication operation. According to this embodiment, it is known that the bit width of one of the two inputs is fixedly less than or equal to the bit width supported by the corresponding exponent processing unit, and thus, it is only necessary to determine the size relationship between the bit width supported by the other input and the corresponding exponent processing unit, and it can be determined whether to call the exponent processing unit multiple times.

According to a fourth embodiment of the present disclosure, the two floating point numbers include a first floating point number and a second floating point number, the exponent processing unit to support a third bit width and a fourth bit width, the exponent of the first floating point number to be a third input corresponding to the third bit width, the exponent of the second floating point number is used as a fourth input corresponding to the fourth bit width, and the second control circuit is configured to call the exponent processing unit multiple times to obtain the exponent after the multiplication operation when the bit width of the third input is greater than the third bit width and the bit width of the fourth input is less than or equal to the fourth bit width, when the bit width of the fourth input is greater than the fourth bit width and the bit width of the third input is less than or equal to the third bit width, or when the bit width of the third input is greater than the third bit width and the bit width of the fourth input is greater than the fourth bit width. According to this embodiment, the size relationship between the bit width of the two inputs and the two bit widths supported by the exponent processing unit is uncertain, and it is necessary to determine whether to call the exponent processing unit multiple times by determining the size relationship between the bit widths supported by the two inputs and the respective corresponding exponent processing units.

According to this fourth embodiment, when the exponent bit width of the first floating point number is smaller than the exponent bit width of the second floating point number and the third bit width is larger than the fourth bit width, or when the exponent bit width of the first floating point number is larger than the exponent bit width of the second floating point number and the third bit width is smaller than the fourth bit width, the second control circuit selects the exponent of the first floating point number as the fourth input corresponding to the fourth bit width and selects the exponent of the second floating point number as the third input corresponding to the third bit width. It should be understood that when the exponents of two floating point numbers are input irregularly, the exponents of the two input floating point numbers may be matched with the two bit widths supported by the exponent processing unit according to the strategy of large bit width to large bit width and small bit width to small bit width, so as to avoid that the exponents of the two floating point numbers are processed once and are called for many times.

Further, when the bit width of the third input is greater than the third bit width and the bit width of the fourth input is less than or equal to the fourth bit width, when the bit width of the fourth input is greater than the fourth bit width and the bit width of the third input is less than or equal to the third bit width, or when the bit width of the third input is greater than the third bit width and the bit width of the fourth input is greater than the fourth bit width, the second control circuit is configured to determine, when the bit width of the third input is less than or equal to the bit width of the fourth input and the third bit width is less than or equal to the fourth bit width, the number of times of calling the exponent processing unit and the data input to the exponent processing unit in each call according to the bit width of the fourth input and the third bit width. It is noted that in the above three cases, the number of calls of the exponent processing unit and the data input to the exponent processing unit in each call are determined according to the larger of the bit widths of the third input and the fourth input and the smaller of the third bit width and the fourth bit width. Of course, when the bit widths of the third input and the fourth input are the same or the third bit width and the fourth bit width are the same, either one of the third input and the fourth input may be selected as the same bit width.

In this embodiment, the description about the first floating point number and the second floating point number is only for distinguishing the two floating point numbers, wherein "third" and "fourth" have no limiting effect. Likewise, the description about the third input and the fourth input is only for distinguishing the two inputs of the exponent processing unit, and the description about the third bit width and the fourth bit width is only for distinguishing the two maximum processing bit widths supported by the exponent processing unit corresponding to the two inputs of the exponent processing unit, and thus neither of "third" and "fourth" has a limiting effect.

It should be noted that the floating-point number of the input multiplier described in the above embodiment is a floating-point number that conforms to the format required by the operation and is applicable to the internal and external components of the multiplier, i.e., a floating-point number that has been preprocessed, for example, by normalization. It should be understood that the floating point number input to the multiplier may be a normalized or denormalized floating point number, and as will be appreciated from the above description of the normalization unit, if at least one of the two input floating point numbers is a denormalized non-zero floating point number, the at least one floating point number may be first normalized by the normalization unit to obtain a normalized exponent and a mantissa, and then the above-described floating point number multiplication may be performed using the normalized exponent as an input to the exponent processing unit. Of course, other preprocessing may be performed on the floating point number, and the floating point number multiplication operation described above may be performed using the exponent of the preprocessed floating point number as an input to the exponent processing unit, for example, normalization of the floating point number to apply the operation mode as described above with respect to the normalization unit.

An example of the multiple call index processing unit will be described in detail below. For a clearer and intuitive understanding of this example, the third input may be, for example, an addend, the fourth input may be, for example, an addend, the third bit width may be, for example, a maximum addend bit width supported by the exponent processing unit, and the fourth bit width may be, for example, a maximum addend bit width supported by the exponent processing unit.

According to an example of the multi-call exponent processing unit of the present disclosure, in combination with the floating point number multiplication operation according to the operation mode described above, taking the two floating point numbers input to the multiplier of the present disclosure as non-normalized non-zero floating point numbers as an example, the two floating point numbers are first normalized, and thus the mantissas of the two floating point numbers are extended by 1 bit. After this preprocessing, the exponents of the two floating point numbers are matched with the input of the exponent processing unit. Therefore, when the bit width of the addend is greater than the maximum addend bit width and the bit width of the addend is less than or equal to the maximum addend bit width, or when the bit width of the addend is greater than the maximum addend bit width and the bit width of the addend is greater than the maximum addend bit width, the control circuit may determine the number of times of call of the exponent processing unit according to the following formula:

m＝ceil(P/(Q-1))，

wherein m represents the number of times of calling the exponent processing unit, P represents the bit width of the addend, Q represents the maximum addend bit width, and Q-1 represents the bit width of the portion intercepted from the addend and the addend in each call. And intercepting the part with the bit width of Q-1 from the addend and the addend at the same time in each call, so that the part with the same bit width and the same number of bits intercepted from the addend and the addend is subjected to addition operation, and if the data of the part intercepted in the call is less than Q-1 bits or no data, the front or all of the parts are supplemented with 0 to complete Q-1 bit data. The addend and addend parts of the input exponent processing unit are formed after a carry bit is extended in front of the parts intercepted from the addend and addend, so Q also represents the bit width of the addend and addend parts of the input exponent processing unit at each call.

Thus, the second control circuit may intercept a portion of the Q-1 bit as an input of the exponent processing unit in the same order from the addend and the addend each time the exponent processing unit is called, obtain an exponent result of the call by the exponent processing unit, and obtain a final exponent after calling the exponent processing unit m times. Note that the same order may be from the higher order to the lower order, or from the lower order to the higher order.

For example, the addend bit width is 6 bits, the addend bit width is 9 bits, and the maximum addend bit width supported by the exponent processing unit are both 8 bits. Therefore, the number of times of calling the exponent processing unit is ceil (9/(8-1)) -2, and the addend is first supplemented with 0 so that the bit width of the addend and the bit width of the addend are the same, then in each call, parts with a bit width of 7 bits are simultaneously truncated for the addend and the addend in order from the upper bit to the lower bit, and the two truncated parts are respectively extended by one carry bit to form two 8-bit carry-in data for addition, and at the time of the second call (i.e., the last call), only 2-bit data (only 2-bit data) can be truncated from the addend and the addend, and therefore, 0 is supplemented with 7 bits before the truncated 2-bit data at the time of the second call, and one carry bit is extended to form two 8-bit carry-in data for addition.

It is to be noted that the call to the exponent processing unit in this example when the bit width of the addend is greater than the maximum addend bit width and the bit width of the addend is less than or equal to the maximum addend bit width and when the bit width of the addend is greater than the maximum addend bit width and the bit width of the addend is less than or equal to the maximum addend bit width is also applicable to the above-described third embodiment of the present disclosure.

According to an embodiment, the exponent processing unit may further include a second shift-and-add circuit configured to obtain the exponent after the multiplication operation according to an exponent result obtained each time the exponent processing unit is called.

Further, the second shift-and-add circuit includes a second shifter, a second intermediate memory, and a second adder, and when the second control circuit calls the exponent processing unit a plurality of times, after the first call, the second shifter shifts the exponent result obtained from the first call and stores the shifted exponent result in the second intermediate memory, starting with the second call to the exponent processing unit, the second shifter shifts the exponent result obtained in the current call, the second adder adds the shifted exponent result to a value stored in a second intermediate memory and stores the added result in the second intermediate memory to update the second intermediate memory, and the value stored in the second intermediate memory in the last call is taken as the exponent after the multiplication operation.

Each time the exponent processing unit is invoked, the second shifter shifts the exponent result obtained in the invocation in the following manner: when the addend and the addend are intercepted in the order from the high order to the low order at the time of calling the exponent processing unit, the part intercepted from the addend and the addend at the time of calling is shifted to the left, and the number of bits of the shift is the number of bits of the part after the part intercepted from the addend at the time of calling.

For example, with reference to the above example, for example, the bit width of the addend is 6 bits, the bit width of the addend is 9 bits, the maximum addend bit width and the maximum addend bit width supported by the exponent processing unit are both 8 bits, and a portion with a bit width of 7 bits is simultaneously intercepted for the addend and the addend in order from the upper bit to the lower bit in each call. Specifically, after the first call to the exponent processing unit, the second shifter shifts the exponent result obtained from the first call by 2 bits to the left (because there is 2 bits of data after the portion intercepted by the addend in the call) and stores the shifted exponent result in the second intermediate memory, starting with the second call to the exponent processing unit, the second shifter shifts the exponent result obtained in the call to the left, by 0 bits since there is no more data after the truncated portion of the call, i.e., not shifted, the second adder adding the 0-bit shifted exponent result to the value stored in the second intermediate memory and storing the added result in the second intermediate memory to update the second intermediate memory, since the second call is the last call, the value stored in the second intermediate memory after the second call is the exponent after the multiplication operation.

As can be seen from the above detailed description of the situation that the multiplier (mantissa processing unit and exponent processing unit) of the present disclosure is called multiple times, the control module may include multiple sub-modules, and the multiple sub-modules may be respectively used to perform various operations in the multiple calls, such as determining the mantissa processing unit to be called multiple times, determining the number of calls, determining data input to the mantissa processing unit in each call, determining whether the mantissa bit width matches the bit width supported by the mantissa processing unit, adjusting the mantissa input, and the like. The second control module may also include a plurality of sub-modules, and as such, the sub-modules may each perform various operations in a plurality of calls.

The operations performed by the multiplier of the present disclosure to multiply the mantissas of a first floating point number and a second floating point number when performing a floating point operation are described in detail above in conjunction with fig. 4-6. Of course, fig. 4 does not depict and describe other elements, such as exponent processing elements and sign processing elements, in order to focus on the description of the operation of the mantissa processing elements of the disclosed multiplier. The multiplier of the present disclosure will be described in its entirety with reference to fig. 7, and the description made above for the mantissa processing unit is also applicable to the case illustrated in fig. 7.

Fig. 7 is an overall schematic block diagram illustrating a multiplier 700 according to an embodiment of the present disclosure. It should be understood that the positions, existence and connection relationships of the various units depicted in the drawings are only exemplary and not limiting, for example, some of the units may be integrated, and other units may be separated or omitted or replaced according to different application scenarios.

The multiplier of the present disclosure can be exemplarily divided into a first stage and a second stage in operation of each operation mode according to an operation flow, as depicted by a dotted line in the figure. In summary, in the first phase: the result of the sign bit calculation is output, the intermediate result of the exponent bits is output, and the intermediate result of the mantissa bits is output (e.g., including the encoding process and the wallace tree compression process of the input mantissa bit fixed-point multiplication booth algorithm described above). In a second phase: and carrying out regularization and rounding operations on the exponent and the mantissa to output a calculation result of the exponent and a calculation result of the mantissa.

As shown in fig. 7, the multiplier of the present disclosure may include a mode selection unit 702 and a normalization processing unit 704, wherein the mode selection unit may select an operation mode according to an input mode signal (in _ mode). In one embodiment, the input mode signal may correspond to the operation mode number in table 2. For example, when the input pattern signal indicates the operation pattern number "1" in table 2, the multiplier may be operated in the operation pattern of FP16 × FP16, and when the input pattern signal indicates the operation pattern number "3" in table 2, the multiplier may be operated in the operation pattern of FP32 × FP 32. For illustration purposes, fig. 7 shows only four exemplary operational modes of FP16 × FP16, BF16 × BF16, FP32 × FP32, and FP32 × BP 16. However, as mentioned above, the multiplier of the present disclosure also supports a variety of other different modes of operation.

The normalization processing unit may be configured to normalize the first floating point number or the second floating point number according to the operation mode to obtain a corresponding exponent and mantissa when the first floating point number or the second floating point number is a non-normalized non-zero floating point number, for example, the floating point number in the data format indicated by the operation mode is subjected to normalization processing according to the IEEE754 standard.

Further, the multiplier includes a mantissa processing unit to perform a multiplication operation of the first floating point number mantissa and the second floating point number mantissa. To this end, in one or more embodiments, the mantissa processing unit may include a bit number expansion circuit 706, a Booth encoder 708, a partial product generation circuit 710, a Wallace tree compressor 712, and an adder 714, where the bit number expansion circuit may be configured to bit number expand, e.g., complement 0 in high order, a mantissa of at least one of the first floating point number and the second floating point number to be suitable for operation of the Booth encoder. The control circuit can carry out the operation of calling the mantissa processing unit for a plurality of times according to the mantissa obtained after the sign bit of the mantissa is expanded by the bit number expansion circuit. Since the details regarding the booth encoder, the partial product generation circuit, the wallace tree compressor, and the adder have been described in detail in conjunction with fig. 4-6, the same description is equally applicable here and thus will not be repeated.

In some embodiments, the multiplier of the present disclosure further includes a regularization unit 716 and a rounding unit 718, which have the same functionality as the units shown in fig. 3. Specifically, for the regularization unit, it may perform floating-point number regularization processing on the sum result and exponent data from the exponent processing unit according to a data format indicated by an output mode signal "out _ mode" as shown in fig. 7 to obtain a regularized exponent result and a regularized mantissa result. For example, depending on the data format indicated by the output mode signal, the regularization unit may adjust the bit widths of the exponent and mantissa to conform to the requirements of the data format indicated previously. For another example, when the most significant bit of the mantissa is 0 and the mantissa is not 0, the regularization unit may repeat left-shifting the mantissa by 1bit and decrementing the exponent by 1 until the most significant bit value is 1. For the rounding unit, in one embodiment, it may be configured to perform a rounding operation on the regularized mantissa result according to a rounding mode to obtain a rounded mantissa, and to treat the rounded mantissa as the mantissa after the multiplication operation.

In one or more embodiments, the aforementioned output mode signal may be a part of an operation mode for indicating a data format after the multiplication operation. For example, as described in table 3 above, when the operation pattern number is "12", the number "1" may be equivalent to the aforementioned "in _ mode" signal for indicating that the multiplication operation of FP16 × FP16 is performed, and the number "2" may be equivalent to the "out _ mode" signal for indicating that the data type of the output result is BF 16. It will therefore be appreciated that in some application scenarios, the output mode signal may be combined with the aforementioned input mode signal for provision to the mode selection unit. Based on this combined mode signal, the mode selection unit can specify the data formats of the input data and the output result at the initial stage of the multiplier operation without separately providing the output mode signal to the regularization, whereby the operation can be further simplified as well.

In one or more embodiments, for the aforementioned rounding operation, the following 5 rounding modes may be exemplarily included.

(1) Rounding to the nearest value: in this mode, even numbers take precedence when the two values are equally close. The result is now rounded to the nearest and representable value, but when there are two numbers that are equally close, the even number is taken as the rounding result (the number ending with 0 in the binary);

(2) rounding off: exemplary operation see the examples below;

(3) rounding in the + ∞ direction: under this rule, the result will be rounded towards positive infinity;

(4) rounding in the- ∞ direction: under this rule, the result will be rounded towards negative infinity; and

(5) rounding towards 0: under this rule, the result is rounded towards 0.

For the example of mantissa rounding in "round" mode: for example, a 48-bit (47-0) mantissa is obtained by multiplying 24-bit mantissas of two normalized floating point numbers, and only the 46 th to 24 th bits are taken when outputting after normalization processing (if the most significant bit of the mantissa is 0, the mantissa is shifted to the left by 1 bit; if the most significant bit of the mantissa is 1, the mantissa is not moved, and the previously obtained temporary step code is added by 1). When the 23 rd bit of the mantissa is 0, the (23-0) th bit is discarded; when the 23 rd bit of the mantissa is 1, 1 is advanced to the 24 th bit and the (23-0) th bit is discarded.

Returning to fig. 7, the multiplier of the present disclosure further includes an exponent processing unit 720 and a sign processing unit 722, where the exponent processing unit may be configured to obtain the multiplied exponent according to an operation mode, the exponent of the first floating point number and the exponent of the second floating point number. For example, the exponent processing circuit may add the exponent bit data of the first floating point number, the exponent bit data of the second floating point number, and respective corresponding offset values of the input floating point data type, and subtract the offset values of the output floating point data type to obtain the exponent bit data of the product of the first floating point number and the second floating point number. In one or more embodiments, the exponent processing unit may be implemented as or include an addition and subtraction circuit to obtain the multiplied exponent according to the operation mode, the exponent of the first floating point number, the exponent of the second floating point number, and the operation mode.

The sign processing unit may in one embodiment be implemented as an exclusive or circuit for performing an exclusive or operation on the sign bit data of the first and second floating point numbers to obtain the sign bit data of the product of the first and second floating point numbers.

The multiplier of the present disclosure is described in detail in its entirety above in connection with fig. 7. From this description, those skilled in the art will appreciate that the multiplier of the present disclosure supports operation in multiple operation modes, thereby overcoming the disadvantage of the prior art multiplier that supports only a single floating-point type operation. Further, the multiplier disclosed by the invention can be multiplexed, so that high-bit-width floating-point data is supported, and the operation cost and the overhead are reduced. In one or more embodiments, the multiplier of the present disclosure may also be arranged or included in an integrated circuit chip or computing device to enable multiplication operations to be performed on floating point numbers in multiple operating modes.

FIG. 8 is a flow chart illustrating a method 800 of performing a floating point number multiply operation using a multiplier in accordance with an embodiment of the present disclosure. It will be appreciated that the multiplier described herein, i.e., the multiplier described in detail above in connection with fig. 1-7, is equally applicable to the description herein above with respect to the multiplier and its internal components, functions and operations.

As shown in fig. 8, the method 800 may include obtaining, with an exponent processing unit of the multiplier, the multiplied exponent according to an operation mode, an exponent of a first floating point number, and an exponent of a second floating point number at step S802. As previously mentioned, the operational mode may be one of a plurality of operational modes and may be used to indicate the data format of a floating point number. In one or more embodiments, the operational mode may also be used to determine the data format of the floating point number of the output result.

Next, at step S804, the method 800 may obtain the multiplied mantissa according to the operation mode, the first floating point number, and the second floating point number by using a mantissa processing unit of a multiplier. With respect to exemplary operation of mantissas, the present disclosure uses the Booth encoding algorithm and the Wallace Tree compressor in some preferred embodiments, thereby improving the efficiency of mantissa processing. Additionally, when the first floating point number and the second floating point number are signed numbers, the method 800 may also be used in step S806 to obtain a sign after the multiplication operation from the sign of the first floating point number and the sign of the second floating point number.

Although the above-described method illustrates the use of the multiplier of the present disclosure in the form of steps to perform floating point multiplication operations, the order of the steps does not imply that the steps of the method must be performed in the order described, but rather may be processed in other orders or in parallel. In addition, other steps of the method 800 are not set forth herein for simplicity of description, but those skilled in the art will appreciate from this disclosure that the method may also perform the various operations described above in conjunction with fig. 1-7 by using multipliers.

In the above embodiments of the present disclosure, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to the related descriptions of other embodiments. The technical features of the embodiments may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

Fig. 9 is a block diagram illustrating a combined processing device 900 according to an embodiment of the present disclosure. As shown, the combining means 900 comprises a computing means 902, which may comprise a multiplier of the present disclosure as described above in connection with the figures. In addition, the combined processing device includes a universal interconnect interface 904 and other processing devices 906. The computing device according to the present disclosure interacts with other processing devices to collectively perform operations specified by a user.

According to aspects of the present disclosure, the other processing devices may include one or more types of general and/or special purpose processors such as a central processing unit ("CPU"), a graphics processing unit ("GPU"), a neural network processor, etc., the number of which is not limited but is determined according to actual needs. In one or more embodiments, the other processing device can interface with external data and control as a computing device (which can be embodied as a machine learning computing device) of the present disclosure, perform basic control including, but not limited to, data handling, completing start, stop, etc. of the present machine learning computing device; other processing devices may cooperate with the machine learning computing device to perform computing tasks.

In accordance with aspects of the present disclosure, the universal interconnect interface may be used to transfer data and control instructions between a computing device and other processing devices. For example, the computing device may obtain the required input data from other processing devices via the universal interconnect interface and write the input data to a storage device on the computing device. Further, the computing device may obtain control instructions from other processing devices via the universal interconnect interface and write the control instructions into a control cache on the computing device slice. Alternatively or optionally, the universal interconnect interface may also read data in a memory module of the computing device and transmit to other processing devices.

Optionally, the combined processing device may further comprise a storage device 908, which may be connected to the computing device and the other processing device, respectively. In one or more embodiments, the storage device may be configured to store data of the computing device and the other processing devices, and is particularly suitable for storing data that is not completely stored in the internal storage of the computing device or the other processing devices.

According to different application scenes, the combined processing device disclosed by the invention can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle, video acquisition equipment and video monitoring equipment, so that the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the universal interconnect interface of the combined processing device is connected to some components of the apparatus. Some of the components herein may be, for example, a camera, a display, a mouse, a keyboard, a network card or a wifi interface.

In some embodiments, the present disclosure also discloses a chip or an integrated circuit chip comprising the above-mentioned computing device, the combination processing device, and the multiplier of the present disclosure. In other embodiments, the present disclosure also discloses a chip packaging structure, which includes the above chip.

In some embodiments, the disclosure also discloses a board card comprising the chip packaging structure. Referring to fig. 10, the exemplary board card is provided, and the board card may include other accessories besides the chip 1002, which may include but is not limited to: a memory device 1004, an interface device 1006, and a control device 1008.

The memory device is connected with the chip in the chip packaging structure through a bus and used for storing data. The memory device may include multiple sets of memory cells 1010. Each group of the storage units is connected with the chip through a bus. It will be appreciated that each group of the memory cells may be a DDR SDRAM ("Double Data Rate SDRAM").

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the memory device may include 4 groups of the memory cells. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may internally include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controller are used for data transmission, and 8 bits are used for ECC check.

In one embodiment, each group of the memory cells may include a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory unit.

The interface device is electrically connected with a chip in the chip packaging structure. The interface means are used for enabling data transfer between the chip and an external device 1012, such as a server or a computer. For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so that data transfer is realized. In another embodiment, the interface device may also be another interface, and the disclosure does not limit the concrete expression of the other interface, and the interface unit may implement the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (e.g., a server) by the interface device.

The control device is electrically connected with the chip so as to monitor the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a single chip microcomputer ("MCU"). The chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, and may carry multiple loads. Therefore, the chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing and/or a plurality of processing circuits in the chip.

In some embodiments, the present disclosure also discloses an electronic device or apparatus, which includes the above board card. According to different application scenarios, the electronic device or apparatus may include a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, optical, acoustic, magnetic or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.

The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. With this understanding, when the technical solution of the present disclosure can be embodied in the form of a software product stored in a memory, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned memory comprises: a U disk, a Read-Only Memory ("ROM"), a Random Access Memory ("RAM"), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing may be better understood in light of the following clauses:

clause a1, a multiplier for performing multiplication operations on floating point numbers, wherein the multiplier comprises:

a mantissa processing unit to obtain the mantissa after the multiplication according to the mantissa of the floating-point number,

the mantissa processing unit comprises a control circuit, and the control circuit is used for calling the mantissa processing unit for multiple times when the mantissa bit width of at least one of the two floating point numbers is larger than the data bit width which can be processed by the mantissa processing unit at one time.

Clause a2, the multiplier of clause a1, wherein the two floating point numbers include a first floating point number and a second floating point number, the mantissa processing unit supports a first bit width and a second bit width, a mantissa of the first floating point number serves as a first input corresponding to the first bit width, a mantissa of the second floating point number serves as a second input corresponding to the second bit width, the bit width of the first input is less than or equal to the first bit width, and the control circuit is configured to, when the bit width of the second input is greater than the second bit width, call the mantissa processing unit multiple times to obtain the mantissa after the multiplication operation.

Clause A3, the multiplier of clause a1 or clause a2, wherein the two floating point numbers include a first floating point number and a second floating point number, the mantissa processing unit supporting a first bit width and a second bit width, a mantissa of the first floating point number being a first input corresponding to the first bit width, the mantissa of the second floating point number is used as a second input corresponding to the second bit width, and the control circuit is configured to call the mantissa processing unit multiple times to obtain the mantissa after the multiplication operation when the bit width of the first input is greater than the first bit width and the bit width of the second input is less than or equal to the second bit width, when the bit width of the second input is greater than the second bit width and the bit width of the first input is less than or equal to the first bit width, or when the bit width of the first input is greater than the first bit width and the bit width of the second input is greater than the second bit width.

Clause a4, the multiplier of any of clauses a1-A3, wherein the control circuit selects the mantissa of the first floating point number as the second input corresponding to the second bit width and selects the mantissa of the second floating point number as the first input corresponding to the first bit width when the mantissa bit width of the first floating point number is less than the mantissa bit width of the second floating point number and the first bit width is greater than the second bit width or when the mantissa bit width of the first floating point number is greater than the mantissa bit width of the second floating point number and the first bit width is less than the second bit width.

Clause a5, the multiplier of any of clauses a1-a4, wherein when the bit width of the first input is greater than the first bit width and the bit width of the second input is less than or equal to the second bit width, the control circuit determines the number of times the mantissa processing unit is called and the data input to the mantissa processing unit in each call according to the bit width of the first input and the first bit width.

Clause a6, the multiplier of any of clauses a1-a5, wherein when the bit width of the second input is greater than the second bit width and the bit width of the first input is less than or equal to the first bit width, the control circuit determines the number of times the mantissa processing unit is called and the data input to the mantissa processing unit in each call according to the bit width of the second input and the second bit width.

Clause a7, the multiplier of any of clauses a1-a6, wherein when the bit width of the first input is greater than the first bit width and the bit width of the second input is greater than the second bit width, the control circuit determines the number of times the mantissa processing unit is invoked and data of the mantissa processing unit is input in each invocation according to the bit width and the first bit width of the first input and the bit width and the second bit width of the second input.

Clause A8, the multiplier of any of clauses a1-a7, wherein the mantissa processing unit further comprises a shift-and-add circuit for obtaining the multiplied mantissa according to the mantissa result obtained by each call to the mantissa processing unit.

Clause a9, the multiplier of any of clauses a1-A8, wherein the shift-and-add circuit comprises a shifter, an intermediate memory, and an adder, wherein when the control circuit calls the mantissa processing unit multiple times, after the first call, the shifter shifts the mantissa result obtained by the first call to obtain a shifted mantissa result and stores the shifted mantissa result in the intermediate memory, starting from the second call, the shifter shifts the mantissa result obtained in the current call to obtain a current mantissa result, the adder adding the current mantissa result to the result stored in the intermediate memory and storing the added result in the intermediate memory to update the intermediate memory, and the result stored in the intermediate memory after the last call is taken as the mantissa after the multiplication operation.

Clause a10, the multiplier according to any of clauses a1-a9, wherein the multiplier further comprises an exponent processing unit for obtaining the multiplied exponent from the exponents of the two floating point numbers, the exponent processing unit comprising a second control circuit for determining to call the exponent processing unit multiple times to obtain the multiplied exponent from an exponent bit width of one of the two floating point numbers and one of two bit widths supported by the exponent processing unit, or from the exponent bit width of the two floating point numbers and the two bit widths supported by the exponent processing unit.

Clause a11, the multiplier according to any of clauses a1-a10, wherein the two floating point numbers include a first floating point number and a second floating point number, the exponent processing unit supports a third bit width and a fourth bit width, an exponent of the first floating point number is used as a third input corresponding to the third bit width, an exponent of the second floating point number is used as a fourth input corresponding to the fourth bit width, the bit width of the third input is less than or equal to the third bit width, and the second control circuit is configured to call the exponent processing unit multiple times to obtain the exponent after the multiplication operation when the bit width of the fourth input is greater than the fourth bit width.

Clause a12, the multiplier of any of clauses a1-a11, wherein, the two floating point numbers including a first floating point number and a second floating point number, the exponent processing unit to support a third bit width and a fourth bit width, the exponent of the first floating point number is provided as a third input corresponding to the third bit width, the exponent of the second floating point number is provided as a fourth input corresponding to the fourth bit width, the second control circuit is configured to call the exponent processing unit multiple times to obtain the exponent after the multiplication operation when the bit width of the third input is greater than the third bit width and the bit width of the fourth input is less than or equal to the fourth bit width, when the bit width of the fourth input is greater than the fourth bit width and the bit width of the third input is less than or equal to the third bit width, or when the bit width of the third input is greater than the third bit width and the bit width of the fourth input is greater than the fourth bit width.

Clause a13, the multiplier of any of clauses a1-a12, wherein the second control circuit selects the exponent of the first floating point number as the fourth input corresponding to the fourth bit width and selects the exponent of the second floating point number as the third input corresponding to the third bit width when the exponent bit width of the first floating point number is less than the exponent bit width of the second floating point number and the third bit width is greater than the fourth bit width or when the exponent bit width of the first floating point number is greater than the exponent bit width of the second floating point number and the third bit width is less than the fourth bit width.

Clause a14, the multiplier of any of clauses a1-a13, wherein the second control circuit is configured to determine the number of times the exponent processing unit is called and the data input to the exponent processing unit in each call according to the bit width of the fourth input and the third bit width when the bit width of the third input is less than or equal to the bit width of the fourth input and the third bit width is less than or equal to the fourth bit width.

Clause a15, the multiplier of any of clauses a1-a14, wherein the exponent processing unit further comprises a second shift-and-add circuit for obtaining the multiplied exponent from the exponent result obtained each time the exponent processing unit is invoked.

Clause a16, the multiplier of any of clauses a1-a15, wherein the mantissa processing unit comprises a partial product operation unit and a partial product summation unit, wherein the partial product operation unit is configured to obtain an intermediate result from mantissas of the two floating point numbers, and the partial product summation unit is configured to sum the intermediate result to obtain a summed result, and to take the summed result as the mantissa after the multiplication operation.

Clause a17, the multiplier of any of clauses a1-a16, wherein the partial product operation unit comprises a booth encoding circuit for booth encoding the mantissa of the first floating point number or the second floating point number to obtain the intermediate result.

Clause a18, the multiplier of any of clauses a1-a17, wherein the partial product summing unit comprises an adder for summing the intermediate results to obtain the summed result.

Clause a19, the multiplier of any of clauses a1-a18, wherein the partial product summing unit comprises a wallace tree for summing the intermediate results to obtain a second intermediate result, and an adder for summing the second intermediate result to obtain the summed result.

Clause a20, the multiplier of any of clauses a1-a19, wherein the adder comprises at least one of a full adder, a serial adder, and a carry look ahead adder.

Clause a21, the multiplier of any of clauses a1-a20, wherein, when the number of intermediate results is less than M, zero values are supplemented as intermediate results so that the number of intermediate results is equal to M, where M is a preset positive integer.

Clause a22, the multiplier of any of clauses a1-a21, wherein each of the wallace trees has M inputs and N outputs, the number of wallace trees being no less than K, where N is a preset positive integer less than M and K is a positive integer no less than the maximum bit width of the intermediate result.

Clause a23, the multiplier of any of clauses a1-a22, wherein the partial product summation unit is configured to sum the intermediate results using one or more groups of the wallace trees, wherein each group of the wallace trees has X wallace trees, X being the number of bits of the intermediate results, wherein the wallace trees within each group have a carry-by-carry relationship, and the wallace trees between each group have no carry-by relationship.

Clause a24, the multiplier of any of clauses a1-a23, wherein the multiplier further comprises:

and the normalization processing unit is used for normalizing at least one floating point number to obtain a corresponding exponent and mantissa when the at least one floating point number in the two floating point numbers is a non-normalized non-zero floating point number.

Clause a25, the multiplier of any of clauses a1-a24, wherein the multiplier is configured to perform a multiplication operation of the two floating point numbers according to an operation mode, the operation mode indicating a data format of the two floating point numbers, the mantissa processing unit is configured to obtain the multiplied mantissa according to the operation mode and the mantissa of the two floating point numbers, and the exponent processing unit is configured to obtain the multiplied exponent according to the operation mode and the exponent of the two floating point numbers.

Clause a26, the multiplier of any of clauses a1-a25, the normalization processing unit further configured to normalize at least one of the two floating point numbers according to the operation mode to obtain a corresponding exponent and mantissa.

Clause a27, the multiplier of any of clauses a1-a26, wherein the data format comprises at least one of a half-precision floating point number, a single-precision floating point number, a brain floating point number, a double-precision floating point number, a custom floating point number.

Clause a28, the multiplier of any of clauses a1-a27, wherein the mantissa processing unit comprises a digit extension circuit for digit extending a mantissa of at least one of the first floating point number and the second floating point number.

Clause a29, the multiplier of any of clauses a1-a28, wherein the floating point number further comprises a symbol, the multiplier further comprising:

and the symbol processing unit is used for obtaining the symbols after the multiplication operation according to the symbols of the two floating point numbers.

Clause a30, the multiplier of any of clauses a1-a29, wherein the symbol processing unit comprises an xor logic circuit, the xor logic circuit being configured to perform an xor operation on the symbols of the two floating point numbers to obtain the multiplied symbol.

Clause a31, the multiplier of any of clauses a1-a30, further comprising a regularization unit for:

and performing floating point number regularization processing on the mantissa and the exponent after the multiplication operation to obtain a regularized exponent result and a regularized mantissa result, and taking the regularized exponent result and the regularized mantissa result as the exponent after the multiplication operation and the mantissa after the multiplication operation.

Clause a32, the multiplier of any of clauses a1-a31, further comprising:

a rounding unit to perform a rounding operation on the regularized mantissa result according to a rounding mode to obtain a rounded mantissa, and to treat the rounded mantissa as the multiplied mantissa.

Clause a33, a method of performing a floating-point multiplication operation using a multiplier, wherein,

obtaining the multiplied mantissa from the mantissa of the floating-point number using a mantissa processing unit of the multiplier,

Clause a34, an integrated circuit chip comprising the multiplier of any one of clauses a1-a 31.

Clause a35, a computing device comprising the multiplier of any one of clauses a1-a31 or the integrated circuit chip of clause a 34.

The foregoing detailed description of the embodiments of the present disclosure has been presented for purposes of illustration and description and is intended to be exemplary only and is not intended to be exhaustive or to limit the invention to the precise forms disclosed; meanwhile, for the person skilled in the art, based on the idea of the present disclosure, there may be variations in the specific embodiments and the application scope, and in summary, the present disclosure should not be construed as limiting the present disclosure.

It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, description, and drawings of the present disclosure are used to distinguish between different objects and are not used to describe a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

The foregoing detailed description of the embodiments of the present disclosure has been presented for purposes of illustration and description and is intended to be exemplary only and is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Meanwhile, a person skilled in the art should, according to the idea of the present disclosure, change or modify the embodiments and applications of the present disclosure. In view of the above, this description should not be taken as limiting the present disclosure.

Claims

1. A multiplier for performing multiplication operations on floating point numbers, wherein the multiplier comprises:

2. The multiplier of claim 1, wherein the two floating point numbers include a first floating point number and a second floating point number, the mantissa processing unit supports a first bit width and a second bit width, a mantissa of the first floating point number serves as a first input corresponding to the first bit width, a mantissa of the second floating point number serves as a second input corresponding to the second bit width, the bit width of the first input is less than or equal to the first bit width, and the control circuit is configured to call the mantissa processing unit multiple times to obtain the mantissa after the multiplication operation when the bit width of the second input is greater than the second bit width.

3. The multiplier of claim 1, wherein the two floating point numbers include a first floating point number and a second floating point number, the mantissa processing unit supporting a first bit width and a second bit width, a mantissa of the first floating point number being a first input corresponding to the first bit width, the mantissa of the second floating point number is used as a second input corresponding to the second bit width, and the control circuit is configured to call the mantissa processing unit multiple times to obtain the mantissa after the multiplication operation when the bit width of the first input is greater than the first bit width and the bit width of the second input is less than or equal to the second bit width, when the bit width of the second input is greater than the second bit width and the bit width of the first input is less than or equal to the first bit width, or when the bit width of the first input is greater than the first bit width and the bit width of the second input is greater than the second bit width.

4. The multiplier of claim 3, wherein the control circuit selects the mantissa of the first floating point number as the second input corresponding to the second bit width and selects the mantissa of the second floating point number as the first input corresponding to the first bit width when the mantissa bit width of the first floating point number is less than the mantissa bit width of the second floating point number and the first bit width is greater than the second bit width or when the mantissa bit width of the first floating point number is greater than the mantissa bit width of the second floating point number and the first bit width is less than the second bit width.

5. The multiplier of claim 4 wherein when the bit width of the first input is greater than the first bit width and the bit width of the second input is less than or equal to the second bit width, the control circuitry determines the number of times the mantissa processing unit is invoked and the data input to the mantissa processing unit in each invocation from the bit width of the first input and the first bit width.

6. The multiplier of claim 4, wherein when the bit width of the second input is greater than the second bit width and the bit width of the first input is less than or equal to the first bit width, the control circuit determines the number of times the mantissa processing unit is called and the data input to the mantissa processing unit in each call according to the bit width of the second input and the second bit width.

7. The multiplier of claim 4 wherein when the bit width of said first input is greater than said first bit width and the bit width of said second input is greater than said second bit width, said control circuitry determines the number of times said mantissa processing unit is invoked and the data input to said mantissa processing unit in each invocation from the bit width and first bit width of said first input and the bit width and second bit width of said second input.

8. A multiplier as claimed in any one of claims 2 to 7, in which the mantissa processing unit further comprises a shift-and-add circuit for obtaining the multiplied mantissa from the mantissa result obtained each time the mantissa processing unit is invoked.

9. A multiplier according to claim 8, wherein the shift-and-add circuit comprises a shifter, an intermediate memory and an adder, when the control circuit calls the mantissa processing unit a plurality of times, after the first call, the shifter shifts the mantissa result obtained by the first call to obtain a shifted mantissa result and stores the shifted mantissa result in the intermediate memory, starting from the second call, the shifter shifts the mantissa result obtained in the current call to obtain a current mantissa result, the adder adding the current mantissa result to the result stored in the intermediate memory and storing the added result in the intermediate memory to update the intermediate memory, and the result stored in the intermediate memory after the last call is taken as the mantissa after the multiplication operation.

10. The multiplier of claim 1, further comprising an exponent processing unit to obtain the multiplied exponent from the exponents of the two floating point numbers, the exponent processing unit including a second control circuit to determine to invoke the exponent processing unit multiple times to obtain the multiplied exponent from one of an exponent bit width of one of the two floating point numbers and two bit widths supported by the exponent processing unit or from the exponent bit widths of the two floating point numbers and two bit widths supported by the exponent processing unit.

11. The multiplier of claim 10, wherein the two floating point numbers include a first floating point number and a second floating point number, the exponent processing unit supports a third bit width and a fourth bit width, the exponent of the first floating point number serves as a third input corresponding to the third bit width, the exponent of the second floating point number serves as a fourth input corresponding to the fourth bit width, the bit width of the third input is less than or equal to the third bit width, and the second control circuit is configured to call the exponent processing unit multiple times to obtain the exponent after the multiplication operation when the bit width of the fourth input is greater than the fourth bit width.

12. The multiplier of claim 10, wherein the two floating point numbers include a first floating point number and a second floating point number, the exponent processing unit to support a third bit width and a fourth bit width, the exponent of the first floating point number to be a third input corresponding to the third bit width, the exponent of the second floating point number is used as a fourth input corresponding to the fourth bit width, and the second control circuit is configured to call the exponent processing unit multiple times to obtain the exponent after the multiplication operation when the bit width of the third input is greater than the third bit width and the bit width of the fourth input is less than or equal to the fourth bit width, when the bit width of the fourth input is greater than the fourth bit width and the bit width of the third input is less than or equal to the third bit width, or when the bit width of the third input is greater than the third bit width and the bit width of the fourth input is greater than the fourth bit width.

13. The multiplier of claim 12, wherein the second control circuit selects the exponent of the first floating point number as the fourth input corresponding to the fourth bit width and selects the exponent of the second floating point number as the third input corresponding to the third bit width when the exponent bit width of the first floating point number is less than the exponent bit width of the second floating point number and the third bit width is greater than the fourth bit width or when the exponent bit width of the first floating point number is greater than the exponent bit width of the second floating point number and the third bit width is less than the fourth bit width.

14. A multiplier according to claim 13, wherein said second control circuit is adapted to determine the number of times said exponent processing unit is called and the data input to said exponent processing unit in each call, from the bit width of said fourth input and said third bit width, when the bit width of said third input is less than or equal to the bit width of said fourth input and said third bit width is less than or equal to said fourth bit width.

15. A multiplier as claimed in any one of claims 11 to 14, in which the exponent processing unit further comprises a second shift-and-add circuit for obtaining the multiplied exponent from the exponent result obtained each time the exponent processing unit is invoked.

16. The multiplier of claim 1, wherein the mantissa processing unit comprises a partial product operation unit to obtain an intermediate result from mantissas of the two floating point numbers, and a partial product summation unit to sum the intermediate result to obtain a summed result and take the summed result as the mantissa after the multiplication operation.

17. The multiplier of claim 16, wherein the partial product operation unit includes a booth encoding circuit to booth encode mantissas of the first or second floating point number to obtain the intermediate result.

18. A multiplier as claimed in claim 17, in which the partial product summing unit comprises an adder for summing the intermediate results to obtain the summed result.

19. A multiplier as claimed in claim 17, in which the partial product summing unit comprises a wallace tree for summing the intermediate results to obtain a second intermediate result and an adder for summing the second intermediate result to obtain the summed result.

20. A multiplier as claimed in claim 18 or 19, in which the adder comprises at least one of a full adder, a serial adder and a carry-look-ahead adder.

21. A multiplier as claimed in claim 19, in which, when the number of intermediate results is less than M, zero values are supplemented as intermediate results so that the number of intermediate results is equal to M, where M is a preset positive integer.

22. A multiplier as claimed in claim 21, in which each of the wallace trees has M inputs and N outputs, the number of wallace trees being no less than K, where N is a preset positive integer less than M and K is a positive integer no less than the maximum bit width of the intermediate result.

23. A multiplier according to claim 22, wherein the partial product summing unit is adapted to sum the intermediate results using one or more groups of the wallace trees, wherein each group of the wallace trees has X wallace trees, X being the number of bits of the intermediate results, wherein there is a carry-by-carry relationship between the wallace trees within each group and there is no carry-by-carry relationship between the wallace trees within each group.

24. The multiplier of claim 10, wherein the multiplier further comprises:

25. The multiplier of claim 10, wherein the multiplier is configured to multiply the two floating point numbers according to an operation mode, the operation mode indicating a data format of the two floating point numbers, the mantissa processing unit is configured to obtain the multiplied mantissa according to the operation mode and the mantissa of the two floating point numbers, and the exponent processing unit is configured to obtain the multiplied exponent according to the operation mode and the exponent of the two floating point numbers.

26. The multiplier of claim 25, the normalization processing unit further configured to normalize at least one of the two floating-point numbers according to the operational mode to obtain a corresponding exponent and mantissa.

27. The multiplier of claim 26, wherein the data format includes at least one of a half-precision floating point number, a single-precision floating point number, a brain floating point number, a double-precision floating point number, a custom floating point number.

28. The multiplier of claim 17, wherein the mantissa processing unit includes a bit number expansion circuit to bit number expand the mantissa of at least one of the first floating point number and the second floating point number.

29. The multiplier of claim 1, wherein the floating point number further comprises a symbol, the multiplier further comprising:

30. The multiplier of claim 29, wherein the symbol processing unit comprises an exclusive or logic circuit for performing an exclusive or operation on the symbols of the two floating point numbers to obtain the multiplied symbols.

31. A multiplier as claimed in claim 25, further comprising a regularization unit for:

32. A multiplier as claimed in claim 31, further comprising:

33. A method for performing a floating-point multiplication operation using a multiplier, wherein,

34. An integrated circuit chip comprising a multiplier as claimed in any one of claims 1 to 32.

35. A computing device comprising a multiplier according to any of claims 1-32 or an integrated circuit chip according to claim 34.