CN115906947B

CN115906947B - Model quantization methods and computing equipment

Info

Publication number: CN115906947B
Application number: CN202211604273.8A
Authority: CN
Inventors: 康瑞鹏; 游亮; 龙欣
Original assignee: Alibaba Cloud Computing Ltd
Current assignee: Alibaba Cloud Computing Ltd
Priority date: 2022-12-13
Filing date: 2022-12-13
Publication date: 2026-01-30
Anticipated expiration: 2042-12-13
Also published as: CN115906947A

Abstract

This application provides a model quantization method and computing device. The method involves: determining first model data to be quantized; determining a corresponding target exponent representation based on the numerical value of the first model data; and quantizing the first model data from a first precision to a second precision according to the target exponent representation to generate second model data. Both the first and second precisions correspond to floating-point types. Furthermore, exponent portions with the same actual value correspond to different exponent encoding values for different exponent representations. The technical solution provided by this application improves the flexibility of model quantization operations.

Description

Model quantization method and computing device

Technical Field

The embodiment of the application relates to the technical field of data processing, in particular to a model quantization method and computing equipment.

Background

The neural network model is usually calculated by adopting floating point type, and model quantization refers to an operation process of quantizing model data from high precision to low precision so as to achieve the purposes of compressing the model, reducing memory occupation, improving calculation speed and the like.

However, the low-precision floating point number has a limited data expression range, resulting in poor model quantization operation flexibility.

Disclosure of Invention

The embodiment of the application provides a model quantization method and computing equipment, which are used for solving the technical problem of poor model quantization operation flexibility in the prior art.

In a first aspect, an embodiment of the present application provides a method for quantizing a model, including:

determining first model data to be quantized in a model;

determining a corresponding target index representation mode according to the expression numerical value of the first model data;

and quantizing the first model data into second precision from the first precision according to the target index representation mode to generate second model data, wherein the first precision and the second precision are corresponding to floating point types, and index parts with the same actual value are different in index coding values corresponding to different index representation modes.

In a second aspect, an embodiment of the present application provides a computing device, including a processing component and a storage component, where the storage component stores one or more computer instructions, and the one or more computer instructions are used to be invoked and executed by the processing component to implement the model quantization method described in the first aspect.

In a third aspect, an embodiment of the present application provides a computer storage medium storing a computer program, where the computer program when executed by a computing device implements a model quantization method as described in the first aspect above.

In the embodiment of the application, for first model data to be quantized, a corresponding target index representation mode is determined according to the expression numerical value of the first model data, the first model data is quantized from a first precision to a second precision according to the target index representation mode to generate second model data, wherein the first precision and the second precision both correspond to floating point types, and index parts with the same actual value correspond to different index representation modes and have different index coding values. Because the floating point type index part determines the data expression range and is limited by bit number limitation, the embodiment of the application selects the corresponding target index expression mode to quantize according to the expression numerical value of the first model data, thereby realizing the purpose of effectively quantizing the first model data into second precision and improving the flexibility of the model quantization operation.

These and other aspects of the application will be more readily apparent from the following description of the embodiments.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart illustrating one embodiment of a model quantization method provided by the present application;

FIG. 2 is a schematic view of scene interaction in a practical application of an embodiment of the present application;

FIG. 3 is a flow chart illustrating one embodiment of a method for data quantization provided by the present application;

fig. 4 is a schematic structural diagram of an embodiment of a data quantization apparatus according to the present application;

FIG. 5 illustrates a schematic diagram of one embodiment of a computing device provided by the present application.

Detailed Description

In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present application with reference to the accompanying drawings.

In some of the flows described in the specification and claims of the present application and in the foregoing figures, a plurality of operations occurring in a particular order are included, but it should be understood that the operations may be performed out of order or performed in parallel, with the order of operations such as 101, 102, etc., being merely used to distinguish between the various operations, the order of the operations themselves not representing any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.

The technical scheme of the embodiment of the application can be applied to a data precision quantization scene, namely, one precision is quantized to another precision, and mainly aims at floating point data. In practical applications, in order to ensure higher precision, most scientific operations are performed by floating point type operations, in particular, related operations related to artificial intelligence models (ARTIFICIAL INTELLIGENCE, AI) and the like.

Artificial intelligence is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain results. In other words, artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a similar manner to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision and reasoning, man-machine interaction, recommendation and skill, basic theory, and the like.

The artificial intelligent model can be realized by adopting a neural network model, and the neural network model refers to a deep learning model obtained by training an artificial neural network. With the development of technology, neural network models have made great progress in computer vision tasks such as image classification, object detection, image segmentation, natural language processing, or video classification. However, the neural network model often contains a large amount of model data, needs to occupy larger equipment resources, and is difficult to operate on terminal equipment efficiently. Therefore, the neural network model needs to be compressed, so that the equipment resources occupied by the neural network are reduced.

Model quantization is an effective method for compressing a neural network, specifically, model data related to a neural network model can be represented by quantizing the model data from high precision (high bit number) to low precision (low bit number), and model quantization operation can include training phase quantization or reasoning phase quantization. The technical scheme of the embodiment of the application can be suitable for training stage quantization scenes or reasoning stage quantization scenes and the like. Particularly, the inference phase quantization scene, such as the inference demand of the neural network model on public cloud, is more and more heavy, and the demand of optimizing the performance of the inference task exists. However, reasoning does not have back propagation, and many unimportant parameters exist in the network or are not required to be represented with too fine accuracy, so that the purposes of reducing memory occupation of a neural network model, improving calculation speed and the like can be achieved through model quantization.

In order to facilitate understanding of the technical solution of the present application, the following first explains the technical terms possibly related in the present application correspondingly:

BITs (computer technical term), which is information unit, is transliterated from English BIT. And is also a bit in a binary number, which is a unit of measure of the amount of information, each 0 or 1 is a bit in a binary system, and therefore the number of bits is the number of bits in the binary number.

Floating point number, floating point data, is a number with an unfixed index point position, and has both a fractional part and an integer part. In a computer, a floating point number is generally divided into an exponent part and a mantissa part, wherein the exponent part is represented by a binary fixed-point integer, the mantissa part is represented by a binary fixed-point decimal, the exponent part length determines a data expression range, and the mantissa part length determines accuracy. In addition, the floating point number also corresponds to sign bit and a base, the sign bit is positioned at the highest bit of the floating point number, 0 represents a positive number, and 1 represents a negative number.

The exponent offset value (exponent bias) is the exponent encoding value of the exponent portion in the floating point representation, which is the actual value of the exponent portion minus an offset, which is the exponent offset value. The exponent offset value specified by the IEEE 754 standard (a binary floating point arithmetic standard) is 2 ^e-1 -1, where e refers to the number of bits of the exponent portion.

FP8:8bit floating point number, wherein 1bit is a sign bit, and the remaining 7 bits are an exponent portion and a mantissa portion.

Fp16:16bit floating point number.

Fp32:32bit floating point number.

The model parameters of the current neural network model and the operation process thereof are mostly represented by adopting 32-bit single-precision floating point numbers or 64-bit double-precision floating point numbers. The model quantization may be, for example, quantization of model data from 32 bits to 16 bits or 8 bits, or the like. However, the low-precision floating point number has the problems of limited data expression range, overflow and the like, so that the performance of the model is influenced, the calculation processing of the model is influenced, and the flexibility of the model quantization operation is poor.

In order to improve flexibility of model quantization operation, the inventor provides a technical scheme of the embodiment of the application through a series of researches, in the embodiment of the application, corresponding target index expression modes are determined according to the expression numerical value of first model data for first model data to be quantized, the first model data is quantized from first precision to second precision according to the target index expression modes to generate second model data, wherein the first precision and the second precision correspond to floating point types, index parts with the same actual value correspond to different index expression modes, namely, the index coding values of the different index expression modes are different, namely, the data expression ranges are different, and the floating point type index parts determine the data expression ranges and are limited to bit number limitation.

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

Fig. 1 is a flowchart of an embodiment of a model quantization method according to an embodiment of the present application, where the method may include the following steps:

And 101, determining first model data to be quantized in the model.

The first model data is an object to be quantized in the model, and may refer to model parameters or input data.

The model may be used to handle processing tasks such as image classification, object detection, image segmentation, natural language processing (speech recognition, text recognition, etc.), or video classification. The model may specifically be referred to as a multimedia data processing model, and the input data may be referred to as multimedia data. The multimedia data may include images, text, video, and the like.

In one implementation, the model may be specifically an image processing model, and the input data may refer to image data, and the image processing model may be used for performing image classification, object detection, or image classification processing.

In yet another implementation, the model may be specifically a speech processing model, and the input data may refer to speech data, and the speech processing model may be used for speech recognition, speech conversion, or the like.

In yet another implementation, the model may be specifically a text processing model, and the input data may refer to text data, which may be used for text matching, intent recognition, text conversion, text classification, and so forth.

In yet another implementation, the model may be a video processing model, the input data may refer to video data, the video processing model may be used to perform video classification, and so on.

Of course, the model may also refer to any model that performs other types of processing tasks, as the application is not specifically limited in this regard.

In one practical application, the model in the embodiment of the present application may be a trained model, that is, the technical solution of the embodiment of the present application may be applied to a model reasoning stage, so as to perform quantization processing on the trained model.

Alternatively, after the request for model quantization is responded, a model requesting quantization is determined, and first model data to be quantized in the model is determined, and so on. The model quantization request may be triggered by a user, or may be generated when a model call event is detected, for example, when a corresponding calculation process is requested to be performed by using a model, for example, when an image classification, an image segmentation, a natural language process, or the like is performed, or when a user terminal having a processing requirement generates and transmits the model call event.

The technical scheme of the embodiment of the application can be executed by the server, the model quantification request can be triggered by a user based on a corresponding interface provided by the server, or the model quantification request or the model calling event can be triggered by the user, of course, the technical scheme of the embodiment of the application can also be executed by the user, after model training is finished, the technical scheme can be deployed at the user, for example, the user can generate a model calling event when detecting the model processing request, and the model quantification request is firstly generated before executing the model calling event, thereby triggering the technical scheme of the embodiment of the application to carry out model quantification.

The first model data to be quantized in the model can be determined by traversing a network layer included in the model. Alternatively, the first model data to be quantized that meets the quantization condition may be determined, where the quantization condition may be set in connection with the actual situation, for example, may be a model parameter of a specific type, or a model parameter of a specific network layer, or input data, for example, may be model data related to a matrix multiplication operation such as a convolution operation, as the first model data to be quantized, or the like.

And 102, determining a corresponding target index representation mode according to the expression numerical value of the first model data.

The first model data is a floating point number, and the expression value of the first model data refers to an actual value corresponding to the first model data, so that the expression value can be expressed in a decimal manner for convenience of understanding. The floating point number may be represented by (sign bit) exponent (exponent portion) fraction. The corresponding expression value can be z= (-1) ⁿ*x*2^y, wherein n is sign bit, x is mantissa value, 2 is radix, and y is exponent coding value. Wherein the mantissa value x=1+1/2 ^m-1+1/2^m—2+……+1/2^m-m, the actual value of the exponent portion is 2 ^e-1+2^e—2+……+2^e-e, and the exponent encoding value is equal to the actual value of the exponent portion minus the exponent offset value. Where e represents the number of bits of the exponent portion and m represents the number of bits of the mantissa portion.

The index portions with the same actual value correspond to different index coding values of different index expression modes, namely, the index coding value ranges corresponding to the different index expression modes are different, namely, the data expression ranges corresponding to floating point numbers corresponding to the different index expression modes are different.

Because the exponent part determines the data expression range, under the condition that the bit number of the exponent part remains unchanged, the embodiment of the application can correspond to different data expression ranges through different exponent expression modes. The index expression mode is used for indicating the calculation mode of the index coding value corresponding to the index part, so that the index parts with the same actual value correspond to different index expression modes and have different index coding values. Specific implementations of the target index representation are described in detail in the examples below.

The quantization may be performed for the network layer of the model, or the corresponding target index representation may be determined by combining a plurality of model data related to the network layer, for example, the corresponding target index representation may be determined according to a maximum expression value of the plurality of model data in the network layer. Of course, the corresponding target index representation may be determined by combining all the model data involved in the model, and may be determined, for example, from the maximum expression value among the plurality of model data in the model.

And 103, quantifying the first model data from the first precision to the second precision according to the target index representation mode to generate second model data.

The first precision and the second precision correspond to floating point types. The first precision may be an initial precision of the first model data, the second precision may be a target quantization precision, and in one practical application, the second precision may be 8 bits, the first precision may be 64 bits, 32 bits, 16 bits, or the like, which is not limited in this application.

Optionally, the second precision may be included in the model quantization request.

The first precision and the second precision may be achieved in a conventional manner, and of course, may be achieved in other manners to reduce quantization complexity, which will be described in detail in the following embodiments.

In this embodiment, according to the expression value of the first model data, the corresponding target index expression mode is selected to perform quantization, so that the data expression range corresponding to the second precision can be dynamically adjusted, the purpose of effectively quantizing the first model data into the second precision can be achieved, and the flexibility of the model quantization operation is improved.

As an alternative, determining the corresponding target index representation according to the expression value of the first model data may include:

And determining a target index representation mode for adjusting the original index offset value to the target index offset value according to the expression value of the first model data.

The original exponent offset value may be 2 ^e-1 -1 specified according to IEEE 754 standard, for example, for FP8, assume that an encoding format of E4M3 (the exponent portion occupies 4 bits and the mantissa portion occupies 3 bits) is adopted, the original exponent offset value is 7, the exponent encoding range corresponding to the exponent portion is [ 6,8 ], in this example, the exponent portion may be from 0001, corresponding to decimal data1, subtracting the exponent offset value 7 to obtain the minimum value-6 of the exponent encoding range, for example, assuming that the maximum value of FP8 is 0 1111 111, and the expression value of 0 1111 111 is 2 ⁸ (1+7/8) =480 according to the original exponent offset value. It should be noted that the present application is described with reference to FP8 in the coding format of E4M3, but the present application is not limited thereto, and is applicable to other coding formats such as E5M2 (5 bits for exponent portion and 2 bits for mantissa portion) and the like.

In the embodiment of the present application, the original exponent offset value may be adjusted according to the expression value of the first model data to determine the target exponent offset value, for example, in order to expand the data expression range of FP8, the target exponent offset value may be adjusted to 0, and then the exponent coding range corresponding to the exponent portion is [1, 15 ], where the expression value of 0 1111 111 is 2 ¹⁵ ×1+7/8= 61440, so that the maximum expression value that FP8 can express is increased by 128 times.

Alternatively, in order to expand the expression numerical range of the second precision, the target exponent shift value may be smaller than the original exponent shift value, which is not limited to this, but the target exponent shift value may be selected from a range of values including the maximum actual value and the minimum actual value of the exponent portion, which is not limited to this.

In some embodiments, a plurality of candidate index offset values may be preset, and different candidate index offset values correspond to different index representations, and determining, according to the expression value of the first model data, the target index representation for adjusting the original index offset value to the target index offset value may include:

Determining a plurality of index expression modes for respectively adjusting the original index offset value to candidate index offset values, wherein the candidate index offset values corresponding to different index expression modes are different;

Determining the expression numerical value ranges respectively corresponding to the index expression modes;

determining a corresponding target expression numerical range according to the expression numerical value of the first model data;

and taking the index representation corresponding to the target expression numerical range as a target index representation.

The target expression numerical value range includes the expression numerical value of the first model data, and optionally, if the expression numerical value of the first model data hits a plurality of expression numerical value ranges, the expression numerical value range with the largest boundary value may be selected as the target expression numerical value range.

In some embodiments, determining the target exponent representation to adjust the original exponent offset value to the target exponent offset value based on the expression value of the first model data may include:

determining a plurality of index expression modes for adjusting the original index offset value to the candidate index offset value, wherein the candidate index offset values corresponding to different index expression modes are different;

respectively quantizing the first model data into second precision according to a plurality of index representation modes;

calculating the precision loss corresponding to the index expression modes respectively;

And selecting the index representation with the minimum precision loss as the target index representation.

Further, alternatively, an index expression in which the expression value range hits the expression value of the first model data and the loss of accuracy is minimized may be selected as the target index offset value.

The method comprises the steps of respectively quantizing first model data into second precision according to a plurality of index expression modes, and obtaining quantized second precision data, wherein the calculation of precision loss corresponding to the index expression modes can be realized in the following mode:

determining second precision data respectively corresponding to the index representation modes;

determining the expression values of the second precision data corresponding to the index representations according to the index representations;

calculating difference information between the expression values of the first model data and the expression values of different second precision data respectively;

and taking the difference information obtained by calculation as precision loss.

Thus, the index representation corresponding to the smallest difference information may be selected as the target index representation.

As an alternative, the difference information may specifically refer to an error, alternatively, there may be a variety of implementation manners for calculating the error, for example, calculating MAE (Mean Absolute Error ), MSE (Mean Square Error, mean square error), RMSE (Root Mean Square Error ), or the like, which is not limited in this application.

In addition, as another alternative, the difference information may be a ratio or a difference between the expression values of the first model data and the expression values of the different second precision data, respectively.

The expression value of the second precision data can be obtained through conversion according to the formula of z= (-1) ⁿ*x*2^y.

The different index representation modes, namely the expression values of the second precision data corresponding to different candidate index offset values, can be different, and the index representation mode with the smallest difference information can be selected as the target index mode by calculating the difference information, so that the precision of data quantization is ensured.

As another alternative, determining the corresponding target index representation from the expression values of the first model data may include:

and determining a target index representation mode of the maximum actual value of the index part corresponding to the second precision for representing the effective value according to the expression value of the first model data.

The maximum actual value of the exponent portion may be used to represent a significant value, since it is not required to represent a significant value when the exponent portion is the maximum actual value, and is typically used to represent abnormal data such as infinity, and for model data, abnormal data or the like that requires special expression is typically not present. For example, for FP8, 0 1111 111 represents a nonsensical value NAN, and the maximum actual value is 0 1110 111 =240, as specified by the IEEE754 standard, while according to the technical solution of the embodiment of the present application, 0 1111 111 may express a significant value of 480, so that the expression value range of FP8 becomes 2 times of the original value.

In some embodiments, determining the target exponent representation of the maximum actual value of the second precision corresponding exponent portion to represent the effective value based on the expressed value of the first model data may include:

And under the condition that the model data accords with the effective condition, determining a target index representation mode of the maximum actual value of the index part corresponding to the second precision for representing the effective value.

The effective condition can be set in combination with the actual situation to judge that the model data is effective data, and at this time, the target index expression mode of the maximum actual value of the index part corresponding to the second precision for representing the effective value can be determined.

Optionally, determining, according to the expression value of the first model data, the target exponent representation of the maximum actual value of the exponent portion corresponding to the second precision to represent the effective value may include:

Determining a corresponding target expression value when the index part corresponding to the second precision is the maximum actual value;

And under the condition that the expression value of the first model data is smaller than the target expression value, determining a target index expression mode of the maximum actual value of the index part corresponding to the second precision for representing the effective value.

One implementation manner of the effective condition may be that the expression value of the first model data is smaller than the target expression value corresponding to the second precision corresponding to the maximum actual value of the index part.

Of course, as yet another alternative, determining the corresponding target index representation from the expression values of the model data may include:

and determining the maximum actual value of the index part corresponding to the second precision to represent the effective value and respectively adjusting the original index offset value to be the target index representation mode of the target index offset value.

In some embodiments, determining the maximum actual value of the index portion corresponding to the second precision to represent the effective value and adjusting the original index offset value to the target index offset value respectively may include:

Determining a maximum actual value of the index part corresponding to the second precision, wherein the maximum actual value is used for representing an effective value, and the original index offset value is respectively adjusted to a plurality of index representation modes of candidate index offset values;

The method can be used for setting a plurality of index expression modes in advance, and each index expression mode can comprise that the maximum actual value of an index part is used for expressing a valid value, and the original index offset value is respectively adjusted to the corresponding candidate index offset value.

an index representation of an expression value of which accuracy loss is minimum and the expression value range includes the first model data is selected as the target index representation.

The calculation of the precision loss corresponding to each of the plurality of index representations may be described in detail in the above embodiments, and the detailed description is not repeated here.

In some embodiments, quantifying the first model data from the first precision to the second precision may include:

Determining a first accuracy of the first model data;

Determining a first exponent portion and a first mantissa portion of the first model data corresponding to the first precision;

according to the second precision, determining an index coding threshold value of a second index part corresponding to the second precision;

And under the condition that the index code value corresponding to the first index part is smaller than the index code threshold value, taking the index code value corresponding to the first index part as the index code value of the second index part, and taking the data of the first mantissa part as the data corresponding to the second mantissa part.

The exponent encoding threshold may be determined based on a second precision exponent offset value, where the second precision exponent offset value is the target exponent offset value if the target exponent representation includes the target exponent offset value. For example, the target exponent offset value is assumed to be 7, for FP8 the second precision is 8 bits, the exponent portion corresponds to an exponent encoding range of [ 6,8 ], and the exponent encoding threshold is 8.

If the index code value corresponding to the first index portion is smaller than the index code threshold, the index code value of the first index portion is not beyond the index code range of the second index portion corresponding to the second precision, in this case, the index code value corresponding to the first index portion is used as the index code value of the second index portion, and the data of the first mantissa portion is used as the data of the second mantissa portion.

Alternatively, the data of the first mantissa portion may be treated as the data of the second mantissa portion by rounding (rounding).

For example, assuming that the first precision is 32 bits, that is, the first model data is FP32, the second precision is 8 bits, and the second model data is FP8, since the mantissa value range of the first mantissa portion of FP32 is also different from that of FP8, the first mantissa portion of FP32 is 23 bits, and the second mantissa portion of FP8 is 3 bits, alternatively, the first mantissa portion may be subjected to rescale (scaling) operations by rounding, and is scaled from 23 bits to 3 bits, and the scaling result is taken as data corresponding to the second mantissa portion. For example, if the model data is 100, the converted result is that the mantissa bits in z ₁＝(-1)⁰*1.100100*2⁶,z₁ are 6 bits (6 bits after the decimal point), but if the converted result is that the converted data is 8 bits, the converted data has only 3 bits for the mantissa bits, in this case, 1.100100 can be scaled by rounding to scale 100100 from 6 bits to 3 bits, since the value of the fourth bit in 100100 is 1, when scaling is performed to scale 3 bits, the third bit in the scaled result is 1, therefore, the scaled result corresponding to 100100 is that the scaled result corresponding to 101,1.100100 is 1.101, the converted result is that z ₁＝(-1)²*1.101*2⁶, if the converted result is z ₁＝(-1)²*1.101000*2⁶, the mantissa bits are 101000, and the value of the fourth bit is 0, when scaling 101000 to 3 bits, the number of the fourth to the sixth bit can be directly discarded, the scaled result corresponding to 101,1.101000 is that the scaled result corresponding to 1.101, and finally the scaled result corresponding to 8 bits is that z ₁＝(-1)⁰*1.101*2⁶ is obtained.

In some embodiments, the method may further include taking the index encoding threshold as the index encoding value of the second index portion of the second model data if the index encoding value corresponding to the first index portion is greater than the index encoding threshold;

And carrying out quantization processing on the first mantissa part according to the first exponent part and the exponent coding threshold value to obtain a mantissa value corresponding to the second mantissa part.

If the index code value corresponding to the first index portion is greater than the index code threshold value, which indicates that the index code value of the first index portion exceeds the index code range of the second index portion, in this case, the first index portion may be truncated, that is, the index code threshold value is directly used as the index code value of the second index portion, and the portion exceeding the index portion may be supplemented by the data of the mantissa portion.

And determining a difference value between an index coding value corresponding to the first index part and an index coding threshold value, and carrying out quantization processing on the first mantissa part based on the difference value and the base, wherein the base corresponding to the first precision is equal to the base corresponding to the second precision.

Optionally, the quantizing the first mantissa portion based on the difference value and the radix may include performing a power operation on the radix and the difference value with the radix as a base and the difference value as an exponent to generate a first operation result;

and performing product operation on the data corresponding to the first mantissa part and the first operation result to generate a second operation result, and taking the second operation result as the data corresponding to the second mantissa part.

For example, assuming that the model data may be z ₁＝(-1)ⁿ¹*x₁*2^y1, the quantized model data may be z ₃＝(-1)ⁿ³*x₃*2^y3,z₁ and z ₃, where n1 and n3 are equal, so that z ₁ is quantized to z ₃, that is, x ₁ is quantized to x ₃, and y1 is quantized to y3, where z1 is FP32, z2 is FP8, the index encoding range of y1 is [ 127,128 ], the index encoding range of y3 is [ 6, 8], and the index encoding threshold is 8, and therefore, if the index encoding value of y1 is greater than 8, the index encoding value of y3 may be determined to be 8, and then the portion of y1 exceeding 8 may be supplemented by adjusting x 1. For example, if the exponent encoding value of y1 is equal to 10 and greater than 8, then after the encoding value of y3 is determined to be 8, z ₃＝(-1)ⁿ³*x₃*2⁸, in order to make the quantized z ₃ equal or substantially equal to the size of z ₁, x ₃ may be adjusted to x ₁*2², that is, in z ₃＝(-1)ⁿ³*(x₁*2²)*2⁸,z₃, the mantissa portion is x ₁*2², and the exponent portion is 8.

Optionally, the data corresponding to the second mantissa portion as the second operation result may be processed by rounding, for example, x ₁*2² is reduced from 23 bits to 3 bits, and the scaling result is used as the data corresponding to the mantissa portion in z ₃.

In some embodiments, after quantizing the first model data from the first precision to the second precision in accordance with the target index representation to generate the second model data, the method further comprises:

replacing the first model data in the model with the second model data;

And calling the floating point processing unit to perform operation processing on the model.

Wherein the first model data comprises model parameters and/or input data.

The floating point processing unit is a structure for running floating point operations, and is generally implemented by circuits, and is applied to a computer chip. After the target representation is determined, the floating point processing unit is correspondingly designed. Thus, in some embodiments, the method may further comprise:

and generating a hardware description program corresponding to the floating point processing unit according to the target index representation mode, wherein the hardware description program is used for designing the floating point processing unit.

In practical application, the technical scheme of the embodiment of the application can be suitable for quantifying the model of the public cloud neural network model, the inference requirement of the public cloud neural network model is larger and larger, the requirement of optimizing the performance of the inference task exists, the current public cloud neural network model is mostly operated by FP16, and the model data of the neural network model can be quantified to FP8 through the technical scheme of the embodiment of the application. The accuracy of the FP8 is lower, so that effective compression of the neural network model can be achieved, however, because the expression numerical range of the FP8 is limited, quantization effect may be affected finally, and model performance may be affected.

In addition, the neural network model can be used for processing tasks such as image classification, target detection, image segmentation, natural language processing (voice recognition, character recognition and the like) or video classification and the like, and the model quantization can be performed on the trained model by the aid of the technology in the embodiment of the application so as to fulfill the aim of optimizing the performance of the reasoning task.

As can be seen from the above description, the neural network model may be a multimedia data processing model, so that the technical solution of the embodiment of the present application improves the data processing efficiency.

For example, the neural network model may be specifically an image processing model, and the input data in the first model data may refer to image data, and the image processing model may be used for performing image classification, object detection, or image classification processing. By the technical scheme provided by the embodiment of the application, the image processing model can be effectively quantized, and the compression of the image processing model is realized, so that the image processing efficiency is improved.

In yet another implementation, the neural network model may be specifically a speech processing model, and the input data may refer to speech data, and the speech processing model may be used for speech recognition or speech conversion, and so on. By the technical scheme of the embodiment of the application, the voice processing model can be effectively quantized, and the compression of the voice processing model is realized so as to improve the voice processing efficiency.

In yet another implementation, the neural network model may be specifically a text processing model, and the input data may refer to text data, and the text processing model may be used for text matching, intent recognition, text conversion, text classification, and so on. By the technical scheme provided by the embodiment of the application, the text processing model can be effectively quantized, and compression of the text processing model is realized, so that the text processing efficiency is improved.

In yet another implementation, the neural network model may be a video processing model, the input data may refer to video data, the video processing model may be used to perform video classification, and so on. By the technical scheme provided by the embodiment of the application, the video processing model can be effectively quantized, and the compression of the video processing model is realized, so that the video processing efficiency is improved.

As shown in fig. 2, the technical scheme of the present application is described below by taking a system architecture applicable to an implementation scenario in which the technical scheme of the present application is implemented as an example, and a server may determine first model data to be quantized in a model based on a model quantization request triggered by a user, and determine a corresponding target index representation mode according to an expression value of the first model data. The server 20 may be, for example, a cloud server provided by a cloud computing platform.

Wherein a hardware description language may be generated based on the target exponent form, based on which a floating point processing unit 21 may be designed. In practical applications, the floating point processing unit 21 may be deployed as an external device in the server.

The server 20 may quantize the first model data from the first precision to the second precision according to the target exponent representation, so as to generate second model data, and may replace the first model data in the model with the second model data, and then may call the floating point processing unit 21 designed and generated based on the target exponent representation to perform operation processing on the model.

In addition, besides the model quantization scene, the technical scheme of the embodiment of the application can be also applied to other data precision quantization scenes, as shown in fig. 3, the embodiment of the application also provides a data quantization method, which can comprise the following steps:

First data to be quantized in the model is determined 301.

302, Determining a corresponding target index representation mode according to the expression value of the first data;

and 303, quantifying the first data from the first precision to the second precision according to the target index representation mode to generate second data.

The first precision and the second precision are both corresponding to floating point type, wherein the index parts with the same actual value are corresponding to different index coding values in different index representation modes.

The embodiment shown in fig. 3 is different from the embodiment shown in fig. 1 in that the first data is specifically the first model data, and other identical or similar steps can be described in detail in the embodiment shown in fig. 1, and a detailed description will not be repeated here.

According to the technical scheme of the embodiment of the application, the flexibility of data quantization operation can be realized, and the corresponding target index expression mode is selected to quantize according to the expression value of the first data, so that the data expression range corresponding to the second precision can be dynamically adjusted, and the purpose of effectively quantizing the first data into the second precision can be realized.

Fig. 4 is a schematic structural diagram of an embodiment of a data quantization apparatus according to an embodiment of the present application, where the apparatus may include:

a first determining module 401, configured to determine first data to be quantized in the model;

A second determining module 402, configured to determine a corresponding target index representation according to the expression value of the first data;

The quantization module 403 is configured to quantize the first data from the first precision to the second precision according to the target exponent representation, so as to generate the second data, where the first precision and the second precision both correspond to floating point types, and where exponent portions with the same actual value correspond to different exponent representations and have different exponent encoding values.

In an actual application, the technical scheme of the embodiment of the application can be applied to a model quantization scene, the first data can be first model data to be quantized in a model, and finally second model data is generated by quantization.

In some embodiments, the second determining module may specifically determine, according to the expression value of the first model data, a target exponent representation for adjusting the original exponent offset value to a target exponent offset value;

The quantization module may specifically quantize the first model data from the first precision to the second precision according to the target exponent offset value to generate the second model data.

In some embodiments, the second determining module may specifically determine a plurality of index representations for adjusting the original index offset value to the candidate index offset value, where the candidate index offset values corresponding to the different index representations are different, quantize the first model data to the second precision according to the plurality of index representations, calculate precision losses corresponding to the plurality of index representations, and select an index representation with the smallest precision loss as the target index representation.

In some embodiments, the second determining module may specifically determine, according to the expression value of the first model data, a target exponent representation of the maximum actual value of the exponent portion corresponding to the second precision, where the target exponent representation is used to represent the effective value.

In some embodiments, the second determining module may specifically determine, according to the expression value of the first model data, a target exponent representation manner of the maximum actual value of the second precision corresponding exponent portion to represent the effective value, including:

in some embodiments, the second determining module may specifically determine, in a case where the first model data meets the validity condition, a target exponent representation in which the maximum actual value of the exponent portion corresponding to the second precision is used to represent the validity value.

In some embodiments, the second determining module may specifically determine a target expression value corresponding to the second precision corresponding to the exponent portion when the second precision corresponding to the exponent portion is the maximum actual value, and determine, when the expression value of the first model data is less than the target expression value, a target exponent representation manner in which the maximum actual value of the second precision corresponding to the exponent portion is used to represent the effective value.

In some embodiments, the second determining model may specifically determine a maximum actual value of the index portion corresponding to the second precision, to be used to represent the effective value and a plurality of index representations for respectively adjusting the original index offset value to the corresponding candidate index offset value, determine an expression value range corresponding to the plurality of index representations respectively, determine a corresponding target expression value range according to the expression value of the first model data, and take the index representation corresponding to the target expression value range as the target index representation.

In some embodiments, the quantization module may specifically determine a first precision of the first model data from the first precision to the second precision, determine a first exponent portion and a first mantissa portion of the first model data corresponding to the first precision, determine an exponent encoding threshold of a second exponent portion corresponding to the second precision according to the second precision, and use the exponent encoding value of the first exponent portion as the exponent encoding value of the second exponent portion and use the data of the first mantissa portion as the data of the second mantissa portion if the exponent encoding value of the first exponent portion is less than the exponent encoding threshold.

In some embodiments, the quantization module is further configured to use the index encoding threshold as the index encoding value of the second mantissa portion of the second model data when the index encoding value corresponding to the first index portion is greater than the index encoding threshold, and perform quantization processing on the first mantissa portion according to the index encoding value of the first index portion and the index encoding threshold to obtain the mantissa value corresponding to the second mantissa portion.

In some embodiments, the quantization module performs quantization processing on the first mantissa portion according to the exponent encoding value and the exponent encoding threshold of the first exponent portion, and obtaining the mantissa value corresponding to the second mantissa portion may include:

and determining a difference value between the index coding value of the first exponent part and the index coding threshold value, and carrying out quantization processing on the first mantissa part based on the difference value and the base, wherein the base corresponding to the first precision is equal to the base corresponding to the second precision.

In some embodiments, the quantization module performs quantization on the first mantissa portion based on the difference and the radix, and may include performing a power operation on the difference and the radix with the radix as a base and the difference as an exponent to generate a first operation result, performing a product operation on data of the first mantissa portion and the first operation result to generate a second operation result, and using the second operation result as data of the second mantissa portion.

In some embodiments, the apparatus may further comprise:

and the processing module is used for replacing the first model data in the model by the second model data and calling the floating point processing unit to perform operation processing on the model.

In some embodiments, the processing model is further configured to generate a hardware description program corresponding to the floating point processing unit according to the target exponent representation, where the hardware description program is used to design the floating point processing unit.

The data quantization apparatus shown in fig. 4 may perform the model quantization method described in the embodiment shown in fig. 1, and its implementation principle and technical effects are not repeated. The specific manner in which the respective modules and units of the data quantization apparatus in the above embodiments perform operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

Embodiments of the present application also provide a computing device, as shown in FIG. 5, that may include a storage component 501 and a processing component 502;

The storage component 501 stores one or more computer instructions for execution by the processing component 502 to implement the model quantization method as shown in fig. 1 or the data quantization method as shown in fig. 3.

Of course, the computing device may necessarily include other components as well, such as input/output interfaces, display components, communication components, and the like. The input/output interface provides an interface between the processing component and a peripheral interface module, which may be an output device, an input device, etc. The communication component is configured to facilitate wired or wireless communication between the computing device and other devices, and the like.

The computing device may further include a floating point processing unit to be invoked to perform an arithmetic process on the quantized model.

Wherein the processing component may include one or more processors to execute computer instructions to perform all or part of the steps of the methods described above. Of course, the processing component may also be implemented as one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors or other electronic elements for executing the methods described above.

The storage component is configured to store various types of data to support operations at the terminal. The memory component may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The display component may be an Electroluminescent (EL) element, a liquid crystal display or a micro display having a similar structure, or a retina-directly displayable or similar laser scanning type display.

It should be noted that, the above-mentioned computing device may be a physical device or an elastic computing host provided by a cloud computing platform, etc. It may be implemented as a distributed cluster of multiple servers or terminal devices, or as a single server or single terminal device.

The computing device may also be specifically implemented as an electronic device, where the electronic device may be a device that is used by a user and has functions of computing, surfing the internet, communication, and the like, where the device may be, for example, a mobile phone, a tablet computer, a personal computer, a wearable device, and the like.

The embodiment of the application also provides a computer readable storage medium, which stores a computer program, and the computer program can implement the model quantization method of the embodiment shown in the above fig. 1 or the data quantization method of the embodiment shown in the above fig. 5 when executed by a computing device. The computer readable medium may be contained in the electronic device described in the above embodiment or may exist alone without being incorporated in the electronic device.

The embodiment of the present application further provides a computer program product, which includes a computer program loaded on a computer readable storage medium, where the computer program when executed by a computer can implement the model quantization method according to the embodiment shown in fig. 1 or the data quantization method according to the embodiment shown in fig. 5. In such embodiments, the computer program may be downloaded and installed from a network, and/or installed from a removable medium. The computer program, when executed by a processor, performs the various functions defined in the system of the application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same, and although the present application has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present application.

Claims

1. A method of model quantization, comprising:

determining first model data to be quantized in a model, wherein the model comprises a multimedia data processing model, input data processed by the multimedia data processing model comprises multimedia data, and the multimedia data comprises images, texts or videos;

The first model data is quantized from a first precision to a second precision according to the target index representation mode to generate second model data, wherein the first precision and the second precision correspond to floating point types, and index parts with the same actual value correspond to different index representation modes and have different index coding values;

said quantifying the first model data from a first precision to a second precision comprises:

Determining a first accuracy of the first model data;

And carrying out quantization processing on the first index part according to the index coding value corresponding to the first index part and the magnitude relation of the index coding threshold value to obtain an index coding value of the second index part, and carrying out quantization processing on the first mantissa part to obtain a mantissa value corresponding to the second mantissa part.

2. The method of claim 1, wherein determining the corresponding target index representation from the expression values of the first model data comprises:

determining a target index representation mode for adjusting the original index offset value to a target index offset value according to the expression value of the first model data;

said quantizing said first model data from a first precision to a second precision in accordance with said target index representation to generate second model data comprises:

and quantizing the first model data from the first precision to the second precision according to the target index offset value to generate second model data.

3. The method of claim 2, wherein determining an exponential representation for adjusting the original exponent offset value to the target exponent offset value based on the expression value of the first model data comprises:

respectively quantizing the first model data into the second precision according to the index representation modes;

calculating the precision loss corresponding to the index representation modes respectively;

4. The method of claim 1, wherein determining the corresponding target index representation from the expression values of the first model data comprises:

5. The method of claim 4, wherein determining the target exponent representation of the maximum actual value of the second precision corresponding exponent portion to represent the effective value based on the representation of the first model data includes:

and under the condition that the first model data meets the effective condition, determining a target index representation mode of the maximum actual value of the index part corresponding to the second precision for representing the effective value.

6. The method of claim 1, wherein determining the corresponding target index representation from the representation values of the model data comprises:

determining a maximum actual value of the index part corresponding to the second precision, wherein the maximum actual value is used for representing an effective value, and the original index offset value is respectively adjusted to a plurality of index representation modes of corresponding candidate index offset values;

7. The method of claim 1, wherein the quantizing the first exponent portion to obtain the exponent encoded value of the second exponent portion and quantizing the first mantissa portion to obtain the mantissa value of the second mantissa portion according to the magnitude relation of the exponent encoded value of the first exponent portion and the exponent encoded threshold comprises:

And under the condition that the index code value corresponding to the first index part is smaller than the index code threshold value, taking the index code value of the first index part as the index code value of the second index part, and taking the data of the first mantissa part as the data of the second mantissa part.

8. The method of claim 1, wherein the quantizing the first exponent portion to obtain the exponent encoded value of the second exponent portion and quantizing the first mantissa portion to obtain the mantissa value of the second mantissa portion according to the magnitude relation of the exponent encoded value of the first exponent portion and the exponent encoded threshold comprises:

taking the index coding threshold value as the index coding value of the second index part of the second model data under the condition that the index coding value corresponding to the first index part is larger than the index coding threshold value;

and carrying out quantization processing on the first mantissa part according to the index coding value of the first index part and the index coding threshold value to obtain a mantissa value corresponding to the second mantissa part.

9. The method of claim 8, wherein the quantizing the first mantissa portion according to the exponent encoding value of the first exponent portion and the exponent encoding threshold value to obtain the mantissa value corresponding to the second mantissa portion comprises:

and determining a difference value between the first exponent part exponent coding value and the exponent coding threshold value, and carrying out quantization processing on the first mantissa part based on the difference value and the base number, wherein the base number corresponding to the first precision is equal to the base number corresponding to the second precision.

10. The method of claim 9, wherein the quantizing the first mantissa portion based on the difference and radix comprises:

Taking the base as a base, taking the difference value as an index, and performing power operation on the base and the difference value to generate a first operation result;

and performing product operation on the data of the first mantissa part and the first operation result to generate a second operation result, and taking the second operation result as the data of the second mantissa part.

11. The method of claim 1, wherein after said quantizing said first model data from a first precision to a second precision in accordance with said target exponent representation to generate second model data, said method further comprises:

replacing the first model data in the model with the second model data;

and calling a floating point processing unit to perform operation processing on the model.

12. The method as recited in claim 11, further comprising:

generating a hardware description program corresponding to the floating point processing unit according to the target exponent representation mode, wherein the hardware description program is used for designing the floating point processing unit.

13. A computing device comprising a processing component and a storage component;

The storage component stores one or more computer instructions for execution by the processing component to implement the model quantization method of any one of claims 1-12.

14. A computer storage medium, characterized in that it stores a computer program which, when executed by a computing device, implements the model quantization method according to any one of claims 1 to 12.