CN117610619A

CN117610619A - Quantized convolution method, device and equipment

Info

Publication number: CN117610619A
Application number: CN202311617458.7A
Authority: CN
Inventors: 赵海丞; 赵亚娟
Original assignee: Spreadtrum Communications Shanghai Co Ltd
Current assignee: Spreadtrum Communications Shanghai Co Ltd
Priority date: 2023-11-29
Filing date: 2023-11-29
Publication date: 2024-02-27

Abstract

The present application relates to the technical field of neural networks, and in particular, to a method, an apparatus, and a device for quantization convolution. A quantized convolution method comprising: determining a first result according to input data, input weights, first zero deviations and second zero deviations, wherein the input weights are weights corresponding to the input data, the first zero deviations are zero deviations of the input data, and the second zero deviations are zero deviations of the input weights; scaling the first result into a second result with the target bit width according to a preset scaling rule; and determining the quantized convolution result according to the second result and the third zero-point deviation, wherein the third zero-point deviation is the zero-point deviation of the second result.

Description

Quantized convolution method, device and equipment

Technical Field

The present application relates to the technical field of neural networks, and in particular, to a method, an apparatus, and a device for quantization convolution.

Background

Convolution is widely used in neural networks, but conventional neural network processors incur high computational costs in order to support convolution. Therefore, it is necessary to alleviate the processing pressure of the neural network by quantization. Quantization is a method for reducing the size of neural network models and speeding up reasoning, with the aim of changing parameters or weights in the computational model to a low bit-width representation, thereby reducing memory space and computational complexity. However, the existing convolution quantization calculation process is still complicated, and small processing pressure is also caused on the neural network.

Disclosure of Invention

The embodiment of the invention provides a quantized convolution method, a quantized convolution device and quantized convolution equipment, which are used for solving the problem that the quantized convolution method is complicated in the prior art, so that the processing pressure of a neural network is high.

In a first aspect, an embodiment of the present invention provides a quantized convolution method, including:

determining a first result according to input data, input weights, first zero deviations and second zero deviations, wherein the input weights are weights corresponding to the input data, the first zero deviations are zero deviations of the input data, and the second zero deviations are zero deviations of the input weights;

scaling the first result into a second result with the target bit width according to a preset scaling rule;

and determining the quantized convolution result according to the second result and the third zero-point deviation, wherein the third zero-point deviation is the zero-point deviation of the second result.

Optionally, the determining the first result according to the input data, the input weight, the first zero-point deviation and the second zero-point deviation includes:

obtaining calibration input data according to the input data and the first zero deviation, and obtaining calibration input weight according to the input weight and the second zero deviation;

determining a target multiplier according to the input data and the bit width of the input weight;

performing a number of data multiplications on the calibration input data and the calibration input weights by the target multiplier, wherein the number of data multiplications performed is determined by a convolution kernel;

and performing step-by-step pairwise accumulation on the results of the data multiplication by a target adder to obtain the first result.

Optionally, the determining the target multiplier according to the bit width of the input data and the input weight includes:

and determining a multiplier which is 1 bit larger than the input data and the input weight bit width as the target multiplier.

Optionally, the step-by-step pairwise accumulation is performed on the results of the plurality of data multiplications by a target adder, to obtain the first result, including:

determining a target number of target adders according to the number of the data multiplication results, and distributing each target adder to each level, wherein the bit width of the target adder of the next level is 1 bit larger than that of the target adder of the previous level;

in each stage, data addition is carried out once on two adjacent first input items through each target adder of the stage, and a data addition result is output to the next stage as the first input item of the next stage until the final stage outputs the first result, wherein the input quantity of the first stage is the result of the data multiplication, and the input quantity in each stage is carried out and only one data addition is carried out.

Optionally, the scaling the first result to a second result with the target bit width according to a preset scaling rule includes:

determining a first scaling factor and a second scaling factor according to the scaling factor of the input data, the scaling factor of the input weight and a target scaling factor of the quantized convolution result, wherein the target scaling factor is determined by the target bit width;

determining the second input item according to the first scaling factor and the first result;

and determining the second result according to the second input item and the second scaling factor based on a preset scaling rule.

Optionally, the determining, based on a preset scaling rule, the second result according to the second input item and the second scaling factor includes:

determining discrimination bits of the second input item according to the second scaling factor, wherein the discrimination bits comprise a first discrimination bit and a second discrimination bit;

and intercepting corresponding data in the second input item as the second result according to the size relation between the second input item and 0 and the discrimination bit.

Optionally, the intercepting, according to the size relation between the second input item and 0 and the discrimination bit, the corresponding data in the second input item as the second result includes:

when the second input item is larger than 0, if the first discrimination bit is 1, the target bit is intercepted to the highest bit as an intercepting result, and the intercepting result is added with 1 and output as the second result;

when the second input item is larger than 0, if the first discrimination bit is 0 and the second discrimination bits are all 1, the target bit is intercepted to the highest bit as an intercepting result, and the intercepting result is added with 1 and output as the second result;

when the second input item is larger than 0, if the first discrimination bit is 0 and the second discrimination bit is not all 1, the target bit is intercepted to the highest bit as an intercepting result, and the intercepting result is output as the second result;

wherein the target bit is determined by the second scaling factor.

Optionally, the method further includes intercepting corresponding data in the second input item as the second result according to the magnitude relation between the second input item and 0 and the discrimination bit:

when the second input item is smaller than 0, if the first discrimination bit is 1 and the second discrimination bits are all 0, the target bit is intercepted to the highest bit as an intercepting result, and the intercepting result is output as the second result;

when the second input item is smaller than 0, if the first discrimination bit is 1 and the second discrimination bit is not 0, the target bit is intercepted to the highest bit as an intercepting result, and the intercepting result is added with 1 and output as the second result;

when the second input item is smaller than 0, if the first discrimination bit is 0, the target bit is intercepted to the highest bit as an intercepting result, and the intercepting result is output as the second result;

wherein the target bit is determined by the second scaling factor.

when the second input item is 0, outputting 0 as the second result.

Optionally, the determining the quantized convolution result according to the second result and the third zero offset, where the third zero offset is a zero offset of the second result includes:

performing saturation truncation on the second result;

and performing data addition on the second result after saturation interception and the third zero point deviation through an adder which is 1 bit wider than the target bit to obtain the quantized convolution result.

In a second aspect, an embodiment of the present invention provides a quantized convolution apparatus, the apparatus comprising:

the first determining module is used for determining a first result according to input data, input weights, first zero deviations and second zero deviations, wherein the input weights are weights corresponding to the input data, the first zero deviations are zero deviations of the input data, and the second zero deviations are zero deviations of the input weights;

the scaling module scales the first result into a second result with the target bit width according to a preset scaling rule;

and the second determining module is used for determining the quantized convolution result according to the second result and the third zero deviation, wherein the third zero deviation is the zero deviation of the second result.

In a third aspect, an embodiment of the present invention provides an electronic device, including:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of the first aspects.

In a fourth aspect, an embodiment of the present invention provides a storage medium, where the storage medium includes a stored program, where the program, when executed, controls a device in which the storage medium is located to perform the method of any one of the first aspects.

The embodiment of the invention realizes equivalent transformation by selecting the target multiplier which is one bit larger than the bit width of the input data, accumulating the multiplication calculation result step by step, presetting the scaling rule, preferentially executing the saturation truncation, adjusting the zero point deviation and the like, reduces the complexity of quantized convolution calculation, simplifies the calculation logic, saves the calculation bit width and improves the processing efficiency of the neural network.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a quantized convolution method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of accumulation of data multiplication results according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a quantized convolution device according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of another embodiment of a quantized convolution device according to the present disclosure;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For a better understanding of the technical solutions of the present application, embodiments of the present application are described in detail below with reference to the accompanying drawings.

It should be understood that the described embodiments are merely some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

Convolution is widely applied to deep learning and is used for tasks such as image processing, feature extraction, object detection and positioning, natural language processing and the like. The method can capture the local mode of the input data through local operation, and reduce the complexity and the calculated amount of the model through parameter sharing, thereby improving the effect and the learning ability of the model.

Convolution is widely used in neural networks, but conventional neural network processors incur high computational costs in order to support convolution. Therefore, it is necessary to alleviate the processing pressure of the neural network by quantization.

Quantization is a method for reducing the size of neural network models and speeding up reasoning, with the aim of changing parameters or weights in the computational model to a low bit-width representation, thereby reducing memory space and computational complexity. Neural network processors are typically provided with a weighting unit that supports a weighting calculation, and the standard calculation formula for the weighting is:

wherein q _o Is the output quantized convolution result, q _a Is input data, Z _a Is the first zero point deviation of the input data, q _w Is the input weight, Z _w Is the second zero-point deviation of the input weight, q _b Is biased, Z _o Is the third zero point deviation of the quantized convolution result, N is the size of the convolution kernel, M and N are scaling factors, round _up-ha And round _away-from-0 Representing two rounding modes, saturation_trunc represents saturation truncation. Substituting the calculation result of the formula 1 into the formula 2 to obtain the final quantized convolution result.

Although the parameters and the weights in the calculation model are quantized through the above formula and the final convolution result is represented through the low bit width, the quantization process is still complex and tedious, and no small processing pressure is brought to the neural network. The invention aims to reduce the calculation complexity, simplify the calculation logic and save the calculation bit width through equivalent transformation and other methods.

As shown in fig. 1, a flowchart of a quantized convolution method according to an embodiment of the present invention includes the following specific steps:

s101, determining a first result according to input data, input weight, first zero point deviation and second zero point deviation.

The input weight is a weight corresponding to the input data, the first zero deviation is the zero deviation of the input data, and the second zero deviation is the zero deviation of the input weight.

Specifically, a subtracter is used for performing data subtraction on the input data and the first zero point deviation to obtain calibration input dataAnd subtracting the second zero point deviation from the input weight by using a subtracter to obtain a calibration input weight

In calculating the convolution, it is necessary to perform data multiplication on the calibration input data and the calibration input weights based on the size of the convolution kernel. Spread with multiplicative allocation rate by calculationTo determine a first result (product).

In the embodiment of the invention, the target multiplier is determined according to the bit width of the input data and the input weight, and the data multiplication of the calibration input data and the calibration input weight is directly executed through the target multiplier for a plurality of times, and the calculation is performed after the data multiplication is not performed according to the multiplication distribution rate.

Wherein, because the value ranges of the calibration input data and the calibration input weight are in a bit wide range larger by 1 bit compared with the value ranges of the input data and the input weight, when selecting the target multiplier, selecting the multiplier which is one bit larger than the bit wide of the input data and the input weight as the target multiplier to execute the data multiplication.

For example, when the input data and the input weight are int8 data, it can be determined that the range of values of the calibration input data and the calibration input weight is within the range of int9, so the int9 multiplier is selected as the target multiplier, and the data multiplication is performed. Therefore, polynomials generated by expanding according to a multiplication distribution law are avoided, the number of polynomials to be processed is reduced, and the universality is improved while the calculation logic is simplified.

After the data multiplication is performed for a plurality of times according to the size of the convolution kernel, the target adder is used for performing step-by-step pairwise accumulation on the obtained results of the plurality of data multiplications so as to obtain a first result.

Specifically, a target adder of a target number is determined according to the number of the data multiplication results and is distributed to each stage. In each stage, the target adder of the stage sequentially performs data addition on two adjacent first input items, and outputs a data addition result to the next stage, and the data addition is continuously performed as a first input quantity of the next stage target adder until a first result is output at one target adder of the final stage. Wherein each first entry performs only one data addition.

The first input quantity of the first stage is a data multiplication result obtained by executing data multiplication through the target multiplier. The first input items are added two by two, so that the number of target adders in the next stage is halved, and the input items of each stage are the same in bit width, so that the target adders in the same stage are the same in bit width, and the bit width is increased by 1 bit in the target multiplier of the next stage.

In one specific implementation, as shown in FIG. 2, the data multiplication results are accumulated. If the data multiplication result obtained by the above steps has 8 int8 data in total, it can be determined that 7 target adders are required, the first layer is a target adder of 4 int9, the second layer is a target adder of 2 int10, and the third layer is a target adder of 1 int 10. And taking the data multiplication result as a first input quantity, accumulating the data multiplication results step by the target multiplier in pairs, and finally outputting a first result by the int10 target multiplier of the final stage.

Compared with a sequential accumulation mode, whether data overflow is issued or not is not required to be considered during each addition calculation, namely, the data bit width of the target adder is not required to be increased during each addition calculation, and only the data bit width is required to be increased at each stage, so that more calculation resources are saved.

S102, scaling the first result into a second result with the target bit width according to a preset scaling rule.

Wherein the first result is scaled to the target bit width by performing a shift operation and rounding by a first scaling factor M and a second scaling factor n.

However, since the order and rule of the two shifts and rounding are determined, it is not necessary to perform a link-by-link calculation, and an equivalent scaling rule may be set according to the correspondence between the input and the output. Based on the preset scaling rule, the corresponding second result is directly output according to the characteristics of the input item, so that the actual operation process is replaced.

Specifically, the first scaling factor is determined according to the scaling factor of the input data, that is, the scaling relation between the original input data and the quantized input data, the scaling factor of the input weight, and the target scaling factor of the quantized convolution result. And determining a second scaling factor based on the first scaling factor and the actual shift operation relationship.

The first scaling factor is multiplied by the first result obtained in S101 to obtain a second input term (M (q _b +product)). Wherein the first result should also be offset q _b And (5) adding.

Determining the discrimination bit of the second input item through the second scaling factor. The discrimination bits include a first discrimination bit and a second discrimination bit. In general, the first discrimination bit is the (31+n) th bit, and the second discrimination bits are the 31 st bit to the 31+n-1 st bit ([ 31:31+n-1 ]). Based on a preset scaling rule, according to the size relation between the second input item and 0 and the discrimination bit of the second input item, intercepting corresponding data in the second input item as the second result.

When the second input item is larger than 0, if the first discrimination bit is 1, the target bit is intercepted to the highest bit as an intercepting result, and the intercepting result is added with 1 and output as the second result; when the second input item is larger than 0, if the first discrimination bit is 0 and the second discrimination bits are all 1, the target bit is intercepted to the highest bit as an intercepting result, and the intercepting result is added with 1 and output as the second result; when the second input item is larger than 0, if the first discrimination bit is 0 and the second discrimination bit is not all 1, the target bit is intercepted to the highest bit as an intercepting result, and the intercepting result is output as the second result. Wherein the target bit is determined by a second scaling factor. The general target bit is the 32+n-th bit.

When the second input item is smaller than 0, if the first discrimination bit is 1 and the second discrimination bits are all 0, the target bit is intercepted to the highest bit as an intercepting result, and the intercepting result is output as the second result; when the second input item is smaller than 0, if the first discrimination bit is 1 and the second discrimination bit is not 0, the target bit is intercepted to the highest bit as an intercepting result, and the intercepting result is added with 1 and output as a second result; when the second input item is smaller than 0, if the first discrimination bit is 0, the target bit is intercepted to the highest bit as an intercepting result, and the intercepting result is output as the second result.

When the second input item is 0, outputting 0 as the second result.

As shown in table 1, a scaling rule table is provided in an embodiment of the present invention.

Input (M (q) _b +product))	Output (second result)
		If input is made>0 and (31+n) bit=1	Intercept [32+n: highest order bit]And +1
If input is made>0 and (31+n) bits=0 and [31:31+n-1]]The bits are all 1	Intercept [32+n: highest order bit]And +1
		If input is made>0 and (31+n) bits=0 and [31:31+n-1]]Bit errors of all 1	Intercept [32+n: highest order bit]
If input is made<0 and (31+n) bits=1 and [31:31+n-1]]Bits are all 0	Intercept [32+n: highest order bit]
		If input is made<0 and (31+n) bits=1 and [31:31+n-1]]Bit errors of 0	Intercept [32+n: highest order bit]And +1
If input is made<0 and (31+n) bit=0	Intercept [32+n: highest order bit]
		If the input=0	Output=0

Table 1: scaling rule table

S103, determining a quantized convolution result according to the second result and the third zero point deviation.

Specifically, the saturation truncation is performed on the second result, and then the second result after saturation truncation and the third zero deviation Z are performed by an adder with a bit width 1 bit larger than the target bit width _o And performing data addition to obtain a quantized convolution result.

In a specific embodiment, when the target bit width is int8, the second result is truncated to [ -128-Z by saturation truncation _o ,127-Z _o ]Within (2), then add it to Z by the adder of int9 _o And adding to obtain a final quantized convolution result.

If the second result is shown in formula 2, then the second result is first taken with Z _o Adding, and cutting the output to [ -128,127 by saturation cutting]In the way Z is added to the higher data bit width due to the non-truncated second result _o An adder of either int32 or int64 is typically required. While the scheme adds Z _o Only the adder of int9 is needed, and the calculated bit width is saved.

Fig. 3 is a schematic diagram of a quantization convolution apparatus according to an embodiment of the present invention. The overall process of quantizing the convolution is described below in conjunction with fig. 4. Take input data and input weight as int8 data as examples.

Referring to fig. 3, the input data and the first zero deviation are subjected to data subtraction by a subtracter to obtain calibration input data, and the input parameters and the second zero deviation are subjected to data subtraction by the subtracter to obtain calibration input weights. And performing a plurality of times of data multiplication on the calibration input data and the calibration input weight based on the target multiplier of int9, and accumulating the results of the data multiplication step by a target adder in pairs to obtain a first result.

The first result is multiplied by the first scaling factor by a multiplier, and the result is used as an input, and the second result is determined based on a preset scaling rule. And performing saturation truncation on the second result, and performing data addition on the second result after the saturation stage and the third zero point deviation through an adder of int9 to obtain a final quantized convolution result.

Corresponding to the quantization convolution method, the embodiment of the application also provides a quantization convolution device. Referring to fig. 4, a schematic structural diagram of a quantization convolution apparatus according to an embodiment of the present application may include: a first determination module 401, a scaling module 402 and a second determination module 403.

The first determining module 401 determines a first result according to input data, an input weight, a first zero-point deviation and a second zero-point deviation, wherein the input weight is a weight corresponding to the input data, the first zero-point deviation is a zero-point deviation of the input data, and the second zero-point deviation is a zero-point deviation of the input weight.

The scaling module 402 scales the first result into a second result with the target bit width according to a preset scaling rule.

A second determining module 403 determines the quantized convolution result according to the second result and the third zero offset, where the third zero offset is a zero offset of the second result.

Fig. 5 is a schematic structural view of an embodiment of the electronic device of the present specification. As shown in fig. 5, the electronic device may include at least one processor; and at least one memory communicatively coupled to the processing unit, wherein: the memory stores program instructions executable by the processing unit, and the processor invokes the program instructions to perform the quantized convolution method provided in this embodiment.

The electronic device may be a device capable of performing an intelligent dialogue with a user, for example: the cloud server, the embodiment of the present disclosure does not limit the specific form of the electronic device. It is understood that the electronic device herein is the machine mentioned in the method embodiment.

Fig. 5 shows a block diagram of an exemplary electronic device suitable for use in implementing embodiments of the present description. The electronic device shown in fig. 5 is only an example and should not be construed as limiting the functionality and scope of use of the embodiments herein.

As shown in fig. 5, the electronic device is in the form of a general purpose computing device. Components of an electronic device may include, but are not limited to: one or more processors 510, a communication interface 520, a memory 530, and a communication bus 540 connecting the different system components (including the memory 530, the communication interface 520, and the processor 510).

Communication bus 540 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include industry Standard architecture (Industry Standard Architecture; hereinafter ISA) bus, micro channel architecture (Micro Channel Architecture; hereinafter MAC) bus, enhanced ISA bus, video electronics standards Association (Video Electronics Standards Association; hereinafter VESA) local bus, and peripheral component interconnect (Peripheral Component Interconnection; hereinafter PCI) bus.

Electronic devices typically include a variety of computer system readable media. Such media can be any available media that can be accessed by the electronic device and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 530 may include computer system readable media in the form of volatile memory, such as random access memory (Random Access Memory; hereinafter: RAM) and/or cache memory. The electronic device may further include other removable/non-removable, volatile/nonvolatile computer system storage media. Memory 530 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of the embodiments of the present description.

A program/utility having a set (at least one) of program modules may be stored in the memory 530, such program modules include, but are not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules typically carry out the functions and/or methods of the embodiments described herein.

The processor 510 executes various functional applications and data processing by running a program stored in the memory 530, for example, implementing the quantized convolution method provided by the embodiments shown in this specification.

Embodiments of the present specification provide a non-transitory computer readable storage medium storing computer instructions that cause a computer to perform the quantized convolution methods provided by the embodiments shown in the present specification.

The non-transitory computer readable storage media described above may employ any combination of one or more computer readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an erasable programmable Read-Only Memory (Erasable Programmable Read Only Memory; EPROM) or flash Memory, an optical fiber, a portable compact disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for the present specification may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (Local Area Network; hereinafter: LAN) or a wide area network (Wide Area Network; hereinafter: WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present specification, the meaning of "plurality" means at least two, for example, two, three, etc., unless explicitly defined otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present specification in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present specification.

Depending on the context, the word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to detection". Similarly, the phrase "if determined" or "if detected (stated condition or event)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event)" or "in response to detection (stated condition or event), depending on the context.

It should be noted that, the terminals in the embodiments of the present disclosure may include, but are not limited to, a personal Computer (Personal Computer; hereinafter referred to as a PC), a personal digital assistant (Personal Digital Assistant; hereinafter referred to as a PDA), a wireless handheld device, a Tablet Computer (Tablet Computer), a mobile phone, an MP3 player, an MP4 player, and the like.

In the embodiments provided in the present specification, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the elements is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

In addition, each functional unit in each embodiment of the present specification may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.

The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a Processor (Processor) to perform part of the steps of the methods described in the embodiments of the present specification.

The foregoing description of the preferred embodiments is provided for the purpose of illustration only, and is not intended to limit the scope of the disclosure, since any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the disclosure are intended to be included within the scope of the disclosure.

Claims

1. A method of quantized convolution, comprising:

2. The method of claim 1, wherein determining the first result based on the input data, the input weight, the first zero-point offset, the second zero-point offset, comprises:

3. The method of claim 2, wherein said determining a target multiplier based on bit widths of said input data and said input weights comprises:

4. The method of claim 2, wherein the step-by-step pairwise accumulation of the results of the number of data multiplications by a target adder to obtain the first result comprises:

5. The method of claim 1, wherein scaling the first result to a second result of a target bit width according to a preset scaling rule comprises:

6. The method of claim 5, wherein the determining the second result from the second input item and the second scaling factor based on a preset scaling rule comprises:

7. The method of claim 6, wherein said intercepting corresponding data in said second input as said second result according to a magnitude relation between said second input and 0, and said discrimination bit, comprises:

wherein the target bit is determined by the second scaling factor.

8. The method of claim 6, wherein the intercepting the corresponding data in the second input item as the second result according to the magnitude relation between the second input item and 0, and the discrimination bit, further comprises:

wherein the target bit is determined by the second scaling factor.

9. The method of claim 6, wherein the intercepting the corresponding data in the second input item as the second result according to the magnitude relation between the second input item and 0, and the discrimination bit, further comprises:

when the second input item is 0, outputting 0 as the second result.

10. The method of claim 1, wherein the determining the quantized convolution result from the second result and the third zero offset, the third zero offset being a zero offset of the second result, comprises:

performing saturation truncation on the second result;

11. A quantized convolution device, the device comprising:

12. An electronic device, comprising:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1-10.

13. A storage medium comprising a stored program, wherein the program, when run, controls a device in which the storage medium is located to perform the method of any one of claims 1 to 10.