Model Quantization for PyTorch (Proposal) #18318

jspisak · 2019-03-22T03:49:20Z

🚀 tl;dr

Attached is a proposal for graph mode quantization in pytorch (model_quantizer) that provides end to end post training quantization support for both mobile and server backends. Model quantization supports fp32 and int8 precisions as a starting point and will expand to support other precision types based on customer needs. Details can be found in the attached pdf doc:

Model Quantization for Pytorch.pdf

cc @soumith, @gchanan, @raghuramank100

t-vi · 2019-03-22T13:01:36Z

How to export to mobile

Ha!

jspisak · 2019-03-22T14:44:42Z

How to export to mobile

Ha!

Won't be long now.. :)

raghuramank100 · 2019-03-22T16:08:27Z

We are initially planning to support export to netdef from pytorch as mobile runtime is still based on caffe2.

jgong5 · 2019-04-15T05:14:23Z

From the design doc, are FakeQuant ops only inserted at the submodule boundary? As an example with lenet below, do we have to break it into submodules in order to have all the ops quantized?

    def forward(self, x):
        out = F.relu(self.conv1(x))
        out = F.max_pool2d(out, 2)
        out = F.relu(self.conv2(out))
        out = F.max_pool2d(out, 2)
        out = out.view(out.size(0), -1)
        out = F.relu(self.fc1(out))
        out = F.relu(self.fc2(out))
        out = self.fc3(out)
        return out

t-vi · 2019-04-15T05:29:48Z

To me it sounds like they are. But the key sentence about this seems to be at the end of step 5:

The model is ready for graph-optimizations.

I would think that this means that the quantized regions at the top and bottom of the modules grow until they "meet". In that event one could either have a re-quantization step (adjusting min/max and the quantized values) in the place off dequant->quant or eliminate it altogether. This would happen either as part of step 6 or after that (because the whole point would be that the ConvRelu is quantized, I think(?), not sure why it's not mentioned/drawn).

I'm very excited about this!

jgong5 · 2019-04-15T05:40:34Z

@t-vi The concern is that users have to refactor their existing code quite a bit in order to have all the ops quantized. Usually in order to get decent performance boost, most ops should be quantized ones, only with few exceptions leaving in full precision for an acceptable accuracy.

raghuramank100 · 2019-04-15T17:00:22Z

Hi Jiang, We are supporting two modes for quantization: - In graph mode, we integrate with JIT IR and quantization is being implemented as a compilation pass, so even the insertion of fake quant ops is automated. Floating point ops are replaced with quantized ops when available. This should be the preferred option for maximum performance. - In Eager mode, yes: one needs to do extra work as there is no notion of a graph. The short answer is that one needs to break up a module into sub-modules (at the level of an individual op) for quantization. Thanks, Raghu

…

________________________________ From: Jiong Gong <notifications@github.com> Sent: Sunday, April 14, 2019 10:41:49 PM To: pytorch/pytorch Cc: Raghuraman Krishnamoorthi; Mention Subject: Re: [pytorch/pytorch] Model Quantization for PyTorch (Proposal) (#18318) @t-vi<https://github.com/t-vi> The concern is that users have to refactor their existing code quite a bit in order to have all the ops quantized. Usually in order to get decent performance boost, most ops should be quantized ones, only with few exceptions leaving in full precision for an acceptable accuracy. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#18318 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AuktTAcV9fXkw3heWtOmDAk5MQ4TwmYBks5vhBEdgaJpZM4cCyvQ>.

jgong5 · 2019-04-16T00:55:07Z

@raghuramank100 Thanks for the answer. So for the graph mode, fake quant ops are inserted at the op boundary, correct? I noticed that the step 3 in the design doc does not have a fake quant op in-between Conv and ReLU. Should there be one in the graph mode? Moreover, if fake quant ops are inserted at the op boundary, how could we selectively fall back some ops to full precision?

The usage model of eager mode quantization is still confusing to me. In my mind, a common working flow of quantization would start with an existing full precision model. Try the graph mode with all ops quantized. If the accuracy target is not achieved, debug with eager mode and may selectively fall back some ops to full precision if needed. Finally deploy the mixed-precision model with graph mode. It sounds horrible to me if one has to refactor the model that much to debug with eager mode...

jgong5 · 2019-04-17T07:27:28Z

@raghuramank100 One more question: would there be an option to allow saving full-precision biases?

I'm asking because there is situation where biases are shared by multiple ops, e.g. RetinaNet, and quantized biases have to be requantized for individual ops due to different scales of the activations. To avoid the extra requantization overhead, we would rather pass full-precision biases to the quantized kernel directly, e.g. MKL-DNN int8 kernel supports full-precision biases.

soumith · 2019-06-26T16:39:10Z

some more details have been posted here: https://github.com/pytorch/pytorch/wiki/Introducing-Quantized-Tensor

raghuramank100 · 2019-08-01T01:15:52Z

@raghuramank100 Thanks for the answer. So for the graph mode, fake quant ops are inserted at the op boundary, correct? I noticed that the step 3 in the design doc does not have a fake quant op in-between Conv and ReLU. Should there be one in the graph mode? Moreover, if fake quant ops are inserted at the op boundary, how could we selectively fall back some ops to full precision?

The usage model of eager mode quantization is still confusing to me. In my mind, a common working flow of quantization would start with an existing full precision model. Try the graph mode with all ops quantized. If the accuracy target is not achieved, debug with eager mode and may selectively fall back some ops to full precision if needed. Finally deploy the mixed-precision model with graph mode. It sounds horrible to me if one has to refactor the model that much to debug with eager mode...

Hi @jgong5,
I have uploaded a more detailed design document that hopefully answers your questions. For Eager mode the user has full control over where fake-quant operations are inserted. We are planning full support for eager mode first and follow up with graph mode functionality.

raghuramank100 · 2019-08-01T01:18:33Z

Please see more detailed design doc at: https://github.com/pytorch/pytorch/wiki/torch_quantization_design_proposal. This document outlines a eager friendly quantization design.

gottbrath · 2019-09-18T18:53:58Z

Hey folks.

I have a bit of a preview of our quantization API and workflow that I would love feedback on!

The attached tutorial covers the steps needed to quantize a model to 8 bit in a post training fashion in eager mode. Users need to prepare the model with a few simple model changes such as providing uniquely named items for repeated elements, and fusing conv and batch norm. Then there are some high level commands that can insert instrumentation so that the activation scaling can be calibrated with a sample set of data. Then another high level command applies the calibration and converts the model to quantized form. The resulting quantized model can be serialized into Torch Script using JIT.

The user has a lot of control and can choose different quantization and calibration functions for different parts of the model and can apply quantization to the whole model or just parts.

The tutorial should work with top of tree. It is currently CPU focused.

Note that this is just eager mode post training quantization in this tutorial. We plan to also support quantization aware training and quantization of models already converted to Torch Script. Accuracy and performance are still being worked on.

Feedback welcomed -- particularly on the workflow and API.

resnext_demo8.ipynb.zip

8000

yaysummeriscoming · 2019-09-19T11:39:02Z

@gottbrath just had a quick look, seems really promising! I imagine that post training quantisation for traced models will be simpler, along the lines of the current tflite implementation.

My thoughts:
Why torch.quantization.floatFunctional? Couldn’t we just use nn.Module?

Can I keep parts of the model in float precision & if so, how well will this be supported with mobile runtimes?

Feels a bit weird to have QuantWrapper - couldn't this functionality be integrated with nn.Module & activation statistics stored at sub-module level or as a tensor property?

Can I have different qconfigs for different sub-modules, or is this a QuantWrapper parameter only?
For instance, I might like to use different quantisation parameters for the first or last layers

Apart from that I quite like the workflow!

banderlog · 2019-09-19T14:55:39Z

@gottbrath I know that I will sound silly and naive, but quantization process for end user should look like model.int8() or torch.quantize(model, qtype='int8'). But great work with first ever working pytorch quantization example notebook (as far as I know) :)

t-vi · 2019-09-19T15:23:16Z

On the wiki, the plan includes a 1-line quantization api, but we are not quite there yet.

gottbrath · 2019-09-26T22:41:18Z

@yaysummeriscoming and @banderlog -- yes the long term goal is to have a super simple quantization API. As @yaysummeriscoming recognized doing that generally requires a graph representation of the full model which is provided with JIT scripted/traced models. The current implementation is focused on eager mode and provides the building blocks that we will put together with some graph manipulation to provide the one line version in the future.

it is worth noting that post training quantization does generally require some calibration.. which is a separate step in the current prepare-calibrate-convert implementation.

raghuramank100 · 2019-09-27T15:59:37Z

@gottbrath just had a quick look, seems really promising! I imagine that post training quantisation for traced models will be simpler, along the lines of the current tflite implementation.

My thoughts:
Why torch.quantization.floatFunctional? Couldn’t we just use nn.Module?

Can I keep parts of the model in float precision & if so, how well will this be supported with mobile runtimes?

Feels a bit weird to have QuantWrapper - couldn't this functionality be integrated with nn.Module & activation statistics stored at sub-module level or as a tensor property?

Can I have different qconfigs for different sub-modules, or is this a QuantWrapper parameter only?
For instance, I might like to use different quantisation parameters for the first or last layers

Apart from that I quite like the workflow!

Great questions, some answers to provide more clarity:

Floatfunctional: This was a heavily debated choice: The idea behind torch.nn is to have modules that contain learnable parameters, which are learnt via backprop (with a few exceptions like ReLU). With quantization, we have the problem that even operations like adding two quantized tensors require us to capture state, i.e the range of the output. One option is to have Floatfunctionals which basically allow us to wrap any torch.tensor operation into a module so that we can track output statistics. nn.Add would have been cleaner for quantization, but would have changed what 'nn' means to the rest of the community.
As answered above, we allow for mixing float and quantized operations at the granularity of a module. As a developer, you can choose the module partitioning in your model and so can control quantization to the level of a single primitive operation (like having just one conv quantized for example). However, to do that, one needs to specify where the activations are quantized and dequantized. This can be done using torch.quantization.QuantStub() and deQuantStub() operations. For example:

class exampleModule(nn.Module):
    def __init__(self):
       super(exampleModule, self).__init__()
       self.conv1 = nn.Conv2d(...)
       self.conv2 = nn.Conv2d(...)
       self.conv3 = nn.Conv2d(...)
       self.conv4 = nn.Conv2d(...)
       self.quant1 = torch.quantization.QuantStub()
       self.dequant1 = torch.quantization.DeQuantStub()
      self.quant2 = torch.quantization.QuantStub()
       self.dequant2 = torch.quantization.DeQuantStub()

def forward(self, x):
     x = self.quant1(x)
    # Specify that first conv needs to be quantized
     x = self.conv1(x)
     x = self.dequant1(x)
    # Conv2 and 3 are in float
     x = self.conv2(x)
     x = self.conv3(x)
     x = self.quant2(x)
     x = self.conv4(x)
     x = self.dequant2(x)

def main():
   test_model = exampleModule()
 # Specify quantization configuration for the modules that need to be quantized
   test_model.conv1.qconfig = torch.quantization.default_qconfig
   test_model.conv4.qconfig = torch.quantization.default_qconfig
   test_model.quant1.qconfig = torch.quantization.default_qconfig
   test_model.quant2.qconfig = torch.quantization.default_qconfig

#  Call prepare and convert
   torch.quantization.prepare(test_model)
   calibrate(q_model)
   torch.quantization.convert(q_model)

Quantwrapper is a convenience function that just inserts a quant and dequant at the beginning and end of a module. The reason this is not integrated at the module level is that in eager mode one does not have any visibility into the sequence of calls to modules in forward. So, the user needs to explicitly decide when activations are quantized/dequantized.
You can definitely control how you want to quantize any layer and mix and match float/quantized layers.

gottbrath · 2019-09-27T16:29:21Z

@yaysummeriscoming -- I also checked and the pytorch mobile will support FP32 ops so using mixed INT8 + FP32 on mobile should be possible.

yaysummeriscoming · 2019-09-27T16:40:29Z

@raghuramank100 & @gottbrath Thanks for the answers:

Ok, I must say I'm quite firmly on the nn.Module side. As I understand it there's nothing preventing me from using nn.Module, like nn.Add then?

3-4: I gather that Quantwrapper isn't necessary - I can implement the quant/dequant functionality myself as in the example, if I choose?

This being the case, very happy with the flexibility provided - mixed float/quant operation is a godsend.

Interested to see what the mobile/IoT deployment process will look like, I see there's been a lot of work done integrating QNNPACK. What's the current best approach? Can I run quantised models on my raspberry pi now?

t-vi · 2019-09-27T17:15:24Z

Yes you can. We just completed a workshop with a dozen people running the quantized ResNet50 on a Pi4. Inference performance (on a Pi4 Debian arm64 system with the RPi-foundation provided 64 bit test kernel running PyTorch) was from 2.7s for the float model to 900ms for the uint8 ones (the absolute numbers leave room for improvement, but it is a neat start). Thanks for all the great work!

raghuramank100 · 2019-10-14T19:03:32Z

@raghuramank100 & @gottbrath Thanks for the answers:

Ok, I must say I'm quite firmly on the nn.Module side. As I understand it there's nothing preventing me from using nn.Module, like nn.Add then?

3-4: I gather that Quantwrapper isn't necessary - I can implement the quant/dequant functionality myself as in the example, if I choose?

This being the case, very happy with the flexibility provided - mixed float/quant operation is a godsend.

Interested to see what the mobile/IoT deployment process will look like, I see there's been a lot of work done integrating QNNPACK. What's the current best approach? Can I run quantised models on my raspberry pi now?

There is no nn.Add() available as Add is a tensor method with no learnable parameters. Wherever there are existing nn Modules, the quantization method works off of those.
You are correct about QuantWrapper, you can instead manually insert quant/dequant stubs. We are working on graph mode quantization, where even this part will be automated.

alohali · 2019-10-25T13:57:38Z

Hi all, is there any way to run QAT on NVIDIA GPU?
I success to run QAT and Post training. But the speed is too low, I spent several hours in a iteration in my own dataset. I tried to run it with CUDA device but met with errors. Is there any doc/tutorial about this?

gottbrath · 2019-11-19T18:27:31Z

closing this issue since we delivered this.

pytorch/pytorch#18318

ezyang added feature A request for a proper, new feature. high priority oncall: quantization Quantization support in PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Apr 6, 2019

ezyang assigned jerryzh168 Apr 6, 2019

soumith mentioned this issue Jun 20, 2019

Tensor and nn.Module Pruning #20402

Closed

ezyang mentioned this issue Jul 10, 2019

Quantization Model Support #11348

Closed

karanchahal mentioned this issue Aug 8, 2019

Quantisation and Pruning Support Lightning-AI/pytorch-lightning#76

Closed

gottbrath mentioned this issue Sep 18, 2019

[Feature request] Support for quantized (INT8) training and inference #7553

Closed

gottbrath closed this as completed Nov 19, 2019

Spandana-K-R added a commit to Spandana-K-R/Optimizing-Deep-Learning-models that referenced this issue Jul 19, 2020

PDF

e50613a

pytorch/pytorch#18318

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Model Quantization for PyTorch (Proposal) #18318

Model Quantization for PyTorch (Proposal) #18318

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Model Quantization for PyTorch (Proposal) #18318

Model Quantization for PyTorch (Proposal) #18318

Comments

Uh oh!

🚀 tl;dr

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!