8000 Model Quantization for PyTorch (Proposal) · Issue #18318 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Model Quantization for PyTorch (Proposal) #18318

8000
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jspisak opened this issue Mar 22, 2019 · 24 comments
Closed

Model Quantization for PyTorch (Proposal) #18318

jspisak opened this issue Mar 22, 2019 · 24 comments
Assignees
Labels
feature A request for a proper, new feature. high priority oncall: quantization Quantization support in PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@jspisak
Copy link
Contributor
jspisak commented Mar 22, 2019

🚀 tl;dr

Attached is a proposal for graph mode quantization in pytorch (model_quantizer) that provides end to end post training quantization support for both mobile and server backends. Model quantization supports fp32 and int8 precisions as a starting point and will expand to support other precision types based on customer needs. Details can be found in the attached pdf doc:

Model Quantization for Pytorch.pdf

cc @soumith, @gchanan, @raghuramank100

@t-vi
Copy link
Collaborator
t-vi commented Mar 22, 2019

How to export to mobile

Ha!

@jspisak
Copy link
Contributor Author
jspisak commented Mar 22, 2019

How to export to mobile

Ha!

Won't be long now.. :)

@raghuramank100
Copy link
Contributor

We are initially planning to support export to netdef from pytorch as mobile runtime is still based on caffe2.

@ezyang ezyang added feature A request for a proper, new feature. high priority oncall: quantization Quantization support in PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Apr 6, 2019
@jgong5
Copy link
Collaborator
jgong5 commented Apr 15, 2019

From the design doc, are FakeQuant ops only inserted at the submodule boundary? As an example with lenet below, do we have to break it into submodules in order to have all the ops quantized?

    def forward(self, x):
        out = F.relu(self.conv1(x))
        out = F.max_pool2d(out, 2)
        out = F.relu(self.conv2(out))
        out = F.max_pool2d(out, 2)
        out = out.view(out.size(0), -1)
        out = F.relu(self.fc1(out))
        out = F.relu(self.fc2(out))
        out = self.fc3(out)
        return out

@t-vi
Copy link
Collaborator
t-vi commented Apr 15, 2019

To me it sounds like they are. But the key sentence about this seems to be at the end of step 5:

The model is ready for graph-optimizations.

I would think that this means that the quantized regions at the top and bottom of the modules grow until they "meet". In that event one could either have a re-quantization step (adjusting min/max and the quantized values) in the place off dequant->quant or eliminate it altogether. This would happen either as part of step 6 or after that (because the whole point would be that the ConvRelu is quantized, I think(?), not sure why it's not mentioned/drawn).

I'm very excited about this!

@jgong5
Copy link
Collaborator
jgong5 commented Apr 15, 2019

@t-vi The concern is that users have to refactor their existing code quite a bit in order to have all the ops quantized. Usually in order to get decent performance boost, most ops should be quantized ones, only with few exceptions leaving in full precision for an acceptable accuracy.

@raghuramank100
Copy link
Contributor
raghuramank100 commented Apr 15, 2019 via email

@jgong5
Copy link
Collaborator
jgong5 commented Apr 16, 2019

@raghuramank100 Thanks for the answer. So for the graph mode, fake quant ops are inserted at the op boundary, correct? I noticed that the step 3 in the design doc does not have a fake quant op in-between Conv and ReLU. Should there be one in the graph mode? Moreover, if fake quant ops are inserted at the op boundary, how could we selectively fall back some ops to full precision?

The usage model of eager mode quantization is still confusing to me. In my mind, a common working flow of quantization would start with an existing full precision model. Try the graph mode with all ops quantized. If the accuracy target is not achieved, debug with eager mode and may selectively fall back some ops to full precision if needed. Finally deploy the mixed-precision model with graph mode. It sounds horrible to me if one has to refactor the model that much to debug with eager mode...

@jgong5
Copy link
Collaborator
jgong5 commented Apr 17, 2019

@raghuramank100 One more question: would there be an option to allow saving full-precision biases?

I'm asking because there is situation where biases are shared by multiple ops, e.g. RetinaNet, and quantized biases have to be requantized for individual ops due to different scales of the activations. To avoid the extra requantization overhead, we would rather pass full-precision biases to the quantized kernel directly, e.g. MKL-DNN int8 kernel supports full-precision biases.

@soumith
Copy link
Member
soumith commented Jun 26, 2019

some more details have been posted here: https://github.com/pytorch/pytorch/wiki/Introducing-Quantized-Tensor

@raghuramank100
Copy link
Contributor

@raghuramank100 Thanks for the answer. So for the graph mode, fake quant ops are inserted at the op boundary, correct? I noticed that the step 3 in the design doc does not have a fake quant op in-between Conv and ReLU. Should there be one in the graph mode? Moreover, if fake quant ops are inserted at the op boundary, how could we selectively fall back some ops to full precision?

The usage model of eager mode quantization is still confusing to me. In my mind, a common working flow of quantization would start with an existing full precision model. Try the graph mode with all ops quantized. If the accuracy target is not achieved, debug with eager mode and may selectively fall back some ops to full precision if needed. Finally deploy the mixed-precision model with graph mode. It sounds horrible to me if one has to refactor the model that much to debug with eager mode...

Hi @jgong5,
I have uploaded a more detailed design document that hopefully answers your questions. For Eager mode the user has full control over where fake-quant operations are inserted. We are planning full support for eager mode first and follow up with graph mode functionality.

@raghuramank100
Copy link
Contributor

Please see more detailed design doc at: https://github.com/pytorch/pytorch/wiki/torch_quantization_design_proposal. This document outlines a eager friendly quantization design.

@gottbrath
Copy link
Contributor

Hey folks.

I have a bit of a preview of our quantization API and workflow that I would love feedback on!

The attached tutorial covers the steps needed to quantize a model to 8 bit in a post training fashion in eager mode. Users need to prepare the model with a few simple model changes such as providing uniquely named items for repeated elements, and fusing conv and batch norm. Then there are some high level commands that can insert instrumentation so that the activation scaling can be calibrated with a sample set of data. Then another high level command applies the calibration and converts the model to quantized form. The resulting quantized model can be serialized into Torch Script using JIT.

The user has a lot of control and can choose different quantization and calibration functions for different parts of the model and can apply quantization to the whole model or just parts.

The tutorial should work with top of tree. It is currently CPU focused.

Note that this is just eager mode post training quantization in this tutorial. We plan to also support quantization aware training and quantization of models already converted to Torch Script. Accuracy and performance are still being worked on.

Feedback welcomed -- particularly on the workflow and API.

resnext_demo8.ipynb.zip

8000

@yaysummeriscoming
Copy link

@gottbrath just had a quick look, seems really promising! I imagine that post training quantisation for traced models will be simpler, along the lines of the current tflite implementation.

My thoughts:
Why torch.quantization.floatFunctional? Couldn’t we just use nn.Module?

Can I keep parts of the model in float precision & if so, how well will this be supported with mobile runtimes?

Feels a bit weird to have QuantWrapper - couldn't this functionality be integrated with nn.Module & activation statistics stored at sub-module level or as a tensor property?

Can I have different qconfigs for different sub-modules, or is this a QuantWrapper parameter only?
For instance, I might like to use different quantisation parameters for the first or last layers

Apart from that I quite like the workflow!

@banderlog
Copy link

@gottbrath I know that I will sound silly and naive, but quantization process for end user should look like model.int8() or torch.quantize(model, qtype='int8'). But great work with first ever working pytorch quantization example notebook (as far as I know) :)

@t-vi
Copy link
Collaborator
t-vi commented Sep 19, 2019

On the wiki, the plan includes a 1-line quantization api, but we are not quite there yet.

@gottbrath
Copy link
Contributor

@yaysummeriscoming and @banderlog -- yes the long term goal is to have a super simple quantization API. As @yaysummeriscoming recognized doing that generally requires a graph representation of the full model which is provided with JIT scripted/traced models. The current implementation is focused on eager mode and provides the building blocks that we will put together with some graph manipulation to provide the one line version in the future.

it is worth noting that post training quantization does generally require some calibration.. which is a separate step in the current prepare-calibrate-convert implementation.

@raghuramank100
Copy link
Contributor

@gottbrath just had a quick look, seems really promising! I imagine that post training quantisation for traced models will be simpler, along the lines of the current tflite implementation.

My thoughts:
Why torch.quantization.floatFunctional? Couldn’t we just use nn.Module?

Can I keep parts of the model in float precision & if so, how well will this be supported with mobile runtimes?

Feels a bit weird to have QuantWrapper - couldn't this functionality be integrated with nn.Module & activation statistics stored at sub-module level or as a tensor property?

Can I have different qconfigs for different sub-modules, or is this a QuantWrapper parameter only?
For instance, I might like to use different quantisation parameters for the first or last layers

Apart from that I quite like the workflow!

Great questions, some answers to provide more clarity:

  1. Floatfunctional: This was a heavily debated choice: The idea behind torch.nn is to have modules that contain learnable parameters, which are learnt via backprop (with a few exceptions like ReLU). With quantization, we have the problem that even operations like adding two quantized tensors require us to capture state, i.e the range of the output. One option is to have Floatfunctionals which basically allow us to wrap any torch.tensor operation into a module so that we can track output statistics. nn.Add would have been cleaner for quantization, but would have changed what 'nn' means to the rest of the community.
  2. As answered above, we allow for mixing float and quantized operations at the granularity of a module. As a developer, you can choose the module partitioning in your model and so can control quantization to the level of a single primitive operation (like having just one conv quantized for example). However, to do that, one needs to specify where the activations are quantized and dequantized. This can be done using torch.quantization.QuantStub() and deQuantStub() operations. For example:
class exampleModule(nn.Module):
    def __init__(self):
       super(exampleModule, self).__init__()
       self.conv1 = nn.Conv2d(...)
       self.conv2 = nn.Conv2d(...)
       self.conv3 = nn.Conv2d(...)
       self.conv4 = nn.Conv2d(...)
       self.quant1 = torch.quantization.QuantStub()
       self.dequant1 = torch.quantization.DeQuantStub()
      self.quant2 = torch.quantization.QuantStub()
       self.dequant2 = torch.quantization.DeQuantStub()

def forward(self, x):
     x = self.quant1(x)
    # Specify that first conv needs to be quantized
     x = self.conv1(x)
     x = self.dequant1(x)
    # Conv2 and 3 are in float
     x = self.conv2(x)
     x = self.conv3(x)
     x = self.quant2(x)
     x = self.conv4(x)
     x = self.dequant2(x)

def main():
   test_model = exampleModule()
 # Specify quantization configuration for the modules that need to be quantized
   test_model.conv1.qconfig = torch.quantization.default_qconfig
   test_model.conv4.qconfig = torch.quantization.default_qconfig
   test_model.quant1.qconfig = torch.quantization.default_qconfig
   test_model.quant2.qconfig = torch.quantization.default_qconfig

#  Call prepare and convert
   torch.quantization.prepare(test_model)
   calibrate(q_model)
   torch.quantization.convert(q_model)
  1. Quantwrapper is a convenience function that just inserts a quant and dequant at the beginning and end of a module. The reason this is not integrated at the module level is that in eager mode one does not have any visibility into the sequence of calls to modules in forward. So, the user needs to explicitly decide when activations are quantized/dequantized.
  2. You can definitely control how you want to quantize any layer and mix and match float/quantized layers.

@gottbrath
Copy link
Contributor

@yaysummeriscoming -- I also checked and the pytorch mobile will support FP32 ops so using mixed INT8 + FP32 on mobile should be possible.

@yaysummeriscoming
Copy link

@raghuramank100 & @gottbrath Thanks for the answers:

  1. Ok, I must say I'm quite firmly on the nn.Module side. As I understand it there's nothing preventing me from using nn.Module, like nn.Add then?

3-4: I gather that Quantwrapper isn't necessary - I can implement the quant/dequant functionality myself as in the example, if I choose?

This being the case, very happy with the flexibility provided - mixed float/quant operation is a godsend.

Interested to see what the mobile/IoT deployment process will look like, I see there's been a lot of work done integrating QNNPACK. What's the current best approach? Can I run quantised models on my raspberry pi now?

@t-vi
Copy link
Collaborator
t-vi commented Sep 27, 2019

Yes you can. We just completed a workshop with a dozen people running the quantized ResNet50 on a Pi4. Inference performance (on a Pi4 Debian arm64 system with the RPi-foundation provided 64 bit test kernel running PyTorch) was from 2.7s for the float model to 900ms for the uint8 ones (the absolute numbers leave room for improvement, but it is a neat start). Thanks for all the great work!

@raghuramank100
Copy link
Contributor

@raghuramank100 & @gottbrath Thanks for the answers:

  1. Ok, I must say I'm quite firmly on the nn.Module side. As I understand it there's nothing preventing me from using nn.Module, like nn.Add then?

3-4: I gather that Quantwrapper isn't necessary - I can implement the quant/dequant functionality myself as in the example, if I choose?

This being the case, very happy with the flexibility provided - mixed float/quant operation is a godsend.

Interested to see what the mobile/IoT deployment process will look like, I see there's been a lot of work done integrating QNNPACK. What's the current best approach? Can I run quantised models on my raspberry pi now?

There is no nn.Add() available as Add is a tensor method with no learnable parameters. Wherever there are existing nn Modules, the quantization method works off of those.
You are correct about QuantWrapper, you can instead manually insert quant/dequant stubs. We are working on graph mode quantization, where even this part will be automated.

@alohali
Copy link
alohali commented Oct 25, 2019

Hi all, is there any way to run QAT on NVIDIA GPU?
I success to run QAT and Post training. But the speed is too low, I spent several hours in a iteration in my own dataset. I tried to run it with CUDA device but met with errors. Is there any doc/tutorial about this?

@gottbrath
Copy link
Contributor

closing this issue since we delivered this.

Spandana-K-R added a commit to Spandana-K-R/Optimizing-Deep-Learning-models that referenced this issue Jul 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A request for a proper, new feature. high priority oncall: quantization Quantization support in PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

0