-
Notifications
You must be signed in to change notification settings - Fork 24.3k
Model Quantization for PyTorch (Proposal) #18318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Ha! |
Won't be long now.. :) |
We are initially planning to support export to netdef from pytorch as mobile runtime is still based on caffe2. |
From the design doc, are
|
To me it sounds like they are. But the key sentence about this seems to be at the end of step 5:
I would think that this means that the quantized regions at the top and bottom of the modules grow until they "meet". In that event one could either have a re-quantization step (adjusting min/max and the quantized values) in the place off I'm very excited about this! |
@t-vi The concern is that users have to refactor their existing code quite a bit in order to have all the ops quantized. Usually in order to get decent performance boost, most ops should be quantized ones, only with few exceptions leaving in full precision for an acceptable accuracy. |
Hi Jiang,
We are supporting two modes for quantization:
- In graph mode, we integrate with JIT IR and quantization is being implemented as a compilation pass, so even the insertion of fake quant ops is automated. Floating point ops are replaced with quantized ops when available. This should be the preferred option for maximum performance.
- In Eager mode, yes: one needs to do extra work as there is no notion of a graph. The short answer is that one needs to break up a module into sub-modules (at the level of an individual op) for quantization.
Thanks,
Raghu
…________________________________
From: Jiong Gong <notifications@github.com>
Sent: Sunday, April 14, 2019 10:41:49 PM
To: pytorch/pytorch
Cc: Raghuraman Krishnamoorthi; Mention
Subject: Re: [pytorch/pytorch] Model Quantization for PyTorch (Proposal) (#18318)
@t-vi<https://github.com/t-vi> The concern is that users have to refactor their existing code quite a bit in order to have all the ops quantized. Usually in order to get decent performance boost, most ops should be quantized ones, only with few exceptions leaving in full precision for an acceptable accuracy.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#18318 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AuktTAcV9fXkw3heWtOmDAk5MQ4TwmYBks5vhBEdgaJpZM4cCyvQ>.
|
@raghuramank100 Thanks for the answer. So for the graph mode, fake quant ops are inserted at the op boundary, correct? I noticed that the step 3 in the design doc does not have a fake quant op in-between Conv and ReLU. Should there be one in the graph mode? Moreover, if fake quant ops are inserted at the op boundary, how could we selectively fall back some ops to full precision? The usage model of eager mode quantization is still confusing to me. In my mind, a common working flow of quantization would start with an existing full precision model. Try the graph mode with all ops quantized. If the accuracy target is not achieved, debug with eager mode and may selectively fall back some ops to full precision if needed. Finally deploy the mixed-precision model with graph mode. It sounds horrible to me if one has to refactor the model that much to debug with eager mode... |
@raghuramank100 One more question: would there be an option to allow saving full-precision biases? I'm asking because there is situation where biases are shared by multiple ops, e.g. RetinaNet, and quantized biases have to be requantized for individual ops due to different scales of the activations. To avoid the extra requantization overhead, we would rather pass full-precision biases to the quantized kernel directly, e.g. MKL-DNN int8 kernel supports full-precision biases. |
some more details have been posted here: https://github.com/pytorch/pytorch/wiki/Introducing-Quantized-Tensor |
Hi @jgong5, |
Please see more detailed design doc at: https://github.com/pytorch/pytorch/wiki/torch_quantization_design_proposal. This document outlines a eager friendly quantization design. |
Hey folks. I have a bit of a preview of our quantization API and workflow that I would love feedback on! The attached tutorial covers the steps needed to quantize a model to 8 bit in a post training fashion in eager mode. Users need to prepare the model with a few simple model changes such as providing uniquely named items for repeated elements, and fusing conv and batch norm. Then there are some high level commands that can insert instrumentation so that the activation scaling can be calibrated with a sample set of data. Then another high level command applies the calibration and converts the model to quantized form. The resulting quantized model can be serialized into Torch Script using JIT. The user has a lot of control and can choose different quantization and calibration functions for different parts of the model and can apply quantization to the whole model or just parts. The tutorial should work with top of tree. It is currently CPU focused. Note that this is just eager mode post training quantization in this tutorial. We plan to also support quantization aware training and quantization of models already converted to Torch Script. Accuracy and performance are still being worked on. Feedback welcomed -- particularly on the workflow and API. |
@gottbrath just had a quick look, seems really promising! I imagine that post training quantisation for traced models will be simpler, along the lines of the current tflite implementation. My thoughts: Can I keep parts of the model in float precision & if so, how well will this be supported with mobile runtimes? Feels a bit weird to have QuantWrapper - couldn't this functionality be integrated with nn.Module & activation statistics stored at sub-module level or as a tensor property? Can I have different qconfigs for different sub-modules, or is this a QuantWrapper parameter only? Apart from that I quite like the workflow! |
@gottbrath I know that I will sound silly and naive, but quantization process for end user should look like |
On the wiki, the plan includes a 1-line quantization api, but we are not quite there yet. |
@yaysummeriscoming and @banderlog -- yes the long term goal is to have a super simple quantization API. As @yaysummeriscoming recognized doing that generally requires a graph representation of the full model which is provided with JIT scripted/traced models. The current implementation is focused on eager mode and provides the building blocks that we will put together with some graph manipulation to provide the one line version in the future. it is worth noting that post training quantization does generally require some calibration.. which is a separate step in the current prepare-calibrate-convert implementation. |
Great questions, some answers to provide more clarity:
class exampleModule(nn.Module):
def __init__(self):
super(exampleModule, self).__init__()
self.conv1 = nn.Conv2d(...)
self.conv2 = nn.Conv2d(...)
self.conv3 = nn.Conv2d(...)
self.conv4 = nn.Conv2d(...)
self.quant1 = torch.quantization.QuantStub()
self.dequant1 = torch.quantization.DeQuantStub()
self.quant2 = torch.quantization.QuantStub()
self.dequant2 = torch.quantization.DeQuantStub()
def forward(self, x):
x = self.quant1(x)
# Specify that first conv needs to be quantized
x = self.conv1(x)
x = self.dequant1(x)
# Conv2 and 3 are in float
x = self.conv2(x)
x = self.conv3(x)
x = self.quant2(x)
x = self.conv4(x)
x = self.dequant2(x)
def main():
test_model = exampleModule()
# Specify quantization configuration for the modules that need to be quantized
test_model.conv1.qconfig = torch.quantization.default_qconfig
test_model.conv4.qconfig = torch.quantization.default_qconfig
test_model.quant1.qconfig = torch.quantization.default_qconfig
test_model.quant2.qconfig = torch.quantization.default_qconfig
# Call prepare and convert
torch.quantization.prepare(test_model)
calibrate(q_model)
torch.quantization.convert(q_model)
|
@yaysummeriscoming -- I also checked and the pytorch mobile will support FP32 ops so using mixed INT8 + FP32 on mobile should be possible. |
@raghuramank100 & @gottbrath Thanks for the answers:
3-4: I gather that Quantwrapper isn't necessary - I can implement the quant/dequant functionality myself as in the example, if I choose? This being the case, very happy with the flexibility provided - mixed float/quant operation is a godsend. Interested to see what the mobile/IoT deployment process will look like, I see there's been a lot of work done integrating QNNPACK. What's the current best approach? Can I run quantised models on my raspberry pi now? |
Yes you can. We just completed a workshop with a dozen people running the quantized ResNet50 on a Pi4. Inference performance (on a Pi4 Debian arm64 system with the RPi-foundation provided 64 bit test kernel running PyTorch) was from 2.7s for the float model to 900ms for the uint8 ones (the absolute numbers leave room for improvement, but it is a neat start). Thanks for all the great work! |
There is no nn.Add() available as Add is a tensor method with no learnable parameters. Wherever there are existing nn Modules, the quantization method works off of those. |
Hi all, is there any way to run QAT on NVIDIA GPU? |
closing this issue since we delivered this. |
Uh oh!
There was an error while loading. Please reload this page.
🚀 tl;dr
Attached is a proposal for graph mode quantization in pytorch (model_quantizer) that provides end to end post training quantization support for both mobile and server backends. Model quantization supports fp32 and int8 precisions as a starting point and will expand to support other precision types based on customer needs. Details can be found in the attached pdf doc:
Model Quantization for Pytorch.pdf
cc @soumith, @gchanan, @raghuramank100
The text was updated successfully, but these errors were encountered: