-
Notifications
You must be signed in to change notification settings - Fork 24.8k
Description
🚀 The feature, motivation and pitch
Description
Add a unified quantization backend ‘X86’ for x86 CPU platforms. Make it the default PyTorch quantization backend for x86 in place of FBGEMM.
It is implemented by auto selection between FBGEMM and ONEDNN during weight prepacking.
Motivation
The ONEDNN quantization backend takes advantage of features of the latest Intel® CPU products. And it supports more fused ops. It has shown better performance over FBGEMM in many (but not all) cases.
From an API design point of view, it would not be user-friendly if we expose both FBGEMM and ONEDNN backends to end users. In that case, we propose a unified quantization backend named ‘X86’ to combine the goodness of both backends while keep API simple.
In the frontend, users will be using the x86 backend by default for x86 platforms. And in the backend, we decide for them about which kernel to run and hide the details. The selection between different kernels is automatically done during weight prepacking with static information.
Thus, the X86 backend can replace FBGEMM and offer better performance.
Design philosophy
Auto kernel selection between FBGEMM and ONEDNN by simple heuristics without runtime info.
- For non-fused ops, choose ONEDNN if it's always better. Otherwise, use simple heuristics to make selection.
- For fused ops, from FX quantization, the x86 QEngine suggests the quant fusion patterns statically as how current fbgemm backend or onednn backend does now. The fusion patterns might include those from onednn backend that fbgemm backend doesn't support. If fused op (e.g. conv-add-relu) is always better on ONEDNN than non-fused conv + add-relu on FBGEMM, fused op is exposed. Otherwise, we expose conv + add-relu.
- During the runtime, x86 QEngine implements the fused ops by choosing the right kernels, fbgemm or onednn. The decision can be made statically (e.g., conv+add+relu is only available on onednn, then onednn kernel is used.)
- For implementation, the X86 backend will follow the
QEngine
andbackend_config
API.
Proposed Heuristics
Rules for auto selection between FBGEMM and ONEDNN:
- On platforms without VNNI (e.g., Skylake), FBGEMM is always used.
- On platforms with VNNI (e.g., Cascade lake, Ice lake, and future platforms):
- For linear, FBGEMM is always used.
- For convolution, FBGEMM is used for depth-wise convolution whose groups > 100; otherwise, ONEDNN is used.
- Currently, X86 supports the same fusion patterns as FBGEMM.
For the unified backend, selection occurs during weight prepacking when ‘quantized::conv/linear_prepack’ is called. The prepack function will check hardware info (with or without VNNI) and op parameters and return a proper prepacked weight object accordingly.
For example, for linear or on platforms without VNNI, FBGEMM’s prepacked weight is always returned; and for convolution with groups=1, ONEDNN’s prepacked weight is returned. Then at runtime, it will automatically call corresponding kernels.
We have done implementation and run common models for benchmarking. The following table lists speedup ratio of throughputs of unified x86 backend vs. Pure FBGEMM:
Device\Ratio Geomean | 1 core/instance | 2 cores/instance | 4 cores/instance | 1 socket/instance |
---|---|---|---|---|
Intel(R) Xeon(R) Cascade Lake | 1.701 | 1.821 | 1.921 | 1.513 |
Intel(R) Xeon(R) Ice Lake | 1.767 | 1.810 | 2.042 | 1.346 |
(Table updated on Feb 21, 2023)
Note:
- Run multi-instance; using Jemalloc, IOMP
- Ratio > 1 means unified backend is better
- For details, please refer to the attached sheets
unified_qengine_poc_performance_bechmark.xlsx
Performance data updated on Jan 20, 2023:
int8_oob_x86_vs_fbgemm_20230119.xlsx
Performance data updated on Feb 5, 2023:
int8_benchmark_x86_vs_fbgemm_20230205.xlsx
(Using PyTorch nightly build on Feb 4, 2023, installed by pip)
Performance data updated on Feb 21, 2023:
int8_benchmark_x86_vs_fbgemm_20230221.xlsx
(Using PyTorch nightly build on Feb 20, 2023, installed by pip)
About qconfig
For compatibility, the new backend will use reduce_range=True
to align with FBGEMM.
However, for accuracy, we hope to change it to reduce_range=False
in the future.
Accuracy
We have run torchvision models to compare accuracy. Results show that FBGEMM and X86 backends give the same accuracy. For details, please see the worksheet:
torchvision_accuracy_comparison_fbgemm_vs_x86.xlsx
Plans
Original plans
The implementation is still pending on some optimizations of the ONEDNN backend, which are not available yet in stock PyTorch. Thus the numbers we showed above cannot be reproduced by stock PyTorch right now.
In that case, we plan to take the steps below to finally unify x86 qengines:
- Update ideep in stock PyTorch. Many optimizations are based on the ideep update.
- Optimize performance of ONEDNN backend. PR(s) will be submitted after ideep's updates.
- Prepare PR of the unified qengine
- Publicize it to end users
We hope all these changes will be landed before 1.13 release.
Current status
- Implementation is finished and PRs are landed
- This feature is expected to be publicized on PyTorch 2.0 release.
- We are continuing working on improvement of onednn backend. The dispatching heuristics and supported fusion patterns might change in the future.
Alternatives
N/A
Additional context
Example of implementing conv_prepack for unified X86 quantization backend.
template <int kSpatialDim = 2>
class QConvPackWeightInt8 final {
public:
// Public API to do conv prepack
private:
static c10::intrusive_ptr<ConvPackedParamsBase<kSpatialDim>> _run(
Tensor weight,
c10::optional<Tensor> bias,
torch::List<int64_t> stride,
torch::List<int64_t> padding,
torch::List<int64_t> output_padding,
torch::List<int64_t> dilation,
int64_t groups,
bool transpose) {
auto& ctx = at::globalContext();
if (ctx.qEngine() == at::QEngine::X86) {
#ifdef USE_FBGEMM
if (no_vnni || groups > 100) {
return PackedConvWeight<kSpatialDim>::prepack(
weight, bias, stride, padding, output_padding, dilation, groups, transpose);
}
#endif
#if AT_MKLDNN_ENABLED()
return PackedConvWeightsOnednn<kSpatialDim>::prepack(
weight, bias, stride, padding, output_padding, dilation, groups, transpose);
#endif
}
#ifdef USE_PYTORCH_QNNPACK
if (ctx.qEngine() == at::QEngine::QNNPACK) {
return PackedConvWeightsQnnp<kSpatialDim>::prepack(
weight, bias, stride, padding, output_padding, dilation, groups,
transpose);
}
#endif
TORCH_CHECK(
false,
"Didn't find engine for operation quantized::conv2d_prepack ",
toString(ctx.qEngine()));
}
};
The prepacking function returns an object of prepacked weight which belongs to either FBGEMM or ONEDNN. The call to prepacked_weight->run
will automatically run into the correct kernel.
Original code can be found here for reference:
class QConvPackWeightInt8 final { |
cc @jerryzh168 @jianyuh @raghuramank100 @jamesr66a @vkuzo @jgong5 @leslie-fang-intel