[RFC] Unified quantization backend for x86 CPU platforms

@jerryzh168

🚀 The feature, motivation and pitch

Description

Add a unified quantization backend ‘X86’ for x86 CPU platforms. Make it the default PyTorch quantization backend for x86 in place of FBGEMM.
It is implemented by auto selection between FBGEMM and ONEDNN during weight prepacking.

Motivation

The ONEDNN quantization backend takes advantage of features of the latest Intel® CPU products. And it supports more fused ops. It has shown better performance over FBGEMM in many (but not all) cases.
From an API design point of view, it would not be user-friendly if we expose both FBGEMM and ONEDNN backends to end users. In that case, we propose a unified quantization backend named ‘X86’ to combine the goodness of both backends while keep API simple.
In the frontend, users will be using the x86 backend by default for x86 platforms. And in the backend, we decide for them about which kernel to run and hide the details. The selection between different kernels is automatically done during weight prepacking with static information.
Thus, the X86 backend can replace FBGEMM and offer better performance.

Design philosophy

Auto kernel selection between FBGEMM and ONEDNN by simple heuristics without runtime info.

For non-fused ops, choose ONEDNN if it's always better. Otherwise, use simple heuristics to make selection.
For fused ops, from FX quantization, the x86 QEngine suggests the quant fusion patterns statically as how current fbgemm backend or onednn backend does now. The fusion patterns might include those from onednn backend that fbgemm backend doesn't support. If fused op (e.g. conv-add-relu) is always better on ONEDNN than non-fused conv + add-relu on FBGEMM, fused op is exposed. Otherwise, we expose conv + add-relu.
During the runtime, x86 QEngine implements the fused ops by choosing the right kernels, fbgemm or onednn. The decision can be made statically (e.g., conv+add+relu is only available on onednn, then onednn kernel is used.)
For implementation, the X86 backend will follow the QEngine and backend_config API.

Proposed Heuristics

Rules for auto selection between FBGEMM and ONEDNN:

On platforms without VNNI (e.g., Skylake), FBGEMM is always used.
On platforms with VNNI (e.g., Cascade lake, Ice lake, and future platforms):
- For linear, FBGEMM is always used.
- For convolution, FBGEMM is used for depth-wise convolution whose groups > 100; otherwise, ONEDNN is used.
Currently, X86 supports the same fusion patterns as FBGEMM.

For the unified backend, selection occurs during weight prepacking when ‘quantized::conv/linear_prepack’ is called. The prepack function will check hardware info (with or without VNNI) and op parameters and return a proper prepacked weight object accordingly.
For example, for linear or on platforms without VNNI, FBGEMM’s prepacked weight is always returned; and for convolution with groups=1, ONEDNN’s prepacked weight is returned. Then at runtime, it will automatically call corresponding kernels.

We have done implementation and run common models for benchmarking. The following table lists speedup ratio of throughputs of unified x86 backend vs. Pure FBGEMM:

Device\Ratio Geomean	1 core/instance	2 cores/instance	4 cores/instance	1 socket/instance
Intel(R) Xeon(R) Cascade Lake	1.701	1.821	1.921	1.513
Intel(R) Xeon(R) Ice Lake	1.767	1.810	2.042	1.346

(Table updated on Feb 21, 2023)

Note:

Run multi-instance; using Jemalloc, IOMP
Ratio > 1 means unified backend is better
For details, please refer to the attached sheets
unified_qengine_poc_performance_bechmark.xlsx
Performance data updated on Jan 20, 2023:
int8_oob_x86_vs_fbgemm_20230119.xlsx

Performance data updated on Feb 5, 2023:
int8_benchmark_x86_vs_fbgemm_20230205.xlsx
(Using PyTorch nightly build on Feb 4, 2023, installed by pip)

Performance data updated on Feb 21, 2023:
int8_benchmark_x86_vs_fbgemm_20230221.xlsx
(Using PyTorch nightly build on Feb 20, 2023, installed by pip)

About qconfig
For compatibility, the new backend will use reduce_range=True to align with FBGEMM.
However, for accuracy, we hope to change it to reduce_range=False in the future.

Accuracy
We have run torchvision models to compare accuracy. Results show that FBGEMM and X86 backends give the same accuracy. For details, please see the worksheet:
torchvision_accuracy_comparison_fbgemm_vs_x86.xlsx

Plans

Original plans

The implementation is still pending on some optimizations of the ONEDNN backend, which are not available yet in stock PyTorch. Thus the numbers we showed above cannot be reproduced by stock PyTorch right now.
In that case, we plan to take the steps below to finally unify x86 qengines:

Update ideep in stock PyTorch. Many optimizations are based on the ideep update.

Optimize performance of ONEDNN backend. PR(s) will be submitted after ideep's updates.

Prepare PR of the unified qengine

Publicize it to end users

We hope all these changes will be landed before 1.13 release.

Current status

Implementation is finished and PRs are landed
This feature is expected to be publicized on PyTorch 2.0 release.
We are continuing working on improvement of onednn backend. The dispatching heuristics and supported fusion patterns might change in the future.

Alternatives

N/A

Additional context

Example of implementing conv_prepack for unified X86 quantization backend.

template <int kSpatialDim = 2>
class QConvPackWeightInt8 final {
 public:
    // Public API to do conv prepack
 private:
  static c10::intrusive_ptr<ConvPackedParamsBase<kSpatialDim>> _run(
      Tensor weight,
      c10::optional<Tensor> bias,
      torch::List<int64_t> stride,
      torch::List<int64_t> padding,
      torch::List<int64_t> output_padding,
      torch::List<int64_t> dilation,
      int64_t groups,
      bool transpose) {
    auto& ctx = at::globalContext();
    if (ctx.qEngine() == at::QEngine::X86) {
#ifdef USE_FBGEMM
      if (no_vnni || groups > 100) {
        return PackedConvWeight<kSpatialDim>::prepack(
            weight, bias, stride, padding, output_padding, dilation, groups, transpose);
      }
#endif
#if AT_MKLDNN_ENABLED()
      return PackedConvWeightsOnednn<kSpatialDim>::prepack(
          weight, bias, stride, padding, output_padding, dilation, groups, transpose);
#endif
    }


#ifdef USE_PYTORCH_QNNPACK
    if (ctx.qEngine() == at::QEngine::QNNPACK) {
      return PackedConvWeightsQnnp<kSpatialDim>::prepack(
          weight, bias, stride, padding, output_padding, dilation, groups,
          transpose);
    }
#endif

    TORCH_CHECK(
        false,
        "Didn't find engine for operation quantized::conv2d_prepack ",
        toString(ctx.qEngine()));
  }
};

The prepacking function returns an object of prepacked weight which belongs to either FBGEMM or ONEDNN. The call to prepacked_weight->run will automatically run into the correct kernel.

Original code can be found here for reference:

pytorch/aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp

Line 482 in b29a074

class QConvPackWeightInt8 final {

cc @jerryzh168 @jianyuh @raghuramank100 @jamesr66a @vkuzo @jgong5 @leslie-fang-intel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🚀 The feature, motivation and pitch

Description

Motivation

Design philosophy

Proposed Heuristics

Plans

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

🚀 The feature, motivation and pitch

Description

Motivation

Design philosophy

Proposed Heuristics

Plans

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions