[Reopen][Inductor][CPU] Fuse SmoothQuant int8 linear pattern #142036

Xia-Weiwen · 2024-12-04T06:12:26Z

Stack from ghstack (oldest at bottom):

-> [Reopen][Inductor][CPU] Fuse SmoothQuant int8 linear pattern #142036

About the PR
In the implementation of SmoothQuant in Torchao, quantized linear is computed by _int_mm(a, b) + mul(b_scale) + mul(a_scale) (+ optional add for bias) with reshape and convert_dtype in between.
This PR adds a pass to fuse the corresponding patterns:

(no bias) reshape -> _int_mm -> convert_element_type -> (expand -> mul) -> mul -> reshape
(with bias) pattern_no_bias -> add -> reshape -> reshape

The patterns are replaced by onednn.qlinear_pointwise and onednn.qlinear_prepack, the latter of which is evaluated and frozen during the freezing process of Inductor. The final graph contains onednn.qlinear_pointwise only with packed weight constants.

Note that onednn.qlinear_pointwise only supports a scalar activation scale, which is a limitation of oneDNN library, so in that case we set activation scale to 1 and bias to none and apply scales and add bias after onednn.qlinear_pointwise.

Validation results
Accuracy/perplexity is not changed with or without this fusion pass.
Latency is improved by >10% with the fusion pass.
Test method:

Model: EleutherAI/gpt-j-6b
Hardware: Intel(R) Xeon(R) Platinum 8490H, running on 1 socket, 60 cores
Using Intel OMP and Tcmalloc
Running the example script of SmoothQuant in Torchao with TORCHINDUCTOR_FREEZING=1 numactl -N1 python example.py -m EleutherAI/gpt-j-6b --device=cpu --quant-mode=dynamic --compile

Test plan

python test/inductor/test_mkldnn_pattern_matcher.py -k test_smooth_quant_with_int_mm

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

Differential Revision: D66796966

[ghstack-poisoned]

pytorch-bot · 2024-12-04T06:12:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142036

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit f3d6c5a with merge base b576a8c ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

trunk / macos-py3-arm64 / test (default, 3, 3, macos-m1-stable) (gh) (trunk failure)
test_accelerator.py::TestAccelerator::test_generic_stream_behavior

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 02eb032 Pull Request resolved: #142036

Xia-Weiwen · 2024-12-05T01:46:43Z

Hi @jerryzh168 CI shows green. Would you like to import it? Thanks.

jerryzh168 · 2024-12-05T01:49:37Z

@jerryzh168 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

[ghstack-poisoned]

jerryzh168 · 2024-12-05T05:34:31Z

@pytorchbot merge

pytorchmergebot · 2024-12-05T05:36:08Z

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR!

Details for Dev Infra team

Raised by workflow job

[ghstack-poisoned]

Xia-Weiwen · 2024-12-05T06:36:29Z

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR!

Details for Dev Infra team

Hi @jerryzh168 It failed to merge by pytorchmergebot. Could you please import it again? Thanks.

Xia-Weiwen · 2024-12-06T13:05:39Z

Hi @jerryzh168 The failure looks irrelevant. Could you please take a look and see if it can be imported again? Thanks.

test/inductor/test_mkldnn_pattern_matcher.py

Xia-Weiwen · 2024-12-07T04:41:43Z

Hi @jerryzh168 There is a dependent PR targeting 2.6 branch cut. Could you please check if this PR can be imported and merged? Thanks.

jerryzh168 · 2024-12-07T06:38:36Z

@jerryzh168 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Xia-Weiwen · 2024-12-09T02:01:56Z

Hi @jerryzh168 There is a dependent PR targeting 2.6 branch cut. Could you please check if this PR can be imported and merged? Thanks.

Hi @jerryzh168 We would like this in 2.6. Could you please merge this PR once internal checks show green? Thanks.

jerryzh168 · 2024-12-10T03:40:19Z

yeah I'm trying to land this

[ghstack-poisoned]

ghstack-source-id: e5964c8 Pull Request resolved: #142036

[ghstack-poisoned]

ghstack-source-id: 55fb77f Pull Request resolved: #142036

[ghstack-poisoned]

ghstack-source-id: 7b2a046 Pull Request resolved: #142036

jerryzh168 · 2024-12-11T17:05:42Z

@jerryzh168 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-12-12T21:10:45Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2024-12-12T21:12:24Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Xia-Weiwen · 2024-12-13T01:09:09Z

@jerryzh168 Thanks a lot

#142110) ### Summary Extends #142036 for Inductor pattern-matching pattern covered for torchao API `int8_dynamic_activation_int8_weight` in the following scenario (inference-only, freezing enabled) - - int8 quantized (symmetrically) activation (per token quantized). - Statically (so, scales are also constant. But then they would have been constant even in case of dynamic quantization due to constant weights, anyway) per-channel int8 quantized (symmetrically) weights (which are also constant because freezing is enabled). The pattern that's matched is `torch._intmm` -> convert to FP32/BF16 -> [optional expand for activation scale] ->`mul` -> `mul`. We don't check if the activation is dynamically quantized or whether the weights are statically quantized, though (since the implementation won't have have any side-effects even if that wouldn't be true). In practice, it also matches the smooth-quant int8 quantized linear pattern if its output is not reshaped (if activation is 2D). ### More details oneDNN int8 matmul supports application of per-channel weight scale but not a vector activation scale, which could be applied as a post op, but is currently unsupported in ATen. Bias addition (which could be supported with an add post-op) is also unfused. The fusion pattern used in this PR is `torch._intmm` -> convert to FP32/BF16 ->`mul`, which will be replaced by oneDNN qlinear op. The speedup over eager-mode is due to 2 reasons - 1. fusion of int8xint8 -> int32 GEMM, conversion to FP32/BF16 & application of weight scale. (In case of BF16, many intermediate conversions are also avoided). 2. weight is pre-packed & cached by Inductor, so a reorder is avoided at run-time. But, in the future, the whole pattern (including application of activation scale, which would be a mul post-op) + bias could be fused if corresponding support would be enabled in ATen. ### Verification Added UT in this PR ``` python test/inductor/test_mkldnn_pattern_matcher.py -v -k test_da8w8_sym_act_sym_wgt_with_int_mm ``` #### Corresponding torchao UTs 1. int8 Smoothquant legacy API - `TORCHINDUCTOR_FREEZING=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" python test/integration/test_integration.py -v -k test_non_dynamically_quantizable_linear`. The difference from #139595 is that there are no reshapes of the linear output in this pattern. 2. int8 da8w8 - symmetrically quantized activation (dynamically) & statically quantized weights - ` TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" TORCHINDUCTOR_FREEZING=1 python test/integration/test_integration.py -v -k test_int8_dynamic_quant_subclass_api_0_cpu` Pull Request resolved: #142110 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: #142036

…#142036) Reopen of pytorch#139595 **About the PR** In the implementation of SmoothQuant in Torchao, quantized linear is computed by `_int_mm(a, b)` + `mul(b_scale)` + `mul(a_scale)` (+ optional `add` for bias) with `reshape` and `convert_dtype` in between. This PR adds a pass to fuse the corresponding patterns: - (no bias) `reshape -> _int_mm -> convert_element_type -> (expand -> mul) -> mul -> reshape` - (with bias) `pattern_no_bias -> add -> reshape -> reshape` The patterns are replaced by `onednn.qlinear_pointwise` and `onednn.qlinear_prepack`, the latter of which is evaluated and frozen during the freezing process of Inductor. The final graph contains `onednn.qlinear_pointwise` only with packed weight constants. Note that `onednn.qlinear_pointwise` only supports a scalar activation scale, which is a limitation of oneDNN library, so in that case we set activation scale to 1 and bias to none and apply scales and add bias after `onednn.qlinear_pointwise`. **Validation results** Accuracy/perplexity is not changed with or without this fusion pass. Latency is improved by >10% with the fusion pass. Test method: - Model: EleutherAI/gpt-j-6b - Hardware: Intel(R) Xeon(R) Platinum 8490H, running on 1 socket, 60 cores - Using Intel OMP and Tcmalloc - Running [the example script of SmoothQuant in Torchao](https://github.com/pytorch/ao/blob/main/torchao/prototype/smoothquant/example.py) with `TORCHINDUCTOR_FREEZING=1 numactl -N1 python example.py -m EleutherAI/gpt-j-6b --device=cpu --quant-mode=dynamic --compile` **Test plan** ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_smooth_quant_with_int_mm ``` Differential Revision: [D66796966](https://our.internmc.facebook.com/intern/diff/D66796966) Pull Request resolved: pytorch#142036 Approved by: https://github.com/jerryzh168, https://github.com/jgong5 Co-authored-by: sanchitintel <sanchit.jain@intel.com>

pytorch#142110) ### Summary Extends pytorch#142036 for Inductor pattern-matching pattern covered for torchao API `int8_dynamic_activation_int8_weight` in the following scenario (inference-only, freezing enabled) - - int8 quantized (symmetrically) activation (per token quantized). - Statically (so, scales are also constant. But then they would have been constant even in case of dynamic quantization due to constant weights, anyway) per-channel int8 quantized (symmetrically) weights (which are also constant because freezing is enabled). The pattern that's matched is `torch._intmm` -> convert to FP32/BF16 -> [optional expand for activation scale] ->`mul` -> `mul`. We don't check if the activation is dynamically quantized or whether the weights are statically quantized, though (since the implementation won't have have any side-effects even if that wouldn't be true). In practice, it also matches the smooth-quant int8 quantized linear pattern if its output is not reshaped (if activation is 2D). ### More details oneDNN int8 matmul supports application of per-channel weight scale but not a vector activation scale, which could be applied as a post op, but is currently unsupported in ATen. Bias addition (which could be supported with an add post-op) is also unfused. The fusion pattern used in this PR is `torch._intmm` -> convert to FP32/BF16 ->`mul`, which will be replaced by oneDNN qlinear op. The speedup over eager-mode is due to 2 reasons - 1. fusion of int8xint8 -> int32 GEMM, conversion to FP32/BF16 & application of weight scale. (In case of BF16, many intermediate conversions are also avoided). 2. weight is pre-packed & cached by Inductor, so a reorder is avoided at run-time. But, in the future, the whole pattern (including application of activation scale, which would be a mul post-op) + bias could be fused if corresponding support would be enabled in ATen. ### Verification Added UT in this PR ``` python test/inductor/test_mkldnn_pattern_matcher.py -v -k test_da8w8_sym_act_sym_wgt_with_int_mm ``` #### Corresponding torchao UTs 1. int8 Smoothquant legacy API - `TORCHINDUCTOR_FREEZING=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" python test/integration/test_integration.py -v -k test_non_dynamically_quantizable_linear`. The difference from pytorch#139595 is that there are no reshapes of the linear output in this pattern. 2. int8 da8w8 - symmetrically quantized activation (dynamically) & statically quantized weights - ` TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" TORCHINDUCTOR_FREEZING=1 python test/integration/test_integration.py -v -k test_int8_dynamic_quant_subclass_api_0_cpu` Pull Request resolved: pytorch#142110 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: pytorch#142036

ghstack-source-id: 426074b Pull Request resolved: pytorch/pytorch#142036

Update

11be344

[ghstack-poisoned]

Xia-Weiwen requested review from jerryzh168, salilsdesai, kimishpatel, digantdesai and jianyuh as code owners December 4, 2024 06:12

pytorch-bot bot added ciflow/inductor module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor release notes: quantization release notes category labels Dec 4, 2024

Xia-Weiwen added a commit that referenced this pull request Dec 4, 2024

[Inductor][CPU] Fuse SmoothQuant int8 linear pattern

53c7649

ghstack-source-id: 02eb032 Pull Request resolved: #142036

Xia-Weiwen mentioned this pull request Dec 4, 2024

[Inductor][CPU] Fuse SmoothQuant int8 linear pattern #139595

Closed

Xia-Weiwen changed the title ~~[Inductor][CPU] Fuse SmoothQuant int8 linear pattern~~ [Reopen][Inductor][CPU] Fuse SmoothQuant int8 linear pattern Dec 4, 2024

pytorchbot added the open source label Dec 4, 2024

Xia-Weiwen added intel This tag is for PR from Intel ciflow/trunk Trigger trunk jobs on your pull request and removed open source labels Dec 4, 2024

pytorchbot added the open source label Dec 4, 2024

Update

e4a74db

[ghstack-poisoned]

sanchitintel mentioned this pull request Dec 5, 2024

[Inductor][CPU] Add torchao da8w8 pattern with sym quantized act & wgt #142110

Closed

jerryzh168 approved these changes Dec 5, 2024

View reviewed changes

pytorchmergebot added the merging label Dec 5, 2024

pytorchmergebot removed the merging label Dec 5, 2024

Update

df42da9

[ghstack-poisoned]

sanchitintel requested review from jgong5 and leslie-fang-intel December 6, 2024 21:08

sanchitintel reviewed Dec 6, 2024

View reviewed changes

test/inductor/test_mkldnn_pattern_matcher.py Show resolved Hide resolved

jgong5 approved these changes Dec 7, 2024

View reviewed changes

Update

a0d89a6

[ghstack-poisoned]

Xia-Weiwen added a commit that referenced this pull request Dec 10, 2024

[Inductor][CPU] Fuse SmoothQuant int8 linear pattern

a2d13b4

ghstack-source-id: e5964c8 Pull Request resolved: #142036

Update

334c9b0

[ghstack-poisoned]

Xia-Weiwen added a commit that referenced this pull request Dec 11, 2024

[Inductor][CPU] Fuse SmoothQuant int8 linear pattern

c280cd1

ghstack-source-id: 55fb77f Pull Request resolved: #142036

8000

Update

f3d6c5a

[ghstack-poisoned]

Xia-Weiwen added a commit that referenced this pull request Dec 11, 2024

[Inductor][CPU] Fuse SmoothQuant int8 linear pattern

82808a6

ghstack-source-id: 7b2a046 Pull Request resolved: #142036

pytorchmergebot added the merging label Dec 12, 2024

pytorchmergebot added the Merged label Dec 12, 2024

pytorchmergebot closed this in fb93462 Dec 12, 2024

pytorchmergebot removed the merging label Dec 12, 2024

Xia-Weiwen deleted the gh/Xia-Weiwen/23/head branch December 14, 2024 12:51

Esquains pushed a commit to Esquains/study1 that referenced this pull request Dec 15, 2024

[Inductor][CPU] Fuse SmoothQuant int8 linear pattern

be24615

ghstack-source-id: 426074b Pull Request resolved: pytorch/pytorch#142036

sanchitintel mentioned this pull request Jan 10, 2025

[Inductor][CPU] Add auto-tuning support for da8w8 sym act sym wgt GEMM #143187

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Reopen][Inductor][CPU] Fuse SmoothQuant int8 linear pattern #142036

[Reopen][Inductor][CPU] Fuse SmoothQuant int8 linear pattern #142036

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge failed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Reopen][Inductor][CPU] Fuse SmoothQuant int8 linear pattern #142036

[Reopen][Inductor][CPU] Fuse SmoothQuant int8 linear pattern #142036

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142036

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge failed

Uh oh!

Merge failed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!

Uh oh!