[Inductor][CPU] Add torchao da8w8 pattern with sym quantized act & wgt #142110

sanchitintel · 2024-12-05T05:28:09Z

Stack from ghstack (oldest at bottom):

Summary

Extends #142036 for Inductor pattern-matching pattern covered for torchao API int8_dynamic_activation_int8_weight in the following scenario (inference-only, freezing enabled) -

int8 quantized (symmetrically) activation (per token quantized).
Statically (so, scales are also constant. But then they would have been constant even in case of dynamic quantization due to constant weights, anyway) per-channel int8 quantized (symmetrically) weights (which are also constant because freezing is enabled).

The pattern that's matched is torch._intmm -> convert to FP32/BF16 -> [optional expand for activation scale] ->mul -> mul.

We don't check if the activation is dynamically quantized or whether the weights are statically quantized, though (since the implementation won't have have any side-effects even if that wouldn't be true).

In practice, it also matches the smooth-quant int8 quantized linear pattern if its output is not reshaped (if activation is 2D).

More details

oneDNN int8 matmul supports application of per-channel weight scale but not a vector activation scale, which could be applied as a post op, but is currently unsupported in ATen. Bias addition (which could be supported with an add post-op) is also unfused.

The fusion pattern used in this PR is torch._intmm -> convert to FP32/BF16 ->mul, which will be replaced by oneDNN qlinear op.

The speedup over eager-mode is due to 2 reasons -

fusion of int8xint8 -> int32 GEMM, conversion to FP32/BF16 & application of weight scale. (In case of BF16, many intermediate conversions are also avoided).
weight is pre-packed & cached by Inductor, so a reorder is avoided at run-time.

But, in the future, the whole pattern (including application of activation scale, which would be a mul post-op) + bias could be fused if corresponding support would be enabled in ATen.

Verification

Added UT in this PR

python test/inductor/test_mkldnn_pattern_matcher.py -v -k test_da8w8_sym_act_sym_wgt_with_int_mm

Corresponding torchao UTs< 8000 /h4>

int8 Smoothquant legacy API - `TORCHINDUCTOR_FREEZING=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" python test/integration/test_integration.py -v -k test_non_dynamically_quantizable_linear`.
The difference from [Inductor][CPU] Fuse SmoothQuant int8 linear pattern #139595 is that there are no reshapes of the linear output in this pattern.

int8 da8w8 - symmetrically quantized activation (dynamically) & statically quantized weights - `TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" TORCHINDUCTOR_FREEZING=1 python test/integration/test_integration.py -v -k test_int8_dynamic_quant_subclass_api_0_cpu`

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

[ghstack-poisoned]

pytorch-bot · 2024-12-05T05:28:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142110

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 604e4ec with merge base b31d3b2 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-focal-py3.11-clang10 / test (dynamo_wrapped, 2, 3, lf.linux.2xlarge) (gh) (detected as infra flaky with no log or failing log classifier)

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

inductor / cuda12.4-py3.10-gcc9-sm86 / test (inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh)
convnext_base

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 0de9a22 Pull Request resolved: #142110

sanchitintel · 2024-12-05T06:27:24Z

Will rebase with the latest main-branch to avoid a CI failure that's also been happening in other PRs -

https://hud.pytorch.org/failure?name=pull%20%2F%20cuda12.4-py3.10-gcc9-sm75%20%2F%20test%20(pr_time_benchmarks%2C%201%2C%201%2C%20linux.g4dn.metal.nvidia.gpu)&jobName=undefined&failureCaptures=%5B%22%23%23%5Berror%5DProcess%20completed%20with%20exit%20code%201.%22%5D

[ghstack-poisoned]

ghstack-source-id: c4d6f0b Pull Request resolved: #142110

sanchitintel · 2024-12-13T02:04:44Z

@pytorchbot merge

pytorchmergebot · 2024-12-13T02:06:45Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch#142110) ### Summary Extends pytorch#142036 for Inductor pattern-matching pattern covered for torchao API `int8_dynamic_activation_int8_weight` in the following scenario (inference-only, freezing enabled) - - int8 quantized (symmetrically) activation (per token quantized). - Statically (so, scales are also constant. But then they would have been constant even in case of dynamic quantization due to constant weights, anyway) per-channel int8 quantized (symmetrically) weights (which are also constant because freezing is enabled). The pattern that's matched is `torch._intmm` -> convert to FP32/BF16 -> [optional expand for activation scale] ->`mul` -> `mul`. We don't check if the activation is dynamically quantized or whether the weights are statically quantized, though (since the implementation won't have have any side-effects even if that wouldn't be true). In practice, it also matches the smooth-quant int8 quantized linear pattern if its output is not reshaped (if activation is 2D). ### More details oneDNN int8 matmul supports application of per-channel weight scale but not a vector activation scale, which could be applied as a post op, but is currently unsupported in ATen. Bias addition (which could be supported with an add post-op) is also unfused. The fusion pattern used in this PR is `torch._intmm` -> convert to FP32/BF16 ->`mul`, which will be replaced by oneDNN qlinear op. The speedup over eager-mode is due to 2 reasons - 1. fusion of int8xint8 -> int32 GEMM, conversion to FP32/BF16 & application of weight scale. (In case of BF16, many intermediate conversions are also avoided). 2. weight is pre-packed & cached by Inductor, so a reorder is avoided at run-time. But, in the future, the whole pattern (including application of activation scale, which would be a mul post-op) + bias could be fused if corresponding support would be enabled in ATen. ### Verification Added UT in this PR ``` python test/inductor/test_mkldnn_pattern_matcher.py -v -k test_da8w8_sym_act_sym_wgt_with_int_mm ``` #### Corresponding torchao UTs 1. int8 Smoothquant legacy API - `TORCHINDUCTOR_FREEZING=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" python test/integration/test_integration.py -v -k test_non_dynamically_quantizable_linear`. The difference from pytorch#139595 is that there are no reshapes of the linear output in this pattern. 2. int8 da8w8 - symmetrically quantized activation (dynamically) & statically quantized weights - ` TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" TORCHINDUCTOR_FREEZING=1 python test/integration/test_integration.py -v -k test_int8_dynamic_quant_subclass_api_0_cpu` Pull Request resolved: pytorch#142110 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5 ghstack dependencies: pytorch#142036

Update

e4b64d5

[ghstack-poisoned]

sanchitintel mentioned this pull request Dec 5, 2024

[Reopen][Inductor][CPU] Fuse SmoothQuant int8 linear pattern #142036

Closed

8000 pytorch-bot bot added ciflow/inductor module: inductor labels Dec 5, 2024

sanchitintel added a commit that referenced this pull request Dec 5, 2024

[Inductor][CPU] Add torchao da8w8 pattern with sym quantized act & wgt

4180340

ghstack-source-id: 0de9a22 Pull Request resolved: #142110

sanchitintel mentioned this pull request Dec 5, 2024

[Inductor][CPU] Add torchao da8w8 pattern with sym quantized act & wgt #142015

Closed

sanchitintel added the release notes: quantization release notes category label Dec 5, 2024

sanchitintel requested review from leslie-fang-intel and jgong5 December 5, 2024 05:30

pytorchbot added the open source label Dec 5, 2024

leslie-fang-intel approved these changes Dec 5, 2024

View reviewed changes

Update

604e4ec

[ghstack-poisoned]

sanchitintel added a commit that referenced this pull request Dec 5, 2024

[Inductor][CPU] Add torchao da8w8 pattern with sym quantized act & wgt

74f80b1

ghstack-source-id: c4d6f0b Pull Request resolved: #142110

jgong5 approved these changes Dec 5, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 13, 2024

pytorchmergebot added the merging label Dec 13, 2024

pytorchmergebot added the Merged label Dec 13, 2024

pytorchmergebot closed this in 57c46af Dec 13, 2024

pytorchmergebot removed the merging label Dec 13, 2024

sanchitintel mentioned this pull request Jan 10, 2025

[Inductor][CPU] Add auto-tuning support for da8w8 sym act sym wgt GEMM #143187

Closed

github-actions bot deleted the gh/sanchitintel/4/head branch January 14, 2025 02:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Inductor][CPU] Add torchao da8w8 pattern with sym quantized act & wgt #142110

[Inductor][CPU] Add torchao da8w8 pattern with sym quantized act & wgt #142110

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Inductor][CPU] Add torchao da8w8 pattern with sym quantized act & wgt #142110

[Inductor][CPU] Add torchao da8w8 pattern with sym quantized act & wgt #142110

Uh oh!

Conversation

Uh oh!

Summary

More details

Verification

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142110

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!