8000 [Reopen][Inductor][CPU] Fuse SmoothQuant int8 linear pattern by Xia-Weiwen · Pull Request #142036 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[Reopen][Inductor][CPU] Fuse SmoothQuant int8 linear pattern #142036

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

Xia-Weiwen
Copy link
Collaborator
@Xia-Weiwen Xia-Weiwen commented Dec 4, 2024

Stack from ghstack (oldest at bottom):

Reopen of #139595

About the PR
In the implementation of SmoothQuant in Torchao, quantized linear is computed by _int_mm(a, b) + mul(b_scale) + mul(a_scale) (+ optional add for bias) with reshape and convert_dtype in between.
This PR adds a pass to fuse the corresponding patterns:

  • (no bias) reshape -> _int_mm -> convert_element_type -> (expand -> mul) -> mul -> reshape
  • (with bias) pattern_no_bias -> add -> reshape -> reshape

The patterns are replaced by onednn.qlinear_pointwise and onednn.qlinear_prepack, the latter of which is evaluated and frozen during the freezing process of Inductor. The final graph contains onednn.qlinear_pointwise only with packed weight constants.

Note that onednn.qlinear_pointwise only supports a scalar activation scale, which is a limitation of oneDNN library, so in that case we set activation scale to 1 and bias to none and apply scales and add bias after onednn.qlinear_pointwise.

Validation results
Accuracy/perplexity is not changed with or without this fusion pass.
Latency is improved by >10% with the fusion pass.
Test method:

  • Model: EleutherAI/gpt-j-6b
  • Hardware: Intel(R) Xeon(R) Platinum 8490H, running on 1 socket, 60 cores
  • Using Intel OMP and Tcmalloc
  • Running the example script of SmoothQuant in Torchao with TORCHINDUCTOR_FREEZING=1 numactl -N1 python example.py -m EleutherAI/gpt-j-6b --device=cpu --quant-mode=dynamic --compile

Test plan

python test/inductor/test_mkldnn_pattern_matcher.py -k test_smooth_quant_with_int_mm

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

Differential Revision: D66796966

[ghstack-poisoned]
Copy link
pytorch-bot bot commented Dec 4, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142036

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit f3d6c5a with merge base b576a8c (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/inductor module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor release notes: quantization release notes category labels Dec 4, 2024
Xia-Weiwen added a commit that referenced this pull request Dec 4, 2024
ghstack-source-id: 02eb032
Pull Request resolved: #142036
@Xia-Weiwen Xia-Weiwen changed the title [Inductor][CPU] Fuse SmoothQuant int8 linear pattern [Reopen][Inductor][CPU] Fuse SmoothQuant int8 linear pattern Dec 4, 2024
@Xia-Weiwen Xia-Weiwen added intel This tag is for PR from Intel ciflow/trunk Trigger trunk jobs on your pull request and removed open source labels Dec 4, 2024
@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 CI shows green. Would you like to import it? Thanks.

@jerryzh168
Copy link
Contributor

@jerryzh168 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

[ghstack-poisoned]
@jerryzh168
Copy link
Contributor

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR!

Details for Dev Infra team Raised by workflow job

[ghstack-poisoned]
@Xia-Weiwen
Copy link
Collaborator Author

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR!

Details for Dev Infra team

Hi @jerryzh168 It failed to merge by pytorchmergebot. Could you please import it again? Thanks.

@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 The failure looks irrelevant. Could you please take a look and see if it can be imported again? Thanks.

@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 There is a dependent PR targeting 2.6 branch cut. Could you please check if this PR can be imported and merged? Thanks.

@jerryzh168
Copy link
Contributor

@jerryzh168 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 There is a dependent PR targeting 2.6 branch cut. Could you please check if this PR can be imported and merged? Thanks.

Hi @jerryzh168 We would like this in 2.6. Could you please merge this PR once internal checks show green? Thanks.

@jerryzh168
Copy link
Contributor

yeah I'm trying to land this

[ghstack-poisoned]
Xia-Weiwen added a commit that referenced this pull request Dec 10, 2024
ghstack-source-id: e5964c8
Pull Request resolved: #142036
[ghstack-poisoned]
Xia-Weiwen added a commit that referenced this pull request Dec 11, 2024
ghstack-source-id: 55fb77f
Pull Request resolved: #142036
8000 @Xia-Weiwen
Update
[ghstack-poisoned]
Xia-Weiwen added a commit that referenced this pull request Dec 11, 2024
ghstack-source-id: 7b2a046
Pull Request resolved: #142036
@jerryzh168
Copy link
Contributor

@jerryzh168 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@Xia-Weiwen
Copy link
Collaborator Author

@jerryzh168 Thanks a lot

pytorchmergebot pushed a commit that referenced this pull request Dec 13, 2024
< 9E88 input type="hidden" name="badge_size" value="small" autocomplete="off" data-targets="batch-deferred-content.inputs" />
#142110)

### Summary

Extends #142036 for Inductor pattern-matching pattern covered for torchao API `int8_dynamic_activation_int8_weight` in the following scenario (inference-only, freezing enabled) -

- int8 quantized (symmetrically) activation (per token quantized).
- Statically (so, scales are also constant. But then they would have been constant even in case of dynamic quantization due to constant weights, anyway) per-channel int8 quantized (symmetrically) weights (which are also constant because freezing is enabled).

The pattern that's matched is `torch._intmm` -> convert to FP32/BF16 -> [optional expand for activation scale] ->`mul` -> `mul`.

We don't check if the activation is dynamically quantized or whether the weights are statically quantized, though (since the implementation won't have have any side-effects even if that wouldn't be true).

In practice, it also matches the smooth-quant int8 quantized linear pattern if its output is not reshaped (if activation is 2D).

### More details

oneDNN int8 matmul supports application of per-channel weight scale but not a vector activation scale, which could be applied as a post op, but is currently unsupported in ATen. Bias addition (which could be supported with an add post-op) is also unfused.

The fusion pattern used in this PR is `torch._intmm` -> convert to FP32/BF16 ->`mul`, which will be replaced by oneDNN qlinear op.

The speedup over eager-mode is due to 2 reasons -
1. fusion of int8xint8 -> int32 GEMM, conversion to FP32/BF16 & application of weight scale. (In case of BF16, many intermediate conversions are also avoided).
2. weight is pre-packed & cached by Inductor, so a reorder is avoided at run-time.

But, in the future, the whole pattern (including application of activation scale, which would be a mul post-op) + bias could be fused if corresponding support would be enabled in ATen.

### Verification

Added UT in this PR
```
python test/inductor/test_mkldnn_pattern_matcher.py -v -k test_da8w8_sym_act_sym_wgt_with_int_mm
```

#### Corresponding torchao UTs

1. int8 Smoothquant legacy API - `TORCHINDUCTOR_FREEZING=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" python test/integration/test_integration.py -v -k test_non_dynamically_quantizable_linear`.
The difference from #139595 is that there are no reshapes of the linear output in this pattern.

2. int8 da8w8 - symmetrically quantized activation (dynamically) & statically quantized weights -  ` TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" TORCHINDUCTOR_FREEZING=1 python test/integration/test_integration.py -v -k test_int8_dynamic_quant_subclass_api_0_cpu`

Pull Request resolved: #142110
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #142036
@Xia-Weiwen Xia-Weiwen deleted the gh/Xia-Weiwen/23/head branch December 14, 2024 12:51
bluenote10 pushed a commit to bluenote10/pytorch that referenced this pull request Dec 14, 2024
…#142036)

Reopen of pytorch#139595

**About the PR**
In the implementation of SmoothQuant in Torchao, quantized linear is computed by `_int_mm(a, b)` + `mul(b_scale)` + `mul(a_scale)` (+ optional `add` for bias) with `reshape` and `convert_dtype` in between.
This PR adds a pass to fuse the corresponding patterns:
- (no bias) `reshape -> _int_mm -> convert_element_type -> (expand -> mul) -> mul -> reshape`
- (with bias) `pattern_no_bias -> add -> reshape -> reshape`

The patterns are replaced by `onednn.qlinear_pointwise` and `onednn.qlinear_prepack`, the latter of which is evaluated and frozen during the freezing process of Inductor. The final graph contains `onednn.qlinear_pointwise` only with packed weight constants.

Note that `onednn.qlinear_pointwise` only supports a scalar activation scale, which is a limitation of oneDNN library, so in that case we set activation scale to 1 and bias to none and apply scales and add bias after `onednn.qlinear_pointwise`.

**Validation results**
Accuracy/perplexity is not changed with or without this fusion pass.
Latency is improved by >10% with the fusion pass.
Test method:
- Model: EleutherAI/gpt-j-6b
- Hardware: Intel(R) Xeon(R) Platinum 8490H, running on 1 socket, 60 cores
- Using Intel OMP and Tcmalloc
- Running [the example script of SmoothQuant in Torchao](https://github.com/pytorch/ao/blob/main/torchao/prototype/smoothquant/example.py) with `TORCHINDUCTOR_FREEZING=1 numactl -N1 python example.py -m EleutherAI/gpt-j-6b --device=cpu --quant-mode=dynamic --compile`

**Test plan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_smooth_quant_with_int_mm
```

Differential Revision: [D66796966](https://our.internmc.facebook.com/intern/diff/D66796966)
Pull Request resolved: pytorch#142036
Approved by: https://github.com/jerryzh168, https://github.com/jgong5

Co-authored-by: sanchitintel <sanchit.jain@intel.com>
bluenote10 pushed a commit to bluenote10/pytorch that referenced this pull request Dec 14, 2024
pytorch#142110)

### Summary

Extends pytorch#142036 for Inductor pattern-matching pattern covered for torchao API `int8_dynamic_activation_int8_weight` in the following scenario (inference-only, freezing enabled) -

- int8 quantized (symmetrically) activation (per token quantized).
- Statically (so, scales are also constant. But then they would have been constant even in case of dynamic quantization due to constant weights, anyway) per-channel int8 quantized (symmetrically) weights (which are also constant because freezing is enabled).

The pattern that's matched is `torch._intmm` -> convert to FP32/BF16 -> [optional expand for activation scale] ->`mul` -> `mul`.

We don't check if the activation is dynamically quantized or whether the weights are statically quantized, though (since the implementation won't have have any side-effects even if that wouldn't be true).

In practice, it also matches the smooth-quant int8 quantized linear pattern if its output is not reshaped (if activation is 2D).

### More details

oneDNN int8 matmul supports application of per-channel weight scale but not a vector activation scale, which could be applied as a post op, but is currently unsupported in ATen. Bias addition (which could be supported with an add post-op) is also unfused.

The fusion pattern used in this PR is `torch._intmm` -> convert to FP32/BF16 ->`mul`, which will be replaced by oneDNN qlinear op.

The speedup over eager-mode is due to 2 reasons -
1. fusion of int8xint8 -> int32 GEMM, conversion to FP32/BF16 & application of weight scale. (In case of BF16, many intermediate conversions are also avoided).
2. weight is pre-packed & cached by Inductor, so a reorder is avoided at run-time.

But, in the future, the whole pattern (including application of activation scale, which would be a mul post-op) + bias could be fused if corresponding support would be enabled in ATen.

### Verification

Added UT in this PR
```
python test/inductor/test_mkldnn_pattern_matcher.py -v -k test_da8w8_sym_act_sym_wgt_with_int_mm
```

#### Corresponding torchao UTs

1. int8 Smoothquant legacy API - `TORCHINDUCTOR_FREEZING=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" python test/integration/test_integration.py -v -k test_non_dynamically_quantizable_linear`.
The difference from pytorch#139595 is that there are no reshapes of the linear output in this pattern.

2. int8 da8w8 - symmetrically quantized activation (dynamically) & statically quantized weights -  ` TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" TORCHINDUCTOR_FREEZING=1 python test/integration/test_integration.py -v -k test_int8_dynamic_quant_subclass_api_0_cpu`

Pull Request resolved: pytorch#142110
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: pytorch#142036
Esquains pushed a commit to Esquains/study1 that referenced this pull request Dec 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request intel This tag is for PR from Intel Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor open source release notes: quantization release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0