8000 [Intel GPU] qlinear.pointwise with mixed dtype support by ZhiweiYan-96 · Pull Request #136753 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[Intel GPU] qlinear.pointwise with mixed dtype support #136753

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 58 commits into from

Conversation

ZhiweiYan-96
Copy link
Collaborator
@ZhiweiYan-96 ZhiweiYan-96 commented Sep 26, 2024

Motivation

This PR is aimed to add mixed data type(AMP) support for qlinear_pointwise op. With current PR, we allow qlinear kernels output Tensor that is BF16, rather than FP32/INT8.

UT verification

DNNL_VERBOSE=1 python test/inductor/test_mkldnn_pattern_matcher.py -v \
    -k test_qlinear_int8_mixed_bf16_xpu \
    -k test_qlinear_relu_int8_mixed_bf16_xpu \
    -k test_qlinear_add_int8_mixed_bf16_xpu

Runtime exemplification

#qlinear+bf16 output
onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32,,4x4:4x4,0.0698242
# qlinear_add + bf16 output
onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:-0.677141+sum:0.0132773,,4x4:4x4,0.0419922
# qlinear_add_relu + bf16 output
onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:0.533096+sum:0.00416481+eltwise_relu,,4x4:4x4,0.0759277

As shown in the oneDNN verbose, the attribute dst_bf16::blocked:ab::f0 demonstrate that we could successfully output a bf16 tensor in int8 gemm.

Stack from ghstack (oldest at bottom):

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

[ghstack-poisoned]
Copy link
pytorch-bot bot commented Sep 26, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136753

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures, 6 Unrelated Failures

As of commit 41c7207 with merge base 3591657 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@ZainRizvi
Copy link
Contributor

This commit is based off of a commit that's about a month old and no longer is compatible with our CI system. Please rebase this onto the latest viable/strict branch

[ghstack-poisoned]
ZhiweiYan-96 added a commit that referenced this pull request Oct 9, 2024
@ZhiweiYan-96 ZhiweiYan-96 added the topic: not user facing topic category label Oct 9, 2024
@ZhiweiYan-96
Copy link
Collaborator Author

@ZainRizvi Great thanks for your reminding! I have rebased my PR stakck.

ZhiweiYan-96 added a commit that referenced this pull request Oct 9, 2024
[ghstack-poisoned]
ZhiweiYan-96 added a commit that referenced this pull request Oct 9, 2024
[ghstack-poisoned]
ZhiweiYan-96 added a commit that referenced this pull request Oct 17, 2024
[ghstack-poisoned]
ZhiweiYan-96 added a commit that referenced this pull request Oct 21, 2024
[ghstack-poisoned]
ZhiweiYan-96 added a commit that referenced this pull request Oct 23, 2024
ZhiweiYan-96 added a commit that referenced this pull request Oct 23, 2024
[ghstack-poisoned]
ZhiweiYan-96 added a commit that referenced this pull request Oct 23, 2024
[ghstack-poisoned]
is_qat=is_qat,
is_dynamic=is_dynamic,
)

def _qlinear_dequant_promotion_cpu_test_helper(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _qlinear_dequant_promotion_cpu_test_helper(
def _qlinear_dequant_promotion_test_helper(

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reminding, has changed all the naming in this file.

@@ -2375,7 +2461,6 @@ def test_qlinear_relu_xpu(self):
(torch.randn((2, 4)).to(device="xpu"),), device="xpu"
)

@skipIfNoDynamoSupport
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for reminding, it's added back

@ZhiweiYan-96 ZhiweiYan-96 added ciflow/xpu Run XPU CI tasks keep-going Don't stop on first failure, keep running tests until the end labels Feb 12, 2025
[ghstack-poisoned]
ZhiweiYan-96 added a commit that referenced this pull request Feb 12, 2025
@EikanWang EikanWang requested a review from jerryzh168 February 14, 2025 03:00
[ghstack-poisoned]
ZhiweiYan-96 added a commit that referenced this pull request Feb 17, 2025
@ZhiweiYan-96 ZhiweiYan-96 added the ciflow/binaries_wheel Trigger binary build and upload jobs for wheel on the PR label Feb 20, 2025
@EikanWang
Copy link
Collaborator

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 24, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

@EikanWang
Copy link
Collaborator

@pytorchbot merge -i

aditew01 pushed a commit that referenced this pull request Feb 28, 2025
# Motivation
This PR is aimed to add mixed data type(AMP) support for `qlinear_pointwise` op. With current PR, we allow `qlinear` kernels output Tensor that is BF16, rather than FP32/INT8.

# UT verification
```bash
DNNL_VERBOSE=1 python test/inductor/test_mkldnn_pattern_matcher.py -v \
    -k test_qlinear_int8_mixed_bf16_xpu \
    -k test_qlinear_relu_int8_mixed_bf16_xpu \
    -k test_qlinear_add_int8_mixed_bf16_xpu
```

# Runtime exemplification
```bash
#qlinear+bf16 output
onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32,,4x4:4x4,0.0698242
# qlinear_add + bf16 output
onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:-0.677141+sum:0.0132773,,4x4:4x4,0.0419922
# qlinear_add_relu + bf16 output
onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:0.533096+sum:0.00416481+eltwise_relu,,4x4:4x4,0.0759277
```
As shown in the oneDNN verbose, the attribute `dst_bf16::blocked:ab::f0` demonstrate that we could successfully output a bf16 tensor in int8 gemm.

Pull Request resolved: #136753
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire, https://github.com/jerryzh168
ghstack dependencies: #133307, #135189, #135337, #135465

Co-authored-by: guangyey <guangye.yu@intel.com>
majing921201 pushed a commit to majing921201/pytorch that referenced this pull request Mar 4, 2025
# Motivation
This PR is aimed to add mixed data type(AMP) support for `qlinear_pointwise` op. With current PR, we allow `qlinear` kernels output Tensor that is BF16, rather than FP32/INT8.

# UT verification
```bash
DNNL_VERBOSE=1 python test/inductor/test_mkldnn_pattern_matcher.py -v \
    -k test_qlinear_int8_mixed_bf16_xpu \
    -k test_qlinear_relu_int8_mixed_bf16_xpu \
    -k test_qlinear_add_int8_mixed_bf16_xpu
```

# Runtime exemplification
```bash
#qlinear+bf16 output
onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32,,4x4:4x4,0.0698242
# qlinear_add + bf16 output
onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:-0.677141+sum:0.0132773,,4x4:4x4,0.0419922
# qlinear_add_relu + bf16 output
onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:0.533096+sum:0.00416481+eltwise_relu,,4x4:4x4,0.0759277
```
As shown in the oneDNN verbose, the attribute `dst_bf16::blocked:ab::f0` demonstrate that we could successfully output a bf16 tensor in int8 gemm.

Pull Request resolved: pytorch#136753
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire, https://github.com/jerryzh168
ghstack dependencies: pytorch#133307, pytorch#135189, pytorch#135337, pytorch#135465

Co-authored-by: guangyey <guangye.yu@intel.com>
@github-actions github-actions bot deleted the gh/ZhiweiYan-96/31/head branch March 27, 2025 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/binaries_wheel Trigger binary build and upload jobs for wheel on the PR ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks 67D7 keep-going Don't stop on first failure, keep running tests until the end Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor open source topic: not user facing topic category
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

9 participants
0