-
Notifications
You must be signed in to change notification settings - Fork 24.3k
[Intel GPU] qconv.pointwise with mixed dtype XPU support #135465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135465
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 94f36e3 with merge base 3591657 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls. fix the linter error.
@@ -3,6 +3,7 @@ | |||
#include <c10/core/MemoryFormat.h> | |||
#include <torch/library.h> | |||
|
|||
#include <c10/core/ScalarType.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Place this under line 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified
has_xpu = any( | ||
isinstance(input, torch.Tensor) and input.device.type == "xpu" | ||
for input in inputs | ||
) | ||
if check_autocast == torch.bfloat16 and ( | ||
torch.ops.mkldnn._is_mkldnn_bf16_supported() or has_xpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has_xpu = any( | |
isinstance(input, torch.Tensor) and input.device.type == "xpu" | |
for input in inputs | |
) | |
if check_autocast == torch.bfloat16 and ( | |
torch.ops.mkldnn._is_mkldnn_bf16_supported() or has_xpu | |
has_xpu = any( | |
isinstance(input, torch.Tensor) and input.device.type == "xpu" | |
for input in inputs | |
) | |
device_type = 'xpu' if has_xpu else 'cpu' | |
if torch.ops.mkldnn._is_mkldnn_bf16_supported() or torch.ops.mkldmm._is_mkldnn_fp16_supported(): | |
maybe_autocast = torch.amp.autocast(device_type, check_autocast) |
torch.cpu.amp.autocast
is deprecated now. Use torch.amp.autocast
instead and generalize the logic to minimize lines of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified
Starting merge as part of PR stack under #136753 |
1 similar comment
Starting merge as part of PR stack under #136753 |
# Motivation This PR is aimed to add mixed data type(AMP) support for `qlinear_pointwise` op. With current PR, we allow `qlinear` kernels output Tensor that is BF16, rather than FP32/INT8. # UT verification ```bash DNNL_VERBOSE=1 python test/inductor/test_mkldnn_pattern_matcher.py -v \ -k test_qlinear_int8_mixed_bf16_xpu \ -k test_qlinear_relu_int8_mixed_bf16_xpu \ -k test_qlinear_add_int8_mixed_bf16_xpu ``` # Runtime exemplification ```bash #qlinear+bf16 output onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32,,4x4:4x4,0.0698242 # qlinear_add + bf16 output onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:-0.677141+sum:0.0132773,,4x4:4x4,0.0419922 # qlinear_add_relu + bf16 output onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:0.533096+sum:0.00416481+eltwise_relu,,4x4:4x4,0.0759277 ``` As shown in the oneDNN verbose, the attribute `dst_bf16::blocked:ab::f0` demonstrate that we could successfully output a bf16 tensor in int8 gemm. Pull Request resolved: #136753 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire, https://github.com/jerryzh168 ghstack dependencies: #133307, #135189, #135337, #135465 Co-authored-by: guangyey <guangye.yu@intel.com>
# Motivation This PR is aimed to add mixed data type(AMP) support for `qconv_pointwise` op. With current PR, we allow `qconv` kernels output Tensor that is BF16, rather than FP32/INT8. # UT verification ```bash DNNL_VERBOSE=1 python test/inductor/test_mkldnn_pattern_matcher.py -v \ -k test_qconv2d_int8_mixed_bf16_xpu \ -k test_qconv2d_relu_int8_mixed_bf16_xpu \ -k test_qconv2d_hardtanh_int8_mixed_bf16_xpu \ -k test_qconv2d_hardswish_int8_mixed_bf16_xpu \ -k test_qconv2d_silu_int8_mixed_bf16_xpu \ -k test_qconv2d_add_int8_mixed_bf16_xpu \ -k test_qconv2d_add_relu_int8_mixed_bf16_xpu ``` # Runtime verification ```bash #qconv + bf16 onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_bf16::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0539551 # qconv_silu + bf16 onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_undef::undef::: dst_bf16::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_swish:1,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0588379 # qconv_hardswish + bf16 onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_undef::undef::: dst_bf16::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_hardswish:0.166667:0.5,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0568848 ``` The `dst_bf16::blocked:acdb::f0` attribute in oneDNN verbose demonstrate the output tensor is computed as bf16 successfully. Pull Request resolved: #135465 Approved by: https://github.com/liangan1, https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire, https://github.com/jerryzh168 ghstack dependencies: #133307, #135189, #135337 Co-authored-by: guangyey <guangye.yu@intel.com>
# Motivation This PR is aimed to add mixed data type(AMP) support for `qlinear_pointwise` op. With current PR, we allow `qlinear` kernels output Tensor that is BF16, rather than FP32/INT8. # UT verification ```bash DNNL_VERBOSE=1 python test/inductor/test_mkldnn_pattern_matcher.py -v \ -k test_qlinear_int8_mixed_bf16_xpu \ -k test_qlinear_relu_int8_mixed_bf16_xpu \ -k test_qlinear_add_int8_mixed_bf16_xpu ``` # Runtime exemplification ```bash #qlinear+bf16 output onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32,,4x4:4x4,0.0698242 # qlinear_add + bf16 output onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:-0.677141+sum:0.0132773,,4x4:4x4,0.0419922 # qlinear_add_relu + bf16 output onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:0.533096+sum:0.00416481+eltwise_relu,,4x4:4x4,0.0759277 ``` As shown in the oneDNN verbose, the attribute `dst_bf16::blocked:ab::f0` demonstrate that we could successfully output a bf16 tensor in int8 gemm. Pull Request resolved: #136753 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire, https://github.com/jerryzh168 ghstack dependencies: #133307, #135189, #135337, #135465 Co-authored-by: guangyey <guangye.yu@intel.com>
) # Motivation This PR is aimed to add mixed data type(AMP) support for `qconv_pointwise` op. With current PR, we allow `qconv` kernels output Tensor that is BF16, rather than FP32/INT8. # UT verification ```bash DNNL_VERBOSE=1 python test/inductor/test_mkldnn_pattern_matcher.py -v \ -k test_qconv2d_int8_mixed_bf16_xpu \ -k test_qconv2d_relu_int8_mixed_bf16_xpu \ -k test_qconv2d_hardtanh_int8_mixed_bf16_xpu \ -k test_qconv2d_hardswish_int8_mixed_bf16_xpu \ -k test_qconv2d_silu_int8_mixed_bf16_xpu \ -k test_qconv2d_add_int8_mixed_bf16_xpu \ -k test_qconv2d_add_relu_int8_mixed_bf16_xpu ``` # Runtime verification ```bash #qconv + bf16 onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_bf16::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0539551 # qconv_silu + bf16 onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_undef::undef::: dst_bf16::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_swish:1,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0588379 # qconv_hardswish + bf16 onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_undef::undef::: dst_bf16::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_hardswish:0.166667:0.5,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0568848 ``` The `dst_bf16::blocked:acdb::f0` attribute in oneDNN verbose demonstrate the output tensor is computed as bf16 successfully. Pull Request resolved: pytorch#135465 Approved by: https://github.com/liangan1, https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire, https://github.com/jerryzh168 ghstack dependencies: pytorch#133307, pytorch#135189, pytorch#135337 Co-authored-by: guangyey <guangye.yu@intel.com>
# Motivation This PR is aimed to add mixed data type(AMP) support for `qlinear_pointwise` op. With current PR, we allow `qlinear` kernels output Tensor that is BF16, rather than FP32/INT8. # UT verification ```bash DNNL_VERBOSE=1 python test/inductor/test_mkldnn_pattern_matcher.py -v \ -k test_qlinear_int8_mixed_bf16_xpu \ -k test_qlinear_relu_int8_mixed_bf16_xpu \ -k test_qlinear_add_int8_mixed_bf16_xpu ``` # Runtime exemplification ```bash #qlinear+bf16 output onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32,,4x4:4x4,0.0698242 # qlinear_add + bf16 output onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:-0.677141+sum:0.0132773,,4x4:4x4,0.0419922 # qlinear_add_relu + bf16 output onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:0.533096+sum:0.00416481+eltwise_relu,,4x4:4x4,0.0759277 ``` As shown in the oneDNN verbose, the attribute `dst_bf16::blocked:ab::f0` demonstrate that we could successfully output a bf16 tensor in int8 gemm. Pull Request resolved: pytorch#136753 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire, https://github.com/jerryzh168 ghstack dependencies: pytorch#133307, pytorch#135189, pytorch#135337, pytorch#135465 Co-authored-by: guangyey <guangye.yu@intel.com>
Motivation
This PR is aimed to add mixed data type(AMP) support for
qconv_pointwise
op. With current PR, we allowqconv
kernels output Tensor that is BF16, rather than FP32/INT8.UT verification
Runtime verification
The
dst_bf16::blocked:acdb::f0
attribute in oneDNN verbose demonstrate the output tensor is computed as bf16 successfully.Stack from ghstack (oldest at bottom):
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov