allow to use bf16 as fp32 internal precision for mkldnn conv #126050

zhuhaozhe · 2024-05-13T06:55:47Z

Allow to use BF16 as the internal computation data types by torch.backends.mkldnn.conv.fp32_precision="bf16"

TestPlan

python test/test_mkldnn.py -k conv

Benchmarking

FP32 conv2d vs. BF16 internal computation conv2d on SPR

Single core:

Input	fp32 ms	bf16 internal ms	Speed up
IC: 64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0	185.5071	83.4749	2.22
IC: 128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0	194.7558	79.1683	2.46
IC: 256, OC: 256, kernel: 3, stride: 1, N: 1, H: 16, W: 16, G: 1, pad: 0	1.9213	1.3690	1.40

56 cores:

Input	fp32 ms	bf16 internal ms	Speed up
IC: 64, OC: 256, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0	6.5804	7.4349	0.89
IC: 128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0	4.9940	3.8093	1.31
IC: 256, OC: 1024, 8000 kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0	8.8359	5.5802	1.58
IC: 1024, OC: 256, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0	16.5800	9.2367	1.80
IC: 256, OC: 256, kernel: 3, stride: 1, N: 1, H: 16, W: 16, G: 1, pad: 0	79.5436	38.3861	2.07

Stack from ghstack (oldest at bottom):

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @gujinghui @PenghuiCheng @jianyuh @min-jean-cho @yanbing-j @Guobing-Chen @Xia-Weiwen @snadampal

pytorch-bot · 2024-05-13T06:55:50Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126050

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 4fb47b9 with merge base 4015166 ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / unstable-linux-focal-cuda12.6-py3.10-gcc11-sm89-xfail / build (gh)
ninja: build stopped: subcommand failed

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

jgong5 · 2024-05-15T01:12:08Z

aten/src/ATen/native/mkldnn/Conv.cpp

@@ -200,6 +200,11 @@ static void check_shape_forward(const Tensor& input,
 //  but weight/bias and grad_weight/grad_bias are always CPU tensor.
 //

+static bool enabled_fpmatch_mode_bf16_for_fp32_for_mkldnn_conv(){


make the name shorter? e.g., mkldnn_conv_enabled_fpmath_mode_bf16?

jgong5 · 2024-05-15T01:13:27Z

aten/src/ATen/native/mkldnn/Conv.cpp

-    const ideep::attr_t& op_attr) {
+    ideep::attr_t& op_attr) {


This doesn't feel right. Does the caller expect the op_attr being changed?

Hi, Jiong.
I see there are two callers for this function, both the callers create op_attr and passed to _mkldnn_convolution_out. And after _mkldnn_convolution_out return, there is no other stuff depends on op_attr, so I think it dose not matter to change op_attr.
If you think we anyway need a function that guarantee op_attr is not changed, I can move the set_fpmath_mode outside of this function (_mkldnn_convolution_out) and do it in his two callers.

Better not to make it mutable in the first place.

Got it, updated.

Updated? But I still saw it is mutable?

Sorry, I may forget to upload. Updated now

Allow to use `BF16` as the internal computation data types by `torch.backends.mkldnn.conv.fp32_precision="bf16"` ### TestPlan python test/test_mkldnn.py -k conv ### Benchmarking FP32 conv2d vs. BF16 internal computation conv2d on SPR Single core: Input | fp32 ms | bf16 internal ms | Speed up -- | -- | -- | -- IC: 64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0 | 185.5071 | 83.4749 | 2.22 IC: 128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 194.7558 | 79.1683| 2.46 IC: 256, OC: 256, kernel: 3, stride: 1, N: 1, H: 16, W: 16, G: 1, pad: 0 | 1.9213 | 1.3690 | 1.40 56 cores: Input | fp32 ms | bf16 internal ms | Speed up -- | -- | -- | -- IC: 64, OC: 256, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 6.5804 | 7.4349 | 0.89 IC: 128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 4.9940 | 3.8093 | 1.31 IC: 256, OC: 1024, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 | 8.8359 | 5.5802 | 1.58 IC: 1024, OC: 256, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 | 16.5800 | 9.2367 | 1.80 IC: 256, OC: 256, kernel: 3, stride: 1, N: 1, H: 16, W: 16, G: 1, pad: 0 | 79.5436 | 38.3861 | 2.07 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 gujinghui PenghuiCheng jianyuh min-jean-cho yanbing-j Guobing-Chen Xia-Weiwen snadampal [ghstack-poisoned]

[ghstack-poisoned]

Allow to use `BF16` as the internal computation data types by `torch.backends.mkldnn.conv.fp32_precision="bf16"` ### TestPlan python test/test_mkldnn.py -k conv ### Benchmarking FP32 conv2d vs. BF16 internal computation conv2d on SPR Single core: Input | fp32 ms | bf16 internal ms | Speed up -- | -- | -- | -- IC: 64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0 | 185.5071 | 83.4749 | 2.22 IC: 128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 194.7558 | 79.1683| 2.46 IC: 256, OC: 256, kernel: 3, stride: 1, N: 1, H: 16, W: 16, G: 1, pad: 0 | 1.9213 | 1.3690 | 1.40 56 cores: Input | fp32 ms | bf16 internal ms | Speed up -- | -- | -- | -- IC: 64, OC: 256, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 6.5804 | 7.4349 | 0.89 IC: 128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 4.9940 | 3.8093 | 1.31 IC: 256, OC: 1024, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 | 8.8359 | 5.5802 | 1.58 IC: 1024, OC: 256, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 | 16.5800 | 9.2367 | 1.80 IC: 256, OC: 256, kernel: 3, stride: 1, N: 1, H: 16, W: 16, G: 1, pad: 0 | 79.5436 | 38.3861 | 2.07 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 gujinghui PenghuiCheng jianyuh min-jean-cho yanbing-j Guobing-Chen Xia-Weiwen snadampal [ghstack-poisoned]

[ghstack-poisoned]

ghstack-source-id: b02b0b1 Pull Request resolved: pytorch/pytorch#126050

[ghstack-poisoned]

pytorch-bot bot added ciflow/linux-aarch64 linux aarch64 CI workflow module: cpu CPU specific problem (e.g., perf, algorithm) module: mkldnn Related to Intel IDEEP or oneDNN (a.k.a. mkldnn) integration labels May 13, 2024

This was referenced May 13, 2024

refine fp32 precision api #125888

Open

allow to use bf16 as fp32 internal precision for mkldnn rnn #126051

Closed

allow to use bf16 as fp32 internal precision for mkldnn conv

a868d68

[ghstack-poisoned]

zhuhaozhe marked this pull request as draft May 13, 2024 06:57

pytorchbot added the open source label May 13, 2024

zhuhaozhe mentioned this pull request May 13, 2024

allow to use bf16 as fp32 internal precision for mkldnn conv backward #126054

Draft

zhuhaozhe added the ciflow/trunk Trigger trunk jobs on your pull request label May 14, 2024

zhuhaozhe requested a review from jgong5 May 14, 2024 13:28

jgong5 requested changes May 15, 2024

View reviewed changes

zhuhaozhe added 17 commits May 16, 2024 14:34

Update

6de04a4

[ghstack-poisoned]

Update

8b74635

[ghstack-poisoned]

Update

c515213

[ghstack-poisoned]

Update

b535ef3

[ghstack-poisoned]

Update

4773d80

[ghstack-poisoned]

Update

3b1baee

[ghstack-poisoned]

zhuhaozhe and others added 9 commits October 31, 2024 22:00

Update

b620826

[ghstack-poisoned]

Update

82d5f37

[ghstack-poisoned]

Update

d03333a

[ghstack-poisoned]

Update

84c00a7

[ghstack-poisoned]

Update

9ad308e

[ghstack-poisoned]

Update

0f34d50

[ghstack-poisoned]

Update

24f8e54

[ghstack-poisoned]

Update

33fdd85

[ghstack-poisoned]

pytorch-bot bot temporarily deployed to upload-benchmark-results January 20, 2025 05:18 Inactive

Update

48cdfeb

[ghstack-poisoned]

pytorch-bot bot temporarily deployed to upload-benchmark-results February 6, 2025 08:33 Inactive

yanbing-j added 3 commits February 8, 2025 02:13

Update

2f89df0

[ghstack-poisoned]

Update

e03ada2

[ghstack-poisoned]

Update

77d200b

[ghstack-poisoned]

yanbing-j added the topic: not user facing topic category label Mar 7, 2025

yanbing-j added 3 commits March 10, 2025 06:49

Update

78d0f1c

[ghstack-poisoned]

Update

d8ffb26

[ghstack-poisoned]

Update

c5ad56c

[ghstack-poisoned]

Divigroup-RAP pushed a commit to Divigroup-RAP/PYTORCH that referenced this pull request Apr 22, 2025

allow to use bf16 as fp32 internal precision for mkldnn conv

18c6d23

ghstack-source-id: b02b0b1 Pull Request resolved: pytorch/pytorch#126050

yanbing-j added 6 commits April 30, 2025 03:00

Update

29e6b76

[ghstack-poisoned]

Update

87a658d

[ghstack-poisoned]

Update

11c3797

[ghstack-poisoned]

Update

0b68004

[ghstack-poisoned]

Update

00baa75

[ghstack-poisoned]

Update

4fb47b9

[ghstack-poisoned]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow to use bf16 as fp32 internal precision for mkldnn conv #126050

allow to use bf16 as fp32 internal precision for mkldnn conv #126050

allow to use bf16 as fp32 internal precision for mkldnn conv #126050

Are you sure you want to change the base?

allow to use bf16 as fp32 internal precision for mkldnn conv #126050

Conversation

TestPlan

Benchmarking

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126050

✅ You can merge normally! (1 Unrelated Failure)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment