-
Notifications
You must be signed in to change notification settings - Fork 24.3k
Enable fp16 linear layers in PyTorch via ACL #144992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Renato Arantes <renato.arantes@arm.com>
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144992
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 3 Cancelled Jobs, 25 Unrelated FailuresAs of commit fa04172 with merge base cf28d61 ( NEW FAILURES - The following jobs have failed:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot label "module: arm" |
@renato-arantes can you add some explanation as to why are you doing this? (I suspect performance, if so, I would love to see some sort of a script that one can run to measure the perf improvements between before and after) |
Hi @malfet Yes, this PR is about improving performance as it enables the path from PyTorch to ACL and, therefore, avoids running bf16 reference in oneDNN. On average, on an AWS c7g instance running with 16 threads, for bf16, we got 8191.85 μs, and for fp32, we got 9629.15 μs, an improvement of 1437.30 μs or 17%. Here is the script used for this benchmark:
|
Signed-off-by: Renato Arantes <renato.arantes@arm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks good to me.
I would improve the summary a bit and add before and after performance numbers on fp16 (and probably also for bf16 if you unblocked that as well). Also a test script, as a link in the summary :)
I left a couple of comments, let's make sure we address them before merging.
@@ -117,7 +117,7 @@ def cal_conv_generated_kernel_number(mod, input, dtype, dim=4): | |||
): | |||
input_kernel = 1 | |||
if output.is_contiguous(memory_format=torch.contiguous_format) or ( | |||
TEST_ACL and dtype == torch.bfloat16 | |||
TEST_ACL and (dtype == torch.bfloat16 or dtype == torch.half) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, how does this work on non ACL?
@@ -90,6 +90,10 @@ inline bool mkldnn_bf16_device_check_arm() { | |||
return cpuinfo_initialize() && cpuinfo_has_arm_bf16(); | |||
} | |||
|
|||
inline bool mkldnn_fp16_device_check_arm() { | |||
return cpuinfo_initialize() && cpuinfo_has_arm_neon_fp16(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we don't care about aarch32 else we would need to check for fp16arith
@pytorchmergebot revert -c nosignal -m "Accuracy Test failures" |
@pytorchbot successfully started a revert job. Check the current status here. |
This reverts commit 5b37249. Reverted #144992 on behalf of https://github.com/nikhil-arm due to Accuracy Test failures ([comment](#144992 (comment)))
@renato-arantes your PR has been successfully reverted. |
This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.
This PR only passed the CI tests because the Arm Compute Library (ACL) version in the jammy docker image used in the CI is outdated - v24.04, while the one used in manylinux is v24.09. ACL v24.04 with Reverting this should fix the CI failures in #138889 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the tolerance in the tests assumes FP32 accumulation, while ACL does FP16 accumulation.
PyTorch eager mode always does accumulation over fp32 even for fp16 input dtypes. There is a PR somewhere that introduces context manager that allows for lower precision, but in general default codepath should do reductions in higher-precision dtypes, as defined in op_math_t
This reverts commit 5b37249. Reverted #144992 on behalf of https://github.com/nikhil-arm due to Accuracy Test failures ([comment](#144992 (comment)))
This reverts commit 5b37249. Reverted #144992 on behalf of https://github.com/nikhil-arm due to Accuracy Test failures ([comment](#144992 (comment)))
This reverts commit 5b37249. Reverted pytorch#144992 on behalf of https://github.com/nikhil-arm due to Accuracy Test failures ([comment](pytorch#144992 (comment)))
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
@renato-arantes what is the current status of this? do we plan to re-land this with newer versions of oneDNN/ACL that do accumulation in FP32? |
This pull request aims to enable the use of linear layers with the fp16 data type through the ACL.
On a Graviton3 instance running with 16 threads,
torch.randn(2048, 4096, dtype=torch.half)
will take 50+% less time to complete compared withtorch.randn(2048, 4096, dtype=torch.float32)
.cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @malfet @snadampal @milpuz01 @aditew01 @nikhil-arm @fadara01 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @yf225 @ColinPeppler @desertfire