Enable qint8 and quint8 add for AArch64 using ACL directly #146620

davsva01 · 2025-02-06T16:54:08Z

This enables qint8 and quint8 add for AArch64 through Arm Compute Library (ACL) directly.
It’s based on changes in PR #145942 which enables the use of ACL directly in ATen.
Relative performance improvement using OMP_NUM_THREADS=1 is ~15x, using OMP_NUM_THREADS=32 it’s ~5.4x.

Script to benchmark quantised add performance:

import torch
import torch.profiler as profiler

a_f32 = torch.rand((400, 3456),dtype=torch.float)
b_f32 = torch.rand((400, 3456),dtype=torch.float)
a_q = torch.quantize_per_tensor(a_f32, 1.2, 0, torch.qint8)
b_q = torch.quantize_per_tensor(b_f32, 1.7, 5, torch.qint8)

with profiler.profile(with_stack=True, profile_memory=False, record_shapes=True) as prof:
    for i in range(1000):     
        _ = torch.ops.quantized.add(a_q, b_q, 1.3, 2)
print(prof.key_averages(group_by_input_shape=True).table(sort_by='self_cpu_time_total', row_limit=50))

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

linux-foundation-easycla · 2025-02-06T16:54:13Z

The committers listed above are authorized under a signed CLA.

✅ login: fadara01 / name: Fadi Arafeh (0b06ae1, 99e5d35, ac5618b)
✅ login: davsva01 / name: David Svantesson (a0d2046)

pytorch-bot · 2025-02-06T16:54:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146620

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a0d2046 with merge base 6c3492b ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

malfet · 2025-02-06T17:43:50Z

aten/src/ATen/native/quantized/cpu/ACLUtils.h

+#pragma once
+
+#include <ATen/Config.h>
+#if defined(__aarch64__) && AT_MKLDNN_ACL_ENABLED()


Why restrict it to aarch64?

Thanks for the quick review!
Yes this is redundant, I have rebased on top of latest changes in #145942 addressing this.

malfet

Same generic feedback is to previous PR: why restrict it to aarch64? Shouldn't IS_ACL_ENABLE be sufficient?

Among many things, this version of ACL fixes the redundant declaration warning that we're blocked on in (pytorch#145942, pytorch#146620, 147337) and introduces better scheduling heuristics for GEMMs

Among many things, this version of ACL fixes the redundant declaration warning that we're blocked on in (#145942, #146620, #147337) and introduces better scheduling heuristics for GEMMs Fixes #ISSUE_NUMBER Pull Request resolved: #147454 Approved by: https://github.com/malfet

fadara01 · 2025-02-20T10:35:43Z

@pytorchbot label "arm priority"

Among many things, this version of ACL fixes the redundant declaration warning that we're blocked on in (#145942, #146620, #147337) and introduces better scheduling heuristics for GEMMs Fixes #ISSUE_NUMBER Pull Request resolved: #147454 Approved by: https://github.com/malfet

Among many things, this version of ACL fixes the redundant declaration warning that we're blocked on in (pytorch#145942, pytorch#146620, pytorch#147337) and introduces better scheduling heuristics for GEMMs Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#147454 Approved by: https://github.com/malfet

fadara01 · 2025-03-04T10:07:13Z

@malfet, @digantdesai - Could you please give this another look?

ACL is already built with PyTorch as a shared library when USE_MKLDNN_ACL is set. Currently, it is only used indirectly in ATen via oneDNN for AArch64 targets. However there are cases where it makes sense to utilize ACL directly without oneDNN as an intermediary - e.g. quantization. See pytorch#145942, pytorch#147337, pytorch#146620. This patch enables such use cases by exposing ACL to ATen

This enables a fast path for eager mode dynamic quantization for AArch64 through Arm Compute Library (ACL) directly. Context: PR pytorch#126687 enabled an optimized implementation for qlinear_dynamic for aarch64 through ideep → oneDNN → ACL which improved performance by ~10x compared to the previous implementation. However, the current qlinear_dynamic path (ideep → oneDNN → ACL) suffers from high overhead due to the API friction between the stateless oneDNN API and the stateful ACL low-precision GEMM (lowp_gemm) API - for example, ACL's lowp_gemm objects cache information like weights reduction or weights in optimized memory format which oneDNN does not allow due to its stateless nature. Hence, ACL currently runs a (redundant) sum of columns and pre-transposition (to the gemm kerne's optimal format) for each GEMM operation. This PR addresses the sub-optimalities above by integrating ACL directly with qlinear_dynamic. This approach yields an average speedup (averaged over context_lengths of 2^3 up to 2^9) of ~ 50% for bert-base-uncased, bert-large-uncased, roberta-base, distilbert-base-uncased with 16 threads on a Neoverse-V1 (with transformers==4.48). To achieve this we introduce PackedLinearWeightsACL (as a subclasses of PackedLinearWeightsOnednn ) with an implementation of qlinear_dynamic that uses ACL directly, while qlinear still follows the oneDNN path.

…tly. This enables a fast path for eager mode static quantization for AArch64 through Arm Compute Library (ACL) directly. PR pytorch#145942 addressed the high overhead in qlinear_dynamic on AArch64 (due to redundant weight pretranspositions and reductions) by enabling a path that calls ACL directly. This does the same thing but for (static) qlinear.

This enables qint8 and quint8 add for AArch64 through Arm Compute Library (ACL) directly. It's based on changes in PR pytorch#145942 which enables the use of ACL directly in ATen. Relative performance improvement using OMP_NUM_THREADS=1 is ~15x, using OMP_NUM_THREADS=32 it’s ~5.4x.

ACL is already built with PyTorch as a shared library when USE_MKLDNN_ACL is set. Currently, it is only used indirectly in ATen via oneDNN for AArch64 targets. However there are cases where it makes sense to utilize ACL directly without oneDNN as an intermediary - e.g. quantization. See #145942, #147337, #146620. This patch enables such use cases by exposing ACL to ATen ghstack-source-id: 266c621 Pull Request resolved: #148581

ACL is already built with PyTorch as a shared library when USE_MKLDNN_ACL is set. Currently, it is only used indirectly in ATen via oneDNN for AArch64 targets. However there are cases where it makes sense to utilize ACL directly without oneDNN as an intermediary - e.g. quantization. See #145942, #147337, #146620. This patch enables such use cases by exposing ACL to ATen Pull Request resolved: #148584 Approved by: https://github.com/malfet

fadara01 · 2025-03-10T19:39:13Z

Closing in favor of ghstack PR: #148653 which has all comments addressed

davsva01 requested review from jeffdaily, a team, jerryzh168, salilsdesai, kimishpatel, digantdesai and jianyuh as code owners February 6, 2025 16:54

pytorch-bot bot added module: cpu CPU specific problem (e.g., perf, algorithm) release notes: quantization release notes category release notes: releng release notes category labels Feb 6, 2025

pytorchbot added the open source label Feb 6, 2025

malfet reviewed Feb 6, 2025

View reviewed changes

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 7, 2025

davsva01 force-pushed the acl_qadd branch 2 times, most recently from 4a1385a to bb887be Compare February 10, 2025 16:40

fadara01 mentioned this pull request Feb 19, 2025

Update Arm Compute Library (ACL) to v25.02 #147454

Closed

davsva01 force-pushed the acl_qadd branch from bb887be to bb2bfd6 Compare February 20, 2025 09:13

pytorch-bot bot added the arm priority label Feb 20, 2025

fadara01 requested a review from malfet February 20, 2025 10:35

davsva01 force-pushed the acl_qadd branch from bb2bfd6 to f3f8c72 Compare February 20, 2025 11:06

fadara01 mentioned this pull request Mar 5, 2025

Enable Direct Use of Arm Compute Library (ACL) in ATen #148542

Closed

fadara01 and others added 3 commits March 5, 2025 11:27

fadara01 mentioned this pull request Mar 5, 2025

Enable Direct Use of Arm Compute Library (ACL) in ATen #148581

Closed

This was referenced Mar 5, 2025

Enable Direct Use of Arm Compute Library (ACL) in ATen #148582

Closed

Enable Direct Use of Arm Compute Library (ACL) in ATen #148584

Closed

davsva01 force-pushed the acl_qadd branch from f3f8c72 to a0d2046 Compare March 5, 2025 20:05

fadara01 closed this Mar 10, 2025

fadara01 mentioned this pull request Mar 10, 2025

Enable qint8 and quint8 add for AArch64 using ACL directly #148653

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable qint8 and quint8 add for AArch64 using ACL directly #146620

Enable qint8 and quint8 add for AArch64 using ACL directly #146620

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Enable qint8 and quint8 add for AArch64 using ACL directly #146620

Enable qint8 and quint8 add for AArch64 using ACL directly #146620

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146620

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!