Enable fp16 linear layers in PyTorch via ACL #144992

renato-arantes · 2025-01-16T19:40:19Z

This pull request aims to enable the use of linear layers with the fp16 data type through the ACL.

On a Graviton3 instance running with 16 threads, torch.randn(2048, 4096, dtype=torch.half) will take 50+% less time to complete compared with torch.randn(2048, 4096, dtype=torch.float32).

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @malfet @snadampal @milpuz01 @aditew01 @nikhil-arm @fadara01 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @yf225 @ColinPeppler @desertfire

Signed-off-by: Renato Arantes <renato.arantes@arm.com>

pytorch-bot · 2025-01-16T19:40:25Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144992

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 3 Cancelled Jobs, 25 Unrelated Failures

As of commit fa04172 with merge base cf28d61 ():

NEW FAILURES - The following jobs have failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

inductor / unit-test / cuda12.4-py3.10-gcc9-sm86 / build (gh)
##[error]The operation was canceled.
inductor / unit-test / cuda12.4-py3.12-gcc9-sm86 / build (gh)
##[error]The operation was canceled.
inductor / unit-test / cuda12.4-py3.13-gcc9-sm86 / build (gh)
##[error]The operation was canceled.

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / linux-focal-cuda12.4-py3.10-gcc9 / test (default, 4, 5, linux.4xlarge.nvidia.gpu) (gh) (disabled by #131082 but the issue was closed recently and a rebase is needed to make it pass)
export/test_export_training_ir_to_run_decomp.py::TrainingIRToRunDecompExportTestExport::test_slice_with_floordiv_training_ir_to_decomp
pull / linux-focal-cuda12.4-py3.10-gcc9 / test (default, 5, 5, linux.4xlarge.nvidia.gpu) (gh) (disabled by #131083 but the issue was closed recently and a rebase is needed to make it pass)
export/test_retraceability.py::RetraceExportTestExport::test_slice_with_floordiv_retraceability
pull / linux-focal-cuda12.4-py3.10-gcc9-sm89 / test (default, 2, 5, linux.g6.4xlarge.experimental.nvidia.gpu) (gh) (disabled by #131083 but the issue was closed recently and a rebase is needed to make it pass)
export/test_retraceability.py::RetraceExportTestExport::test_slice_with_floordiv_retraceability
pull / linux-focal-cuda12.4-py3.10-gcc9-sm89 / test (default, 3, 5, linux.g6.4xlarge.experimental.nvidia.gpu) (gh) (disabled by #131082 but the issue was closed recently and a rebase is needed to make it pass)
export/test_export_training_ir_to_run_decomp.py::TrainingIRToRunDecompExportTestExport::test_slice_with_floordiv_training_ir_to_decomp
pull / linux-focal-cuda12.4-py3.10-gcc9-sm89 / test (default, 4, 5, linux.g6.4xlarge.experimental.nvidia.gpu) (gh) (disabled by #131082, #131083, #131088, #131101, #131119, #131136, #138675, #138884 but the issue was closed recently and a rebase is needed to make it pass)
export/test_export.py::TestExport::test_slice_with_floordiv
pull / linux-focal-cuda12.4-py3.10-gcc9-sm89 / test (default, 5, 5, linux.g6.4xlarge.experimental.nvidia.gpu) (gh) (disabled by #131303 but the issue was closed recently and a rebase is needed to make it pass)
dynamo/test_exc.py::ExcTests::test_trigger_bisect_on_error
pull / linux-focal-py3.13-clang10 / test (crossref, 2, 2, linux.2xlarge) (gh) (disabled by #131082, #131083, #131088, #131101, #131119, #131136, #138675, #138884 but the issue was closed recently and a rebase is needed to make it pass)
export/test_export.py::TestExport::test_slice_with_floordiv
pull / linux-focal-py3.13-clang10 / test (default, 2, 5, linux.4xlarge) (gh) (disabled by #131303 but the issue was closed recently and a rebase is needed to make it pass)
dynamo/test_exc.py::ExcTests::test_trigger_bisect_on_error
pull / linux-focal-py3.13-clang10 / test (default, 4, 5, linux.4xlarge) (gh) (disabled by #131083 but the issue was closed recently and a rebase is needed to make it pass)
export/test_retraceability.py::RetraceExportTestExport::test_slice_with_floordiv_retraceability
pull / linux-focal-py3.13-clang10 / test (default, 5, 5, linux.4xlarge) (gh) (disabled by #131119 but the issue was closed recently and a rebase is needed to make it pass)
export/test_serdes.py::SerDesExportTestExport::test_slice_with_floordiv_serdes
pull / linux-focal-py3.9-clang10 / test (crossref, 2, 2, linux.2xlarge) (gh) (disabled by #131082, #131083, #131088, #131101, #131119, #131136, #138675, #138884 but the issue was closed recently and a rebase is needed to make it pass)
export/test_export.py::TestExport::test_slice_with_floordiv
pull / linux-focal-py3.9-clang10 / test (default, 1, 5, linux.4xlarge) (gh) (disabled by #131082, #131083, #131088, #131101, #131119, #131136, #138675, #138884 but the issue was closed recently and a rebase is needed to make it pass)
export/test_export.py::TestExport::test_slice_with_floordiv
pull / linux-focal-py3.9-clang10 / test (default, 4, 5, linux.4xlarge) (gh) (disabled by #131083 but the issue was closed recently and a rebase is needed to make it pass)
export/test_retraceability.py::RetraceExportTestExport::test_slice_with_floordiv_retraceability
pull / linux-focal-py3.9-clang10 / test (default, 5, 5, linux.4xlarge) (gh) (disabled by #131119 but the issue was closed recently and a rebase is needed to make it pass)
export/test_serdes.py::SerDesExportTestExport::test_slice_with_floordiv_serdes
pull / linux-focal-py3.9-clang10 / test (dynamo_wrapped, 2, 3, linux.2xlarge) (gh) (disabled by #116746 but the issue was closed recently and a rebase is needed to make it pass)
torch_np/numpy_tests/lib/test_function_base.py::TestGradient::test_second_order_accurate
pull / linux-jammy-py3.10-clang15-asan / test (default, 2, 6, linux.4xlarge) (gh) (disabled by #131088 but the issue was closed recently and a rebase is needed to make it pass)
export/test_export_nonstrict.py::NonStrictExportTestExport::test_slice_with_floordiv_non_strict
pull / linux-jammy-py3.10-clang15-asan / test (default, 3, 6, linux.4xlarge) (gh) (disabled by #131082 but the issue was closed recently and a rebase is needed to make it pass)
export/test_export_training_ir_to_run_decomp.py::TrainingIRToRunDecompExportTestExport::test_slice_with_floordiv_training_ir_to_decomp
pull / linux-jammy-py3.10-clang15-asan / test (default, 4, 6, linux.4xlarge) (gh) (disabled by #131083 but the issue was closed recently and a rebase is needed to make it pass)
export/test_retraceability.py::RetraceExportTestExport::test_slice_with_floordiv_retraceability
pull / linux-jammy-py3.10-clang15-asan / test (default, 6, 6, linux.4xlarge) (gh) (disabled by #131119 but the issue was closed recently and a rebase is needed to make it pass)
export/test_serdes.py::SerDesExportTestExport::test_slice_with_floordiv_serdes
pull / linux-jammy-py3.9-gcc11 / test (default, 3, 5, linux.2xlarge) (gh) (disabled by #131303 but the issue was closed recently and a rebase is needed to make it pass)
dynamo/test_exc.py::ExcTests::test_trigger_bisect_on_error
pull / linux-jammy-py3.9-gcc11 / test (default, 4, 5, linux.2xlarge) (gh) (disabled by #131083 but the issue was closed recently and a rebase is needed to make it pass)
export/test_retraceability.py::RetraceExportTestExport::test_slice_with_floordiv_retraceability
pull / linux-jammy-py3.9-gcc11 / test (default, 5, 5, linux.2xlarge) (gh) (disabled by #131119 but the issue was closed recently and a rebase is needed to make it pass)
export/test_serdes.py::SerDesExportTestExport::test_slice_with_floordiv_serdes
trunk / macos-py3-arm64 / test (default, 2, 3, macos-m1-stable) (gh) (disabled by #131083 but the issue was closed recently and a rebase is needed to make it pass)
export/test_retraceability.py::RetraceExportTestExport::test_slice_with_floordiv_retraceability
trunk / macos-py3-arm64 / test (default, 3, 3, macos-m1-stable) (gh) (disabled by #131082 but the issue was closed recently and a rebase is needed to make it pass)
export/test_export_training_ir_to_run_decomp.py::TrainingIRToRunDecompExportTestExport::test_slice_with_floordiv_training_ir_to_decomp

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (dynamic_cpu_inductor_timm, 1, 2, linux.8xlarge.amx) (gh) (trunk failure)
levit_128

This comment was automatically generated by Dr. CI and updates every 15 minutes.

annop-w · 2025-01-16T19:59:31Z

@pytorchbot label "module: arm"

malfet · 2025-01-16T21:38:46Z

@renato-arantes can you add some explanation as to why are you doing this? (I suspect performance, if so, I would love to see some sort of a script that one can run to measure the perf improvements between before and after)

renato-arantes · 2025-01-17T10:58:56Z

Hi @malfet

Yes, this PR is about improving performance as it enables the path from PyTorch to ACL and, therefore, avoids running bf16 reference in oneDNN. On average, on an AWS c7g instance running with 16 threads, for bf16, we got 8191.85 μs, and for fp32, we got 9629.15 μs, an improvement of 1437.30 μs or 17%. Here is the script used for this benchmark:

import torch
import torch.nn as nn
import torch.profiler as profiler
import time
# Enable torch.no_grad globally
torch.set_grad_enabled(False)

# Define models as nn.Modules
class LinearModel(nn.Module):
    def __init__(self, input_size, output_size, bias):
        super(LinearModel, self).__init__()
        self.linear = nn.Linear(input_size, output_size, bias=bias)

    def forward(self, x):
        return self.linear(x)

N = 2048
K = 4096
M = 512
bias = False
dtype = torch.float16 # <<<--- change here for fp32

model = LinearModel(K, N, bias=bias).to(dtype=dtype)

# Generate random inputs
input = torch.randn(M, K, dtype=dtype)

# Number of iterations for benchmarking
num_iterations = 1000

# Warm-up function
def warmup(model, input_tensor):
    for _ in range(100):  # Warm-up phase to stabilize performance
        _ = model(input_tensor)

# Benchmark function
def benchmark(model, input_tensor):
    start_time = time.time()

    for _ in range(num_iterations):
        _ = model(input_tensor)

    end_time = time.time()
    return (end_time - start_time) / num_iterations

# Warm-up
print("Warming up...")
warmup(model, input)

# Benchmark without profiler
print("Benchmarking...")
average_time = benchmark(model, input)

print(f"Average execution time for Linear Layer: {average_time * 1e6:.2f} microseconds")

Signed-off-by: Renato Arantes <renato.arantes@arm.com>

digantdesai

Thanks, this looks good to me.

I would improve the summary a bit and add before and after performance numbers on fp16 (and probably also for bf16 if you unblocked that as well). Also a test script, as a link in the summary :)

I left a couple of comments, let's make sure we address them before merging.

aten/src/ATen/native/LinearAlgebra.cpp

aten/src/ATen/native/mkldnn/Utils.h

digantdesai · 2025-01-23T16:17:01Z

test/inductor/test_mkldnn_pattern_matcher.py

@@ -117,7 +117,7 @@ def cal_conv_generated_kernel_number(mod, input, dtype, dim=4):
    ):
        input_kernel = 1
    if output.is_contiguous(memory_format=torch.contiguous_format) or (
-        TEST_ACL and dtype == torch.bfloat16
+        TEST_ACL and (dtype == torch.bfloat16 or dtype == torch.half)


Curious, how does this work on non ACL?

digantdesai · 2025-01-23T16:51:46Z

aten/src/ATen/native/mkldnn/Utils.h

@@ -90,6 +90,10 @@ inline bool mkldnn_bf16_device_check_arm() {
  return cpuinfo_initialize() && cpuinfo_has_arm_bf16();
 }

+inline bool mkldnn_fp16_device_check_arm() {
+  return cpuinfo_initialize() && cpuinfo_has_arm_neon_fp16();


I guess we don't care about aarch32 else we would need to check for fp16arith

nikhil-arm · 2025-02-15T12:39:28Z

@pytorchmergebot revert -c nosignal -m "Accuracy Test failures"

pytorchmergebot · 2025-02-15T12:40:51Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

This reverts commit 5b37249. Reverted #144992 on behalf of https://github.com/nikhil-arm due to Accuracy Test failures ([comment](#144992 (comment)))

pytorchmergebot · 2025-02-15T12:41:02Z

@renato-arantes your PR has been successfully reverted.

This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.

fadara01 · 2025-02-15T12:49:45Z

This PR only passed the CI tests because the Arm Compute Library (ACL) version in the jammy docker image used in the CI is outdated - v24.04, while the one used in manylinux is v24.09.

ACL v24.04 with multi_isa=1 and arch=armv8a does not enable FP16, while ACL v24.09 does. Hence the CI above only tested the oneDNN reference implementation as FP16 was not enabled in ACL. Had ACL v24.09 been used, the CI would have failed (as it does in #138889 where the ACL version in jammy is up to date) since the tolerance in the tests assumes FP32 accumulation, while ACL does FP16 accumulation.

Reverting this should fix the CI failures in #138889

malfet

since the tolerance in the tests assumes FP32 accumulation, while ACL does FP16 accumulation.

PyTorch eager mode always does accumulation over fp32 even for fp16 input dtypes. There is a PR somewhere that introduces context manager that allows for lower precision, but in general default codepath should do reductions in higher-precision dtypes, as defined in op_math_t

This reverts commit 5b37249. Reverted #144992 on behalf of https://github.com/nikhil-arm due to Accuracy Test failures ([comment](#144992 (comment)))

This reverts commit 5b37249. Reverted pytorch#144992 on behalf of https://github.com/nikhil-arm due to Accuracy Test failures ([comment](pytorch#144992 (comment)))

github-actions · 2025-04-26T17:34:18Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

fadara01 · 2025-05-23T13:30:17Z

@renato-arantes what is the current status of this? do we plan to re-land this with newer versions of oneDNN/ACL that do accumulation in FP32?

Enable fp16 linear layers in PyTorch via ACL

1d985c8

Signed-off-by: Renato Arantes <renato.arantes@arm.com>

renato-arantes requested review from lezcano, nikitaved and IvanYashchuk as code owners January 16, 2025 19:40

pytorch-bot bot added module: cpu CPU specific problem (e.g., perf, algorithm) release notes: linalg_frontend release notes category labels Jan 16, 2025

pytorchbot added the open source label Jan 16, 2025

pytorch-bot bot added the module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 label Jan 16, 2025

nikhil-arm requested review from digantdesai, malfet and nikhil-arm January 16, 2025 20:15

malfet added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 16, 2025

pytorch-bot bot temporarily deployed to upload-benchmark-results January 16, 2025 22:10 Inactive

Fix conv failing tests for fp16 via ACL

fa04172

Signed-off-by: Renato Arantes <renato.arantes@arm.com>

pytorch-bot bot added the module: inductor label Jan 17, 2025

pytorch-bot bot temporarily deployed to upload-benchmark-results January 17, 2025 14:56 Inactive

IvanYashchuk removed their request for review January 20, 2025 09:38

nikhil-arm previously approved these changes Jan 20, 2025

View reviewed changes

pytorch-bot bot added the ciflow/inductor label Jan 20, 2025

pytorch-bot bot had a problem deploying to upload-benchmark-results January 20, 2025 11:29 Failure

pytorch-bot bot temporarily deployed to upload-benchmark-results January 20, 2025 11:29 Inactive

digantdesai previously approved these changes Jan 23, 2025

View reviewed changes

pytorchmergebot added the Merged label Jan 23, 2025

pytorchmergebot closed this in 5b37249 Jan 23, 2025

pytorchmergebot removed the merging label Jan 23, 2025

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Feb 15, 2025

pytorchmergebot reopened this Feb 15, 2025

fadara01 mentioned this pull request Feb 15, 2025

Upgrade submodule oneDNN to v3.7 #138889

Closed

pytorch-bot bot temporarily deployed to upload-benchmark-results February 15, 2025 14:54 Inactive

pytorch-bot bot temporarily deployed to upload-benchmark-results February 15, 2025 14:55 Inactive

malfet requested changes Feb 16, 2025

View reviewed changes

robert-hardwick mentioned this pull request Feb 17, 2025

[ARM] Enable some additional Aarch64 unit tests #146895

Closed

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 25, 2025

github-actions bot added the Stale label Apr 26, 2025

fadara01 mentioned this pull request May 23, 2025

Assertion Failure: TestMkldnnCPU.test_matmul_lower_precision_cpu_float16 on Graviton 2 & 3 #146484

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable fp16 linear layers in PyTorch via ACL #144992

Enable fp16 linear layers in PyTorch via ACL #144992

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Enable fp16 linear layers in PyTorch via ACL #144992

Are you sure you want to change the base?

Enable fp16 linear layers in PyTorch via ACL #144992

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144992

❌ 2 New Failures, 3 Cancelled Jobs, 25 Unrelated Failures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!