-
Notifications
You must be signed in to change notification settings - Fork 24.7k
[MPS] cholesky implementation #145701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MPS] cholesky implementation #145701
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145701
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 65e0374 with merge base ed01514 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Attention! native_functions.yaml was changedIf you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info. Caused by: |
id<MTLComputeCommandEncoder> computeEncoder = stream->commandEncoder(); | ||
[computeEncoder setBuffer:outBuffer offset:0 atIndex:0]; | ||
[computeEncoder setBytes:&N length:sizeof(int64_t) atIndex:2]; | ||
[computeEncoder setBytes:&NB length:sizeof(int64_t) atIndex:3]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't all operations with stream must be serialized over its queue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm not really sure. Haven't encountered any issues with this approach. I can put these inside of the dispatch stream -> queue in a followup so we can be sure that no errors occur(~tomorrow along with performance improvements of these kernels)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (with minor nits, if it passes CI)
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Requested in pytorch#77764 Closed pytorch#144193 due to a lot of conflicts when rebasing Pull Request resolved: pytorch#145701 Approved by: https://github.com/malfet
Followup to #145701 Optimizes the syrk and trsm kernels of cholesky decomposition on mps. For SYRK kernel it does matmuls with apple's simdgroup matrices instead of a tiled implementation and for trsm kernel we do vectorized loads. Also this PR puts command encoder inside of the stream queue dispatch (as discussed on last PR). Script to collect perf ``` mport torch import numpy as np import time import csv matrix_sizes = [512, 1024, 2048, 4096] batch_sizes = [1, 2, 4, 8, 16] num_runs = 10 warmup_runs = 3 def create_spd_matrix(n, batch_size): torch.manual_seed(42) A = torch.randn(batch_size, n, n, dtype=torch.float32) return A @ A.transpose(-2, -1) + n * torch.eye(n).expand(batch_size, -1, -1) def run_cholesky_mps(A): torch.mps.synchronize() start = time.perf_counter() b = torch.linalg.cholesky(A, upper=False) torch.mps.synchronize() end = time.perf_counter() return b, end - start results = { 'N': [], 'batch_size': [], 'mean_time': [], 'std_time': [] } for n in matrix_sizes: for batch_size in batch_sizes: print(f"\nBenchmarking N={n}, batch_size={batch_size}") try: A_cpu = create_spd_matrix(n, batch_size) A_mps = A_cpu.to("mps") for _ in range(warmup_runs): _, _ = run_cholesky_mps(A_mps) times = [] for _ in range(num_runs): _, t = run_cholesky_mps(A_mps) times.append(t) mean_time = np.mean(times) std_time = np.std(times) results['N'].append(n) results['batch_size'].append(batch_size) results['mean_time'].append(mean_time) results['std_time'].append(std_time) print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s") except RuntimeError as e: print(f"Error for N={n}, batch_size={batch_size}: {e}") continue with open('cholesky_benchmark_times.csv', 'w', newline='') as f: writer = csv.writer(f) writer.writerow(['N', 'batch_size', 'mean_time', 'std_time']) for i in range(len(results['N'])): writer.writerow([ results['N'][i], results['batch_size'][i], results['mean_time'][i], results['std_time'][i] ]) ``` Observed speedups on M1 Pro  Pull Request resolved: #145722 Approved by: https://github.com/malfet
PR #145701 didn't have experimental version of cholesky. This PR adds that version Pull Request resolved: #146799 Approved by: https://github.com/malfet
Requested in #77764
Closed #144193 due to a lot of conflicts when rebasing