8000 [MPS] cholesky implementation by Isalia20 · Pull Request #145701 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[MPS] cholesky implementation #145701

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

Isalia20
Copy link
Collaborator

Requested in #77764

Closed #144193 due to a lot of conflicts when rebasing

Copy link
pytorch-bot bot commented Jan 26, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145701

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 65e0374 with merge base ed01514 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: mps Release notes category label Jan 26, 2025
@Isalia20 Isalia20 changed the title mps cholesky recommit [MPS] cholesky implementation Jan 26, 2025
@Isalia20 Isalia20 mentioned this pull request Jan 26, 2025
6 tasks
Copy link
Contributor

Attention! native_functions.yaml was changed

If you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info.


Caused by:

@malfet malfet added ciflow/trunk Trigger trunk jobs on your pull request ciflow/mps Run MPS tests (subset of trunk) ciflow/inductor labels Jan 26, 2025
Comment on lines +822 to +825
id<MTLComputeCommandEncoder> computeEncoder = stream->commandEncoder();
[computeEncoder setBuffer:outBuffer offset:0 atIndex:0];
[computeEncoder setBytes:&N length:sizeof(int64_t) atIndex:2];
[computeEncoder setBytes:&NB length:sizeof(int64_t) atIndex:3];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't all operations with stream must be serialized over its queue?

Copy link
Collaborator Author
@Isalia20 Isalia20 Jan 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not really sure. Haven't encountered any issues with this approach. I can put these inside of the dispatch stream -> queue in a followup so we can be sure that no errors occur(~tomorrow along with performance improvements of these kernels)

Copy link
Contributor
@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (with minor nits, if it passes CI)

@malfet malfet added the topic: improvements topic category label Jan 26, 2025
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 26, 2025 15:50 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 26, 2025 15:50 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 26, 2025 15:50 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 26, 2025 15:53 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 26, 2025 15:53 Inactive
@malfet
Copy link
Contributor
malfet commented Jan 27, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

nWEIdia pushed a commit to nWEIdia/pytorch that referenced this pull request Jan 27, 2025
Requested in pytorch#77764

Closed pytorch#144193  due to a lot of conflicts when rebasing
Pull Request resolved: pytorch#145701
Approved by: https://github.com/malfet
pytorchmergebot pushed a commit that referenced this pull request Jan 31, 2025
Followup to #145701

Optimizes the syrk and trsm kernels of cholesky decomposition on mps. For SYRK kernel it does matmuls with apple's simdgroup matrices instead of a tiled implementation and for trsm kernel we do vectorized loads. Also this PR puts command encoder inside of the stream queue dispatch (as discussed on last PR).

Script to collect perf
```
mport torch
import numpy as np
import time
import csv

matrix_sizes = [512, 1024, 2048, 4096]
batch_sizes = [1, 2, 4, 8, 16]
num_runs = 10
warmup_runs = 3

def create_spd_matrix(n, batch_size):
    torch.manual_seed(42)
    A = torch.randn(batch_size, n, n, dtype=torch.float32)
    return A @ A.transpose(-2, -1) + n * torch.eye(n).expand(batch_size, -1, -1)

def run_cholesky_mps(A):
    torch.mps.synchronize()
    start = time.perf_counter()
    b = torch.linalg.cholesky(A, upper=False)
    torch.mps.synchronize()
    end = time.perf_counter()
    return b, end - start

results = {
    'N': [],
    'batch_size': [],
    'mean_time': [],
    'std_time': []
}

for n in matrix_sizes:
    for batch_size in batch_sizes:
        print(f"\nBenchmarking N={n}, batch_size={batch_size}")

        try:
            A_cpu = create_spd_matrix(n, batch_size)
            A_mps = A_cpu.to("mps")

            for _ in range(warmup_runs):
                _, _ = run_cholesky_mps(A_mps)

            times = []
            for _ in range(num_runs):
                _, t = run_cholesky_mps(A_mps)
                times.append(t)

            mean_time = np.mean(times)
            std_time = np.std(times)

            results['N'].append(n)
            results['batch_size'].append(batch_size)
            results['mean_time'].append(mean_time)
            results['std_time'].append(std_time)

            print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s")

        except RuntimeError as e:
            print(f"Error for N={n}, batch_size={batch_size}: {e}")
            continue

with open('cholesky_benchmark_times.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['N', 'batch_size', 'mean_time', 'std_time'])
    for i in range(len(results['N'])):
        writer.writerow([
            results['N'][i],
            results['batch_size'][i],
            results['mean_time'][i],
            results['std_time'][i]
        ])
```

Observed speedups on M1 Pro
![cholesky_speedup](https://github.com/user-attachments/assets/be3edb1a-8b4a-4039-9d7f-9b9a10f1c83a)
Pull Request resolved: #145722
Approved by: https://github.com/malfet
pytorchmergebot pushed a commit that referenced this pull request Feb 13, 2025
PR #145701 didn't have experimental version of cholesky. This PR adds that version

Pull Request resolved: #146799
Approved by: https://github.com/malfet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/inductor ciflow/mps Run MPS tests (subset of trunk) ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: mps Release notes category topic: improvements topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0