8000 cublaslt autotuning support for TunableOp by bilal2vec · Pull Request #133896 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

cublaslt autotuning support for TunableOp #133896

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 21 commits into from

Conversation

bilal2vec
Copy link
Contributor
@bilal2vec bilal2vec commented Aug 19, 2024

Adds support for cublaslt autotuning to TunableOp.

Todo:

  • Add and test ScaledGemmTunableOp
  • Benchmarking numbers

Copy link
pytorch-bot bot commented Aug 19, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133896

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 New Failures, 2 Unrelated Failures

As of commit c0fe674 with merge base cf31724 (image):

NEW FAILURES - The following jobs have failed:

  • Lint / lintrunner-clang / linux-job (gh)
    RuntimeError: Command docker exec -t 750487957d04daa75b92c3f90a41c1930ba6cc2ca276cb6d432b94fcdd484f99 /exec failed with exit code 1
  • Lint / lintrunner-noclang / linux-job (gh)
    RuntimeError: Command docker exec -t 93191f30e809045ef0fbc69534e6cb4cd34a2f2d0637357a77de38150196520f /exec failed with exit code 1
  • linux-binary-libtorch-pre-cxx11 / libtorch-cpu-shared-with-deps-pre-cxx11-test / test (gh)
    RuntimeError: recursive_directory_iterator in used pre-CXX11 binaries, see; ['std::filesystem::recursive_directory_iterator::recursion_pending() const', 'std::filesystem::recursive_directory_iterator::depth() const', 'std::filesystem::recursive_directory_iterator::options() const', 'std::filesystem::recursive_directory_iterator::operator*() const', 'std::filesystem::recursive_directory_iterator::disable_recursion_pending()', 'std::filesystem::recursive_directory_iterator::pop(std::error_code&)', 'std::filesystem::recursive_directory_iterator::pop()', 'std::filesystem::recursive_directory_iterator::pop() [clone .cold]', 'std::filesystem::recursive_directory_iterator::increment(std::error_code&)', 'std::filesystem::recursive_directory_iterator::increment(std::error_code&) [clone .cold]', 'std::filesystem::recursive_directory_iterator::operator=(std::filesystem::recursive_directory_iterator&&)', 'std::filesystem::recursive_directory_iterator::operator=(std::filesystem::recursive_directory_iterator const&)', 'std::filesystem::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::path const&, std::filesystem::directory_options, std::error_code*)', 'std::filesystem::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::path const&, std::filesystem::directory_options, std::error_code*)', 'std::filesystem::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::path const&, std::filesystem::directory_options, std::error_code*) [clone .cold]', 'std::filesystem::recursive_directory_iterator::~recursive_directory_iterator()', 'std::filesystem::recursive_directory_iterator::~recursive_directory_iterator()', 'std::filesystem::recursive_directory_iterator::operator++()', 'std::filesystem::recursive_directory_iterator::operator++() [clone .cold]']
  • linux-binary-manywheel / manywheel-py3_9-cuda11_8-test / test (gh)
    RuntimeError: recursive_directory_iterator in used pre-CXX11 binaries, see; ['std::filesystem::recursive_directory_iterator::recursion_pending() const', 'std::filesystem::recursive_directory_iterator::depth() const', 'std::filesystem::recursive_directory_iterator::options() const', 'std::filesystem::recursive_directory_iterator::operator*() const', 'std::filesystem::recursive_directory_iterator::disable_recursion_pending()', 'std::filesystem::recursive_directory_iterator::pop(std::error_code&)', 'std::filesystem::recursive_directory_iterator::pop()', 'std::filesystem::recursive_directory_iterator::pop() [clone .cold]', 'std::filesystem::recursive_directory_iterator::increment(std::error_code&)', 'std::filesystem::recursive_directory_iterator::increment(std::error_code&) [clone .cold]', 'std::filesystem::recursive_directory_iterator::operator=(std::filesystem::recursive_directory_iterator&&)', 'std::filesystem::recursive_directory_iterator::operator=(std::filesystem::recursive_directory_iterator const&)', 'std::filesystem::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::path const&, std::filesystem::directory_options, std::error_code*)', 'std::filesystem::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::path const&, std::filesystem::directory_options, std::error_code*)', 'std::filesystem::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::path const&, std::filesystem::directory_options, std::error_code*) [clone .cold]', 'std::filesystem::recursive_directory_iterator::~recursive_directory_iterator()', 'std::filesystem::recursive_directory_iterator::~recursive_directory_iterator()', 'std::filesystem::recursive_directory_iterator::operator++()', 'std::filesystem::recursive_directory_iterator::operator++() [clone .cold]']
  • linux-binary-manywheel / manywheel-py3_9-cuda12_1-test / test (gh)
    RuntimeError: recursive_directory_iterator in used pre-CXX11 binaries, see; ['std::filesystem::recursive_directory_iterator::recursion_pending() const', 'std::filesystem::recursive_directory_iterator::depth() const', 'std::filesystem::recursive_directory_iterator::options() const', 'std::filesystem::recursive_directory_iterator::operator*() const', 'std::filesystem::recursive_directory_iterator::disable_recursion_pending()', 'std::filesystem::recursive_directory_iterator::pop(std::error_code&)', 'std::filesystem::recursive_directory_iterator::pop()', 'std::filesystem::recursive_directory_iterator::pop() [clone .cold]', 'std::filesystem::recursive_directory_iterator::increment(std::error_code&)', 'std::filesystem::recursive_directory_iterator::increment(std::error_code&) [clone .cold]', 'std::filesystem::recursive_directory_iterator::operator=(std::filesystem::recursive_directory_iterator&&)', 'std::filesystem::recursive_directory_iterator::operator=(std::filesystem::recursive_directory_iterator const&)', 'std::filesystem::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::path const&, std::filesystem::directory_options, std::error_code*)', 'std::filesystem::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::path const&, std::filesystem::directory_options, std::error_code*)', 'std::filesystem::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::path const&, std::filesystem::directory_options, std::error_code*) [clone .cold]', 'std::filesystem::recursive_directory_iterator::~recursive_directory_iterator()', 'std::filesystem::recursive_directory_iterator::~recursive_directory_iterator()', 'std::filesystem::recursive_directory_iterator::operator++()', 'std::filesystem::recursive_directory_iterator::operator++() [clone .cold]']
  • linux-binary-manywheel / manywheel-py3_9-cuda12_4-test / test (gh)
    RuntimeError: recursive_directory_iterator in used pre-CXX11 binaries, see; ['std::filesystem::recursive_directory_iterator::recursion_pending() const', 'std::filesystem::recursive_directory_iterator::depth() const', 'std::filesystem::recursive_directory_iterator::options() const', 'std::filesystem::recursive_directory_iterator::operator*() const', 'std::filesystem::recursive_directory_iterator::disable_recursion_pending()', 'std::filesystem::recursive_directory_iterator::pop(std::error_code&)', 'std::filesystem::recursive_directory_iterator::pop()', 'std::filesystem::recursive_directory_iterator::pop() [clone .cold]', 'std::filesystem::recursive_directory_iterator::increment(std::error_code&)', 'std::filesystem::recursive_directory_iterator::increment(std::error_code&) [clone .cold]', 'std::filesystem::recursive_directory_iterator::operator=(std::filesystem::recursive_directory_iterator&&)', 'std::filesystem::recursive_directory_iterator::operator=(std::filesystem::recursive_directory_iterator const&)', 'std::filesystem::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::path const&, std::filesystem::directory_options, std::error_code*)', 'std::filesystem::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::path const&, std::filesystem::directory_options, std::error_code*)', 'std::filesystem::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::path const&, std::filesystem::directory_options, std::error_code*) [clone .cold]', 'std::filesystem::recursive_directory_iterator::~recursive_directory_iterator()', 'std::filesystem::recursive_directory_iterator::~recursive_directory_iterator()', 'std::filesystem::recursive_directory_iterator::operator++()', 'std::filesystem::recursive_directory_iterator::operator++() [clone .cold]']

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@bilal2vec bilal2vec force-pushed the bilal/cublas_tuning branch from ff68489 to 90665ef Compare August 22, 2024 16:29
@bilal2vec bilal2vec marked this pull request as ready for review August 22, 2024 16:29
@bilal2vec bilal2vec force-pushed the bilal/cublas_tuning branch from 7c7e894 to 5690e16 Compare August 23, 2024 15:18
@bilal2vec
Copy link
Contributor Author

@pytorchmergebot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased bilal/cublas_tuning onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout bilal/cublas_tuning && git pull --rebase)

@bilal2vec
Copy link
Contributor Author

Bfloat16 torch.bmm speedups with cublaslt tuning on a H100 PCIe power limited to 300W (script)

problem shape | time (us) | teraflops | power usage (W) | speedup vs pytorch default
2x8192x6144x4096_fw_x_b: 	 1408.91 (+-1.71)  tflops: 585.30 power: 288.10 speedup: 1.0028260060243004
2x8192x6144x4096_bw_dx_b: 	 1376.61 (+-3.42)  tflops: 599.03 power: 288.10 speedup: 1.003983442733681
2x8192x6144x4096_bw_dx_f: 	 1309.18 (+-3.58)  tflops: 629.89 power: 288.10 speedup: 1.037242082642822
2x8192x6144x4096_bw_dw_f: 	 1323.06 (+-0.90)  tflops: 623.28 power: 288.10 speedup: 1.0395725587010973
2x8192x4096x4096_fw_x_b: 	 956.44 (+-5.36)  tflops: 574.79 power: 287.09 speedup: 0.9729783627480307
2x8192x4096x4096_bw_dx_b: 	 956.46 (+-4.43)  tflops: 574.78 power: 287.09 speedup: 0.9634818156891094
2x8192x4096x4096_bw_dx_f: 	 932.57 (+-1.51)  tflops: 589.50 power: 287.09 speedup: 0.9675060718394479
2x8192x4096x4096_bw_dw_f: 	 897.84 (+-6.47)  tflops: 612.31 power: 288.19 speedup: 0.9988581443517976
2x8192x14336x4096_fw_x_b: 	 3244.85 (+-23.17)  tflops: 592.98 power: 288.71 speedup: 0.9221596044056165
2x8192x14336x4096_bw_dx_b: 	 2953.93 (+-41.98)  tflops: 651.38 power: 288.71 speedup: 1.0332857131326827
2x8192x14336x4096_bw_dx_f: 	 2998.66 (+-28.42)  tflops: 641.67 power: 287.44 speedup: 1.0149157609343817
2x8192x14336x4096_bw_dw_f: 	 3076.65 (+-47.26)  tflops: 625.40 power: 287.44 speedup: 1.014018474318837
2x8192x4096x14336_fw_x_b: 	 3105.81 (+-4.42)  tflops: 619.53 power: 287.53 speedup: 0.9950332778063514
2x8192x4096x14336_bw_dx_b: 	 3235.22 (+-19.01)  tflops: 594.75 power: 287.18 speedup: 0.9913850915656437
2x8192x4096x14336_bw_dx_f: 	 3149.58 (+-36.03)  tflops: 610.92 power: 287.60 speedup: 0.991196263521646
2x8192x4096x14336_bw_dw_f: 	 3004.65 (+-68.09)  tflops: 640.39 power: 287.60 
8000
speedup: 1.0204484860026661

@bilal2vec
Copy link
Contributor Author

@pytorchmergebot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Tried to rebase and push PR #133896, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

8000
@albanD albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 27, 2024
@bilal2vec
Copy link
Contributor Author

@pytorchmergebot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased bilal/cublas_tuning onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout bilal/cublas_tuning && git pull --rebase)

8000
@bilal2vec
Copy link
Contributor Author

@pytorchmergebot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Tried to rebase and push PR #133896, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

izaitsevfb added a commit to pytorch/test-infra that referenced this pull request Aug 29, 2024
Currently Dr.CI marks this signal as flaky
([example](pytorch/pytorch#133896 (comment))).
@bilal2vec
Copy link
Contributor Author

@pytorchmergebot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@bilal2vec
Copy link
Contributor Author

ah merge needs an approval from a maintainer :/ @eqy

@jeffdaily
Copy link
Collaborator

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot added a commit that referenced this pull request Oct 14, 2024
…137328)"

8000

This reverts commit 25ac565.

Reverted #137328 on behalf of https://github.com/clee2000 due to need to revert this in order to revert #133896, please rebase and reland, sorry for the churn ([comment](#137328 (comment)))
@clee2000
Copy link
Contributor

@pytorchbot revert -m "this is breaking internal builds, I've copied what I think is the most relevant part of the log below. I believe the job running internally uses an old version of cuda, could you put guards to make sure compilation still words on an older version of cuda/cublaslt?" -c ghfirst

In file included from fbcode/caffe2/aten/src/ATen/cuda/CUDABlas.cpp:10:
In file included from fbcode/caffe2/aten/src/ATen/cuda/tunable/TunableGemm.h:17:
fbcode/caffe2/aten/src/ATen/cuda/tunable/GemmCublasLt.h:394:42: error: use of undeclared identifier 'CUBLASLT_MATMUL_DESC_A_SCALE_POINTER'; did you mean 'CUBLASLT_MATMUL_DESC_BIAS_POINTER'?
  394 |                 computeDesc.setAttribute(CUBLASLT_MATMUL_DESC_A_SCALE_POINTER, mat1_scale_ptr);
      |                                          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                          CUBLASLT_MATMUL_DESC_BIAS_POINTER
fbcode/third-party-buck/platform010/build/cuda/11.4.2/include/cublasLt.h:593:3: note: 'CUBLASLT_MATMUL_DESC_BIAS_POINTER' declared here
  593 |   CUBLASLT_MATMUL_DESC_BIAS_POINTER = 8,
      |   ^

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Oct 14, 2024
This reverts commit 19bbbef.

Reverted #133896 on behalf of https://github.com/clee2000 due to this is breaking internal builds, I've copied what I think is the most relevant part of the log below. I believe the job running internally uses an old version of cuda, could you put guards to make sure compilation still words on an older version of cuda/cublaslt? ([comment](#133896 (comment)))
@pytorchmergebot
Copy link
Collaborator

@bilal2vec your PR has been successfully reverted.

@pytorch-bot pytorch-bot bot dismissed stale reviews from eqy and jeffdaily October 14, 2024 20:28

This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.

Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request Merged open source Reverted Stale topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0