Speed up half tensors printing #141927

malfet · 2024-12-03T01:00:03Z

This PR removes copycast of reduced precision types to float before printing, that was added in #14418 to probably unblock printing when many operations, like is_nan and max were not supported on CPUs

(Reusing old test plan) Before the PR:

In [1]: import torch; a = torch.rand(1, 1700, 34, 50, dtype=torch.float16)

In [2]: %timeit str(a)
621 μs ± 5.06 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

after the PR

In [1]: import torch; a = torch.rand(1, 1700, 34, 50, dtype=torch.float16)

In [2]: %timeit str(a)
449 μs ± 2.34 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Also, this allows one printing 15Gb Metal tensors on 32GB Mac machine:

% python3 -c "import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))"  
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='mps:0', dtype=torch.float16)

Before this change it failed with non-descriptive

% python3 -c "import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))" 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
    import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))
                 ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor.py", line 568, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
           ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 708, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 625, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 339, in _tensor_str
    self = self.float()
RuntimeError: Invalid buffer size: 19.45 GB

Convert fp8 dtypes to float16, as float range is an overkill

This removes cast of reduced precision types to float before testing, which were added in #14418 (Reusing old test plan) Before the PR: ```python In [1]: import torch; a = torch.rand(1, 1700, 34, 50, dtype=torch.float16) In [2]: %timeit str(a) 621 μs ± 5.06 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) ``` after the PR ```python In [1]: import torch; a = torch.rand(1, 1700, 34, 50, dtype=torch.float16) In [2]: %timeit str(a) 449 μs ± 2.34 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) ``` Also, this allows one printing 15Gb Metal tensors on 32GB Mac machine: ``` % python3 -c "import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))" tensor([[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]], device='mps:0', dtype=torch.float16) ``` Before this change it failed with non-descriptive ``` % python3 -c "import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))" Traceback (most recent call last): File "<string>", line 1, in <module> import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16)) ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/malfet/git/pytorch/pytorch/torch/_tensor.py", line 568, in __repr__ return torch._tensor_str._str(self, tensor_contents=tensor_contents) ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 708, in _str return _str_intern(self, tensor_contents=tensor_contents) File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 625, in _str_intern tensor_str = _tensor_str(self, indent) File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 339, in _tensor_str self = self.float() RuntimeError: Invalid buffer size: 19.45 GB ```

pytorch-bot · 2024-12-03T01:00:07Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141927

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7e2b8e7 with merge base 4959784 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

vadimkantorov · 2024-12-03T01:17:19Z

Also interesting about float8's - I wonder if they can be converted instead to fp16 or bf16 (especially if these kernels exist) without losing precision and make use of the improvements in this PR wrt printing. This can also save some memory as fp8 -> fp32 conversion probably is hungry...

malfet · 2024-12-03T01:31:37Z

Also interesting about float8's - I wonder if they can be converted instead to fp16 or bf16 (especially if these kernels exist) without losing precision and make use of the improvements in this PR wrt printing. This can also save some memory as fp8 -> fp32 conversion probably is hungry...

Sure, I can convert float8 to float16, but also think it might worth an effort expanding a few more generic select ops to float8, see #141928

torch/_tensor_str.py

malfet · 2024-12-03T02:13:53Z

@pytorchbot merge

pytorchmergebot · 2024-12-03T02:17:03Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

malfet · 2024-12-03T06:59:44Z

@pytorchbot merge -f "Don't want to wait for last ROCm test"

pytorchmergebot · 2024-12-03T07:00:02Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

pytorchmergebot · 2024-12-03T07:01:37Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This PR removes copycast of reduced precision types to float before printing, that was added in pytorch#14418 to probably unblock printing when many operations, like `is_nan` and `max` were not supported on CPUs (Reusing old test plan) Before the PR: ```python In [1]: import torch; a = torch.rand(1, 1700, 34, 50, dtype=torch.float16) In [2]: %timeit str(a) 621 μs ± 5.06 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) ``` after the PR ```python In [1]: import torch; a = torch.rand(1, 1700, 34, 50, dtype=torch.float16) In [2]: %timeit str(a) 449 μs ± 2.34 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) ``` Also, this allows one printing 15Gb Metal tensors on 32GB Mac machine: ``` % python3 -c "import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))" tensor([[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]], device='mps:0', dtype=torch.float16) ``` Before this change it failed with non-descriptive ``` % python3 -c "import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))" Traceback (most recent call last): File "<string>", line 1, in <module> import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16)) ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/malfet/git/pytorch/pytorch/torch/_tensor.py", line 568, in __repr__ return torch._tensor_str._str(self, tensor_contents=tensor_contents) ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 708, in _str return _str_intern(self, tensor_contents=tensor_contents) File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 625, in _str_intern tensor_str = _tensor_str(self, indent) File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 339, in _tensor_str self = self.float() RuntimeError: Invalid buffer size: 19.45 GB ``` Convert fp8 dtypes to float16, as float range is an overkill Pull Request resolved: pytorch#141927 Approved by: https://github.com/ezyang

malfet added release notes: python_frontend python frontend release notes category topic: improvements topic category labels Dec 3, 2024

malfet requested review from ezyang and albanD December 3, 2024 01:00

Update _tensor_str.py

036f637

ezyang approved these changes Dec 3, 2024

View reviewed changes

malfet commented Dec 3, 2024

View reviewed changes

torch/_tensor_str.py Outdated Show resolved Hide resolved

Update torch/_tensor_str.py

7e2b8e7

malfet mentioned this pull request Dec 3, 2024

Output size of the matrix multiplication is larger than currently supported by the MPS backend: 72250,72250, needs to be less than 2**32 elements #141909

Closed

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 3, 2024

pytorchmergebot added the merging label Dec 3, 2024

albanD removed their request for review December 3, 2024 03:14

pytorchmergebot added the Merged label Dec 3, 2024

pytorchmergebot closed this in e499b46 Dec 3, 2024

pytorchmergebot removed the merging label Dec 3, 2024

malfet deleted the malfet-patch-4 branch December 12, 2024 22:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up half tensors printing #141927

Speed up half tensors printing #141927

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Speed up half tensors printing #141927

Speed up half tensors printing #141927

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141927

✅ No Failures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!