8000 Speed up half tensors printing by malfet · Pull Request #141927 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Speed up half tensors printing #141927

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

Speed up half tensors printing #141927

wants to merge 3 commits into from

Conversation

malfet
Copy link
Contributor
@malfet malfet commented Dec 3, 2024

This PR removes copycast of reduced precision types to float before printing, that was added in #14418 to probably unblock printing when many operations, like is_nan and max were not supported on CPUs

(Reusing old test plan) Before the PR:

In [1]: import torch; a = torch.rand(1, 1700, 34, 50, dtype=torch.float16)

In [2]: %timeit str(a)
621 μs ± 5.06 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

after the PR

In [1]: import torch; a = torch.rand(1, 1700, 34, 50, dtype=torch.float16)

In [2]: %timeit str(a)
449 μs ± 2.34 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Also, this allows one printing 15Gb Metal tensors on 32GB Mac machine:

% python3 -c "import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))"  
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='mps:0', dtype=torch.float16)

Before this change it failed with non-descriptive

% python3 -c "import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))" 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
    import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))
                 ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor.py", line 568, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
           ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 708, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 625, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 339, in _tensor_str
    self = self.float()
RuntimeError: Invalid buffer size: 19.45 GB

Convert fp8 dtypes to float16, as float range is an overkill

This removes cast of reduced precision types to float before testing, which were added in #14418

(Reusing old test plan) Before the PR:
```python
In [1]: import torch; a = torch.rand(1, 1700, 34, 50, dtype=torch.float16)

In [2]: %timeit str(a)
621 μs ± 5.06 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```

after the PR
```python
In [1]: import torch; a = torch.rand(1, 1700, 34, 50, dtype=torch.float16)

In [2]: %timeit str(a)
449 μs ± 2.34 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```

Also, this allows one printing 15Gb Metal tensors on 32GB Mac machine:
```
% python3 -c "import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))"  
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='mps:0', dtype=torch.float16)
```

Before this change it failed with non-descriptive
```
% python3 -c "import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))" 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
    import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))
                 ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor.py", line 568, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
           ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 708, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 625, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 339, in _tensor_str
    self = self.float()
RuntimeError: Invalid buffer size: 19.45 GB
```
@malfet malfet added release notes: python_frontend python frontend release notes category topic: improvements topic category labels Dec 3, 2024
@malfet malfet requested review from ezyang and albanD December 3, 2024 01:00
Copy link
pytorch-bot bot commented Dec 3, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141927

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7e2b8e7 with merge base 4959784 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@vadimkantorov
Copy link
Contributor
vadimkantorov commented Dec 3, 2024

Also interesting about float8's - I wonder if they can be converted instead to fp16 or bf16 (especially if these kernels exist) without losing precision and make use of the improvements in this PR wrt printing. This can also save some memory as fp8 -> fp32 conversion probably is hungry...

@malfet
Copy link
Contributor Author
malfet commented Dec 3, 2024

Also interesting about float8's - I wonder if they can be converted instead to fp16 or bf16 (especially if these kernels exist) without losing precision and make use of the improvements in this PR wrt printing. This can also save some memory as fp8 -> fp32 conversion probably is hungry...

Sure, I can convert float8 to float16, but also think it might worth an effort expanding a few more generic select ops to float8, see #141928

@malfet
Copy link
Contributor Author
malfet commented Dec 3, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 3, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@albanD albanD removed their request for review December 3, 2024 03:14
@malfet
Copy link
Contributor Author
malfet commented Dec 3, 2024

@pytorchbot merge -f "Don't want to wait for last ROCm test"

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
This PR removes copycast of reduced precision types to float before printing, that was added in pytorch#14418 to probably unblock printing when many operations, like `is_nan` and `max` were not supported on CPUs

(Reusing old test plan) Before the PR:
```python
In [1]: import torch; a = torch.rand(1, 1700, 34, 50, dtype=torch.float16)

In [2]: %timeit str(a)
621 μs ± 5.06 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```

after the PR
```python
In [1]: import torch; a = torch.rand(1, 1700, 34, 50, dtype=torch.float16)

In [2]: %timeit str(a)
449 μs ± 2.34 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```

Also, this allows one printing 15Gb Metal tensors on 32GB Mac machine:
```
% python3 -c "import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))"
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='mps:0', dtype=torch.float16)
```

Before this change it failed with non-descriptive
```
% python3 -c "import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
    import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))
                 ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor.py", line 568, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
           ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 708, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 625, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 339, in _tensor_str
    self = self.float()
RuntimeError: Invalid buffer size: 19.45 GB
```

Convert fp8 dtypes to float16, as float range is an overkill
Pull Request resolved: pytorch#141927
Approved by: https://github.com/ezyang
@malfet malfet deleted the malfet-patch-4 branch December 12, 2024 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: python_frontend python frontend release notes category topic: improvements topic category
Projects
None yet
415B
Development

Successfully merging this pull request may close these issues.

4 participants
0