-
Notifications
You must be signed in to change notification settings - Fork 24.7k
[complex] nansum & nanmean #93199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[complex] nansum & nanmean #93199
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/93199
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 970201c: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a tracking issue for the overarching goal of state of complex type support in PyTorch? For anything that can't be jiterated, I'm a bit concerned about added compile time/binary size/test time increase compared to the value proposition.
AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND1(kComplexHalf, | ||
iter.dtype(), "nansum_cuda", [&]() { | ||
nansum_functor<scalar_t>{}(iter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any estimation how much this affects compile time and binary size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @malfet!
Any estimation how much this affects compile time and binary size?
I usually take 45-60 mins to compile in my system. The last time I compiled, I got the following:
real 47m6.068s
user 537m54.813s
sys 17m20.032s
My System Info
Collecting environment information...
PyTorch version: 2.0.0a0+git295a5e2
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: 10.0.0-4ubuntu1
CMake version: version 3.22.1
Libc version: glibc-2.31
Python version: 3.10.8 (main, Nov 24 2022, 14:13:03) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-58-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.6.124
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1650 Ti
Nvidia driver version: 470.161.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: False
Versions of relevant libraries:
[pip3] mypy==0.960
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.1
[pip3] torch==2.0.0a0+gitf6a3034
[conda] mkl 2022.1.0 hc2b9512_224
[conda] mkl-include 2022.1.0 h06a4308_224
[conda] numpy 1.23.1 pypi_0 pypi
[conda] torch 2.0.0a0+gitc805b2b pypi_0 pypi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Build time of whole build seems to have a lot of variance.
One thing you can do is build PyTorch first and then make a small change (maybe add a space) in ReduceSumProdKernel.cu
to trigger incremental build so that only ReduceSumProdKernel.cu
and linking will be triggered. This will provide more granular time required to compile the file.
Other option (if you are using ninja
for building) is to use https://github.com/nico/ninjatracing and check how long this particular file takes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a line to check the build time, as you suggested. I got:
real 0m57.989s
user 0m55.940s
sys 0m4.402s
As discussed in #93199 (comment). Pull Request resolved: #93267 Approved by: https://github.com/Skylion007
Yes, there are some issues:
It was decided that we cannot jiterate the There are a few more trackers regarding complex support in PyTorch. cc: @ngimel @kshitij12345 @anjali411 Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks good to me.
Thanks, @Skylion007, @kshitij12345, and @malfet, for informative shares throughout the PR! Note to self: I'll add complex CPU support in the next PR. :) @pytorchbot label ciflow/trunk |
FYI, branch cutoff for 2.0 is really soon, (tomorrow I think) if you want to try to get it in by then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, what's the compilation time increase as result of this change?
I don't think this necessarily needs to go 2.0 release branch |
Fair, it'd be nice to add CPU and GPU support in the same release if it's not a big change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it is silently updating the behaviour to support integral inputs for nanmean
. Do we want that in this PR?
Also, I don't understand why the meta
tests fail for nanmean
? Looks like they fail only on Integral and complex inputs. Any idea why?
@malfet, there's no such extra increase in compilation time. #93199 (comment). I also benchmarked the performance between jiterated and non-jiterated operators. Here's the performance: [-------------------------- torch.nansum(x) Benchmark --------------------------]
| non_jiterated
1 threads: ----------------------------------------------------------------------
dtype torch.complex64 - shape : torch.Size([]) | 8.4
dtype torch.complex64 - shape : torch.Size([5]) | 8.7
dtype torch.complex64 - shape : torch.Size([10, 10]) | 9.1
dtype torch.complex64 - shape : torch.Size([64, 64]) | 9.3
dtype torch.complex64 - shape : torch.Size([512, 256]) | 12.8
dtype torch.complex64 - shape : torch.Size([1024, 1024]) | 50.2
dtype torch.complex128 - shape : torch.Size([]) | 8.7
dtype torch.complex128 - shape : torch.Size([5]) | 8.6
dtype torch.complex128 - shape : torch.Size([10, 10]) | 9.0
dtype torch.complex128 - shape : torch.Size([64, 64]) | 10.0
dtype torch.complex128 - shape : torch.Size([512, 256]) | 23.5
dtype torch.complex128 - shape : torch.Size([1024, 1024]) | 114.2
Times are in microseconds (us).
[--------------------------------- torch.nansum(x) Benchmark ---------------------------------]
| jiterated | non_jiterated
1 threads: ------------------------------------------------------------------------------------
dtype torch.complex64 - shape : torch.Size([]) | 8.1 | 8.4
dtype torch.complex64 - shape : torch.Size([5]) | 8.1 | 8.7
dtype torch.complex64 - shape : torch.Size([10, 10]) | 8.6 | 9.1
dtype torch.complex64 - shape : torch.Size([64, 64]) | 8.6 | 9.3
dtype torch.complex64 - shape : torch.Size([512, 256]) | 12.5 | 12.8
dtype torch.complex64 - shape : torch.Size([1024, 1024]) | 50.0 | 50.2
dtype torch.complex128 - shape : torch.Size([]) | 8.0 | 8.7
dtype torch.complex128 - shape : torch.Size([5]) | 8.0 | 8.6
dtype torch.complex128 - shape : torch.Size([10, 10]) | 8.5 | 9.0
dtype torch.complex128 - shape : torch.Size([64, 64]) | 10.1 | 10.0
dtype torch.complex128 - shape : torch.Size([512, 256]) | 23.5 | 23.5
dtype torch.complex128 - shape : torch.Size([1024, 1024]) | 114.2 | 114.2
Times are in microseconds (us). script
import torch
from torch.testing import make_tensor
import pickle
from torch.utils import benchmark
from itertools import product
device = 'cuda'
dtypes = [torch.complex64, torch.complex128]
# desc = "jiterated"
desc = "non_jiterated"
shapes = ((), (5,), (10, 10), (64, 64), (512, 256), (1024, 1024))
result = []
for dtype, shape in product(dtypes, shapes):
x = make_tensor(shape, device=device, dtype=dtype)
# Generate the JIT kernel (if it wasn't previously).
torch.nansum(x)
stmt = 'torch.nansum(x)'
measurement = benchmark.Timer(
stmt=stmt,
globals={'x': x},
label=stmt+ " Benchmark",
sub_label=f"dtype {dtype} - shape : {x.shape}",
description=desc,
).blocked_autorange()
result.append(measurement)
fname_prefix = "benchmark_nansum_"
benchmark.Compare(result).print()
with open(fname_prefix+desc, "wb") as f:
pickle.dump(result, f)
results = []
for desc in ['jiterated', 'non_jiterated']:
with open(fname_prefix+desc, 'rb') as f:
results.append(pickle.load(f))
benchmark.Compare(results[0] + results[1]).print() Thanks! |
@kshitij12345, I think we can update both NumPy>>> import numpy as np
>>> k = np.array([1, 2, 3], dtype=np.int32)
>>> np.mean(k)
2.0
>>> np.nanmean(k)
2.0 PyTorch>>> import torch
>>> k = torch.tensor([1, 2, 3], dtype=torch.int32)
>>> torch.mean(k)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: mean(): could not infer output dtype. Input dtype must be either a floating point or complex dtype. Got: Int
>>> torch.nanmean(k)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: nanmean(): expected input to have floating point dtype but got Int What do you think?
The Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Have a single comment but not a blocker.
Thanks @khushi-411
I see the 2.0.0 branch is cut now. So, I'm merging this now :) @pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This reverts commit a038968.
Follows: pytorch#71472 Pull Request resolved: pytorch#93199 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kshitij12345
Follows: #71472
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10