10000 [complex] nansum & nanmean by khushi-411 · Pull Request #93199 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[complex] nansum & nanmean #93199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from
Closed

Conversation

khushi-411
Copy link
Contributor
@khushi-411 khushi-411 commented Jan 28, 2023

@pytorch-bot
Copy link
pytorch-bot bot commented Jan 28, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/93199

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 970201c:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Jan 28, 2023
Copy link
Contributor
@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a tracking issue for the overarching goal of state of complex type support in PyTorch? For anything that can't be jiterated, I'm a bit concerned about added compile time/binary size/test time increase compared to the value proposition.

Comment on lines 189 to 191
AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND1(kComplexHalf,
iter.dtype(), "nansum_cuda", [&]() {
nansum_functor<scalar_t>{}(iter);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any estimation how much this affects compile time and binary size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @malfet!

Any estimation how much this affects compile time and binary size?

I usually take 45-60 mins to compile in my system. The last time I compiled, I got the following:

real	47m6.068s
user	537m54.813s
sys	17m20.032s
My System Info

Collecting environment information...
PyTorch version: 2.0.0a0+git295a5e2
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: 10.0.0-4ubuntu1 
CMake version: version 3.22.1
Libc version: glibc-2.31

Python version: 3.10.8 (main, Nov 24 2022, 14:13:03) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-58-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.6.124
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1650 Ti
Nvidia driver version: 470.161.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: False

Versions of relevant libraries:
[pip3] mypy==0.960
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.1
[pip3] torch==2.0.0a0+gitf6a3034
[conda] mkl                       2022.1.0           hc2b9512_224  
[conda] mkl-include               2022.1.0           h06a4308_224  
[conda] numpy                     1.23.1                   pypi_0    pypi
[conda] torch                     2.0.0a0+gitc805b2b          pypi_0    pypi

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build time of whole build seems to have a lot of variance.

One thing you can do is build PyTorch first and then make a small change (maybe add a space) in ReduceSumProdKernel.cu to trigger incremental build so that only ReduceSumProdKernel.cu and linking will be triggered. This will provide more granular time required to compile the file.

Other option (if you are using ninja for building) is to use https://github.com/nico/ninjatracing and check how long this particular file takes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a line to check the build time, as you suggested. I got:

real	0m57.989s
user	0m55.940s
sys	0m4.402s

pytorchmergebot pushed a commit that referenced this pull request Jan 31, 2023
@khushi-411
Copy link
Contributor Author
khushi-411 commented Jan 31, 2023

Is there a tracking issue for the overarching goal of state of complex type support in PyTorch? For anything that can't be jiterated, I'm a bit concerned about added compile time/binary size/test time increase compared to the value proposition.

Yes, there are some issues:

It was decided that we cannot jiterate the linalg operators. And the performance of some operators was severely affected. But there's no tracker for it other than the first issue I mentioned here, AFAIK. 😅

There are a few more trackers regarding complex support in PyTorch.

cc: @ngimel @kshitij12345 @anjali411

Thanks!

@khushi-411 khushi-411 marked this pull request as ready for review February 7, 2023 13:34
@soulitzer soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 8, 2023
Copy link
Collaborator
@Skylion007 Skylion007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good to me.

@khushi-411
Copy link
Contributor Author

Thanks, @Skylion007, @kshitij12345, and @malfet, for informative shares throughout the PR!

Note to self: I'll add complex CPU support in the next PR. :)

@pytorchbot label ciflow/trunk

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 14, 2023
@Skylion007
Copy link
Collaborator

Thanks, @Skylion007, @kshitij12345, and @malfet, for informative shares throughout the PR!

Note to self: I'll add complex CPU support in the next PR. :)

@pytorchbot label ciflow/trunk

FYI, branch cutoff for 2.0 is really soon, (tomorrow I think) if you want to try to get it in by then.

Copy link
Contributor
@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, what's the compilation time increase as result of this change?

@malfet
Copy link
Contributor
malfet commented Feb 14, 2023

FYI, branch cutoff for 2.0 is really soon, (tomorrow I think) if you want to try to get it in by then.

I don't think this necessarily needs to go 2.0 release branch

@Skylion007
Copy link
Collaborator

FYI, branch cutoff for 2.0 is really soon, (tomorrow I think) if you want to try to get it in by then.

I don't think this necessarily needs to go 2.0 release branch

Fair, it'd be nice to add CPU and GPU support in the same release if it's not a big change.

Copy link
Collaborator
@kshitij12345 kshitij12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it is silently updating the behaviour to support integral inputs for nanmean. Do we want that in this PR?

Also, I don't understand why the meta tests fail for nanmean? Looks like they fail only on Integral and complex inputs. Any idea why?

@khushi-411
Copy link
Contributor Author

LGTM, what's the compilation time increase as result of this change?

@malfet, there's no such extra increase in compilation time. #93199 (comment).

I also benchmarked the performance between jiterated and non-jiterated operators. Here's the performance:

[-------------------------- torch.nansum(x) Benchmark --------------------------]
                                                                 |  non_jiterated
1 threads: ----------------------------------------------------------------------
      dtype torch.complex64 - shape : torch.Size([])             |        8.4    
      dtype torch.complex64 - shape : torch.Size([5])            |        8.7    
      dtype torch.complex64 - shape : torch.Size([10, 10])       |        9.1    
      dtype torch.complex64 - shape : torch.Size([64, 64])       |        9.3    
      dtype torch.complex64 - shape : torch.Size([512, 256])     |       12.8    
      dtype torch.complex64 - shape : torch.Size([1024, 1024])   |       50.2    
      dtype torch.complex128 - shape : torch.Size([])            |        8.7    
      dtype torch.complex128 - shape : torch.Size([5])           |        8.6    
      dtype torch.complex128 - shape : torch.Size([10, 10])      |        9.0    
      dtype torch.complex128 - shape : torch.Size([64, 64])      |       10.0    
      dtype torch.complex128 - shape : torch.Size([512, 256])    |       23.5    
      dtype torch.complex128 - shape : torch.Size([1024, 1024])  |      114.2    

Times are in microseconds (us).

[--------------------------------- torch.nansum(x) Benchmark ---------------------------------]
                                                                 |  jiterated  |  non_jiterated
1 threads: ------------------------------------------------------------------------------------
      dtype torch.complex64 - shape : torch.Size([])             |      8.1    |        8.4    
      dtype torch.complex64 - shape : torch.Size([5])            |      8.1    |        8.7    
      dtype torch.complex64 - shape : torch.Size([10, 10])       |      8.6    |        9.1    
      dtype torch.complex64 - shape : torch.Size([64, 64])       |      8.6    |        9.3    
      dtype torch.complex64 - shape : torch.Size([512, 256])     |     12.5    |       12.8    
      dtype torch.complex64 - shape : torch.Size([1024, 1024])   |     50.0    |       50.2    
      dtype torch.complex128 - shape : torch.Size([])            |      8.0    |        8.7    
      dtype torch.complex128 - shape : torch.Size([5])           |      8.0    |        8.6    
      dtype torch.complex128 - shape : torch.Size([10, 10])      |      8.5    |        9.0    
      dtype torch.complex128 - shape : torch.Size([64, 64])      |     10.1    |       10.0    
      dtype torch.complex128 - shape : torch.Size([512, 256])    |     23.5    |       23.5    
      dtype torch.complex128 - shape : torch.Size([1024, 1024])  |    114.2    |      114.2    

Times are in microseconds (us).
script

import torch
from torch.testing import make_tensor
import pickle
from torch.utils import benchmark
from itertools import product

device = 'cuda'
dtypes = [torch.complex64, torch.complex128]
# desc = "jiterated"
desc = "non_jiterated"
shapes = ((), (5,), (10, 10), (64, 64), (512, 256), (1024, 1024))

result = []

for dtype, shape in product(dtypes, shapes):
    x = make_tensor(shape, device=device, dtype=dtype)

    # Generate the JIT kernel (if it wasn't previously).
    torch.nansum(x)
    stmt = 'torch.nansum(x)'
    measurement = benchmark.Timer(
                stmt=stmt,
                globals={'x': x},
                label=stmt+ " Benchmark",
                sub_label=f"dtype {dtype} - shape : {x.shape}",
                description=desc,
            ).blocked_autorange()

    result.append(measurement)

fname_prefix = "benchmark_nansum_"
benchmark.Compare(result).print()
with open(fname_prefix+desc, "wb") as f:
    pickle.dump(result, f)

results = []
for desc in ['jiterated', 'non_jiterated']:
    with open(fname_prefix+desc, 'rb') as f:
        results.append(pickle.load(f))

benchmark.Compare(results[0] + results[1]).print()

Thanks!

@khushi-411
Copy link
Contributor Author

Seems like it is silently updating the behaviour to support integral inputs for nanmean. Do we want that in this PR?

@kshitij12345, I think we can update both mean's and nanmean's behavior in a different PR because numpy supports integer numbers. Please see:

NumPy

>>> import numpy as np
>>> k = np.array([1, 2, 3], dtype=np.int32)
>>> np.mean(k)
2.0
>>> np.nanmean(k)
2.0

PyTorch

>>> import torch
>>> k = torch.tensor([1, 2, 3], dtype=torch.int32)
>>> torch.mean(k)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: mean(): could not infer output dtype. Input dtype must be either a floating point or complex dtype. Got: Int
>>> torch.nanmean(k)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: nanmean(): expected input to have floating point dtype but got Int

What do you think?

Also, I don't understand why the meta tests fail for nanmean? Looks like they fail only on Integral and complex inputs. Any idea why?

The meta_dispatch_outplace test is resolved in my latest commit.

Thanks!

Copy link
Collaborator
@kshitij12345 kshitij12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Have a single comment but not a blocker.

Thanks @khushi-411

@khushi-411
Copy link
Contributor Author

I see the 2.0.0 branch is cut now. So, I'm merging this now :)

@pytorchmergebot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged module: cpu CPU specific problem (e.g., perf, algorithm) open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0