[complex] nansum & nanmean #93199

khushi-411 · 2023-01-28T13:43:58Z

Follows: #71472

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

pytorch-bot · 2023-01-28T13:44:01Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/93199

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 970201c:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

aten/src/ATen/native/SharedReduceOps.h

aten/src/ATen/native/cuda/ReduceSumProdKernel.cu

malfet

Is there a tracking issue for the overarching goal of state of complex type support in PyTorch? For anything that can't be jiterated, I'm a bit concerned about added compile time/binary size/test time increase compared to the value proposition.

malfet · 2023-01-30T17:11:00Z

aten/src/ATen/native/cuda/ReduceSumProdKernel.cu

+    AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND1(kComplexHalf,
+      iter.dtype(), "nansum_cuda", [&]() {
+        nansum_functor<scalar_t>{}(iter);


Any estimation how much this affects compile time and binary size?

Hi, @malfet!

Any estimation how much this affects compile time and binary size?

I usually take 45-60 mins to compile in my system. The last time I compiled, I got the following:

real 47m6.068s user 537m54.813s sys 17m20.032s

My System Info

Collecting environment information... PyTorch version: 2.0.0a0+git295a5e2 Is debug build: False CUDA used to build PyTorch: 11.6 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.5 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: 10.0.0-4ubuntu1 CMake version: version 3.22.1 Libc version: glibc-2.31 Python version: 3.10.8 (main, Nov 24 2022, 14:13:03) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-58-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 11.6.124 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1650 Ti Nvidia driver version: 470.161.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: False Versions of relevant libraries: [pip3] mypy==0.960 [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.23.1 [pip3] torch==2.0.0a0+gitf6a3034 [conda] mkl 2022.1.0 hc2b9512_224 [conda] mkl-include 2022.1.0 h06a4308_224 [conda] numpy 1.23.1 pypi_0 pypi [conda] torch 2.0.0a0+gitc805b2b pypi_0 pypi

Build time of whole build seems to have a lot of variance.

One thing you can do is build PyTorch first and then make a small change (maybe add a space) in ReduceSumProdKernel.cu to trigger incremental build so that only ReduceSumProdKernel.cu and linking will be triggered. This will provide more granular time required to compile the file.

Other option (if you are using ninja for building) is to use https://github.com/nico/ninjatracing and check how long this particular file takes.

I added a line to check the build time, as you suggested. I got:

real 0m57.989s user 0m55.940s sys 0m4.402s

As discussed in #93199 (comment). Pull Request resolved: #93267 Approved by: https://github.com/Skylion007

khushi-411 · 2023-01-31T10:56:05Z

Is there a tracking issue for the overarching goal of state of complex type support in PyTorch? For anything that can't be jiterated, I'm a bit concerned about added compile time/binary size/test time increase compared to the value proposition.

Yes, there are some issues:

Regarding performance: [jiterator] perf regression when jiterating few ops for complex dtype #76558
Tracker ComplexHalf & Jiterator: ComplexHalf Coverage Tracker #74537, [jiterator] Jiterate Complex Ops #74748, Tracker: Jiterator improvements #69463

It was decided that we cannot jiterate the linalg operators. And the performance of some operators was severely affected. But there's no tracker for it other than the first issue I mentioned here, AFAIK. 😅

There are a few more trackers regarding complex support in PyTorch.

Loss functions for complex tensors #46642

cc: @ngimel @kshitij12345 @anjali411

Thanks!

aten/src/ATen/native/cuda/ReduceSumProdKernel.cu

c10/util/complex_utils.h

aten/src/ATen/native/cpu/ReduceOpsKernel.cpp

aten/src/ATen/native/ReduceOps.cpp

Skylion007

This PR looks good to me.

khushi-411 · 2023-02-14T19:32:34Z

Thanks, @Skylion007, @kshitij12345, and @malfet, for informative shares throughout the PR!

Note to self: I'll add complex CPU support in the next PR. :)

@pytorchbot label ciflow/trunk

Skylion007 · 2023-02-14T19:52:36Z

Thanks, @Skylion007, @kshitij12345, and @malfet, for informative shares throughout the PR!

Note to self: I'll add complex CPU support in the next PR. :)

@pytorchbot label ciflow/trunk

FYI, branch cutoff for 2.0 is really soon, (tomorrow I think) if you want to try to get it in by then.

malfet

LGTM, what's the compilation time increase as result of this change?

malfet · 2023-02-14T19:55:00Z

FYI, branch cutoff for 2.0 is really soon, (tomorrow I think) if you want to try to get it in by then.

I don't think this necessarily needs to go 2.0 release branch

Skylion007 · 2023-02-14T19:55:38Z

FYI, branch cutoff for 2.0 is really soon, (tomorrow I think) if you want to try to get it in by then.

I don't think this necessarily needs to go 2.0 release branch

Fair, it'd be nice to add CPU and GPU support in the same release if it's not a big change.

kshitij12345

Seems like it is silently updating the behaviour to support integral inputs for nanmean. Do we want that in this PR?

Also, I don't understand why the meta tests fail for nanmean? Looks like they fail only on Integral and complex inputs. Any idea why?

aten/src/ATen/native/ReduceOps.cpp

torch/testing/_internal/common_methods_invocations.py

khushi-411 · 2023-02-14T21:13:03Z

LGTM, what's the compilation time increase as result of this change?

@malfet, there's no such extra increase in compilation time. #93199 (comment).

I also benchmarked the performance between jiterated and non-jiterated operators. Here's the performance:

[-------------------------- torch.nansum(x) Benchmark --------------------------]
                                                                 |  non_jiterated
1 threads: ----------------------------------------------------------------------
      dtype torch.complex64 - shape : torch.Size([])             |        8.4    
      dtype torch.complex64 - shape : torch.Size([5])            |        8.7    
      dtype torch.complex64 - shape : torch.Size([10, 10])       |        9.1    
      dtype torch.complex64 - shape : torch.Size([64, 64])       |        9.3    
      dtype torch.complex64 - shape : torch.Size([512, 256])     |       12.8    
      dtype torch.complex64 - shape : torch.Size([1024, 1024])   |       50.2    
      dtype torch.complex128 - shape : torch.Size([])            |        8.7    
      dtype torch.complex128 - shape : torch.Size([5])           |        8.6    
      dtype torch.complex128 - shape : torch.Size([10, 10])      |        9.0    
      dtype torch.complex128 - shape : torch.Size([64, 64])      |       10.0    
      dtype torch.complex128 - shape : torch.Size([512, 256])    |       23.5    
      dtype torch.complex128 - shape : torch.Size([1024, 1024])  |      114.2    

Times are in microseconds (us).

[--------------------------------- torch.nansum(x) Benchmark ---------------------------------]
                                                                 |  jiterated  |  non_jiterated
1 threads: ------------------------------------------------------------------------------------
      dtype torch.complex64 - shape : torch.Size([])             |      8.1    |        8.4    
      dtype torch.complex64 - shape : torch.Size([5])            |      8.1    |        8.7    
      dtype torch.complex64 - shape : torch.Size([10, 10])       |      8.6    |        9.1    
      dtype torch.complex64 - shape : torch.Size([64, 64])       |      8.6    |        9.3    
      dtype torch.complex64 - shape : torch.Size([512, 256])     |     12.5    |       12.8    
      dtype torch.complex64 - shape : torch.Size([1024, 1024])   |     50.0    |       50.2    
      dtype torch.complex128 - shape : torch.Size([])            |      8.0    |        8.7    
      dtype torch.complex128 - shape : torch.Size([5])           |      8.0    |        8.6    
      dtype torch.complex128 - shape : torch.Size([10, 10])      |      8.5    |        9.0    
      dtype torch.complex128 - shape : torch.Size([64, 64])      |     10.1    |       10.0    
      dtype torch.complex128 - shape : torch.Size([512, 256])    |     23.5    |       23.5    
      dtype torch.complex128 - shape : torch.Size([1024, 1024])  |    114.2    |      114.2    

Times are in microseconds (us).

script

import torch
from torch.testing import make_tensor
import pickle
from torch.utils import benchmark
from itertools import product

device = 'cuda'
dtypes = [torch.complex64, torch.complex128]
# desc = "jiterated"
desc = "non_jiterated"
shapes = ((), (5,), (10, 10), (64, 64), (512, 256), (1024, 1024))

result = []

for dtype, shape in product(dtypes, shapes):
    x = make_tensor(shape, device=device, dtype=dtype)

    # Generate the JIT kernel (if it wasn't previously).
    torch.nansum(x)
    stmt = 'torch.nansum(x)'
    measurement = benchmark.Timer(
                stmt=stmt,
                globals={'x': x},
                label=stmt+ " Benchmark",
                sub_label=f"dtype {dtype} - shape : {x.shape}",
                description=desc,
            ).blocked_autorange()

    result.append(measurement)

fname_prefix = "benchmark_nansum_"
benchmark.Compare(result).print()
with open(fname_prefix+desc, "wb") as f:
    pickle.dump(result, f)

results = []
for desc in ['jiterated', 'non_jiterated']:
    with open(fname_prefix+desc, 'rb') as f:
        results.append(pickle.load(f))

benchmark.Compare(results[0] + results[1]).print()

Thanks!

khushi-411 · 2023-02-15T11:06:39Z

Seems like it is silently updating the behaviour to support integral inputs for nanmean. Do we want that in this PR?

@kshitij12345, I think we can update both mean's and nanmean's behavior in a different PR because numpy supports integer numbers. Please see:

NumPy

>>> import numpy as np
>>> k = np.array([1, 2, 3], dtype=np.int32)
>>> np.mean(k)
2.0
>>> np.nanmean(k)
2.0

PyTorch

>>> import torch
>>> k = torch.tensor([1, 2, 3], dtype=torch.int32)
>>> torch.mean(k)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: mean(): could not infer output dtype. Input dtype must be either a floating point or complex dtype. Got: Int
>>> torch.nanmean(k)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: nanmean(): expected input to have floating point dtype but got Int

What do you think?

Also, I don't understand why the meta tests fail for nanmean? Looks like they fail only on Integral and complex inputs. Any idea why?

The meta_dispatch_outplace test is resolved in my latest commit.

Thanks!

aten/src/ATen/native/ReduceOps.cpp

kshitij12345

LGTM! Have a single comment but not a blocker.

Thanks @khushi-411

khushi-411 · 2023-02-16T06:11:39Z

I see the 2.0.0 branch is cut now. So, I'm merging this now :)

@pytorchmergebot merge

pytorchmergebot · 2023-02-16T06:13:38Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This reverts commit a038968.

Follows: pytorch#71472 Pull Request resolved: pytorch#93199 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kshitij12345

[complex] nansum & nanmean

0a4df78

github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Jan 28, 2023

pytorchbot added the open source label Jan 28, 2023

Merge branch 'pytorch:master' into complex_nansum

c805b2b

Skylion007 reviewed Jan 28, 2023

View reviewed changes

aten/src/ATen/native/SharedReduceOps.h Outdated Show resolved Hide resolved

Skylion007 reviewed Jan 28, 2023

View reviewed changes

aten/src/ATen/native/cuda/ReduceSumProdKernel.cu Outdated Show resolved Hide resolved

Skylion007 reviewed Jan 28, 2023

View reviewed changes

aten/src/ATen/native/cuda/ReduceSumProdKernel.cu Outdated Show resolved Hide resolved

address reviews

295a5e2

khushi-411 mentioned this pull request Jan 30, 2023

[c++] use constexpr instead of const #93267

Closed

malfet reviewed Jan 30, 2023

View reviewed changes

pytorchmergebot pushed a commit that referenced this pull request Jan 31, 2023

[c++] use constexpr instead of const (#93267)

aee5f84

As discussed in #93199 (comment). Pull Request resolved: #93267 Approved by: https://github.com/Skylion007

khushi-411 added 2 commits January 31, 2023 15:10

address reviews

0887ca7

Merge branch 'master' into complex_nansum

95dd88a

kshitij12345 reviewed Jan 31, 2023

View reviewed changes

aten/src/ATen/native/cuda/ReduceSumProdKernel.cu Outdated Show resolved Hide resolved

kshitij12345 reviewed Jan 31, 2023

View reviewed changes

c10/util/complex_utils.h Outdated Show resolved Hide resolved

kshitij12345 reviewed Jan 31, 2023

View reviewed changes

aten/src/ATen/native/cpu/ReduceOpsKernel.cpp Show resolved Hide resolved

Skylion007 reviewed Jan 31, 2023

View reviewed changes

aten/src/ATen/native/cpu/ReduceOpsKernel.cpp Show resolved Hide resolved

khushi-411 added 2 commits February 7, 2023 15:20

fix failures & address reviews

bbad5d1

Merge branch 'master' into complex_nansum

fc6e748

khushi-411 marked this pull request as ready for review February 7, 2023 13:34

khushi-411 requested review from mruberry, ngimel, albanD and soulitzer as code owners February 7, 2023 13:34

soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 8, 2023

Skylion007 reviewed Feb 10, 2023

View reviewed changes

aten/src/ATen/native/ReduceOps.cpp Show resolved Hide resolved

Skylion007 approved these changes Feb 14, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 14, 2023

Merge branch 'pytorch:master' into complex_nansum

706707d

malfet approved these changes Feb 14, 2023

View reviewed changes

kshitij12345 requested changes Feb 14, 2023

View reviewed changes

aten/src/ATen/native/ReduceOps.cpp Show resolved Hide resolved

torch/testing/_internal/common_methods_invocations.py Outdated Show resolved Hide resolved

fix integer support

30a0efc

kshitij12345 reviewed Feb 15, 2023

View reviewed changes

aten/src/ATen/native/ReduceOps.cpp Show resolved Hide resolved

kshitij12345 approved these changes Feb 15, 2023

View reviewed changes

address review

970201c

khushi-411 mentioned this pull request Feb 15, 2023

[numpy] mean & nanmean should support int dtypes #94909

Open

pytorchmergebot added the Merged label Feb 16, 2023

pytorchmergebot closed this in a038968 Feb 16, 2023

khushi-411 deleted the complex_nansum branch February 16, 2023 06:42

msaroufim mentioned this pull request Mar 3, 2023

Remove mention of dynamo.optimize() in docs #96002

Closed

pruthvistony added a commit to ROCm/pytorch that referenced this pull request May 2, 2023

Revert "[complex] nansum & nanmean (pytorch#93199)"

6484d44

This reverts commit a038968.

khushi-411 mentioned this pull request Sep 3, 2023

[complex][cpu] nansum & nanmean #108502

Closed

patrickslarson8 mentioned this pull request Oct 31, 2024

torch.nansum does not work with complex numbers on CPU #139349

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[complex] nansum & nanmean #93199

[complex] nansum & nanmean #93199

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[complex] nansum & nanmean #93199

[complex] nansum & nanmean #93199

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/93199

✅ No Failures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NumPy

PyTorch

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!