add max_and_min function and cpu kernel to speed up observers #41570

vkuzo · 2020-07-17T00:33:52Z

Stack from ghstack:

add max_and_min function and cpu kernel to speed up observers #41570 add max_and_min function and cpu kernel to speed up observers

Summary:

For min/max based quantization observers, calculating min and max of a tensor
takes most of the runtime. Since the calculation of min and max is done
on the same tensor, we can speed this up by only reading the tensor
once, and reducing with two outputs.

One question I had is whether we should put this into the quantization
namespace, since the use case is pretty specific.

This PR implements the easier CPU path to get an initial validation.
There is some needed additional work in future PRs, which @jpgraham will
take a look at:

CUDA kernel and tests
making this work per channel
benchmarking on observer
benchmarking impact on QAT overhead

Test Plan:

python test/test_torch.py TestTorch.test_min_and_max

quick bench (not representative of real world use case):
https://gist.github.com/vkuzo/7fce61c3456dbc488d432430cafd6eca

(pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=1 python ~/nfs/pytorch_scripts/observer_bench.py
tensor(5.0390) tensor(-5.4485) tensor([-5.4485,  5.0390])
min and max separate 11.90243935585022
min and max combined 6.353186368942261
% decrease 0.466228209277153
(pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=4 python ~/nfs/pytorch_scripts/observer_bench.py
tensor(5.5586) tensor(-5.3983) tensor([-5.3983,  5.5586])
min and max separate 3.468616485595703
min and max combined 1.8227086067199707
% decrease 0.4745142294372342
(pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=8 python ~/nfs/pytorch_scripts/observer_bench.py
tensor(5.2146) tensor(-5.2858) tensor([-5.2858,  5.2146])
min and max separate 1.5707778930664062
min and max combined 0.8645427227020264
% decrease 0.4496085496757899

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: D22589349

@jpgraham

Summary: For min/max based quantization observers, calculating min and max of a tensor takes most of the runtime. Since the calculation of min and max is done on the same tensor, we can speed this up by only reading the tensor once, and reducing with two outputs. One question I had is whether we should put this into the quantization namespace, since the use case is pretty specific. This PR implements the easier CPU path to get an initial validation. There is some needed additional work in future PRs, which @jpgraham will take a look at: * CUDA kernel and tests * making this work per channel * benchmarking on observer * benchmarking impact on QAT overhead Test Plan: ``` python test/test_torch.py TestTorch.test_min_and_max ``` quick bench (not representative of real world use case): https://gist.github.com/vkuzo/7fce61c3456dbc488d432430cafd6eca ``` (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=1 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.0390) tensor(-5.4485) tensor([-5.4485, 5.0390]) min and max separate 11.90243935585022 min and max combined 6.353186368942261 % decrease 0.466228209277153 (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=4 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.5586) tensor(-5.3983) tensor([-5.3983, 5.5586]) min and max separate 3.468616485595703 min and max combined 1.8227086067199707 % decrease 0.4745142294372342 (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=8 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.2146) tensor(-5.2858) tensor([-5.2858, 5.2146]) min and max separate 1.5707778930664062 min and max combined 0.8645427227020264 % decrease 0.4496085496757899 ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

@jpgraham

Summary: For min/max based quantization observers, calculating min and max of a tensor takes most of the runtime. Since the calculation of min and max is done on the same tensor, we can speed this up by only reading the tensor once, and reducing with two outputs. One question I had is whether we should put this into the quantization namespace, since the use case is pretty specific. This PR implements the easier CPU path to get an initial validation. There is some needed additional work in future PRs, which @jpgraham will take a look at: * CUDA kernel and tests * making this work per channel * benchmarking on observer * benchmarking impact on QAT overhead Test Plan: ``` python test/test_torch.py TestTorch.test_min_and_max ``` quick bench (not representative of real world use case): https://gist.github.com/vkuzo/7fce61c3456dbc488d432430cafd6eca ``` (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=1 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.0390) tensor(-5.4485) tensor([-5.4485, 5.0390]) min and max separate 11.90243935585022 min and max combined 6.353186368942261 % decrease 0.466228209277153 (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=4 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.5586) tensor(-5.3983) tensor([-5.3983, 5.5586]) min and max separate 3.468616485595703 min and max combined 1.8227086067199707 % decrease 0.4745142294372342 (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=8 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.2146) tensor(-5.2858) tensor([-5.2858, 5.2146]) min and max separate 1.5707778930664062 min and max combined 0.8645427227020264 % decrease 0.4496085496757899 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 0f7ed56 Pull Request resolved: #41570

dr-ci · 2020-07-17T00:56:29Z

💊 CI failures summary and remediations

As of commit 4944b3f (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 19 times.

vadimkantorov · 2020-07-17T09:49:16Z

About naming: pytorch already has torch.std_mean and torch.var_mean (do not use and). So torch.min_max would also make perfect sense IMO

There was a discussion on np.minmax, but no concrete results: numpy/numpy#9836

vkuzo · 2020-07-17T15:15:37Z

numpy/numpy#9836

sure, I'm flexible. min_max does read better

ngimel · 2020-07-17T16:24:10Z

Also, keeping similar structure to var_mean, it would be good to return 2 tensors and not one. Another thing, since at this stage it is not intended as a fully usable API (e.g. it does not support dim the way min/max do, it does not work on the GPU, it does not support autograd, it does not have docs) and it is unclear if it ever will, can you start the name with _? _min_max

vkuzo · 2020-07-17T18:58:27Z

Also, keeping similar structure to var_mean, it would be good to return 2 tensors and not one. Another thing, since at this stage it is not intended as a fully usable API (e.g. it does not support dim the way min/max do, it does not work on the GPU, it does not support autograd, it does not have docs) and it is unclear if it ever will, can you start the name with _? _min_max

would be good to return 2 tensors and not one

ah, great, didn't know returning multiple tensors was supported in native_functions.yaml, will do

Another thing, since at this stage it is not intended as a fully usable API (e.g. it does not support dim the way min/max do, it does not work on the GPU, it does not support autograd, it does not have docs

sure thing. For quantization we'll add CUDA and indices, we'll add it in future PRs just to keep the PR size manageable. But yeah, unless this is actually needed outside of quantization keeping it private would be good.

ngimel · 2020-07-17T19:06:41Z

I'm pretty sure it will be usable outside of quantization, so perhaps we should work on making it first class citizen. Would it be a lot of trouble for you guys to start as _min_max and then switch to min_max if it becomes stable and public?

vkuzo · 2020-07-17T19:11:15Z

I'm pretty sure it will be usable outside of quantization, so perhaps we should work on making it first class citizen. Would it be a lot of trouble for you guys to start as _min_max and then switch to min_max if it becomes stable and public?

sg. Docs we can also add ourselves. For autograd support I haven't looked into it tbh.

@jpgraham

… speed up observers" Summary: For min/max based quantization observers, calculating min and max of a tensor takes most of the runtime. Since the calculation of min and max is done on the same tensor, we can speed this up by only reading the tensor once, and reducing with two outputs. One question I had is whether we should put this into the quantization namespace, since the use case is pretty specific. This PR implements the easier CPU path to get an initial validation. There is some needed additional work in future PRs, which @jpgraham will take a look at: * CUDA kernel and tests * making this work per channel * benchmarking on observer * benchmarking impact on QAT overhead Test Plan: ``` python test/test_torch.py TestTorch.test_min_and_max ``` quick bench (not representative of real world use case): https://gist.github.com/vkuzo/7fce61c3456dbc488d432430cafd6eca ``` (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=1 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.0390) tensor(-5.4485) tensor([-5.4485, 5.0390]) min and max separate 11.90243935585022 min and max combined 6.353186368942261 % decrease 0.466228209277153 (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=4 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.5586) tensor(-5.3983) tensor([-5.3983, 5.5586]) min and max separate 3.468616485595703 min and max combined 1.8227086067199707 % decrease 0.4745142294372342 (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=8 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.2146) tensor(-5.2858) tensor([-5.2858, 5.2146]) min and max separate 1.5707778930664062 min and max combined 0.8645427227020264 % decrease 0.4496085496757899 ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D22589349](https://our.internmc.facebook.com/intern/diff/D22589349) [ghstack-poisoned]

@jpgraham

Summary: For min/max based quantization observers, calculating min and max of a tensor takes most of the runtime. Since the calculation of min and max is done on the same tensor, we can speed this up by only reading the tensor once, and reducing with two outputs. One question I had is whether we should put this into the quantization namespace, since the use case is pretty specific. This PR implements the easier CPU path to get an initial validation. There is some needed additional work in future PRs, which @jpgraham will take a look at: * CUDA kernel and tests * making this work per channel * benchmarking on observer * benchmarking impact on QAT overhead Test Plan: ``` python test/test_torch.py TestTorch.test_min_and_max ``` quick bench (not representative of real world use case): https://gist.github.com/vkuzo/7fce61c3456dbc488d432430cafd6eca ``` (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=1 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.0390) tensor(-5.4485) tensor([-5.4485, 5.0390]) min and max separate 11.90243935585022 min and max combined 6.353186368942261 % decrease 0.466228209277153 (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=4 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.5586) tensor(-5.3983) tensor([-5.3983, 5.5586]) min and max separate 3.468616485595703 min and max combined 1.8227086067199707 % decrease 0.4745142294372342 (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=8 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.2146) tensor(-5.2858) tensor([-5.2858, 5.2146]) min and max separate 1.5707778930664062 min and max combined 0.8645427227020264 % decrease 0.4496085496757899 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 6d6a101 Pull Request resolved: #41570

ngimel

Ok, this looks good for starters. Looking forward to cuda and dim implementations.

ngimel · 2020-07-20T02:06:31Z

aten/src/ATen/native/cpu/ReduceAllOpsKernel.cpp

+  output2.fill_(result.second);
+}
+
+template <typename scalar_t, typename func_t, typename vec_func_t1, typename vec_func_t2>


nit: vec_func_t1 and vec_func_t2 are the same type, so you don't need 2 template args here

hmm, so I tried this a few days ago and the compiler wasn't happy. Looks like it would be some non-trivial extra work to go from lambdas to functions which resolve to the same templated type (context: https://stackoverflow.com/questions/7477310/why-cant-i-create-a-vector-of-lambdas-of-the-same-type-in-c11). Thoughts on if it's worth it?

No, if it's nontrivial then it's fine to leave as is.

sounds good. thanks for the review!

@jpgraham

…ers" Summary: For min/max based quantization observers, calculating min and max of a tensor takes most of the runtime. Since the calculation of min and max is done on the same tensor, we can speed this up by only reading the tensor once, and reducing with two outputs. One question I had is whether we should put this into the quantization namespace, since the use case is pretty specific. This PR implements the easier CPU path to get an initial validation. There is some needed additional work in future PRs, which @jpgraham will take a look at: * CUDA kernel and tests * making this work per channel * benchmarking on observer * benchmarking impact on QAT overhead Test Plan: ``` python test/test_torch.py TestTorch.test_min_and_max ``` quick bench (not representative of real world use case): https://gist.github.com/vkuzo/7fce61c3456dbc488d432430cafd6eca ``` (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=1 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.0390) tensor(-5.4485) tensor([-5.4485, 5.0390]) min and max separate 11.90243935585022 min and max combined 6.353186368942261 % decrease 0.466228209277153 (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=4 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.5586) tensor(-5.3983) tensor([-5.3983, 5.5586]) min and max separate 3.468616485595703 min and max combined 1.8227086067199707 % decrease 0.4745142294372342 (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=8 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.2146) tensor(-5.2858) tensor([-5.2858, 5.2146]) min and max separate 1.5707778930664062 min and max combined 0.8645427227020264 % decrease 0.4496085496757899 ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D22589349](https://our.internmc.facebook.com/intern/diff/D22589349) [ghstack-poisoned]

@jpgraham

…ers" Summary: For min/max based quantization observers, calculating min and max of a tensor takes most of the runtime. Since the calculation of min and max is done on the same tensor, we can speed this up by only reading the tensor once, and reducing with two outputs. One question I had is whether we should put this into the quantization namespace, since the use case is pretty specific. This PR implements the easier CPU path to get an initial validation. There is some needed additional work in future PRs, which @jpgraham will take a look at: * CUDA kernel and tests * making this work per channel * benchmarking on observer * benchmarking impact on QAT overhead Test Plan: ``` python test/test_torch.py TestTorch.test_min_and_max ``` quick bench (not representative of real world use case): https://gist.github.com/vkuzo/7fce61c3456dbc488d432430cafd6eca ``` (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=1 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.0390) tensor(-5.4485) tensor([-5.4485, 5.0390]) min and max separate 11.90243935585022 min and max combined 6.353186368942261 % decrease 0.466228209277153 (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=4 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.5586) tensor(-5.3983) tensor([-5.3983, 5.5586]) min and max separate 3.468616485595703 min and max combined 1.8227086067199707 % decrease 0.4745142294372342 (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=8 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.2146) tensor(-5.2858) tensor([-5.2858, 5.2146]) min and max separate 1.5707778930664062 min and max combined 0.8645427227020264 % decrease 0.4496085496757899 ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D22589349](https://our.internmc.facebook.com/intern/diff/D22589349) [ghstack-poisoned]

@jpgraham

Summary: For min/max based quantization observers, calculating min and max of a tensor takes most of the runtime. Since the calculation of min and max is done on the same tensor, we can speed this up by only reading the tensor once, and reducing with two outputs. One question I had is whether we should put this into the quantization namespace, since the use case is pretty specific. This PR implements the easier CPU path to get an initial validation. There is some needed additional work in future PRs, which @jpgraham will take a look at: * CUDA kernel and tests * making this work per channel * benchmarking on observer * benchmarking impact on QAT overhead Test Plan: ``` python test/test_torch.py TestTorch.test_min_and_max ``` quick bench (not representative of real world use case): https://gist.github.com/vkuzo/7fce61c3456dbc488d432430cafd6eca ``` (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=1 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.0390) tensor(-5.4485) tensor([-5.4485, 5.0390]) min and max separate 11.90243935585022 min and max combined 6.353186368942261 % decrease 0.466228209277153 (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=4 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.5586) tensor(-5.3983) tensor([-5.3983, 5.5586]) min and max separate 3.468616485595703 min and max combined 1.8227086067199707 % decrease 0.4745142294372342 (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=8 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.2146) tensor(-5.2858) tensor([-5.2858, 5.2146]) min and max separate 1.5707778930664062 min and max combined 0.8645427227020264 % decrease 0.4496085496757899 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 72bf781 Pull Request resolved: #41570

@jpgraham

Summary: For min/max based quantization observers, calculating min and max of a tensor takes most of the runtime. Since the calculation of min and max is done on the same tensor, we can speed this up by only reading the tensor once, and reducing with two outputs. One question I had is whether we should put this into the quantization namespace, since the use case is pretty specific. This PR implements the easier CPU path to get an initial validation. There is some needed additional work in future PRs, which @jpgraham will take a look at: * CUDA kernel and tests * making this work per channel * benchmarking on observer * benchmarking impact on QAT overhead Test Plan: ``` python test/test_torch.py TestTorch.test_min_and_max ``` quick bench (not representative of real world use case): https://gist.github.com/vkuzo/7fce61c3456dbc488d432430cafd6eca ``` (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=1 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.0390) tensor(-5.4485) tensor([-5.4485, 5.0390]) min and max separate 11.90243935585022 min and max combined 6.353186368942261 % decrease 0.466228209277153 (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=4 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.5586) tensor(-5.3983) tensor([-5.3983, 5.5586]) min and max separate 3.468616485595703 min and max combined 1.8227086067199707 % decrease 0.4745142294372342 (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=8 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.2146) tensor(-5.2858) tensor([-5.2858, 5.2146]) min and max separate 1.5707778930664062 min and max combined 0.8645427227020264 % decrease 0.4496085496757899 ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D22589349](https://our.internmc.facebook.com/intern/diff/D22589349) [ghstack-poisoned]

@jpgraham

Summary: For min/max based quantization observers, calculating min and max of a tensor takes most of the runtime. Since the calculation of min and max is done on the same tensor, we can speed this up by only reading the tensor once, and reducing with two outputs. One question I had is whether we should put this into the quantization namespace, since the use case is pretty specific. This PR implements the easier CPU path to get an initial validation. There is some needed additional work in future PRs, which @jpgraham will take a look at: * CUDA kernel and tests * making this work per channel * benchmarking on observer * benchmarking impact on QAT overhead Test Plan: ``` python test/test_torch.py TestTorch.test_min_and_max ``` quick bench (not representative of real world use case): https://gist.github.com/vkuzo/7fce61c3456dbc488d432430cafd6eca ``` (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=1 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.0390) tensor(-5.4485) tensor([-5.4485, 5.0390]) min and max separate 11.90243935585022 min and max combined 6.353186368942261 % decrease 0.466228209277153 (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=4 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.5586) tensor(-5.3983) tensor([-5.3983, 5.5586]) min and max separate 3.468616485595703 min and max combined 1.8227086067199707 % decrease 0.4745142294372342 (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=8 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.2146) tensor(-5.2858) tensor([-5.2858, 5.2146]) min and max separate 1.5707778930664062 min and max combined 0.8645427227020264 % decrease 0.4496085496757899 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: a5e8ecb Pull Request resolved: #41570

facebook-github-bot · 2020-07-22T02:11:04Z

This pull request has been merged in 302e566.

vkuzo requested a review from ngimel July 17, 2020 00:38

ngimel approved these changes Jul 20, 2020

View reviewed changes

facebook-github-bot closed this in 302e566 Jul 22, 2020

facebook-github-bot added the merged label Jul 22, 2020

facebook-github-bot deleted the gh/vkuzo/107/head branch July 25, 2020 14:17

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add max_and_min function and cpu kernel to speed up observers #41570

add max_and_min function and cpu kernel to speed up observers #41570

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

add max_and_min function and cpu kernel to speed up observers #41570

add max_and_min function and cpu kernel to speed up observers #41570

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

💊 CI failures summary and remediations

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!