nits on "add max_and_min function and cpu kernel to speed up observers"

vkuzo · vkuzo · commit 4944b3fb2ccf · 2020-07-21T13:31:01.000-07:00
Summary: For min/max based quantization observers, calculating min and max of a tensor takes most of the runtime. Since the calculation of min and max is done on the same tensor, we can speed this up by only reading the tensor once, and reducing with two outputs. One question I had is whether we should put this into the quantization namespace, since the use case is pretty specific. This PR implements the easier CPU path to get an initial validation. There is some needed additional work in future PRs, which @jpgraham will take a look at: * CUDA kernel and tests * making this work per channel * benchmarking on observer * benchmarking impact on QAT overhead Test Plan: ``` python test/test_torch.py TestTorch.test_min_and_max ``` quick bench (not representative of real world use case): https://gist.github.com/vkuzo/7fce61c3456dbc488d432430cafd6eca ``` (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=1 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.0390) tensor(-5.4485) tensor([-5.4485, 5.0390]) min and max separate 11.90243935585022 min and max combined 6.353186368942261 % decrease 0.466228209277153 (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=4 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.5586) tensor(-5.3983) tensor([-5.3983, 5.5586]) min and max separate 3.468616485595703 min and max combined 1.8227086067199707 % decrease 0.4745142294372342 (pytorch) [vasiliy@devgpu108.ash6 ~/local/pytorch] OMP_NUM_THREADS=8 python ~/nfs/pytorch_scripts/observer_bench.py tensor(5.2146) tensor(-5.2858) tensor([-5.2858, 5.2146]) min and max separate 1.5707778930664062 min and max combined 0.8645427227020264 % decrease 0.4496085496757899 ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D22589349](https://our.internmc.facebook.com/intern/diff/D22589349) [ghstack-poisoned]
diff --git a/aten/src/ATen/native/cpu/ReduceAllOpsKernel.cpp b/aten/src/ATen/native/cpu/ReduceAllOpsKernel.cpp
@@ -114,14 +114,14 @@ inline void reduce_all_impl_two_outputs(
     Tensor& output1,
     Tensor& output2,
     const Tensor& input,
-    const std::pair<scalar_t, scalar_t> ident_v,
+    const std::pair<scalar_t, scalar_t>& ident_v,
     func_t1 reduce_chunk_func,
     func_t2 reduce_acc_func) {
   using scalar_t_pair = std::pair<scalar_t, scalar_t>;
   const int64_t input_numel = input.numel();
   auto input_data = input.data_ptr<scalar_t>();
   scalar_t_pair result = at::parallel_reduce(0, input_numel, internal::GRAIN_SIZE, ident_v,
-    [&](int64_t start, int64_t end, const scalar_t_pair ident) -> scalar_t_pair {
+    [&](int64_t start, int64_t end, const scalar_t_pair& ident) -> scalar_t_pair {
       scalar_t_pair partial_out(ident);
       for (int64_t i = start; i < end; i++) {
          partial_out = reduce_chunk_func(partial_out, input_data[i]);
@@ -139,7 +139,7 @@ inline void reduce_all_impl_vec_two_outputs(
     Tensor& output1,
     Tensor& output2,
     const Tensor& input,
-    const std::pair<scalar_t, scalar_t> ident_v,
+    const std::pair<scalar_t, scalar_t>& ident_v,
     func_t reduce_acc_func,
     vec_func_t1 reduce_chunk_func1,
     vec_func_t2 reduce_chunk_func2) {
@@ -149,7 +149,7 @@ inline void reduce_all_impl_vec_two_outputs(
   auto input_data = input.data_ptr<scalar_t>();
   // NOTE: parallel_reduce not support bool type
   std::pair<scalar_t, scalar_t> result = at::parallel_reduce(0, input_numel, internal::GRAIN_SIZE, ident_v,
-    [&](int64_t start, int64_t end, const scalar_t_pair ident) -> scalar_t_pair {
+    [&](int64_t start, int64_t end, const scalar_t_pair& /* ident */) -> scalar_t_pair {
     scalar_t_pair partial_out = vec256::reduce2_all<scalar_t>(
         [=](Vec x, Vec y) { return reduce_chunk_func1(x, y); },
         [=](Vec x, Vec y) { return reduce_chunk_func2(x, y); },