8000 API: Optimize np.unique for integer arrays; add `kind` parameter by MilesCranmer · Pull Request #21843 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

API: Optimize np.unique for integer arrays; add kind parameter #21843

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

MilesCranmer
Copy link
Contributor
@MilesCranmer MilesCranmer commented Jun 24, 2022

This commit implements the same lookup table method from #12065 for np.unique.

np.unique is, like np.isin, based on sorting and array comparison. These methods, though more generic, can be greatly sped up for integer dtypes with a lookup table. For unique, this works as follows:

  1. calloc a lookup table with each element corresponding to an integer: x = np.zeros(ar_range)
  2. Fill that table with 1 where a number exists in the array: x[ar - ar_min] = 1
  3. Return the indices of elements with 1: return np.nonzero(x) + ar_min

A similar strategy can be used for return_counts=True. Instead of filling the lookup table with 1, you can use the table to count, with, for example,

np.add.at(x, ar - ar_min, 1)

which will count the occurrences of each integer. (Though I am not sure if this is the best way to do it. Comments appreciated!)


I can see speedups similar to found with np.isin for large integral arrays:

In [2]: x = np.random.randint(0, 10_000_000, size=(10_000_000,))
In [3]: %%timeit
   ...: np.unique(x, kind='sort')
828 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %%timeit
   ...: np.unique(x, kind='table')
150 ms ± 6.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [5]: %%timeit
   ...: np.unique(x, kind='sort', return_counts=True)
876 ms ± 5.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: %%timeit
   ...: np.unique(x, kind='table', return_counts=True)
329 ms ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Comments very much appreciated!

cc'ing everybody recently active in the isin PR: @seberg @WarrenWeckesser @hameerabbasi @shoyer @adeak


Mailing list discussion: https://mail.python.org/archives/list/numpy-discussion@python.org/thread/3SJTDDF7D4VLZIUPQKHJMSYU2RVY3MHH/


Edit 1: Improved speed for return_counts=True.

Edit 2: No longer a work in progress.

Edit 3: Added mailing list link

Copy link
Contributor
@adeak adeak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the ping.

(Though I am not sure if this is the best way to do it. Comments appreciated!)

np.add.at is definitely how I'd approach this too.

Some early minor remarks below.

@MilesCranmer
Copy link
Contributor Author

Thanks!

The heart of the slowdown on return_counts is due to np.add.at. I am confused why it is so slow compared to slicing. See the following:
Screen Shot 2022-06-24 at 2 53 41 PM

Which is a 50x slowdown... Does it loop through the indices one-by-one or something? Aren't some additive SIMD operations atomic which could be used instead?

@MilesCranmer
Copy link
Contributor Author

You could do normal slicing, except on a 2D array, then sum along one axis. Of course this would be N^2 memory usage...

@adeak
Copy link
Contributor
adeak commented Jun 24, 2022

I don't know about the implementations, but I assume the "unbuffered" in "Performs unbuffered in place operation" might be relevant. Although you already lose a factor of 2 if you use += instead of =, which is the corresponding operation to np.add.at.

Side note: it's better to post code as text rather than images, because people will be able to copy-paste them, and people using screen readers will also be able to see your code and results. Here's more or less your code with N = 100_000:

N = 100_000
counts = np.zeros(N, dtype=np.intp)
idx = np.random.randint(0, N, size=(N,))

>>> %timeit counts[idx] = 1
... %timeit counts[idx] += 1
... %timeit np.add.at(counts, idx, 1)
257 µs ± 1.89 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
486 µs ± 9.57 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
6.36 ms ± 594 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@MilesCranmer
Copy link
Contributor Author

Right, will do that!

Regarding the +=, I think that operation doesn't actually allow for duplicate indices:

In [7]: N = 100_000
   ...: counts = np.zeros(N, dtype=np.intp)
   ...: idx = np.random.randint(0, 2, size=(N,))

In [8]: counts[idx] += 1

In [9]: np.max(counts)
Out[9]: 1

So I don't think it could be used here unfortunately.

@adeak
Copy link
Contributor
adeak commented Jun 24, 2022

So I don't think it could be used here unfortunately.

Indeed. My point was just that when you compare the performance of np.add.at with an assignment it's not enirely fair to begin with, because they do different things. We're still talking about a factor of 25 in speed though (factor of 12-13 in my runs with N = 100_000, actually).

8000

@MilesCranmer
Copy link
Contributor Author
MilesCranmer commented Jun 24, 2022

Ah I see, sorry. I guess my point is that I think even 12x or 25x seems extremely large. Why is it that slow?

I tested this. Here's a barebones implementation in C: https://gist.github.com/MilesCranmer/d262f468d570df4b90c8769ebdcce213

If I call this with gcc -O3 unique.c && ./a.out 10000000 10000000 1, which means to generate 10,000,000 elements up to a max of 10,000,000, and compute the unique values with counting, I get the number:

Time: 0.064353

About 64 ms. With counting turned off, i.e., gcc -O3 unique.c && ./a.out 10000000 10000000 0, I see the following:

Time: 0.038059

About 38 ms. So there is only <2x difference...

However, here is the python attempt:

In [2]: x = np.random.randint(0, 10_000_000, size=(10_000_000,))
In [3]: counts = np.zeros(10_000_000, dtype=np.intp)
In [10]: %%timeit
    ...: counts[x] = 1
38.3 ms ± 83.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [11]: counts[:] = 0
In [12]: %%timeit
    ...: np.add.at(counts, x, 1)
1.14 s ± 7.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So the counting one is much much slower. Maybe np.add.at is written in pure Python or something?


Edit: Incorrectly passed a bool so times were the same if statement. Now fixed - about a 2x difference overall.

Edit 2: Moved C example to a gist

@MilesCranmer
Copy link
Contributor Author
MilesCranmer commented Jun 26, 2022

Good news: I sped up the new method for return_counts=True by using np.bincount instead of np.add.at.

Here are the updated speeds, showing kind='table' in the lead for both return_counts in {True, False}

In [2]: x = np.random.randint(0, 10_000_000, size=(10_000_000,))
In [3]: %%timeit
   ...: np.unique(x, kind='sort')
828 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %%timeit
   ...: np.unique(x, kind='table')
150 ms ± 6.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [5]: %%timeit
   ...: np.unique(x, kind='sort', return_counts=True)
876 ms ± 5.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: %%timeit
   ...: np.unique(x, kind='table', return_counts=True)
329 ms ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@MilesCranmer
Copy link
Contributor Author

Okay I think the core API should be ready for a review. I checked through the unit-tests and many of them are for non-integral types, so it won't be as easy to pytest.mark.parameterize them-I did try this but 90% of them I had to manually skip if kind='table'.

@MilesCranmer
Copy link
Contributor Author
MilesCranmer commented Jun 27, 2022

Here are the results of some newly added benchmarks. This shows that one gets >12x speedups for large integer arrays, including with return_counts=True. (run with python runtests.py --bench-compare v1.23.0 bench_lib.Unique)

The only regressions look to be with very tiny arrays which are on the order of μs in difference. I think that is entirely due to the check for what method to use. If those regressions are not acceptable I could switch off the new method for small arrays, before the computation is done using ranges.

       before           after         ratio  bench_lib.Unique.<test>(<array_size>, <array_range>)
     [54c52f13]       [158f25b3]
     <v1.23.0^0>       <unique-speed>
+      7.69±0.03s         489±20ms     0.06  bench_lib.Unique.time_unique(100000000, 1778279)
+      5.81±0.03s         447±10ms     0.08  bench_lib.Unique.time_unique(100000000, 31622)
+       101±0.7ms       8.09±0.7ms     0.08  bench_lib.Unique.time_unique(1778279, 31622)
+         101±1ms         9.18±1ms     0.09  bench_lib.Unique.time_unique_with_counts(1778279, 31622)
+     1.22±0.01ms         111±40μs     0.09  bench_lib.Unique.time_unique(31622, 562)
+        70.6±1ms       7.84±0.9ms     0.11  bench_lib.Unique.time_unique(1778279, 562)
+      3.98±0.04s         442±10ms     0.11  bench_lib.Unique.time_unique(100000000, 562)
+      7.74±0.06s        870±200ms     0.11  bench_lib.Unique.time_unique_with_counts(100000000, 1778279)
+      69.6±0.4ms       8.89±0.7ms     0.13  bench_lib.Unique.time_unique_with_counts(1778279, 562)
+       131±0.9ms         19.0±1ms     0.14  bench_lib.Unique.time_unique(1778279, 1778279)
+         9.20±0s       1.49±0.09s     0.16  bench_lib.Unique.time_unique(100000000, 100000000)
+         607±2μs         111±40μs     0.18  bench_lib.Unique.time_unique(31622, 10)
+      2.37±0.04s         444±10ms     0.19  bench_lib.Unique.time_unique(100000000, 10)
+      38.4±0.9ms         7.35±1ms     0.19  bench_lib.Unique.time_unique(1778279, 10)
+      2.33±0.06s        462±200ms     0.20  bench_lib.Unique.time_unique_with_counts(100000000, 10)
+     1.72±0.01ms         357±20μs     0.21  bench_lib.Unique.time_unique(31622, 31622)
+     1.24±0.01ms         260±50μs     0.21  bench_lib.Unique.time_unique_with_counts(31622, 562)
+        38.8±1ms       8.92±0.9ms     0.23  bench_lib.Unique.time_unique_with_counts(1778279, 10)
+         139±3ms         38.2±3ms     0.28  bench_lib.Unique.time_unique_with_counts(1778279, 1778279)
+     1.91±0.02ms        717±100μs     0.38  bench_lib.Unique.time_unique_with_counts(31622, 31622)
+         630±6μs         253±40μs     0.40  bench_lib.Unique.time_unique_with_counts(31622, 10)

Slow downs:

-      11.1±0.1μs       20.2±0.9μs     1.82  bench_lib.Unique.time_unique(562, 562)
-      19.0±0.7μs         34.4±7μs     1.82  bench_lib.Unique.time_unique_with_counts(562, 562)
-      10.1±0.2μs       22.3±0.9μs     2.20  bench_lib.Unique.time_unique(562, 1778279)
-     14.4±0.08μs         32.2±8μs     2.24  bench_lib.Unique.time_unique_with_counts(562, 10)
-      9.92±0.2μs       22.3±0.7μs     2.25  bench_lib.Unique.time_unique(562, 31622)
-      10.7±0.2μs         24.3±7μs     2.27  bench_lib.Unique.time_unique_with_counts(10, 10)
-     9.75±0.07μs       22.3±0.6μs     2.29  bench_lib.Unique.time_unique(562, 100000000)
-      8.22±0.1μs         19.1±1μs     2.32  bench_lib.Unique.time_unique(562, 10)
-      18.2±0.6μs        49.0±20μs     2.69  bench_lib.Unique.time_unique_with_counts(562, 1778279)
-      10.5±0.1μs        29.7±10μs     2.83  bench_lib.Unique.time_unique_with_counts(10, 31622)
-      16.9±0.1μs        49.1±10μs     2.90  bench_lib.Unique.time_unique_with_counts(562, 31622)
-      17.4±0.1μs        52.1±20μs     2.99  bench_lib.Unique.time_unique_with_counts(562, 100000000)
-      10.7±0.1μs         32.4±8μs     3.04  bench_lib.Unique.time_unique_with_counts(10, 100000000)
-     10.4±0.09μs        33.0±10μs     3.18  bench_lib.Unique.time_unique_with_counts(10, 562)
-      10.6±0.1μs        37.5±10μs     3.55  bench_lib.Unique.time_unique_with_counts(10, 1778279)
-     4.43±0.04μs         16.3±1μs     3.67  bench_lib.Unique.time_unique(10, 31622)
-      4.46±0.1μs         16.7±2μs     3.74  bench_lib.Unique.time_unique(10, 100000000)
-     4.33±0.08μs         17.1±1μs     3.95  bench_lib.Unique.time_unique(10, 1778279)
-     4.41±0.03μs       17.5±0.6μs     3.98  bench_lib.Unique.time_unique(10, 562)
-     4.50±0.06μs         18.1±5μs     4.02  bench_lib.Unique.time_unique(10, 10)

@MilesCranmer
Copy link
Contributor Author

Posted to the mailing list.

@seberg
Copy link
Member
seberg commented Jun 28, 2022

Posted to the mailing list.

I have not seen any email, maybe you got the wrong thing? This is the right one: https://mail.python.org/mailman3/lists/numpy-discussion.python.org/

@MilesCranmer
Copy link
Contributor Author

Ah, I see, sorry. It said "Your message to the NumPy-Discussion mailing-list was rejected for the following
reasons: The message is not from a list member"

Will do it correctly this time...

seberg added a commit that referenced this pull request Jun 29, 2022
This PR fixes the issue discussed on #12065 and #21843 where 'timedelta64' was noted to be a subtype of numpy.integer. This in principle should detect any cases where int(np.min(ar2)) fails. This PR also adds unittests for these.

* TST: Create in1d test for timedelta input

* MAINT: fix in1d for timedelta input

* TST: in1d raise ValueError for timedelta input

* MAINT: Clean up type checking for isin kind="table"

* TST: Add test for mixed boolean/integer in1d

* MAINT: Increase readability of in1d type checking

* STY: Apply small code style tweaks

This is probably really mainly my personal opinion...

Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>
@MilesCranmer MilesCranmer force-pushed the unique-speed branch 3 times, most recently from 67da234 to 837101a Compare April 16, 2023 04:26
@MilesCranmer
Copy link
Contributor Author

Hey @seberg!

It's been a while but I realized we never merged this yet. I have a bit of time so I was wondering if we could push on this PR and try to get it merged? I just updated my branch to main and verified the tests are all working.

Recent update: there was an issue with kind='table' + dtype=bool, so I implemented the same conversion to integers we use in in1d.

Cheers,
Miles

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0