API: Optimize np.unique for integer arrays; add `kind` parameter #21843

MilesCranmer · 2022-06-24T05:35:11Z

This commit implements the same lookup table method from #12065 for np.unique.

np.unique is, like np.isin, based on sorting and array comparison. These methods, though more generic, can be greatly sped up for integer dtypes with a lookup table. For unique, this works as follows:

calloc a lookup table with each element corresponding to an integer: x = np.zeros(ar_range)
Fill that table with 1 where a number exists in the array: x[ar - ar_min] = 1
Return the indices of elements with 1: return np.nonzero(x) + ar_min

A similar strategy can be used for return_counts=True. Instead of filling the lookup table with 1, you can use the table to count, with, for example,

np.add.at(x, ar - ar_min, 1)

which will count the occurrences of each integer. (Though I am not sure if this is the best way to do it. Comments appreciated!)

I can see speedups similar to found with np.isin for large integral arrays:

In [2]: x = np.random.randint(0, 10_000_000, size=(10_000_000,))
In [3]: %%timeit
   ...: np.unique(x, kind='sort')
828 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %%timeit
   ...: np.unique(x, kind='table')
150 ms ± 6.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [5]: %%timeit
   ...: np.unique(x, kind='sort', return_counts=True)
876 ms ± 5.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: %%timeit
   ...: np.unique(x, kind='table', return_counts=True)
329 ms ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Comments very much appreciated!

cc'ing everybody recently active in the isin PR: @seberg @WarrenWeckesser @hameerabbasi @shoyer @adeak

Mailing list discussion: https://mail.python.org/archives/list/numpy-discussion@python.org/thread/3SJTDDF7D4VLZIUPQKHJMSYU2RVY3MHH/

Edit 1: Improved speed for return_counts=True.

Edit 2: No longer a work in progress.

Edit 3: Added mailing list link

adeak

Thanks for the ping.

(Though I am not sure if this is the best way to do it. Comments appreciated!)

np.add.at is definitely how I'd approach this too.

Some early minor remarks below.

numpy/lib/arraysetops.py

MilesCranmer · 2022-06-24T18:56:19Z

Thanks!

The heart of the slowdown on return_counts is due to np.add.at. I am confused why it is so slow compared to slicing. See the following:

Which is a 50x slowdown... Does it loop through the indices one-by-one or something? Aren't some additive SIMD operations atomic which could be used instead?

MilesCranmer · 2022-06-24T19:03:05Z

You could do normal slicing, except on a 2D array, then sum along one axis. Of course this would be N^2 memory usage...

adeak · 2022-06-24T19:20:44Z

I don't know about the implementations, but I assume the "unbuffered" in "Performs unbuffered in place operation" might be relevant. Although you already lose a factor of 2 if you use += instead of =, which is the corresponding operation to np.add.at.

Side note: it's better to post code as text rather than images, because people will be able to copy-paste them, and people using screen readers will also be able to see your code and results. Here's more or less your code with N = 100_000:

N = 100_000
counts = np.zeros(N, dtype=np.intp)
idx = np.random.randint(0, N, size=(N,))

>>> %timeit counts[idx] = 1
... %timeit counts[idx] += 1
... %timeit np.add.at(counts, idx, 1)
257 µs ± 1.89 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
486 µs ± 9.57 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
6.36 ms ± 594 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

MilesCranmer · 2022-06-24T19:31:58Z

Right, will do that!

Regarding the +=, I think that operation doesn't actually allow for duplicate indices:

In [7]: N = 100_000
   ...: counts = np.zeros(N, dtype=np.intp)
   ...: idx = np.random.randint(0, 2, size=(N,))

In [8]: counts[idx] += 1

In [9]: np.max(counts)
Out[9]: 1

So I don't think it could be used here unfortunately.

adeak · 2022-06-24T19:35:25Z

So I don't think it could be used here unfortunately.

Indeed. My point was just that when you compare the performance of np.add.at with an assignment it's not enirely fair to begin with, because they do different things. We're still talking about a factor of 25 in speed though (factor of 12-13 in my runs with N = 100_000, actually).

8000

MilesCranmer · 2022-06-24T21:13:34Z

Ah I see, sorry. I guess my point is that I think even 12x or 25x seems extremely large. Why is it that slow?

I tested this. Here's a barebones implementation in C: https://gist.github.com/MilesCranmer/d262f468d570df4b90c8769ebdcce213

If I call this with gcc -O3 unique.c && ./a.out 10000000 10000000 1, which means to generate 10,000,000 elements up to a max of 10,000,000, and compute the unique values with counting, I get the number:

Time: 0.064353

About 64 ms. With counting turned off, i.e., gcc -O3 unique.c && ./a.out 10000000 10000000 0, I see the following:

Time: 0.038059

About 38 ms. So there is only <2x difference...

However, here is the python attempt:

In [2]: x = np.random.randint(0, 10_000_000, size=(10_000_000,))
In [3]: counts = np.zeros(10_000_000, dtype=np.intp)
In [10]: %%timeit
    ...: counts[x] = 1
38.3 ms ± 83.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [11]: counts[:] = 0
In [12]: %%timeit
    ...: np.add.at(counts, x, 1)
1.14 s ± 7.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So the counting one is much much slower. Maybe np.add.at is written in pure Python or something?

Edit: Incorrectly passed a bool so times were the same if statement. Now fixed - about a 2x difference overall.

Edit 2: Moved C example to a gist

MilesCranmer · 2022-06-26T00:59:44Z

Good news: I sped up the new method for return_counts=True by using np.bincount instead of np.add.at.

Here are the updated speeds, showing kind='table' in the lead for both return_counts in {True, False}

In [2]: x = np.random.randint(0, 10_000_000, size=(10_000_000,))
In [3]: %%timeit
   ...: np.unique(x, kind='sort')
828 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %%timeit
   ...: np.unique(x, kind='table')
150 ms ± 6.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [5]: %%timeit
   ...: np.unique(x, kind='sort', return_counts=True)
876 ms ± 5.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: %%timeit
   ...: np.unique(x, kind='table', return_counts=True)
329 ms ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

MilesCranmer · 2022-06-26T02:01:28Z

Okay I think the core API should be ready for a review. I checked through the unit-tests and many of them are for non-integral types, so it won't be as easy to pytest.mark.parameterize them-I did try this but 90% of them I had to manually skip if kind='table'.

MilesCranmer · 2022-06-27T07:39:22Z

Here are the results of some newly added benchmarks. This shows that one gets >12x speedups for large integer arrays, including with return_counts=True. (run with python runtests.py --bench-compare v1.23.0 bench_lib.Unique)

The only regressions look to be with very tiny arrays which are on the order of μs in difference. I think that is entirely due to the check for what method to use. If those regressions are not acceptable I could switch off the new method for small arrays, before the computation is done using ranges.

       before           after         ratio  bench_lib.Unique.<test>(<array_size>, <array_range>)
     [54c52f13]       [158f25b3]
     <v1.23.0^0>       <unique-speed>
+      7.69±0.03s         489±20ms     0.06  bench_lib.Unique.time_unique(100000000, 1778279)
+      5.81±0.03s         447±10ms     0.08  bench_lib.Unique.time_unique(100000000, 31622)
+       101±0.7ms       8.09±0.7ms     0.08  bench_lib.Unique.time_unique(1778279, 31622)
+         101±1ms         9.18±1ms     0.09  bench_lib.Unique.time_unique_with_counts(1778279, 31622)
+     1.22±0.01ms         111±40μs     0.09  bench_lib.Unique.time_unique(31622, 562)
+        70.6±1ms       7.84±0.9ms     0.11  bench_lib.Unique.time_unique(1778279, 562)
+      3.98±0.04s         442±10ms     0.11  bench_lib.Unique.time_unique(100000000, 562)
+      7.74±0.06s        870±200ms     0.11  bench_lib.Unique.time_unique_with_counts(100000000, 1778279)
+      69.6±0.4ms       8.89±0.7ms     0.13  bench_lib.Unique.time_unique_with_counts(1778279, 562)
+       131±0.9ms         19.0±1ms     0.14  bench_lib.Unique.time_unique(1778279, 1778279)
+         9.20±0s       1.49±0.09s     0.16  bench_lib.Unique.time_unique(100000000, 100000000)
+         607±2μs         111±40μs     0.18  bench_lib.Unique.time_unique(31622, 10)
+      2.37±0.04s         444±10ms     0.19  bench_lib.Unique.time_unique(100000000, 10)
+      38.4±0.9ms         7.35±1ms     0.19  bench_lib.Unique.time_unique(1778279, 10)
+      2.33±0.06s        462±200ms     0.20  bench_lib.Unique.time_unique_with_counts(100000000, 10)
+     1.72±0.01ms         357±20μs     0.21  bench_lib.Unique.time_unique(31622, 31622)
+     1.24±0.01ms         260±50μs     0.21  bench_lib.Unique.time_unique_with_counts(31622, 562)
+        38.8±1ms       8.92±0.9ms     0.23  bench_lib.Unique.time_unique_with_counts(1778279, 10)
+         139±3ms         38.2±3ms     0.28  bench_lib.Unique.time_unique_with_counts(1778279, 1778279)
+     1.91±0.02ms        717±100μs     0.38  bench_lib.Unique.time_unique_with_counts(31622, 31622)
+         630±6μs         253±40μs     0.40  bench_lib.Unique.time_unique_with_counts(31622, 10)

Slow downs:

-      11.1±0.1μs       20.2±0.9μs     1.82  bench_lib.Unique.time_unique(562, 562)
-      19.0±0.7μs         34.4±7μs     1.82  bench_lib.Unique.time_unique_with_counts(562, 562)
-      10.1±0.2μs       22.3±0.9μs     2.20  bench_lib.Unique.time_unique(562, 1778279)
-     14.4±0.08μs         32.2±8μs     2.24  bench_lib.Unique.time_unique_with_counts(562, 10)
-      9.92±0.2μs       22.3±0.7μs     2.25  bench_lib.Unique.time_unique(562, 31622)
-      10.7±0.2μs         24.3±7μs     2.27  bench_lib.Unique.time_unique_with_counts(10, 10)
-     9.75±0.07μs       22.3±0.6μs     2.29  bench_lib.Unique.time_unique(562, 100000000)
-      8.22±0.1μs         19.1±1μs     2.32  bench_lib.Unique.time_unique(562, 10)
-      18.2±0.6μs        49.0±20μs     2.69  bench_lib.Unique.time_unique_with_counts(562, 1778279)
-      10.5±0.1μs        29.7±10μs     2.83  bench_lib.Unique.time_unique_with_counts(10, 31622)
-      16.9±0.1μs        49.1±10μs     2.90  bench_lib.Unique.time_unique_with_counts(562, 31622)
-      17.4±0.1μs        52.1±20μs     2.99  bench_lib.Unique.time_unique_with_counts(562, 100000000)
-      10.7±0.1μs         32.4±8μs     3.04  bench_lib.Unique.time_unique_with_counts(10, 100000000)
-     10.4±0.09μs        33.0±10μs     3.18  bench_lib.Unique.time_unique_with_counts(10, 562)
-      10.6±0.1μs        37.5±10μs     3.55  bench_lib.Unique.time_unique_with_counts(10, 1778279)
-     4.43±0.04μs         16.3±1μs     3.67  bench_lib.Unique.time_unique(10, 31622)
-      4.46±0.1μs         16.7±2μs     3.74  bench_lib.Unique.time_unique(10, 100000000)
-     4.33±0.08μs         17.1±1μs     3.95  bench_lib.Unique.time_unique(10, 1778279)
-     4.41±0.03μs       17.5±0.6μs     3.98  bench_lib.Unique.time_unique(10, 562)
-     4.50±0.06μs         18.1±5μs     4.02  bench_lib.Unique.time_unique(10, 10)

MilesCranmer · 2022-06-27T16:25:45Z

Posted to the mailing list.

seberg · 2022-06-28T16:17:54Z

Posted to the mailing list.

I have not seen any email, maybe you got the wrong thing? This is the right one: https://mail.python.org/mailman3/lists/numpy-discussion.python.org/

MilesCranmer · 2022-06-28T16:30:43Z

Ah, I see, sorry. It said "Your message to the NumPy-Discussion mailing-list was rejected for the following
reasons: The message is not from a list member"

Will do it correctly this time...

This PR fixes the issue discussed on #12065 and #21843 where 'timedelta64' was noted to be a subtype of numpy.integer. This in principle should detect any cases where int(np.min(ar2)) fails. This PR also adds unittests for these. * TST: Create in1d test for timedelta input * MAINT: fix in1d for timedelta input * TST: in1d raise ValueError for timedelta input * MAINT: Clean up type checking for isin kind="table" * TST: Add test for mixed boolean/integer in1d * MAINT: Increase readability of in1d type checking * STY: Apply small code style tweaks This is probably really mainly my personal opinion... Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>

MilesCranmer · 2023-04-16T05:28:06Z

Hey @seberg!

It's been a while but I realized we never merged this yet. I have a bit of time so I was wondering if we could push on this PR and try to get it merged? I just updated my branch to main and verified the tests are all working.

Recent update: there was an issue with kind='table' + dtype=bool, so I implemented the same conversion to integers we use in in1d.

Cheers,
Miles

MAINT: Optimize np.unique for integer arrays

9056b75

github-actions bot added the 30 - API label Jun 24, 2022

adeak reviewed Jun 24, 2022

View reviewed changes

numpy/lib/arraysetops.py Outdated Show resolved Hide resolved

numpy/lib/arraysetops.py Outdated Show resolved Hide resolved

numpy/lib/arraysetops.py Outdated Show resolved Hide resolved

MilesCranmer added 3 commits June 26, 2022 00:40

MAINT: Fast unique with return_counts=True

c0ad80b

MAINT: Allow empty arrays for unique kind='table'

80bf1d7

MAINT: Clean up error messages for unique kind='table'

ee90cd0

MilesCranmer added 2 commits June 26, 2022 01:04

MAINT: Clean up error messages

8cfca17

MAINT: Safe type conversion for bincounts

1373493

MilesCranmer added 5 commits June 27, 2022 05:14

MAINT: kind="table" unsupported along an axis

2fcf1e7

TST: Test for unique with kind in None, sort

df79303

TST: Extend unique tests to kind="table"

70d8a2e

BENCH: Create benchmark for np.unique

d7c9486

BENCH: Increase parameter range for testing unique

158f25b

DOC: Describe kind parameter for unique

ee3b167

MilesCranmer mentioned this pull request Jun 27, 2022

BUG: Fix numpy.isin for timedelta dtype #21860

Merged

MAINT: Clean up int type checking in unique

837101a

MilesCranmer mentioned this pull request Jun 30, 2022

Faster versions of isin, in1d, unique for integer arrays jax-ml/jax#11325

Closed

MilesCranmer force-pushed the unique-speed branch 3 times, most recently from 67da234 to 837101a Compare April 16, 2023 04:26

MilesCranmer added 2 commits April 16, 2023 00:29

Merge branch 'main' into unique-speed

ec644a5

MAINT: Fix unique for bool input

ddc2fa2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

API: Optimize np.unique for integer arrays; add `kind` parameter #21843

API: Optimize np.unique for integer arrays; add `kind` parameter #21843

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

API: Optimize np.unique for integer arrays; add kind parameter #21843

Are you sure you want to change the base?

API: Optimize np.unique for integer arrays; add kind parameter #21843

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

API: Optimize np.unique for integer arrays; add `kind` parameter #21843

API: Optimize np.unique for integer arrays; add `kind` parameter #21843