8000 ufunc.at perfomance >10x too slow · Issue #5922 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content
ufunc.at perfomance >10x too slow #5922
Closed
@d1manson

Description

@d1manson

I have created a Matlab-like accumarray function which tries to squeeze as much performance form numpy as possible for a specific list of functions: sum any all max min mean...etc.

The functions is available as a gist here.

There is another accumarray implementation available on github here, which was originally created by @ml31415, but over the last couple of days has had many of my suggestions incorporated (I have also updated my own gist in line with my recent suggestions).

The original purpose of this other implementation was to use scipy.weave to get massively improved performance over what was assumed to be the best-possible raw-numpy version. However it appears that most of the functions can be fairly heavily optimized without scipy.weave. [It's not really important here, but for reference, the remaining functions which are difficult-to-optimize are min max and prod.]

The main point, however, is that it should be simple to optimize by just using ufunc.at for the relevant ufunc (as this is exactly what ufunc.at is intended for), yet ufunc.at gets miserable performance....about 15x slower than scipy.weave and 10-25x slower than a bunch of carefully written alternative numpy algorithms (where such optimized algorithms exist). Surely ufunc.at could be improved?

Also, and on a separate note, would there be any interest in including accumarray itself in numpy?


Here are some benchmarking stats produced by my function (testing and benchmarking code is inclued at the bottom of the gist). Note that the baseline times are obtained by sorting-spliting-and-looping, using the named numpy function for each group; whereas the optimised functions do some kind of handcrafted vectorised operation in most cases, except max min and prod which use ufunc.at. Note also that the actual observed speedup depends on a variety of properties of the input. Here we are using 100 000 indices uniformly picked from the interval [0, 1000). Specifically, about 25% of the values are 0 (for use with the bool tests), the remainder are uniformly distributed on [-50,25).

std baseline 190.4 ms optimised 7.3 ms ... 26.2 x faster
all baseline 77.2 ms optimised 8.6 ms ... 9.0 x faster
min baseline 65.3 ms optimised 50.0 ms ... 1.3 x faster
max baseline 64.4 ms optimised 45.8 ms ... 1.4 x faster
sum baseline 64.6 ms optimised 2.4 ms ... 27.3 x faster
var baseline 173.4 ms optimised 7.7 ms ... 22.4 x faster
prod baseline 63.5 ms optimised 50.2 ms ... 1.3 x faster
any baseline 75.5 ms optimised 7.0 ms ... 10.8 x faster
mean baseline 100.2 ms optimised 3.7 ms ... 26.9 x faster

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0