Description
Description
sklearn.metrics.pairwise.manhattan_distances()
is very slow when applied to sparse matrices. The sparse matrix implementation uses the cython function _sparse_manhattan()
in sklearn.metrics.pairwise_fast.pyx
. The implementation uses an admittedly simple strategy, which turns out to be inefficient, in particular when the matrix has many features.
Steps/Code to Reproduce
I illustrate the issue in this gist on a 2000 x 10000 CSR matrix with density 0.001.
The gist shows my alternative Cython implementation which is orders of magnitude faster.
The current implementation takes 41s, the improved implementation 108ms (on 6 cores of an Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz).
If the maintainers are willing to consider accepting a Pull Request on this, I'd be happy to submit one.