8000 manhattan_distances for sparse matrices is slow · Issue #14304 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
manhattan_distances for sparse matrices is slow #14304
Closed
@ptocca

Description

@ptocca

Description

sklearn.metrics.pairwise.manhattan_distances() is very slow when applied to sparse matrices. The sparse matrix implementation uses the cython function _sparse_manhattan() in sklearn.metrics.pairwise_fast.pyx. The implementation uses an admittedly simple strategy, which turns out to be inefficient, in particular when the matrix has many features.

Steps/Code to Reproduce

I illustrate the issue in this gist on a 2000 x 10000 CSR matrix with density 0.001.
The gist shows my alternative Cython implementation which is orders of magnitude faster.

The current implementation takes 41s, the improved implementation 108ms (on 6 cores of an Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz).

If the maintainers are willing to consider accepting a Pull Request on this, I'd be happy to submit one.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0