Introduce SIMD intrinsics for _dist_metrics.pyx

Context

Pairwise distance computation is an essential part of many estimators in scikit-learn, and can take up a significant portion of run time in certain workflows. I believe that we may achieve significant performance gains in several (perhaps most) distance metric implementations by leveraging SIMD intrinsics.

Proof of Concept

I built a quick proof of concept just to see what kinds of performance gains we could observe with a potentially-naive implementation of SIMD intrinsics. I chose to optimize the ManhattanDistance.dist function. This implementation uses intrinsics found in SSE{1,2,3}. To ensure that the instructions are supported, it checks for the presence of the SSE3 instruction set (SSE3 implies SSE{1,2}) and provides the optimized implementation if so. Otherwise it provides a dummy implementation just to appease Cython, and the main function falls back to the current implementation on main. Note that on most modern hardware, support for SSE3 is a reasonable expectation (indeed numpy assumes it is always present when optimization is enabled). For the specific implementation referred to here, please take a look at this PR: Micky774#11

Note that the full benefit of the intrinsics are gained when compiling with -march="native", however the benefit is still significant when compiling with -march="nocona", as is often default (e.g when following the scikit-learn development instructions on linux).

Benchmarks

The following benchmarks were produced by this gist: https://gist.github.com/Micky774/567a5fa199c05d90c4c08625b077840e

Summary: The SIMD implementations are ~2x faster than the current implementation for `float32` and 1.5x faster for `float64`.

Plots

Discussion

I haven't looked too deeply into this yet, as first I wanted to see whether there was interest in the venture. I would love to hear what the other maintainers' thoughts are regarding exploring this route in a bit more detail. Obviously SIMD implementations will bring with them added complexity, but the performance gains are pretty compelling. In my opinion, the tradeoff is worth it.

CC: @scikit-learn/core-devs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Context

Proof of Concept

Benchmarks

Summary: The SIMD implementations are ~2x faster than the current implementation for `float32` and 1.5x faster for `float64`.

Discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Description

Context

Proof of Concept

Benchmarks

Summary: The SIMD implementations are ~2x faster than the current implementation for float32 and 1.5x faster for float64.

Discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Summary: The SIMD implementations are ~2x faster than the current implementation for `float32` and 1.5x faster for `float64`.