Q: why is the correlation metric commented in dist_metrics.pyx? #4213

mvdoc · 2015-02-06T17:25:07Z

Hi,

in dist_metrics.pyx the correlation metric is commented, and the comment states that it is not a 'true' metric. While this is true as it doesn't satisfy the coincidence axiom, it is still helpful to have such a metric implemented in cython to speed up computations.

Also, note that the actual formula implemented in the commented code is incorrect. The returned value at the moment is

return (1. - x1Tx2) / sqrt(x1nrm * x2nrm)

but it should be instead

return 1. - (x1Tx2 / sqrt(x1nrm * x2nrm))

Thanks,
Matteo

The text was updated successfully, but these errors were encountered:

larsmans · 2015-02-07T22:03:29Z

SciPy has a vectorized version of it, cdist(X, Y, 'correlation'). Do you really need a KD-tree or ball tree with this metric? That's what dist_metrics.pyx is for.

jakevdp · 2015-02-08T00:31:42Z

dist_metrics.pyx contains metrics used by the KDTree/BallTree. I initially implemented the correlation metric, but it failed the BallTree unit tests. That's actually how I learned that it's not a true metric! I'd suggest using the scipy version that @larsmans mentioned.

mvdoc · 2015-02-08T17:13:17Z

Thanks for the replies.

I run into this issue because I wanted to run KNN classification using correlation distance as a metric, and passing any user defined function was very slow. I ended up defining my own cython function (but I wasn't aware of cdist). However, I believe this slows the whole process because even if algorithm='brute' is used, the function has to be wrapped into the PyFuncDistance class. But maybe I'm wrong on this point?

jakevdp · 2015-02-08T18:17:02Z

I see. This use-case and others like it could be accommodated by adding a metric='precomputed' option, which is used in some other places in the package.

jakevdp · 2015-02-08T18:21:25Z

Actually, it looks like the correlation distance is implemented natively for the brute-force method:

clf = KNeighborsClassifier(5, algorithm='brute', metric='correlation')
X, y = np.random.random((100, 3)), np.random.randint(0, 3, 100)
clf.fit(X, y)
clf.predict(X)

jakevdp · 2015-02-08T18:24:16Z

If brute-force isn't fast enough for your use-case, there do exist fast algorithms for finding neighbors with a correlation distance, but it will require a data structure that's not implemented in scikit-learn. Look u 8189 p "Kernelized Locality Sensitive Hashing" for one example of an approach that might be useful.

mvdoc · 2015-02-08T18:55:32Z

@jakevdp thanks for taking time to look into this and for the info on that method!

jnothman mentioned this issue Feb 8, 2015

[MRG] Neighbors refactor #4217

Closed

mvdoc closed this as completed Feb 8, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q: why is the correlation metric commented in dist_metrics.pyx? #4213

Q: why is the correlation metric commented in dist_metrics.pyx? #4213

Q: why is the correlation metric commented in dist_metrics.pyx? #4213

Q: why is the correlation metric commented in dist_metrics.pyx? #4213

Comments