8000 Q: why is the correlation metric commented in dist_metrics.pyx? · Issue #4213 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Q: why is the correlation metric commented in dist_metrics.pyx? #4213

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mvdoc opened this issue Feb 6, 2015 · 7 comments
Closed

Q: why is the correlation metric commented in dist_metrics.pyx? #4213

mvdoc opened this issue Feb 6, 2015 · 7 comments

Comments

@mvdoc
Copy link
Contributor
mvdoc commented Feb 6, 2015

Hi,

in dist_metrics.pyx the correlation metric is commented, and the comment states that it is not a 'true' metric. While this is true as it doesn't satisfy the coincidence axiom, it is still helpful to have such a metric implemented in cython to speed up computations.

Also, note that the actual formula implemented in the commented code is incorrect. The returned value at the moment is

return (1. - x1Tx2) / sqrt(x1nrm * x2nrm)

but it should be instead

return 1. - (x1Tx2 / sqrt(x1nrm * x2nrm))

Thanks,
Matteo

@larsmans
Copy link
Member
larsmans commented Feb 7, 2015

SciPy has a vectorized version of it, cdist(X, Y, 'correlation'). Do you really need a KD-tree or ball tree with this metric? That's what dist_metrics.pyx is for.

@jakevdp
Copy link
Member
jakevdp commented Feb 8, 2015

dist_metrics.pyx contains metrics used by the KDTree/BallTree. I initially implemented the correlation metric, but it failed the BallTree unit tests. That's actually how I learned that it's not a true metric! I'd suggest using the scipy version that @larsmans mentioned.

@mvdoc
Copy link
Contributor Author
mvdoc commented Feb 8, 2015

Thanks for the replies.

I run into this issue because I wanted to run KNN classification using correlation distance as a metric, and passing any user defined function was very slow. I ended up defining my own cython function (but I wasn't aware of cdist). However, I believe this slows the whole process because even if algorithm='brute' is used, the function has to be wrapped into the PyFuncDistance class. But maybe I'm wrong on this point?

@jakevdp
Copy link
Member
jakevdp commented Feb 8, 2015

I see. This use-case and others like it could be accommodated by adding a metric='precomputed' option, which is used in some other places in the package.

@jakevdp
Copy link
Member
jakevdp commented Feb 8, 2015

Actually, it looks like the correlation distance is implemented natively for the brute-force method:

clf = KNeighborsClassifier(5, algorithm='brute', metric='correlation')
X, y = np.random.random((100, 3)), np.random.randint(0, 3, 100)
clf.fit(X, y)
clf.predict(X)

@jakevdp
Copy link
Member
jakevdp commented Feb 8, 2015

If brute-force isn't fast enough for your use-case, there do exist fast algorithms for finding neighbors with a correlation distance, but it will require a data structure that's not implemented in scikit-learn. Look u 8189 p "Kernelized Locality Sensitive Hashing" for one example of an approach that might be useful.

@mvdoc
Copy link
Contributor Author
mvdoc commented Feb 8, 2015

@jakevdp thanks for taking time to look into this and for the info on that method!

@mvdoc mvdoc closed this as completed Feb 8, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
0