-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG] FIX pairwise distances with 'seuclidean' or 'mahalanobis' metrics #12701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] FIX pairwise distances with 'seuclidean' or 'mahalanobis' metrics #12701
Conversation
|
if n_jobs == 1: | ||
expected_dist = squareform(pdist(X, metric=metric)) | ||
else: | ||
expected_dist = cdist(X, Y, metric=metric) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use cdist in either case? We don't care much about performance in this test. And it would alow to simplify this test significantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well actually cdist and pdist disagree for mahalanobis distance on scipy (comment) and we use pdist when n_jobs == 1
and cdist otherwise in sklearn, so I'm forced to make the distinction.
It would be better to only use one of them in sklearn.
I made a fix which "works", but it's really not satisfying and it hides some bad behavior.
What's your opinion on these points ? |
After more thoughts and discussions,
Concretely, it means that calling Notice that it won't necessary give the same output as cdist or pdist, because cdist does not make a distinction between cdist(X,X) and cdist(X,X.copy()). |
sklearn/metrics/pairwise.py
Outdated
@@ -1264,6 +1264,19 @@ def pairwise_distances_chunked(X, Y=None, reduce_func=None, | |||
working_memory=working_memory) | |||
slices = gen_batches(n_samples_X, chunk_n_rows) | |||
|
|||
if metric == "seuclidean" and 'V' not in kwds: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create a helper function, please
|
||
assert_allclose(dist, expected_dist) | ||
|
||
set_config(working_memory=wm) # reset working memory to initial value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config_context is intended for exactly this purpose
I've labelled this for 0.20.2 as it may be effectively a regression. |
f39ff83
to
421c0ed
Compare
else: | ||
params = {'VI': np.linalg.inv(np.cov(np.vstack([X, Y]).T)).T} | ||
|
||
expected_dist = cdist(X, Y, metric=metric, **params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do the params need to be passed explicitly. doesn't cdist calculate these by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but as I've been trying to explain, pdist(X)
and cdist(X,X)
disagree, since var is computed on X for pdist and on [X,X] for cdist.
This fix proposes the following:
pairwise_distances(X, Y!=X)
gives same output ascdist(X,Y)
pairwise_distances(X)
orpairwise_distances(X,X)
gives same output aspdist(X)
or equivalently ascdist(X, X, V=var(X))
pairwise_distances(X, Y==X)
(but Y is not X) gives same output ascdist(X, X)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that if we want pairwise_distances(X,X)
to give same output as cdist(X, X)
, then pairwise_distances(X,X) != pairwise_distances(X)
which felt bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow entirely.
We have pairwise_distances(X, X) == pairwise_distances(X) != pairwise_distances(X, X.copy())
right now, right?
And we're using var(X)
in the first two and var([X, X])
in the last one.
Is that not the same behavior as cdist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's the current behavior. And it remains as is in this PR.
cdist
requires 2 arrays so not the same behavior. Currently
pairwise_distances(X, X) == pairwise_distances(X) == pdist(X)
if n_jobs = 1pairwise_distances(X, X) == pairwise_distances(X) == cdist(X,X) != pdist(X)
if n_jobs > 1
This PR change it to pairwise_distances(X, X) == pairwise_distances(X) == pdist(X)
for all n_jobs values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Shouldn't we then assert that in addition to explicitly passing params?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think @amueller is asking that we test the equivalence of dist_function to cdist or pdist as appropriate, where we do not explicitly pass params. At the moment we are only comparing to when params are given.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a comment in between yours which seems to have been deleted... Anyway, I added an assert to explicitly check the equivalence of pairwise distances with the appropriate pdist or cdist.
I'd like to point out the fact that this new assert is basically a scipy test, because it's just assert that pdist and cdist do agree if passed appropriate params. But I guess it doesn't hurt to add extra checks, since it was not that clear at first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also documenting behavior and ensuring that our understanding of the relative behavior is still correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point thx
I've not thought about this in depth, but I think we should, for now, have
it match past behaviour of pairwise_distances
|
This does not seem to be a regression from 0.20. The issue is also in |
The discrepancy from cdist is not a regression, no. Is that what you mean? |
There are 3 things:
|
Yes, that's what I thought. So let's fix the regression and the n_jobs
issue and put the tricky cdist compatibility issue on hold
…On Tue, 11 Dec 2018, 11:36 pm jeremiedbb ***@***.*** wrote:
There are 3 things:
- in pairwise_distances there is a bug when n_jobs > 1. This is not a
regression.
- in pairwise_distances_chunked there is a bug when there are more
than 1 chunks. It's kind of a regression since
pairwise_distances_chunked is new and used instead of
pairwise_distances in some places.
- pdist and cdist disagree. This is not a regression and actually this
is expected. It affects sklearn when Y is X for n_jobs==1 vs n_jobs>1.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#12701 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz68wjb1TDMGqV5hgjot0tdAJoVDoJks5u36a4gaJpZM4Y70Sa>
.
|
Then I think the current state of this PR is fine. I made it so that the behavior of pairwise distances matches the previous behavior with |
This is the only thing left for 0.20.2, right? I agree with @qinhanmin2014 that we should do that soon given that we started doing drastic changes for dropping 2.7 |
Well I think it's ready but it needs more reviews :) |
what's wrong with circleci ? |
@jeremiedbb please merge master in |
Well this fix is intended to go in 0.20.2 so we still need python 2 CI ? |
@jeremiedbb We don't run python2 CI in master (We only run it in 0.20.X branch). We'll try to tackle this and release 0.20.2 ASAP. |
(Travis and Appveyor don't run python3 now, the problem is that Circle won't merge master automatically) |
0eea3f1
to
66bbe65
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not gone through the original issue carefully and I choose to trust @jeremiedbb
#12672 (comment)
This LGTM from my side, though I think maybe we can leave it to 0.21.
doc/whats_new/v0.20.rst
Outdated
...................... | ||
|
||
- |Fix| Fixed a bug in :func:`metrics.pairwise_distances` and | ||
:func:`metrics.pairwise_distances_chunked` where parameters of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe mention what parameter?
@@ -893,3 +896,40 @@ def test_check_preserve_type(): | |||
XB.astype(np.float)) | |||
assert_equal(XA_checked.dtype, np.float) | |||
assert_equal(XB_checked.dtype, np.float) | |||
|
|||
|
|||
@pytest.mark.parametrize("n_jobs", [1, 2, -1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can remove n_jobs=2 or n_jobs=-1?
else: | ||
params = {'VI': np.linalg.inv(np.cov(np.vstack([X, Y]).T)).T} | ||
|
||
expected_dist_explicit_params = cdist(X, Y, metric=metric, **params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like such a scipy test, but won't vote -1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's wrong with that ?
You'd prefer to just test that pairwise_distances(X, n_jobs=1) == pairwise_distances(X, n_jobs=2)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you were right, 2 and -1 are redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the above comments.
I don't think we need to compare n_jobs=1 and n_jobs=2 since we've compared with scipy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to review on bed, apologies again for the meaningless comments above :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha no problem :D
# parallel, when metric has data-derived parameters. | ||
with config_context(working_memory=0.1): # to have more than 1 chunk | ||
rng = np.random.RandomState(0) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you can remove some blank lines to reduce number of rows in a single file :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jeremiedbb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks @jeremiedbb! |
…rics (scikit-learn#12701)" This reverts commit fedc3ac.
…rics (scikit-learn#12701)" This reverts commit fedc3ac.
Fixes #12672
The issue affects
pairwise_distances
andpairwise_distances_chunked
. When metrics params are not provided for these metrics, they are computed for each chunk of data instead of being precomputed for whole data. These two metrics are the only one with data-derived params.