8000 [MRG] FIX pairwise distances with 'seuclidean' or 'mahalanobis' metrics by jeremiedbb · Pull Request #12701 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] FIX pairwise distances with 'seuclidean' or 'mahalanobis' metrics #12701

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Dec 17, 2018

Conversation

jeremiedbb
Copy link
Member

Fixes #12672

The issue affects pairwise_distances and pairwise_distances_chunked. When metrics params are not provided for these metrics, they are computed for each chunk of data instead of being precomputed for whole data. These two metrics are the only one with data-derived params.

@jeremiedbb
Copy link
Member Author
jeremiedbb commented Nov 30, 2018

This is just a reproducible test for now so CI failure is expected :). Actual fix is coming.

if n_jobs == 1:
expected_dist = squareform(pdist(X, metric=metric))
else:
expected_dist = cdist(X, Y, metric=metric)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use cdist in either case? We don't care much about performance in this test. And it would alow to simplify this test significantly.

Copy link
Member Author
@jeremiedbb jeremiedbb Dec 4, 2018 < 8000 /div>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well actually cdist and pdist disagree for mahalanobis distance on scipy (comment) and we use pdist when n_jobs == 1 and cdist otherwise in sklearn, so I'm forced to make the distinction.

It would be better to only use one of them in sklearn.

@jeremiedbb
Copy link
Member Author

I made a fix which "works", but it's really not satisfying and it hides some bad behavior.

  • Code has to be duplicated in pairwise_distances_chunked because it calls pairwise_distances for each chunk, but the metric params have to be computed before splitting. This is necessary unless we do some kind of fit for the metrics.

  • When Y is X, the use of pdist when n_jobs == 1 and cdist, which was probably made for efficiency, does not give consistent result due to a disagreement between these 2 functions in scipy. I don't know right now how it will be solved upstream, but I think it would be reasonable to use cdist for both (pdist does not make sens when n_jobs > 1).

  • the difference between pdist(X) and cdist(X,X) is due to the default ddof parameter for numpy var or cov. It's default value is 1 for computing the metrics params V and VI. If X has N samples, var(X) is computed dividing by N-1 instead of N. But if X is Y, they are stacked, and var(np.vstack([X,X])) is computed dividing by 2N-1 != 2(N-1). We could decide to be consistent and set ddof=0 by default for both. After all it's just a choice of default, so maybe we don't need to be consistent with scipy's default.

What's your opinion on these points ?

@jeremiedbb
Copy link
Member Author

After more thoughts and discussions, pdist(X) != cdist(X,X) is expected and should remain as is. 'seuclidean' and 'mahalanobis' metrics make the assumption that X (and Y) are samples from an underlying distribution.

  • For the pairwise distances between X and Y, where Y is X, X and Y are the same samples from the distribution and therefore the variance or covariance should be estimated only on X.
  • On the other hand, if it happens that Y == X, X and Y are not the same samples from the distributions. They are two sets of samples which happen to take the same values, and therefore the variance or covariance should be estimated on (X,X).

Concretely, it means that calling pairwise_distance(X) or pairwise_distance(X,X) will give the same output with var and cov estimated on X. calling pairwise_distance(X,X.copy()) will give a different output with var and cov estimated on (X,X).

Notice that it won't necessary give the same output as cdist or pdist, because cdist does not make a distinction between cdist(X,X) and cdist(X,X.copy()).

@jeremiedbb jeremiedbb changed the title [WIP] FIX pairwise distances with 'seuclidean' or 'mahalanobis' metrics [MRG] FIX pairwise distances with 'seuclidean' or 'mahalanobis' metrics Dec 6, 2018
@jnothman jnothman added this to the 0.20.2 milestone Dec 9, 2018
@@ -1264,6 +1264,19 @@ def pairwise_distances_chunked(X, Y=None, reduce_func=None,
working_memory=working_memory)
slices = gen_batches(n_samples_X, chunk_n_rows)

if metric == "seuclidean" and 'V' not in kwds:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a helper function, please


assert_allclose(dist, expected_dist)

set_config(working_memory=wm) # reset working memory to initial value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config_context is intended for exactly this purpose

@jnothman
Copy link
Member
jnothman commented Dec 9, 2018

I've labelled this for 0.20.2 as it may be effectively a regression.

else:
params = {'VI': np.linalg.inv(np.cov(np.vstack([X, Y]).T)).T}

expected_dist = cdist(X, Y, metric=metric, **params)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do the params need to be passed explicitly. doesn't cdist calculate these by default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but as I've been trying to explain, pdist(X) and cdist(X,X) disagree, since var is computed on X for pdist and on [X,X] for cdist.

This fix proposes the following:

  • pairwise_distances(X, Y!=X) gives same output as cdist(X,Y)
  • pairwise_distances(X) or pairwise_distances(X,X) gives same output as pdist(X) or equivalently as cdist(X, X, V=var(X))
  • pairwise_distances(X, Y==X) (but Y is not X) gives same output as cdist(X, X)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that if we want pairwise_distances(X,X) to give same output as cdist(X, X), then pairwise_distances(X,X) != pairwise_distances(X) which felt bad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow entirely.
We have pairwise_distances(X, X) == pairwise_distances(X) != pairwise_distances(X, X.copy()) right now, right?

And we're using var(X) in the first two and var([X, X]) in the last one.
Is that not the same behavior as cdist?

Copy link
Member Author
< 10000 h3 class="f5 text-normal" style="flex: 1 1 auto">

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's the current behavior. And it remains as is in this PR.

cdist requires 2 arrays so not the same behavior. Currently

  • pairwise_distances(X, X) == pairwise_distances(X) == pdist(X) if n_jobs = 1
  • pairwise_distances(X, X) == pairwise_distances(X) == cdist(X,X) != pdist(X) if n_jobs > 1

This PR change it to pairwise_distances(X, X) == pairwise_distances(X) == pdist(X) for all n_jobs values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Shouldn't we then assert that in addition to explicitly passing params?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think @amueller is asking that we test the equivalence of dist_function to cdist or pdist as appropriate, where we do not explicitly pass params. At the moment we are only comparing to when params are given.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a comment in between yours which seems to have been deleted... Anyway, I added an assert to explicitly check the equivalence of pairwise distances with the appropriate pdist or cdist.

I'd like to point out the fact that this new assert is basically a scipy test, because it's just assert that pdist and cdist do agree if passed appropriate params. But I guess it doesn't hurt to add extra checks, since it was not that clear at first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also documenting behavior and ensuring that our understanding of the relative behavior is still correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point thx

@jnothman
Copy link
Member
jnothman commented Dec 10, 2018 via email

@jeremiedbb
Copy link
Member Author

This does not seem to be a regression from 0.20. The issue is also in pairwise_distances, not only in pairwise_distances_chunked. We can delay it to 0.21 if we need more time to find the most appropriate solution.

@jnothman
Copy link
Member

The discrepancy from cdist is not a regression, no. Is that what you mean?

@jeremiedbb
Copy link
Member Author
jeremiedbb commented Dec 11, 2018

There are 3 things:

  • in pairwise_distances there is a bug when n_jobs > 1. This is not a regression.
  • in pairwise_distances_chunked there is a bug when there are more than 1 chunks. It's kind of a regression since pairwise_distances_chunked is new and used instead of pairwise_distances in some places. (but anyway there was already the previous bug)
  • pdist and cdist disagree. This is not a regression and actually this is expected. It affects sklearn when Y is X for n_jobs==1 vs n_jobs>1.

@jnothman
Copy link
Member
jnothman commented Dec 11, 2018 via email

@jeremiedbb
Copy link
Member Author

Then I think the current state of this PR is fine. I made it so that the behavior of pairwise distances matches the previous behavior with n_jobs==1 in all situations.

@amueller
Copy link
Member

This is the only thing left for 0.20.2, right? I agree with @qinhanmin2014 that we should do that soon given that we started doing drastic changes for dropping 2.7

@jeremiedbb
Copy link
Member Author

Well I think it's ready but it needs more reviews :)

@amueller amueller mentioned this pull request Dec 14, 2018
@jeremiedbb
Copy link
Member Author

what's wrong with circleci ?

@qinhanmin2014
Copy link
Member

@jeremiedbb please merge master in

@jeremiedbb
Copy link
Member Author

Well this fix is intended to go in 0.20.2 so we still need python 2 CI ?

@qinhanmin2014
Copy link
Member

Well this fix is intended to go in 0.20.2 so we still need python 2 CI ?

@jeremiedbb We don't run python2 CI in master (We only run it in 0.20.X branch). We'll try to tackle this and release 0.20.2 ASAP.
(maybe you can check your test with python2 locally if possible, apologies for the inconvenience)

@qinhanmin2014
Copy link
Member

(Travis and Appveyor don't run python3 now, the problem is that Circle won't merge master automatically)

@jeremiedbb jeremiedbb force-pushed the precompute-metrics-params branch from 0eea3f1 to 66bbe65 Compare December 17, 2018 09:39
Copy link
Member
@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not gone through the original issue carefully and I choose to trust @jeremiedbb
#12672 (comment)

This LGTM from my side, though I think maybe we can leave it to 0.21.

......................

- |Fix| Fixed a bug in :func:`metrics.pairwise_distances` and
:func:`metrics.pairwise_distances_chunked` where parameters of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe mention what parameter?

@@ -893,3 +896,40 @@ def test_check_preserve_type():
XB.astype(np.float))
assert_equal(XA_checked.dtype, np.float)
assert_equal(XB_checked.dtype, np.float)


@pytest.mark.parametrize("n_jobs", [1, 2, -1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can remove n_jobs=2 or n_jobs=-1?

else:
params = {'VI': np.linalg.inv(np.cov(np.vstack([X, Y]).T)).T}

expected_dist_explicit_params = cdist(X, Y, metric=metric, **params)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like such a scipy test, but won't vote -1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's wrong with that ?
You'd prefer to just test that pairwise_distances(X, n_jobs=1) == pairwise_distances(X, n_jobs=2) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you were right, 2 and -1 are redundant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the above comments.
I don't think we need to compare n_jobs=1 and n_jobs=2 since we've compared with scipy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to review on bed, apologies again for the meaningless comments above :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha no problem :D

# parallel, when metric has data-derived parameters.
with config_context(working_memory=0.1): # to have more than 1 chunk
rng = np.random.RandomState(0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can remove some blank lines to reduce number of rows in a single file :)

Copy link
Member
@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jeremiedbb

Copy link
Member
@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Member
@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jnothman jnothman merged commit 876b149 into scikit-learn:master Dec 17, 2018
@jnothman
Copy link
Member

Thanks @jeremiedbb!

@jeremiedbb jeremiedbb deleted the precompute-metrics-params branch January 31, 2019 12:58
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0