[MRG] FIX pairwise distances with 'seuclidean' or 'mahalanobis' metrics #12701

jeremiedbb · 2018-11-30T14:56:59Z

The issue affects pairwise_distances and pairwise_distances_chunked. When metrics params are not provided for these metrics, they are computed for each chunk of data instead of being precomputed for whole data. These two metrics are the only one with data-derived params.

jeremiedbb · 2018-11-30T14:57:47Z

~~This is just a reproducible test for now so CI failure is expected :). Actual fix is coming.~~

rth · 2018-12-03T19:50:40Z

sklearn/metrics/tests/test_pairwise.py

+        if n_jobs == 1:
+            expected_dist = squareform(pdist(X, metric=metric))
+        else:
+            expected_dist = cdist(X, Y, metric=metric)


Maybe use cdist in either case? We don't care much about performance in this test. And it would alow to simplify this test significantly.

Well actually cdist and pdist disagree for mahalanobis distance on scipy (comment) and we use pdist when n_jobs == 1 and cdist otherwise in sklearn, so I'm forced to make the distinction.

It would be better to only use one of them in sklearn.

jeremiedbb · 2018-12-04T16:33:02Z

I made a fix which "works", but it's really not satisfying and it hides some bad behavior.

Code has to be duplicated in pairwise_distances_chunked because it calls pairwise_distances for each chunk, but the metric params have to be computed before splitting. This is necessary unless we do some kind of fit for the metrics.
When Y is X, the use of pdist when n_jobs == 1 and cdist, which was probably made for efficiency, does not give consistent result due to a disagreement between these 2 functions in scipy. I don't know right now how it will be solved upstream, but I think it would be reasonable to use cdist for both (pdist does not make sens when n_jobs > 1).
the difference between pdist(X) and cdist(X,X) is due to the default ddof parameter for numpy var or cov. It's default value is 1 for computing the metrics params V and VI. If X has N samples, var(X) is computed dividing by N-1 instead of N. But if X is Y, they are stacked, and var(np.vstack([X,X])) is computed dividing by 2N-1 != 2(N-1). We could decide to be consistent and set ddof=0 by default for both. After all it's just a choice of default, so maybe we don't need to be consistent with scipy's default.

What's your opinion on these points ?

jeremiedbb · 2018-12-06T09:45:58Z

After more thoughts and discussions, pdist(X) != cdist(X,X) is expected and should remain as is. 'seuclidean' and 'mahalanobis' metrics make the assumption that X (and Y) are samples from an underlying distribution.

For the pairwise distances between X and Y, where Y is X, X and Y are the same samples from the distribution and therefore the variance or covariance should be estimated only on X.
On the other hand, if it happens that Y == X, X and Y are not the same samples from the distributions. They are two sets of samples which happen to take the same values, and therefore the variance or covariance should be estimated on (X,X).

Concretely, it means that calling pairwise_distance(X) or pairwise_distance(X,X) will give the same output with var and cov estimated on X. calling pairwise_distance(X,X.copy()) will give a different output with var and cov estimated on (X,X).

Notice that it won't necessary give the same output as cdist or pdist, because cdist does not make a distinction between cdist(X,X) and cdist(X,X.copy()).

jnothman · 2018-12-09T10:39:09Z

sklearn/metrics/pairwise.py

@@ -1264,6 +1264,19 @@ def pairwise_distances_chunked(X, Y=None, reduce_func=None,
                                        working_memory=working_memory)
        slices = gen_batches(n_samples_X, chunk_n_rows)

+    if metric == "seuclidean" and 'V' not in kwds:


Create a helper function, please

jnothman · 2018-12-09T10:39:45Z

sklearn/metrics/tests/test_pairwise.py

+
+    assert_allclose(dist, expected_dist)
+
+    set_config(working_memory=wm)  # reset working memory to initial value


config_context is intended for exactly this purpose

jnothman · 2018-12-09T10:40:51Z

I've labelled this for 0.20.2 as it may be effectively a regression.

jnothman · 2018-12-10T10:38:15Z

sklearn/metrics/tests/test_pairwise.py

+            else:
+                params = {'VI': np.linalg.inv(np.cov(np.vstack([X, Y]).T)).T}
+
+        expected_dist = cdist(X, Y, metric=metric, **params)


why do the params need to be passed explicitly. doesn't cdist calculate these by default?

Yes but as I've been trying to explain, pdist(X) and cdist(X,X) disagree, since var is computed on X for pdist and on [X,X] for cdist.

This fix proposes the following:

pairwise_distances(X, Y!=X) gives same output as cdist(X,Y)

pairwise_distances(X) or pairwise_distances(X,X) gives same output as pdist(X) or equivalently as cdist(X, X, V=var(X))

pairwise_distances(X, Y==X) (but Y is not X) gives same output as cdist(X, X)

Note that if we want pairwise_distances(X,X) to give same output as cdist(X, X), then pairwise_distances(X,X) != pairwise_distances(X) which felt bad.

I'm not sure I follow entirely.
We have pairwise_distances(X, X) == pairwise_distances(X) != pairwise_distances(X, X.copy()) right now, right?

And we're using var(X) in the first two and var([X, X]) in the last one.
Is that not the same behavior as cdist?

Yes it's the current behavior. And it remains as is in this PR.

cdist requires 2 arrays so not the same behavior. Currently

pairwise_distances(X, X) == pairwise_distances(X) == pdist(X) if n_jobs = 1

pairwise_distances(X, X) == pairwise_distances(X) == cdist(X,X) != pdist(X) if n_jobs > 1

This PR change it to pairwise_distances(X, X) == pairwise_distances(X) == pdist(X) for all n_jobs values.

Ok. Shouldn't we then assert that in addition to explicitly passing params?

So I think @amueller is asking that we test the equivalence of dist_function to cdist or pdist as appropriate, where we do not explicitly pass params. At the moment we are only comparing to when params are given.

I had a comment in between yours which seems to have been deleted... Anyway, I added an assert to explicitly check the equivalence of pairwise distances with the appropriate pdist or cdist.

I'd like to point out the fact that this new assert is basically a scipy test, because it's just assert that pdist and cdist do agree if passed appropriate params. But I guess it doesn't hurt to add extra checks, since it was not that clear at first.

It's also documenting behavior and ensuring that our understanding of the relative behavior is still correct.

good point thx

jnothman · 2018-12-10T22:27:21Z

I've not thought about this in depth, but I think we should, for now, have it match past behaviour of pairwise_distances

jeremiedbb · 2018-12-10T23:16:06Z

This does not seem to be a regression from 0.20. The issue is also in pairwise_distances, not only in pairwise_distances_chunked. We can delay it to 0.21 if we need more time to find the most appropriate solution.

jnothman · 2018-12-11T10:17:06Z

The discrepancy from cdist is not a regression, no. Is that what you mean?

jeremiedbb · 2018-12-11T12:36:06Z

There are 3 things:

in pairwise_distances there is a bug when n_jobs > 1. This is not a regression.
in pairwise_distances_chunked there is a bug when there are more than 1 chunks. It's kind of a regression since pairwise_distances_chunked is new and used instead of pairwise_distances in some places. (but anyway there was already the previous bug)
pdist and cdist disagree. This is not a regression and actually this is expected. It affects sklearn when Y is X for n_jobs==1 vs n_jobs>1.

jnothman · 2018-12-11T21:55:45Z

Yes, that's what I thought. So let's fix the regression and the n_jobs issue and put the tricky cdist compatibility issue on hold

…

On Tue, 11 Dec 2018, 11:36 pm jeremiedbb ***@***.*** wrote: There are 3 things: - in pairwise_distances there is a bug when n_jobs > 1. This is not a regression. - in pairwise_distances_chunked there is a bug when there are more than 1 chunks. It's kind of a regression since pairwise_distances_chunked is new and used instead of pairwise_distances in some places. - pdist and cdist disagree. This is not a regression and actually this is expected. It affects sklearn when Y is X for n_jobs==1 vs n_jobs>1. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#12701 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz68wjb1TDMGqV5hgjot0tdAJoVDoJks5u36a4gaJpZM4Y70Sa> .

jeremiedbb · 2018-12-13T09:29:06Z

Then I think the current state of this PR is fine. I made it so that the behavior of pairwise distances matches the previous behavior with n_jobs==1 in all situations.

amueller · 2018-12-14T16:08:04Z

This is the only thing left for 0.20.2, right? I agree with @qinhanmin2014 that we should do that soon given that we started doing drastic changes for dropping 2.7

jeremiedbb · 2018-12-14T16:14:23Z

Well I think it's ready but it needs more reviews :)

jeremiedbb · 2018-12-17T09:15:07Z

what's wrong with circleci ?

qinhanmin2014 · 2018-12-17T09:27:04Z

@jeremiedbb please merge master in

jeremiedbb · 2018-12-17T09:30:22Z

Well this fix is intended to go in 0.20.2 so we still need python 2 CI ?

qinhanmin2014 · 2018-12-17T09:35:50Z

Well this fix is intended to go in 0.20.2 so we still need python 2 CI ?

@jeremiedbb We don't run python2 CI in master (We only run it in 0.20.X branch). We'll try to tackle this and release 0.20.2 ASAP.
(maybe you can check your test with python2 locally if possible, apologies for the inconvenience)

qinhanmin2014 · 2018-12-17T09:36:59Z

(Travis and Appveyor don't run python3 now, the problem is that Circle won't merge master automatically)

qinhanmin2014

I've not gone through the original issue carefully and I choose to trust @jeremiedbb
#12672 (comment)
This LGTM from my side, though I think maybe we can leave it to 0.21.

qinhanmin2014 · 2018-12-17T15:21:26Z

doc/whats_new/v0.20.rst

+......................
+
+- |Fix| Fixed a bug in :func:`metrics.pairwise_distances` and
+  :func:`metrics.pairwise_distances_chunked` where parameters of


maybe mention what parameter?

qinhanmin2014 · 2018-12-17T15:22:24Z

sklearn/metrics/tests/test_pairwise.py

@@ -893,3 +896,40 @@ def test_check_preserve_type():
                                                   XB.astype(np.float))
    assert_equal(XA_checked.dtype, np.float)
    assert_equal(XB_checked.dtype, np.float)
+
+
+@pytest.mark.parametrize("n_jobs", [1, 2, -1])


maybe we can remove n_jobs=2 or n_jobs=-1?

qinhanmin2014 · 2018-12-17T15:24:51Z

sklearn/metrics/tests/test_pairwise.py

+            else:
+                params = {'VI': np.linalg.inv(np.cov(np.vstack([X, Y]).T)).T}
+
+        expected_dist_explicit_params = cdist(X, Y, metric=metric, **params)


I don't like such a scipy test, but won't vote -1.

What's wrong with that ?
You'd prefer to just test that pairwise_distances(X, n_jobs=1) == pairwise_distances(X, n_jobs=2) ?

Yes you were right, 2 and -1 are redundant.

Apologies for the above comments.
I don't think we need to compare n_jobs=1 and n_jobs=2 since we've compared with scipy.

It's hard to review on bed, apologies again for the meaningless comments above :)

haha no problem :D

qinhanmin2014 · 2018-12-17T15:26:10Z

sklearn/metrics/tests/test_pairwise.py

+    # parallel, when metric has data-derived parameters.
+    with config_context(working_memory=0.1):  # to have more than 1 chunk
+        rng = np.random.RandomState(0)
+


maybe you can remove some blank lines to reduce number of rows in a single file :)

doc/whats_new/v0.20.rst

qinhanmin2014

Thanks @jeremiedbb

amueller

lgtm

amueller

lgtm

jnothman · 2018-12-17T19:58:46Z

Thanks @jeremiedbb!

…ikit-learn#12701) Fixes scikit-learn#12672

…rics (scikit-learn#12701)" This reverts commit fedc3ac.

…ikit-learn#12701) Fixes scikit-learn#12672

rth reviewed Dec 3, 2018

View reviewed changes

jeremiedbb changed the title ~~[WIP] FIX pairwise distances with 'seuclidean' or 'mahalanobis' metrics~~ [MRG] FIX pairwise distances with 'seuclidean' or 'mahalanobis' metrics Dec 6, 2018

jnothman added this to the 0.20.2 milestone Dec 9, 2018

jnothman reviewed Dec 9, 2018

View reviewed changes

jnothman added Bug Regression labels Dec 9, 2018

jeremiedbb force-pushed the precompute-metrics-params branch from f39ff83 to 421c0ed Compare December 10, 2018 09:41

jnothman reviewed Dec 10, 2018

View reviewed changes

amueller mentioned this pull request Dec 14, 2018

Release 0.20.2 #12784

Merged

jeremiedbb added 4 commits December 17, 2018 10:39

test pairwise distances metrics with data derived params

15274e7

reset wm in test

1ce856b

dirty fix

5bc133a

same result all n_jobs

8d2f30e

jeremiedbb added 4 commits December 17, 2018 10:39

use config_context

240020d

helper to precompute metric params

1d342dc

what's new

49debfc

also test against pdist/cdist default params

66bbe65

jeremiedbb force-pushed the precompute-metrics-params branch from 0eea3f1 to 66bbe65 Compare December 17, 2018 09:39

qinhanmin2014 approved these changes Dec 17, 2018

View reviewed changes

nitpicks

aafd49a

qinhanmin2014 approved these changes Dec 17, 2018

View reviewed changes

amueller approved these changes Dec 17, 2018

View reviewed changes

jnothman merged commit 876b149 into scikit-learn:master Dec 17, 2018

amueller pushed a commit to amueller/scikit-learn that referenced this pull request Dec 17, 2018

FIX pairwise distances with 'seuclidean' or 'mahalanobis' metrics (sc…

713c164

…ikit-learn#12701) Fixes scikit-learn#12672

This was referenced Dec 18, 2018

DOC Update changed models for 0.20.2 #12809

Closed

DOC Update changed models for 0.20.2 #12810

Merged

TST np.vstack won't support generator in the future #12816

Merged

adrinjalali pushed a commit to adrinjalali/scikit-learn that referenced this pull request Jan 7, 2019

FIX pairwise distances with 'seuclidean' or 'mahalanobis' metrics (sc…

046f89c

…ikit-learn#12701) Fixes scikit-learn#12672

jeremiedbb deleted the precompute-metrics-params branch January 31, 2019 12:58

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

FIX pairwise distances with 'seuclidean' or 'mahalanobis' metrics (sc…

fedc3ac

…ikit-learn#12701) Fixes scikit-learn#12672

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FIX pairwise distances with 'seuclidean' or 'mahalanobis' met…

c306a61

…rics (scikit-learn#12701)" This reverts commit fedc3ac.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FIX pairwise distances with 'seuclidean' or 'mahalanobis' met…

a719102

…rics (scikit-learn#12701)" This reverts commit fedc3ac.

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

FIX pairwise distances with 'seuclidean' or 'mahalanobis' metrics (sc…

14b1108

…ikit-learn#12701) Fixes scikit-learn#12672


		assert_allclose(dist, expected_dist)

		set_config(working_memory=wm) # reset working memory to initial value

Uh oh!

[MRG] FIX pairwise distances with 'seuclidean' or 'mahalanobis' metrics #12701

[MRG] FIX pairwise distances with 'seuclidean' or 'mahalanobis' metrics #12701

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment