Implement Gower similarity coeficient #9555

marcelobeckmann · 2017-08-15T07:19:12Z

Reference Issue

What does this implement/fix? Explain your changes.

Implements the Gower similarity in the sklearn.metrics.pairwse

Any other comments?

Unit tests are on the way, but please review and advise this piece of code while this. 8000
This code cares about NaN propagation and non square matrix for parallel processing.

jnothman · 2017-08-17T14:31:01Z

In scikit-learn we often use numeric arrays where ints are used to represent categorical features. Perhaps this needs a categorical_features parameter that identifies the categorical columns when dtype is not discriminative.

Also, the builds are failing.

I'm marking the title as [WIP]. When you are satisfied with the testing and documentation, please mark it MRG and let us know.

ashimb9 · 2017-08-20T05:12:59Z

The linked issue contains a typo I believe and should be #5884

marcelobeckmann · 2017-08-24T06:15:31Z

Thanks for the hints, I'm working on this.

jnothman · 2017-09-10T10:15:09Z

Please make sure your code adheres to PEP8 to make it easier to review

jnothman

Please add tests. Numerous lines do not have test coverage.

sklearn/metrics/pairwise.py

marcelobeckmann · 2017-09-28T06:58:39Z

I fixed all the minor changes proposed by Travis, but seems now there is a problem in the CI servers that are making most CI applications fail.

pip._vendor.requests.packages.urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)", ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))

I'll continue pushing my code, just to check when the CI servers will be restored.

chang · 2017-11-10T18:48:00Z

Is there an update on this PR? It looks like the previous Travis CI issues have been resolved and the builds are now passing. Thanks for doing this!

marcelobeckmann · 2017-11-10T19:03:06Z

Hi, I changed the code to avoid zero division warnings, as proposed by Pierre, and CI is green. I need someone to review my code.

chang · 2017-12-02T01:30:55Z

Hi @marcelobeckmann, I'd love to see this work get merged - Gower similarity is very useful in the case of mixed data types, which we frequently encounter in the real word.

Pinging one of the core maintainers might help if your PR got lost in the queue. Also, I noticed that the issue # in the PR is incorrect, should be #5884.

sklearn/metrics/pairwise.py

sklearn/metrics/tests/test_pairwise.py

sklearn/metrics/pairwise.py

marcelobeckmann · 2017-12-04T17:49:46Z

Thanks for this review @jnothman , a lot of work to do before Christmas! :)

OlafEichstaedt · 2018-01-17T10:52:26Z

@marcelobeckmann 8000 Just downloaded your code as a jupyter notebook https://sourceforge.net/projects/gower-distance-4python/files/gower_function-v3.ipynb/download ... thanks for all the work you have put in, very, very, very useful indeed. I have a question, though: it seems that None in a scalar dimension is transformed to "NaN" and causes all the distances to this observation to be "nan". Is this a bug or a feature? Or am I missing something here? I will analyze the code some more but maybe you already have an answer ready at hand ...

marcelobeckmann · 2018-01-17T22:48:06Z

Hi @OlafEichstaedt, this is a feature of this implementation, as None is related to a missing value in an object, and the resulting numerical distance is a NaN, as there is nothing to compare over there. This result is equivalent to the R Gower implementation. Please contact me directly for further questions about the jupyter notebook, as this forum is for the definitive implementation of Gower that I'm developing for scikit learn.

marcelobeckmann · 2018-02-01T07:58:43Z

Hi, just to let you know I´m performing some profiling for array vectorization, and making some adjustments for sparse matrix. My fixes are on the way.

jnothman · 2018-02-01T10:54:56Z

I would ignore the sparse matrix case for now...

…

On 1 February 2018 at 18:58, Marcelo Beckmann ***@***.***> wrote: Hi, just to let you know I´m performing some profiling for array vectorization, and making some adjustments for sparse matrix. My fixes are on the way. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9555 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz62-kxtp96knwK8YkiCVhTzGC-PQmks5tQW61gaJpZM4O3Q5v> .

sklearn-lgtm · 2018-02-26T17:19:53Z

This pull request introduces 1 alert when merging 2804c9d into 3e29334 - view on lgtm.com

new alerts:

1 for Variable defined multiple times

Comment posted by lgtm.com

jnothman · 2018-02-26T22:45:48Z

thanks for updating this. please ping when those tests are done and you are ready for another full review

jnothman · 2018-02-26T22:47:02Z

Test failures

jnothman · 2018-03-05T10:55:15Z

Please ping when this is ready for another review

sklearn-lgtm · 2018-03-10T23:30:39Z

This pull request introduces 1 alert when merging 5e0f965 into 3e29334 - view on lgtm.com

new alerts:

1 for Variable defined multiple times

Comment posted by lgtm.com

jnothman · 2018-03-10T23:56:07Z

@marcelobeckmann would you like help with this?

jnothman · 2018-03-11T13:16:23Z

Also, you can run the tests on your own machine, by running pytest sklearn/metrics for instance.

marcelobeckmann · 2018-03-14T06:27:34Z

Hi @jnothman,

The tests are passing locally using make and pytest. I'm using Python 3.5.3 and Anaconda 4.0.0 (64-bit). Now I'm modifying my code and pushing it several times to get some clue about why I'm getting unexpected values with Travis.

Any help or direction will be very welcome.

sklearn/metrics/pairwise.py

marcelobeckmann · 2018-03-15T08:30:57Z

Hi @jnothman,

I made the changes you proposed and this stopped to affect the other libraries, but I'm still getting errors in my assertions. I'm 100% sure my expected values are correct, and all the tests are passing locally, seems to be a discrepancy between the CI and my environment. I'm going to use PYTHON_VERSION="3.4" in my environment, and I'll see if I can reproduce the assertion errors locally.

sklearn/metrics/pairwise.py

jnothman

You still have tests failing on old numpy. I've not looked into it, but you might consider installing an old numpy (e.g. 1.9) and debugging

…atforms

…into b5584

…atforms

…into b5584

…atforms

marcelobeckmann · 2019-11-27T08:44:52Z

I reverted the proposal to speedup the detection of categorical features in case of full nan columns, because the deployments py35_ubuntu_atlas_32bit, py35_ubuntu_atlas, py35_conda_openblas are returning nan from np.nansum, when 0 was expected.

Please let me know if you have an alternative to np.nansum on these platforms, or if you are happy with the current proposal to detect categorical attributes. I'm happy to help somehow.

marcelobeckmann · 2019-11-29T06:23:28Z

Hi, can someone make a code review please?

cmarmo · 2020-01-25T11:43:56Z

doc/modules/metrics.rst

+.. topic:: References:
+
+    * Gower, J.C., 1971, A General Coefficient of Similarity and Some of Its 
+    Properties, Biometrics, Vol. 27, No. 4. (Dec., 1971), pp. 857-871.


Suggested change

Properties, Biometrics, Vol. 27, No. 4. (Dec., 1971), pp. 857-871.

Properties, Biometrics, Vol. 27, No. 4. (Dec., 1971), pp. 857-871.

cmarmo · 2020-01-25T11:44:31Z

doc/modules/metrics.rst

+
+    * Gower, J.C., 1971, A General Coefficient of Similarity and Some of Its 
+    Properties, Biometrics, Vol. 27, No. 4. (Dec., 1971), pp. 857-871.
+    http://members.cbio.mines-paristech.fr/~jvert/svn/bibli/local/Gower1971general.pdf


Suggested change

http://members.cbio.mines-paristech.fr/~jvert/svn/bibli/local/Gower1971general.pdf

http://members.cbio.mines-paristech.fr/~jvert/svn/bibli/local/Gower1971general.pdf

cmarmo · 2020-01-25T11:49:13Z

Hi @marcelobeckmann my two small fixes are meant to remove (I hope) the sphinx warnings.
May I suggest to change the title of your PR : [WIP] -> [MRG]. Hope that this will capture some attention. Anyway, this PR already deserved to be added to the 0.23 milestone... good sign! :)

jnothman

Just a few comments on the recently addressed portions.

I'd like to review tests next time.

jnothman · 2020-01-28T02:23:19Z

sklearn/metrics/pairwise.py

+        def detect_cat(x):
+            if not np.isnan(x):
+                if np.issubdtype(type(x), np.number):
+                    raise ValueError(False)


This looks like a very unconventional way of providing control flow and passing values around. Why are we using exceptions rather than return values here?

Okay. I see that we're applying pyufunc to check each element individually, and using exceptions to abort as soon as we have a non-NaN. This logic is very unclear from your code, and I see no benefit in doing it this way rather than an explicit python loop over elements, or something more functional-style:

non_nan_values = itertools.dropwhile(np.isnan, X[:, col]) try: value = next(non_nan_values) except StopIteration: TODO: handle case when all values are NaN TODO: determine type from value

jnothman · 2020-01-28T02:44:18Z

sklearn/metrics/pairwise.py

+        with the numerical indexes of the categorical attribtes.
+
+        If the categorical_features array is not provided, by default all
+        non-numeric columns are considered categorical.


Note that behaviour is undefined if columns mix numeric and non-numeric values.

jnothman

Hi @marcelobeckmann, I'm finding the single very long test function hard to read. While I appreciate the attempt to show that gower_distances produces results equivalent to a simplified implementation (with nested for loops), overall it's very hard to see what your tests are asserting without a lot of attention.

A good test suite should look like a proof, or an essay. I would like to see a test suite with separate tests for different lemmas towards that proof or arguments in that essay:

test_gower_distances_sample_pair(x1, y1, scale, categorical_features, expected) should show that for a series of toy examples, gower_distances([x1], [y1], scale=scale, categorical_features=categorical_features) calculates the [[expected]] distance between them. The current test spends too much effort setting up a matrix of results. Focus on one at a time. This should be parametrised to check different scalings, mixes of categorical and numeric, representation of categorical as string or numeric, number of features, and missing values.
test_gower_distances_matrix(X, Y, expected_scale, expected_categorical_features) should check that the gower_distances(X, Y) result decomposes over sample pairs for the given scale and categorical_features. I.e. gower_distances(X, Y)[i, j] == gower_distances([X[i]], [Y[i]], scale=expected_scale, categorical_features=expected_categorical_features). This checks that the overall matrix is constructed correctly, and that categorical_features and scale are correctly inferred.
test_gower_distances_validation that appropriate validation is performed.

Potentially there are things I've missed out, but I think these two tests, with carefully selected example parametrizations (perhaps with a comment for what that example adds to the previous), would demonstrate together that the implementation is correct. (Do we check elsewhere that gower_distances(X, X) == gower_distances(X)?)

jnothman · 2020-01-28T07:40:26Z

sklearn/metrics/tests/test_pairwise.py

@@ -602,44 +603,37 @@ def test_pairwise_distances_chunked():
        next(gen)


-@pytest.mark.parametrize("x_array_constr", [np.array, csr_matrix],


I assume that these changes to the euclidean distances tests have been included accidentally in a bad merge. Please revert this section (i.e. copy in the code from master).

jnothman · 2020-01-28T07:41:57Z

sklearn/metrics/tests/test_pairwise.py

+
+    D = gower_distances(X)
+
+    # These are the normalized values for X above


Suggested change

# These are the normalized values for X above

# These are the scaled values for X above

jnothman · 2020-01-28T07:42:26Z

sklearn/metrics/tests/test_pairwise.py

+
+    with pytest.raises(ValueError):
+        D = gower_distances(X, scale=[1])
+        print(D)


is there a reason to print in the test? especially after a ValueError?

jnothman · 2020-01-28T07:48:19Z

sklearn/metrics/tests/test_pairwise.py

+        for j in range(0, 4):
+            # The calculations below shows how it compares observation
+            # by observation, attribute by attribute.
+            D_expected[i][j] = (([1, 0][X[i][0] == X[j][0]] +


This doesn't extend to NaNs in individual values (as opposed to an entire row of NaNs which I think is an unhelpful degenerate case).

jnothman · 2020-01-28T08:06:18Z

sklearn/metrics/pairwise.py

+            scale = kwds['scale']
+        scale, _, _ = _precompute_gower_params(X, Y, scale, num_mask)
+
+        return {'scale': scale}


shouldn't we also return the determined categorical_features if they had been passed in as None?

jnothman · 2020-01-28T08:06:46Z

sklearn/metrics/pairwise.py

+        if 'categorical_features' in kwds:
+            categorical_features = kwds['categorical_features']
+
+        num_mask = ~ _detect_categorical_features(X, categorical_features)


Is there benefit to determining categorical features from both X and Y?

jnothman · 2020-01-28T08:07:00Z

sklearn/metrics/pairwise.py

+        if 'categorical_features' in kwds:
+            categorical_features = kwds['categorical_features']
+
+        num_mask = ~ _detect_categorical_features(X, categorical_features)


Is there benefit to determining categorical features from both X and Y?

NicolasHug

10000

Quick pass on the user guide

NicolasHug · 2020-01-30T18:50:47Z

sklearn/metrics/pairwise.py

+    """Compute the distances between the observations in X and Y,
+    that may contain mixed types of data, using an implementation
+    of Gower formula.
+


Please add "Read more in the :ref:User Guide <ref_to_UG>"

NicolasHug · 2020-01-30T18:50:57Z

sklearn/metrics/pairwise.py

+
+    Returns
+    -------
+    similarities : ndarray, shape (n_samples_X, n_samples_Y)


NicolasHug · 2020-01-30T18:52:35Z

doc/modules/metrics.rst

+
+Gower distances
+-----------------
+The function :func:`gower_distances` computes the distances between the


Suggested change

The function :func:`gower_distances` computes the distances between the

The function :func:`~sklearn.metrics.pairwise.gower_distances` computes the distances between the

NicolasHug · 2020-01-30T18:54:19Z

doc/modules/metrics.rst

+s(x, y) : Calculates the similarity of all features (for k = 1 to n_features)
+of x and y, as described by the expressions:
+
+    s(x_k, y_k) = 0, if k represents a boolean or categorical attribute,


These should be rendered in latex :math:`formula here`

NicolasHug · 2020-01-30T18:57:14Z

doc/modules/metrics.rst

+
+Where:
+
+x, y : array_like (1, n_features) are the observations to be compared.


Suggested change

x, y : array_like (1, n_features) are the observations to be compared.

x, y : two samples to be compared.

NicolasHug · 2020-01-30T18:59:05Z

doc/modules/metrics.rst

+
+.. math::
+
+    g(\mathbf{x}, \mathbf{y}) = \frac{\sum_i(s(x_i, y_i))}{|\{i| x_i\text{ is not missing or }y_i\text{ is not missing}\}|}


use i or k to index the features but not both please

NicolasHug · 2020-01-30T19:01:12Z

doc/modules/metrics.rst

+-----------------
+The function :func:`gower_distances` computes the distances between the
+observations in X and Y, that may contain combinations of numerical, boolean,
+or categorical attributes, using an implementation of Gower Similarity.


Please describe how we go from the similarity to the distance?

NicolasHug · 2020-01-30T19:02:13Z

doc/modules/metrics.rst

+    s(x_k, y_k) = 1, if k represents a boolean or categorical attribute,
+    and they are unequal.
+
+    s(x_k, y_k) = abs(x_k - y_k), if k represents a numerical attribute.


So IIUC, the scale of a numerical feature will have a huge impact on the final value? Should the features be standardized before computing the Gower similarity?

The features are currently being min-max scaled within Gower unless scale=False.

jnothman · 2020-03-29T09:51:55Z

Any chance we might be able to pull this towards the finish line in April? I know the world's a bit crazy right now...

NicolasHug · 2020-03-29T12:04:43Z

I can try taking this one up, if that's OK with @marcelobeckmann ?

adrinjalali · 2020-03-29T12:06:02Z

I just thought of the same thing and started working on it @NicolasHug lol

marcelobeckmann · 2020-03-29T15:16:14Z

Hi Nicolas, please feel free to take over this one, I'm not able to make it right now.

…

On Sun 29 Mar 2020, 13:04 Nicolas Hug, ***@***.***> wrote: I can try taking this one up, if that's OK with @marcelobeckmann <https://github.com/marcelobeckmann> ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9555 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAG4N363RMFXHAQMEWZUYITRJ42OTANCNFSM4DW5BZXQ> .

NicolasHug · 2020-03-29T15:17:25Z

Thanks for the notice @marcelobeckmann .

@adrinjalali go ahead

marcelobeckmann · 2020-04-03T15:24:20Z

This PR is closed. Please visit #16834 for further developments regarding Gower distance on scikit-learn.

jnothman added the Waiting for Reviewer label Aug 15, 2017

jnothman changed the title ~~[5584] Implement Gower similarity coeficient~~ [MRG] Implement Gower similarity coeficient Aug 17, 2017

jnothman reviewed Sep 10, 2017

View reviewed changes

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved

jnothman reviewed Dec 4, 2017

View reviewed changes

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved

jnothman reviewed Mar 14, 2018

View reviewed changes

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved

jnothman previously requested changes Mar 15, 2018

View reviewed changes

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved

jnothman reviewed Mar 15, 2018

View reviewed changes

marcelobeckmann added 6 commits November 22, 2019 07:37

Make some prints to figure out the unit test error in some specifc pl…

988028a

…atforms

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

8454f97

…into b5584

Make some prints to figure out the unit test error in some specifc pl…

f1d840d

…atforms

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

a8f2a65

…into b5584

Make some prints to figure out the unit test error in some specifc pl…

37359f0

…atforms

Revert improvement to check full nan columns

63c179e

jnothman mentioned this pull request Dec 15, 2019

Feature Request: Include Heterogeneous Distance Metrics #15894

Open

cmarmo reviewed Jan 25, 2020

View reviewed changes

thomasjpfan self-assigned this Jan 27, 2020

jnothman reviewed Jan 28, 2020

View reviewed changes

NicolasHug reviewed Jan 30, 2020

View reviewed changes

github-actions bot added module:metrics module:neighbors labels Mar 2, 2020

adrinjalali removed the Waiting for Reviewer label Mar 29, 2020

Merge remote-tracking branch 'upstream/master' into b5584

8786f5d

adrinjalali mentioned this pull request Apr 3, 2020

[MRG] FEA Gower distance #16834

Open

marcelobeckmann closed this Apr 3, 2020

marcelobeckmann changed the title ~~[WIP] Implement Gower similarity coeficient~~ Implement Gower similarity coeficient Apr 9, 2020

	Properties, Biometrics, Vol. 27, No. 4. (Dec., 1971), pp. 857-871.
	Properties, Biometrics, Vol. 27, No. 4. (Dec., 1971), pp. 857-871.

	http://members.cbio.mines-paristech.fr/~jvert/svn/bibli/local/Gower1971general.pdf
	http://members.cbio.mines-paristech.fr/~jvert/svn/bibli/local/Gower1971general.pdf

		@@ -602,44 +603,37 @@ def test_pairwise_distances_chunked():
		next(gen)


		@pytest.mark.parametrize("x_array_constr", [np.array, csr_matrix],


		D = gower_distances(X)

		# These are the normalized values for X above

	# These are the normalized values for X above
	# These are the scaled values for X above

	The function :func:`gower_distances` computes the distances between the
	The function :func:`~sklearn.metrics.pairwise.gower_distances` computes the distances between the


		Where:

		x, y : array_like (1, n_features) are the observations to be compared.

	x, y : array_like (1, n_features) are the observations to be compared.
	x, y : two samples to be compared.


		.. math::

		g(\mathbf{x}, \mathbf{y}) = \frac{\sum_i(s(x_i, y_i))}{\|\{i\| x_i\text{ is not missing or }y_i\text{ is not missing}\}\|}

Uh oh!

Implement Gower similarity coeficient #9555

Implement Gower similarity coeficient #9555

Uh oh!

Conversation

Uh oh!

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment