8000 Implement Gower similarity coeficient by marcelobeckmann · Pull Request #9555 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Implement Gower similarity coeficient #9555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 206 commits into from

Conversation

marcelobeckmann
Copy link
@marcelobeckmann marcelobeckmann commented Aug 15, 2017

Reference Issue

Fixes #5884

What does this implement/fix? Explain your changes.

Implements the Gower similarity in the sklearn.metrics.pairwse

Any other comments?

Unit tests are on the way, but please review and advise this piece of code while this. 8000
This code cares about NaN propagation and non square matrix for parallel processing.

@jnothman
Copy link
Member

In scikit-learn we often use numeric arrays where ints are used to represent categorical features. Perhaps this needs a categorical_features parameter that identifies the categorical columns when dtype is not discriminative.

Also, the builds are failing.

I'm marking the title as [WIP]. When you are satisfied with the testing and documentation, please mark it MRG and let us know.

@jnothman jnothman changed the title [5584] Implement Gower similarity coeficient [MRG] Implement Gower similarity coeficient Aug 17, 2017
@ashimb9
Copy link
Contributor
ashimb9 commented Aug 20, 2017

The linked issue contains a typo I believe and should be #5884

@marcelobeckmann
Copy link
Author

Thanks for the hints, I'm working on this.

@jnothman
Copy link
Member

Please make sure your code adheres to PEP8 to make it easier to review

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests. Numerous lines do not have test coverage.

@marcelobeckmann
Copy link
Author

I fixed all the minor changes proposed by Travis, but seems now there is a problem in the CI servers that are making most CI applications fail.

pip._vendor.requests.packages.urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)", ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))

I'll continue pushing my code, just to check when the CI servers will be restored.

@chang
Copy link
Contributor
chang commented Nov 10, 2017

Is there an update on this PR? It looks like the previous Travis CI issues have been resolved and the builds are now passing. Thanks for doing this!

@marcelobeckmann
Copy link
Author

Hi, I changed the code to avoid zero division warnings, as proposed by Pierre, and CI is green. I need someone to review my code.

@chang
Copy link
Contributor
chang commented Dec 2, 2017

Hi @marcelobeckmann, I'd love to see this work get merged - Gower similarity is very useful in the case of mixed data types, which we frequently encounter in the real word.

Pinging one of the core maintainers might help if your PR got lost in the queue. Also, I noticed that the issue # in the PR is incorrect, should be #5884.

@marcelobeckmann
Copy link
Author

Thanks for this review @jnothman , a lot of work to do before Christmas! :)

@OlafEichstaedt
Copy link
OlafEichstaedt commented Jan 17, 2018

@marcelobeckmann 8000 Just downloaded your code as a jupyter notebook https://sourceforge.net/projects/gower-distance-4python/files/gower_function-v3.ipynb/download ... thanks for all the work you have put in, very, very, very useful indeed. I have a question, though: it seems that None in a scalar dimension is transformed to "NaN" and causes all the distances to this observation to be "nan". Is this a bug or a feature? Or am I missing something here? I will analyze the code some more but maybe you already have an answer ready at hand ...

@marcelobeckmann
Copy link
Author
marcelobeckmann commented Jan 17, 2018

Hi @OlafEichstaedt, this is a feature of this implementation, as None is related to a missing value in an object, and the resulting numerical distance is a NaN, as there is nothing to compare over there. This result is equivalent to the R Gower implementation. Please contact me directly for further questions about the jupyter notebook, as this forum is for the definitive implementation of Gower that I'm developing for scikit learn.

@marcelobeckmann
Copy link
Author

Hi, just to let you know I´m performing some profiling for array vectorization, and making some adjustments for sparse matrix. My fixes are on the way.

@jnothman
Copy link
Member
jnothman commented Feb 1, 2018 via email

@sklearn-lgtm
Copy link

This pull request introduces 1 alert when merging 2804c9d into 3e29334 - view on lgtm.com

new alerts:

  • 1 for Variable defined multiple times

Comment posted by lgtm.com

@jnothman
Copy link
Member
jnothman commented Feb 26, 2018 via email

@jnothman
Copy link
Member

Test failures

@jnothman
Copy link
Member
jnothman commented Mar 5, 2018

Please ping when this is ready for another review

@sklearn-lgtm
Copy link

This pull request introduces 1 alert when merging 5e0f965 into 3e29334 - view on lgtm.com

new alerts:

  • 1 for Variable defined multiple times

Comment posted by lgtm.com

@jnothman
Copy link
Member

@marcelobeckmann would you like help with this?

@jnothman
Copy link
Member

Also, you can run the tests on your own machine, by running pytest sklearn/metrics for instance.

@marcelobeckmann
Copy link
Author
marcelobeckmann commented Mar 14, 2018

Hi @jnothman,

The tests are passing locally using make and pytest. I'm using Python 3.5.3 and Anaconda 4.0.0 (64-bit). Now I'm modifying my code and pushing it several times to get some clue about why I'm getting unexpected values with Travis.

Any help or direction will be very welcome.

@marcelobeckmann
Copy link
Author

Hi @jnothman,

I made the changes you proposed and this stopped to affect the other libraries, but I'm still getting errors in my assertions. I'm 100% sure my expected values are correct, and all the tests are passing locally, seems to be a discrepancy between the CI and my environment. I'm going to use PYTHON_VERSION="3.4" in my environment, and I'll see if I can reproduce the assertion errors locally.

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still have tests failing on old numpy. I've not looked into it, but you might consider installing an old numpy (e.g. 1.9) and debugging

@marcelobeckmann
Copy link
Author
marcelobeckmann commented Nov 27, 2019

I reverted the proposal to speedup the detection of categorical features in case of full nan columns, because the deployments py35_ubuntu_atlas_32bit, py35_ubuntu_atlas, py35_conda_openblas are returning nan from np.nansum, when 0 was expected.

Please let me know if you have an alternative to np.nansum on these platforms, or if you are happy with the current proposal to detect categorical attributes. I'm happy to help somehow.

@marcelobeckmann
Copy link
Author

Hi, can someone make a code review please?

.. topic:: References:

* Gower, J.C., 1971, A General Coefficient of Similarity and Some of Its
Properties, Biometrics, Vol. 27, No. 4. (Dec., 1971), pp. 857-871.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Properties, Biometrics, Vol. 27, No. 4. (Dec., 1971), pp. 857-871.
Properties, Biometrics, Vol. 27, No. 4. (Dec., 1971), pp. 857-871.


* Gower, J.C., 1971, A General Coefficient of Similarity and Some of Its
Properties, Biometrics, Vol. 27, No. 4. (Dec., 1971), pp. 857-871.
http://members.cbio.mines-paristech.fr/~jvert/svn/bibli/local/Gower1971general.pdf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
http://members.cbio.mines-paristech.fr/~jvert/svn/bibli/local/Gower1971general.pdf
http://members.cbio.mines-paristech.fr/~jvert/svn/bibli/local/Gower1971general.pdf

@cmarmo
Copy link
Contributor
cmarmo commented Jan 25, 2020

Hi @marcelobeckmann my two small fixes are meant to remove (I hope) the sphinx warnings.
May I suggest to change the title of your PR : [WIP] -> [MRG]. Hope that this will capture some attention. Anyway, this PR already deserved to be added to the 0.23 milestone... good sign! :)

@thomasjpfan thomasjpfan self-assigned this Jan 27, 2020
Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments on the recently addressed portions.

I'd like to review tests next time.

def detect_cat(x):
if not np.isnan(x):
if np.issubdtype(type(x), np.number):
raise ValueError(False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a very unconventional way of providing control flow and passing values around. Why are we using exceptions rather than return values here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I see that we're applying pyufunc to check each element individually, and using exceptions to abort as soon as we have a non-NaN. This logic is very unclear from your code, and I see no benefit in doing it this way rather than an explicit python loop over elements, or something more functional-style:

non_nan_values = itertools.dropwhile(np.isnan, X[:, col])
try:
    value = next(non_nan_values)
except StopIteration:
    TODO: handle case when all values are NaN

TODO: determine type from value

with the numerical indexes of the categorical attribtes.

If the categorical_features array is not provided, by default all
non-numeric columns are considered categorical.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that behaviour is undefined if columns mix numeric and non-numeric values.

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @marcelobeckmann, I'm finding the single very long test function hard to read. While I appreciate the attempt to show that gower_distances produces results equivalent to a simplified implementation (with nested for loops), overall it's very hard to see what your tests are asserting without a lot of attention.

A good test suite should look like a proof, or an essay. I would like to see a test suite with separate tests for different lemmas towards that proof or arguments in that essay:

  1. test_gower_distances_sample_pair(x1, y1, scale, categorical_features, expected) should show that for a series of toy examples, gower_distances([x1], [y1], scale=scale, categorical_features=categorical_features) calculates the [[expected]] distance between them. The current test spends too much effort setting up a matrix of results. Focus on one at a time. This should be parametrised to check different scalings, mixes of categorical and numeric, representation of categorical as string or numeric, number of features, and missing values.
  2. test_gower_distances_matrix(X, Y, expected_scale, expected_categorical_features) should check that the gower_distances(X, Y) result decomposes over sample pairs for the given scale and categorical_features. I.e. gower_distances(X, Y)[i, j] == gower_distances([X[i]], [Y[i]], scale=expected_scale, categorical_features=expected_categorical_features). This checks that the overall matrix is constructed correctly, and that categorical_features and scale are correctly inferred.
  3. test_gower_distances_validation that appropriate validation is performed.

Potentially there are things I've missed out, but I think these two tests, with carefully selected example parametrizations (perhaps with a comment for what that example adds to the previous), would demonstrate together that the implementation is correct. (Do we check elsewhere that gower_distances(X, X) == gower_distances(X)?)

@@ -602,44 +603,37 @@ def test_pairwise_distances_chunked():
next(gen)


@pytest.mark.parametrize("x_array_constr", [np.array, csr_matrix],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that these changes to the euclidean distances tests have been included accidentally in a bad merge. Please revert this section (i.e. copy in the code from master).


D = gower_distances(X)

# These are the normalized values for X above
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# These are the normalized values for X above
# These are the scaled values for X above


with pytest.raises(ValueError):
D = gower_distances(X, scale=[1])
print(D)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason to print in the test? especially after a ValueError?

for j in range(0, 4):
# The calculations below shows how it compares observation
# by observation, attribute by attribute.
D_expected[i][j] = (([1, 0][X[i][0] == X[j][0]] +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't extend to NaNs in individual values (as opposed to an entire row of NaNs which I think is an unhelpful degenerate case).

scale = kwds['scale']
scale, _, _ = _precompute_gower_params(X, Y, scale, num_mask)

return {'scale': scale}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we also return the determined categorical_features if they had been passed in as None?

if 'categorical_features' in kwds:
categorical_features = kwds['categorical_features']

num_mask = ~ _detect_categorical_features(X, categorical_features)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there benefit to determining categorical features from both X and Y?

if 'categorical_features' in kwds:
categorical_features = kwds['categorical_features']

num_mask = ~ _detect_categorical_features(X, categorical_features)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there benefit to determining categorical features from both X and Y?

Copy link
Member
@NicolasHug NicolasHug left a comment
10000

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick pass on the user guide

"""Compute the distances between the observations in X and Y,
that may contain mixed types of data, using an implementation
of Gower formula.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add "Read more in the :ref:User Guide <ref_to_UG>"


Returns
-------
similarities : ndarray, shape (n_samples_X, n_samples_Y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distances?


Gower distances
-----------------
The function :func:`gower_distances` computes the distances between the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The function :func:`gower_distances` computes the distances between the
The function :func:`~sklearn.metrics.pairwise.gower_distances` computes the distances between the

s(x, y) : Calculates the similarity of all features (for k = 1 to n_features)
of x and y, as described by the expressions:

s(x_k, y_k) = 0, if k represents a boolean or categorical attribute,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be rendered in latex :math:`formula here`


Where:

x, y : array_like (1, n_features) are the observations to be compared.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
x, y : array_like (1, n_features) are the observations to be compared.
x, y : two samples to be compared.


.. math::

g(\mathbf{x}, \mathbf{y}) = \frac{\sum_i(s(x_i, y_i))}{|\{i| x_i\text{ is not missing or }y_i\text{ is not missing}\}|}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use i or k to index the features but not both please

-----------------
The function :func:`gower_distances` computes the distances between the
observations in X and Y, that may contain combinations of numerical, boolean,
or categorical attributes, using an implementation of Gower Similarity.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please describe how we go from the similarity to the distance?

s(x_k, y_k) = 1, if k represents a boolean or categorical attribute,
and they are unequal.

s(x_k, y_k) = abs(x_k - y_k), if k represents a numerical attribute.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So IIUC, the scale of a numerical feature will have a huge impact on the final value? Should the features be standardized before computing the Gower similarity?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The features are currently being min-max scaled within Gower unless scale=False.

@jnothman
Copy link
Member

Any chance we might be able to pull this towards the finish line in April? I know the world's a bit crazy right now...

@NicolasHug
Copy link
Member

I can try taking this one up, if that's OK with @marcelobeckmann ?

@adrinjalali
Copy link
Member

I just thought of the same thing and started working on it @NicolasHug lol

@marcelobeckmann
Copy link
Author
marcelobeckmann commented Mar 29, 2020 via email

@NicolasHug
Copy link
Member

Thanks for the notice @marcelobeckmann .

@adrinjalali go ahead

@marcelobeckmann
Copy link
Author

This PR is closed. Please visit #16834 for further developments regarding Gower distance on scikit-learn.

@marcelobeckmann marcelobeckmann changed the title [WIP] Implement Gower similarity coeficient Implement Gower similarity coeficient Apr 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Gower Similarity Coefficient
0