-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Implement Gower similarity coeficient #9555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In scikit-learn we often use numeric arrays where ints are used to represent categorical features. Perhaps this needs a Also, the builds are failing. I'm marking the title as [WIP]. When you are satisfied with the testing and documentation, please mark it MRG and let us know. |
The linked issue contains a typo I believe and should be #5884 |
Thanks for the hints, I'm working on this. |
Please make sure your code adheres to PEP8 to make it easier to review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add tests. Numerous lines do not have test coverage.
I fixed all the minor changes proposed by Travis, but seems now there is a problem in the CI servers that are making most CI applications fail. pip._vendor.requests.packages.urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)", ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)) I'll continue pushing my code, just to check when the CI servers will be restored. |
Is there an update on this PR? It looks like the previous Travis CI issues have been resolved and the builds are now passing. Thanks for doing this! |
Hi, I changed the code to avoid zero division warnings, as proposed by Pierre, and CI is green. I need someone to review my code. |
Hi @marcelobeckmann, I'd love to see this work get merged - Gower similarity is very useful in the case of mixed data types, which we frequently encounter in the real word. Pinging one of the core maintainers might help if your PR got lost in the queue. Also, I noticed that the issue # in the PR is incorrect, should be #5884. |
Thanks for this review @jnothman , a lot of work to do before Christmas! :) |
@marcelobeckmann 8000 Just downloaded your code as a jupyter notebook https://sourceforge.net/projects/gower-distance-4python/files/gower_function-v3.ipynb/download ... thanks for all the work you have put in, very, very, very useful indeed. I have a question, though: it seems that None in a scalar dimension is transformed to "NaN" and causes all the distances to this observation to be "nan". Is this a bug or a feature? Or am I missing something here? I will analyze the code some more but maybe you already have an answer ready at hand ... |
Hi @OlafEichstaedt, this is a feature of this implementation, as None is related to a missing value in an object, and the resulting numerical distance is a NaN, as there is nothing to compare over there. This result is equivalent to the R Gower implementation. Please contact me directly for further questions about the jupyter notebook, as this forum is for the definitive implementation of Gower that I'm developing for scikit learn. |
Hi, just to let you know I´m performing some profiling for array vectorization, and making some adjustments for sparse matrix. My fixes are on the way. |
I would ignore the sparse matrix case for now...
…On 1 February 2018 at 18:58, Marcelo Beckmann ***@***.***> wrote:
Hi, just to let you know I´m performing some profiling for array
vectorization, and making some adjustments for sparse matrix. My fixes are
on the way.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9555 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz62-kxtp96knwK8YkiCVhTzGC-PQmks5tQW61gaJpZM4O3Q5v>
.
|
This pull request introduces 1 alert when merging 2804c9d into 3e29334 - view on lgtm.com new alerts:
Comment posted by lgtm.com |
thanks for updating this. please ping when those tests are done and you are
ready for another full review
|
Test failures |
Please ping when this is ready for another review |
This pull request introduces 1 alert when merging 5e0f965 into 3e29334 - view on lgtm.com new alerts:
Comment posted by lgtm.com |
@marcelobeckmann would you like help with this? |
Also, you can run the tests on your own machine, by running |
Hi @jnothman, The tests are passing locally using make and pytest. I'm using Python 3.5.3 and Anaconda 4.0.0 (64-bit). Now I'm modifying my code and pushing it several times to get some clue about why I'm getting unexpected values with Travis. Any help or direction will be very welcome. |
Hi @jnothman, I made the changes you proposed and this stopped to affect the other libraries, but I'm still getting errors in my assertions. I'm 100% sure my expected values are correct, and all the tests are passing locally, seems to be a discrepancy between the CI and my environment. I'm going to use PYTHON_VERSION="3.4" in my environment, and I'll see if I can reproduce the assertion errors locally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You still have tests failing on old numpy. I've not looked into it, but you might consider installing an old numpy (e.g. 1.9) and debugging
I reverted the proposal to speedup the detection of categorical features in case of full nan columns, because the deployments py35_ubuntu_atlas_32bit, py35_ubuntu_atlas, py35_conda_openblas are returning nan from np.nansum, when 0 was expected. Please let me know if you have an alternative to np.nansum on these platforms, or if you are happy with the current proposal to detect categorical attributes. I'm happy to help somehow. |
Hi, can someone make a code review please? |
.. topic:: References: | ||
|
||
* Gower, J.C., 1971, A General Coefficient of Similarity and Some of Its | ||
Properties, Biometrics, Vol. 27, No. 4. (Dec., 1971), pp. 857-871. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Properties, Biometrics, Vol. 27, No. 4. (Dec., 1971), pp. 857-871. | |
Properties, Biometrics, Vol. 27, No. 4. (Dec., 1971), pp. 857-871. |
|
||
* Gower, J.C., 1971, A General Coefficient of Similarity and Some of Its | ||
Properties, Biometrics, Vol. 27, No. 4. (Dec., 1971), pp. 857-871. | ||
http://members.cbio.mines-paristech.fr/~jvert/svn/bibli/local/Gower1971general.pdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
http://members.cbio.mines-paristech.fr/~jvert/svn/bibli/local/Gower1971general.pdf | |
http://members.cbio.mines-paristech.fr/~jvert/svn/bibli/local/Gower1971general.pdf |
Hi @marcelobeckmann my two small fixes are meant to remove (I hope) the sphinx warnings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments on the recently addressed portions.
I'd like to review tests next time.
def detect_cat(x): | ||
if not np.isnan(x): | ||
if np.issubdtype(type(x), np.number): | ||
raise ValueError(False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a very unconventional way of providing control flow and passing values around. Why are we using exceptions rather than return values here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I see that we're applying pyufunc to check each element individually, and using exceptions to abort as soon as we have a non-NaN. This logic is very unclear from your code, and I see no benefit in doing it this way rather than an explicit python loop over elements, or something more functional-style:
non_nan_values = itertools.dropwhile(np.isnan, X[:, col])
try:
value = next(non_nan_values)
except StopIteration:
TODO: handle case when all values are NaN
TODO: determine type from value
with the numerical indexes of the categorical attribtes. | ||
|
||
If the categorical_features array is not provided, by default all | ||
non-numeric columns are considered categorical. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that behaviour is undefined if columns mix numeric and non-numeric values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @marcelobeckmann, I'm finding the single very long test function hard to read. While I appreciate the attempt to show that gower_distances produces results equivalent to a simplified implementation (with nested for loops), overall it's very hard to see what your tests are asserting without a lot of attention.
A good test suite should look like a proof, or an essay. I would like to see a test suite with separate tests for different lemmas towards that proof or arguments in that essay:
test_gower_distances_sample_pair(x1, y1, scale, categorical_features, expected)
should show that for a series of toy examples,gower_distances([x1], [y1], scale=scale, categorical_features=categorical_features)
calculates the[[expected]]
distance between them. The current test spends too much effort setting up a matrix of results. Focus on one at a time. This should be parametrised to check different scalings, mixes of categorical and numeric, representation of categorical as string or numeric, number of features, and missing values.test_gower_distances_matrix(X, Y, expected_scale, expected_categorical_features)
should check that thegower_distances(X, Y)
result decomposes over sample pairs for the givenscale
andcategorical_features
. I.e.gower_distances(X, Y)[i, j] == gower_distances([X[i]], [Y[i]], scale=expected_scale, categorical_features=expected_categorical_features)
. This checks that the overall matrix is constructed correctly, and thatcategorical_features
andscale
are correctly inferred.test_gower_distances_validation
that appropriate validation is performed.
Potentially there are things I've missed out, but I think these two tests, with carefully selected example parametrizations (perhaps with a comment for what that example adds to the previous), would demonstrate together that the implementation is correct. (Do we check elsewhere that gower_distances(X, X) == gower_distances(X)
?)
Sorry, something went wrong.
All reactions
@@ -602,44 +603,37 @@ def test_pairwise_distances_chunked(): | |||
next(gen) | |||
|
|||
|
|||
@pytest.mark.parametrize("x_array_constr", [np.array, csr_matrix], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that these changes to the euclidean distances tests have been included accidentally in a bad merge. Please revert this section (i.e. copy in the code from master).
Sorry, something went wrong.
All reactions
|
||
D = gower_distances(X) | ||
|
||
# These are the normalized values for X above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# These are the normalized values for X above | |
# These are the scaled values for X above |
Sorry, something went wrong.
All reactions
|
||
with pytest.raises(ValueError): | ||
D = gower_distances(X, scale=[1]) | ||
print(D) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason to print in the test? especially after a ValueError?
Sorry, something went wrong.
All reactions
for j in range(0, 4): | ||
# The calculations below shows how it compares observation | ||
# by observation, attribute by attribute. | ||
D_expected[i][j] = (([1, 0][X[i][0] == X[j][0]] + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't extend to NaNs in individual values (as opposed to an entire row of NaNs which I think is an unhelpful degenerate case).
Sorry, something went wrong.
All reactions
scale = kwds['scale'] | ||
scale, _, _ = _precompute_gower_params(X, Y, scale, num_mask) | ||
|
||
return {'scale': scale} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we also return the determined categorical_features
if they had been passed in as None
?
Sorry, something went wrong.
All reactions
if 'categorical_features' in kwds: | ||
categorical_features = kwds['categorical_features'] | ||
|
||
num_mask = ~ _detect_categorical_features(X, categorical_features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there benefit to determining categorical features from both X and Y?
Sorry, something went wrong.
All reactions
if 'categorical_features' in kwds: | ||
categorical_features = kwds['categorical_features'] | ||
|
||
num_mask = ~ _detect_categorical_features(X, categorical_features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there benefit to determining categorical features from both X and Y?
Sorry, something went wrong.
All reactions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick pass on the user guide
Sorry, something went wrong.
All reactions
"""Compute the distances between the observations in X and Y, | ||
that may contain mixed types of data, using an implementation | ||
of Gower formula. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add "Read more in the :ref:User Guide <ref_to_UG>
"
Sorry, something went wrong.
All reactions
|
||
Returns | ||
------- | ||
similarities : ndarray, shape (n_samples_X, n_samples_Y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
distances?
Sorry, something went wrong.
All reactions
|
||
Gower distances | ||
----------------- | ||
The function :func:`gower_distances` computes the distances between the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function :func:`gower_distances` computes the distances between the | |
The function :func:`~sklearn.metrics.pairwise.gower_distances` computes the distances between the |
Sorry, something went wrong.
All reactions
s(x, y) : Calculates the similarity of all features (for k = 1 to n_features) | ||
of x and y, as described by the expressions: | ||
|
||
s(x_k, y_k) = 0, if k represents a boolean or categorical attribute, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be rendered in latex :math:`formula here`
Sorry, something went wrong.
All reactions
|
||
Where: | ||
|
||
x, y : array_like (1, n_features) are the observations to be compared. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x, y : array_like (1, n_features) are the observations to be compared. | |
x, y : two samples to be compared. |
Sorry, something went wrong.
All reactions
|
||
.. math:: | ||
|
||
g(\mathbf{x}, \mathbf{y}) = \frac{\sum_i(s(x_i, y_i))}{|\{i| x_i\text{ is not missing or }y_i\text{ is not missing}\}|} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use i or k to index the features but not both please
Sorry, something went wrong.
All reactions
----------------- | ||
The function :func:`gower_distances` computes the distances between the | ||
observations in X and Y, that may contain combinations of numerical, boolean, | ||
or categorical attributes, using an implementation of Gower Similarity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please describe how we go from the similarity to the distance?
Sorry, something went wrong.
All reactions
s(x_k, y_k) = 1, if k represents a boolean or categorical attribute, | ||
and they are unequal. | ||
|
||
s(x_k, y_k) = abs(x_k - y_k), if k represents a numerical attribute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So IIUC, the scale of a numerical feature will have a huge impact on the final value? Should the features be standardized before computing the Gower similarity?
Sorry, something went wrong.
All reactions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The features are currently being min-max scaled within Gower unless scale=False.
Sorry, something went wrong.
All reactions
Any chance we might be able to pull this towards the finish line in April? I know the world's a bit crazy right now... |
All reactions
Sorry, something went wrong.
I can try taking this one up, if that's OK with @marcelobeckmann ? |
All reactions
Sorry, something went wrong.
I just thought of the same thing and started working on it @NicolasHug lol |
All reactions
Sorry, something went wrong.
Hi Nicolas, please feel free to take over this one, I'm not able to make it
right now.
…On Sun 29 Mar 2020, 13:04 Nicolas Hug, ***@***.***> wrote:
I can try taking this one up, if that's OK with @marcelobeckmann
<https://github.com/marcelobeckmann> ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9555 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG4N363RMFXHAQMEWZUYITRJ42OTANCNFSM4DW5BZXQ>
.
|
All reactions
Sorry, something went wrong.
Thanks for the notice @marcelobeckmann . @adrinjalali go ahead |
All reactions
Sorry, something went wrong.
This PR is closed. Please visit #16834 for further developments regarding Gower distance on scikit-learn. |
All reactions
Sorry, something went wrong.
jnothman
guylr
NicolasHug
iamDecode
cmarmo
adrinjalali
darena-mdsol
Successfully merging this pull request may close these issues.
Implement Gower Similarity Coefficient
Reference Issue
Fixes #5884
What does this implement/fix? Explain your changes.
Implements the Gower similarity in the sklearn.metrics.pairwse
Any other comments?
Unit tests are on the way, but please review and advise this piece of code while this. 8000
This code cares about NaN propagation and non square matrix for parallel processing.