8000 CLN Refactors _encode into two functions by thomasjpfan · Pull Request #17101 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

CLN Refactors _encode into two functions #17101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 12, 2020

Conversation

thomasjpfan
Copy link
Member

Reference Issues/PRs

Related to #16018

What does this implement/fix? Explain your changes.

Refactors _encode into two functions.

  1. _encode : encodes
  2. _unique : np.unique but also works for objects. It only implements the subset of features of np.unique that we need.

Any other comments?

This PR is to make #16018 easier to review. _unique_python is written in a weird way because in #16018 I will be adding return_counts.

CC @NicolasHug This may be the refactor we spoke about months ago.

Copy link
Member
@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking care of this @thomasjpfan it does seem to look better

@@ -34,7 +34,7 @@
from ..utils.validation import _deprecate_positional_args
from ..exceptions import UndefinedMetricWarning
from ..preprocessing import label_binarize
from ..preprocessing._label import _encode
from ..preprocessing._label import _encode, _unique
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we move these out of preprocessing._label and put these in utils, since they're not just used for encoding labels but also features?

uniques : array, optional
If passed, uniques are not determined from passed values (this
uniques : array
Uniques are not determined from passed values (this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the part in the parenthesis should be removed it seems?
I guess this could just be "The unique values in values"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can also repeat that it should be sorted when relying on numpy

def _encode(values, uniques=None, encode=False, check_unknown=True):
"""Helper function to factorize (find uniques) and encode values.
def _encode(values, *, uniques, check_unknown=True):
"""Helper function encode values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Helper function encode values.
"""Helper function to encode values into [0, n_uniques - 1]



def _unique(values, *, return_inverse=False):
"""Helper function to find uniques with support for python objects.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Helper function to find uniques with support for python objects.
"""Helper function to find unique values with support for python objects.

The sorted uniique values

unique_inverse : ndarray
The indicies to reconstruct the original array from the unique array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The indicies to reconstruct the original array from the unique array.
The indices to reconstruct the original array from the unique array.

Comment on lines 93 to 103
ret = (uniques, )

if return_inverse:
table = {val: i for i, val in enumerate(uniques)}
inverse = np.array([table[v] for v in values])
ret += (inverse, )

if len(ret) == 1:
ret = ret[0]

return ret
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this equivalent? it seems we can avoid the tuple juggling here

Suggested change
ret = (uniques, )
if return_inverse:
table = {val: i for i, val in enumerate(uniques)}
inverse = np.array([table[v] for v in values])
ret += (inverse, )
if len(ret) == 1:
ret = ret[0]
return ret
if return_inverse:
table = {val: i for i, val in enumerate(uniques)}
inverse = np.array([table[v] for v in values])
return uniques, inverse
else:
return uniques

Copy link
Member
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm happy with these changes. Will these functions be more coupled if we handle NaNs?

unique_inverse : ndarray
The indicies to reconstruct the original array from the unique array.
Only provided if `return_inverse` is True.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add Returns section

@thomasjpfan
Copy link
Member Author

Yes, I'm happy with these changes. Will these functions be more coupled if we handle NaNs?

From my early prototyping with nan/None support (which builds on top of this PR), there seems to be no change needed to _encode and an update to only _unique.

@thomasjpfan
Copy link
Member Author
  • Move things to utils._encode (for the lack of a better name)
  • Added more test for _encode_check_unknown

@amueller
Copy link
Member

Do you need reviews or are @jnothman and @NicolasHug on it?

@thomasjpfan
Copy link
Member Author

I would give them a few more days. But having another set if eyes would not hurt.

This PR only updates private API which hopefully makes the encoder logic easier to reason about.

Copy link
Member
@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A f 67ED ew more but LGTM otherwise.



def _unique_python(values, *, return_inverse):
# Only used in _uniques below, see docstring there for details
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lolol

Suggested change
# Only used in _uniques below, see docstring there for details
# Only used in _uniques above, see docstring there for details

Comment on lines 67 to 68
The uniques values in `values`. If the dtype is not object, then
`uniques` need to be sorted.
Copy link
Member
A3DB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The uniques values in `values`. If the dtype is not object, then
`uniques` need to be sorted.
The unique values in `values`. If the dtype is not object, then
`uniques` needs to be sorted.

Parameters
----------
values : ndarray
Values to factorize or encode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Values to factorize or encode.
Values to encode.

not sure what "factorize" means?

return np.searchsorted(uniques, values)


def _encode_check_unknown(values, uniques, return_mask=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this helper isn't specific to any encoding logic. Should this signature be instead:

def _check_unknown(values, known_values, return_mask=False):

(it's OK to keep it in this file though)

(np.array(['b', 'a', 'c', 'a', 'c']),
np.array(['a', 'b', 'c']))],
ids=['int64', 'object', 'str'])
def test_encode_util(values, expected):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the lexicographic order expected? If so maybe add a comment here. If it's a strict API contract it should also be in the docstring.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lexicographic order is not expected. Updated with more test to check for unordered cases.

return np.searchsorted(uniques, values)


def _check_unknown(values, uniques, return_mask=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should uniques be known_values?
The fact that they're unique isn't relevant for this funciton

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updatted to known_values and also added in the docstring that it must be unique.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I thought it would be fine even with duplicated values.

Copy link
Member
@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @thomasjpfan

@NicolasHug NicolasHug merged commit d40d993 into scikit-learn:master May 12, 2020
gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020
viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0