-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
CLN Refactors _encode into two functions #17101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking care of this @thomasjpfan it does seem to look better
sklearn/metrics/_ranking.py
Outdated
@@ -34,7 +34,7 @@ | |||
from ..utils.validation import _deprecate_positional_args | |||
from ..exceptions import UndefinedMetricWarning | |||
from ..preprocessing import label_binarize | |||
from ..preprocessing._label import _encode | |||
from ..preprocessing._label import _encode, _unique |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we move these out of preprocessing._label
and put these in utils
, since they're not just used for encoding labels but also features?
sklearn/preprocessing/_label.py
Outdated
uniques : array, optional | ||
If passed, uniques are not determined from passed values (this | ||
uniques : array | ||
Uniques are not determined from passed values (this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the part in the parenthesis should be removed it seems?
I guess this could just be "The unique values in values
"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can also repeat that it should be sorted when relying on numpy
sklearn/preprocessing/_label.py
Outdated
def _encode(values, uniques=None, encode=False, check_unknown=True): | ||
"""Helper function to factorize (find uniques) and encode values. | ||
def _encode(values, *, uniques, check_unknown=True): | ||
"""Helper function encode values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Helper function encode values. | |
"""Helper function to encode values into [0, n_uniques - 1] |
sklearn/preprocessing/_label.py
Outdated
|
||
|
||
def _unique(values, *, return_inverse=False): | ||
"""Helper function to find uniques with support for python objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Helper function to find uniques with support for python objects. | |
"""Helper function to find unique values with support for python objects. |
sklearn/preprocessing/_label.py
Outdated
The sorted uniique values | ||
|
||
unique_inverse : ndarray | ||
The indicies to reconstruct the original array from the unique array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indicies to reconstruct the original array from the unique array. | |
The indices to reconstruct the original array from the unique array. |
sklearn/preprocessing/_label.py
Outdated
ret = (uniques, ) | ||
|
||
if return_inverse: | ||
table = {val: i for i, val in enumerate(uniques)} | ||
inverse = np.array([table[v] for v in values]) | ||
ret += (inverse, ) | ||
|
||
if len(ret) == 1: | ||
ret = ret[0] | ||
|
||
return ret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this equivalent? it seems we can avoid the tuple juggling here
ret = (uniques, ) | |
if return_inverse: | |
table = {val: i for i, val in enumerate(uniques)} | |
inverse = np.array([table[v] for v in values]) | |
ret += (inverse, ) | |
if len(ret) == 1: | |
ret = ret[0] | |
return ret | |
if return_inverse: | |
table = {val: i for i, val in enumerate(uniques)} | |
inverse = np.array([table[v] for v in values]) | |
return uniques, inverse | |
else: | |
return uniques |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm happy with these changes. Will these functions be more coupled if we handle NaNs?
sklearn/preprocessing/_label.py
Outdated
unique_inverse : ndarray | ||
The indicies to reconstruct the original array from the unique array. | ||
Only provided if `return_inverse` is True. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add Returns section
From my early prototyping with nan/None support (which builds on top of this PR), there seems to be no change needed to |
|
Do you need reviews or are @jnothman and @NicolasHug on it? |
I would give them a few more days. But having another set if eyes would not hurt. This PR only updates private API which hopefully makes the encoder logic easier to reason about. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A f 67ED ew more but LGTM otherwise.
sklearn/utils/_encode.py
Outdated
|
||
|
||
def _unique_python(values, *, return_inverse): | ||
# Only used in _uniques below, see docstring there for details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lolol
# Only used in _uniques below, see docstring there for details | |
# Only used in _uniques above, see docstring there for details |
sklearn/utils/_encode.py
Outdated
The uniques values in `values`. If the dtype is not object, then | ||
`uniques` need to be sorted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The uniques values in `values`. If the dtype is not object, then | |
`uniques` need to be sorted. | |
The unique values in `values`. If the dtype is not object, then | |
`uniques` needs to be sorted. |
sklearn/utils/_encode.py
Outdated
Parameters | ||
---------- | ||
values : ndarray | ||
Values to factorize or encode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Values to factorize or encode. | |
Values to encode. |
not sure what "factorize" means?
sklearn/utils/_encode.py
Outdated
return np.searchsorted(uniques, values) | ||
|
||
|
||
def _encode_check_unknown(values, uniques, return_mask=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this helper isn't specific to any encoding logic. Should this signature be instead:
def _check_unknown(values, known_values, return_mask=False):
(it's OK to keep it in this file though)
(np.array(['b', 'a', 'c', 'a', 'c']), | ||
np.array(['a', 'b', 'c']))], | ||
ids=['int64', 'object', 'str']) | ||
def test_encode_util(values, expected): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the lexicographic order expected? If so maybe add a comment here. If it's a strict API contract it should also be in the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lexicographic order is not expected. Updated with more test to check for unordered cases.
sklearn/utils/_encode.py
Outdated
return np.searchsorted(uniques, values) | ||
|
||
|
||
def _check_unknown(values, uniques, return_mask=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should uniques
be known_values
?
The fact that they're unique isn't relevant for this funciton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updatted to known_values
and also added in the docstring that it must be unique.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I thought it would be fine even with duplicated values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @thomasjpfan
Reference Issues/PRs
Related to #16018
What does this implement/fix? Explain your changes.
Refactors
_encode
into two functions._encode
: encodes_unique
:np.unique
but also works for objects. It only implements the subset of features ofnp.unique
that we need.Any other comments?
This PR is to make #16018 easier to review.
_unique_python
is written in a weird way because in #16018 I will be addingreturn_counts
.CC @NicolasHug This may be the refactor we spoke about months ago.