8000 CLN Refactors _encode into two functions by thomasjpfan · Pull Request #17101 · scikit-learn/scikit-learn · GitHub

CLN Refactors _encode into two functions #17101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

NicolasHug merged 7 commits into scikit-learn:master from thomasjpfan:refactor_encode

May 12, 2020

Member

thomasjpfan commented

Reference Issues/PRs

Related to #16018

What does this implement/fix? Explain your changes.

Refactors _encode into two functions.

_encode : encodes
_unique : np.unique but also works for objects. It only implements the subset of features of np.unique that we need.

Any other comments?

This PR is to make #16018 easier to review. _unique_python is written in a weird way because in #16018 I will be adding return_counts.

CC @NicolasHug This may be the refactor we spoke about months ago.

thomasjpfan added 2 commits

May 1, 2020 19:31


          CLN Refactors encoding logic

805daf9


          TST Fix

723b7e8

github-actions bot added module:metrics module:preprocessing labels

NicolasHug reviewed

View reviewed changes

Member

NicolasHug left a comment

Thanks for taking care of this @thomasjpfan it does seem to look better

sklearn/metrics/_ranking.py Outdated

@@ @@ -34,7 +34,7 @@ @@
               from ..utils.validation import _deprecate_positional_args
               from ..exceptions import UndefinedMetricWarning
               from ..preprocessing import label_binarize
-              from ..preprocessing._label import _encode
+              from ..preprocessing._label import _encode, _unique

Member

NicolasHug

should we move these out of preprocessing._label and put these in utils, since they're not just used for encoding labels but also features?

sklearn/preprocessing/_label.py Outdated

    
                  uniques : array, optional

                      If passed, uniques are not determined from passed values (this

                  uniques : array

                      Uniques are not determined from passed values (this

Member

NicolasHug

the part in the parenthesis should be removed it seems?
I guess this could just be "The unique values in values"

Member

NicolasHug

maybe we can also repeat that it should be sorted when relying on numpy

sklearn/preprocessing/_label.py Outdated

-              def _encode(values, uniques=None, encode=False, check_unknown=True):
-                  """Helper function to factorize (find uniques) and encode values.
+              def _encode(values, *, uniques, check_unknown=True):
+                  """Helper function encode values.

Member

NicolasHug

Suggested change

      
                """Helper function encode values.
          
                """Helper function to encode values into [0, n_uniques - 1]

sklearn/preprocessing/_label.py Outdated



		def _unique(values, *, return_inverse=False):
		"""Helper function to find uniques with support for python objects.

Member

NicolasHug

Suggested change

      
                """Helper function to find uniques with support for python objects.
          
                """Helper function to find unique values with support for python objects.

sklearn/preprocessing/_label.py Outdated

+                      The sorted uniique values
+                  unique_inverse : ndarray
+                      The indicies to reconstruct the original array from the unique array.

Member

NicolasHug

Suggested change

      
                    The indicies to reconstruct the original array from the unique array.
          
                    The indices to reconstruct the original array from the unique array.

sklearn/preprocessing/_label.py Outdated

Comment on lines 93 to 103

+                  ret = (uniques, )
+                  if return_inverse:
+                      table = {val: i for i, val in enumerate(uniques)}
+                      inverse = np.array([table[v] for v in values])
+                      ret += (inverse, )
+                  if len(ret) == 1:
+                      ret = ret[0]
+                  return ret

Member

NicolasHug

Is this equivalent? it seems we can avoid the tuple juggling here

Suggested change

      
                ret = (uniques, )
          
                if return_inverse:
          
                    table = {val: i for i, val in enumerate(uniques)}
          
                    inverse = np.array([table[v] for v in values])
          
                    ret += (inverse, )
          
                if len(ret) == 1:
          
                    ret = ret[0]
          
                return ret
          
                if return_inverse:
          
                    table = {val: i for i, val in enumerate(uniques)}
          
                    inverse = np.array([table[v] for v in values])
          
                    return uniques, inverse
          
            	else:
          
            		return uniques

jnothman reviewed

View reviewed changes

Member

jnothman left a comment

Yes, I'm happy with these changes. Will these functions be more coupled if we handle NaNs?

sklearn/preprocessing/_label.py Outdated

+                  unique_inverse : ndarray
+                      The indicies to reconstruct the original array from the unique array.
+                      Only provided if `return_inverse` is True.
+                  """

Member

jnothman

Please add Returns section

Member Author

thomasjpfan commented

Yes, I'm happy with these changes. Will these functions be more coupled if we handle NaNs?

From my early prototyping with nan/None support (which builds on top of this PR), there seems to be no change needed to _encode and an update to only _unique.

thomasjpfan added 3 commits

May 8, 2020 15:04


          CLN Move to utils

7f17b94


          TST More testing

14dbc77


          CLN Move functions around

e546b5e

Member Author

thomasjpfan commented

Move things to utils._encode (for the lack of a better name)
Added more test for _encode_check_unknown

Member

amueller commented

Do you need reviews or are @jnothman and @NicolasHug on it?

Member Author

thomasjpfan commented

I would give them a few more days. But having another set if eyes would not hurt.

This PR only updates private API which hopefully makes the encoder logic easier to reason about.

jnothman approved these changes

View reviewed changes

NicolasHug reviewed

View reviewed changes

Member

NicolasHug left a comment

A f 67ED ew more but LGTM otherwise.

sklearn/utils/_encode.py Outdated



		def _unique_python(values, *, return_inverse):
		# Only used in _uniques below, see docstring there for details

Member

NicolasHug

lolol

Suggested change

      
                # Only used in _uniques below, see docstring there for details
          
                # Only used in _uniques above, see docstring there for details

sklearn/utils/_encode.py Outdated

Comment on lines 67 to 68

		The uniques values in `values`. If the dtype is not object, then
		`uniques` need to be sorted.

Member

NicolasHug

Suggested change

      
                    The uniques values in `values`. If the dtype is not object, then
          
                    `uniques` need to be sorted.
          
                    The unique values in `values`. If the dtype is not object, then
          
                    `uniques` needs to be sorted.

sklearn/utils/_encode.py Outdated

+                  Parameters
+                  ----------
+                  values : ndarray
+                      Values to factorize or encode.

Member

NicolasHug

Suggested change

      
                    Values to factorize or encode.
          
                    Values to encode.

not sure what "factorize" means?

sklearn/utils/_encode.py Outdated

		return np.searchsorted(uniques, values)


		def _encode_check_unknown(values, uniques, return_mask=False):

Member

NicolasHug

I feel like this helper isn't specific to any encoding logic. Should this signature be instead:

def _check_unknown(values, known_values, return_mask=False):

(it's OK to keep it in this file though)

sklearn/utils/tests/test_encode.py

+                       (np.array(['b', 'a', 'c', 'a', 'c']),
+                        np.array(['a', 'b', 'c']))],
+                      ids=['int64', 'object', 'str'])
+              def test_encode_util(values, expected):

Member

NicolasHug

Is the lexicographic order expected? If so maybe add a comment here. If it's a strict API contract it should also be in the docstring.

Member Author

thomasjpfan

lexicographic order is not expected. Updated with more test to check for unordered cases.


          CLN Address comments

bdc4c89

NicolasHug reviewed

View reviewed changes

sklearn/utils/_encode.py Outdated

		return np.searchsorted(uniques, values)


		def _check_unknown(values, uniques, return_mask=False):

Member

NicolasHug

Should uniques be known_values?
The fact that they're unique isn't relevant for this funciton

Member Author

thomasjpfan

Updatted to known_values and also added in the docstring that it must be unique.

Member

NicolasHug

OK, I thought it would be fine even with duplicated values.


          CLN Adjust name

063c3f7

NicolasHug approved these changes

View reviewed changes

Member

NicolasHug left a comment

thanks @thomasjpfan

NicolasHug merged commit d40d993 into scikit-learn:master

gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request


          CLN Refactors _encode into two functions (scikit-learn#17101)

a733e24

viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request


          CLN Refactors _encode into two functions (scikit-learn#17101)

7defe6e

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request


          CLN Refactors _encode into two functions (scikit-learn#17101)

27a344d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:metrics module:preprocessing

0