ENH Add per feature max_categories for OrdinalEncoder #26284

Andrew-Wang-IB45 · 2023-04-25T21:54:56Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR allows a user to specify per-feature max categories for an OrdinalEncoder by passing max_categories as a dictionary mapping a valid feature name to its corresponding maximum number of output categories. Since identifying infrequent categories is already done per feature, instead of applying the global limit when max_categories is an integer, the current feature name is retrieved and its corresponding value in the dictionary is used as the upper limit.

Any other comments?

Currently, this PR assumes that for any X that is not a pandas DataFrame, the feature names are the ones generated by get_feature_names_out. Any improvements or suggestions would be appreciated.

betatim

To fix the linting errors run black . from the top level directory. You can also use pre-commit to take care of formatting things (step 9 of https://scikit-learn.org/dev/developers/contributing.html#how-to-contribute)

sklearn/preprocessing/_encoders.py

Andrew-Wang-IB45 · 2023-04-26T15:33:06Z

I have considered running black . from the project root to address the linting issues, but that would change many files that are not covered by this PR. Would it be preferable if a separate PR is made to properly lint the files?

thomasjpfan · 2023-04-26T18:39:53Z

Would it be preferable if a separate PR is made to properly lint the files?

Please check that you are using black==23.3.0. Different black versions gives different formatting results. Scikit-learn is currently pinned to 23.3.0.

For future reference, the black version is in step 4 of the contributing guide. You can use pre-commit hooks as a quick way to make sure your commits are linted properly with the current versions. You can find instructions for pre-commit hooks in step 9 of the contributing guide.

Andrew-Wang-IB45 · 2023-04-26T18:58:02Z

Thanks for your suggestions, I will update black to version 23.3.0 and see how that goes.

…or a dict

…Encoder

Andrew-Wang-IB45 · 2023-05-15T02:04:33Z

Hi @betatim, I have addressed the points you have raised in your review. At your convenience, can you look over this and comment on the design choices? Thank you.

betatim · 2023-05-15T15:08:59Z

I'll leave a few individual review comments on particular bits of code. One overall comment I have is that I find it hard to follow the code. This is not just because of the changes you made, but was already the case before. I think one thing that makes it tricky is that I can't get a good set of names in my head. And I think in the code the variables are also sometimes inconsistent/change how things are called. For example I think the one hot encoding talks about features and what it means is how many categories there are in a particular feature of the input dataset. The reason it calls it features is because you need 5 features to one hot encode 5 categories. But I am not sure.

The overall point is: can we try and come up with a consistent set of names for things and then apply it to the variables and doc strings? I think this would help readers from the future follow what the code does.

sklearn/preprocessing/_encoders.py

Andrew-Wang-IB45 · 2023-05-16T00:26:26Z

Yeah, when I was implementing the features, I found it difficult to follow what the code was doing. At least for the changes I have made, it is definitely a good idea to find more descriptive variable and function names. Refactoring this completely would probably require another PR since it would be quite a large undertaking and will definitely be beyond the scope of this PR.

thomasjpfan

Thank you for the PR @Andrew-Wang-IB45 !

I think the complexity of this PR can be reduced if we add a new _max_categories_per_feature private attribute:

diff --git a/sklearn/preprocessing/_encoders.py b/sklearn/preprocessing/_encoders.py
index fd9941f533..c2fde8fe50 100644
--- a/sklearn/preprocessing/_encoders.py
+++ b/sklearn/preprocessing/_encoders.py
@@ -16,6 +16,7 @@ from ..utils.validation import _check_feature_names_in
 from ..utils._param_validation import Interval, StrOptions, Hidden
 from ..utils._param_validation import RealNotInt
 from ..utils._mask import _get_mask
+from ..utils.validation import _is_arraylike_not_scalar
 
 from ..utils._encode import _encode, _check_unknown, _unique, _get_counts
 
@@ -75,9 +76,9 @@ class _BaseEncoder(TransformerMixin, BaseEstimator):
         return_counts=False,
         return_and_ignore_missing_for_infrequent=False,
     ):
-        self._check_infrequent_enabled()
         self._check_n_features(X, reset=True)
         self._check_feature_names(X, reset=True)
+        self._check_infrequent_enabled()
         X_list, n_samples, n_features = self._check_X(
             X, force_all_finite=force_all_finite
         )
@@ -250,15 +251,33 @@ class _BaseEncoder(TransformerMixin, BaseEstimator):
             for category, indices in zip(self.categories_, infrequent_indices)
         ]
 
+    def _validate_max_categories(self):
+        parameter = getattr(self, "max_categories", None)
+        if isinstance(parameter, int) and parameter >= 1:
+            return [parameter] * self.n_features_in_
+
+        elif isinstance(parameter, dict):
+            if not hasattr(self, "feature_names_in_"):
+                raise ValueError("parameter is a dict but X has not feature names")
+            return [parameter.get(col, None) for col in self.feature_names_in_]
+
+        elif _is_arraylike_not_scalar(parameter) and any(
+            p is not None for p in parameter
+        ):
+            return parameter
+
+        return None
+
     def _check_infrequent_enabled(self):
         """
         This functions checks whether _infrequent_enabled is True or False.
         This has to be called after parameter validation in the fit function.
         """
-        max_categories = getattr(self, "max_categories", None)
+        self._max_categories_per_feature = self._validate_max_categories()
         min_frequency = getattr(self, "min_frequency", None)
+
         self._infrequent_enabled = (
-            max_categories is not None and max_categories >= 1
+            self._max_categories_per_feature is not None
         ) or min_frequency is not None
 
     def _identify_infrequent(self, category_count, n_samples, col_idx):
@@ -290,9 +309,15 @@ class _BaseEncoder(TransformerMixin, BaseEstimator):
             infrequent_mask = np.zeros(category_count.shape[0], dtype=bool)
 
         n_current_features = category_count.size - infrequent_mask.sum() + 1
-        if self.max_categories is not None and self.max_categories < n_current_features:
+
+        if self._max_categories_per_feature is not None:
+            max_categories = self._max_categories_per_feature[col_idx]
+        else:
+            max_categories = None
+
+        if max_categories is not None and max_categories < n_current_features:
             # max_categories includes the one infrequent category
-            frequent_category_count = self.max_categories - 1
+            frequent_category_count = max_categories - 1
             if frequent_category_count == 0:
                 # All categories are infrequent
                 infrequent_mask[:] = True
@@ -1419,7 +1444,12 @@ class OrdinalEncoder(OneToOneFeatureMixin, _BaseEncoder):
         "encoded_missing_value": [Integral, type(np.nan)],
         "handle_unknown": [StrOptions({"error", "use_encoded_value"})],
         "unknown_value": [Integral, type
E7EE
(np.nan), None],
-        "max_categories": [Interval(Integral, 1, None, closed="left"), None],
+        "max_categories": [
+            Interval(Integral, 1, None, closed="left"),
+            "array-like",
+            dict,
+            None,
+        ],
         "min_frequency": [
             Interval(Integral, 1, None, closed="left"),
             Interval(RealNotInt, 0, 1, closed="neither"),

(Note that _check_max_features should include some more error checking as you have in your PR)

…verrides in OrdinalEncoder

Andrew-Wang-IB45 · 2023-05-18T04:37:36Z

All right, I was looking over the failed checks but could not decipher what exactly went wrong. Do you have any idea what may have caused this?

Add per feature max_categories for OrdinalEncoder

b4208f2

github-actions bot added the module:preprocessing label Apr 25, 2023

Andrew-Wang-IB45 changed the title ~~Add per feature max_categories for OrdinalEncoder~~ ENH Add per feature max_categories for OrdinalEncoder Apr 25, 2023

betatim reviewed Apr 26, 2023

View reviewed changes

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

Andrew-Wang-IB45 added 7 commits April 26, 2023 16:18

Fix formatting

a3db2b6

Update behaviour of max_categories in OrdinalEncoder

39e81ab

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

bbf9031

Fix errors pertaining to checking for infrequent categories

b06a0d5

Only check max_categories in OrdinalEncoder when it is an array-like …

be5242a

…or a dict

Update changelog

f43456e

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

310f81e

Andrew-Wang-IB45 requested a review from betatim April 28, 2023 06:27

Andrew-Wang-IB45 added 5 commits April 29, 2023 00:02

Improve ordering of checking max_categories and add tests for Ordinal…

d558687

…Encoder

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

5c6c690

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

dd26ec1

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

84a5e2c

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

de713ce

betatim reviewed May 15, 2023

View reviewed changes

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

thomasjpfan reviewed May 16, 2023

View reviewed changes

Andrew-Wang-IB45 added 3 commits May 16, 2023 21:46

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

cb09cac

Add _max_categories_per_feature attribute to BaseEncoder and remove o…

77e9567

…verrides in OrdinalEncoder

Update tests

20c4489

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

918a7a8

Andrew-Wang-IB45 added 30 commits July 6, 2024 09:32

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

768d8b2

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

4bc68db

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

c27b98f

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

3eab8e7

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

2ef436b

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

39f711f

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

e96d4de

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

9b8df70

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

cbed9ab

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

5541c2b

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

217e7cc

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

d92cf8c

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

ee8cde0

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

f30113a

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

9f931f1

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

3c31f99

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

4f6c0a5

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

8abe364

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

8017913

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

c3114b4

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

3337745

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

970a013

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

5b57ce1

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

4bb5aaa

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

7bc2d8c

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

ea03e99

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

748b876

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

35dd88c

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

0fe87ab

Merge branch 'main' into ordinal_encoder_max_categories_per_feature

2ca25fe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH Add per feature max_categories for OrdinalEncoder #26284

ENH Add per feature max_categories for OrdinalEncoder #26284

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ENH Add per feature max_categories for OrdinalEncoder #26284

Are you sure you want to change the base?

ENH Add per feature max_categories for OrdinalEncoder #26284

Uh oh!

Conversation

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!