8000 ENH Add per feature max_categories for OrdinalEncoder by Andrew-Wang-IB45 · Pull Request #26284 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

ENH Add per feature max_categories for OrdinalEncoder #26284

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 208 commits into
base: main
Choose a base branch
from

Conversation

Andrew-Wang-IB45
Copy link
Contributor

Reference Issues/PRs

Closes #26013.

What does this implement/fix? Explain your changes.

This PR allows a user to specify per-feature max categories for an OrdinalEncoder by passing max_categories as a dictionary mapping a valid feature name to its corresponding maximum number of output categories. Since identifying infrequent categories is already done per feature, instead of applying the global limit when max_categories is an integer, the current feature name is retrieved and its corresponding value in the dictionary is used as the upper limit.

Any other comments?

Currently, this PR assumes that for any X that is not a pandas DataFrame, the feature names are the ones generated by get_feature_names_out. Any improvements or suggestions would be appreciated.

@Andrew-Wang-IB45 Andrew-Wang-IB45 changed the title Add per feature max_categories for OrdinalEncoder ENH Add per feature max_categories for OrdinalEncoder Apr 25, 2023
Copy link
Member
@betatim betatim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix the linting errors run black . from the top level directory. You can also use pre-commit to take care of formatting things (step 9 of https://scikit-learn.org/dev/developers/contributing.html#how-to-contribute)

@Andrew-Wang-IB45
Copy link
Contributor Author

I have considered running black . from the project root to address the linting issues, but that would change many files that are not covered by this PR. Would it be preferable if a separate PR is made to properly lint the files?

@thomasjpfan
Copy link
Member
thomasjpfan commented Apr 26, 2023

Would it be preferable if a separate PR is made to properly lint the files?

Please check that you are using black==23.3.0. Different black versions gives different formatting results. Scikit-learn is currently pinned to 23.3.0.

For future reference, the black version is in step 4 of the contributing guide. You can use pre-commit hooks as a quick way to make sure your commits are linted properly with the current versions. You can find instructions for pre-commit hooks in step 9 of the contributing guide.

@Andrew-Wang-IB45
Copy link
Contributor Author

Thanks for your suggestions, I will update black to version 23.3.0 and see how that goes.

@Andrew-Wang-IB45
Copy link
Contributor Author

Hi @betatim, I have addressed the points you have raised in your review. At your convenience, can you look over this and comment on the design choices? Thank you.

@betatim
Copy link
Member
betatim commented May 15, 2023

I'll leave a few individual review comments on particular bits of code. One overall comment I have is that I find it hard to follow the code. This is not just because of the changes you made, but was already the case before. I think one thing that makes it tricky is that I can't get a good set of names in my head. And I think in the code the variables are also sometimes inconsistent/change how things are called. For example I think the one hot encoding talks about features and what it means is how many categories there are in a particular feature of the input dataset. The reason it calls it features is because you need 5 features to one hot encode 5 categories. But I am not sure.

The overall point is: can we try and come up with a consistent set of names for things and then apply it to the variables and doc strings? I think this would help readers from the future follow what the code does.

@Andrew-Wang-IB45
Copy link
Contributor Author

Yeah, when I was implementing the features, I found it difficult to follow what the code was doing. At least for the changes I have made, it is definitely a good idea to find more descriptive variable and function names. Refactoring this completely would probably require another PR since it would be quite a large undertaking and will definitely be beyond the scope of this PR.

Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @Andrew-Wang-IB45 !

I think the complexity of this PR can be reduced if we add a new _max_categories_per_feature private attribute:

diff --git a/sklearn/preprocessing/_encoders.py b/sklearn/preprocessing/_encoders.py
index fd9941f533..c2fde8fe50 100644
--- a/sklearn/preprocessing/_encoders.py
+++ b/sklearn/preprocessing/_encoders.py
@@ -16,6 +16,7 @@ from ..utils.validation import _check_feature_names_in
 from ..utils._param_validation import Interval, StrOptions, Hidden
 from ..utils._param_validation import RealNotInt
 from ..utils._mask import _get_mask
+from ..utils.validation import _is_arraylike_not_scalar
 
 from ..utils._encode import _encode, _check_unknown, _unique, _get_counts
 
@@ -75,9 +76,9 @@ class _BaseEncoder(TransformerMixin, BaseEstimator):
         return_counts=False,
         return_and_ignore_missing_for_infrequent=False,
     ):
-        self._check_infrequent_enabled()
         self._check_n_features(X, reset=True)
         self._check_feature_names(X, reset=True)
+        self._check_infrequent_enabled()
         X_list, n_samples, n_features = self._check_X(
             X, force_all_finite=force_all_finite
         )
@@ -250,15 +251,33 @@ class _BaseEncoder(TransformerMixin, BaseEstimator):
             for category, indices in zip(self.categories_, infrequent_indices)
         ]
 
+    def _validate_max_categories(self):
+        parameter = getattr(self, "max_categories", None)
+        if isinstance(parameter, int) and parameter >= 1:
+            return [parameter] * self.n_features_in_
+
+        elif isinstance(parameter, dict):
+            if not hasattr(self, "feature_names_in_"):
+                raise ValueError("parameter is a dict but X has not feature names")
+            return [parameter.get(col, None) for col in self.feature_names_in_]
+
+        elif _is_arraylike_not_scalar(parameter) and any(
+            p is not None for p in parameter
+        ):
+            return parameter
+
+        return None
+
     def _check_infrequent_enabled(self):
         """
         This functions checks whether _infrequent_enabled is True or False.
         This has to be called after parameter validation in the fit function.
         """
-        max_categories = getattr(self, "max_categories", None)
+        self._max_categories_per_feature = self._validate_max_categories()
         min_frequency = getattr(self, "min_frequency", None)
+
         self._infrequent_enabled = (
-            max_categories is not None and max_categories >= 1
+            self._max_categories_per_feature is not None
         ) or min_frequency is not None
 
     def _identify_infrequent(self, category_count, n_samples, col_idx):
@@ -290,9 +309,15 @@ class _BaseEncoder(TransformerMixin, BaseEstimator):
             infrequent_mask = np.zeros(category_count.shape[0], dtype=bool)
 
         n_current_features = category_count.size - infrequent_mask.sum() + 1
-        if self.max_categories is not None and self.max_categories < n_current_features:
+
+        if self._max_categories_per_feature is not None:
+            max_categories = self._max_categories_per_feature[col_idx]
+        else:
+            max_categories = None
+
+        if max_categories is not None and max_categories < n_current_features:
             # max_categories includes the one infrequent category
-            frequent_category_count = self.max_categories - 1
+            frequent_category_count = max_categories - 1
             if frequent_category_count == 0:
                 # All categories are infrequent
                 infrequent_mask[:] = True
@@ -1419,7 +1444,12 @@ class OrdinalEncoder(OneToOneFeatureMixin, _BaseEncoder):
         "encoded_missing_value": [Integral, type(np.nan)],
         "handle_unknown": [StrOptions({"error", "use_encoded_value"})],
         "unknown_value": [Integral, type
E7EE
(np.nan), None],
-        "max_categories": [Interval(Integral, 1, None, closed="left"), None],
+        "max_categories": [
+            Interval(Integral, 1, None, closed="left"),
+            "array-like",
+            dict,
+            None,
+        ],
         "min_frequency": [
             Interval(Integral, 1, None, closed="left"),
             Interval(RealNotInt, 0, 1, closed="neither"),

(Note that _check_max_features should include some more error checking as you have in your PR)

@Andrew-Wang-IB45
Copy link
Contributor Author

All right, I was looking over the failed checks but could not decipher what exactly went wrong. Do you have any idea what may have caused this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add per feature "maximum category" counts to OrdinalEncoder
4 participants
0