-
-
Notifications
You must be signed in to change notification settings - Fork 26k
ENH Add per feature max_categories for OrdinalEncoder #26284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ENH Add per feature max_categories for OrdinalEncoder #26284
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fix the linting errors run black .
from the top level directory. You can also use pre-commit
to take care of formatting things (step 9 of https://scikit-learn.org/dev/developers/contributing.html#how-to-contribute)
I have considered running |
Please check that you are using For future reference, the |
Thanks for your suggestions, I will update |
Hi @betatim, I have addressed the points you have raised in your review. At your convenience, can you look over this and comment on the design choices? Thank you. |
I'll leave a few individual review comments on particular bits of code. One overall comment I have is that I find it hard to follow the code. This is not just because of the changes you made, but was already the case before. I think one thing that makes it tricky is that I can't get a good set of names in my head. And I think in the code the variables are also sometimes inconsistent/change how things are called. For example I think the one hot encoding talks about features and what it means is how many categories there are in a particular feature of the input dataset. The reason it calls it features is because you need 5 features to one hot encode 5 categories. But I am not sure. The overall point is: can we try and come up with a consistent set of names for things and then apply it to the variables and doc strings? I think this would help readers from the future follow what the code does. |
Yeah, when I was implementing the features, I found it difficult to follow what the code was doing. At least for the changes I have made, it is definitely a good idea to find more descriptive variable and function names. Refactoring this completely would probably require another PR since it would be quite a large undertaking and will definitely be beyond the scope of this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @Andrew-Wang-IB45 !
I think the complexity of this PR can be reduced if we add a new _max_categories_per_feature
private attribute:
diff --git a/sklearn/preprocessing/_encoders.py b/sklearn/preprocessing/_encoders.py
index fd9941f533..c2fde8fe50 100644
--- a/sklearn/preprocessing/_encoders.py
+++ b/sklearn/preprocessing/_encoders.py
@@ -16,6 +16,7 @@ from ..utils.validation import _check_feature_names_in
from ..utils._param_validation import Interval, StrOptions, Hidden
from ..utils._param_validation import RealNotInt
from ..utils._mask import _get_mask
+from ..utils.validation import _is_arraylike_not_scalar
from ..utils._encode import _encode, _check_unknown, _unique, _get_counts
@@ -75,9 +76,9 @@ class _BaseEncoder(TransformerMixin, BaseEstimator):
return_counts=False,
return_and_ignore_missing_for_infrequent=False,
):
- self._check_infrequent_enabled()
self._check_n_features(X, reset=True)
self._check_feature_names(X, reset=True)
+ self._check_infrequent_enabled()
X_list, n_samples, n_features = self._check_X(
X, force_all_finite=force_all_finite
)
@@ -250,15 +251,33 @@ class _BaseEncoder(TransformerMixin, BaseEstimator):
for category, indices in zip(self.categories_, infrequent_indices)
]
+ def _validate_max_categories(self):
+ parameter = getattr(self, "max_categories", None)
+ if isinstance(parameter, int) and parameter >= 1:
+ return [parameter] * self.n_features_in_
+
+ elif isinstance(parameter, dict):
+ if not hasattr(self, "feature_names_in_"):
+ raise ValueError("parameter is a dict but X has not feature names")
+ return [parameter.get(col, None) for col in self.feature_names_in_]
+
+ elif _is_arraylike_not_scalar(parameter) and any(
+ p is not None for p in parameter
+ ):
+ return parameter
+
+ return None
+
def _check_infrequent_enabled(self):
"""
This functions checks whether _infrequent_enabled is True or False.
This has to be called after parameter validation in the fit function.
"""
- max_categories = getattr(self, "max_categories", None)
+ self._max_categories_per_feature = self._validate_max_categories()
min_frequency = getattr(self, "min_frequency", None)
+
self._infrequent_enabled = (
- max_categories is not None and max_categories >= 1
+ self._max_categories_per_feature is not None
) or min_frequency is not None
def _identify_infrequent(self, category_count, n_samples, col_idx):
@@ -290,9 +309,15 @@ class _BaseEncoder(TransformerMixin, BaseEstimator):
infrequent_mask = np.zeros(category_count.shape[0], dtype=bool)
n_current_features = category_count.size - infrequent_mask.sum() + 1
- if self.max_categories is not None and self.max_categories < n_current_features:
+
+ if self._max_categories_per_feature is not None:
+ max_categories = self._max_categories_per_feature[col_idx]
+ else:
+ max_categories = None
+
+ if max_categories is not None and max_categories < n_current_features:
# max_categories includes the one infrequent category
- frequent_category_count = self.max_categories - 1
+ frequent_category_count = max_categories - 1
if frequent_category_count == 0:
# All categories are infrequent
infrequent_mask[:] = True
@@ -1419,7 +1444,12 @@ class OrdinalEncoder(OneToOneFeatureMixin, _BaseEncoder):
"encoded_missing_value": [Integral, type(np.nan)],
"handle_unknown": [StrOptions({"error", "use_encoded_value"})],
"unknown_value": [Integral, type
E7EE
(np.nan), None],
- "max_categories": [Interval(Integral, 1, None, closed="left"), None],
+ "max_categories": [
+ Interval(Integral, 1, None, closed="left"),
+ "array-like",
+ dict,
+ None,
+ ],
"min_frequency": [
Interval(Integral, 1, None, closed="left"),
Interval(RealNotInt, 0, 1, closed="neither"),
(Note that _check_max_features
should include some more error checking as you have in your PR)
All right, I was looking over the failed checks but could not decipher what exactly went wrong. Do you have any idea what may have caused this? |
Reference Issues/PRs
Closes #26013.
What does this implement/fix? Explain your changes.
This PR allows a user to specify per-feature max categories for an
OrdinalEncoder
by passingmax_categories
as a dictionary mapping a valid feature name to its corresponding maximum number of output categories. Since identifying infrequent categories is already done per feature, instead of applying the global limit whenmax_categories
is an integer, the current feature name is retrieved and its corresponding value in the dictionary is used as the upper limit.Any other comments?
Currently, this PR assumes that for any
X
that is not a pandas DataFrame, the feature names are the ones generated byget_feature_names_out
. Any improvements or suggestions would be appreciated.