8000 ENH Adds infrequent categories support to OrdinalEncoder (#25677) · scikit-learn/scikit-learn@7df7062 · GitHub
[go: up one dir, main page]

Skip to content

Commit 7df7062

Browse files
thomasjpfanbetatimogriselamueller
authored
ENH Adds infrequent categories support to OrdinalEncoder (#25677)
Co-authored-by: Tim Head <betatim@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Andreas Mueller <t3kcit@gmail.com>
1
8000
 parent ec8a2a6 commit 7df7062

File tree

4 files changed

+657
-187
lines changed

4 files changed

+657
-187
lines changed

doc/modules/preprocessing.rst

Lines changed: 43 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -729,14 +729,15 @@ separate categories::
729729
See :ref:`dict_feature_extraction` for categorical features that are
730730
represented as a dict, not as scalars.
731731

732-
.. _one_hot_encoder_infrequent_categories:
732+
.. _encoder_infrequent_categories:
733733

734734
Infrequent categories
735735
---------------------
736736

737-
:class:`OneHotEncoder` supports aggregating infrequent categories into a single
738-
output for each feature. The parameters to enable the gathering of infrequent
739-
categories are `min_frequency` and `max_categories`.
737+
:class:`OneHotEncoder` and :class:`OrdinalEncoder` support aggregating
738+
infrequent categories into a single output for each feature. The parameters to
739+
enable the gathering of infrequent categories are `min_frequency` and
740+
`max_categories`.
740741

741742
1. `min_frequency` is either an integer greater or equal to 1, or a float in
742743
the interval `(0.0, 1.0)`. If `min_frequency` is an integer, categories with
@@ -750,11 +751,47 @@ categories are `min_frequency` and `max_categories`.
750751
input feature. `max_categories` includes the feature that combines
751752
infrequent categories.
752753

753-
In the following example, the categories, `'dog', 'snake'` are considered
754-
infrequent::
754+
In the following example with :class:`OrdinalEncoder`, the categories `'dog' and
755+
'snake'` are considered infrequent::
755756

756757
>>> X = np.array([['dog'] * 5 + ['cat'] * 20 + ['rabbit'] * 10 +
757758
... ['snake'] * 3], dtype=object).T
759+
>>> enc = preprocessing.OrdinalEncoder(min_frequency=6).fit(X)
760+
>>> enc.infrequent_categories_
761+
[array(['dog', 'snake'], dtype=object)]
762+
>>> enc.transform(np.array([['dog'], ['cat'], ['rabbit'], ['snake']]))
763+
array([[2.],
764+
[0.],
765+
[1.],
766+
[2.]])
767+
768+
:class:`OrdinalEncoder`'s `max_categories` do **not** take into account missing
769+
or unknown categories. Setting `unknown_value` or `encoded_missing_value` to an
770+
integer will increase the number of unique integer codes by one each. This can
771+
result in up to `max_categories + 2` integer codes. In the following example,
772+
"a" and "d" are considered infrequent and grouped together into a single
773+
category, "b" and "c" are their own categories, unknown values are encoded as 3
774+
and missing values are encoded as 4.
775+
776+
>>> X_train = np.array(
777+
... [["a"] * 5 + ["b"] * 20 + ["c"] * 10 + ["d"] * 3 + [np.nan]],
778+
... dtype=object).T
779+
>>> enc = preprocessing.OrdinalEncoder(
780+
... handle_unknown="use_encoded_value", unknown_value=3,
781+
... max_categories=3, encoded_missing_value=4)
782+
>>> _ = enc.fit(X_train)
783+
>>> X_test = np.array([["a"], ["b"], ["c"], ["d"], ["e"], [np.nan]], dtype=object)
784+
>>> enc.transform(X_test)
785+
array([[2.],
786+
[0.],
787+
[1.],
788+
[2.],
789+
[3.],
790+
[4.]])
791+
792+
Similarity, :class:`OneHotEncoder` can be configured to group together infrequent
793+
categories::
794+
758795
>>> enc = preprocessing.OneHotEncoder(min_frequency=6, sparse_output=False).fit(X)
759796
>>> enc.infrequent_categories_
760797
[array(['dog', 'snake'], dtype=object)]

doc/whats_new/v1.3.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -396,6 +396,11 @@ Changelog
396396
:pr:`24935` by :user:`Seladus <seladus>`, :user:`Guillaume Lemaitre <glemaitre>`, and
397397
:user:`Dea María Léon <deamarialeon>`, :pr:`25257` by :user:`Gleb Levitski <glevv>`.
398398

399+
- |Feature| :class:`preprocessing.OrdinalEncoder` now supports grouping
400+
infrequent categories into a single feature. Grouping infrequent categories
401+
is enabled by specifying how to select infrequent categories with
402+
`min_frequency` or `max_categories`. :pr:`25677` by `Thomas Fan`_.
403+
399404
- |Fix| :class:`AdditiveChi2Sampler` is now stateless.
400405
The `sample_interval_` attribute is deprecated and will be removed in 1.5.
401406
:pr:`25190` by :user:`Vincent Maladière <Vincent-Maladiere>`.

0 commit comments

Comments
 (0)
0