8000 ENH Adds infrequent categories to OneHotEncoder (#16018) · glemaitre/scikit-learn@77ebe59 · GitHub
[go: up one dir, main page]

Skip to content

Commit 77ebe59

Browse files
thomasjpfanjnothmanrthglemaitre
committed
ENH Adds infrequent categories to OneHotEncoder (scikit-learn#16018)
* ENH Completely adds infrequent categories * STY Linting * STY Linting * DOC Improves wording * DOC Lint * BUG Fixes * CLN Address comments * CLN Address comments * DOC Uses math to description float min_frequency * DOC Adds comment regarding drop * BUG Fixes method name * DOC Clearer docstring * TST Adds more tests * FIX Fixes mege * CLN More pythonic * CLN Address comments * STY Flake8 * CLN Address comments * DOC Fix * MRG * WIP * ENH Address comments * STY Fix * ENH Use functiion call instead of property * ENH Adds counts feature * CLN Rename variables * DOC More details * CLN Remove unneeded line * CLN Less lines is less complicated * CLN Less diffs * CLN Improves readiabilty * BUG Fix * CLN Address comments * TST Fix * CLN Address comments * CLN Address comments * CLN Move docstring to userguide * DOC Better wrapping * TST Adds test to handle_unknown='error' * ENH Spelling error in docstring * BUG Fixes counter with nan values * BUG Removes unneeded test * BUG Fixes issue * ENH Sync with main * DOC Correct settings * DOC Adds docstring * DOC Immprove user guide * DOC Move to 1.0 * DOC Update docs * TST Remove test * DOC Update docstring * STY Linting * DOC Address comments * ENH Neater code * DOC Update explaination for auto * Update sklearn/preprocessing/_encoders.py Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com> * TST Uses docstring instead of comments * TST Remove call to fit * TST Spelling error * ENH Adds support for drop + infrequent categories * ENH Adds infrequent_if_exist option * DOC Address comments for user guide * DOC Address comments for whats_new * DOC Update docstring based on comments * CLN Update test with suggestions * ENH Adds computed property infrequent_categories_ * DOC Adds where the infrequent column is located * TST Adds more test for infrequent_categories_ * DOC Adds docstring for _compute_drop_idx * CLN Moves _convert_to_infrequent_idx into its own method * TST Increases test coverage * TST Adds failing test * CLN Careful consideration of dropped and inverse_transform * STY Linting * DOC Adds docstrinb about dropping infrequent * DOC Uses only * DOC Numpydoc * TST Includes test for get_feature_names_out * DOC Move whats new * DOC Address docstring comments * DOC Docstring changes * TST Better comments * TST Adds check for handle_unknown='ignore' for infrequent * CLN Make _infrequent_indices private * CLN Change min_frequency default to None * DOC Adds comments * ENH adds support for max_categories=1 * ENH Describe lexicon ordering for ties * DOC Better docstring * STY Fix * CLN Error when explicity dropping an infrequent category * STY Grammar Co-authored-by: Joel Nothman <joel.nothman@gmail.com> Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
1 parent 511d232 commit 77ebe59

File tree

6 files changed

+1270
-110
lines changed

6 files changed

+1270
-110
lines changed

doc/modules/preprocessing.rst

Lines changed: 117 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -594,17 +594,19 @@ dataset::
594594
array([[1., 0., 0., 1., 0., 0., 1., 0., 0., 0.]])
595595

596596
If there is a possibility that the training data might have missing categorical
597-
features, it can often be better to specify ``handle_unknown='ignore'`` instead
598-
of setting the ``categories`` manually as above. When
599-
``handle_unknown='ignore'`` is specified and unknown categories are encountered
600-
during transform, no error will be raised but the resulting one-hot encoded
601-
columns for this feature will be all zeros
602-
(``handle_unknown='ignore'`` is only supported for one-hot encoding)::
603-
604-
>>> enc = preprocessing.OneHotEncoder(handle_unknown='ignore')
597+
features, it can often be better to specify
598+
`handle_unknown='infrequent_if_exist'` instead of setting the `categories`
599+
manually as above. When `handle_unknown='infrequent_if_exist'` is specified
600+
and unknown categories are encountered during transform, no error will be
601+
raised but the resulting one-hot encoded columns for this feature will be all
602+
zeros or considered as an infrequent category if enabled.
603+
(`handle_unknown='infrequent_if_exist'` is only supported for one-hot
604+
encoding)::
605+
606+
>>> enc = preprocessing.OneHotEncoder(handle_unknown='infrequent_if_exist')
605607
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
606608
>>> enc.fit(X)
607-
OneHotEncoder(handle_unknown='ignore')
609+
OneHotEncoder(handle_unknown='infrequent_if_exist')
608610
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
609611
array([[1., 0., 0., 0., 0., 0.]])
610612

@@ -621,7 +623,8 @@ since co-linearity would cause the covariance matrix to be non-invertible::
621623
... ['female', 'from Europe', 'uses Firefox']]
622624
>>> drop_enc = preprocessing.OneHotEncoder(drop='first').fit(X)
623625
>>> drop_enc.categories_
624-
[array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object), array(['uses Firefox', 'uses Safari'], dtype=object)]
626+
[array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object),
627+
array(['uses Firefox', 'uses Safari'], dtype=object)]
625628
>>> drop_enc.transform(X).toarray()
626629
array([[1., 1., 1.],
627630
[0., 0., 0.]])
@@ -634,7 +637,8 @@ categories. In this case, you can set the parameter `drop='if_binary'`.
634637
... ['female', 'Asia', 'Chrome']]
635638
>>> drop_enc = preprocessing.OneHotEncoder(drop='if_binary').fit(X)
636639
>>> drop_enc.categories_
637-
[array(['female', 'male'], dtype=object), array(['Asia', 'Europe', 'US'], dtype=object), array(['Chrome', 'Firefox', 'Safari'], dtype=object)]
640+
[array(['female', 'male'], dtype=object), array(['Asia', 'Europe', 'US'], dtype=object),
641+
array(['Chrome', 'Firefox', 'Safari'], dtype=object)]
638642
>>> drop_enc.transform(X).toarray()
639643
array([[1., 0., 0., 1., 0., 0., 1.],
640644
[0., 0., 1., 0., 0., 1., 0.],
@@ -699,6 +703,107 @@ separate categories::
699703
See :ref:`dict_feature_extraction` for categorical features that are
700704
represented as a dict, not as scalars.
701705

706+
.. _one_hot_encoder_infrequent_categories:
707+
708+
Infrequent categories
709+
---------------------
710+
711+
:class:`OneHotEncoder` supports aggregating infrequent categories into a single
712+
output for each feature. The parameters to enable the gathering of infrequent
713+
categories are `min_frequency` and `max_categories`.
714+
715+
1. `min_frequency` is either an integer greater or equal to 1, or a float in
716+
the interval `(0.0, 1.0)`. If `min_frequency` is an integer, categories with
717+
a cardinality smaller than `min_frequency` will be considered infrequent.
718+
If `min_frequency` is a float, categories with a cardinality smaller than
719+
this fraction of the total number of samples will be considered infrequent.
720+
The default value is 1, which means every category is encoded separately.
721+
722+
2. `max_categories` is either `None` or any integer greater than 1. This
723+
parameter sets an upper limit to the number of output features for each
724+
input feature. `max_categories` includes the feature that combines
725+
infrequent categories.
726+
727+
In the following example, the categories, `'dog', 'snake'` are considered
728+
infrequent::
729+
730+
>>> X = np.array([['dog'] * 5 + ['cat'] * 20 + ['rabbit'] * 10 +
731+
... ['snake'] * 3], dtype=object).T
732+
>>> enc = preprocessing.OneHotEncoder(min_frequency=6, sparse=False).fit(X)
733+
>>> enc.infrequent_categories_
734+
[array(['dog', 'snake'], dtype=object)]
735+
>>> enc.transform(np.array([['dog'], ['cat'], ['rabbit'], ['snake']]))
736+
array([[0., 0., 1.],
737+
[1., 0., 0.],
738+
[0., 1., 0.],
739+
[0., 0., 1.]])
740+
741+
By setting handle_unknown to `'infrequent_if_exist'`, unknown categories will
742+
be considered infrequent::
743+
744+
>>> enc = preprocessing.OneHotEncoder(
745+
... handle_unknown='infrequent_if_exist', sparse=False, min_frequency=6)
746+
>>> enc = enc.fit(X)
747+
>>> enc.transform(np.array([['dragon']]))
748+
array([[0., 0., 1.]])
749+
750+
:meth:`OneHotEncoder.get_feature_names_out` uses 'infrequent' as the infrequent
751+
feature name::
752+
753+
>>> enc.get_feature_names_out()
754+
array(['x0_cat', 'x0_rabbit', 'x0_infrequent_sklearn'], dtype=object)
755+
756+
When `'handle_unknown'` is set to `'infrequent_if_exist'` and an unknown
757+
category is encountered in transform:
758+
759+
1. If infrequent category support was not configured or there was no
760+
infrequent category during training, the resulting one-hot encoded columns
761+
for this feature will be all zeros. In the inverse transform, an unknown
762+
category will be denoted as `None`.
763+
764+
2. If there is an infrequent category during training, the unknown category
765+
will be considered infrequent. In the inverse transform, 'infrequent_sklearn'
766+
will be used to represent the infrequent category.
767+
768+
Infrequent categories can also be configured using `max_categories`. In the
769+
following example, we set `max_categories=2` to limit the number of features in
770+
the output. This will result in all but the `'cat'` category to be considered
771+
infrequent, leading to two features, one for `'cat'` and one for infrequent
772+
categories - which are all the others::
773+
774+
>>> enc = preprocessing.OneHotEncoder(max_categories=2, sparse=False)
775+
>>> enc = enc.fit(X)
776+
>>> enc.transform([['dog'], ['cat'], ['rabbit'], ['snake']])
777+
array([[0., 1.],
778+
[1., 0.],
779+
[0., 1.],
780+
[0., 1.]])
781+
782+
If both `max_categories` and `min_frequency` are non-default values, then
783+
categories are selected based on `min_frequency` first and `max_categories`
784+
categories are kept. In the following example, `min_frequency=4` considers
785+
only `snake` to be infrequent, but `max_categories=3`, forces `dog` to also be
786+
infrequent::
787+
788+
>>> enc = preprocessing.OneHotEncoder(min_frequency=4, max_categories=3, sparse=False)
789+
>>> enc = enc.fit(X)
790+
>>> enc.transform([['dog'], ['cat'], ['rabbit'], ['snake']])
791+
array([[0., 0., 1.],
792+
[1., 0., 0.],
793+
[0., 1., 0.],
794+
[0., 0., 1.]])
795+
796+
If there are infrequent categories with the same cardinality at the cutoff of
797+
`max_categories`, then then the first `max_categories` are taken based on lexicon
798+
ordering. In the following example, "b", "c", and "d", have the same cardinality
799+
and with `max_categories=2`, "b" and "c" are infrequent because they have a higher
800+
lexicon order.
801+
802+
>>> X = np.asarray([["a"] * 20 + ["b"] * 10 + ["c"] * 10 + ["d"] * 10], dtype=object).T
803+
>>> enc = preprocessing.OneHotEncoder(max_categories=3).fit(X)
804+
>>> enc.infrequent_categories_
805+
[array(['b', 'c'], dtype=object)]
806+
702807
.. _preprocessing_discretization:
703808

704809
Discretization
@@ -981,7 +1086,7 @@ Interestingly, a :class:`SplineTransformer` of ``degree=0`` is the same as
9811086
Penalties <10.1214/ss/1038425655>`. Statist. Sci. 11 (1996), no. 2, 89--121.
9821087

9831088
* Perperoglou, A., Sauerbrei, W., Abrahamowicz, M. et al. :doi:`A review of
984-
spline function procedures in R <10.1186/s12874-019-0666-3>`.
1089+
spline function procedures in R <10.1186/s12874-019-0666-3>`.
9851090
BMC Med Res Methodol 19, 46 (2019).
9861091

9871092
.. _function_transformer:

doc/whats_new/v1.1.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -759,6 +759,11 @@ Changelog
759759
:mod:`sklearn.preprocessing`
760760
............................
761761

762+
- |Feature| :class:`preprocessing.OneHotEncoder` now supports grouping
763+
infrequent categories into a single feature. Grouping infrequent categories
764+
is enabled by specifying how to select infrequent categories with
765+
`min_frequency` or `max_categories`. :pr:`16018` by `Thomas Fan`_.
766+
762767
- |Enhancement| Adds a `subsample` parameter to :class:`preprocessing.KBinsDiscretizer`.
763768
This allows specifying a maximum number of samples to be used while fitting
764769
the model. The option is only available when `strategy` is set to `quantile`.

0 commit comments

Comments
 (0)
0