@@ -594,17 +594,19 @@ dataset::
594
594
array([[1., 0., 0., 1., 0., 0., 1., 0., 0., 0.]])
595
595
596
596
If there is a possibility that the training data might have missing categorical
597
- features, it can often be better to specify ``handle_unknown='ignore' `` instead
598
- of setting the ``categories `` manually as above. When
599
- ``handle_unknown='ignore' `` is specified and unknown categories are encountered
600
- during transform, no error will be raised but the resulting one-hot encoded
601
- columns for this feature will be all zeros
602
- (``handle_unknown='ignore' `` is only supported for one-hot encoding)::
603
-
604
- >>> enc = preprocessing.OneHotEncoder(handle_unknown='ignore')
597
+ features, it can often be better to specify
598
+ `handle_unknown='infrequent_if_exist' ` instead of setting the `categories `
599
+ manually as above. When `handle_unknown='infrequent_if_exist' ` is specified
600
+ and unknown categories are encountered during transform, no error will be
601
+ raised but the resulting one-hot encoded columns for this feature will be all
602
+ zeros or considered as an infrequent category if enabled.
603
+ (`handle_unknown='infrequent_if_exist' ` is only supported for one-hot
604
+ encoding)::
605
+
606
+ >>> enc = preprocessing.OneHotEncoder(handle_unknown='infrequent_if_exist')
605
607
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
606
608
>>> enc.fit(X)
607
- OneHotEncoder(handle_unknown='ignore ')
609
+ OneHotEncoder(handle_unknown='infrequent_if_exist ')
608
610
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
609
611
array([[1., 0., 0., 0., 0., 0.]])
610
612
@@ -621,7 +623,8 @@ since co-linearity would cause the covariance matrix to be non-invertible::
621
623
... ['female', 'from Europe', 'uses Firefox']]
622
624
>>> drop_enc = preprocessing.OneHotEncoder(drop='first').fit(X)
623
625
>>> drop_enc.categories_
624
- [array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object), array(['uses Firefox', 'uses Safari'], dtype=object)]
626
+ [array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object),
627
+ array(['uses Firefox', 'uses Safari'], dtype=object)]
625
628
>>> drop_enc.transform(X).toarray()
626
629
array([[1., 1., 1.],
627
630
[0., 0., 0.]])
@@ -634,7 +637,8 @@ categories. In this case, you can set the parameter `drop='if_binary'`.
634
637
... [' female' , ' Asia' , ' Chrome' ]]
635
638
>>> drop_enc = preprocessing.OneHotEncoder(drop = ' if_binary' ).fit(X)
636
639
>>> drop_enc.categories_
637
- [array(['female', 'male'], dtype=object), array(['Asia', 'Europe', 'US'], dtype=object), array(['Chrome', 'Firefox', 'Safari'], dtype=object)]
640
+ [array(['female', 'male'], dtype=object), array(['Asia', 'Europe', 'US'], dtype=object),
641
+ array(['Chrome', 'Firefox', 'Safari'], dtype=object)]
638
642
>>> drop_enc.transform(X).toarray()
639
643
array([[1., 0., 0., 1., 0., 0., 1.],
640
644
[0., 0., 1., 0., 0., 1., 0.],
@@ -699,6 +703,107 @@ separate categories::
699
703
See :ref: `dict_feature_extraction ` for categorical features that are
700
704
represented as a dict, not as scalars.
701
705
706
+ .. _one_hot_encoder_infrequent_categories :
707
+
708
+ Infrequent categories
709
+ ---------------------
710
+
711
+ :class: `OneHotEncoder ` supports aggregating infrequent categories into a single
712
+ output for each feature. The parameters to enable the gathering of infrequent
713
+ categories are `min_frequency ` and `max_categories `.
714
+
715
+ 1. `min_frequency ` is either an integer greater or equal to 1, or a float in
716
+ the interval `(0.0, 1.0) `. If `min_frequency ` is an integer, categories with
717
+ a cardinality smaller than `min_frequency ` will be considered infrequent.
718
+ If `min_frequency ` is a float, categories with a cardinality smaller than
719
+ this fraction of the total number of samples will be considered infrequent.
720
+ The default value is 1, which means every category is encoded separately.
721
+
722
+ 2. `max_categories ` is either `None ` or any integer greater than 1. This
723
+ parameter sets an upper limit to the number of output features for each
724
+ input feature. `max_categories ` includes the feature that combines
725
+ infrequent categories.
726
+
727
+ In the following example, the categories, `'dog', 'snake' ` are considered
728
+ infrequent::
729
+
730
+ >>> X = np.array([['dog'] * 5 + ['cat'] * 20 + ['rabbit'] * 10 +
731
+ ... ['snake'] * 3], dtype=object).T
732
+ >>> enc = preprocessing.OneHotEncoder(min_frequency=6, sparse=False).fit(X)
733
+ >>> enc.infrequent_categories_
734
+ [array(['dog', 'snake'], dtype=object)]
735
+ >>> enc.transform(np.array([['dog'], ['cat'], ['rabbit'], ['snake']]))
736
+ array([[0., 0., 1.],
737
+ [1., 0., 0.],
738
+ [0., 1., 0.],
739
+ [0., 0., 1.]])
740
+
741
+ By setting handle_unknown to `'infrequent_if_exist' `, unknown categories will
742
+ be considered infrequent::
743
+
744
+ >>> enc = preprocessing.OneHotEncoder(
745
+ ... handle_unknown='infrequent_if_exist', sparse=False, min_frequency=6)
746
+ >>> enc = enc.fit(X)
747
+ >>> enc.transform(np.array([['dragon']]))
748
+ array([[0., 0., 1.]])
749
+
750
+ :meth: `OneHotEncoder.get_feature_names_out ` uses 'infrequent' as the infrequent
751
+ feature name::
752
+
753
+ >>> enc.get_feature_names_out()
754
+ array(['x0_cat', 'x0_rabbit', 'x0_infrequent_sklearn'], dtype=object)
755
+
756
+ When `'handle_unknown' ` is set to `'infrequent_if_exist' ` and an unknown
757
+ category is encountered in transform:
758
+
759
+ 1. If infrequent category support was not configured or there was no
760
+ infrequent category during training, the resulting one-hot encoded columns
761
+ for this feature will be all zeros. In the inverse transform, an unknown
762
+ category will be denoted as `None `.
763
+
764
+ 2. If there is an infrequent category during training, the unknown category
765
+ will be considered infrequent. In the inverse transform, 'infrequent_sklearn'
766
+ will be used to represent the infrequent category.
767
+
768
+ Infrequent categories can also be configured using `max_categories `. In the
769
+ following example, we set `max_categories=2 ` to limit the number of features in
770
+ the output. This will result in all but the `'cat' ` category to be considered
771
+ infrequent, leading to two features, one for `'cat' ` and one for infrequent
772
+ categories - which are all the others::
773
+
774
+ >>> enc = preprocessing.OneHotEncoder(max_categories=2, sparse=False)
775
+ >>> enc = enc.fit(X)
776
+ >>> enc.transform([['dog'], ['cat'], ['rabbit'], ['snake']])
777
+ array([[0., 1.],
778
+ [1., 0.],
779
+ [0., 1.],
780
+ [0., 1.]])
781
+
782
+ If both `max_categories ` and `min_frequency ` are non-default values, then
783
+ categories are selected based on `min_frequency ` first and `max_categories `
784
+ categories are kept. In the following example, `min_frequency=4 ` considers
785
+ only `snake ` to be infrequent, but `max_categories=3 `, forces `dog ` to also be
786
+ infrequent::
787
+
788
+ >>> enc = preprocessing.OneHotEncoder(min_frequency=4, max_categories=3, sparse=False)
789
+ >>> enc = enc.fit(X)
790
+ >>> enc.transform([['dog'], ['cat'], ['rabbit'], ['snake']])
791
+ array([[0., 0., 1.],
792
+ [1., 0., 0.],
793
+ [0., 1., 0.],
794
+ [0., 0., 1.]])
795
+
796
+ If there are infrequent categories with the same cardinality at the cutoff of
797
+ `max_categories `, then then the first `max_categories ` are taken based on lexicon
798
+ ordering. In the following example, "b", "c", and "d", have the same cardinality
799
+ and with `max_categories=2 `, "b" and "c" are infrequent because they have a higher
800
+ lexicon order.
801
+
802
+ >>> X = np.asarray([[" a" ] * 20 + [" b" ] * 10 + [" c" ] * 10 + [" d" ] * 10 ], dtype = object ).T
803
+ >>> enc = preprocessing.OneHotEncoder(max_categories = 3 ).fit(X)
804
+ >>> enc.infrequent_categories_
805
+ [array(['b', 'c'], dtype=object)]
806
+
702
807
.. _preprocessing_discretization :
703
808
704
809
Discretization
@@ -981,7 +1086,7 @@ Interestingly, a :class:`SplineTransformer` of ``degree=0`` is the same as
981
1086
Penalties <10.1214/ss/1038425655> `. Statist. Sci. 11 (1996), no. 2, 89--121.
982
1087
983
1088
* Perperoglou, A., Sauerbrei, W., Abrahamowicz, M. et al. :doi: `A review of
984
- spline function procedures in R <10.1186/s12874-019-0666-3> `.
1089
+ spline function procedures in R <10.1186/s12874-019-0666-3> `.
985
1090
BMC Med Res Methodol 19, 46 (2019).
986
1091
987
1092
.. _function_transformer :
0 commit comments