@@ -729,14 +729,15 @@ separate categories::
729
729
See :ref: `dict_feature_extraction ` for categorical features that are
730
730
represented as a dict, not as scalars.
731
731
732
- .. _ one_hot_encoder_infrequent_categories :
732
+ .. _ encoder_infrequent_categories :
733
733
734
734
Infrequent categories
735
735
---------------------
736
736
737
- :class: `OneHotEncoder ` supports aggregating infrequent categories into a single
738
- output for each feature. The parameters to enable the gathering of infrequent
739
- categories are `min_frequency ` and `max_categories `.
737
+ :class: `OneHotEncoder ` and :class: `OrdinalEncoder ` support aggregating
738
+ infrequent categories into a single output for each feature. The parameters to
739
+ enable the gathering of infrequent categories are `min_frequency ` and
740
+ `max_categories `.
740
741
741
742
1. `min_frequency ` is either an integer greater or equal to 1, or a float in
742
743
the interval `(0.0, 1.0) `. If `min_frequency ` is an integer, categories with
@@ -750,11 +751,47 @@ categories are `min_frequency` and `max_categories`.
750
751
input feature. `max_categories ` includes the feature that combines
751
752
infrequent categories.
752
753
753
- In the following example, the categories, `'dog', 'snake' ` are considered
754
- infrequent::
754
+ In the following example with :class: ` OrdinalEncoder ` , the categories `'dog' and
755
+ 'snake' ` are considered infrequent::
755
756
756
757
>>> X = np.array([[' dog' ] * 5 + [' cat' ] * 20 + [' rabbit' ] * 10 +
757
758
... [' snake' ] * 3 ], dtype= object ).T
759
+ >>> enc = preprocessing.OrdinalEncoder(min_frequency = 6 ).fit(X)
760
+ >>> enc.infrequent_categories_
761
+ [array(['dog', 'snake'], dtype=object)]
762
+ >>> enc.transform(np.array([[' dog' ], [' cat' ], [' rabbit' ], [' snake' ]]))
763
+ array([[2.],
764
+ [0.],
765
+ [1.],
766
+ [2.]])
767
+
768
+ :class: `OrdinalEncoder `'s `max_categories ` do **not ** take into account missing
769
+ or unknown categories. Setting `unknown_value ` or `encoded_missing_value ` to an
770
+ integer will increase the number of unique integer codes by one each. This can
771
+ result in up to `max_categories + 2 ` integer codes. In the following example,
772
+ "a" and "d" are considered infrequent and grouped together into a single
773
+ category, "b" and "c" are their own categories, unknown values are encoded as 3
774
+ and missing values are encoded as 4.
775
+
776
+ >>> X_train = np.array(
777
+ ... [[" a" ] * 5 + [" b" ] * 20 + [" c" ] * 10 + [" d" ] * 3 + [np.nan]],
778
+ ... dtype= object ).T
779
+ >>> enc = preprocessing.OrdinalEncoder(
780
+ ... handle_unknown= " use_encoded_value" , unknown_value= 3 ,
781
+ ... max_categories= 3 , encoded_missing_value= 4 )
782
+ >>> _ = enc.fit(X_train)
783
+ >>> X_test = np.array([[" a" ], [" b" ], [" c" ], [" d" ], [" e" ], [np.nan]], dtype = object )
784
+ >>> enc.transform(X_test)
785
+ array([[2.],
786
+ [0.],
787
+ [1.],
788
+ [2.],
789
+ [3.],
790
+ [4.]])
791
+
792
+ Similarity, :class: `OneHotEncoder ` can be configured to group together infrequent
793
+ categories::
794
+
758
795
>>> enc = preprocessing.OneHotEncoder(min_frequency=6, sparse_output=False).fit(X)
759
796
>>> enc.infrequent_categories_
760
797
[array(['dog', 'snake'], dtype=object)]
0 commit comments