@@ -455,47 +455,87 @@ Such features can be efficiently coded as integers, for instance
455
455
``[0, 1, 3] `` while ``["female", "from Asia", "uses Chrome"] `` would be
456
456
``[1, 2, 1] ``.
457
457
458
- Such integer representation can not be used directly with scikit-learn estimators, as these
459
- expect continuous input, and would interpret the categories as being ordered, which is often
460
- not desired (i.e. the set of browsers was ordered arbitrarily).
461
-
462
- One possibility to convert categorical features to features that can be used
463
- with scikit-learn estimators is to use a one-of-K or one-hot encoding, which is
464
- implemented in :class: `OneHotEncoder `. This estimator transforms each
465
- categorical feature with ``m `` possible values into ``m `` binary features, with
466
- only one active.
458
+ To convert categorical features to such integer codes, we can use the
459
+ :class: `CategoricalEncoder `. When specifying that we want to perform an
460
+ ordinal encoding, the estimator transforms each categorical feature to one
461
+ new feature of integers (0 to n_categories - 1)::
462
+
463
+ >>> enc = preprocessing.CategoricalEncoder(encoding='ordinal')
464
+ >>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
465
+ >>> enc.fit(X) # doctest: +ELLIPSIS
466
+ CategoricalEncoder(categories='auto', dtype=<... 'numpy.float64'>,
467
+ encoding='ordinal', handle_unknown='error')
468
+ >>> enc.transform([['female', 'from US', 'uses Safari']])
469
+ array([[ 0., 1., 1.]])
470
+
471
+ Such integer representation can, however, not be used directly with all
472
+ scikit-learn estimators, as these expect continuous input, and would interpret
473
+ the categories as being ordered, which is often not desired (i.e. the set of
474
+ browsers was ordered arbitrarily).
475
+
476
+ Another possibility to convert categorical features to features that can be used
477
+ with scikit-learn estimators is to use a one-of-K, also known as one-hot or
478
+ dummy encoding.
479
+ This type of encoding is the default behaviour of the :class: `CategoricalEncoder `.
480
+ The :class: `CategoricalEncoder ` then transforms each categorical feature with
481
+ ``n_categories `` possible values into ``n_categories `` binary features, with
482
+ one of them 1, and all others 0.
467
483
468
484
Continuing the example above::
469
485
470
- >>> enc = preprocessing.OneHotEncoder()
471
- >>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]) # doctest: +ELLIPSIS
472
- OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
473
- handle_unknown='error', n_values='auto', sparse=True)
474
- >>> enc.transform([[0, 1, 3]]).toarray()
475
- array([[ 1., 0., 0., 1., 0., 0., 0., 0., 1.]])
476
-
477
- By default, how many values each feature can take is inferred automatically from the dataset.
478
- It is possible to specify this explicitly using the parameter ``n_values ``.
479
- There are two genders, three possible continents and four web browsers in our
480
- dataset.
481
- Then we fit the estimator, and transform a data point.
482
- In the result, the first two numbers encode the gender, the next set of three
483
- numbers the continent and the last four the web browser.
484
-
485
- Note that, if there is a possibility that the training data might have missing categorical
486
- features, one has to explicitly set ``n_values ``. For example,
487
-
488
- >>> enc = preprocessing.OneHotEncoder(n_values = [2 , 3 , 4 ])
489
- >>> # Note that there are missing categorical values for the 2nd and 3rd
490
- >>> # features
491
- >>> enc.fit([[1 , 2 , 3 ], [0 , 2 , 0 ]]) # doctest: +ELLIPSIS
492
- OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
493
- handle_unknown='error', n_values=[2, 3, 4], sparse=True)
494
- >>> enc.transform([[1 , 0 , 0 ]]).toarray()
495
- array([[ 0., 1., 1., 0., 0., 1., 0., 0., 0.]])
486
+ >>> enc = preprocessing.CategoricalEncoder()
487
+ >>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
488
+ >>> enc.fit(X) # doctest: +ELLIPSIS
489
+ CategoricalEncoder(categories='auto', dtype=<... 'numpy.float64'>,
490
+ encoding='onehot', handle_unknown='error')
491
+ >>> enc.transform([['female', 'from US', 'uses Safari'],
492
+ ... ['male', 'from Europe', 'uses Safari']]).toarray()
493
+ array([[ 1., 0., 0., 1., 0., 1.],
494
+ [ 0., 1., 1., 0., 0., 1.]])
495
+
496
+ By default, the values each feature can take is inferred automatically
497
+ from the dataset and can be found in the ``categories_ `` attribute::
498
+
499
+ >>> enc.categories_
500
+ [array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object), array(['uses Firefox', 'uses Safari'], dtype=object)]
501
+
502
+ It is possible to specify this explicitly using the parameter ``categories ``.
503
+ There are two genders, four possible continents and four web browsers in our
504
+ dataset::
505
+
506
+ >>> genders = ['female', 'male']
507
+ >>> locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
508
+ >>> browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']
509
+ >>> enc = preprocessing.CategoricalEncoder(categories=[genders, locations, browsers])
510
+ >>> # Note that for there are missing categorical values for the 2nd and 3rd
511
+ >>> # feature
512
+ >>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
513
+ >>> enc.fit(X) # doctest: +ELLIPSIS
514
+ CategoricalEncoder(categories=[...],
515
+ dtype=<... 'numpy.float64'>, encoding='onehot',
516
+ handle_unknown='error')
517
+ >>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
518
+ array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0., 0.]])
519
+
520
+ If there is a possibility that the training data might have missing categorical
521
+ features, it can often be better to specify ``handle_unknown='ignore' `` instead
522
+ of setting the ``categories `` manually as above. When
523
+ ``handle_unknown='ignore' `` is specified and unknown categories are encountered
524
+ during transform, no error will be raised but the resulting one-hot encoded
525
+ columns for this feature will be all zeros
526
+ (``handle_unknown='ignore' `` is only supported for one-hot encoding)::
527
+
528
+ >>> enc = preprocessing.CategoricalEncoder(handle_unknown='ignore')
529
+ >>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
530
+ >>> enc.fit(X) # doctest: +ELLIPSIS
531
+ CategoricalEncoder(categories='auto', dtype=<... 'numpy.float64'>,
532
+ encoding='onehot', handle_unknown='ignore')
533
+ >>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
534
+ array([[ 1., 0., 0., 0., 0., 0.]])
535
+
496
536
497
537
See :ref: `dict_feature_extraction ` for categorical features that are represented
498
- as a dict, not as integers .
538
+ as a dict, not as scalars .
499
539
500
540
.. _imputation :
501
541
0 commit comments