@@ -397,31 +397,37 @@ only one active.
397
397
Continuing the example above::
398
398
399
399
>>> enc = preprocessing.OneHotEncoder()
400
- >>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]) # doctest: +ELLIPSIS
400
+ >>> enc.fit([['female', 'from US', 'uses Chrome'],
401
+ ... ['male', 'from Asia', 'uses Firefox']]) # doctest: +ELLIPSIS
401
402
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
402
- handle_unknown='error', n_values='auto' , sparse=True)
403
- >>> enc.transform([[0, 1, 3 ]]).toarray()
404
- array([[ 1., 0., 0., 1., 0., 0 ., 0., 0., 1.]])
403
+ handle_unknown='error', n_values=None , sparse=True, values='auto' )
404
+ >>> enc.transform([['female', 'from Asia', 'uses Firefox' ]]).toarray()
405
+ array([[ 1., 0., 1 ., 0., 0., 1.]])
405
406
406
407
By default, how many values each feature can take is inferred automatically from the dataset.
407
- It is possible to specify this explicitly using the parameter ``n_values ``.
408
+ It is possible to specify this explicitly using the parameter ``xvalues ``.
408
409
There are two genders, three possible continents and four web browsers in our
409
410
dataset.
410
411
Then we fit the estimator, and transform a data point.
411
- In the result, the first two numbers encode the gender , the next set of three
412
- numbers the continent and the last four the web browser .
412
+ In the result, the first two values are genders , the next set of three
413
+ values are the continents and the last values are web browsers .
413
414
414
415
Note that, if there is a possibilty that the training data might have missing categorical
415
416
features, one has to explicitly set ``n_values ``. For example,
416
417
417
- >>> enc = preprocessing.OneHotEncoder(n_values = [2 , 3 , 4 ])
418
+ >>> browsers = [' uses Internet Explorer' , ' uses Chrome' , ' uses Safari' , ' uses Firefox' ]
419
+ >>> genders = [' male' , ' female' ]
420
+ >>> locations = [' from Europe' , ' from Asia' , ' from US' ]
421
+ >>> enc = preprocessing.OneHotEncoder(values = [genders, locations, browsers])
418
422
>>> # Note that for there are missing categorical values for the 2nd and 3rd
419
423
>>> # feature
420
- >>> enc.fit([[1 , 2 , 3 ], [0 , 2 , 0 ]]) # doctest: +ELLIPSIS
424
+ >>> enc.fit([[' female' , ' from US' , ' uses Chrome' ],
425
+ ... [' male' , ' from Asia' , ' uses Internet Explorer' ]]) # doctest: +ELLIPSIS
421
426
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
422
- handle_unknown='error', n_values=[2, 3, 4], sparse=True)
423
- >>> enc.transform([[1 , 0 , 0 ]]).toarray()
424
- array([[ 0., 1., 1., 0., 0., 1., 0., 0., 0.]])
427
+ handle_unknown='error', n_values=None, sparse=True,
428
+ values=[...])
429
+ >>> enc.transform([[' male' , ' from Europe' , ' uses Safari' ]]).toarray()
430
+ array([[ 0., 1., 0., 1., 0., 0., 0., 0., 1.]])
425
431
426
432
See :ref: `dict_feature_extraction ` for categorical features that are represented
427
433
as a dict, not as integers.
0 commit comments