8000 Refactored OneHotEncoder to work with strings · scikit-learn/scikit-learn@d8e4781 · GitHub
[go: up one dir, main page]

Skip to content

Commit d8e4781

Browse files
Refactored OneHotEncoder to work with strings
1 parent f254ce2 commit d8e4781

File tree

3 files changed

+256
-181
lines changed

3 files changed

+256
-181
lines changed

doc/modules/preprocessing.rst

Lines changed: 18 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -397,31 +397,37 @@ only one active.
397397
Continuing the example above::
398398

399399
>>> enc = preprocessing.OneHotEncoder()
400-
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]) # doctest: +ELLIPSIS
400+
>>> enc.fit([['female', 'from US', 'uses Chrome'],
401+
... ['male', 'from Asia', 'uses Firefox']]) # doctest: +ELLIPSIS
401402
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
402-
handle_unknown='error', n_values='auto', sparse=True)
403-
>>> enc.transform([[0, 1, 3]]).toarray()
404-
array([[ 1., 0., 0., 1., 0., 0., 0., 0., 1.]])
403+
handle_unknown='error', n_values=None, sparse=True, values='auto')
404+
>>> enc.transform([['female', 'from Asia', 'uses Firefox']]).toarray()
405+
array([[ 1., 0., 1., 0., 0., 1.]])
405406

406407
By default, how many values each feature can take is inferred automatically from the dataset.
407-
It is possible to specify this explicitly using the parameter ``n_values``.
408+
It is possible to specify this explicitly using the parameter ``xvalues``.
408409
There are two genders, three possible continents and four web browsers in our
409410
dataset.
410411
Then we fit the estimator, and transform a data point.
411-
In the result, the first two numbers encode the gender, the next set of three
412-
numbers the continent and the last four the web browser.
412+
In the result, the first two values are genders, the next set of three
413+
values are the continents and the last values are web browsers.
413414

414415
Note that, if there is a possibilty that the training data might have missing categorical
415416
features, one has to explicitly set ``n_values``. For example,
416417

417-
>>> enc = preprocessing.OneHotEncoder(n_values=[2, 3, 4])
418+
>>> browsers = ['uses Internet Explorer', 'uses Chrome' , 'uses Safari', 'uses Firefox']
419+
>>> genders = ['male', 'female']
420+
>>> locations = ['from Europe', 'from Asia', 'from US']
421+
>>> enc = preprocessing.OneHotEncoder(values=[genders, locations, browsers])
418422
>>> # Note that for there are missing categorical values for the 2nd and 3rd
419423
>>> # feature
420-
>>> enc.fit([[1, 2, 3], [0, 2, 0]]) # doctest: +ELLIPSIS
424+
>>> enc.fit([['female', 'from US', 'uses Chrome'],
425+
... ['male', 'from Asia', 'uses Internet Explorer']]) # doctest: +ELLIPSIS
421426
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
422-
handle_unknown='error', n_values=[2, 3, 4], sparse=True)
423-
>>> enc.transform([[1, 0, 0]]).toarray()
424-
array([[ 0., 1., 1., 0., 0., 1., 0., 0., 0.]])
427+
handle_unknown='error', n_values=None, sparse=True,
428+
values=[...])
429+
>>> enc.transform([['male', 'from Europe', 'uses Safari']]).toarray()
430+
array([[ 0., 1., 0., 1., 0., 0., 0., 0., 1.]])
425431

426432
See :ref:`dict_feature_extraction` for categorical features that are represented
427433
as a dict, not as integers.

0 commit comments

Comments
 (0)
0