8000 DOC Change CountVectorizer(...lambda..) to OneHotEncoder() in ColumnT… · scikit-learn/scikit-learn@2480368 · GitHub
[go: up one dir, main page]

Skip to content

Commit 2480368

Browse files
maikiaqinhanmin2014
authored andcommitted
DOC Change CountVectorizer(...lambda..) to OneHotEncoder() in ColumnTransformer examples (#13212)
1 parent d19a5dc commit 2480368

File tree

1 file changed

+13
-9
lines changed

1 file changed

+13
-9
lines changed

doc/modules/compose.rst

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -408,16 +408,19 @@ preprocessing or a specific feature extraction method::
408408
... 'user_rating': [4, 5, 4, 3]})
409409

410410
For this data, we might want to encode the ``'city'`` column as a categorical
411-
variable, but apply a :class:`feature_extraction.text.CountVectorizer
411+
variable using :class:`preprocessing.OneHotEncoder
412+
<sklearn.preprocessing.OneHotEncoder>` but apply a
413+
:class:`feature_extraction.text.CountVectorizer
412414
<sklearn.feature_extraction.text.CountVectorizer>` to the ``'title'`` column.
413415
As we might use multiple feature extraction methods on the same column, we give
414416
each transformer a unique name, say ``'city_category'`` and ``'title_bow'``.
415417
By default, the remaining rating columns are ignored (``remainder='drop'``)::
416418

417419
>>> from sklearn.compose import ColumnTransformer
418420
>>> from sklearn.feature_extraction.text import CountVectorizer
421+
>>> from sklearn.preprocessing import OneHotEncoder
419422
>>> column_trans = ColumnTransformer(
420-
... [('city_category', CountVectorizer(analyzer=lambda x: [x]), 'city'),
423+
... [('city_category', OneHotEncoder(dtype='int'),['city']),
421424
... ('title_bow', CountVectorizer(), 'title')],
422425
... remainder='drop')
423426

@@ -428,7 +431,7 @@ By default, the remaining rating columns are ignored (``remainder='drop'``)::
428431

429432
>>> column_trans.get_feature_names()
430433
... # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
431-
['city_category__London', 'city_category__Paris', 'city_category__Sallisaw',
434+
['city_category__x0_London', 'city_category__x0_Paris', 'city_category__x0_Sallisaw',
432435
'title_bow__bow', 'title_bow__feast', 'title_bow__grapes', 'title_bow__his',
< 10000 /code>
433436
'title_bow__how', 'title_bow__last', 'title_bow__learned', 'title_bow__moveable',
434437
'title_bow__of', 'title_bow__the', 'title_bow__trick', 'title_bow__watson',
@@ -443,8 +446,9 @@ By default, the remaining rating columns are ignored (``remainder='drop'``)::
443446

444447
In the above example, the
445448
:class:`~sklearn.feature_extraction.text.CountVectorizer` expects a 1D array as
446-
input and therefore the columns were specified as a string (``'city'``).
447-
However, other transformers generally expect 2D data, and in that case you need
449+
input and therefore the columns were specified as a string (``'title'``).
450+
However, :class:`preprocessing.OneHotEncoder <sklearn.preprocessing.OneHotEncoder>`
451+
as most of other transformers expects 2D data, therefore in that case you need
448452
to specify the column as a list of strings (``['city']``).
449453

450454
Apart from a scalar or a single item list, the column selection can be specified
@@ -457,7 +461,7 @@ We can keep the remaining rating columns by setting
457461
transformation::
458462

459463
>>> column_trans = ColumnTransformer(
460-
... [('city_category', CountVectorizer(analyzer=lambda x: [x]), 'city'),
464+
... [('city_category', OneHotEncoder(dtype='int'),['city']),
461465
... ('title_bow', CountVectorizer(), 'title')],
462466
... remainder='passthrough')
463467

@@ -474,7 +478,7 @@ the transformation::
474478

475479
>>> from sklearn.preprocessing import MinMaxScaler
476480
>>> column_trans = ColumnTransformer(
477-
... [('city_category', CountVectorizer(analyzer=lambda x: [x]), 'city'),
481+
... [('city_category', OneHotEncoder(), ['city']),
478482
... ('title_bow', CountVectorizer(), 'title')],
479483
... remainder=MinMaxScaler())
480484

@@ -492,14 +496,14 @@ above example would be::
492496

493497
>>> from sklearn.compose import make_column_transformer
494498
>>> column_trans = make_column_transformer(
495-
... (CountVectorizer(analyzer=lambda x: [x]), 'city'),
499+
... (OneHotEncoder(), ['city']),
496500
... (CountVectorizer(), 'title'),
497501
... remainder=MinMaxScaler())
498502
>>> column_trans # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
499503
ColumnTransformer(n_jobs=None, remainder=MinMaxScaler(copy=True, ...),
500504
sparse_threshold=0.3,
501505
transformer_weights=None,
502-
transformers=[('countvectorizer-1', ...)
506+
transformers=[('onehotencoder', ...)
503507

504508
.. topic:: Examples:
505509

0 commit comments

Comments
 (0)
0