10000 Merge branch 'master' into sw · scikit-learn/scikit-learn@dc14749 · GitHub
[go: up one dir, main page]

Skip to content

Commit dc14749

Browse files
author
Shangwu Yao
authored
Merge branch 'master' into sw
2 parents e030adb + 1c61b8a commit dc14749

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

55 files changed

+2941
-1331
lines changed

doc/conftest.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,13 @@ def setup_working_with_text_data():
5555
check_skip_network()
5656

5757

58+
def setup_compose():
59+
try:
60+
import pandas # noqa
61+
except ImportError:
62+
raise SkipTest("Skipping compose.rst, pandas not installed")
63+
64+
5865
def pytest_runtest_setup(item):
5966
fname = item.fspath.strpath
6067
if fname.endswith('datasets/labeled_faces.rst'):
@@ -67,6 +74,8 @@ def pytest_runtest_setup(item):
6774
setup_twenty_newsgroups()
6875
elif fname.endswith('tutorial/text_analytics/working_with_text_data.rst'):
6976
setup_working_with_text_data()
77+
elif fname.endswith('modules/compose.rst'):
78+
setup_compose()
7079

7180

7281
def pytest_runtest_teardown(item):

doc/datasets/twenty_newsgroups.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,7 @@ which is fast to train and achieves a decent F-score::
135135
>>> metrics.f1_score(newsgroups_test.target, pred, average='macro')
136136
0.88213592402729568
137137

138-
(The example :ref:`sphx_glr_auto_examples_text_document_classification_20newsgroups.py` shuffles
138+
(The example :ref:`sphx_glr_auto_examples_text_plot_document_classification_20newsgroups.py` shuffles
139139
the training and test data, instead of segmenting by time, and in that case
140140
multinomial Naive Bayes gets a much higher F-score of 0.88. Are you suspicious
141141
yet of what's going on inside this classifier?)
@@ -215,4 +215,4 @@ the ``--filter`` option to compare the results.
215215

216216
* :ref:`sphx_glr_auto_examples_model_selection_grid_search_text_feature_extraction.py`
217217

218-
* :ref:`sphx_glr_auto_examples_text_document_classification_20newsgroups.py`
218+
* :ref:`sphx_glr_auto_examples_text_plot_document_classification_20newsgroups.py`

doc/modules/classes.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -158,8 +158,15 @@ details.
158158
:toctree: generated
159159
:template: class.rst
160160

161+
compose.ColumnTransformer
161162
compose.TransformedTargetRegressor
162163

164+
.. autosummary::
165+
:toctree: generated/
166+
:template: function.rst
167+
168+
compose.make_column_transformer
169+
163170
.. _covariance_ref:
164171

165172
:mod:`sklearn.covariance`: Covariance Estimators
@@ -749,6 +756,7 @@ Kernels:
749756
linear_model.logistic_regression_path
750757
linear_model.orthogonal_mp
751758
linear_model.orthogonal_mp_gram
759+
linear_model.ridge_regression
752760

753761

754762
.. _manifold_ref:
@@ -1463,6 +1471,7 @@ Low-level methods
14631471
utils.testing.assert_raise_message
14641472
utils.testing.all_estimators
14651473

1474+
14661475
Recently deprecated
14671476
===================
14681477

doc/modules/clustering.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -271,7 +271,7 @@ small, as shown in the example and cited reference.
271271
* :ref:`sphx_glr_auto_examples_cluster_plot_mini_batch_kmeans.py`: Comparison of KMeans and
272272
MiniBatchKMeans
273273

274-
* :ref:`sphx_glr_auto_examples_text_document_clustering.py`: Document clustering using sparse
274+
* :ref:`sphx_glr_auto_examples_text_plot_document_clustering.py`: Document clustering using sparse
275275
MiniBatchKMeans
276276

277277
* :ref:`sphx_glr_auto_examples_cluster_plot_dict_face_patches.py`

doc/modules/compose.rst

Lines changed: 106 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -304,9 +304,13 @@ FeatureUnion: composite feature spaces
304304
:class:`FeatureUnion` combines several transformer objects into a new
305305
transformer that combines their output. A :class:`FeatureUnion` takes
306306
a list of transformer objects. During fitting, each of these
307-
is fit to the data independently. For transforming data, the
308-
transformers are applied in parallel, and the sample vectors they output
309-
are concatenated end-to-end into larger vectors.
307+
is fit to the data independently. The transformers are applied in parallel,
308+
and the feature matrices they output are concatenated side-by-side into a
309+
larger matrix.
310+
311+
When you want to apply different transformations to each field of the data,
312+
see the related class :class:`sklearn.compose.ColumnTransformer`
313+
(see :ref:`user guide <column_transformer>`).
310314

311315
:class:`FeatureUnion` serves the same purposes as :class:`Pipeline` -
312316
convenience and joint parameter estimation and validation.
@@ -357,4 +361,102 @@ and ignored by setting to ``None``::
357361
.. topic:: Examples:
358362

359363
* :ref:`sphx_glr_auto_examples_plot_feature_stacker.py`
360-
* :ref:`sphx_glr_auto_examples_hetero_feature_union.py`
364+
365+
366+
.. _column_transformer:
367+
368+
ColumnTransformer for heterogeneous data
369+
========================================
370+
371+
.. warning::
372+
373+
The :class:`compose.ColumnTransformer <sklearn.compose.ColumnTransformer>`
374+
class is experimental and the API is subject to change.
375+
376+
Many datasets contain features of different types, say text, floats, and dates,
377+
where each type of feature requires separate preprocessing or feature
378+
extraction steps. Often it is easiest to preprocess data before applying
379+
scikit-learn methods, for example using `pandas <http://pandas.pydata.org/>`__.
380+
Processing your data before passing it to scikit-learn might be problematic for
381+
one of the following reasons:
382+
383+
1. Incorporating statistics from test data into the preprocessors makes
384+
cross-validation scores unreliable (known as *data leakage*),
385+
for example in the case of scalers or imputing missing values.
386+
2. You may want to include the parameters of the preprocessors in a
387+
:ref:`parameter search <grid_search>`.
388+
389+
The :class:`~sklearn.compose.ColumnTransformer` helps performing different
390+
transformations for different columns of the data, within a
391+
:class:`~sklearn.pipeline.Pipeline` that is safe from data leakage and that can
392+
be parametrized. :class:`~sklearn.compose.ColumnTransformer` works on
393+
arrays, sparse matrices, and
394+
`pandas DataFrames <http://pandas.pydata.org/pandas-docs/stable/>`__.
395+
396+
To each column, a different transformation can be applied, such as
397+
preprocessing or a specific feature extraction method::
398+
399+
>>> import pandas as pd
400+
>>> X = pd.DataFrame(
401+
... {'city': ['London', 'London', 'Paris', 'Sallisaw'],
402+
... 'title': ["His Last Bow", "How Watson Learned the Trick",
403+
... "A Moveable Feast", "The Grapes of Wrath"]})
404+
405+
For this data, we might want to encode the ``'city'`` column as a categorical
406+
variable, but apply a :class:`feature_extraction.text.CountVectorizer
407+
<sklearn.feature_extraction.text.CountVectorizer>` to the ``'title'`` column.
408+
As we might use multiple feature extraction methods on the same column, we give
409+
each transformer a unique name, say ``'city_category'`` and ``'title_bow'``::
410+
411+
>>> from sklearn.compose import ColumnTransformer
412+
>>> from sklearn.feature_extraction.text import CountVectorizer
413+
>>> column_trans = ColumnTransformer(
414+
... [('city_category', CountVectorizer(analyzer=lambda x: [x]), 'city'),
415+
... ('title_bow', CountVectorizer(), 'title')])
416+
417+
>>> column_trans.fit(X) # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
418+
ColumnTransformer(n_jobs=1, remainder='passthrough', transformer_weights=None,
419+
transformers=...)
420+
421+
>>> column_trans.get_feature_names()
422+
... # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
423+
['city_category__London', 'city_category__Paris', 'city_category__Sallisaw',
424+
'title_bow__bow', 'title_bow__feast', 'title_bow__grapes', 'title_bow__his',
425+
'title_bow__how', 'title_bow__last', 'title_bow__learned', 'title_bow__moveable',
426+
'title_bow__of', 'title_bow__the', 'title_bow__trick', 'title_bow__watson',
427+
'title_bow__wrath']
428+
429+
>>> column_trans.transform(X).toarray()
430+
... # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
431+
array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
432+
[1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0],
433+
[0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
434+
[0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1]]...)
435+
436+
In the above example, the
437+
:class:`~sklearn.feature_extraction.text.CountVectorizer` expects a 1D array as
438+
input and therefore the columns were specified as a string (``'city'``).
439+
However, other transformers generally expect 2D data, and in that case you need
440+
to specify the column as a list of strings (``['city']``).
441+
442+
Apart from a scalar or a single item list, the column selection can be specified
443+
as a list of multiple items, an integer array, a slice, or a boolean mask.
444+
Strings can reference columns if the input is a DataFrame, integers are always
445+
interpreted as the positional columns.
446+
447+
The :func:`~sklearn.compose.make_columntransformer` function is available
448+
to more easily create a :class:`~sklearn.compose.ColumnTransformer` object.
449+
Specifically, the names will be given automatically. The equivalent for the
450+
above example would be::
451+
452+
>>> from sklearn.compose import make_column_transformer
453+
>>> column_trans = make_column_transformer(
454+
... ('city', CountVectorizer(analyzer=lambda x: [x])),
455+
... ('title', CountVectorizer()))
456+
>>> column_trans # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
457+
ColumnTransformer(n_jobs=1, remainder='passthrough', transformer_weights=None,
458+
transformers=[('countvectorizer-1', ...)
459+
460+
.. topic:: Examples:
461+
462+
* :ref:`sphx_glr_auto_examples_column_transformer.py`

doc/modules/decomposition.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -347,7 +347,7 @@ compensating for LSA's erroneous assumptions about textual data.
347347

348348
.. topic:: Examples:
349349

350-
* :ref:`sphx_glr_auto_examples_text_document_clustering.py`
350+
* :ref:`sphx_glr_auto_examples_text_plot_document_clustering.py`
351351

352352
.. topic:: References:
353353

doc/modules/feature_extraction.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -657,12 +657,12 @@ In particular in a **supervised setting** it can be successfully combined
657657
with fast and scalable linear models to train **document classifiers**,
658658
for instance:
659659

660-
* :ref:`sphx_glr_auto_examples_text_document_classification_20newsgroups.py`
660+
* :ref:`sphx_glr_auto_examples_text_plot_document_classification_20newsgroups.py`
661661

662662
In an **unsupervised setting** it can be used to group similar documents
663663
together by applying clustering algorithms such as :ref:`k_means`:
664664

665-
* :ref:`sphx_glr_auto_examples_text_document_clustering.py`
665+
* :ref:`sphx_glr_auto_examples_text_plot_document_clustering.py`
666666

667667
Finally it is possible to discover the main topics of a corpus by
668668
relaxing the hard assignment constraint of clustering, for instance by
@@ -916,7 +916,7 @@ Some tips and tricks:
916916
(Note that this will not filter out punctuation.)
917917

918918

919-
The following example will, for instance, transform some British spelling
919+
The following example will, for instance, transform some British spelling
920920
to American spelling::
921921

922922
>>> import re

doc/modules/feature_selection.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -198,7 +198,7 @@ alpha parameter, the fewer features selected.
198198

199199
.. topic:: Examples:
200200

201-
* :ref:`sphx_glr_auto_examples_text_document_classification_20newsgroups.py`: Comparison
201+
* :ref:`sphx_glr_auto_examples_text_plot_document_classification_20newsgroups.py`: Comparison
202202
of different algorithms for document classification including L1-based
203203
feature selection.
204204

doc/modules/linear_model.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,7 @@ its ``coef_`` member::
114114
.. topic:: Examples:
115115

116116
* :ref:`sphx_glr_auto_examples_linear_model_plot_ridge_path.py`
117-
* :ref:`sphx_glr_auto_examples_text_document_classification_20newsgroups.py`
117+
* :ref:`sphx_glr_auto_examples_text_plot_document_classification_20newsgroups.py`
118118

119119

120120
Ridge Complexity

doc/modules/model_evaluation.rst

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -565,7 +565,7 @@ false negatives and true positives as follows::
565565
for an example of using a confusion matrix to classify
566566
hand-written digits.
567567

568-
* See :ref:`sphx_glr_auto_examples_text_document_classification_20newsgroups.py`
568+
* See :ref:`sphx_glr_auto_examples_text_plot_document_classification_20newsgroups.py`
569569
for an example of using a confusion matrix to classify text
570570
documents.
571571

@@ -598,7 +598,7 @@ and inferred labels::
598598
for an example of classification report usage for
599599
hand-written digits.
600600

601-
* See :ref:`sphx_glr_auto_examples_text_document_classification_20newsgroups.py`
601+
* See :ref:`sphx_glr_auto_examples_text_plot_document_classification_20newsgroups.py`
602602
for an example of classification report usage for text
603603
documents.
604604

@@ -749,7 +749,7 @@ binary classification and multilabel indicator format.
749749

750750
.. topic:: Examples:
751751

752-
* See :ref:`sphx_glr_auto_examples_text_document_classification_20newsgroups.py`
752+
* See :ref:`sphx_glr_auto_examples_text_plot_document_classification_20newsgroups.py`
753753
for an example of :func:`f1_score` usage to classify text
754754
documents.
755755

@@ -859,10 +859,10 @@ specified by the ``average`` argument to the
859859
:func:`average_precision_score` (multilabel only), :func:`f1_score`,
860860
:func:`fbeta_score`, :func:`precision_recall_fscore_support`,
861861
:func:`precision_score` and :func:`recall_score` functions, as described
862-
:ref:`above <average>`. Note that for "micro"-averaging in a multiclass setting
863-
with all labels included will produce equal precision, recall and :math:`F`,
864-
while "weighted" averaging may produce an F-score that is not between
865-
precision and recall.
862+
:ref:`above <average>`. Note that if all labels are included, "micro"-averaging
863+
in a multiclass setting will produce precision, recall and :math:`F`
864+
that are all identical to accuracy. Also note that "weighted" averaging may
865+
produce an F-score that is not between precision and recall.
866866

867867
To make this more explicit, consider the following notation:
868868

0 commit comments

Comments
 (0)
0