8000 [MRG] DOC add mixed categorical / continuous example with ColumnTransformer by partmor · Pull Request #11197 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] DOC add mixed categorical / continuous example with ColumnTransformer #11197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jun 13, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/faq.rst
8000
Original file line number Diff line number Diff line change
Expand Up @@ -362,7 +362,7 @@ of a single numeric dtype. These do not explicitly represent categorical
variables at present. Thus, unlike R's data.frames or pandas.DataFrame, we
require explicit conversion of categorical features to numeric values, as
discussed in :ref:`preprocessing_categorical_features`.
See also :ref:`sphx_glr_auto_examples_compose_column_transformer.py` for an
See also :ref:`sphx_glr_auto_examples_compose_column_transformer_mixed_types.py` for an
example of working with heterogeneous (e.g. categorical and numeric) data.

Why does Scikit-learn not directly work with, for example, pandas.DataFrame?
Expand Down
1 change: 1 addition & 0 deletions doc/modules/compose.rst
Original file line number Diff line number Diff line change
Expand Up @@ -464,3 +464,4 @@ above example would be::
.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_compose_column_transformer.py`
* :ref:`sphx_glr_auto_examples_compose_column_transformer_mixed_types.py`
104 changes: 104 additions & 0 deletions examples/compose/column_transformer_mixed_types.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
"""
===================================
Column Transformer with Mixed Types
===================================

This example illustrates how to apply different preprocessing and
feature extraction pipelines to different subsets of features,
using :class:`sklearn.compose.ColumnTransformer`.
This is particularly handy for the case of datasets that contain
heterogeneous data types, since we may want to scale the
numeric features and one-hot encode the categorical ones.

In this example, the numeric data is standard-scaled after
mean-imputation, while the categorical data is one-hot
encoded after imputing missing values with a new category
(``'missing'``).

Finally, the preprocessing pipeline is integrated in a
full prediction pipeline using :class:`sklearn.pipeline.Pipeline`,
together with a simple classification model.
"""

# Author: Pedro Morales <part.morales@gmail.com>
#
# License: BSD 3 clause

from __future__ import print_function

import pandas as pd

from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, CategoricalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV


# Read data from Titanic dataset.
titanic_url = ('https://raw.githubusercontent.com/amueller/'
'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
data = pd.read_csv(titanic_url)

# We will train our classifier with the following features:
# Numeric Features:
# - age: float.
# - fare: float.
# Categorical Features:
# - embarked: categories encoded as strings {'C', 'S', 'Q'}.
# - sex: categories encoded as strings {'female', 'male'}.
# - pclass: ordinal integers {1, 2, 3}.
numeric_features = ['age', 'fare']
categorical_features = ['embarked', 'sex', 'pclass']

# Provisionally, use pd.fillna() to impute missing values for categorical
# features; SimpleImputer will eventually support strategy="constant".
data[categorical_features] = data[categorical_features].fillna(value='missing')

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_transformer = make_pipeline(SimpleImputer(), StandardScaler())
categorical_transformer = CategoricalEncoder('onehot-dense',
handle_unknown='ignore')

preprocessing_pl = make_column_transformer(
(numeric_features, numeric_transformer),
(categorical_features, categorical_transformer),
remainder='drop'
)

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = make_pipeline(preprocessing_pl, LogisticRegression())

X = data.drop('survived', axis=1)
y = data.survived.values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
shuffle=True)

clf.fit(X_train, y_train)
print("model score: %f" % clf.score(X_test, y_test))


###############################################################################
# Using the prediction pipeline in a grid search
###############################################################################
# Grid search can also be performed on the different preprocessing steps
# defined in the ``ColumnTransformer`` object, together with the classifier's
# hyperparameters as part of the ``Pipeline``.
# We will search for both the imputer strategy of the numeric preprocessing
# and the regularization parameter of the logistic regression using
# :class:`sklearn.model_selection.GridSearchCV`.


param_grid = {
'columntransformer__pipeline__simpleimputer__strategy': ['mean', 'median'],
'logisticregression__C': [0.1, 1.0, 1.0],
}

grid_search = GridSearchCV(clf, param_grid, cv=10, iid=False)
grid_search.fit(X_train, y_train)

print(("best logistic regression from grid search: %f"
% grid_search.score(X_test, y_test)))
0